Home | Publications | CS Home

Feature Generation for Text Categorization Using World Knowledge


Evgeniy Gabrilovich and Shaul Markovitch. Feature Generation for Text Categorization Using World Knowledge. In Proceedings of The Nineteenth International Joint Conference for Artificial Intelligence, 1048-1053 Edinburgh, Scotland, 2005.


Abstract

We enhance machine learning algorithms for text categorization with generated features based on domain-specific and common-sense knowledge. This knowledge is represented using publicly available ontologies that contain hundreds of thousands of concepts, such as the Open Directory; these ontologies are further enriched by several orders of magnitude through controlled Web crawling. Prior to text categorization, a feature generator analyzes the documents and maps them onto appropriate ontology concepts, which in turn induce a set of generated features that augment the standard bag of words. Feature generation is accomplished through contextual analysis of document text, implicitly performing word sense disambiguation. Coupled with the ability to generalize concepts using the ontology, this approach addresses the two main problems of natural language processing---synonymy and polysemy. Categorizing documents with the aid of knowledge-based features leverages information that cannot be deduced from the documents alone. Experimental results confirm improved performance, breaking through the plateau previously reached in the field.


Keywords: Feature Generation, Text Categorization, Information Retrieval, ESA, Explicit Semantic Analysis
Secondary Keywords:
Online version:
Bibtex entry:
 @inproceedings{Gabrilovich:2005:FGT,
  Author = {Evgeniy Gabrilovich and Shaul Markovitch},
  Title = {Feature Generation for Text Categorization Using World Knowledge},
  Year = {2005},
  Booktitle = {Proceedings of The Nineteenth International Joint Conference for Artificial Intelligence},
  Pages = {1048--1053},
  Address = {Edinburgh, Scotland},
  Url = {http://www.cs.technion.ac.il/~shaulm/papers/pdf/Gabrilovich-Markovitch-ijcai2005.pdf},
  Keywords = {Feature Generation, Text Categorization, Information Retrieval, ESA, Explicit Semantic Analysis},
  Secondary-keywords = {Common-Sense Knowledge},
  Abstract = {
    We enhance machine learning algorithms for text categorization
    with generated features based on domain-specific and common-sense
    knowledge. This knowledge is represented using publicly available
    ontologies that contain hundreds of thousands of concepts, such as
    the Open Directory; these ontologies are further enriched by
    several orders of magnitude through controlled Web crawling. Prior
    to text categorization, a feature generator analyzes the documents
    and maps them onto appropriate ontology concepts, which in turn
    induce a set of generated features that augment the standard bag
    of words. Feature generation is accomplished through contextual
    analysis of document text, implicitly performing word sense
    disambiguation. Coupled with the ability to generalize concepts
    using the ontology, this approach addresses the two main problems
    of natural language processing---synonymy and polysemy.
    Categorizing documents with the aid of knowledge-based features
    leverages information that cannot be deduced from the documents
    alone. Experimental results confirm improved performance, breaking
    through the plateau previously reached in the field.
  }

  }