Evgeniy Gabrilovich and Shaul Markovitch. Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization. Journal of Machine Learning Research, 8:2297-2345 2007.
Most existing methods for text categorization use induction algorithms in conjunction with the bag of words document representation. While they perform well in many categorization tasks, these methods are inherently limited when faced with more complicated tasks where external knowledge is essential. Recently, there have been a number of attempts to augment the bag of words approach with external knowledge, including semi-supervised learning and transfer learning. In this work, we present a new framework for automatic acquisition of world knowledge and methods for incorporating it into the text categorization process. Our approach enhances machine learning algorithms with features generated from domain-specific and common-sense knowledge. This knowledge is represented by ontologies that contain hundreds of thousands of concepts, further enriched by several orders of magnitude through controlled Web crawling. Prior to text categorization, a feature generator analyzes the documents and maps them onto appropriate ontology concepts that augment the bag of words. Feature generation is accomplished through contextual analysis of document text, thus implicitly performing word sense disambiguation. Coupled with the ability to generalize concepts using the ontology, this approach addresses the two main problems of natural language processing---synonymy and polysemy. Categorizing documents with the aid of knowledge-based features leverages information that cannot be deduced from the training documents alone. We applied our methodology using the Open Directory Project---the largest existing Web directory, built by over 70,000 human editors. Experimental results over a range of datasets confirm improved performance compared to the bag of words document representation.
@article{Gabrilovich:2007:HEH,
Author = {Evgeniy Gabrilovich and Shaul Markovitch},
Title = {Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization},
Year = {2007},
Journal = {Journal of Machine Learning Research},
Volume = {8},
Number = {},
Month = {Oct},
Pages = {2297--2345},
Url = {http://www.cs.technion.ac.il/~shaulm/papers/pdf/Gabrilovich-Markovitch-JMLR2007.pdf},
Keywords = {Feature Generation, Text Categorization, Information Retrieval, Explicit Semantic Analysis, ESA},
Secondary-keywords = {Common-Sense Knowledge, Feature Constrcution},
Abstract = {
Most existing methods for text categorization use induction
algorithms in conjunction with the bag of words document
representation. While they perform well in many categorization
tasks, these methods are inherently limited when faced with more
complicated tasks where external knowledge is essential. Recently,
there have been a number of attempts to augment the bag of words
approach with external knowledge, including semi-supervised
learning and transfer learning. In this work, we present a new
framework for automatic acquisition of world knowledge and methods
for incorporating it into the text categorization process. Our
approach enhances machine learning algorithms with features
generated from domain-specific and common-sense knowledge. This
knowledge is represented by ontologies that contain hundreds of
thousands of concepts, further enriched by several orders of
magnitude through controlled Web crawling. Prior to text
categorization, a feature generator analyzes the documents and
maps them onto appropriate ontology concepts that augment the bag
of words. Feature generation is accomplished through contextual
analysis of document text, thus implicitly performing word sense
disambiguation. Coupled with the ability to generalize concepts
using the ontology, this approach addresses the two main problems
of natural language processing---synonymy and polysemy.
Categorizing documents with the aid of knowledge-based features
leverages information that cannot be deduced from the training
documents alone. We applied our methodology using the Open
Directory Project---the largest existing Web directory, built by
over 70,000 human editors. Experimental results over a range of
datasets confirm improved performance compared to the bag of words
document representation.
}
}