Evgeniy Gabrilovich and Shaul Markovitch. Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization. Journal of Machine Learning Research, 8:2297-2345 2007.
Most existing methods for text categorization use induction algorithms in conjunction with the bag of words document representation. While they perform well in many categorization tasks, these methods are inherently limited when faced with more complicated tasks where external knowledge is essential. Recently, there have been a number of attempts to augment the bag of words approach with external knowledge, including semi-supervised learning and transfer learning. In this work, we present a new framework for automatic acquisition of world knowledge and methods for incorporating it into the text categorization process. Our approach enhances machine learning algorithms with features generated from domain-specific and common-sense knowledge. This knowledge is represented by ontologies that contain hundreds of thousands of concepts, further enriched by several orders of magnitude through controlled Web crawling. Prior to text categorization, a feature generator analyzes the documents and maps them onto appropriate ontology concepts that augment the bag of words. Feature generation is accomplished through contextual analysis of document text, thus implicitly performing word sense disambiguation. Coupled with the ability to generalize concepts using the ontology, this approach addresses the two main problems of natural language processing---synonymy and polysemy. Categorizing documents with the aid of knowledge-based features leverages information that cannot be deduced from the training documents alone. We applied our methodology using the Open Directory Project---the largest existing Web directory, built by over 70,000 human editors. Experimental results over a range of datasets confirm improved performance compared to the bag of words document representation.
@article{Gabrilovich:2007:HEH, Author = {Evgeniy Gabrilovich and Shaul Markovitch}, Title = {Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization}, Year = {2007}, Journal = {Journal of Machine Learning Research}, Volume = {8}, Number = {}, Month = {Oct}, Pages = {2297--2345}, Url = {http://www.cs.technion.ac.il/~shaulm/papers/pdf/Gabrilovich-Markovitch-JMLR2007.pdf}, Keywords = {Feature Generation, Text Categorization, Information Retrieval, Explicit Semantic Analysis, ESA}, Secondary-keywords = {Common-Sense Knowledge, Feature Constrcution}, Abstract = { Most existing methods for text categorization use induction algorithms in conjunction with the bag of words document representation. While they perform well in many categorization tasks, these methods are inherently limited when faced with more complicated tasks where external knowledge is essential. Recently, there have been a number of attempts to augment the bag of words approach with external knowledge, including semi-supervised learning and transfer learning. In this work, we present a new framework for automatic acquisition of world knowledge and methods for incorporating it into the text categorization process. Our approach enhances machine learning algorithms with features generated from domain-specific and common-sense knowledge. This knowledge is represented by ontologies that contain hundreds of thousands of concepts, further enriched by several orders of magnitude through controlled Web crawling. Prior to text categorization, a feature generator analyzes the documents and maps them onto appropriate ontology concepts that augment the bag of words. Feature generation is accomplished through contextual analysis of document text, thus implicitly performing word sense disambiguation. Coupled with the ability to generalize concepts using the ontology, this approach addresses the two main problems of natural language processing---synonymy and polysemy. Categorizing documents with the aid of knowledge-based features leverages information that cannot be deduced from the training documents alone. We applied our methodology using the Open Directory Project---the largest existing Web directory, built by over 70,000 human editors. Experimental results over a range of datasets confirm improved performance compared to the bag of words document representation. } }