Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge

Evgeniy Gabrilovich and Shaul Markovitch. Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge. In Proceedings of the Twenty-First National Conference on Artificial Intelligence, 1301-1306 Boston, MA, 2006.

Abstract

When humans approach the task of text categorization, they interpret the specific wording of the document in the much larger context of their background knowledge and experience. On the other hand, state-of-the-art information retrieval systems are quite \emph{brittle}---they traditionally represent documents as bags of words, and are restricted to learning from individual word occurrences in the (necessarily limited) training set. For instance, given the sentence ``Wal-Mart supply chain goes real time'', how can a text categorization system know that Wal-Mart manages its stock with RFID technology? And having read that ``Ciprofloxacin belongs to the quinolones group'', how on earth can a machine know that the drug mentioned is an antibiotic produced by Bayer? We propose to enrich document representation through automatic use of a vast compendium of human knowledge---an encyclopedia. We apply machine learning techniques to Wikipedia, the largest encyclopedia to date, which surpasses in scope many conventional encyclopedias and provides a cornucopia of world knowledge. Each Wikipedia article represents a \emph{concept}, and documents to be categorized are represented in the rich feature space of words and relevant Wikipedia concepts. Empirical results confirm that this knowledge-intensive representation brings text categorization to a qualitatively new level of performance across a diverse collection of datasets.

Keywords: Wikipedia, Feature Generation, Text Categorization, Information Retrieval, ESA, Explicit Semantic Analysis

Secondary Keywords:

Online version:

Bibtex entry:

 @inproceedings{Gabrilovich:2006:OBB,
  Author = {Evgeniy Gabrilovich and Shaul Markovitch},
  Title = {Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge},
  Year = {2006},
  Booktitle = {Proceedings of the Twenty-First National Conference on Artificial Intelligence},
  Pages = {1301--1306},
  Address = {Boston, MA},
  Url = {http://www.cs.technion.ac.il/~shaulm/papers/pdf/Gabrilovich-Markovitch-aaai2006.pdf},
  Keywords = {Wikipedia, Feature Generation, Text Categorization, Information Retrieval, ESA, Explicit Semantic Analysis},
  Secondary-keywords = {Common-sense Knowledge},
  Abstract = {
    When humans approach the task of text categorization, they
    interpret the specific wording of the document in the much larger
    context of their background knowledge and experience. On the other
    hand, state-of-the-art information retrieval systems are quite
    \emph{brittle}---they traditionally represent documents as bags of
    words, and are restricted to learning from individual word
    occurrences in the (necessarily limited) training set. For
    instance, given the sentence ``Wal-Mart supply chain goes real
    time'', how can a text categorization system know that Wal-Mart
    manages its stock with RFID technology? And having read that
    ``Ciprofloxacin belongs to the quinolones group'', how on earth
    can a machine know that the drug mentioned is an antibiotic
    produced by Bayer? We propose to enrich document representation
    through automatic use of a vast compendium of human knowledge---an
    encyclopedia. We apply machine learning techniques to Wikipedia,
    the largest encyclopedia to date, which surpasses in scope many
    conventional encyclopedias and provides a cornucopia of world
    knowledge. Each Wikipedia article represents a \emph{concept}, and
    documents to be categorized are represented in the rich feature
    space of words and relevant Wikipedia concepts. Empirical results
    confirm that this knowledge-intensive representation brings text
    categorization to a qualitatively new level of performance across
    a diverse collection of datasets.
  }

  }