Wikipedia-based Semantic Interpretation for Natural Language Processing

Evgeniy Gabrilovich and Shaul Markovitch. Wikipedia-based Semantic Interpretation for Natural Language Processing. Journal of Artificial Intelligence Research, 34:443-498 2009.

Abstract

Adequate representation of natural language semantics requires access to vast amounts of common sense and domain-specific world knowledge. Prior work in the field was based on purely statistical techniques that did not make use of background knowledge, on limited lexicographic knowledge bases such as WordNet, or on huge manual efforts such as the CYC project. Here we propose a novel method, called Explicit Semantic Analysis (ESA), for fine-grained semantic interpretation of unrestricted natural language texts. Our method represents meaning in a high-dimensional space of concepts derived from Wikipedia, the largest encyclopedia in existence. We explicitly represent the meaning of any text in terms of Wikipedia-based concepts. We evaluate the effectiveness of our method on text catego- rization and on computing the degree of semantic relatedness between fragments of natural language text. Using ESA results in significant improvements over the previous state of the art in both tasks. Importantly, due to the use of natural concepts, the ESA model is easy to explain to human users.

Keywords: Wikipedia, Feature Generation, Information Retrieval, ESA, Semantic Relatedness, Explicit Semantic Analysis

Secondary Keywords:

Online version:

Bibtex entry:

 @article{Gabrilovich:2009:WBS,
  Author = {Evgeniy Gabrilovich and Shaul Markovitch},
  Title = {Wikipedia-based Semantic Interpretation for Natural Language Processing},
  Year = {2009},
  Journal = {Journal of Artificial Intelligence Research},
  Volume = {34},
  Pages = {443--498},
  Url = {http://www.cs.technion.ac.il/~shaulm/papers/pdf/Gabrilovich-Markovitch-JAIR2009.pdf},
  Keywords = {Wikipedia, Feature Generation, Information Retrieval, ESA, Semantic Relatedness, Explicit Semantic Analysis},
  Secondary-keywords = {Word Similarity, Common-Sense Knowledge, Semantics},
  Abstract = {
    Adequate representation of natural language semantics requires
    access to vast amounts of common sense and domain-specific world
    knowledge. Prior work in the field was based on purely statistical
    techniques that did not make use of background knowledge, on
    limited lexicographic knowledge bases such as WordNet, or on huge
    manual efforts such as the CYC project. Here we propose a novel
    method, called Explicit Semantic Analysis (ESA), for fine-grained
    semantic interpretation of unrestricted natural language texts.
    Our method represents meaning in a high-dimensional space of
    concepts derived from Wikipedia, the largest encyclopedia in
    existence. We explicitly represent the meaning of any text in
    terms of Wikipedia-based concepts. We evaluate the effectiveness
    of our method on text catego- rization and on computing the degree
    of semantic relatedness between fragments of natural language
    text. Using ESA results in significant improvements over the
    previous state of the art in both tasks. Importantly, due to the
    use of natural concepts, the ESA model is easy to explain to human
    users.
  }

  }