Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis

Evgeniy Gabrilovich and Shaul Markovitch. Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis. In Proceedings of The Twentieth International Joint Conference for Artificial Intelligence, 1606-1611 Hyderabad, India, 2007.

Abstract

Computing semantic relatedness of natural language texts requires access to vast amounts of common-sense and domain-specific world knowledge. We propose Explicit Semantic Analysis (ESA), a novel method that represents the meaning of texts in a high-dimensional space of concepts derived from Wikipedia. We use machine learning techniques to explicitly represent the meaning of any text as a weighted vector of Wikipedia-based concepts. Assessing the relatedness of texts in this space amounts to comparing the corresponding vectors using conventional metrics (e.g., cosine). Compared with the previous state of the art, using ESA results in substantial improvements in correlation of computed relatedness scores with human judgments: from $r=0.56$ to $0.75$ for individual words and from $r=0.60$ to $0.72$ for texts. Importantly, due to the use of natural concepts, the ESA model is easy to explain to human users.

Keywords: Wikipedia, Feature Generation, Information Retrieval, ESA, Semantic Relatedness, Explicit Semantic Analysis

Secondary Keywords:

Online version:

Bibtex entry:

 @inproceedings{Gabrilovich:2007:CSR,
  Author = {Evgeniy Gabrilovich and Shaul Markovitch},
  Title = {Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis},
  Year = {2007},
  Booktitle = {Proceedings of The Twentieth International Joint Conference for Artificial Intelligence},
  Pages = {1606--1611},
  Address = {Hyderabad, India},
  Url = {http://www.cs.technion.ac.il/~shaulm/papers/pdf/Gabrilovich-Markovitch-ijcai2007.pdf},
  Keywords = {Wikipedia, Feature Generation, Information Retrieval, ESA, Semantic Relatedness, Explicit Semantic Analysis},
  Secondary-keywords = {Word Similarity, Common-Sense Knowledge, Semantics},
  Abstract = {
    Computing semantic relatedness of natural language texts requires
    access to vast amounts of common-sense and domain-specific world
    knowledge. We propose Explicit Semantic Analysis (ESA), a novel
    method that represents the meaning of texts in a high-dimensional
    space of concepts derived from Wikipedia. We use machine learning
    techniques to explicitly represent the meaning of any text as a
    weighted vector of Wikipedia-based concepts. Assessing the
    relatedness of texts in this space amounts to comparing the
    corresponding vectors using conventional metrics (e.g., cosine).
    Compared with the previous state of the art, using ESA results in
    substantial improvements in correlation of computed relatedness
    scores with human judgments: from $r=0.56$ to $0.75$ for
    individual words and from $r=0.60$ to $0.72$ for texts.
    Importantly, due to the use of natural concepts, the ESA model is
    easy to explain to human users.
  }

  }