Home | Publications | CS Home

Concept-Based Information Retrieval using Explicit Semantic Analysis


Ofer Egozi, Shaul Markovitch and Evgeniy Gabrilovich. Concept-Based Information Retrieval using Explicit Semantic Analysis. {ACM} {T}ransactions on {I}nformation {S}ystems, 29:8:1-8:34 2011.


Abstract

Information retrieval systems traditionally rely on textual keywords to index and retrieve documents. Keyword-based retrieval may return inaccurate and incomplete results when different keywords are used to describe the same concept in the documents and in the queries. Furthermore, the relationship between these related keywords may be semantic rather than syntactic, and capturing it thus requires access to comprehensive human world knowledge. Concept-based retrieval methods have attempted to tackle these difficulties by using manually built thesauri, by relying on term cooccurrence data, or by extracting latent word relationships and concepts from a corpus. In this article we introduce a new concept-based retrieval approach based on Explicit Semantic Analysis (ESA), a recently proposed method that augments keyword-based text representation with concept-based features, automatically extracted from massive human knowledge repositories such as Wikipedia. Our approach generates new text features automatically, and we have found that high-quality feature selection becomes crucial in this setting to make the retrieval more focused. However, due to the lack of labeled data, traditional feature selection methods cannot be used, hence we propose new methods that use self-generated labeled training data. The resulting system is evaluated on several TREC datasets, showing superior performance over previous state-of-the-art results.


Keywords: Information Retrieval, Concept-based Retrieval, Explicit Semantic Analysis, Feature Selection, Semantic Search, ESA
Secondary Keywords:
Online version:
Bibtex entry:
 @article{Egozi:2011:CBI,
  Author = {Ofer Egozi and Shaul Markovitch and Evgeniy Gabrilovich},
  Title = {Concept-Based Information Retrieval using Explicit Semantic Analysis},
  Year = {2011},
  Journal = {{ACM} {T}ransactions on {I}nformation {S}ystems},
  Volume = {29},
  Number = {2},
  Pages = {8:1--8:34},
  Address = {New York, NY, USA},
  Url = {http://www.cs.technion.ac.il/~shaulm/papers/pdf/Egozi-Gabrilovich-Markovitch-TOIS2011.pdf},
  Keywords = {Information Retrieval, Concept-based Retrieval, Explicit Semantic Analysis, Feature Selection, Semantic Search, ESA},
  Abstract = {
    Information retrieval systems traditionally rely on textual
    keywords to index and retrieve documents. Keyword-based retrieval
    may return inaccurate and incomplete results when different
    keywords are used to describe the same concept in the documents
    and in the queries. Furthermore, the relationship between these
    related keywords may be semantic rather than syntactic, and
    capturing it thus requires access to comprehensive human world
    knowledge. Concept-based retrieval methods have attempted to
    tackle these difficulties by using manually built thesauri, by
    relying on term cooccurrence data, or by extracting latent word
    relationships and concepts from a corpus. In this article we
    introduce a new concept-based retrieval approach based on Explicit
    Semantic Analysis (ESA), a recently proposed method that augments
    keyword-based text representation with concept-based features,
    automatically extracted from massive human knowledge repositories
    such as Wikipedia. Our approach generates new text features
    automatically, and we have found that high-quality feature
    selection becomes crucial in this setting to make the retrieval
    more focused. However, due to the lack of labeled data,
    traditional feature selection methods cannot be used, hence we
    propose new methods that use self-generated labeled training data.
    The resulting system is evaluated on several TREC datasets,
    showing superior performance over previous state-of-the-art
    results.
  }

  }