Concept-Based Information Retrieval using Explicit Semantic Analysis

Ofer Egozi, M.Sc. Thesis Seminar
Wednesday, 24.6.2009, 14:30
Taub 601
Prof. Shaul Markovitch

Information Retrieval systems traditionally rely on textual keywords to index and retrieve documents. Keywords may be related, describing a common human concept, but their use in the documents and queries may not be consistent, causing inaccurate and incomplete retrieval. Furthermore, relations between keywords may extend beyond simple syntactic relations, requiring access to comprehensive world knowledge to capture those relations. Concept-based retrieval methods attempted to tackle these difficulties using manually-built thesauri, or by extracting latent artificial concepts from a corpus. In this work we propose a new approach to concept-based retrieval, using Explicit Semantic Analysis, a recently proposed representation method that can augment the keyword-based representation with concept-based features, automatically extracted from massive human knowledge resources such as Wikipedia. We find that for such a representation to be successful, high quality feature selection is required, but unlike supervised learning tasks the retrieval task provides no labeled training data. Inspired by pseudo-relevance feedback, several feature selection methods are presented that use the top-ranked and bottom-ranked documents retrieved by keyword-based retrieval as per-query training data. The resulting system is evaluated on TREC data, showing superior performance over previous state of the art results.

Back to the index of events