Technical Report CIS-2006-04

Title: Computing semantic relatedness of words and texts in Wikipedia-derived semantic space
Authors: Evgeniy Gabrilovich, Shaul Markovitch
Abstract: Adequate representation of natural language semantics requires access to vast amounts of common sense and domain-specific world knowledge. Prior work in the field was either based on purely statistical techniques that did not make use of background knowledge or on huge manual efforts, such as the CYC projects. Here we propose a novel method, called Explicit Semantic Analysis (ESA), for fine-grained semantic interpretation of unrestricted natural language texts. Our method represents meaning in a high-dimensional space of concepts derived from Wikipedia, the largest encyclopedia in existence. We use machine learning techniques that allow us to explicitly represent the meaning of any text in terms of Wikipedia-based concepts. We evaluate the effectiveness of our method on automatically computing the degree of semantic relatedness between fragments of natural language text. Compared with the previous state of the art, using ESA results in substantial improvements in correlation of computed relatedness scores with human judgments: from r=0.56 to 0.75 for individual words and from r=0.60 to 0.72 for texts. Consequently, we anticipate ESA to give rise to the next generation of natural language processing tools. Importantly, due to the use of natural concepts, the ESA model is easy to explain to human users.
CopyrightThe above paper is copyright by the Technion, Author(s), or others. Please contact the author(s) for more information

Remark: Any link to this technical report should be to this page (, rather than to the URL of the PDF files directly. The latter URLs may change without notice.

To the list of the CIS technical reports of 2006
To the main CS technical reports page

Computer science department, Technion