Home | Publications | CS Home

Concept-Based Feature Generation and Selection for Information Retrieval


Ofer Egozi, Evgeniy Gabrilovich and Shaul Markovitch. Concept-Based Feature Generation and Selection for Information Retrieval. In Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, 1132-1137 Chicago, IL, 2008.


Abstract

Traditional information retrieval systems use query words to identify relevant documents. In difficult retrieval tasks, however, one needs access to a wealth of background knowledge. We present a method that uses Wikipedia-based feature generation to improve retrieval performance. Intuitively, we expect that using extensive world knowledge is likely to improve recall but may adversely affect precision. High quality feature selection is necessary to maintain high precision, but here we do not have the labeled training data for evaluating features, that we have in supervised learning. We present a new feature selection method that is inspired by pseudo-relevance feedback. We use the top-ranked and bottom-ranked documents retrieved by the bag-of-words method as representative sets of relevant and non-relevant documents. The generated features are then evaluated and filtered on the basis of these sets. Experiments on TREC data confirm the superior performance of our method compared to the previous state of the art.


Keywords: Feature Generation, Feature Selection, ESA, Information Retrieval, Explicit Semantic Analysis
Secondary Keywords:
Online version:
Bibtex entry:
 @inproceedings{Egozi:2008:CBF,
  Author = {Ofer Egozi and Evgeniy Gabrilovich and Shaul Markovitch},
  Title = {Concept-Based Feature Generation and Selection for Information Retrieval},
  Year = {2008},
  Booktitle = {Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence},
  Pages = {1132--1137},
  Address = {Chicago, IL},
  Url = {http://www.cs.technion.ac.il/~shaulm/papers/pdf/Egozi-Gabrilovich-Markovitch-AAAI2008.pdf},
  Keywords = {Feature Generation, Feature Selection, ESA, Information Retrieval, Explicit Semantic Analysis},
  Secondary-keywords = {Common-Sense Knowledge, Feature Construction},
  Abstract = {
    Traditional information retrieval systems use query words to
    identify relevant documents. In difficult retrieval tasks,
    however, one needs access to a wealth of background knowledge. We
    present a method that uses Wikipedia-based feature generation to
    improve retrieval performance. Intuitively, we expect that using
    extensive world knowledge is likely to improve recall but may
    adversely affect precision. High quality feature selection is
    necessary to maintain high precision, but here we do not have the
    labeled training data for evaluating features, that we have in
    supervised learning. We present a new feature selection method
    that is inspired by pseudo-relevance feedback. We use the top-
    ranked and bottom-ranked documents retrieved by the bag-of-words
    method as representative sets of relevant and non-relevant
    documents. The generated features are then evaluated and filtered
    on the basis of these sets. Experiments on TREC data confirm the
    superior performance of our method compared to the previous state
    of the art.
  }

  }