Text Categorization with Many Redundant Features: Using Aggressive Feature Selection to Make SVMs Competitive with C4.5

Evgeniy Gabrilovich and Shaul Markovitch. Text Categorization with Many Redundant Features: Using Aggressive Feature Selection to Make SVMs Competitive with C4.5. In Proceedings of The Twenty-First International Conference on Machine Learning, 321-328 Banff, Alberta, Canada, 2004.Morgan Kaufmann

Abstract

Text categorization algorithms usually represent documents as bags of words and consequently have to deal with huge numbers of features. Most previous studies found that the majority of these features are relevant for classification, and that the performance of text categorization with support vector machines peaks when no feature selection is performed. We describe a class of text categorization problems that are characterized with many redundant features. Even though most of these features are relevant, the underlying concepts can be concisely captured using only a few features, while keeping all of them has substantially detrimental effect on categorization accuracy. We develop a novel measure that captures feature redundancy, and use it to analyze a large collection of datasets. We show that for problems plagued with numerous redundant features the performance of C4.5 is significantly superior to that of SVM, while aggressive feature selection allows SVM to beat C4.5 by a narrow margin.

Keywords: Feature Selection, Text Categorization

Secondary Keywords:

Online version:

Bibtex entry:

 @inproceedings{Gabrilovich:2004:TCM,
  Author = {Evgeniy Gabrilovich and Shaul Markovitch},
  Title = {Text Categorization with Many Redundant Features: Using Aggressive Feature Selection to Make SVMs Competitive with C4.5},
  Year = {2004},
  Booktitle = {Proceedings of The Twenty-First International Conference on Machine Learning},
  Pages = {321--328},
  Address = {Banff, Alberta, Canada},
  Url = {http://www.cs.technion.ac.il/~shaulm/papers/pdf/Gabrilovich-Markovitch-icml2004.pdf},
  Keywords = {Feature Selection, Text Categorization},
  Secondary-keywords = {SVM, C4.5, Decision Trees},
  Abstract = {
    Text categorization algorithms usually represent documents as bags
    of words and consequently have to deal with huge numbers of
    features. Most previous studies found that the majority of these
    features are relevant for classification, and that the performance
    of text categorization with support vector machines peaks when no
    feature selection is performed. We describe a class of text
    categorization problems that are characterized with many redundant
    features. Even though most of these features are relevant, the
    underlying concepts can be concisely captured using only a few
    features, while keeping all of them has substantially detrimental
    effect on categorization accuracy. We develop a novel measure that
    captures feature redundancy, and use it to analyze a large
    collection of datasets. We show that for problems plagued with
    numerous redundant features the performance of C4.5 is
    significantly superior to that of SVM, while aggressive feature
    selection allows SVM to beat C4.5 by a narrow margin.
  }

  }