Home | Publications | CS Home

Parameterized Generation of Labeled Datasets for Text Categorization Based on a Hierarchical Directory


Dmitry Davidov, Evgeniy Gabrilovich and Shaul Markovitch. Parameterized Generation of Labeled Datasets for Text Categorization Based on a Hierarchical Directory. In Proceedings of The 27th Annual International ACM SIGIR Conference, 250-257 Sheffield, UK, 2004.ACM Press


Abstract

Although text categorization is a burgeoning area of IR research, readily available test collections in this field are surprisingly scarce. We describe a methodology and system named ACCIO for automatically acquiring labeled datasets for text categorization from the World Wide Web, by capitalizing on the body of knowledge encoded in the structure of existing hierarchical directories such as the Open Directory. We define parameters of categories that make it possible to acquire numerous datasets with desired properties, which in turn allow better control over categorization experiments. In particular, we develop metrics that estimate the difficulty of a dataset by examining the host directory structure. These metrics are shown to be good predictors of categorization accuracy that can be achieved on a dataset, and serve as efficient heuristics for generating datasets subject to user's requirements. A large collection of automatically generated datasets are made available for other researchers to use.


Keywords: Text Categorization, Dataset Generation
Secondary Keywords:
Online version:
Bibtex entry:
 @inproceedings{Davidov:2004:PGL,
  Author = {Dmitry Davidov and Evgeniy Gabrilovich and Shaul Markovitch},
  Title = {Parameterized Generation of Labeled Datasets for Text Categorization Based on a Hierarchical Directory},
  Year = {2004},
  Booktitle = {Proceedings of The 27th Annual International ACM SIGIR Conference},
  Pages = {250--257},
  Address = {Sheffield, UK},
  Url = {http://www.cs.technion.ac.il/~shaulm/papers/pdf/Davidov-Gabrilovich-Markovitch-sigir2004.pdf},
  Keywords = {Text Categorization, Dataset Generation},
  Secondary-keywords = {Dataset Parameters},
  Abstract = {
    Although text categorization is a burgeoning area of IR research,
    readily available test collections in this field are surprisingly
    scarce. We describe a methodology and system named ACCIO for
    automatically acquiring labeled datasets for text categorization
    from the World Wide Web, by capitalizing on the body of knowledge
    encoded in the structure of existing hierarchical directories such
    as the Open Directory. We define parameters of categories that
    make it possible to acquire numerous datasets with desired
    properties, which in turn allow better control over categorization
    experiments. In particular, we develop metrics that estimate the
    difficulty of a dataset by examining the host directory structure.
    These metrics are shown to be good predictors of categorization
    accuracy that can be achieved on a dataset, and serve as efficient
    heuristics for generating datasets subject to user's requirements.
    A large collection of automatically generated datasets are made
    available for other researchers to use.
  }

  }