Resources for Text, Speech and Language Processing

Bibliography on Automated Text Categorization

This is a large online bibliography on automated text categorization (ATC). You can either view it or download it as a single file (ASCII text in BibTex format) or access the fully searchable online version.


ATC is the activity of automatically building, by means of machine learning techniques, automated text classifiers, i.e., systems capable of assigning a text document to one or more thematic categories (or labels) from a predefined set. The following article contains a very comprehensive survey of the state of the art in ATC (see entry [Sebastiani02] in the bibliography): Fabrizio Sebastiani, Machine Learning in Automated Text Categorization, ACM Computing Surveys, 34(1):1-47, 2002.


In general, only references specific to ATC are considered pertinent to this bibliography; in particular, references that are considered pertinent are:

  • publications that discuss novel ATC methods, novel experimentation with previously known methods, or new resources for ATC experimentation;
  • publications that discuss applications of ATC (e.g., automated indexing for Boolean IR systems, filtering, etc.).
  • References that are not considered pertinent are:
  • Publications that discuss techniques that are in principle useful for ATC (e.g., machine learning techniques, information retrieval techniques) but do not explicitly discuss their application to ATC;
  • Publications that discuss related topics sometimes confused with ATC; these include (but are not limited to) text clustering (i.e., text classification by unsupervised learning) and text indexing;
  • Technical reports and workshop papers. Only papers that have been the object of formal publication (i.e., conferences and journals) are to be included in the bibliography, so as to avoid its explosion and the inclusion of material bound to obsolescence
  • Updates

    Please do send me new references, as well as corrections and additions (e.g., missing URLs and abstracts) to the existing ones. I'm routinely monitoring major conferences and journals several times a year, but there always will be articles that I unfortunately overlook, so please help me keep the bibliography as current and complete as possible.

    Concerning URLs from which on-line copies of the papers can be downloaded: where possible, I included URLs with unrestricted access (e.g., home pages of the authors). When such URLs were not available, sometimes a URL with restricted access (e.g., the ACM Digital Library or the IEEE Computing Society Digital Library, which are accessible to subscribers only) is provided. When this is the case, if you know of a URL with unrestricted access from which the paper is also available, please let me know and I will update the link.

    Historical notes

    This bibliography was originally created by Fabrizio Sebastiani.

    Back to top

    Evgeniy Gabrilovich

    Keywords: Text Categorization, Machine Learning, Computational Linguistics, Natural Language Processing, NLP, Natural Language Understanding, Natural Language Analysis, Information Retrieval, IR, Artificial Intelligence, AI, Machine Learning, Corpus Linguistics, Text Mining, Text Data Mining