A text categorization task can in many cases be considered as a combination of three problems:
| Text representation | |
| Classifier induction | |
| Model selection |
We mainly dealt with different aspects of the text representation. As a classifier, we used the SVM (SVMlight packet by Thorsten Joachims).
We chose to work with a text representation based on word clusters instead of single words. Our clustering scheme is an implementation of the Information Bottleneck method, introduced by Naftali Tishby, Fernando Pereira and William Bialek in 1999. In our scheme, words are represented as distributions over the dataset categories and then clustered using the Deterministic Annealing technique. Early research on Distributional Clustering was performed by Douglas Baker and Andrew McCallum in 1998.
We tested our scheme on three benchmark datasets:
| Reuters-21578 | |
| 20 Newsgroups | |
| WebKB |
We compared our technique with the one proposed by Susan Dumais et.al. in 1998. Their technique is based on simple Bag-Of-Words with feature selection by Mutual Information between a word and a category. This technique achieved state-of-the-art results on the Reuters dataset. The comparison showed that our technique led to a certain improvement on the 20NG dataset while being not better on Reuters and WebKB.
The question is why our results are not consistent over different benchmarks. Is the problem in our approach or maybe in the datasets? Our study showed that a simple text categorization technique (Bag-Of-Words + Mutual Information) is probably more appropriate for Reuters and WebKB. Namely, only as few as 50 words extracted for each category do the whole job. Thousands of other words do not affect the process. It means that Reuters and WebKB are likely to contain a small number of highlighting keywords and many don't-cares.
This is not the case for the 20NG. We saw that each single word contributes something. And clusters of words preserve semantic connections between the words which gives more power to the representation.
The bottom line is: if you deal with text representation or feature selection, use complex datasets for testing your smart techniques. Do not build a sophisticated representation for documents of a dataset whose structure is too simple.
Here you can find our Distributional Clustering software (courtesy Uri Weiss). As an input, it gets a file with words represented as distributions and then it composes clusters of words that have similar distributions.
| cluster.exe (217Kb) |
Here you can find the preprocessed datasets we were working with. The three of them have a common format which is rather computer-friendly than user-friendly. Both the 20NG and WebKB are uniformly split to 4 folds and prepared for cross-validation. Reuters is in its standard ModApte split so it does not need the cross-validation to be applied.
| Preprocessed Reuters (5.7Mb) | |
| Preprocessed 20NG (12.7Mb) | |
| Preprocessed WebKB (4.8Mb) |
All our scripts and files that contain our results can be downloaded from here. After unzipping the file you can see an hierarchy of directories according to a dataset, a representation method applied (the Information Bottleneck or the Bag-Of-Words) and its parameters. All the scripts are written in Perl under Cygwin. Each leaf directory includes a script named run_all that activates the whole process. The scripts have until now been in our internal use only, so please be patient if they do not run right away (they contain local paths etc). Tip: first few lines of the scripts include all the necessary configuration parameters. Please contact me if you have any question or bug report.
| Scripts2.0.zip (3.5Mb) - version 2.0. |