
My supervisors are Dr. Ran El-Yaniv and Dr. Yoad Winter
| Information Retrieval: Text Categorization and Clustering, Information Extraction, Text Segmentation, Feature Selection Approaches, Methodology of Information Retrieval | |
| Machine Learning: Support Vector Machines, Semi-supervised and Unsupervised Classification |
My thesis is on Word Distributional Clustering for Text Categorization. We study an approach to text categorization that combines distributional clustering of words and a Support Vector Machine classifier. The word-cluster representation is computed using the recently introduced Information Bottleneck method, which generates a compact and efficient representation of documents. When combined with the classification power of the SVM, this method yields high performance in text categorization.
We compare this technique with SVM-based categorization using the simple minded Bag-Of-Words representation. The comparison is performed over three known datasets. On one of these datasets (the 20 Newsgroups) the method that is based on word clusters significantly outperforms the word-based representation in terms of categorization accuracy or representation efficiency. On the two other sets (Reuters-21578 and WebKB) the word-based representation slightly outperforms the word-cluster representation. We investigate the potential reasons for this behavior.
More information about my thesis (including preprocessed datasets and text categorization software).
| On Feature Distributional Clustering for Text Categorization. Joint work with R. El-Yaniv, N. Tishby and Y. Winter. In Proceedings of SIGIR 2001 ps pdf | |
| Distributional Word Clusters vs. Words for Text Categorization. Joint work with R. El-Yaniv, N. Tishby and Y.Winter. In Special Issue on Variable and Feature Selection of JMLR 2003 ps pdf (preliminary version) | |
| Word Distributional Clustering for Text Categorization. M.Sc. Thesis ps pdf |
| JMLR |
| BISFAI Jun 2001 ppt | |
| SIGIR Sept 2001 ppt | |
| Technion Aug 2002 ppt | |
| Bar Ilan University March 2003 ppt |
| Computer Organization and Programming 234118 (1999-2001) |
| Zoran Inc. (2001- present) | |
| Motorola Semiconductor Israel (1996-1999) |
|
M.Sc. in Computer Science Cum Laude (1999-2002). Department of Computer Science, Technion - Israel Institute of Technology | |
|
B.Sc. in Computer Science Cum Laude (1994-1997). Department of Computer Science, Technion - Israel Institute of Technology | |
|
B.A. in Applied Mathematics Summa Cum Laude (1991-1994). Faculty of Computational Mathematics and Cybernetics, Moscow State University |