Abstract:
In this talk, I will address the problem of automatically cleaning noisy
data sets. I will focus on cleaning textual data collections, though the
methods I will discuss are applicable to various types of data. Many textual
data sets are composed of a subset of documents that share their topic (we
call them "the core"), while the rest of the documents have no strong
correlation with the others ("the noise"). For example, given a query
"iPhone", one of the top Google's search results is an advertisement of a
blender that can crush an iPhone -- information that does not have much to
do with iPhones.
More formally, the problem can be defined as detecting the largest and most
comprehensive subset in a given text collection. A dual formulation of the
same problem would be to remove "outliers", i.e. documents that are "too far
away" from all the others. Since it is computationally hard to estimate
distances between each pair of documents, we take an alternative approach of
optimizing a global objective function. We derive two versions of the
resulting model and apply them to three real-world problems in Web Mining,
Information Retrieval, and Topic Detection.
Joint work with Koby Crammer (University of Pennsylvania).
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Ron Bekkerman is a Research Scientist at HP Labs, California. Ron joined HP
Labs in October 2007, after completing his PhD in Computer Science at the
University of Massachusetts, Amherst. Ron's research interests include
practical aspects of machine learning and data mining. In his PhD work, Ron
proposed a new model for unsupervised and semi-supervised learning, with
applications to Web Mining and Information Retrieval. Ron received his BSc
and MSc in Computer Science from the Technion -- Israel Institute of
Technology. His Master's thesis was on feature induction for text
categorization. Ron is serving / has served on program committees of
top-tier conferences, including ICML-07, SIGIR-08, and KDD-09.