Abstract:
During the past decade huge bodies of raw data have emerged in numerous
domains, including science (genomic data and astronomical sky surveys),
finance (credit card transactions), and technology (the World-Wide Web).
These "massive data sets" hide precious pieces of information, which
once discovered, can have major implications to medical breakthroughs,
to the advancement of science, and to business success,
Massive data sets are typically unstructured, noisy, heterogeneous, and
dynamic --- characteristics that make the task of extracting valuable
information from them very challenging. In this talk I will describe
four highlights of my work on the theoretical study of computing on
large data sets and on applied research in Web Data Mining:
1. Web Statistics: a method for generating statistics about web pages,
using random walks on the web graph.
2. Sampling Lower Bounds: an information-theoretic technique for finding
how many data samples are needed in order to estimate a function over a
large data set.
3. Hamming Distance Sketching: a scheme for fingerprinting large
documents into small "sketches", so that finding whether they are
similar or not can be inferred from the sketches only,
4. Template Detection: an algorithm for eliminating "templates" from web
pages, yielding a significant improvement to search engine precision.