Time+Place: Sunday 21/12/2003 14:30 Room 337-8 Taub Bld.
Title: Massive Data Sets: Theory and Practice
Speaker: Ziv Bar-Yossef http://www.almaden.ibm.com/cs/people/ziv
Affiliation: IBM Almaden Research Center
Host: Johann Makowsky

Abstract:

During the past decade huge bodies of raw data have emerged in numerous 
domains, including science (genomic data and astronomical sky surveys), 
finance (credit card transactions), and technology (the World-Wide Web). 
These "massive data sets" hide precious pieces of information, which 
once discovered, can have major implications to medical breakthroughs, 
to the advancement of science, and to business success,

Massive data sets are typically unstructured, noisy, heterogeneous, and 
dynamic --- characteristics that make the task of extracting valuable 
information from them very challenging. In this talk I will describe 
four highlights of my work on the theoretical study of computing on 
large data sets and on applied research in Web Data Mining:

1. Web Statistics: a method for generating statistics about web pages, 
using random walks on the web graph.

2. Sampling Lower Bounds: an information-theoretic technique for finding 
how many data samples are needed in order to estimate a function over a 
large data set.

3. Hamming Distance Sketching: a scheme for fingerprinting large 
documents into small "sketches", so that finding whether they are 
similar or not can be inferred from the sketches only,

4. Template Detection: an algorithm for eliminating "templates" from web 
pages, yielding a significant improvement to search engine precision.