Dan Feldman (Callifornia Institute of Technology)
Data analysis of massive data sets is common today for web-search (e.g. Google), social networking (e.g. Facebook), financial
applications, supermarkets, bioinformatics and many other fields.
A coreset (or, core-set) for a given problem is a "compressed"
representation of its input, in the sense that a solution for the
problem with the (small) coreset as input would yield an approximate solution to the problem with the original (large) input.
Using traditional techniques, a coreset usually implies provable
linear time algorithms for the corresponding optimization problem,
which can be computed in parallel, via one pass over the data, and
using only logarithmic space (i.e, in the streaming model).
In this talk I will present a unified framework that yields the
efficient construction of succinct coresets for several problems.
Representing the data by a set F of positive functions over a ground
set X, our framework forges a link between the combinatorial
complexity of the function family F at hand (measured in terms of
classical VC dimension) and the paradigm of coresets. Our coresets are
obtained via randomized sampling according to a delicately designed
sampling distribution. Examples in which our framework has been
successful (and significantly improves over previous works) include
the k-median, the k-line median, projective clustering, linear
regression, low rank approximation, and subspace approximation