Skip to content (access key 's')
Logo of Technion
Logo of CS Department


Pixel Club: Clustering and Approximating High-Dimensional Streaming Data using Coresets
event speaker icon
Dan Feldman (Callifornia Institute of Technology)
event date icon
Sunday, 22.5.2011, 11:30
event location icon
Room 337-8 Taub Bld.
Data analysis of massive data sets is common today for web-search (e.g. Google), social networking (e.g. Facebook), financial applications, supermarkets, bioinformatics and many other fields.

A coreset (or, core-set) for a given problem is a "compressed" representation of its input, in the sense that a solution for the problem with the (small) coreset as input would yield an approximate solution to the problem with the original (large) input. Using traditional techniques, a coreset usually implies provable linear time algorithms for the corresponding optimization problem, which can be computed in parallel, via one pass over the data, and using only logarithmic space (i.e, in the streaming model).

In this talk I will present a unified framework that yields the efficient construction of succinct coresets for several problems. Representing the data by a set F of positive functions over a ground set X, our framework forges a link between the combinatorial complexity of the function family F at hand (measured in terms of classical VC dimension) and the paradigm of coresets. Our coresets are obtained via randomized sampling according to a delicately designed sampling distribution. Examples in which our framework has been successful (and significantly improves over previous works) include the k-median, the k-line median, projective clustering, linear regression, low rank approximation, and subspace approximation problems.
[Back to the index of events]