Moshe Gabel, Ph.D. Thesis Seminar
Wednesday, 10.5.2017, 13:00
Recent years has seen an explosion in the number of connected devices, which means not only growth in velocity and volume of data, but also that data sources are increasingly geographically distributed, raising cost of communication. Data mining algorithms often assume that data is centralized or that communication is inexpensive: the setting is implicitly assumed to be a data center.
In settings like wireless sensor networks, however, communication costs battery power. Moreover, most work only considers one-shot computation: computing a result once from a fixed data set. Yet data is increasingly dynamic, and many applications need current results over a recent time window.
In this talk, we focus on computing approximations over aggregated distributed data streams with reduced communication. Using a safe zone framework developed in our group (also called geometric monitoring), we'll describe three novel distributed approximations for important non-linear functions: variance, least-squares regression, and Shannon's entropy. Our algorithms provide deterministic user-defined error bounds, while avoiding messages unless needed to maintain those bounds. Compared to the centralized solution, our algorithms reduce communication by up to two orders of magnitude on several real data sets, including machine health monitoring, network monitoring with netflows, traffic monitoring, and others.