Ran Bernstein, M.Sc. Thesis Seminar
Real systems for mining dynamic data streams should be able to detect changes that affect the accuracy of their model. A distributed setting is one of the main challenges in this kind of change detection. In a distributed setting, model training requires centralizing the data from all nodes (hereafter, synchronization), which is very costly in terms of communication. In order to minimize the communication, a monitoring algorithm should be executed locally at each node,
while preserving the validity of the global model (the model that will be computed if a synchronization will occur). For minimizing this communication, we propose the first communication-efficient algorithm for monitoring a classification model over distributed, dynamic data streams. The classification algorithm that we chose to monitor is Linear Discriminant Analysis (LDA), which is a popular method used for classification and dimensionality reduction in many fields. This choice was made due to the strong theoretical guarantee of correctness that we prove on the monitoring algorithm of this kind of model.
In addition to its theoretical guarantee, we demonstrated how our algorithm and a probabilistic variant of it reduce communication volume by up to two orders of magnitude (compared to synchronization in every round) on three real data sets from different worlds of content. Moreover, our approach monitors the classification model itself as opposed to its misclassifications, which makes it possible to detect the change before the misclassification occurs.