Ilya Kolchinsky, Ph.D. Thesis Seminar
Tuesday, 26.2.2019, 12:30
Rapid advances in data-driven applications over recent years have intensified the need for efficient mechanisms capable of real-time monitoring and detecting arbitrarily complex patterns in massive data streams. Complex event processing (CEP) is a prominent technology widely employed for performing this task in many areas, including online finance, security monitoring, credit card fraud detection, and IoT (Internet of Things) technologies. An increasingly active and rapidly developing area of academic research, CEP functionality is also provided by multiple open source libraries and commercial data analysis platforms.
CEP engines operate by collecting basic data items arriving from input data streams and using them to infer complex events according to the patterns defined by the system users. To that end, data items are combined into higher-level entities matching the pattern-specified structure. In order to guarantee detection correctness, a CEP system is required to actively maintain all subsets of data items that might eventually become a part of a successful pattern match. As a result, the overall number of such potential matches grows exponentially with the size and the sophistication of the pattern being detected. Considering that real-life patterns typically incorporate a highly convoluted structure and may consist of 10 or more events connected by increasingly complex operators and predicates, this situation introduces a crucial performance bottleneck, making complex event processing virtually infeasible even for large businesses capable of acquiring extensive computation power. In addition, modern CEP applications are often required to process hundreds or even thousands of patterns and streams in parallel under tight real-time constraints, which increases the magnitude of the problem.
In this talk, we present a novel solution to overcome the exponential resource requirements of complex event processing. Our solution is based on the principle of 'statistic-based lazy evaluation'. Under this paradygm, incoming data items are allowed to be processed in an order different from their natural order of appearance in an input stream. As a result, statistical properties of the underlying data, such as the occurrence rates of different types of items and the selectivities (probabilities of success) of the predicates, can be utilized to generate efficient evaluation plans providing close-to-optimal detection performance.
We devise and describe an efficient lazy evaluation mechanism for complex event processing based on nondeterministic finite automata. By exploiting the similarity of our problem to the well-known problem of join query plan generation, we develop a procedure for adapting the existing join-based algorithms to the CEP domain, thus creating a new family of algorithms for generating practically efficient pattern detection plans. As the statistical data properties required for plan generation are rarely known in advance and may change dynamically, we present an efficient and precise mechanism that continuously estimates the current statistic values of the required data characteristics on-the-fly and, if and whenever necessary, adapts the evaluation plan accordingly. Finally, we extend our methods to multi-pattern CEP environment, demonstrating how the lazy evaluation approach can facilitate common subexpression sharing between different patterns in a workload. An extensive theoretical and empirical analysis of our innovations demonstrates their superiority over state-of-the-art approaches.