Technical Report PHD-2017-17

Title: Generalizations of the Cardinality Estimation Problem and Applications to Computer Networks
Authors: Aviv Yehezkel
Supervisors: Reuven Cohen
PDFCurrently accessibly only within the Technion network
Abstract: Classical processing algorithms for database management systems usually require several passes over (static) datasets in order to produce an accurate answer to a user query. However, for a wide range of application domains, the data set is very large and is updated on a continuous basis, making this approach impractical. Such big data streams appear in a wide variety of computer science applications. They are common, for example, in computer networks, where detailed usage statistics (such as the source IP addresses of packets) from different parts of the network need to be continuously collected and analyzed for various security and management tasks.

For this reason, there is a growing interest in developing “streaming algorithms” for efficient processing and querying of continuous data streams. These algorithms seek to provide accurate results while minimizing both the required storage and the processing time per stream element, at the price of a small inaccuracy in their output. Streaming algorithms use small fixed-size storage to store a summary (“sketch”) of the input data, and use probabilistic algorithms to accurately estimate the desired quantity.

A fundamental streaming problem is the “cardinality estimation problem”: given a very long stream of elements x1, x2, x3, ...,xs with repetitions, the goal is to find the number n of distinct elements. This is a well-known problem in numerous big data applications for network monitoring, security, query optimization, query execution progress, etc. The cardinality estimation problem can be generalized to set expressions over multiple streams, which yields many new important big data applications in large data base systems, network monitoring and network security.

It is easy to use linear O(n) space to produce an accurate solution to the cardinality problem. This can be done, for example, by comparing the value of a newly encountered element, xi, to every (stored) value encountered so far. If the value of xi has not been seen before, it is stored and counted. However, for a wide range of application domains, the data set is very large, making linear space algorithms impractical. Thus, streaming algorithms are required.

This thesis focuses on several generalizations of the cardinality estimation problem. It proposes new streaming algorithms, analyzes their efficiency and statistical performance, and shows how they can be used for solving timely management and monitoring problems in computer networks.

CopyrightThe above paper is copyright by the Technion, Author(s), or others. Please contact the author(s) for more information

Remark: Any link to this technical report should be to this page (, rather than to the URL of the PDF files directly. The latter URLs may change without notice.

To the list of the PHD technical reports of 2017
To the main CS technical reports page

Computer science department, Technion