Technical Report MSC-2021-34

TR#:MSC-2021-34
Class:MSC
Title: Keyword Search in Deduplicated Storage Systems
Authors: Nadav Elias
Supervisors: Gala Yadgar
PDFCurrently accessibly only within the Technion network
Abstract: Deduplication is widely used to effectively increase the logical capacity of large-scale storage systems, by replacing redundant chunks of data with references to their unique copies. As a result, the logical size of a storage system may be many multiples of the physical data size. The many-to-one relationship between logical references and physical chunks complicates many functionalities supported by traditional storage systems, but, at the same time, presents an opportunity to rethink and optimize others. We focus on the common task of searching for a byte string (keyword) in a large data repository.

The traditional, naïve, search mechanism traverses the directory tree and reads the data chunks in the order in which they are referenced, fetching them from the underlying storage devices repeatedly if they are referenced multiple times. We propose a DedupSearch algorithm that operates in two phases: it first scans the storage sequentially and processes each data chunk only once, recording keyword matches in a temporary result database. It then traverses the system's metadata in its logical order, attributing matches within chunks to the files that contain them. The main challenge is to identify keywords that are split between logically adjacent chunks. To do that, the physical phase records keyword prefixes and suffixes at chunk boundaries, and the logical phase matches these substrings when processing the file's metadata. We limit the memory usage of the result database by offloading records of tiny (one-character) partial matches to the SSD/HDD, and ensure that it is rarely accessed.

We compare our DedupSearch algorithm to the naïve one on datasets of three different data types (text, code, and binaries), and show that it can reduce the overall search time by orders of magnitude.

CopyrightThe above paper is copyright by the Technion, Author(s), or others. Please contact the author(s) for more information

Remark: Any link to this technical report should be to this page (http://www.cs.technion.ac.il/users/wwwb/cgi-bin/tr-info.cgi/2021/MSC/MSC-2021-34), rather than to the URL of the PDF files directly. The latter URLs may change without notice.

To the list of the MSC technical reports of 2021
To the main CS technical reports page

Computer science department, Technion
admin