Technical Report MSC-2020-32

Title: Managing Capacity in Deduplicated Storage Systems
Authors: Aviv Nachman
Supervisors: Gala Yadgar
PDFCurrently accessibly only within the Technion network
Abstract: Deduplication decreases the physical occupancy of files in a storage volume by removing duplicate copies of data chunks, but creates data-sharing dependencies that complicate standard storage management tasks. Specifically, data migration plans, which consist the information of which files are remapped and which are not, must consider the dependencies between files that are remapped to new volumes and files that are not. Thus far, only greedy approaches have been suggested for constructing such plans, and it is unclear how they compare to one another and how much they can be improved theoretically and practically.

We set to bridge this gap for seeding---migration in which the target volume is initially empty. We prove that even this basic instance of data migration is NP-hard in the presence of deduplication. We then present GoSeed, a formulation of seeding as an integer linear programming (ILP) problem, and three acceleration methods for applying it to real-sized storage volumes. Our experimental evaluation shows that, while the greedy approaches perform well on "easy" problem instances, instances with relatively low deduplication ratio, the cost of their solution can be significantly higher than that of GoSeed's solution, which is our approach, on "hard" instances, for which they are sometimes unable to find a solution at all.

CopyrightThe above paper is copyright by the Technion, Author(s), or others. Please contact the author(s) for more information

Remark: Any link to this technical report should be to this page (, rather than to the URL of the PDF files directly. The latter URLs may change without notice.

To the list of the MSC technical reports of 2020
To the main CS technical reports page

Computer science department, Technion