|Title:||Clustering Based Data Migration in Deduplicated Storage
|Currently accessibly only within the Technion network|
|Abstract:||Deduplication is a leading method for reducing physical storage capacity when duplicate data is present. This method can be applied on chunks, files, containers, and more. Instead of storing the same physical data multiple times, a pointer is created from each logical copy to the same physical copy, saving the space of the duplicate data. Due to this, data is shared between objects, such as files or entire directories, which result in garbage collection overhead and migration challenges.
In our work, we addressed the general migration problem where files are remapped between different volumes due to system expansion or maintenance. The question of which files and blocks to migrate has been extensively studied in systems without deduplication. However, only simplified migration problems have been considered in the context of deduplicated storage. As part of a migration plan, we aim to minimize the system’s size while simultaneously ensuring that the storage load is evenly distributed across the volumes and that the network traffic required for the migration does not exceed its allocation.
Following that, we outline a way to develop effective migration plans using hierarchical clustering. Clustering refers to grouping objects based on their similarity. Hierarchical clustering, in particular, takes the distance between those objects into account. Each object is initially clustered separately, and the process of iterative clustering merges, in each step, two clusters with a minimal distance between them. We are interested in clustering files with high similarity together in order to reduce the amount of physical data while still maintaining low network traffic and a balanced system. Based on each cluster, we calculate data savings, traffic consumed, and load balance achieved and determine the plan’s quality.
We show that this method has different tradeoffs between computation time and migration efficiency compared to other algorithms such as greedy and ILP (Integer Linear Programming). Our algorithm achieves almost identical results (and sometimes even better) than ILP, which, theoretically, is optimal, but in much less time.
|Copyright||The above paper is copyright by the Technion, Author(s), or others. Please contact the author(s) for more information|
|Disclaimer||Recent theses may have not yet been approved by the Technion Senate, and are provided here for the purpose of fast dissemination of knowledge only. Final approval of the Technion Senate is needed for a thesis to be used for the partial fulfillment of the requirements for the degree of M.Sc. or Ph.D.|
Remark: Any link to this technical report should be to this page (http://www.cs.technion.ac.il/users/wwwb/cgi-bin/tr-info.cgi/2022/MSC/MSC-2022-36), rather than to the URL of the PDF files directly. The latter URLs may change without notice.
To the list of the MSC technical reports of 2022
To the main CS technical reports page
Computer science department, Technion