Roei Kisous, M.Sc. Thesis Seminar
Deduplication is a leading method for reducing physical storage capacity when duplicate data is present. This method can be applied on chunks, files, containers, and more. Instead of storing the same physical data multiple times, a pointer is created from each logical copy to the same physical copy, saving the space of the duplicate data. Due to this, data is shared between objects, such as files or entire directories, which result in garbage collection overhead and migration challenges.
In our work, we addressed the general migration problem where files are remapped between different volumes due to system expansion or maintenance. The question of which files and blocks to migrate has been extensively studied in systems without deduplication. However, only simplified migration problems have been considered in the context of deduplicated storage. As part of a migration plan, we aim to minimize the system's size while simultaneously ensuring that the storage load is evenly distributed across the volumes and that the network traffic required for the migration does not exceed its allocation.
Following that, we outline a way to develop effective migration plans using hierarchical clustering. Clustering refers to grouping objects based on their similarity. Hierarchical clustering, in particular, takes the distance between those objects into account. Each object is initially clustered separately, and the process of iterative clustering merges, in each step, two clusters with a minimal distance between them. We are interested in clustering files with high similarity together in order to reduce the amount of physical data while still maintaining low network traffic and a balanced system. Based on each cluster, we calculate data savings, traffic consumed, and load balance achieved and determine the plan's quality.
We show that this method has different tradeoffs between computation time and migration efficiency compared to other algorithms such as greedy and ILP (Integer Linear Programming). Our algorithm achieves almost identical results (and sometimes even better) than ILP, which, theoretically, is optimal, but in much less time.