Technical Report MSC-2021-17

Title: Clustering-Correcting Codes for DNA-based Storage System
Authors: Tal Shinkar
Supervisors: Eitan Yaakobi
PDFCurrently accessibly only within the Technion network
Abstract: A new family of codes, called \emph{clustering-correcting codes}, is presented in this thesis. This family of codes is motivated by the special structure of the data that is stored in DNA-based storage systems. The data stored in these systems has the form of unordered sequences, also called \emph{strands}, and every strand is synthesized thousands to millions of times, where some of these copies are read back during sequencing.

In the context of storing data, we refer to a DNA strand as a linear sequence over the alphabet {A,C,G,T}. In 2012 and 2013 two pioneering DNA based storage systems were implemented by Church et al. and Goldman et al., respectively. Those systems paved the way for multiple studies in which synthetic DNA molecules are used as a volume for storing data. The different systems share similar architecture principles. The writing channel is a chemical process named synthesis that generates artificial DNA strands. The reading is done using a technology called DNA sequencing. The length of the strands is limited to few hundreds symbols hence the DNA pool consists of multiple strands in order to store reasonable amount of data.

Due to the unordered structure of the strands, an important task in the decoding process is to place them in their correct order. This is usually accomplished by allocating part of the strand for an index. However, in the presence of errors in the index field, important information on the order of the strands may be lost. Clustering-correcting codes ensure that if the distance between the index fields of two strands is small, their data fields have large distance. It is shown how this property enables to place the strands together in their correct clusters even in the presence of errors.

Clustering-correcting codes are parametric in $\taui,\taud$. Those two values are chosen in respect to a DNA-storage system that defines the maximum number of errors that can occur in the index field and in the data field. Those values are denoted by $t_i, t_d$, respectively. We show that for $ \taud > 4t_d$ and a suitable choice of $\taui$ clustering-correcting capabilities can be achieved. This is due to the fact that for two strands read with the same index $(\mathsf{ind}_i, \bfu_k), (\mathsf{ind}_i, \bfu_j)$, it holds that $\dist(\bfu_k, \bfu_j) \le t_d + t_d = 2t_d$, where $\dist(\bfx, \bfy)$ is the Hamming distance between two vectors $\bfx$ and $\bfy$. While on the other hand for two different $(\mathsf{ind}_i, \bfu_i), (\mathsf{ind}_j, \bfu_j)$ strands that satisfy the constraint, hence $\dist(\mathsf{ind}_i, \mathsf{ind}_j) \le \taui$, it holds that $\dist(\bfu_i, \bfu_j) \ge \taud - 2t_d > 2t_d$.

This property allows to differentiate the strands that belong in the same cluster and those that do not. In some cases it is also possible to move the mis-clustered strands to their correct clusters. We present lower and upper bounds on the size of clustering-correcting codes and an explicit construction of these codes which uses only a single symbol of redundancy. The results are first presented for the Hamming metric and are then extended for the edit distance.

CopyrightThe above paper is copyright by the Technion, Author(s), or others. Please contact the author(s) for more information

Remark: Any link to this technical report should be to this page (, rather than to the URL of the PDF files directly. The latter URLs may change without notice.

To the list of the MSC technical reports of 2021
To the main CS technical reports page

Computer science department, Technion