Random Reed-Solomon Codes Achieve the Half-Singleton Bound for Insertions and Deletions over Linear-Sized Alphabets
Abstract
In this paper, we prove that with high probability, random Reed–Solomon codes approach the half-Singleton bound – the optimal rate versus error tradeoff for linear insdel codes – with linear-sized alphabets. More precisely, we prove that, for any and positive integers and , with high probability, random Reed–Solomon codes of length and dimension can correct adversarial insdel errors over alphabets of size . This significantly improves upon the alphabet size demonstrated in the work of Con, Shpilka, and Tamo (IEEE TIT, 2023), who showed the existence of Reed–Solomon codes with exponential alphabet size precisely achieving the half-Singleton bound.
Our methods are inspired by recent works on list-decoding Reed–Solomon codes. Brakensiek–Gopi–Makam (STOC 2023) showed that random Reed–Solomon codes are list-decodable up to capacity with exponential-sized alphabets, and Guo–Zhang (FOCS 2023) and Alrabiah–Guruswami–Li (STOC 2024) improved the alphabet-size to linear. We achieve a similar alphabet-size reduction by similarly establishing strong bounds on the probability that certain random rectangular matrices are full rank. To accomplish this in our insdel context, our proof combines the random matrix techniques from list-decoding with structural properties of Longest Common Subsequences.
Keywords and phrases:
coding theory, error-correcting codes, Reed-Solomon codes, insdel, insertion-deletion errors, half-Singleton boundCategory:
Track A: Algorithms, Complexity and GamesFunding:
Roni Con: Supported by the European Union (DiDAX, 101115134). Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.Copyright and License:
![[Uncaptioned image]](x1.png)
2012 ACM Subject Classification:
Mathematics of computing Coding theoryAcknowledgements:
This work was conducted while all four authors were visiting the Simons Institute for the Theory of Computing at UC Berkeley. The authors would like to thank the institute for its support and hospitality. Additionally, the last author wishes to thank Prof. Xin Li for the helpful discussions and suggestions.Editors:
Keren Censor-Hillel, Fabrizio Grandoni, Joël Ouaknine, and Gabriele PuppisSeries and Publisher:

1 Introduction
Error-correcting codes (hereafter referred to as codes) are constructs designed to enable the recovery of original information from data that has been corrupted. The primary corruption model explored in the literature involves errors or erasures. In this model, each symbol in the transmitted word is either replaced with a different symbol from the alphabet (an error) or with a “?” (an erasure).
The theory of coding theory originated with the influential works of Shannon and Hamming. Shannon [49] studied random errors and erasures, whereas Hamming [32] studied the adversarial model for errors and erasures. These models are well understood, and today, we have efficiently encodable and decodable codes that are optimal for Shannon’s model of random errors. For adversarial errors, optimal and efficient codes exist over large alphabets, and there are good codes (codes with constant relative rate and relative distance) for every constant-sized alphabet.
Another important model that has been considered ever since Shannon’s work is that of synchronization errors. The most common model for studying synchronization errors is the insertion-deletion model (insdel for short): an insertion error is when a new symbol is inserted between two symbols of the transmitted word and a deletion is when a symbol is removed from the transmitted word. For example, over the binary alphabet, when is transmitted, we may receive the word , which is obtained from two insertions ( at the beginning and at the end) and one deletion (one of the ’s at the beginning of the transmitted word). These errors are related to the more common substitution and erasure errors: a deletion is like an erasure, but the position of the deletion is unknown, and a substitution can be imitated by a deletion followed by an insertion. Despite their apparent similarity to well studied error models, insdel errors are much more challenging to handle.
Coding schemes that correct insdel errors are not only an intriguing theoretical concept but also have implications in real-world scenarios. Such codes find applications in magnetic and optical recording, semiconductor devices, integrated circuits, and synchronous digital communication networks (for detailed applications, refer to the survey by Mercier [42]). This natural theoretical model together coupled with its relevance across various domains, has led many researches in recent years to study and design codes that can correct from insdel errors [6, 31, 5, 28, 46, 11, 29, 9, 24, 25], just to name a few. Although there has been significant progress in recent years on understanding this model of insdel errors (both on limitation and constructing efficient codes), our comprehension of this model lags far behind our understanding of codes that correct erasures and substitution errors (we refer the reader to the following excellent surveys [43, 42, 12, 30]).
Codes that can correct insdel errors also attract a lot of attention due to their possible application in designing DNA-based storage systems [20]. This recent increased interest was paved by substantial progress in synthesis and sequencing technologies. The main advantages of DNA-based storage over classical storage technologies are very high data densities and long-term reliability without an electrical supply. It is thus natural that designing codes for DNA storage and studying the limitations of this model received a lot of attention recently [34, 37, 33, 50].
While linear codes are highly desirable, most of the works constructing codes for the insdel model are not linear codes, in contrast to the predominant use of linear codes for Hamming errors. Notable examples of linear codes for Hamming errors include Hamming codes, Reed–Solomon codes, Reed–Muller codes, algebraic-geometry codes, polar codes, Turbo codes, expander codes, and LDPC codes. The reason for the absence of linear codes in the insdel model can be found in works such as [1, 9], which show that the maximal rate of a linear code capable of correcting insdel errors is significantly worse than that of a nonlinear code correcting the same number of insdel errors. However, linear codes are desirable for many reasons: they offer compact representation, efficient encoding, easier analysis, and in many cases, efficient decoding algorithms. Moreover, linear codes are a central mathematical object that has found widespread applications across various research areas. As such, studying them in the context of insdel errors is important and was the subject of recent works [9, 8, 13, 14, 35, 10, 40].
This work concerns the performance of Reed–Solomon codes in the insdel model. As Reed–Solomon codes are ubiquitous in theory and practice, it is important to understand whether they can also decode from insdel errors. This question received a lot of attention recently [45, 56, 54, 16, 41, 7, 14, 15, 40]. Our result makes a substantial improvement in this line of research. Specifically, we show that “most” RS codes over linear-sized fields have almost optimal capabilities correcting insdel errors. Our methods are inspired by recent works on list-decoding Reed–Solomon codes [23, 2], which showed that “most” RS codes are list-decodable up to capacity over linear-sized alphabets. Specifically, we achieve our result by adapting the random matrix techniques from [23, 2] to the insdel setting which required the development of several chain-based structural results of longest common subsequences.
1.1 Previous works
Insdel codes
To offer a wider context, we now discuss some results related to the broader study of codes correcting adversarial insdel errors.
Codes correcting adversarial insdel errors were first considered in the seminal works of Levenshtein [38] and Varshamov and Tenengolts [55], the latter of which constructed binary codes that can correct a single insdel error with optimal redundancy. The first efficient construction of asymptotically good binary insdel codes – codes with asymptotically positive rate and tolerating an asymptotically positive insdel fraction – are, as far as we know, given in [47].
Despite these early successes and much recent interest, there are still significant gaps in our understanding of insdel codes, particularly for binary codes. For binary codes, gaps remain for several important questions, including (i) determining the optimal redundancy-error tradeoff and finding explicit binary codes for correcting insdels, for constant [24, 51, 52]; (ii) finding explicit binary codes for correcting insdels for small constant with optimal rate [11, 29], (iii) determining the zero-rate threshold, the maximum fraction of errors correctable by an asymptotically positive rate code (we know the answer to be in ) [26, 25], and (iv) determining the optimal rate-versus distance tradeoff for deletion codes [38, 39, 58], among others.
On the other hand, when the alphabet can be larger, as is the case in this work, the picture is more complete. Haeupler and Shahrasbi [31], using a novel primitive called synchronization strings, constructed codes of rate that can correct fraction of insdel errors over an alphabet of size . This work was the first one to show explicit and efficient constructions of codes that can achieve the Singleton bound in the edit distance setting over constant alphabets. This work also inspired several follow-up works that improved the secondary code parameters and applied synchronization strings to related problems; we refer the reader to [30] for a discussion.
Linear insdel codes
As far as we know, the first study about the performance of linear codes against insdel errors is due to [1]. Specifically, they showed that any linear code that can correct one deletion must have rate at most . Note that this result shows that linear codes are provably worse than nonlinear codes in the context of insdel errors. Indeed, nonlinear can achieve rate close to whereas linear codes have rate .
Nevertheless, as described in the introduction, studying the performance of linear codes against insdel errors is an important question that indeed has been studied in recent years. The basic question of whether there exist good linear codes for the insdel model was first addressed in the work of Cheng, Guruswami, Haeupler, and Li [9]. They showed that there are linear codes of rate that can correct a fraction of insdel errors. They also established two upper bounds on the rate of linear codes that can correct a fraction of deletions. First, they proved the half-Plotkin bound ([9, Theorem 5.1]), which states that every linear code over a finite field capable of correcting a -fraction of insdel errors has rate at most . Then, they established the following alphabet-independent bound,
Theorem 1 (Half-Singleton bound [9, Corollary 5.2]).
Every linear insdel code which is capable of correcting a fraction of deletions has rate at most .
Remark.
The following non-asymptotic version of the half-Singleton bound can be derived from the proof of [9, Corollary 5.2]: An linear code can correct at most insdel errors.
The question of constructing explicit linear codes that are efficiently decodable has also been studied and there are explicit and efficient constructions of linear codes over small alphabets (such as binary) that are asymptotically good [13, 10]. However, we note that over small alphabets, determining the optimal rate-error tradeoff and finding optimal explicit constructions remains an open problem.
Reed–Solomon codes against insdel errors
In this work, we focus on Reed–Solomon codes, which are among the most well-known codes. These codes are defined as follows.
Definition 2 (Reed–Solomon code).
Let be distinct elements in the finite field of order . For , the Reed–Solomon (RS) code of dimension and block length associated with the evaluation vector is defined to be the set of codewords
The performance of RS codes against insdel errors was first considered in [45] in the context of traitor tracing. In [56], the authors constructed a RS code correcting a single deletion. In [54], an generalized RS code that is capable of correcting deletions was constructed. Several constructions of two-dimensional RS codes correcting insdel errors are given in [16, 41]. Note that the half-Singleton bound states that an code can correct at most deletions. In [14], it was shown that for an RS code to decode from insdel errors, it must be that . On the other hand, [15] gave an explicit construction with . Thus, for the special case of two-dimensional RS codes, there is a complete characterization of RS codes that achieve the half-Singleton bound.
For , much less is known. It was shown in [14] that over large enough fields, there exist RS codes that exactly achieve the half-Singleton bound. Specifically,
Theorem 3 ([14, Theorem 16]).
Let and be positive integers such that . For , there exists an RS code that can decode from insdel errors.111The alphabet size in [14, proof of Theorem 16] is actually , which is better, especially for sublinear . However, given that the primary parameter regime of interest in this paper is , we state this simplified version for brevity.
In both theoretical and practical scenarios, codes over smaller alphabets tend to be more valuable, but the field size above is very large. Whether Reed–Solomon codes over significantly smaller fields are capable of correcting insdel errors up to the half-Singleton bound remained an interesting and important open problem, which we address in this work.
List-decodable (Hamming) codes
As we mentioned, we port some crucial ideas from a different context (List-decoding under Hamming distance) to build our main results in the insertion-deletion world. To facilitate a better understanding, we provide a concise summary of recent and significant advancements in the field of list decoding of codes under the Hamming metric.
The notion of list decoding was introduced by Elias and Wozencraft [17, 57] in 1950s as a natural generalization of unique decoding. Briefly, a code exhibits satisfactory list decodability if its codewords are widely dispersed throughout the entire space, that is, there are not too many codewords located within a single ball under the Hamming metric. After the seminal work of Sudan [53] and Guruswami–Sudan [27], which provided efficient algorithms for list decoding RS codes up to the Johnson bound [36], the issue of understanding the list decodability of RS codes beyond the Johnson bound became very important. A long line of works [44, 48, 21, 18, 19, 4, 23, 2, 3, 22] have made significant advancements in the understanding of list decodability for RS and related codes. Specifically, recent works [4, 23, 2] have shown that “most” RS codes over exponential-sized alphabets (in terms of the code length) are optimally list decodable, and “most” RS codes over linear-sized alphabets are in fact almost optimally list decodable.
1.2 Our results
When is uniformly distributed over the set of all -tuples of distinct elements in , we say the code over is a random RS code of dimension and length over . In this work, we show that random RS codes over alphabets of size , with high probability, approach the half-Singleton bound for insdel errors. Specifically,
Theorem 4 (Informal, Details in Theorem 40).
Let , and let and be positive integers. For a prime power satisfying , with high probability, a random RS code of dimension and length over corrects at least adversarial insdel errors.
For the constant rate regime, , our result exponentially improves the alphabet size of Con, Shpilka, and Tamo [14], where they have . As a warmup to this result, we prove a weaker but more straightforward result (Theorem 22), which establishes Theorem 4 for .
1.3 Proof overview
We outline the proof of our main theorem in this section. First, we review the proof of Theorem 3 [14] that achieves the half-Singleton bound with exponential-sized alphabets. We slightly modify the proof’s presentation to parallel our proofs. Second, we show how to prove a weaker version of Theorem 4 (Theorem 22) with quadratic-sized alphabets. Lastly, we describe how to prove Theorem 4 that achieves linear-sized alphabets. Throughout this overview, let be a random Reed–Solomon code of length and dimension , where the tuple of evaluation points is sampled uniformly from all -tuples of pairwise-distinct elements from .
Warmup: exponential alphabet size
We start with the argument from [14] that proves that Reed–Solomon codes achieve the half-Singleton bound over exponential-sized alphabets . Let . We want (with high probability over ) that our code corrects insdel errors, or equivalently, has pairwise small LCS: for any two distinct codewords .
The key observation is that, if our code fails to correct insdels, then there exist indices and such that the following matrix, which we call the -matrix, is bad, meaning it does not have full column rank.
(5) |
Indeed, if fails to correct insdels, there exist two distinct polynomials and , such that the -indexed subsequence of the codeword for equals the -indexed subsequence of the codeword for . In that case,
(10) |
so the -matrix is bad.
Now, it suffices to show, with high probability, that all -matrices are good (have full collumn rank). However, by considering the determinant of the -matrix (which is square as ), the probability that one -matrix is bad is at most by the Schwartz–Zippel lemma.222An important detail here, which [14] proves, is that (to apply the Schwartz-Zippel lemma) the determinant needs to be symbolically nonzero. A -matrix is defined by the indices of the subsequences and , so there are at most possible -matrices. Hence, by the union bound, the probability that some -matrix is bad is at most . Hence, for sufficiently large exponential alphabet sizes , our code corrects insdel errors with high probability, as desired.
Quadratic alphabet size
We now discuss how to improve the field size bound, first to quadratic, and then to linear.
Our main idea, inspired by [23, 2], is to use “slackness” in the coding parameters to amplify the probability that a -matrix is bad. The above warmup gives random RS codes that correct errors, exactly achieving the half-Singleton bound. We relax the guarantee, and now ask for a random RS code to correct errors, approaching the half-Singleton bound. Now, the corresponding -matrix is a matrix, for . For this relaxation, we show the probability -matrix is bad is at most , rather than .
First, we discuss a toy problem that illustrates this probability amplification. Consider the toy problem of independently picking uniformly random row vectors to form an matrix , which we want to have full column rank. If we choose , then the probability that has full column rank is bounded by a function that is , and this happens only if each is not in the span of . However, suppose we choose for some small . In this case, we could afford “faulty” vectors , i.e., may be in the span of previous vectors, in which case we just skip it and consider the next vector. The probability that the matrix has full column rank is then exponentially small, .
Now we outline the formal proof of this probability amplification, captured in Corollary 21.
Corollary 5 (Corollary 21, informal).
Let , , and . Let be two increasing subsequences that agree on at most coordinates333More precisely, for at most values of . This technical condition ensures that the -matrix is symbolically full column rank.. Then,
At the highest level, our proof of Corollary 21 is a union bound over “certificates.” For all evaluation points where is bad, we show that there is a certificate of distinct indices in (Lemma 19) that intuitively “attests” that the matrix is bad.
We generate the certificate deterministically from the evaluations using Algorithm 1. We compute the certificate one index at a time. Given indices , define index as follows: let be the top square submatrix of – the -matrix after removing rows containing any of variables – and let be the smallest index such that is not full column rank (we call a faulty index, Definition 18). Since is a full rank submatrix of a bad -matrix444The full-rank part needs to be checked, but follows from [14]., is not full rank, so index always exists. Hence, we can keep generating indices as long as the truncated -matrix, , has at least rows. By definition, each participates in at most 2 rows of a -matrix, so we get a certificate of length at least .
We then show (Lemma 20), for any certificate , the probability that the -matrix has certificate is exponentially small. Conditioned on being full rank with , the probability that becomes not-full-rank when setting is at most : is uniformly random over at least field elements, and the degree of in the determinant of is at most . This event needs to happen times, and it is possible to run the conditional probabilities in the correct order to conclude that the probability of generating a particular certificate is at most .
Since there are at most certificates, the total probability that that a particular -matrix is bad is at most . This is at most for sufficiently large quadratic alphabet sizes . For such , by a union bound over the at-most- -matrices, with probability at least , the code corrects deletions, thus approaching the half-Singleton bound with quadratic alphabet size.
Linear alphabet size
To improve the alphabet size to linear, we modify the certificate argument so that the number of certificates is only , rather than . The idea is to force the certificates to have increasing coordinates (this does not hold automatically).555For convenience, in the actual proof, our certificate is slightly different. Instead of , we take certificates . The ’s have slightly different meaning, but the idea is the same.
First, we preprocess the -matrix by removing some of its rows (equivalently, we remove elements from and ), so that the remaining matrix can be partitioned into “blocks” of length at most . Crucially, the variables in a block appear only in that block, so that the blocks partition the variables (in the proof, these blocks are given by chains, Definition 24). Proving this step requires establishing structural properties of longest common subsequences.
We then generate our certificates in a more careful way. We remove the largest of these blocks from our -matrix to create a bank of blocks, and, we reindex the variables so that the banked blocks have the highest-indexed variables.666This is acceptable, since we don’t use the original ordering on the variables after this point. As in Algorithm 1, we choose to be the top submatrix of the -matrix – this time, after removing the blocks in the bank – , and for all , we again choose as the smallest index such that setting makes not full rank. However, we choose the matrices more carefully. After choosing , we let be a submatrix of that “re-indeterminates” the matrix : we remove from the block containing variable , and replace it with an “equivalent” new block from our bank – possibly truncating the new block, if the new block is longer than the old block – to get . This results in a matrix “equivalent” to ; it is the same polynomial matrix up to permuting the rows and relabeling the indeterminates. Since this matrix is an equivalent, “re-indeterminated” version of , we must have is full column rank, so we have , which is our desired property for our certificates. Further, since our bank has at least blocks, we can “re-indeterminate” at least times, giving a certificate of length .
Since our certificates now satisfy , the number of certificates is at most , the probability fails to correct deletions is only , which is exponentially small for sufficiently large linear alphabet sizes . Again, a union bound over (at most ) -matrices gives that, with high probability, our code corrects deletions, thus approaching the half-Singleton bound with linear-sized alphabets.
1.4 Future research directions
We conclude the introduction by outlining several open questions for future research:
-
1.
Lower bounds on the field size. In [14], it was demonstrated that there exist RS codes over fields of size that exactly attain the half-Singleton bound. This paper shows that the field size can be significantly reduced if we only need to get -close to the half-Singleton bound. A remaining open question is to prove a lower bound on the field size of linear codes, not just RS codes, that achieve (or get close to) the half-Singleton bound.
-
2.
Explicit constructions. The results presented in this paper are existential; specifically, we demonstrate the existence of RS codes capable of correcting insdel errors with an almost optimal rate-distance tradeoff. An important question remains: How to provide explicit constructions of RS codes that achieve these parameters?
-
3.
Decoding algorithms. One of the primary advantages of RS codes in the Hamming metric is their efficient decoding algorithms. Currently, as far as we know, there are no efficient algorithms for RS codes that correct insdel errors. Therefore, a crucial open question is how to design efficient decoding algorithms for RS codes handling insdel errors.
-
4.
Affine codes. In [9, Theorem 1.5], it was demonstrated that affine codes outperform linear codes in correcting insdel errors. Specifically, efficient affine binary codes of rate were constructed to correct insdel errors. By efficient codes, we mean explicit construction of codes with efficient encoding and decoding algorithms. An immediate open question is to construct efficient affine code with improved rate-distance trade-offs. Moreover, considering that RS codes achieve the half-Singleton bound, a natural question arises: can affine RS codes perform even better? In particular, can they approach the Singleton bound?
2 Notation and Preliminaries
Define , , and for . Throughout the paper, denotes the finite field of order .
Denote by the ring of polynomials in over , and denote by the field of fractions of that ring. More generally, for a collection of variables indexed by a set , denote by the ring of polynomials in these variables over . The value of at is denoted by , or when .
The degree of a nonzero multivariate polynomial refers to its total degree. And for , denotes the degree of in , where the other variables are viewed as constants.
We also need the following notation, adapted from [23], regarding the partial assignment of a symbolic matrix.
Definition 6 (Partial assignment).
Let be a matrix over such that the entries of are in . For and , denote by the matrix obtained from by substituting for for . More generally, for and a tuple , denote by the matrix obtained from by substituting for for .
We need the following variation of the Schwarz-Zippel Lemma.
Lemma 7.
Let be a nonzero polynomial such that for . Let be a set of size at least , and let be uniformly distributed over the set of -tuples with distinct coordinates in . Then .
We recall the notion of a subsequence and a longest common subsequence.
Definition 8.
A subsequence of a string is a string obtained by removing some (possibly none) of the symbols in .
Definition 9.
Let be strings over an alphabet . A longest common subsequence between and , is a subsequence of both and , of maximal length. We denote by the length of a longest common subsequence.
The edit distance between and , denoted by , is the minimal number of insertions and deletions needed in order to turn into .
It is well known that the insdel correction capability of a code is determined by the LCS of its codewords. Specifically,
Lemma 10.
A code can correct insdel errors if and only if for any distinct .
We now adopt two definitions and a lemma from [14]. These will establish the algebraic conditions ensuring that an RS code can correct insdel errors.
Definition 11 (Increasing subsequence).
We call an increasing subsequence if it holds that , where is called the length of .
Definition 12 (-matrix).
For positive integers and increasing subsequences of length , define the matrix
over the field . We call a -matrix.
The following lemma states that if a Reed–Solomon code cannot correct insdel errors, then we can identify a specific -matrix that does not have full column rank. The proof of this lemma is identical to the proof of Proposition 2.1 in [14].
Lemma 13.
Let . Consider the Reed–Solomon code associated with an evaluation vector . If the code cannot correct arbitrary insdel errors, then there exist two increasing subsequences that agree on at most coordinates such that matrix does not have full column rank.
A key technical lemma regarding the -matrices that was proved in [14] is the following.
Lemma 14 (Proposition 2.4, [14]).
Let be two increasing subsequences that agree on at most coordinates. Then, we have as a multivariate polynomial in .
Informally speaking, by using this lemma, and by considering all the -matrices, the authors of [14] showed by using the Schwarz-Zippel lemma that there exists an assignment over a large enough field for which is nonzero for all pairs of length that agree on at most coordinates. For convenience, we introduce the following definition.
Definition 15 (Selecting a subset of coordinates).
For an increasing subsequence and with , denote by the increasing subsequence .
The following is a corollary of Lemma 14.
Corollary 16.
Let with be two increasing subsequences that agree on at most coordinates. Let be a set of size . Then . Additionally, for , we have .
3 Achieving Quadratic Alphabet Size
In this section, we present a warm-up result, Theorem 22, which achieves the alphabet size . Our proof resembles the proof by Guo and Zhang [23] showing that random RS codes achieve list decodability over fields of quadratic size.
3.1 Full rankness of under a random assignment
Firstly, we introduce the following definition, , which is a submatrix of obtained by deleting certain rows.
Definition 17.
Under the notation in Definition 12, for a subset , define the matrix to be the submatrix of obtained by deleting the -th row for all such that or . In other words, is obtained from by deleting all the rows containing powers of for every index .
We also need to define the notion of faulty indices. Our definition is a simplification of a similar definition in [23].
Definition 18 (Faulty index).
Let be a matrix, where , and let be the submatrix consisting of the first rows of . Suppose the entries of are in . For , we say is the faulty index of (with respect to ) if but . Note that the faulty index of is unique if it exists.
Next, we will present an algorithm that, when provided with two increasing subsequences of length that agree on at most coordinates, elements , and a parameter , attempts to verify whether the matrix has full column rank. If it is unable to confirm this, the algorithm produces either a “FAIL” message or identifies a sequence of faulty indicies . The algorithm is given as Algorithm 1.
Lemma 19 (Behavior of Algorithm 1).
Let . Let be two increasing subsequences that agree on at most coordinates, where . Let be a positive integer such that . Then for all , running Algorithm 1 on the input , and yields one of the following two possible scenarios:
-
1.
Algorithm 1 outputs “SUCCESS”. In this case, has full column rank.
-
2.
Algorithm 1 outputs a sequence of distinct indices . For each , is the faulty index of , where .
The next lemma bounds the probability that Algorithm 1 outputs a particular sequence of faulty indices over random . The proof follows the same approach as [23, Lemma 4.5].
Lemma 20.
Under the notation and conditions in Lemma 19, suppose and is chosen uniformly at random from the set of all -tuples of distinct elements in . Then for any sequence , the probability that Algorithm 1 outputs on the input , , and is at most .
By combining Lemma 19 and Lemma 20 and taking a union bound over the set of possible outputs of Algorithm 1, we obtain the following corollary.
Corollary 21.
Under the notation and conditions in Lemma 19, suppose and is chosen uniformly at random from the set of all -tuples of distinct elements in . Then for any positive integer , we have
3.2 Putting it together
We are now ready to prove a weaker version of our main result, which achieves quadratic-sized alphabets.
Theorem 22.
Let and , where . Let be a prime power such that . Suppose is chosen uniformly at random from the set of all -tuples of distinct elements in . Then with probability at least , the code over corrects at least adversarial insdel errors.
4 Achieving Linear Alphabet Size
In this section, we present an improved analysis that achieves a linear alphabet size. We begin by defining the notion of “chains” and establishing related structural results about common subsequences in Section 4.1. The proof of Theorem 4 – which is omitted due to space limitations – is built on these foundations.
4.1 Chain decomposition
For convenience, we first introduce some definitions.
Definition 23.
For an increasing subsequence , define .
Definition 24 (Chain).
For two increasing subsequences , where , we call the pair a chain if either (i) for all , or (ii) for all . If and , we call a Type I chain. Otherwise, we call a Type II chain. See Figure 1 for some examples.
Recall Definition 15 that for an increasing subsequence and with , we let .
Definition 25 (Maximal chain).
Let be two increasing subsequences. Let be a subset of of size . Let and . Suppose is a chain. We say the chain is maximal with respect to if
-
1.
is a Type I chain, or
-
2.
is a Type II chain, and both and is in exactly one of and .
Intuitively, a chain is maximal with respect to if it cannot be extended to a longer chain in by adding matches from either side.
Definition 26 (Indices of variables).
For increasing subsequences and , define
Note that is the set of indices of the variables involved in a -matrix .777If , these variables do not really appear as . But we still consider as the set of indices of variables involved in in this case.
Definition 27 (-disjointness).
Let be increasing subsequences. We say two sets are -disjoint if and are disjoint. Note that -disjointness implies disjointness.
The next theorem states that for increasing subsequences , can always be decomposed into maximal chains. Its proof is omitted.
Theorem 28.
For any two increasing subsequences , there exists a partition of such that is a maximal chain with respect to for all .
Splitting long chains into short ones
The maximal chains in Theorem 28 can be very long. The next lemma states that, by removing a small fraction of matchings from , we can split the long chains into very short ones while maintaining their -disjointness. (See also Figure 2.)
Lemma 29.
Let be increasing subsequences. Let . Then there exist a subset of size at least and a partition of such that,
-
1.
are mutually -disjoint.
-
2.
for .
4.2 Proof of Theorem 4
Fix to be increasing subsequences that agree on at most coordinates. For convenience, we introduce the following notation.
Definition 30.
Let . Let be nonempty subsets of . Define the block matrix which is a matrix over .
Lemma 31.
Let and . If has full column rank under the assignment , then also has full column rank under the same assignment.
Choosing sets
Fix a parameter , whose exact value will be determined later. We note that by Lemma 29, there exist mutually -disjoint nonempty sets , each of size at most , such that and each is a chain. Fix such . We further make the following assumption:
Assumption 32.
. Moreover, for with , if is a Type I chain, so is . In other words, the Type I chains have the smallest indices .
Note that Assumption 32 can be guaranteed by permuting the sets . We will show that has full column rank with high probability over random . By Lemma 31, this would imply that also has full column rank with high probability. To bound the probability that does not have full column rank, we use Algorithm 2, whose pseudocode is given below, to assign the variables chain by chain.
We sketch how the algorithm works.
First, choose the smallest such that (Line 3). We will soon prove that such exists for slightly larger than . The last chains , , form a “bank.” At a high level, Algorithm 2 attempts to assign the variables in the first chains, i.e., those not in the bank, while keeping the top submatrix of nonsingular even after the assignment. If assigning the variables in some chain violates this property, the algorithm will use part of a chain in the bank to replace that chain, and continue. Such a replacement is done by updating the sets . To avoid confusion, we create a copy for (Line 7) and update the sets instead.
The variables are assigned chain by chain. That is, for , Algorithm 2 assigns to for all . The algorithm terminates when the top rows of have been fully assigned. This condition is checked at Line 8. The number of assigned rows is maintained as a variable . That is, if the first chains have been assigned, then .
When attempting to assign the variables in the -th chain , we check (Line 12) whether the top submatrix of remains nonsingular after the assignment. If so, we perform the assignment, and then move to the next chain (Lines 13–15). If not, we use a prefix of a chain in the bank to replace the chain , and then throw away .888Alternatively, one can remove the used portion of and continue using the remaining part. However, this approach does not qualitatively improve our parameters. In the algorithm, the updates to the chains and are done by simply updating the sets and (Lines 13–15).
Finally, the algorithm maintains a variable , which is the number of times the nonsingularity condition at Line 12 fails to hold. It also maintains a sequence . When the nonsingularity condition fails for the -th time, the number of successfully assigned rows is stored into (Line 21). We have due to the condition checked at Line 8. If reaches a given integer , the algorithm outputs the sequence and halts (Line 22). We will eventually choose .
From now on, assume is chosen such that
(11) |
Line 3 of Algorithm 2 is valid by the following lemma.
Lemma 33.
There exists such that . Moreover, for such , it holds that and .
The output of Algorithm 2 can be used to certify the full-column-rankness of the matrix , as stated by the following lemma.
Lemma 34.
Algorithm 2 terminates, outputting either “SUCCESS” or . In the former case, has full column rank.
As the variable in Algorithm 2 never decreases, we have the following lemma.
Lemma 35.
Any sequence output by Algorithm 2 satisfies .
Next, we make the crucial observation that the behavior of the algorithm is mostly determined by the values of , , and , regardless of .
Lemma 36.
At Line 12 of Algorithm 2, the values of and are determined by the values of , , and , together with the input excluding .
We also need the following crucial lemma.
Lemma 37.
At Line 12 of Algorithm 2, is nonsingular.
The next lemma bounds the probability that Algorithm 2 outputs a particular sequence.
Lemma 38.
Suppose and is chosen uniformly at random from the set of all -tuples of distinct elements in . The probability that Algorithm 2 outputs a fixed sequence is at most , where .
Taking the union bound over all possible output sequences, we obtain the following corollary.
Corollary 39.
Suppose and is chosen uniformly at random from the set of all -tuples of distinct elements in . Also suppose and satisfy (11). Then the probability that does not have full column rank is at most , where .
Finally, we state our main theorem below, whose proof is based on Algorithm 2.
Theorem 40 (Detailed version of Theorem 4).
Let and , where . Let be a prime power such that , where is a large enough absolute constant. Suppose is chosen uniformly at random from the set of all -tuples of distinct elements in . Then with probability at least , the code over corrects at least adversarial insdel errors.
Proof.
Let be a large enough absolute constant, and let . First assume . Since , it suffices to prove the claim for . In this case, the code can, in fact, correct at least adversarial insdel errors. A proof was implicitly given in [14, proof of Theorem 16] and also sketched in our Section 1.3.
Next, assume . Choose , , and . It is easy to verify that (11) holds given that the constant is large enough.
By Lemma 13, if fails to correct adversarial insdel errors, then we will have two increasing subsequences which agree on at most coordinates, such that the matrix does not have full column rank. On the other hand, applying Corollary 39 with the chosen parameters , , and , for fixed , we have that
where the last inequality uses the facts that and that is a large enough constant. By the union bound over all , the probability that cannot correct adversarial insdel errors is at most , as desired.
References
- [1] Khaled AS Abdel-Ghaffar, Hendrik C Ferreira, and Ling Cheng. On linear and cyclic codes for correcting deletions. In 2007 IEEE International Symposium on Information Theory (ISIT), pages 851–855. IEEE, 2007.
- [2] Omar Alrabiah, Venkatesan Guruswami, and Ray Li. Randomly punctured Reed–Solomon codes achieve list-decoding capacity over linear-sized fields. In Proceedings of the 56th Annual ACM Symposium on Theory of Computing (STOC), pages 1458–1469, 2024. doi:10.1145/3618260.3649634.
- [3] Joshua Brakensiek, Manik Dhar, Sivakanth Gopi, and Zihan Zhang. AG codes achieve list decoding capacity over constant-sized fields. In Proceedings of the 56th Annual ACM Symposium on Theory of Computing (STOC), pages 740–751, 2024. doi:10.1145/3618260.3649651.
- [4] Joshua Brakensiek, Sivakanth Gopi, and Visu Makam. Generic Reed-Solomon codes achieve list-decoding capacity. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing (STOC), pages 1488–1501, 2023. doi:10.1145/3564246.3585128.
- [5] Joshua Brakensiek, Venkatesan Guruswami, and Samuel Zbarsky. Efficient low-redundancy codes for correcting multiple deletions. IEEE Transactions on Information Theory, 64(5):3403–3410, 2017. doi:10.1109/TIT.2017.2746566.
- [6] Boris Bukh, Venkatesan Guruswami, and Johan Håstad. An improved bound on the fraction of correctable deletions. IEEE Transactions on Information Theory, 63(1):93–103, 2016. doi:10.1109/TIT.2016.2621044.
- [7] Bocong Chen and Guanghui Zhang. Improved singleton bound on insertion-deletion codes and optimal constructions. IEEE Transactions on Information Theory, 68(5):3028–3033, 2022. doi:10.1109/TIT.2022.3148185.
- [8] Hao Chen. Coordinate-ordering-free upper bounds for linear insertion-deletion codes. IEEE Transactions on Information Theory, 68(8):5126–5132, 2022. doi:10.1109/TIT.2022.3167662.
- [9] Kuan Cheng, Venkatesan Guruswami, Bernhard Haeupler, and Xin Li. Efficient linear and affine codes for correcting insertions/deletions. In Dániel Marx, editor, Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1–20. SIAM, 2021. doi:10.1137/1.9781611976465.1.
- [10] Kuan Cheng, Zhengzhong Jin, Xin Li, Zhide Wei, and Yu Zheng. Linear insertion deletion codes in the high-noise and high-rate regimes. In 50th International Colloquium on Automata, Languages, and Programming (ICALP). Schloss-Dagstuhl-Leibniz Zentrum für Informatik, 2023.
- [11] Kuan Cheng, Zhengzhong Jin, Xin Li, and Ke Wu. Deterministic document exchange protocols and almost optimal binary codes for edit errors. Journal of the ACM, 69(6):1–39, 2022. doi:10.1145/3561046.
- [12] Mahdi Cheraghchi and João Ribeiro. An overview of capacity results for synchronization channels. IEEE Transactions on Information Theory, 67(6):3207–3232, 2020. doi:10.1109/TIT.2020.2997329.
- [13] Roni Con, Amir Shpilka, and Itzhak Tamo. Explicit and efficient constructions of linear codes against adversarial insertions and deletions. IEEE Transactions on Information Theory, 68(10):6516–6526, 2022. doi:10.1109/TIT.2022.3173185.
- [14] Roni Con, Amir Shpilka, and Itzhak Tamo. Reed–Solomon codes against adversarial insertions and deletions. IEEE Transactions on Information Theory, 2023.
- [15] Roni Con, Amir Shpilka, and Itzhak Tamo. Optimal two-dimensional Reed–Solomon codes correcting insertions and deletions. IEEE Transactions on Information Theory, 2024.
- [16] Tai Do Duc, Shu Liu, Ivan Tjuawinata, and Chaoping Xing. Explicit constructions of two-dimensional Reed-Solomon codes in high insertion and deletion noise regime. IEEE Transactions on Information Theory, 67(5):2808–2820, 2021. doi:10.1109/TIT.2021.3065618.
- [17] Peter Elias. List decoding for noisy channels. Wescon Convention Record, Part 2, Institute of Radio Engineers, pages 99–104, 1957.
- [18] Asaf Ferber, Matthew Kwan, and Lisa Sauermann. List-decodability with large radius for Reed-Solomon codes. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages 720–726. IEEE, 2022.
- [19] Eitan Goldberg, Chong Shangguan, and Itzhak Tamo. List-decoding and list-recovery of Reed–Solomon codes beyond the Johnson radius for every rate. IEEE Transactions on Information Theory, 69(4):2261–2268, 2022. doi:10.1109/TIT.2022.3222877.
- [20] Nick Goldman, Paul Bertone, Siyuan Chen, Christophe Dessimoz, Emily M LeProust, Botond Sipos, and Ewan Birney. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature, 494(7435):77–80, 2013. doi:10.1038/NATURE11875.
- [21] Zeyu Guo, Ray Li, Chong Shangguan, Itzhak Tamo, and Mary Wootters. Improved list-decodability and list-recoverability of Reed-Solomon codes via tree packings. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages 708–719. IEEE, 2022.
- [22] Zeyu Guo, Chaoping Xing, Chen Yuan, and Zihan Zhang. Random Gabidulin codes achieve list decoding capacity in the rank metric. arXiv preprint arXiv:2404.13230, 2024. To appear in FOCS 2024. doi:10.48550/arXiv.2404.13230.
- [23] Zeyu Guo and Zihan Zhang. Randomly punctured Reed-Solomon codes achieve the list decoding capacity over polynomial-size alphabets. In 2023 IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS), pages 164–176. IEEE, 2023. doi:10.1109/FOCS57990.2023.00019.
- [24] Venkatesan Guruswami and Johan Håstad. Explicit two-deletion codes with redundancy matching the existential bound. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 21–32. SIAM, 2021. doi:10.1137/1.9781611976465.2.
- [25] Venkatesan Guruswami, Xiaoyu He, and Ray Li. The zero-rate threshold for adversarial bit-deletions is less than 1/2. IEEE Transactions on Information Theory, 69(4):2218–2239, 2022. doi:10.1109/TIT.2022.3223023.
- [26] Venkatesan Guruswami and Ray Li. Efficiently decodable insertion/deletion codes for high-noise and high-rate regimes. In 2016 IEEE International Symposium on Information Theory (ISIT), pages 620–624. IEEE, 2016. doi:10.1109/ISIT.2016.7541373.
- [27] Venkatesan Guruswami and Madhu Sudan. Improved decoding of Reed-Solomon and algebraic-geometric codes. In Proceedings 39th Annual Symposium on Foundations of Computer Science (FOCS), pages 28–37. IEEE, 1998.
- [28] Venkatesan Guruswami and Carol Wang. Deletion codes in the high-noise and high-rate regimes. IEEE Transactions on Information Theory, 63(4):1961–1970, 2017. doi:10.1109/TIT.2017.2659765.
- [29] Bernhard Haeupler. Optimal document exchange and new codes for insertions and deletions. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 334–347. IEEE, 2019. doi:10.1109/FOCS.2019.00029.
- [30] Bernhard Haeupler and Amirbehshad Shahrasbi. Synchronization strings and codes for insertions and deletions–a survey. IEEE Transactions on Information Theory, 67(6):3190–3206, 2021. doi:10.1109/TIT.2021.3056317.
- [31] Bernhard Haeupler and Amirbehshad Shahrasbi. Synchronization strings: Codes for insertions and deletions approaching the Singleton bound. Journal of the ACM, 68(5):1–39, 2021. doi:10.1145/3468265.
- [32] Richard W. Hamming. Error detecting and error correcting codes. Bell System technical journal, 29(2):147–160, 1950.
- [33] Reinhard Heckel, Gediminas Mikutis, and Robert N Grass. A characterization of the DNA data storage channel. Scientific reports, 9(1):1–12, 2019.
- [34] Reinhard Heckel, Ilan Shomorony, Kannan Ramchandran, and NC David. Fundamental limits of DNA storage systems. In 2017 IEEE International Symposium on Information Theory (ISIT), pages 3130–3134. IEEE, 2017.
- [35] Qinqin Ji, Dabin Zheng, Hao Chen, and Xiaoqiang Wang. Strict half-Singleton bound, strict direct upper bound for linear insertion-deletion codes and optimal codes. IEEE Transactions on Information Theory, 2023.
- [36] Selmer Johnson. A new upper bound for error-correcting codes. IRE Transactions on Information Theory, 8(3):203–207, 1962. doi:10.1109/TIT.1962.1057714.
- [37] Andreas Lenz, Paul H Siegel, Antonia Wachter-Zeh, and Eitan Yaakobi. Coding over sets for DNA storage. IEEE Transactions on Information Theory, 66(4):2331–2351, 2019. doi:10.1109/TIT.2019.2961265.
- [38] Vladimir I Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pages 707–710, 1966.
- [39] Vladimir I Levenshtein. Bounds for deletion/insertion correcting codes. In Proceedings IEEE International Symposium on Information Theory (ISIT), page 370. IEEE, 2002.
- [40] Jingge Liu. Optimal RS codes and GRS codes against adversarial insertions and deletions and optimal constructions. IEEE Transactions on Information Theory, 2024.
- [41] Shu Liu and Ivan Tjuawinata. On 2-dimensional insertion-deletion Reed-Solomon codes with optimal asymptotic error-correcting capability. Finite Fields and Their Applications, 73:101841, 2021. doi:10.1016/J.FFA.2021.101841.
- [42] Hugues Mercier, Vijay K Bhargava, and Vahid Tarokh. A survey of error-correcting codes for channels with symbol synchronization errors. IEEE Communications Surveys & Tutorials, 12(1):87–96, 2010. doi:10.1109/SURV.2010.020110.00079.
- [43] Michael Mitzenmacher. A survey of results for deletion channels and related synchronization channels. Probability Surveys, 6:1–33, 2009.
- [44] Atri Rudra and Mary Wootters. Every list-decodable code for high noise has abundant near-optimal rate puncturings. In Proceedings of the 46th Annual ACM Symposium on Theory of Computing (STOC), pages 764–773, 2014. doi:10.1145/2591796.2591797.
- [45] Reihaneh Safavi-Naini and Yejing Wang. Traitor tracing for shortened and corrupted fingerprints. In ACM workshop on Digital Rights Management, pages 81–100. Springer, 2002. doi:10.1007/978-3-540-44993-5_6.
- [46] Clayton Schoeny, Antonia Wachter-Zeh, Ryan Gabrys, and Eitan Yaakobi. Codes correcting a burst of deletions or insertions. IEEE Transactions on Information Theory, 63(4):1971–1985, 2017. doi:10.1109/TIT.2017.2661747.
- [47] Leonard J Schulman and David Zuckerman. Asymptotically good codes correcting insertions, deletions, and transpositions. IEEE Transactions on Information Theory, 45(7):2552–2557, 1999. doi:10.1109/18.796406.
- [48] Chong Shangguan and Itzhak Tamo. Combinatorial list-decoding of Reed-Solomon codes beyond the Johnson radius. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 538–551, 2020. doi:10.1145/3357713.3384295.
- [49] Claude Elwood Shannon. A mathematical theory of communication. Bell system technical journal, 27(3):379–423, 1948. doi:10.1002/J.1538-7305.1948.TB01338.X.
- [50] Ilan Shomorony and Reinhard Heckel. Information-theoretic foundations of DNA data storage. Foundations and Trends® in Communications and Information Theory, 19(1):1–106, 2022. doi:10.1561/0100000117.
- [51] Jin Sima and Jehoshua Bruck. On optimal k-deletion correcting codes. IEEE Transactions on Information Theory, 67(6):3360–3375, 2020. doi:10.1109/TIT.2020.3028702.
- [52] Jin Sima, Ryan Gabrys, and Jehoshua Bruck. Optimal systematic t-deletion correcting codes. In 2020 IEEE International Symposium on Information Theory (ISIT), pages 769–774. IEEE, 2020. doi:10.1109/ISIT44484.2020.9173986.
- [53] Madhu Sudan. Decoding of Reed Solomon codes beyond the error-correction bound. Journal of Complexity, 13(1):180–193, 1997. doi:10.1006/JCOM.1997.0439.
- [54] Dongvu Tonien and Reihaneh Safavi-Naini. Construction of deletion correcting codes using generalized Reed–Solomon codes and their subcodes. Designs, Codes and Cryptography, 42(2):227–237, 2007. doi:10.1007/S10623-006-9032-7.
- [55] RR Varshamov and GM Tenengolts. Codes which correct single asymmetric errors (in Russian). Automatika i Telemkhanika, 161(3):288–292, 1965.
- [56] Yejing Wang, Luke McAven, and Reihaneh Safavi-Naini. Deletion correcting using generalized Reed-Solomon codes. In Coding, Cryptography and Combinatorics, pages 345–358. Springer, 2004.
- [57] John M Wozencraft. List decoding. Quarterly Progress Report, 48:90–95, 1958.
- [58] Kenji Yasunaga. Improved bounds for codes correcting insertions and deletions. Designs, Codes and Cryptography, pages 1–12, 2024.