New Bounds for Circular Trace Reconstruction
Abstract
The “trace reconstruction” problem asks, given an unknown binary string and a channel that repeatedly returns “traces” of with each bit randomly deleted with some probability , how many traces are needed to recover ? There is an exponential gap between the best known upper and lower bounds for this problem. Many variants of the model have been introduced in hopes of motivating or revealing new approaches to narrow this gap. We study the variant of circular trace reconstruction introduced by Narayanan and Ren (ITCS 2021), in which traces undergo a random cyclic shift in addition to random deletions.
We show an improved lower bound of for circular trace reconstruction. This contrasts with the (previously) best known lower bounds of in the circular case and in the linear case. Our bound shows the indistinguishability of traces from two sparse strings that each have a constant number of nonzeros. Can this technique be extended significantly? How hard is it to reconstruct a sparse string under a cyclic deletion channel? We resolve these questions by showing, using Fourier techniques, that traces suffice for reconstructing any constant-sparse string in a circular deletion channel, in contrast to the best known upper bound of for general strings in the circular deletion channel. This shows that new algorithms or new lower bounds must focus on non-constant-sparse strings.
Keywords and phrases:
Trace reconstruction, algorithmic statistics, Fourier analysisCopyright and License:
2012 ACM Subject Classification:
Mathematics of computing Probabilistic inference problemsFunding:
This work is partially supported by NSF award CCF-2127806 and by Office of Naval Research award N000142412695.Editor:
Shubhangi SarafSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
Given a binary string and a fixed probability , the deletion channel deletes each character in with probability and returns the remaining string, which is called a trace of the deletion channel. The trace reconstruction problem asks: how many traces from are needed to recover reliably. This problem has been studied extensively in various forms and under various assumptions [3, 2, 6, 7, 15, 20]. In the standard setting, where the deletion probability is a constant and the original string is an arbitrary binary string of length , there remains an exponential gap between the best known upper bound and the best known lower bound for worst-case reconstruction [6, 7]. In hopes of narrowing this gap, many related problems have been proposed and studied, such as matrix trace reconstruction [15], population recovery [2], and coded trace reconstruction [10, 4].
We study one such mild variant, the problem of circular trace reconstruction introduced by [17]. In this problem, the original string undergoes a random cyclic shift after deletion, and the reconstruction algorithm must return any string which is cyclically equivalent to . Circular trace reconstruction is known to be at least as hard as the linear version (essentially a consequence of the data processing inequality). Also, the best known bounds in each model are fairly similar: prior to our work, the best known lower bound was in the circular model compared to in the linear model [6, 17]. The best known upper bound for arbitrary strings is in the circular model (assuming can be written as a product of two or fewer primes), compared to in the linear model [7, 17].
1.1 Our Contributions
We improve the lower bound for circular trace reconstruction from to .
We prove this result by exhibiting two cyclically distinct strings that have a constant number of nonzero entries, yet which need traces to distinguish. This strong lower bound motivates the general question: how difficult is it to reconstruct constant-sparse strings from circular traces? To what degree can this lower bound of be improved?
We resolve these questions almost-tightly by showing that traces suffice to distinguish any two sparse circular strings. Our results imply that stronger lower bounds can only be obtained from non-constant-sparse strings, and also that new algorithmic improvements should focus on the non-constant-sparse case.
1.2 Technical Overview
We focus on the specific case of sparse circular strings, defined as binary strings which contain s for some constant . This family of strings has the useful property that any -sparse string is a cyclic shift of ; therefore, can be represented as the integer sequence , where specifies the number of s (the “gap”) between the th and th s in . With probability the deletion channel will preserve all s in , yielding a trace which can also be represented by a sequence of gaps. Crucially, in our setting of constant , this probability is constant. When this occurs, the distribution over traces depends only on the original integer sequence of gaps in . Both our upper and lower bounds are based on properties of this integer sequence.
Cyclic Statistics.
Our main technical tool is a family of shift-invariant functions which we call cyclic statistics. At a high level, a cyclic statistic of a sequence is a cyclically-invariant polynomial of the variables ; the statistic is obtained by summing a monomial function over all cyclic shifts of . The order of this statistic is the degree of the corresponding monomial. If two strings and are cyclic shifts of each other, their cyclic statistics are clearly identical. Surprisingly, we prove that the converse holds for order : if and are cyclically distinct, there exists a cyclic statistic of order in which they differ. We give a lower bound showing that this result is nearly tight.
Lemma 1 (Informal).
Let and be two cyclically distinct integer sequences. Then and differ in some cyclic statistic of order at most . Conversely, there exist integer sequences and with identical cyclic statistics up to order .
We prove the upper bound in Lemma 1 using Fourier techniques (see Lemma 19 in Section 3). We show that if and have identical cyclic statistics up to order , their Fourier transforms and must satisfy a certain system of sparse linear equations, each involving or fewer variables. We then analyze this system and show that any solution has the property that there exists an integer for which
| (1) |
By basic properties of Fourier transforms, this implies that is a cyclic shift of .
The significance of the order stems from a number theoretic result. Our original constraints involve an equation of two variables for each , but our analysis only applies for relatively prime to . We show that each can be written as the sum of 2 or 3 integers relatively prime to , transforming our constraint into a 6-variable equation, which implies Equation 1, given all cyclic statistics up to order 6 vanish.
The lower bound – that there exist and with identical cyclic statistics up to order 4 – is proved directly. We provide a pair of sequences of length 12 satisfying this property, obtained via computer search: and the sequence .
Cyclic statistics are preserved in expectation by the cyclic deletion channel, leading to our algorithmic results. They also characterize the distribution of traces obtained from the deletion channel, from which we obtain our information-theoretic lower bound. We give a brief overview of this connection and our results below.
Lower Bound.
In order to build our lower bound, we relate the probability distribution over traces of to the cyclic statistics of . Roughly, we show that the probability of generating a trace from can be expressed as a polynomial in of degree whose degree coefficient depends only on the order cyclic statistics of . We use this fact to show that if two cyclically distinct strings and have identical cyclic statistics up to order , the Hellinger distance between the distributions of the traces generated from and is . Therefore, it requires samples to distinguish the two distributions.
Theorem 2 (Informal version of Lemma 33).
Let and be two sequences that are permutations of each other and have identical cyclic statistics up to order . Then for any constant deletion probability , any algorithm which distinguishes the strings
requires traces from a cyclic deletion channel.
Upper Bound.
We build an algorithm that can recover any -sparse string using samples from the deletion channel, showing that our techniques from above cannot be extended to yield a significantly stronger lower bound.
Theorem 3 (Informal version of Theorem 26).
Let and be two cyclically distinct constant-sparse binary strings of length . Then there exists an algorithm which, for any constant deletion probability , distinguishes from using traces from a cyclic deletion channel.
As explained above, our algorithm exploits the fact that the retention probability is a constant. Therefore, with constant probability, no 1 is deleted in . Conditioning on this, every trace can be expressed as a sequence of binomial random variables; that is, with . By Lemma 1, any two cyclically distinct sequences and differ in some cyclic statistic of order at most . As all cyclic statistics of integer sequences are integers, the difference between and in this statistic is at least . Thus we aim to estimate cyclic statistics from traces to within constant error.
A naïve estimator is the empirical average of a certain cyclic statistic of the traces. However, this naïve algorithm requires samples. This is because, in the worst case, each may have mean and variance both . Thus the variance of the product of such random variables may be . However, instead, if we can shift the binomials such that they become random variables whose absolute value is , the absolute value of their product will be bounded by leading to variance . We accomplish this by greedily partitioning the set of expected gaps into clusters, so that all elements in the same cluster are -close to each other and -far from all other clusters. Since each binomial is tightly concentrated around its expectation, we can map each to the correct cluster with high probability, and subtract the cluster center from to obtain a random variable whose absolute value is bounded by with high probability.
A final hurdle is that subtracting cluster centers may erase any differences in the cyclic statistics of and . See the discussion at the start of Section 4 and Example 22 for more algorithmic intuition, and discussion of why cyclic statistics mod are needed to yield a algorithm. See Algorithm 1 and Theorem 26 for our main algorithmic result.
1.3 Structure of the Rest of the Paper
In Section 1.4 we discuss related work. Section 2 introduces the key definitions used in this paper. The rest of the paper consists of three technical sections that may be read independently, in any order. Section 3 builds tools to analyze cyclic statistics, leading to our characterization that “any two cyclically distinct strings must differ in some cyclic statistic of order .” Section 4 contains our algorithmic results, using the characterization of Section 3 to give an algorithm that distinguishes any constant-sparse strings with traces. Section 5 shows our lower bounds, based on a Hellinger distance analysis, showing that two strings that are permutations of each other and have identical cyclic statistics up to order require traces to distinguish.
1.4 Related Work
The trace reconstruction problem was introduced in [16, 3]. [3] shows strong results for reconstructing random strings , in contrast to the worst-case setting we consider here. [14] gives an algorithm for the (worst-case) trace reconstruction problem with constant deletion probability using traces. This result was improved independently by [11] and [18], which present an algorithm using traces. The state of the art is [7], providing an algorithm using traces.
On the lower bound side, [3] provides a bound, showing that distinguishing from requires traces. [13] improved the lower bound to by replacing the ’s in the previous strings by alternating , and [6] tightened this analysis to yield , the best known lower bound, but still exponentially far from .
Restricting the class of algorithms to yield improved lower bounds, [11] shows that all “mean-based” algorithms require samples to reconstruct strings. Here, a mean-based algorithm is any algorithm that only uses the empirical means of individual bits of the traces. Extending this line of work, [9] considers “-mer-based” algorithms which rely on counting the occurrences of contiguous -bit strings (where “mean-based” algorithm are the case ). [9] shows that, even for up to , any -mer-based algorithm requires samples to reconstruct the string. All known algorithms for the worst-case trace reconstruction problem are -mer-based, including the algorithm presented by [7]. More recently, [8] strengthened this result to show an analogous -sample lower bound for all statistical query (SQ) algorithms whose queries are “-local.” Thus if there is an algorithm using fewer than samples, it must use “non-local information.” We point out that the algorithm of the present paper, though we are in the rather different setting of sparse and circular trace reconstruction, looks at more global properties of the input (see Algorithm 1 and Theorem 26).
As the general worst-case trace reconstruction problem still has an exponential gap, much work has developed and investigated variants of the problem. [15] investigates many different variants, including matrix reconstruction and trace reconstruction for sparse strings. Most relevant to our work: in the regime where and are both constants, they analyze the “trivial” algorithm of directly estimating the size of each gap, giving an upper bound. (Directly estimating the size of each gap does not work in our circular trace reconstruction setting, since each trace gets randomly cyclically shifted, and thus a gap in a trace could have come from any of the gaps in the original string, and thus we cannot aggregate statistics about any individual gap.) [1] provides an algorithm using samples for the trace reconstruction problem for “separated” strings, which means the number of ’s between two ’s is at least . The work of [12] and [21] shows that when the string is confined in a known Hamming ball or an edit distance ball of radius , there are algorithms that can recover the string with traces. [19] investigates the average-case trace reconstruction problem, that is, when the string is uniformly randomly chosen instead of adversarially generated.They show that, in contrast to the worst-case, it only requires traces in the average case. [10] investigates the coded version of the trace reconstruction problem, that is, if we code the original string with a high-rate error-correction code, how many traces do we need to recover it? [4] shows that the coded trace reconstruction is nearly equivalent to the average-case trace reconstruction problem, and traces are enough to reconstruct the string.
The work of [17] introduces the circular trace reconstruction problem. In this problem, the traces are generated by passing through a deletion channel and then undergoing a random cyclic shift. This problem aims to recover the string or any of its cyclic shifts. They show, via a reduction, that this problem is at least as hard as the linear trace reconstruction problem and provide a upper bound when is a prime or a product of two primes in the worst-case, and a upper bound for the average case, along with a lower bound of . This lower bound comes from showing the indistinguishability of two constant-sparse strings whose cyclic statistics cancel up to order ; the current paper extends this construction to and shows this is almost the best possible.
2 Definitions
We study traces of -sparse binary strings, meaning the set of binary strings with nonzeros; we assume is bounded by a constant. For sequences and , we say is a cyclic shift of if there exists an integer such that for all , interpreting indices mod . If no such exists, then we say and are cyclically distinct.
Every -sparse binary string is a cyclic shift of some string of the form . For our analysis, it will often be useful to represent each string as an integer sequence , where corresponds to the number of zeros in the “gap” between the and s in . In this paper, we will generally use the integer sequence representation of ; when we require the binary string instead, we will explicitly write .
We use to refer to the set of integers and to denote a binomial random variable with trials and success probability . For two probability measures and over a set , the Hellinger distance between and is defined as .
Most of the technical analysis of this paper concerns understanding properties of “cyclic statistics” of sequences , which we define here:
Definition 4.
Let , and be a divisor of (with allowed). This defines a cyclic statistic mod of defined by , (where all indices are interpreted). When , we refer to the function as merely a cyclic statistic and write .
Problem definitions.
We investigate the problem of recovering a string from its traces generated by a circular deletion channel, which is defined as follows.
Definition 5 (Circular deletion channel).
Fix . A circular deletion channel, denoted takes as input and induces the probability distribution defined by the following process:
-
1.
Delete each bit in independently with probability to yield a string .
-
2.
Return a uniformly random cyclic shift of .
In this paper, we focus on the case where is a universal constant. We are now ready to define the (circular) trace reconstruction problem.
Definition 6 (Trace reconstruction problem).
Let . Given sample access to traces generated by the deletion channel , an algorithm which returns solves the trace reconstruction problem if, with probability at least , is a cyclic shift of .
Definition 7 (Distinguishing problem).
Let with and cyclically distinct. An algorithm distinguishes from if, given sample access to a deletion channel , returns with probability at least when is a cyclic shift of , and returns with probability at least when is a cyclic shift of .
Remark 8.
We note that the two problems above are nearly equivalent for -sparse strings in the sense that an upper bound for the distinguishing problem implies an algorithm for reconstruction. This is due to the following standard technique. Suppose algorithm distinguishes all pairs with probability at least using samples. To reconstruct an unknown string , we draw samples from the deletion channel , and test each possible pair of cyclically distinct strings against each other using . By a Chernoff bound, will return a valid answer on each pair with probability . Since there are cyclically distinct -sparse strings of length , all tests will succeed with constant probability, in which case we return the unique string which is chosen by against all other candidates.
3 Cyclic Statistics Characterize Cyclically Distinct Sequences
We begin by studying cyclic statistics of integer sequences, a key component of our upper and lower bounds. The goal of this section is to answer the question: given two cyclically distinct integer sequences and , what is the smallest value of such that there exists a cyclic statistic satisfying ? We show always suffices.
Our proof relies on the Fourier transforms of and , denoted and respectively. We partition the Fourier coefficients into equivalence classes based on . We show that for each equivalence class, the Fourier coefficients in are jointly zero or nonzero together. Notably, as it turns out, for any and with matching second-order cyclic statistics, has identical pattern of zeros and nonzeros as (see Lemma 13).
Armed with this insight, we restrict our analysis to the set of nonzero Fourier coefficients in and , and define a set of variables for each index with nonzero Fourier coefficients in and . We show that equivalence of cyclic statistics (up to order 6) imposes a system of linear constraints on the variables , and that any set of solutions must be strictly real and there exists an integer such that for all . It follows immediately that for all . By basic properties of Fourier transforms, this implies that is a cyclic shift of , proving our upper bound.
The organization of this section is as follows. In Section 3.1, we relate cyclic statistics to the Fourier transform and prove that for any and with identical cyclic statistics up to order , and have the same pattern of zeros and nonzeros. In Section 3.2 we set up several number theoretic results necessary to analyze the system of linear constraints. Finally, in Section 3.3, we apply these facts to the set of variables defined above to obtain the main result of this section, Lemma 19, which states that any two cyclically distinct strings must differ in some cyclic statistic of order .
3.1 Cyclic Statistics and the Fourier Transform
Suppose two integer sequences and have identical cyclic statistics up to order . We show this implies an equivalence between certain -way products of Fourier coefficients.
Lemma 9.
Let and be two integer sequences of length with identical order cyclic statistics. Consider their Fourier transforms and . For any -tuple with , we have .
Proof.
Letting , and interpreting indices mod :
where the exponential terms vanish in the second-to-last equality because and so . Since for all by assumption, we have that the corresponding -way products of Fourier terms are equal.
Lemma 9 implies that for any satisfying , if the Fourier coefficients are nonzero, then must be nonzero as well. We now partition the elements of into equivalence classes based on their gcd with .
Definition 10.
Let be a divisor of . We define .
We show that for a single integer sequence , the Fourier coefficients of in each are either all zero or all nonzero, allowing us to analyze them together. This can be shown using a well-known property of cyclotomic fields: for every primitive th root of unity of form , the Galois group is isomorphic to , the multiplicative group of integers mod . The isomorphism is specified by Fact 11.
Fact 11.
Let and let be a primitive th root of unity. Consider , the field extension of generated by . For all , there exists an automorphism such that .111For more on cyclotomic fields, see https://en.wikipedia.org/wiki/Cyclotomic_field.
Lemma 12.
Let be an integer sequence of length with Fourier transform . For any and , if , then .
Proof.
Suppose . Let and consider the a primitive th root of unity . The th Fourier coefficient of is given by
Let be a sequence of length generated by interpreting each index of mod and summing together all elements that map to a given location. (In other words, .) Then the Fourier transform of at location is given by
Similarly, . Because is an integer sequence, and are elements of the cyclotomic field . Choose such that . By Fact 11, there exists such that . Therefore, we have
as desired.
Consider two sequences and with identical second order cyclic statistics. As a simple consequence of Lemmas 9 and 12, and are jointly zero or jointly nonzero everywhere in the indices belonging to a single .
Lemma 13.
Let and be two integer sequences of length with identical second order cyclic statistics. Then for each , either (1) and for all or (2) for all .
Proof.
Suppose for some . By Lemma 12, we must have for all . Note that , so . Moreover, , so by Lemma 9, we have . This requires that . Applying Lemma 12 to , we have that is nonzero for all indices in , proving the first statement.
If , then similarly , requiring that either or . By Lemma 12, and must be zero everywhere in , proving the second statement.
3.2 Number Theoretic Prerequisites
Recall that our goal is to analyze a system of constraints involving the nonzero Fourier coefficients of and , where the constraints arise from the assumption that and have identical cyclic statistics up to order . These constraints will turn out to take the form , for a certain set of complex valued variables and certain tuples . We wish to show that any set of variables satisfying these constraints must be strictly real and form an arithmetic sequence mod . In Lemma 16, we show this is true; we apply this fact in Section 3.3 to show that and are cyclically equivalent.
Before proving this relation, we require certain number theoretic facts. First, we show that for any integer and , can be written as a sum (mod ) of either 2 or 3 numbers relatively prime to . Here and below, we use to refer to the cyclic group of integers modulo , represented as the set .
Fact 14.
The following facts are true for any prime power and integer :
-
If , there exists a pair that are relatively prime to such that . Also, there exists a triple that are relatively prime to such that .
-
If and is even, there exists a pair that are relatively prime to such that .
-
If and is odd then there exists a triple that are relatively prime to such that .
Proof.
First, consider the case in which . At least one of must be relatively prime to since their difference is 2, which itself is relatively prime to . Since , , and both and are relatively prime to , at least one of these two expressions expresses as the sum of 2 numbers both relatively prime to . Similarly, at least one of must be relatively prime to since their difference is 4, which itself is relatively prime to . Thus since and , and since both and are relatively prime to , at least one of these two expressions expresses as the sum of 3 numbers both relatively prime to .
For the case of , we note that if is even, then is relatively prime to and thus we express . Otherwise, if is odd, then is relatively prime to , and we express .
Lemma 15.
Let . The following is true for all :
-
If is even and is odd, then can be written as the sum mod of 3 numbers relatively prime to .
-
Else can be written as the sum mod of 2 numbers relatively prime to .
Proof.
Consider the prime factorization of : . Given , we use the Chinese remainder theorem to express via its residues mod , for each prime factor . Define . We note that is relatively prime to if and only if each is relatively prime to its corresponding prime power .
We use Fact 14 (and the Chinese remainder theorem) to canonically represent each as the sum mod of either 2 or 3 numbers that are each relatively prime to . If either is odd or and are both even, we represent as the sum of 2 numbers, by applying Fact 14 to represent each as with and relatively prime to . By the Chinese remainder theorem, the system, , has a unique solution relatively prime to ; similarly, there exists a unique relatively prime to satisfying for all . Finally, , which proves the first case. If is odd and is even, we use the analogous procedure to represent as the sum of 3 numbers relatively prime to , which proves the second case.
Using the representation given by Lemma 15, we analyze the system of complex-valued linear equations described at the beginning of this subsection.
Lemma 16.
Let and for each relatively prime to . Suppose that for every , and any -tuple relatively prime to satisfying , we have . Then there exists an integer such that for all relatively prime to we have .
Proof.
For all , let denote the canonical representation of given by Lemma 15 as the sum mod of either 2 or 3 numbers relatively prime to (so is a set of size 2 or 3). Let denote the sum of the 2 or 3 corresponding variables. For convenience, we write the set . Then the conditions of the lemma imply the following set of conditions:
To see the implication, it suffices to show that the left hand side of each constraint is a sum of at most 6 real variables whose indices are relatively prime to . For all with , we have , so the first constraint satisfies the condition. For all , can be written as a sum of 2 or 3 variables whose indices are relatively prime to , so the second constraint requires at most 4 variables. Finally, if is odd, the representation comprises 2 variables for all , in which case the third constraint requires 5 variables. If is even, the values and comprise 1 even number and 1 odd number, and has the same parity as , resulting in at most 6 terms.
Note that these three sets of conditions collectively imply that for all . Therefore, any set of solutions must form an arithmetic sequence of increment plus some integer, i.e., for all . Therefore, , which requires that is a multiple of (and is therefore real). Writing for some integer , we have for all . Finally, the first and second conditions imply that for all relatively prime to , which proves the lemma.
The following technical lemma is used in the proof of the main result of this section, Lemma 19. Intuitively, in the proof of Lemma 19 we aim to show that variables form an arithmetic sequence. Instead we show that, for each value , those with index satisfying form an arithmetic sequence of increment . We use the following lemma to show that this implies there is a consistent arithmetic sequence across all , with some increment .
Definition 17.
For a positive integer and prime number , define the -adic valuation of to be .
Lemma 18.
Let and . Let for each , and suppose the following are true for all pairs :
-
1.
.
-
2.
Then there exists such that for all .
Proof.
Consider the prime factorization of , . For each , we will show there exists satisfying the desired property mod , and conclude the lemma by the Chinese remainder theorem.
Fix . Define to be the element of with the lowest -adic valuation: . Let be an integer satisfying , and consider any . By a well-known property of -adic valuations, we have . The prime power divides , so it also divides . Therefore, we have the following two facts:
| (by assumption (2) of the lemma) | ||||
| (by definition of ) |
Using the fact that to combine the two equations, we have . Finally, we multiply both sides by the integer to yield . Choosing for all , the equation above holds for all pairs . We now wish to find such that for all . Such an integer is guaranteed to exist by the Chinese remainder theorem, which completes the proof.
3.3 Main Characterization
We now prove the main result of this section.
Lemma 19.
Let and be two integer sequences of length with identical cyclic statistics up to order . Then and must be identical up to a cyclic shift.
Proof.
It suffices to show that there exists an integer such that for all . By Lemma 13, this property holds for all where . Therefore, we consider only the divisors such that the Fourier coefficients are nonzero everywhere in for both and . For all with nonzero Fourier coefficients and , define , where is the unique complex-valued function satisfying and for all . To prove the lemma, we show that there exists an integer such that for all with and , and .
Fix and let . Note that for all , . Therefore, each corresponds to a unique number relatively prime to . Consider any -tuple such that . By Lemma 9, we have . Applying Lemma 16 (and using the fact that ), there exists an integer such that for all , we have . In other words, is real and .
We have shown that for all with nonzero Fourier coefficients, we can write . We will now show that this is true for a single consistent integer across all . Consider where and are nonzero everywhere in . Let respectively be the corresponding coefficients, and . Since is a multiple of , Lemma 15 implies that we can write as a sum mod of 2 or 3 numbers relatively prime to . Multiplying by , we can write as a sum mod of 2 or 3 numbers whose gcd with is ; denote this representation as . Since is also a multiple of , it must also have a representation as a sum mod of 2 or 3 numbers whose gcd with is .
We write and to denote the sum of the variables corresponding to the indices in and respectively, and use the convention that . Since for all , we have for all . From the previous paragraph, we have that the sum of the elements in equals the sum of the elements in mod . We thus conclude the condition that has a consistent coefficient in both representations, as , and reexpress this as a sum of 4 to 6 coefficients, moving everything to the left-hand side, as . Recall that only contains indices with , and the corresponding variables satisfy . Similarly, only contains indices with , and each satisfies . Therefore, the constraint holds for any pair of , where .
Applying Lemma 18, there exists , such that for all . Therefore, for every , for every , . Notice that by definition, is an integer and thus, for every , for every , . Therefore, for every where , are nonzero, we have . Dividing by , we have for every where , are nonzero, as desired.
4 Upper Bound
We now give an algorithm for distinguishing any two cyclically distinct strings with a constant number of s. Given Lemma 19, a naïve approach for testing whether a set of traces is generated from or is to identify a cyclic statistic in which and differ, and then estimate this statistic from traces. For a string with s, the deletion channel will preserve all s with constant probability; a resulting trace can be represented as a sequence of “gaps,” for which we can compute cyclic statistics. In particular, for a string and retention probability , a trace can be represented as a sequence of binomial random variables , where .
Unfortunately, computing a cyclic statistic of requires multiplying up to 6 of these binomial random variables. In the worst case, we might have for all , causing our estimator to have variance roughly . Since cyclic statistics are integers, we may need to estimate the desired cyclic statistic to within constant error, and thus the naïve algorithm would require samples.
We give a more efficient algorithm using the following insight: while the product of independent binomials with trials each has variance roughly , the variance is considerably lower if each binomial is shifted to have mean . To accomplish this, we preprocess and by greedily grouping the values that are within of each other into a cluster. When processing a trace from either or , we compute the cluster center closest to each element , and subtract that center from . Because the gap sizes are highly likely to be within of their expectations, we can find the “correct” cluster with high probability. The resulting estimator has variance .
Of course, subtracting an arbitrary sequence from the observed trace raises another concern: the sequences and might now be cyclically equivalent, erasing any differences in their cyclic statistics. We circumvent this problem with a slightly different estimator which only considers cyclic statistics mod where is chosen so that is preserved by a cyclic shift of .
The remainder of this section is organized as follows. In Section 4.1, we give an algorithm for generating a well separated partition of the expectations . In Section 4.2, we show that cyclic statistics mod are preserved by subtraction of a sequence invariant to cyclic shifts of . Finally, in Section 4.3, we present our cyclic statistics-based algorithm and prove its correctness.
4.1 Determining Centers
As a preprocessing step, our algorithm partitions the means of binomial random variables (the gaps in traces from and ) into well separated clusters. For our purposes, clusters are considered to be well separated if they are far apart; we give a formal definition below.
Definition 20.
Let and . Then a function is a -separated partition of if it satisfies the following:
-
1.
For any , if , then .
-
2.
For any , if , then .
-
3.
For all , .
We show that a -separated partition exists for any and . The proof is constructive: we give a greedy algorithm for producing a -separated partition of a fixed set of points . The algorithm starts by assigning each point to its own cluster; it then iteratively merges clusters which violate the separation condition. When no violations remain, we prove that the algorithm has obtained a -separated partition.
Lemma 21.
Let , and . Then there exists a -separated partition of .
4.2 Uniqueness of Cyclic Statistics
As discussed at the beginning of this section, our goal is to reduce the variance of our estimator by shifting each binomial random variable by a value which is within a constant of its mean. In particular, let be a -separated partition of , and . The binomial random variables concentrate tightly around their expectation, allowing us to approximately recover the partition of and generated by the partitioning algorithm with high probability. The logical approach is therefore to estimate cyclic statistics of .
Unfortunately, this introduces a second problem. If and are distinct after applying the partition elementwise, then they can be distinguished simply by recovering the partition. But if and appear similar under the partition – that is, is a cyclic shift of – then might be a cyclic shift of , resulting in identical cyclic statistics and thus a failure to distinguish from .
Example 22.
Consider the case where , and where the sequence of shifts/clusters follows the pattern . Further, let and . In this case, when we sample from or , once we subtract , we get two 6-tuples that are cyclically identical; however, algorithmically, we would hope to be able to distinguish traces from vs , since the alignment between the sequence of shifts vs the sequence of offsets differs for vs . The right thing to do here, after subtracting , is to look at cyclic statistics mod 3.
This motivates our general strategy of looking at cyclic statistics mod , where is the smallest period of repetition of the shift sequence.
Definition 23.
A sequence is symmetric mod if for all , .
Definition 24.
Let , , and be a divisor of . For a sequence , the function is an -th order cyclic statistic mod , shifted by .
We show that for any period and set of shifts that repeats mod , there is a -th order statistic that differs on , when shifted by and taken mod . We prove this by starting with Lemma 19 that guarantees the existence of a statistic of order that distinguishes from ; we show this implies existence of a distinguishing statistic of the same order mod . Taking the distinguishing statistic of smallest order lets us subtract off any sequence that is symmetric mod , without modifying the discrepancy between and . This yields:
Lemma 25.
Let and be two cyclically distinct integer sequences of length . Let divide , and let be a sequence that is symmetric mod . Then there exists with such that
4.3 Our Algorithm
We are now ready to describe our proposed tester, Algorithm 1. Given two candidates and , a trace from the deletion channel can be described as a sequence of binomial random variables , each of which has an expectation in the set . The algorithm begins by preprocessing the strings and to create a -separated partition of the expectations (Line 2). We then consider the sequences and created by applying elementwise to and . We design two tests for two separate cases:
-
If is a cyclic shift of , then traces from and appear relatively similar. In this case, we invoke a subroutine (Algorithm 2), which determines a shifted cyclic statistic in which and differ, then estimates this statistic from traces. Owing to the shift, the resulting estimate has low variance, allowing us to estimate the cyclic statistic from traces.
-
If and are cyclically distinct, then the algorithm can reliably distinguish and by examining only a single trace whose s are intact. Therefore, we draw a constant number of traces, choose an arbitrary trace with s (Line 7), and cluster the gaps according to . For large enough, the resulting sequence will with high probability correspond exactly to either or .
Combining the results from each case yields our overall upper bound. We now state and prove the main result of Section 4. Our proof relies on Lemmas 27 and 28 in Section 4.3.2, which prove the correctness of Algorithm 2.
4.3.1 Proof of Upper Bound
Theorem 26.
Let and be two cyclically distinct -sparse binary strings of length . Then Algorithm 1 distinguishes from with probability for sufficiently large , using traces from a cyclic deletion channel.
Proof.
We analyze Algorithm 1 to show the theorem. We consider two cases:
- Case 1:
-
is a cyclic shift of . In this case, we invoke Algorithm 2, which draws returns the correct answer with probability at least , by Lemma 28.
- Case 2:
-
is not a cyclic shift of . Then we draw traces. With probability , at least one trace will contain exactly s. Suppose . Due to Lemma 27, with probability at least , we have for all . Therefore, , and the algorithm returns with probability at least for large enough .
In both cases, the number of traces is bounded by .
4.3.2 Distinguishing Similar Strings
We now handle the case in which , i.e., and are cyclically equivalent under the -separated partition constructed in Algorithm 1. We build an estimator based on shifted cyclic statistics and show that traces suffice to distinguish from . In particular, let be the minimum value such that and are symmetric mod . For each , let . Define the function to be the average of the points in cluster ; that is, . The algorithm chooses the smallest value of for which there is an order cyclic statistic such that . We call a trace useful if it satisfies the following conditions:
-
1.
There are exactly s in .
-
2.
For all , .
Intuitively, the second condition says that the gaps in can be matched up with cluster centers as we expect. Suppose there are useful traces. Our estimator will apply the statistic to all the useful traces after subtracting out their cluster centers:
| (2) |
We define as the average of the above equation over useful traces. We show that, for any set of traces from a string , with probability at least , which implies an efficient tester.
The main result of this subsection requires the following lemma, which roughly states that we can recover clusters from a random trace with high probability.
Lemma 27.
Let , , and for all . Then there exists an absolute constant such that, for any -separated partition of , and for all with probability at least .
Proof.
Fix . Let be a -separated partition of . By a Chernoff bound, there exists a constant such that with probability at least . When this occurs, the following holds for all with :
In this case, clearly . The contrapositive of this is that, if , we must have . Taking a union bound over the random variables, the probability that this occurs for some is at most , as desired.
We are now ready to prove the correctness of Algorithm 2.
Lemma 28.
Let and be two cyclically distinct binary strings of length , and . Let be a -separated partition of , and suppose is a cyclic shift of . Then for sufficiently large and , there exists an algorithm (Algorithm 2) which distinguishes from with probability at least using traces from a cyclic deletion channel.
Proof.
Fix to be the constant determined by Lemma 27. Let denote the set of useful traces (defined at the top of Section 4.3.2). The first usefulness condition holds with probability exactly for each trace. By Lemma 27, we have and with probability at least . In this case, we have , which is precisely the second condition. Therefore, for large enough , each trace falls in with constant probability. By a Chernoff bound, with probability . For each , the algorithm computes , where is defined in Equation 2. Let be the event that and for all . Conditioned on , it is possible to compute precisely, since we can choose a consistent cyclic shift of and match to that shift up to symmetry. We therefore have
| (s are independent) | ||||
where the term in the first line comes from the probability that there are 1s in but event fails; this is multiplied by the universal bound on for .
By Lemma 27, holds with probability at least conditioned on . Additionally, the second condition implies that for all , we have . Combining these facts and applying the law of total probability yields
Combining the bound on with Hoeffding’s inequality, we have
By our choice of , we have with probability . Therefore, for an appropriate choice of constants, the above equation is bounded by . By the triangle inequality, the following holds for large enough with probability at least : . By Lemma 25, there must exist a statistic such that (namely, the statistic chosen on Line 2). Therefore, if , the algorithm returns with probability at least (Line 5). The same is true for , which completes the proof.
5 Lower Bounds
Our goal in this section is to upper-bound the distance between the distributions and for two strings with that are permutations of each other and have identical low-order cyclic statistics, which will yield a lower bound for the number of traces needed to distinguish these strings. We first analyze the probability of observing a given trace from , viewed as a polynomial in . We show that the higher-order coefficients of this polynomial depend on the lower-order cyclic statistics of . Using this fact, we bound the distance between probabilities from versus . We conclude by providing two strings with identical cyclic statistics up to order 4, proving our lower bound.
Lemma 29.
Let with an integer upper bound on , and let be any cyclic shift of . Then for a circular deletion channel with deletion probability , letting , and letting for , we have, for some symmetric function of , that depends on , that
| (3) |
where all indices are taken modulo .
Proof.
Since all cyclic shifts of a trace have equal probability under the deletion channel, we assume without loss of generality that . By definition of a circular deletion channel, and letting denote the length of the underlying binary strings respectively, we have
| (4) |
where indices are taken modulo . We point out that the terms outside the parentheses, are a symmetric function of (depending only on ), and thus can be absorbed into the initial term in the lemma statement. We slightly refactor each binomial term as
We thus need only consider the sum over of the product over of the above expression. The first term does not depend on so is by definition symmetric in and can be absorbed into the initial . The next term, , when we take its product over all , becomes a symmetric function of and can thus also be absorbed into . The final term becomes the last term in Equation 3, after noting that , by definition of .
The only non-symmetric portion of Equation 3 is the expression inside the sum, which is
| (5) |
Thus, when we compare traces from versus for sequences that are permutations of each other, it is this last term that will distinguish between them. We now view this term as a polynomial in , and show that the high-order coefficients of this polynomial can be expressed in terms of low-order cyclic statistics of . Thus for two sequences with identical low-order cyclic statistics, we expect the distributions of versus to have high-order terms in that exactly cancel, leading to our main lower bound.
Lemma 30.
Expression 5 when viewed as a polynomial in of degree , for any has coefficient that is a linear combination of degree cyclic statistics of , each multiplied by some symmetric function of (possibly depending on ).
Proof.
Define to be the set of pairs such that and , where as defined in the lemma. Thus the main expression of the lemma equals . In particular, the coefficient of in this expression can be found by multiplying the non- parts of all combinations of distinct terms:
| (6) |
For each fixed , this expression is a symmetric polynomial of the terms for , and we thus use Newton’s symmetric polynomial identities to conclude that we may reexpress this expression for fixed as a linear combination of products of “power sums” of these terms; the power sum of degree is defined as . Namely, Equation 6 can be expressed as some linear combination of the following expressions, where are some positive integers that sum to :
| (7) |
We break down the inner sum of this expression using the definition of , and the binomial expansion (across different powers ), as
For any exponent , we consider the final sum as a degree polynomial in , with notation . Thus Equation 7 equals
We pull the sums over outside:
We split the product by whether, for each , we have , since when this is true the expression simplifies significantly (note that we also move the sum over to the right, past the first parenthetical, which does not depend on ):
The first parenthetical is clearly a symmetric function of . To analyze the second expression, we apply a variable substitution in the inner sum, replacing with . For any fixed tuple , the last parenthetical equals
We view the product as a polynomial in the variables , which we denote as ; the degree of this polynomial is thus bounded by , which is at most . Crucially, the sum over now equals , which, since we sum a degree polynomial over all cyclic shifts of , is thus clearly a linear combination of cyclic statistics of degree .
In conclusion, Equation 7 is thus a sum of cyclic statistics of each time some symmetric function of . Thus, the coefficient of in Equation 5, being a linear combination of expressions of the form of Equation 7, is also a sum of cyclic statistics of each time some symmetric function of , as desired.
We now directly compare the probabilities of getting a certain trace from versus .
Lemma 31.
Let and . Suppose is a permutation of and that the two sequences have matching cyclic statistics up to some order . Let . If for all then for sufficiently large depending on , , , and we have
Proof.
As above, define and define . Since upper bounds , we have that upper bounds too, since they are a permutation of . By Lemma 29, we have
and the corresponding expression for . Let and . Because is a permutation of , we have
It remains only to show that .
Let . Then are both degree polynomials in with leading coefficient . For we now bound the contribution of the terms of degree .
By assumption, each ; thus for and we trivially have that , for . Thus by definition of , the contribution of the term to either or has magnitude at most . We bound , and also bound . We use these bounds to bound the contribution to or of all terms of degree in : this is bounded by the geometric series . We choose large enough so that each term of this series is at most of the previous term, namely, we choose so that . Because the ratio of terms is , the term of the series is at least twice as large as the sum of magnitudes the remaining terms combined, so that, since the term of is , we conclude that . Second, the sum of the series starting with the term has magnitude at most times the term, namely,
We now invoke Lemma 30, which says that the coefficients of respectively are linear combinations of degree cyclic statistics of respectively, times symmetric functions of respectively. The assumption that are permutations of each other means that all symmetric functions of equal the corresponding symmetric functions of ; the assumption that have identical cyclic statistics up to order further implies that the coefficients of must be identical for . Thus we bound by plugging in the bounds of the previous paragraph starting at the first nonzero degree, : . Combining with our lower bounds from above that , and setting , we conclude as desired. We use the following bound relating the Hellinger distance between to the TV distance between samples from or .
Lemma 32 (Lemma A.5 in [13]).
Let and be probability measures with . Then if .
Lemma 33.
Let and be two strings where and are permutations of each other and have identical cyclic statistics up to order . Then any algorithm which distinguishes from with probability requires samples.
Proof.
Let represent conditioned on deleting only s; with defined correspondingly for . The distributions are only supported on traces that are cyclic shifts of , for some gaps .
As above, define . We point out that, by a standard Chernoff bound, we have with probability at least for some universal constant (assuming is a constant).
Combining Lemma 31 with this Chernoff bound yields the following bound on the Hellinger distance between and :
We can generate a sample from by first sampling from and then passing the sample through a second channel which deletes only s (and similarly for and ). By the data processing inequality, the second channel cannot increase the Hellinger distance between and , so we have . Applying Lemma 32, we conclude that one requires traces to distinguish from with constant success probability.
Finally, we provide two strings that have identical cyclic statistics up to -th order, and whose gaps are permutations of each other. They are the sequence and the sequence . Thus, distinguishing these two strings requires traces.
References
- [1] Anders Aamand, Allen Liu, and Shyam Narayanan. Near-Optimal Trace Reconstruction for Mildly Separated Strings. In Keren Censor-Hillel, Fabrizio Grandoni, Joël Ouaknine, and Gabriele Puppis, editors, 52nd International Colloquium on Automata, Languages, and Programming (ICALP 2025), volume 334 of Leibniz International Proceedings in Informatics (LIPIcs), pages 3:1–3:20, Dagstuhl, Germany, 2025. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPIcs.ICALP.2025.3.
- [2] Frank Ban, Xi Chen, Adam Freilich, Rocco A Servedio, and Sandip Sinha. Beyond trace reconstruction: Population recovery from the deletion channel. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 745–768. IEEE, 2019. doi:10.1109/FOCS.2019.00050.
- [3] Tugkan Batu, Sampath Kannan, Sanjeev Khanna, and Andrew McGregor. Reconstructing strings from random traces. In Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’04, pages 910–918, USA, 2004. Society for Industrial and Applied Mathematics. URL: http://dl.acm.org/citation.cfm?id=982792.982929.
- [4] Joshua Brakensiek, Ray Li, and Bruce Spang. Coded trace reconstruction in a constant number of traces. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 482–493. IEEE, 2020. doi:10.1109/FOCS46700.2020.00052.
- [5] Arnav Burudgunte, Paul Valiant, and Hongao Wang. New bounds for circular trace reconstruction, 2025. arXiv:2512.02412.
- [6] Zachary Chase. New lower bounds for trace reconstruction. In Annales de l’Institut Henri Poincaré-Probabilités et Statistiques, volume 57, pages 627–643, 2021.
- [7] Zachary Chase. Separating words and trace reconstruction. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 21–31, 2021. doi:10.1145/3406325.3451118.
- [8] Xi Chen, Anindya De, Chin Ho Lee, and Rocco A. Servedio. Trace Reconstruction from Local Statistical Queries. In Amit Kumar and Noga Ron-Zewi, editors, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2024), volume 317 of Leibniz International Proceedings in Informatics (LIPIcs), pages 52:1–52:24, Dagstuhl, Germany, 2024. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPIcs.APPROX/RANDOM.2024.52.
- [9] Kuan Cheng, Elena Grigorescu, Xin Li, Madhu Sudan, and Minshen Zhu. On -mer-based and maximum likelihood estimation algorithms for trace reconstruction. In 2024 IEEE International Symposium on Information Theory (ISIT), pages 879–884, 2024. doi:10.1109/ISIT57864.2024.10619392.
- [10] Mahdi Cheraghchi, Ryan Gabrys, Olgica Milenkovic, and Joao Ribeiro. Coded trace reconstruction. IEEE Transactions on Information Theory, 66(10):6084–6103, 2020. doi:10.1109/TIT.2020.2996377.
- [11] Anindya De, Ryan O’Donnell, and Rocco A Servedio. Optimal mean-based algorithms for trace reconstruction. The Annals of Applied Probability, 29(2):851–874, 2019.
- [12] Elena Grigorescu, Madhu Sudan, and Minshen Zhu. Limitations of Mean-Based Algorithms for Trace Reconstruction at Small Edit Distance. IEEE Transactions on Information Theory, 68(10):6790–6801, 2022. doi:10.1109/TIT.2022.3168624.
- [13] Nina Holden and Russell Lyons. Lower bounds for trace reconstruction. The Annals of Applied Probability, 30(2):503–525, April 2020. doi:10.1214/19-AAP1506.
- [14] Thomas Holenstein, Michael Mitzenmacher, Rina Panigrahy, and Udi Wieder. Trace reconstruction with constant deletion probability and related results. In Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’08, pages 389–398, San Francisco, California and USA, 2008. Society for Industrial and Applied Mathematics. URL: http://dl.acm.org/citation.cfm?id=1347082.1347125.
- [15] Akshay Krishnamurthy, Arya Mazumdar, Andrew McGregor, and Soumyabrata Pal. Trace reconstruction: Generalized and parameterized. IEEE Transactions on Information Theory, 67(6):3233–3250, 2021. doi:10.1109/TIT.2021.3066010.
- [16] Vladimir I Levenshtein. Efficient reconstruction of sequences. IEEE Transactions on Information Theory, 47(1):2–22, 2002. doi:10.1109/18.904499.
- [17] Shyam Narayanan and Michael Ren. Circular trace reconstruction. In 12th Innovations in Theoretical Computer Science Conference (ITCS 2021), pages 18:1–18:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPIcs.ITCS.2021.18.
- [18] Fedor Nazarov and Yuval Peres. Trace reconstruction with exp(O()) samples. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, pages 1042–1046, New York, NY, USA, 2017. Association for Computing Machinery. doi:10.1145/3055399.3055494.
- [19] Yuval Peres and Alex Zhai. Average-Case Reconstruction for the Deletion Channel: Subpolynomially Many Traces Suffice. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 228–239, Berkeley, CA, October 2017. IEEE. doi:10.1109/FOCS.2017.29.
- [20] Joey Rivkin, Gregory Valiant, and Paul Valiant. A generalized trace reconstruction problem: Recovering a string of probabilities. In Proceedings of the 57th Annual ACM Symposium on Theory of Computing, STOC ’25, pages 1657–1667, New York, NY, USA, 2025. Association for Computing Machinery. doi:10.1145/3717823.3718315.
- [21] Jin Sima and Jehoshua Bruck. Trace Reconstruction with Bounded Edit Distance. In 2021 IEEE International Symposium on Information Theory (ISIT), pages 2519–2524, Melbourne, Australia, 2021. IEEE Press. doi:10.1109/ISIT45174.2021.9518244.
