Abstract 1 Introduction 2 Definitions 3 Cyclic Statistics Characterize Cyclically Distinct Sequences 4 Upper Bound 5 Lower Bounds References

New Bounds for Circular Trace Reconstruction

Arnav Burudgunte ORCID Department of Computer Science, Purdue University, West Lafayette, IN, USA Paul Valiant ORCID Department of Computer Science, Purdue University, West Lafayette, IN, USA Hongao Wang ORCID Department of Computer Science, Purdue University, West Lafayette, IN, USA
Abstract

The “trace reconstruction” problem asks, given an unknown binary string x and a channel that repeatedly returns “traces” of x with each bit randomly deleted with some probability p, how many traces are needed to recover x? There is an exponential gap between the best known upper and lower bounds for this problem. Many variants of the model have been introduced in hopes of motivating or revealing new approaches to narrow this gap. We study the variant of circular trace reconstruction introduced by Narayanan and Ren (ITCS 2021), in which traces undergo a random cyclic shift in addition to random deletions.

We show an improved lower bound of Ω~(n5) for circular trace reconstruction. This contrasts with the (previously) best known lower bounds of Ω~(n3) in the circular case and Ω~(n3/2) in the linear case. Our bound shows the indistinguishability of traces from two sparse strings x,y that each have a constant number of nonzeros. Can this technique be extended significantly? How hard is it to reconstruct a sparse string x under a cyclic deletion channel? We resolve these questions by showing, using Fourier techniques, that O~(n6) traces suffice for reconstructing any constant-sparse string in a circular deletion channel, in contrast to the best known upper bound of exp(O~(n1/3)) for general strings in the circular deletion channel. This shows that new algorithms or new lower bounds must focus on non-constant-sparse strings.

Keywords and phrases:
Trace reconstruction, algorithmic statistics, Fourier analysis
Copyright and License:
[Uncaptioned image] © Arnav Burudgunte, Paul Valiant, and Hongao Wang; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Mathematics of computing Probabilistic inference problems
Related Version:
Full Version: https://arxiv.org/abs/2512.02412 [5]
Funding:
This work is partially supported by NSF award CCF-2127806 and by Office of Naval Research award N000142412695.
Editor:
Shubhangi Saraf

1 Introduction

Given a binary string x and a fixed probability p, the deletion channel Del(x) deletes each character in x with probability p and returns the remaining string, which is called a trace of the deletion channel. The trace reconstruction problem asks: how many traces from Del(x) are needed to recover x reliably. This problem has been studied extensively in various forms and under various assumptions [3, 2, 6, 7, 15, 20]. In the standard setting, where the deletion probability p is a constant and the original string x is an arbitrary binary string of length n, there remains an exponential gap between the best known upper bound exp(O~(n1/5)) and the best known lower bound Ω~(n3/2)) for worst-case reconstruction [6, 7]. In hopes of narrowing this gap, many related problems have been proposed and studied, such as matrix trace reconstruction [15], population recovery [2], and coded trace reconstruction [10, 4].

We study one such mild variant, the problem of circular trace reconstruction introduced by [17]. In this problem, the original string x undergoes a random cyclic shift after deletion, and the reconstruction algorithm must return any string which is cyclically equivalent to x. Circular trace reconstruction is known to be at least as hard as the linear version (essentially a consequence of the data processing inequality). Also, the best known bounds in each model are fairly similar: prior to our work, the best known lower bound was Ω~(n3) in the circular model compared to Ω~(n3/2) in the linear model [6, 17]. The best known upper bound for arbitrary strings is exp(O~(n1/3)) in the circular model (assuming n can be written as a product of two or fewer primes), compared to exp(O~(n1/5)) in the linear model [7, 17].

1.1 Our Contributions

We improve the lower bound for circular trace reconstruction from Ω~(n3) to Ω~(n5).

We prove this result by exhibiting two cyclically distinct strings x,y that have a constant number of nonzero entries, yet which need Ω~(n5) traces to distinguish. This strong lower bound motivates the general question: how difficult is it to reconstruct constant-sparse strings x from circular traces? To what degree can this lower bound of Ω~(n5) be improved?

We resolve these questions almost-tightly by showing that O~(n6) traces suffice to distinguish any two sparse circular strings. Our results imply that stronger lower bounds can only be obtained from non-constant-sparse strings, and also that new algorithmic improvements should focus on the non-constant-sparse case.

1.2 Technical Overview

We focus on the specific case of sparse circular strings, defined as binary strings which contain k 1s for some constant k. This family of strings has the useful property that any k-sparse string x is a cyclic shift of 10x110x210xk; therefore, x can be represented as the integer sequence (x1,,xk), where xj specifies the number of 0s (the “gap”) between the jth and (j+1)th 1s in x. With probability (1p)k the deletion channel will preserve all 1s in x, yielding a trace x~ which can also be represented by a sequence of k gaps. Crucially, in our setting of constant p,k, this probability (1p)k is constant. When this occurs, the distribution over traces x~ depends only on the original integer sequence x1,,xk of gaps in x. Both our upper and lower bounds are based on properties of this integer sequence.

Cyclic Statistics.

Our main technical tool is a family of shift-invariant functions which we call cyclic statistics. At a high level, a cyclic statistic of a sequence x=(x1,,xk) is a cyclically-invariant polynomial of the variables x1,,xk; the statistic is obtained by summing a monomial function over all cyclic shifts of x. The order of this statistic is the degree of the corresponding monomial. If two strings x and y are cyclic shifts of each other, their cyclic statistics are clearly identical. Surprisingly, we prove that the converse holds for order 6: if x and y are cyclically distinct, there exists a cyclic statistic of order 6 in which they differ. We give a lower bound showing that this result is nearly tight.

Lemma 1 (Informal).

Let x and y be two cyclically distinct integer sequences. Then x and y differ in some cyclic statistic of order at most 6. Conversely, there exist integer sequences x and y with identical cyclic statistics up to order 4.

We prove the upper bound in Lemma 1 using Fourier techniques (see Lemma 19 in Section 3). We show that if x and y have identical cyclic statistics up to order 6, their Fourier transforms x^ and y^ must satisfy a certain system of sparse linear equations, each involving 6 or fewer variables. We then analyze this system and show that any solution x^,y^ has the property that there exists an integer c for which

x^j=y^jexp(2πicjk)for all j[k]. (1)

By basic properties of Fourier transforms, this implies that y is a cyclic shift of x.

The significance of the order 6 stems from a number theoretic result. Our original constraints involve an equation of two variables for each j[k], but our analysis only applies for j relatively prime to k. We show that each j[k] can be written as the sum of 2 or 3 integers relatively prime to k, transforming our constraint into a 6-variable equation, which implies Equation 1, given all cyclic statistics up to order 6 vanish.

The lower bound – that there exist x and y with identical cyclic statistics up to order 4 – is proved directly. We provide a pair of sequences of length 12 satisfying this property, obtained via computer search: x(j)=(0,2,3,2,1,1,1,1,2,3,2,0) and the sequence y(j)=3x(j).

Cyclic statistics are preserved in expectation by the cyclic deletion channel, leading to our algorithmic results. They also characterize the distribution of traces obtained from the deletion channel, from which we obtain our information-theoretic lower bound. We give a brief overview of this connection and our results below.

Lower Bound.

In order to build our lower bound, we relate the probability distribution over traces of x to the cyclic statistics of (x1,,xk). Roughly, we show that the probability of generating a trace x~ from x can be expressed as a polynomial in n of degree L whose degree Lm coefficient depends only on the order m cyclic statistics of x. We use this fact to show that if two cyclically distinct strings x and y have identical cyclic statistics up to order m, the Hellinger distance between the distributions of the traces generated from x and y is O~(nm1). Therefore, it requires Ω~(nm+1) samples to distinguish the two distributions.

Theorem 2 (Informal version of Lemma 33).

Let (x1,,xk) and (y1,,yk) be two sequences that are permutations of each other and have identical cyclic statistics up to order m. Then for any constant deletion probability p, any algorithm which distinguishes the strings

x=10n+x110n+x210n+xk and y=10n+y110n+y210n+yk

requires Ω~(nm+1) traces from a cyclic deletion channel.

By Lemma 1, there exist two cyclically distinct sequences x and y with identical cyclic statistics up to order 4. Plugging the corresponding strings into Theorem 2, we obtain our Ω~(n5) lower bound for circular trace reconstruction.

Upper Bound.

We build an algorithm that can recover any k-sparse string x using O~(n6) samples from the deletion channel, showing that our techniques from above cannot be extended to yield a significantly stronger lower bound.

Theorem 3 (Informal version of Theorem 26).

Let x and y be two cyclically distinct constant-sparse binary strings of length n. Then there exists an algorithm which, for any constant deletion probability p, distinguishes x from y using O~(n6) traces from a cyclic deletion channel.

As explained above, our algorithm exploits the fact that the retention probability q1p is a constant. Therefore, with constant probability, no 1 is deleted in x. Conditioning on this, every trace x~ can be expressed as a sequence of binomial random variables; that is, x~=(x~1,,x~k) with x~jBin(xj,q). By Lemma 1, any two cyclically distinct sequences x and y differ in some cyclic statistic of order at most 6. As all cyclic statistics of integer sequences are integers, the difference between x and y in this statistic is at least 1. Thus we aim to estimate cyclic statistics from traces to within constant error.

A naïve estimator is the empirical average of a certain cyclic statistic of the traces. However, this naïve algorithm requires Ω(n11) samples. This is because, in the worst case, each x~j may have mean and variance both Ω(n). Thus the variance of the product of 6 such random variables may be Ω(n11). However, instead, if we can shift the binomials such that they become random variables whose absolute value is O~(n), the absolute value of their product will be bounded by O~(n3) leading to variance O~(n6). We accomplish this by greedily partitioning the set of expected gaps {qxj} into clusters, so that all elements in the same cluster are O~(n)-close to each other and Ω~(n)-far from all other clusters. Since each binomial is tightly concentrated around its expectation, we can map each x~j to the correct cluster with high probability, and subtract the cluster center from x~j to obtain a random variable whose absolute value is bounded by O~(n) with high probability.

A final hurdle is that subtracting cluster centers may erase any differences in the cyclic statistics of x and y. See the discussion at the start of Section 4 and Example 22 for more algorithmic intuition, and discussion of why cyclic statistics mod are needed to yield a O~(n6) algorithm. See Algorithm 1 and Theorem 26 for our main algorithmic result.

1.3 Structure of the Rest of the Paper

In Section 1.4 we discuss related work. Section 2 introduces the key definitions used in this paper. The rest of the paper consists of three technical sections that may be read independently, in any order. Section 3 builds tools to analyze cyclic statistics, leading to our characterization that “any two cyclically distinct strings x,y must differ in some cyclic statistic of order m6.” Section 4 contains our algorithmic results, using the characterization of Section 3 to give an algorithm that distinguishes any constant-sparse strings with O~(n6) traces. Section 5 shows our lower bounds, based on a Hellinger distance analysis, showing that two strings x,y that are permutations of each other and have identical cyclic statistics up to order m require Ω~(nm+1) traces to distinguish.

1.4 Related Work

The trace reconstruction problem was introduced in [16, 3]. [3] shows strong results for reconstructing random strings x, in contrast to the worst-case setting we consider here. [14] gives an algorithm for the (worst-case) trace reconstruction problem with constant deletion probability using exp(O~(n1/2)) traces. This result was improved independently by [11] and [18], which present an algorithm using exp(O(n1/3)) traces. The state of the art is [7], providing an algorithm using exp(O~(n1/5)) traces.

On the lower bound side, [3] provides a Ω(n) bound, showing that distinguishing 0n10n1 from 0n110n requires Ω(n) traces. [13] improved the lower bound to Ω~(n5/4) by replacing the 0’s in the previous strings by alternating 01010101, and [6] tightened this analysis to yield Ω~(n3/2), the best known lower bound, but still exponentially far from exp(O~(n1/5)).

Restricting the class of algorithms to yield improved lower bounds, [11] shows that all “mean-based” algorithms require exp(Ω(n1/3)) samples to reconstruct strings. Here, a mean-based algorithm is any algorithm that only uses the empirical means of individual bits of the traces. Extending this line of work, [9] considers “k-mer-based” algorithms which rely on counting the occurrences of contiguous k-bit strings (where “mean-based” algorithm are the case k=1). [9] shows that, even for k up to n1/5, any k-mer-based algorithm requires exp(Ω(n1/5)) samples to reconstruct the string. All known algorithms for the worst-case trace reconstruction problem are k-mer-based, including the algorithm presented by [7]. More recently, [8] strengthened this result to show an analogous exp(Ω(n1/5))-sample lower bound for all statistical query (SQ) algorithms whose queries are “O~(n1/5)-local.” Thus if there is an algorithm using fewer than exp(n1/5) samples, it must use “non-local information.” We point out that the algorithm of the present paper, though we are in the rather different setting of sparse and circular trace reconstruction, looks at more global properties of the input (see Algorithm 1 and Theorem 26).

As the general worst-case trace reconstruction problem still has an exponential gap, much work has developed and investigated variants of the problem. [15] investigates many different variants, including matrix reconstruction and trace reconstruction for sparse strings. Most relevant to our work: in the regime where k and p are both constants, they analyze the “trivial” algorithm of directly estimating the size of each gap, giving an O(nlogn) upper bound. (Directly estimating the size of each gap does not work in our circular trace reconstruction setting, since each trace gets randomly cyclically shifted, and thus a gap in a trace could have come from any of the k gaps in the original string, and thus we cannot aggregate statistics about any individual gap.) [1] provides an algorithm using O(nlogn) samples for the trace reconstruction problem for “separated” strings, which means the number of 0’s between two 1’s is at least polylog n. The work of [12] and [21] shows that when the string is confined in a known Hamming ball or an edit distance ball of radius k, there are algorithms that can recover the string with nO(k) traces. [19] investigates the average-case trace reconstruction problem, that is, when the string x is uniformly randomly chosen instead of adversarially generated.They show that, in contrast to the worst-case, it only requires exp(O(log1/2n)) traces in the average case. [10] investigates the coded version of the trace reconstruction problem, that is, if we code the original string with a high-rate error-correction code, how many traces do we need to recover it? [4] shows that the coded trace reconstruction is nearly equivalent to the average-case trace reconstruction problem, and exp(O(log1/3(1/ϵ)) traces are enough to reconstruct the string.

The work of [17] introduces the circular trace reconstruction problem. In this problem, the traces are generated by passing through a deletion channel and then undergoing a random cyclic shift. This problem aims to recover the string or any of its cyclic shifts. They show, via a reduction, that this problem is at least as hard as the linear trace reconstruction problem and provide a exp(O(n1/3)) upper bound when n is a prime or a product of two primes in the worst-case, and a nOp(1) upper bound for the average case, along with a lower bound of Ω~(n3). This lower bound comes from showing the indistinguishability of two constant-sparse strings whose cyclic statistics cancel up to order m=2; the current paper extends this construction to m=4 and shows this is almost the best possible.

2 Definitions

We study traces of k-sparse binary strings, meaning the set of binary strings with k nonzeros; we assume k is bounded by a constant. For sequences x=(x1,,xk) and y=(y1,,yk), we say y is a cyclic shift of x if there exists an integer c such that xj=yj+c for all j[k], interpreting indices mod k. If no such c exists, then we say x and y are cyclically distinct.

Every k-sparse binary string x is a cyclic shift of some string of the form 10x110xk. For our analysis, it will often be useful to represent each string as an integer sequence x=(x1,,xk), where xj corresponds to the number of zeros in the “gap” between the jth and j+1st 1s in x. In this paper, we will generally use the integer sequence representation of x; when we require the binary string instead, we will explicitly write x=10x110xk.

We use [k] to refer to the set of integers {1,,k} and Bin(z,q) to denote a binomial random variable with z trials and success probability q. For two probability measures μ and ν over a set Ω, the Hellinger distance between μ and ν is defined as dH(μ,ν)xΩ(μ(x)ν(x))2.

Most of the technical analysis of this paper concerns understanding properties of “cyclic statistics” of sequences x1,,xk, which we define here:

Definition 4.

Let i1,,im[k], and be a divisor of k (with =1 allowed). This defines a cyclic statistic mod of x=(x1,,xk)k defined by Si1,,im;(x)j=1k/xi1+jxim+j, (where all indices are interpreted mod k). When =1, we refer to the function as merely a cyclic statistic and write Si1,,im(x).

Problem definitions.

We investigate the problem of recovering a string from its traces generated by a circular deletion channel, which is defined as follows.

Definition 5 (Circular deletion channel).

Fix p(0,1). A circular deletion channel, denoted Del(x) takes as input x{0,1}n and induces the probability distribution defined by the following process:

  1. 1.

    Delete each bit in x independently with probability p to yield a string x~.

  2. 2.

    Return a uniformly random cyclic shift of x~.

In this paper, we focus on the case where p is a universal constant. We are now ready to define the (circular) trace reconstruction problem.

Definition 6 (Trace reconstruction problem).

Let x{0,1}n. Given sample access to traces generated by the deletion channel Del(x), an algorithm 𝒜 which returns x solves the trace reconstruction problem if, with probability at least 2/3, x is a cyclic shift of x.

Definition 7 (Distinguishing problem).

Let x,y{0,1}n with x and y cyclically distinct. An algorithm 𝒜 distinguishes x from y if, given sample access to a deletion channel Del(z), 𝒜 returns x with probability at least 2/3 when z is a cyclic shift of x, and 𝒜 returns y with probability at least 2/3 when z is a cyclic shift of y.

 Remark 8.

We note that the two problems above are nearly equivalent for k-sparse strings in the sense that an O(f(n)) upper bound for the distinguishing problem implies an O(f(n)logn) algorithm for reconstruction. This is due to the following standard technique. Suppose algorithm 𝒜 distinguishes all pairs x,y𝒳 with probability at least 2/3 using f(n) samples. To reconstruct an unknown string z, we draw O(f(n)logn2k)=Ok(f(n)logn) samples from the deletion channel Del(z), and test each possible pair of cyclically distinct strings x,y against each other using 𝒜. By a Chernoff bound, 𝒜 will return a valid answer on each pair with probability 1O(1/n2k). Since there are O(nk) cyclically distinct k-sparse strings of length n, all O(n2k) tests will succeed with constant probability, in which case we return the unique string z which is chosen by 𝒜 against all other candidates.

3 Cyclic Statistics Characterize Cyclically Distinct Sequences

We begin by studying cyclic statistics of integer sequences, a key component of our upper and lower bounds. The goal of this section is to answer the question: given two cyclically distinct integer sequences x and y, what is the smallest value of m such that there exists a cyclic statistic Si1,,im satisfying Si1,,im(x)Si1,,im(y)? We show m6 always suffices.

Our proof relies on the Fourier transforms of x and y, denoted x^ and y^ respectively. We partition the Fourier coefficients x^j into equivalence classes based on gcd(j,k). We show that for each equivalence class, the Fourier coefficients in x are jointly zero or nonzero together. Notably, as it turns out, for any x and y with matching second-order cyclic statistics, y^ has identical pattern of zeros and nonzeros as x^ (see Lemma 13).

Armed with this insight, we restrict our analysis to the set of nonzero Fourier coefficients in x and y, and define a set of variables {zj=12πilogx^jy^j} for each index j with nonzero Fourier coefficients in x^ and y^. We show that equivalence of cyclic statistics (up to order 6) imposes a system of linear constraints on the variables zj, and that any set of solutions must be strictly real and there exists an integer c such that zjcjk mod 1 for all j. It follows immediately that x^j=y^jexp(2πicjk) for all j. By basic properties of Fourier transforms, this implies that x is a cyclic shift of y, proving our upper bound.

The organization of this section is as follows. In Section 3.1, we relate cyclic statistics to the Fourier transform and prove that for any x and y with identical cyclic statistics up to order 6, x^ and y^ have the same pattern of zeros and nonzeros. In Section 3.2 we set up several number theoretic results necessary to analyze the system of linear constraints. Finally, in Section 3.3, we apply these facts to the set of variables {zj} defined above to obtain the main result of this section, Lemma 19, which states that any two cyclically distinct strings x,y must differ in some cyclic statistic of order m6.

3.1 Cyclic Statistics and the Fourier Transform

Suppose two integer sequences x and y have identical cyclic statistics up to order m. We show this implies an equivalence between certain m-way products of Fourier coefficients.

Lemma 9.

Let x and y be two integer sequences of length k with identical mth order cyclic statistics. Consider their Fourier transforms x^ and y^. For any m-tuple i1,,im with i1++im0 mod k, we have x^i1x^im=y^i1y^im.

Proof.

Letting j2j2j1 mod k,jmjmj1 mod k, and interpreting indices mod k:

x^i1x^im =(j1=0k1xj1e2πikj1i1)(jm=0k1xjme2πikjmim)
=(j1=0k1xj1e2πikj1i1)(j2=0k1xj2+j1e2πik(j2+j1)i2)(jm=0k1xjm+j1e2πik(jm+j1)im)
=j2,,jm=0k1(e2πik(j2i2++jmim)j1=0k1xj1xj2+j1xjm+j1)
=j2,,jm=0k1e2πik(j2i2++jmim)S0,j2,,jm(x)

where the exponential terms vanish in the second-to-last equality because i1++im0 mod k and so e2πikj1i1e2πikj1im=1. Since S0,j2,,jm(x)=S0,j2,,jm(y) for all j2,,jm by assumption, we have that the corresponding m-way products of Fourier terms are equal.

Lemma 9 implies that for any i1,,im satisfying i1++im0 mod k, if the Fourier coefficients x^i1,,x^im are nonzero, then y^i1,,y^im must be nonzero as well. We now partition the elements of [k] into equivalence classes based on their gcd with k.

Definition 10.

Let α be a divisor of k. We define Gα{j[k]:gcd(j,k)=α}.

We show that for a single integer sequence x, the Fourier coefficients of x in each Gα are either all zero or all nonzero, allowing us to analyze them together. This can be shown using a well-known property of cyclotomic fields: for every primitive dth root of unity of form ωd=e2πi/d, the Galois group Gal((ωd)/) is isomorphic to (/d)×{j{0,,d1}:gcd(j,k)=1}, the multiplicative group of integers mod d. The isomorphism is specified by Fact 11.

Fact 11.

Let d and let ωde2πi/d be a primitive dth root of unity. Consider (ωd), the field extension of generated by ωd. For all a(/d)×, there exists an automorphism σaGal((ωd)/) such that σa(ωd)=ωda.111For more on cyclotomic fields, see https://en.wikipedia.org/wiki/Cyclotomic_field.

Lemma 12.

Let x be an integer sequence of length k with Fourier transform x^. For any α[k] and j,jGα, if x^j=0, then x^j=0.

Proof.

Suppose x^j=0. Let dkα and consider the a primitive dth root of unity ωde2πi/d. The jth Fourier coefficient of x is given by

x^j==0k1xe(2πi/k)j==0k1xe(2πi/d)(d/k)j==0k1xωd(j/α)

Let z be a sequence of length d generated by interpreting each index of x mod d and summing together all elements that map to a given location. (In other words, zj:j mod dx.) Then the Fourier transform of z at location jα is given by

z^j/α==0d1zωd(j/α)==0k1xωd(j/α)( mod d)==0k1xωd(j/α)=x^j

Similarly, z^j/α=x^j. Because z is an integer sequence, z^j/α and z^j/α are elements of the cyclotomic field (ωd/α). Choose b(/d)× such that bjαjα mod d. By Fact 11, there exists σbGal((ωd)/) such that σb(ωd)=ωdb. Therefore, we have

x^j=z^b(j/α) mod d==0d1zωdb(j/α)==0d1zσb(ωd(j/α))=σb(z^j/α)=σb(x^j)=0

as desired.

Consider two sequences x and y with identical second order cyclic statistics. As a simple consequence of Lemmas 9 and 12, x^ and y^ are jointly zero or jointly nonzero everywhere in the indices belonging to a single Gα.

Lemma 13.

Let x and y be two integer sequences of length k with identical second order cyclic statistics. Then for each α[k], either (1) x^j0 and y^j0 for all jGα or (2) x^j=y^j=0 for all jGα.

Proof.

Suppose x^j0 for some jGα. By Lemma 12, we must have x^j0 for all jGα. Note that kjGα, so x^kj0. Moreover, j+(kj)0 mod k, so by Lemma 9, we have y^jy^kj=x^jx^kj0. This requires that y^j0. Applying Lemma 12 to y, we have that y^ is nonzero for all indices in Gα, proving the first statement.

If xj=0, then similarly y^jy^kj=x^jx^kj=0, requiring that either y^j=0 or y^kj=0. By Lemma 12, x^ and y^ must be zero everywhere in Gα, proving the second statement.

3.2 Number Theoretic Prerequisites

Recall that our goal is to analyze a system of constraints involving the nonzero Fourier coefficients of x and y, where the constraints arise from the assumption that x and y have identical cyclic statistics up to order 6. These constraints will turn out to take the form zj1++zjm, for a certain set of complex valued variables {zj} and certain m tuples j1,,jm. We wish to show that any set of variables satisfying these constraints must be strictly real and form an arithmetic sequence mod 1. In Lemma 16, we show this is true; we apply this fact in Section 3.3 to show that x and y are cyclically equivalent.

Before proving this relation, we require certain number theoretic facts. First, we show that for any integer d and j/d, j can be written as a sum (mod d) of either 2 or 3 numbers relatively prime to d. Here and below, we use /d to refer to the cyclic group of integers modulo d, represented as the set {0,,d1}.

Fact 14.

The following facts are true for any prime power pa and integer j:

  • If p2, there exists a pair b1,b2[pa] that are relatively prime to p such that b1+b2j mod pa. Also, there exists a triple b1,b2,b3[pa] that are relatively prime to p such that b1+b2+b3j mod pa.

  • If p=2 and j is even, there exists a pair b1,b2[pa] that are relatively prime to p such that b1+b2j mod pa.

  • If p=2 and j is odd then there exists a triple b1,b2,b3[pa] that are relatively prime to p such that b1+b2+b3j mod pa.

Proof.

First, consider the case in which p2. At least one of j1,j+1 must be relatively prime to pa since their difference is 2, which itself is relatively prime to p. Since j(j1)+1 mod pa, j(j+1)+(pa1) mod pa, and both 1 and pa1 are relatively prime to pa, at least one of these two expressions expresses j as the sum of 2 numbers both relatively prime to pa. Similarly, at least one of j2,j+2 must be relatively prime to pa since their difference is 4, which itself is relatively prime to p. Thus since j(j2)+1+1 mod pa and j(j+2)+(pa1)+(pa1) mod pa, and since both 1 and pa1 are relatively prime to pa, at least one of these two expressions expresses j as the sum of 3 numbers both relatively prime to pa.

For the case of p=2, we note that if j is even, then j1 is relatively prime to 2a and thus we express j=(j1)+1. Otherwise, if j is odd, then j2 is relatively prime to 2a, and we express j=(j2)+1+1.

Lemma 15.

Let d. The following is true for all j/d:

  • If d is even and j is odd, then j can be written as the sum mod d of 3 numbers relatively prime to d.

  • Else j can be written as the sum mod d of 2 numbers relatively prime to d.

Proof.

Consider the prime factorization of d: d=p1a1pa. Given j/d, we use the Chinese remainder theorem to express j via its residues mod piai, for each prime factor pi. Define jij mod piai. We note that j is relatively prime to d if and only if each ji is relatively prime to its corresponding prime power piai.

We use Fact 14 (and the Chinese remainder theorem) to canonically represent each j/d as the sum mod d of either 2 or 3 numbers that are each relatively prime to d. If either d is odd or j and d are both even, we represent j as the sum of 2 numbers, by applying Fact 14 to represent each ji as jibi(1)+bi(2) mod piai with bi(1) and bi(2) relatively prime to piai. By the Chinese remainder theorem, the system, b(1)bi(1) mod piai,i[], has a unique solution b(1) relatively prime to d; similarly, there exists a unique b(2) relatively prime to d satisfying b(2)bi(2) mod piai for all i. Finally, b(1)+b(2)j mod d, which proves the first case. If j is odd and d is even, we use the analogous procedure to represent j as the sum of 3 numbers relatively prime to d, which proves the second case.

Using the representation given by Lemma 15, we analyze the system of complex-valued linear equations described at the beginning of this subsection.

Lemma 16.

Let d and zj for each j[d] relatively prime to d. Suppose that for every m6, and any m-tuple (j1,,jm) relatively prime to d satisfying j1++jm0(modd), we have zj1++zjm. Then there exists an integer c such that for all j relatively prime to d we have zjcjd.

Proof.

For all j/d, let [[j]] denote the canonical representation of j given by Lemma 15 as the sum mod d of either 2 or 3 numbers relatively prime to d (so [[j]] is a set of size 2 or 3). Let z[[j]] denote the sum of the 2 or 3 corresponding variables. For convenience, we write the set d[[j]]{dj:j[[j]]}. Then the conditions of the lemma imply the following set of conditions:

j relatively prime to d:zj+zdj,
j relatively prime to d:z[[j]]+zdj,
j/d:z[[j]]+z1+zd[[j+1]]

To see the implication, it suffices to show that the left hand side of each constraint is a sum of at most 6 real variables whose indices are relatively prime to d. For all j with gcd(j,d)=1, we have gcd(dj,d)=1, so the first constraint satisfies the condition. For all j, z[[j]] can be written as a sum of 2 or 3 variables whose indices are relatively prime to d, so the second constraint requires at most 4 variables. Finally, if d is odd, the representation [[j]] comprises 2 variables for all j, in which case the third constraint requires 5 variables. If d is even, the values j and j+1 comprise 1 even number and 1 odd number, and d[[j+1]] has the same parity as j+1, resulting in at most 6 terms.

Note that these three sets of conditions collectively imply that z[[j]]+z1z[[j+1]] for all j/d. Therefore, any set of solutions z[[j]] must form an arithmetic sequence of increment z1 plus some integer, i.e., z[[j]]jz1 for all j. Therefore, dz1, which requires that z1 is a multiple of 1d (and is therefore real). Writing z1=cd for some integer c, we have z[[j]]cjd mod 1 for all j/d. Finally, the first and second conditions imply that zj=z[[j]] for all j relatively prime to d, which proves the lemma.

The following technical lemma is used in the proof of the main result of this section, Lemma 19. Intuitively, in the proof of Lemma 19 we aim to show that variables z0,,zk1 form an arithmetic sequence. Instead we show that, for each value α, those zj with index j satisfying gcd(j,k)=α form an arithmetic sequence of increment cα. We use the following lemma to show that this implies there is a consistent arithmetic sequence across all j, with some increment c.

Definition 17.

For a positive integer a and prime number p, define the p-adic valuation of a to be νp(a)max{0:pa}.

Lemma 18.

Let k and A[k]. Let cα for each αA, and suppose the following are true for all pairs α,αA:

  1. 1.

    lcm(α,α)A.

  2. 2.

    (cαcα)lcm(α,α)0 mod k

Then there exists c such that (ccα)α0 mod k for all αA.

Proof.

Consider the prime factorization of k, k=p1a1pa. For each i[], we will show there exists ci satisfying the desired property mod piai, and conclude the lemma by the Chinese remainder theorem.

Fix i[]. Define αi to be the element of A with the lowest pi-adic valuation: αi=argminαAνpi(α). Let ci be an integer satisfying (cicαi)αi0 mod piai, and consider any αA. By a well-known property of p-adic valuations, we have νpi(lcm(α,αi))=max{νpi(α),νpi(αi)}=νpi(α). The prime power piνpi(α) divides α, so it also divides lcm(α,αi). Therefore, we have the following two facts:

(cαicα)pνpi(α) 0 mod piai (by assumption (2) of the lemma)
(cicαi)piνpi(αi) 0 mod piai (by definition of ci)

Using the fact that νpi(α)νpi(αi) to combine the two equations, we have (cicα)piνpi(α)0 mod piai. Finally, we multiply both sides by the integer αpiνpi(α) to yield (cicα)α0 mod piai. Choosing ci=cαi for all i, the equation above holds for all pairs i,α. We now wish to find c such that cci mod piai for all i. Such an integer c is guaranteed to exist by the Chinese remainder theorem, which completes the proof.

3.3 Main Characterization

We now prove the main result of this section.

Lemma 19.

Let x and y be two integer sequences of length k with identical cyclic statistics up to order 6. Then x and y must be identical up to a cyclic shift.

Proof.

It suffices to show that there exists an integer c such that x^j=y^jexp(2πicjk) for all j[k]. By Lemma 13, this property holds for all j where x^j=0. Therefore, we consider only the divisors α such that the Fourier coefficients are nonzero everywhere in Gα for both x and y. For all j with nonzero Fourier coefficients x^j and y^j, define zj12πilog(x^jy^j), where log is the unique complex-valued function satisfying elogz=z and Imlogz(0,2π] for all z\{0}. To prove the lemma, we show that there exists an integer c such that for all j with x^j0 and y^j0, Imzj=0 and zjcjk mod 1.

Fix α[k] and let dkα. Note that for all jGα, gcd(jα,d)=1. Therefore, each jGα corresponds to a unique number relatively prime to d. Consider any m-tuple j1,,jmGα such that j1α++jmα0 mod d. By Lemma 9, we have zj1++zjm. Applying Lemma 16 (and using the fact that k=dα), there exists an integer cα such that for all jGα, we have zjcαjk. In other words, zj is real and zjcαjk mod 1.

We have shown that for all Gα with nonzero Fourier coefficients, we can write kzjcαj mod k. We will now show that this is true for a single consistent integer c=cα across all α. Consider αα where x^ and y^ are nonzero everywhere in GαGα. Let cα,cα respectively be the corresponding coefficients, and =lcm(α,α). Since is a multiple of α, Lemma 15 implies that we can write /α as a sum mod k/α of 2 or 3 numbers relatively prime to k/α. Multiplying by α, we can write as a sum mod k of 2 or 3 numbers whose gcd with k is α; denote this representation as [[]]. Since is also a multiple of α, it must also have a representation [[]] as a sum mod k of 2 or 3 numbers whose gcd with k is α.

We write z[[]] and z[[]] to denote the sum of the variables corresponding to the indices in [[]] and [[]] respectively, and use the convention that k[[]]={kj:j[[]]}. Since jα+kjα0 mod kα for all jGα, we have z+zk[[]]0 mod 1 for all . From the previous paragraph, we have that the sum of the elements in [[]] equals the sum of the elements in [[]] mod k. We thus conclude the condition that has a consistent coefficient in both representations, as z[[]]=z[[]], and reexpress this as a sum of 4 to 6 coefficients, moving everything to the left-hand side, as z[[]]+zk[[]]0 mod 1. Recall that [[]] only contains indices j with gcd(j,k)=α, and the corresponding variables zj satisfy kzjcαj mod k. Similarly, [[]] only contains indices j with gcd(j,k)=α, and each zj satisfies kzjcαj mod k. Therefore, the constraint (cαcα)0 mod k holds for any pair of α,αA[k], where A={α:x^,y^ are nonzero everywhere in Gα}.

Applying Lemma 18, there exists c, such that (ccα)α0 mod k for all αA. Therefore, for every αA, for every jGα, cαjαcααjα mod k. Notice that by definition, jα is an integer and thus, for every αA, for every jGα, cjcαj mod k. Therefore, for every j where x^j, y^j are nonzero, we have kzjcαjcj mod k. Dividing by k, we have zjcjk mod 1 for every j where x^j, y^j are nonzero, as desired.

4 Upper Bound

We now give an algorithm for distinguishing any two cyclically distinct strings with a constant number of 1s. Given Lemma 19, a naïve approach for testing whether a set of traces is generated from x or y is to identify a cyclic statistic in which x and y differ, and then estimate this statistic from traces. For a string with k 1s, the deletion channel will preserve all 1s with constant probability; a resulting trace can be represented as a sequence of k “gaps,” for which we can compute cyclic statistics. In particular, for a string x=(x1,,xk) and retention probability q:=1p, a trace can be represented as a sequence of binomial random variables x~=(x~1,,x~k), where x~jBin(xj,q).

Unfortunately, computing a cyclic statistic of x~ requires multiplying up to 6 of these binomial random variables. In the worst case, we might have xj=Ω(n) for all j, causing our estimator to have variance roughly n11. Since cyclic statistics are integers, we may need to estimate the desired cyclic statistic to within constant error, and thus the naïve algorithm would require Ω(n11) samples.

We give a more efficient algorithm using the following insight: while the product of 6 independent binomials with Ω(n) trials each has variance roughly Ω(n11), the variance is considerably lower if each binomial is shifted to have mean O~(n). To accomplish this, we preprocess x and y by greedily grouping the values qx1,,qxk,qy1,,qyk that are within O~(n) of each other into a cluster. When processing a trace z~ from either x or y, we compute the cluster center closest to each element z~j, and subtract that center from z~j. Because the gap sizes z~j are highly likely to be within O~(n) of their expectations, we can find the “correct” cluster with high probability. The resulting estimator has variance O~(n6).

Of course, subtracting an arbitrary sequence s from the observed trace z~ raises another concern: the sequences qxs and qys might now be cyclically equivalent, erasing any differences in their cyclic statistics. We circumvent this problem with a slightly different estimator which only considers cyclic statistics mod where is chosen so that s is preserved by a cyclic shift of .

The remainder of this section is organized as follows. In Section 4.1, we give an algorithm for generating a well separated partition of the expectations qx1,,qxk,qy1,,qyk. In Section 4.2, we show that cyclic statistics mod are preserved by subtraction of a sequence s invariant to cyclic shifts of . Finally, in Section 4.3, we present our cyclic statistics-based algorithm and prove its correctness.

4.1 Determining Centers

As a preprocessing step, our algorithm partitions the means of 2k binomial random variables (the gaps in traces from x and y) into well separated clusters. For our purposes, clusters are considered to be well separated if they are Ω(nlogn) far apart; we give a formal definition below.

Definition 20.

Let x1,,xk and C>0. Then a function c:[k] is a C-separated partition of x1,,xk if it satisfies the following:

  1. 1.

    For any j,j[k], if c(xj)c(xj), then |qxjqxj|>2Cnlogn.

  2. 2.

    For any j,j[k], if c(xj)=c(xj), then |qxjqxj|2Cknlogn.

  3. 3.

    For all x, c(x)=c(argminxj:j[k]|xjx|).

We show that a C-separated partition exists for any C>0 and x1,,xk. The proof is constructive: we give a greedy algorithm for producing a C-separated partition of a fixed set of points x1,,xk. The algorithm starts by assigning each point to its own cluster; it then iteratively merges clusters which violate the separation condition. When no violations remain, we prove that the algorithm has obtained a C-separated partition.

Lemma 21.

Let x1,,xk[n], and C>0. Then there exists a C-separated partition of x1,,xk.

4.2 Uniqueness of Cyclic Statistics

As discussed at the beginning of this section, our goal is to reduce the variance of our estimator by shifting each binomial random variable x~1,,x~k by a value which is within a constant of its mean. In particular, let c be a C-separated partition of qx1,,qxk,qy1,,qyk, and s=(c(qx1),,c(qxk)). The binomial random variables concentrate tightly around their expectation, allowing us to approximately recover the partition of x and y generated by the partitioning algorithm with high probability. The logical approach is therefore to estimate cyclic statistics of xs.

Unfortunately, this introduces a second problem. If x and y are distinct after applying the partition c elementwise, then they can be distinguished simply by recovering the partition. But if x and y appear similar under the partition – that is, sx=(c(qx1),,c(qxk)) is a cyclic shift of sy=(c(qy1),,c(qyk)) – then xsx might be a cyclic shift of ysy, resulting in identical cyclic statistics and thus a failure to distinguish x from y.

Example 22.

Consider the case where k=6, and where the sequence of shifts/clusters sx=sy follows the pattern s,s,s,s,s,s. Further, let x=(sq+0,sq+1,sq+2,sq+0,sq+1,sq+2) and y=(sq+1,sq+2,sq+0,sq+1,sq+2,sq+0). In this case, when we sample from Del(x) or Del(y), once we subtract sx=sy, we get two 6-tuples that are cyclically identical; however, algorithmically, we would hope to be able to distinguish traces from x vs y, since the alignment between the sequence of shifts sx=sy vs the sequence of offsets (0,1,2,0,1,2) differs for x vs y. The right thing to do here, after subtracting sx=sy, is to look at cyclic statistics mod 3.

This motivates our general strategy of looking at cyclic statistics mod , where is the smallest period of repetition of the shift sequence.

Definition 23.

A sequence s=(s1,,sk)k is symmetric mod if for all j[k], sj=s(j+) mod k.

Definition 24.

Let x=(x1,,xk)k, i1,,im[k], and be a divisor of k. For a sequence s=(s1,,sk)k, the function Si1,,im;(xs)j=1k/(xi1+jsi1+j)(xim+jsim+j) is an m-th order cyclic statistic mod , shifted by s.

We show that for any period and set of shifts s that repeats mod , there is a 6-th order statistic that differs on x,y, when shifted by s and taken mod . We prove this by starting with Lemma 19 that guarantees the existence of a statistic of order 6 that distinguishes x from y; we show this implies existence of a distinguishing statistic of the same order mod . Taking the distinguishing statistic of smallest order lets us subtract off any sequence s that is symmetric mod , without modifying the discrepancy between x and y. This yields:

Lemma 25.

Let x and y be two cyclically distinct integer sequences of length k. Let divide k, and let s=(s1,,sk)k be a sequence that is symmetric mod . Then there exists i1,,im[k] with m6 such that

|Si1,,im;(xs)Si1,,im;(ys)|=|Si1,,im;(x)Si1,,im;(y)|1.

4.3 Our Algorithm

We are now ready to describe our proposed tester, Algorithm 1. Given two candidates x and y, a trace z~ from the deletion channel can be described as a sequence of binomial random variables z~1,,z~k, each of which has an expectation in the set {qx1,,qxk,qy1,,qyk}. The algorithm begins by preprocessing the strings x and y to create a C-separated partition c of the expectations (Line 2). We then consider the sequences sx and sy created by applying c elementwise to x and y. We design two tests for two separate cases:

  • If sx is a cyclic shift of sy, then traces from x and y appear relatively similar. In this case, we invoke a subroutine (Algorithm 2), which determines a shifted cyclic statistic in which x and y differ, then estimates this statistic from traces. Owing to the shift, the resulting estimate has low variance, allowing us to estimate the cyclic statistic from O~(n6) traces.

  • If sx and sy are cyclically distinct, then the algorithm can reliably distinguish x and y by examining only a single trace whose 1s are intact. Therefore, we draw a constant number of traces, choose an arbitrary trace z~ with k 1s (Line 7), and cluster the gaps according to c. For C large enough, the resulting sequence will with high probability correspond exactly to either sx or sy.

Combining the results from each case yields our overall upper bound. We now state and prove the main result of Section 4. Our proof relies on Lemmas 27 and 28 in Section 4.3.2, which prove the correctness of Algorithm 2.

4.3.1 Proof of Upper Bound

Algorithm 1 Test-Cyclic-Traces.
Theorem 26.

Let x and y be two cyclically distinct k-sparse binary strings of length n. Then Algorithm 1 distinguishes x from y with probability 23 for sufficiently large n, using O(n6log12n) traces from a cyclic deletion channel.

Proof.

We analyze Algorithm 1 to show the theorem. We consider two cases:

Case 1:

sx is a cyclic shift of sy. In this case, we invoke Algorithm 2, which draws O(n6log12n) returns the correct answer with probability at least 2/3, by Lemma 28.

Case 2:

sx is not a cyclic shift of sy. Then we draw T=log(1/4)/log(1qk)=O(1) traces. With probability 3/4, at least one trace z~ will contain exactly k 1s. Suppose z~Del(x). Due to Lemma 27, with probability at least 1O(n10), we have c(z~j)=c(qxj) for all j[k]. Therefore, c(z~)=sx, and the algorithm returns x with probability at least 3/4O(n10)2/3 for large enough n.

In both cases, the number of traces T is bounded by O(n6log12n).

4.3.2 Distinguishing Similar Strings

We now handle the case in which sx=sy, i.e., x and y are cyclically equivalent under the C-separated partition c constructed in Algorithm 1. We build an estimator based on shifted cyclic statistics and show that O~(n6) traces suffice to distinguish x from y. In particular, let be the minimum value such that sx and sy are symmetric mod . For each j[k], let Pj{xj:c(qxj)=j}{yj:c(qyj)=j}. Define the function g:[k] to be the average of the points in cluster j; that is, g(j)=1|Pj|zPjqz. The algorithm chooses the smallest value of m for which there is an order m cyclic statistic Si1,,im; such that Si1,,im;(x)Si1,,im;(y). We call a trace z~ useful if it satisfies the following conditions:

  1. 1.

    There are exactly k 1s in z~.

  2. 2.

    For all j[k], |z~jg(c(z~j))|(4k+1)Cnlogn.

Intuitively, the second condition says that the gaps in z~ can be matched up with cluster centers as we expect. Suppose there are T useful traces. Our estimator will apply the statistic Si1,,im;(x) to all the useful traces after subtracting out their cluster centers:

fi1,,im;(z~)=1qmj=1k/(z~i1+jg(c(z~i1+j)))(z~im+jg(c(z~im+j))) (2)

We define f^=1Tz~:useful(z~)fi1,,im;(z~) as the average of the above equation over useful traces. We show that, for any set of Θ~(n6) traces from a string z, |f^Si1,,im;(z)|1/3 with probability at least 2/3, which implies an efficient tester.

The main result of this subsection requires the following lemma, which roughly states that we can recover clusters from a random trace with high probability.

Lemma 27.

Let z1,,z2k, q(0,1), and z~jBin(zj,q) for all j[k]. Then there exists an absolute constant C>0 such that, for any C-separated partition c of qz1,,qz2k, |z~jqzj|Cnlogn and c(z~j)=c(qzj) for all j[k] with probability at least 1Ok(n10).

Proof.

Fix j[k]. Let c be a C-separated partition of qz1,,qz2k. By a Chernoff bound, there exists a constant C such that |z~jqzj|Cnlogn with probability at least 1n10. When this occurs, the following holds for all j[k] with c(qzj)c(qzj):

|z~jqzj||qzjqzj||z~jqzj|>2CnlognCnlogn=Cnlogn|z~jqzj|

In this case, clearly c(z~j)=c(qzj). The contrapositive of this is that, if c(z~j)c(qzj), we must have |z~jqzj|>Cnlogn. Taking a union bound over the k random variables, the probability that this occurs for some j[k] is at most kn10, as desired.

Algorithm 2 Test-Similar-Traces.

We are now ready to prove the correctness of Algorithm 2.

Lemma 28.

Let x=(x1,,xk) and y=(y1,,yk) be two cyclically distinct binary strings of length n, and q(0,1). Let c be a C-separated partition of qx1,,qxk,qy1,,qyk, and suppose c(qx1),,c(qxk) is a cyclic shift of c(qy1),,c(qyk). Then for sufficiently large C and n, there exists an algorithm (Algorithm 2) which distinguishes x from y with probability at least 2/3 using T=O(n6log12n) traces from a cyclic deletion channel.

Proof.

Fix C to be the constant determined by Lemma 27. Let 𝒵 denote the set of useful traces (defined at the top of Section 4.3.2). The first usefulness condition holds with probability exactly qk for each trace. By Lemma 27, we have |z~jqzj|Cnlogn and c(z~j)=c(qzj) with probability at least 1O(n10). In this case, we have |z~jg(c(z~j))||z~jqzj|+|qzjg(c(z~j))|Cnlogn+4Cknlogn, which is precisely the second condition. Therefore, for large enough n, each trace falls in 𝒵 with constant probability. By a Chernoff bound, |𝒵|=Ω(T) with probability 1o(1). For each z~𝒵, the algorithm computes fi1,,im;(z~), where fi1,,im; is defined in Equation 2. Let A be the event that z~𝒵 and c(z~j)=c(qzj) for all j[k]. Conditioned on A, it is possible to compute f(z~) precisely, since we can choose a consistent cyclic shift of s and match z~ to that shift up to symmetry. We therefore have

𝐄[f(z~)A] =𝐄[1qmj=1k/(z~i1+jg(c(qzi1+j)))(z~im+jg(c(qzim+j)))]±O(n4)
=1qmj=1k/𝐄[z~i1+jg(c(qzi1+j))]𝐄[z~im+jg(c(qzim+j))]±O(n4) (z~js are independent)
=1qmj=1k/(qzi1+jg(si1+j))(qzim+jg(sim+j))±O(n4)
=Si1,,im;(z1qg(s))±O(n4)

where the ±O(n4) term in the first line comes from the O(n10) probability that there are k 1s in z~ but event A fails; this is multiplied by the universal O(n6) bound on |f(z~)| for m6.

By Lemma 27, A holds with probability at least 1O(n10) conditioned on z~𝒵. Additionally, the second condition implies that for all z~𝒵, we have |f(z~)|kqm(4k+1)6C6n3log6n. Combining these facts and applying the law of total probability yields

|𝐄[f(z~)z~𝒵]Si1,,im;(z1qg(s))|O(n3log6nn10+n4)=o(1)

Combining the bound on |f(z~)| with Hoeffding’s inequality, we have

𝐏𝐫[|f^𝐄[f(z~z𝒵)]|14] 2exp(2|𝒵|216|𝒵|(kqm(4k+1)6C6n3log6n)2)
=2exp(|𝒵|16k2q2m(4k+1)12C12n6log12n)

By our choice of T, we have |𝒵|=Ω(n6log12n) with probability 1o(1). Therefore, for an appropriate choice of constants, the above equation is bounded by 0.1+o(1). By the triangle inequality, the following holds for large enough n with probability at least 0.8: |f^Si1,,im;(z1qg(s))|14+o(1)13. By Lemma 25, there must exist a statistic Si1,,im; such that |Si1,,im;(x1qg(s))Si1,,im;(y1qg(s))|1 (namely, the statistic chosen on Line 2). Therefore, if z=x, the algorithm returns x with probability at least 0.9 (Line 5). The same is true for y, which completes the proof.

5 Lower Bounds

Our goal in this section is to upper-bound the distance between the distributions Del(x) and Del(y) for two strings x,y with that are permutations of each other and have identical low-order cyclic statistics, which will yield a lower bound for the number of traces needed to distinguish these strings. We first analyze the probability of observing a given trace a from Del(x), viewed as a polynomial in n. We show that the higher-order coefficients of this polynomial depend on the lower-order cyclic statistics of x. Using this fact, we bound the distance between probabilities from Del(x) versus Del(y). We conclude by providing two strings x,y with identical cyclic statistics up to order 4, proving our Ω~(n5) lower bound.

Lemma 29.

Let x=10n+x110n+x210n+xk with x an integer upper bound on x1,,xk, and let a be any cyclic shift of 10a110a210ak. Then for a circular deletion channel Del(x) with deletion probability p, letting n=pn, and letting bj=ajn(1p) for j[k], we have, for Sym(x;n,p,b,x) some symmetric function of x1,,xk, that depends on n,p,b,x, that

𝐏𝐫x~Del(x)[x~=a]=Sym(x;n,p,b,x)i=1kj=1kh=xj+1x(nbj+i+h), (3)

where all indices are taken modulo k.

Proof.

Since all cyclic shifts of a trace have equal probability under the deletion channel, we assume without loss of generality that a=10a110a210ak. By definition of a circular deletion channel, and letting |a|,|x| denote the length of the underlying binary strings a,x respectively, we have

𝐏𝐫x~Del(x)[x~=a]=1|a|(i=1kj=1k(n+xjaj+i))p|x||a|q|a| (4)

where indices are taken modulo k. We point out that the terms outside the parentheses, 1|a|p|x||a|q|a| are a symmetric function of x (depending only on |x|), and thus can be absorbed into the initial Sym(x;n,p,b,x) term in the lemma statement. We slightly refactor each binomial term as

(n+xjai+j)=n!ai+j!(n+xai+j)!(h=1xj(n+h))(h=xj+1x(nai+j+h))

We thus need only consider the sum over i of the product over j of the above expression. The first term n!ai+j!(n+xai+j)! does not depend on x so is by definition symmetric in x and can be absorbed into the initial Sym(x;n,p,b,x). The next term, h=1xj(n+h), when we take its product over all j, becomes a symmetric function of x1,,xk and can thus also be absorbed into Sym(x;n,p,b,x). The final term becomes the last term in Equation 3, after noting that nai+j=nbi+j, by definition of n,b.

The only non-symmetric portion of Equation 3 is the expression inside the sum, which is

i=1kj=1kh=xj+1x(nbj+i+h). (5)

Thus, when we compare traces from Del(x) versus Del(y) for sequences x,y that are permutations of each other, it is this last term that will distinguish between them. We now view this term as a polynomial in n, and show that the high-order coefficients of this polynomial can be expressed in terms of low-order cyclic statistics of x. Thus for two sequences x,y with identical low-order cyclic statistics, we expect the distributions of Del(x) versus Del(y) to have high-order terms in n that exactly cancel, leading to our main lower bound.

Lemma 30.

Expression 5 when viewed as a polynomial in n of degree L=j=1kxxj, for any m0 has nLm coefficient that is a linear combination of degree m cyclic statistics of x, each multiplied by some symmetric function of x (possibly depending on n,p,b,x).

Proof.

Define S to be the set of pairs (j,h) such that j[k] and h{xj+1,,x}, where |S|=L as defined in the lemma. Thus the main expression of the lemma equals i=1k(j,h)S(nbj+i+h). In particular, the coefficient of nLm in this expression can be found by multiplying the non-n parts of all combinations of m distinct terms:

i=1kdistinct (j1,h1),,(jm,hm)S(h1bj1+i)(hmbjm+i) (6)

For each fixed i, this expression is a symmetric polynomial of the terms (hbi+j) for (j,h)S, and we thus use Newton’s symmetric polynomial identities to conclude that we may reexpress this expression for fixed i as a linear combination of products of “power sums” of these terms; the power sum of degree is defined as (j,h)S(hbj+i). Namely, Equation 6 can be expressed as some linear combination of the following expressions, where 1,s are some positive integers that sum to m:

i=1kr=1s(j,h)S(hbi+j)r (7)

We break down the inner sum of this expression using the definition of S, and the binomial expansion (across different powers t), as

(j,h)S(hbi+j)r=j=1kt=0r(rt)(bi+j)rth=xjxht.

For any exponent t, we consider the final sum as a degree t+1 polynomial in xj, with notation Pt+1(xj):=h=xjxht. Thus Equation 7 equals

i=1kr=1s(jr=1ktr=0r(rtr)(bi+jr)rtrPtr+1(xjr)).

We pull the sums over tr outside:

t1=01ts=0si=1kr=1s(jr=1k(rtr)(bi+jr)rtrPtr+1(xjr))

We split the product by whether, for each r, we have tr=r, since when this is true the expression simplifies significantly (note that we also move the sum over i to the right, past the first parenthetical, which does not depend on i):

t1=01ts=0s(r[s]:tr=rjr=1kPr+1(xjr))(i=1kr[s]:tr<rjr=1k(rtr)(bi+jr)rtrPtr+1(xjr))

The first parenthetical is clearly a symmetric function of x1,,xk. To analyze the second expression, we apply a variable substitution in the inner sum, replacing jr with jri. For any fixed tuple t1,,ts, the last parenthetical equals

i=1kr[s]:tr<rjr=1k(rtr)(bjr)rtrPtr+1(xjri).

We view the product as a polynomial in the variables x1i,,xki, which we denote as Q(x1i,,xki); the degree of this polynomial is thus bounded by r:tr<r(tr+1), which is at most rr=m. Crucially, the sum over i now equals i=1kQ(x1i,,xki), which, since we sum a degree m polynomial over all cyclic shifts of x1,,xk, is thus clearly a linear combination of cyclic statistics of degree m.

In conclusion, Equation 7 is thus a sum of cyclic statistics of x each time some symmetric function of x. Thus, the coefficient of nLm in Equation 5, being a linear combination of expressions of the form of Equation 7, is also a sum of cyclic statistics of x each time some symmetric function of x, as desired.

We now directly compare the probabilities of getting a certain trace a from Del(x) versus Del(y).

Lemma 31.

Let x=10n+x110n+x210n+xk and y=10n+y110n+y210n+yk. Suppose (x1,,xk) is a permutation of (y1,,yk) and that the two sequences have matching cyclic statistics up to some order z1. Let a=10a110a210ak. If ajn(1p)[Cnlogn,Cnlogn] for all j[k] then for sufficiently large n,c depending on C, k, p, and x=max{x1,,xk} we have

𝐏𝐫x~Del(x)[x~=a]𝐏𝐫y~Del(y)[y~=a][1(clogn/n)z+12,1+(clogn/n)z+12].
Proof.

As above, define bj=ajn(1p) and define n=pn. Since x upper bounds x1,,xk, we have that x upper bounds y1,,yk too, since they are a permutation of x1,,xk. By Lemma 29, we have

𝐏𝐫x~Del(x)[x~=a]=Sym(x;n,p,b,x)i=1kj=1kh=xj+1x(nbj+i+h)

and the corresponding expression for y. Let D1i=1kj=1kh=xj+1x(nbj+i+h) and D2i=1kj=1kh=yj+1x(nbj+i+h). Because (y1,,yk) is a permutation of (x1,,xk), we have

𝐏𝐫x~Del(x)[x~=a]𝐏𝐫y~Del(y)[y~=a]=i=1kj=1kh=xj+1x(nbj+i+h)i=1kj=1kh=yj+1x(nbj+i+h)=D1D2=1+D1D2D2

It remains only to show that D1D2D2[(clogn/n)z+12,(clogn/n)z+12].

Let Lkxj=1kxj. Then D1,D2 are both degree L polynomials in n with leading coefficient k. For m1 we now bound the contribution of the terms of degree Lm.

By assumption, each bj[Cnlogn,Cnlogn]; thus for C=C+x and n2 we trivially have that |bj+h|Cnlogn, for h{1,,x}. Thus by definition of D1,D2, the contribution of the nLm term to either D1 or D2 has magnitude at most k(Lm)(Cnlogn)mnLm. We bound (Lm)Lm, and also bound Lkx. We use these bounds to bound the contribution to D1 or D2 of all terms of degree Lm in n: this is bounded by the geometric series kim(kxCnlogn)inLi. We choose n large enough so that each term of this series is at most 13 of the previous term, namely, we choose n so that nlogn3kxCp. Because the ratio of terms is 13, the i=0 term of the series is at least twice as large as the sum of magnitudes the remaining terms combined, so that, since the i=0 term of D2 is k(n)L, we conclude that D212knL. Second, the sum of the series starting with the i=m term has magnitude at most 32 times the i=m term, namely,

kim(kxCnlogn)inLi32k(kxCnlogn)mnLm=32knL((kxC)2lognp2n)m2

We now invoke Lemma 30, which says that the nLm coefficients of D1,D2 respectively are linear combinations of degree m cyclic statistics of x,y respectively, times symmetric functions of x,y respectively. The assumption that x,y are permutations of each other means that all symmetric functions of x equal the corresponding symmetric functions of y; the assumption that x,y have identical cyclic statistics up to order z further implies that the nLm coefficients of D1,D2 must be identical for mz. Thus we bound |D1D2| by plugging in the bounds of the previous paragraph starting at the first nonzero degree, m=z+1: |D1D2|232knL((kxC)2lognp2n)z+12. Combining with our lower bounds from above that D212knL, and setting c=6(kxC)2p2, we conclude D1D2D2[(clogn/n)z+12,(clogn/n)z+12] as desired. We use the following bound relating the Hellinger distance between μ,ν to the TV distance between t samples from μ or ν.

Lemma 32 (Lemma A.5 in [13]).

Let μ and ν be probability measures with dH(μ,ν)1/2. Then 1DTV(μt,νt)ε if tlog(1/ε)9dH(μ,ν).

Lemma 33.

Let x=10n+x110n+x210n+xk and y=10n+y110n+y210n+yk be two strings where (x1,xk) and (y1,,yk) are permutations of each other and have identical cyclic statistics up to order z. Then any algorithm which distinguishes Del(x) from Del(y) with probability 23 requires Ω(nz+1/logz+1n) samples.

Proof.

Let μ0 represent Del(x) conditioned on deleting only 0s; with ν0 defined correspondingly for Del(y). The distributions μ0,ν0 are only supported on traces that are cyclic shifts of a=10a110a210ak, for some gaps a1,,ak.

As above, define bi=ain(1p). We point out that, by a standard Chernoff bound, we have bi[Cnlogn,Cnlogn] with probability at least 1n(z+100) for some universal constant C (assuming z is a constant).

Combining Lemma 31 with this Chernoff bound yields the following bound on the Hellinger distance between μ0 and ν0:

dH2(μ0,ν0)=a1,,ak(μ0(10a110ak)ν0(10a110ak))2
=1(1p)ka1,,ak(PraDel(x)[a=a]PraDel(y)[a=a])2
2kn(z+100)(1p)k+1(1p)ka1,,ak[n(1p)Cnlogn,n(1p)+Cnlogn]PraDel(y)[a=a](1PraDel(x)[a=a]PraDel(y)[a=a])2
=O(logz+1nnz+1)

We can generate a sample from Del(x) by first sampling from μ0 and then passing the sample through a second channel which deletes only 1s (and similarly for Del(y) and ν0). By the data processing inequality, the second channel cannot increase the Hellinger distance between μ0 and ν0, so we have dH(Del(x),Del(y))dH(μ0,ν0)=O(logz+1nnz+1). Applying Lemma 32, we conclude that one requires Ω(nz+1logz+1n) traces to distinguish x from y with constant success probability.

Finally, we provide two strings that have identical cyclic statistics up to 4-th order, and whose gaps are permutations of each other. They are the sequence x(j)=(0,2,3,2,1,1,1,1,2,3,2,0) and the sequence y(j)=3x(j). Thus, distinguishing these two strings requires Ω(n5log5n) traces.

References

  • [1] Anders Aamand, Allen Liu, and Shyam Narayanan. Near-Optimal Trace Reconstruction for Mildly Separated Strings. In Keren Censor-Hillel, Fabrizio Grandoni, Joël Ouaknine, and Gabriele Puppis, editors, 52nd International Colloquium on Automata, Languages, and Programming (ICALP 2025), volume 334 of Leibniz International Proceedings in Informatics (LIPIcs), pages 3:1–3:20, Dagstuhl, Germany, 2025. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPIcs.ICALP.2025.3.
  • [2] Frank Ban, Xi Chen, Adam Freilich, Rocco A Servedio, and Sandip Sinha. Beyond trace reconstruction: Population recovery from the deletion channel. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 745–768. IEEE, 2019. doi:10.1109/FOCS.2019.00050.
  • [3] Tugkan Batu, Sampath Kannan, Sanjeev Khanna, and Andrew McGregor. Reconstructing strings from random traces. In Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’04, pages 910–918, USA, 2004. Society for Industrial and Applied Mathematics. URL: http://dl.acm.org/citation.cfm?id=982792.982929.
  • [4] Joshua Brakensiek, Ray Li, and Bruce Spang. Coded trace reconstruction in a constant number of traces. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 482–493. IEEE, 2020. doi:10.1109/FOCS46700.2020.00052.
  • [5] Arnav Burudgunte, Paul Valiant, and Hongao Wang. New bounds for circular trace reconstruction, 2025. arXiv:2512.02412.
  • [6] Zachary Chase. New lower bounds for trace reconstruction. In Annales de l’Institut Henri Poincaré-Probabilités et Statistiques, volume 57, pages 627–643, 2021.
  • [7] Zachary Chase. Separating words and trace reconstruction. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 21–31, 2021. doi:10.1145/3406325.3451118.
  • [8] Xi Chen, Anindya De, Chin Ho Lee, and Rocco A. Servedio. Trace Reconstruction from Local Statistical Queries. In Amit Kumar and Noga Ron-Zewi, editors, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2024), volume 317 of Leibniz International Proceedings in Informatics (LIPIcs), pages 52:1–52:24, Dagstuhl, Germany, 2024. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPIcs.APPROX/RANDOM.2024.52.
  • [9] Kuan Cheng, Elena Grigorescu, Xin Li, Madhu Sudan, and Minshen Zhu. On k-mer-based and maximum likelihood estimation algorithms for trace reconstruction. In 2024 IEEE International Symposium on Information Theory (ISIT), pages 879–884, 2024. doi:10.1109/ISIT57864.2024.10619392.
  • [10] Mahdi Cheraghchi, Ryan Gabrys, Olgica Milenkovic, and Joao Ribeiro. Coded trace reconstruction. IEEE Transactions on Information Theory, 66(10):6084–6103, 2020. doi:10.1109/TIT.2020.2996377.
  • [11] Anindya De, Ryan O’Donnell, and Rocco A Servedio. Optimal mean-based algorithms for trace reconstruction. The Annals of Applied Probability, 29(2):851–874, 2019.
  • [12] Elena Grigorescu, Madhu Sudan, and Minshen Zhu. Limitations of Mean-Based Algorithms for Trace Reconstruction at Small Edit Distance. IEEE Transactions on Information Theory, 68(10):6790–6801, 2022. doi:10.1109/TIT.2022.3168624.
  • [13] Nina Holden and Russell Lyons. Lower bounds for trace reconstruction. The Annals of Applied Probability, 30(2):503–525, April 2020. doi:10.1214/19-AAP1506.
  • [14] Thomas Holenstein, Michael Mitzenmacher, Rina Panigrahy, and Udi Wieder. Trace reconstruction with constant deletion probability and related results. In Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’08, pages 389–398, San Francisco, California and USA, 2008. Society for Industrial and Applied Mathematics. URL: http://dl.acm.org/citation.cfm?id=1347082.1347125.
  • [15] Akshay Krishnamurthy, Arya Mazumdar, Andrew McGregor, and Soumyabrata Pal. Trace reconstruction: Generalized and parameterized. IEEE Transactions on Information Theory, 67(6):3233–3250, 2021. doi:10.1109/TIT.2021.3066010.
  • [16] Vladimir I Levenshtein. Efficient reconstruction of sequences. IEEE Transactions on Information Theory, 47(1):2–22, 2002. doi:10.1109/18.904499.
  • [17] Shyam Narayanan and Michael Ren. Circular trace reconstruction. In 12th Innovations in Theoretical Computer Science Conference (ITCS 2021), pages 18:1–18:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPIcs.ITCS.2021.18.
  • [18] Fedor Nazarov and Yuval Peres. Trace reconstruction with exp(O(n1/3)) samples. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, pages 1042–1046, New York, NY, USA, 2017. Association for Computing Machinery. doi:10.1145/3055399.3055494.
  • [19] Yuval Peres and Alex Zhai. Average-Case Reconstruction for the Deletion Channel: Subpolynomially Many Traces Suffice. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 228–239, Berkeley, CA, October 2017. IEEE. doi:10.1109/FOCS.2017.29.
  • [20] Joey Rivkin, Gregory Valiant, and Paul Valiant. A generalized trace reconstruction problem: Recovering a string of probabilities. In Proceedings of the 57th Annual ACM Symposium on Theory of Computing, STOC ’25, pages 1657–1667, New York, NY, USA, 2025. Association for Computing Machinery. doi:10.1145/3717823.3718315.
  • [21] Jin Sima and Jehoshua Bruck. Trace Reconstruction with Bounded Edit Distance. In 2021 IEEE International Symposium on Information Theory (ISIT), pages 2519–2524, Melbourne, Australia, 2021. IEEE Press. doi:10.1109/ISIT45174.2021.9518244.