Abstract 1 Introduction 2 Notation 3 Main Alignment Procedure 4 Full algorithm/analysis 5 Conclusion and Open Questions References

Near-Optimal Trace Reconstruction for Mildly Separated Strings

Anders Aamand ORCID BARC, University of Copenhagen, Denmark Allen Liu ORCID Massachusetts Institute of Technology, Cambridge, MA, USA Shyam Narayanan ORCID Citadel Securities, Miami, FL, USA
Abstract

In the trace reconstruction problem our goal is to learn an unknown string x{0,1}n given independent traces of x. A trace is obtained by independently deleting each bit of x with some probability δ and concatenating the remaining bits. It is a major open question whether the trace reconstruction problem can be solved with a polynomial number of traces when the deletion probability δ is constant. The best known upper bound and lower bounds are respectively exp(O~(n1/5)) [7] and Ω~(n3/2)[6]. Our main result is that if the string x is mildly separated, meaning that the number of zeros between any two ones in x is at least polylogn, and if δ is a sufficiently small constant, then the trace reconstruction problem can be solved with O(nlogn) traces and in polynomial time.

Keywords and phrases:
Trace Reconstruction, deletion channel, sample complexity
Category:
Track A: Algorithms, Complexity and Games
Funding:
Anders Aamand: This work was supported by the DFF-International Postdoc Grant 0164-00022B and by the VILLUM Foundation grants 54451 and 16582.
Allen Liu: This work was supported in part by an NSF Graduate Research Fellowship, a Hertz Fellowship, and a Citadel GQS Fellowship.
Shyam Narayanan: Supported by the NSF TRIPODS Program, an NSF Graduate Fellowship, and a Google Fellowship.
Copyright and License:
[Uncaptioned image] © Anders Aamand, Allen Liu, and Shyam Narayanan; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Design and analysis of algorithms
Related Version:
Previous Version: https://arxiv.org/abs/2411.18765
Editors:
Keren Censor-Hillel, Fabrizio Grandoni, Joël Ouaknine, and Gabriele Puppis

1 Introduction

Trace reconstruction is a well-studied problem at the interface of string algorithms and learning theory. Informally, the goal of trace reconstruction is to recover an unknown string x given several independent noisy copies of the string.

Formally, fix an integer n1 and a deletion parameter δ(0,1). Let x{0,1}n be an unknown binary string with xi representing the ith bit of x. Then, a trace x~ of x is generated by deleting every bit xi independently with probability δ (and retaining it otherwise), and concatenating the retained bits together. For instance, if x=01001 and we delete the second and third bits, the trace would be 001 (from the first, fourth, and fifth bits of x). For a fixed string x, note that the trace follows some distribution over bitstrings, where the randomness comes from which bits are deleted. In trace reconstruction, we assume we are given N i.i.d. traces x~(1),,x~(N), and our goal is to recover the original string x with high probability.

The trace reconstruction problem has been a very well studied problem over the past two decades [23, 24, 3, 21, 20, 34, 25, 16, 30, 31, 17, 18, 19, 6, 10, 9, 7, 32]. There have also been numerous generalizations or variants of trace reconstruction studied in the literature, including coded trace reconstruction [13, 4], reconstructing mixture models [1, 2, 28], reconstructing alternatives to strings [14, 22, 29, 26, 33, 27], and approximate trace reconstruction [15, 8, 5, 11, 12].

In perhaps the most well-studied version of trace reconstruction, x is assumed to be an arbitrary n-bit string and the deletion parameter δ is assumed to be a fixed constant independent of n. In this case, the best known algorithm requires eO~(n1/5) random traces to reconstruct x with high probability [7]. As we do not know of any polynomial-time (or even polynomial-sample) algorithms for trace reconstruction, there have been many works making distributional assumptions on the string x, such as x being a uniformly random string [20, 25, 31, 19, 32] or x being drawn from a “smoothed” distribution [10]. An alternative assumption is that the string x is parameterized, meaning that x comes from a certain “nice” class of strings that may be amenable to efficient algorithms [22, 15].

In this work, we also wish to understand parameterized classes of strings for which we can solve trace reconstruction efficiently. Indeed, we give an algorithm using polynomial traces and runtime, that works for a general class of strings that we call L-separated strings. This significantly broadens the classes of strings for which polynomial-time algorithms are known [22].

Main Result

Our main result concerns trace reconstruction of strings that are mildly separated. We say that a string x is L-separated if the number of zeros between any two consecutive ones is at least L. Depicting a string x{0,1}n with t ones as

00a0 times100a1 times1100at times,

it is L-separated if and only if aiL for each i with 1it1. Note that we make no assumptions on a0 or at. Our main result is as follows.

Theorem 1.

There exists an algorithm that solves the trace reconstruction problem with high probability in n on any L-separated string x, provided that LC(logn)8 for a universal constant C, and that the deletion probability is at most some universal constant c0. The algorithm uses N=O(nlogn) independently sampled traces of x, x~(1),,x~(N), and runs in poly(n) time.

We note that the number of traces is nearly optimal. Even distinguishing between two strings x,x which contain only a single one at positions n/2 and n/2+1 respectively, requires Ω(n) traces to succeed with probability 1/2+Ω(1).

While trace reconstruction is known to be solvable very efficiently for random strings [19, 32], there are certain structured classes of strings that appear to be natural hard instances for existing approaches. Our algorithm can be seen as solving one basic class of hard instances. It is worth noting the work by [9] which studies the trace reconstruction problem when the deletion probability δ is sub-constant. They show that the simple Bitwise Majority Alignment (BMA) algorithm from [3] can succeed with 1/no(1) deletion probability as long as the original string does not contain deserts – which are highly repetitive blocks where some short substring is repeated many times. They then combine this with an algorithm for reconstructing repetitive blocks – but this part of their algorithm requires a significantly smaller deletion probability of δ1/n1/3+ε. This suggests that strings containing many repetitive blocks are a natural hard instance and good test-bed for developing new algorithms and approaches. L-separated strings can be thought of as the simplest class of highly repetitive strings (where the repeating pattern is just a 0), where every repetition has length at least L.

Comparison to Related Work

Most closely related to our work is the result by Krishnamurthy et al. [22] stating that if x has at most k ones and if each pair of ones is separated by a run of zeros of length Ω(klogn), then x can be recovered in polynomial time from poly(n) many traces. In particular, for strings with k=O((logn)7) ones, the required separation is milder than ours, albeit not below Ω(logn). Our algorithm works in general assuming a polylogn separation of the ones but with no additional requirement on the number of ones: indeed, we could even have npolylogn ones. With no sparsity assumptions, [22] would need to set LΩ(nlogn), as a nlogn-separated string can be Θ(n/logn)-sparse in the worst case. The techniques of [22] are also very different than ours. They recursively cluster the positions of the ones in the observed traces to correctly align a large fraction of the ones in the observed traces to ones in the string x. In contrast, our algorithm works quite differently and is of a more sequential nature processing the traces from left to right (or right to left). See Section 1.1 for a discussion of our algorithm.

Another paper studying strings with large runs is by Davies et al. [15]. They consider approximate trace reconstruction, specifically how many traces are needed to approximately reconstruct x up to edit distance εn under various assumptions on the lengths of the runs of zeros and ones in x. Among other results but most closely related to ours, they show that one can ε-approximately reconstruct x using O((logn)/ε2) traces if the runs of zeros have length lognε and if the runs of ones are all of length Clogn or 3Clogn for a constant C (e.g. they could have length one as in our paper). However, for exact reconstruction, they would need to set ε<1/n, which means they do not provide any nontrivial guarantees in our setting.

1.1 Technical Contributions

In this section, we give a high level overview of our techniques. Recall that we want to reconstruct a string x{0,1}n from independent traces x~ where we assume that x is mildly separated. More concretely, we assume that there are numbers a0,,atpolylogn such that x consists of a0 zeros followed by a one, followed by a1 zeros followed by a one and so on, with the last at bits of x being zero. Writing ai=0jiaj, we thus have that there are t ones in x at positions ai+i+1 for 0it1.

Note that a retained bit in a trace x~ naturally corresponds to a bit in x. More formally, for a trace x~ of length , let i1<<i be the positions in x where the bit was retained when generating x~ so that x~=xi1xi. Then, the correspondence is defined by the map from [] to [n] mapping jij. We think of this map as the correct alignment of x~ to x.

Our main technical contribution is an alignment algorithm (see Algorithm 1) which takes in some mt and estimates b0,,bm1 of a0,,am1 satisfying that for all i, |biai|=O(ailogn), and correctly aligns the one in a trace x~ corresponding to the m’th one of x with probability 1O(δ) (where the randomness is over the draw of x~ – naturally, this requires that the m’th one of x was not deleted).

Moreover, we ensure that the alignment procedure with high probability, say 1O(n100), never aligns a one in x~ too far to the right in x: if the one in x~ corresponding to the m0’th one of x is aligned to the m’th one of x, then mm0. We will refer to this latter property by saying that the algorithm is never ahead with high probability. If m<m0, we say that the algorithm is behind. Thus, to show that the algorithm correctly aligns the m’th one, it suffices to show that the probability that the algorithm is behind is O(δ).

We first discuss how to implement this alignment procedure and then afterwards we discuss how to complete the reconstruction by using this alignment procedure.

The alignment procedure of Algorithm 1

The main technical challenge of this paper is the analysis of Algorithm 1. Let us first describe on a high level how the algorithm works. For 0jjm, we write bj:j=i=jj1bj. Suppose that the trace x~ consists of s0 zeros followed by a one followed by s1 zeros followed by a one and so on. The algorithm first attempts to align the first one in x~ with a one in x by finding the minimal j0 such that (1δ)b0:j0 is within Clognb0:j0 of s0 for a sufficiently large C. Inductively, having determined ji (that is the alignment of the i’th one of x~), it looks for the minimal ji+1>ji satisfying that there is a jij<ji+1 such that bj:ji+1(1δ) is within Clognbj:ji+1 of si+1. Intuitively, when looking at the i’th one in the trace, we want to find the earliest possible location in the real string (which has gaps estimated by b0,b1,) that could plausibly align with the one in the trace.

It is relatively easy to check that the algorithm is never ahead with very high probability. Indeed, by concentration bounds on the number of deleted zeros and the fact that |bjaj|=O(ajlogn) for all jm, it always has the option of aligning the (i+1)’st one in x~ to the correct one in x. However, it might align to an earlier one in x since it is looking for the minimum ji+1 such that an alignment is possible. For a very simple example, suppose that a0=nΩ(1) and a1==am=b1==bm=polylog(n). If the first k<m ones of x are deleted and the (k+1)’st one is retained, the algorithm will align the retained one (which corresponds to the (k+1)’st one of x) with the first one of x resulting in the aligning algorithm being k steps behind. Moreover, the algorithm will remain k steps behind all the way up to the m’th one of x. The probability of this happening is Θ(δk). To prove that the probability of the algorithm being behind when aligning the m’th one of x is at most 1O(δ), we prove a much stronger statement which is amenable to an inductive proof, essentially stating that this is the worst that can happen: The probability of the algorithm being k steps behind at any fixed point is bounded by (Cδ)k for a constant C. In particular, we show that there is a sort of amortization – whenever there is a substring that can cause the algorithm to fall further behind with some probability (i.e. if certain bits are deleted), the substring also helps the algorithm catch back up if it is already behind.

The algorithm is not too far behind

Proving that the algorithm cannot be too far behind, i.e., is k steps behind with probability at most O(δ)k is perhaps the most challenging technical part of our paper. We discuss some of the ideas behind proving this result.

The first step towards proving this lemma is to attempt to prove an even stronger statement: that even if the current estimates b0,,bm are totally arbitrary (perhaps not similar to a0,,am at all), we will still not be behind. This is not too far-fetched, as for general b0,,bm we might actually start jumping ahead. For example, if b0 is not close to a0 and we do not delete the first 1, we will predict the first 1 in the trace to have come from a later location. This will end up being proven by induction on the length m of the string.

Now, condition on the (m+1)th 1 from the true string x not being deleted, and consider the probability of being k steps behind after seeing this bit. Recall that the true gaps between the bits until the (m+1)th 1 are a0,,am but the algorithm believes the gaps are b0,,bm, and the algorithm believes we have just gone through the gaps b0,,bmk so far. Let h be the smallest value where ahbh doesn’t hold (here, think of as |ahbh| being much larger than ah, so it would be easy to distinguish between these gaps even with random deletions). Let h be the smallest value where bmkhamh doesn’t hold. This can be thought of in terms of reading the sequences a and b backward, from where the algorithm thinks the gaps are from the sequence b whereas the gaps actually come from a. The idea is that if we are aligned after the hth 1 (which is after the gaps ah1 and bh1, the difference between ah and bh should cause us to move ahead, meaning that we will have to fall k+1 steps behind afterwards, making the inductive argument easier for us. By a symmetric argument, we shouldn’t expect to have the hth to last 1 aligned with the h to last 1 in b. So, the point is that we should expect to fall behind both in the first h gaps and the last h gaps. This will allow us to split the string into pieces where in each one we fall behind, and we can apply an inductive hypothesis on the length of the string. Another option is that there is never a value h where ahbh or h where amkhamh. In this case, (a0,,am) is approximately periodic with period k and we would have to fall an entire period behind, which we show happens with very low probability

Reconstructing 𝒙 using Algorithm 1

Using Algorithm 1 we can iteratively get estimates b0,,bt with |biai|=O(ailogn). Namely, suppose that we have the estimates b0,,bm. We then run Algorithm 1 on O(logn) independent traces and with high probability, for a 1O(δ) fraction of them, we have that the m’th and (m+1)’st one of x are retained in x~ and correctly aligned. In particular, with probability 1O(δ) we can identify both the m’th and (m+1)’st one of x in x~ and taking the median over the gaps between these (and appropriately rescaling by 11δ), we obtain an estimate of bm+1 such that |bm+1am+1|=O(am+1logn)). Note that the success probability of 1O(δ) is enough to obtain the coarse estimates using the median approach but we cannot obtain a fine estimate by taking the average since with constant probability O(δ), we may have misaligned the gap completely and then our estimate can be arbitrarily off.

To obtain fine estimates, we first obtain coarse estimates, say b0,,bt, for all of the gaps. Next, we show that we can identify the m’th and (m+1)’st one in x in a trace x~ (if they are retained) and we can detect if they were deleted not just with probability 1O(δ) but with very high probability. The trick here is to run Algorithm 1 both from the left and from the right on x~ looking for respectively the one in x~ aligned to the m’th one in x and the one in x~ aligned to the (m+1)’st one in x (which is the (tm)’th one when running the algorithm from the right). If either of these runs fails to align a one in x~ to respectively the m’th and (m+1)’st one in x or the runs disagree on their alignment, then we will almost certainly know. To see why, assuming that we are never ahead in the alignment procedure from the left, if we believe we have reached the m’th one in x, then we are truly at some m0’th one where m0m. By a symmetric argument, if we believe we have reached the (m+1)’st one in x after running the procedure from the right, we are truly at the m1’th one in x, where m1m+1. The key observation now is that m0m1 if and only if m0=m and m1=m+1, meaning that both runs succeeding is equivalent to the one found in the left-alignment procedure being strictly earlier than the one found in the right-alignment procedure. So, if we realize that either run fails to align the ones properly, we discard the trace and repeat on a newly sampled trace.

Finally, we can ensure that the success of the runs of the alignment algorithm is independent of the deletion of zeros between the m’th and (m+1)’st ones in x. If a trace is not discarded, then with very high probability, the gap between the ones in x~ aligned to the m’th and (m+1)’st ones in x (normalized by 11δ) is an unbiased estimator for am+1. By taking the average of the gap over O~(n) traces, normalizing by 11δ, and rounding to the nearest integer, we determine am+1 exactly with very high probability. Doing so for each m, reconstructs x.

Roadmap of our paper

In Section 2, we introduce notation. In Section 3, we describe and analyse our main alignment procedure. We first prove that with high probability it is never ahead (Lemma 2). Second, in Section 3.2, we bound the probability that it is behind (Lemma 3). Finally, in Section 4, we describe our full trace reconstruction algorithm and prove Theorem 1.

2 Notation

We note a few notational conventions and definitions.

  • We recall that a bitstring x is L-separated if the gap between any consecutive 1’s in the string contains at least L 0’s.

  • Given an string x, we say that a run is a contiguous sequence of 0’s in x. For x=00a0 times100a1 times1100at times, the ith run of x is the sequence 00ai times, and has length ai.

  • For any bitstring x=x1x2xn, we use rev(x):=xnxn1x1 to denote the string where the bits have been reversed.

  • We use 𝐚=a0,a1,,am1 to denote an integer sequence of length m. For notational convenience, for any 0j<jm, we write 𝐚j:j to denote the subsequence aj,aj+1,,aj1, and aj:j:=i=jj1ai.

We will define some sufficiently large constants C0,C1,C2,C3 and a small constant c0. We will assume the separation parameter L=C3log8n, and the deletion parameter δc0, where c0=13106. We did not make significant effort to optimize the constant c0 or the value 8 in log8n, though we believe that any straightforward modifications to our analysis will not obtain bounds such as c012 or a separation of L=O(logn).

3 Main Alignment Procedure

3.1 Description and Main Lemmas

In this section, we consider a probabilistic process that models a simpler version of the trace reconstruction problem that we aim to solve. In the simpler version of the trace reconstruction problem, suppose that we never delete any 0’s, but delete each 1 independently with δ probability. Let a0,,am1n represent the true lengths of the first m gaps (so the first 1 is at position 1+a0, the second 1 is at position 2+a0+a1, and so on). Moreover, suppose we have some current predictions b0,,bm1n of the gaps a0,,am1. The high level goal will be, given a single trace (where the trace means only 1s are deleted), to identify the mth 1 in the trace from the the original string with reasonably high probability. (Note that the mth 1 is deleted with δ probability, in which case we cannot succeed.)

In this section, we will describe and analyze the probabilistic process, and then explain how this analysis helps us solve the trace reconstruction problem in Section 4.

In the process, we fix mn and two sequences 𝐚=a0,,am1 and 𝐛=b0,,bm1 where 𝐚 has length m but 𝐛 has some length m which may or may not equal m. Moreover, we assume Lain and Lbjn for every term ai𝐚 and bj𝐛.

Now, for each 1im1, let wi{0,1} be i.i.d. random variables, with wi=1 with 1δ probability and wi=0 with δ probability. Also, let w0=wm=1 with probability 1. For each 0im with wi=1, we define a value fi as follows. First, we set f0=0. Next, for each index i1 such that wi=1, let i0 denote the previous index with wi0=1. We define fi to be the smallest index j>fi0 such that there exists fi0j<j with |bj:jai0:i|C0lognbj:j, where C0 is a sufficiently large constant. (If such an index does not exist, we set fi=.)

Our goal will be for fm=m. In general, for any i with wi=1, we would like fi=i. If fi<i, we say that we are ifi steps behind at step i, and if fi>i, we say that we are fii steps ahead at step i.

First, we note the following lemma, which states that we will never be ahead with very high probability, as long as the sequences 𝐚 and 𝐛 are similar enough.

Lemma 2.

Set C1=C0/4. Let 𝐚,𝐛 be sequences of lengths m,m, respectively, where mm. Suppose that |biai|C1bilogn for all 0i<m. Then, with probability at least 11n10 (over the randomness of the wi), for all 0im with wi=1, fii.

Proof.

Let us consider the event that for every index 0im15logn, at least one of wi,wi+1,,wi+15logn equals 1. Equivalently, the string w0w1wm does not ever have 15logn+1 0’s in a row. For any fixed i, the probability of this being false is at most δ15lognn15, so by a union bound over all choices of i, the event holds with at most n10 failure probability.

First, note that f0=0. Now, suppose that some i0 satisfies wi=1 and fii. Suppose i is the smallest index strictly larger than i such that wi=1. Note that ii15logn+116logn, by our assumed event. Note that if we set j=i and j=i, then j>jfi, since fii. Moreover, |bj:jaj:j|i=jj1|biai|C1logni=jj1biC1lognbj:j|jj|4C1lognbj:j, where the second to last inequality is by Cauchy-Schwarz. Thus, j=i,j=i satisfies the requirements for fi, which means that fij=i. Thus, if fii, fii. Since f00, this means fii for all i with wi=1.

The main technical result will be showing that fmm with reasonably high probability, i.e., with reasonably high probability we are not behind. This result will hold for any choice of 𝐚,𝐛 and does not require any similarity between these sequences. In other words, our goal is to prove the following lemma.

Lemma 3.

Let 𝐚,𝐛 be strings of length at most n with every 𝐚i,𝐛j between L and n, where L=Clog8n for a sufficiently large constant C. Define m=|𝐚|. Then, for any δ13106, with probability at least 1200δ over the randomness of w1,,wm1, fmm.

3.2 Proof of Lemma 3

In this section, we prove Lemma 3.

We will set a parameter K=C2logn, where C2 is a sufficiently large constant. For any k0, given the sequences 𝐚=a0,,am1 and 𝐛=b0,,bm1 (of possibly differing lengths), we define pk(𝐚,𝐛) to be the probability (over the randomness of w1,,wm1) that

  • fmmk.

  • For any indices 0iim with wi,wi=1, fifi(ii)K.

Equivalently, this is the same as the probability that we fall behind at least k steps from step 0 to step m, but we never fall behind K+1 or more steps (relatively) from any (possibly intermediate) steps i to i. For any m1, we define pk(m) to be the supremum value of pk(𝐚,𝐛) over any sequences 𝐚,𝐛 where 𝐚 has length at most m and every ai and bj is between L and n, and we also define pk:=supm1pk(m).

Note that for any k>K, pk(𝐚,𝐛)=0, as fm=mk means fmf0<(m0)K. So, pk(m) and pk also equal 0 for any k>K.

First, we note a simple proposition, that will only be useful for simplifying the argument at certain places.

Proposition 4.

For any m1, p0(m)=1.

Proof.

Since p0(m) is the maximum over all 𝐚,𝐛 where 𝐚 has length at most m, it suffices to prove it for some 𝐚,𝐛 of length 1. Indeed, for m=1 and a0=b0=L, we must have that w0=w1=1, so we must have f0=0 and f1=1.

We now aim to bound the probabilities pk for kK. We will do this via an inductive approach on the length of m, where the high-level idea is that if we fall back by k steps, there is a natural splitting point where we can say first we fell back by k1 steps, and then by k2 steps, for some k1,k2>0 with k1+k2=k – see Lemmas 6 and 7. This natural splitting point will be based on the structure of the similarity of 𝐚 and 𝐛, and will not work if 𝐚 and 𝐛 share a k-periodic structure. But in the periodic case, we can give a more direct argument that we cannot fall back by k steps (i.e., a full period), even with 1poly(n) probability – see Lemma 5. We can then compute a recursive formula for the probability of falling back k steps, by saying we need to first fall back k1 steps and then fall back k2 steps. In Lemma 9, we bound the terms of this recursion.

Lemma 5.

Fix any mk1 such that kK, and suppose that LC3log8n, where C3 is a sufficiently large multiple of C02C26. Suppose that 𝐚,𝐛 are sequences such that for every 0i<mk, |biai|C0lognbi and |biai+k|C0lognbi. Then, the probability pk(𝐚,𝐛)(2δ)K.

Proof.

We show that the probability of ever being behind by k or more is at most (2δ)K. In fact, we will show this deterministically never happens, conditioned on the event that for every index 0imKk, at least one of wi,wi+k,wi+2k,,wi+Kk equals 1. Indeed, the probability of this being false for any fixed i is at most δK, so by a union bound over all choices of i, the event holds with at most nδK(2δ)K failure probability.

Now, assume the event and suppose that fiik holds for some i. More precisely, we fix i to be the smallest index such that wi=1 and fiik.

First, assume that i2Kk. Consider the values ai1,ai2,,aik, and let h=argmax1tkait. By our conditional assumption, and since i2Kk, at least one of wih,wihk,,wihKk equals 1. Say that wihrk=1, where 0rK. Also, by our choice of i, we know that fihrk>ih(r+1)k0, and that fiik. So, we have two options:

  1. 1.

    i2Kk, and fiik,fihrk>ih(r+1)k0, for some rK and where h=argmax1tkait.

  2. 2.

    i<2Kk, and fiik,f0=0.

Now, let’s consider the list of all indices i0<i1<<is=i with wi0,wi1,,wis=1, starting with i0=ihrk if i2Kk and i0=0 otherwise, and ending with is=i. By definition of the sequence f, for every 0t<s there exists j,j such that fitj<jfit+1 and |bj:jait:it+1|C0lognbj:j. Assuming that L(10C0logn)2, then ait:it+1(10C0logn)2, which means ait:it+1bj:j/2, and thus |bj:jait:it+1|2C0lognait:it+1. So,

bfit:fit+1bj:jait:it+12C0lognait:ait+1.

Adding the above equation over 0ts1, we obtain

bfi0:fiai0:i2C0lognt=0s1ait:it+1ai0:i2C0lognai0:is,

where the final line follows by Cauchy-Schwarz. Let j0 be ih(r+1)k+1 if i2Kk and j0=0 otherwise. Then, since sisi02kK4K2, we have

bj0:ikbfi0:fiai0:i4C0Klognai0:i. (1)

The above equation tells us that bj0:ik=t=j0ik1bt can’t be too much smaller than ai0:i=t=i0i1at. We now show contrary evidence, thus establishing a contradiction.

First, we compare bj0:ik to aj0+k:i. Indeed, for any t<im, |btkat|C0lognbtk. Since every ai(10C0logn)2, this also means |btkat|2C0lognat. Adding over all j0t<ik, we have

aj0+k:ibj0:ik2C0lognt=j0+ki1atbj0:ik4C0Klognaj0+k:i,

where the last inequality follows by Cauchy-Schwarz and the fact that i(j0+k)ij02Kk4K2.

However, we do not care about aj0+k:i – we really care about ai0:i. To bound this, first note that for any ki<m, |aibik|C0lognbi and |aikaik|C0lognbi. So, |aiaik|4C0lognai, assuming every ai(10C0logn)2. If we additionally have that L(100C0lognK)2, then |aiaisk|8C0lognsai for any 1sK and ski<m. Importantly, aiskai[1/2,2].

In the case that i2Kk, this implies that t=ihrki1at2(r+1)t=iki1at4Kt=iki1at. So, because h=argmax1tkait, we have

ai0=aihrk12aih12kt=1kait18K2t=ihrki1at.

Recalling that i0=ihrk and j0=ih(r+1)k+1, since i0=j0+k1,

ai0:i(1+18K2)aj0+k:i (1+18K2)(bj0:ik4C0Klognaj0+k:i)
(1+18K2)(bj0:ik4C0Klognai0:i). (2)

In the case that i<2Kk, we instead have t=0i1at2ikt=0k1at4Kt=0k1at. So, since i0=j0=0, we have that

ai0:i=aj0+k:i+a0:k(1+14K)aj0+k:i(1+14K)(bj0:ik4C0Klognaj0+k:i),

so the same bound as (3.2) holds (in fact, an even stronger bound holds).

So, both (1) and (3.2) hold, in either case. Together, they imply that

ai0:i(1+18K2)(ai0:i8C0Klognai0:i).

This is impossible if ai0:i is a sufficiently large multiple of (C0KlognK2)2=C02log2nK6. Since ii0+1 in either case, it suffices for L to be a sufficiently large multiple of C02log2nK6=C02C26log8n.

Lemma 6.

Fix any mk such that kK, and suppose that LC3log2nK6. Suppose that 𝐚,𝐛 are sequences of length m, such that for every 0i<mk, |biai+k|C0lognbi. Then, the probability

pk(𝐚,𝐛)(2δ)K+h1,h2,k2,k20h1+h2+k1+k2kk1,k2K(h1,h2,k1,k2)(0,0,0,k),(0,0,k,0)δh1+h2pk1(m1)pk2(m1).
Proof.

Suppose that for all 0i<mk, |biai|C0lognbi. Then, we can use Lemma 5 to bound pk(𝐚,𝐛)(2δ)K. Alternatively, let 0h<mk be the smallest index such that |bhah|>C0lognbh. Next, let h1,h20 be such that hh1 is the largest index less than h with whh1=1, and h+1+h2 is the smallest index at least h+1 with wh+1+h2=1. Finally, let k1:=max(0,(hh1)fhh1) and k2:=max(0,(m(h+1+h2))(fmfh+1+h2)). In other words, k1 is the number of steps we fall behind from 0 to hh1, and k2 is the number of steps we fall behind from h+1+h2 to m.

Note that k1+k2m1h1h2fm+fh+1+h2fhh1, and since each subsequent fi is strictly increasing, this means fh+1+h2fhh11, so k1+k2mfm(h1+h2)k(h1+h2), assuming that fmmk. In other words, we have that h1,h2,k1,k2 are nonnegative integers such that h1+h2+k1+k2k.

Now, let us bound the probability (over the randomness of w1,,wm1) of the event indicated by pk(𝐚,𝐛) occurring, with the corresponding values h1,h2,k1,k2. Note that for any fixed h1,h2, the event of those specific values is equivalent to whh1 and wh+1+h2 being 1, and everything in between being 0. So, the probability is at most δh1+h2. Now, conditioned on h1,h2, the values k1,k2 imply that we fall back k1 steps from step 0 to hh1 (or we may move forward if k1=0) and we fall back k2 steps from step h+1+h2 to m. Moreover, there cannot be two steps i,i such that that we fell back K steps from i to i. Since hh1h<m and m(h+1+h2)m1, this means both hh1,m(h+1+h2)m1. So, the overall probability of the corresponding values h1,h2,k1,k2 is at most δh1+h2pk1(m1)pk2(m1), where we are using the fact that p0(m)=1 for all m by Proposition 4.

Overall, the probability pk(𝐚,𝐛) is at most

h1,h2,k1,k20h1+h2+k1+k2kk1,k2Kδh1+h2pk1(m1)pk2(m1).

We can cap k1,k2 as at most K since otherwise pk1(m1) or pk2(m1) is 0. Moreover, we can give improved bounds in the cases when h1=h2=0 and either (k1,k2)=(0,k) or (k1,k2)=(k,0).

Note that in either case, both wh and wh+1 equal 1. In the former case, we must have fh=hk and fh+1=h+1k. Importantly, the algorithm fell back by exactly k steps from 0 to h, However, we know that for all 0ih1, |biai|C0lognbi. In that case, if we restrict ourselves to the strings 𝐚0:h=a0a1ah1 and 𝐛0:h=b0b1bh1, we are dealing with the case of Lemma 5. Hence, we can bound the overall probability of this case by (2δ)K. In the latter case, we must have fh=h and fh+1=h+1, since we need to fall back by exactly k steps from h to m. However, this actually cannot happen, because by definition of fh and fh1, we must have that |bhah|C0lognbh, which is not true by our definition of h.

Overall, this means

pk(𝐚,𝐛)(2δ)K+h1,h2,k2,k20h1+h2+k1+k2kk1,k2K(h1,h2,k1,k2)(0,0,0,k),(0,0,k,0)δh1+h2pk1(m1)pk2(m1).

Lemma 7.

Fix any mk such that kK, and suppose that LC3log2nK6. Suppose that 𝐚,𝐛 are sequences of length m. Then, the probability

pk(𝐚,𝐛)(2δ)K+h1,h2,k2,k20h1+h2+k1+k2kk1,k2K(h1,h2,k1,k2)(0,0,0,k),(0,0,k,0)δh1+h2pk1(m1)pk2(m1).
Proof.

Our proof will be quite similar to that of Lemma 6, so we omit some of the identical details.

First, assume that for every ki<m, |bikai|C0lognbik. Then, we can directly apply Lemma 6. Alternatively, let kh<m be the largest index such that |bhkah|>C0lognbhk. As in the proof of Lemma 6, let h1,h20 be such that hh1 is the largest index less than h with whh1=1, and h+1+h2 is the smallest index at least h+1 with wh+1+h2=1. Also, let k1:=max(0,(hh1)fhh1) and k2:=max(0,(m(h+1+h2))(fmfh+1+h2)).

As in the proof of Lemma 6, we have h1+h2+k1+k2k, as long as fmmk. We can again do the same casework on h1,h2,k1,k2, to obtain

h1,h2,k1,k20h1+h2+k1+k2kk1,k2Kδh1+h2pk1(m1)pk2(m1).

Once again, we wish to consider the individual cases of (h1,h2,k1,k2)=(0,0,0,k) or (h1,h2,k1,k2)=(0,0,k,0) separately. In either case, wh=wh+1=1. In the former case, must have fh=h and fh+1=h+1. In this case, from step h+1 to m we fall behind k steps. In other words, we can restrict ourselves to the strings 𝐚h+1:m=ah+1am1 and 𝐛h+1:m=bh+1bm1. However, we have now restricted ourselves to strings which satisfy the conditions of Lemma 6, so we can bound the probability in this case as at most

(2δ)K+h1,h2,k2,k20h1+h2+k1+k2kk1,k2K(h1,h2,k1,k2)(0,0,0,k),(0,0,k,0)δh1+h2pk1(m1)pk2(m1).

In the latter case, we must have fh=hk and fh+1=h+1k. However, this is impossible, because |ahbhk|>C0lognbhk, by our definition of h.

Overall, by adding all cases together, we obtain

pk(𝐚,𝐛)(2δ)K+2h1,h2,k2,k20h1+h2+k1+k2kk1,k2K(h1,h2,k1,k2)(0,0,0,k),(0,0,k,0)δh1+h2pk1(m1)pk2(m1).

Overall, this implies that

pk(m)(2δ)K+2h1,h2,k2,k20h1+h2+k1+k2kk1,k2K(h1,h2,k1,k2)(0,0,0,k),(0,0,k,0)δh1+h2pk1(m1)pk2(m1).

We now can universally bound pk for all 0kK. To do so, we first recall some basic properties of the Catalan numbers.

Fact 8.

For n0, the Catalan numbers n 111We use n rather than the more standard Cn to avoid confusion with the constants C0,C1, we have defined. are defined as n=(2nn)/(n+1). They satisfy the following list of properties.

  1. 1.

    0=1 and for all n0, n+1=i=0nini.

  2. 2.

    For all n1, 2n+1n4.

  3. 3.

    For all n0, n4n.

Lemma 9.

Assume δ131003, and define 𝔇k:=1002k1k for k1 and 𝔇0=1. Then, for all 0kK, pk𝔇kδk.

Proof.

We prove the statement by induction on m. For m=1, note that w0=w1=1 with probability 1, so f11. Indeed, either f1=1 or f1=. So, p0(m)1 and pk(m)=0 for all k1.

Now, suppose that the induction hypothesis holds for m1: we now prove the statement for m. First, note that p0(m)=1=𝔇0δ0. Next, for k1,

pk(m) (2δ)K+2h1,h2,k2,k20h1+h2+k1+k2kk1,k2K(h1,h2,k1,k2)(0,0,0,k),(0,0,k,0)δh1+h2𝔇k1δk1𝔇k2δk2
(2δ)k+2h1,h2,k2,k20h1+h2+k1+k2k(h1,h2,k1,k2)(0,0,0,k),(0,0,k,0)𝔇k1𝔇k2δh1+h2+k1+k2. (3)

We now bound the summation in the above expression. First, we focus on the terms where one of k1 or k2 is 0. If k1=k2=0, the summation becomes h1+h2kδh1+h2. If we fix h3=h1+h2, for each h3k there are h3+1 choices of h1+h2, which means the summation is h3kδh3(h3+1). For δ131003, each term is at most half the previous term, so this is at most 2(k+1)δk. Next, for k1=0,k2>0, if we fix h3=h1+h2, the summation is h3+k2k,(h3,k2)(0,k)(h3+1)𝔇k2δh3+k2, since there are h3+1 choices of (h1,h2):h1+h2=h3. We have a symmetric summation for k1>0,k2=0. Finally, if we focus on the terms with k1,k21, by writing h3=h1+h2 and k3=k1+k2, for any fixed h3,k3, the sum of 𝔇k1𝔇k2 is at most 1002k1+2k22k3+11003𝔇k3+1, and there are h3+1 choices for (h1,h2). So, the summation is at most 1003h3+k3k(h3+1)𝔇k3+1δh3+k34100h3+k3k(h3+1)𝔇k3δh3+k3, where the last inequality holds because 𝔇k3+1𝔇k31002k3+1k341002.

Overall, replacing indices accordingly, we can write (3.2) as at most

(2δ)k+2(2(k+1)δk+2a,b0a+bk(a,b)(0,k)(a+1)𝔇bδa+b+4100a,b0a+bk(a+1)𝔇bδa+b)
(2δ)k+2(2(k+1)δk+3a,b0a+bk(a,b)(0,k)(a+1)𝔇bδa+b+4100𝔇kδk).

We can now focus on the middle summation term. If we first consider all terms with b=0, the sum equals ak(a+1)δa=(k+1)δk+(k+2)δk+1+2(k+1)δk, as long as δ131003. For the remaining terms, we fix d=a+b and consider the sum. If d=k, the sum equals δk(2𝔇k1+3𝔇k2++k𝔇1). Since 𝔇n+11002𝔇n for all n1, this is at most δk4𝔇k1. For d>k, the sum equals δd(𝔇d+2𝔇d1++d𝔇1)2δd𝔇d. Since 𝔇d41002𝔇d+1, as long as δ131003, the terms 2δd𝔇d decrease by a factor greater than 2 each time d increases. So the sum over all d>k is at most 4δk+1𝔇k+1. Overall, the summation in the middle term is at most 2(k+1)δk+4𝔇k1δk+4𝔇k+1δk+1.

Overall, this means (3.2) is at most

2kδk+16(k+1)δk+24𝔇k1δk+24𝔇k+1δk+1+8100𝔇kδk. (4)

Now, note that 𝔇k1𝔇k1100 for all k1, even for k=1. Moreover, 𝔇k+1𝔇k1002k+1k41002. Thus, (4) is at most

δk(2k+16(k+1)+32100𝔇k+961002𝔇kδ)

Assuming that δ131003, this is at most δk(2k+16(k+1)+0.64𝔇k), which can be verified to be at most δk𝔇k for all k1, by just using the fact that 𝔇k100k for all k1. This completes the inductive step.

We are now ready to prove Lemma 3.

Proof of Lemma 3.

If fm<m, this means that either the event p1(𝐚,𝐛) occurs, or there exist indices i<i with wi=wi=1 but we fall behind at least K+1 steps from step i to step i.

Assuming δ13103, the probability of p1(𝐚,𝐛) is at most 100δ. Alternatively, if there exist i<i with wi=wi=1 but we fall behind at least K+1 steps from step i to step i, there must exist such an i,i with minimal ii (breaking ties arbitrarily). This could be because wi+1=wi+2==wi+r=0 for some rK/2. However, the probability of there being rK/2 consecutive indices wi+1=wi+2==wi+r=0 is at most nδK/2δ.

The final option is that, if we look at the first index i+r>i with wi+r=0, rK/2. This means that from step i+r to i, we must fall behind at least K/2 steps, and there could not have been any intermediate steps where we fell behind more than K steps. Hence, if we restrict ourselves to the strings 𝐚i+r:i and 𝐛fi+r,fi, the event indicated by pk(𝐚i+r:i,𝐛fi+r:) must occur, since conditioned on fi+r and the fact that wi+r=wi=1, the value fi only depends on 𝐚i+r:i, 𝐛 starting from position fi+r, and wi+r+1,,wi1.

In other words, there exists some contiguous subsequences 𝐚 and 𝐛 of 𝐚 and 𝐛, respectively, such that the event of pK/2(𝐚,𝐛) occurs. For any fixed 𝐚,𝐛, the probability is at most (41002δ)K/2. Since there are at most n2 possible contiguous subsequences for each of 𝐚 and 𝐛, the overall probability is at most n4(41002δ)K/250δ, assuming that δ13106 and K=C2logn where C2 is sufficiently large.

Overall, the probability of falling behind is at most 100δ+δ+50δ200δ.

4 Full algorithm/analysis

Let us depict the true string x{0,1}n as 00a0 times100a1 times1100at times, i.e., there are t1 ones, and the string starts and ends with a run of 0’s. This assumption can be made WLOG by padding the string with L 0’s at the front and the end. For any L-separated string, doing this padding maintains the L-separated property, and we can easily simulate the padded trace by adding Bin(L,1δ) 0’s at the front and Bin(L,1δ) 0’s at the back. Once we reconstruct the padded string, we remove the padding to get x.

We assume we know the value of t. Indeed, the number of 1’s in a single trace x~ is distributed as Bin(t,1δ). So, by averaging the number of 1’s over O(nlogn) random traces and dividing by 1δ, we get an estimate of t1 that is accurate within 0.1 with 11n10 probability. Thus, by rounding, we know t exactly with 11n10 probability.

The main goal is now to learn the lengths a0,a1,,at. If we learn these exactly just using the traces, this completes the proof. Our algorithm runs in two phases: a coarse estimation phase and a fine estimation phase. In the coarse estimation phase, we sequentially learn each ai up to error O(ailogn). In the fine estimation phase, we learn each ai exactly, given the coarse estimates.

4.1 Coarse estimation

Fix some 0mt, and suppose that for all i<m, we have estimates bi satisfying |bi(1δ)ai|10ai. (If m=0, then we have no estimates yet.) Our goal will be to provide an estimate bm such that |bm(1δ)am|10am.

Consider a trace x~ of x. Let w0=wt+1=1 and for each 1it, let wi be the indicator that the ith 1 is retained. Next, for each 0it, let a~iBin(ai,1δ) represent the number of 0s in the ith run that were not deleted. Note that with at least 0.99 probability, |a~i(1δ)ai|10lognai for all i. Since |bi(1δ)ai|10ai for all i<m, this implies that |a~ibi|20lognbi for all i<m.

Now, even though we have no knowledge of a~i or ai, we can still simulate the probabilistic process of Section 3. Let 0=i0<i1<<ih=t+1 be the list of all indices i:0it+1 with wi=1. While we do not know the values a~i, for every pair of consecutive indices iq,iq+1, the value a~iq:iq+1 is exactly the number of 0’s between the qth and (q+1)st 1 in the trace x~ (where we say that the 0th 1 is at position 0 and the (t+1)st 1 is at position |x~|+1). In other words, if rq represents the position of the qth 1, then a~iq:iq+1=rq+1rq1. Hence, because computing each fiq+1 only requires knowledge of 𝐛 and the value of a~iq:iq+1, and since fi0=f0=0, the algorithm can in fact compute gq:=fiq for all 0qh, using the same process as described in Section 3, even if the values iq are not known.

Algorithm 1 simulates this process, assuming knowledge of m, b0,,bm1, a single trace x~, and t. In Algorithm 1, we use the variable val to represent gq=fiq, i.e., the current prediction of the position iq. In other words, valiq equals the number of steps ahead (or iqval equals the number of steps behind) we are.

Algorithm 1 Locate the mth and (m+1)st 1 in x, in the trace x~, and return the position and length of the gap.
Lemma 10.

Fix b0,,bm1 such that |bi(1δ)ai|10ai for all 0im1. With probability at least 0.98 over the randomness of x~, we have that Algorithm 1 returns q such that the qth 1 in x~ corresponds to the mth 1 in x. Moreover, conditioned on this event holding, the distribution rq+1rq1 exactly follows Bin(am,1δ).

Proof.

Let us first condition on the values a~0,,a~m1, assuming that |a~i(1δ)ai|10lognai for all 0im1. As discussed earlier, this occurs with at least 0.99 probability, and implies that |a~ibi|20lognbi for all i<m.

Let us also condition on wm=1. By Lemma 2 and Lemma 3, the probability that fm=m, for δ=13106, is at least 0.99. This is conditioned on wm=1 and the values a~1,,a~m1 (assuming |a~ibi|20lognbi). This means that with at least 0.99 probability, the algorithm finds the position q with iq=m. Since fm only depends on 𝐛, 𝐚~0:m and w1,,wm, with probability at least 0.99(1δ)0.99 over the randomness of w1,,wm and a~1,,a~m1, we have that wm=1 and iq=m. This is independent of wm+1, so with probability at least 0.992(1δ)20.98 probability, we additionally have that wm+1=1.

The event that iq=m means that rq is the position in x~ of the mth 1 in the true string x. Moreover, since neither the mth nor (m+1)th 1 was deleted, rq+1 is the position in x~ of the (m+1)th 1 in the true string x. So, rq+1rq1 is in fact the length of the gap between the mth and (m+1)th 1 after deletion, which means it has length a~mBin(am,1δ), since a~m is independent of the events that decide whether wm=wm+1=1 and iq=m.

Given this, we can crudely estimate every gap, in order. Namely, assuming that that we have estimates b0,,bm1 (where 0mt), we can run the Align procedure on O(logn) independent traces. By a Chernoff bound, with 1n15 failure probability, at least 0.9 fraction of the traces will have the desired property of Lemma 10, so will output some (q,b) where bBin(am,1δ). Since Bin(am,1δ) is in the range am(1δ)±10am with at least 0.99 probability, at least 0.75 fraction of the outputs (q,b) will satisfy |b(1δ)am|10am, with 1n15 failure probability. Thus, by defining bm to be the median value of b across the randomly drawn traces, we have that |bm(1δ)am|10am with at least 11n10 probability.

By running this procedure iteratively to provide estimates b0,b1,,bt, we obtain Algorithm 2. The analysis in the above paragraph implies the following result.

Theorem 11 (Crude Approximation).

Algorithm 2 uses O(nlogn) traces and polynomial time, and learns estimates b0,b1,,bt such that with at least 11n9 probability, |bm(1δ)am|10am for all 0mt.

Algorithm 2 Crude Estimation of all gaps.

4.2 Fine estimation

In this section, we show how to exactly compute each am with high probability, given the crude estimates b0,b1,,bt1. This will again be done using an alignment procedure, but this time running the alignment both “forward and backward”.

Namely, given a trace x~, we will try to identify the mth and (m+1)st 1’s from the original string, but we try to identify the mth 1 by running Align on x~ and the (m+1)st 1 by running Align on the reverse string rev(x~):=x~|x~|x~2x~1. The idea is: assuming that we never go ahead in the alignment procedure, if we find some index q in the forward alignment procedure with gq=fiq=m, then the true position iq must be at least m. Likewise, if we do the alignment procedure in reverse until we believe we have found the (tm)th 1 from the back (equivalently, the (m+1)th 1 from the front), the true position must be at most m+1.

So, the true positions of the index found in the forward alignment procedure can only be earlier than that of the index from the backward alignment procedure, if the true positions were exactly m and m+1, respectively. Thus, by comparing the indices, we can effectively verify that the positions are correct, with negligible failure probability (rather than with 1O(δ) failure probability). This is the key towards obtaining the fine estimate of am, rather than just a coarse estimate that may be off by O(am).

Algorithm 3 formally describes the fine alignment procedure, using N=O(nlogn) traces, assuming we have already done the coarse estimation to find b0,b1,,bt.

Algorithm 3 Fine Estimation of all gaps.
Lemma 12.

Suppose that |bi(1δ)ai|10ai for all 10t. Fix indices 0mt and 1iN, and for simplicity of notation, let x~:=x~(i). Let m~ be the number of 1’s in x~. Then, the probability that qf+qb=m~, but either the forward or backward iterations finds an index in x~ which does not correspond to the mth 1 or (m+1)th 1, respectively, from x, is at most 2n10. Moreover, if the forward and backward iterations find indices in x~ corresponding to the mth 1 and (m+1)th 1, respectively, then qf+qb=m~. Finally, the probability of finding both corresponding indices is at least 0.98.

Proof.

First, let us consider the forward alignment procedure. We know that val tracks fiq when looking at the qth 1 of x~ (from left to right). So, if we do not return FAIL, then fiqf=m. If iqf<m, this implies there is an index i=iqf where fi>i. The probability of this is at most n10, by Lemma 2. Otherwise, iqfm, meaning that the qfth 1 in x~ is after (or equal to) the mth 1 in x.

Likewise, if we consider the backward alignment procedure, if we do not return FAIL, then except for an event with probability at most n10, the qbth 1 in rev(x~) is ahead of (or equal to) the (tm)th 1 in rev(x). Equivalently, the (m~+1qb)th 1 in x~ (reading from left to right) is before (or equal to) the (m+1)th 1 in x (reading from left to right).

So, barring a 2n10 probability event, the only way that the qfth 1 in x~ is strictly before the (m~+1qb)th 1 in x~ is if the qfth 1 in x~ is precisely the mth 1 in x and (m~+1qb)th 1 in x~ is precisely the (m+1)th 1 in x. However, if qf+qb=m~, then in fact the qfth 1 is before the (m~+1qb)th 1 in x~ (reading from left to right). This proves the first statement.

Next, if we in fact found the corresponding indices, they are consecutive 1’s in x, which means they must be consecutive 1’s in x~. So, if we found the qfth 1 from the left, and the qbth 1 from the right, we must have qf+qb=m~.

Finally, the event of finding both corresponding indices is equivalent to fm=m in the forward iteration and ftm=tm in the backward iteration. Conditioned on the corresponding 1’s not being deleted, each of these occur with at least 0.98 probability, by Lemmas 2 and 3. So, the overall probability is at least 0.9.

We are now ready to prove Theorem 1. Indeed, given the accuracy of the crude estimation procedure, it suffices to check that for each m, we compute am correctly, with at least 1n5 probability.

Theorem 13 (Fine Estimation).

Assume that t, the number of ones in x, is computed correctly, and for all 0mt, |bm(1δ)am|10am.

Then, for any fixed m:0mt, with at least 1n5 probability, we compute the gap am correctly.

Proof.

For any fixed iteration i:1iN, if both the forward and backward procedures correctly identify the mth and (m+1)th 1’s from the left, respectively, then qf+qb=m~ by Lemma 12. In this case, we will compute an actual value b(i)=bf. Moreover, as discussed in the proof of Lemma 10, the event that the forward procedure correctly identifies the right 1 only depends on 𝐛, a^0,,a^m1, and the events of whether the first m 1’s are deleted. Thus, the event that the backward procedure correctly identifies the right 1 only depends on 𝐛, a^m+1,,a^t, and the events of whether the (m+1)th 1 until the tth 1 are deleted.

Thus, the forward and backward procedure correctly identifying the right 1’s is independent of a^mBin(am,1δ). Moreover, in this case, bf is precisely a^m, since qf is the position in x~ corresponding to the mth 1 in x, and neither the mth nor (m+1)th 1 can be deleted if both of these 1’s are identified.

So, if the forward and backward procedures identifying the right 1’s for trace x~(i), the conditional distribution of b(i) is Bin(am,1δ). However, we really want to look at the distribution conditioned on the event qf+qb=m~. Indeed, by Lemma 12, this event is equivalent to either the forward and backward procedures identifying the right 1’s, or some other event which occurs with at most 2n10 probability. Because b(i) is clearly between 0 and n, and since the probability of both 1’s being correctly identified is at least 0.9 by Lemma 12, the expectation of b(i), conditioned on not being NULL, is am(1δ)±O(n10n)=am(1δ)±O(n9).

By a Chernoff bound, the number of 1iN with b(i)NULL is at least 0.5N with at least 1n10 probability, since in expectation it is at least 0.9N. Then, by another Chernoff bound, the empirical average of all such b(i) is within 0.1 of its expectation with 1n10 probability, which is am(1δ)±O(n9). Thus, taking the empirical average and dividing by 1δ, with at most O(n10) failure probability, 11δ times the average of all non-null b(i)’s is within 0.2 of am, and thus rounds to am.

5 Conclusion and Open Questions

In this paper, we established that the trace reconstruction problem can be solved with a polynomial number of traces, as long as any two ones in the initial string are separated by at least polylogn zeros and the deletion probability is at most a sufficiently small constant. It is an interesting open question to handle more general deserts such as (01)n=01010101 interspersed with mildly separated zeros and ones. Indeed, we believe that this is an important step towards solving the general trace reconstruction problem with deletion probability δ=no(1). With this deletion probability, the Bitwise Majority Alignment (BMA) algorithm from [3] succeeds in reconstructing x as long as x does not contain any such highly repetitive contiguous substrings. If one can provide a separate algorithm for such strings, one could then imagine x being partitioned into contiguous substrings that can be reconstructed by respectively BMA and the highly repetitive algorithm in an alternating fashion. Additional work is required to determine how to switch between the two algorithms.

References

  • [1] Frank Ban, Xi Chen, Adam Freilich, Rocco A. Servedio, and Sandip Sinha. Beyond trace reconstruction: Population recovery from the deletion channel. In Foundations of Computer Science (FOCS), pages 745–768, 2019. doi:10.1109/FOCS.2019.00050.
  • [2] Frank Ban, Xi Chen, Rocco A. Servedio, and Sandip Sinha. Efficient average-case population recovery in the presence of insertions and deletions. In Approximation, Randomization, and Combinatorial Optimization: Algorithms and Techniques, pages 44:1–44:18, 2019. doi:10.4230/LIPIcs.APPROX-RANDOM.2019.44.
  • [3] Tugkan Batu, Sampath Kannan, Sanjeev Khanna, and Andrew McGregor. Reconstructing strings from random traces. In Symposium on Discrete Algorithms (SODA), pages 910–918, 2004. URL: http://dl.acm.org/citation.cfm?id=982792.982929.
  • [4] Joshua Brakensiek, Ray Li, and Bruce Spang. Coded trace reconstruction in a constant number of traces. In Foundations of Computer Science (FOCS), 2020. arXiv:1908.03996.
  • [5] Diptarka Chakraborty, Debarati Das, and Robert Krauthgamer. Approximate trace reconstruction via median string (in average-case). In Foundations of Software Technology and Theoretical Computer Science (FSTTCS), volume 213 of LIPIcs, pages 11:1–11:23. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.FSTTCS.2021.11.
  • [6] Zachary Chase. New lower bounds for trace reconstruction. Ann. Inst. H. Poincaré Probab. Statist., 57(2), 2021. URL: http://arxiv.org/abs/1905.03031.
  • [7] Zachary Chase. Separating words and trace reconstruction. In Symposium on Theory of Computing (STOC), 2021.
  • [8] Zachary Chase and Yuval Peres. Approximate trace reconstruction of random strings from a constant number of traces. CoRR, abs/2107.06454, 2021.
  • [9] Xi Chen, Anindya De, Chin Ho Lee, Rocco A. Servedio, and Sandip Sinha. Polynomial-time trace reconstruction in the low deletion rate regime. In Innovations in Theoretical Computer Science (ITCS), 2021. arXiv:2012.02844.
  • [10] Xi Chen, Anindya De, Chin Ho Lee, Rocco A. Servedio, and Sandip Sinha. Polynomial-time trace reconstruction in the smoothed complexity model. In Symposium on Discrete Algorithms (SODA), 2021. arXiv:2008.12386.
  • [11] Xi Chen, Anindya De, Chin Ho Lee, Rocco A. Servedio, and Sandip Sinha. Near-optimal average-case approximate trace reconstruction from few traces. In Symposium on Discrete Algorithms (SODA), 2022. arXiv:2107.11530.
  • [12] Xi Chen, Anindya De, Chin Ho Lee, Rocco A. Servedio, and Sandip Sinha. Approximate trace reconstruction from a single trace. In Symposium on Discrete Algorithms (SODA), 2023. doi:10.48550/arXiv.2211.03292.
  • [13] Mahdi Cheraghchi, Ryan Gabrys, Olgica Milenkovic, and João Ribeiro. Coded trace reconstruction. IEEE Trans. Inf. Theory, 66(10):6084–6103, 2020. doi:10.1109/TIT.2020.2996377.
  • [14] Sami Davies, Miklos Racz, and Cyrus Rashtchian. Reconstructing trees from traces. In Conference On Learning Theory (COLT), pages 961–978, 2019. URL: http://proceedings.mlr.press/v99/davies19a.html.
  • [15] Sami Davies, Miklós Z. Rácz, Benjamin G. Schiffer, and Cyrus Rashtchian. Approximate trace reconstruction: Algorithms. In International Symposium on Information Theory (ISIT), pages 2525–2530. IEEE, 2021. doi:10.1109/ISIT45174.2021.9517926.
  • [16] Anindya De, Ryan O’Donnell, and Rocco A. Servedio. Optimal mean-based algorithms for trace reconstruction. Annals of Applied Probability, 29(2):851–874, 2019. doi:10.1214/18-AAP1394.
  • [17] Lisa Hartung, Nina Holden, and Yuval Peres. Trace reconstruction with varying deletion probabilities. In Analytic Algorithmics and Combinatorics (ANALCO), pages 54–61, 2018. doi:10.1137/1.9781611975062.6.
  • [18] Nina Holden and Russell Lyons. Lower bounds for trace reconstruction. Annals of Applied Probability, 30(2):503–525, 2020. doi:10.1214/19-AAP1506.
  • [19] Nina Holden, Robin Pemantle, and Yuval Peres. Subpolynomial trace reconstruction for random strings and arbitrary deletion probability. In Conference On Learning Theory (COLT), pages 1799–1840, 2018. URL: http://proceedings.mlr.press/v75/holden18a.html.
  • [20] Thomas Holenstein, Michael Mitzenmacher, Rina Panigrahy, and Udi Wieder. Trace reconstruction with constant deletion probability and related results. In Symposium on Discrete Algorithms (SODA), pages 389–398, 2008. doi:10.1145/1347082.1347125.
  • [21] Sampath Kannan and Andrew McGregor. More on reconstructing strings from random traces: insertions and deletions. In International Symposium on Information Theory (ISIT), pages 297–301, 2005. doi:10.1109/ISIT.2005.1523342.
  • [22] Akshay Krishnamurthy, Arya Mazumdar, Andrew McGregor, and Soumyabrata Pal. Trace reconstruction: Generalized and parameterized. IEEE Trans. Inf. Theory, 67(6):3233–3250, 2021. doi:10.1109/TIT.2021.3066010.
  • [23] Vladimir I. Levenshtein. Efficient reconstruction of sequences. IEEE Trans. Information Theory, 47(1):2–22, 2001. doi:10.1109/18.904499.
  • [24] Vladimir I. Levenshtein. Efficient reconstruction of sequences from their subsequences or supersequences. J. Comb. Theory, Ser. A, 93(2):310–332, 2001. doi:10.1006/jcta.2000.3081.
  • [25] Andrew McGregor, Eric Price, and Sofya Vorotnikova. Trace reconstruction revisited. In European Symposium on Algorithms (ESA), pages 689–700, 2014. doi:10.1007/978-3-662-44777-2_57.
  • [26] Andrew McGregor and Rik Sengupta. Graph reconstruction from random subgraphs. In International Colloquium on Automata, Languages, and Programming (ICALP), volume 229, pages 96:1–96:18, 2022. doi:10.4230/LIPICS.ICALP.2022.96.
  • [27] Andrew McGregor and Rik Sengupta. Graph reconstruction from noisy random subgraphs. CoRR, abs/2405.04261, 2024. doi:10.48550/arXiv.2405.04261.
  • [28] Shyam Narayanan. Improved algorithms for population recovery from the deletion channel. In Symposium on Discrete Algorithms (SODA), pages 1259–1278. SIAM, 2021. doi:10.1137/1.9781611976465.77.
  • [29] Shyam Narayanan and Michael Ren. Circular trace reconstruction. In Innovations in Theoretical Computer Science (ITCS), 2021. arXiv:2009.01346.
  • [30] Fedor Nazarov and Yuval Peres. Trace reconstruction with exp(o(n1/3)) samples. In Symposium on Theory of Computing (STOC), pages 1042–1046, 2017. doi:10.1145/3055399.3055494.
  • [31] Yuval Peres and Alex Zhai. Average-case reconstruction for the deletion channel: Subpolynomially many traces suffice. In Foundations of Computer Science (FOCS), pages 228–239, 2017. doi:10.1109/FOCS.2017.29.
  • [32] Ittai Rubinstein. Average-case to (shifted) worst-case reduction for the trace reconstruction problem. In International Colloquium on Automata, Languages, and Programming (ICALP), volume 261 of LIPIcs, pages 102:1–102:20, 2023. URL: https://arxiv.org/abs/2207.11489.
  • [33] Alec Sun and William Yue. The trace reconstruction problem for spider graphs. Discrete Mathematics, 346(1):113115, 2023. doi:10.1016/J.DISC.2022.113115.
  • [34] Krishnamurthy Viswanathan and Ram Swaminathan. Improved string reconstruction over insertion-deletion channels. In Symposium on Discrete Algorithms (SODA), pages 399–408, 2008. doi:10.1145/1347082.1347126.