Near-Optimal Trace Reconstruction for Mildly Separated Strings

Aamand, Anders; Liu, Allen; Narayanan, Shyam

doi:10.4230/LIPIcs.ICALP.2025.3

Near-Optimal Trace Reconstruction for Mildly Separated Strings

Anders Aamand

BARC, University of Copenhagen, Denmark Allen Liu

Massachusetts Institute of Technology, Cambridge, MA, USA Shyam Narayanan

Citadel Securities, Miami, FL, USA

Abstract

In the trace reconstruction problem our goal is to learn an unknown string $x\in\{0,1\}^{n}$ given independent traces of $x$ . A trace is obtained by independently deleting each bit of $x$ with some probability $\delta$ and concatenating the remaining bits. It is a major open question whether the trace reconstruction problem can be solved with a polynomial number of traces when the deletion probability $\delta$ is constant. The best known upper bound and lower bounds are respectively $\exp(\tilde{O}(n^{1/5}))$ [7] and $\tilde{\Omega}(n^{3/2})\leavevmode\nobreak\ \cite[cite]{[\@@bibref{}{Chase21a}% {}{}]}$ . Our main result is that if the string $x$ is mildly separated, meaning that the number of zeros between any two ones in $x$ is at least $\operatorname*{polylog}n$ , and if $\delta$ is a sufficiently small constant, then the trace reconstruction problem can be solved with $O(n\log n)$ traces and in polynomial time.

Keywords and phrases:

Trace Reconstruction, deletion channel, sample complexity

Category:

Track A: Algorithms, Complexity and Games

Funding:

Anders Aamand: This work was supported by the DFF-International Postdoc Grant 0164-00022B and by the VILLUM Foundation grants 54451 and 16582.

Allen Liu: This work was supported in part by an NSF Graduate Research Fellowship, a Hertz Fellowship, and a Citadel GQS Fellowship.

Shyam Narayanan: Supported by the NSF TRIPODS Program, an NSF Graduate Fellowship, and a Google Fellowship.

Copyright and License:

2012 ACM Subject Classification:

Theory of computation

\rightarrow

Design and analysis of algorithms

Editors:

Keren Censor-Hillel, Fabrizio Grandoni, Joël Ouaknine, and Gabriele Puppis

Series and Publisher:

Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik

1 Introduction

Trace reconstruction is a well-studied problem at the interface of string algorithms and learning theory. Informally, the goal of trace reconstruction is to recover an unknown string $x$ given several independent noisy copies of the string.

Formally, fix an integer $n\geq 1$ and a deletion parameter $\delta\in(0,1)$ . Let $x\in\{0,1\}^{n}$ be an unknown binary string with $x_{i}$ representing the $i$ th bit of $x$ . Then, a trace $\tilde{x}$ of $x$ is generated by deleting every bit $x_{i}$ independently with probability $\delta$ (and retaining it otherwise), and concatenating the retained bits together. For instance, if $x=01001$ and we delete the second and third bits, the trace would be $001$ (from the first, fourth, and fifth bits of $x$ ). For a fixed string $x$ , note that the trace follows some distribution over bitstrings, where the randomness comes from which bits are deleted. In trace reconstruction, we assume we are given $N$ i.i.d. traces $\tilde{x}^{(1)},\dots,\tilde{x}^{(N)}$ , and our goal is to recover the original string $x$ with high probability.

The trace reconstruction problem has been a very well studied problem over the past two decades [23, 24, 3, 21, 20, 34, 25, 16, 30, 31, 17, 18, 19, 6, 10, 9, 7, 32]. There have also been numerous generalizations or variants of trace reconstruction studied in the literature, including coded trace reconstruction [13, 4], reconstructing mixture models [1, 2, 28], reconstructing alternatives to strings [14, 22, 29, 26, 33, 27], and approximate trace reconstruction [15, 8, 5, 11, 12].

In perhaps the most well-studied version of trace reconstruction, $x$ is assumed to be an arbitrary $n$ -bit string and the deletion parameter $\delta$ is assumed to be a fixed constant independent of $n$ . In this case, the best known algorithm requires $e^{\tilde{O}(n^{1/5})}$ random traces to reconstruct $x$ with high probability [7]. As we do not know of any polynomial-time (or even polynomial-sample) algorithms for trace reconstruction, there have been many works making distributional assumptions on the string $x$ , such as $x$ being a uniformly random string [20, 25, 31, 19, 32] or $x$ being drawn from a “smoothed” distribution [10]. An alternative assumption is that the string $x$ is parameterized, meaning that $x$ comes from a certain “nice” class of strings that may be amenable to efficient algorithms [22, 15].

In this work, we also wish to understand parameterized classes of strings for which we can solve trace reconstruction efficiently. Indeed, we give an algorithm using polynomial traces and runtime, that works for a general class of strings that we call $L$ -separated strings. This significantly broadens the classes of strings for which polynomial-time algorithms are known [22].

Main Result

Our main result concerns trace reconstruction of strings that are mildly separated. We say that a string $x$ is $L$ -separated if the number of zeros between any two consecutive ones is at least $L$ . Depicting a string $x\in\{0,1\}^{n}$ with $t$ ones as

\underbrace{0\dots 0}_{a_{0}\text{ times}}1\underbrace{0\dots 0}_{a_{1}\text{ % times}}1\cdots 1\underbrace{0\dots 0}_{a_{t}\text{ times}},

it is $L$ -separated if and only if $a_{i}\geq L$ for each $i$ with $1\leq i\leq t-1$ . Note that we make no assumptions on $a_{0}$ or $a_{t}$ . Our main result is as follows.

Theorem 1.

There exists an algorithm that solves the trace reconstruction problem with high probability in $n$ on any $L$ -separated string $x$ , provided that $L\geq C(\log n)^{8}$ for a universal constant $C$ , and that the deletion probability is at most some universal constant $c_{0}$ . The algorithm uses $N=O(n\log n)$ independently sampled traces of $x$ , $\tilde{x}^{(1)},\dots,\tilde{x}^{(N)}$ , and runs in $\operatorname*{poly}(n)$ time.

We note that the number of traces is nearly optimal. Even distinguishing between two strings $x,x^{\prime}$ which contain only a single one at positions $\lfloor n/2\rfloor$ and $\lfloor n/2\rfloor+1$ respectively, requires $\Omega(n)$ traces to succeed with probability $1/2+\Omega(1)$ .

While trace reconstruction is known to be solvable very efficiently for random strings [19, 32], there are certain structured classes of strings that appear to be natural hard instances for existing approaches. Our algorithm can be seen as solving one basic class of hard instances. It is worth noting the work by [9] which studies the trace reconstruction problem when the deletion probability $\delta$ is sub-constant. They show that the simple Bitwise Majority Alignment (BMA) algorithm from [3] can succeed with $1/n^{o(1)}$ deletion probability as long as the original string does not contain deserts – which are highly repetitive blocks where some short substring is repeated many times. They then combine this with an algorithm for reconstructing repetitive blocks – but this part of their algorithm requires a significantly smaller deletion probability of $\delta\leq 1/n^{1/3+\varepsilon}$ . This suggests that strings containing many repetitive blocks are a natural hard instance and good test-bed for developing new algorithms and approaches. $L$ -separated strings can be thought of as the simplest class of highly repetitive strings (where the repeating pattern is just a $0$ ), where every repetition has length at least $L$ .

Comparison to Related Work

Most closely related to our work is the result by Krishnamurthy et al. [22] stating that if $x$ has at most $k$ ones and if each pair of ones is separated by a run of zeros of length $\Omega(k\log n)$ , then $x$ can be recovered in polynomial time from $\operatorname*{poly}(n)$ many traces. In particular, for strings with $k=O((\log n)^{7})$ ones, the required separation is milder than ours, albeit not below $\Omega(\log n)$ . Our algorithm works in general assuming a $\operatorname*{polylog}n$ separation of the ones but with no additional requirement on the number of ones: indeed, we could even have $\frac{n}{\operatorname*{polylog}n}$ ones. With no sparsity assumptions, [22] would need to set $L\geq\Omega(\sqrt{n\log n})$ , as a $\sqrt{n\log n}$ -separated string can be $\Theta(\sqrt{n/\log n})$ -sparse in the worst case. The techniques of [22] are also very different than ours. They recursively cluster the positions of the ones in the observed traces to correctly align a large fraction of the ones in the observed traces to ones in the string $x$ . In contrast, our algorithm works quite differently and is of a more sequential nature processing the traces from left to right (or right to left). See Section 1.1 for a discussion of our algorithm.

Another paper studying strings with large runs is by Davies et al. [15]. They consider approximate trace reconstruction, specifically how many traces are needed to approximately reconstruct $x$ up to edit distance $\varepsilon n$ under various assumptions on the lengths of the runs of zeros and ones in $x$ . Among other results but most closely related to ours, they show that one can $\varepsilon$ -approximately reconstruct $x$ using $O((\log n)/\varepsilon^{2})$ traces if the runs of zeros have length $\gg\frac{\log n}{\varepsilon}$ and if the runs of ones are all of length $\leq C\log n$ or $\gg 3C\log n$ for a constant $C$ (e.g. they could have length one as in our paper). However, for exact reconstruction, they would need to set $\varepsilon<1/n$ , which means they do not provide any nontrivial guarantees in our setting.

1.1 Technical Contributions

In this section, we give a high level overview of our techniques. Recall that we want to reconstruct a string $x\in\{0,1\}^{n}$ from independent traces $\tilde{x}$ where we assume that $x$ is mildly separated. More concretely, we assume that there are numbers $a_{0},\dots,a_{t}\gg\operatorname*{polylog}n$ such that $x$ consists of $a_{0}$ zeros followed by a one, followed by $a_{1}$ zeros followed by a one and so on, with the last $a_{t}$ bits of $x$ being zero. Writing $a_{\leq i}=\sum_{0\leq j\leq i}a_{j}$ , we thus have that there are $t$ ones in $x$ at positions $a_{\leq i}+i+1$ for $0\leq i\leq t-1$ .

Note that a retained bit in a trace $\tilde{x}$ naturally corresponds to a bit in $x$ . More formally, for a trace $\tilde{x}$ of length $\ell$ , let $i_{1}<\cdots<i_{\ell}$ be the $\ell$ positions in $x$ where the bit was retained when generating $\tilde{x}$ so that $\tilde{x}=x_{i_{1}}\cdots x_{i_{\ell}}$ . Then, the correspondence is defined by the map from $[\ell]$ to $[n]$ mapping $j\mapsto i_{j}$ . We think of this map as the correct alignment of $\tilde{x}$ to $x$ .

Our main technical contribution is an alignment algorithm (see Algorithm 1) which takes in some $m\leq t$ and estimates $b_{0},\dots,b_{m-1}$ of $a_{0},\dots,a_{m-1}$ satisfying that for all $i$ , $|b_{i}-a_{i}|=O(\sqrt{a_{i}\log n})$ , and correctly aligns the one in a trace $\tilde{x}$ corresponding to the $m$ ’th one of $x$ with probability $1-O(\delta)$ (where the randomness is over the draw of $\tilde{x}$ – naturally, this requires that the $m$ ’th one of $x$ was not deleted).

Moreover, we ensure that the alignment procedure with high probability, say $1-O(n^{-100})$ , never aligns a one in $\tilde{x}$ too far to the right in $x$ : if the one in $\tilde{x}$ corresponding to the $m_{0}$ ’th one of $x$ is aligned to the $m$ ’th one of $x$ , then $m\leq m_{0}$ . We will refer to this latter property by saying that the algorithm is never ahead with high probability. If $m<m_{0}$ , we say that the algorithm is behind. Thus, to show that the algorithm correctly aligns the $m$ ’th one, it suffices to show that the probability that the algorithm is behind is $O(\delta)$ .

We first discuss how to implement this alignment procedure and then afterwards we discuss how to complete the reconstruction by using this alignment procedure.

The alignment procedure of Algorithm 1

The main technical challenge of this paper is the analysis of Algorithm 1. Let us first describe on a high level how the algorithm works. For $0\leq j\leq j^{\prime}\leq m$ , we write $b_{j:j^{\prime}}=\sum_{i=j}^{j^{\prime}-1}b_{j}$ . Suppose that the trace $\tilde{x}$ consists of $s_{0}$ zeros followed by a one followed by $s_{1}$ zeros followed by a one and so on. The algorithm first attempts to align the first one in $\tilde{x}$ with a one in $x$ by finding the minimal $j_{0}$ such that $(1-\delta)\cdot b_{0:j_{0}}$ is within $C\log n\sqrt{b_{0:j_{0}}}$ of $s_{0}$ for a sufficiently large $C$ . Inductively, having determined $j_{i}$ (that is the alignment of the $i$ ’th one of $\tilde{x}$ ), it looks for the minimal $j_{i+1}>j_{i}$ satisfying that there is a $j_{i}\leq j^{\prime}<j_{i+1}$ such that $b_{j^{\prime}:j_{i+1}}\cdot(1-\delta)$ is within $C\log n\sqrt{b_{j^{\prime}:j_{i+1}}}$ of $s_{i+1}$ . Intuitively, when looking at the $i$ ’th one in the trace, we want to find the earliest possible location in the real string (which has gaps estimated by $b_{0},b_{1},\dots$ ) that could plausibly align with the one in the trace.

It is relatively easy to check that the algorithm is never ahead with very high probability. Indeed, by concentration bounds on the number of deleted zeros and the fact that $|b_{j}-a_{j}|=O(\sqrt{a_{j}\log n})$ for all $j\leq m$ , it always has the option of aligning the $(i+1)$ ’st one in $\tilde{x}$ to the correct one in $x$ . However, it might align to an earlier one in $x$ since it is looking for the minimum $j_{i+1}$ such that an alignment is possible. For a very simple example, suppose that $a_{0}=n^{\Omega(1)}$ and $a_{1}=\cdots=a_{m}=b_{1}=\cdots=b_{m}=\operatorname*{polylog}(n)$ . If the first $k<m$ ones of $x$ are deleted and the $(k+1)$ ’st one is retained, the algorithm will align the retained one (which corresponds to the $(k+1)$ ’st one of $x$ ) with the first one of $x$ resulting in the aligning algorithm being $k$ steps behind. Moreover, the algorithm will remain $k$ steps behind all the way up to the $m$ ’th one of $x$ . The probability of this happening is $\Theta(\delta^{k})$ . To prove that the probability of the algorithm being behind when aligning the $m$ ’th one of $x$ is at most $1-O(\delta)$ , we prove a much stronger statement which is amenable to an inductive proof, essentially stating that this is the worst that can happen: The probability of the algorithm being $k$ steps behind at any fixed point is bounded by $(C\delta)^{k}$ for a constant $C$ . In particular, we show that there is a sort of amortization – whenever there is a substring that can cause the algorithm to fall further behind with some probability (i.e. if certain bits are deleted), the substring also helps the algorithm catch back up if it is already behind.

The algorithm is not too far behind

Proving that the algorithm cannot be too far behind, i.e., is $k$ steps behind with probability at most $O(\delta)^{k}$ is perhaps the most challenging technical part of our paper. We discuss some of the ideas behind proving this result.

The first step towards proving this lemma is to attempt to prove an even stronger statement: that even if the current estimates $b_{0},\dots,b_{m}$ are totally arbitrary (perhaps not similar to $a_{0},\dots,a_{m}$ at all), we will still not be behind. This is not too far-fetched, as for general $b_{0},\dots,b_{m}$ we might actually start jumping ahead. For example, if $b_{0}$ is not close to $a_{0}$ and we do not delete the first $1$ , we will predict the first $1$ in the trace to have come from a later location. This will end up being proven by induction on the length $m$ of the string.

Now, condition on the $(m+1)^{\text{th}}$ $1$ from the true string $x$ not being deleted, and consider the probability of being $k$ steps behind after seeing this bit. Recall that the true gaps between the bits until the $(m+1)^{\text{th}}$ $1$ are $a_{0},\dots,a_{m}$ but the algorithm believes the gaps are $b_{0},\dots,b_{m}$ , and the algorithm believes we have just gone through the gaps $b_{0},\dots,b_{m-k}$ so far. Let $h$ be the smallest value where $a_{h}\approx b_{h}$ doesn’t hold (here, think of $\approx$ as $|a_{h}-b_{h}|$ being much larger than $\sqrt{a_{h}}$ , so it would be easy to distinguish between these gaps even with random deletions). Let $h^{\prime}$ be the smallest value where $b_{m-k-h^{\prime}}\approx a_{m-h^{\prime}}$ doesn’t hold. This can be thought of in terms of reading the sequences $a$ and $b$ backward, from where the algorithm thinks the gaps are from the sequence $b$ whereas the gaps actually come from $a$ . The idea is that if we are aligned after the $h$ th $1$ (which is after the gaps $a_{h-1}$ and $b_{h-1}$ , the difference between $a_{h}$ and $b_{h}$ should cause us to move ahead, meaning that we will have to fall $k+1$ steps behind afterwards, making the inductive argument easier for us. By a symmetric argument, we shouldn’t expect to have the $h^{\prime}$ th to last $1$ aligned with the $h^{\prime}$ to last $1$ in $b$ . So, the point is that we should expect to fall behind both in the first $h$ gaps and the last $h^{\prime}$ gaps. This will allow us to split the string into pieces where in each one we fall behind, and we can apply an inductive hypothesis on the length of the string. Another option is that there is never a value $h$ where $a_{h}\not\approx b_{h}$ or $h^{\prime}$ where $a_{m-k-h^{\prime}}\not\approx a_{m-h^{\prime}}$ . In this case, $(a_{0},\dots,a_{m})$ is approximately periodic with period $k$ and we would have to fall an entire period behind, which we show happens with very low probability

Reconstructing $𝒙$ using Algorithm 1

Using Algorithm 1 we can iteratively get estimates $b_{0},\dots,b_{t}$ with $|b_{i}-a_{i}|=O(\sqrt{a_{i}\log n})$ . Namely, suppose that we have the estimates $b_{0},\dots,b_{m}$ . We then run Algorithm 1 on $O(\log n)$ independent traces and with high probability, for a $1-O(\delta)$ fraction of them, we have that the $m$ ’th and $(m+1)$ ’st one of $x$ are retained in $\tilde{x}$ and correctly aligned. In particular, with probability $1-O(\delta)$ we can identify both the $m$ ’th and $(m+1)$ ’st one of $x$ in $\tilde{x}$ and taking the median over the gaps between these (and appropriately rescaling by $\frac{1}{1-\delta}$ ), we obtain an estimate of $b_{m+1}$ such that $|b_{m+1}-a_{m+1}|=O(\sqrt{a_{m+1}\log n})$ ). Note that the success probability of $1-O(\delta)$ is enough to obtain the coarse estimates using the median approach but we cannot obtain a fine estimate by taking the average since with constant probability $O(\delta)$ , we may have misaligned the gap completely and then our estimate can be arbitrarily off.

To obtain fine estimates, we first obtain coarse estimates, say $b_{0},\dots,b_{t}$ , for all of the gaps. Next, we show that we can identify the $m$ ’th and $(m+1)$ ’st one in $x$ in a trace $\tilde{x}$ (if they are retained) and we can detect if they were deleted not just with probability $1-O(\delta)$ but with very high probability. The trick here is to run Algorithm 1 both from the left and from the right on $\tilde{x}$ looking for respectively the one in $\tilde{x}$ aligned to the $m$ ’th one in $x$ and the one in $\tilde{x}$ aligned to the $(m+1)$ ’st one in $x$ (which is the $(t-m)$ ’th one when running the algorithm from the right). If either of these runs fails to align a one in $\tilde{x}$ to respectively the $m$ ’th and $(m+1)$ ’st one in $x$ or the runs disagree on their alignment, then we will almost certainly know. To see why, assuming that we are never ahead in the alignment procedure from the left, if we believe we have reached the $m$ ’th one in $x$ , then we are truly at some $m_{0}$ ’th one where $m_{0}\geq m$ . By a symmetric argument, if we believe we have reached the $(m+1)$ ’st one in $x$ after running the procedure from the right, we are truly at the $m_{1}$ ’th one in $x$ , where $m_{1}\leq m+1$ . The key observation now is that $m_{0}\leq m_{1}$ if and only if $m_{0}=m$ and $m_{1}=m+1$ , meaning that both runs succeeding is equivalent to the one found in the left-alignment procedure being strictly earlier than the one found in the right-alignment procedure. So, if we realize that either run fails to align the ones properly, we discard the trace and repeat on a newly sampled trace.

Finally, we can ensure that the success of the runs of the alignment algorithm is independent of the deletion of zeros between the $m$ ’th and $(m+1)$ ’st ones in $x$ . If a trace is not discarded, then with very high probability, the gap between the ones in $\tilde{x}$ aligned to the $m$ ’th and $(m+1)$ ’st ones in $x$ (normalized by $\frac{1}{1-\delta}$ ) is an unbiased estimator for $a_{m+1}$ . By taking the average of the gap over $\tilde{O}(n)$ traces, normalizing by $\frac{1}{1-\delta}$ , and rounding to the nearest integer, we determine $a_{m+1}$ exactly with very high probability. Doing so for each $m$ , reconstructs $x$ .

Roadmap of our paper

In Section 2, we introduce notation. In Section 3, we describe and analyse our main alignment procedure. We first prove that with high probability it is never ahead (Lemma 2). Second, in Section 3.2, we bound the probability that it is behind (Lemma 3). Finally, in Section 4, we describe our full trace reconstruction algorithm and prove Theorem 1.

2 Notation

We note a few notational conventions and definitions.

$\blacksquare$

We recall that a bitstring $x$ is $L$ -separated if the gap between any consecutive $1$ ’s in the string contains at least $L$ $0$ ’s.
$\blacksquare$

Given an string $x$ , we say that a run is a contiguous sequence of $0$ ’s in $x$ . For $x=\underbrace{0\dots 0}_{a_{0}\text{ times}}1\underbrace{0\dots 0}_{a_{1}\text% { times}}1\cdots 1\underbrace{0\dots 0}_{a_{t}\text{ times}},$ the $i$ th run of $x$ is the sequence $\underbrace{0\dots 0}_{a_{i}\text{ times}}$ , and has length $a_{i}$ .
$\blacksquare$

For any bitstring $x=x_{1}x_{2}\cdots x_{n}$ , we use $\operatorname{rev}(x):=x_{n}x_{n-1}\cdots x_{1}$ to denote the string where the bits have been reversed.
$\blacksquare$

We use $\mathbf{a}=a_{0},a_{1},\dots,a_{m-1}$ to denote an integer sequence of length $m$ . For notational convenience, for any $0\leq j<j^{\prime}\leq m$ , we write $\mathbf{a}_{j:j^{\prime}}$ to denote the subsequence $a_{j},a_{j+1},\dots,a_{j^{\prime}-1}$ , and $a_{j:j^{\prime}}:=\sum_{i=j}^{j^{\prime}-1}a_{i}$ .

We will define some sufficiently large constants $C_{0},C_{1},C_{2},C_{3}$ and a small constant $c_{0}$ . We will assume the separation parameter $L=C_{3}\cdot\log^{8}n$ , and the deletion parameter $\delta\leq c_{0}$ , where $c_{0}=\frac{1}{3\cdot 10^{6}}$ . We did not make significant effort to optimize the constant $c_{0}$ or the value $8$ in $\log^{8}n$ , though we believe that any straightforward modifications to our analysis will not obtain bounds such as $c_{0}\geq\frac{1}{2}$ or a separation of $L=O(\log n)$ .

3 Main Alignment Procedure

3.1 Description and Main Lemmas

In this section, we consider a probabilistic process that models a simpler version of the trace reconstruction problem that we aim to solve. In the simpler version of the trace reconstruction problem, suppose that we never delete any $0$ ’s, but delete each $1$ independently with $\delta$ probability. Let $a_{0},\dots,a_{m-1}\leq n$ represent the true lengths of the first $m$ gaps (so the first $1$ is at position $1+a_{0}$ , the second $1$ is at position $2+a_{0}+a_{1}$ , and so on). Moreover, suppose we have some current predictions $b_{0},\dots,b_{m-1}\leq n$ of the gaps $a_{0},\dots,a_{m-1}$ . The high level goal will be, given a single trace (where the trace means only $1$ s are deleted), to identify the $m$ th $1$ in the trace from the the original string with reasonably high probability. (Note that the $m$ th $1$ is deleted with $\delta$ probability, in which case we cannot succeed.)

In this section, we will describe and analyze the probabilistic process, and then explain how this analysis helps us solve the trace reconstruction problem in Section 4.

In the process, we fix $m\leq n$ and two sequences $\mathbf{a}=a_{0},\dots,a_{m-1}$ and $\mathbf{b}=b_{0},\dots,b_{m^{\prime}-1}$ where $\mathbf{a}$ has length $m$ but $\mathbf{b}$ has some length $m^{\prime}$ which may or may not equal $m$ . Moreover, we assume $L\leq a_{i}\leq n$ and $L\leq b_{j}\leq n$ for every term $a_{i}\in\mathbf{a}$ and $b_{j}\in\mathbf{b}.$

Now, for each $1\leq i\leq m-1$ , let $w_{i}\in\{0,1\}$ be i.i.d. random variables, with $w_{i}=1$ with $1-\delta$ probability and $w_{i}=0$ with $\delta$ probability. Also, let $w_{0}=w_{m}=1$ with probability $1$ . For each $0\leq i\leq m$ with $w_{i}=1$ , we define a value $f_{i}$ as follows. First, we set $f_{0}=0$ . Next, for each index $i\geq 1$ such that $w_{i}=1$ , let $i_{0}$ denote the previous index with $w_{i_{0}}=1$ . We define $f_{i}$ to be the smallest index $j^{\prime}>f_{i_{0}}$ such that there exists $f_{i_{0}}\leq j<j^{\prime}$ with $|b_{j:j^{\prime}}-a_{i_{0}:i}|\leq C_{0}\cdot\log n\cdot\sqrt{b_{j:j^{\prime}}}$ , where $C_{0}$ is a sufficiently large constant. (If such an index does not exist, we set $f_{i}=\infty$ .)

Our goal will be for $f_{m}=m$ . In general, for any $i$ with $w_{i}=1$ , we would like $f_{i}=i$ . If $f_{i}<i$ , we say that we are $i-f_{i}$ steps behind at step $i$ , and if $f_{i}>i$ , we say that we are $f_{i}-i$ steps ahead at step $i$ .

First, we note the following lemma, which states that we will never be ahead with very high probability, as long as the sequences $\mathbf{a}$ and $\mathbf{b}$ are similar enough.

Lemma 2.

Set $C_{1}=C_{0}/4$ . Let $\mathbf{a},\mathbf{b}$ be sequences of lengths $m,m^{\prime}$ , respectively, where $m^{\prime}\geq m$ . Suppose that $|b_{i}-a_{i}|\leq C_{1}\cdot\sqrt{b_{i}\log n}$ for all $0\leq i<m$ . Then, with probability at least $1-\frac{1}{n^{10}}$ (over the randomness of the $w_{i}$ ), for all $0\leq i\leq m$ with $w_{i}=1$ , $f_{i}\leq i$ .

Proof.

Let us consider the event that for every index $0\leq i\leq m-15\log n$ , at least one of $w_{i},w_{i+1},\dots,w_{i+15\log n}$ equals $1$ . Equivalently, the string $w_{0}w_{1}\cdots w_{m}$ does not ever have $15\log n+1$ $0$ ’s in a row. For any fixed $i$ , the probability of this being false is at most $\delta^{15\log n}\leq n^{-15}$ , so by a union bound over all choices of $i$ , the event holds with at most $n^{-10}$ failure probability.

First, note that $f_{0}=0$ . Now, suppose that some $i\geq 0$ satisfies $w_{i}=1$ and $f_{i}\leq i$ . Suppose $i^{\prime}$ is the smallest index strictly larger than $i$ such that $w_{i^{\prime}}=1$ . Note that $i^{\prime}-i\leq 15\log n+1\leq 16\log n$ , by our assumed event. Note that if we set $j=i$ and $j^{\prime}=i^{\prime}$ , then $j^{\prime}>j\geq f_{i}$ , since $f_{i}\leq i$ . Moreover, $|b_{j:j^{\prime}}-a_{j:j^{\prime}}|\leq\sum_{i=j}^{j^{\prime}-1}|b_{i}-a_{i}|% \leq C_{1}\sqrt{\log n}\cdot\sum_{i=j}^{j^{\prime}-1}\sqrt{b_{i}}\leq C_{1}% \cdot\sqrt{\log n}\cdot\sqrt{b_{j:j^{\prime}}\cdot|j^{\prime}-j|}\leq 4C_{1}% \cdot\log n\cdot\sqrt{b_{j:j^{\prime}}}$ , where the second to last inequality is by Cauchy-Schwarz. Thus, $j=i,j^{\prime}=i^{\prime}$ satisfies the requirements for $f_{i^{\prime}}$ , which means that $f_{i^{\prime}}\leq j^{\prime}=i^{\prime}$ . Thus, if $f_{i}\leq i$ , $f_{i^{\prime}}\leq i^{\prime}$ . Since $f_{0}\leq 0$ , this means $f_{i}\leq i$ for all $i$ with $w_{i}=1$ . $\hfill\blacktriangleleft$

The main technical result will be showing that $f_{m}\geq m$ with reasonably high probability, i.e., with reasonably high probability we are not behind. This result will hold for any choice of $\mathbf{a},\mathbf{b}$ and does not require any similarity between these sequences. In other words, our goal is to prove the following lemma.

Lemma 3.

Let $\mathbf{a},\mathbf{b}$ be strings of length at most $n$ with every $\mathbf{a}_{i},\mathbf{b}_{j}$ between $L$ and $n$ , where $L=C\cdot\log^{8}n$ for a sufficiently large constant $C$ . Define $m=|\mathbf{a}|$ . Then, for any $\delta\leq\frac{1}{3\cdot 10^{6}}$ , with probability at least $1-200\cdot\delta$ over the randomness of $w_{1},\dots,w_{m-1}$ , $f_{m}\geq m$ .

3.2 Proof of Lemma 3

In this section, we prove Lemma 3.

We will set a parameter $K=C_{2}\log n$ , where $C_{2}$ is a sufficiently large constant. For any $k\geq 0$ , given the sequences $\mathbf{a}=a_{0},\dots,a_{m-1}$ and $\mathbf{b}=b_{0},\dots,b_{m^{\prime}-1}$ (of possibly differing lengths), we define $p_{k}(\mathbf{a},\mathbf{b})$ to be the probability (over the randomness of $w_{1},\dots,w_{m-1}$ ) that

$\blacksquare$

$f_{m}\leq m-k$ .
$\blacksquare$

For any indices $0\leq i\leq i^{\prime}\leq m$ with $w_{i},w_{i^{\prime}}=1$ , $f_{i^{\prime}}-f_{i}\geq(i^{\prime}-i)-K$ .

Equivalently, this is the same as the probability that we fall behind at least $k$ steps from step $0$ to step $m$ , but we never fall behind $K+1$ or more steps (relatively) from any (possibly intermediate) steps $i$ to $i^{\prime}$ . For any $m\geq 1$ , we define $p_{k}(m)$ to be the supremum value of $p_{k}(\mathbf{a},\mathbf{b})$ over any sequences $\mathbf{a},\mathbf{b}$ where $\mathbf{a}$ has length at most $m$ and every $a_{i}$ and $b_{j}$ is between $L$ and $n$ , and we also define $p_{k}:=\sup_{m\geq 1}p_{k}(m)$ .

Note that for any $k>K$ , $p_{k}(\mathbf{a},\mathbf{b})=0$ , as $f_{m}=m-k$ means $f_{m}-f_{0}<(m-0)-K$ . So, $p_{k}(m)$ and $p_{k}$ also equal $0$ for any $k>K$ .

First, we note a simple proposition, that will only be useful for simplifying the argument at certain places.

Proposition 4.

For any $m\geq 1$ , $p_{0}(m)=1$ .

Proof.

Since $p_{0}(m)$ is the maximum over all $\mathbf{a},\mathbf{b}$ where $\mathbf{a}$ has length at most $m$ , it suffices to prove it for some $\mathbf{a},\mathbf{b}$ of length $1$ . Indeed, for $m=1$ and $a_{0}=b_{0}=L$ , we must have that $w_{0}=w_{1}=1$ , so we must have $f_{0}=0$ and $f_{1}=1$ . $\hfill\blacktriangleleft$

We now aim to bound the probabilities $p_{k}$ for $k\leq K$ . We will do this via an inductive approach on the length of $m$ , where the high-level idea is that if we fall back by $k$ steps, there is a natural splitting point where we can say first we fell back by $k_{1}$ steps, and then by $k_{2}$ steps, for some $k_{1},k_{2}>0$ with $k_{1}+k_{2}=k$ – see Lemmas 6 and 7. This natural splitting point will be based on the structure of the similarity of $\mathbf{a}$ and $\mathbf{b}$ , and will not work if $\mathbf{a}$ and $\mathbf{b}$ share a $k$ -periodic structure. But in the periodic case, we can give a more direct argument that we cannot fall back by $k$ steps (i.e., a full period), even with $\frac{1}{\operatorname*{poly}(n)}$ probability – see Lemma 5. We can then compute a recursive formula for the probability of falling back $k$ steps, by saying we need to first fall back $k_{1}$ steps and then fall back $k_{2}$ steps. In Lemma 9, we bound the terms of this recursion.

Lemma 5.

Fix any $m\geq k\geq 1$ such that $k\leq K$ , and suppose that $L\geq C_{3}\cdot\log^{8}n$ , where $C_{3}$ is a sufficiently large multiple of $C_{0}^{2}\cdot C_{2}^{6}$ . Suppose that $\mathbf{a},\mathbf{b}$ are sequences such that for every $0\leq i<m-k,$ $|b_{i}-a_{i}|\leq C_{0}\log n\cdot\sqrt{b_{i}}$ and $|b_{i}-a_{i+k}|\leq C_{0}\log n\cdot\sqrt{b_{i}}$ . Then, the probability $p_{k}(\mathbf{a},\mathbf{b})\leq(2\delta)^{K}$ .

Proof.

We show that the probability of ever being behind by $k$ or more is at most $(2\delta)^{K}$ . In fact, we will show this deterministically never happens, conditioned on the event that for every index $0\leq i\leq m-K\cdot k$ , at least one of $w_{i},w_{i+k},w_{i+2k},\dots,w_{i+K\cdot k}$ equals $1$ . Indeed, the probability of this being false for any fixed $i$ is at most $\delta^{K}$ , so by a union bound over all choices of $i$ , the event holds with at most $n\cdot\delta^{K}\leq(2\delta)^{K}$ failure probability.

Now, assume the event and suppose that $f_{i}\leq i-k$ holds for some $i$ . More precisely, we fix $i$ to be the smallest index such that $w_{i}=1$ and $f_{i}\leq i-k$ .

First, assume that $i\geq 2K\cdot k$ . Consider the values $a_{i-1},a_{i-2},\dots,a_{i-k}$ , and let $h=\operatorname*{arg\,max}_{1\leq t\leq k}a_{i-t}.$ By our conditional assumption, and since $i\geq 2K\cdot k$ , at least one of $w_{i-h},w_{i-h-k},\dots,w_{i-h-K\cdot k}$ equals $1$ . Say that $w_{i-h-r\cdot k}=1$ , where $0\leq r\leq K$ . Also, by our choice of $i$ , we know that $f_{i-h-r\cdot k}>i-h-(r+1)\cdot k\geq 0$ , and that $f_{i}\leq i-k$ . So, we have two options:

1.

$i\geq 2K\cdot k$ , and $f_{i}\leq i-k,\,f_{i-h-r\cdot k}>i-h-(r+1)\cdot k\geq 0$ , for some $r\leq K$ and where $h=\operatorname*{arg\,max}_{1\leq t\leq k}a_{i-t}$ .
2.

$i<2K\cdot k$ , and $f_{i}\leq i-k,\,f_{0}=0$ .

Now, let’s consider the list of all indices $i_{0}<i_{1}<\cdots<i_{s}=i$ with $w_{i_{0}},w_{i_{1}},\dots,w_{i_{s}}=1$ , starting with $i_{0}=i-h-r\cdot k$ if $i\geq 2K\cdot k$ and $i_{0}=0$ otherwise, and ending with $i_{s}=i$ . By definition of the sequence $f$ , for every $0\leq t<s$ there exists $j,j^{\prime}$ such that $f_{i_{t}}\leq j<j^{\prime}\leq f_{i_{t+1}}$ and $|b_{j:j^{\prime}}-a_{i_{t}:i_{t+1}}|\leq C_{0}\log n\cdot\sqrt{b_{j:j^{\prime}}}$ . Assuming that $L\geq(10C_{0}\log n)^{2}$ , then $a_{i_{t}:i_{t+1}}\geq(10C_{0}\log n)^{2},$ which means $a_{i_{t}:i_{t+1}}\geq b_{j:j^{\prime}}/2,$ and thus $|b_{j:j^{\prime}}-a_{i_{t}:i_{t+1}}|\leq 2C_{0}\log n\cdot\sqrt{a_{i_{t}:i_{t+% 1}}}.$ So,

b_{f_{i_{t}}:f_{i_{t+1}}}\geq b_{j:j^{\prime}}\geq a_{i_{t}:i_{t+1}}-2C_{0}% \log n\sqrt{a_{i_{t}}:a_{i_{t+1}}}.

Adding the above equation over $0\leq t\leq s-1$ , we obtain

b_{f_{i_{0}}:f_{i}}\geq a_{i_{0}:i}-2C_{0}\log n\cdot\sum_{t=0}^{s-1}\sqrt{a_{% i_{t}:i_{t+1}}}\geq a_{i_{0}:i}-2C_{0}\log n\cdot\sqrt{a_{i_{0}:i}\cdot s},

where the final line follows by Cauchy-Schwarz. Let $j_{0}$ be $i-h-(r+1)\cdot k+1$ if $i\geq 2K\cdot k$ and $j_{0}=0$ otherwise. Then, since $s\leq i_{s}-i_{0}\leq 2k\cdot K\leq 4K^{2}$ , we have

b_{j_{0}:i-k}\geq b_{f_{i_{0}}:f_{i}}\geq a_{i_{0}:i}-4C_{0}\cdot K\log n\cdot% \sqrt{a_{i_{0}:i}}.

(1)

The above equation tells us that $b_{j_{0}:i-k}=\sum_{t=j_{0}}^{i-k-1}b_{t}$ can’t be too much smaller than $a_{i_{0}:i}=\sum_{t=i_{0}}^{i-1}a_{t}$ . We now show contrary evidence, thus establishing a contradiction.

First, we compare $b_{j_{0}:i-k}$ to $a_{j_{0}+k:i}$ . Indeed, for any $t<i\leq m$ , $|b_{t-k}-a_{t}|\leq C_{0}\log n\cdot\sqrt{b_{t-k}}$ . Since every $a_{i}\geq(10C_{0}\log n)^{2}$ , this also means $|b_{t-k}-a_{t}|\leq 2C_{0}\log n\cdot\sqrt{a_{t}}$ . Adding over all $j_{0}\leq t<i-k$ , we have

a_{j_{0}+k:i}\geq b_{j_{0}:i-k}-2C_{0}\log n\cdot\sum_{t=j_{0}+k}^{i-1}\sqrt{a% _{t}}\geq b_{j_{0}:i-k}-4C_{0}\cdot K\log n\cdot\sqrt{a_{j_{0}+k:i}},

where the last inequality follows by Cauchy-Schwarz and the fact that $i-(j_{0}+k)\leq i-j_{0}\leq 2K\cdot k\leq 4K^{2}$ .

However, we do not care about $a_{j_{0}+k:i}$ – we really care about $a_{i_{0}:i}$ . To bound this, first note that for any $k\leq i<m$ , $|a_{i}-b_{i-k}|\leq C_{0}\log n\cdot\sqrt{b_{i}}$ and $|a_{i-k}-a_{i-k}|\leq C_{0}\log n\cdot\sqrt{b_{i}}$ . So, $|a_{i}-a_{i-k}|\leq 4C_{0}\log n\cdot\sqrt{a_{i}}$ , assuming every $a_{i}\geq(10C_{0}\log n)^{2}$ . If we additionally have that $L\geq(100C_{0}\log n\cdot K)^{2},$ then $|a_{i}-a_{i-s\cdot k}|\leq 8C_{0}\log n\cdot s\cdot\sqrt{a_{i}}$ for any $1\leq s\leq K$ and $s\cdot k\leq i<m$ . Importantly, $\frac{a_{i-s\cdot k}}{a_{i}}\in[1/2,2]$ .

In the case that $i\geq 2K\cdot k,$ this implies that $\sum_{t=i-h-r\cdot k}^{i-1}a_{t}\leq 2(r+1)\cdot\sum_{t=i-k}^{i-1}a_{t}\leq 4K% \cdot\sum_{t=i-k}^{i-1}a_{t}$ . So, because $h=\operatorname*{arg\,max}_{1\leq t\leq k}a_{i-t},$ we have

a_{i_{0}}=a_{i-h-r\cdot k}\geq\frac{1}{2}\cdot a_{i-h}\geq\frac{1}{2k}\cdot% \sum_{t=1}^{k}a_{i-t}\geq\frac{1}{8K^{2}}\cdot\sum_{t=i-h-r\cdot k}^{i-1}a_{t}.

Recalling that $i_{0}=i-h-r\cdot k$ and $j_{0}=i-h-(r+1)\cdot k+1$ , since $i_{0}=j_{0}+k-1$ ,

	$\displaystyle a_{i_{0}:i}\geq\left(1+\frac{1}{8K^{2}}\right)\cdot a_{j_{0}+k:i}$	$\displaystyle\geq\left(1+\frac{1}{8K^{2}}\right)\cdot(b_{j_{0}:i-k}-4C_{0}% \cdot K\log n\cdot\sqrt{a_{j_{0}+k:i}})$
		$\displaystyle\geq\left(1+\frac{1}{8K^{2}}\right)\cdot(b_{j_{0}:i-k}-4C_{0}% \cdot K\log n\cdot\sqrt{a_{i_{0}:i}}).$		(2)

In the case that $i<2K\cdot k$ , we instead have $\sum_{t=0}^{i-1}a_{t}\leq 2\cdot\left\lceil\frac{i}{k}\right\rceil\cdot\sum_{t% =0}^{k-1}a_{t}\leq 4K\cdot\sum_{t=0}^{k-1}a_{t}$ . So, since $i_{0}=j_{0}=0$ , we have that

a_{i_{0}:i}=a_{j_{0}+k:i}+a_{0:k}\geq\left(1+\frac{1}{4K}\right)\cdot a_{j_{0}% +k:i}\geq\left(1+\frac{1}{4K}\right)\cdot(b_{j_{0}:i-k}-4C_{0}\cdot K\log n% \cdot\sqrt{a_{j_{0}+k:i}}),

so the same bound as (3.2) holds (in fact, an even stronger bound holds).

So, both (1) and (3.2) hold, in either case. Together, they imply that

a_{i_{0}:i}\geq\left(1+\frac{1}{8K^{2}}\right)\cdot\left(a_{i_{0}:i}-8C_{0}% \cdot K\log n\cdot\sqrt{a_{i_{0}:i}}\right).

This is impossible if $a_{i_{0}:i}$ is a sufficiently large multiple of $(C_{0}\cdot K\log n\cdot K^{2})^{2}=C_{0}^{2}\cdot\log^{2}n\cdot K^{6}$ . Since $i\geq i_{0}+1$ in either case, it suffices for $L$ to be a sufficiently large multiple of $C_{0}^{2}\cdot\log^{2}n\cdot K^{6}=C_{0}^{2}C_{2}^{6}\cdot\log^{8}n$ . $\hfill\blacktriangleleft$

Lemma 6.

Fix any $m\geq k$ such that $k\leq K$ , and suppose that $L\geq C_{3}\cdot\log^{2}n\cdot K^{6}$ . Suppose that $\mathbf{a},\mathbf{b}$ are sequences of length $m$ , such that for every $0\leq i<m-k,$ $|b_{i}-a_{i+k}|\leq C_{0}\log n\cdot\sqrt{b_{i}}$ . Then, the probability

p_{k}(\mathbf{a},\mathbf{b})\leq(2\delta)^{K}+\sum_{\begin{subarray}{c}h_{1},h% _{2},k_{2},k_{2}\geq 0\\ h_{1}+h_{2}+k_{1}+k_{2}\geq k\\ k_{1},k_{2}\leq K\\ (h_{1},h_{2},k_{1},k_{2})\neq(0,0,0,k),(0,0,k,0)\end{subarray}}\delta^{h_{1}+h% _{2}}p_{k_{1}}(m-1)p_{k_{2}}(m-1).

Proof.

Suppose that for all $0\leq i<m-k,$ $|b_{i}-a_{i}|\leq C_{0}\log n\sqrt{b_{i}}$ . Then, we can use Lemma 5 to bound $p_{k}(\mathbf{a},\mathbf{b})\leq(2\delta)^{K}$ . Alternatively, let $0\leq h<m-k$ be the smallest index such that $|b_{h}-a_{h}|>C_{0}\log n\cdot\sqrt{b_{h}}$ . Next, let $h_{1},h_{2}\geq 0$ be such that $h-h_{1}$ is the largest index less than $h$ with $w_{h-h_{1}}=1$ , and $h+1+h_{2}$ is the smallest index at least $h+1$ with $w_{h+1+h_{2}}=1$ . Finally, let $k_{1}:=\max(0,(h-h_{1})-f_{h-h_{1}})$ and $k_{2}:=\max(0,(m-(h+1+h_{2}))-(f_{m}-f_{h+1+h_{2}}))$ . In other words, $k_{1}$ is the number of steps we fall behind from $0$ to $h-h_{1}$ , and $k_{2}$ is the number of steps we fall behind from $h+1+h_{2}$ to $m$ .

Note that $k_{1}+k_{2}\geq m-1-h_{1}-h_{2}-f_{m}+f_{h+1+h_{2}}-f_{h-h_{1}}$ , and since each subsequent $f_{i}$ is strictly increasing, this means $f_{h+1+h_{2}}-f_{h-h_{1}}\geq 1$ , so $k_{1}+k_{2}\geq m-f_{m}-(h_{1}+h_{2})\geq k-(h_{1}+h_{2})$ , assuming that $f_{m}\leq m-k$ . In other words, we have that $h_{1},h_{2},k_{1},k_{2}$ are nonnegative integers such that $h_{1}+h_{2}+k_{1}+k_{2}\geq k$ .

Now, let us bound the probability (over the randomness of $w_{1},\dots,w_{m-1}$ ) of the event indicated by $p_{k}(\mathbf{a},\mathbf{b})$ occurring, with the corresponding values $h_{1},h_{2},k_{1},k_{2}$ . Note that for any fixed $h_{1},h_{2}$ , the event of those specific values is equivalent to $w_{h-h_{1}}$ and $w_{h+1+h_{2}}$ being $1$ , and everything in between being $0$ . So, the probability is at most $\delta^{h_{1}+h_{2}}$ . Now, conditioned on $h_{1},h_{2}$ , the values $k_{1},k_{2}$ imply that we fall back $k_{1}$ steps from step $0$ to $h-h_{1}$ (or we may move forward if $k_{1}=0$ ) and we fall back $k_{2}$ steps from step $h+1+h_{2}$ to $m$ . Moreover, there cannot be two steps $i,i^{\prime}$ such that that we fell back $K$ steps from $i$ to $i^{\prime}$ . Since $h-h_{1}\leq h<m$ and $m-(h+1+h_{2})\leq m-1$ , this means both $h-h_{1},m-(h+1+h_{2})\leq m-1$ . So, the overall probability of the corresponding values $h_{1},h_{2},k_{1},k_{2}$ is at most $\delta^{h_{1}+h_{2}}\cdot p_{k_{1}}(m-1)\cdot p_{k_{2}}(m-1)$ , where we are using the fact that $p_{0}(m)=1$ for all $m$ by Proposition 4.

Overall, the probability $p_{k}(\mathbf{a},\mathbf{b})$ is at most

\sum_{\begin{subarray}{c}h_{1},h_{2},k_{1},k_{2}\geq 0\\ h_{1}+h_{2}+k_{1}+k_{2}\geq k\\ k_{1},k_{2}\leq K\end{subarray}}\delta^{h_{1}+h_{2}}\cdot p_{k_{1}}(m-1)p_{k_{% 2}}(m-1).

We can cap $k_{1},k_{2}$ as at most $K$ since otherwise $p_{k_{1}}(m-1)$ or $p_{k_{2}}(m-1)$ is $0$ . Moreover, we can give improved bounds in the cases when $h_{1}=h_{2}=0$ and either $(k_{1},k_{2})=(0,k)$ or $(k_{1},k_{2})=(k,0)$ .

Note that in either case, both $w_{h}$ and $w_{h+1}$ equal $1$ . In the former case, we must have $f_{h}=h-k$ and $f_{h+1}=h+1-k$ . Importantly, the algorithm fell back by exactly $k$ steps from $0$ to $h$ , However, we know that for all $0\leq i\leq h-1$ , $|b_{i}-a_{i}|\leq C_{0}\log n\cdot\sqrt{b_{i}}$ . In that case, if we restrict ourselves to the strings $\mathbf{a}_{0:h}=a_{0}a_{1}\cdots a_{h-1}$ and $\mathbf{b}_{0:h}=b_{0}b_{1}\cdots b_{h-1}$ , we are dealing with the case of Lemma 5. Hence, we can bound the overall probability of this case by $(2\delta)^{K}$ . In the latter case, we must have $f_{h}=h$ and $f_{h+1}=h+1$ , since we need to fall back by exactly $k$ steps from $h$ to $m$ . However, this actually cannot happen, because by definition of $f_{h}$ and $f_{h-1}$ , we must have that $|b_{h}-a_{h}|\leq C_{0}\log n\cdot\sqrt{b_{h}},$ which is not true by our definition of $h$ .

Overall, this means

p_{k}(\mathbf{a},\mathbf{b})\leq(2\delta)^{K}+\sum_{\begin{subarray}{c}h_{1},h% _{2},k_{2},k_{2}\geq 0\\ h_{1}+h_{2}+k_{1}+k_{2}\geq k\\ k_{1},k_{2}\leq K\\ (h_{1},h_{2},k_{1},k_{2})\neq(0,0,0,k),(0,0,k,0)\end{subarray}}\delta^{h_{1}+h% _{2}}p_{k_{1}}(m-1)p_{k_{2}}(m-1).\

$\hfill\blacktriangleleft$

Lemma 7.

Fix any $m\geq k$ such that $k\leq K$ , and suppose that $L\geq C_{3}\cdot\log^{2}n\cdot K^{6}$ . Suppose that $\mathbf{a},\mathbf{b}$ are sequences of length $m$ . Then, the probability

p_{k}(\mathbf{a},\mathbf{b})\leq(2\delta)^{K}+\sum_{\begin{subarray}{c}h_{1},h% _{2},k_{2},k_{2}\geq 0\\ h_{1}+h_{2}+k_{1}+k_{2}\geq k\\ k_{1},k_{2}\leq K\\ (h_{1},h_{2},k_{1},k_{2})\neq(0,0,0,k),(0,0,k,0)\end{subarray}}\delta^{h_{1}+h% _{2}}p_{k_{1}}(m-1)p_{k_{2}}(m-1).

Proof.

Our proof will be quite similar to that of Lemma 6, so we omit some of the identical details.

First, assume that for every $k\leq i<m,$ $|b_{i-k}-a_{i}|\leq C_{0}\log n\cdot\sqrt{b_{i-k}}$ . Then, we can directly apply Lemma 6. Alternatively, let $k\leq h<m$ be the largest index such that $|b_{h-k}-a_{h}|>C_{0}\log n\cdot\sqrt{b_{h-k}}$ . As in the proof of Lemma 6, let $h_{1},h_{2}\geq 0$ be such that $h-h_{1}$ is the largest index less than $h$ with $w_{h-h_{1}}=1$ , and $h+1+h_{2}$ is the smallest index at least $h+1$ with $w_{h+1+h_{2}}=1$ . Also, let $k_{1}:=\max(0,(h-h_{1})-f_{h-h_{1}})$ and $k_{2}:=\max(0,(m-(h+1+h_{2}))-(f_{m}-f_{h+1+h_{2}}))$ .

As in the proof of Lemma 6, we have $h_{1}+h_{2}+k_{1}+k_{2}\geq k$ , as long as $f_{m}\leq m-k$ . We can again do the same casework on $h_{1},h_{2},k_{1},k_{2}$ , to obtain

\sum_{\begin{subarray}{c}h_{1},h_{2},k_{1},k_{2}\geq 0\\ h_{1}+h_{2}+k_{1}+k_{2}\geq k\\ k_{1},k_{2}\leq K\end{subarray}}\delta^{h_{1}+h_{2}}\cdot p_{k_{1}}(m-1)p_{k_{% 2}}(m-1).

Once again, we wish to consider the individual cases of $(h_{1},h_{2},k_{1},k_{2})=(0,0,0,k)$ or $(h_{1},h_{2},k_{1},k_{2})=(0,0,k,0)$ separately. In either case, $w_{h}=w_{h+1}=1$ . In the former case, must have $f_{h}=h$ and $f_{h+1}=h+1$ . In this case, from step $h+1$ to $m$ we fall behind $k$ steps. In other words, we can restrict ourselves to the strings $\mathbf{a}_{h+1:m}=a_{h+1}\cdots a_{m-1}$ and $\mathbf{b}_{h+1:m}=b_{h+1}\cdots b_{m-1}$ . However, we have now restricted ourselves to strings which satisfy the conditions of Lemma 6, so we can bound the probability in this case as at most

(2\delta)^{K}+\sum_{\begin{subarray}{c}h_{1},h_{2},k_{2},k_{2}\geq 0\\ h_{1}+h_{2}+k_{1}+k_{2}\geq k\\ k_{1},k_{2}\leq K\\ (h_{1},h_{2},k_{1},k_{2})\neq(0,0,0,k),(0,0,k,0)\end{subarray}}\delta^{h_{1}+h% _{2}}p_{k_{1}}(m-1)p_{k_{2}}(m-1).

In the latter case, we must have $f_{h}=h-k$ and $f_{h+1}=h+1-k$ . However, this is impossible, because $|a_{h}-b_{h-k}|>C_{0}\log n\cdot\sqrt{b_{h-k}},$ by our definition of $h$ .

Overall, by adding all cases together, we obtain

p_{k}(\mathbf{a},\mathbf{b})\leq(2\delta)^{K}+2\cdot\sum_{\begin{subarray}{c}h% _{1},h_{2},k_{2},k_{2}\geq 0\\ h_{1}+h_{2}+k_{1}+k_{2}\geq k\\ k_{1},k_{2}\leq K\\ (h_{1},h_{2},k_{1},k_{2})\neq(0,0,0,k),(0,0,k,0)\end{subarray}}\delta^{h_{1}+h% _{2}}p_{k_{1}}(m-1)p_{k_{2}}(m-1).\

$\hfill\blacktriangleleft$

Overall, this implies that

p_{k}(m)\leq(2\delta)^{K}+2\cdot\sum_{\begin{subarray}{c}h_{1},h_{2},k_{2},k_{% 2}\geq 0\\ h_{1}+h_{2}+k_{1}+k_{2}\geq k\\ k_{1},k_{2}\leq K\\ (h_{1},h_{2},k_{1},k_{2})\neq(0,0,0,k),(0,0,k,0)\end{subarray}}\delta^{h_{1}+h% _{2}}p_{k_{1}}(m-1)p_{k_{2}}(m-1).

We now can universally bound $p_{k}$ for all $0\leq k\leq K$ . To do so, we first recall some basic properties of the Catalan numbers.

Fact 8.

For $n\geq 0$ , the Catalan numbers $\mathfrak{C}_{n}$ ¹¹1We use $\mathfrak{C}_{n}$ rather than the more standard $C_{n}$ to avoid confusion with the constants $C_{0},C_{1},\dots$ we have defined. are defined as $\mathfrak{C}_{n}=\binom{2n}{n}/(n+1)$ . They satisfy the following list of properties.

1.

$\mathfrak{C}_{0}=1$ and for all $n\geq 0$ , $\mathfrak{C}_{n+1}=\sum_{i=0}^{n}\mathfrak{C}_{i}\mathfrak{C}_{n-i}$ .
2.

For all $n\geq 1$ , $2\leq\frac{\mathfrak{C}_{n+1}}{\mathfrak{C}_{n}}\leq 4$ .
3.

For all $n\geq 0$ , $\mathfrak{C}_{n}\leq 4^{n}$ .

Lemma 9.

Assume $\delta\leq\frac{1}{3\cdot 100^{3}}$ , and define $\mathfrak{D}_{k}:=100^{2k-1}\mathfrak{C}_{k}$ for $k\geq 1$ and $\mathfrak{D}_{0}=1$ . Then, for all $0\leq k\leq K$ , $p_{k}\leq\mathfrak{D}_{k}\cdot\delta^{k}$ .

Proof.

We prove the statement by induction on $m$ . For $m=1$ , note that $w_{0}=w_{1}=1$ with probability $1$ , so $f_{1}\geq 1$ . Indeed, either $f_{1}=1$ or $f_{1}=\infty$ . So, $p_{0}(m)\leq 1$ and $p_{k}(m)=0$ for all $k\geq 1$ .

Now, suppose that the induction hypothesis holds for $m-1$ : we now prove the statement for $m$ . First, note that $p_{0}(m)=1=\mathfrak{D}_{0}\cdot\delta^{0}$ . Next, for $k\geq 1$ ,

	$\displaystyle p_{k}(m)$	$\displaystyle\leq(2\delta)^{K}+2\cdot\sum_{\begin{subarray}{c}h_{1},h_{2},k_{2% },k_{2}\geq 0\\ h_{1}+h_{2}+k_{1}+k_{2}\geq k\\ k_{1},k_{2}\leq K\\ (h_{1},h_{2},k_{1},k_{2})\neq(0,0,0,k),(0,0,k,0)\end{subarray}}\delta^{h_{1}+h% _{2}}\cdot\mathfrak{D}_{k_{1}}\delta^{k_{1}}\cdot\mathfrak{D}_{k_{2}}\delta^{k% _{2}}$
		$\displaystyle\leq(2\delta)^{k}+2\cdot\sum_{\begin{subarray}{c}h_{1},h_{2},k_{2% },k_{2}\geq 0\\ h_{1}+h_{2}+k_{1}+k_{2}\geq k\\ (h_{1},h_{2},k_{1},k_{2})\neq(0,0,0,k),(0,0,k,0)\end{subarray}}\mathfrak{D}_{k% _{1}}\mathfrak{D}_{k_{2}}\cdot\delta^{h_{1}+h_{2}+k_{1}+k_{2}}.$		(3)

We now bound the summation in the above expression. First, we focus on the terms where one of $k_{1}$ or $k_{2}$ is $0$ . If $k_{1}=k_{2}=0$ , the summation becomes $\sum_{h_{1}+h_{2}\geq k}\delta^{h_{1}+h_{2}}$ . If we fix $h_{3}=h_{1}+h_{2}$ , for each $h_{3}\geq k$ there are $h_{3}+1$ choices of $h_{1}+h_{2}$ , which means the summation is $\sum_{h_{3}\geq k}\delta^{h_{3}}(h_{3}+1)$ . For $\delta\leq\frac{1}{3\cdot 100^{3}}$ , each term is at most half the previous term, so this is at most $2(k+1)\cdot\delta^{k}$ . Next, for $k_{1}=0,k_{2}>0$ , if we fix $h_{3}=h_{1}+h_{2}$ , the summation is $\sum_{h_{3}+k_{2}\geq k,(h_{3},k_{2})\neq(0,k)}(h_{3}+1)\mathfrak{D}_{k_{2}}% \delta^{h_{3}+k_{2}}$ , since there are $h_{3}+1$ choices of $(h_{1},h_{2}):h_{1}+h_{2}=h_{3}$ . We have a symmetric summation for $k_{1}>0,k_{2}=0$ . Finally, if we focus on the terms with $k_{1},k_{2}\geq 1$ , by writing $h_{3}=h_{1}+h_{2}$ and $k_{3}=k_{1}+k_{2}$ , for any fixed $h_{3},k_{3}$ , the sum of $\mathfrak{D}_{k_{1}}\mathfrak{D}_{k_{2}}$ is at most $100^{2k_{1}+2k_{2}-2}\cdot\mathfrak{C}_{k_{3}+1}\leq 100^{-3}\cdot\mathfrak{D}% _{k_{3}+1}$ , and there are $h_{3}+1$ choices for $(h_{1},h_{2})$ . So, the summation is at most $100^{-3}\cdot\sum_{h_{3}+k_{3}\geq k}(h_{3}+1)\mathfrak{D}_{k_{3}+1}\delta^{h_% {3}+k_{3}}\leq\frac{4}{100}\cdot\sum_{h_{3}+k_{3}\geq k}(h_{3}+1)\mathfrak{D}_% {k_{3}}\delta^{h_{3}+k_{3}}$ , where the last inequality holds because $\frac{\mathfrak{D}_{k_{3}+1}}{\mathfrak{D}_{k_{3}}}\leq 100^{2}\cdot\frac{% \mathfrak{C}_{k_{3}+1}}{\mathfrak{C}_{k_{3}}}\leq 4\cdot 100^{2}$ .

Overall, replacing indices accordingly, we can write (3.2) as at most

	$\displaystyle\hskip 14.22636pt(2\delta)^{k}+2\cdot\left(2(k+1)\cdot\delta^{k}+% 2\cdot\sum_{\begin{subarray}{c}a,b\geq 0\\ a+b\geq k\\ (a,b)\neq(0,k)\end{subarray}}(a+1)\mathfrak{D}_{b}\delta^{a+b}+\frac{4}{100}% \cdot\sum_{\begin{subarray}{c}a,b\geq 0\\ a+b\geq k\end{subarray}}(a+1)\mathfrak{D}_{b}\delta^{a+b}\right)$
	$\displaystyle\leq(2\delta)^{k}+2\cdot\left(2(k+1)\cdot\delta^{k}+3\cdot\sum_{% \begin{subarray}{c}a,b\geq 0\\ a+b\geq k\\ (a,b)\neq(0,k)\end{subarray}}(a+1)\mathfrak{D}_{b}\delta^{a+b}+\frac{4}{100}% \mathfrak{D}_{k}\delta^{k}\right).$

We can now focus on the middle summation term. If we first consider all terms with $b=0$ , the sum equals $\sum_{a\geq k}(a+1)\delta^{a}=(k+1)\delta^{k}+(k+2)\delta^{k+1}+\cdots\leq 2(k% +1)\delta^{k},$ as long as $\delta\leq\frac{1}{3\cdot 100^{3}}$ . For the remaining terms, we fix $d=a+b$ and consider the sum. If $d=k$ , the sum equals $\delta^{k}\cdot(2\mathfrak{D}_{k-1}+3\mathfrak{D}_{k-2}+\cdots+k\mathfrak{D}_{% 1})$ . Since $\mathfrak{D}_{n+1}\geq 100^{2}\mathfrak{D}_{n}$ for all $n\geq 1$ , this is at most $\delta^{k}\cdot 4\mathfrak{D}_{k-1}$ . For $d>k$ , the sum equals $\delta^{d}\cdot(\mathfrak{D}_{d}+2\mathfrak{D}_{d-1}+\cdots+d\mathfrak{D}_{1})% \leq 2\delta^{d}\cdot\mathfrak{D}_{d}$ . Since $\mathfrak{D}_{d}\leq 4\cdot 100^{2}\cdot\mathfrak{D}_{d+1}$ , as long as $\delta\leq\frac{1}{3\cdot 100^{3}}$ , the terms $2\delta^{d}\cdot\mathfrak{D}_{d}$ decrease by a factor greater than $2$ each time $d$ increases. So the sum over all $d>k$ is at most $4\delta^{k+1}\cdot\mathfrak{D}_{k+1}.$ Overall, the summation in the middle term is at most $2(k+1)\delta^{k}+4\mathfrak{D}_{k-1}\cdot\delta^{k}+4\mathfrak{D}_{k+1}\cdot% \delta^{k+1}$ .

Overall, this means (3.2) is at most

2^{k}\delta^{k}+16(k+1)\cdot\delta^{k}+24\mathfrak{D}_{k-1}\cdot\delta^{k}+24% \mathfrak{D}_{k+1}\cdot\delta^{k+1}+\frac{8}{100}\mathfrak{D}_{k}\delta^{k}.

(4)

Now, note that $\frac{\mathfrak{D}_{k-1}}{\mathfrak{D}_{k}}\leq\frac{1}{100}$ for all $k\geq 1$ , even for $k=1$ . Moreover, $\frac{\mathfrak{D}_{k+1}}{\mathfrak{D}_{k}}\leq 100^{2}\cdot\frac{\mathfrak{C}% _{k+1}}{\mathfrak{C}_{k}}\leq 4\cdot 100^{2}$ . Thus, (4) is at most

\delta^{k}\cdot\left(2^{k}+16(k+1)+\frac{32}{100}\cdot\mathfrak{D}_{k}+96\cdot 1% 00^{2}\cdot\mathfrak{D}_{k}\cdot\delta\right)

Assuming that $\delta\leq\frac{1}{3\cdot 100^{3}}$ , this is at most $\delta^{k}\cdot\left(2^{k}+16(k+1)+0.64\cdot\mathfrak{D}_{k}\right),$ which can be verified to be at most $\delta^{k}\cdot\mathfrak{D}_{k}$ for all $k\geq 1$ , by just using the fact that $\mathfrak{D}_{k}\geq 100^{k}$ for all $k\geq 1$ . This completes the inductive step. $\hfill\blacktriangleleft$

We are now ready to prove Lemma 3.

Proof of Lemma 3.

If $f_{m}<m$ , this means that either the event $p_{1}(\mathbf{a},\mathbf{b})$ occurs, or there exist indices $i<i^{\prime}$ with $w_{i}=w_{i^{\prime}}=1$ but we fall behind at least $K+1$ steps from step $i$ to step $i^{\prime}$ .

Assuming $\delta\leq\frac{1}{3\cdot 10^{3}}$ , the probability of $p_{1}(\mathbf{a},\mathbf{b})$ is at most $100\delta$ . Alternatively, if there exist $i<i^{\prime}$ with $w_{i}=w_{i^{\prime}}=1$ but we fall behind at least $K+1$ steps from step $i$ to step $i^{\prime}$ , there must exist such an $i,i^{\prime}$ with minimal $i^{\prime}-i$ (breaking ties arbitrarily). This could be because $w_{i+1}=w_{i+2}=\dots=w_{i+r}=0$ for some $r\geq K/2$ . However, the probability of there being $r\geq K/2$ consecutive indices $w_{i+1}=w_{i+2}=\dots=w_{i+r}=0$ is at most $n\cdot\delta^{K/2}\leq\delta$ .

The final option is that, if we look at the first index $i+r>i$ with $w_{i+r}=0$ , $r\leq K/2$ . This means that from step $i+r$ to $i^{\prime}$ , we must fall behind at least $K/2$ steps, and there could not have been any intermediate steps where we fell behind more than $K$ steps. Hence, if we restrict ourselves to the strings $\mathbf{a}_{i+r:i^{\prime}}$ and $\mathbf{b}_{f_{i+r},f_{i^{\prime}}}$ , the event indicated by $p_{k}(\mathbf{a}_{i+r:i^{\prime}},\mathbf{b}_{f_{i+r}:})$ must occur, since conditioned on $f_{i+r}$ and the fact that $w_{i+r}=w_{i^{\prime}}=1$ , the value $f_{i^{\prime}}$ only depends on $\mathbf{a}_{i+r:i^{\prime}}$ , $\mathbf{b}$ starting from position $f_{i+r}$ , and $w_{i+r+1},\dots,w_{i^{\prime}-1}$ .

In other words, there exists some contiguous subsequences $\mathbf{a}^{\prime}$ and $\mathbf{b}^{\prime}$ of $\mathbf{a}$ and $\mathbf{b}$ , respectively, such that the event of $p_{K/2}(\mathbf{a}^{\prime},\mathbf{b}^{\prime})$ occurs. For any fixed $\mathbf{a}^{\prime},\mathbf{b}^{\prime}$ , the probability is at most $(4\cdot 100^{2}\cdot\delta)^{K/2}$ . Since there are at most $n^{2}$ possible contiguous subsequences for each of $\mathbf{a}^{\prime}$ and $\mathbf{b}^{\prime}$ , the overall probability is at most $n^{4}\cdot(4\cdot 100^{2}\cdot\delta)^{K/2}\leq 50\delta$ , assuming that $\delta\leq\frac{1}{3\cdot 10^{6}}$ and $K=C_{2}\log n$ where $C_{2}$ is sufficiently large.

Overall, the probability of falling behind is at most $100\delta+\delta+50\delta\leq 200\delta$ . $\hfill\blacktriangleleft$

4 Full algorithm/analysis

Let us depict the true string $x\in\{0,1\}^{n}$ as $\underbrace{0\dots 0}_{a_{0}\text{ times}}1\underbrace{0\dots 0}_{a_{1}\text{ % times}}1\cdots 1\underbrace{0\dots 0}_{a_{t}\text{ times}},$ i.e., there are $t-1$ ones, and the string starts and ends with a run of $0$ ’s. This assumption can be made WLOG by padding the string with $L$ $0$ ’s at the front and the end. For any $L$ -separated string, doing this padding maintains the $L$ -separated property, and we can easily simulate the padded trace by adding $\operatorname{Bin}(L,1-\delta)$ $0$ ’s at the front and $\operatorname{Bin}(L,1-\delta)$ $0$ ’s at the back. Once we reconstruct the padded string, we remove the padding to get $x$ .

We assume we know the value of $t$ . Indeed, the number of $1$ ’s in a single trace $\tilde{x}$ is distributed as $\operatorname{Bin}(t,1-\delta)$ . So, by averaging the number of $1$ ’s over $O(n\log n)$ random traces and dividing by $1-\delta$ , we get an estimate of $t-1$ that is accurate within $0.1$ with $1-\frac{1}{n^{10}}$ probability. Thus, by rounding, we know $t$ exactly with $1-\frac{1}{n^{10}}$ probability.

The main goal is now to learn the lengths $a_{0},a_{1},\dots,a_{t}$ . If we learn these exactly just using the traces, this completes the proof. Our algorithm runs in two phases: a coarse estimation phase and a fine estimation phase. In the coarse estimation phase, we sequentially learn each $a_{i}$ up to error $O(\sqrt{a_{i}\log n})$ . In the fine estimation phase, we learn each $a_{i}$ exactly, given the coarse estimates.

4.1 Coarse estimation

Fix some $0\leq m\leq t$ , and suppose that for all $i<m$ , we have estimates $b_{i}$ satisfying $|b_{i}-(1-\delta)a_{i}|\leq 10\sqrt{a_{i}}$ . (If $m=0$ , then we have no estimates yet.) Our goal will be to provide an estimate $b_{m}$ such that $|b_{m}-(1-\delta)a_{m}|\leq 10\sqrt{a_{m}}$ .

Consider a trace $\tilde{x}$ of $x$ . Let $w_{0}=w_{t+1}=1$ and for each $1\leq i\leq t$ , let $w_{i}$ be the indicator that the $i$ th $1$ is retained. Next, for each $0\leq i\leq t$ , let $\tilde{a}_{i}\sim\operatorname{Bin}(a_{i},1-\delta)$ represent the number of $0$ s in the $i$ th run that were not deleted. Note that with at least $0.99$ probability, $|\tilde{a}_{i}-(1-\delta)a_{i}|\leq 10\sqrt{\log n\cdot a_{i}}$ for all $i$ . Since $|b_{i}-(1-\delta)a_{i}|\leq 10\sqrt{a_{i}}$ for all $i<m$ , this implies that $|\tilde{a}_{i}-b_{i}|\leq 20\sqrt{\log n\cdot b_{i}}$ for all $i<m$ .

Now, even though we have no knowledge of $\tilde{a}_{i}$ or $a_{i}$ , we can still simulate the probabilistic process of Section 3. Let $0=i_{0}<i_{1}<\cdots<i_{h}=t+1$ be the list of all indices $i:0\leq i\leq t+1$ with $w_{i}=1$ . While we do not know the values $\tilde{a}_{i}$ , for every pair of consecutive indices $i_{q},i_{q+1}$ , the value $\tilde{a}_{i_{q}:i_{q+1}}$ is exactly the number of $0$ ’s between the $q$ th and $(q+1)$ st $1$ in the trace $\tilde{x}$ (where we say that the $0$ th $1$ is at position $0$ and the $(t+1)$ st $1$ is at position $|\tilde{x}|+1$ ). In other words, if $r_{q}$ represents the position of the $q$ th $1$ , then $\tilde{a}_{i_{q}:i_{q+1}}=r_{q+1}-r_{q}-1$ . Hence, because computing each $f_{i_{q+1}}$ only requires knowledge of $\mathbf{b}$ and the value of $\tilde{a}_{i_{q}:i_{q+1}}$ , and since $f_{i_{0}}=f_{0}=0$ , the algorithm can in fact compute $g_{q}:=f_{i_{q}}$ for all $0\leq q\leq h$ , using the same process as described in Section 3, even if the values $i_{q}$ are not known.

Algorithm 1 simulates this process, assuming knowledge of $m$ , $b_{0},\dots,b_{m-1}$ , a single trace $\tilde{x}$ , and $t$ . In Algorithm 1, we use the variable $\operatorname{val}$ to represent $g_{q}=f_{i_{q}}$ , i.e., the current prediction of the position $i_{q}$ . In other words, $\operatorname{val}-\,i_{q}$ equals the number of steps ahead (or $i_{q}-\operatorname{val}$ equals the number of steps behind) we are.

Algorithm 1 Locate the

m

th and

(m+1)

st

1

in

x

, in the trace

\tilde{x}

, and return the position and length of the gap.

Lemma 10.

Fix $b_{0},\dots,b_{m-1}$ such that $|b_{i}-(1-\delta)a_{i}|\leq 10\sqrt{a_{i}}$ for all $0\leq i\leq m-1$ . With probability at least $0.98$ over the randomness of $\tilde{x}$ , we have that Algorithm 1 returns $q$ such that the $q$ th $1$ in $\tilde{x}$ corresponds to the $m$ th $1$ in $x$ . Moreover, conditioned on this event holding, the distribution $r_{q+1}-r_{q}-1$ exactly follows $\operatorname{Bin}(a_{m},1-\delta)$ .

Proof.

Let us first condition on the values $\tilde{a}_{0},\dots,\tilde{a}_{m-1}$ , assuming that $|\tilde{a}_{i}-(1-\delta)a_{i}|\leq 10\sqrt{\log n\cdot a_{i}}$ for all $0\leq i\leq m-1$ . As discussed earlier, this occurs with at least $0.99$ probability, and implies that $|\tilde{a}_{i}-b_{i}|\leq 20\sqrt{\log n\cdot b_{i}}$ for all $i<m$ .

Let us also condition on $w_{m}=1$ . By Lemma 2 and Lemma 3, the probability that $f_{m}=m$ , for $\delta=\frac{1}{3\cdot 10^{6}}$ , is at least $0.99$ . This is conditioned on $w_{m}=1$ and the values $\tilde{a}_{1},\dots,\tilde{a}_{m-1}$ (assuming $|\tilde{a}_{i}-b_{i}|\leq 20\sqrt{\log n\cdot b_{i}}$ ). This means that with at least $0.99$ probability, the algorithm finds the position $q$ with $i_{q}=m$ . Since $f_{m}$ only depends on $\mathbf{b}$ , $\tilde{\mathbf{a}}_{0:m}$ and $w_{1},\dots,w_{m}$ , with probability at least $0.99\cdot(1-\delta)\cdot 0.99$ over the randomness of $w_{1},\dots,w_{m}$ and $\tilde{a}_{1},\dots,\tilde{a}_{m-1}$ , we have that $w_{m}=1$ and $i_{q}=m$ . This is independent of $w_{m+1}$ , so with probability at least $0.99^{2}\cdot(1-\delta)^{2}\geq 0.98$ probability, we additionally have that $w_{m+1}=1$ .

The event that $i_{q}=m$ means that $r_{q}$ is the position in $\tilde{x}$ of the $m$ th $1$ in the true string $x$ . Moreover, since neither the $m$ th nor $(m+1)$ th $1$ was deleted, $r_{q+1}$ is the position in $\tilde{x}$ of the $(m+1)$ th $1$ in the true string $x$ . So, $r_{q+1}-r_{q}-1$ is in fact the length of the gap between the $m$ th and $(m+1)$ th $1$ after deletion, which means it has length $\tilde{a}_{m}\sim\operatorname{Bin}(a_{m},1-\delta)$ , since $\tilde{a}_{m}$ is independent of the events that decide whether $w_{m}=w_{m+1}=1$ and $i_{q}=m$ . $\hfill\blacktriangleleft$

Given this, we can crudely estimate every gap, in order. Namely, assuming that that we have estimates $b_{0},\dots,b_{m-1}$ (where $0\leq m\leq t$ ), we can run the Align procedure on $O(\log n)$ independent traces. By a Chernoff bound, with $\frac{1}{n^{15}}$ failure probability, at least $0.9$ fraction of the traces will have the desired property of Lemma 10, so will output some $(q,b)$ where $b\sim\operatorname{Bin}(a_{m},1-\delta)$ . Since $\operatorname{Bin}(a_{m},1-\delta)$ is in the range $a_{m}(1-\delta)\pm 10\sqrt{a_{m}}$ with at least $0.99$ probability, at least $0.75$ fraction of the outputs $(q,b)$ will satisfy $|b-(1-\delta)a_{m}|\leq 10\sqrt{a_{m}}$ , with $\frac{1}{n^{15}}$ failure probability. Thus, by defining $b_{m}$ to be the median value of $b$ across the randomly drawn traces, we have that $|b_{m}-(1-\delta)a_{m}|\leq 10\sqrt{a_{m}}$ with at least $1-\frac{1}{n^{10}}$ probability.

By running this procedure iteratively to provide estimates $b_{0},b_{1},\dots,b_{t}$ , we obtain Algorithm 2. The analysis in the above paragraph implies the following result.

Theorem 11 (Crude Approximation).

Algorithm 2 uses $O(n\log n)$ traces and polynomial time, and learns estimates $b_{0},b_{1},\dots,b_{t}$ such that with at least $1-\frac{1}{n^{9}}$ probability, $|b_{m}-(1-\delta)a_{m}|\leq 10\sqrt{a_{m}}$ for all $0\leq m\leq t$ .

Algorithm 2 Crude Estimation of all gaps.

4.2 Fine estimation

In this section, we show how to exactly compute each $a_{m}$ with high probability, given the crude estimates $b_{0},b_{1},\dots,b_{t-1}$ . This will again be done using an alignment procedure, but this time running the alignment both “forward and backward”.

Namely, given a trace $\tilde{x}$ , we will try to identify the $m$ th and $(m+1)$ st $1$ ’s from the original string, but we try to identify the $m$ th $1$ by running Align on $\tilde{x}$ and the $(m+1)$ st $1$ by running Align on the reverse string $\operatorname{rev}(\tilde{x}):=\tilde{x}_{|\tilde{x}|}\cdots\tilde{x}_{2}% \tilde{x}_{1}$ . The idea is: assuming that we never go ahead in the alignment procedure, if we find some index $q$ in the forward alignment procedure with $g_{q}=f_{i_{q}}=m$ , then the true position $i_{q}$ must be at least $m$ . Likewise, if we do the alignment procedure in reverse until we believe we have found the $(t-m)$ th $1$ from the back (equivalently, the $(m+1)$ th $1$ from the front), the true position must be at most $m+1$ .

So, the true positions of the index found in the forward alignment procedure can only be earlier than that of the index from the backward alignment procedure, if the true positions were exactly $m$ and $m+1$ , respectively. Thus, by comparing the indices, we can effectively verify that the positions are correct, with negligible failure probability (rather than with $1-O(\delta)$ failure probability). This is the key towards obtaining the fine estimate of $a_{m}$ , rather than just a coarse estimate that may be off by $O(\sqrt{a_{m}})$ .

Algorithm 3 formally describes the fine alignment procedure, using $N=O(n\log n)$ traces, assuming we have already done the coarse estimation to find $b_{0},b_{1},\dots,b_{t}$ .

Algorithm 3 Fine Estimation of all gaps.

Lemma 12.

Suppose that $|b_{i}-(1-\delta)a_{i}|\leq 10\sqrt{a_{i}}$ for all $1\leq 0\leq t$ . Fix indices $0\leq m\leq t$ and $1\leq i\leq N$ , and for simplicity of notation, let $\tilde{x}:=\tilde{x}^{(i)}$ . Let $\tilde{m}$ be the number of $1$ ’s in $\tilde{x}$ . Then, the probability that $q_{\text{f}}+q_{\text{b}}=\tilde{m}$ , but either the forward or backward iterations finds an index in $\tilde{x}$ which does not correspond to the $m$ th $1$ or $(m+1)$ th $1$ , respectively, from $x$ , is at most $2n^{-10}$ . Moreover, if the forward and backward iterations find indices in $\tilde{x}$ corresponding to the $m$ th $1$ and $(m+1)$ th $1$ , respectively, then $q_{\text{f}}+q_{\text{b}}=\tilde{m}$ . Finally, the probability of finding both corresponding indices is at least $0.98$ .

Proof.

First, let us consider the forward alignment procedure. We know that $\operatorname{val}$ tracks $f_{i_{q}}$ when looking at the $q$ th $1$ of $\tilde{x}$ (from left to right). So, if we do not return FAIL, then $f_{i_{q_{\text{f}}}}=m$ . If $i_{q_{\text{f}}}<m$ , this implies there is an index $i=i_{q_{\text{f}}}$ where $f_{i}>i$ . The probability of this is at most $n^{-10}$ , by Lemma 2. Otherwise, $i_{q_{\text{f}}}\geq m$ , meaning that the $q_{\text{f}}$ th $1$ in $\tilde{x}$ is after (or equal to) the $m$ th $1$ in $x$ .

Likewise, if we consider the backward alignment procedure, if we do not return FAIL, then except for an event with probability at most $n^{-10}$ , the $q_{\text{b}}$ th $1$ in $\operatorname{rev}(\tilde{x})$ is ahead of (or equal to) the $(t-m)$ th $1$ in $\operatorname{rev}(x)$ . Equivalently, the $(\tilde{m}+1-q_{\text{b}})$ th $1$ in $\tilde{x}$ (reading from left to right) is before (or equal to) the $(m+1)$ th $1$ in $x$ (reading from left to right).

So, barring a $2\cdot n^{-10}$ probability event, the only way that the $q_{\text{f}}$ th $1$ in $\tilde{x}$ is strictly before the $(\tilde{m}+1-q_{\text{b}})$ th $1$ in $\tilde{x}$ is if the $q_{\text{f}}$ th $1$ in $\tilde{x}$ is precisely the $m$ th $1$ in $x$ and $(\tilde{m}+1-q_{\text{b}})$ th $1$ in $\tilde{x}$ is precisely the $(m+1)$ th $1$ in $x$ . However, if $q_{\text{f}}+q_{\text{b}}=\tilde{m}$ , then in fact the $q_{\text{f}}$ th $1$ is before the $(\tilde{m}+1-q_{\text{b}})$ th $1$ in $\tilde{x}$ (reading from left to right). This proves the first statement.

Next, if we in fact found the corresponding indices, they are consecutive $1$ ’s in $x$ , which means they must be consecutive $1$ ’s in $\tilde{x}$ . So, if we found the $q_{\text{f}}$ th $1$ from the left, and the $q_{\text{b}}$ th $1$ from the right, we must have $q_{\text{f}}+q_{\text{b}}=\tilde{m}$ .

Finally, the event of finding both corresponding indices is equivalent to $f_{m}=m$ in the forward iteration and $f_{t-m}=t-m$ in the backward iteration. Conditioned on the corresponding $1$ ’s not being deleted, each of these occur with at least $0.98$ probability, by Lemmas 2 and 3. So, the overall probability is at least $0.9$ . $\hfill\blacktriangleleft$

We are now ready to prove Theorem 1. Indeed, given the accuracy of the crude estimation procedure, it suffices to check that for each $m$ , we compute $a_{m}$ correctly, with at least $1-n^{-5}$ probability.

Theorem 13 (Fine Estimation).

Assume that $t$ , the number of ones in $x$ , is computed correctly, and for all $0\leq m\leq t$ , $|b_{m}-(1-\delta)a_{m}|\leq 10\sqrt{a_{m}}$ .

Then, for any fixed $m:0\leq m\leq t$ , with at least $1-n^{-5}$ probability, we compute the gap $a_{m}$ correctly.

Proof.

For any fixed iteration $i:1\leq i\leq N$ , if both the forward and backward procedures correctly identify the $m$ th and $(m+1)$ th $1$ ’s from the left, respectively, then $q_{\text{f}}+q_{\text{b}}=\tilde{m}$ by Lemma 12. In this case, we will compute an actual value $b^{(i)}=b_{\text{f}}$ . Moreover, as discussed in the proof of Lemma 10, the event that the forward procedure correctly identifies the right $1$ only depends on $\mathbf{b}$ , $\hat{a}_{0},\dots,\hat{a}_{m-1}$ , and the events of whether the first $m$ $1$ ’s are deleted. Thus, the event that the backward procedure correctly identifies the right $1$ only depends on $\mathbf{b}$ , $\hat{a}_{m+1},\dots,\hat{a}_{t}$ , and the events of whether the $(m+1)$ th $1$ until the $t$ th $1$ are deleted.

Thus, the forward and backward procedure correctly identifying the right $1$ ’s is independent of $\hat{a}_{m}\sim\operatorname{Bin}(a_{m},1-\delta)$ . Moreover, in this case, $b_{\text{f}}$ is precisely $\hat{a}_{m}$ , since $q_{\text{f}}$ is the position in $\tilde{x}$ corresponding to the $m$ th $1$ in $x$ , and neither the $m$ th nor $(m+1)$ th $1$ can be deleted if both of these $1$ ’s are identified.

So, if the forward and backward procedures identifying the right $1$ ’s for trace $\tilde{x}^{(i)}$ , the conditional distribution of $b^{(i)}$ is $\operatorname{Bin}(a_{m},1-\delta)$ . However, we really want to look at the distribution conditioned on the event $q_{\text{f}}+q_{\text{b}}=\tilde{m}$ . Indeed, by Lemma 12, this event is equivalent to either the forward and backward procedures identifying the right $1$ ’s, or some other event which occurs with at most $2n^{-10}$ probability. Because $b^{(i)}$ is clearly between $0$ and $n$ , and since the probability of both $1$ ’s being correctly identified is at least $0.9$ by Lemma 12, the expectation of $b^{(i)}$ , conditioned on not being NULL, is $a_{m}(1-\delta)\pm O(n^{-10}\cdot n)=a_{m}(1-\delta)\pm O(n^{-9})$ .

By a Chernoff bound, the number of $1\leq i\leq N$ with $b^{(i)}\neq\textbf{NULL}$ is at least $0.5\cdot N$ with at least $1-n^{-10}$ probability, since in expectation it is at least $0.9N$ . Then, by another Chernoff bound, the empirical average of all such $b^{(i)}$ is within $0.1$ of its expectation with $1-n^{-10}$ probability, which is $a_{m}(1-\delta)\pm O(n^{-9})$ . Thus, taking the empirical average and dividing by $1-\delta$ , with at most $O(n^{-10})$ failure probability, $\frac{1}{1-\delta}$ times the average of all non-null $b^{(i)}$ ’s is within $0.2$ of $a_{m}$ , and thus rounds to $a_{m}$ . $\hfill\blacktriangleleft$

5 Conclusion and Open Questions

In this paper, we established that the trace reconstruction problem can be solved with a polynomial number of traces, as long as any two ones in the initial string are separated by at least $\operatorname*{polylog}n$ zeros and the deletion probability is at most a sufficiently small constant. It is an interesting open question to handle more general deserts such as $(01)_{n}=010101\dots 01$ interspersed with mildly separated zeros and ones. Indeed, we believe that this is an important step towards solving the general trace reconstruction problem with deletion probability $\delta=n^{-o(1)}$ . With this deletion probability, the Bitwise Majority Alignment (BMA) algorithm from [3] succeeds in reconstructing $x$ as long as $x$ does not contain any such highly repetitive contiguous substrings. If one can provide a separate algorithm for such strings, one could then imagine $x$ being partitioned into contiguous substrings that can be reconstructed by respectively BMA and the highly repetitive algorithm in an alternating fashion. Additional work is required to determine how to switch between the two algorithms.

References

[1] Frank Ban, Xi Chen, Adam Freilich, Rocco A. Servedio, and Sandip Sinha. Beyond trace reconstruction: Population recovery from the deletion channel. In Foundations of Computer Science (FOCS), pages 745–768, 2019. doi:10.1109/FOCS.2019.00050.
[2] Frank Ban, Xi Chen, Rocco A. Servedio, and Sandip Sinha. Efficient average-case population recovery in the presence of insertions and deletions. In Approximation, Randomization, and Combinatorial Optimization: Algorithms and Techniques, pages 44:1–44:18, 2019. doi:10.4230/LIPIcs.APPROX-RANDOM.2019.44.
[3] Tugkan Batu, Sampath Kannan, Sanjeev Khanna, and Andrew McGregor. Reconstructing strings from random traces. In Symposium on Discrete Algorithms (SODA), pages 910–918, 2004. URL: http://dl.acm.org/citation.cfm?id=982792.982929.
[4] Joshua Brakensiek, Ray Li, and Bruce Spang. Coded trace reconstruction in a constant number of traces. In Foundations of Computer Science (FOCS), 2020. arXiv:1908.03996.
[5] Diptarka Chakraborty, Debarati Das, and Robert Krauthgamer. Approximate trace reconstruction via median string (in average-case). In Foundations of Software Technology and Theoretical Computer Science (FSTTCS), volume 213 of LIPIcs, pages 11:1–11:23. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.FSTTCS.2021.11.
[6] Zachary Chase. New lower bounds for trace reconstruction. Ann. Inst. H. Poincaré Probab. Statist., 57(2), 2021. URL: http://arxiv.org/abs/1905.03031.
[7] Zachary Chase. Separating words and trace reconstruction. In Symposium on Theory of Computing (STOC), 2021.
[8] Zachary Chase and Yuval Peres. Approximate trace reconstruction of random strings from a constant number of traces. CoRR, abs/2107.06454, 2021.
[9] Xi Chen, Anindya De, Chin Ho Lee, Rocco A. Servedio, and Sandip Sinha. Polynomial-time trace reconstruction in the low deletion rate regime. In Innovations in Theoretical Computer Science (ITCS), 2021. arXiv:2012.02844.
[10] Xi Chen, Anindya De, Chin Ho Lee, Rocco A. Servedio, and Sandip Sinha. Polynomial-time trace reconstruction in the smoothed complexity model. In Symposium on Discrete Algorithms (SODA), 2021. arXiv:2008.12386.
[11] Xi Chen, Anindya De, Chin Ho Lee, Rocco A. Servedio, and Sandip Sinha. Near-optimal average-case approximate trace reconstruction from few traces. In Symposium on Discrete Algorithms (SODA), 2022. arXiv:2107.11530.
[12] Xi Chen, Anindya De, Chin Ho Lee, Rocco A. Servedio, and Sandip Sinha. Approximate trace reconstruction from a single trace. In Symposium on Discrete Algorithms (SODA), 2023. doi:10.48550/arXiv.2211.03292.
[13] Mahdi Cheraghchi, Ryan Gabrys, Olgica Milenkovic, and João Ribeiro. Coded trace reconstruction. IEEE Trans. Inf. Theory, 66(10):6084–6103, 2020. doi:10.1109/TIT.2020.2996377.
[14] Sami Davies, Miklos Racz, and Cyrus Rashtchian. Reconstructing trees from traces. In Conference On Learning Theory (COLT), pages 961–978, 2019. URL: http://proceedings.mlr.press/v99/davies19a.html.
[15] Sami Davies, Miklós Z. Rácz, Benjamin G. Schiffer, and Cyrus Rashtchian. Approximate trace reconstruction: Algorithms. In International Symposium on Information Theory (ISIT), pages 2525–2530. IEEE, 2021. doi:10.1109/ISIT45174.2021.9517926.
[16] Anindya De, Ryan O’Donnell, and Rocco A. Servedio. Optimal mean-based algorithms for trace reconstruction. Annals of Applied Probability, 29(2):851–874, 2019. doi:10.1214/18-AAP1394.
[17] Lisa Hartung, Nina Holden, and Yuval Peres. Trace reconstruction with varying deletion probabilities. In Analytic Algorithmics and Combinatorics (ANALCO), pages 54–61, 2018. doi:10.1137/1.9781611975062.6.
[18] Nina Holden and Russell Lyons. Lower bounds for trace reconstruction. Annals of Applied Probability, 30(2):503–525, 2020. doi:10.1214/19-AAP1506.
[19] Nina Holden, Robin Pemantle, and Yuval Peres. Subpolynomial trace reconstruction for random strings and arbitrary deletion probability. In Conference On Learning Theory (COLT), pages 1799–1840, 2018. URL: http://proceedings.mlr.press/v75/holden18a.html.
[20] Thomas Holenstein, Michael Mitzenmacher, Rina Panigrahy, and Udi Wieder. Trace reconstruction with constant deletion probability and related results. In Symposium on Discrete Algorithms (SODA), pages 389–398, 2008. doi:10.1145/1347082.1347125.
[21] Sampath Kannan and Andrew McGregor. More on reconstructing strings from random traces: insertions and deletions. In International Symposium on Information Theory (ISIT), pages 297–301, 2005. doi:10.1109/ISIT.2005.1523342.
[22] Akshay Krishnamurthy, Arya Mazumdar, Andrew McGregor, and Soumyabrata Pal. Trace reconstruction: Generalized and parameterized. IEEE Trans. Inf. Theory, 67(6):3233–3250, 2021. doi:10.1109/TIT.2021.3066010.
[23] Vladimir I. Levenshtein. Efficient reconstruction of sequences. IEEE Trans. Information Theory, 47(1):2–22, 2001. doi:10.1109/18.904499.
[24] Vladimir I. Levenshtein. Efficient reconstruction of sequences from their subsequences or supersequences. J. Comb. Theory, Ser. A, 93(2):310–332, 2001. doi:10.1006/jcta.2000.3081.
[25] Andrew McGregor, Eric Price, and Sofya Vorotnikova. Trace reconstruction revisited. In European Symposium on Algorithms (ESA), pages 689–700, 2014. doi:10.1007/978-3-662-44777-2_57.
[26] Andrew McGregor and Rik Sengupta. Graph reconstruction from random subgraphs. In International Colloquium on Automata, Languages, and Programming (ICALP), volume 229, pages 96:1–96:18, 2022. doi:10.4230/LIPICS.ICALP.2022.96.
[27] Andrew McGregor and Rik Sengupta. Graph reconstruction from noisy random subgraphs. CoRR, abs/2405.04261, 2024. doi:10.48550/arXiv.2405.04261.
[28] Shyam Narayanan. Improved algorithms for population recovery from the deletion channel. In Symposium on Discrete Algorithms (SODA), pages 1259–1278. SIAM, 2021. doi:10.1137/1.9781611976465.77.
[29] Shyam Narayanan and Michael Ren. Circular trace reconstruction. In Innovations in Theoretical Computer Science (ITCS), 2021. arXiv:2009.01346.
[30] Fedor Nazarov and Yuval Peres. Trace reconstruction with exp(o(n ${}^{\mbox{1/3}}$ )) samples. In Symposium on Theory of Computing (STOC), pages 1042–1046, 2017. doi:10.1145/3055399.3055494.
[31] Yuval Peres and Alex Zhai. Average-case reconstruction for the deletion channel: Subpolynomially many traces suffice. In Foundations of Computer Science (FOCS), pages 228–239, 2017. doi:10.1109/FOCS.2017.29.
[32] Ittai Rubinstein. Average-case to (shifted) worst-case reduction for the trace reconstruction problem. In International Colloquium on Automata, Languages, and Programming (ICALP), volume 261 of LIPIcs, pages 102:1–102:20, 2023. URL: https://arxiv.org/abs/2207.11489.
[33] Alec Sun and William Yue. The trace reconstruction problem for spider graphs. Discrete Mathematics, 346(1):113115, 2023. doi:10.1016/J.DISC.2022.113115.
[34] Krishnamurthy Viswanathan and Ram Swaminathan. Improved string reconstruction over insertion-deletion channels. In Symposium on Discrete Algorithms (SODA), pages 399–408, 2008. doi:10.1145/1347082.1347126.

[bib.bib1] [1] Frank Ban, Xi Chen, Adam Freilich, Rocco A. Servedio, and Sandip Sinha. Beyond trace reconstruction: Population recovery from the deletion channel. In Foundations of Computer Science (FOCS), pages 745–768, 2019. doi:10.1109/FOCS.2019.00050.

[bib.bib2] [2] Frank Ban, Xi Chen, Rocco A. Servedio, and Sandip Sinha. Efficient average-case population recovery in the presence of insertions and deletions. In Approximation, Randomization, and Combinatorial Optimization: Algorithms and Techniques, pages 44:1–44:18, 2019. doi:10.4230/LIPIcs.APPROX-RANDOM.2019.44.

[bib.bib3] [3] Tugkan Batu, Sampath Kannan, Sanjeev Khanna, and Andrew McGregor. Reconstructing strings from random traces. In Symposium on Discrete Algorithms (SODA), pages 910–918, 2004. URL: http://dl.acm.org/citation.cfm?id=982792.982929.

[bib.bib4] [4] Joshua Brakensiek, Ray Li, and Bruce Spang. Coded trace reconstruction in a constant number of traces. In Foundations of Computer Science (FOCS), 2020. arXiv:1908.03996.

[bib.bib5] [5] Diptarka Chakraborty, Debarati Das, and Robert Krauthgamer. Approximate trace reconstruction via median string (in average-case). In Foundations of Software Technology and Theoretical Computer Science (FSTTCS), volume 213 of LIPIcs, pages 11:1–11:23. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.FSTTCS.2021.11.

[bib.bib6] [6] Zachary Chase. New lower bounds for trace reconstruction. Ann. Inst. H. Poincaré Probab. Statist., 57(2), 2021. URL: http://arxiv.org/abs/1905.03031.

[bib.bib7] [7] Zachary Chase. Separating words and trace reconstruction. In Symposium on Theory of Computing (STOC), 2021.

[bib.bib8] [8] Zachary Chase and Yuval Peres. Approximate trace reconstruction of random strings from a constant number of traces. CoRR, abs/2107.06454, 2021.

[bib.bib9] [9] Xi Chen, Anindya De, Chin Ho Lee, Rocco A. Servedio, and Sandip Sinha. Polynomial-time trace reconstruction in the low deletion rate regime. In Innovations in Theoretical Computer Science (ITCS), 2021. arXiv:2012.02844.

[bib.bib10] [10] Xi Chen, Anindya De, Chin Ho Lee, Rocco A. Servedio, and Sandip Sinha. Polynomial-time trace reconstruction in the smoothed complexity model. In Symposium on Discrete Algorithms (SODA), 2021. arXiv:2008.12386.

[bib.bib11] [11] Xi Chen, Anindya De, Chin Ho Lee, Rocco A. Servedio, and Sandip Sinha. Near-optimal average-case approximate trace reconstruction from few traces. In Symposium on Discrete Algorithms (SODA), 2022. arXiv:2107.11530.

[bib.bib12] [12] Xi Chen, Anindya De, Chin Ho Lee, Rocco A. Servedio, and Sandip Sinha. Approximate trace reconstruction from a single trace. In Symposium on Discrete Algorithms (SODA), 2023. doi:10.48550/arXiv.2211.03292.

[bib.bib13] [13] Mahdi Cheraghchi, Ryan Gabrys, Olgica Milenkovic, and João Ribeiro. Coded trace reconstruction. IEEE Trans. Inf. Theory, 66(10):6084–6103, 2020. doi:10.1109/TIT.2020.2996377.

[bib.bib14] [14] Sami Davies, Miklos Racz, and Cyrus Rashtchian. Reconstructing trees from traces. In Conference On Learning Theory (COLT), pages 961–978, 2019. URL: http://proceedings.mlr.press/v99/davies19a.html.

[bib.bib15] [15] Sami Davies, Miklós Z. Rácz, Benjamin G. Schiffer, and Cyrus Rashtchian. Approximate trace reconstruction: Algorithms. In International Symposium on Information Theory (ISIT), pages 2525–2530. IEEE, 2021. doi:10.1109/ISIT45174.2021.9517926.

[bib.bib16] [16] Anindya De, Ryan O’Donnell, and Rocco A. Servedio. Optimal mean-based algorithms for trace reconstruction. Annals of Applied Probability, 29(2):851–874, 2019. doi:10.1214/18-AAP1394.

[bib.bib17] [17] Lisa Hartung, Nina Holden, and Yuval Peres. Trace reconstruction with varying deletion probabilities. In Analytic Algorithmics and Combinatorics (ANALCO), pages 54–61, 2018. doi:10.1137/1.9781611975062.6.

[bib.bib18] [18] Nina Holden and Russell Lyons. Lower bounds for trace reconstruction. Annals of Applied Probability, 30(2):503–525, 2020. doi:10.1214/19-AAP1506.

[bib.bib19] [19] Nina Holden, Robin Pemantle, and Yuval Peres. Subpolynomial trace reconstruction for random strings and arbitrary deletion probability. In Conference On Learning Theory (COLT), pages 1799–1840, 2018. URL: http://proceedings.mlr.press/v75/holden18a.html.

[bib.bib20] [20] Thomas Holenstein, Michael Mitzenmacher, Rina Panigrahy, and Udi Wieder. Trace reconstruction with constant deletion probability and related results. In Symposium on Discrete Algorithms (SODA), pages 389–398, 2008. doi:10.1145/1347082.1347125.

[bib.bib21] [21] Sampath Kannan and Andrew McGregor. More on reconstructing strings from random traces: insertions and deletions. In International Symposium on Information Theory (ISIT), pages 297–301, 2005. doi:10.1109/ISIT.2005.1523342.

[bib.bib22] [22] Akshay Krishnamurthy, Arya Mazumdar, Andrew McGregor, and Soumyabrata Pal. Trace reconstruction: Generalized and parameterized. IEEE Trans. Inf. Theory, 67(6):3233–3250, 2021. doi:10.1109/TIT.2021.3066010.

[bib.bib23] [23] Vladimir I. Levenshtein. Efficient reconstruction of sequences. IEEE Trans. Information Theory, 47(1):2–22, 2001. doi:10.1109/18.904499.

[bib.bib24] [24] Vladimir I. Levenshtein. Efficient reconstruction of sequences from their subsequences or supersequences. J. Comb. Theory, Ser. A, 93(2):310–332, 2001. doi:10.1006/jcta.2000.3081.

[bib.bib25] [25] Andrew McGregor, Eric Price, and Sofya Vorotnikova. Trace reconstruction revisited. In European Symposium on Algorithms (ESA), pages 689–700, 2014. doi:10.1007/978-3-662-44777-2_57.

[bib.bib26] [26] Andrew McGregor and Rik Sengupta. Graph reconstruction from random subgraphs. In International Colloquium on Automata, Languages, and Programming (ICALP), volume 229, pages 96:1–96:18, 2022. doi:10.4230/LIPICS.ICALP.2022.96.

[bib.bib27] [27] Andrew McGregor and Rik Sengupta. Graph reconstruction from noisy random subgraphs. CoRR, abs/2405.04261, 2024. doi:10.48550/arXiv.2405.04261.

[bib.bib28] [28] Shyam Narayanan. Improved algorithms for population recovery from the deletion channel. In Symposium on Discrete Algorithms (SODA), pages 1259–1278. SIAM, 2021. doi:10.1137/1.9781611976465.77.

[bib.bib29] [29] Shyam Narayanan and Michael Ren. Circular trace reconstruction. In Innovations in Theoretical Computer Science (ITCS), 2021. arXiv:2009.01346.

[bib.bib30] [30] Fedor Nazarov and Yuval Peres. Trace reconstruction with exp(o(n ${}^{\mbox{1/3}}$ )) samples. In Symposium on Theory of Computing (STOC), pages 1042–1046, 2017. doi:10.1145/3055399.3055494.

[bib.bib31] [31] Yuval Peres and Alex Zhai. Average-case reconstruction for the deletion channel: Subpolynomially many traces suffice. In Foundations of Computer Science (FOCS), pages 228–239, 2017. doi:10.1109/FOCS.2017.29.

[bib.bib32] [32] Ittai Rubinstein. Average-case to (shifted) worst-case reduction for the trace reconstruction problem. In International Colloquium on Automata, Languages, and Programming (ICALP), volume 261 of LIPIcs, pages 102:1–102:20, 2023. URL: https://arxiv.org/abs/2207.11489.

[bib.bib33] [33] Alec Sun and William Yue. The trace reconstruction problem for spider graphs. Discrete Mathematics, 346(1):113115, 2023. doi:10.1016/J.DISC.2022.113115.

[bib.bib34] [34] Krishnamurthy Viswanathan and Ram Swaminathan. Improved string reconstruction over insertion-deletion channels. In Symposium on Discrete Algorithms (SODA), pages 399–408, 2008. doi:10.1145/1347082.1347126.

Near-Optimal Trace Reconstruction for Mildly Separated Strings

Abstract

Keywords and phrases:

Category:

Funding:

Copyright and License:

2012 ACM Subject Classification:

Related Version:

DOI:

Event:

Editors:

Series and Publisher:

1 Introduction

Main Result

Theorem 1.

Comparison to Related Work

1.1 Technical Contributions

The alignment procedure of Algorithm 1

The algorithm is not too far behind

Reconstructing 𝒙 using Algorithm 1

Roadmap of our paper

2 Notation

3 Main Alignment Procedure

3.1 Description and Main Lemmas

Lemma 2.

Proof.

Lemma 3.

3.2 Proof of Lemma 3

Proposition 4.

Proof.

Lemma 5.

Proof.

Lemma 6.

Proof.

Lemma 7.

Proof.

Fact 8.

Lemma 9.

Proof.

Proof of Lemma 3.

4 Full algorithm/analysis

4.1 Coarse estimation

Lemma 10.

Proof.

Theorem 11 (Crude Approximation).

4.2 Fine estimation

Lemma 12.

Proof.

Theorem 13 (Fine Estimation).

Proof.

5 Conclusion and Open Questions

References

Reconstructing $𝒙$ using Algorithm 1