Near-Optimal Trace Reconstruction for Mildly Separated Strings
Abstract
In the trace reconstruction problem our goal is to learn an unknown string given independent traces of . A trace is obtained by independently deleting each bit of with some probability and concatenating the remaining bits. It is a major open question whether the trace reconstruction problem can be solved with a polynomial number of traces when the deletion probability is constant. The best known upper bound and lower bounds are respectively [7] and . Our main result is that if the string is mildly separated, meaning that the number of zeros between any two ones in is at least , and if is a sufficiently small constant, then the trace reconstruction problem can be solved with traces and in polynomial time.
Keywords and phrases:
Trace Reconstruction, deletion channel, sample complexityCategory:
Track A: Algorithms, Complexity and GamesFunding:
Anders Aamand: This work was supported by the DFF-International Postdoc Grant 0164-00022B and by the VILLUM Foundation grants 54451 and 16582.Copyright and License:
![[Uncaptioned image]](x1.png)
2012 ACM Subject Classification:
Theory of computation Design and analysis of algorithmsEditors:
Keren Censor-Hillel, Fabrizio Grandoni, Joël Ouaknine, and Gabriele PuppisSeries and Publisher:

1 Introduction
Trace reconstruction is a well-studied problem at the interface of string algorithms and learning theory. Informally, the goal of trace reconstruction is to recover an unknown string given several independent noisy copies of the string.
Formally, fix an integer and a deletion parameter . Let be an unknown binary string with representing the th bit of . Then, a trace of is generated by deleting every bit independently with probability (and retaining it otherwise), and concatenating the retained bits together. For instance, if and we delete the second and third bits, the trace would be (from the first, fourth, and fifth bits of ). For a fixed string , note that the trace follows some distribution over bitstrings, where the randomness comes from which bits are deleted. In trace reconstruction, we assume we are given i.i.d. traces , and our goal is to recover the original string with high probability.
The trace reconstruction problem has been a very well studied problem over the past two decades [23, 24, 3, 21, 20, 34, 25, 16, 30, 31, 17, 18, 19, 6, 10, 9, 7, 32]. There have also been numerous generalizations or variants of trace reconstruction studied in the literature, including coded trace reconstruction [13, 4], reconstructing mixture models [1, 2, 28], reconstructing alternatives to strings [14, 22, 29, 26, 33, 27], and approximate trace reconstruction [15, 8, 5, 11, 12].
In perhaps the most well-studied version of trace reconstruction, is assumed to be an arbitrary -bit string and the deletion parameter is assumed to be a fixed constant independent of . In this case, the best known algorithm requires random traces to reconstruct with high probability [7]. As we do not know of any polynomial-time (or even polynomial-sample) algorithms for trace reconstruction, there have been many works making distributional assumptions on the string , such as being a uniformly random string [20, 25, 31, 19, 32] or being drawn from a “smoothed” distribution [10]. An alternative assumption is that the string is parameterized, meaning that comes from a certain “nice” class of strings that may be amenable to efficient algorithms [22, 15].
In this work, we also wish to understand parameterized classes of strings for which we can solve trace reconstruction efficiently. Indeed, we give an algorithm using polynomial traces and runtime, that works for a general class of strings that we call -separated strings. This significantly broadens the classes of strings for which polynomial-time algorithms are known [22].
Main Result
Our main result concerns trace reconstruction of strings that are mildly separated. We say that a string is -separated if the number of zeros between any two consecutive ones is at least . Depicting a string with ones as
it is -separated if and only if for each with . Note that we make no assumptions on or . Our main result is as follows.
Theorem 1.
There exists an algorithm that solves the trace reconstruction problem with high probability in on any -separated string , provided that for a universal constant , and that the deletion probability is at most some universal constant . The algorithm uses independently sampled traces of , , and runs in time.
We note that the number of traces is nearly optimal. Even distinguishing between two strings which contain only a single one at positions and respectively, requires traces to succeed with probability .
While trace reconstruction is known to be solvable very efficiently for random strings [19, 32], there are certain structured classes of strings that appear to be natural hard instances for existing approaches. Our algorithm can be seen as solving one basic class of hard instances. It is worth noting the work by [9] which studies the trace reconstruction problem when the deletion probability is sub-constant. They show that the simple Bitwise Majority Alignment (BMA) algorithm from [3] can succeed with deletion probability as long as the original string does not contain deserts – which are highly repetitive blocks where some short substring is repeated many times. They then combine this with an algorithm for reconstructing repetitive blocks – but this part of their algorithm requires a significantly smaller deletion probability of . This suggests that strings containing many repetitive blocks are a natural hard instance and good test-bed for developing new algorithms and approaches. -separated strings can be thought of as the simplest class of highly repetitive strings (where the repeating pattern is just a ), where every repetition has length at least .
Comparison to Related Work
Most closely related to our work is the result by Krishnamurthy et al. [22] stating that if has at most ones and if each pair of ones is separated by a run of zeros of length , then can be recovered in polynomial time from many traces. In particular, for strings with ones, the required separation is milder than ours, albeit not below . Our algorithm works in general assuming a separation of the ones but with no additional requirement on the number of ones: indeed, we could even have ones. With no sparsity assumptions, [22] would need to set , as a -separated string can be -sparse in the worst case. The techniques of [22] are also very different than ours. They recursively cluster the positions of the ones in the observed traces to correctly align a large fraction of the ones in the observed traces to ones in the string . In contrast, our algorithm works quite differently and is of a more sequential nature processing the traces from left to right (or right to left). See Section 1.1 for a discussion of our algorithm.
Another paper studying strings with large runs is by Davies et al. [15]. They consider approximate trace reconstruction, specifically how many traces are needed to approximately reconstruct up to edit distance under various assumptions on the lengths of the runs of zeros and ones in . Among other results but most closely related to ours, they show that one can -approximately reconstruct using traces if the runs of zeros have length and if the runs of ones are all of length or for a constant (e.g. they could have length one as in our paper). However, for exact reconstruction, they would need to set , which means they do not provide any nontrivial guarantees in our setting.
1.1 Technical Contributions
In this section, we give a high level overview of our techniques. Recall that we want to reconstruct a string from independent traces where we assume that is mildly separated. More concretely, we assume that there are numbers such that consists of zeros followed by a one, followed by zeros followed by a one and so on, with the last bits of being zero. Writing , we thus have that there are ones in at positions for .
Note that a retained bit in a trace naturally corresponds to a bit in . More formally, for a trace of length , let be the positions in where the bit was retained when generating so that . Then, the correspondence is defined by the map from to mapping . We think of this map as the correct alignment of to .
Our main technical contribution is an alignment algorithm (see Algorithm 1) which takes in some and estimates of satisfying that for all , , and correctly aligns the one in a trace corresponding to the ’th one of with probability (where the randomness is over the draw of – naturally, this requires that the ’th one of was not deleted).
Moreover, we ensure that the alignment procedure with high probability, say , never aligns a one in too far to the right in : if the one in corresponding to the ’th one of is aligned to the ’th one of , then . We will refer to this latter property by saying that the algorithm is never ahead with high probability. If , we say that the algorithm is behind. Thus, to show that the algorithm correctly aligns the ’th one, it suffices to show that the probability that the algorithm is behind is .
We first discuss how to implement this alignment procedure and then afterwards we discuss how to complete the reconstruction by using this alignment procedure.
The alignment procedure of Algorithm 1
The main technical challenge of this paper is the analysis of Algorithm 1. Let us first describe on a high level how the algorithm works. For , we write . Suppose that the trace consists of zeros followed by a one followed by zeros followed by a one and so on. The algorithm first attempts to align the first one in with a one in by finding the minimal such that is within of for a sufficiently large . Inductively, having determined (that is the alignment of the ’th one of ), it looks for the minimal satisfying that there is a such that is within of . Intuitively, when looking at the ’th one in the trace, we want to find the earliest possible location in the real string (which has gaps estimated by ) that could plausibly align with the one in the trace.
It is relatively easy to check that the algorithm is never ahead with very high probability. Indeed, by concentration bounds on the number of deleted zeros and the fact that for all , it always has the option of aligning the ’st one in to the correct one in . However, it might align to an earlier one in since it is looking for the minimum such that an alignment is possible. For a very simple example, suppose that and . If the first ones of are deleted and the ’st one is retained, the algorithm will align the retained one (which corresponds to the ’st one of ) with the first one of resulting in the aligning algorithm being steps behind. Moreover, the algorithm will remain steps behind all the way up to the ’th one of . The probability of this happening is . To prove that the probability of the algorithm being behind when aligning the ’th one of is at most , we prove a much stronger statement which is amenable to an inductive proof, essentially stating that this is the worst that can happen: The probability of the algorithm being steps behind at any fixed point is bounded by for a constant . In particular, we show that there is a sort of amortization – whenever there is a substring that can cause the algorithm to fall further behind with some probability (i.e. if certain bits are deleted), the substring also helps the algorithm catch back up if it is already behind.
The algorithm is not too far behind
Proving that the algorithm cannot be too far behind, i.e., is steps behind with probability at most is perhaps the most challenging technical part of our paper. We discuss some of the ideas behind proving this result.
The first step towards proving this lemma is to attempt to prove an even stronger statement: that even if the current estimates are totally arbitrary (perhaps not similar to at all), we will still not be behind. This is not too far-fetched, as for general we might actually start jumping ahead. For example, if is not close to and we do not delete the first , we will predict the first in the trace to have come from a later location. This will end up being proven by induction on the length of the string.
Now, condition on the from the true string not being deleted, and consider the probability of being steps behind after seeing this bit. Recall that the true gaps between the bits until the are but the algorithm believes the gaps are , and the algorithm believes we have just gone through the gaps so far. Let be the smallest value where doesn’t hold (here, think of as being much larger than , so it would be easy to distinguish between these gaps even with random deletions). Let be the smallest value where doesn’t hold. This can be thought of in terms of reading the sequences and backward, from where the algorithm thinks the gaps are from the sequence whereas the gaps actually come from . The idea is that if we are aligned after the th (which is after the gaps and , the difference between and should cause us to move ahead, meaning that we will have to fall steps behind afterwards, making the inductive argument easier for us. By a symmetric argument, we shouldn’t expect to have the th to last aligned with the to last in . So, the point is that we should expect to fall behind both in the first gaps and the last gaps. This will allow us to split the string into pieces where in each one we fall behind, and we can apply an inductive hypothesis on the length of the string. Another option is that there is never a value where or where . In this case, is approximately periodic with period and we would have to fall an entire period behind, which we show happens with very low probability
Reconstructing using Algorithm 1
Using Algorithm 1 we can iteratively get estimates with . Namely, suppose that we have the estimates . We then run Algorithm 1 on independent traces and with high probability, for a fraction of them, we have that the ’th and ’st one of are retained in and correctly aligned. In particular, with probability we can identify both the ’th and ’st one of in and taking the median over the gaps between these (and appropriately rescaling by ), we obtain an estimate of such that ). Note that the success probability of is enough to obtain the coarse estimates using the median approach but we cannot obtain a fine estimate by taking the average since with constant probability , we may have misaligned the gap completely and then our estimate can be arbitrarily off.
To obtain fine estimates, we first obtain coarse estimates, say , for all of the gaps. Next, we show that we can identify the ’th and ’st one in in a trace (if they are retained) and we can detect if they were deleted not just with probability but with very high probability. The trick here is to run Algorithm 1 both from the left and from the right on looking for respectively the one in aligned to the ’th one in and the one in aligned to the ’st one in (which is the ’th one when running the algorithm from the right). If either of these runs fails to align a one in to respectively the ’th and ’st one in or the runs disagree on their alignment, then we will almost certainly know. To see why, assuming that we are never ahead in the alignment procedure from the left, if we believe we have reached the ’th one in , then we are truly at some ’th one where . By a symmetric argument, if we believe we have reached the ’st one in after running the procedure from the right, we are truly at the ’th one in , where . The key observation now is that if and only if and , meaning that both runs succeeding is equivalent to the one found in the left-alignment procedure being strictly earlier than the one found in the right-alignment procedure. So, if we realize that either run fails to align the ones properly, we discard the trace and repeat on a newly sampled trace.
Finally, we can ensure that the success of the runs of the alignment algorithm is independent of the deletion of zeros between the ’th and ’st ones in . If a trace is not discarded, then with very high probability, the gap between the ones in aligned to the ’th and ’st ones in (normalized by ) is an unbiased estimator for . By taking the average of the gap over traces, normalizing by , and rounding to the nearest integer, we determine exactly with very high probability. Doing so for each , reconstructs .
Roadmap of our paper
In Section 2, we introduce notation. In Section 3, we describe and analyse our main alignment procedure. We first prove that with high probability it is never ahead (Lemma 2). Second, in Section 3.2, we bound the probability that it is behind (Lemma 3). Finally, in Section 4, we describe our full trace reconstruction algorithm and prove Theorem 1.
2 Notation
We note a few notational conventions and definitions.
-
We recall that a bitstring is -separated if the gap between any consecutive ’s in the string contains at least ’s.
-
Given an string , we say that a run is a contiguous sequence of ’s in . For the th run of is the sequence , and has length .
-
For any bitstring , we use to denote the string where the bits have been reversed.
-
We use to denote an integer sequence of length . For notational convenience, for any , we write to denote the subsequence , and .
We will define some sufficiently large constants and a small constant . We will assume the separation parameter , and the deletion parameter , where . We did not make significant effort to optimize the constant or the value in , though we believe that any straightforward modifications to our analysis will not obtain bounds such as or a separation of .
3 Main Alignment Procedure
3.1 Description and Main Lemmas
In this section, we consider a probabilistic process that models a simpler version of the trace reconstruction problem that we aim to solve. In the simpler version of the trace reconstruction problem, suppose that we never delete any ’s, but delete each independently with probability. Let represent the true lengths of the first gaps (so the first is at position , the second is at position , and so on). Moreover, suppose we have some current predictions of the gaps . The high level goal will be, given a single trace (where the trace means only s are deleted), to identify the th in the trace from the the original string with reasonably high probability. (Note that the th is deleted with probability, in which case we cannot succeed.)
In this section, we will describe and analyze the probabilistic process, and then explain how this analysis helps us solve the trace reconstruction problem in Section 4.
In the process, we fix and two sequences and where has length but has some length which may or may not equal . Moreover, we assume and for every term and
Now, for each , let be i.i.d. random variables, with with probability and with probability. Also, let with probability . For each with , we define a value as follows. First, we set . Next, for each index such that , let denote the previous index with . We define to be the smallest index such that there exists with , where is a sufficiently large constant. (If such an index does not exist, we set .)
Our goal will be for . In general, for any with , we would like . If , we say that we are steps behind at step , and if , we say that we are steps ahead at step .
First, we note the following lemma, which states that we will never be ahead with very high probability, as long as the sequences and are similar enough.
Lemma 2.
Set . Let be sequences of lengths , respectively, where . Suppose that for all . Then, with probability at least (over the randomness of the ), for all with , .
Proof.
Let us consider the event that for every index , at least one of equals . Equivalently, the string does not ever have ’s in a row. For any fixed , the probability of this being false is at most , so by a union bound over all choices of , the event holds with at most failure probability.
First, note that . Now, suppose that some satisfies and . Suppose is the smallest index strictly larger than such that . Note that , by our assumed event. Note that if we set and , then , since . Moreover, , where the second to last inequality is by Cauchy-Schwarz. Thus, satisfies the requirements for , which means that . Thus, if , . Since , this means for all with .
The main technical result will be showing that with reasonably high probability, i.e., with reasonably high probability we are not behind. This result will hold for any choice of and does not require any similarity between these sequences. In other words, our goal is to prove the following lemma.
Lemma 3.
Let be strings of length at most with every between and , where for a sufficiently large constant . Define . Then, for any , with probability at least over the randomness of , .
3.2 Proof of Lemma 3
In this section, we prove Lemma 3.
We will set a parameter , where is a sufficiently large constant. For any , given the sequences and (of possibly differing lengths), we define to be the probability (over the randomness of ) that
-
.
-
For any indices with , .
Equivalently, this is the same as the probability that we fall behind at least steps from step to step , but we never fall behind or more steps (relatively) from any (possibly intermediate) steps to . For any , we define to be the supremum value of over any sequences where has length at most and every and is between and , and we also define .
Note that for any , , as means . So, and also equal for any .
First, we note a simple proposition, that will only be useful for simplifying the argument at certain places.
Proposition 4.
For any , .
Proof.
Since is the maximum over all where has length at most , it suffices to prove it for some of length . Indeed, for and , we must have that , so we must have and .
We now aim to bound the probabilities for . We will do this via an inductive approach on the length of , where the high-level idea is that if we fall back by steps, there is a natural splitting point where we can say first we fell back by steps, and then by steps, for some with – see Lemmas 6 and 7. This natural splitting point will be based on the structure of the similarity of and , and will not work if and share a -periodic structure. But in the periodic case, we can give a more direct argument that we cannot fall back by steps (i.e., a full period), even with probability – see Lemma 5. We can then compute a recursive formula for the probability of falling back steps, by saying we need to first fall back steps and then fall back steps. In Lemma 9, we bound the terms of this recursion.
Lemma 5.
Fix any such that , and suppose that , where is a sufficiently large multiple of . Suppose that are sequences such that for every and . Then, the probability .
Proof.
We show that the probability of ever being behind by or more is at most . In fact, we will show this deterministically never happens, conditioned on the event that for every index , at least one of equals . Indeed, the probability of this being false for any fixed is at most , so by a union bound over all choices of , the event holds with at most failure probability.
Now, assume the event and suppose that holds for some . More precisely, we fix to be the smallest index such that and .
First, assume that . Consider the values , and let By our conditional assumption, and since , at least one of equals . Say that , where . Also, by our choice of , we know that , and that . So, we have two options:
-
1.
, and , for some and where .
-
2.
, and .
Now, let’s consider the list of all indices with , starting with if and otherwise, and ending with . By definition of the sequence , for every there exists such that and . Assuming that , then which means and thus So,
Adding the above equation over , we obtain
where the final line follows by Cauchy-Schwarz. Let be if and otherwise. Then, since , we have
(1) |
The above equation tells us that can’t be too much smaller than . We now show contrary evidence, thus establishing a contradiction.
First, we compare to . Indeed, for any , . Since every , this also means . Adding over all , we have
where the last inequality follows by Cauchy-Schwarz and the fact that .
However, we do not care about – we really care about . To bound this, first note that for any , and . So, , assuming every . If we additionally have that then for any and . Importantly, .
In the case that this implies that . So, because we have
Recalling that and , since ,
(2) |
In the case that , we instead have . So, since , we have that
so the same bound as (3.2) holds (in fact, an even stronger bound holds).
So, both (1) and (3.2) hold, in either case. Together, they imply that
This is impossible if is a sufficiently large multiple of . Since in either case, it suffices for to be a sufficiently large multiple of .
Lemma 6.
Fix any such that , and suppose that . Suppose that are sequences of length , such that for every . Then, the probability
Proof.
Suppose that for all . Then, we can use Lemma 5 to bound . Alternatively, let be the smallest index such that . Next, let be such that is the largest index less than with , and is the smallest index at least with . Finally, let and . In other words, is the number of steps we fall behind from to , and is the number of steps we fall behind from to .
Note that , and since each subsequent is strictly increasing, this means , so , assuming that . In other words, we have that are nonnegative integers such that .
Now, let us bound the probability (over the randomness of ) of the event indicated by occurring, with the corresponding values . Note that for any fixed , the event of those specific values is equivalent to and being , and everything in between being . So, the probability is at most . Now, conditioned on , the values imply that we fall back steps from step to (or we may move forward if ) and we fall back steps from step to . Moreover, there cannot be two steps such that that we fell back steps from to . Since and , this means both . So, the overall probability of the corresponding values is at most , where we are using the fact that for all by Proposition 4.
Overall, the probability is at most
We can cap as at most since otherwise or is . Moreover, we can give improved bounds in the cases when and either or .
Note that in either case, both and equal . In the former case, we must have and . Importantly, the algorithm fell back by exactly steps from to , However, we know that for all , . In that case, if we restrict ourselves to the strings and , we are dealing with the case of Lemma 5. Hence, we can bound the overall probability of this case by . In the latter case, we must have and , since we need to fall back by exactly steps from to . However, this actually cannot happen, because by definition of and , we must have that which is not true by our definition of .
Overall, this means
Lemma 7.
Fix any such that , and suppose that . Suppose that are sequences of length . Then, the probability
Proof.
Our proof will be quite similar to that of Lemma 6, so we omit some of the identical details.
First, assume that for every . Then, we can directly apply Lemma 6. Alternatively, let be the largest index such that . As in the proof of Lemma 6, let be such that is the largest index less than with , and is the smallest index at least with . Also, let and .
As in the proof of Lemma 6, we have , as long as . We can again do the same casework on , to obtain
Once again, we wish to consider the individual cases of or separately. In either case, . In the former case, must have and . In this case, from step to we fall behind steps. In other words, we can restrict ourselves to the strings and . However, we have now restricted ourselves to strings which satisfy the conditions of Lemma 6, so we can bound the probability in this case as at most
In the latter case, we must have and . However, this is impossible, because by our definition of .
Overall, by adding all cases together, we obtain
Overall, this implies that
We now can universally bound for all . To do so, we first recall some basic properties of the Catalan numbers.
Fact 8.
For , the Catalan numbers 111We use rather than the more standard to avoid confusion with the constants we have defined. are defined as . They satisfy the following list of properties.
-
1.
and for all , .
-
2.
For all , .
-
3.
For all , .
Lemma 9.
Assume , and define for and . Then, for all , .
Proof.
We prove the statement by induction on . For , note that with probability , so . Indeed, either or . So, and for all .
Now, suppose that the induction hypothesis holds for : we now prove the statement for . First, note that . Next, for ,
(3) |
We now bound the summation in the above expression. First, we focus on the terms where one of or is . If , the summation becomes . If we fix , for each there are choices of , which means the summation is . For , each term is at most half the previous term, so this is at most . Next, for , if we fix , the summation is , since there are choices of . We have a symmetric summation for . Finally, if we focus on the terms with , by writing and , for any fixed , the sum of is at most , and there are choices for . So, the summation is at most , where the last inequality holds because .
Overall, replacing indices accordingly, we can write (3.2) as at most
We can now focus on the middle summation term. If we first consider all terms with , the sum equals as long as . For the remaining terms, we fix and consider the sum. If , the sum equals . Since for all , this is at most . For , the sum equals . Since , as long as , the terms decrease by a factor greater than each time increases. So the sum over all is at most Overall, the summation in the middle term is at most .
Overall, this means (3.2) is at most
(4) |
Now, note that for all , even for . Moreover, . Thus, (4) is at most
Assuming that , this is at most which can be verified to be at most for all , by just using the fact that for all . This completes the inductive step.
We are now ready to prove Lemma 3.
Proof of Lemma 3.
If , this means that either the event occurs, or there exist indices with but we fall behind at least steps from step to step .
Assuming , the probability of is at most . Alternatively, if there exist with but we fall behind at least steps from step to step , there must exist such an with minimal (breaking ties arbitrarily). This could be because for some . However, the probability of there being consecutive indices is at most .
The final option is that, if we look at the first index with , . This means that from step to , we must fall behind at least steps, and there could not have been any intermediate steps where we fell behind more than steps. Hence, if we restrict ourselves to the strings and , the event indicated by must occur, since conditioned on and the fact that , the value only depends on , starting from position , and .
In other words, there exists some contiguous subsequences and of and , respectively, such that the event of occurs. For any fixed , the probability is at most . Since there are at most possible contiguous subsequences for each of and , the overall probability is at most , assuming that and where is sufficiently large.
Overall, the probability of falling behind is at most .
4 Full algorithm/analysis
Let us depict the true string as i.e., there are ones, and the string starts and ends with a run of ’s. This assumption can be made WLOG by padding the string with ’s at the front and the end. For any -separated string, doing this padding maintains the -separated property, and we can easily simulate the padded trace by adding ’s at the front and ’s at the back. Once we reconstruct the padded string, we remove the padding to get .
We assume we know the value of . Indeed, the number of ’s in a single trace is distributed as . So, by averaging the number of ’s over random traces and dividing by , we get an estimate of that is accurate within with probability. Thus, by rounding, we know exactly with probability.
The main goal is now to learn the lengths . If we learn these exactly just using the traces, this completes the proof. Our algorithm runs in two phases: a coarse estimation phase and a fine estimation phase. In the coarse estimation phase, we sequentially learn each up to error . In the fine estimation phase, we learn each exactly, given the coarse estimates.
4.1 Coarse estimation
Fix some , and suppose that for all , we have estimates satisfying . (If , then we have no estimates yet.) Our goal will be to provide an estimate such that .
Consider a trace of . Let and for each , let be the indicator that the th is retained. Next, for each , let represent the number of s in the th run that were not deleted. Note that with at least probability, for all . Since for all , this implies that for all .
Now, even though we have no knowledge of or , we can still simulate the probabilistic process of Section 3. Let be the list of all indices with . While we do not know the values , for every pair of consecutive indices , the value is exactly the number of ’s between the th and st in the trace (where we say that the th is at position and the st is at position ). In other words, if represents the position of the th , then . Hence, because computing each only requires knowledge of and the value of , and since , the algorithm can in fact compute for all , using the same process as described in Section 3, even if the values are not known.
Algorithm 1 simulates this process, assuming knowledge of , , a single trace , and . In Algorithm 1, we use the variable to represent , i.e., the current prediction of the position . In other words, equals the number of steps ahead (or equals the number of steps behind) we are.
Lemma 10.
Fix such that for all . With probability at least over the randomness of , we have that Algorithm 1 returns such that the th in corresponds to the th in . Moreover, conditioned on this event holding, the distribution exactly follows .
Proof.
Let us first condition on the values , assuming that for all . As discussed earlier, this occurs with at least probability, and implies that for all .
Let us also condition on . By Lemma 2 and Lemma 3, the probability that , for , is at least . This is conditioned on and the values (assuming ). This means that with at least probability, the algorithm finds the position with . Since only depends on , and , with probability at least over the randomness of and , we have that and . This is independent of , so with probability at least probability, we additionally have that .
The event that means that is the position in of the th in the true string . Moreover, since neither the th nor th was deleted, is the position in of the th in the true string . So, is in fact the length of the gap between the th and th after deletion, which means it has length , since is independent of the events that decide whether and .
Given this, we can crudely estimate every gap, in order. Namely, assuming that that we have estimates (where ), we can run the Align procedure on independent traces. By a Chernoff bound, with failure probability, at least fraction of the traces will have the desired property of Lemma 10, so will output some where . Since is in the range with at least probability, at least fraction of the outputs will satisfy , with failure probability. Thus, by defining to be the median value of across the randomly drawn traces, we have that with at least probability.
By running this procedure iteratively to provide estimates , we obtain Algorithm 2. The analysis in the above paragraph implies the following result.
Theorem 11 (Crude Approximation).
Algorithm 2 uses traces and polynomial time, and learns estimates such that with at least probability, for all .
4.2 Fine estimation
In this section, we show how to exactly compute each with high probability, given the crude estimates . This will again be done using an alignment procedure, but this time running the alignment both “forward and backward”.
Namely, given a trace , we will try to identify the th and st ’s from the original string, but we try to identify the th by running Align on and the st by running Align on the reverse string . The idea is: assuming that we never go ahead in the alignment procedure, if we find some index in the forward alignment procedure with , then the true position must be at least . Likewise, if we do the alignment procedure in reverse until we believe we have found the th from the back (equivalently, the th from the front), the true position must be at most .
So, the true positions of the index found in the forward alignment procedure can only be earlier than that of the index from the backward alignment procedure, if the true positions were exactly and , respectively. Thus, by comparing the indices, we can effectively verify that the positions are correct, with negligible failure probability (rather than with failure probability). This is the key towards obtaining the fine estimate of , rather than just a coarse estimate that may be off by .
Algorithm 3 formally describes the fine alignment procedure, using traces, assuming we have already done the coarse estimation to find .
Lemma 12.
Suppose that for all . Fix indices and , and for simplicity of notation, let . Let be the number of ’s in . Then, the probability that , but either the forward or backward iterations finds an index in which does not correspond to the th or th , respectively, from , is at most . Moreover, if the forward and backward iterations find indices in corresponding to the th and th , respectively, then . Finally, the probability of finding both corresponding indices is at least .
Proof.
First, let us consider the forward alignment procedure. We know that tracks when looking at the th of (from left to right). So, if we do not return FAIL, then . If , this implies there is an index where . The probability of this is at most , by Lemma 2. Otherwise, , meaning that the th in is after (or equal to) the th in .
Likewise, if we consider the backward alignment procedure, if we do not return FAIL, then except for an event with probability at most , the th in is ahead of (or equal to) the th in . Equivalently, the th in (reading from left to right) is before (or equal to) the th in (reading from left to right).
So, barring a probability event, the only way that the th in is strictly before the th in is if the th in is precisely the th in and th in is precisely the th in . However, if , then in fact the th is before the th in (reading from left to right). This proves the first statement.
Next, if we in fact found the corresponding indices, they are consecutive ’s in , which means they must be consecutive ’s in . So, if we found the th from the left, and the th from the right, we must have .
Finally, the event of finding both corresponding indices is equivalent to in the forward iteration and in the backward iteration. Conditioned on the corresponding ’s not being deleted, each of these occur with at least probability, by Lemmas 2 and 3. So, the overall probability is at least .
We are now ready to prove Theorem 1. Indeed, given the accuracy of the crude estimation procedure, it suffices to check that for each , we compute correctly, with at least probability.
Theorem 13 (Fine Estimation).
Assume that , the number of ones in , is computed correctly, and for all , .
Then, for any fixed , with at least probability, we compute the gap correctly.
Proof.
For any fixed iteration , if both the forward and backward procedures correctly identify the th and th ’s from the left, respectively, then by Lemma 12. In this case, we will compute an actual value . Moreover, as discussed in the proof of Lemma 10, the event that the forward procedure correctly identifies the right only depends on , , and the events of whether the first ’s are deleted. Thus, the event that the backward procedure correctly identifies the right only depends on , , and the events of whether the th until the th are deleted.
Thus, the forward and backward procedure correctly identifying the right ’s is independent of . Moreover, in this case, is precisely , since is the position in corresponding to the th in , and neither the th nor th can be deleted if both of these ’s are identified.
So, if the forward and backward procedures identifying the right ’s for trace , the conditional distribution of is . However, we really want to look at the distribution conditioned on the event . Indeed, by Lemma 12, this event is equivalent to either the forward and backward procedures identifying the right ’s, or some other event which occurs with at most probability. Because is clearly between and , and since the probability of both ’s being correctly identified is at least by Lemma 12, the expectation of , conditioned on not being NULL, is .
By a Chernoff bound, the number of with is at least with at least probability, since in expectation it is at least . Then, by another Chernoff bound, the empirical average of all such is within of its expectation with probability, which is . Thus, taking the empirical average and dividing by , with at most failure probability, times the average of all non-null ’s is within of , and thus rounds to .
5 Conclusion and Open Questions
In this paper, we established that the trace reconstruction problem can be solved with a polynomial number of traces, as long as any two ones in the initial string are separated by at least zeros and the deletion probability is at most a sufficiently small constant. It is an interesting open question to handle more general deserts such as interspersed with mildly separated zeros and ones. Indeed, we believe that this is an important step towards solving the general trace reconstruction problem with deletion probability . With this deletion probability, the Bitwise Majority Alignment (BMA) algorithm from [3] succeeds in reconstructing as long as does not contain any such highly repetitive contiguous substrings. If one can provide a separate algorithm for such strings, one could then imagine being partitioned into contiguous substrings that can be reconstructed by respectively BMA and the highly repetitive algorithm in an alternating fashion. Additional work is required to determine how to switch between the two algorithms.
References
- [1] Frank Ban, Xi Chen, Adam Freilich, Rocco A. Servedio, and Sandip Sinha. Beyond trace reconstruction: Population recovery from the deletion channel. In Foundations of Computer Science (FOCS), pages 745–768, 2019. doi:10.1109/FOCS.2019.00050.
- [2] Frank Ban, Xi Chen, Rocco A. Servedio, and Sandip Sinha. Efficient average-case population recovery in the presence of insertions and deletions. In Approximation, Randomization, and Combinatorial Optimization: Algorithms and Techniques, pages 44:1–44:18, 2019. doi:10.4230/LIPIcs.APPROX-RANDOM.2019.44.
- [3] Tugkan Batu, Sampath Kannan, Sanjeev Khanna, and Andrew McGregor. Reconstructing strings from random traces. In Symposium on Discrete Algorithms (SODA), pages 910–918, 2004. URL: http://dl.acm.org/citation.cfm?id=982792.982929.
- [4] Joshua Brakensiek, Ray Li, and Bruce Spang. Coded trace reconstruction in a constant number of traces. In Foundations of Computer Science (FOCS), 2020. arXiv:1908.03996.
- [5] Diptarka Chakraborty, Debarati Das, and Robert Krauthgamer. Approximate trace reconstruction via median string (in average-case). In Foundations of Software Technology and Theoretical Computer Science (FSTTCS), volume 213 of LIPIcs, pages 11:1–11:23. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.FSTTCS.2021.11.
- [6] Zachary Chase. New lower bounds for trace reconstruction. Ann. Inst. H. Poincaré Probab. Statist., 57(2), 2021. URL: http://arxiv.org/abs/1905.03031.
- [7] Zachary Chase. Separating words and trace reconstruction. In Symposium on Theory of Computing (STOC), 2021.
- [8] Zachary Chase and Yuval Peres. Approximate trace reconstruction of random strings from a constant number of traces. CoRR, abs/2107.06454, 2021.
- [9] Xi Chen, Anindya De, Chin Ho Lee, Rocco A. Servedio, and Sandip Sinha. Polynomial-time trace reconstruction in the low deletion rate regime. In Innovations in Theoretical Computer Science (ITCS), 2021. arXiv:2012.02844.
- [10] Xi Chen, Anindya De, Chin Ho Lee, Rocco A. Servedio, and Sandip Sinha. Polynomial-time trace reconstruction in the smoothed complexity model. In Symposium on Discrete Algorithms (SODA), 2021. arXiv:2008.12386.
- [11] Xi Chen, Anindya De, Chin Ho Lee, Rocco A. Servedio, and Sandip Sinha. Near-optimal average-case approximate trace reconstruction from few traces. In Symposium on Discrete Algorithms (SODA), 2022. arXiv:2107.11530.
- [12] Xi Chen, Anindya De, Chin Ho Lee, Rocco A. Servedio, and Sandip Sinha. Approximate trace reconstruction from a single trace. In Symposium on Discrete Algorithms (SODA), 2023. doi:10.48550/arXiv.2211.03292.
- [13] Mahdi Cheraghchi, Ryan Gabrys, Olgica Milenkovic, and João Ribeiro. Coded trace reconstruction. IEEE Trans. Inf. Theory, 66(10):6084–6103, 2020. doi:10.1109/TIT.2020.2996377.
- [14] Sami Davies, Miklos Racz, and Cyrus Rashtchian. Reconstructing trees from traces. In Conference On Learning Theory (COLT), pages 961–978, 2019. URL: http://proceedings.mlr.press/v99/davies19a.html.
- [15] Sami Davies, Miklós Z. Rácz, Benjamin G. Schiffer, and Cyrus Rashtchian. Approximate trace reconstruction: Algorithms. In International Symposium on Information Theory (ISIT), pages 2525–2530. IEEE, 2021. doi:10.1109/ISIT45174.2021.9517926.
- [16] Anindya De, Ryan O’Donnell, and Rocco A. Servedio. Optimal mean-based algorithms for trace reconstruction. Annals of Applied Probability, 29(2):851–874, 2019. doi:10.1214/18-AAP1394.
- [17] Lisa Hartung, Nina Holden, and Yuval Peres. Trace reconstruction with varying deletion probabilities. In Analytic Algorithmics and Combinatorics (ANALCO), pages 54–61, 2018. doi:10.1137/1.9781611975062.6.
- [18] Nina Holden and Russell Lyons. Lower bounds for trace reconstruction. Annals of Applied Probability, 30(2):503–525, 2020. doi:10.1214/19-AAP1506.
- [19] Nina Holden, Robin Pemantle, and Yuval Peres. Subpolynomial trace reconstruction for random strings and arbitrary deletion probability. In Conference On Learning Theory (COLT), pages 1799–1840, 2018. URL: http://proceedings.mlr.press/v75/holden18a.html.
- [20] Thomas Holenstein, Michael Mitzenmacher, Rina Panigrahy, and Udi Wieder. Trace reconstruction with constant deletion probability and related results. In Symposium on Discrete Algorithms (SODA), pages 389–398, 2008. doi:10.1145/1347082.1347125.
- [21] Sampath Kannan and Andrew McGregor. More on reconstructing strings from random traces: insertions and deletions. In International Symposium on Information Theory (ISIT), pages 297–301, 2005. doi:10.1109/ISIT.2005.1523342.
- [22] Akshay Krishnamurthy, Arya Mazumdar, Andrew McGregor, and Soumyabrata Pal. Trace reconstruction: Generalized and parameterized. IEEE Trans. Inf. Theory, 67(6):3233–3250, 2021. doi:10.1109/TIT.2021.3066010.
- [23] Vladimir I. Levenshtein. Efficient reconstruction of sequences. IEEE Trans. Information Theory, 47(1):2–22, 2001. doi:10.1109/18.904499.
- [24] Vladimir I. Levenshtein. Efficient reconstruction of sequences from their subsequences or supersequences. J. Comb. Theory, Ser. A, 93(2):310–332, 2001. doi:10.1006/jcta.2000.3081.
- [25] Andrew McGregor, Eric Price, and Sofya Vorotnikova. Trace reconstruction revisited. In European Symposium on Algorithms (ESA), pages 689–700, 2014. doi:10.1007/978-3-662-44777-2_57.
- [26] Andrew McGregor and Rik Sengupta. Graph reconstruction from random subgraphs. In International Colloquium on Automata, Languages, and Programming (ICALP), volume 229, pages 96:1–96:18, 2022. doi:10.4230/LIPICS.ICALP.2022.96.
- [27] Andrew McGregor and Rik Sengupta. Graph reconstruction from noisy random subgraphs. CoRR, abs/2405.04261, 2024. doi:10.48550/arXiv.2405.04261.
- [28] Shyam Narayanan. Improved algorithms for population recovery from the deletion channel. In Symposium on Discrete Algorithms (SODA), pages 1259–1278. SIAM, 2021. doi:10.1137/1.9781611976465.77.
- [29] Shyam Narayanan and Michael Ren. Circular trace reconstruction. In Innovations in Theoretical Computer Science (ITCS), 2021. arXiv:2009.01346.
- [30] Fedor Nazarov and Yuval Peres. Trace reconstruction with exp(o(n)) samples. In Symposium on Theory of Computing (STOC), pages 1042–1046, 2017. doi:10.1145/3055399.3055494.
- [31] Yuval Peres and Alex Zhai. Average-case reconstruction for the deletion channel: Subpolynomially many traces suffice. In Foundations of Computer Science (FOCS), pages 228–239, 2017. doi:10.1109/FOCS.2017.29.
- [32] Ittai Rubinstein. Average-case to (shifted) worst-case reduction for the trace reconstruction problem. In International Colloquium on Automata, Languages, and Programming (ICALP), volume 261 of LIPIcs, pages 102:1–102:20, 2023. URL: https://arxiv.org/abs/2207.11489.
- [33] Alec Sun and William Yue. The trace reconstruction problem for spider graphs. Discrete Mathematics, 346(1):113115, 2023. doi:10.1016/J.DISC.2022.113115.
- [34] Krishnamurthy Viswanathan and Ram Swaminathan. Improved string reconstruction over insertion-deletion channels. In Symposium on Discrete Algorithms (SODA), pages 399–408, 2008. doi:10.1145/1347082.1347126.