Efficient Average-Case Population Recovery in the Presence of Insertions and Deletions
A number of recent works have considered the trace reconstruction problem, in which an unknown source string x in {0,1}^n is transmitted through a probabilistic channel which may randomly delete coordinates or insert random bits, resulting in a trace of x. The goal is to reconstruct the original string x from independent traces of x. While the asymptotically best algorithms known for worst-case strings use exp(O(n^{1/3})) traces [De et al., 2017; Fedor Nazarov and Yuval Peres, 2017], several highly efficient algorithms are known [Yuval Peres and Alex Zhai, 2017; Nina Holden et al., 2018] for the average-case version of the problem, in which the source string x is chosen uniformly at random from {0,1}^n. In this paper we consider a generalization of the above-described average-case trace reconstruction problem, which we call average-case population recovery in the presence of insertions and deletions. In this problem, rather than a single unknown source string there is an unknown distribution over s unknown source strings x^1,...,x^s in {0,1}^n, and each sample given to the algorithm is independently generated by drawing some x^i from this distribution and returning an independent trace of x^i. Building on the results of [Yuval Peres and Alex Zhai, 2017] and [Nina Holden et al., 2018], we give an efficient algorithm for the average-case population recovery problem in the presence of insertions and deletions. For any support size 1 <= s <= exp(Theta(n^{1/3})), for a 1-o(1) fraction of all s-element support sets {x^1,...,x^s} subset {0,1}^n, for every distribution D supported on {x^1,...,x^s}, our algorithm can efficiently recover D up to total variation distance at most epsilon with high probability, given access to independent traces of independent draws from D. The running time of our algorithm is poly(n,s,1/epsilon) and its sample complexity is poly (s,1/epsilon,exp(log^{1/3} n)). This polynomial dependence on the support size s is in sharp contrast with the worst-case version of the problem (when x^1,...,x^s may be any strings in {0,1}^n), in which the sample complexity of the most efficient known algorithm [Frank Ban et al., 2019] is doubly exponential in s.
population recovery
deletion channel
trace reconstruction
Mathematics of computing~Information theory
Theory of computation~Machine learning theory
44:1-44:18
RANDOM
Frank
Ban
Frank Ban
UC Berkeley, Berkeley, CA, USA
Xi
Chen
Xi Chen
Columbia University, New York, NY, USA
http://www.cs.columbia.edu/~xichen
Supported by NSF IIS-1838154 and NSF CCF-1703925.
Rocco A.
Servedio
Rocco A. Servedio
Columbia University, New York, NY, USA
http://www.cs.columbia.edu/~rocco
Supported by NSF grants CCF-1563155, CCF-1814873, IIS-1838154, and by the Simons Collaboration on Algorithms and Geometry.
Sandip
Sinha
Sandip Sinha
Columbia University, New York, NY, USA
https://sites.google.com/view/sandips
https://orcid.org/0000-0002-2592-175X
Supported by NSF awards CCF-1563155, CCF-1420349, CCF-1617955, CCF-1740833, CCF-1421161, CCF-1714818 and Simons Foundation (#491119).
10.4230/LIPIcs.APPROX-RANDOM.2019.44
Alexandr Andoni, Mark Braverman, and Avinatan Hassidim. Phylogenetic Reconstruction with Insertions and Deletions. Manuscript, 2014.
Alexandr Andoni, Constantinos Daskalakis, Avinatan Hassidim, and Sébastien Roch. Global Alignment of Molecular Sequences via Ancestral State Reconstruction. In ICS, pages 358-369, 2010.
Frank Ban, Xi Chen, Adam Freilich, Rocco A. Servedio, and Sandip Sinha. Beyond trace reconstruction: Population recovery from the deletion channel. CoRR, abs/1904.05532, 2019. URL: http://arxiv.org/abs/1904.05532.
http://arxiv.org/abs/1904.05532
T. Batu, S. Kannan, S. Khanna, and A. McGregor. Reconstructing strings from random traces. In Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2004, pages 910-918, 2004.
Zachary Chase. New lower bounds for trace reconstruction. arXiv preprint, 2019. URL: http://arxiv.org/abs/1905.03031.
http://arxiv.org/abs/1905.03031
Constantinos Daskalakis and Sébastien Roch. Alignment-Free Phylogenetic Reconstruction. In RECOMB, pages 123-137, 2010.
A. De, M. Saks, and S. Tang. Noisy population recovery in polynomial time. Technical Report TR-16-026, Electronic Colloquium on Computational Complexity, 2016. To appear in FOCS 2016.
Anindya De, Ryan O'Donnell, and Rocco A. Servedio. Optimal mean-based algorithms for trace reconstruction. In Proceedings of the 49th ACM Symposium on Theory of Computing (STOC), pages 1047-1056, 2017.
Anindya De, Ryan O'Donnell, and Rocco A. Servedio. Sharp bounds for population recovery. CoRR, abs/1703.01474, 2017. URL: http://arxiv.org/abs/1703.01474.
http://arxiv.org/abs/1703.01474
Z. Dvir, A. Rao, A. Wigderson, and A. Yehudayoff. Restriction access. In Innovations in Theoretical Computer Science, pages 19-33, 2012.
W. Feller. An introduction to probability theory and its applications. John Wiley & Sons, 1968.
Nina Holden and Russell Lyons. Lower bounds for trace reconstruction. Available at https://arxiv.org/abs/1808.02336, 2018.
https://arxiv.org/abs/1808.02336
Nina Holden, Robin Pemantle, and Yuval Peres. Subpolynomial trace reconstruction for random strings and arbitrary deletion probability. CoRR, abs/1801.04783, 2018.
T. Holenstein, M. Mitzenmacher, R. Panigrahy, and U. Wieder. Trace reconstruction with constant deletion probability and related results. In Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2008, pages 389-398, 2008.
Svante Janson. Tail bounds for sums of geometric and exponential variables. Statistics & Probability Letters, 135:1-6, 2018. URL: https://doi.org/10.1016/j.spl.2017.11.017.
https://doi.org/10.1016/j.spl.2017.11.017
Sampath Kannan and Andrew McGregor. More on Reconstructing Strings from Random Traces: Insertions and Deletions. In IEEE International Symposium on Information Theory, pages 297-301, 2005.
Akshay Krishnamurthy, Arya Mazumdar, Andrew McGregor, and Soumyabrata Pal. Trace Reconstruction: Generalized and Parameterized. arXiv preprint, 2019. URL: http://arxiv.org/abs/1904.09618.
http://arxiv.org/abs/1904.09618
S. Lovett and J. Zhang. Improved Noisy Population Recovery, and Reverse Bonami-Beckner Inequality for Sparse Functions. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, June 14-17, 2015, pages 137-142, 2015.
Andrew McGregor, Eric Price, and Sofya Vorotnikova. Trace Reconstruction Revisited. In Proceedings of the 22nd Annual European Symposium on Algorithms, pages 689-700, 2014.
Ankur Moitra and Michael E. Saks. A Polynomial Time Algorithm for Lossy Population Recovery. In 54th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2013, 26-29 October, 2013, Berkeley, CA, USA, pages 110-116, 2013.
Fedor Nazarov and Yuval Peres. Trace reconstruction with exp(O(n^1/3)) samples. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, pages 1042-1046, 2017.
Lee Organick, Siena Dumas Ang, Yuan-Jyue Chen, Randolph Lopez, Sergey Yekhanin, Konstantin Makarychev, Miklos Z Racz, Govinda Kamath, Parikshit Gopalan, Bichlien Nguyen, et al. Random access in large-scale DNA data storage. Nature biotechnology, 36(3):242, 2018.
Yuval Peres and Alex Zhai. Average-Case Reconstruction for the Deletion Channel: Subpolynomially Many Traces Suffice. In FOCS, pages 228-239, 2017.
Yury Polyanskiy, Ananda Theertha Suresh, and Yihong Wu. Sample complexity of population recovery. In Proceedings of the 30th Conference on Learning Theory, COLT 2017, Amsterdam, The Netherlands, 7-10 July 2017, pages 1589-1618, 2017.
Krishnamurthy Viswanathan and Ram Swaminathan. Improved string reconstruction over insertion-deletion channels. In Proceedings of the 19th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 399-408, 2008.
A. Wigderson and A. Yehudayoff. Population recovery and partial identification. Machine Learning, 102(1):29-56, 2016. Preliminary version in FOCS 2012.
S.M. Hossein Tabatabaei Yazdi, Ryan Gabrys, and Olgica Milenkovic. Portable and Error-Free DNA-Based Data Storage. Scientific Reports, 7(1):5011, 2017.
Frank Ban, Xi Chen, Rocco A. Servedio, and Sandip Sinha
Creative Commons Attribution 3.0 Unported license
https://creativecommons.org/licenses/by/3.0/legalcode