Approximate Trace Reconstruction via Median String (In Average-Case)

Chakraborty, Diptarka; Das, Debarati; Krauthgamer, Robert

doi:10.4230/LIPIcs.FSTTCS.2021.11

Abstract

We consider an approximate version of the trace reconstruction problem, where the goal is to recover an unknown string s ∈ {0,1}ⁿ from m traces (each trace is generated independently by passing s through a probabilistic insertion-deletion channel with rate p). We present a deterministic near-linear time algorithm for the average-case model, where s is random, that uses only three traces. It runs in near-linear time Õ(n) and with high probability reports a string within edit distance Õ(p² n) from s, which significantly improves over the straightforward bound of O(pn).
Technically, our algorithm computes a (1+ε)-approximate median of the three input traces. To prove its correctness, our probabilistic analysis shows that an approximate median is indeed close to the unknown s. To achieve a near-linear time bound, we have to bypass the well-known dynamic programming algorithm that computes an optimal median in time O(n³).

J. Abreu and Juan Ramón Rico-Juan. A new iterative algorithm for computing a quality approximate median of strings based on edit operations. Pattern Recognition Letters, 36:74-80, 2014.
Nir Ailon, Moses Charikar, and Alantha Newman. Aggregating inconsistent information: Ranking and clustering. J. ACM, 55(5):23:1-23:27, 2008. URL: https://doi.org/10.1145/1411509.1411513.
Frank Ban, Xi Chen, Adam Freilich, Rocco A Servedio, and Sandip Sinha. Beyond trace reconstruction: Population recovery from the deletion channel. In 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 745-768. IEEE, 2019.
Tugkan Batu, Funda Ergün, Joe Kilian, Avner Magen, Sofya Raskhodnikova, Ronitt Rubinfeld, and Rahul Sami. A sublinear algorithm for weakly approximating edit distance. In Proceedings of the 35th Annual ACM Symposium on Theory of Computing, STOC '03, pages 316-324. ACM, 2003.
Tugkan Batu, Sampath Kannan, Sanjeev Khanna, and Andrew McGregor. Reconstructing strings from random traces. In Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2004, pages 910-918. SIAM, 2004.
Hervé Cardot, Peggy Cénac, and Antoine Godichon-Baggioni. Online estimation of the geometric median in Hilbert spaces: Nonasymptotic confidence balls. Annals of Statistics, 45(2):591-614, 2017. URL: https://doi.org/10.1214/16-AOS1460.
Francisco Casacuberta and M. D. Antonio. A greedy algorithm for computing approximate median strings. In Proc. of National Symposium on Pattern Recognition and Image Analysis, pages 193-198, 1997.
Diptarka Chakraborty, Debarati Das, and Robert Krauthgamer. Approximating the median under the ulam metric. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 761-775. SIAM, 2021. URL: https://doi.org/10.1137/1.9781611976465.48.
Zachary Chase. New lower bounds for trace reconstruction. Annales de l'Institut Henri Poincaré, Probabilités et Statistiques, 57(2):627-643, 2021.
Zachary Chase. Separating words and trace reconstruction. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 21-31, 2021.
Xi Chen, Anindya De, Chin Ho Lee, Rocco A. Servedio, and Sandip Sinha. Polynomial-time trace reconstruction in the low deletion rate regime. In James R. Lee, editor, 12th Innovations in Theoretical Computer Science Conference, ITCS 2021, volume 185 of LIPIcs, pages 20:1-20:20. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021.
Xi Chen, Anindya De, Chin Ho Lee, Rocco A. Servedio, and Sandip Sinha. Polynomial-time trace reconstruction in the smoothed complexity model. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms, pages 54-73. SIAM, 2021.
Mahdi Cheraghchi, Ryan Gabrys, Olgica Milenkovic, and Joao Ribeiro. Coded trace reconstruction. IEEE Transactions on Information Theory, 66(10):6084-6103, 2020.
Flavio Chierichetti, Ravi Kumar, Sandeep Pandey, and Sergei Vassilvitskii. Finding the Jaccard median. In Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms, pages 293-311. SIAM, 2010. URL: https://doi.org/10.1137/1.9781611973075.25.
Michael B. Cohen, Yin Tat Lee, Gary Miller, Jakub Pachocki, and Aaron Sidford. Geometric median in nearly linear time. In Proceedings of the forty-eighth annual ACM Symposium on Theory of Computing, pages 9-21, 2016.
Sami Davies, Miklós Z. Rácz, Cyrus Rashtchian, and Benjamin G. Schiffer. Approximate trace reconstruction. CoRR, abs/2012.06713, 2020. URL: http://arxiv.org/abs/2012.06713.
Anindya De, Ryan O'Donnell, and Rocco A Servedio. Optimal mean-based algorithms for trace reconstruction. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 1047-1056, 2017.
Colin de la Higuera and Francisco Casacuberta. Topology of strings: Median string is NP-complete. Theor. Comput. Sci., 230(1-2):39-48, 2000. URL: https://doi.org/10.1016/S0304-3975(97)00240-5.
Cynthia Dwork, Ravi Kumar, Moni Naor, and D. Sivakumar. Rank aggregation methods for the web. In Proceedings of the Tenth International World Wide Web Conference, WWW 10, pages 613-622, 2001. URL: https://doi.org/10.1145/371920.372165.
Igor Fischer and Andreas Zell. String averages and self-organizing maps for strings. Proceedings of the neural computation, pages 208-215, 2000.
P. Thomas Fletcher, Suresh Venkatasubramanian, and Sarang Joshi. Robust statistics on riemannian manifolds via the geometric median. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1-8. IEEE, 2008.
Zvi Galil and Kunsoo Park. An improved algorithm for approximate string matching. SIAM Journal on Computing, 19(6):989-999, 1990.
Nick Goldman, Paul Bertone, Siyuan Chen, Christophe Dessimoz, Emily M. LeProust, Botond Sipos, and Ewan Birney. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature, 494(7435):77-80, 2013.
Elena Grigorescu, Madhu Sudan, and Minshen Zhu. Limitations of mean-based algorithms for trace reconstruction at small distance. CoRR, abs/2011.13737, 2020. URL: http://arxiv.org/abs/2011.13737.
Dan Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, 1997.
Morihiro Hayashida and Hitoshi Koyano. Integer linear programming approach to median and center strings for a probability distribution on a set of strings. In BIOINFORMATICS, pages 35-41, 2016.
Nina Holden and Russell Lyons. Lower bounds for trace reconstruction. The Annals of Applied Probability, 30(2):503-525, 2020.
Nina Holden, Robin Pemantle, Yuval Peres, and Alex Zhai. Subpolynomial trace reconstruction for random strings and arbitrary deletion probability. Mathematical Statistics and Learning, 2(3):275-309, 2020.
Thomas Holenstein, Michael Mitzenmacher, Rina Panigrahy, and Udi Wieder. Trace reconstruction with constant deletion probability and related results. In Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2008, pages 389-398. SIAM, 2008.
V. V. Kalashnik. Reconstruction of a word from its fragments. Computational Mathematics and Computer Science (Vychislitel’naya matematika i vychislitel’naya tekhnika), Kharkov, 4:56-57, 1973.
Sampath Kannan and Andrew McGregor. More on reconstructing strings from random traces: insertions and deletions. In Proceedings. International Symposium on Information Theory, 2005. ISIT 2005., pages 297-301. IEEE, 2005.
Claire Kenyon-Mathieu and Warren Schudy. How to rank with few errors. In Proceedings of the Thirty-Ninth Annual ACM Symposium on Theory of Computing, STOC '07, pages 95-103. ACM, 2007. URL: https://doi.org/10.1145/1250790.1250806.
Teuvo Kohonen. Median strings. Pattern Recognition Letters, 3(5):309-313, 1985. URL: https://doi.org/10.1016/0167-8655(85)90061-3.
Joseph B Kruskal. An overview of sequence comparison: Time warps, string edits, and macromolecules. SIAM review, 25(2):201-237, 1983. URL: https://doi.org/10.1137/1025045.
Ferenc Kruzslicz. Improved greedy algorithm for computing approximate median strings. Acta Cybernetica, 14(2):331-339, 1999.
Gad M. Landau and Uzi Vishkin. Fast parallel and serial approximate string matching. Journal of Algorithms, 10(2):157-169, 1989.
Vladimir I. Levenshtein. Efficient reconstruction of sequences. IEEE Transactions on Information Theory, 47(1):2-22, 2001. URL: https://doi.org/10.1109/18.904499.
Vladimir I. Levenshtein. Efficient reconstruction of sequences from their subsequences or supersequences. Journal of Combinatorial Theory, Series A, 93(2):310-332, 2001. URL: https://doi.org/10.1006/jcta.2000.3081.
Carlos D. Martínez-Hinarejos, Alfons Juan, and Francisco Casacuberta. Use of median string for classification. In Proceedings 15th International Conference on Pattern Recognition. ICPR-2000, volume 2, pages 903-906. IEEE, 2000. URL: https://doi.org/10.1109/ICPR.2000.906220.
Andrew McGregor, Eric Price, and Sofya Vorotnikova. Trace reconstruction revisited. In Algorithms - ESA 2014 - 22th Annual European Symposium, Wroclaw, Poland, September 8-10, 2014. Proceedings, volume 8737 of Lecture Notes in Computer Science, pages 689-700. Springer, 2014.
Stanislav Minsker. Geometric median and robust estimation in Banach spaces. Bernoulli, 21(4):2308-2335, 2015.
P. Mirabal, J. Abreu, and D. Seco. Assessing the best edit in perturbation-based iterative refinement algorithms to compute the median string. Pattern Recognition Letters, 120:104-111, April 2019.
Michael Mitzenmacher. A survey of results for deletion channels and related synchronization channels. Probability Surveys, 6:1-33, 2009.
Shyam Narayanan. Population recovery from the deletion channel: Nearly matching trace reconstruction bounds. arXiv preprint, 2020. URL: http://arxiv.org/abs/2004.06828.
Fedor Nazarov and Yuval Peres. Trace reconstruction with exp(o(n^1/3)) samples. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, pages 1042-1046. ACM, 2017.
François Nicolas and Eric Rivals. Complexities of the centre and median string problems. In 14th Annual Symposium on Combinatorial Pattern Matching, CPM 2003, pages 315-327, 2003.
Lee Organick, Siena Dumas Ang, Yuan-Jyue Chen, Randolph Lopez, Sergey Yekhanin, Konstantin Makarychev, Miklos Z. Racz, Govinda Kamath, Parikshit Gopalan, Bichlien Nguyen, et al. Random access in large-scale dna data storage. Nature biotechnology, 36(3):242, 2018.
Oscar Pedreira and Nieves R. Brisaboa. Spatial selection of sparse pivots for similarity search in metric spaces. In International Conference on Current Trends in Theory and Practice of Computer Science, pages 434-445. Springer, 2007.
Yuval Peres and Alex Zhai. Average-case reconstruction for the deletion channel: Subpolynomially many traces suffice. In 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, pages 228-239. IEEE Computer Society, 2017.
Pavel Pevzner. Computational molecular biology: an algorithmic approach. MIT press, 2000.
Cyrus Rashtchian, Konstantin Makarychev, Miklós Z. Rácz, Siena Ang, Djordje Jevdjic, Sergey Yekhanin, Luis Ceze, and Karin Strauss. Clustering billions of reads for DNA data storage. In Advances in Neural Information Processing Systems 30, pages 3360-3371. Curran Associates, Inc., 2017.
Richard J Roberts, Mauricio O Carneiro, and Michael C Schatz. The advantages of smrt sequencing. Genome Biology, 14(7):405, 2013.
David Sankoff. Minimal mutation trees of sequences. SIAM Journal on Applied Mathematics, 28(1):35-42, 1975. URL: https://doi.org/10.1137/0128004.
Krishnamurthy Viswanathan and Ram Swaminathan. Improved string reconstruction over insertion-deletion channels. In Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms, pages 399-408, 2008.
SM Hossein Tabatabaei Yazdi, Ryan Gabrys, and Olgica Milenkovic. Portable and error-free dna-based data storage. Scientific reports, 7(1):1-6, 2017.

Approximate Trace Reconstruction via Median String (In Average-Case)

Authors Diptarka Chakraborty, Debarati Das, Robert Krauthgamer

File

Document Identifiers

Author Details

Cite As Get BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

Approximate Trace Reconstruction via Median String (In Average-Case)

Authors Diptarka Chakraborty, Debarati Das, Robert Krauthgamer

File

Document Identifiers

Author Details

Funding

Cite As Get BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Related Versions

References

Thanks for your feedback!

Could not send message