Approximate Trace Reconstruction via Median String (In Average-Case)

Authors Diptarka Chakraborty, Debarati Das, Robert Krauthgamer



PDF
Thumbnail PDF

File

LIPIcs.FSTTCS.2021.11.pdf
  • Filesize: 1.05 MB
  • 23 pages

Document Identifiers

Author Details

Diptarka Chakraborty
  • National University of Singapore, Singapore
Debarati Das
  • Basic Algorithm Research Copenhagen (BARC), University of Copenhagen, Denmark
Robert Krauthgamer
  • Weizmann Institute of Science, Rehovot, Israel

Cite AsGet BibTex

Diptarka Chakraborty, Debarati Das, and Robert Krauthgamer. Approximate Trace Reconstruction via Median String (In Average-Case). In 41st IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 213, pp. 11:1-11:23, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/LIPIcs.FSTTCS.2021.11

Abstract

We consider an approximate version of the trace reconstruction problem, where the goal is to recover an unknown string s ∈ {0,1}ⁿ from m traces (each trace is generated independently by passing s through a probabilistic insertion-deletion channel with rate p). We present a deterministic near-linear time algorithm for the average-case model, where s is random, that uses only three traces. It runs in near-linear time Õ(n) and with high probability reports a string within edit distance Õ(p² n) from s, which significantly improves over the straightforward bound of O(pn). Technically, our algorithm computes a (1+ε)-approximate median of the three input traces. To prove its correctness, our probabilistic analysis shows that an approximate median is indeed close to the unknown s. To achieve a near-linear time bound, we have to bypass the well-known dynamic programming algorithm that computes an optimal median in time O(n³).

Subject Classification

ACM Subject Classification
  • Theory of computation → Approximation algorithms analysis
Keywords
  • Trace Reconstruction
  • Approximation Algorithms
  • Edit Distance
  • String Median

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. J. Abreu and Juan Ramón Rico-Juan. A new iterative algorithm for computing a quality approximate median of strings based on edit operations. Pattern Recognition Letters, 36:74-80, 2014. Google Scholar
  2. Nir Ailon, Moses Charikar, and Alantha Newman. Aggregating inconsistent information: Ranking and clustering. J. ACM, 55(5):23:1-23:27, 2008. URL: https://doi.org/10.1145/1411509.1411513.
  3. Frank Ban, Xi Chen, Adam Freilich, Rocco A Servedio, and Sandip Sinha. Beyond trace reconstruction: Population recovery from the deletion channel. In 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 745-768. IEEE, 2019. Google Scholar
  4. Tugkan Batu, Funda Ergün, Joe Kilian, Avner Magen, Sofya Raskhodnikova, Ronitt Rubinfeld, and Rahul Sami. A sublinear algorithm for weakly approximating edit distance. In Proceedings of the 35th Annual ACM Symposium on Theory of Computing, STOC '03, pages 316-324. ACM, 2003. Google Scholar
  5. Tugkan Batu, Sampath Kannan, Sanjeev Khanna, and Andrew McGregor. Reconstructing strings from random traces. In Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2004, pages 910-918. SIAM, 2004. Google Scholar
  6. Hervé Cardot, Peggy Cénac, and Antoine Godichon-Baggioni. Online estimation of the geometric median in Hilbert spaces: Nonasymptotic confidence balls. Annals of Statistics, 45(2):591-614, 2017. URL: https://doi.org/10.1214/16-AOS1460.
  7. Francisco Casacuberta and M. D. Antonio. A greedy algorithm for computing approximate median strings. In Proc. of National Symposium on Pattern Recognition and Image Analysis, pages 193-198, 1997. Google Scholar
  8. Diptarka Chakraborty, Debarati Das, and Robert Krauthgamer. Approximating the median under the ulam metric. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 761-775. SIAM, 2021. URL: https://doi.org/10.1137/1.9781611976465.48.
  9. Zachary Chase. New lower bounds for trace reconstruction. Annales de l'Institut Henri Poincaré, Probabilités et Statistiques, 57(2):627-643, 2021. Google Scholar
  10. Zachary Chase. Separating words and trace reconstruction. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 21-31, 2021. Google Scholar
  11. Xi Chen, Anindya De, Chin Ho Lee, Rocco A. Servedio, and Sandip Sinha. Polynomial-time trace reconstruction in the low deletion rate regime. In James R. Lee, editor, 12th Innovations in Theoretical Computer Science Conference, ITCS 2021, volume 185 of LIPIcs, pages 20:1-20:20. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. Google Scholar
  12. Xi Chen, Anindya De, Chin Ho Lee, Rocco A. Servedio, and Sandip Sinha. Polynomial-time trace reconstruction in the smoothed complexity model. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms, pages 54-73. SIAM, 2021. Google Scholar
  13. Mahdi Cheraghchi, Ryan Gabrys, Olgica Milenkovic, and Joao Ribeiro. Coded trace reconstruction. IEEE Transactions on Information Theory, 66(10):6084-6103, 2020. Google Scholar
  14. Flavio Chierichetti, Ravi Kumar, Sandeep Pandey, and Sergei Vassilvitskii. Finding the Jaccard median. In Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms, pages 293-311. SIAM, 2010. URL: https://doi.org/10.1137/1.9781611973075.25.
  15. Michael B. Cohen, Yin Tat Lee, Gary Miller, Jakub Pachocki, and Aaron Sidford. Geometric median in nearly linear time. In Proceedings of the forty-eighth annual ACM Symposium on Theory of Computing, pages 9-21, 2016. Google Scholar
  16. Sami Davies, Miklós Z. Rácz, Cyrus Rashtchian, and Benjamin G. Schiffer. Approximate trace reconstruction. CoRR, abs/2012.06713, 2020. URL: http://arxiv.org/abs/2012.06713.
  17. Anindya De, Ryan O'Donnell, and Rocco A Servedio. Optimal mean-based algorithms for trace reconstruction. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 1047-1056, 2017. Google Scholar
  18. Colin de la Higuera and Francisco Casacuberta. Topology of strings: Median string is NP-complete. Theor. Comput. Sci., 230(1-2):39-48, 2000. URL: https://doi.org/10.1016/S0304-3975(97)00240-5.
  19. Cynthia Dwork, Ravi Kumar, Moni Naor, and D. Sivakumar. Rank aggregation methods for the web. In Proceedings of the Tenth International World Wide Web Conference, WWW 10, pages 613-622, 2001. URL: https://doi.org/10.1145/371920.372165.
  20. Igor Fischer and Andreas Zell. String averages and self-organizing maps for strings. Proceedings of the neural computation, pages 208-215, 2000. Google Scholar
  21. P. Thomas Fletcher, Suresh Venkatasubramanian, and Sarang Joshi. Robust statistics on riemannian manifolds via the geometric median. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1-8. IEEE, 2008. Google Scholar
  22. Zvi Galil and Kunsoo Park. An improved algorithm for approximate string matching. SIAM Journal on Computing, 19(6):989-999, 1990. Google Scholar
  23. Nick Goldman, Paul Bertone, Siyuan Chen, Christophe Dessimoz, Emily M. LeProust, Botond Sipos, and Ewan Birney. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature, 494(7435):77-80, 2013. Google Scholar
  24. Elena Grigorescu, Madhu Sudan, and Minshen Zhu. Limitations of mean-based algorithms for trace reconstruction at small distance. CoRR, abs/2011.13737, 2020. URL: http://arxiv.org/abs/2011.13737.
  25. Dan Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, 1997. Google Scholar
  26. Morihiro Hayashida and Hitoshi Koyano. Integer linear programming approach to median and center strings for a probability distribution on a set of strings. In BIOINFORMATICS, pages 35-41, 2016. Google Scholar
  27. Nina Holden and Russell Lyons. Lower bounds for trace reconstruction. The Annals of Applied Probability, 30(2):503-525, 2020. Google Scholar
  28. Nina Holden, Robin Pemantle, Yuval Peres, and Alex Zhai. Subpolynomial trace reconstruction for random strings and arbitrary deletion probability. Mathematical Statistics and Learning, 2(3):275-309, 2020. Google Scholar
  29. Thomas Holenstein, Michael Mitzenmacher, Rina Panigrahy, and Udi Wieder. Trace reconstruction with constant deletion probability and related results. In Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2008, pages 389-398. SIAM, 2008. Google Scholar
  30. V. V. Kalashnik. Reconstruction of a word from its fragments. Computational Mathematics and Computer Science (Vychislitel’naya matematika i vychislitel’naya tekhnika), Kharkov, 4:56-57, 1973. Google Scholar
  31. Sampath Kannan and Andrew McGregor. More on reconstructing strings from random traces: insertions and deletions. In Proceedings. International Symposium on Information Theory, 2005. ISIT 2005., pages 297-301. IEEE, 2005. Google Scholar
  32. Claire Kenyon-Mathieu and Warren Schudy. How to rank with few errors. In Proceedings of the Thirty-Ninth Annual ACM Symposium on Theory of Computing, STOC '07, pages 95-103. ACM, 2007. URL: https://doi.org/10.1145/1250790.1250806.
  33. Teuvo Kohonen. Median strings. Pattern Recognition Letters, 3(5):309-313, 1985. URL: https://doi.org/10.1016/0167-8655(85)90061-3.
  34. Joseph B Kruskal. An overview of sequence comparison: Time warps, string edits, and macromolecules. SIAM review, 25(2):201-237, 1983. URL: https://doi.org/10.1137/1025045.
  35. Ferenc Kruzslicz. Improved greedy algorithm for computing approximate median strings. Acta Cybernetica, 14(2):331-339, 1999. Google Scholar
  36. Gad M. Landau and Uzi Vishkin. Fast parallel and serial approximate string matching. Journal of Algorithms, 10(2):157-169, 1989. Google Scholar
  37. Vladimir I. Levenshtein. Efficient reconstruction of sequences. IEEE Transactions on Information Theory, 47(1):2-22, 2001. URL: https://doi.org/10.1109/18.904499.
  38. Vladimir I. Levenshtein. Efficient reconstruction of sequences from their subsequences or supersequences. Journal of Combinatorial Theory, Series A, 93(2):310-332, 2001. URL: https://doi.org/10.1006/jcta.2000.3081.
  39. Carlos D. Martínez-Hinarejos, Alfons Juan, and Francisco Casacuberta. Use of median string for classification. In Proceedings 15th International Conference on Pattern Recognition. ICPR-2000, volume 2, pages 903-906. IEEE, 2000. URL: https://doi.org/10.1109/ICPR.2000.906220.
  40. Andrew McGregor, Eric Price, and Sofya Vorotnikova. Trace reconstruction revisited. In Algorithms - ESA 2014 - 22th Annual European Symposium, Wroclaw, Poland, September 8-10, 2014. Proceedings, volume 8737 of Lecture Notes in Computer Science, pages 689-700. Springer, 2014. Google Scholar
  41. Stanislav Minsker. Geometric median and robust estimation in Banach spaces. Bernoulli, 21(4):2308-2335, 2015. Google Scholar
  42. P. Mirabal, J. Abreu, and D. Seco. Assessing the best edit in perturbation-based iterative refinement algorithms to compute the median string. Pattern Recognition Letters, 120:104-111, April 2019. Google Scholar
  43. Michael Mitzenmacher. A survey of results for deletion channels and related synchronization channels. Probability Surveys, 6:1-33, 2009. Google Scholar
  44. Shyam Narayanan. Population recovery from the deletion channel: Nearly matching trace reconstruction bounds. arXiv preprint, 2020. URL: http://arxiv.org/abs/2004.06828.
  45. Fedor Nazarov and Yuval Peres. Trace reconstruction with exp(o(n^1/3)) samples. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, pages 1042-1046. ACM, 2017. Google Scholar
  46. François Nicolas and Eric Rivals. Complexities of the centre and median string problems. In 14th Annual Symposium on Combinatorial Pattern Matching, CPM 2003, pages 315-327, 2003. Google Scholar
  47. Lee Organick, Siena Dumas Ang, Yuan-Jyue Chen, Randolph Lopez, Sergey Yekhanin, Konstantin Makarychev, Miklos Z. Racz, Govinda Kamath, Parikshit Gopalan, Bichlien Nguyen, et al. Random access in large-scale dna data storage. Nature biotechnology, 36(3):242, 2018. Google Scholar
  48. Oscar Pedreira and Nieves R. Brisaboa. Spatial selection of sparse pivots for similarity search in metric spaces. In International Conference on Current Trends in Theory and Practice of Computer Science, pages 434-445. Springer, 2007. Google Scholar
  49. Yuval Peres and Alex Zhai. Average-case reconstruction for the deletion channel: Subpolynomially many traces suffice. In 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, pages 228-239. IEEE Computer Society, 2017. Google Scholar
  50. Pavel Pevzner. Computational molecular biology: an algorithmic approach. MIT press, 2000. Google Scholar
  51. Cyrus Rashtchian, Konstantin Makarychev, Miklós Z. Rácz, Siena Ang, Djordje Jevdjic, Sergey Yekhanin, Luis Ceze, and Karin Strauss. Clustering billions of reads for DNA data storage. In Advances in Neural Information Processing Systems 30, pages 3360-3371. Curran Associates, Inc., 2017. Google Scholar
  52. Richard J Roberts, Mauricio O Carneiro, and Michael C Schatz. The advantages of smrt sequencing. Genome Biology, 14(7):405, 2013. Google Scholar
  53. David Sankoff. Minimal mutation trees of sequences. SIAM Journal on Applied Mathematics, 28(1):35-42, 1975. URL: https://doi.org/10.1137/0128004.
  54. Krishnamurthy Viswanathan and Ram Swaminathan. Improved string reconstruction over insertion-deletion channels. In Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms, pages 399-408, 2008. Google Scholar
  55. SM Hossein Tabatabaei Yazdi, Ryan Gabrys, and Olgica Milenkovic. Portable and error-free dna-based data storage. Scientific reports, 7(1):1-6, 2017. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail