Approximating Longest Common Substring with k mismatches: Theory and Practice

Authors Garance Gourdel, Tomasz Kociumaka , Jakub Radoszewski , Tatiana Starikovskaya



PDF
Thumbnail PDF

File

LIPIcs.CPM.2020.16.pdf
  • Filesize: 0.73 MB
  • 15 pages

Document Identifiers

Author Details

Garance Gourdel
  • ENS Paris Saclay, France
Tomasz Kociumaka
  • Bar-Ilan University, Ramat Gan, Israel
Jakub Radoszewski
  • Institute of Informatics, University of Warsaw, Poland
  • Samsung R&D Institute, Warsaw, Poland
Tatiana Starikovskaya
  • DIENS, École normale supérieure, PSL Research University, France

Cite AsGet BibTex

Garance Gourdel, Tomasz Kociumaka, Jakub Radoszewski, and Tatiana Starikovskaya. Approximating Longest Common Substring with k mismatches: Theory and Practice. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 16:1-16:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)
https://doi.org/10.4230/LIPIcs.CPM.2020.16

Abstract

In the problem of the longest common substring with k mismatches we are given two strings X, Y and must find the maximal length 𝓁 such that there is a length-𝓁 substring of X and a length-𝓁 substring of Y that differ in at most k positions. The length 𝓁 can be used as a robust measure of similarity between X, Y. In this work, we develop new approximation algorithms for computing 𝓁 that are significantly more efficient that previously known solutions from the theoretical point of view. Our approach is simple and practical, which we confirm via an experimental evaluation, and is probably close to optimal as we demonstrate via a conditional lower bound.

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
Keywords
  • approximation algorithms
  • string similarity
  • LSH
  • conditional lower bounds

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Amir Abboud, Richard Ryan Williams, and Huacheng Yu. More applications of the polynomial method to algorithm design. In SODA'15, pages 218-230, 2015. URL: https://doi.org/10.1137/1.9781611973730.17.
  2. Dimitris Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4):671-687, 2003. URL: https://doi.org/10.1016/S0022-0000(03)00025-4.
  3. Manindra Agrawal, Neeraj Kayal, and Nitin Saxena. PRIMES is in P. Annals of Mathematics, 160(2):781-793, 2004. URL: https://doi.org/10.4007/annals.2004.160.781.
  4. Alexandr Andoni and Piotr Indyk. Efficient algorithms for substring near neighbor problem. In SODA'06, pages 1203-1212, 2006. URL: https://doi.org/10.1145/1109557.1109690.
  5. Alexandr Andoni and Ilya Razenshteyn. Optimal data-dependent hashing for approximate near neighbors. In STOC'15, pages 793-801, 2015. URL: https://doi.org/10.1145/2746539.2746553.
  6. Maxim Babenko and Tatiana Starikovskaya. Computing the longest common substring with one mismatch. Problems of Information Transmission, 47(1):28-33, 2011. URL: https://doi.org/10.1134/S0032946011010030.
  7. Arturs Backurs and Piotr Indyk. Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). SIAM Journal on Computing, 47(3):1087-1097, 2018. URL: https://doi.org/10.1137/15M1053128.
  8. Diptarka Chakraborty, Debarati Das, Elazar Goldenberg, Michal Koucký, and Michael E. Saks. Approximating edit distance within constant factor in truly sub-quadratic time. In FOCS'18, pages 979-990, 2018. URL: https://doi.org/10.1109/FOCS.2018.00096.
  9. Min-Te Chao. A general purpose unequal probability sampling plan. Biometrika, 69(3):653-656, December 1982. URL: https://doi.org/10.2307/2336002.
  10. Panagiotis Charalampopoulos, Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. Linear-time algorithm for long LCF with k mismatches. In CPM'18, pages 23:1-23:16, 2018. URL: https://doi.org/10.4230/LIPIcs.CPM.2018.23.
  11. Aditi Dhagat, Peter Gács, and Peter Winkler. On playing "Twenty questions" with a liar. In SODA'92, pages 16-22, 1992. URL: http://dl.acm.org/citation.cfm?id=139404.139409.
  12. Michael J. Fischer and Michael S. Paterson. String matching and other products. In Complexity of Computation, pages 113-125, 1974. Google Scholar
  13. Tomás Flouri, Emanuele Giaquinta, Kassian Kobert, and Esko Ukkonen. Longest common substrings with k mismatches. Information Processing Letters, 115(6-8):643-647, 2015. Google Scholar
  14. Zvi Galil and Raffaele Giancarlo. Improved string matching with k mismatches. SIGACT News, 17(4):52–54, March 1986. URL: https://doi.org/10.1145/8307.8309.
  15. Szymon Grabowski. A note on the longest common substring with k-mismatches problem. Information Processing Letters, 115(6-8):640-642, 2015. Google Scholar
  16. Russell Impagliazzo, Ramamohan Paturi, and Francis Zane. Which problems have strongly exponential complexity? Journal of Computer and System Sciences, 63(4):512-530, 2001. URL: https://doi.org/10.1006/jcss.2001.1774.
  17. William B. Johnson and Joram Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. In Modern Analysis and Probability, volume 26 of Contemporary Mathematics, pages 189-206, 1984. Google Scholar
  18. Richard M. Karp and Michael O. Rabin. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, 31(2):249-260, 1987. URL: https://doi.org/10.1147/rd.312.0249.
  19. Tomasz Kociumaka, Jakub Radoszewski, and Tatiana Starikovskaya. Longest common substring with approximately k mismatches. Algorithmica, 81(6):2633-2652, 2019. URL: https://doi.org/10.1007/s00453-019-00548-x.
  20. Chris-Andre Leimeister and Burkhard Morgenstern. kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics, 30(14):2000-2008, 2014. URL: https://doi.org/10.1093/bioinformatics/btu331.
  21. Benny Porat and Ely Porat. Exact and approximate pattern matching in the streaming model. In FOCS'09, pages 315-323, 2009. URL: https://doi.org/10.1109/FOCS.2009.11.
  22. Aviad Rubinstein. Hardness of approximate nearest neighbor search. In STOC'18, pages 1260-1268, 2018. URL: https://doi.org/10.1145/3188745.3188916.
  23. Terence Tao, Ernest Croot III, and Harald Helfgott. Deterministic methods to find primes. AMS Mathematics of Computation, 81(278):1233-1246, 2012. URL: https://doi.org/10.1090/S0025-5718-2011-02542-1.
  24. Sharma V. Thankachan, Alberto Apostolico, and Srinivas Aluru. A provably efficient algorithm for the k-mismatch average common substring problem. Journal of Computational Biology, 23(6):472-482, 2016. URL: https://doi.org/10.1089/cmb.2015.0235.
  25. Sharma V. Thankachan, Sriram P. Chockalingam, Yongchao Liu, Alberto Apostolico, and Srinivas Aluru. ALFRED: A practical method for alignment-free distance computation. Journal of Computational Biology, 23(6):452-460, 2016. URL: https://doi.org/10.1089/cmb.2015.0217.
  26. Sharma V. Thankachan, Sriram P. Chockalingam, Yongchao Liu, Ambujam Krishnan, and Srinivas Aluru. A greedy alignment-free distance estimator for phylogenetic inference. BMC Bioinformatics, 18(8):238, June 2017. URL: https://doi.org/10.1186/s12859-017-1658-0.
  27. Virginia Vassilevska Williams. On some fine-grained questions in algorithms and complexity. In ICM'18, pages 3447-3487, 2018. URL: https://doi.org/10.1142/9789813272880_0188.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail