Approximating Longest Common Substring with k mismatches: Theory and Practice

Gourdel, Garance; Kociumaka, Tomasz; Radoszewski, Jakub; Starikovskaya, Tatiana

doi:10.4230/LIPIcs.CPM.2020.16

Abstract

In the problem of the longest common substring with k mismatches we are given two strings X, Y and must find the maximal length 𝓁 such that there is a length-𝓁 substring of X and a length-𝓁 substring of Y that differ in at most k positions. The length 𝓁 can be used as a robust measure of similarity between X, Y. In this work, we develop new approximation algorithms for computing 𝓁 that are significantly more efficient that previously known solutions from the theoretical point of view. Our approach is simple and practical, which we confirm via an experimental evaluation, and is probably close to optimal as we demonstrate via a conditional lower bound.

Cite As Get BibTex

Garance Gourdel, Tomasz Kociumaka, Jakub Radoszewski, and Tatiana Starikovskaya. Approximating Longest Common Substring with k mismatches: Theory and Practice. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 16:1-16:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020) https://doi.org/10.4230/LIPIcs.CPM.2020.16

Author Details

Garance Gourdel

ENS Paris Saclay, France

Tomasz Kociumaka

Bar-Ilan University, Ramat Gan, Israel

Jakub Radoszewski

Institute of Informatics, University of Warsaw, Poland
Samsung R&D Institute, Warsaw, Poland

Tatiana Starikovskaya

DIENS, École normale supérieure, PSL Research University, France

Funding

Kociumaka, Tomasz: Supported by ISF grants no. 1278/16 and 1926/19, a BSF grant no. 2018364, and an ERC grant MPM (no. 683064) under the EU’s Horizon 2020 Research and Innovation Programme.
Radoszewski, Jakub: Supported by the Polish National Science Center, grant number 2018/31/D/ST6/03991.

Supplementary Materials

To ensure the reproducibility of our results, our complete experimental setup, including data files, is available at https://github.com/fnareoh/LCS_Approx_k_mis.

References

Amir Abboud, Richard Ryan Williams, and Huacheng Yu. More applications of the polynomial method to algorithm design. In SODA'15, pages 218-230, 2015. URL: https://doi.org/10.1137/1.9781611973730.17.
Dimitris Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4):671-687, 2003. URL: https://doi.org/10.1016/S0022-0000(03)00025-4.
Manindra Agrawal, Neeraj Kayal, and Nitin Saxena. PRIMES is in P. Annals of Mathematics, 160(2):781-793, 2004. URL: https://doi.org/10.4007/annals.2004.160.781.
Alexandr Andoni and Piotr Indyk. Efficient algorithms for substring near neighbor problem. In SODA'06, pages 1203-1212, 2006. URL: https://doi.org/10.1145/1109557.1109690.
Alexandr Andoni and Ilya Razenshteyn. Optimal data-dependent hashing for approximate near neighbors. In STOC'15, pages 793-801, 2015. URL: https://doi.org/10.1145/2746539.2746553.
Maxim Babenko and Tatiana Starikovskaya. Computing the longest common substring with one mismatch. Problems of Information Transmission, 47(1):28-33, 2011. URL: https://doi.org/10.1134/S0032946011010030.
Arturs Backurs and Piotr Indyk. Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). SIAM Journal on Computing, 47(3):1087-1097, 2018. URL: https://doi.org/10.1137/15M1053128.
Diptarka Chakraborty, Debarati Das, Elazar Goldenberg, Michal Koucký, and Michael E. Saks. Approximating edit distance within constant factor in truly sub-quadratic time. In FOCS'18, pages 979-990, 2018. URL: https://doi.org/10.1109/FOCS.2018.00096.
Min-Te Chao. A general purpose unequal probability sampling plan. Biometrika, 69(3):653-656, December 1982. URL: https://doi.org/10.2307/2336002.
Panagiotis Charalampopoulos, Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. Linear-time algorithm for long LCF with k mismatches. In CPM'18, pages 23:1-23:16, 2018. URL: https://doi.org/10.4230/LIPIcs.CPM.2018.23.
Aditi Dhagat, Peter Gács, and Peter Winkler. On playing "Twenty questions" with a liar. In SODA'92, pages 16-22, 1992. URL: http://dl.acm.org/citation.cfm?id=139404.139409.
Michael J. Fischer and Michael S. Paterson. String matching and other products. In Complexity of Computation, pages 113-125, 1974.
Tomás Flouri, Emanuele Giaquinta, Kassian Kobert, and Esko Ukkonen. Longest common substrings with k mismatches. Information Processing Letters, 115(6-8):643-647, 2015.
Zvi Galil and Raffaele Giancarlo. Improved string matching with k mismatches. SIGACT News, 17(4):52–54, March 1986. URL: https://doi.org/10.1145/8307.8309.
Szymon Grabowski. A note on the longest common substring with k-mismatches problem. Information Processing Letters, 115(6-8):640-642, 2015.
Russell Impagliazzo, Ramamohan Paturi, and Francis Zane. Which problems have strongly exponential complexity? Journal of Computer and System Sciences, 63(4):512-530, 2001. URL: https://doi.org/10.1006/jcss.2001.1774.
William B. Johnson and Joram Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. In Modern Analysis and Probability, volume 26 of Contemporary Mathematics, pages 189-206, 1984.
Richard M. Karp and Michael O. Rabin. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, 31(2):249-260, 1987. URL: https://doi.org/10.1147/rd.312.0249.
Tomasz Kociumaka, Jakub Radoszewski, and Tatiana Starikovskaya. Longest common substring with approximately k mismatches. Algorithmica, 81(6):2633-2652, 2019. URL: https://doi.org/10.1007/s00453-019-00548-x.
Chris-Andre Leimeister and Burkhard Morgenstern. kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics, 30(14):2000-2008, 2014. URL: https://doi.org/10.1093/bioinformatics/btu331.
Benny Porat and Ely Porat. Exact and approximate pattern matching in the streaming model. In FOCS'09, pages 315-323, 2009. URL: https://doi.org/10.1109/FOCS.2009.11.
Aviad Rubinstein. Hardness of approximate nearest neighbor search. In STOC'18, pages 1260-1268, 2018. URL: https://doi.org/10.1145/3188745.3188916.
Terence Tao, Ernest Croot III, and Harald Helfgott. Deterministic methods to find primes. AMS Mathematics of Computation, 81(278):1233-1246, 2012. URL: https://doi.org/10.1090/S0025-5718-2011-02542-1.
Sharma V. Thankachan, Alberto Apostolico, and Srinivas Aluru. A provably efficient algorithm for the k-mismatch average common substring problem. Journal of Computational Biology, 23(6):472-482, 2016. URL: https://doi.org/10.1089/cmb.2015.0235.
Sharma V. Thankachan, Sriram P. Chockalingam, Yongchao Liu, Alberto Apostolico, and Srinivas Aluru. ALFRED: A practical method for alignment-free distance computation. Journal of Computational Biology, 23(6):452-460, 2016. URL: https://doi.org/10.1089/cmb.2015.0217.
Sharma V. Thankachan, Sriram P. Chockalingam, Yongchao Liu, Ambujam Krishnan, and Srinivas Aluru. A greedy alignment-free distance estimator for phylogenetic inference. BMC Bioinformatics, 18(8):238, June 2017. URL: https://doi.org/10.1186/s12859-017-1658-0.
Virginia Vassilevska Williams. On some fine-grained questions in algorithms and complexity. In ICM'18, pages 3447-3487, 2018. URL: https://doi.org/10.1142/9789813272880_0188.

Approximating Longest Common Substring with k mismatches: Theory and Practice

Authors Garance Gourdel, Tomasz Kociumaka , Jakub Radoszewski , Tatiana Starikovskaya

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Approximating Longest Common Substring with k mismatches: Theory and Practice

Authors Garance Gourdel, Tomasz Kociumaka , Jakub Radoszewski , Tatiana Starikovskaya

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

Supplementary Materials

References

Thanks for your feedback!

Could not send message