Longest Common Substring with Approximately k Mismatches

Author Tatiana Starikovskaya



PDF
Thumbnail PDF

File

LIPIcs.CPM.2016.21.pdf
  • Filesize: 405 kB
  • 11 pages

Document Identifiers

Author Details

Tatiana Starikovskaya

Cite AsGet BibTex

Tatiana Starikovskaya. Longest Common Substring with Approximately k Mismatches. In 27th Annual Symposium on Combinatorial Pattern Matching (CPM 2016). Leibniz International Proceedings in Informatics (LIPIcs), Volume 54, pp. 21:1-21:11, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2016)
https://doi.org/10.4230/LIPIcs.CPM.2016.21

Abstract

In the longest common substring problem we are given two strings of length n and must find a substring of maximal length that occurs in both strings. It is well-known that the problem can be solved in linear time, but the solution is not robust and can vary greatly when the input strings are changed even by one letter. To circumvent this, Leimester and Morgenstern introduced the problem of the longest common substring with k mismatches. Lately, this problem has received a lot of attention in the literature, and several algorithms have been suggested. The running time of these algorithms is n^{2-o(1)}, and unfortunately, conditional lower bounds have been shown which imply that there is little hope to improve this bound. In this paper we study a different but closely related problem of the longest common substring with approximately k mismatches and use computational geometry techniques to show that it admits a randomised solution with strongly subquadratic running time.
Keywords
  • Randomised algorithms
  • string similarity measures
  • longest common substring
  • sketching
  • locality-sensitive hashing

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. A. Abboud, R. Williams, and H. Yu. More applications of the polynomial method to algorithm design. In Proceedings of the 26th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 218-230, 2015. Google Scholar
  2. N. Alon and J. Spencer. The Probabilistic Method. John Wiley and Sons, 1992. Google Scholar
  3. S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D. J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215(3):403-410, 1990. Google Scholar
  4. S. Aluru, A. Apostolico, and S. Thankachan. Proceedings of the 19th annual international conference on research in computational molecular biology. pages 1-12, 2015. Google Scholar
  5. A. Andoni and P. Indyk. Efficient algorithms for substring near neighbor problem. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1203-1212, 2006. Google Scholar
  6. M. Babenko and T. Starikovskaya. Computing longest common substrings via suffix arrays. In Proceedings of the 3rd International Computer Science Symposium in Russia, pages 64-75, 2008. Google Scholar
  7. M. Babenko and T. Starikovskaya. Computing the longest common substring with one mismatch. Problems of Information Transmission, 47(1):28-33, 2011. Google Scholar
  8. A. Backurs and P. Indyk. Edit distance cannot be computed in strongly subquadratic time (Unless SETH is false). In Proceedings of the 47th Annual ACM on Symposium on Theory of Computing (STOC), pages 51-58, 2015. Google Scholar
  9. P. Bille, I.L. Gørtz, and J. Kristensen. Longest common extensions via fingerprinting. In Proceedings of the 6th International Conference on Language and Automata Theory and Applications, pages 119-130, 2012. Google Scholar
  10. P. Bille, I.L. Gørtz, B. Sach, and H.W. Vildhøj. Time-space trade-offs for longest common extensions. J. of Discrete Algorithms, 25:42-50, March 2014. Google Scholar
  11. K. Bringmann and M. Kunnemann. Quadratic conditional lower bounds for string problems and dynamic time warping. In Proceedings of the 56th Annual Symposium on Foundations of Computer Science (FOCS), pages 79-97, 2015. Google Scholar
  12. M. Crochemore, G.M. Landau, and M. Ziv-Ukelson. A subquadratic sequence alignment algorithm for unrestricted scoring matrices. SIAM J. Comput., 32(6):1654-1673, 2003. Google Scholar
  13. M. Farach-Colton. Optimal suffix tree construction with large alphabets. In Procedings of the 38th IEEE Symposium on Foundations of Computer Science (FOCS), pages 137-143, 1997. Google Scholar
  14. J. Fischer and V. Heun. Theoretical and practical improvements on the RMQ-problem, with applications to LCA and LCE. In Proceedings of the 17th Annual Conference on Combinatorial Pattern Matching, pages 36-48, 2006. Google Scholar
  15. M. J. Fischer and M. S. Paterson. String-matching and other products. In Complexity of Computation, volume 7, pages 113-125. SIAM AMS, 1974. Google Scholar
  16. T. Flouri, E. Giaquinta, K. Kobert, and E. Ukkonen. Longest common substrings with k mismatches. Information Processing Letters, 115(6–8):643-647, 2015. Google Scholar
  17. Z. Galil and R. Giancarlo. Parallel string matching with k mismatches. Theoretical Computer Science, 51(3):341-348, 1987. Google Scholar
  18. S. Grabowski. A note on the longest common substring with k-mismatches problem. Information Processing Letters, 115(6–8):640-642, 2015. Google Scholar
  19. D. Gusfield. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997. Google Scholar
  20. D. Harel and R. E. Tarjan. Fast algorithms for finding nearest common ancestors. SIAM J. Comput., 13(2):338-355, May 1984. Google Scholar
  21. L.C.K. Hui. Color set size problem with applications to string matching. In Proceedings of the 3rd Annual Symposium on Combinatorial Pattern Matching, volume 644 of Lecture notes in computer science, pages 230-243, 1992. Google Scholar
  22. L. Ilie, G. Navarro, and L. Tinta. The longest common extension problem revisited and applications to approximate string searching. J. of Discrete Algorithms, 8(4):418-428, 2010. Google Scholar
  23. P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the 30th Annual ACM Symposium on Theory of Computing, pages 604-613, 1998. Google Scholar
  24. R.M. Karp and M.O. Rabin. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development - Mathematics and computing, 31(2):249-260, 1987. Google Scholar
  25. T. Kociumaka, T. Starikovskaya, and Vildhøj H. W. Sublinear space algorithms for the longest common substring problem. In Proceedings of the 22nd Annual European Symposium on Algorithms, volume 8737 of Lecture notes in computer science, pages 605-617, 2014. Google Scholar
  26. E. Kushilevitz, R. Ostrovsky, and Y. Rabani. Efficient search for approximate nearest neighbor in high dimensional spaces. In Proceedings of the 30th Annual ACM Symposium on Theory of Computing (STOC), pages 614-623, New York, NY, USA, 1998. ACM. Google Scholar
  27. G.M. Landau and U. Vishkin. Efficient string matching with k mismatches. Theoretical Computer Science, 43:239-249, 1986. Google Scholar
  28. C.A. Leimeister and B. Morgenstern. kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics, 30(14):2000-2008, 2014. Google Scholar
  29. W.J. Masek and M.S. Paterson. A faster algorithm computing string edit distances. Journal of Computer and System Sciences, 20(1):18-31, 1980. Google Scholar
  30. T.F. Smith and M.S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147(1):195-197, 1981. Google Scholar
  31. T. Starikovskaya and H.W. Vildhøj. Time-space trade-offs for the longest common substring problem. In Proceedings of the 24th Annual Symposium in Combinatorial Pattern Matching, volume 7922 of Lecture notes in computer science, pages 223-234, 2013. Google Scholar
  32. P. Weiner. Linear pattern matching algorithms. In Proceedings of the 14th IEEE Symposium on Foundations of Computer Science (FOCS), pages 1-11, 1973. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail