Streaming and Small Space Approximation Algorithms for Edit Distance and Longest Common Subsequence

Authors Kuan Cheng, Alireza Farhadi, MohammadTaghi Hajiaghayi, Zhengzhong Jin, Xin Li, Aviad Rubinstein, Saeed Seddighin, Yu Zheng



PDF
Thumbnail PDF

File

LIPIcs.ICALP.2021.54.pdf
  • Filesize: 0.79 MB
  • 20 pages

Document Identifiers

Author Details

Kuan Cheng
  • Peking University, Beijing, China
Alireza Farhadi
  • University of Maryland, College Park, MD, USA
MohammadTaghi Hajiaghayi
  • University of Maryland, College Park, MD, USA
Zhengzhong Jin
  • Johns Hopkins University, Baltimore, MD, USA
Xin Li
  • Johns Hopkins University, Baltimore, MD, USA
Aviad Rubinstein
  • Stanford University, CA, USA
Saeed Seddighin
  • Toyota Technological Institute, Chicago, IL, USA
Yu Zheng
  • Johns Hopkins University, Baltimore, MD, USA

Cite AsGet BibTex

Kuan Cheng, Alireza Farhadi, MohammadTaghi Hajiaghayi, Zhengzhong Jin, Xin Li, Aviad Rubinstein, Saeed Seddighin, and Yu Zheng. Streaming and Small Space Approximation Algorithms for Edit Distance and Longest Common Subsequence. In 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 198, pp. 54:1-54:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/LIPIcs.ICALP.2021.54

Abstract

The edit distance (ED) and longest common subsequence (LCS) are two fundamental problems which quantify how similar two strings are to one another. In this paper, we first consider these problems in the asymmetric streaming model introduced by Andoni, Krauthgamer and Onak [Andoni et al., 2010] (FOCS'10) and Saks and Seshadhri [Saks and Seshadhri, 2013] (SODA'13). In this model we have random access to one string and streaming access the other one. Our main contribution is a constant factor approximation algorithm for ED with memory Õ(n^δ) for any constant δ > 0. In addition to this, we present an upper bound of Õ _ε(√n) on the memory needed to approximate ED or LCS within a factor 1±ε. All our algorithms are deterministic and run in polynomial time in a single pass. We further study small-space approximation algorithms for ED, LCS, and longest increasing sequence (LIS) in the non-streaming setting. Here, we design algorithms that achieve 1 ± ε approximation for all three problems, where ε > 0 can be any constant and even slightly sub-constant. Our algorithms only use poly-logarithmic space while maintaining a polynomial running time. This significantly improves previous results in terms of space complexity, where all known results need to use space at least Ω(√n). Our algorithms make novel use of triangle inequality and carefully designed recursions to save space, which can be of independent interest.

Subject Classification

ACM Subject Classification
  • Theory of computation → Design and analysis of algorithms
Keywords
  • Edit Distance
  • Longest Common Subsequence
  • Longest Increasing Subsequence
  • Space Efficient Algorithm
  • Approximation Algorithm

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Amir Abboud and Arturs Backurs. Towards hardness of approximation for polynomial time problems. In ITCS, 2017. Google Scholar
  2. Amir Abboud, Arturs Backurs, and Virginia Vassilevska Williams. Tight hardness results for LCS and other sequence similarity measures. In FOCS, 2015. Google Scholar
  3. Amir Abboud, Thomas Dueholm Hansen, Virginia Vassilevska Williams, and Ryan Williams. Simulating branching programs with edit distance and friends or: a polylog shaved is a lower bound made. In STOC, 2016. Google Scholar
  4. Amir Abboud and Aviad Rubinstein. Fast and deterministic constant factor approximation algorithms for LCS imply new circuit lower bounds. In ITCS, 2018. Google Scholar
  5. David Aldous and Persi Diaconis. Longest increasing subsequences: from patience sorting to the baik-deift-johansson theorem. Bulletin of the American Mathematical Society, 36(4):413-432, 1999. Google Scholar
  6. Carlos ER Alves, Edson N Cáceres, and Siang Wun Song. A coarse-grained parallel algorithm for the all-substrings longest common subsequence problem. Algorithmica, 45(3):301-335, 2006. Google Scholar
  7. A. Andoni, M. Deza, A. Gupta, P. Indyk, and S. Raskhodnikova. Lower bounds for embedding edit distance into normed spaces. In SODA, 2003. Google Scholar
  8. Alexandr Andoni, Assaf Goldberger, Andrew McGregor, and Ely Porat. Homomorphic fingerprints under misalignments: Sketching edit and shift distances. In STOC, 2013. Google Scholar
  9. Alexandr Andoni and Robert Krauthgamer. The computational hardness of estimating edit distance [extended abstract]. In FOCS, 2007. Google Scholar
  10. Alexandr Andoni and Robert Krauthgamer. The smoothed complexity of edit distance. In ICALP, 2008. Google Scholar
  11. Alexandr Andoni, Robert Krauthgamer, and Krzysztof Onak. Polylogarithmic approximation for edit distance and the asymmetric query complexity. In FOCS, 2010. Google Scholar
  12. Alexandr Andoni and Negev Shekel Nosatzki. Edit distance in near-linear time: it’s a constant factor. In Proceedings of the 61st Annual Symposium on Foundations of Computer Science (FOCS), 2020. Google Scholar
  13. Alexandr Andoni and Krzysztof Onak. Approximating edit distance in near-linear time. In STOC, 2009. Google Scholar
  14. Arturs Backurs and Piotr Indyk. Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). In STOC, 2015. Google Scholar
  15. Nikhil Bansal, Moshe Lewenstein, Bin Ma, and Kaizhong Zhang. On the longest common rigid subsequence problem. Algorithmica, 56(2):270-280, February 2010. URL: https://doi.org/10.1007/s00453-008-9175-1.
  16. Ziv Bar-Yossef, TS Jayram, Robert Krauthgamer, and Ravi Kumar. Approximating edit distance efficiently. In FOCS, 2004. Google Scholar
  17. Tuğkan Batu, Funda Ergun, and Cenk Sahinalp. Oblivious string embeddings and edit distance approximations. In SODA, 2006. Google Scholar
  18. Djamal Belazzougui and Qin Zhang. Edit distance: Sketching, streaming, and document exchange. In FOCS, 2016. Google Scholar
  19. Richard Bellman. Dynamic programming (dp), 1957. Google Scholar
  20. Mahdi Boroujeni, Soheil Ehsani, Mohammad Ghodsi, MohammadTaghi HajiAghayi, and Saeed Seddighin. Approximating edit distance in truly subquadratic time: Quantum and MapReduce. In SODA, 2018. Google Scholar
  21. Mahdi Boroujeni and Saeed Seddighin. Improved MPC algorithms for edit distance and Ulam distance. In SPAA, 2019. Google Scholar
  22. Joshua Brakensiek and Aviad Rubinstein. Constant-factor approximation of near-linear edit distance in near-linear time. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 685-698, 2020. Google Scholar
  23. Karl Bringman and Marvin Künnemann. Multivariate fine-grained complexity of longest common subsequence. In SODA, 2018. Google Scholar
  24. Karl Bringmann, Fabrizio Grandoni, Barna Saha, and Virginia Vassilevska Williams. Truly sub-cubic algorithms for language edit distance and RNA-folding via fast bounded-difference min-plus product. In FOCS, 2016. Google Scholar
  25. Karl Bringmann and Marvin Kunnemann. Quadratic conditional lower bounds for string problems and dynamic time warping. In FOCS, 2015. Google Scholar
  26. Diptarka Chakraborty, Debarati Das, Elazar Goldenberg, Michal Koucky, and Michael Saks. Approximating edit distance within constant factor in truly sub-quadratic time. In FOCS, 2018. Google Scholar
  27. Diptarka Chakraborty, Elazar Goldenberg, and Michal Koucký. Streaming algorithms for computing edit distance without exploiting suffix trees. arXiv preprint, 2016. URL: http://arxiv.org/abs/1607.03718.
  28. Diptarka Chakraborty, Elazar Goldenberg, and Michal Koucký. Streaming algorithms for embedding and computing edit distance in the low distance regime. In STOC, 2016. Google Scholar
  29. Moses Charikar, Ofir Geri, Michael P Kim, and William Kuszmaul. On estimating edit distance: Alignment, dimension reduction, and embeddings. In ICALP, 2018. Google Scholar
  30. Lijie Chen, Shafi Goldwasser, Kaifeng Lyu, Guy N Rothblum, and Aviad Rubinstein. Fine-grained complexity meets IP=PSPACE. In SODA, 2019. Google Scholar
  31. Kuan Cheng, Zhengzhong Jin, Xin Li, and Yu Zheng. Space efficient deterministic approximation of string measures. CoRR, abs/2002.08498, 2020. URL: http://arxiv.org/abs/2002.08498.
  32. Maxime Crochemore, Costas S Iliopoulos, Yoan J Pinzon, and James F Reid. A fast and practical bit-vector algorithm for the longest common subsequence problem. Information Processing Letters, 80(6):279-285, 2001. Google Scholar
  33. Maxime Crochemore, Gad M Landau, and Michal Ziv-Ukelson. A subquadratic sequence alignment algorithm for unrestricted scoring matrices. SIAM Journal on Computing, 32(6):1654-1673, 2003. Google Scholar
  34. J Boutet de Monvel. Extensive simulations for longest common subsequences. The European Physical Journal B-Condensed Matter and Complex Systems, 7(2):293-308, 1999. Google Scholar
  35. Funda Ergün and Hossein Jowhari. On distance to monotonicity and longest increasing subsequence of a data stream. In SODA, 2008. Google Scholar
  36. Alireza Farhadi, MohammadTaghi Hajiaghayi, Aviad Rubinstein, and Saeed Seddighin. Streaming with oracle: New streaming algorithms for edit distance and LCS. CoRR, abs/2002.11342, 2020. URL: http://arxiv.org/abs/2002.11342.
  37. Anna Gál and Parikshit Gopalan. Lower bounds on streaming algorithms for approximating the length of the longest increasing subsequence. In FOCS, 2007. Google Scholar
  38. Minos Garofalakis and Amit Kumar. Correlating XML data streams using tree-edit distance embeddings. In PODS, 2003. Google Scholar
  39. Omer Gold and Micha Sharir. Dynamic time warping and geometric edit distance: Breaking the quadratic barrier. In ICALP, 2017. Google Scholar
  40. Elazar Goldenberg, Robert Krauthgamer, and Barna Saha. Sublinear algorithms for gap edit distance. In FOCS, 2019. Google Scholar
  41. Parikshit Gopalan, T. S. Jayram, Robert Krauthgamer, and Ravi Kumar. Estimating the sortedness of a data stream. In SODA, 2007. Google Scholar
  42. Dan Gusfield. Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge University Press, 1997. Google Scholar
  43. Bernhard Haeupler, Aviad Rubinstein, and Amirbehshad Shahrasbi. Near-linear time insertion-deletion codes and (1+ε)-approximating edit distance via indexing. In STOC, 2019. Google Scholar
  44. MohammadTaghi Hajiaghayi, Masoud Seddighin, Saeed Seddighin, and Xiaorui Sun. Approximating lcs in linear time: Beating the √n barrier. In SODA, 2019. Google Scholar
  45. MohammadTaghi Hajiaghayi, Saeed Seddighin, and Xiaorui Sun. Massively parallel approximation algorithms for edit distance and longest common subsequence. In SODA, 2019. Google Scholar
  46. James W Hunt and Thomas G Szymanski. A fast algorithm for computing longest common subsequences. Communications of the ACM, 20(5):350-353, 1977. Google Scholar
  47. Piotr Indyk. Algorithmic applications of low-distortion geometric embeddings. In FOCS, 2001. Google Scholar
  48. Rajesh Jayaram and Barna Saha. Approximating language edit distance beyond fast matrix multiplication: Ultralinear grammars are where parsing becomes hard! In ICALP, 2017. Google Scholar
  49. Ce Jin, Jelani Nelson, and Kewen Wu. An improved sketching algorithm for edit distance. In 38th International Symposium on Theoretical Aspects of Computer Science (STACS 2021). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2021. Google Scholar
  50. Masashi Kiyomi, Takashi Horiyama, and Yota Otachi. Longest common subsequence in sublinear space. arXiv preprint, 2020. URL: http://arxiv.org/abs/2009.08588.
  51. Masashi Kiyomi, Hirotaka Ono, Yota Otachi, Pascal Schweitzer, and Jun Tarui. Space-efficient algorithms for longest increasing subsequence. Theory of Computing Systems, pages 1-20, 2018. Google Scholar
  52. Michal Koucky and Michael Saks. Constant factor approximations to edit distance on far input pairs in nearly linear time. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 699-712, 2020. Google Scholar
  53. William Kuszmaul. Dynamic time warping in strongly subquadratic time: Algorithms for the low-distance regime and approximate evaluation. arXiv preprint, 2019. URL: http://arxiv.org/abs/1904.09690.
  54. Gad M Landau, Eugene W Myers, and Jeanette P Schmidt. Incremental string comparison. SIAM Journal on Computing, 27(2):557-582, 1998. Google Scholar
  55. Charles Eric Leiserson, Ronald L Rivest, Thomas H Cormen, and Clifford Stein. Introduction to algorithms, volume 6. MIT press Cambridge, MA, 2001. Google Scholar
  56. David Liben-Nowell, Erik Vee, and An Zhu. Finding longest increasing and common subsequences in streaming data. In COCOON, 2005. Google Scholar
  57. William J Masek and Michael S Paterson. A faster algorithm computing string edit distances. Journal of Computer and System Sciences, 20(1):18-31, 1980. Google Scholar
  58. Timothy Naumovitz and Michael Saks. A polylogarithmic space deterministic streaming algorithm for approximating distance to monotonicity. SODA, 2014. Google Scholar
  59. Rafail Ostrovsky and Yuval Rabani. Low distortion embeddings for edit distance. In STOC, 2005. Google Scholar
  60. Aviad Rubinstein and Zhao Song. Reducing approximate longest common subsequence to approximate edit distance. arXiv preprint, 2019. URL: http://arxiv.org/abs/1904.05451.
  61. Aviad Runbinstein, Saeed Seddighin, Zhao Song, and Xiaorui Sun. Approximation algorithms for LCS and LIS with truly improved running times. In FOCS, 2019. Google Scholar
  62. Barna Saha. Language edit distance and maximum likelihood parsing of stochastic grammars: Faster algorithms and connection to fundamental graph problems. In FOCS, 2015. Google Scholar
  63. Barna Saha. Fast & space-efficient approximations of language edit distance and RNA folding: An amnesic dynamic programming approach. In FOCS, 2017. Google Scholar
  64. Michael Saks and C Seshadhri. Space efficient streaming algorithms for the distance to monotonicity and asymmetric edit distance. In SODA, 2013. Google Scholar
  65. Walter J Savitch. Relationships between nondeterministic and deterministic tape complexities. Journal of computer and system sciences, 4(2):177-192, 1970. Google Scholar
  66. Xiaoming Sun and David P. Woodruff. The communication and streaming complexity of computing the longest common and increasing subsequences. In SODA, 2007. Google Scholar
  67. Alexander Tiskin. Semi-local string comparison: Algorithmic techniques and applications. Mathematics in Computer Science, 1(4):571-603, 2008. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail