Document Open Access Logo

A Compact DAG for Storing and Searching Maximal Common Subsequences

Authors Alessio Conte , Roberto Grossi , Giulia Punzi , Takeaki Uno



PDF
Thumbnail PDF

File

LIPIcs.ISAAC.2023.21.pdf
  • Filesize: 0.83 MB
  • 15 pages

Document Identifiers

Author Details

Alessio Conte
  • Università di Pisa, Italy
Roberto Grossi
  • Università di Pisa, Italy
Giulia Punzi
  • National Institute of Informatics, Tokyo, Japan
Takeaki Uno
  • National Institute of Informatics, Tokyo, Japan

Acknowledgements

We thank the anonymous Referees for their comments, leading us to the current version of Theorem 13.

Cite AsGet BibTex

Alessio Conte, Roberto Grossi, Giulia Punzi, and Takeaki Uno. A Compact DAG for Storing and Searching Maximal Common Subsequences. In 34th International Symposium on Algorithms and Computation (ISAAC 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 283, pp. 21:1-21:15, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/LIPIcs.ISAAC.2023.21

Abstract

Maximal Common Subsequences (MCSs) between two strings X and Y are subsequences of both X and Y that are maximal under inclusion. MCSs relax and generalize the well known and widely used concept of Longest Common Subsequences (LCSs), which can be seen as MCSs of maximum length. While the number both LCSs and MCSs can be exponential in the length of the strings, LCSs have been long exploited for string and text analysis, as simple compact representations of all LCSs between two strings, built via dynamic programming or automata, have been known since the '70s. MCSs appear to have a more challenging structure: even listing them efficiently was an open problem open until recently, thus narrowing the complexity difference between the two problems, but the gap remained significant. In this paper we close the complexity gap: we show how to build DAG of polynomial size - in polynomial time - which allows for efficient operations on the set of all MCSs such as enumeration in Constant Amortized Time per solution (CAT), counting, and random access to the i-th element (i.e., rank and select operations). Other than improving known algorithmic results, this work paves the way for new sequence analysis methods based on MCSs.

Subject Classification

ACM Subject Classification
  • Mathematics of computing → Combinatorial algorithms
  • Information systems → Structured text search
Keywords
  • Maximal common subsequence
  • DAG
  • Compact data structures
  • Enumeration
  • Constant amortized time
  • Random access

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. A. Abboud, A. Backurs, and V. V. Williams. Tight hardness results for LCS and other sequence similarity measures. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 59-78, October 2015. URL: https://doi.org/10.1109/FOCS.2015.14.
  2. Amihood Amir, Gianni Franceschini, Roberto Grossi, Tsvi Kopelowitz, Moshe Lewenstein, and Noa Lewenstein. Managing unbounded-length keys in comparison-driven data structures with applications to online indexing. SIAM J. Comput., 43(4):1396-1416, 2014. URL: https://doi.org/10.1137/110836377.
  3. Ricardo A Baeza-Yates. Searching subsequences. Theoretical Computer Science, 78(2):363-376, 1991. Google Scholar
  4. L. Bergroth, H. Hakonen, and T. Raita. A survey of longest common subsequence algorithms. In Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000, pages 39-48, September 2000. URL: https://doi.org/10.1109/SPIRE.2000.878178.
  5. Sankardeep Chakraborty, Roberto Grossi, Kunihiko Sadakane, and Srinivasa Rao Satti. Succinct representation for (non)deterministic finite automata. J. Comput. Syst. Sci., 131:1-12, 2023. URL: https://doi.org/10.1016/j.jcss.2022.07.002.
  6. Alessio Conte, Roberto Grossi, Giulia Punzi, and Takeaki Uno. Enumeration of maximal common subsequences between two strings. Algorithmica, pages 1-27, 2022. Google Scholar
  7. Maxime Crochemore, Bořivoj Melichar, and Zdeněk Troníček. Directed acyclic subsequence graph - Overview. Journal of Discrete Algorithms, 1(3-4):255-280, 2003. Google Scholar
  8. Maxime Crochemore and Zdeněk Troníček. Directed acyclic subsequence graph for multiple texts. Rapport IGM, pages 99-13, 1999. Google Scholar
  9. C. B. Fraser, R. W. Irving, and M. Middendorf. Maximal common subsequences and minimal common supersequences. Information and Computation, 124(2):145-153, 1996. URL: https://doi.org/10.1006/inco.1996.0011.
  10. Miyuji Hirota and Yoshifumi Sakai. Efficient algorithms for enumerating maximal common subsequences of two strings. CoRR, abs/2307.10552, 2023. URL: https://doi.org/10.48550/arXiv.2307.10552.
  11. Miyuji Hirota and Yoshifumi Sakai. A fast algorithm for finding a maximal common subsequence of multiple strings. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, page 2022DML0002, 2023. Google Scholar
  12. D. S. Hirschberg. Algorithms for the longest common subsequence problem. J. ACM, 24(4):664-675, October 1977. URL: https://doi.org/10.1145/322033.322044.
  13. W. J. Hsu and M. W. Du. Computing a longest common subsequence for a set of strings. BIT Numerical Mathematics, 24(1):45-59, 1984. Google Scholar
  14. Elsa Loekito, James Bailey, and Jian Pei. A binary decision diagram based approach for mining frequent subsequences. Knowl. Inf. Syst., 24(2):235-268, 2010. URL: https://doi.org/10.1007/s10115-009-0252-9.
  15. David Maier. The complexity of some problems on subsequences and supersequences. Journal of the ACM (JACM), 25(2):322-336, 1978. Google Scholar
  16. W. J. Masek and M. S. Paterson. A faster algorithm computing string edit distances. Journal of Computer and System Sciences, 20(1):18-31, 1980. URL: https://doi.org/10.1016/0022-0000(80)90002-1.
  17. Bořivoj Melichar and Tomáš Polcar. The longest common subsequence problem a finite automata approach. In Implementation and Application of Automata: 8th International Conference, CIAA 2003 Santa Barbara, CA, USA, July 16-18, 2003 Proceedings, pages 294-296. Springer, 2003. Google Scholar
  18. Shin-ichi Minato. Zero-suppressed bdds for set manipulation in combinatorial problems. In Proceedings of the 30th International Design Automation Conference, DAC '93, pages 272-277, New York, NY, USA, 1993. Association for Computing Machinery. URL: https://doi.org/10.1145/157485.164890.
  19. R. Raman, V. Raman, and S. R. Satti. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms, 3(4):43-es, November 2007. URL: https://doi.org/10.1145/1290672.1290680.
  20. Frank Ruskey. Combinatorial generation. Preliminary working draft. University of Victoria, Victoria, BC, Canada, 11:20, 2003. Google Scholar
  21. Yoshifumi Sakai. Maximal common subsequence algorithms. In Gonzalo Navarro, David Sankoff, and Binhai Zhu, editors, Annual Symposium on Combinatorial Pattern Matching (CPM 2018), volume 105 of Leibniz International Proceedings in Informatics (LIPIcs), pages 1:1-1:10, Dagstuhl, Germany, 2018. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik. URL: https://doi.org/10.4230/LIPIcs.CPM.2018.1.
  22. Yoshifumi Sakai. Maximal common subsequence algorithms. Theoretical Computer Science, 793:132-139, 2019. URL: https://doi.org/10.1016/j.tcs.2019.06.020.
  23. Etsuji Tomita, Akira Tanaka, and Haruhisa Takahashi. The worst-case time complexity for generating all maximal cliques and computational experiments. Theoretical Computer Science, 363(1):28-42, 2006. Computing and Combinatorics. Google Scholar
  24. Zdeněk Troníček. Common subsequence automaton. In International Conference on Implementation and Application of Automata, pages 270-275. Springer, 2002. Google Scholar
  25. R. A. Wagner and M. J. Fischer. The string-to-string correction problem. J. ACM, 21(1):168-173, January 1974. URL: https://doi.org/10.1145/321796.321811.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail