McDag: Indexing Maximal Common Subsequences in Practice

Authors Giovanni Buzzega , Alessio Conte , Roberto Grossi , Giulia Punzi



PDF
Thumbnail PDF

File

LIPIcs.WABI.2024.21.pdf
  • Filesize: 0.96 MB
  • 18 pages

Document Identifiers

Author Details

Giovanni Buzzega
  • University of Pisa, Italy
Alessio Conte
  • University of Pisa, Italy
Roberto Grossi
  • University of Pisa, Italy
Giulia Punzi
  • University of Pisa, Italy

Acknowledgements

The authors would like to thank the anonymous reviewers for their insightful comments and suggestions.

Cite AsGet BibTex

Giovanni Buzzega, Alessio Conte, Roberto Grossi, and Giulia Punzi. McDag: Indexing Maximal Common Subsequences in Practice. In 24th International Workshop on Algorithms in Bioinformatics (WABI 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 312, pp. 21:1-21:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.WABI.2024.21

Abstract

Analyzing and comparing sequences of symbols is among the most fundamental problems in computer science, possibly even more so in bioinformatics. Maximal Common Subsequences (MCSs), i.e., inclusion-maximal sequences of non-contiguous symbols common to two or more strings, have only recently received attention in this area, despite being a basic notion and a natural generalization of more common tools like Longest Common Substrings/Subsequences. In this paper we simplify and engineer recent advancements on MCSs into a practical tool called McDag, the first publicly available tool that can index MCSs of real genomic data. We demonstrate that our tool can index sequences exceeding 10,000 base pairs within minutes, utilizing only 4-7% more than the minimum required nodes, while also extracting relevant insights.

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
  • Applied computing → Molecular sequence analysis
  • Applied computing → Computational genomics
Keywords
  • Index data structure
  • DAG
  • Common subsequence
  • Inclusion-wise maximality
  • LCS

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Amir Abboud, Arturs Backurs, and Virginia Vassilevska Williams. Tight hardness results for LCS and other sequence similarity measures. In Venkatesan Guruswami, editor, IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley, CA, USA, 17-20 October, 2015, pages 59-78. IEEE, IEEE Computer Society, 2015. URL: https://doi.org/10.1109/FOCS.2015.14.
  2. Rakesh Agrawal and Ramakrishnan Srikant. Mining sequential patterns. In Proceedings of the eleventh international conference on data engineering, pages 3-14. IEEE, 1995. URL: https://doi.org/10.1109/ICDE.1995.380415.
  3. Ricardo A Baeza-Yates. Searching subsequences. Theoretical Computer Science, 78(2):363-376, 1991. URL: https://doi.org/10.1016/0304-3975(91)90358-9.
  4. Karl Bringmann and Marvin Künnemann. Quadratic conditional lower bounds for string problems and dynamic time warping. In Proceedings of the 56th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 79-97. IEEE, 2015. URL: https://doi.org/10.1109/FOCS.2015.15.
  5. Laurent Bulteau, Mark Jones, Rolf Niedermeier, and Till Tantau. An FPT-algorithm for longest common subsequence parameterized by the maximum number of deletions. In Hideo Bannai and Jan Holub, editors, 33rd Annual Symposium on Combinatorial Pattern Matching, CPM 2022, June 27-29, 2022, Prague, Czech Republic, volume 223 of LIPIcs, pages 6:1-6:11. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2022. URL: https://doi.org/10.4230/LIPIcs.CPM.2022.6.
  6. Alessio Conte, Roberto Grossi, Giulia Punzi, and Takeaki Uno. Enumeration of maximal common subsequences between two strings. Algorithmica, 84(3):757-783, 2022. URL: https://doi.org/10.1007/s00453-021-00898-5.
  7. Alessio Conte, Roberto Grossi, Giulia Punzi, and Takeaki Uno. A compact DAG for storing and searching maximal common subsequences. In Satoru Iwata and Naonori Kakimura, editors, 34th International Symposium on Algorithms and Computation, ISAAC 2023, December 3-6, 2023, Kyoto, Japan, volume 283 of LIPIcs, pages 21:1-21:15. Schloss-Dagstuhl-Leibniz Zentrum für Informatik, Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2023. URL: https://doi.org/10.4230/LIPIcs.ISAAC.2023.21.
  8. Maxime Crochemore, Borivoj Melichar, and Zdenek Tronícek. Directed acyclic subsequence graph - overview. J. Discrete Algorithms, 1(3-4):255-280, 2003. URL: https://doi.org/10.1016/S1570-8667(03)00029-7.
  9. Maxime Crochemore and Zdeněk Troníček. Directed acyclic subsequence graph for multiple texts. Rapport IGM, pages 99-13, 1999. Google Scholar
  10. Christoph Fischer, Stephan Koblmüller, Christian Gülly, Christian Schlötterer, Christian Sturmbauer, and Gerhard G. Thallinger. Complete mitochondrial dna sequences of the threadfin cichlid (petrochromis trewavasae) and the blunthead cichlid (tropheus moorii) and patterns of mitochondrial genome evolution in cichlid fishes. PLOS ONE, 8(6):1-14, June 2013. URL: https://doi.org/10.1371/journal.pone.0067048.
  11. Campbell Fraser, Robert W. Irving, and Martin Middendorf. Maximal common subsequences and minimal common supersequences. Inf. Comput., 124(2):145-153, 1996. URL: https://doi.org/10.1006/inco.1996.0011.
  12. Ronald I. Greenberg. Bounds on the number of longest common subsequences. CoRR, cs.DM/0301030, 2003. URL: http://arxiv.org/abs/cs/0301030.
  13. Miyuji Hirota and Yoshifumi Sakai. Efficient algorithms for enumerating maximal common subsequences of two strings. CoRR, abs/2307.10552, 2023. URL: https://doi.org/10.48550/arXiv.2307.10552.
  14. Miyuji Hirota and Yoshifumi Sakai. A fast algorithm for finding a maximal common subsequence of multiple strings. IEICE Trans. Fundam. Electron. Commun. Comput. Sci., 106(9):1191-1194, 2023. URL: https://doi.org/10.1587/transfun.2022dml0002.
  15. W. J. Hsu and M. W. Du. Computing a longest common subsequence for a set of strings. BIT Numerical Mathematics, 24(1):45-59, 1984. URL: https://doi.org/10.1007/BF01934514.
  16. Robert W Irving and Campbell B Fraser. Two algorithms for the longest common subsequence of three (or more) strings. In Combinatorial Pattern Matching: Third Annual Symposium Tucson, Arizona, USA, April 29-May 1, 1992 Proceedings 3, pages 214-229. Springer, 1992. URL: https://doi.org/10.1007/3-540-56024-6_18.
  17. Elsa Loekito, James Bailey, and Jian Pei. A binary decision diagram based approach for mining frequent subsequences. Knowl. Inf. Syst., 24(2):235-268, 2010. URL: https://doi.org/10.1007/s10115-009-0252-9.
  18. Mi Lu and Hua Lin. Parallel algorithms for the longest common subsequence problem. IEEE Transactions on Parallel and Distributed Systems, 5(8):835-848, 1994. URL: https://doi.org/10.1109/71.298210.
  19. David Maier. The complexity of some problems on subsequences and supersequences. Journal of the ACM (JACM), 25(2):322-336, 1978. URL: https://doi.org/10.1145/322063.322075.
  20. Borivoj Melichar and Tomás Polcar. The longest common subsequence problem A finite automata approach. In Oscar H. Ibarra and Zhe Dang, editors, Implementation and Application of Automata, 8th International Conference, CIAA 2003, Santa Barbara, California, USA, July 16-18, 2003, Proceedings, volume 2759 of Lecture Notes in Computer Science, pages 294-296. Springer, Springer, 2003. URL: https://doi.org/10.1007/3-540-45089-0_27.
  21. Shin-ichi Minato. Zero-suppressed bdds for set manipulation in combinatorial problems. In Alfred E. Dunlop, editor, Proceedings of the 30th Design Automation Conference. Dallas, Texas, USA, June 14-18, 1993, DAC '93, pages 272-277, New York, NY, USA, 1993. ACM Press. URL: https://doi.org/10.1145/157485.164890.
  22. Dominique Revuz. Minimisation of acyclic deterministic automata in linear time. Theoretical Computer Science, 92(1):181-189, 1992. URL: https://doi.org/10.1016/0304-3975(92)90142-3.
  23. Yoshifumi Sakai. Maximal common subsequence algorithms. Theor. Comput. Sci., 793:132-139, 2019. URL: https://doi.org/10.1016/j.tcs.2019.06.020.
  24. Yuto Shida, Giulia Punzi, Yasuaki Kobayashi, Takeaki Uno, and Hiroki Arimura. Finding diverse strings and longest common subsequences in a graph. CoRR, abs/2405.00131, 2024. URL: https://doi.org/10.48550/arXiv.2405.00131.
  25. Temple F Smith, Michael S Waterman, et al. Identification of common molecular subsequences. Journal of molecular biology, 147(1):195-197, 1981. Google Scholar
  26. Zdenek Tronícek. Common subsequence automaton. In Jean-Marc Champarnaud and Denis Maurel, editors, Implementation and Application of Automata, 7th International Conference, CIAA 2002, Tours, France, July 3-5, 2002, Revised Papers, volume 2608 of Lecture Notes in Computer Science, pages 270-275. Springer, Springer, 2002. URL: https://doi.org/10.1007/3-540-44977-9_28.
  27. Xiaomeng Wu, Zhipeng Cai, Xiu-Feng Wan, Tin Hoang, Randy Goebel, and Guohui Lin. Nucleotide composition string selection in HIV-1 subtyping using whole genomes. Bioinformatics, 23(14):1744-1752, May 2007. URL: https://doi.org/10.1093/bioinformatics/btm248.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail