A Myhill-Nerode Theorem for Generalized Automata, with Applications to Pattern Matching and Compression

Author Nicola Cotumaccio



PDF
Thumbnail PDF

File

LIPIcs.STACS.2024.26.pdf
  • Filesize: 0.79 MB
  • 19 pages

Document Identifiers

Author Details

Nicola Cotumaccio
  • Gran Sasso Science Institute, L'Aquila, Italy
  • Dalhousie University, Halifax, Canada

Cite AsGet BibTex

Nicola Cotumaccio. A Myhill-Nerode Theorem for Generalized Automata, with Applications to Pattern Matching and Compression. In 41st International Symposium on Theoretical Aspects of Computer Science (STACS 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 289, pp. 26:1-26:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.STACS.2024.26

Abstract

The model of generalized automata, introduced by Eilenberg in 1974, allows representing a regular language more concisely than conventional automata by allowing edges to be labeled not only with characters, but also strings. Giammaresi and Montalbano introduced a notion of determinism for generalized automata [STACS 1995]. While generalized deterministic automata retain many properties of conventional deterministic automata, the uniqueness of a minimal generalized deterministic automaton is lost. In the first part of the paper, we show that the lack of uniqueness can be explained by introducing a set 𝒲(𝒜) associated with a generalized automaton 𝒜. The set 𝒲(𝒜) is always trivially equal to the set of all prefixes of the language recognized by the automaton, if 𝒜 is a conventional automaton, but this need not be true for generalized automata. By fixing 𝒲(𝒜), we are able to derive for the first time a full Myhill-Nerode theorem for generalized automata, which contains the textbook Myhill-Nerode theorem for conventional automata as a degenerate case. In the second part of the paper, we show that the set 𝒲(𝒜) leads to applications for pattern matching and data compression. Wheeler automata [TCS 2017, SODA 2020] are a popular class of automata that can be compactly stored using e log σ (1 + o(1)) + O(e) bits (e being the number of edges, σ being the size of the alphabet) in such a way that pattern matching queries can be solved in Õ(m) time (m being the length of the pattern). In the paper, we show how to extend these results to generalized automata. More precisely, a Wheeler generalized automata can be stored using 𝔢 log σ (1 + o(1)) + O(e + rn) bits so that pattern matching queries can be solved in Õ(rm) time, where 𝔢 is the total length of all edge labels, r is the maximum length of an edge label and n is the number of states.

Subject Classification

ACM Subject Classification
  • Theory of computation → Regular languages
  • Theory of computation → Pattern matching
  • Theory of computation → Data compression
Keywords
  • Generalized Automata
  • Myhill-Nerode Theorem
  • Regular Languages
  • Wheeler Graphs
  • FM-index
  • Burrows-Wheeler Transform

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Tatsuya Akutsu. A linear time pattern matching algorithm between a string and a tree. In Alberto Apostolico, Maxime Crochemore, Zvi Galil, and Udi Manber, editors, Combinatorial Pattern Matching, pages 1-10, Berlin, Heidelberg, 1993. Springer Berlin Heidelberg. Google Scholar
  2. Jarno Alanko, Nicola Cotumaccio, and Nicola Prezza. Linear-time minimization of wheeler dfas. In 2022 Data Compression Conference (DCC), pages 53-62, 2022. URL: https://doi.org/10.1109/DCC52660.2022.00013.
  3. Jarno Alanko, Giovanna D'Agostino, Alberto Policriti, and Nicola Prezza. Regular Languages meet Prefix Sorting, pages 911-930. SIAM, 2020. URL: https://doi.org/10.1137/1.9781611975994.55.
  4. Jarno Alanko, Giovanna D'Agostino, Alberto Policriti, and Nicola Prezza. Wheeler languages. Information and Computation, 281:104820, 2021. URL: https://doi.org/10.1016/j.ic.2021.104820.
  5. Amihood Amir, Moshe Lewenstein, and Noa Lewenstein. Pattern matching in hypertext. Journal of Algorithms, 35(1):82-99, 2000. URL: https://doi.org/10.1006/jagm.1999.1063.
  6. Jasmijn A. Baaijens, Paola Bonizzoni, Christina Boucher, Gianluca Della Vedova, Yuri Pirola, Raffaella Rizzi, and Jouni Sirén. Computational graph pangenomics: a tutorial on data structures and their applications. Nat. Comput., 21(1):81-108, 2022. URL: https://doi.org/10.1007/s11047-022-09882-6.
  7. Anton Bankevich, Sergey Nurk, Dmitry Antipov, Alexey A. Gurevich, Mikhail Dvorkin, Alexander S. Kulikov, Valery M. Lesin, Sergey I. Nikolenko, Son Pham, Andrey D. Prjibelski, Alexey V. Pyshkin, Alexander V. Sirotkin, Nikolay Vyahhi, Glenn Tesler, Max A. Alekseyev, and Pavel A. Pevzner. SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology, 19(5):455-477, 2012. PMID: 22506599. URL: https://doi.org/10.1089/cmb.2012.0021.
  8. Ruben Becker, Manuel Cáceres, Davide Cenzato, Sung-Hwan Kim, Bojana Kodric, Francisco Olivares, and Nicola Prezza. Sorting Finite Automata via Partition Refinement. In Inge Li Gørtz, Martin Farach-Colton, Simon J. Puglisi, and Grzegorz Herman, editors, 31st Annual European Symposium on Algorithms (ESA 2023), volume 274 of Leibniz International Proceedings in Informatics (LIPIcs), pages 15:1-15:15, Dagstuhl, Germany, 2023. Schloss Dagstuhl - Leibniz-Zentrum für Informatik. URL: https://doi.org/10.4230/LIPIcs.ESA.2023.15.
  9. Alexander Bowe, Taku Onodera, Kunihiko Sadakane, and Tetsuo Shibuya. Succinct de bruijn graphs. In Ben Raphael and Jijun Tang, editors, Algorithms in Bioinformatics, pages 225-235, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. Google Scholar
  10. M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm, 1994. Google Scholar
  11. Alessio Conte, Nicola Cotumaccio, Travis Gagie, Giovanni Manzini, Nicola Prezza, and Marinella Sciortino. Computing matching statistics on wheeler dfas. In 2023 Data Compression Conference (DCC), pages 150-159, 2023. URL: https://doi.org/10.1109/DCC55655.2023.00023.
  12. Nicola Cotumaccio. Graphs can be succinctly indexed for pattern matching in o(| e| ²+| v| ^5/2) time. In 2022 Data Compression Conference (DCC), pages 272-281, 2022. URL: https://doi.org/10.1109/DCC52660.2022.00035.
  13. Nicola Cotumaccio. Prefix Sorting DFAs: A Recursive Algorithm. In Satoru Iwata and Naonori Kakimura, editors, 34th International Symposium on Algorithms and Computation (ISAAC 2023), volume 283 of Leibniz International Proceedings in Informatics (LIPIcs), pages 22:1-22:15, Dagstuhl, Germany, 2023. Schloss Dagstuhl - Leibniz-Zentrum für Informatik. URL: https://doi.org/10.4230/LIPIcs.ISAAC.2023.22.
  14. Nicola Cotumaccio. A myhill-nerode theorem for generalized automata, with applications to pattern matching and compression, 2024. URL: https://arxiv.org/abs/2302.06506.
  15. Nicola Cotumaccio, Giovanna D’Agostino, Alberto Policriti, and Nicola Prezza. Co-lexicographically ordering automata and regular languages - part i. J. ACM, 70(4), August 2023. URL: https://doi.org/10.1145/3607471.
  16. Nicola Cotumaccio, Travis Gagie, Dominik Köppl, and Nicola Prezza. Space-time trade-offs for the lcp array of wheeler dfas. In Franco Maria Nardini, Nadia Pisanti, and Rossano Venturini, editors, String Processing and Information Retrieval, pages 143-156, Cham, 2023. Springer Nature Switzerland. Google Scholar
  17. Nicola Cotumaccio and Nicola Prezza. On indexing and compressing finite automata. In Proceedings of the Thirty-Second Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '21, pages 2585-2599, USA, 2021. Society for Industrial and Applied Mathematics. Google Scholar
  18. Samuel Eilenberg. Automata, Languages, and Machines. Academic Press, Inc., USA, 1974. Google Scholar
  19. Massimo Equi, Roberto Grossi, Veli Mäkinen, and Alexandru I. Tomescu. On the complexity of string matching for graphs. In Christel Baier, Ioannis Chatzigiannakis, Paola Flocchini, and Stefano Leonardi, editors, 46th International Colloquium on Automata, Languages, and Programming, ICALP 2019, July 9-12, 2019, Patras, Greece, volume 132 of LIPIcs, pages 55:1-55:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019. URL: https://doi.org/10.4230/LIPIcs.ICALP.2019.55.
  20. Massimo Equi, Veli Mäkinen, and Alexandru I. Tomescu. Graphs cannot be indexed in polynomial time for sub-quadratic time string matching, unless seth fails. In Tomáš Bureš, Riccardo Dondi, Johann Gamper, Giovanna Guerrini, Tomasz Jurdziński, Claus Pahl, Florian Sikora, and Prudence W.H. Wong, editors, SOFSEM 2021: Theory and Practice of Computer Science, pages 608-622, Cham, 2021. Springer International Publishing. Google Scholar
  21. Massimo Equi, Veli Mäkinen, Alexandru I. Tomescu, and Roberto Grossi. On the complexity of string matching for graphs. ACM Trans. Algorithms, 19(3), April 2023. URL: https://doi.org/10.1145/3588334.
  22. Massimo Equi, Tuukka Norri, Jarno Alanko, Bastien Cazaux, Alexandru I. Tomescu, and Veli Mäkinen. Algorithms and complexity on indexing elastic founder graphs. In Hee-Kap Ahn and Kunihiko Sadakane, editors, 32nd International Symposium on Algorithms and Computation, ISAAC 2021, December 6-8, 2021, Fukuoka, Japan, volume 212 of LIPIcs, pages 20:1-20:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. URL: https://doi.org/10.4230/LIPIcs.ISAAC.2021.20.
  23. Massimo Equi, Tuukka Norri, Jarno Alanko, Bastien Cazaux, Alexandru I. Tomescu, and Veli Mäkinen. Algorithms and complexity on indexing founder graphs. Algorithmica, 85(6):1586-1623, 2023. URL: https://doi.org/10.1007/s00453-022-01007-w.
  24. P. Ferragina and G. Manzini. Opportunistic data structures with applications. In Proc. 41st Annual Symposium on Foundations of Computer Science (FOCS'00), pages 390-398, 2000. URL: https://doi.org/10.1109/SFCS.2000.892127.
  25. Paolo Ferragina and Giovanni Manzini. Indexing compressed text. J. ACM, 52(4):552-581, July 2005. URL: https://doi.org/10.1145/1082036.1082039.
  26. Travis Gagie, Giovanni Manzini, and Jouni Sirén. Wheeler graphs: A framework for bwt-based data structures. Theoretical Computer Science, 698:67-78, 2017. Algorithms, Strings and Theoretical Approaches in the Big Data Era (In Honor of the 60th Birthday of Professor Raffaele Giancarlo). URL: https://doi.org/10.1016/j.tcs.2017.06.016.
  27. Dora Giammarresi and Rosa Montalbano. Deterministic generalized automata. In Ernst W. Mayr and Claude Puech, editors, STACS 95, 12th Annual Symposium on Theoretical Aspects of Computer Science, Munich, Germany, March 2-4, 1995, Proceedings, volume 900 of Lecture Notes in Computer Science, pages 325-336. Springer, 1995. URL: https://doi.org/10.1007/3-540-59042-0_84.
  28. Dora Giammarresi and Rosa Montalbano. Deterministic generalized automata. Theor. Comput. Sci., 215(1-2):191-208, 1999. URL: https://doi.org/10.1016/S0304-3975(97)00166-7.
  29. Kosaburo Hashiguchi. Algorithms for determining the smallest number of nonterminals (states) sufficient for generating (accepting) a regular language. In Javier Leach Albert, Burkhard Monien, and Mario Rodríguez Artalejo, editors, Automata, Languages and Programming, pages 641-648, Berlin, Heidelberg, 1991. Springer Berlin Heidelberg. Google Scholar
  30. John Hopcroft. An n log n algorithm for minimizing states in a finite automaton. In Zvi Kohavi and Azaria Paz, editors, Theory of Machines and Computations, pages 189-196. Academic Press, 1971. URL: https://doi.org/10.1016/B978-0-12-417750-5.50022-1.
  31. John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman. Introduction to Automata Theory, Languages, and Computation (3rd Edition). Addison-Wesley Longman Publishing Co., Inc., USA, 2006. Google Scholar
  32. Ramana M. Idury and Michael S. Waterman. A new algorithm for DNA sequence assembly. Journal of computational biology : a journal of computational molecular cell biology, 2 2:291-306, 1995. Google Scholar
  33. Sung-Hwan Kim, Francisco Olivares, and Nicola Prezza. Faster prefix-sorting algorithms for deterministic finite automata. In Laurent Bulteau and Zsuzsanna Lipták, editors, 34th Annual Symposium on Combinatorial Pattern Matching, CPM 2023, June 26-28, 2023, Marne-la-Vallée, France, volume 259 of LIPIcs, pages 16:1-16:16. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2023. URL: https://doi.org/10.4230/LIPIcs.CPM.2023.16.
  34. Donald E. Knuth, James H. Morris, Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM Journal on Computing, 6(2):323-350, 1977. URL: https://doi.org/10.1137/0206024.
  35. Udi Manber and Sun Wu. Approximate string matching with arbitrary costs for text and hypertext. In Advances In Structural And Syntactic Pattern Recognition, pages 22-33. World Scientific, 1992. Google Scholar
  36. Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, and Alexandru I. Tomescu. Genome-Scale Algorithm Design: Bioinformatics in the Era of High-Throughput Sequencing. Cambridge University Press, 2 edition, 2023. Google Scholar
  37. Gonzalo Navarro. Improved approximate pattern matching on hypertext. Theor. Comput. Sci., 237(1–2):455-463, April 2000. URL: https://doi.org/10.1016/S0304-3975(99)00333-3.
  38. Gonzalo Navarro. Compact Data Structures: A Practical Approach. Cambridge University Press, 2016. URL: https://doi.org/10.1017/CBO9781316588284.
  39. Kunsoo Park and Dong Kyue Kim. String matching in hypertext. In Zvi Galil and Esko Ukkonen, editors, Combinatorial Pattern Matching, pages 318-329, Berlin, Heidelberg, 1995. Springer Berlin Heidelberg. Google Scholar
  40. Pavel A. Pevzner, Haixu Tang, and Michael S. Waterman. An Eulerian path approach to DNA fragment assembly. Proceedings of the National Academy of Sciences, 98(17):9748-9753, 2001. URL: https://doi.org/10.1073/pnas.171285098.
  41. Mikko Rautiainen and Tobias Marschall. Aligning sequences to general graphs in o(v + me) time. bioRxiv, 2017. URL: https://doi.org/10.1101/216127.
  42. Nicola Rizzo and Veli Mäkinen. Linear time construction of indexable elastic founder graphs. In Cristina Bazgan and Henning Fernau, editors, Combinatorial Algorithms, pages 480-493, Cham, 2022. Springer International Publishing. Google Scholar
  43. Jared T. Simpson and Richard Durbin. Efficient construction of an assembly string graph using the fm-index. Bioinform., 26(12):367-373, 2010. URL: https://doi.org/10.1093/bioinformatics/btq217.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail