Minimal Absent Words on Run-Length Encoded Strings

Authors Tooru Akagi, Kouta Okabe, Takuya Mieno , Yuto Nakashima , Shunsuke Inenaga



PDF
Thumbnail PDF

File

LIPIcs.CPM.2022.27.pdf
  • Filesize: 0.97 MB
  • 17 pages

Document Identifiers

Author Details

Tooru Akagi
  • Department of Informatics, Kyushu University, Fukuoka, Japan
Kouta Okabe
  • Department of Information Science and Technology, Kyushu University, Fukuoka, Japan
Takuya Mieno
  • Faculty of Information Science and Technology, Hokkaido University, Sapporo, Japan
Yuto Nakashima
  • Department of Informatics, Kyushu University, Fukuoka, Japan
Shunsuke Inenaga
  • Department of Informatics, Kyushu University, Fukuoka, Japan
  • PRESTO, Japan Science and Technology Agency, Kawaguchi, Japan

Acknowledgements

We thank the anonymous referees for their comments.

Cite AsGet BibTex

Tooru Akagi, Kouta Okabe, Takuya Mieno, Yuto Nakashima, and Shunsuke Inenaga. Minimal Absent Words on Run-Length Encoded Strings. In 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 223, pp. 27:1-27:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)
https://doi.org/10.4230/LIPIcs.CPM.2022.27

Abstract

A string w is called a minimal absent word for another string T if w does not occur (as a substring) in T and all proper substrings of w occur in T. State-of-the-art data structures for reporting the set MAW(T) of MAWs from a given string T of length n require O(n) space, can be built in O(n) time, and can report all MAWs in O(|MAW(T)|) time upon a query. This paper initiates the problem of computing MAWs from a compressed representation of a string. In particular, we focus on the most basic compressed representation of a string, run-length encoding (RLE), which represents each maximal run of the same characters a by a^p where p is the length of the run. Let m be the RLE-size of string T. After categorizing the MAWs into five disjoint sets ℳ₁, ℳ₂, ℳ₃, ℳ₄, ℳ₅ using RLE, we present matching upper and lower bounds for the number of MAWs in ℳ_i for i = 1,2,4,5 in terms of RLE-size m, except for ℳ₃ whose size is unbounded by m. We then present a compact O(m)-space data structure that can report all MAWs in optimal O(|MAW(T)|) time.

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
Keywords
  • string algorithms
  • combinatorics on words
  • minimal absent words
  • run-length encoding

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Tooru Akagi, Yuki Kuhara, Takuya Mieno, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Combinatorics of minimal absent words for a sliding window. CoRR, abs/2105.08496, 2021. URL: http://arxiv.org/abs/2105.08496.
  2. Yannis Almirantis, Panagiotis Charalampopoulos, Jia Gao, Costas S. Iliopoulos, Manal Mohamed, Solon P. Pissis, and Dimitris Polychronopoulos. On avoided words, absent words, and their application to biological sequence analysis. Algorithms for Molecular Biology, 12(1):5, 2017. Google Scholar
  3. Lorraine A. K. Ayad, Golnaz Badkobeh, Gabriele Fici, Alice Héliou, and Solon P. Pissis. Constructing antidictionaries of long texts in output-sensitive space. Theory Comput. Syst., 65(5):777-797, 2021. Google Scholar
  4. Carl Barton, Alice Heliou, Laurent Mouchard, and Solon P. Pissis. Linear-time computation of minimal absent words using suffix array. BMC Bioinformatics, 15(1):388, 2014. Google Scholar
  5. Marie Pierre Béal, Filippo Mignosi, and Antonio Restivo. Minimal forbidden words and symbolic dynamics. In STACS 1996, pages 555-566, 1996. Google Scholar
  6. Djamal Belazzougui, Fabio Cunial, Juha Kärkkäinen, and Veli Mäkinen. Versatile succinct representations of the bidirectional Burrows-Wheeler transform. In ESA 2013, pages 133-144, 2013. Google Scholar
  7. Anselm Blumer, J. Blumer, David Haussler, Andrzej Ehrenfeucht, M. T. Chen, and Joel I. Seiferas. The smallest automaton recognizing the subwords of a text. Theor. Comput. Sci., 40:31-55, 1985. Google Scholar
  8. Supaporn Chairungsee and Maxime Crochemore. Using minimal absent words to build phylogeny. Theor. Comput. Sci., 450:109-116, 2012. Google Scholar
  9. Panagiotis Charalampopoulos, Maxime Crochemore, Gabriele Fici, Robert Mercas, and Solon P. Pissis. Alignment-free sequence comparison using absent words. Inf. Comput., 262:57-68, 2018. Google Scholar
  10. Panagiotis Charalampopoulos, Maxime Crochemore, and Solon P Pissis. On extended special factors of a word. In SPIRE 2018, pages 131-138. Springer, 2018. Google Scholar
  11. Tim Crawford, Golnaz Badkobeh, and David Lewis. Searching page-images of early music scanned with OMR: A scalable solution using minimal absent words. In ISMIR 2018, pages 233-239, 2018. Google Scholar
  12. M. Crochemore, F. Mignosi, A. Restivo, and S. Salemi. Data compression using antidictionaries. Proc. IEEE, 88(11):1756-1768, 2000. Google Scholar
  13. Maxime Crochemore, Alice Héliou, Gregory Kucherov, Laurent Mouchard, Solon P. Pissis, and Yann Ramusat. Absent words in a sliding window with applications. Information and Computation, 270:104461, 2020. Google Scholar
  14. Maxime Crochemore, F. Mignosi, and A. Restivo. Automata and forbidden words. Information Processing Letters, 67(3):111-117, 1998. Google Scholar
  15. Maxime Crochemore and Gonzalo Navarro. Improved antidictionary based compression. In 12th International Conference of the Chilean Computer Science Society, 2002. Proceedings., pages 7-13. IEEE, 2002. Google Scholar
  16. Gabriele Fici. Minimal forbidden words and applications. PhD thesis, Università di Palermo and Université Paris-Est Marne-la-Vallée, 2006. Google Scholar
  17. Gabriele Fici and Pawel Gawrychowski. Minimal absent words in rooted and unrooted trees. In SPIRE 2019, pages 152-161, 2019. Google Scholar
  18. Gabriele Fici, Antonio Restivo, and Laura Rizzo. Minimal forbidden factors of circular words. Theor. Comput. Sci., 792:144-153, 2019. Google Scholar
  19. Yuta Fujishige, Yuki Tsujimaru, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Computing DAWGs and minimal absent words in linear time for integer alphabets. In MFCS 2016, volume 58, pages 38:1-38:14, 2016. Google Scholar
  20. John C. Kieffer and En-Hui Yang. Grammar-based codes: A new class of universal lossless source codes. IEEE Trans. Inf. Theory, 46(3):737-754, 2000. URL: https://doi.org/10.1109/18.841160.
  21. Grigorios Koulouras and Martin C Frith. Significant non-existence of sequences in genomes and proteomes. Nucleic acids research, 49(6):3139-3155, 2021. Google Scholar
  22. Veli Mäkinen and Gonzalo Navarro. Succinct suffix arrays based on run-length encoding. Nord. J. Comput., 12(1):40-66, 2005. Google Scholar
  23. Takuya Mieno, Yuki Kuhara, Tooru Akagi, Yuta Fujishige, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Minimal unique substrings and minimal absent words in a sliding window. In SOFSEM 2020, volume 12011 of Lecture Notes in Computer Science, pages 148-160. Springer, 2020. Google Scholar
  24. Diogo Pratas and Jorge M Silva. Persistent minimal sequences of SARS-CoV-2. Bioinformatics, 36(21):5129-5132, 2020. Google Scholar
  25. Yuya Tamakoshi, Keisuke Goto, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. An opportunistic text indexing structure based on run length encoding. In CIAC 2015, volume 9079 of Lecture Notes in Computer Science, pages 390-402. Springer, 2015. Google Scholar
  26. J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory, IT-23(3):337-349, 1977. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail