R-enum: Enumeration of Characteristic Substrings in BWT-runs Bounded Space

Authors Takaaki Nishimoto, Yasuo Tabei



PDF
Thumbnail PDF

File

LIPIcs.CPM.2021.21.pdf
  • Filesize: 0.98 MB
  • 21 pages

Document Identifiers

Author Details

Takaaki Nishimoto
  • RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
Yasuo Tabei
  • RIKEN Center for Advanced Intelligence Project, Tokyo, Japan

Acknowledgements

We thank reviewers for their useful comments.

Cite AsGet BibTex

Takaaki Nishimoto and Yasuo Tabei. R-enum: Enumeration of Characteristic Substrings in BWT-runs Bounded Space. In 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 191, pp. 21:1-21:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/LIPIcs.CPM.2021.21

Abstract

Enumerating characteristic substrings (e.g., maximal repeats, minimal unique substrings, and minimal absent words) in a given string has been an important research topic because there are a wide variety of applications in various areas such as string processing and computational biology. Although several enumeration algorithms for characteristic substrings have been proposed, they are not space-efficient in that their space-usage is proportional to the length of an input string. Recently, the run-length encoded Burrows-Wheeler transform (RLBWT) has attracted increased attention in string processing, and various algorithms for the RLBWT have been developed. Developing enumeration algorithms for characteristic substrings with the RLBWT, however, remains a challenge. In this paper, we present r-enum (RLBWT-based enumeration), the first enumeration algorithm for characteristic substrings based on RLBWT. R-enum runs in O(n log log (n/r)) time and with O(r log n) bits of working space for string length n and number r of runs in RLBWT. Here, r is expected to be significantly smaller than n for highly repetitive strings (i.e., strings with many repetitions). Experiments using a benchmark dataset of highly repetitive strings show that the results of r-enum are more space-efficient than the previous results. In addition, we demonstrate the applicability of r-enum to a huge string by performing experiments on a 300-gigabyte string of 100 human genomes.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data compression
Keywords
  • Enumeration algorithm
  • Burrows-Wheeler transform
  • Maximal repeats
  • Minimal unique substrings
  • Minimal absent words

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Paniz Abedin, M. Oguzhan Külekci, and Shama V. Thankachan. A survey on shortest unique substring queries. Algorithms, 13:224, 2020. Google Scholar
  2. Mohamed Ibrahim Abouelhoda, Stefan Kurtz, and Enno Ohlebusch. Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms, 2:53-86, 2004. Google Scholar
  3. Alberto Apostolico. The myriad virtues of subword trees. In Combinatorial Algorithms on Words, pages 85-96, 1985. Google Scholar
  4. Hideo Bannai, Travis Gagie, and Tomohiro I. Refining the r-index. Theoretical Computer Science, 812:96-108, 2020. Google Scholar
  5. Carl Barton, Alice Héliou, Laurent Mouchard, and Solon P. Pissis. Linear-time computation of minimal absent words using suffix array. BMC Bioinformatics, 15:388, 2014. Google Scholar
  6. Verónica Becher, Alejandro Deymonnaz, and Pablo Ariel Heiber. Efficient computation of all perfect repeats in genomic sequences of up to half a gigabyte, with a case study on the human genome. Bioinformatics, 25:1746-1753, 2009. Google Scholar
  7. Djamal Belazzougui and Fabio Cunial. Space-efficient detection of unusual words. In Proceedings of SPIRE, pages 222-233, 2015. Google Scholar
  8. Djamal Belazzougui, Fabio Cunial, Travis Gagie, Nicola Prezza, and Mathieu Raffinot. Composite repetition-aware data structures. In Proceedings of CPM, pages 26-39, 2015. Google Scholar
  9. Djamal Belazzougui, Fabio Cunial, Juha Kärkkäinen, and Veli Mäkinen. Linear-time string indexing and analysis in small space. ACM Transactions on Algorithms, 16:17:1-17:54, 2020. Google Scholar
  10. Djamal Belazzougui and Gonzalo Navarro. Optimal lower and upper bounds for representing sequences. ACM Transactions on Algorithms, 11:31:1-31:21, 2015. Google Scholar
  11. Djamal Belazzougui, Gonzalo Navarro, and Daniel Valenzuela. Improved compressed indexes for full-text document retrieval. Journal of Discrete Algorithms, 18:3-13, 2013. Google Scholar
  12. Timo Beller, Katharina Berger, and Enno Ohlebusch. Space-efficient computation of maximal and supermaximal repeats in genome sequences. In Proceedings of SPIRE, pages 99-110, 2012. Google Scholar
  13. Michael Burrows and David J Wheeler. A block-sorting lossless data compression algorithm. Technical report, 1994. Google Scholar
  14. Panagiotis Charalampopoulos, Maxime Crochemore, Gabriele Fici, Robert Mercas, and Solon P. Pissis. Alignment-free sequence comparison using absent words. Information and Computation, 262:57-68, 2018. Google Scholar
  15. Francisco Claude, Gonzalo Navarro, and Alberto Ordóñez Pereira. The wavelet matrix: An efficient wavelet tree for large alphabets. Information Systems, 47:15-32, 2015. Google Scholar
  16. Maxime Crochemore, Gabriele Fici, Robert Mercas, and Solon P. Pissis. Linear-time sequence comparison using minimal absent words & applications. In Proceedings of LATIN, pages 334-346, 2016. Google Scholar
  17. Maxime Crochemore, Filippo Mignosi, and Antonio Restivo. Automata and forbidden words. Information Processing Letters, 67:111-117, 1998. Google Scholar
  18. Maxime Crochemore, Filippo Mignosi, Antonio Restivo, and Sergio Salemi. Data compression using antidictionaries. Proceedings of the IEEE, 88:1756-1768, 2000. Google Scholar
  19. Maxime Crochemore and Gonzalo Navarro. Improved antidictionary based compression. In Proceedings of SCCC, pages 7-13, 2002. Google Scholar
  20. Isamu Furuya, Takuya Takagi, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Takuya Kida. MR-RePair: Grammar compression based on maximal repeats. In Proceedings of DCC, pages 508-517, 2019. Google Scholar
  21. Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in bwt-runs bounded space. Journal of the ACM, 67:2:1-2:54, 2020. Google Scholar
  22. Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From theory to practice: Plug and play with succinct data structures. In Proceedings of SEA, pages 326-337, 2014. Google Scholar
  23. Dan Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, 1997. Google Scholar
  24. Bernhard Haubold, Nora Pierstorff, Friedrich Möller, and Thomas Wiehe. Genome comparison without alignment using shortest unique substrings. BMC Bioinformatics, 6:123, 2005. Google Scholar
  25. Dominik Kempa. Optimal construction of compressed indexes for highly repetitive texts. In Proceedings of SODA, pages 1344-1357, 2019. Google Scholar
  26. Dominik Kempa and Tomasz Kociumaka. Resolution of the burrows-wheeler transform conjecture. In Proceedings of FOCS, pages 1002-1013, 2020. Google Scholar
  27. Udi Manber and Eugene W. Myers. Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing, 22:935-948, 1993. Google Scholar
  28. Tomonari Masada, Atsuhiro Takasu, Yuichiro Shibata, and Kiyoshi Oguri. Clustering documents with maximal substrings. In Proceedings of ICEIS, pages 19-34, 2011. Google Scholar
  29. Tatsuya Ohno, Kensuke Sakai, Yoshimasa Takabatake, Tomohiro I, and Hiroshi Sakamoto. A faster implementation of online RLBWT and its application to LZ77 parsing. Journal of Discrete Algorithms, 52-53:18-28, 2018. Google Scholar
  30. Daisuke Okanohara and Jun'ichi Tsujii. Text categorization with all substring features. In Proceedings of SDM, pages 838-846, 2009. Google Scholar
  31. Alberto Policriti and Nicola Prezza. LZ77 computation based on the run-length encoded BWT. Algorithmica, 80:1986-2011, 2018. Google Scholar
  32. Nicola Prezza and Giovanna Rosone. Space-efficient construction of compressed suffix trees. Theoretical Computer Science, 852:138-156, 2021. Google Scholar
  33. The 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature, 491:56-65, 2012. Google Scholar