R-enum: Enumeration of Characteristic Substrings in BWT-runs Bounded Space

Nishimoto, Takaaki; Tabei, Yasuo

doi:10.4230/LIPIcs.CPM.2021.21

File

Author Details

Takaaki Nishimoto

RIKEN Center for Advanced Intelligence Project, Tokyo, Japan

Yasuo Tabei

RIKEN Center for Advanced Intelligence Project, Tokyo, Japan

Cite AsGet BibTex

Takaaki Nishimoto and Yasuo Tabei. R-enum: Enumeration of Characteristic Substrings in BWT-runs Bounded Space. In 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 191, pp. 21:1-21:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/LIPIcs.CPM.2021.21

Abstract

Enumerating characteristic substrings (e.g., maximal repeats, minimal unique substrings, and minimal absent words) in a given string has been an important research topic because there are a wide variety of applications in various areas such as string processing and computational biology. Although several enumeration algorithms for characteristic substrings have been proposed, they are not space-efficient in that their space-usage is proportional to the length of an input string. Recently, the run-length encoded Burrows-Wheeler transform (RLBWT) has attracted increased attention in string processing, and various algorithms for the RLBWT have been developed. Developing enumeration algorithms for characteristic substrings with the RLBWT, however, remains a challenge. In this paper, we present r-enum (RLBWT-based enumeration), the first enumeration algorithm for characteristic substrings based on RLBWT. R-enum runs in O(n log log (n/r)) time and with O(r log n) bits of working space for string length n and number r of runs in RLBWT. Here, r is expected to be significantly smaller than n for highly repetitive strings (i.e., strings with many repetitions). Experiments using a benchmark dataset of highly repetitive strings show that the results of r-enum are more space-efficient than the previous results. In addition, we demonstrate the applicability of r-enum to a huge string by performing experiments on a 300-gigabyte string of 100 human genomes.

Subject Classification

ACM Subject Classification

Theory of computation → Data compression

Keywords

Enumeration algorithm
Burrows-Wheeler transform
Maximal repeats
Minimal unique substrings
Minimal absent words

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Paniz Abedin, M. Oguzhan Külekci, and Shama V. Thankachan. A survey on shortest unique substring queries. Algorithms, 13:224, 2020.
Mohamed Ibrahim Abouelhoda, Stefan Kurtz, and Enno Ohlebusch. Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms, 2:53-86, 2004.
Alberto Apostolico. The myriad virtues of subword trees. In Combinatorial Algorithms on Words, pages 85-96, 1985.
Hideo Bannai, Travis Gagie, and Tomohiro I. Refining the r-index. Theoretical Computer Science, 812:96-108, 2020.
Carl Barton, Alice Héliou, Laurent Mouchard, and Solon P. Pissis. Linear-time computation of minimal absent words using suffix array. BMC Bioinformatics, 15:388, 2014.
Verónica Becher, Alejandro Deymonnaz, and Pablo Ariel Heiber. Efficient computation of all perfect repeats in genomic sequences of up to half a gigabyte, with a case study on the human genome. Bioinformatics, 25:1746-1753, 2009.
Djamal Belazzougui and Fabio Cunial. Space-efficient detection of unusual words. In Proceedings of SPIRE, pages 222-233, 2015.
Djamal Belazzougui, Fabio Cunial, Travis Gagie, Nicola Prezza, and Mathieu Raffinot. Composite repetition-aware data structures. In Proceedings of CPM, pages 26-39, 2015.
Djamal Belazzougui, Fabio Cunial, Juha Kärkkäinen, and Veli Mäkinen. Linear-time string indexing and analysis in small space. ACM Transactions on Algorithms, 16:17:1-17:54, 2020.
Djamal Belazzougui and Gonzalo Navarro. Optimal lower and upper bounds for representing sequences. ACM Transactions on Algorithms, 11:31:1-31:21, 2015.
Djamal Belazzougui, Gonzalo Navarro, and Daniel Valenzuela. Improved compressed indexes for full-text document retrieval. Journal of Discrete Algorithms, 18:3-13, 2013.
Timo Beller, Katharina Berger, and Enno Ohlebusch. Space-efficient computation of maximal and supermaximal repeats in genome sequences. In Proceedings of SPIRE, pages 99-110, 2012.
Michael Burrows and David J Wheeler. A block-sorting lossless data compression algorithm. Technical report, 1994.
Panagiotis Charalampopoulos, Maxime Crochemore, Gabriele Fici, Robert Mercas, and Solon P. Pissis. Alignment-free sequence comparison using absent words. Information and Computation, 262:57-68, 2018.
Francisco Claude, Gonzalo Navarro, and Alberto Ordóñez Pereira. The wavelet matrix: An efficient wavelet tree for large alphabets. Information Systems, 47:15-32, 2015.
Maxime Crochemore, Gabriele Fici, Robert Mercas, and Solon P. Pissis. Linear-time sequence comparison using minimal absent words & applications. In Proceedings of LATIN, pages 334-346, 2016.
Maxime Crochemore, Filippo Mignosi, and Antonio Restivo. Automata and forbidden words. Information Processing Letters, 67:111-117, 1998.
Maxime Crochemore, Filippo Mignosi, Antonio Restivo, and Sergio Salemi. Data compression using antidictionaries. Proceedings of the IEEE, 88:1756-1768, 2000.
Maxime Crochemore and Gonzalo Navarro. Improved antidictionary based compression. In Proceedings of SCCC, pages 7-13, 2002.
Isamu Furuya, Takuya Takagi, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Takuya Kida. MR-RePair: Grammar compression based on maximal repeats. In Proceedings of DCC, pages 508-517, 2019.
Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in bwt-runs bounded space. Journal of the ACM, 67:2:1-2:54, 2020.
Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From theory to practice: Plug and play with succinct data structures. In Proceedings of SEA, pages 326-337, 2014.
Dan Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, 1997.
Bernhard Haubold, Nora Pierstorff, Friedrich Möller, and Thomas Wiehe. Genome comparison without alignment using shortest unique substrings. BMC Bioinformatics, 6:123, 2005.
Dominik Kempa. Optimal construction of compressed indexes for highly repetitive texts. In Proceedings of SODA, pages 1344-1357, 2019.
Dominik Kempa and Tomasz Kociumaka. Resolution of the burrows-wheeler transform conjecture. In Proceedings of FOCS, pages 1002-1013, 2020.
Udi Manber and Eugene W. Myers. Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing, 22:935-948, 1993.
Tomonari Masada, Atsuhiro Takasu, Yuichiro Shibata, and Kiyoshi Oguri. Clustering documents with maximal substrings. In Proceedings of ICEIS, pages 19-34, 2011.
Tatsuya Ohno, Kensuke Sakai, Yoshimasa Takabatake, Tomohiro I, and Hiroshi Sakamoto. A faster implementation of online RLBWT and its application to LZ77 parsing. Journal of Discrete Algorithms, 52-53:18-28, 2018.
Daisuke Okanohara and Jun'ichi Tsujii. Text categorization with all substring features. In Proceedings of SDM, pages 838-846, 2009.
Alberto Policriti and Nicola Prezza. LZ77 computation based on the run-length encoded BWT. Algorithmica, 80:1986-2011, 2018.
Nicola Prezza and Giovanna Rosone. Space-efficient construction of compressed suffix trees. Theoretical Computer Science, 852:138-156, 2021.
The 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature, 491:56-65, 2012.

R-enum: Enumeration of Characteristic Substrings in BWT-runs Bounded Space

Authors Takaaki Nishimoto, Yasuo Tabei

File

Document Identifiers

Author Details

Acknowledgements

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

R-enum: Enumeration of Characteristic Substrings in BWT-runs Bounded Space

Authors Takaaki Nishimoto, Yasuo Tabei

File

Document Identifiers

Author Details

Acknowledgements

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Supplementary Materials

References

Thanks for your feedback!

Could not send message