Optimal-Time Queries on BWT-Runs Compressed Indexes

Authors Takaaki Nishimoto, Yasuo Tabei



PDF
Thumbnail PDF

File

LIPIcs.ICALP.2021.101.pdf
  • Filesize: 0.84 MB
  • 15 pages

Document Identifiers

Author Details

Takaaki Nishimoto
  • RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
Yasuo Tabei
  • RIKEN Center for Advanced Intelligence Project, Tokyo, Japan

Cite AsGet BibTex

Takaaki Nishimoto and Yasuo Tabei. Optimal-Time Queries on BWT-Runs Compressed Indexes. In 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 198, pp. 101:1-101:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/LIPIcs.ICALP.2021.101

Abstract

Indexing highly repetitive strings (i.e., strings with many repetitions) for fast queries has become a central research topic in string processing, because it has a wide variety of applications in bioinformatics and natural language processing. Although a substantial number of indexes for highly repetitive strings have been proposed thus far, developing compressed indexes that support various queries remains a challenge. The run-length Burrows-Wheeler transform (RLBWT) is a lossless data compression by a reversible permutation of an input string and run-length encoding, and it has received interest for indexing highly repetitive strings. LF and ϕ^{-1} are two key functions for building indexes on RLBWT, and the best previous result computes LF and ϕ^{-1} in O(log log n) time with O(r) words of space for the string length n and the number r of runs in RLBWT. In this paper, we improve LF and ϕ^{-1} so that they can be computed in a constant time with O(r) words of space. Subsequently, we present OptBWTR (optimal-time queries on BWT-runs compressed indexes), the first string index that supports various queries including locate, count, extract queries in optimal time and O(r) words of space.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data compression
Keywords
  • Compressed text indexes
  • Burrows-Wheeler transform
  • highly repetitive text collections

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Hideo Bannai, Travis Gagie, and Tomohiro I. Refining the r-index. Theoretical Computer Science, 812:96-108, 2020. Google Scholar
  2. Djamal Belazzougui, Paolo Boldi, and Sebastiano Vigna. Dynamic z-fast tries. In Proceedings of SPIRE, pages 159-172, 2010. Google Scholar
  3. Djamal Belazzougui and Gonzalo Navarro. Alphabet-independent compressed text indexing. ACM Transactions on Algorithms, 10:23:1-23:19, 2014. Google Scholar
  4. Djamal Belazzougui and Gonzalo Navarro. Optimal lower and upper bounds for representing sequences. ACM Transactions on Algorithms, 11:31:1-31:21, 2015. Google Scholar
  5. Michael Burrows and David J Wheeler. A block-sorting lossless data compression algorithm. Technical report, 1994. Google Scholar
  6. Anders Roy Christiansen and Mikko Berggren Ettienne. Compressed indexing with signature grammars. In Proceedings of LATIN, pages 331-345, 2018. Google Scholar
  7. Anders Roy Christiansen, Mikko Berggren Ettienne, Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Optimal-time dictionary-compressed indexes. ACM Transactions on Algorithms, 17:8:1-8:39, 2021. Google Scholar
  8. Francisco Claude and Gonzalo Navarro. Improved grammar-based compressed indexes. In Proceedings of SPIRE, pages 180-192, 2012. Google Scholar
  9. Patrick Hagge Cording, Pawel Gawrychowski, and Oren Weimann. Bookmarks in grammar-compressed strings. In Proceedings of SPIRE, pages 153-159, 2016. Google Scholar
  10. Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52:552-581, 2005. Google Scholar
  11. Paolo Ferragina, Giovanni Manzini, Veli Mäkinen, and Gonzalo Navarro. Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms, 3:20, 2007. Google Scholar
  12. Travis Gagie, Pawel Gawrychowski, Juha Kärkkäinen, Yakov Nekrich, and Simon J. Puglisi. LZ77-based self-indexing with faster pattern matching. In Proceedings of LATIN, pages 731-742, 2014. Google Scholar
  13. Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. Journal of the ACM, 67, 2020. Google Scholar
  14. Alexander Golynski, J. Ian Munro, and S. Srinivasa Rao. Rank/select operations on large alphabets: a tool for text indexing. In Proceedings of SODA, pages 368-373, 2006. Google Scholar
  15. Juha Kärkkäinen, Giovanni Manzini, and Simon J. Puglisi. Permuted longest-common-prefix array. In Proceedings of CPM, pages 181-192, 2009. Google Scholar
  16. Dominik Kempa. Optimal construction of compressed indexes for highly repetitive texts. In Proceedings of SODA, pages 1344-1357, 2019. Google Scholar
  17. Ulrich Lauther and Tamás Lukovszki. Space efficient algorithms for the Burrows-Wheeler backtransformation. Algorithmica, 58:339-351, 2010. Google Scholar
  18. Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology, 17:281-308, 2010. Google Scholar
  19. Udi Manber and Eugene W. Myers. Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing, 22:935-948, 1993. Google Scholar
  20. Donald R. Morrison. PATRICIA - practical algorithm to retrieve information coded in alphanumeric. Journal of the ACM, 15:514-534, 1968. Google Scholar
  21. Gonzalo Navarro and Nicola Prezza. Universal compressed text indexing. Theoretical Computer Science, 762:41-50, 2019. Google Scholar
  22. Takaaki Nishimoto and Yasuo Tabei. Optimal-time queries on BWT-runs compressed indexes. CoRR, abs/2006.05104, 2021. URL: http://arxiv.org/abs/2006.05104.
  23. Alberto Policriti and Nicola Prezza. LZ77 computation based on the run-length encoded BWT. Algorithmica, 80:1986-2011, 2018. Google Scholar
  24. Milan Ruzic. Constructing efficient dictionaries in close to sorting time. In Proceedings of ICALP, pages 84-95, 2008. Google Scholar
  25. Takuya Takagi, Shunsuke Inenaga, Kunihiko Sadakane, and Hiroki Arimura. Packed compact tries: A fast and efficient data structure for online string processing. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 100-A:1785-1793, 2017. Google Scholar
  26. Kazuya Tsuruta, Dominik Köppl, Shunsuke Kanda, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. c-trie++: A dynamic trie tailored for fast prefix searches. In Proceedings of DCC, pages 243-252, 2020. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail