Optimal-Time Queries on BWT-Runs Compressed Indexes

Nishimoto, Takaaki; Tabei, Yasuo

doi:10.4230/LIPIcs.ICALP.2021.101

File

LIPIcs.ICALP.2021.101.pdf

Filesize: 0.84 MB
15 pages

Document Identifiers

DOI: 10.4230/LIPIcs.ICALP.2021.101
URN: urn:nbn:de:0030-drops-141702

Author Details

Takaaki Nishimoto

RIKEN Center for Advanced Intelligence Project, Tokyo, Japan

Yasuo Tabei

RIKEN Center for Advanced Intelligence Project, Tokyo, Japan

Cite AsGet BibTex

Takaaki Nishimoto and Yasuo Tabei. Optimal-Time Queries on BWT-Runs Compressed Indexes. In 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 198, pp. 101:1-101:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/LIPIcs.ICALP.2021.101

Abstract

Indexing highly repetitive strings (i.e., strings with many repetitions) for fast queries has become a central research topic in string processing, because it has a wide variety of applications in bioinformatics and natural language processing. Although a substantial number of indexes for highly repetitive strings have been proposed thus far, developing compressed indexes that support various queries remains a challenge. The run-length Burrows-Wheeler transform (RLBWT) is a lossless data compression by a reversible permutation of an input string and run-length encoding, and it has received interest for indexing highly repetitive strings. LF and ϕ^{-1} are two key functions for building indexes on RLBWT, and the best previous result computes LF and ϕ^{-1} in O(log log n) time with O(r) words of space for the string length n and the number r of runs in RLBWT. In this paper, we improve LF and ϕ^{-1} so that they can be computed in a constant time with O(r) words of space. Subsequently, we present OptBWTR (optimal-time queries on BWT-runs compressed indexes), the first string index that supports various queries including locate, count, extract queries in optimal time and O(r) words of space.

Subject Classification

ACM Subject Classification

Theory of computation → Data compression

Keywords

Compressed text indexes
Burrows-Wheeler transform
highly repetitive text collections

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Hideo Bannai, Travis Gagie, and Tomohiro I. Refining the r-index. Theoretical Computer Science, 812:96-108, 2020.
Djamal Belazzougui, Paolo Boldi, and Sebastiano Vigna. Dynamic z-fast tries. In Proceedings of SPIRE, pages 159-172, 2010.
Djamal Belazzougui and Gonzalo Navarro. Alphabet-independent compressed text indexing. ACM Transactions on Algorithms, 10:23:1-23:19, 2014.
Djamal Belazzougui and Gonzalo Navarro. Optimal lower and upper bounds for representing sequences. ACM Transactions on Algorithms, 11:31:1-31:21, 2015.
Michael Burrows and David J Wheeler. A block-sorting lossless data compression algorithm. Technical report, 1994.
Anders Roy Christiansen and Mikko Berggren Ettienne. Compressed indexing with signature grammars. In Proceedings of LATIN, pages 331-345, 2018.
Anders Roy Christiansen, Mikko Berggren Ettienne, Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Optimal-time dictionary-compressed indexes. ACM Transactions on Algorithms, 17:8:1-8:39, 2021.
Francisco Claude and Gonzalo Navarro. Improved grammar-based compressed indexes. In Proceedings of SPIRE, pages 180-192, 2012.
Patrick Hagge Cording, Pawel Gawrychowski, and Oren Weimann. Bookmarks in grammar-compressed strings. In Proceedings of SPIRE, pages 153-159, 2016.
Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52:552-581, 2005.
Paolo Ferragina, Giovanni Manzini, Veli Mäkinen, and Gonzalo Navarro. Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms, 3:20, 2007.
Travis Gagie, Pawel Gawrychowski, Juha Kärkkäinen, Yakov Nekrich, and Simon J. Puglisi. LZ77-based self-indexing with faster pattern matching. In Proceedings of LATIN, pages 731-742, 2014.
Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. Journal of the ACM, 67, 2020.
Alexander Golynski, J. Ian Munro, and S. Srinivasa Rao. Rank/select operations on large alphabets: a tool for text indexing. In Proceedings of SODA, pages 368-373, 2006.
Juha Kärkkäinen, Giovanni Manzini, and Simon J. Puglisi. Permuted longest-common-prefix array. In Proceedings of CPM, pages 181-192, 2009.
Dominik Kempa. Optimal construction of compressed indexes for highly repetitive texts. In Proceedings of SODA, pages 1344-1357, 2019.
Ulrich Lauther and Tamás Lukovszki. Space efficient algorithms for the Burrows-Wheeler backtransformation. Algorithmica, 58:339-351, 2010.
Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology, 17:281-308, 2010.
Udi Manber and Eugene W. Myers. Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing, 22:935-948, 1993.
Donald R. Morrison. PATRICIA - practical algorithm to retrieve information coded in alphanumeric. Journal of the ACM, 15:514-534, 1968.
Gonzalo Navarro and Nicola Prezza. Universal compressed text indexing. Theoretical Computer Science, 762:41-50, 2019.
Takaaki Nishimoto and Yasuo Tabei. Optimal-time queries on BWT-runs compressed indexes. CoRR, abs/2006.05104, 2021. URL: http://arxiv.org/abs/2006.05104.
Alberto Policriti and Nicola Prezza. LZ77 computation based on the run-length encoded BWT. Algorithmica, 80:1986-2011, 2018.
Milan Ruzic. Constructing efficient dictionaries in close to sorting time. In Proceedings of ICALP, pages 84-95, 2008.
Takuya Takagi, Shunsuke Inenaga, Kunihiko Sadakane, and Hiroki Arimura. Packed compact tries: A fast and efficient data structure for online string processing. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 100-A:1785-1793, 2017.
Kazuya Tsuruta, Dominik Köppl, Shunsuke Kanda, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. c-trie++: A dynamic trie tailored for fast prefix searches. In Proceedings of DCC, pages 243-252, 2020.

Optimal-Time Queries on BWT-Runs Compressed Indexes

Authors Takaaki Nishimoto, Yasuo Tabei

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

Optimal-Time Queries on BWT-Runs Compressed Indexes

Authors Takaaki Nishimoto, Yasuo Tabei

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Related Versions

References

Thanks for your feedback!

Could not send message