Bi-Directional r-Indexes

Authors Yuma Arakawa, Gonzalo Navarro, Kunihiko Sadakane



PDF
Thumbnail PDF

File

LIPIcs.CPM.2022.11.pdf
  • Filesize: 0.83 MB
  • 14 pages

Document Identifiers

Author Details

Yuma Arakawa
  • Department of Mathematical Informatics, The University of Tokyo, Japan
Gonzalo Navarro
  • CeBiB and Department of Computer Science, University of Chile, Santiago, Chile
Kunihiko Sadakane
  • Department of Mathematical Informatics, The University of Tokyo, Japan

Cite AsGet BibTex

Yuma Arakawa, Gonzalo Navarro, and Kunihiko Sadakane. Bi-Directional r-Indexes. In 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 223, pp. 11:1-11:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)
https://doi.org/10.4230/LIPIcs.CPM.2022.11

Abstract

Indexing highly repetitive texts is important in fields such as bioinformatics and versioned repositories. The run-length compression of the Burrows-Wheeler transform (BWT) provides a compressed representation particularly well-suited to text indexing. The r-index is one such index. It enables fast locating of occurrences of a pattern within O(r) words of space, where r is the number of equal-letter runs in the BWT. Its mechanism of locating is to maintain one suffix array sample along the backward-search of the pattern, and to compute all the pattern positions from that sample once the backward-search is complete. In this paper we develop this algorithm further, and propose a new bi-directional text index called the br-index, which supports extending the matched pattern both in forward and backward directions, and locating the occurrences of the pattern at any step of the search, within O(r+r_R) words of space, where r_R is the number of equal-letter runs in the BWT of the reversed text. Our experiments show that the br-index captures the long repetitions of the text, and outperforms the existing indexes in text searching allowing some mismatches except in an internal part.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data compression
Keywords
  • Compressed text indexes
  • Burrows-Wheeler Transform
  • highly repetitive text collections

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Djamal Belazzougui and Fabio Cunial. Smaller fully-functional bidirectional BWT indexes. In International Symposium on String Processing and Information Retrieval, pages 42-59, 2020. Google Scholar
  2. Michael. Burrows and David J. Wheeler. A block sorting lossless data compression algorithm. Digital SRC Research Report, 1994. Google Scholar
  3. Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52(4):552-581, 2005. Google Scholar
  4. Paolo Ferragina, Giovanni Manzini, Veli Mäkinen, and Gonzalo Navarro. Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms, 3(2):article 20, 2007. Google Scholar
  5. Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Optimal-time text indexing in BWT-runs bounded space. Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1459-1477, 2018. Google Scholar
  6. Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. Journal of the ACM, 67(1):1-54, 2020. Google Scholar
  7. Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. High-order entropy-compressed text indexes. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms, pages 841-850, 2003. Google Scholar
  8. Dan Gusfield. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997. Google Scholar
  9. Dominik Kempa and Tomasz Kociumaka. Resolution of the burrows-wheeler transform conjecture. In IEEE 61st Annual Symposium on Foundations of Computer Science, pages 1002-1013, 2020. Google Scholar
  10. T. W. Lam, Ruiqiang Li, Alan Tam, Simon Wong, Edward Wu, and S. M. Yiu. High throughput short read alignment via bi-directional BWT. 2009 IEEE International Conference on Bioinformatics and Biomedicine, pages 31-36, 2009. Google Scholar
  11. Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, and Alexandru I. Tomescu. Genome-Scale Algorithm Design. Cambridge University Press, 2015. Google Scholar
  12. Veli Mäkinen and Gonzalo Navarro. Succinct suffix arrays based on run-length encoding. In Annual Symposium on Combinatorial Pattern Matching, pages 45-56, 2005. Google Scholar
  13. Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology, 17(3):281-308, 2010. Google Scholar
  14. Udi Manber and Gene Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 22(5):935-948, 1993. Google Scholar
  15. Gonzalo Navarro. Wavelet trees for all. Journal of Discrete Algorithms, 25:2-20, 2014. Google Scholar
  16. Gonzalo Navarro. Document listing on repetitive collections with guaranteed performance. Theoretical Computer Science, 772:58-72, June 2019. Google Scholar
  17. Gonzalo Navarro. Indexing highly repetitive string collections, part i. ACM Computing Surveys, 54(2), 2021. Google Scholar
  18. Gonzalo Navarro and Alberto Ordóñez. Faster compressed suffix trees for repetitive collections. ACM Journal of Experimental Algorithmics, 21(1):article 1.8, 2016. Google Scholar
  19. Takaaki Nishimoto and Yasuo Tabei. Optimal-time queries on BWT-runs compressed indexes. In Leibniz International Proceedings in Informatics, pages 101:1-101:15, 2021. Google Scholar
  20. Daisuke Okanohara and Kunihiko Sadakane. Practical entropy-compressed rank/select dictionary. In Proceedings of the 9th Workshop on Algorithm Engineering and Experiments, pages 60-70, 2007. Google Scholar
  21. Rajeev Raman, Venkatesh Raman, and Srinivasa Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, pages 233-242, 2002. Google Scholar
  22. Kunihiko Sadakane. Compressed suffix trees with full functionality. Theory of Computing Systems, 41(4):589-607, 2007. Google Scholar
  23. Peter Weiner. Linear pattern matching algorithms. 14th Annual Symposium on Switching and Automata Theory, pages 1-11, 1973. Google Scholar