Online LZ77 Parsing and Matching Statistics with RLBWTs

Authors Hideo Bannai , Travis Gagie, Tomohiro I



PDF
Thumbnail PDF

File

LIPIcs.CPM.2018.7.pdf
  • Filesize: 0.5 MB
  • 12 pages

Document Identifiers

Author Details

Hideo Bannai
  • Department of Informatics, Kyushu University, Japan, RIKEN Center for Advanced Intelligence Project, Japan
Travis Gagie
  • Diego Portales University and CeBiB, Chile
Tomohiro I
  • Frontier Research Academy for Young Researchers, Kyushu Institute of Technology, Japan

Cite AsGet BibTex

Hideo Bannai, Travis Gagie, and Tomohiro I. Online LZ77 Parsing and Matching Statistics with RLBWTs. In 29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 105, pp. 7:1-7:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)
https://doi.org/10.4230/LIPIcs.CPM.2018.7

Abstract

Lempel-Ziv 1977 (LZ77) parsing, matching statistics and the Burrows-Wheeler Transform (BWT) are all fundamental elements of stringology. In a series of recent papers, Policriti and Prezza (DCC 2016 and Algorithmica, CPM 2017) showed how we can use an augmented run-length compressed BWT (RLBWT) of the reverse T^R of a text T, to compute offline the LZ77 parse of T in O(n log r) time and O(r) space, where n is the length of T and r is the number of runs in the BWT of T^R. In this paper we first extend a well-known technique for updating an unaugmented RLBWT when a character is prepended to a text, to work with Policriti and Prezza's augmented RLBWT. This immediately implies that we can build online the LZ77 parse of T while still using O(n log r) time and O(r) space; it also seems likely to be of independent interest. Our experiments, using an extension of Ohno, Takabatake, I and Sakamoto's (IWOCA 2017) implementation of updating, show our approach is both time- and space-efficient for repetitive strings. We then show how to augment the RLBWT further - albeit making it static again and increasing its space by a factor proportional to the size of the alphabet - such that later, given another string S and O(log log n)-time random access to T, we can compute the matching statistics of S with respect to T in O(|S| log log n) time.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data compression
Keywords
  • Lempel-Ziv 1977
  • Matching Statistics
  • Run-Length Compressed Burrows-Wheeler Transform

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Djamal Belazzougui and Fabio Cunial. Fast matching statistics in small space. In Proceedings of the 17th Symposium on Experimental Algorithms (SEA), pages 179-190, 2014. Google Scholar
  2. Djamal Belazzougui and Fabio Cunial. Indexed matching statistics and shortest unique substrings. In Proceedings of the 21st Symposium on String Processing and Information Retrieval (SPIRE), pages 179-190, 2014. Google Scholar
  3. Djamal Belazzougui and Fabio Cunial. Representing the suffix tree with the CDAWG. In Proceedings of the 28th Symposium on Combinatorial Pattern Matching (CPM), pages 7:1-7:13, 2017. Google Scholar
  4. Michael Burrows and David J. Wheeler. A block sorting lossless compression algorithm. Technical Report 124, DEC, 1994. Google Scholar
  5. Anthony J. Cox, Andrea Farruggia, Travis Gagie, Simon J. Puglisi, and Jouni Sirén. RLZAP: relative Lempel-Ziv with adaptive pointers. In Proceedings of the 23rd Symposium on String Processing and Information Retrieval (SPIRE), pages 1-14, 2016. Google Scholar
  6. Richard Durbin. Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (PBWT). Bioinformatics, 30(9):1266-1272, 2014. Google Scholar
  7. Dynamic: dynamic succinct/compressed data structures library. URL: https://github.com/xxsds/DYNAMIC.
  8. Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM (JACM), 52(4):552-581, 2005. Google Scholar
  9. Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Optimal-time text indexing in bwt-runs bounded space. Technical Report 1705.10382, arXiv.org, 2017. Google Scholar
  10. Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Optimal-time text indexing in BWT-runs bounded space. In Proceedings of the 19th Symposium on Discrete Algorithms (SODA), pages 1459-1477, 2018. Google Scholar
  11. get-git-revisions: Get all revisions of a git repository. URL: https://github.com/nicolaprezza/get-git-revisions.
  12. Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Lightweight Lempel-Ziv parsing. In Proceedings of the 13th Symposium on Experimental Algorithms (SEA), pages 139-150, 2013. Google Scholar
  13. Shanika Kuruppu, Simon J. Puglisi, and Justin Zobel. Relative lempel-ziv compression of genomes for large-scale storage and retrieval. In Proceeings of the 17th Symposium on String Processing and Information Retrieval (SPIRE), pages 201-206, 2010. Google Scholar
  14. Markus Lohrey. Algorithmics on slp-compressed strings: A survey. Groups Complexity Cryptology, 4(2):241-299, 2012. Google Scholar
  15. Lzscan. URL: https://www.cs.helsinki.fi/group/pads/.
  16. Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, and Alexandru I Tomescu. Genome-scale algorithm design. Cambridge University Press, 2015. Google Scholar
  17. Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology, 17(3):281-308, 2010. Google Scholar
  18. Gonzalo Navarro and Alberto Ordóñez Pereira. Faster compressed suffix trees for repetitive collections. ACM Journal of Experimental Algorithmics, 21(1):1.8:1-1.8:38, 2016. Google Scholar
  19. Enno Ohlebusch. Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction. Oldenbusch Verlag, 2013. Google Scholar
  20. Tatsuya Ohno, Yoshimasa Takabatake, Tomohiro I, and Hiroshi Sakamoto. A faster implementation of online run-length Burrows-Wheeler transform. In Proceedings of the 28th International Workshop on Combinatorial Algorithms (IWOCA), 2017. To appear. Google Scholar
  21. Online rlbwt. URL: https://github.com/itomomoti/OnlineRlbwt.
  22. Alberto Policriti and Nicola Prezza. Fast online Lempel-Ziv factorization in compressed space. In Proceedings of the 22nd Symposium on String Processing and Information Retrieval (SPIRE), pages 13-20, 2015. Google Scholar
  23. Alberto Policriti and Nicola Prezza. Computing LZ77 in run-compressed space. In Proceedings of the Data Compression Conference (DCC), pages 23-32, 2016. Google Scholar
  24. Alberto Policriti and Nicola Prezza. From LZ77 to the run-length encoded Burrows-Wheeler transform, and back. In Proceedings of the 28th Symposium on Combinatorial Pattern Matching (CPM), pages 17:1-17:10, 2017. Google Scholar
  25. Alberto Policriti and Nicola Prezza. LZ77 computation based on the run-length encoded BWT. Algorithmica, Jul 2017. Google Scholar
  26. Daniel Valenzuela and Veli Mäkinen. CHIC: a short read aligner for pan-genomic references. bioRxiv, 2017. Google Scholar