Linking BWT and XBW via Aho-Corasick Automaton: Applications to Run-Length Encoding

Authors Bastien Cazaux , Eric Rivals



PDF
Thumbnail PDF

File

LIPIcs.CPM.2019.24.pdf
  • Filesize: 0.66 MB
  • 20 pages

Document Identifiers

Author Details

Bastien Cazaux
  • Department of Computer Science, University of Helsinki, Finland
  • L.I.R.M.M., CNRS, Université Montpellier, France
Eric Rivals
  • L.I.R.M.M., CNRS, Université Montpellier, France

Cite AsGet BibTex

Bastien Cazaux and Eric Rivals. Linking BWT and XBW via Aho-Corasick Automaton: Applications to Run-Length Encoding. In 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 128, pp. 24:1-24:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)
https://doi.org/10.4230/LIPIcs.CPM.2019.24

Abstract

The boom of genomic sequencing makes compression of sets of sequences inescapable. This underlies the need for multi-string indexing data structures that helps compressing the data. The most prominent example of such data structures is the Burrows-Wheeler Transform (BWT), a reversible permutation of a text that improves its compressibility. A similar data structure, the eXtended Burrows-Wheeler Transform (XBW), is able to index a tree labelled with alphabet symbols. A link between a multi-string BWT and the Aho-Corasick automaton has already been found and led to a way to build a XBW from a multi-string BWT. We exhibit a stronger link between a multi-string BWT and a XBW by using the order of the concatenation in the multi-string. This bijective link has several applications: first, it allows one to build one data structure from the other; second, it enables one to compute an ordering of the input strings that optimises a Run-Length measure (i.e., the compressibility) of the BWT or of the XBW.

Subject Classification

ACM Subject Classification
  • Mathematics of computing → Discrete mathematics
  • Theory of computation → Randomness, geometry and discrete structures
  • Theory of computation → Data structures and algorithms for data management
Keywords
  • Data Structure
  • Algorithm
  • Aho-Corasick Tree
  • compression
  • RLE

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Alfred V. Aho and Margaret J. Corasick. Efficient String Matching: An Aid to Bibliographic Search. Communications of the ACM, 18(6):333-340, 1975. Google Scholar
  2. Jasmijn A. Baaijens, Amal Zine El Aabidine, Eric Rivals, and Alexander Schönhuth. De novo assembly of viral quasispecies using overlap graphs. Genome Research, 27(5):835-848, 2017. Google Scholar
  3. Markus J. Bauer, Anthony J. Cox, and Giovanna Rosone. Lightweight algorithms for constructing and inverting the BWT of string collections. Theoretical Computer Science, 483:134-148, 2013. Google Scholar
  4. Djamal Belazzougui. Succinct Dictionary Matching with No Slowdown. In Combinatorial Pattern Matching, pages 88-100, 2010. Google Scholar
  5. Michael Burrows and David J. Wheeler. A block-sorting lossless data compression algorithm. Technical report, Digital Equipment Corporation, 1994. Google Scholar
  6. Bastien Cazaux and Eric Rivals. Strong link between BWT and XBW via Aho-Corasick automaton and applications to Run-Length Encoding. CoRR, abs/1805.10070, 2018. URL: http://arxiv.org/abs/1805.10070.
  7. Anthony J. Cox, Markus J. Bauer, Tobias Jakobi, and Giovanna Rosone. Large-scale compression of genomic sequence databases with the Burrows–Wheeler Transform. Bioinformatics, 28(11):1415-1419, 2012. Google Scholar
  8. Lavinia Egidi and Giovanni Manzini. Lightweight BWT and LCP Merging via the Gap Algorithm. In String Processing and Information Retrieval, pages 176-190, 2017. Google Scholar
  9. Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini, and S. Muthukrishnan. Compressing and indexing labeled trees, with applications. Journal of the ACM, 57(1):1-33, 2009. Google Scholar
  10. Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52(4):552-581, 2005. Google Scholar
  11. Luca Foschini, Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. When indexing equals compression: Experiments with compressing suffix arrays and applications. ACM Transactions on Algorithms, 2(4):611-639, 2006. Google Scholar
  12. Travis Gagie, Giovanni Manzini, and Jouni Sirén. Wheeler graphs: A framework for BWT-based data structures. Theoretical Computer Science, 698:67-78, 2017. Google Scholar
  13. Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. High-order entropy-compressed text indexes. In ACM-SIAM Symp. on Discrete Algorithms, pages 841-850, 2003. Google Scholar
  14. James Holt and Leonard McMillan. Constructing Burrows-Wheeler Transforms of large string collections via merging. In ACM Conf. on Bioinformatics, Computational Biology, and Health Informatics, pages 464-471, 2014. Google Scholar
  15. James Holt and Leonard McMillan. Merging of multi-string BWTs with applications. Bioinformatics, 30(24):3524-3531, 2014. Google Scholar
  16. Wing-Kai Hon, Tsung-Han Ku, Rahul Shah, Sharma V. Thankachan, and Jeffrey Scott Vitter. Faster compressed dictionary matching. Theoretical Computer Science, 475:113-119, 2013. Google Scholar
  17. Heng Li and Richard Durbin. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 25(14):1754-1760, 2009. Google Scholar
  18. Veli Mäkinen and Gonzalo Navarro. Rank and select revisited and extended. Theoretical Computer Science, 387(3):332-347, 2007. Google Scholar
  19. Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and Retrieval of Individual Genomes. In RECOMB, Tucson, AZ, USA, pages 121-137, 2009. Google Scholar
  20. Udi Manber and Eugene W. Myers. Suffix Arrays: A New Method for On-Line String Searches. SIAM Journal on Computing, 22(5):935-948, 1993. Google Scholar
  21. Sabrina Mantaci, Antonio Restivo, Giovanna Rosone, and Marinella Sciortino. An Extension of the Burrows Wheeler Transform and Applications to Sequence Comparison and Data Compression. In Combinatorial Pattern Matching, pages 178-189, 2005. Google Scholar
  22. Giovanni Manzini. XBWT tricks. In String Processing and Information Retrieval, pages 80-92, 2016. Google Scholar
  23. Jouni Sirén. Burrows-Wheeler Transform for Terabases. In Data Compression Conf., pages 211-220, 2016. Google Scholar
  24. Jouni Sirén, Niko Välimäki, Veli Mäkinen, and Gonzalo Navarro. Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections. In String Processing and Information Retrieval, pages 164-175, 2009. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail