On Undetected Redundancy in the Burrows-Wheeler Transform

Author Uwe Baier



PDF
Thumbnail PDF

File

LIPIcs.CPM.2018.3.pdf
  • Filesize: 0.61 MB
  • 15 pages

Document Identifiers

Author Details

Uwe Baier
  • Institute of Theoretical Computer Science, Ulm University, D-89069 Ulm, Germany

Cite AsGet BibTex

Uwe Baier. On Undetected Redundancy in the Burrows-Wheeler Transform. In 29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 105, pp. 3:1-3:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)
https://doi.org/10.4230/LIPIcs.CPM.2018.3

Abstract

The Burrows-Wheeler-Transform (BWT) is an invertible permutation of a text known to be highly compressible but also useful for sequence analysis, what makes the BWT highly attractive for lossless data compression. In this paper, we present a new technique to reduce the size of a BWT using its combinatorial properties, while keeping it invertible. The technique can be applied to any BWT-based compressor, and, as experiments show, is able to reduce the encoding size by 8-16 % on average and up to 33-57 % in the best cases (depending on the BWT-compressor used), making BWT-based compressors competitive or even superior to today's best lossless compressors.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data compression
  • Applied computing → Document analysis
  • Mathematics of computing → Coding theory
Keywords
  • Lossless data compression
  • BWT
  • Tunneling

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Jürgen Abel. Post BWT Stages of the Burrows-Wheeler Compression Algorithm. Software - Practice and Experiences, 40(9):751-777, 2010. Google Scholar
  2. Uwe Baier. Tunneled BWT Implementation and Benchmark. https://github.com/waYne1337/tbwt. last visited January 2018.
  3. Michael Burrows and David J Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994. Google Scholar
  4. Sebastian Deorowicz. Silesia Corpus. http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia. last visited January 2018.
  5. Patrick Dinklage, Johannes Fischer, Dominik Köppl, Marvin Löbel, and Kunihiko Sadakane. Compression with the tudocomp Framework. In 16th International Symposium on Experimental Algorithms, SEA '17, pages 13:1-13:22, 2017. Google Scholar
  6. Peter M. Fenwick. Burrows-Wheeler compression: Principles and reflections. Theoretical Computer Science, 387(3):200-219, 2007. Google Scholar
  7. Paolo Ferragina, Raffaele Giancarlo, and Giovanni Manzini. The Engineering of a Compression Boosting Library: Theory vs Practice in BWT Compression. In Algorithms - ESA 2006: 14th Annual European Symposium, Zurich, Switzerland, September 11-13, 2006. Proceedings, ESA '06, pages 756-767, 2006. Google Scholar
  8. Paolo Ferragina and Giovanni Manzini. Indexing Compressed Text. Journal of the ACM, 52(4):552-581, 2005. Google Scholar
  9. Paolo Ferragina and Gonzalo Navarro. Pizza &Chili Corpus. http://pizzachili.dcc.uchile.cl/texts.html. last visited January 2018.
  10. Paolo Ferragina and Gonzalo Navarro. Repetitive Corpus. http://pizzachili.dcc.uchile.cl/repcorpus.html. last visited January 2018.
  11. Luca Foschini, Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. When Indexing Equals Compression: Experiments with Compressing Suffix Arrays and Applications. ACM Transactions on Algorithms, 2(4):611-639, 2006. Google Scholar
  12. Jean-Loup Gailly and Mark Adler. gzip File Compressor. http://www.gzip.org/. last visited January 2018.
  13. Simon Gog. sdsl-lite Library. https://github.com/simongog/sdsl-lite. last visited January 2018.
  14. Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. High-order Entropy-compressed Text Indexes. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '03, pages 841-850, 2003. Google Scholar
  15. David A. Huffman. A Method for the Construction of Minimum-Redundancy Codes. Proceedings of the IRE, 40(9):1098-1101, 1952. Google Scholar
  16. Juha Kärkkainen, Dominik Kempa, and Simon J. Puglisi. Slashing the Time for BWT Inversion. In Proceedings of the 2012 Data Compression Conference, DCC '12, pages 99-108, 2012. Google Scholar
  17. Juha Kärkkainen, Dominik Kempa, and Simon J. Puglisi. Hybrid Compression of Bitvectors for the FM-Index. In Proceedings of the 2014 Data Compression Conference, DCC '14, pages 302-311, 2014. Google Scholar
  18. Toru Kasai, Gunho Lee, Hiroki Arimura, Setsuo Arikawa, and Kunsoo Park. Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications. In Proceedings of the 12th Annual Conference on Combinatorial Pattern Matching, CPM '01, pages 181-192, 2001. Google Scholar
  19. Matt Mahoney. Large Text Compression Benchmark. http://mattmahoney.net/dc/text.html. last visited January 2018.
  20. Matt Mahoney. zpaq File Compressor. http://mattmahoney.net/dc/zpaq.html. last visited January 2018.
  21. Veli Mäkinen. Compact Suffix Array. In Proceedings of the 11th Annual Conference on Combinatorial Pattern Matching, CPM '00, pages 305-319, 2000. Google Scholar
  22. Udi Manber and Gene Myers. Suffix Arrays: A New Method for On-Line String Searches. SIAM Journal on Computing, 5:935-948, 1993. Google Scholar
  23. Markus Mauer, Timo Beller, and Enno Ohlebusch. A Lempel-Ziv-style Compression Method for Repetitive Texts. In Proceedings of the Prague Stringology Conference 2017, PSC '17, pages 96-107, 2017. Google Scholar
  24. Alistair Moffat and R. Yugo Kartono Isal. Word-based Text Compression Using the Burrows-Wheeler Transform. Information Processing and Management, 41:1175-1192, 2005. Google Scholar
  25. Ilya Muravyov. bcm File Compressor. https://github.com/encode84/bcm. last visited January 2018.
  26. Gonzalo Navarro and Veli Mäkinen. Compressed Full-text Indexes. ACM Computing Surveys, 39(1), 2007. Google Scholar
  27. Igor Pavlov. 7zip File Compressor. http://www.7-zip.org/. last visited January 2018.
  28. Jorma J. Rissanen and Glen G. Langdon. Arithmetic coding. IBM Journal of Research and Development, 23:149-162, 1979. Google Scholar
  29. B. Ya Ryabko. Data compression by means of a "book stack". Problems of Information Transmission, 16:265-269, 1980. Google Scholar
  30. Julian Seward. bzip2 File Compressor. http://bzip.org/. last visited January 2018.
  31. Claude E. Shannon. A mathematical theory of communication. The Bell System Technical Journal, 27(3):379-423, 1948. Google Scholar
  32. Tukaani. xz File Compressor. https://tukaani.org/xz/. last visited January 2018.
  33. Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 23:337-343, 1977. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail