A New Class of Searchable and Provably Highly Compressible String Transformations

Authors Raffaele Giancarlo , Giovanni Manzini , Giovanna Rosone , Marinella Sciortino



PDF
Thumbnail PDF

File

LIPIcs.CPM.2019.12.pdf
  • Filesize: 437 kB
  • 12 pages

Document Identifiers

Author Details

Raffaele Giancarlo
  • University of Palermo, Dipartimento di Matematica e Informatica, Italy
Giovanni Manzini
  • University of Eastern Piedmont, Alessandria, Italy
  • IIT-CNR, Pisa, Italy
Giovanna Rosone
  • University of Pisa, Dipartimento di Informatica, Italy
Marinella Sciortino
  • University of Palermo, Dipartimento di Matematica e Informatica, Italy

Cite As Get BibTex

Raffaele Giancarlo, Giovanni Manzini, Giovanna Rosone, and Marinella Sciortino. A New Class of Searchable and Provably Highly Compressible String Transformations. In 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 128, pp. 12:1-12:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019) https://doi.org/10.4230/LIPIcs.CPM.2019.12

Abstract

The Burrows-Wheeler Transform is a string transformation that plays a fundamental role for the design of self-indexing compressed data structures. Over the years, researchers have successfully extended this transformation outside the domains of strings. However, efforts to find non-trivial alternatives of the original, now 25 years old, Burrows-Wheeler string transformation have met limited success. In this paper we bring new lymph to this area by introducing a whole new family of transformations that have all the "myriad virtues" of the BWT: they can be computed and inverted in linear time, they produce provably highly compressible strings, and they support linear time pattern search directly on the transformed string. This new family is a special case of a more general class of transformations based on context adaptive alphabet orderings, a concept introduced here. This more general class includes also the Alternating BWT, another invertible string transforms recently introduced in connection with a generalization of Lyndon words.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data compression
  • Mathematics of computing → Combinatorial algorithms
Keywords
  • Data Indexing and Compression
  • Burrows-Wheeler Transformation
  • Combinatorics on Words

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. D. Belazzougui and G. Navarro. Optimal Lower and Upper Bounds for Representing Sequences. ACM T. Algorithms, 11(4):31:1-31:21, 2015. Google Scholar
  2. M. Burrows and D. J. Wheeler. A Block Sorting data Compression Algorithm. Technical report, DIGITAL System Research Center, 1994. Google Scholar
  3. B. Chapin and S. Tate. Higher Compression from the Burrows-Wheeler Transform by Modified Sorting. In DCC, page 532. IEEE Computer Society, 1998. Full version available from URL: https://www.uncg.edu/cmp/faculty/srtate/papers/bwtsort.pdf.
  4. A. Cox, M. Bauer, T. Jakobi, and G. Rosone. Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics, 28(11):1415-1419, 2012. Google Scholar
  5. M. Crochemore, J. Désarménien, and D. Perrin. A note on the Burrows-Wheeler transformation. Theor. Comput. Sci., 332:567-572, 2005. Google Scholar
  6. J. S. Culpepper, M. Petri, and S. J. Puglisi. Revisiting bounded context block-sorting transformations. Software Pract. Exper., 42(8):1037-1054, 2012. Google Scholar
  7. J. Daykin, R. Groult, Y. Guesnet, T. Lecroq, A. Lefebvre, M. Léonard, and É. Prieur-Gaston. A survey of string orderings and their application to the Burrows-Wheeler transform. Theor. Comput. Sci., 2017. Google Scholar
  8. P. Ferragina, R. Giancarlo, G. Manzini, and M. Sciortino. Boosting textual compression in optimal linear time. J. ACM, 52(4):688-713, 2005. Google Scholar
  9. P. Ferragina and G. Manzini. Opportunistic data structures with applications. In FOCS 2000, pages 390-398. IEEE Computer Society, 2000. Google Scholar
  10. P. Ferragina and G. Manzini. Indexing compressed text. J. ACM, 52:552-581, 2005. Google Scholar
  11. T. Gagie, G. Manzini, and J. Sirén. Wheeler graphs: A framework for BWT-based data structures. Theor. Comput. Sci., 698:67-78, 2017. Google Scholar
  12. T. Gagie, G. Navarro, and N. Prezza. Optimal-Time Text Indexing in BWT-runs Bounded Space. In SODA, pages 1459-1477. SIAM, 2018. Google Scholar
  13. I. M. Gessel, A. Restivo, and C. Reutenauer. A bijection between words and multisets of necklaces. Eur. J. Combin., 33(7):1537-1546, 2012. Google Scholar
  14. I. M. Gessel and C. Reutenauer. Counting permutations with given cycle structure and descent set. J. Comb. Theory A, 64(2):189-215, 1993. Google Scholar
  15. R. Giancarlo, G. Manzini, A. Restivo, G. Rosone, and M. Sciortino. Block Sorting-Based Transformations on Words: Beyond the Magic BWT. In DLT, volume 11088 of Lecture Notes in Computer Science, pages 1-17. Springer, 2018. Google Scholar
  16. R. Giancarlo, A. Restivo, and M. Sciortino. From first principles to the Burrows and Wheeler transform and beyond, via combinatorial optimization. Theor. Comput. Sci., 387:236-248, 2007. Google Scholar
  17. J. Kärkkäinen, P. Sanders, and S. Burkhardt. Linear work suffix array construction. J. ACM, 53(6):918-936, 2006. Google Scholar
  18. D. Kempa and N. Prezza. At the roots of dictionary compression: string attractors. In STOC, pages 827-840. ACM, 2018. Google Scholar
  19. S. Kreft and G. Navarro. On compressing and indexing repetitive sequences. Theor. Comput. Sci., 483:115-133, 2013. Google Scholar
  20. V. Mäkinen, D. Belazzougui, F. Cunial, and A. Tomescu. Genome-Scale Algorithm Design. Cambridge University Press, 2015. ISBN 978-1-107-07853-6. Google Scholar
  21. V. Mäkinen, G. Navarro, J. Sirén, and Niko Välimäki. Storage and Retrieval of Highly Repetitive Sequence Collections. J. Comput. Biol., 17(3):281-308, 2010. Google Scholar
  22. S. Mantaci, A. Restivo, G. Rosone, and M. Sciortino. Burrows-Wheeler Transform and Run-Length Enconding. In Combinatorics on Words - 11th International Conference, WORDS 2017. Proceedings, volume 10432 of LNCS, pages 228-239. Springer, 2017. Google Scholar
  23. S. Mantaci, A. Restivo, G. Rosone, M. Sciortino, and L. Versari. Measuring the clustering effect of BWT via RLE. Theor. Comput. Sci., 698:79-87, 2017. Google Scholar
  24. G. Manzini. An analysis of the Burrows-Wheeler transform. J. ACM, 48(3):407-430, 2001. Google Scholar
  25. G. Navarro. Compact Data Structures - A practical approach. Cambridge University Press, 2016. ISBN 978-1-107-15238-0. Google Scholar
  26. G. Navarro and V. Mäkinen. Compressed Full-Text Indexes. ACM Comput. Surv., 39(1), 2007. Google Scholar
  27. M. Petri, G. Navarro, J. S. Culpepper, and S. J. Puglisi. Backwards Search in Context Bound Text Transformations. In CCP, pages 82-91. IEEE Computer Society, 2011. Google Scholar
  28. M. Schindler. A Fast Block-Sorting algorithm for Lossless Data Compression. In DCC, page 469. IEEE Computer Society, 1997. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail