A Theoretical and Experimental Analysis of BWT Variants for String Collections

Authors Davide Cenzato , Zsuzsanna Lipták



PDF
Thumbnail PDF

File

LIPIcs.CPM.2022.25.pdf
  • Filesize: 1.84 MB
  • 18 pages

Document Identifiers

Author Details

Davide Cenzato
  • Department of Computer Science, University of Verona, Italy
Zsuzsanna Lipták
  • Department of Computer Science, University of Verona, Italy

Acknowledgements

We would like to thank Massimiliano Rossi who supplied us with some cleaned and filtered datasets.

Cite AsGet BibTex

Davide Cenzato and Zsuzsanna Lipták. A Theoretical and Experimental Analysis of BWT Variants for String Collections. In 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 223, pp. 25:1-25:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)
https://doi.org/10.4230/LIPIcs.CPM.2022.25

Abstract

The extended Burrows-Wheeler-Transform (eBWT), introduced by Mantaci et al. [Theor. Comput. Sci., 2007], is a generalization of the Burrows-Wheeler-Transform (BWT) to multisets of strings. While the original BWT is based on the lexicographic order, the eBWT uses the omega-order, which differs from the lexicographic order in important ways. A number of tools are available that compute the BWT of string collections; however, the data structures they generate in most cases differ from the one originally defined, as well as from each other. In this paper, we review the differences between these BWT variants, both from a theoretical and from a practical point of view, comparing them on several real-life datasets with different characteristics. We find that the differences can be extensive, depending on the dataset characteristics, and are largest on collections of many highly similar short sequences. The widely-used parameter r, the number of runs of the BWT, also shows notable variation between the different BWT variants; on our datasets, it varied by a multiplicative factor of up to 4.2.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data compression
  • Applied computing → Bioinformatics
Keywords
  • Burrows-Wheeler-Transform
  • extended BWT
  • string collections
  • repetitiveness measures
  • r
  • compression

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Tooru Akagi, Mitsuru Funakoshi, and Shunsuke Inenaga. Sensitivity of string compressors and repetitiveness measures. CoRR, abs/2107.08615, 2021. URL: http://arxiv.org/abs/2107.08615.
  2. Hideo Bannai, Travis Gagie, and Tomohiro I. Refining the r-index. Theor. Comput. Sci., 812:96-108, 2020. URL: https://doi.org/10.1016/j.tcs.2019.08.005.
  3. Markus J. Bauer, Anthony J. Cox, and Giovanna Rosone. Lightweight algorithms for constructing and inverting the BWT of string collections. Theor. Comput. Sci., 483:134-148, 2013. URL: https://doi.org/10.1016/j.tcs.2012.02.002.
  4. Jason W. Bentley, Daniel Gibney, and Sharma V. Thankachan. On the complexity of BWT-runs minimization via alphabet reordering. In Proc. of 28th Annual European Symposium on Algorithms (ESA 2020), volume 173 of LIPIcs, pages 15:1-15:13, 2020. URL: https://doi.org/10.4230/LIPIcs.ESA.2020.15.
  5. Paola Bonizzoni, Gianluca Della Vedova, Yuri Pirola, Marco Previtali, and Raffaella Rizzi. Multithread multistring Burrows-Wheeler Transform and Longest Common Prefix array. J. Comput. Biol., 26(9):948-961, 2019. URL: https://doi.org/10.1089/cmb.2018.0230.
  6. Christina Boucher, Davide Cenzato, Zsuzsanna Lipták, Massimiliano Rossi, and Marinella Sciortino. Computing the original eBWT faster, simpler, and with less memory. In Proc. of 28th International Symposium on String Processing and Information Retrieval (SPIRE 2021), volume 12944 of LNCS, pages 129-142, 2021. URL: https://doi.org/10.1007/978-3-030-86692-1_11.
  7. Christina Boucher, Ondrej Cvacho, Travis Gagie, Jan Holub, Giovanni Manzini, Gonzalo Navarro, and Massimiliano Rossi. PFP compressed suffix trees. In Proc. of 23rd Symposium on Algorithm Engineering and Experiments (ALENEX 2021), pages 60-72. SIAM, 2021. URL: https://doi.org/10.1137/1.9781611976472.5.
  8. Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni Manzini, and Taher Mun. Prefix-free parsing for building big BWTs. Algorithms Mol. Biol., 14(1):13:1-13:15, 2019. URL: https://doi.org/10.1186/s13015-019-0148-5.
  9. Michael Burrows and David J. Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994. Google Scholar
  10. Bastien Cazaux and Eric Rivals. Linking BWT and XBW via Aho-Corasick automaton: Applications to run-length encoding. In Proc. of 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019), volume 128 of LIPIcs, pages 24:1-24:20, 2019. URL: https://doi.org/10.4230/LIPIcs.CPM.2019.24.
  11. Shubham Chandak, Kedar Tatwawadi, Idoia Ochoa, Mikel Hernaez, and Tsachy Weissman. SPRING: a next-generation compressor for FASTQ data. Bioinform., 35(15):2674-2676, 2019. URL: https://doi.org/10.1093/bioinformatics/bty1015.
  12. Dustin Cobas, Travis Gagie, and Gonzalo Navarro. A fast and small subsampled r-index. In Proc. of 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021), volume 191 of LIPIcs, pages 13:1-13:16. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. URL: https://doi.org/10.4230/LIPIcs.CPM.2021.13.
  13. Anthony J. Cox, Markus J. Bauer, Tobias Jakobi, and Giovanna Rosone. Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinform., 28(11):1415-1419, 2012. URL: https://doi.org/10.1093/bioinformatics/bts173.
  14. Diego Díaz-Domínguez and Gonzalo Navarro. Efficient construction of the extended BWT from grammar-compressed DNA sequencing reads. CoRR, abs/2102.03961, 2021. URL: http://arxiv.org/abs/2102.03961.
  15. Robert C Edgar. Updating the 97% identity threshold for 16S ribosomal RNA OTUs. Bioinf., 34(14):2371-2375, 2018. URL: https://doi.org/10.1093/bioinformatics/bty113.
  16. Lavinia Egidi, Felipe A. Louza, Giovanni Manzini, and Guilherme P. Telles. External memory BWT and LCP computation for sequence collections with applications. Algorithms Mol. Biol., 14(1):6:1-6:15, 2019. URL: https://doi.org/10.1186/s13015-019-0140-0.
  17. Paolo Ferragina, Travis Gagie, and Giovanni Manzini. Lightweight data indexing and compression in external memory. Algorithmica, 63(3):707-730, 2012. URL: https://doi.org/10.1007/s00453-011-9535-0.
  18. Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini, and S. Muthukrishnan. Structuring labeled trees for optimal succinctness, and beyond. In Proc. of 46th IEEE Symposium on Foundations of Computer Science (FOCS 2005), pages 184-193, 2005. URL: https://doi.org/10.1109/SFCS.2005.69.
  19. Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini, and S. Muthukrishnan. Compressing and indexing labeled trees, with applications. J. ACM, 57(1):4:1-4:33, 2009. URL: https://doi.org/10.1145/1613676.1613680.
  20. Johannes Fischer and Florian Kurpicz. sais-lite-lcp. https://github.com/kurpicz/sais-lite-lcp. Accessed: 2022-02-05.
  21. Travis Gagie, Garance Gourdel, and Giovanni Manzini. Compressing and indexing aligned readsets. In Proc. of 21st International Workshop on Algorithms in Bioinformatics (WABI 2021), volume 201 of LIPIcs, pages 13:1-13:21, 2021. URL: https://doi.org/10.4230/LIPIcs.WABI.2021.13.
  22. Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Optimal-time text indexing in BWT-runs bounded space. In Proc. of 39th ACM-SIAM Symposium on Discrete Algorithms (SODA 2018), pages 1459-1477, 2018. URL: https://doi.org/10.1137/1.9781611975031.96.
  23. Joseph Yossi Gil and David Allen Scott. A bijective string sorting transform. CoRR, abs/1201.3077, 2012. URL: http://arxiv.org/abs/1201.3077.
  24. Sara Giuliani, Shunsuke Inenaga, Zsuzsanna Lipták, Nicola Prezza, Marinella Sciortino, and Anna Toffanello. Novel results on the number of runs of the Burrows-Wheeler-Transform. In Proc. of 47th International Conference on Current Trends in Theory and Practice of Computer Science (SOFSEM 2021), volume 12607 of LNCS, pages 249-262, 2021. URL: https://doi.org/10.1007/978-3-030-67731-2_18.
  25. Allison J. Greaney et al. A SARS-CoV-2 variant elicits an antibody response with a shifted immunodominance hierarchy. PLOS Pathogens, 18:1-27, February 2022. URL: https://doi.org/10.1101/2021.10.12.464114.
  26. Ilya Grebnov. libsais. https://github.com/IlyaGrebnov/libsais. Accessed: 2022-02-05.
  27. Dan Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, 1997. Google Scholar
  28. James Holt and Leonard McMillan. Merging of multi-string BWTs with applications. Bioinform., 30(24):3524-3531, 2014. URL: https://doi.org/10.1093/bioinformatics/btu584.
  29. Dominik Kempa and Tomasz Kociumaka. Resolution of the Burrows-Wheeler Transform conjecture. In Proc. of 61st IEEE Annual Symposium on Foundations of Computer Science (FOCS 2020), pages 1002-1013, 2020. URL: https://doi.org/10.1109/FOCS46700.2020.00097.
  30. Dominik Köppl, Daiki Hashimoto, Diptarama Hendrian, and Ayumi Shinohara. In-place Bijective Burrows-Wheeler Transforms. In Proc. of 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020), volume 161 of LIPIcs, pages 21:1-21:15, 2020. URL: https://doi.org/10.4230/LIPIcs.CPM.2020.21.
  31. Gregory Kucherov, Lilla Tóthmérész, and Stéphane Vialette. On the combinatorics of suffix arrays. Inf Process Lett, 113(22-24):915-920, 2013. URL: https://doi.org/10.1016/j.ipl.2013.09.009.
  32. Alan Kuhnle, Taher Mun, Christina Boucher, Travis Gagie, Ben Langmead, and Giovanni Manzini. Efficient construction of a complete index for pan-genomics read alignment. In Proc. of 23rd Annual Conference in Computational Molecular Biology (RECOMB 2019), volume 11467 of LNCS, pages 158-173, 2019. URL: https://doi.org/10.1089/cmb.2019.0309.
  33. Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with Bowtie 2. Nature Methods, 9(4):357-359, 2012. URL: https://doi.org/10.1038/nmeth.1923.
  34. Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10:R25, 2009. URL: https://doi.org/10.1186/gb-2009-10-3-r25.
  35. Heng Li. Fast construction of FM-index for long sequence reads. Bioinform., 30(22):3274-3275, 2014. URL: https://doi.org/10.1093/bioinformatics/btu541.
  36. Heng Li and Richard Durbin. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 26(5):589-595, 2010. URL: https://doi.org/10.1093/bioinformatics/btp698.
  37. Felipe A. Louza, Guilherme P. Telles, Simon Gog, Nicola Prezza, and Giovanna Rosone. gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections. Algorithms Mol. Biol., 15(1):18, 2020. URL: https://doi.org/10.1186/s13015-020-00177-y.
  38. Felipe A. Louza, Guilherme P. Telles, Steve Hoffmann, and Cristina Dutra de Aguiar Ciferri. Generalized enhanced suffix array construction in external memory. Algorithms Mol. Biol., 12(1):26:1-26:16, 2017. URL: https://doi.org/10.1186/s13015-017-0117-9.
  39. Swapan Mallick et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature, 538(7624):201-206, 2016. URL: https://doi.org/10.1038/nature18964.
  40. Sabrina Mantaci, Antonio Restivo, Giovanna Rosone, and Marinella Sciortino. An extension of the Burrows-Wheeler Transform. Theor. Comput. Sci., 387(3):298-312, 2007. URL: https://doi.org/10.1016/j.tcs.2007.07.014.
  41. Giovanni Manzini. XBWT tricks. In Proc. of 23rd International Symposium on String Processing and Information Retrieval (SPIRE 2016), volume 9954 of LNCS, pages 80-92, 2016. URL: https://doi.org/10.1007/978-3-319-46049-9_8.
  42. Yuta Mori. libdivsufsort. https://github.com/y-256/libdivsufsort. Accessed: 2022-02-05.
  43. Gonzalo Navarro. Indexing highly repetitive string collections, part I: repetitiveness measures. ACM Comput. Surv., 54(2):29:1-29:31, 2021. URL: https://doi.org/10.1145/3434399.
  44. Genome 10K Community of Scientists. A proposal to obtain whole-genome sequence for 10,000 vertebrate species. J Hered., 100:659-674, 2009. URL: https://doi.org/10.1093/jhered/esp086.
  45. Enno Ohlebusch. Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction. Oldenbusch Verlag, 2013. Google Scholar
  46. Enno Ohlebusch, Stefan Stauß, and Uwe Baier. Trickier XBWT tricks. In Proc. of 25th International Symposium in String Processing and Information Retrieval (SPIRE 2018), volume 11147 of LNCS, pages 325-333, 2018. URL: https://doi.org/10.1007/978-3-030-00479-8_26.
  47. Marco Oliva, Massimiliano Rossi, Jouni Sirén, Giovanni Manzini, Tamer Kahveci, Travis Gagie, and Christina Boucher. Efficiently merging r-indexes. In Proc. of 31st Data Compression Conference (DCC 2021), pages 203-212, 2021. URL: https://doi.org/10.1109/DCC50243.2021.00028.
  48. Jacopo Pantaleoni. BWT of large string sets. CoRR, abs/1410.0562, 2014. URL: http://arxiv.org/abs/1410.0562.
  49. Simon J. Puglisi and Bella Zhukova. Document retrieval hacks. In Proc. of 19th International Symposium on Experimental Algorithms (SEA 2021), volume 190 of LIPIcs, pages 12:1-12:12, 2021. URL: https://doi.org/10.4230/LIPIcs.SEA.2021.12.
  50. Jouni Sirén. Burrows-Wheeler Transform for terabases. In Proc. of 26th Data Compression Conference (DCC 2016), pages 211-220, 2016. URL: https://doi.org/10.1109/DCC.2016.17.
  51. Tyler N. Starr et al. Deep mutational scanning of SARS-CoV-2 receptor binding domain reveals constraints on folding and ACE2 binding. Cell, 182(5):1295-1310.e20, 2020. URL: https://doi.org/10.1016/j.cell.2020.08.012.
  52. C. Sun et al. RPAN: rice pan-genome browser for 3000 rice genomes. Nucleic Acids Res, 45(2):597-605, 2017. URL: https://doi.org/10.1093/nar/gkw958.
  53. The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature, 526:68-74, 2015. URL: https://doi.org/10.1038/nature15393.
  54. The 1001 Genomes Consortium. Epigenomic Diversity in a Global Collection of Arabidopsis thaliana Accessions. Cell, 166(2):492-505, 2016. URL: https://doi.org/10.1016/j.cell.2016.06.044.
  55. C. Turnbull et al. The 100,000 genomes project: bringing whole genome sequencing to the NHS. Br Med J, 361, 2018. URL: https://doi.org/10.1136/bmj.k1687.
  56. Silvie Van den Hoecke, Judith Verhelst, Marnik Vuylsteke, and Xavier Saelens. Analysis of the genetic diversity of influenza A viruses using next-generation DNA sequencing. BMC Genomics, 16(1):79, 2015. URL: https://doi.org/10.1186/s12864-015-1284-z.
  57. Raf Winand et al. Targeting the 16s rRNA gene for bacterial identification in complex mixed samples: Comparative evaluation of second (Illumina) and third (Oxford nanopore technologies) generation sequencing technologies. Int. J. of Mol. Sci., 21(1):298, 2019. URL: https://doi.org/10.3390/ijms21010298.
  58. Michael H. Woodworth et al. Sentinel case of Candida auris in the Western United States Following Prolonged Occult Colonization in a Returned Traveler from India. Microb Drug Resist, 25(5):677-680, 2019. URL: https://doi.org/10.1089/mdr.2018.0408.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail