Document Open Access Logo

Space-Efficient Computation of the LCP Array from the Burrows-Wheeler Transform

Authors Nicola Prezza , Giovanna Rosone



PDF
Thumbnail PDF

File

LIPIcs.CPM.2019.7.pdf
  • Filesize: 0.58 MB
  • 18 pages

Document Identifiers

Author Details

Nicola Prezza
  • Department of Computer Science, University of Pisa, Italy
Giovanna Rosone
  • Department of Computer Science, University of Pisa, Italy

Cite AsGet BibTex

Nicola Prezza and Giovanna Rosone. Space-Efficient Computation of the LCP Array from the Burrows-Wheeler Transform. In 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 128, pp. 7:1-7:18, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2019)
https://doi.org/10.4230/LIPIcs.CPM.2019.7

Abstract

We show that the Longest Common Prefix Array of a text collection of total size n on alphabet [1,sigma] can be computed from the Burrows-Wheeler transformed collection in O(n log sigma) time using o(n log sigma) bits of working space on top of the input and output. Our result improves (on small alphabets) and generalizes (to string collections) the previous solution from Beller et al., which required O(n) bits of extra working space. We also show how to merge the BWTs of two collections of total size n within the same time and space bounds. The procedure at the core of our algorithms can be used to enumerate suffix tree intervals in succinct space from the BWT, which is of independent interest. An engineered implementation of our first algorithm on DNA alphabet induces the LCP of a large (16 GiB) collection of short (100 bases) reads at a rate of 2.92 megabases per second using in total 1.5 Bytes per base in RAM. Our second algorithm merges the BWTs of two short-reads collections of 8 GiB each at a rate of 1.7 megabases per second and uses 0.625 Bytes per base in RAM. An extension of this algorithm that computes also the LCP array of the merged collection processes the data at a rate of 1.48 megabases per second and uses 1.625 Bytes per base in RAM.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data structures design and analysis
  • Theory of computation → Data compression
Keywords
  • Burrows-Wheeler Transform
  • LCP array
  • DNA reads

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. M.J. Bauer, A.J. Cox, and G. Rosone. Lightweight algorithms for constructing and inverting the BWT of string collections. Theor. Comput. Sci., 483(0):134-148, 2013. Google Scholar
  2. D. Belazzougui. Linear time construction of compressed text indices in compact space. In STOC, pages 148-193. ACM, 2014. Google Scholar
  3. D. Belazzougui, F. Cunial, J. Kärkkäinen, and V. Mäkinen. Linear-time string indexing and analysis in small space. arXiv preprint arXiv:1609.06378, 2016. Google Scholar
  4. D. Belazzougui and G. Navarro. Alphabet-independent compressed text indexing. TALG, 10(4):23, 2014. Google Scholar
  5. T. Beller, S. Gog, E. Ohlebusch, and T. Schnattinger. Computing the longest common prefix array based on the Burrows-Wheeler transform. J. Discrete Algorithms, 18:22-31, 2013. Google Scholar
  6. P. Bonizzoni, G. Della Vedova, S. Nicosia, Y. Pirola, M. Previtali, and R. Rizzi. Divide and Conquer Computation of the Multi-string BWT and LCP Array. In CiE, LNCS, pages 107-117. Springer, 2018. Google Scholar
  7. M. Burrows and D.J. Wheeler. A Block Sorting data Compression Algorithm. Technical report, DEC Systems Research Center, 1994. Google Scholar
  8. F. Claude, G. Navarro, and A. Ordónez. The wavelet matrix: An efficient wavelet tree for large alphabets. Information Systems, 47:15-32, 2015. Google Scholar
  9. A.J. Cox, F. Garofalo, G. Rosone, and M. Sciortino. Lightweight LCP construction for very large collections of strings. J. Discrete Algorithms, 37:17-33, 2016. Google Scholar
  10. L. Egidi, F.A. Louza, G. Manzini, and G.P. Telles. External memory BWT and LCP computation for sequence collections with applications. Algorithms Mol. Biol., 14(1):6, 2019. Google Scholar
  11. L. Egidi and G. Manzini. Lightweight BWT and LCP merging via the Gap algorithm. In SPIRE, LNCS, pages 176-190. Springer, 2017. Google Scholar
  12. P. Ferragina and G. Manzini. Opportunistic data structures with applications. In FOCS, pages 390-398. IEEE, 2000. Google Scholar
  13. J. Holt and L. McMillan. Constructing Burrows-Wheeler transforms of large string collections via merging. In ACM-BCB, pages 464-471. ACM, 2014. Google Scholar
  14. J. Holt and L. McMillan. Merging of multi-string BWTs with applications. Bioinformatics, 30(24):3524-3531, 2014. Google Scholar
  15. F.A. Louza, G.P. Telles, S. Hoffmann, and C.D.A. Ciferri. Generalized enhanced suffix array construction in external memory. Algorithms Mol. Biol., 12(1):26, 2017. Google Scholar
  16. S. Mantaci, A. Restivo, G. Rosone, and M. Sciortino. An extension of the Burrows-Wheeler Transform. Theor. Comput. Sci., 387(3):298-312, 2007. Google Scholar
  17. J.I. Munro, G. Navarro, and Y. Nekrich. Space-efficient construction of compressed indexes in deterministic linear time. In SODA, pages 408-424. SIAM, 2017. Google Scholar
  18. Gonzalo N. Wavelet trees for all. J. Discrete Algorithms, 25:2-20, 2014. Google Scholar
  19. N. Prezza, N. Pisanti, M. Sciortino, and G. Rosone. Detecting Mutations by eBWT. In WABI 2018, volume 113 of LIPIcs, pages 3:1-3:15, 2018. Google Scholar
  20. N. Prezza, N. Pisanti, M. Sciortino, and G. Rosone. SNPs detection by eBWT positional clustering. Algorithms Mol. Biol., 14(1):3, 2019. Google Scholar
  21. F. Shi. Suffix Arrays for Multiple Strings: A Method for On-Line Multiple String Searches. In ASIAN, volume 1179 of LNCS, pages 11-22. Springer, 1996. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail