Space-Efficient Computation of the LCP Array from the Burrows-Wheeler Transform

Prezza, Nicola; Rosone, Giovanna

doi:10.4230/LIPIcs.CPM.2019.7

Abstract

We show that the Longest Common Prefix Array of a text collection of total size n on alphabet [1,sigma] can be computed from the Burrows-Wheeler transformed collection in O(n log sigma) time using o(n log sigma) bits of working space on top of the input and output. Our result improves (on small alphabets) and generalizes (to string collections) the previous solution from Beller et al., which required O(n) bits of extra working space. We also show how to merge the BWTs of two collections of total size n within the same time and space bounds. The procedure at the core of our algorithms can be used to enumerate suffix tree intervals in succinct space from the BWT, which is of independent interest. An engineered implementation of our first algorithm on DNA alphabet induces the LCP of a large (16 GiB) collection of short (100 bases) reads at a rate of 2.92 megabases per second using in total 1.5 Bytes per base in RAM. Our second algorithm merges the BWTs of two short-reads collections of 8 GiB each at a rate of 1.7 megabases per second and uses 0.625 Bytes per base in RAM. An extension of this algorithm that computes also the LCP array of the merged collection processes the data at a rate of 1.48 megabases per second and uses 1.625 Bytes per base in RAM.

Cite As Get BibTex

Nicola Prezza and Giovanna Rosone. Space-Efficient Computation of the LCP Array from the Burrows-Wheeler Transform. In 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 128, pp. 7:1-7:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019) https://doi.org/10.4230/LIPIcs.CPM.2019.7

Author Details

Nicola Prezza

Department of Computer Science, University of Pisa, Italy

Giovanna Rosone

Department of Computer Science, University of Pisa, Italy

Funding

GR is partially, and NP is totally, supported by the project MIUR-SIR CMACBioSeq ("Combinatorial methods for analysis and compression of biological sequences") grant n. RBSI146R5L.

References

M.J. Bauer, A.J. Cox, and G. Rosone. Lightweight algorithms for constructing and inverting the BWT of string collections. Theor. Comput. Sci., 483(0):134-148, 2013.
D. Belazzougui. Linear time construction of compressed text indices in compact space. In STOC, pages 148-193. ACM, 2014.
D. Belazzougui, F. Cunial, J. Kärkkäinen, and V. Mäkinen. Linear-time string indexing and analysis in small space. arXiv preprint arXiv:1609.06378, 2016.
D. Belazzougui and G. Navarro. Alphabet-independent compressed text indexing. TALG, 10(4):23, 2014.
T. Beller, S. Gog, E. Ohlebusch, and T. Schnattinger. Computing the longest common prefix array based on the Burrows-Wheeler transform. J. Discrete Algorithms, 18:22-31, 2013.
P. Bonizzoni, G. Della Vedova, S. Nicosia, Y. Pirola, M. Previtali, and R. Rizzi. Divide and Conquer Computation of the Multi-string BWT and LCP Array. In CiE, LNCS, pages 107-117. Springer, 2018.
M. Burrows and D.J. Wheeler. A Block Sorting data Compression Algorithm. Technical report, DEC Systems Research Center, 1994.
F. Claude, G. Navarro, and A. Ordónez. The wavelet matrix: An efficient wavelet tree for large alphabets. Information Systems, 47:15-32, 2015.
A.J. Cox, F. Garofalo, G. Rosone, and M. Sciortino. Lightweight LCP construction for very large collections of strings. J. Discrete Algorithms, 37:17-33, 2016.
L. Egidi, F.A. Louza, G. Manzini, and G.P. Telles. External memory BWT and LCP computation for sequence collections with applications. Algorithms Mol. Biol., 14(1):6, 2019.
L. Egidi and G. Manzini. Lightweight BWT and LCP merging via the Gap algorithm. In SPIRE, LNCS, pages 176-190. Springer, 2017.
P. Ferragina and G. Manzini. Opportunistic data structures with applications. In FOCS, pages 390-398. IEEE, 2000.
J. Holt and L. McMillan. Constructing Burrows-Wheeler transforms of large string collections via merging. In ACM-BCB, pages 464-471. ACM, 2014.
J. Holt and L. McMillan. Merging of multi-string BWTs with applications. Bioinformatics, 30(24):3524-3531, 2014.
F.A. Louza, G.P. Telles, S. Hoffmann, and C.D.A. Ciferri. Generalized enhanced suffix array construction in external memory. Algorithms Mol. Biol., 12(1):26, 2017.
S. Mantaci, A. Restivo, G. Rosone, and M. Sciortino. An extension of the Burrows-Wheeler Transform. Theor. Comput. Sci., 387(3):298-312, 2007.
J.I. Munro, G. Navarro, and Y. Nekrich. Space-efficient construction of compressed indexes in deterministic linear time. In SODA, pages 408-424. SIAM, 2017.
Gonzalo N. Wavelet trees for all. J. Discrete Algorithms, 25:2-20, 2014.
N. Prezza, N. Pisanti, M. Sciortino, and G. Rosone. Detecting Mutations by eBWT. In WABI 2018, volume 113 of LIPIcs, pages 3:1-3:15, 2018.
N. Prezza, N. Pisanti, M. Sciortino, and G. Rosone. SNPs detection by eBWT positional clustering. Algorithms Mol. Biol., 14(1):3, 2019.
F. Shi. Suffix Arrays for Multiple Strings: A Method for On-Line Multiple String Searches. In ASIAN, volume 1179 of LNCS, pages 11-22. Springer, 1996.

Space-Efficient Computation of the LCP Array from the Burrows-Wheeler Transform

Authors Nicola Prezza , Giovanna Rosone

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Space-Efficient Computation of the LCP Array from the Burrows-Wheeler Transform

Authors Nicola Prezza , Giovanna Rosone

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

References

Thanks for your feedback!

Could not send message