External memory BWT and LCP computation for sequence collections with applications

Authors Lavinia Egidi , Felipe A. Louza , Giovanni Manzini , Guilherme P. Telles



PDF
Thumbnail PDF

File

LIPIcs.WABI.2018.10.pdf
  • Filesize: 0.55 MB
  • 14 pages

Document Identifiers

Author Details

Lavinia Egidi
  • University of Eastern Piedmont, Alessandria, Italy
Felipe A. Louza
  • Department of Computing and Mathematics, University of São Paulo, Ribeirão Preto, Brazil
Giovanni Manzini
  • University of Eastern Piedmont, Alessandria, Italy, IIT CNR, Pisa, Italy
Guilherme P. Telles
  • Institute of Computing, University of Campinas, Campinas, Brazil

Cite AsGet BibTex

Lavinia Egidi, Felipe A. Louza, Giovanni Manzini, and Guilherme P. Telles. External memory BWT and LCP computation for sequence collections with applications. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 10:1-10:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)
https://doi.org/10.4230/LIPIcs.WABI.2018.10

Abstract

We propose an external memory algorithm for the computation of the BWT and LCP array for a collection of sequences. Our algorithm takes the amount of available memory as an input parameter, and tries to make the best use of it by splitting the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external memory and in the process it also computes the LCP values. We show that our algorithm performs O(n maxlcp) sequential I/Os, where n is the total length of the collection and maxlcp is the maximum LCP value. The experimental results show that our algorithm outperforms the current best algorithm for collections of sequences with different lengths and when the average LCP of the collection is relatively small compared to the length of the sequences. In the second part of the paper, we show that our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP arrays, provide simple, scan based, external memory algorithms for three well known problems in bioinformatics: the computation of the all pairs suffix-prefix overlaps, the computation of maximal repeats, and the construction of succinct de Bruijn graphs.

Subject Classification

ACM Subject Classification
  • Theory of computation → Design and analysis of algorithms
Keywords
  • Burrows-Wheeler Transform
  • Longest Common Prefix Array
  • All pairs suffix-prefix overlaps
  • Succinct de Bruijn graph
  • Maximal repeats

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Markus J. Bauer, Anthony J. Cox, and Giovanna Rosone. Lightweight algorithms for constructing and inverting the BWT of string collections. Theor. Comput. Sci., 483:134-148, 2013. Google Scholar
  2. Djamal Belazzougui. Linear time construction of compressed text indices in compact space. In STOC, pages 148-193. ACM, 2014. Google Scholar
  3. Djamal Belazzougui, Travis Gagie, Veli Mäkinen, Marco Previtali, and Simon J. Puglisi. Bidirectional variable-order de Bruijn graphs. In LATIN, volume 9644 of Lecture Notes in Computer Science, pages 164-178. Springer, 2016. Google Scholar
  4. Paola Bonizzoni, Gianluca Della Vedova, Yuri Pirola, Marco Previtali, and Raffaella Rizzi. Computing the BWT and LCP array of a set of strings in external memory. CoRR, abs/1705.07756, 2017. Google Scholar
  5. Paola Bonizzoni, Gianluca Della Vedova, Yuri Pirola, Marco Previtali, and Raffaella Rizzi. An external-memory algorithm for string graph construction. Algorithmica, 78(2):394-424, 2017. URL: http://dx.doi.org/10.1007/s00453-016-0165-4.
  6. Paola Bonizzoni, Gianluca Della Vedova, Yuri Pirola, Marco Previtali, and Raffaella Rizzi. Divide and conquer computation of the multi-string BWT and LCP array. In Proc. Computability in Europe (CiE), 2018. To appear. Google Scholar
  7. Christina Boucher, Alexander Bowe, Travis Gagie, Simon J. Puglisi, and Kunihiko Sadakane. Variable-order de Bruijn graphs. In DCC, pages 383-392. IEEE, 2015. Google Scholar
  8. Alexander Bowe, Taku Onodera, Kunihiko Sadakane, and Tetsuo Shibuya. Succinct de Bruijn graphs. In WABI, volume 7534 of Lecture Notes in Computer Science, pages 225-235. Springer, 2012. Google Scholar
  9. S. Burkhardt and J. Kärkkäinen. Fast lightweight suffix array construction and checking. In Proc. 14th Symposium on Combinatorial Pattern Matching (CPM '03), pages 55-69. Springer-Verlag LNCS n. 2676, 2003. Google Scholar
  10. Michael Burrows and David J. Wheeler. A block-sorting lossless data compression algorithm. Technical report, Digital SRC Research Report, 1994. Google Scholar
  11. Anthony J. Cox, Fabio Garofalo, Giovanna Rosone, and Marinella Sciortino. Lightweight LCP construction for very large collections of strings. J. Discrete Algorithms, 37, 2016. Google Scholar
  12. Lavinia Egidi and Giovanni Manzini. Lightweight BWT and LCP merging via the Gap algorithm. In SPIRE, volume 10508 of Lecture Notes in Computer Science, pages 176-190. Springer, 2017. Google Scholar
  13. P. Ferragina, T. Gagie, and G. Manzini. Lightweight data indexing and compression in external memory. Algorithmica, 2011. Google Scholar
  14. Simon Gog and Enno Ohlebusch. Compressed suffix trees: Efficient computation and storage of LCP-values. ACM Journal of Experimental Algorithmics, 18, 2013. URL: http://dx.doi.org/10.1145/2444016.2461327.
  15. D. Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997. Google Scholar
  16. Dan Gusfield, Gad M. Landau, and Baruch Schieber. An efficient algorithm for the all pairs suffix-prefix problem. Inf. Process. Lett., 41(4):181-185, 1992. Google Scholar
  17. James Holt and Leonard McMillan. Merging of multi-string BWTs with applications. Bioinformatics, 30(24):3524-3531, 2014. Google Scholar
  18. Juha Kärkkäinen and Dominik Kempa. LCP array construction in external memory. ACM Journal of Experimental Algorithmics, 21(1):1.7:1-1.7:22, 2016. Google Scholar
  19. Juha Kärkkäinen and Dominik Kempa. Engineering a lightweight external memory suffix array construction algorithm. Mathematics in Computer Science, 11(2):137-149, 2017. Google Scholar
  20. D. E. Knuth. Sorting and Searching, volume 3 of The Art of Computer Programming. Addison-Wesley, Reading, MA, USA, second edition, 1998. Google Scholar
  21. M. Oguzhan Külekci, Jeffrey Scott Vitter, and Bojian Xu. Efficient maximal repeat finding using the burrows-wheeler transform and wavelet tree. IEEE/ACM Trans. Comput. Biology Bioinform., 9(2):421-429, 2012. Google Scholar
  22. Felipe A. Louza, Simon Gog, and Guilherme P. Telles. Inducing enhanced suffix arrays for string collections. Theor. Comput. Sci., 678:22-39, 2017. Google Scholar
  23. Felipe A. Louza, Guilherme P. Telles, Steve Hoffmann, and Cristina D. A. Ciferri. Generalized enhanced suffix array construction in external memory. Algorithms for Molecular Biology, 12(1):26:1-26:16, 2017. Google Scholar
  24. Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, and Alexandru I. Tomescu. Genome-Scale Algorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing. Cambridge University Press, 2015. Google Scholar
  25. Sabrina Mantaci, Antonio Restivo, Giovanna Rosone, and Marinella Sciortino. An extension of the Burrows-Wheeler transform. Theor. Comput. Sci., 387(3):298-312, 2007. Google Scholar
  26. G. Manzini. Two space saving tricks for linear time LCP computation. In Proc. of 9th Scandinavian Workshop on Algorithm Theory (SWAT '04), pages 372-383. Springer-Verlag LNCS n. 3111, 2004. Google Scholar
  27. G. Manzini and P. Ferragina. Engineering a lightweight suffix array construction algorithm. In Proc. 10th European Symposium on Algorithms (ESA), pages 698-710. Springer Verlag LNCS n. 2461, 2002. Google Scholar
  28. Guillaume Marçais, Arthur L. Delcher, Adam M. Phillippy, Rachel Coston, Steven L. Salzberg, and Aleksey V. Zimin. Mummer4: A fast and versatile genome alignment system. PLoS Computational Biology, 14(1), 2018. Google Scholar
  29. Martin D. Muggli, Alexander Bowe, Noelle R. Noyes, Paul S. Morley, Keith E. Belk, Robert Raymond, Travis Gagie, Simon J. Puglisi, and Christina Boucher. Succinct colored de Bruijn graphs. Bioinformatics, 33(20):3181-3187, 2017. Google Scholar
  30. J. Ian Munro, Gonzalo Navarro, and Yakov Nekrich. Space-efficient construction of compressed indexes in deterministic linear time. In SODA, pages 408-424. SIAM, 2017. Google Scholar
  31. G. Navarro and V. Mäkinen. Compressed full-text indexes. ACM Computing Surveys, 39(1), 2007. Google Scholar
  32. Ge Nong. Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACM Trans. Inf. Syst., 31(3):15, 2013. Google Scholar
  33. Enno Ohlebusch and Simon Gog. Efficient algorithms for the all-pairs suffix-prefix problem and the all-pairs substring-prefix problem. Inf. Process. Lett., 110(3):123-128, 2010. Google Scholar
  34. Enno Ohlebusch, Simon Gog, and Adrian Kügel. Computing matching statistics and maximal exact matches on compressed full-text indexes. In SPIRE, volume 6393 of Lecture Notes in Computer Science, pages 347-358. Springer, 2010. Google Scholar
  35. William H. A. Tustumi, Simon Gog, Guilherme P. Telles, and Felipe A. Louza. An improved algorithm for the all-pairs suffix-prefix problem. J. Discrete Algorithms, 37:34-43, 2016. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail