Fully-Functional Bidirectional Burrows-Wheeler Indexes and Infinite-Order De Bruijn Graphs

Authors Djamal Belazzougui, Fabio Cunial



PDF
Thumbnail PDF

File

LIPIcs.CPM.2019.10.pdf
  • Filesize: 0.67 MB
  • 15 pages

Document Identifiers

Author Details

Djamal Belazzougui
  • CAPA, DTISI, Centre de Recherche sur l'Information Scientifique et Technique, Algiers, Algeria
Fabio Cunial
  • Max Planck Institute for Molecular Cell Biology and Genetics (MPI-CBG), Dresden, Germany
  • Center for Systems Biology Dresden (CSBD), Dresden, Germany

Acknowledgements

We thank Martin Bundgaard for motivating the contract operation, Rodrigo Canovas for discussions about bidirectional indexes, Gene Myers for discussions about PacBio CCS reads, and German Tischler for help with k-mer counting.

Cite AsGet BibTex

Djamal Belazzougui and Fabio Cunial. Fully-Functional Bidirectional Burrows-Wheeler Indexes and Infinite-Order De Bruijn Graphs. In 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 128, pp. 10:1-10:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)
https://doi.org/10.4230/LIPIcs.CPM.2019.10

Abstract

Given a string T on an alphabet of size sigma, we describe a bidirectional Burrows-Wheeler index that takes O(|T| log sigma) bits of space, and that supports the addition and removal of one character, on the left or right side of any substring of T, in constant time. Previously known data structures that used the same space allowed constant-time addition to any substring of T, but they could support removal only from specific substrings of T. We also describe an index that supports bidirectional addition and removal in O(log log |T|) time, and that takes a number of words proportional to the number of left and right extensions of the maximal repeats of T. We use such fully-functional indexes to implement bidirectional, frequency-aware, variable-order de Bruijn graphs with no upper bound on their order, and supporting natural criteria for increasing and decreasing the order during traversal.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data structures design and analysis
  • Theory of computation → Design and analysis of algorithms
Keywords
  • BWT
  • suffix tree
  • CDAWG
  • de Bruijn graph
  • maximal repeat
  • string depth
  • contraction
  • bidirectional index

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Amihood Amir, Gad M Landau, Moshe Lewenstein, and Dina Sokol. Dynamic text and static pattern matching. ACM Transactions on Algorithms (TALG), 3(2):19, 2007. Google Scholar
  2. Alberto Apostolico and Gill Bejerano. Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space. Journal of Computational Biology, 7(3-4):381-393, 2000. Google Scholar
  3. Uwe Baier, Timo Beller, and Enno Ohlebusch. Graphical pan-genome analysis with compressed suffix trees and the Burrows-Wheeler transform. Bioinformatics, 32(4):497-504, 2015. Google Scholar
  4. Djamal Belazzougui. Linear time construction of compressed text indices in compact space. In Proceedings of the forty-sixth Annual ACM Symposium on Theory of Computing, pages 148-193. ACM, 2014. Google Scholar
  5. Djamal Belazzougui and Fabio Cunial. Indexed matching statistics and shortest unique substrings. In International Symposium on String Processing and Information Retrieval, pages 179-190. Springer, 2014. Google Scholar
  6. Djamal Belazzougui and Fabio Cunial. A Framework for Space-Efficient String Kernels. Algorithmica, 79(3):857-883, 2017. Google Scholar
  7. Djamal Belazzougui and Fabio Cunial. Fast label extraction in the CDAWG. In International Symposium on String Processing and Information Retrieval, pages 161-175. Springer, 2017. Google Scholar
  8. Djamal Belazzougui and Fabio Cunial. Representing the Suffix Tree with the CDAWG. In LIPIcs-Leibniz International Proceedings in Informatics, volume 78. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017. Google Scholar
  9. Djamal Belazzougui, Fabio Cunial, and Olgert Denas. Fast matching statistics in small space. In LIPIcs-Leibniz International Proceedings in Informatics, volume 103. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018. Google Scholar
  10. Djamal Belazzougui, Fabio Cunial, Travis Gagie, Nicola Prezza, and Mathieu Raffinot. Composite repetition-aware data structures. In Annual Symposium on Combinatorial Pattern Matching, pages 26-39. Springer, 2015. Google Scholar
  11. Djamal Belazzougui, Fabio Cunial, Juha Kärkkäinen, and Veli Mäkinen. Versatile Succinct Representations of the Bidirectional Burrows-Wheeler Transform. In 21st Annual European Symposium on Algorithms (ESA 2013), volume 8125 of Lecture Notes in Computer Science, pages 133-144, France, 2013. Springer. Google Scholar
  12. Djamal Belazzougui, Fabio Cunial, Juha Kärkkäinen, and Veli Mäkinen. Linear-time string indexing and analysis in small space. arXiv preprint, 2016. URL: http://arxiv.org/abs/1609.06378.
  13. Djamal Belazzougui, Travis Gagie, Veli Mäkinen, Marco Previtali, and Simon J Puglisi. Bidirectional variable-order de Bruijn graphs. In Latin American Symposium on Theoretical Informatics, pages 164-178. Springer, 2016. Google Scholar
  14. Michael A Bender and Martın Farach-Colton. The level ancestor problem simplified. Theoretical Computer Science, 321(1):5-12, 2004. Google Scholar
  15. Omer Berkman and Uzi Vishkin. Finding level-ancestors in trees. Journal of Computer and System Sciences, 48(2):214-230, 1994. Google Scholar
  16. Anselm Blumer, Janet Blumer, David Haussler, Ross McConnell, and Andrzej Ehrenfeucht. Complete inverted files for efficient text retrieval and analysis. Journal of the ACM, 34(3):578-595, 1987. Google Scholar
  17. Christina Boucher, Alex Bowe, Travis Gagie, Simon J Puglisi, and Kunihiko Sadakane. Variable-order de Bruijn graphs. In 2015 Data Compression Conference, pages 383-392. IEEE, 2015. Google Scholar
  18. Rodrigo Cánovas and Eric Rivals. Full Compressed Affix Tree Representations. In Data Compression Conference (DCC), 2017, pages 102-111. IEEE, 2017. Google Scholar
  19. Rayan Chikhi, Antoine Limasset, and Paul Medvedev. Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics, 32(12):i201-i208, 2016. Google Scholar
  20. Maxime Crochemore and Renaud Vérin. Direct construction of compact directed acyclic word graphs. In Alberto Apostolico and Jotun Hein, editors, CPM, volume 1264 of Lecture Notes in Computer Science, pages 116-129. Springer, 1997. Google Scholar
  21. Fabio Cunial, Jarno Alanko, and Djamal Belazzougui. A framework for space-efficient variable-order Markov models. bioRxiv preprint, page 443101, 2018. Google Scholar
  22. Diego Díaz-Domínguez, Djamal Belazzougui, Travis Gagie, Veli Mäkinen, Gonzalo Navarro, and Simon J Puglisi. Assembling Omnitigs using Hidden-Order de Bruijn Graphs. arXiv preprint, 2018. URL: http://arxiv.org/abs/1805.05228.
  23. Diego Díaz-Domínguez, Travis Gagie, and Gonzalo Navarro. Simulating the DNA String Graph in Succinct Space. arXiv preprint, 2019. URL: http://arxiv.org/abs/1901.10453.
  24. Martin Farach and S Muthukrishnan. Perfect hashing for strings: formalization and algorithms. In Annual Symposium on Combinatorial Pattern Matching, pages 130-140. Springer, 1996. Google Scholar
  25. Simon Gog, Kalle Karhu, Juha Kärkkäinen, Veli Mäkinen, and Niko Välimäki. Multi-pattern matching with bidirectional indexes. Journal of Discrete Algorithms, 24:26-39, 2014. Google Scholar
  26. Philipp Koch, Matthias Platzer, and Bryan R Downie. RepARK — de novo creation of repeat libraries from whole-genome NGS reads. Nucleic Acids Research, 42(9):e80-e80, 2014. Google Scholar
  27. Marek Kokot, Maciej Długosz, and Sebastian Deorowicz. KMC 3: counting and manipulating k-mer statistics. Bioinformatics, 33(17):2759-2761, 2017. Google Scholar
  28. Tak Wah Lam, Ruiqiang Li, Alan Tam, Simon Wong, Edward Wu, and Siu-Ming Yiu. High throughput short read alignment via bi-directional BWT. In Bioinformatics and Biomedicine, 2009. BIBM'09. IEEE International Conference on, pages 31-36. IEEE, 2009. Google Scholar
  29. Dinghua Li, Ruibang Luo, Chi-Man Liu, Chi-Ming Leung, Hing-Fung Ting, Kunihiko Sadakane, Hiroshi Yamashita, and Tak-Wah Lam. MEGAHIT v1.0: a fast and scalable metagenome assembler driven by advanced methodologies and community practices. Methods, 102:3-11, 2016. Google Scholar
  30. Moritz G Maaß. Linear bidirectional on-line construction of affix trees. In Annual Symposium on Combinatorial Pattern Matching, pages 320-334. Springer, 2000. Google Scholar
  31. Veli Mäkinen and Gonzalo Navarro. Succinct suffix arrays based on run-length encoding. In Combinatorial Pattern Matching, pages 45-56. Springer, 2005. Google Scholar
  32. Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology, 17(3):281-308, 2010. Google Scholar
  33. Shoshana Marcus and Dina Sokol. Engineering Small Space Dictionary Matching. arXiv preprint, 2013. URL: http://arxiv.org/abs/1301.6428.
  34. Giancarlo Mauri and Giulio Pavesi. Pattern discovery in RNA secondary structure using affix trees. In Annual Symposium on Combinatorial Pattern Matching, pages 278-294. Springer, 2003. Google Scholar
  35. Ilia Minkin and Paul Medvedev. Scalable multiple whole-genome alignment and locally collinear block construction with SibeliaZ. bioRxiv preprint, page 548123, 2019. Google Scholar
  36. Ilia Minkin, Son Pham, and Paul Medvedev. TwoPaCo: An efficient algorithm to build the compacted de Bruijn graph from many complete genomes. Bioinformatics, 33(24):4024-4032, 2016. Google Scholar
  37. Pierre Morisse, Thierry Lecroq, and Arnaud Lefebvre. Hybrid correction of highly noisy long reads using a variable-order de Bruijn graph. Bioinformatics, 34(24):4213-4222, 2018. Google Scholar
  38. J Ian Munro, Gonzalo Navarro, and Yakov Nekrich. Space-efficient construction of compressed indexes in deterministic linear time. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 408-424. SIAM, 2017. Google Scholar
  39. J Ian Munro and Venkatesh Raman. Succinct representation of balanced parentheses and static trees. SIAM Journal on Computing, 31(3):762-776, 2001. Google Scholar
  40. Gonzalo Navarro. Compact data structures: a practical approach. Cambridge University Press, 2016. Google Scholar
  41. Gonzalo Navarro and Kunihiko Sadakane. Fully Functional Static and Dynamic Succinct Trees. ACM Transactions on Algorithms, 10(3):16:1-16:39, 2014. Google Scholar
  42. Prashant Pandey, Michael A Bender, Rob Johnson, and Rob Patro. deBGR: an efficient and near-exact representation of the weighted de Bruijn graph. Bioinformatics, 33(14):i133-i141, 2017. Google Scholar
  43. Nicolas Philippe, Mikaël Salson, Thérèse Commes, and Eric Rivals. CRAC: an integrated approach to the analysis of RNA-seq reads. Genome Biology, 14(3):R30, 2013. Google Scholar
  44. Mathieu Raffinot. On maximal repeats in strings. Information Processing Letters, 80(3):165-169, 2001. Google Scholar
  45. Luís Russo, Gonzalo Navarro, Arlindo L Oliveira, and Pedro Morales. Approximate string matching with compressed indexes. Algorithms, 2(3):1105-1136, 2009. Google Scholar
  46. K. Sadakane and G. Navarro. Fully-Functional Succinct Trees. In Proc. ACM-SIAM Symposium on Discrete Algorithms (SODA 2010), pages 134-149, Austin, Texas, USA, 2010. ACM-SIAM. Google Scholar
  47. Thomas Schnattinger, Enno Ohlebusch, and Simon Gog. Bidirectional search in a string with wavelet trees and bidirectional matching statistics. Information and Computation, 213:13-22, 2012. Google Scholar
  48. Jouni Sirén, Niko Välimäki, Veli Mäkinen, and Gonzalo Navarro. Run-length compressed indexes are superior for highly repetitive sequence collections. In String Processing and Information Retrieval, 15th International Symposium, SPIRE 2008, Melbourne, Australia, November 10-12, 2008., pages 164-175, 2008. Google Scholar
  49. Jens Stoye. Affix trees. Master’s thesis, Universität Bielefeld, 2000. Google Scholar
  50. Dirk Strothmann. The affix array data structure and its applications to RNA secondary structure analysis. Theoretical Computer Science, 389(1-2):278-294, 2007. Google Scholar
  51. Aaron M Wenger et al. Highly-accurate long-read sequencing improves variant detection and assembly of a human genome. bioRxiv preprint, 2019. URL: http://dx.doi.org/10.1101/519025.