Document Open Access Logo

Fulgor: A Fast and Compact {k-mer} Index for Large-Scale Matching and Color Queries

Authors Jason Fan , Noor Pratap Singh , Jamshed Khan , Giulio Ermanno Pibiri , Rob Patro



PDF
Thumbnail PDF

File

LIPIcs.WABI.2023.18.pdf
  • Filesize: 2.03 MB
  • 21 pages

Document Identifiers

Author Details

Jason Fan
  • Department of Computer Science, University of Maryland, College Park, MD, USA
Noor Pratap Singh
  • Department of Computer Science, University of Maryland, College Park, MD, USA
Jamshed Khan
  • Department of Computer Science, University of Maryland, College Park, MD, USA
Giulio Ermanno Pibiri
  • DAIS, Ca' Foscari University of Venice, Italy
  • STI-CNR, Pisa, Italy
Rob Patro
  • Department of Computer Science, University of Maryland, College Park, MD, USA

Cite AsGet BibTex

Jason Fan, Noor Pratap Singh, Jamshed Khan, Giulio Ermanno Pibiri, and Rob Patro. Fulgor: A Fast and Compact {k-mer} Index for Large-Scale Matching and Color Queries. In 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 273, pp. 18:1-18:21, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/LIPIcs.WABI.2023.18

Abstract

The problem of sequence identification or matching - determining the subset of reference sequences from a given collection that are likely to contain a short, queried nucleotide sequence - is relevant for many important tasks in Computational Biology, such as metagenomics and pan-genome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resource-efficient solution to this problem is of utmost importance. This poses the threefold challenge of representing the reference collection with a data structure that is efficient to query, has light memory usage, and scales well to large collections. To solve this problem, we describe how recent advancements in associative, order-preserving, k-mer dictionaries can be combined with a compressed inverted index to implement a fast and compact colored de Bruijn graph data structure. This index takes full advantage of the fact that unitigs in the colored de Bruijn graph are monochromatic (all k-mers in a unitig have the same set of references of origin, or "color"), leveraging the order-preserving property of its dictionary. In fact, k-mers are kept in unitig order by the dictionary, thereby allowing for the encoding of the map from k-mers to their inverted lists in as little as 1+o(1) bits per unitig. Hence, one inverted list per unitig is stored in the index with almost no space/time overhead. By combining this property with simple but effective compression methods for inverted lists, the index achieves very small space. We implement these methods in a tool called Fulgor. Compared to Themisto, the prior state of the art, Fulgor indexes a heterogeneous collection of 30,691 bacterial genomes in 3.8× less space, a collection of 150,000 Salmonella enterica genomes in approximately 2× less space, is at least twice as fast for color queries, and is 2-6 × faster to construct.

Subject Classification

ACM Subject Classification
  • Applied computing → Bioinformatics
Keywords
  • k-mers
  • Colored de Bruijn Graph
  • Compression
  • Read-mapping

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Jarno N. Alanko, Simon J. Puglisi, and Jaakko Vuohtoniemi. Small searchable k-spectra via subset rank queries on the spectral burrows-wheeler transform. In SIAM Conference on Applied and Computational Discrete Algorithms (ACDA23), pages 225-236, 2023. Google Scholar
  2. Jarno N Alanko, Jaakko Vuohtoniemi, Tommi Mäklin, and Simon J Puglisi. Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes. In Proceedings of the 31st Conference on Intelligent Systems for Molecular Biology (ISMB), 2023. Google Scholar
  3. Fatemeh Almodaresi, Prashant Pandey, Michael Ferdman, Rob Johnson, and Rob Patro. An Efficient, Scalable, and Exact Representation of High-Dimensional Color Information Enabled Using de Bruijn Graph Search. Journal of Computational Biology, 27(4):485-499, 2020. PMID: 32176522. Google Scholar
  4. Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and David J Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215(3):403-410, 1990. Google Scholar
  5. Grace A. Blackwell, Martin Hunt, Kerri M. Malone, Leandro Lima, Gal Horesh, Blaise T. F. Alako, Nicholas R. Thomson, and Zamin Iqbal. Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences. PLOS Biology, 19(11):1-16, November 2021. URL: https://doi.org/10.1371/journal.pbio.3001421.
  6. Nicolas L Bray, Harold Pimentel, Páll Melsted, and Lior Pachter. Near-optimal probabilistic rna-seq quantification. Nature biotechnology, 34(5):525-527, 2016. Google Scholar
  7. Samy Chambi, Daniel Lemire, Owen Kaser, and Robert Godin. Better bitmap performance with roaring bitmaps. Software: Practice and Experience, 46(5):709-719, 2016. Google Scholar
  8. Andrea Cracco and Alexandru I Tomescu. Extremely-fast Construction and Querying of Compacted and Colored de Bruijn Graphs with GGCAT. In Proceedings of the 27th Annual International Conference on Research in Computational Molecular Biology, pages 208-210. Springer, 2023. Google Scholar
  9. Peter Elias. Efficient storage and retrieval by content and address of static files. Journal of the ACM, 21(2):246-260, 1974. Google Scholar
  10. Peter Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, 21(2):194-203, 1975. Google Scholar
  11. Jason Fan, Jamshed Khan, Giulio Ermanno Pibiri, and Rob Patro. Spectrum preserving tilings enable sparse and modular reference indexing. In Research in Computational Molecular Biology, pages 21-40, 2023. Google Scholar
  12. Robert Mario Fano. On the number of bits required to implement an associative memory. Memorandum 61, Computer Structures Group, MIT, 1971. Google Scholar
  13. Dongze He, Mohsen Zakeri, Hirak Sarkar, Charlotte Soneson, Avi Srivastava, and Rob Patro. Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data. Nature Methods, 19(3):316-322, 2022. Google Scholar
  14. Pranvera Hiseni, Knut Rudi, Robert C Wilson, Finn Terje Hegge, and Lars Snipen. HumGut: a comprehensive human gut prokaryotic genomes collection filtered by metagenome data. Microbiome, 9(1):1-12, 2021. Google Scholar
  15. Guillaume Holley and Páll Melsted. Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs. Genome biology, 21(1):1-20, 2020. Google Scholar
  16. M. Holtgrewe. Mason - a read simulator for second generation sequencing data. Technical Report FU Berlin, October 2010. Google Scholar
  17. Mikhail Karasikov, Harun Mustafa, Amir Joudaki, Sara Javadzadeh-No, Gunnar Rätsch, and André Kahles. Sparse Binary Relation Representations for Genome Graph Annotation. J Comput Biol, 27(4):626-639, December 2019. Google Scholar
  18. Jamshed Khan, Marek Kokot, Sebastian Deorowicz, and Rob Patro. Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2. Genome Biology, 23(1):190, 2022. Google Scholar
  19. Jamshed Khan and Rob Patro. Cuttlefish: fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections. Bioinformatics, 37(Supplement_1):i177-i186, 2021. Google Scholar
  20. Nathan LaPierre, Mohammed Alser, Eleazar Eskin, David Koslicki, and Serghei Mangul. Metalign: efficient alignment-based metagenomic profiling via containment min hash. Genome Biology, 21(1):242, September 2020. Google Scholar
  21. Tommi Mäklin, Teemu Kallonen, Sophia David, Christine J Boinett, Ben Pascoe, Guillaume Méric, David M Aanensen, Edward J Feil, Stephen Baker, Julian Parkhill, et al. High-resolution sweep metagenomics using fast probabilistic inference [version 1; peer review: 1 approved, 1 approved with reservations]. Wellcome open research, 5(14), 2021. Google Scholar
  22. Camille Marchet, Christina Boucher, Simon J Puglisi, Paul Medvedev, Mikaël Salson, and Rayan Chikhi. Data structures based on k-mers for querying large collections of sequencing data sets. Genome Research, 31(1):1-12, 2021. Google Scholar
  23. Alexa B. R. McIntyre, Rachid Ounit, Ebrahim Afshinnekoo, Robert J. Prill, Elizabeth Hénaff, Noah Alexander, Samuel S. Minot, David Danko, Jonathan Foox, Sofia Ahsanuddin, Scott Tighe, Nur A. Hasan, Poorani Subramanian, Kelly Moffat, Shawn Levy, Stefano Lonardi, Nick Greenfield, Rita R. Colwell, Gail L. Rosen, and Christopher E. Mason. Comprehensive benchmarking and ensemble approaches for metagenomic classifiers. Genome Biology, 18(1):182, September 2017. Google Scholar
  24. Ilia Minkin, Son Pham, and Paul Medvedev. TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes. Bioinformatics, 33(24):4024-4032, 2017. Google Scholar
  25. Sergey Nurk, Sergey Koren, Arang Rhie, Mikko Rautiainen, Andrey V. Bzikadze, Alla Mikheenko, et al. The complete sequence of a human genome. Science, 376(6588):44-53, 2022. Google Scholar
  26. Giuseppe Ottaviano and Rossano Venturini. Partitioned Elias-Fano indexes. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pages 273-282, 2014. Google Scholar
  27. Rachid Ounit, Steve Wanamaker, Timothy J Close, and Stefano Lonardi. Clark: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics, 16(1):1-13, 2015. Google Scholar
  28. Giulio Ermanno Pibiri. Fast and compact set intersection through recursive universe partitioning. In 2021 Data Compression Conference (DCC), pages 293-302. IEEE, 2021. Google Scholar
  29. Giulio Ermanno Pibiri. On weighted k-mer dictionaries. In International Workshop on Algorithms in Bioinformatics (WABI), pages 9:1-9:20, 2022. Google Scholar
  30. Giulio Ermanno Pibiri. Sparse and skew hashing of k-mers. Bioinformatics, 38(Supplement_1):i185-i194, June 2022. Google Scholar
  31. Giulio Ermanno Pibiri and Shunsuke Kanda. Rank/select queries over mutable bitmaps. Information Systems, 99(101756), 2021. Google Scholar
  32. Giulio Ermanno Pibiri and Roberto Trani. PTHash: Revisiting FCH minimal perfect hashing. In Proceedings of the 44th international ACM SIGIR conference on Research & development in information retrieval, pages 1339-1348, 2021. Google Scholar
  33. Giulio Ermanno Pibiri and Rossano Venturini. Clustered elias-fano indexes. ACM Transactions on Information Systems, 36(1):1-33, 2017. Google Scholar
  34. Giulio Ermanno Pibiri and Rossano Venturini. Techniques for inverted index compression. ACM Comput. Surv., 53(6):125:1-125:36, 2021. Google Scholar
  35. N Tessa Pierce, Luiz Irber, Taylor Reiter, Phillip Brooks, and C Titus Brown. Large-scale sequence comparisons with sourmash. F1000Research, 8, 2019. Google Scholar
  36. NT Pierce, L Irber, T Reiter, P Brooks, and CT Brown. Large-scale sequence comparisons with sourmash [version 1; peer review: 2 approved]. F1000Research, 8(1006), 2019. URL: https://doi.org/10.12688/f1000research.19675.1.
  37. Mark Reppell and John Novembre. Using pseudoalignment and base quality to accurately quantify microbial community composition. PLOS Computational Biology, 14(4):1-23, April 2018. Google Scholar
  38. Michael Roberts, Wayne Hayes, Brian R Hunt, Stephen M Mount, and James A Yorke. Reducing storage requirements for biological sequence comparison. Bioinformatics, 20(18):3363-3369, 2004. Google Scholar
  39. L Schaeffer, H Pimentel, N Bray, P Melsted, and L Pachter. Pseudoalignment for metagenomic read assignment. Bioinformatics, 33(14):2082-2088, February 2017. Google Scholar
  40. Wei Shen, Hongyan Xiang, Tianquan Huang, Hui Tang, Mingli Peng, Dachuan Cai, Peng Hu, and Hong Ren. KMCP: accurate metagenomic profiling of both prokaryotic and viral populations by pseudo-mapping. Bioinformatics, 39(1), December 2022. btac845. Google Scholar
  41. Sebastiano Vigna. Broadword implementation of rank/select queries. In International Workshop on Experimental and Efficient Algorithms, pages 154-168, 2008. Google Scholar
  42. Derrick E. Wood, Jennifer Lu, and Ben Langmead. Improved metagenomic analysis with Kraken 2. Genome Biology, 20(1):257, November 2019. Google Scholar
  43. Derrick E Wood and Steven L Salzberg. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome biology, 15(3):1-12, 2014. Google Scholar
  44. Ilya Y Zhbannikov, Samuel S Hunter, Matthew L Settles, and James A Foster. SlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching. Journal of data mining in genomics & proteomics, 4(3), 2013. Google Scholar
  45. Justin Zobel and Alistair Moffat. Inverted files for text search engines. ACM Computing Surveys (CSUR), 38(2):6-es, 2006. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail