Fulgor: A Fast and Compact {k-mer} Index for Large-Scale Matching and Color Queries

Fan, Jason; Singh, Noor Pratap; Khan, Jamshed; Pibiri, Giulio Ermanno; Patro, Rob

doi:10.4230/LIPIcs.WABI.2023.18

Abstract

The problem of sequence identification or matching - determining the subset of reference sequences from a given collection that are likely to contain a short, queried nucleotide sequence - is relevant for many important tasks in Computational Biology, such as metagenomics and pan-genome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resource-efficient solution to this problem is of utmost importance. This poses the threefold challenge of representing the reference collection with a data structure that is efficient to query, has light memory usage, and scales well to large collections. To solve this problem, we describe how recent advancements in associative, order-preserving, k-mer dictionaries can be combined with a compressed inverted index to implement a fast and compact colored de Bruijn graph data structure. This index takes full advantage of the fact that unitigs in the colored de Bruijn graph are monochromatic (all k-mers in a unitig have the same set of references of origin, or "color"), leveraging the order-preserving property of its dictionary. In fact, k-mers are kept in unitig order by the dictionary, thereby allowing for the encoding of the map from k-mers to their inverted lists in as little as 1+o(1) bits per unitig. Hence, one inverted list per unitig is stored in the index with almost no space/time overhead. By combining this property with simple but effective compression methods for inverted lists, the index achieves very small space. We implement these methods in a tool called Fulgor. Compared to Themisto, the prior state of the art, Fulgor indexes a heterogeneous collection of 30,691 bacterial genomes in 3.8× less space, a collection of 150,000 Salmonella enterica genomes in approximately 2× less space, is at least twice as fast for color queries, and is 2-6 × faster to construct.

Jarno N. Alanko, Simon J. Puglisi, and Jaakko Vuohtoniemi. Small searchable k-spectra via subset rank queries on the spectral burrows-wheeler transform. In SIAM Conference on Applied and Computational Discrete Algorithms (ACDA23), pages 225-236, 2023.
Jarno N Alanko, Jaakko Vuohtoniemi, Tommi Mäklin, and Simon J Puglisi. Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes. In Proceedings of the 31st Conference on Intelligent Systems for Molecular Biology (ISMB), 2023.
Fatemeh Almodaresi, Prashant Pandey, Michael Ferdman, Rob Johnson, and Rob Patro. An Efficient, Scalable, and Exact Representation of High-Dimensional Color Information Enabled Using de Bruijn Graph Search. Journal of Computational Biology, 27(4):485-499, 2020. PMID: 32176522.
Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and David J Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215(3):403-410, 1990.
Grace A. Blackwell, Martin Hunt, Kerri M. Malone, Leandro Lima, Gal Horesh, Blaise T. F. Alako, Nicholas R. Thomson, and Zamin Iqbal. Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences. PLOS Biology, 19(11):1-16, November 2021. URL: https://doi.org/10.1371/journal.pbio.3001421.
Nicolas L Bray, Harold Pimentel, Páll Melsted, and Lior Pachter. Near-optimal probabilistic rna-seq quantification. Nature biotechnology, 34(5):525-527, 2016.
Samy Chambi, Daniel Lemire, Owen Kaser, and Robert Godin. Better bitmap performance with roaring bitmaps. Software: Practice and Experience, 46(5):709-719, 2016.
Andrea Cracco and Alexandru I Tomescu. Extremely-fast Construction and Querying of Compacted and Colored de Bruijn Graphs with GGCAT. In Proceedings of the 27th Annual International Conference on Research in Computational Molecular Biology, pages 208-210. Springer, 2023.
Peter Elias. Efficient storage and retrieval by content and address of static files. Journal of the ACM, 21(2):246-260, 1974.
Peter Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, 21(2):194-203, 1975.
Jason Fan, Jamshed Khan, Giulio Ermanno Pibiri, and Rob Patro. Spectrum preserving tilings enable sparse and modular reference indexing. In Research in Computational Molecular Biology, pages 21-40, 2023.
Robert Mario Fano. On the number of bits required to implement an associative memory. Memorandum 61, Computer Structures Group, MIT, 1971.
Dongze He, Mohsen Zakeri, Hirak Sarkar, Charlotte Soneson, Avi Srivastava, and Rob Patro. Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data. Nature Methods, 19(3):316-322, 2022.
Pranvera Hiseni, Knut Rudi, Robert C Wilson, Finn Terje Hegge, and Lars Snipen. HumGut: a comprehensive human gut prokaryotic genomes collection filtered by metagenome data. Microbiome, 9(1):1-12, 2021.
Guillaume Holley and Páll Melsted. Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs. Genome biology, 21(1):1-20, 2020.
M. Holtgrewe. Mason - a read simulator for second generation sequencing data. Technical Report FU Berlin, October 2010.
Mikhail Karasikov, Harun Mustafa, Amir Joudaki, Sara Javadzadeh-No, Gunnar Rätsch, and André Kahles. Sparse Binary Relation Representations for Genome Graph Annotation. J Comput Biol, 27(4):626-639, December 2019.
Jamshed Khan, Marek Kokot, Sebastian Deorowicz, and Rob Patro. Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2. Genome Biology, 23(1):190, 2022.
Jamshed Khan and Rob Patro. Cuttlefish: fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections. Bioinformatics, 37(Supplement_1):i177-i186, 2021.
Nathan LaPierre, Mohammed Alser, Eleazar Eskin, David Koslicki, and Serghei Mangul. Metalign: efficient alignment-based metagenomic profiling via containment min hash. Genome Biology, 21(1):242, September 2020.
Tommi Mäklin, Teemu Kallonen, Sophia David, Christine J Boinett, Ben Pascoe, Guillaume Méric, David M Aanensen, Edward J Feil, Stephen Baker, Julian Parkhill, et al. High-resolution sweep metagenomics using fast probabilistic inference [version 1; peer review: 1 approved, 1 approved with reservations]. Wellcome open research, 5(14), 2021.
Camille Marchet, Christina Boucher, Simon J Puglisi, Paul Medvedev, Mikaël Salson, and Rayan Chikhi. Data structures based on k-mers for querying large collections of sequencing data sets. Genome Research, 31(1):1-12, 2021.
Alexa B. R. McIntyre, Rachid Ounit, Ebrahim Afshinnekoo, Robert J. Prill, Elizabeth Hénaff, Noah Alexander, Samuel S. Minot, David Danko, Jonathan Foox, Sofia Ahsanuddin, Scott Tighe, Nur A. Hasan, Poorani Subramanian, Kelly Moffat, Shawn Levy, Stefano Lonardi, Nick Greenfield, Rita R. Colwell, Gail L. Rosen, and Christopher E. Mason. Comprehensive benchmarking and ensemble approaches for metagenomic classifiers. Genome Biology, 18(1):182, September 2017.
Ilia Minkin, Son Pham, and Paul Medvedev. TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes. Bioinformatics, 33(24):4024-4032, 2017.
Sergey Nurk, Sergey Koren, Arang Rhie, Mikko Rautiainen, Andrey V. Bzikadze, Alla Mikheenko, et al. The complete sequence of a human genome. Science, 376(6588):44-53, 2022.
Giuseppe Ottaviano and Rossano Venturini. Partitioned Elias-Fano indexes. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pages 273-282, 2014.
Rachid Ounit, Steve Wanamaker, Timothy J Close, and Stefano Lonardi. Clark: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics, 16(1):1-13, 2015.
Giulio Ermanno Pibiri. Fast and compact set intersection through recursive universe partitioning. In 2021 Data Compression Conference (DCC), pages 293-302. IEEE, 2021.
Giulio Ermanno Pibiri. On weighted k-mer dictionaries. In International Workshop on Algorithms in Bioinformatics (WABI), pages 9:1-9:20, 2022.
Giulio Ermanno Pibiri. Sparse and skew hashing of k-mers. Bioinformatics, 38(Supplement_1):i185-i194, June 2022.
Giulio Ermanno Pibiri and Shunsuke Kanda. Rank/select queries over mutable bitmaps. Information Systems, 99(101756), 2021.
Giulio Ermanno Pibiri and Roberto Trani. PTHash: Revisiting FCH minimal perfect hashing. In Proceedings of the 44th international ACM SIGIR conference on Research & development in information retrieval, pages 1339-1348, 2021.
Giulio Ermanno Pibiri and Rossano Venturini. Clustered elias-fano indexes. ACM Transactions on Information Systems, 36(1):1-33, 2017.
Giulio Ermanno Pibiri and Rossano Venturini. Techniques for inverted index compression. ACM Comput. Surv., 53(6):125:1-125:36, 2021.
N Tessa Pierce, Luiz Irber, Taylor Reiter, Phillip Brooks, and C Titus Brown. Large-scale sequence comparisons with sourmash. F1000Research, 8, 2019.
NT Pierce, L Irber, T Reiter, P Brooks, and CT Brown. Large-scale sequence comparisons with sourmash [version 1; peer review: 2 approved]. F1000Research, 8(1006), 2019. URL: https://doi.org/10.12688/f1000research.19675.1.
Mark Reppell and John Novembre. Using pseudoalignment and base quality to accurately quantify microbial community composition. PLOS Computational Biology, 14(4):1-23, April 2018.
Michael Roberts, Wayne Hayes, Brian R Hunt, Stephen M Mount, and James A Yorke. Reducing storage requirements for biological sequence comparison. Bioinformatics, 20(18):3363-3369, 2004.
L Schaeffer, H Pimentel, N Bray, P Melsted, and L Pachter. Pseudoalignment for metagenomic read assignment. Bioinformatics, 33(14):2082-2088, February 2017.
Wei Shen, Hongyan Xiang, Tianquan Huang, Hui Tang, Mingli Peng, Dachuan Cai, Peng Hu, and Hong Ren. KMCP: accurate metagenomic profiling of both prokaryotic and viral populations by pseudo-mapping. Bioinformatics, 39(1), December 2022. btac845.
Sebastiano Vigna. Broadword implementation of rank/select queries. In International Workshop on Experimental and Efficient Algorithms, pages 154-168, 2008.
Derrick E. Wood, Jennifer Lu, and Ben Langmead. Improved metagenomic analysis with Kraken 2. Genome Biology, 20(1):257, November 2019.
Derrick E Wood and Steven L Salzberg. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome biology, 15(3):1-12, 2014.
Ilya Y Zhbannikov, Samuel S Hunter, Matthew L Settles, and James A Foster. SlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching. Journal of data mining in genomics & proteomics, 4(3), 2013.
Justin Zobel and Alistair Moffat. Inverted files for text search engines. ACM Computing Surveys (CSUR), 38(2):6-es, 2006.

Fulgor: A Fast and Compact {k-mer} Index for Large-Scale Matching and Color Queries

Authors Jason Fan , Noor Pratap Singh , Jamshed Khan , Giulio Ermanno Pibiri , Rob Patro

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Fulgor: A Fast and Compact {k-mer} Index for Large-Scale Matching and Color Queries

Authors Jason Fan , Noor Pratap Singh , Jamshed Khan , Giulio Ermanno Pibiri , Rob Patro

File

Document Identifiers

Author Details

Funding

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Related Versions

Supplementary Materials

References