Space-Efficient Representation of Genomic k-Mer Count Tables

Shibuya, Yoshihiro; Belazzougui, Djamal; Kucherov, Gregory

doi:10.4230/LIPIcs.WABI.2021.8

Abstract

Motivation. k-mer counting is a common task in bioinformatic pipelines, with many dedicated tools available. Output formats could rely on quotienting to reduce the space of k-mers in hash tables, however counts are not usually stored in space-efficient formats. Overall, k-mer count tables for genomic data take a considerable space, easily reaching tens of GB. Furthermore, such tables do not support efficient random-access queries in general.

Results. In this work, we design an efficient representation of k-mer count tables supporting fast random-access queries. We propose to apply Compressed Static Functions (CSFs), with space proportional to the empirical zero-order entropy of the counts. For very skewed distributions, like those of k-mer counts in whole genomes, the only currently available implementation of CSFs does not provide a compact enough representation. By adding a Bloom Filter to a CSF we obtain a Bloom-enhanced CSF (BCSF) effectively overcoming this limitation. Furthermore, by combining BCSFs with minimizer-based bucketing of k-mers, we build even smaller representations breaking the empirical entropy lower bound, for large enough k. We also extend these representations to the approximate case, gaining additional space. We experimentally validate these techniques on k-mer count tables of whole genomes (E.Coli and C.Elegans) as well as on k-mer document frequency tables for 29 E.Coli genomes. In the case of exact counts, our representation takes about a half of the space of the empirical entropy, for large enough k’s.

Djamal Belazzougui and Rossano Venturini. Compressed Static Functions with Applications. In Proceedings of the 2013 Annual ACM-SIAM Symposium on Discrete Algorithms, Proceedings, pages 229-240. Society for Industrial and Applied Mathematics, 2013. URL: https://doi.org/10.1137/1.9781611973105.17.
Robert S. Boyer and J. Strother Moore. MJRTY - A Fast Majority Vote Algorithm. In Robert S. Boyer, editor, Automated Reasoning: Essays in Honor of Woody Bledsoe, Automated Reasoning Series, pages 105-117. Springer Netherlands, Dordrecht, 1991. URL: https://doi.org/10.1007/978-94-011-3488-0_5.
A. Z. Broder. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171), pages 21-29, June 1997. URL: https://doi.org/10.1109/SEQUEN.1997.666900.
Benny Chor, David Horn, Nick Goldman, Yaron Levy, and Tim Massingham. Genomic dna k-mer spectra: models and modalities. Genome Biology, 10(10):R108, October 2009. URL: https://doi.org/10.1186/gb-2009-10-10-r108.
Yingnan Cong, Yao-ban Chan, and Mark A. Ragan. A novel alignment-free method for detection of lateral genetic transfer based on TF-IDF. Scientific Reports, 6(1):30308, 2016. URL: https://doi.org/10.1038/srep30308.
M. Csűrös, L. Noé, and G. Kucherov. Reconsidering the significance of genomic word frequencies. Trends in Genetics, 23(11):543-546, November 2007. URL: https://doi.org/10.1016/j.tig.2007.07.008.
Thomas Dencker, Chris-André Leimeister, Michael Gerth, Christoph Bleidorn, Sagi Snir, and Burkhard Morgenstern. Multi-SpaM: A Maximum-Likelihood Approach to Phylogeny Reconstruction Using Multiple Spaced-Word Matches and Quartet Trees. In Mathieu Blanchette and Aïda Ouangraoua, editors, Comparative Genomics, Lecture Notes in Computer Science, pages 227-241, Cham, 2018. Springer International Publishing. URL: https://doi.org/10.1007/978-3-030-00834-5_13.
Barış Ekim, Bonnie Berger, and Yaron Orenstein. A Randomized Parallel Algorithm for Efficiently Finding Near-Optimal Universal Hitting Sets. In Russell Schwartz, editor, Research in Computational Molecular Biology, Lecture Notes in Computer Science, pages 37-53, Cham, 2020. Springer International Publishing. URL: https://doi.org/10.1007/978-3-030-45257-5_3.
Emmanuel Esposito, Thomas Mueller Graf, and Sebastiano Vigna. RecSplit: Minimal Perfect Hashing via Recursive Splitting. arXiv:1910.06416 [cs], November 2019. URL: http://arxiv.org/abs/1910.06416.
Huan Fan, Anthony R. Ives, Yann Surget-Groba, and Charles H. Cannon. An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data. BMC Genomics, 16(1):522, July 2015. URL: https://doi.org/10.1186/s12864-015-1647-5.
Marco Genuzio, Giuseppe Ottaviano, and Sebastiano Vigna. Fast scalable construction of ([compressed] static bar minimal perfect hash) functions. Information and Computation, 273:104517, August 2020. URL: https://doi.org/10.1016/j.ic.2020.104517.
Karen Spärck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28:11-21, 1972.
Mikhail Karasikov, Harun Mustafa, Daniel Danciu, Marc Zimmermann, Christopher Barber, Gunnar Rätsch, and André Kahles. MetaGraph: Indexing and Analysing Nucleotide Archives at Petabase-scale. bioRxiv, page 2020.10.01.322164, November 2020. URL: https://doi.org/10.1101/2020.10.01.322164.
Mikhail Karasikov, Harun Mustafa, Amir Joudaki, Sara Javadzadeh-no, Gunnar Rätsch, and André Kahles. Sparse Binary Relation Representations for Genome Graph Annotation. Journal of Computational Biology, 27(4):626-639, December 2019. URL: https://doi.org/10.1089/cmb.2019.0324.
Parsoa Khorsand and Fereydoun Hormozdiari. Nebula: Ultra-efficient mapping-free structural variant genotyper. bioRxiv, page 566620, March 2019. URL: https://doi.org/10.1101/566620.
Marek Kokot, Maciej Długosz, and Sebastian Deorowicz. KMC 3: counting and manipulating k-mer statistics. Bioinformatics, 33(17):2759-2761, May 2017. URL: https://doi.org/10.1093/bioinformatics/btx304.
Sergey Koren, Brian P. Walenz, Konstantin Berlin, Jason R. Miller, Nicholas H. Bergman, and Adam M. Phillippy. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Research, 27(5):722-736, 2017. URL: https://doi.org/10.1101/gr.215087.116.
Téo Lemane, Paul Medvedev, Rayan Chikhi, and Pierre Peterlongo. kmtricks: Efficient construction of Bloom filters for large sequencing data collections. bioRxiv, page 2021.02.16.429304, 2021. URL: https://doi.org/10.1101/2021.02.16.429304.
Heng Li. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics, 32(14):2103-2110, July 2016. URL: https://doi.org/10.1093/bioinformatics/btw152.
Heng Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18):3094-3100, May 2018. URL: https://doi.org/10.1093/bioinformatics/bty191.
Antoine Limasset, Jean-François Flot, and Pierre Peterlongo. Toward perfect reads: self-correction of short reads via mapping on de Bruijn graphs. Bioinformatics, 36(5):1374-1381, 2020. URL: https://doi.org/10.1093/bioinformatics/btz102.
Antoine Limasset, Guillaume Rizk, Rayan Chikhi, and Pierre Peterlongo. Fast and Scalable Minimal Perfect Hashing for Massive Key Sets. In 16th International Symposium on Experimental Algorithms (SEA 2017), volume 75 of Leibniz International Proceedings in Informatics (LIPIcs), pages 25:1-25:16, Dagstuhl, Germany, 2017. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. URL: https://doi.org/10.4230/LIPIcs.SEA.2017.25.
Camille Marchet, Zamin Iqbal, Daniel Gautheret, Mikaël Salson, and Rayan Chikhi. REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets. Bioinformatics, 36(Supplement_1):i177-i185, July 2020. URL: https://doi.org/10.1093/bioinformatics/btaa487.
Guillaume Marçais and Carl Kingsford. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 27(6):764, March 2011. URL: https://doi.org/10.1093/bioinformatics/btr011.
Marmar Moussa and Ion I. Măndoiu. Single cell RNA-seq data clustering using TF-IDF based methods. BMC Genomics, 19(6):569, 2018. URL: https://doi.org/10.1186/s12864-018-4922-4.
Harun Mustafa, André Kahles, Mikhail Karasikov, and Gunnar Rätsch. Metannot: A succinct data structure for compression of colors in dynamic de Bruijn graphs. bioRxiv, page 236711, March 2018. URL: https://doi.org/10.1101/236711.
Ingo Müller, Peter Sanders, Robert Schulze, and Wei Zhou. Retrieval and Perfect Hashing Using Fingerprinting. In Joachim Gudmundsson and Jyrki Katajainen, editors, Experimental Algorithms, Lecture Notes in Computer Science, pages 138-149, Cham, 2014. Springer International Publishing. URL: https://doi.org/10.1007/978-3-319-07959-2_12.
Atif Rahman, Ingileif Hallgrímsdóttir, Michael Eisen, and Lior Pachter. Association mapping from sequencing reads using k-mers. eLife, 7:e32920, June 2018. URL: https://doi.org/10.7554/eLife.32920.
Guillaume Rizk, Dominique Lavenier, and Rayan Chikhi. DSK: k-mer counting with very low memory usage. Bioinformatics, 29(5):652-653, March 2013. URL: https://doi.org/10.1093/bioinformatics/btt020.
Michael Roberts, Wayne Hayes, Brian R. Hunt, Stephen M. Mount, and James A. Yorke. Reducing storage requirements for biological sequence comparison. Bioinformatics, 20(18):3363-3369, 2004. URL: https://doi.org/10.1093/bioinformatics/bth408.
K. Salikhov, G. Sacomoto, and G. Kucherov. Using cascading Bloom filters to improve the memory usage for de Brujin graphs. BMC Algorithms for Molecular Biology, 9(1):2, 2014. URL: http://www.almob.org/content/9/1/2.
Saul Schleimer, Daniel S. Wilkerson, and Alex Aiken. Winnowing: local algorithms for document fingerprinting. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, SIGMOD '03, pages 76-85, San Diego, California, 2003. Association for Computing Machinery. URL: https://doi.org/10.1145/872757.872770.
Moustafa Shokrof, C. Titus Brown, and Tamer A. Mansour. MQF and buffered MQF: Quotient filters for efficient storage of k-mers with their counts and metadata. bioRxiv, page 2020.08.23.263061, August 2020. URL: https://doi.org/10.1101/2020.08.23.263061.
Gregory E. Sims, Se-Ran Jun, Guohong A. Wu, and Sung-Hou Kim. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proceedings of the National Academy of Sciences of the United States of America, 106(8):2677-2682, February 2009. URL: https://doi.org/10.1073/pnas.0813249106.
Derrick E. Wood and Steven L. Salzberg. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology, 15(3):R46, March 2014. URL: https://doi.org/10.1186/gb-2014-15-3-r46.
Huiguang Yi and Li Jin. Co-phylog: an assembly-free phylogenomic approach for closely related organisms. Nucleic Acids Research, 41(7):e75, 2013. URL: https://doi.org/10.1093/nar/gkt003.
Ye Yu, Djamal Belazzougui, Chen Qian, and Qin Zhang. Memory-efficient and Ultra-fast Network Lookup and Forwarding using Othello Hashing. arXiv:1608.05699 [cs], November 2017. URL: http://arxiv.org/abs/1608.05699.
Ye Yu, Jinpeng Liu, Xinan Liu, Yi Zhang, Eamonn Magner, Erik Lehnert, Chen Qian, and Jinze Liu. SeqOthello: querying RNA-seq experiments at scale. Genome Biology, 19(1):167, 2018. URL: https://doi.org/10.1186/s13059-018-1535-9.
Hongyu Zheng, Carl Kingsford, and Guillaume Marçais. Lower Density Selection Schemes via Small Universal Hitting Sets with Short Remaining Path Length. In Russell Schwartz, editor, Research in Computational Molecular Biology, Lecture Notes in Computer Science, pages 202-217, Cham, 2020. Springer International Publishing. URL: https://doi.org/10.1007/978-3-030-45257-5_13.

Space-Efficient Representation of Genomic k-Mer Count Tables

Authors Yoshihiro Shibuya, Djamal Belazzougui, Gregory Kucherov

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Space-Efficient Representation of Genomic k-Mer Count Tables

Authors Yoshihiro Shibuya, Djamal Belazzougui, Gregory Kucherov

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

Supplementary Materials

References

Thanks for your feedback!

Could not send message