Efficient Reconciliation of Genomic Datasets of High Similarity

Shibuya, Yoshihiro; Belazzougui, Djamal; Kucherov, Gregory

doi:10.4230/LIPIcs.WABI.2022.14

Abstract

We apply Invertible Bloom Lookup Tables (IBLTs) to the comparison of k-mer sets originated from large DNA sequence datasets. We show that for similar datasets, IBLTs provide a more space-efficient and, at the same time, more accurate method for estimating Jaccard similarity of underlying k-mer sets, compared to MinHash which is a go-to sketching technique for efficient pairwise similarity estimation. This is achieved by combining IBLTs with k-mer sampling based on syncmers, which constitute a context-independent alternative to minimizers and provide an unbiased estimator of Jaccard similarity. A key property of our method is that involved data structures require space proportional to the difference of k-mer sets and are independent of the size of sets themselves. As another application, we show how our ideas can be applied in order to efficiently compute (an approximation of) k-mers that differ between two datasets, still using space only proportional to their number. We experimentally illustrate our results on both simulated and real data (SARS-CoV-2 and Streptococcus Pneumoniae genomes).

Mahdi Belbasi, Antonio Blanca, Robert S. Harris, David Koslicki, and Paul Medvedev. The minimizer Jaccard estimator is biased and inconsistent. bioRxiv, 2022. URL: https://doi.org/10.1101/2022.01.14.476226.
Timo Bingmann, Phelim Bradley, Florian Gauger, and Zamin Iqbal. COBS: A Compact Bit-Sliced Signature Index. In Nieves R. Brisaboa and Simon J. Puglisi, editors, String Processing and Information Retrieval, Lecture Notes in Computer Science, pages 285-303, Cham, 2019. Springer International Publishing. URL: https://doi.org/10.1007/978-3-030-32686-9_21.
Phelim Bradley, Henk C Den Bakker, Eduardo P. C. Rocha, Gil McVean, and Zamin Iqbal. Ultra-fast search of all deposited bacterial and viral genomic data. Nature biotechnology, 37(2):152-159, February 2019. URL: https://doi.org/10.1038/s41587-018-0010-1.
Andrei Z. Broder. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171), pages 21-29, June 1997. URL: https://doi.org/10.1109/SEQUEN.1997.666900.
Karel Břinda, Alanna Callendrello, Kevin C. Ma, Derek R. MacFadden, Themoula Charalampous, Robyn S. Lee, Lauren Cowley, Crista B. Wadsworth, Yonatan H. Grad, Gregory Kucherov, Justin O’Grady, Michael Baym, and William P. Hanage. Rapid inference of antibiotic resistance and susceptibility by genomic neighbour typing. Nature Microbiology, 5(3):455-464, March 2020. URL: https://doi.org/10.1038/s41564-019-0656-6.
Dan DeBlasio, Fiyinfoluwa Gbosibo, Carl Kingsford, and Guillaume Marçais. Practical universal k-mer sets for minimizer schemes. In Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB '19, pages 167-176, New York, NY, USA, 2019. Association for Computing Machinery. URL: https://doi.org/10.1145/3307339.3342144.
Martin Dietzfelbinger, Andreas Goerdt, Michael Mitzenmacher, Andrea Montanari, Rasmus Pagh, and Michael Rink. Tight thresholds for cuckoo hashing via xorsat. In Proceedings of the 37th International Colloquium Conference on Automata, Languages and Programming, ICALP'10, pages 213-225, Berlin, Heidelberg, 2010. Springer-Verlag.
Robert Edgar. Syncmers are more sensitive than minimizers for selecting conserved k‑mers in biological sequences. PeerJ, 9:e10805, February 2021. URL: https://doi.org/10.7717/peerj.10805.
Barış Ekim, Bonnie Berger, and Yaron Orenstein. A Randomized Parallel Algorithm for Efficiently Finding Near-Optimal Universal Hitting Sets. In Russell Schwartz, editor, Research in Computational Molecular Biology, Lecture Notes in Computer Science, pages 37-53, Cham, 2020. Springer International Publishing. URL: https://doi.org/10.1007/978-3-030-45257-5_3.
David Eppstein and Michael T. Goodrich. Straggler identification in round-trip data streams via Newton’s identities and invertible Bloom filters. IEEE Transactions on Knowledge and Data Engineering, 23(2):297-306, 2011. URL: https://doi.org/10.1109/TKDE.2010.132.
Huan Fan, Anthony R. Ives, Yann Surget-Groba, and Charles H. Cannon. An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data. BMC Genomics, 16(1):522, July 2015. URL: https://doi.org/10.1186/s12864-015-1647-5.
Michael T. Goodrich and Michael Mitzenmacher. Invertible Bloom lookup tables, 2011. URL: https://doi.org/10.48550/ARXIV.1101.2245.
Gaurav Gupta, Minghao Yan, Benjamin Coleman, R. A. Leo Elworth, Todd Treangen, and Anshumali Shrivastava. Sub-linear Sequence Search via a Repeated And Merged Bloom Filter (RAMBO): Indexing 170 TB data in 14 hours. arXiv:1910.04358 [cs, q-bio], December 2019. URL: http://arxiv.org/abs/1910.04358.
Robert S Harris and Paul Medvedev. Improved representation of sequence Bloom trees. Bioinformatics, 36(3):721-727, August 2019. URL: https://doi.org/10.1093/bioinformatics/btz662.
Chirag Jain, Luis M. Rodriguez-R, Adam M. Phillippy, Konstantinos T. Konstantinidis, and Srinivas Aluru. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nature Communications, 9(1):5114, November 2018. URL: https://doi.org/10.1038/s41467-018-07641-9.
Mikhail Karasikov, Harun Mustafa, Gunnar Rätsch, and André Kahles. Lossless indexing with counting de bruijn graphs. bioRxiv, 2022. URL: https://doi.org/10.1101/2021.11.09.467907.
Marek Kokot, Maciej Długosz, and Sebastian Deorowicz. KMC 3: counting and manipulating k-mer statistics. Bioinformatics, 33(17):2759-2761, May 2017. URL: https://doi.org/10.1093/bioinformatics/btx304.
Nathan LaPierre, Mohammed Alser, Eleazar Eskin, David Koslicki, and Serghei Mangul. Metalign: efficient alignment-based metagenomic profiling via containment min hash. Genome Biology, 21(1):242, September 2020. URL: https://doi.org/10.1186/s13059-020-02159-0.
Heng Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18):3094-3100, May 2018. URL: https://doi.org/10.1093/bioinformatics/bty191.
Camille Marchet, Christina Boucher, Simon J. Puglisi, Paul Medvedev, Mikaël Salson, and Rayan Chikhi. Data structures based on k-mers for querying large collections of sequencing data sets. Genome Research, 31(1):1-12, January 2021. URL: https://doi.org/10.1101/gr.260604.119.
Guillaume Marçais and Carl Kingsford. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 27(6):764, March 2011. URL: https://doi.org/10.1093/bioinformatics/btr011.
Michael Molloy. The pure literal rule threshold and cores in random hypergraphs. In Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms, SODA '04, pages 672-681, USA, January 2004. Society for Industrial and Applied Mathematics.
Martin D Muggli, Bahar Alipanahi, and Christina Boucher. Building large updatable colored de Bruijn graphs via merging. Bioinformatics, 35(14):i51-i60, July 2019. URL: https://doi.org/10.1093/bioinformatics/btz350.
Brian D. Ondov, Todd J. Treangen, Páll Melsted, Adam B. Mallonee, Nicholas H. Bergman, Sergey Koren, and Adam M. Phillippy. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biology, 17(1):132, June 2016. URL: https://doi.org/10.1186/s13059-016-0997-x.
Giulio Ermanno Pibiri. Sparse and skew hashing of k-mers. bioRxiv, 2022. URL: https://doi.org/10.1101/2022.01.15.476199.
Ely Porat and Ohad Lipsky. Improved sketching of hamming distance with error correcting. In Bin Ma and Kaizhong Zhang, editors, Combinatorial Pattern Matching, pages 173-182, Berlin, Heidelberg, 2007. Springer Berlin Heidelberg.
Guillaume Rizk, Dominique Lavenier, and Rayan Chikhi. DSK: k-mer counting with very low memory usage, March 2013. URL: https://doi.org/10.1093/bioinformatics/btt020.
Michael Roberts, Wayne Hayes, Brian R. Hunt, Stephen M. Mount, and James A. Yorke. Reducing storage requirements for biological sequence comparison. Bioinformatics, 20(18):3363-3369, December 2004. URL: https://doi.org/10.1093/bioinformatics/bth408.
Kamil Salikhov, Gustavo Sacomoto, and Gregory Kucherov. Using cascading Bloom filters to improve the memory usage for de Brujin graphs. BMC Algorithms for Molecular Biology, 9(1):2, 2014. URL: http://www.almob.org/content/9/1/2.
Saul Schleimer, Daniel S. Wilkerson, and Alex Aiken. Winnowing: local algorithms for document fingerprinting. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, SIGMOD '03, pages 76-85, San Diego, California, June 2003. Association for Computing Machinery. URL: https://doi.org/10.1145/872757.872770.

Efficient Reconciliation of Genomic Datasets of High Similarity

Authors Yoshihiro Shibuya , Djamal Belazzougui, Gregory Kucherov

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Efficient Reconciliation of Genomic Datasets of High Similarity

Authors Yoshihiro Shibuya , Djamal Belazzougui, Gregory Kucherov

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Supplementary Materials

References

Thanks for your feedback!

Could not send message