Toward Optimal Fingerprint Indexing for Large Scale Genomics

Agret, Clément; Cazaux, Bastien; Limasset, Antoine

doi:10.4230/LIPIcs.WABI.2022.25

File

LIPIcs.WABI.2022.25.pdf

Filesize: 1.27 MB
15 pages

Document Identifiers

DOI: 10.4230/LIPIcs.WABI.2022.25
URN: urn:nbn:de:0030-drops-170598

Author Details

Clément Agret

Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France
LIRMM, Univ Montpellier, CNRS, Montpellier, France

Bastien Cazaux

Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France

Antoine Limasset

Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France

Acknowledgements

We want to thank Camille Marchet, Pierre Doignies, organizers and participants of the Bioinformatics: from Algorithms to Applications conference, for their support and discussions on this project. The ANR SEQdigger supported this work.

Cite AsGet BibTex

Clément Agret, Bastien Cazaux, and Antoine Limasset. Toward Optimal Fingerprint Indexing for Large Scale Genomics. In 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 242, pp. 25:1-25:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)
https://doi.org/10.4230/LIPIcs.WABI.2022.25

Abstract

Motivation. To keep up with the scale of genomic databases, several methods rely on local sensitive hashing methods to efficiently find potential matches within large genome collections. Existing solutions rely on Minhash or Hyperloglog fingerprints and require reading the whole index to perform a query. Such solutions can not be considered scalable with the growing amount of documents to index. Results. We present NIQKI, a novel structure with well-designed fingerprints that lead to theoretical and practical query time improvements, outperforming state-of-the-art by orders of magnitude. Our contribution is threefold. First, we generalize the concept of Hyperminhash fingerprints in (h,m)-HMH fingerprints that can be tuned to present the lowest false positive rate given the expected sub-sampling applied. Second, we provide a structure able to index any kind of fingerprints based on inverted indexes that provide optimal queries, namely linear with the size of the output. Third, we implemented these approaches in a tool dubbed NIQKI that can index and calculate pairwise distances for over one million bacterial genomes from GenBank in a few days on a small cluster. We show that our approach can be orders of magnitude faster than state-of-the-art with comparable precision. We believe this approach can lead to tremendous improvements, allowing fast queries and scaling on extensive genomic databases.

Subject Classification

ACM Subject Classification

Applied computing → Bioinformatics

Keywords

Data Structure
Indexation
Local Sensitive Hashing
Genomes
Databases

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Alexandre Almeida, Stephen Nayfach, Miguel Boland, Francesco Strozzi, Martin Beracochea, Zhou Jason Shi, Katherine S Pollard, Ekaterina Sakharova, Donovan H Parks, Philip Hugenholtz, et al. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nature biotechnology, 39(1):105-114, 2021.
Daniel N Baker and Ben Langmead. Dashing: fast and accurate genomic distances with hyperloglog. Genome biology, 20(1):1-12, 2019.
Timo Bingmann, Phelim Bradley, Florian Gauger, and Zamin Iqbal. Cobs: a compact bit-sliced signature index. In International Symposium on String Processing and Information Retrieval, pages 285-303. Springer, 2019.
Phelim Bradley, Henk C Den Bakker, Eduardo PC Rocha, Gil McVean, and Zamin Iqbal. Ultrafast search of all deposited bacterial and viral genomic data. Nature biotechnology, 37(2):152-159, 2019.
Andrei Z Broder. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pages 21-29. IEEE, 1997.
Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In Discrete Mathematics and Theoretical Computer Science, pages 137-156. Discrete Mathematics and Theoretical Computer Science, 2007.
David Koslicki and Hooman Zabeti. Improving minhash via the containment index with applications to metagenomic analysis. Applied Mathematics and Computation, 354:206-215, 2019.
Daniel Lemire, Leonid Boytsov, and Nathan Kurz. Simd compression and the intersection of sorted integers. Software: Practice and Experience, 46(6):723-749, 2016.
Ping Li and Christian König. b-bit minwise hashing. In Proceedings of the 19th international conference on World wide web, pages 671-680, 2010.
Antoine Limasset. Million sequences indexing. In BMC BIOINFORMATICS, volume 20. BMC CAMPUS, 4 CRINAN ST, LONDON N1 9XW, ENGLAND, 2019.
Tung Mai, Anup Rao, Matt Kapilevich, Ryan Rossi, Yasin Abbasi-Yadkori, and Ritwik Sinha. On densification for minwise hashing. In Uncertainty in Artificial Intelligence, pages 831-840. PMLR, 2020.
George Marsaglia et al. Xorshift rngs. Journal of Statistical Software, 8(14):1-6, 2003.
Martin D Muggli, Bahar Alipanahi, and Christina Boucher. Building large updatable colored de bruijn graphs via merging. Bioinformatics, 35(14):i51-i60, 2019.
Brian D Ondov, Todd J Treangen, Páll Melsted, Adam B Mallonee, Nicholas H Bergman, Sergey Koren, and Adam M Phillippy. Mash: fast genome and metagenome distance estimation using minhash. Genome biology, 17(1):1-14, 2016.
N Tessa Pierce, Luiz Irber, Taylor Reiter, Phillip Brooks, and C Titus Brown. Large-scale sequence comparisons with sourmash. F1000Research, 8, 2019.
Will PM Rowe. When the levee breaks: a practical guide to sketching algorithms for processing the flood of genomic data. Genome biology, 20(1):1-12, 2019.
Anshumali Shrivastava. Optimal densification for fast and accurate minwise hashing. In International Conference on Machine Learning, pages 3154-3163. PMLR, 2017.
Yun William Yu and Griffin M Weber. Hyperminhash: Minhash in loglog space. arXiv preprint, 2017. URL: http://arxiv.org/abs/1710.08436.
XiaoFei Zhao. Bindash, software for fast genome distance estimation on a typical personal laptop. Bioinformatics, 35(4):671-673, 2019.