Document Open Access Logo

Toward Optimal Fingerprint Indexing for Large Scale Genomics

Authors Clément Agret, Bastien Cazaux, Antoine Limasset



PDF
Thumbnail PDF

File

LIPIcs.WABI.2022.25.pdf
  • Filesize: 1.27 MB
  • 15 pages

Document Identifiers

Author Details

Clément Agret
  • Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France
  • LIRMM, Univ Montpellier, CNRS, Montpellier, France
Bastien Cazaux
  • Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France
Antoine Limasset
  • Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France

Acknowledgements

We want to thank Camille Marchet, Pierre Doignies, organizers and participants of the Bioinformatics: from Algorithms to Applications conference, for their support and discussions on this project. The ANR SEQdigger supported this work.

Cite AsGet BibTex

Clément Agret, Bastien Cazaux, and Antoine Limasset. Toward Optimal Fingerprint Indexing for Large Scale Genomics. In 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 242, pp. 25:1-25:15, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2022)
https://doi.org/10.4230/LIPIcs.WABI.2022.25

Abstract

Motivation. To keep up with the scale of genomic databases, several methods rely on local sensitive hashing methods to efficiently find potential matches within large genome collections. Existing solutions rely on Minhash or Hyperloglog fingerprints and require reading the whole index to perform a query. Such solutions can not be considered scalable with the growing amount of documents to index. Results. We present NIQKI, a novel structure with well-designed fingerprints that lead to theoretical and practical query time improvements, outperforming state-of-the-art by orders of magnitude. Our contribution is threefold. First, we generalize the concept of Hyperminhash fingerprints in (h,m)-HMH fingerprints that can be tuned to present the lowest false positive rate given the expected sub-sampling applied. Second, we provide a structure able to index any kind of fingerprints based on inverted indexes that provide optimal queries, namely linear with the size of the output. Third, we implemented these approaches in a tool dubbed NIQKI that can index and calculate pairwise distances for over one million bacterial genomes from GenBank in a few days on a small cluster. We show that our approach can be orders of magnitude faster than state-of-the-art with comparable precision. We believe this approach can lead to tremendous improvements, allowing fast queries and scaling on extensive genomic databases.

Subject Classification

ACM Subject Classification
  • Applied computing → Bioinformatics
Keywords
  • Data Structure
  • Indexation
  • Local Sensitive Hashing
  • Genomes
  • Databases

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Alexandre Almeida, Stephen Nayfach, Miguel Boland, Francesco Strozzi, Martin Beracochea, Zhou Jason Shi, Katherine S Pollard, Ekaterina Sakharova, Donovan H Parks, Philip Hugenholtz, et al. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nature biotechnology, 39(1):105-114, 2021. Google Scholar
  2. Daniel N Baker and Ben Langmead. Dashing: fast and accurate genomic distances with hyperloglog. Genome biology, 20(1):1-12, 2019. Google Scholar
  3. Timo Bingmann, Phelim Bradley, Florian Gauger, and Zamin Iqbal. Cobs: a compact bit-sliced signature index. In International Symposium on String Processing and Information Retrieval, pages 285-303. Springer, 2019. Google Scholar
  4. Phelim Bradley, Henk C Den Bakker, Eduardo PC Rocha, Gil McVean, and Zamin Iqbal. Ultrafast search of all deposited bacterial and viral genomic data. Nature biotechnology, 37(2):152-159, 2019. Google Scholar
  5. Andrei Z Broder. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pages 21-29. IEEE, 1997. Google Scholar
  6. Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In Discrete Mathematics and Theoretical Computer Science, pages 137-156. Discrete Mathematics and Theoretical Computer Science, 2007. Google Scholar
  7. David Koslicki and Hooman Zabeti. Improving minhash via the containment index with applications to metagenomic analysis. Applied Mathematics and Computation, 354:206-215, 2019. Google Scholar
  8. Daniel Lemire, Leonid Boytsov, and Nathan Kurz. Simd compression and the intersection of sorted integers. Software: Practice and Experience, 46(6):723-749, 2016. Google Scholar
  9. Ping Li and Christian König. b-bit minwise hashing. In Proceedings of the 19th international conference on World wide web, pages 671-680, 2010. Google Scholar
  10. Antoine Limasset. Million sequences indexing. In BMC BIOINFORMATICS, volume 20. BMC CAMPUS, 4 CRINAN ST, LONDON N1 9XW, ENGLAND, 2019. Google Scholar
  11. Tung Mai, Anup Rao, Matt Kapilevich, Ryan Rossi, Yasin Abbasi-Yadkori, and Ritwik Sinha. On densification for minwise hashing. In Uncertainty in Artificial Intelligence, pages 831-840. PMLR, 2020. Google Scholar
  12. George Marsaglia et al. Xorshift rngs. Journal of Statistical Software, 8(14):1-6, 2003. Google Scholar
  13. Martin D Muggli, Bahar Alipanahi, and Christina Boucher. Building large updatable colored de bruijn graphs via merging. Bioinformatics, 35(14):i51-i60, 2019. Google Scholar
  14. Brian D Ondov, Todd J Treangen, Páll Melsted, Adam B Mallonee, Nicholas H Bergman, Sergey Koren, and Adam M Phillippy. Mash: fast genome and metagenome distance estimation using minhash. Genome biology, 17(1):1-14, 2016. Google Scholar
  15. N Tessa Pierce, Luiz Irber, Taylor Reiter, Phillip Brooks, and C Titus Brown. Large-scale sequence comparisons with sourmash. F1000Research, 8, 2019. Google Scholar
  16. Will PM Rowe. When the levee breaks: a practical guide to sketching algorithms for processing the flood of genomic data. Genome biology, 20(1):1-12, 2019. Google Scholar
  17. Anshumali Shrivastava. Optimal densification for fast and accurate minwise hashing. In International Conference on Machine Learning, pages 3154-3163. PMLR, 2017. Google Scholar
  18. Yun William Yu and Griffin M Weber. Hyperminhash: Minhash in loglog space. arXiv preprint, 2017. URL: http://arxiv.org/abs/1710.08436.
  19. XiaoFei Zhao. Bindash, software for fast genome distance estimation on a typical personal laptop. Bioinformatics, 35(4):671-673, 2019. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail