Analysis of Min-Hashing for Variant Tolerant DNA Read Mapping

Authors Jens Quedenfeld, Sven Rahmann

Thumbnail PDF


  • Filesize: 0.5 MB
  • 13 pages

Document Identifiers

Author Details

Jens Quedenfeld
Sven Rahmann

Cite AsGet BibTex

Jens Quedenfeld and Sven Rahmann. Analysis of Min-Hashing for Variant Tolerant DNA Read Mapping. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 21:1-21:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)


DNA read mapping has become a ubiquitous task in bioinformatics. New technologies provide ever longer DNA reads (several thousand basepairs), although at comparatively high error rates (up to 15%), and the reference genome is increasingly not considered as a simple string over ACGT anymore, but as a complex object containing known genetic variants in the population. Conventional indexes based on exact seed matches, in particular the suffix array based FM index, struggle with these changing conditions, so other methods are being considered, and one such alternative is locality sensitive hashing. Here we examine the question whether including single nucleotide polymorphisms (SNPs) in a min-hashing index is beneficial. The answer depends on the population frequency of the SNP, and we analyze several models (from simple to complex) that provide precise answers to this question under various assumptions. Our results also provide sensitivity and specificity values for min-hashing based read mappers and may be used to understand dependencies between the parameters of such methods. We hope that this article will provide a theoretical foundation for a new generation of read mappers.
  • read mapping
  • min-Hashing
  • variant
  • SNP
  • analysis of algorithms


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Konstantin Berlin, Sergey Koren, Chen-Shan Chin, James P. Drake, Jane M. Landolin, and Adam M. Phillippy. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol., 33(6):623-630, 2015. Corrigendum in Nat. Biotechnol. 33(10), 1109 (2015). Google Scholar
  2. Andrei Z. Broder. On the resemblance and containment of documents. In Compression and Complexity of Sequences (SEQUENCES'97), pages 21-29. IEEE, 1997. Google Scholar
  3. Andrei Z. Broder, Moses Charikar, Alan M Frieze, and Michael Mitzenmacher. Min-wise independent permutations. In Proceedings of the 30th annual ACM symposium on Theory of computing (STOC), pages 327-336. ACM, 1998. Google Scholar
  4. Matthew Casperson. Minhash for dummies., November 2013.
  5. P. Ferragina and G. Manzini. Indexing compressed text. J. ACM, 52(4):552-–581, 2005. Google Scholar
  6. Chirag Jain, Alexander Dilthey, Sergey Koren, Srinivas Aluru, and Adam M. Phillippy. A fast approximate algorithm for mapping long reads to large reference databases. In International Conference on Research in Computational Molecular Biology, pages 66-81. Springer, 2017. Google Scholar
  7. Benjamin Kramer, Jens Quedenfeld, Sven Schrinner, Marcel Bargull, Kada Benadjemia, Jan Stricker, and David Losch. VATRAM - VAriant Tolerant ReAd Mapper. Technical report, Project Group PG583, Computer Science, TU Dortmund, Germany, 2015. Google Scholar
  8. Heng Li. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics, page btw152, 2016. Google Scholar
  9. Victoria Popic and Serafim Batzoglou. Privacy-preserving read mapping using locality sensitive hashing and secure kmer voting. bioRxiv, page 046920, 2016. Google Scholar
  10. Jens Quedenfeld and Sven Rahmann. Variant tolerant read mapping using min-hashing. arXiv, 1702.01703, 2017. Google Scholar
  11. The Computational Pan-Genomics Consortium. Computational pan-genomics: status, promises and challenges. Briefings in Bioinformatics, Oct 2016. Online first,. URL: