Analysis of Min-Hashing for Variant Tolerant DNA Read Mapping

Quedenfeld, Jens; Rahmann, Sven

doi:10.4230/LIPIcs.WABI.2017.21

File

Subject Classification

Keywords

read mapping
min-Hashing
variant
SNP
analysis of algorithms

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

Document

0

Metadata

Abstract

DNA read mapping has become a ubiquitous task in bioinformatics. New technologies provide ever longer DNA reads (several thousand basepairs), although at comparatively high error rates (up to 15%), and the reference genome is increasingly not considered as a simple string over ACGT anymore, but as a complex object containing known genetic variants in the population. Conventional indexes based on exact seed matches, in particular the suffix array based FM index, struggle with these changing conditions, so other methods are being considered, and one such alternative is locality sensitive hashing. Here we examine the question whether including single nucleotide polymorphisms (SNPs) in a min-hashing index is beneficial. The answer depends on the population frequency of the SNP, and we analyze several models (from simple to complex) that provide precise answers to this question under various assumptions. Our results also provide sensitivity and specificity values for min-hashing based read mappers and may be used to understand dependencies between the parameters of such methods. We hope that this article will provide a theoretical foundation for a new generation of read mappers.

Cite As Get BibTex

Jens Quedenfeld and Sven Rahmann. Analysis of Min-Hashing for Variant Tolerant DNA Read Mapping. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 21:1-21:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017) https://doi.org/10.4230/LIPIcs.WABI.2017.21

Author Details

Jens Quedenfeld

Sven Rahmann

References

Konstantin Berlin, Sergey Koren, Chen-Shan Chin, James P. Drake, Jane M. Landolin, and Adam M. Phillippy. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol., 33(6):623-630, 2015. Corrigendum in Nat. Biotechnol. 33(10), 1109 (2015).
Andrei Z. Broder. On the resemblance and containment of documents. In Compression and Complexity of Sequences (SEQUENCES'97), pages 21-29. IEEE, 1997.
Andrei Z. Broder, Moses Charikar, Alan M Frieze, and Michael Mitzenmacher. Min-wise independent permutations. In Proceedings of the 30th annual ACM symposium on Theory of computing (STOC), pages 327-336. ACM, 1998.
Matthew Casperson. Minhash for dummies. http://matthewcasperson.blogspot.de/2013/11/minhash-for-dummies.html, November 2013.
P. Ferragina and G. Manzini. Indexing compressed text. J. ACM, 52(4):552-–581, 2005.
Chirag Jain, Alexander Dilthey, Sergey Koren, Srinivas Aluru, and Adam M. Phillippy. A fast approximate algorithm for mapping long reads to large reference databases. In International Conference on Research in Computational Molecular Biology, pages 66-81. Springer, 2017.
Benjamin Kramer, Jens Quedenfeld, Sven Schrinner, Marcel Bargull, Kada Benadjemia, Jan Stricker, and David Losch. VATRAM - VAriant Tolerant ReAd Mapper. Technical report, Project Group PG583, Computer Science, TU Dortmund, Germany, 2015.
Heng Li. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics, page btw152, 2016.
Victoria Popic and Serafim Batzoglou. Privacy-preserving read mapping using locality sensitive hashing and secure kmer voting. bioRxiv, page 046920, 2016.
Jens Quedenfeld and Sven Rahmann. Variant tolerant read mapping using min-hashing. arXiv, 1702.01703, 2017.
The Computational Pan-Genomics Consortium. Computational pan-genomics: status, promises and challenges. Briefings in Bioinformatics, Oct 2016. Online first,. URL: http://dx.doi.org/10.1093/bib/bbw089.

Analysis of Min-Hashing for Variant Tolerant DNA Read Mapping

Authors Jens Quedenfeld, Sven Rahmann

File

Document Identifiers

Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message