Spalter: A Meta Machine Learning Approach to Distinguish True DNA Variants from Sequencing Artefacts

Hartmann, Till; Rahmann, Sven

doi:10.4230/LIPIcs.WABI.2018.13

File

LIPIcs.WABI.2018.13.pdf

Filesize: 451 kB
8 pages

Document Identifiers

DOI: 10.4230/LIPIcs.WABI.2018.13
URN: urn:nbn:de:0030-drops-93158

Author Details

Till Hartmann

Genome Informatics, Institute of Human Genetics, University of Duisburg-Essen, University Hospital Essen, 45122 Essen, Germany

Sven Rahmann

Genome Informatics, Institute of Human Genetics, University of Duisburg-Essen, University Hospital Essen, 45122 Essen, Germany, Bioinformatics, Computer Science XI, TU Dortmund, Dortmund, Germany

Cite AsGet BibTex

Till Hartmann and Sven Rahmann. Spalter: A Meta Machine Learning Approach to Distinguish True DNA Variants from Sequencing Artefacts. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 13:1-13:8, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)
https://doi.org/10.4230/LIPIcs.WABI.2018.13

Abstract

Being able to distinguish between true DNA variants and technical sequencing artefacts is a fundamental task in whole genome, exome or targeted gene analysis. Variant calling tools provide diagnostic parameters, such as strand bias or an aggregated overall quality for each called variant, to help users make an informed choice about which variants to accept or discard. Having several such quality indicators poses a problem for the users of variant callers because they need to set or adjust thresholds for each such indicator. Alternatively, machine learning methods can be used to train a classifier based on these indicators. This approach needs large sets of labeled training data, which is not easily available. The new approach presented here relies on the idea that a true DNA variant exists independently of technical features of the read in which it appears (e.g. base quality, strand, position in the read). Therefore the nucleotide separability classification problem - predicting the nucleotide state of each read in a given pileup based on technical features only - should be near impossible to solve for true variants. Nucleotide separability, i.e. achievable classification accuracy, can either be used to distinguish between true variants and technical artefacts directly, using a thresholding approach, or it can be used as a meta-feature to train a separability-based classifier. This article explores both possibilities with promising results, showing accuracies around 90%.

Subject Classification

ACM Subject Classification

Applied computing → Sequencing and genotyping technologies
Computing methodologies → Unsupervised learning
Computing methodologies → Supervised learning by classification

Keywords

variant calling
sequencing error
technical artefact
meta machine learning
classification

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Erik Garrison and Gabor Marth. Haplotype-based variant detection from short-read sequencing. arXiv preprint arXiv:1207.3907, 2012.
Heng Li, Jonathan M Bloom, Yossi Farjoun, Mark Fleharty, Laura D Gauthier, Benjamin Neale, and Daniel MacArthur. New synthetic-diploid benchmark for accurate variant calling evaluation. bioRxiv, 223297, 2017.
Aaron McKenna, Matthew Hanna, Eric Banks, Andrey Sivachenko, Kristian Cibulskis, Andrew Kernytsky, Kiran Garimella, David Altshuler, Stacey Gabriel, Mark Daly, and Mark A. DePristo. The genome analysis toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research, 20(9):1297-1303, 2010. URL: http://dx.doi.org/10.1101/gr.107524.110.
Brendan D O’Fallon, Whitney Wooderchak-Donahue, and David K Crockett. A support vector machine for identification of single-nucleotide polymorphisms from next-generation sequencing data. Bioinformatics, 29(11):1361-1366, 2013.
Ryan Poplin, Pi-Chuan Chang, David Alexander, Scott Schwartz, Thomas Colthurst, Alexander Ku, Dan Newburger, Jojo Dijamco, Nam Nguyen, Pegah T. Afshar, Sam S. Gross, Lizzie Dorfman, Cory Y. McLean, and Mark A. DePristo. Creating a universal SNP and small indel variant caller with deep neural networks. bioRxiv, 2018. URL: http://dx.doi.org/10.1101/092890.