An Average-Case Sublinear Exact Li and Stephens Forward Algorithm

Rosen, Yohei M.; Paten, Benedict J.

doi:10.4230/LIPIcs.WABI.2018.9

Abstract

Hidden Markov models of haplotype inheritance such as the Li and Stephens model allow for computationally tractable probability calculations using the forward algorithms as long as the representative reference panel used in the model is sufficiently small. Specifically, the monoploid Li and Stephens model and its variants are linear in reference panel size unless heuristic approximations are used. However, sequencing projects numbering in the thousands to hundreds of thousands of individuals are underway, and others numbering in the millions are anticipated.
To make the Li and Stephens forward algorithm for these datasets computationally tractable, we have created a numerically exact version of the algorithm with observed average case O(nk^{0.35}) runtime in number of genetic sites n and reference panel size k. This avoids any tradeoff between runtime and model complexity. We demonstrate that our approach also provides a succinct data structure for general purpose haplotype data storage. We discuss generalizations of our algorithmic techniques to other hidden Markov models.

Cite As Get BibTex

Yohei M. Rosen and Benedict J. Paten. An Average-Case Sublinear Exact Li and Stephens Forward Algorithm. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 9:1-9:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018) https://doi.org/10.4230/LIPIcs.WABI.2018.9

Author Details

Yohei M. Rosen

University of California, Santa Cruz, California, New York University School of Medicine, New York, New York

Benedict J. Paten

University of California, Santa Cruz, California

Funding

This work was supported by the National Human Genome Research Institute of the National Institutes of Health under Award Number 5U54HG007990, the National Heart, Lung, and Blood Institute of the National Institutes of Health under Award Number 1U01HL137183-01, and grants from the W.M. Keck foundation and the Simons Foundation.

Rosen, Yohei M.: Yohei Rosen was supported in part by a Howard Hughes Medical Institute Medical Research Fellowship.

Supplementary Materials

https://github.com/yoheirosen/sublinear-Li-Stephens/

References

Brian L Browning and Sharon R Browning. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. The American Journal of Human Genetics, 84(2):210-223, 2009.
1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature, 526(7571):68, 2015.
Olivier Delaneau, Jean-Francois Zagury, and Jonathan Marchini. Improved whole-chromosome phasing for disease and population genetic studies. Nature methods, 10(1):5, 2013.
Peter Donnelly and Stephen Leslie. The coalescent and its descendants. arXiv preprint arXiv:1006.1514, 2010.
Richard Durbin. Efficient haplotype matching and storage using the positional burrows-wheeler transform (pbwt). Bioinformatics, 30(9):1266-1272, 2014.
Alon Keinan and Andrew G Clark. Recent explosive human population growth has resulted in an excess of rare genetic variants. science, 336(6082):740-743, 2012.
John Frank Charles Kingman. The coalescent. Stochastic processes and their applications, 13(3):235-248, 1982.
Na Li and Matthew Stephens. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics, 165(4):2213-2233, 2003.
Yun Li, Cristen J Willer, Jun Ding, Paul Scheet, and Gonçalo R Abecasis. Mach: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genetic epidemiology, 34(8):816-834, 2010.
Po-Ru Loh, Petr Danecek, Pier Francesco Palamara, Christian Fuchsberger, Yakir A Reshef, Hilary K Finucane, Sebastian Schoenherr, Lukas Forer, Shane McCarthy, Goncalo R Abecasis, et al. Reference-based phasing using the haplotype reference consortium panel. Nature genetics, 48(11):1443, 2016.
Gerton Lunter. Fast haplotype matching in very large cohorts using the li and stephens model. bioRxiv, 2016. URL: http://dx.doi.org/10.1101/048280.
Jared O'Connell, Kevin Sharp, Nick Shrine, Louise Wain, Ian Hall, Martin Tobin, Jean-Francois Zagury, Olivier Delaneau, and Jonathan Marchini. Haplotype estimation for biobank-scale data sets. Nature genetics, 48(7):817, 2016.
Yohei Rosen, Jordan Eizenga, and Benedict Paten. Modelling haplotypes with respect to reference cohort variation graphs. Bioinformatics, 33(14):i118-i123, 2017.
Amy L Williams, Nick Patterson, Joseph Glessner, Hakon Hakonarson, and David Reich. Phasing of many thousands of genotyped samples. The American Journal of Human Genetics, 91(2):238-251, 2012.

An Average-Case Sublinear Exact Li and Stephens Forward Algorithm

Authors Yohei M. Rosen , Benedict J. Paten

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

An Average-Case Sublinear Exact Li and Stephens Forward Algorithm

Authors Yohei M. Rosen , Benedict J. Paten

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

Supplementary Materials

References

Thanks for your feedback!

Could not send message