Memoization on Shared Subtrees Accelerates Computations on Genealogical Forests

Authors Lukas Hübner , Alexandros Stamatakis



PDF
Thumbnail PDF

File

LIPIcs.WABI.2024.5.pdf
  • Filesize: 1.64 MB
  • 22 pages

Document Identifiers

Author Details

Lukas Hübner
  • Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Germany
  • Heidelberg Institute for Theoretical Studies, Germany
Alexandros Stamatakis
  • Institute of Computer Science, Foundation for Research and Technology-Hellas, Heraklion, Greece
  • Heidelberg Institute for Theoretical Studies, Germany
  • Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Germany

Cite AsGet BibTex

Lukas Hübner and Alexandros Stamatakis. Memoization on Shared Subtrees Accelerates Computations on Genealogical Forests. In 24th International Workshop on Algorithms in Bioinformatics (WABI 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 312, pp. 5:1-5:22, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.WABI.2024.5

Abstract

The field of population genetics attempts to advance our understanding of evolutionary processes. It has applications, for example, in medical research, wildlife conservation, and - in conjunction with recent advances in ancient DNA sequencing technology - studying human migration patterns over the past few thousand years. The basic toolbox of population genetics includes genealogical trees, which describe the shared evolutionary history among individuals of the same species. They are calculated on the basis of genetic variations. However, in recombining organisms, a single tree is insufficient to describe the evolutionary history of the whole genome. Instead, a collection of correlated trees can be used, where each describes the evolutionary history of a consecutive region of the genome. The current corresponding state of-the-art data structure, tree sequences, compresses these genealogical trees via edit operations when moving from one tree to the next along the genome instead of storing the full, often redundant, description for each tree. We propose a new data structure, genealogical forests, which compresses the set of genealogical trees into a DAG. In this DAG identical subtrees that are shared across the input trees are encoded only once, thereby allowing for straight-forward memoization of intermediate results. Additionally, we provide a C++ implementation of our proposed data structure, called gfkit, which is 2.1 to 11.2 (median 4.0) times faster than the state-of-the-art tool on empirical and simulated datasets at computing important population genetics statistics such as the Allele Frequency Spectrum, Patterson’s f, the Fixation Index, Tajima’s D, pairwise Lowest Common Ancestors, and others. On Lowest Common Ancestor queries with more than two samples as input, gfkit scales asymptotically better than the state-of-the-art, and is thus up to 990 times faster. In conclusion, our proposed data structure compresses genealogical trees by storing shared subtrees only once, thereby enabling straight-forward memoization of intermediate results, yielding a substantial runtime reduction and a potentially more intuitive data representation over the state-of-the-art. Our improvements will boost the development of novel analyses and models in the field of population genetics and increases scalability to ever-growing genomic datasets.

Subject Classification

ACM Subject Classification
  • Applied computing → Molecular sequence analysis
  • Applied computing → Bioinformatics
  • Applied computing → Population genetics
  • Applied computing → Computational genomics
Keywords
  • bioinformatics
  • population genetics
  • algorithms

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Alfred V. Aho, John E. Hopcroft, and Jeffrey D. Ullman. On finding lowest common ancestors in trees. In Alfred V. Aho, Allan Borodin, Robert L. Constable, Robert W. Floyd, Michael A. Harrison, Richard M. Karp, and H. Raymond Strong, editors, Proceedings of the 5th Annual ACM Symposium on Theory of Computing, April 30 - May 2, 1973, Austin, Texas, USA, STOC '73, pages 253-265. ACM, 1973. URL: https://doi.org/10.1145/800125.804056.
  2. Morten E. Allentoft, Martin Sikora, Anders Fischer, Karl-Göran Sjögren, Andrés Ingason, et al. 100 ancient genomes show repeated population turnovers in neolithic denmark. Nature, 625(7994):329-337, January 2024. URL: https://doi.org/10.1038/s41586-023-06862-3.
  3. A. C. Allison. Polymorphism and natural selection in human populations. Cold Spring Harbor Symposia on Quantitative Biology, 29(0):137-149, January 1964. URL: https://doi.org/10.1101/sqb.1964.029.01.018.
  4. Cécile Ané and Michael J. Sanderson. Missing the forest for the trees: Phylogenetic compression and its implications for inferring complex evolutionary histories. Systematic Biology, 54(1):146-157, February 2005. URL: https://doi.org/10.1080/1063515059090598410.1080/10635150590905984.
  5. Adam Auton, Gonçalo R. Abecasis, David M. Altshuler, Richard M. Durbin, Gonçalo R. Abecasis, et al. A global reference for human genetic variation. Nature, 526(7571):68-74, September 2015. URL: https://doi.org/10.1038/nature15393.
  6. Gautrey Barrett PH, Herbert PJ, Kohn S, and Smith S D, editors. Charles Darwin’s Notebooks, 1836-1844. British Museum (Natural History), 1987. Google Scholar
  7. Mihir Bellare and Philipp Rogaway. Introduction to modern cryptography, 2005. Google Scholar
  8. Bonnie Berger, Noah M. Daniels, and Y. William Yu. Computational biology in the 21st century: scaling with compressive algorithms. Commun. ACM, 59(8):72-80, July 2016. URL: https://doi.org/10.1145/2957324.
  9. Philip Bille, Inge Li Gørtz, Gad M. Landau, and Oren Weimann. Tree compression with top trees. Inf. Comput., 243:166-177, August 2015. URL: https://doi.org/10.1016/j.ic.2014.12.012.
  10. Philip Bille, Gad M. Landau, Rajeev Raman, Kunihiko Sadakane, Srinivasa Rao Satti, and Oren Weimann. Random access to grammar-compressed strings. In Dana Randall, editor, Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2011, San Francisco, California, USA, January 23-25, 2011, pages 373-389. SIAM, January 2011. URL: https://doi.org/10.1137/1.9781611973082.30.
  11. Clare Bycroft, Colin Freeman, Desislava Petkova, Gavin Band, Lloyd T. Elliott, et al. The uk biobank resource with deep phenotyping and genomic data. Nature, 562(7726):203-209, October 2018. URL: https://doi.org/10.1038/s41586-018-0579-z.
  12. Meredith L. Carpenter, Jason D. Buenrostro, Cristina Valdiosera, Hannes Schroeder, Morten E. Allentoft, et al. Pulling out the 1%: Whole-genome capture for the targeted enrichment of ancient dna sequencing libraries. The American Journal of Human Genetics, 93(5):852-864, November 2013. URL: https://doi.org/10.1016/j.ajhg.2013.10.002.
  13. Diego Darriba, Tomáš Flouri, and Alexandros Stamatakis. The state of software for evolutionary biology. Molecular Biology and Evolution, 35(5):1037-1046, January 2018. URL: https://doi.org/10.1093/molbev/msy014.
  14. Charles Darwin. On the Origin of Species by Means of Natural Selection, or the PReservation of Favoured Races in the Struggle for Life. John Murray, 1859. Google Scholar
  15. Drew DeHaas, Ziqing Pan, and Xinzhu Wei. Genotype representation graphs: Enabling efficient analysis of biobank-scale data. bioRxiv, April 2024. URL: https://doi.org/10.1101/2024.04.23.590800.
  16. Kai Diethelm. The limits of reproducibility in numerical simulation. Comput. Sci. Eng., 14(1):64-72, January 2012. URL: https://doi.org/10.1109/mcse.2011.21.
  17. Peter J. Downey, Ravi Sethi, and Robert Endre Tarjan. Variations on the common subexpression problem. J. ACM, 27(4):758-771, October 1980. URL: https://doi.org/10.1145/322217.322228.
  18. R. A. Fisher. Xvii.-the distribution of gene ratios for rare mutations. Proceedings of the Royal Society of Edinburgh, 50:204-219, 1931. URL: https://doi.org/10.1017/s0370164600044886.
  19. George W. Furnas and Jeff Zacks. Multitrees: enriching and reusing hierarchical structure. In Beth Adelson, Susan T. Dumais, and Judith S. Olson, editors, Conference on Human Factors in Computing Systems, CHI 1994, Boston, Massachusetts, USA, April 24-28, 1994, Proceedings, CHI '94, pages 330-336. ACM, 1994. URL: https://doi.org/10.1145/191666.191778.
  20. David Goldberg. What every computer scientist should know about floating-point arithmetic. ACM Comput. Surv., 23(1):5-48, March 1991. URL: https://doi.org/10.1145/103162.103163.
  21. Ernst Haeckel. Allgemeine Anatomie der Organismen. Georg Reimer, 1866. Google Scholar
  22. Cody E. Hinchliff, Stephen A. Smith, James F. Allman, J. Gordon Burleigh, Ruchi Chaudhary, et al. Synthesis of phylogeny and taxonomy into a comprehensive tree of life. Proceedings of the National Academy of Sciences, 112(41):12764-12769, September 2015. URL: https://doi.org/10.1073/pnas.1423041112.
  23. Edward Hitchcock. Elementary Geology. 1840. Google Scholar
  24. Paul A. Hohenlohe, W. Chris Funk, and Om P. Rajora. Population genomics for wildlife conservation and management. Molecular Ecology, 30(1):62-82, November 2020. URL: https://doi.org/10.1111/mec.15720.
  25. Lukas Hübner and Alexandros Stamatakis. Genealogical Forest Files for Simons Genome Diversity Project. Dataset (visited on 2024-08-13). URL: https://doi.org/10.5281/zenodo.11241730.
  26. Lukas Hübner and Alexandros Stamatakis. Genealogical Forest Files for Thousand Genome Project. Dataset (visited on 2024-08-13). URL: https://doi.org/10.5281/zenodo.11241619.
  27. Lukas Hübner and Alexandros Stamatakis. Genealogical Forest Files for Unified Genome (Wohns 2022). Software (visited on 2024-08-13). URL: https://doi.org/10.5281/zenodo.11241788.
  28. Lukas Hübner and Alexandros Stamatakis. gfkit. Software, swhId: https://archive.softwareheritage.org/swh:1:dir:bffece502c3a579271ac3d91d66b051d089d02f2;origin=https://github.com/lukashuebner/gfkit;visit=swh:1:snp:52733351ce8859561ad78da4ab95296b29658f03;anchor=swh:1:rev:63871e016457fbefa4988bc17c21340f07f9a0f9 (visited on 2024-08-13). URL: https://github.com/lukashuebner/gfkit.
  29. Lukas Hübner and Alexandros Stamatakis. Tree Sequence and Genealogical Forest Files for a Simulated Human Chromosome 20. Dataset (visited on 2024-08-13). URL: https://doi.org/10.5281/zenodo.11241938.
  30. Richard R. Hudson. Properties of a neutral allele model with intragenic recombination. Theoretical Population Biology, 23(2):183-201, April 1983. URL: https://doi.org/10.1016/0040-5809(83)90013-8.
  31. Florian Ingels. On the similarities of trees: the interest of enumeration and compression methods. (Sur la similarité des arbres : l'intérêt des méthodes d'énumération et de compression). PhD thesis, École normale supérieure de Lyon, France, 2022. URL: https://tel.archives-ouvertes.fr/tel-03908078.
  32. Florian Ingels and Romain Azaïs. A reverse search method for the enumeration of unordered forests using dag compression. WEPA 2020 - Fourth International Workshop on Enumeration Problemsand Applications,, 2020. Google Scholar
  33. Guy Jacobson. Space-efficient static trees and graphs. In 30th Annual Symposium on Foundations of Computer Science, Research Triangle Park, North Carolina, USA, 30 October - 1 November 1989, pages 549-554. IEEE Computer Society, 1989. URL: https://doi.org/10.1109/sfcs.1989.63533.
  34. Konrad J. Karczewski, Laurent C. Francioli, Grace Tiao, Beryl B. Cummings, Jessica Alföldi, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature, 581(7809):434-443, May 2020. URL: https://doi.org/10.1038/s41586-020-2308-7.
  35. Jerome Kelleher, Alison M. Etheridge, and Gilean McVean. Efficient coalescent simulation and genealogical analysis for large sample sizes. PLoS Comput. Biol., 12(5):e1004842, May 2016. URL: https://doi.org/10.1371/journal.pcbi.1004842.
  36. Jerome Kelleher, Yan Wong, Anthony W. Wohns, Chaimaa Fadil, Patrick K. Albers, et al. Inferring whole-genome histories in large population datasets. Nature Genetics, 51(9):1330-1338, September 2019. URL: https://doi.org/10.1038/s41588-019-0483-y.
  37. Miroslaw Kowaluk and Andrzej Lingas. LCA queries in directed acyclic graphs. In Luís Caires, Giuseppe F. Italiano, Luís Monteiro, Catuscia Palamidessi, and Moti Yung, editors, Automata, Languages and Programming, 32nd International Colloquium, ICALP 2005, Lisbon, Portugal, July 11-15, 2005, Proceedings, volume 3580 of Lecture Notes in Computer Science, pages 241-248. Springer, 2005. https://doi.org/10.1007/11523468_20.
  38. Bret Larget. The estimation of tree posterior probabilities using conditional clade probability distributions. Systematic Biology, 62(4):501-511, April 2013. URL: https://doi.org/10.1093/sysbio/syt014.
  39. R C Lewontin and J L Hubby. A molecular approach to the study of genic heterozygosity in natural populations. ii. amount of variation and degree of heterozygosity in natural populations of drosophila pseudoobscura. Genetics, 54(2):595-609, August 1966. URL: https://doi.org/10.1093/genetics/54.2.595.
  40. Mark Lipson, Elizabeth A. Sawchuk, Jessica C. Thompson, Jonas Oppenheimer, Christian A. Tryon, et al. Ancient dna and deep population structure in sub-saharan african foragers. Nature, 603(7900):290-296, February 2022. URL: https://doi.org/10.1038/s41586-022-04430-9.
  41. Gordon Luikart, Marty Kardos, Brian K. Hand, Om P. Rajora, Sally N. Aitken, et al. Population Genomics: Advancing Understanding of Nature, pages 3-79. Springer International Publishing, 2018. URL: https://doi.org/10.1007/13836_2018_60.
  42. Swapan Mallick, Heng Li, Mark Lipson, Iain Mathieson, Melissa Gymrek, et al. The simons genome diversity project: 300 genomes from 142 diverse populations. Nature, 538(7624):201-206, September 2016. URL: https://doi.org/10.1038/nature18964.
  43. Suzanne J. Matthews, Seung-Jin Sul, and Tiffani L. Williams. A novel approach for compressing phylogenetic trees. In Mark Borodovsky, Johann Peter Gogarten, Teresa M. Przytycka, and Sanguthevar Rajasekaran, editors, Bioinformatics Research and Applications, 6th International Symposium, ISBRA 2010, Storrs, CT, USA, May 23-26, 2010. Proceedings, volume 6053 of Lecture Notes in Computer Science, pages 113-124. Springer, 2010. URL: https://doi.org/10.1007/978-3-642-13078-6_13.
  44. Craig D. Millar, Leon Huynen, Sankar Subramanian, Elmira Mohandesan, and David M. Lambert. New developments in ancient genomics. Trends in evology & evolution, 23(7):386-393, July 2008. URL: https://doi.org/10.1016/j.tree.2008.04.002.
  45. David A. Morrison. Genealogies: Pedigrees and phylogenies are reticulating networks not just divergent trees. Evolutionary Biology, 43(4):456-473, February 2016. URL: https://doi.org/10.1007/s11692-016-9376-5.
  46. M Nei and W H Li. Mathematical model for studying genetic variation in terms of restriction endonucleases. Proceedings of the National Academy of Sciences, 76(10):5269-5273, October 1979. URL: https://doi.org/10.1073/pnas.76.10.5269.
  47. Atsuko Okazaki, Satoru Yamazaki, Ituro Inoue, and Jurg Ott. Population genetics: past, present, and future. Human Genetics, 140(2):231-240, July 2020. URL: https://doi.org/10.1007/s00439-020-02208-5.
  48. M. Parks, S. Subramanian, C. Baroni, M. C. Salvatore, G. Zhang, et al. Ancient population genomics and the study of evolution. Philosophical Transactions of the Royal Society B: Biological Sciences, 370(1660):20130381, January 2015. URL: https://doi.org/10.1098/rstb.2013.0381.
  49. Nick Patterson, Priya Moorjani, Yontao Luo, Swapan Mallick, Nadin Rohland, et al. Ancient admixture in human history. Genetics, 192(3):1065-1093, November 2012. URL: https://doi.org/10.1534/genetics.112.145037.
  50. David Penny. Inferring phylogenies.-joseph felsenstein. 2003. sinauer associates, sunderland, massachusetts. Systematic Biology, 53(4):669-670, August 2004. URL: https://doi.org/10.1080/10635150490468530.
  51. Peter Ralph, Kevin Thornton, and Jerome Kelleher. Efficiently summarizing relationships in large samples: A general duality between statistics of genealogies and genomes. Genetics, 215(3):779-797, July 2020. URL: https://doi.org/10.1534/genetics.120.303253.
  52. UMA RAMAKRISHNAN and ELIZABETH A. HADLY. Using phylochronology to reveal cryptic population histories: review and synthesis of 29 ancient dna studies. Molecular Ecology, 18(7):1310-1330, March 2009. URL: https://doi.org/10.1111/j.1365-294x.2009.04092.x.
  53. David Reich. Who we are and how we got here. Oxford University Press, Oxford, 2019. Google Scholar
  54. O. P. Roja. Population Genomics: Concepts, Approaches, and Applications. Sprinter Nature Switzerland AG, 2019. Google Scholar
  55. Sherif Sakr. XML compression techniques: A survey and comparison. J. Comput. Syst. Sci., 75(5):303-322, August 2009. URL: https://doi.org/10.1016/j.jcss.2009.01.004.
  56. Baruch Schieber and Uzi Vishkin. On finding lowest common ancestors: Simplification and parallelization. SIAM J. Comput., 17(6):1253-1262, 1988. URL: https://doi.org/10.1137/0217079.
  57. Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. Dram errors in the wild: a large-scale field study. ACM SIGMETRICS Performance Evaluation Review, 37(1):193-204, June 2009. URL: https://doi.org/10.1145/2492101.1555372.
  58. Patrick Sung and Hannah Klein. Mechanism of homologous recombination: mediators and helicases take on regulatory functions. Nature Reviews Molecular Cell Biology, 7(10):739-750, August 2006. URL: https://doi.org/10.1038/nrm2008.
  59. Megan A. Supple and Beth Shapiro. Conservation of biodiversity in the genomics era. Genome Biology, 19(1), September 2018. URL: https://doi.org/10.1186/s13059-018-1520-3.
  60. F Tajima. Statistical method for testing the neutral mutation hypothesis by dna polymorphism. Genetics, 123(3):585-595, November 1989. URL: https://doi.org/10.1093/genetics/123.3.585.
  61. Daniel Taliun, Daniel N. Harris, Michael D. Kessler, Jedidiah Carlson, Zachary A. Szpiech, et al. Sequencing of 53,831 diverse genomes from the nhlbi topmed program. Nature, 590(7845):290-299, February 2021. URL: https://doi.org/10.1038/s41586-021-03205-y.
  62. Matthias Wiesenberger, Lukas Einkemmer, Markus Held, Albert Gutierrez-Milla, Xavier Saez, and Roman Iakymchuk. Reproducibility, accuracy and performance of the feltor code and library on parallel computer architectures. Comput. Phys. Commun., 238:145-156, May 2019. URL: https://doi.org/10.1016/j.cpc.2018.12.006.
  63. George C. Williams and Randolph M. Nesse. The dawn of darwinian medicine. The Quarterly Review of Biology, 66(1):1-22, March 1991. URL: https://doi.org/10.1086/417048.
  64. Anthony Wilder Wohns, Yan Wong, Ben Jeffery, Ali Akbari, Swapan Mallick, et al. A unified genealogy of modern and ancient genomes. Science, 375(6583), February 2022. URL: https://doi.org/10.1126/science.abi8264.
  65. Yan Wong, Anastasia Ignatieva, Jere Koskela, Gregor Gorjanc, Anthony W. Wohns, et al. A general and efficient representation of ancestral recombination graphs. arxiv, November 2023. URL: https://doi.org/10.1101/2023.11.03.565466.
  66. Sewall Wright. Genetical structure of populations. Nature, 166(4215):247-249, August 1950. URL: https://doi.org/10.1038/166247a0.
  67. Chao Zhang, Maryam Rabiee, Erfan Sayyari, and Siavash Mirarab. ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees. BMC Bioinform., 19-S(6):15-30, 2018. URL: https://doi.org/10.1186/s12859-018-2129-y.