Document Open Access Logo

Pangenomic Genotyping with the Marker Array

Authors Taher Mun , Naga Sai Kavya Vaddadi, Ben Langmead



PDF
Thumbnail PDF

File

LIPIcs.WABI.2022.19.pdf
  • Filesize: 0.88 MB
  • 17 pages

Document Identifiers

Author Details

Taher Mun
  • Johns Hopkins University, Baltimore MD, USA
  • Illumina, San Diego, USA
Naga Sai Kavya Vaddadi
  • Johns Hopkins University, Baltimore MD, USA
Ben Langmead
  • Johns Hopkins University, USA

Acknowledgements

We thank Massimiliano Rossi and Travis Gagie for many helpful discussions. We thank Margaret Gagie for her careful editing. Part of this research project was conducted using computational resources at the Maryland Advanced Research Computing Center (MARCC).

Cite AsGet BibTex

Taher Mun, Naga Sai Kavya Vaddadi, and Ben Langmead. Pangenomic Genotyping with the Marker Array. In 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 242, pp. 19:1-19:17, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2022)
https://doi.org/10.4230/LIPIcs.WABI.2022.19

Abstract

We present a new method and software tool called rowbowt that applies a pangenome index to the problem of inferring genotypes from short-read sequencing data. The method uses a novel indexing structure called the marker array. Using the marker array, we can genotype variants with respect from large panels like the 1000 Genomes Project while avoiding the reference bias that results when aligning to a single linear reference. rowbowt can infer accurate genotypes in less time and memory compared to existing graph-based methods.

Subject Classification

ACM Subject Classification
  • Applied computing → Computational genomics
Keywords
  • Sequence alignment indexing genotyping

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. O. Ahmed, M. Rossi, S. Kovaka, M. C. Schatz, T. Gagie, C. Boucher, and B. Langmead. Pan-genomic matching statistics for targeted nanopore sequencing. iScience, 24(6):102696, June 2021. Google Scholar
  2. A. Auton, L. D. Brooks, R. M. Durbin, E. P. Garrison, H. M. Kang, J. O. Korbel, J. L. Marchini, S. McCarthy, G. A. McVean, G. R. Abecasis, et al. A global reference for human genetic variation. Nature, 526(7571):68-74, October 2015. Google Scholar
  3. D. Belazzougui, F. Cunial, T. Gagie, N. Prezza, and M. Raffinot. Flexible Indexing of Repetitive Collections. In Jarkko Kari, Florin Manea, and Ion Petre, editors, Unveiling Dynamics and Complexity, volume 10307, pages 162-174. Springer International Publishing, Cham, 2017. Series Title: Lecture Notes in Computer Science. Google Scholar
  4. D. Y. Brandt, V. R. Aguiar, B. D. Bitarello, K. Nunes, J. Goudet, and D. Meyer. Mapping Bias Overestimates Reference Allele Frequencies at the HLA Genes in the 1000 Genomes Project Phase I Data. G3 (Bethesda), 5(5):931-941, March 2015. Google Scholar
  5. M. Burrows and D.J. Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994. Google Scholar
  6. M. J. P. Chaisson, A. D. Sanders, X. Zhao, A. Malhotra, D. Porubsky, T. Rausch, E. J. Gardner, O. L. Rodriguez, L. Guo, R. L. Collins, et al. Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat Commun, 10(1):1784, April 2019. Google Scholar
  7. N. C. Chen, B. Solomon, T. Mun, S. Iyer, and B. Langmead. Reference flow: reducing reference bias using multiple population genomes. Genome Biol, 22(1):8, January 2021. Google Scholar
  8. S. Chen, P. Krusche, E. Dolzhenko, R. M. Sherman, R. Petrovski, F. Schlesinger, M. Kirsche, D. R. Bentley, M. C. Schatz, F. J. Sedlazeck, and M. A. Eberle. Paragraph: a graph-based structural variant genotyper for short-read sequence data. Genome Biol, 20(1):291, December 2019. Google Scholar
  9. P. Danecek, A. Auton, G. Abecasis, C. A. Albers, E. Banks, M. A. DePristo, R. E. Handsaker, G. Lunter, G. T. Marth, S. T. Sherry, et al. The variant call format and VCFtools. Bioinformatics, 27(15):2156-2158, August 2011. Google Scholar
  10. R. W. Davies, M. Kucka, D. Su, S. Shi, M. Flanagan, C. M. Cunniff, Y. F. Chan, and S. Myers. Rapid genotype imputation from sequence with reference panels. Nat Genet, 53(7):1104-1111, July 2021. Google Scholar
  11. L. Denti, M. Previtali, G. Bernardini, A. Schönhuth, and P. Bonizzoni. MALVA: Genotyping by Mapping-free ALlele Detection of Known VAriants. iScience, 18:20-27, August 2019. Google Scholar
  12. P. Ebert, P. A. Audano, Q. Zhu, B. Rodriguez-Martin, D. Porubsky, M. J. Bonder, A. Sulovari, J. Ebler, W. Zhou, R. Serra Mari, et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science, 372(6537), April 2021. Google Scholar
  13. P. Ferragina and G. Manzini. Opportunistic data structures with applications. In Proceedings of the 41st Annual Symposium on Foundations of Computer Science (FOCS), pages 390-398, 2000. Google Scholar
  14. T. Gagie, G. Navarro, and N. Prezza. Optimal-Time Text Indexing in BWT-runs Bounded Space. In Proceedings of the 29th Annual Symposium on Discrete Algorithms (SODA), pages 1459-1477, 2018. Google Scholar
  15. E. Garrison, J. Sirén, A. M. Novak, G. Hickey, J. M. Eizenga, E. T. Dawson, W. Jones, S. Garg, C. Markello, M. F. Lin, B. Paten, and R. Durbin. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat Biotechnol, 36(9):875-879, October 2018. Google Scholar
  16. S. Gog, T. Beller, A. Moffat, and M. Petri. From theory to practice: Plug and play with succinct data structures. In 13th International Symposium on Experimental Algorithms, (SEA 2014), pages 326-337, 2014. URL: https://doi.org/10.1007/978-3-319-07959-2_28.
  17. T. Günther and C. Nettelblad. The presence and impact of reference bias on population genomic studies of prehistoric human populations. PLoS Genet, 15(7):e1008302, July 2019. Google Scholar
  18. C. Kim, H. Guo, W. Kong, R. Chandnani, L. S. Shuang, and A. H. Paterson. Application of genotyping by sequencing technology to a variety of crop breeding programs. Plant Sci, 242:14-22, January 2016. Google Scholar
  19. A. Kuhnle, T. Mun, C. Boucher, T. Gagie, B. Langmead, and G. Manzini. Efficient Construction of a Complete Index for Pan-Genomics Read Alignment. J Comput Biol, 27(4):500-513, April 2020. Google Scholar
  20. B. Langmead and S. L. Salzberg. Fast gapped-read alignment with Bowtie 2. Nat Methods, 9(4):357-359, March 2012. Google Scholar
  21. H. Li. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 27(21):2987-2993, November 2011. Google Scholar
  22. F. Mölder, K. P. Jablonski, B. Letcher, M. B. Hall, C. H. Tomkins-Tinch, V. Sochat, J. Forster, S. Lee, S. O. Twardziok, A. Kanitz, A. Wilm, M. Holtgrewe, S. Rahmann, S. Nahnsen, and J. Köster. Sustainable data analysis with Snakemake. F1000Res, 10:33, 2021. Google Scholar
  23. J. Pritt, N. C. Chen, and B. Langmead. FORGe: prioritizing variants for graph genomes. Genome Biol, 19(1):220, December 2018. Google Scholar
  24. K. Reinert, T. H. Dadi, M. Ehrhardt, H. Hauswedell, S. Mehringer, R. Rahn, J. Kim, C. Pockrandt, J. Winkler, E. Siragusa, G. Urgese, and D. Weese. The SeqAn C++ template library for efficient sequence analysis: A resource for programmers. J Biotechnol, 261:157-168, November 2017. Google Scholar
  25. M. Rossi, M. Oliva, B. Langmead, T. Gagie, and C. Boucher. MONI: A Pangenomic Index for Finding Maximal Exact Matches. J Comput Biol, 29(2):169-187, February 2022. Google Scholar
  26. V. A. Schneider, T. Graves-Lindsay, K. Howe, N. Bouk, H. C. Chen, P. A. Kitts, T. D. Murphy, K. D. Pruitt, F. Thibaud-Nissen, D. Albracht, et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res, 27(5):849-864, May 2017. Google Scholar
  27. A. Shajii, D. Yorukoglu, Y. William Yu, and B. Berger. Fast genotyping of known SNPs through approximate k-mer matching. Bioinformatics, 32(17):i538-i544, September 2016. Google Scholar
  28. R. M. Sherman, J. Forman, V. Antonescu, D. Puiu, M. Daya, N. Rafaels, M. P. Boorgula, S. Chavan, C. Vergara, V. E. Ortega, et al. Assembly of a pan-genome from deep sequencing of 910 humans of African descent. Nat Genet, 51(1):30-35, January 2019. Google Scholar
  29. J. A. Sibbesen, L. Maretty, and A. Krogh. Accurate genotyping across variant classes and lengths using variant graphs. Nat Genet, 50(7):1054-1059, July 2018. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail