MEM-Based Pangenome Indexing for k-mer Queries

Authors Stephen Hwang , Nathaniel K. Brown , Omar Y. Ahmed , Katharine M. Jenike , Sam Kovaka , Michael C. Schatz , Ben Langmead



PDF
Thumbnail PDF

File

LIPIcs.WABI.2024.4.pdf
  • Filesize: 1.98 MB
  • 17 pages

Document Identifiers

Author Details

Stephen Hwang
  • XDBio Program, Johns Hopkins University, Baltimore, MD, USA
Nathaniel K. Brown
  • Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
Omar Y. Ahmed
  • Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
Katharine M. Jenike
  • Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
Sam Kovaka
  • Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
Michael C. Schatz
  • Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
Ben Langmead
  • Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA

Acknowledgements

We thank Christina Boucher for helpful conversations.

Cite AsGet BibTex

Stephen Hwang, Nathaniel K. Brown, Omar Y. Ahmed, Katharine M. Jenike, Sam Kovaka, Michael C. Schatz, and Ben Langmead. MEM-Based Pangenome Indexing for k-mer Queries. In 24th International Workshop on Algorithms in Bioinformatics (WABI 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 312, pp. 4:1-4:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.WABI.2024.4

Abstract

Pangenomes are growing in number and size, thanks to the prevalence of high-quality long-read assemblies. However, current methods for studying sequence composition and conservation within pangenomes have limitations. Methods based on graph pangenomes require a computationally expensive multiple-alignment step, which can leave out some variation. Indexes based on k-mers and de Bruijn graphs are limited to answering questions at a specific substring length k. We present Maximal Exact Match Ordered (MEMO), a pangenome indexing method based on maximal exact matches (MEMs) between sequences. A single MEMO index can handle arbitrary-length queries over pangenomic windows. MEMO enables both queries that test k-mer presence/absence (membership queries) and that count the number of genomes containing k-mers in a window (conservation queries). MEMO’s index for a pangenome of 89 human autosomal haplotypes fits in 2.04 GB, 8.8× smaller than a comparable KMC3 index and 11.4× smaller than a PanKmer index. MEMO indexes can be made smaller by sacrificing some counting resolution, with our decile-resolution HPRC index reaching 0.67 GB. MEMO can conduct a conservation query for 31-mers over the human leukocyte antigen locus in 13.89 seconds, 2.5× faster than other approaches. MEMO’s small index size, lack of k-mer length dependence, and efficient queries make it a flexible tool for studying and visualizing substring conservation in pangenomes.

Subject Classification

ACM Subject Classification
  • Applied computing → Computational genomics
Keywords
  • Pangenomics
  • Comparative genomics
  • Compressed indexing

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. O. Ahmed, M. Rossi, S. Kovaka, M. C. Schatz, T. Gagie, C. Boucher, and B. Langmead. Pan-genomic matching statistics for targeted nanopore sequencing. iScience, 24(6):102696, June 2021. Google Scholar
  2. O. Y. Ahmed, M. Rossi, T. Gagie, C. Boucher, and B. Langmead. SPUMONI 2: improved classification using a pangenome index of minimizer digests. Genome Biol, 24(1):122, May 2023. Google Scholar
  3. O.Y. Ahmed, M. Rossi, C. Boucher, and B. Langmead. Efficient taxa identification using a pangenome index. Genome Research, 33(7):1069-1077, July 2023. Google Scholar
  4. Anthony J Aylward, Semar Petrus, Allen Mamerto, Nolan T Hartwick, and Todd P Michael. PanKmer: k-mer based and reference-free pangenome analysis. Bioinformatics, page btad621, October 2023. URL: https://doi.org/10.1093/bioinformatics/btad621.
  5. W.I. Chang and E.L. Lawler. Sublinear approximate string matching and biological applications. Algorithmica, 12(4):327-344, 1994. Google Scholar
  6. B Clift, D Haussler, R McConnell, T D Schneider, and G D Stormo. Sequence landscapes. Nucleic Acids Research, 14(1):141-158, January 1986. URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC339363/.
  7. Sebastian Deorowicz, Adam Gudyś, Maciej Długosz, Marek Kokot, and Agnieszka Danek. Kmer-db: instant evolutionary distance estimation. Bioinformatics, 35(1):133-136, January 2019. URL: https://doi.org/10.1093/bioinformatics/bty610.
  8. Pushpendra K. Gupta. GWAS for genetics of complex quantitative traits: Genome to pangenome and SNPs to SVs and k-mers. BioEssays, 43(11):2100109, 2021. URL: https://doi.org/10.1002/bies.202100109.
  9. Stephen Hwang, Nathaniel K. Brown, Omar Y. Ahmed, Katharine M. Jenike, Sam Kovaka, Michael C. Schatz, and Ben Langmead. StephenHwang/MEMO. Software, version 1.0.0., swhId: https://archive.softwareheritage.org/swh:1:dir:793f47e3260ebae1887b07175fe3087c8e93d1f8;origin=https://github.com/StephenHwang/MEMO;visit=swh:1:snp:b23bfa6e000a68e85c5b91961d022de194b4b86b;anchor=swh:1:rev:d61a1a995b8027ae3d3dbe449502e952321f7217 (visited on 2024-08-16). URL: https://github.com/StephenHwang/MEMO.
  10. Stephen Hwang, Nathaniel K. Brown, Omar Y. Ahmed, Katharine M. Jenike, Sam Kovaka, Michael C. Schatz, and Ben Langmead. StephenHwang/MEMO_experiments. Software, swhId: https://archive.softwareheritage.org/swh:1:dir:d69ad61b0d1d563b3945a978b1396fd81be04732;origin=https://github.com/StephenHwang/MEMO_experiments;visit=swh:1:snp:c6a9c4193f1f39f83e8987cf1f9dda2ad2fc3e2d;anchor=swh:1:rev:b47d8f5f8a1d7ff511dad707c79f168feef8469f (visited on 2024-08-16). URL: https://github.com/StephenHwang/MEMO_experiments.
  11. M. Jayakodi, S. Padmarasu, G. Haberer, V. S. Bonthala, H. Gundlach, C. Monat, T. Lux, N. Kamal, D. Lang, A. Himmelbach, J. Ens, X. Q. Zhang, T. T. Angessa, G. Zhou, C. Tan, C. Hill, P. Wang, M. Schreiber, L. B. Boston, C. Plott, J. Jenkins, Y. Guo, A. Fiebig, H. Budak, D. Xu, J. Zhang, C. Wang, J. Grimwood, J. Schmutz, G. Guo, G. Zhang, K. Mochida, T. Hirayama, K. Sato, K. J. Chalmers, P. Langridge, R. Waugh, C. J. Pozniak, U. Scholz, K. F. X. Mayer, M. Spannagl, C. Li, M. Mascher, and N. Stein. The barley pan-genome reveals the hidden legacy of mutation breeding. Nature, 588(7837):284-289, December 2020. Google Scholar
  12. K. Jenike, S. Kovaka, S. Oh, S. Hwang, S. Ramakrishnan, B. Langmead, Z. Lippman, and M.C. Schatz. Panagram: Interactive, alignment-free pan-genome browser. https://github.com/kjenike/panagram, 2023.
  13. J. H. Kim, J. S. Park, C. Y. Lee, M. G. Jeong, J. L. Xu, Y. Choi, H. W. Jung, and H. K. Choi. Dissecting seed pigmentation-associated genomic loci and genes by employing dual approaches of reference-based and k-mer-based GWAS with 438 Glycine accessions. PLoS One, 15(12):e0243085, 2020. Google Scholar
  14. M. Kokot, M. Dlugosz, and S. Deorowicz. KMC 3: counting and manipulating k-mer statistics. Bioinformatics, 33(17):2759-2761, September 2017. Google Scholar
  15. J. K. Kulski, S. Suzuki, and T. Shiina. Human leukocyte antigen super-locus: nexus of genomic supergenes, SNPs, indels, transcripts, and haplotypes. Hum Genome Var, 9(1):49, December 2022. Google Scholar
  16. M. A. Lemay, M. de Ronne, R. langer, and F. Belzile. k-mer-based GWAS enhances the discovery of causal variants and candidate genes in soybean. Plant Genome, 16(4):e20374, December 2023. Google Scholar
  17. H. Li. Tabix: fast retrieval of sequence features from generic TAB-delimited files. Bioinformatics, 27(5):718-719, March 2011. Google Scholar
  18. Heng Li. Tabix: fast retrieval of sequence features from generic TAB-delimited files. Bioinformatics, 27(5):718-719, March 2011. URL: https://doi.org/10.1093/bioinformatics/btq671.
  19. Q. Lian, B. Huettel, B. Walkemeier, B. Mayjonade, C. Lopez-Roques, L. Gil, F. Roux, K. Schneeberger, and R. Mercier. A pan-genome of 69 Arabidopsis thaliana accessions reveals a conserved genome structure throughout the global species range. Nat Genet, 56(5):982-991, May 2024. Google Scholar
  20. G. Marçais, A. L. Delcher, A. M. Phillippy, R. Coston, S. L. Salzberg, and A. Zimin. MUMmer4: A fast and versatile genome alignment system. PLoS Comput Biol, 14(1):e1005944, January 2018. Google Scholar
  21. D. J. Nasko, S. Koren, A. M. Phillippy, and T. J. Treangen. RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification. Genome Biol, 19(1):165, October 2018. Google Scholar
  22. S. Nurk, S. Koren, A. Rhie, M. Rautiainen, A. V. Bzikadze, A. Mikheenko, M. R. Vollger, N. Altemose, L. Uralsky, A. Gershman, S. Aganezov, S. J. Hoyt, M. Diekhans, G. A. Logsdon, M. Alonge, S. E. Antonarakis, M. Borchers, G. G. Bouffard, S. Y. Brooks, G. V. Caldas, N. C. Chen, H. Cheng, C. S. Chin, W. Chow, L. G. de Lima, P. C. Dishuck, R. Durbin, T. Dvorkina, I. T. Fiddes, G. Formenti, R. S. Fulton, A. Fungtammasan, E. Garrison, P. G. S. Grady, T. A. Graves-Lindsay, I. M. Hall, N. F. Hansen, G. A. Hartley, M. Haukness, K. Howe, M. W. Hunkapiller, C. Jain, M. Jain, E. D. Jarvis, P. Kerpedjiev, M. Kirsche, M. Kolmogorov, J. Korlach, M. Kremitzki, H. Li, V. V. Maduro, T. Marschall, A. M. McCartney, J. McDaniel, D. E. Miller, J. C. Mullikin, E. W. Myers, N. D. Olson, B. Paten, P. Peluso, P. A. Pevzner, D. Porubsky, T. Potapova, E. I. Rogaev, J. A. Rosenfeld, S. L. Salzberg, V. A. Schneider, F. J. Sedlazeck, K. Shafin, C. J. Shew, A. Shumate, Y. Sims, A. F. A. Smit, D. C. Soto, I. Sović, J. M. Storer, A. Streets, B. A. Sullivan, F. Thibaud-Nissen, J. Torrance, J. Wagner, B. P. Walenz, A. Wenger, J. M. D. Wood, C. Xiao, S. M. Yan, A. C. Young, S. Zarate, U. Surti, R. C. McCoy, M. Y. Dennis, I. A. Alexandrov, J. L. Gerton, R. J. O'Neill, W. Timp, J. M. Zook, M. C. Schatz, E. E. Eichler, K. H. Miga, and A. M. Phillippy. The complete sequence of a human genome. Science, 376(6588):44-53, April 2022. Google Scholar
  23. A. Rhie, S. A. McCarthy, O. Fedrigo, J. Damas, G. Formenti, S. Koren, M. Uliano-Silva, W. Chow, A. Fungtammasan, J. Kim, C. Lee, B. J. Ko, M. Chaisson, G. L. Gedman, L. J. Cantin, F. Thibaud-Nissen, L. Haggerty, I. Bista, M. Smith, B. Haase, J. Mountcastle, S. Winkler, S. Paez, J. Howard, S. C. Vernes, T. M. Lama, F. Grutzner, W. C. Warren, C. N. Balakrishnan, D. Burt, J. M. George, M. T. Biegler, D. Iorns, A. Digby, D. Eason, B. Robertson, T. Edwards, M. Wilkinson, G. Turner, A. Meyer, A. F. Kautt, P. Franchini, H. W. Detrich, H. Svardal, M. Wagner, G. J. P. Naylor, M. Pippel, M. Malinsky, M. Mooney, M. Simbirsky, B. T. Hannigan, T. Pesout, M. Houck, A. Misuraca, S. B. Kingan, R. Hall, Z. Kronenberg, I. ć, C. Dunn, Z. Ning, A. Hastie, J. Lee, S. Selvaraj, R. E. Green, N. H. Putnam, I. Gut, J. Ghurye, E. Garrison, Y. Sims, J. Collins, S. Pelan, J. Torrance, A. Tracey, J. Wood, R. E. Dagnew, D. Guan, S. E. London, D. F. Clayton, C. V. Mello, S. R. Friedrich, P. V. Lovell, E. Osipova, F. O. Al-Ajli, S. Secomandi, H. Kim, C. Theofanopoulou, M. Hiller, Y. Zhou, R. S. Harris, K. D. Makova, P. Medvedev, J. Hoffman, P. Masterson, K. Clark, F. Martin, K. Howe, P. Flicek, B. P. Walenz, W. Kwak, H. Clawson, M. Diekhans, L. Nassar, B. Paten, R. H. S. Kraus, A. J. Crawford, M. T. P. Gilbert, G. Zhang, B. Venkatesh, R. W. Murphy, K. P. Koepfli, B. Shapiro, W. E. Johnson, F. Di Palma, T. Marques-Bonet, E. C. Teeling, T. Warnow, J. M. Graves, O. A. Ryder, D. Haussler, S. J. O'Brien, J. Korlach, H. A. Lewin, K. Howe, E. W. Myers, R. Durbin, A. M. Phillippy, and E. D. Jarvis. Towards complete and error-free genome assemblies of all vertebrate species. Nature, 592(7856):737-746, April 2021. Google Scholar
  24. A. Rhie, S. Nurk, M. Cechova, S. J. Hoyt, D. J. Taylor, N. Altemose, P. W. Hook, S. Koren, M. Rautiainen, I. A. Alexandrov, J. Allen, M. Asri, A. V. Bzikadze, N. C. Chen, C. S. Chin, M. Diekhans, P. Flicek, G. Formenti, A. Fungtammasan, C. Garcia Giron, E. Garrison, A. Gershman, J. L. Gerton, P. G. S. Grady, A. Guarracino, L. Haggerty, R. Halabian, N. F. Hansen, R. Harris, G. A. Hartley, W. T. Harvey, M. Haukness, J. Heinz, T. Hourlier, R. M. Hubley, S. E. Hunt, S. Hwang, M. Jain, R. K. Kesharwani, A. P. Lewis, H. Li, G. A. Logsdon, J. K. Lucas, W. Makalowski, C. Markovic, F. J. Martin, A. M. Mc Cartney, R. C. McCoy, J. McDaniel, B. M. McNulty, P. Medvedev, A. Mikheenko, K. M. Munson, T. D. Murphy, H. E. Olsen, N. D. Olson, L. F. Paulin, D. Porubsky, T. Potapova, F. Ryabov, S. L. Salzberg, M. E. G. Sauria, F. J. Sedlazeck, K. Shafin, V. A. Shepelev, A. Shumate, J. M. Storer, L. Surapaneni, A. M. Taravella Oill, F. Thibaud-Nissen, W. Timp, M. Tomaszkiewicz, M. R. Vollger, B. P. Walenz, A. C. Watwood, M. H. Weissensteiner, A. M. Wenger, M. A. Wilson, S. Zarate, Y. Zhu, J. M. Zook, E. E. Eichler, R. J. O'Neill, M. C. Schatz, K. H. Miga, K. D. Makova, and A. M. Phillippy. The complete sequence of a human Y chromosome. Nature, 621(7978):344-354, September 2023. Google Scholar
  25. M. Rossi, M. Oliva, B. Langmead, T. Gagie, and C. Boucher. MONI: A Pangenomic Index for Finding Maximal Exact Matches. J Comput Biol, 29(2):169-187, February 2022. Google Scholar
  26. B. Shariat, N. S. Movahedi, H. Chitsaz, and C. Boucher. HyDA-Vista: towards optimal guided selection of k-mer size for sequence assembly. BMC Genomics, 15 Suppl 10(Suppl 10):S9, 2014. Google Scholar
  27. R. M. Sherman and S. L. Salzberg. Pan-genomics in the human genome era. Nat Rev Genet, 21(4):243-254, April 2020. Google Scholar
  28. T. Shiina, K. Hosomichi, H. Inoko, and J. K. Kulski. The HLA genomic loci map: expression, interaction, diversity and disease. J Hum Genet, 54(1):15-39, January 2009. Google Scholar
  29. The Apache Software Foundation. Parquet. https://github.com/apache/parquet-format, 2024.
  30. T. Wang, L. Antonacci-Fulton, K. Howe, H. A. Lawson, J. K. Lucas, A. M. Phillippy, A. B. Popejoy, M. Asri, C. Carson, M. J. P. Chaisson, X. Chang, R. Cook-Deegan, A. L. Felsenfeld, R. S. Fulton, E. P. Garrison, N. A. Garrison, T. A. Graves-Lindsay, H. Ji, E. E. Kenny, B. A. Koenig, D. Li, T. Marschall, J. F. McMichael, A. M. Novak, D. Purushotham, V. A. Schneider, B. I. Schultz, M. W. Smith, H. J. Sofia, T. Weissman, P. Flicek, H. Li, K. H. Miga, B. Paten, E. D. Jarvis, I. M. Hall, E. E. Eichler, and D. Haussler. The Human Pangenome Project: a global resource to map genomic diversity. Nature, 604(7906):437-446, April 2022. Google Scholar