MEM-Based Pangenome Indexing for k-mer Queries

Hwang, Stephen; Brown, Nathaniel K.; Ahmed, Omar Y.; Jenike, Katharine M.; Kovaka, Sam; Schatz, Michael C.; Langmead, Ben

doi:10.4230/LIPIcs.WABI.2024.4

File

LIPIcs.WABI.2024.4.pdf

Filesize: 1.98 MB
17 pages

Document Identifiers

DOI: 10.4230/LIPIcs.WABI.2024.4
URN: urn:nbn:de:0030-drops-206482

Author Details

Stephen Hwang

XDBio Program, Johns Hopkins University, Baltimore, MD, USA

Nathaniel K. Brown

Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA

Omar Y. Ahmed

Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA

Katharine M. Jenike

Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA

Sam Kovaka

Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA

Michael C. Schatz

Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA

Ben Langmead

Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA

Acknowledgements

We thank Christina Boucher for helpful conversations.

Cite AsGet BibTex

Stephen Hwang, Nathaniel K. Brown, Omar Y. Ahmed, Katharine M. Jenike, Sam Kovaka, Michael C. Schatz, and Ben Langmead. MEM-Based Pangenome Indexing for k-mer Queries. In 24th International Workshop on Algorithms in Bioinformatics (WABI 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 312, pp. 4:1-4:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.WABI.2024.4

Abstract

Pangenomes are growing in number and size, thanks to the prevalence of high-quality long-read assemblies. However, current methods for studying sequence composition and conservation within pangenomes have limitations. Methods based on graph pangenomes require a computationally expensive multiple-alignment step, which can leave out some variation. Indexes based on k-mers and de Bruijn graphs are limited to answering questions at a specific substring length k. We present Maximal Exact Match Ordered (MEMO), a pangenome indexing method based on maximal exact matches (MEMs) between sequences. A single MEMO index can handle arbitrary-length queries over pangenomic windows. MEMO enables both queries that test k-mer presence/absence (membership queries) and that count the number of genomes containing k-mers in a window (conservation queries). MEMO’s index for a pangenome of 89 human autosomal haplotypes fits in 2.04 GB, 8.8× smaller than a comparable KMC3 index and 11.4× smaller than a PanKmer index. MEMO indexes can be made smaller by sacrificing some counting resolution, with our decile-resolution HPRC index reaching 0.67 GB. MEMO can conduct a conservation query for 31-mers over the human leukocyte antigen locus in 13.89 seconds, 2.5× faster than other approaches. MEMO’s small index size, lack of k-mer length dependence, and efficient queries make it a flexible tool for studying and visualizing substring conservation in pangenomes.

Subject Classification

ACM Subject Classification

Applied computing → Computational genomics

Keywords

Pangenomics
Comparative genomics
Compressed indexing

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

O. Ahmed, M. Rossi, S. Kovaka, M. C. Schatz, T. Gagie, C. Boucher, and B. Langmead. Pan-genomic matching statistics for targeted nanopore sequencing. iScience, 24(6):102696, June 2021.
O. Y. Ahmed, M. Rossi, T. Gagie, C. Boucher, and B. Langmead. SPUMONI 2: improved classification using a pangenome index of minimizer digests. Genome Biol, 24(1):122, May 2023.
O.Y. Ahmed, M. Rossi, C. Boucher, and B. Langmead. Efficient taxa identification using a pangenome index. Genome Research, 33(7):1069-1077, July 2023.
Anthony J Aylward, Semar Petrus, Allen Mamerto, Nolan T Hartwick, and Todd P Michael. PanKmer: k-mer based and reference-free pangenome analysis. Bioinformatics, page btad621, October 2023. URL: https://doi.org/10.1093/bioinformatics/btad621.
W.I. Chang and E.L. Lawler. Sublinear approximate string matching and biological applications. Algorithmica, 12(4):327-344, 1994.
B Clift, D Haussler, R McConnell, T D Schneider, and G D Stormo. Sequence landscapes. Nucleic Acids Research, 14(1):141-158, January 1986. URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC339363/.
Sebastian Deorowicz, Adam Gudyś, Maciej Długosz, Marek Kokot, and Agnieszka Danek. Kmer-db: instant evolutionary distance estimation. Bioinformatics, 35(1):133-136, January 2019. URL: https://doi.org/10.1093/bioinformatics/bty610.
Pushpendra K. Gupta. GWAS for genetics of complex quantitative traits: Genome to pangenome and SNPs to SVs and k-mers. BioEssays, 43(11):2100109, 2021. URL: https://doi.org/10.1002/bies.202100109.
Stephen Hwang, Nathaniel K. Brown, Omar Y. Ahmed, Katharine M. Jenike, Sam Kovaka, Michael C. Schatz, and Ben Langmead. StephenHwang/MEMO. Software, version 1.0.0., swhId: https://archive.softwareheritage.org/swh:1:dir:793f47e3260ebae1887b07175fe3087c8e93d1f8;origin=https://github.com/StephenHwang/MEMO;visit=swh:1:snp:b23bfa6e000a68e85c5b91961d022de194b4b86b;anchor=swh:1:rev:d61a1a995b8027ae3d3dbe449502e952321f7217 (visited on 2024-08-16). URL: https://github.com/StephenHwang/MEMO.
Stephen Hwang, Nathaniel K. Brown, Omar Y. Ahmed, Katharine M. Jenike, Sam Kovaka, Michael C. Schatz, and Ben Langmead. StephenHwang/MEMO_experiments. Software, swhId: https://archive.softwareheritage.org/swh:1:dir:d69ad61b0d1d563b3945a978b1396fd81be04732;origin=https://github.com/StephenHwang/MEMO_experiments;visit=swh:1:snp:c6a9c4193f1f39f83e8987cf1f9dda2ad2fc3e2d;anchor=swh:1:rev:b47d8f5f8a1d7ff511dad707c79f168feef8469f (visited on 2024-08-16). URL: https://github.com/StephenHwang/MEMO_experiments.
M. Jayakodi, S. Padmarasu, G. Haberer, V. S. Bonthala, H. Gundlach, C. Monat, T. Lux, N. Kamal, D. Lang, A. Himmelbach, J. Ens, X. Q. Zhang, T. T. Angessa, G. Zhou, C. Tan, C. Hill, P. Wang, M. Schreiber, L. B. Boston, C. Plott, J. Jenkins, Y. Guo, A. Fiebig, H. Budak, D. Xu, J. Zhang, C. Wang, J. Grimwood, J. Schmutz, G. Guo, G. Zhang, K. Mochida, T. Hirayama, K. Sato, K. J. Chalmers, P. Langridge, R. Waugh, C. J. Pozniak, U. Scholz, K. F. X. Mayer, M. Spannagl, C. Li, M. Mascher, and N. Stein. The barley pan-genome reveals the hidden legacy of mutation breeding. Nature, 588(7837):284-289, December 2020.
K. Jenike, S. Kovaka, S. Oh, S. Hwang, S. Ramakrishnan, B. Langmead, Z. Lippman, and M.C. Schatz. Panagram: Interactive, alignment-free pan-genome browser. https://github.com/kjenike/panagram, 2023.
J. H. Kim, J. S. Park, C. Y. Lee, M. G. Jeong, J. L. Xu, Y. Choi, H. W. Jung, and H. K. Choi. Dissecting seed pigmentation-associated genomic loci and genes by employing dual approaches of reference-based and k-mer-based GWAS with 438 Glycine accessions. PLoS One, 15(12):e0243085, 2020.
M. Kokot, M. Dlugosz, and S. Deorowicz. KMC 3: counting and manipulating k-mer statistics. Bioinformatics, 33(17):2759-2761, September 2017.
J. K. Kulski, S. Suzuki, and T. Shiina. Human leukocyte antigen super-locus: nexus of genomic supergenes, SNPs, indels, transcripts, and haplotypes. Hum Genome Var, 9(1):49, December 2022.
M. A. Lemay, M. de Ronne, R. langer, and F. Belzile. k-mer-based GWAS enhances the discovery of causal variants and candidate genes in soybean. Plant Genome, 16(4):e20374, December 2023.
H. Li. Tabix: fast retrieval of sequence features from generic TAB-delimited files. Bioinformatics, 27(5):718-719, March 2011.
Heng Li. Tabix: fast retrieval of sequence features from generic TAB-delimited files. Bioinformatics, 27(5):718-719, March 2011. URL: https://doi.org/10.1093/bioinformatics/btq671.
Q. Lian, B. Huettel, B. Walkemeier, B. Mayjonade, C. Lopez-Roques, L. Gil, F. Roux, K. Schneeberger, and R. Mercier. A pan-genome of 69 Arabidopsis thaliana accessions reveals a conserved genome structure throughout the global species range. Nat Genet, 56(5):982-991, May 2024.
G. Marçais, A. L. Delcher, A. M. Phillippy, R. Coston, S. L. Salzberg, and A. Zimin. MUMmer4: A fast and versatile genome alignment system. PLoS Comput Biol, 14(1):e1005944, January 2018.
D. J. Nasko, S. Koren, A. M. Phillippy, and T. J. Treangen. RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification. Genome Biol, 19(1):165, October 2018.
S. Nurk, S. Koren, A. Rhie, M. Rautiainen, A. V. Bzikadze, A. Mikheenko, M. R. Vollger, N. Altemose, L. Uralsky, A. Gershman, S. Aganezov, S. J. Hoyt, M. Diekhans, G. A. Logsdon, M. Alonge, S. E. Antonarakis, M. Borchers, G. G. Bouffard, S. Y. Brooks, G. V. Caldas, N. C. Chen, H. Cheng, C. S. Chin, W. Chow, L. G. de Lima, P. C. Dishuck, R. Durbin, T. Dvorkina, I. T. Fiddes, G. Formenti, R. S. Fulton, A. Fungtammasan, E. Garrison, P. G. S. Grady, T. A. Graves-Lindsay, I. M. Hall, N. F. Hansen, G. A. Hartley, M. Haukness, K. Howe, M. W. Hunkapiller, C. Jain, M. Jain, E. D. Jarvis, P. Kerpedjiev, M. Kirsche, M. Kolmogorov, J. Korlach, M. Kremitzki, H. Li, V. V. Maduro, T. Marschall, A. M. McCartney, J. McDaniel, D. E. Miller, J. C. Mullikin, E. W. Myers, N. D. Olson, B. Paten, P. Peluso, P. A. Pevzner, D. Porubsky, T. Potapova, E. I. Rogaev, J. A. Rosenfeld, S. L. Salzberg, V. A. Schneider, F. J. Sedlazeck, K. Shafin, C. J. Shew, A. Shumate, Y. Sims, A. F. A. Smit, D. C. Soto, I. Sović, J. M. Storer, A. Streets, B. A. Sullivan, F. Thibaud-Nissen, J. Torrance, J. Wagner, B. P. Walenz, A. Wenger, J. M. D. Wood, C. Xiao, S. M. Yan, A. C. Young, S. Zarate, U. Surti, R. C. McCoy, M. Y. Dennis, I. A. Alexandrov, J. L. Gerton, R. J. O'Neill, W. Timp, J. M. Zook, M. C. Schatz, E. E. Eichler, K. H. Miga, and A. M. Phillippy. The complete sequence of a human genome. Science, 376(6588):44-53, April 2022.
A. Rhie, S. A. McCarthy, O. Fedrigo, J. Damas, G. Formenti, S. Koren, M. Uliano-Silva, W. Chow, A. Fungtammasan, J. Kim, C. Lee, B. J. Ko, M. Chaisson, G. L. Gedman, L. J. Cantin, F. Thibaud-Nissen, L. Haggerty, I. Bista, M. Smith, B. Haase, J. Mountcastle, S. Winkler, S. Paez, J. Howard, S. C. Vernes, T. M. Lama, F. Grutzner, W. C. Warren, C. N. Balakrishnan, D. Burt, J. M. George, M. T. Biegler, D. Iorns, A. Digby, D. Eason, B. Robertson, T. Edwards, M. Wilkinson, G. Turner, A. Meyer, A. F. Kautt, P. Franchini, H. W. Detrich, H. Svardal, M. Wagner, G. J. P. Naylor, M. Pippel, M. Malinsky, M. Mooney, M. Simbirsky, B. T. Hannigan, T. Pesout, M. Houck, A. Misuraca, S. B. Kingan, R. Hall, Z. Kronenberg, I. ć, C. Dunn, Z. Ning, A. Hastie, J. Lee, S. Selvaraj, R. E. Green, N. H. Putnam, I. Gut, J. Ghurye, E. Garrison, Y. Sims, J. Collins, S. Pelan, J. Torrance, A. Tracey, J. Wood, R. E. Dagnew, D. Guan, S. E. London, D. F. Clayton, C. V. Mello, S. R. Friedrich, P. V. Lovell, E. Osipova, F. O. Al-Ajli, S. Secomandi, H. Kim, C. Theofanopoulou, M. Hiller, Y. Zhou, R. S. Harris, K. D. Makova, P. Medvedev, J. Hoffman, P. Masterson, K. Clark, F. Martin, K. Howe, P. Flicek, B. P. Walenz, W. Kwak, H. Clawson, M. Diekhans, L. Nassar, B. Paten, R. H. S. Kraus, A. J. Crawford, M. T. P. Gilbert, G. Zhang, B. Venkatesh, R. W. Murphy, K. P. Koepfli, B. Shapiro, W. E. Johnson, F. Di Palma, T. Marques-Bonet, E. C. Teeling, T. Warnow, J. M. Graves, O. A. Ryder, D. Haussler, S. J. O'Brien, J. Korlach, H. A. Lewin, K. Howe, E. W. Myers, R. Durbin, A. M. Phillippy, and E. D. Jarvis. Towards complete and error-free genome assemblies of all vertebrate species. Nature, 592(7856):737-746, April 2021.
A. Rhie, S. Nurk, M. Cechova, S. J. Hoyt, D. J. Taylor, N. Altemose, P. W. Hook, S. Koren, M. Rautiainen, I. A. Alexandrov, J. Allen, M. Asri, A. V. Bzikadze, N. C. Chen, C. S. Chin, M. Diekhans, P. Flicek, G. Formenti, A. Fungtammasan, C. Garcia Giron, E. Garrison, A. Gershman, J. L. Gerton, P. G. S. Grady, A. Guarracino, L. Haggerty, R. Halabian, N. F. Hansen, R. Harris, G. A. Hartley, W. T. Harvey, M. Haukness, J. Heinz, T. Hourlier, R. M. Hubley, S. E. Hunt, S. Hwang, M. Jain, R. K. Kesharwani, A. P. Lewis, H. Li, G. A. Logsdon, J. K. Lucas, W. Makalowski, C. Markovic, F. J. Martin, A. M. Mc Cartney, R. C. McCoy, J. McDaniel, B. M. McNulty, P. Medvedev, A. Mikheenko, K. M. Munson, T. D. Murphy, H. E. Olsen, N. D. Olson, L. F. Paulin, D. Porubsky, T. Potapova, F. Ryabov, S. L. Salzberg, M. E. G. Sauria, F. J. Sedlazeck, K. Shafin, V. A. Shepelev, A. Shumate, J. M. Storer, L. Surapaneni, A. M. Taravella Oill, F. Thibaud-Nissen, W. Timp, M. Tomaszkiewicz, M. R. Vollger, B. P. Walenz, A. C. Watwood, M. H. Weissensteiner, A. M. Wenger, M. A. Wilson, S. Zarate, Y. Zhu, J. M. Zook, E. E. Eichler, R. J. O'Neill, M. C. Schatz, K. H. Miga, K. D. Makova, and A. M. Phillippy. The complete sequence of a human Y chromosome. Nature, 621(7978):344-354, September 2023.
M. Rossi, M. Oliva, B. Langmead, T. Gagie, and C. Boucher. MONI: A Pangenomic Index for Finding Maximal Exact Matches. J Comput Biol, 29(2):169-187, February 2022.
B. Shariat, N. S. Movahedi, H. Chitsaz, and C. Boucher. HyDA-Vista: towards optimal guided selection of k-mer size for sequence assembly. BMC Genomics, 15 Suppl 10(Suppl 10):S9, 2014.
R. M. Sherman and S. L. Salzberg. Pan-genomics in the human genome era. Nat Rev Genet, 21(4):243-254, April 2020.
T. Shiina, K. Hosomichi, H. Inoko, and J. K. Kulski. The HLA genomic loci map: expression, interaction, diversity and disease. J Hum Genet, 54(1):15-39, January 2009.
The Apache Software Foundation. Parquet. https://github.com/apache/parquet-format, 2024.
T. Wang, L. Antonacci-Fulton, K. Howe, H. A. Lawson, J. K. Lucas, A. M. Phillippy, A. B. Popejoy, M. Asri, C. Carson, M. J. P. Chaisson, X. Chang, R. Cook-Deegan, A. L. Felsenfeld, R. S. Fulton, E. P. Garrison, N. A. Garrison, T. A. Graves-Lindsay, H. Ji, E. E. Kenny, B. A. Koenig, D. Li, T. Marschall, J. F. McMichael, A. M. Novak, D. Purushotham, V. A. Schneider, B. I. Schultz, M. W. Smith, H. J. Sofia, T. Weissman, P. Flicek, H. Li, K. H. Miga, B. Paten, E. D. Jarvis, I. M. Hall, E. E. Eichler, and D. Haussler. The Human Pangenome Project: a global resource to map genomic diversity. Nature, 604(7906):437-446, April 2022.

MEM-Based Pangenome Indexing for k-mer Queries

Authors Stephen Hwang , Nathaniel K. Brown , Omar Y. Ahmed , Katharine M. Jenike , Sam Kovaka , Michael C. Schatz , Ben Langmead

File

Document Identifiers

Author Details

Acknowledgements

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

MEM-Based Pangenome Indexing for k-mer Queries

Authors Stephen Hwang , Nathaniel K. Brown , Omar Y. Ahmed , Katharine M. Jenike , Sam Kovaka , Michael C. Schatz , Ben Langmead

File

Document Identifiers

Author Details

Funding

Acknowledgements

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Supplementary Materials

References

Thanks for your feedback!

Could not send message