Taxonomic Classification with Maximal Exact Matches in KATKA Kernels and Minimizer Digests

Authors Dominika Draesslerová , Omar Ahmed , Travis Gagie , Jan Holub , Ben Langmead , Giovanni Manzini , Gonzalo Navarro



PDF
Thumbnail PDF

File

LIPIcs.SEA.2024.10.pdf
  • Filesize: 0.73 MB
  • 13 pages

Document Identifiers

Author Details

Dominika Draesslerová
  • Czech Technical University in Prague, Czech Republic
Omar Ahmed
  • Johns Hopkins University, Baltimore, MD, USA
Travis Gagie
  • CeBiB & Dalhousie University, Halifax, Canada
Jan Holub
  • Czech Technical University in Prague, Czech Republic
Ben Langmead
  • Johns Hopkins University, Baltimore, MD, USA
Giovanni Manzini
  • University of Pisa, Italy
Gonzalo Navarro
  • CeBiB & DCC, University of Chile, Chile

Acknowledgements

Many thanks to Sana Kashgouli and Finlay Maguire for helpful discussions.

Cite AsGet BibTex

Dominika Draesslerová, Omar Ahmed, Travis Gagie, Jan Holub, Ben Langmead, Giovanni Manzini, and Gonzalo Navarro. Taxonomic Classification with Maximal Exact Matches in KATKA Kernels and Minimizer Digests. In 22nd International Symposium on Experimental Algorithms (SEA 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 301, pp. 10:1-10:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.SEA.2024.10

Abstract

For taxonomic classification, we are asked to index the genomes in a phylogenetic tree such that later, given a DNA read, we can quickly choose a small subtree likely to contain the genome from which that read was drawn. Although popular classifiers such as Kraken use k-mers, recent research indicates that using maximal exact matches (MEMs) can lead to better classifications. For example, we can - build an augmented FM-index over the the genomes in the tree concatenated in left-to-right order; - for each MEM in a read, find the interval in the suffix array containing the starting positions of that MEM’s occurrences in those genomes; - find the minimum and maximum values stored in that interval; - take the lowest common ancestor (LCA) of the genomes containing the characters at those positions. This solution is practical, however, only when the total size of the genomes in the tree is fairly small. In this paper we consider applying the same solution to three lossily compressed representations of the genomes' concatenation: - a KATKA kernel, which discards characters that are not in the first or last occurrence of any k_max-tuple, for a parameter k_max; - a minimizer digest; - a KATKA kernel of a minimizer digest. With a test dataset and these three representations of it, simulated reads and various parameter settings, we checked how many reads' longest MEMs occurred only in the sequences from which those reads were generated ("true positive" reads). For some parameter settings we achieved significant compression while only slightly decreasing the true-positive rate.

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
Keywords
  • Taxonomic classification
  • metagenomics
  • KATKA
  • maximal exact matches
  • string kernels
  • minimizer digests

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Omar Ahmed, Massimiliano Rossi, Sam Kovaka, Michael C Schatz, Travis Gagie, Christina Boucher, and Ben Langmead. Pan-genomic matching statistics for targeted nanopore sequencing. Iscience, 24(6), 2021. Google Scholar
  2. Omar Y Ahmed, Massimiliano Rossi, Travis Gagie, Christina Boucher, and Ben Langmead. SPUMONI 2: improved classification using a pangenome index of minimizer digests. Genome Biology, 24(1):122, 2023. Google Scholar
  3. Nicolas L Bray, Harold Pimentel, Páll Melsted, and Lior Pachter. Near-optimal probabilistic RNA-Seq quantification. Nature biotechnology, 34(5):525-527, 2016. Google Scholar
  4. Florian P Breitwieser, Daniel N Baker, and Steven L Salzberg. KrakenUniq: confident and fast metagenomics classification using unique k-mer counts. Genome biology, 19(1):1-10, 2018. Google Scholar
  5. Manuel Cáceres and Gonzalo Navarro. Faster repetition-aware compressed suffix trees based on block trees. Information and Computation, 285:104749, 2022. Google Scholar
  6. Marie Cheng, Omar Ahmed, Anna Liebhoff, and Ben Langmead. Factors affecting k-mer specificity and alternative approaches for metagenomic classification. In preparation. Google Scholar
  7. Barış Ekim, Bonnie Berger, and Rayan Chikhi. Minimizer-space de Bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer. Cell systems, 12(10):958-968, 2021. Google Scholar
  8. Héctor Ferrada, Travis Gagie, Tommi Hirvola, and Simon J Puglisi. Hybrid indexes for repetitive datasets. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 372(2016):20130137, 2014. Google Scholar
  9. Héctor Ferrada, Dominik Kempa, and Simon J Puglisi. Hybrid indexing revisited. In 2018 Proceedings of the Twentieth Workshop on Algorithm Engineering and Experiments (ALENEX), pages 1-8. SIAM, 2018. Google Scholar
  10. Travis Gagie, Sana Kashgouli, and Ben Langmead. KATKA: A KRAKEN-like tool with k given at query time. In International Symposium on String Processing and Information Retrieval, pages 191-197. Springer, 2022. Google Scholar
  11. Travis Gagie and Simon J Puglisi. Searching and indexing genomic databases via kernelization. Frontiers in Bioengineering and Biotechnology, 3:12, 2015. Google Scholar
  12. Younan Gao. Computing matching statistics on repetitive texts. In 2022 Data Compression Conference (DCC), pages 73-82. IEEE, 2022. Google Scholar
  13. Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From theory to practice: Plug and play with succinct data structures. In Experimental Algorithms: 13th International Symposium, SEA 2014, Copenhagen, Denmark, June 29-July 1, 2014. Proceedings 13, pages 326-337. Springer, 2014. Google Scholar
  14. Daehwan Kim, Li Song, Florian P Breitwieser, and Steven L Salzberg. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome research, 26(12):1721-1729, 2016. Google Scholar
  15. Sam Kovaka, Yunfan Fan, Bohan Ni, Winston Timp, and Michael C Schatz. Targeted nanopore sequencing by real-time mapping of raw electrical signal with UNCALLED. Nature biotechnology, 39(4):431-441, 2021. Google Scholar
  16. Sabrina Mantaci, Antonio Restivo, Giovanna Rosone, and Marinella Sciortino. An extension of the Burrows-Wheeler transform. Theoretical Computer Science, 387(3):298-312, 2007. Google Scholar
  17. Peter Menzel, Kim Lee Ng, and Anders Krogh. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nature communications, 7(1):11257, 2016. Google Scholar
  18. Daniel J Nasko, Sergey Koren, Adam M Phillippy, and Todd J Treangen. RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification. Genome biology, 19(1):1-10, 2018. Google Scholar
  19. Gonzalo Navarro. Computing MEMs on repetitive text collections. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Schloss-Dagstuhl-Leibniz Zentrum für Informatik, 2023. Google Scholar
  20. Enno Ohlebusch, Simon Gog, and Adrian Kügel. Computing matching statistics and maximal exact matches on compressed full-text indexes. In String Processing and Information Retrieval: 17th International Symposium, SPIRE 2010, Los Cabos, Mexico, October 11-13, 2010. Proceedings 17, pages 347-358. Springer, 2010. Google Scholar
  21. Vitor C Piro, Martin S Lindner, and Bernhard Y Renard. DUDes: a top-down taxonomic profiler for metagenomics. Bioinformatics, 32(15):2272-2280, 2016. Google Scholar
  22. Petr Procházka and Jan Holub. Compressing similar biological sequences using FM-index. In 2014 Data Compression Conference, pages 312-321. IEEE, 2014. Google Scholar
  23. Christian Quast, Elmar Pruesse, Pelin Yilmaz, Jan Gerken, Timmy Schweer, Pablo Yarza, Jörg Peplies, and Frank Oliver Glöckner. The silva ribosomal rna gene database project: improved data processing and web-based tools. Nucleic acids research, 41(D1):D590-D596, 2012. Google Scholar
  24. Michael Roberts, Wayne Hayes, Brian R Hunt, Stephen M Mount, and James A Yorke. Reducing storage requirements for biological sequence comparison. Bioinformatics, 20(18):3363-3369, 2004. Google Scholar
  25. Daniel Valenzuela, Tuukka Norri, Niko Välimäki, Esa Pitkänen, and Veli Mäkinen. Towards pan-genome read alignment to improve variation calling. BMC genomics, 19(2):123-130, 2018. Google Scholar
  26. vgteam. sdsl-lite. https://github.com/vgteam/sdsl-lite, 2022.
  27. Derrick E Wood, Jennifer Lu, and Ben Langmead. Improved metagenomic analysis with Kraken 2. Genome biology, 20:1-13, 2019. Google Scholar
  28. Derrick E Wood and Steven L Salzberg. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome biology, 15(3):1-12, 2014. Google Scholar