Taxonomic Classification with Maximal Exact Matches in KATKA Kernels and Minimizer Digests

Draesslerová, Dominika; Ahmed, Omar; Gagie, Travis; Holub, Jan; Langmead, Ben; Manzini, Giovanni; Navarro, Gonzalo

doi:10.4230/LIPIcs.SEA.2024.10

File

LIPIcs.SEA.2024.10.pdf

Filesize: 0.73 MB
13 pages

Document Identifiers

DOI: 10.4230/LIPIcs.SEA.2024.10
URN: urn:nbn:de:0030-drops-203756

Author Details

Dominika Draesslerová

Czech Technical University in Prague, Czech Republic

Omar Ahmed

Johns Hopkins University, Baltimore, MD, USA

Travis Gagie

CeBiB & Dalhousie University, Halifax, Canada

Jan Holub

Czech Technical University in Prague, Czech Republic

Ben Langmead

Johns Hopkins University, Baltimore, MD, USA

Giovanni Manzini

University of Pisa, Italy

Gonzalo Navarro

CeBiB & DCC, University of Chile, Chile

Acknowledgements

Many thanks to Sana Kashgouli and Finlay Maguire for helpful discussions.

Cite AsGet BibTex

Dominika Draesslerová, Omar Ahmed, Travis Gagie, Jan Holub, Ben Langmead, Giovanni Manzini, and Gonzalo Navarro. Taxonomic Classification with Maximal Exact Matches in KATKA Kernels and Minimizer Digests. In 22nd International Symposium on Experimental Algorithms (SEA 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 301, pp. 10:1-10:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.SEA.2024.10

Abstract

For taxonomic classification, we are asked to index the genomes in a phylogenetic tree such that later, given a DNA read, we can quickly choose a small subtree likely to contain the genome from which that read was drawn. Although popular classifiers such as Kraken use k-mers, recent research indicates that using maximal exact matches (MEMs) can lead to better classifications. For example, we can - build an augmented FM-index over the the genomes in the tree concatenated in left-to-right order; - for each MEM in a read, find the interval in the suffix array containing the starting positions of that MEM’s occurrences in those genomes; - find the minimum and maximum values stored in that interval; - take the lowest common ancestor (LCA) of the genomes containing the characters at those positions. This solution is practical, however, only when the total size of the genomes in the tree is fairly small. In this paper we consider applying the same solution to three lossily compressed representations of the genomes' concatenation: - a KATKA kernel, which discards characters that are not in the first or last occurrence of any k_max-tuple, for a parameter k_max; - a minimizer digest; - a KATKA kernel of a minimizer digest. With a test dataset and these three representations of it, simulated reads and various parameter settings, we checked how many reads' longest MEMs occurred only in the sequences from which those reads were generated ("true positive" reads). For some parameter settings we achieved significant compression while only slightly decreasing the true-positive rate.

Subject Classification

ACM Subject Classification

Theory of computation → Pattern matching

Keywords

Taxonomic classification
metagenomics
KATKA
maximal exact matches
string kernels
minimizer digests

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Omar Ahmed, Massimiliano Rossi, Sam Kovaka, Michael C Schatz, Travis Gagie, Christina Boucher, and Ben Langmead. Pan-genomic matching statistics for targeted nanopore sequencing. Iscience, 24(6), 2021.
Omar Y Ahmed, Massimiliano Rossi, Travis Gagie, Christina Boucher, and Ben Langmead. SPUMONI 2: improved classification using a pangenome index of minimizer digests. Genome Biology, 24(1):122, 2023.
Nicolas L Bray, Harold Pimentel, Páll Melsted, and Lior Pachter. Near-optimal probabilistic RNA-Seq quantification. Nature biotechnology, 34(5):525-527, 2016.
Florian P Breitwieser, Daniel N Baker, and Steven L Salzberg. KrakenUniq: confident and fast metagenomics classification using unique k-mer counts. Genome biology, 19(1):1-10, 2018.
Manuel Cáceres and Gonzalo Navarro. Faster repetition-aware compressed suffix trees based on block trees. Information and Computation, 285:104749, 2022.
Marie Cheng, Omar Ahmed, Anna Liebhoff, and Ben Langmead. Factors affecting k-mer specificity and alternative approaches for metagenomic classification. In preparation.
Barış Ekim, Bonnie Berger, and Rayan Chikhi. Minimizer-space de Bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer. Cell systems, 12(10):958-968, 2021.
Héctor Ferrada, Travis Gagie, Tommi Hirvola, and Simon J Puglisi. Hybrid indexes for repetitive datasets. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 372(2016):20130137, 2014.
Héctor Ferrada, Dominik Kempa, and Simon J Puglisi. Hybrid indexing revisited. In 2018 Proceedings of the Twentieth Workshop on Algorithm Engineering and Experiments (ALENEX), pages 1-8. SIAM, 2018.
Travis Gagie, Sana Kashgouli, and Ben Langmead. KATKA: A KRAKEN-like tool with k given at query time. In International Symposium on String Processing and Information Retrieval, pages 191-197. Springer, 2022.
Travis Gagie and Simon J Puglisi. Searching and indexing genomic databases via kernelization. Frontiers in Bioengineering and Biotechnology, 3:12, 2015.
Younan Gao. Computing matching statistics on repetitive texts. In 2022 Data Compression Conference (DCC), pages 73-82. IEEE, 2022.
Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From theory to practice: Plug and play with succinct data structures. In Experimental Algorithms: 13th International Symposium, SEA 2014, Copenhagen, Denmark, June 29-July 1, 2014. Proceedings 13, pages 326-337. Springer, 2014.
Daehwan Kim, Li Song, Florian P Breitwieser, and Steven L Salzberg. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome research, 26(12):1721-1729, 2016.
Sam Kovaka, Yunfan Fan, Bohan Ni, Winston Timp, and Michael C Schatz. Targeted nanopore sequencing by real-time mapping of raw electrical signal with UNCALLED. Nature biotechnology, 39(4):431-441, 2021.
Sabrina Mantaci, Antonio Restivo, Giovanna Rosone, and Marinella Sciortino. An extension of the Burrows-Wheeler transform. Theoretical Computer Science, 387(3):298-312, 2007.
Peter Menzel, Kim Lee Ng, and Anders Krogh. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nature communications, 7(1):11257, 2016.
Daniel J Nasko, Sergey Koren, Adam M Phillippy, and Todd J Treangen. RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification. Genome biology, 19(1):1-10, 2018.
Gonzalo Navarro. Computing MEMs on repetitive text collections. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Schloss-Dagstuhl-Leibniz Zentrum für Informatik, 2023.
Enno Ohlebusch, Simon Gog, and Adrian Kügel. Computing matching statistics and maximal exact matches on compressed full-text indexes. In String Processing and Information Retrieval: 17th International Symposium, SPIRE 2010, Los Cabos, Mexico, October 11-13, 2010. Proceedings 17, pages 347-358. Springer, 2010.
Vitor C Piro, Martin S Lindner, and Bernhard Y Renard. DUDes: a top-down taxonomic profiler for metagenomics. Bioinformatics, 32(15):2272-2280, 2016.
Petr Procházka and Jan Holub. Compressing similar biological sequences using FM-index. In 2014 Data Compression Conference, pages 312-321. IEEE, 2014.
Christian Quast, Elmar Pruesse, Pelin Yilmaz, Jan Gerken, Timmy Schweer, Pablo Yarza, Jörg Peplies, and Frank Oliver Glöckner. The silva ribosomal rna gene database project: improved data processing and web-based tools. Nucleic acids research, 41(D1):D590-D596, 2012.
Michael Roberts, Wayne Hayes, Brian R Hunt, Stephen M Mount, and James A Yorke. Reducing storage requirements for biological sequence comparison. Bioinformatics, 20(18):3363-3369, 2004.
Daniel Valenzuela, Tuukka Norri, Niko Välimäki, Esa Pitkänen, and Veli Mäkinen. Towards pan-genome read alignment to improve variation calling. BMC genomics, 19(2):123-130, 2018.
vgteam. sdsl-lite. https://github.com/vgteam/sdsl-lite, 2022.
Derrick E Wood, Jennifer Lu, and Ben Langmead. Improved metagenomic analysis with Kraken 2. Genome biology, 20:1-13, 2019.
Derrick E Wood and Steven L Salzberg. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome biology, 15(3):1-12, 2014.

Taxonomic Classification with Maximal Exact Matches in KATKA Kernels and Minimizer Digests

Authors Dominika Draesslerová , Omar Ahmed , Travis Gagie , Jan Holub , Ben Langmead , Giovanni Manzini , Gonzalo Navarro

File

Document Identifiers

Author Details

Acknowledgements

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

Taxonomic Classification with Maximal Exact Matches in KATKA Kernels and Minimizer Digests

Authors Dominika Draesslerová , Omar Ahmed , Travis Gagie , Jan Holub , Ben Langmead , Giovanni Manzini , Gonzalo Navarro

File

Document Identifiers

Author Details

Funding

Acknowledgements

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Supplementary Materials

References

Thanks for your feedback!

Could not send message