Compression of Multiple k-Mer Sets by Iterative SPSS Decomposition

Kitaya, Kazushi; Shibuya, Tetsuo

doi:10.4230/LIPIcs.WABI.2021.12

Abstract

A set of k-mers is used in many bioinformatics tasks, and much work has been done on methods to efficiently represent or compress a single set of k-mers. However, methods for compressing multiple k-mer sets have been less studied in spite of their obvious benefits for researchers and genome-related database maintainers. This paper proposes an algorithm to compress multiple k-mer sets, which works by iteratively splitting SPSS (spectrum-preserving string sets). In experiments with 3292 k-mer sets constructed from E. coli whole-genome sequencing data and 2555 k-mer sets constructed from human RNA-Seq data, the proposed algorithm could reduce the compressed file sizes by 34.7% and 13.2% respectively compared to one of the state-of-the-art colored de Bruijn graph representations. Also, our method used less memory than the colored de Bruijn graph method. This paper also introduces various methods to make the compression algorithm efficient in terms of time and memory, one of which is a parallelizable small-weight SPSS construction algorithm.

Cite As Get BibTex

Kazushi Kitaya and Tetsuo Shibuya. Compression of Multiple k-Mer Sets by Iterative SPSS Decomposition. In 21st International Workshop on Algorithms in Bioinformatics (WABI 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 201, pp. 12:1-12:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021) https://doi.org/10.4230/LIPIcs.WABI.2021.12

Author Details

Kazushi Kitaya

Tokyo, Japan

Tetsuo Shibuya

Human Genome Center, Institute of Medical Science, The University of Tokyo, Japan

Funding

This work was supported by JSPS KAKENHI Grant 17H01693, 20K21827, and JST CREST Grant JPMJCR1402JST. The super-computing resource was provided by Human Genome Center, the Institute of Medical Science, the University of Tokyo.

Shibuya, Tetsuo: https://orcid.org/0000-0003-1514-5766

Supplementary Materials

Software (Source Code) https://github.com/kkty/kmer-sets-compression browse archived version

References

Fatemeh Almodaresi, Prashant Pandey, Michael Ferdman, Rob Johnson, and Rob Patro. An efficient, scalable and exact representation of high-dimensional color information enabled via de Bruijn graph search. In Proceedings of the International Conference on Research in Computational Molecular Biology, pages 1-18, 2019.
Fatemeh Almodaresi, Prashant Pandey, and Rob Patro. Rainbowfish: A succinct colored de Bruijn graph representation. In Proceedings of the International Workshop on Algorithms in Bioinformatics, pages 18:1-18:15, 2017.
Richard J. Anderson and Heather Woll. Wait-free parallel algorithms for the union-find problem. In Proceedings of the ACM symposium on Theory of Computing, pages 370-380, 1991.
Alexander Bowe, Taku Onodera, Kunihiko Sadakane, and Tetsuo Shibuya. Succinct de Bruijn graphs. In Proceedings of the International Workshop on Algorithms in Bioinformatics, pages 225-235, 2012.
Karel Břinda, Michael Baym, and Gregory Kucherov. Simplitigs as an efficient and scalable representation of de bruijn graphs. Genome biology, 22(1):1-24, 2021.
Rayan Chikhi, Antoine Limasset, and Paul Medvedev. Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics, 32(12):i201-i208, 2016.
Ramana M. Idury and Michael S. Waterman. A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2):291-306, 1995.
Zamin Iqbal, Mario Caccamo, Isaac Turner, Paul Flicek, and Gil McVean. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nature Genetics, 44(2):226-232, 2012.
Rasko Leinonen, Hideaki Sugawara, and Martin Shumway. The sequence read archive. Nucleic Acids Research, 39(suppl_1):D19-D21, 2010.
Daniel Lemire, Nathan Kurz, and Christoph Rupp. Stream VByte: Faster byte-oriented integer compression. Information Processing Letters, 130:1-6, 2018.
Dinghua Li, Chi-Man Liu, Ruibang Luo, Kunihiko Sadakane, and Tak-Wah Lam. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics, 31(10):1674-1676, 2015.
Camille Marchet, Zamin Iqbal, Daniel Gautheret, Mikaël Salson, and Rayan Chikhi. REINDEER: Efficient indexing of k-mer presence and abundance in sequencing datasets. Bioinformatics, 36(Supplement_1):i177-i185, 2020.
Prashant Pandey, Fatemeh Almodaresi, Michael A. Bender, Michael Ferdman, Rob Johnson, and Rob Patro. Mantis: A fast, small, and exact large-scale sequence-search index. Cell Systems, 7(2):201-207, 2018.
Prashant Pandey, Michael A. Bender, Rob Johnson, and Rob Patro. Squeakr: An exact and approximate k-mer counting system. Bioinformatics, 34(4):568-575, 2018.
Pavel A. Pevzner, Haixu Tang, and Michael S. Waterman. An eulerian path approach to DNA fragment assembly. National Academy of Sciences, 98(17):9748-9753, 2001.
Amatur Rahman and Paul Medvedev. Representation of k-mer sets using spectrum-preserving string sets. In Proceedings of the International Conference on Research in Computational Molecular Biology, pages 152-168, 2020.
Brad Solomon and Carl Kingsford. Fast search of thousands of short-read sequencing experiments. Nature Biotechnology, 34(3):300-302, 2016.
John William Joseph Williams. Algorithm 232: Heapsort. Communications of the ACM, 7(6):347-348, 1964.

Compression of Multiple k-Mer Sets by Iterative SPSS Decomposition

Authors Kazushi Kitaya, Tetsuo Shibuya

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message