Compression of Multiple k-Mer Sets by Iterative SPSS Decomposition

Authors Kazushi Kitaya, Tetsuo Shibuya



PDF
Thumbnail PDF

File

LIPIcs.WABI.2021.12.pdf
  • Filesize: 1.21 MB
  • 17 pages

Document Identifiers

Author Details

Kazushi Kitaya
  • Tokyo, Japan
Tetsuo Shibuya
  • Human Genome Center, Institute of Medical Science, The University of Tokyo, Japan

Cite AsGet BibTex

Kazushi Kitaya and Tetsuo Shibuya. Compression of Multiple k-Mer Sets by Iterative SPSS Decomposition. In 21st International Workshop on Algorithms in Bioinformatics (WABI 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 201, pp. 12:1-12:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/LIPIcs.WABI.2021.12

Abstract

A set of k-mers is used in many bioinformatics tasks, and much work has been done on methods to efficiently represent or compress a single set of k-mers. However, methods for compressing multiple k-mer sets have been less studied in spite of their obvious benefits for researchers and genome-related database maintainers. This paper proposes an algorithm to compress multiple k-mer sets, which works by iteratively splitting SPSS (spectrum-preserving string sets). In experiments with 3292 k-mer sets constructed from E. coli whole-genome sequencing data and 2555 k-mer sets constructed from human RNA-Seq data, the proposed algorithm could reduce the compressed file sizes by 34.7% and 13.2% respectively compared to one of the state-of-the-art colored de Bruijn graph representations. Also, our method used less memory than the colored de Bruijn graph method. This paper also introduces various methods to make the compression algorithm efficient in terms of time and memory, one of which is a parallelizable small-weight SPSS construction algorithm.

Subject Classification

ACM Subject Classification
  • Applied computing → Molecular sequence analysis
Keywords
  • sequencing data
  • k-mer
  • de Bruijn graph
  • compression
  • colored de Bruijn graph

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Fatemeh Almodaresi, Prashant Pandey, Michael Ferdman, Rob Johnson, and Rob Patro. An efficient, scalable and exact representation of high-dimensional color information enabled via de Bruijn graph search. In Proceedings of the International Conference on Research in Computational Molecular Biology, pages 1-18, 2019. Google Scholar
  2. Fatemeh Almodaresi, Prashant Pandey, and Rob Patro. Rainbowfish: A succinct colored de Bruijn graph representation. In Proceedings of the International Workshop on Algorithms in Bioinformatics, pages 18:1-18:15, 2017. Google Scholar
  3. Richard J. Anderson and Heather Woll. Wait-free parallel algorithms for the union-find problem. In Proceedings of the ACM symposium on Theory of Computing, pages 370-380, 1991. Google Scholar
  4. Alexander Bowe, Taku Onodera, Kunihiko Sadakane, and Tetsuo Shibuya. Succinct de Bruijn graphs. In Proceedings of the International Workshop on Algorithms in Bioinformatics, pages 225-235, 2012. Google Scholar
  5. Karel Břinda, Michael Baym, and Gregory Kucherov. Simplitigs as an efficient and scalable representation of de bruijn graphs. Genome biology, 22(1):1-24, 2021. Google Scholar
  6. Rayan Chikhi, Antoine Limasset, and Paul Medvedev. Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics, 32(12):i201-i208, 2016. Google Scholar
  7. Ramana M. Idury and Michael S. Waterman. A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2):291-306, 1995. Google Scholar
  8. Zamin Iqbal, Mario Caccamo, Isaac Turner, Paul Flicek, and Gil McVean. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nature Genetics, 44(2):226-232, 2012. Google Scholar
  9. Rasko Leinonen, Hideaki Sugawara, and Martin Shumway. The sequence read archive. Nucleic Acids Research, 39(suppl_1):D19-D21, 2010. Google Scholar
  10. Daniel Lemire, Nathan Kurz, and Christoph Rupp. Stream VByte: Faster byte-oriented integer compression. Information Processing Letters, 130:1-6, 2018. Google Scholar
  11. Dinghua Li, Chi-Man Liu, Ruibang Luo, Kunihiko Sadakane, and Tak-Wah Lam. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics, 31(10):1674-1676, 2015. Google Scholar
  12. Camille Marchet, Zamin Iqbal, Daniel Gautheret, Mikaël Salson, and Rayan Chikhi. REINDEER: Efficient indexing of k-mer presence and abundance in sequencing datasets. Bioinformatics, 36(Supplement_1):i177-i185, 2020. Google Scholar
  13. Prashant Pandey, Fatemeh Almodaresi, Michael A. Bender, Michael Ferdman, Rob Johnson, and Rob Patro. Mantis: A fast, small, and exact large-scale sequence-search index. Cell Systems, 7(2):201-207, 2018. Google Scholar
  14. Prashant Pandey, Michael A. Bender, Rob Johnson, and Rob Patro. Squeakr: An exact and approximate k-mer counting system. Bioinformatics, 34(4):568-575, 2018. Google Scholar
  15. Pavel A. Pevzner, Haixu Tang, and Michael S. Waterman. An eulerian path approach to DNA fragment assembly. National Academy of Sciences, 98(17):9748-9753, 2001. Google Scholar
  16. Amatur Rahman and Paul Medvedev. Representation of k-mer sets using spectrum-preserving string sets. In Proceedings of the International Conference on Research in Computational Molecular Biology, pages 152-168, 2020. Google Scholar
  17. Brad Solomon and Carl Kingsford. Fast search of thousands of short-read sequencing experiments. Nature Biotechnology, 34(3):300-302, 2016. Google Scholar
  18. John William Joseph Williams. Algorithm 232: Heapsort. Communications of the ACM, 7(6):347-348, 1964. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail