Tight Bounds for Compressing Substring Samples

Authors Philip Bille , Christian Mikkelsen Fuglsang, Inge Li Gørtz



PDF
Thumbnail PDF

File

LIPIcs.CPM.2024.9.pdf
  • Filesize: 0.94 MB
  • 14 pages

Document Identifiers

Author Details

Philip Bille
  • Technical University of Denmark, Lyngby, Denmark
Christian Mikkelsen Fuglsang
  • Technical University of Denmark, Lyngby, Denmark
Inge Li Gørtz
  • Technical University of Denmark, Lyngby, Denmark

Cite AsGet BibTex

Philip Bille, Christian Mikkelsen Fuglsang, and Inge Li Gørtz. Tight Bounds for Compressing Substring Samples. In 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 296, pp. 9:1-9:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.CPM.2024.9

Abstract

We consider the problem of compressing a set of substrings sampled from a string and analyzing the size of the compression. Given a string S of length n, and integers d and m where n ≥ m ≥ 2d > 0, let SCS(S, m, d) be the string obtained by sequentially concatenating substrings of length m sampled regularly at intervals of d starting at position 1 in S. We consider the size of the LZ77 parsing of SCS(S, m, d), in relation to the size of the LZ77 parsing of S. This is motivated by genome sequencing, where the mentioned sampling process is an idealization of the short-read DNA sequencing. We show the following upper bound: |LZ77(SCS(S, m, d))| ≤ |LZ77(S)| + 2(n-m)/d. We also give a lower bound showing that this is tight. This improves previous results by Badkobeh et al. [ICTCS 2022], and closes the open problem of whether their bound can be improved. Another natural question is whether assuming that all letters in S are part of a sample, it is always the case that |LZ77(S)| ≤ |LZ77(SCS(S, m, d))|. Surprisingly, we show that there is a family of strings such that |LZ77(SCS(S, m, d))| = |LZ77(S)| - 1.

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
Keywords
  • Compression
  • Algorithms
  • Lempel-Ziv

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Sultan Al Yami and Chun-Hsi Huang. LFastqC: A lossless non-reference-based FASTQ compressor. PLoS One, 14(11):e0224806, 2019. Google Scholar
  2. Golnaz Badkobeh, Maxime Crochemore, and Chalita Toopsuwan. Computing the maximal-exponent repeats of an overlap-free string in linear time. In Proc. SPIRE, pages 61-72, 2012. Google Scholar
  3. Golnaz Badkobeh, Sara Giuliani, Zsuzsanna Lipták, and Simon J. Puglisi. On compressing collections of substring samples. In Proc. 23rd ICTCS, pages 136-147, 2022. Google Scholar
  4. Philip Bille, Patrick Hagge Cording, Johannes Fischer, and Inge Li Gørtz. Lempel-Ziv compression in a sliding window. In Proc. 28th CPM, volume 78, pages 15:1-15:11, 2017. Google Scholar
  5. Shubham Chandak, Kedar Tatwawadi, Idoia Ochoa, Mikel Hernaez, and Tsachy Weissman. SPRING: a next-generation compressor for FASTQ data. Bioinformatics, 35(15):2674-2676, 2018. Google Scholar
  6. Maxime Crochemore, Lucian Ilie, and William F. Smyth. A simple algorithm for computing the Lempel Ziv factorization. In Proc. DCC, pages 482-488, 2008. Google Scholar
  7. Sebastian Deorowicz. FQSqueezer: k-mer-based compression of sequencing data. Sci. Rep., 10(1):578, 2020. Google Scholar
  8. Robert Ekblom, Linnéa Smeds, and Hans Ellegren. Patterns of sequencing coverage bias revealed by ultra-deep sequencing of vertebrate mitochondria. BMC Genomics, 15:467, 2014. Google Scholar
  9. Susan Fairley, Ernesto Lowy-Gallego, Emily Perry, and Paul Flicek. The International Genome Sample Resource (IGSR) collection of open human genomic variation resources. Nucleic Acids Res., 48(D1):D941-D947, 2019. Google Scholar
  10. Paolo Ferragina and Giovanni Manzini. On compressing the textual web. In Proc. WSDM, pages 391-400, 2010. Google Scholar
  11. Edward R. Fiala and Daniel H. Greene. Data compression with finite windows. Commun. ACM, 32(4):490-505, 1989. Google Scholar
  12. Johannes Fischer, Travis Gagie, Paweł Gawrychowski, and Tomasz Kociumaka. Approximating LZ77 via small-space multiple-pattern matching. In Proc. ESA, pages 533-544, 2015. Google Scholar
  13. Johannes Fischer, Tomohiro I, and Dominik Köppl. Lempel Ziv computation in small space (LZ-CISS). In Proc. CPM, pages 172-184, 2015. Google Scholar
  14. Travis Gagie and Paweł Gawrychowski. Grammar-based compression in a streaming model. In Proc. LATA, pages 273-284, 2010. Google Scholar
  15. Keisuke Goto and Hideo Bannai. Space efficient linear time Lempel-Ziv factorization for small alphabets. In Proc. DCC, pages 163-172, 2014. Google Scholar
  16. Christopher Hoobin, Trey Kind, Christina Boucher, and Simon J. Puglisi. Fast and efficient compression of high-throughput sequencing reads. In Proc. 6th ACM-BCB, pages 325-334, 2015. Google Scholar
  17. Christopher Hoobin, Simon J. Puglisi, and Justin Zobel. Relative Lempel-Ziv factorization for efficient storage and retrieval of web collections. Proc. VLDB Endow., 5(3):265-273, 2011. Google Scholar
  18. Dominik Kempa and Simon J. Puglisi. Lempel-Ziv factorization: Simple, fast, practical. In Proc. ALENEX, pages 103-112, 2013. Google Scholar
  19. Roman Kolpakov and Gregory Kucherov. Finding approximate repetitions under Hamming distance. Theor. Comput. Sci., 303(1):135-156, 2003. Google Scholar
  20. Dmitry Kosolobov. Faster lightweight Lempel-Ziv parsing. In Proc. MFCS, pages 432-444, 2015. Google Scholar
  21. Sebastian Kreft and Gonzalo Navarro. Lz77-like compression with fast random access. In Proc. DCC, pages 239-248, 2010. Google Scholar
  22. Shanika Kuruppu, Simon J. Puglisi, and Justin Zobel. Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval. In Proc. SPIRE, pages 201-206, 2010. Google Scholar
  23. Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Lightweight Lempel-Ziv parsing. In Proc. SEA, pages 139-150, 2013. Google Scholar
  24. Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Linear time Lempel-Ziv factorization: Simple, fast, small. In Proc. CPM, pages 189-200, 2013. Google Scholar
  25. Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Lempel-Ziv parsing in external memory. In Proc. DCC, pages 153-162, 2014. Google Scholar
  26. Dominik Köppl and Kunihiko Sadakane. Lempel-Ziv computation in compressed space (LZ-CICS). In Proc. DCC, pages 3-12, 2016. Google Scholar
  27. Niklas Jesper Larsson. Extended application of suffix trees to data compression. In Proc. DCC, pages 190-199, 1996. Google Scholar
  28. Niklas Jesper Larsson and Alistair Moffat. Off-line dictionary-based compression. Proc. IEEE, 88(11):1722-1732, 2000. Google Scholar
  29. Abraham Lempel and Jacob Ziv. On the complexity of finite sequences. IEEE Trans. Inform. Theory, 22(1):75-81, 1976. Google Scholar
  30. Harris A. Lewin et al. Earth BioGenome Project: Sequencing life for the future of life. Proc. Natl. Acad. Sci. U.S.A, 115(17):4325-4333, 2018. Google Scholar
  31. Harris A. Lewin et al. The earth biogenome project 2020: Starting the clock. Proc. Natl. Acad. Sci. U.S.A, 119(4), 2022. Google Scholar
  32. Craig G. Nevill-Manning and Ian H. Witten. Identifying hierarchical structure in sequences: A linear-time algorithm. JAIR, 7:67-82, 1997. Google Scholar
  33. Genome 10K Community of Scientists. Genome 10K: A Proposal to Obtain Whole-Genome Sequence for 10 000 Vertebrate Species. J. Hered., 100(6):659-674, 2009. Google Scholar
  34. Enno Ohlebusch and Simon Gog. Lempel-Ziv factorization revisited. In Proc. CPM, pages 15-26, 2011. Google Scholar
  35. Daisuke Okanohara and Kunihiko Sadakane. An online algorithm for finding the longest previous factors. In Proc. ESA, pages 696-707, 2008. Google Scholar
  36. Alberto Policriti and Nicola Prezza. Fast online Lempel-Ziv factorization in compressed space. In Proc. SPIRE, pages 13-20, 2015. Google Scholar
  37. Alberto Policriti and Nicola Prezza. Computing LZ77 in run-compressed space. In Proc. DCC, pages 23-32, 2016. Google Scholar
  38. Julian Shun and Fuyao Zhao. Practical parallel Lempel-Ziv factorization. In Proc. DCC, pages 123-132, 2013. Google Scholar
  39. Tatiana Starikovskaya. Computing lempel-ziv factorization online. In Proc. MFCS, pages 789-799, 2012. Google Scholar
  40. James Andrew Storer and Thomas Gregory Szymanski. Data compression via textual substitution. J. ACM, 29(4):928-951, 1982. Google Scholar
  41. Jun'ichi Yamamoto, Tomohiro I, Hideo Bannai, Shunsuke Inenaga, and Masayuki Takeda. Faster compact on-line Lempel-Ziv factorization. In Proc. 31st STACS, volume 25, pages 675-686, 2014. Google Scholar
  42. En-Hui Yang and John C. Kieffer. Efficient universal lossless data compression algorithms based on a greedy sequential grammar transform - Part one: Without context models. IEEE Trans. Inform. Theory, 46(3):755-777, 2000. Google Scholar
  43. Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory, 23(3):337-343, 1977. Google Scholar
  44. Jacob Ziv and Abraham Lempel. Compression of individual sequences via variable-rate coding. IEEE Trans. Inform. Theory, 24(5):530-536, 1978. Google Scholar