Tight Bounds for Compressing Substring Samples

Bille, Philip; Fuglsang, Christian Mikkelsen; Gørtz, Inge Li

doi:10.4230/LIPIcs.CPM.2024.9

File

Subject Classification

ACM Subject Classification

Theory of computation → Pattern matching

Keywords

Compression
Algorithms
Lempel-Ziv

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

Abstract

We consider the problem of compressing a set of substrings sampled from a string and analyzing the size of the compression. Given a string S of length n, and integers d and m where n ≥ m ≥ 2d > 0, let SCS(S, m, d) be the string obtained by sequentially concatenating substrings of length m sampled regularly at intervals of d starting at position 1 in S. We consider the size of the LZ77 parsing of SCS(S, m, d), in relation to the size of the LZ77 parsing of S. This is motivated by genome sequencing, where the mentioned sampling process is an idealization of the short-read DNA sequencing. We show the following upper bound: |LZ77(SCS(S, m, d))| ≤ |LZ77(S)| + 2(n-m)/d. We also give a lower bound showing that this is tight. This improves previous results by Badkobeh et al. [ICTCS 2022], and closes the open problem of whether their bound can be improved. Another natural question is whether assuming that all letters in S are part of a sample, it is always the case that |LZ77(S)| ≤ |LZ77(SCS(S, m, d))|. Surprisingly, we show that there is a family of strings such that |LZ77(SCS(S, m, d))| = |LZ77(S)| - 1.

Cite As Get BibTex

Philip Bille, Christian Mikkelsen Fuglsang, and Inge Li Gørtz. Tight Bounds for Compressing Substring Samples. In 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 296, pp. 9:1-9:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024) https://doi.org/10.4230/LIPIcs.CPM.2024.9

Author Details

Philip Bille

Technical University of Denmark, Lyngby, Denmark

Christian Mikkelsen Fuglsang

Technical University of Denmark, Lyngby, Denmark

Inge Li Gørtz

Technical University of Denmark, Lyngby, Denmark

References

Sultan Al Yami and Chun-Hsi Huang. LFastqC: A lossless non-reference-based FASTQ compressor. PLoS One, 14(11):e0224806, 2019.
Golnaz Badkobeh, Maxime Crochemore, and Chalita Toopsuwan. Computing the maximal-exponent repeats of an overlap-free string in linear time. In Proc. SPIRE, pages 61-72, 2012.
Golnaz Badkobeh, Sara Giuliani, Zsuzsanna Lipták, and Simon J. Puglisi. On compressing collections of substring samples. In Proc. 23rd ICTCS, pages 136-147, 2022.
Philip Bille, Patrick Hagge Cording, Johannes Fischer, and Inge Li Gørtz. Lempel-Ziv compression in a sliding window. In Proc. 28th CPM, volume 78, pages 15:1-15:11, 2017.
Shubham Chandak, Kedar Tatwawadi, Idoia Ochoa, Mikel Hernaez, and Tsachy Weissman. SPRING: a next-generation compressor for FASTQ data. Bioinformatics, 35(15):2674-2676, 2018.
Maxime Crochemore, Lucian Ilie, and William F. Smyth. A simple algorithm for computing the Lempel Ziv factorization. In Proc. DCC, pages 482-488, 2008.
Sebastian Deorowicz. FQSqueezer: k-mer-based compression of sequencing data. Sci. Rep., 10(1):578, 2020.
Robert Ekblom, Linnéa Smeds, and Hans Ellegren. Patterns of sequencing coverage bias revealed by ultra-deep sequencing of vertebrate mitochondria. BMC Genomics, 15:467, 2014.
Susan Fairley, Ernesto Lowy-Gallego, Emily Perry, and Paul Flicek. The International Genome Sample Resource (IGSR) collection of open human genomic variation resources. Nucleic Acids Res., 48(D1):D941-D947, 2019.
Paolo Ferragina and Giovanni Manzini. On compressing the textual web. In Proc. WSDM, pages 391-400, 2010.
Edward R. Fiala and Daniel H. Greene. Data compression with finite windows. Commun. ACM, 32(4):490-505, 1989.
Johannes Fischer, Travis Gagie, Paweł Gawrychowski, and Tomasz Kociumaka. Approximating LZ77 via small-space multiple-pattern matching. In Proc. ESA, pages 533-544, 2015.
Johannes Fischer, Tomohiro I, and Dominik Köppl. Lempel Ziv computation in small space (LZ-CISS). In Proc. CPM, pages 172-184, 2015.
Travis Gagie and Paweł Gawrychowski. Grammar-based compression in a streaming model. In Proc. LATA, pages 273-284, 2010.
Keisuke Goto and Hideo Bannai. Space efficient linear time Lempel-Ziv factorization for small alphabets. In Proc. DCC, pages 163-172, 2014.
Christopher Hoobin, Trey Kind, Christina Boucher, and Simon J. Puglisi. Fast and efficient compression of high-throughput sequencing reads. In Proc. 6th ACM-BCB, pages 325-334, 2015.
Christopher Hoobin, Simon J. Puglisi, and Justin Zobel. Relative Lempel-Ziv factorization for efficient storage and retrieval of web collections. Proc. VLDB Endow., 5(3):265-273, 2011.
Dominik Kempa and Simon J. Puglisi. Lempel-Ziv factorization: Simple, fast, practical. In Proc. ALENEX, pages 103-112, 2013.
Roman Kolpakov and Gregory Kucherov. Finding approximate repetitions under Hamming distance. Theor. Comput. Sci., 303(1):135-156, 2003.
Dmitry Kosolobov. Faster lightweight Lempel-Ziv parsing. In Proc. MFCS, pages 432-444, 2015.
Sebastian Kreft and Gonzalo Navarro. Lz77-like compression with fast random access. In Proc. DCC, pages 239-248, 2010.
Shanika Kuruppu, Simon J. Puglisi, and Justin Zobel. Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval. In Proc. SPIRE, pages 201-206, 2010.
Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Lightweight Lempel-Ziv parsing. In Proc. SEA, pages 139-150, 2013.
Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Linear time Lempel-Ziv factorization: Simple, fast, small. In Proc. CPM, pages 189-200, 2013.
Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Lempel-Ziv parsing in external memory. In Proc. DCC, pages 153-162, 2014.
Dominik Köppl and Kunihiko Sadakane. Lempel-Ziv computation in compressed space (LZ-CICS). In Proc. DCC, pages 3-12, 2016.
Niklas Jesper Larsson. Extended application of suffix trees to data compression. In Proc. DCC, pages 190-199, 1996.
Niklas Jesper Larsson and Alistair Moffat. Off-line dictionary-based compression. Proc. IEEE, 88(11):1722-1732, 2000.
Abraham Lempel and Jacob Ziv. On the complexity of finite sequences. IEEE Trans. Inform. Theory, 22(1):75-81, 1976.
Harris A. Lewin et al. Earth BioGenome Project: Sequencing life for the future of life. Proc. Natl. Acad. Sci. U.S.A, 115(17):4325-4333, 2018.
Harris A. Lewin et al. The earth biogenome project 2020: Starting the clock. Proc. Natl. Acad. Sci. U.S.A, 119(4), 2022.
Craig G. Nevill-Manning and Ian H. Witten. Identifying hierarchical structure in sequences: A linear-time algorithm. JAIR, 7:67-82, 1997.
Genome 10K Community of Scientists. Genome 10K: A Proposal to Obtain Whole-Genome Sequence for 10 000 Vertebrate Species. J. Hered., 100(6):659-674, 2009.
Enno Ohlebusch and Simon Gog. Lempel-Ziv factorization revisited. In Proc. CPM, pages 15-26, 2011.
Daisuke Okanohara and Kunihiko Sadakane. An online algorithm for finding the longest previous factors. In Proc. ESA, pages 696-707, 2008.
Alberto Policriti and Nicola Prezza. Fast online Lempel-Ziv factorization in compressed space. In Proc. SPIRE, pages 13-20, 2015.
Alberto Policriti and Nicola Prezza. Computing LZ77 in run-compressed space. In Proc. DCC, pages 23-32, 2016.
Julian Shun and Fuyao Zhao. Practical parallel Lempel-Ziv factorization. In Proc. DCC, pages 123-132, 2013.
Tatiana Starikovskaya. Computing lempel-ziv factorization online. In Proc. MFCS, pages 789-799, 2012.
James Andrew Storer and Thomas Gregory Szymanski. Data compression via textual substitution. J. ACM, 29(4):928-951, 1982.
Jun'ichi Yamamoto, Tomohiro I, Hideo Bannai, Shunsuke Inenaga, and Masayuki Takeda. Faster compact on-line Lempel-Ziv factorization. In Proc. 31st STACS, volume 25, pages 675-686, 2014.
En-Hui Yang and John C. Kieffer. Efficient universal lossless data compression algorithms based on a greedy sequential grammar transform - Part one: Without context models. IEEE Trans. Inform. Theory, 46(3):755-777, 2000.
Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory, 23(3):337-343, 1977.
Jacob Ziv and Abraham Lempel. Compression of individual sequences via variable-rate coding. IEEE Trans. Inform. Theory, 24(5):530-536, 1978.

Tight Bounds for Compressing Substring Samples

Authors Philip Bille , Christian Mikkelsen Fuglsang, Inge Li Gørtz

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Tight Bounds for Compressing Substring Samples

Authors Philip Bille , Christian Mikkelsen Fuglsang, Inge Li Gørtz

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

References

Thanks for your feedback!

Could not send message