Efficient Large-Scale Text Precompression via Approximate LZ77 Parsings

Dinklage, Patrick

doi:10.4230/LIPIcs.SEA.2026.16

Abstract

The LZ77 [Lempel and Ziv, 1977] compression scheme is ubiquitous: it lies at the core of everyday general-purpose standard compressors such as gzip or zstd, but also behind the scenes of many applications such as the compression of payloads transmitted in networks.
Computing the exact LZ77 parsing is largely solved in theory: it can be done in sublinear time and space, in compressed space and in external memory, to name but some scenarios. However, these approaches are often impractical for everyday use due to their intensive time or space requirements. Standard compressors tackle this issue by introducing heuristics that go hand in hand with sophisticated encoding schemes to achieve very good compression fast and in small space, however, they only have a local view (e.g., a sliding window) on the input, potentially missing out on long-range repetitions that may be located far apart from one another.
In this work, we design and implement - in C++ and leveraging shared-memory parallelism - compression pipelines that first precompress the input using an approximate LZ77 parsing taking care of long-range repetitions. This then serves as an assist to standard compressors for producing a succinct encoding of the remaining short and local repetitions. Similar approaches have been considered by [Kosolobov et al., 2020] and [Nalbach, 2024], respectively using Relative Lempel Ziv [Kuruppu et al. 2010] or the string synchronizing set [Kempa & Kociumaka, 2019].
We fill a gap taking the route via the prefix-free parsing [Boucher et al., 2019], using an intermediate result of [Hong et al., 2023]. On large repetitive inputs of tens of gigabytes, our pipelines are orders of magnitudes faster than the state of the art for computing the exact LZ77 parsing, use space less than the input size and still - despite producing more phrases - achieve the best overall compression in comparison to related work.

Uwe Baier. Linear-time suffix sorting-a new approach for suffix array construction. In 27th Annual Symposium on Combinatorial Pattern Matching (CPM), pages 23-1. Dagstuhl, 2016. URL: https://doi.org/10.4230/LIPIcs.CPM.2016.23.
Nico Bertram, Jonas Ellert, and Johannes Fischer. Lyndon words accelerate suffix sorting. In 29th European Symposium on Algorithms (ESA), pages 15-1. Dagstuhl, 2021. URL: https://doi.org/10.4230/LIPIcs.ESA.2021.15.
Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information. NCBI Virus. https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/. Accessed April 14, 2026.
Philip Bille, Patrick Hagge Cording, Johannes Fischer, and Inge Li Gørtz. Lempel-Ziv compression in a sliding window. In 28th Annual Symposium on Combinatorial Pattern Matching (CPM), volume 78 of LIPIcs, pages 15:1-15:11. Dagstuhl, 2017. URL: https://doi.org/10.4230/LIPIcs.CPM.2017.15.
Timo Bingmann, Andreas Eberle, and Peter Sanders. Engineering parallel string sorting. Algorithmica, 77(1):235-286, 2017. URL: https://doi.org/10.1007/s00453-015-0071-1.
Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni Manzini, and Taher Mun. Prefix-free parsing for building big BWTs. Algorithms Mol. Biol., 14(1):13:1-13:15, 2019. URL: https://doi.org/10.1186/S13015-019-0148-5.
Michael Burrows and David Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994.
Robert Clausecker, Florian Kurpicz, and Etienne Palanga. Practical parallel block tree construction: First results. CoRR (accepted at SEA 2026), abs/2512.23314, 2025. URL: https://doi.org/10.48550/arXiv.2512.23314.
Yann Collet and Murray S. Kucherawy. Zstandard compression and the 'application/zstd' media type. RFC, 8878:1-45, 2021. URL: https://doi.org/10.17487/RFC8878.
Patrick Dinklage, Jonas Ellert, Johannes Fischer, Florian Kurpicz, and Marvin Löbel. Practical wavelet tree construction. Journal of Experimental Algorithms, 26:1.8:1-1.8:67, 2021. URL: https://doi.org/10.1145/3457197.
Patrick Dinklage, Johannes Fischer, and Nicola Prezza. top-k-compress. https://github.com/pdinklag/top-k-compress. Accessed April 14, 2026.
Patrick Dinklage, Johannes Fischer, and Nicola Prezza. Top-k frequent patterns in streams and parameterized-space LZ compression. In 22nd International Symposium on Experimental Algorithms (SEA), volume 301 of LIPIcs, pages 9:1-9:20. Dagstuhl, 2024. URL: https://doi.org/10.4230/LIPIcs.SEA.2024.9.
Jonas Ellert. Sublinear time Lempel-Ziv (LZ77) factorization. In 30th International Symposium on String Processing and Information Retrieval (SPIRE), volume 14240 of Lecture Notes in Computer Science, pages 171-187. Springer, 2023. URL: https://doi.org/10.1007/978-3-031-43980-3_14.
Paolo Ferragina and Gonzalo Navarro. Pizza & Chili corpus - compressed indexes and their testbeds. http://pizzachili.dcc.uchile.cl/texts.html. Accessed April 14, 2026.
Johannes Fischer, Travis Gagie, Pawel Gawrychowski, and Tomasz Kociumaka. Approximating LZ77 via small-space multiple-pattern matching. In 23rd European Symposium on Algorithms (ESA), volume 9294, pages 533-544. Springer, 2015. URL: https://doi.org/10.1007/978-3-662-48350-3_45.
Pawel Gawrychowski, Maria Kosche, and Florin Manea. On the number of factors in the LZ-End factorization. In 30th International Symposium on String Processing and Information Retrieval (SPIRE), volume 14240 of Lecture Notes in Computer Science, pages 253-259. Springer, 2023. URL: https://doi.org/10.1007/978-3-031-43980-3_20.
Ilya Grebnov. libbsc. https://github.com/IlyaGrebnov/libbsc. Accessed April 14, 2026.
Ilya Grebnov. libsais. https://github.com/IlyaGrebnov/libsais. Accessed April 14, 2026.
Torben Hagerup. Sorting and searching on the word RAM. In 15th Annual Symposium on Theoretical Aspects of Computer Science (STACS), volume 1373 of Lecture Notes in Computer Science, pages 366-398. Springer, 1998. URL: https://doi.org/10.1007/BFb0028575.
Stefan Heule, Marc Nunkesser, and Alexander Hall. HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. In Joint 2013 EDBT/ICDT Conferences, pages 683-692. ACM, 2013. URL: https://doi.org/10.1145/2452376.2452456.
Aaron Hong, Massimiliano Rossi, and Christina Boucher. PFP_LZ77. https://github.com/AaronHong1024/PFP_LZ77. Accessed April 14, 2026.
Aaron Hong, Massimiliano Rossi, and Christina Boucher. LZ77 via prefix-free parsing. In 25th Workshop on Algorithm Engineering and Experiments (ALENEX), pages 123-134. SIAM, 2023. URL: https://doi.org/10.1137/1.9781611977561.CH11.
Takumi Ideue, Takuya Mieno, Mitsuru Funakoshi, Yuto Nakashima, Shunsuke Inenaga, and Masayuki Takeda. On the approximation ratio of LZ-End to LZ77. In 28th International Symposium on String Processing and Information Retrieval (SPIRE), volume 12944 of Lecture Notes in Computer Science, pages 114-126. Springer, 2021. URL: https://doi.org/10.1007/978-3-030-86692-1_10.
Intel Corporation. Intel® oneAPI Threading Building Blocks. https://www.intel.com/content/www/us/en/developer/tools/oneapi/onetbb.html. Accessed April 14, 2026.
Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Linear time Lempel-Ziv factorization: Simple, fast, small. In 24th Annual Symposium on Combinatorial Pattern Matching (CPM), volume 7922 of Lecture Notes in Computer Science, pages 189-200. Springer, 2013. URL: https://doi.org/10.1007/978-3-642-38905-4_19.
Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Lempel-Ziv parsing in external memory. In 2014 Data Compression Conference (DCC), pages 153-162. IEEE, 2014. URL: https://doi.org/10.1109/DCC.2014.78.
Richard M. Karp and Michael O. Rabin. Efficient randomized pattern-matching algorithms. IBM J. Res. Dev., 31(2):249-260, 1987. URL: https://doi.org/10.1147/rd.312.0249.
Dominik Kempa and Tomasz Kociumaka. String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure. In 51st Annual ACM Symposium on Theory of Computing (STOC), pages 756-767. ACM, 2019. URL: https://doi.org/10.1145/3313276.3316368.
Dominik Kempa and Tomasz Kociumaka. Lempel-Ziv (LZ77) factorization in sublinear time. In 65th Symposium on Foundations of Computer Science (FOCS), pages 2045-2055. IEEE, 2024. URL: https://doi.org/10.1109/FOCS61266.2024.00122.
Dominik Kempa and Barna Saha. An upper bound and linear-space queries on the LZ-End parsing. In ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2847-2866. SIAM, 2022. URL: https://doi.org/10.1137/1.9781611977073.111.
Dmitry Kosolobov, Daniel Valenzuela, Gonzalo Navarro, and Simon J. Puglisi. ReLZ. https://gitlab.com/dvalenzu/ReLZ. Accessed April 14, 2026.
Dmitry Kosolobov, Daniel Valenzuela, Gonzalo Navarro, and Simon J. Puglisi. Lempel-Ziv-like parsing in small space. Algorithmica, 82(11):3195-3215, 2020. URL: https://doi.org/10.1007/S00453-020-00722-6.
Sebastian Kreft and Gonzalo Navarro. LZ77-like compression with fast random access. In 2010 Data Compression Conference (DCC), pages 239-248. IEEE, 2010. URL: https://doi.org/10.1109/DCC.2010.29.
Shanika Kuruppu, Simon J. Puglisi, and Justin Zobel. Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval. In 17th International Symposium on String Processing and Information Retrieval (SPIRE), pages 201-206. Springer, 2010. URL: https://doi.org/10.1007/978-3-642-16321-0_20.
Tobias Maier, Peter Sanders, and Roman Dementiev. Concurrent hash tables: Fast and general(?)! ACM Trans. Parallel Comput., 5(4):16:1-16:32, 2019. URL: https://doi.org/10.1145/3309206.
Udi Manber and Eugene W. Myers. Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22(5):935-948, 1993. URL: https://doi.org/10.1137/0222058.
Lukas Nalbach. lz77-sss. https://github.com/LukasNalbach/lz77-sss. Accessed April 14, 2026.
Lukas Nalbach. Implementing sublinear-time approximation algorithms for the Lempel-Ziv 77 factorization. Master’s thesis, TU Dortmund University, 2024. URL: https://doi.org/10.17877/DE290R-25750.
Ge Nong, Sen Zhang, and Wai Hong Chan. Linear suffix array construction by almost pure induced-sorting. In 2009 Data Compression Conference (DCC), pages 193-202. IEEE, 2009. URL: https://doi.org/10.1109/DCC.2009.42.
Tatsuya Ohno, Kensuke Sakai, Yoshimasa Takabatake, Tomohiro I, and Hiroshi Sakamoto. OnlineRlbwt. https://github.com/itomomoti/OnlineRlbwt. Accessed April 14, 2026.
Tatsuya Ohno, Kensuke Sakai, Yoshimasa Takabatake, Tomohiro I, and Hiroshi Sakamoto. A faster implementation of online RLBWT and its application to LZ77 parsing. J. Discrete Algorithms, 52-53:18-28, 2018. URL: https://doi.org/10.1016/J.JDA.2018.11.002.
OpenMP ARB. OpenMP. https://www.openmp.org. Accessed April 14, 2026.
Alberto Policriti and Nicola Prezza. LZ77 computation based on the run-length encoded BWT. Algorithmica, 80(7):1986-2011, 2018. URL: https://doi.org/10.1007/S00453-017-0327-Z.
Rolf Rabenseifner. Optimization of collective reduction operations. In 4th International Conference on Computational Science (ICCS), Lecture Notes in Computer Science, pages 1-9. Springer, 2004. URL: https://doi.org/10.1007/978-3-540-24685-5_1.
James A. Storer and Thomas G. Szymanski. Data compression via textual substitution. J. ACM, 29(4):928-951, 1982. URL: https://doi.org/10.1145/322344.322346.
Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory, 23(3):337-343, 1977. URL: https://doi.org/10.1109/TIT.1977.1055714.

Efficient Large-Scale Text Precompression via Approximate LZ77 Parsings

Author Patrick Dinklage

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Efficient Large-Scale Text Precompression via Approximate LZ77 Parsings

Author Patrick Dinklage

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

Supplementary Materials

References

Thanks for your feedback!

Could not send message