Document

**Published in:** LIPIcs, Volume 204, 29th Annual European Symposium on Algorithms (ESA 2021)

Grammar compression is, next to Lempel-Ziv (LZ77) and run-length Burrows-Wheeler transform (RLBWT), one of the most flexible approaches to representing and processing highly compressible strings. The main idea is to represent a text as a context-free grammar whose language is precisely the input string. This is called a straight-line grammar (SLG). An AVL grammar, proposed by Rytter [Theor. Comput. Sci., 2003] is a type of SLG that additionally satisfies the AVL property: the heights of parse trees for children of every nonterminal differ by at most one. In contrast to other SLG constructions, AVL grammars can be constructed from the LZ77 parsing in compressed time: 𝒪(z log n) where z is the size of the LZ77 parsing and n is the length of the input text. Despite these advantages, AVL grammars are thought to be too large to be practical.
We present a new technique for rapidly constructing a small AVL grammar from an LZ77 or LZ77-like parse. Our algorithm produces grammars that are always at least five times smaller than those produced by the original algorithm, and usually not more than double the size of grammars produced by the practical Re-Pair compressor [Larsson and Moffat, Proc. IEEE, 2000]. Our algorithm also achieves low peak RAM usage. By combining this algorithm with recent advances in approximating the LZ77 parsing, we show that our method has the potential to construct a run-length BWT in about one third of the time and peak RAM required by other approaches. Overall, we show that AVL grammars are surprisingly practical, opening the door to much faster construction of key compressed data structures.

Dominik Kempa and Ben Langmead. Fast and Space-Efficient Construction of AVL Grammars from the LZ77 Parsing. In 29th Annual European Symposium on Algorithms (ESA 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 204, pp. 56:1-56:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{kempa_et_al:LIPIcs.ESA.2021.56, author = {Kempa, Dominik and Langmead, Ben}, title = {{Fast and Space-Efficient Construction of AVL Grammars from the LZ77 Parsing}}, booktitle = {29th Annual European Symposium on Algorithms (ESA 2021)}, pages = {56:1--56:14}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-204-4}, ISSN = {1868-8969}, year = {2021}, volume = {204}, editor = {Mutzel, Petra and Pagh, Rasmus and Herman, Grzegorz}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2021.56}, URN = {urn:nbn:de:0030-drops-146373}, doi = {10.4230/LIPIcs.ESA.2021.56}, annote = {Keywords: grammar compression, straight-line program, SLP, AVL grammar, Lempel-Ziv compression, LZ77, dictionary compression} }

Document

**Published in:** LIPIcs, Volume 112, 26th Annual European Symposium on Algorithms (ESA 2018)

String attractors [STOC 2018] are combinatorial objects recently introduced to unify all known dictionary compression techniques in a single theory. A set Gamma subseteq [1..n] is a k-attractor for a string S in Sigma^n if and only if every distinct substring of S of length at most k has an occurrence crossing at least one of the positions in Gamma. Finding the smallest k-attractor is NP-hard for k >= 3, but polylogarithmic approximations can be found using reductions from dictionary compressors. It is easy to reduce the k-attractor problem to a set-cover instance where the string's positions are interpreted as sets of substrings. The main result of this paper is a much more powerful reduction based on the truncated suffix tree. Our new characterization of the problem leads to more efficient algorithms for string attractors: we show how to check the validity and minimality of a k-attractor in near-optimal time and how to quickly compute exact solutions. For example, we prove that a minimum 3-attractor can be found in O(n) time when |Sigma| in O(sqrt[3+epsilon]{log n}) for some constant epsilon > 0, despite the problem being NP-hard for large Sigma.

Dominik Kempa, Alberto Policriti, Nicola Prezza, and Eva Rotenberg. String Attractors: Verification and Optimization. In 26th Annual European Symposium on Algorithms (ESA 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 112, pp. 52:1-52:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{kempa_et_al:LIPIcs.ESA.2018.52, author = {Kempa, Dominik and Policriti, Alberto and Prezza, Nicola and Rotenberg, Eva}, title = {{String Attractors: Verification and Optimization}}, booktitle = {26th Annual European Symposium on Algorithms (ESA 2018)}, pages = {52:1--52:13}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-081-1}, ISSN = {1868-8969}, year = {2018}, volume = {112}, editor = {Azar, Yossi and Bast, Hannah and Herman, Grzegorz}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2018.52}, URN = {urn:nbn:de:0030-drops-95153}, doi = {10.4230/LIPIcs.ESA.2018.52}, annote = {Keywords: Dictionary compression, String attractors, Set cover} }

Document

**Published in:** LIPIcs, Volume 87, 25th Annual European Symposium on Algorithms (ESA 2017)

We present a deterministic algorithm that constructs in linear time and space the LZ-End parsing (a variation of LZ77) of a given string over an integer polynomially bounded alphabet.

Dominik Kempa and Dmitry Kosolobov. LZ-End Parsing in Linear Time. In 25th Annual European Symposium on Algorithms (ESA 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 87, pp. 53:1-53:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{kempa_et_al:LIPIcs.ESA.2017.53, author = {Kempa, Dominik and Kosolobov, Dmitry}, title = {{LZ-End Parsing in Linear Time}}, booktitle = {25th Annual European Symposium on Algorithms (ESA 2017)}, pages = {53:1--53:14}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-049-1}, ISSN = {1868-8969}, year = {2017}, volume = {87}, editor = {Pruhs, Kirk and Sohler, Christian}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2017.53}, URN = {urn:nbn:de:0030-drops-78471}, doi = {10.4230/LIPIcs.ESA.2017.53}, annote = {Keywords: LZ-End, LZ77, construction algorithm, linear time} }

Document

**Published in:** LIPIcs, Volume 75, 16th International Symposium on Experimental Algorithms (SEA 2017)

The suffix array augmented with the LCP array is perhaps the most important data structure in modern string processing. There has been a lot of recent research activity on constructing these arrays in external memory. In this paper, we engineer the two fastest LCP array construction algorithms (ESA 2016) and improve them in three ways. First, we speed up the algorithms by up to a factor of two through parallelism. Just 8 threads is sufficient for making the algorithms essentially I/O bound. Second, we reduce the disk space usage of the algorithms making them in-place: The input (text and suffix array) is treated as read-only and the working disk space never exceeds the size of the final output (the LCP array). Third, we add support for large alphabets. All previous implementations assume the byte alphabet.

Juha Kärkkäinen and Dominik Kempa. Engineering External Memory LCP Array Construction: Parallel, In-Place and Large Alphabet. In 16th International Symposium on Experimental Algorithms (SEA 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 75, pp. 17:1-17:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{karkkainen_et_al:LIPIcs.SEA.2017.17, author = {K\"{a}rkk\"{a}inen, Juha and Kempa, Dominik}, title = {{Engineering External Memory LCP Array Construction: Parallel, In-Place and Large Alphabet}}, booktitle = {16th International Symposium on Experimental Algorithms (SEA 2017)}, pages = {17:1--17:14}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-036-1}, ISSN = {1868-8969}, year = {2017}, volume = {75}, editor = {Iliopoulos, Costas S. and Pissis, Solon P. and Puglisi, Simon J. and Raman, Rajeev}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2017.17}, URN = {urn:nbn:de:0030-drops-76116}, doi = {10.4230/LIPIcs.SEA.2017.17}, annote = {Keywords: LCP array, suffix array, external memory algorithms} }

Document

**Published in:** LIPIcs, Volume 66, 34th Symposium on Theoretical Aspects of Computer Science (STACS 2017)

Lyndon factorization and Lempel-Ziv (LZ) factorization are both important tools for analysing the structure and complexity of strings, but their combinatorial structure is very different. In this paper, we establish the first direct connection between the two by showing that while the Lyndon factorization can be bigger than the non-overlapping LZ factorization (which we demonstrate by describing a new, non-trivial family of strings) it is always less than twice the size.

Juha Kärkkäinen, Dominik Kempa, Yuto Nakashima, Simon J. Puglisi, and Arseny M. Shur. On the Size of Lempel-Ziv and Lyndon Factorizations. In 34th Symposium on Theoretical Aspects of Computer Science (STACS 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 66, pp. 45:1-45:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{karkkainen_et_al:LIPIcs.STACS.2017.45, author = {K\"{a}rkk\"{a}inen, Juha and Kempa, Dominik and Nakashima, Yuto and Puglisi, Simon J. and Shur, Arseny M.}, title = {{On the Size of Lempel-Ziv and Lyndon Factorizations}}, booktitle = {34th Symposium on Theoretical Aspects of Computer Science (STACS 2017)}, pages = {45:1--45:13}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-028-6}, ISSN = {1868-8969}, year = {2017}, volume = {66}, editor = {Vollmer, Heribert and Vall\'{e}e, Brigitte}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.STACS.2017.45}, URN = {urn:nbn:de:0030-drops-69878}, doi = {10.4230/LIPIcs.STACS.2017.45}, annote = {Keywords: Lempel-Ziv factorization, Lempel-Ziv parsing, LZ, Lyndon word, Lyndon factorization, Standard factorization} }

Document

**Published in:** LIPIcs, Volume 57, 24th Annual European Symposium on Algorithms (ESA 2016)

The suffix array, perhaps the most important data structure in modern string processing, needs to be augmented with the longest-common-prefix (LCP) array in many applications. Their construction is often a major bottleneck especially when the data is too big for internal memory. We describe two new algorithms for computing the LCP array from the suffix array in external memory. Experiments demonstrate that the new algorithms are about a factor of two faster than the fastest previous algorithm.

Juha Kärkkäinen and Dominik Kempa. Faster External Memory LCP Array Construction. In 24th Annual European Symposium on Algorithms (ESA 2016). Leibniz International Proceedings in Informatics (LIPIcs), Volume 57, pp. 61:1-61:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2016)

Copy BibTex To Clipboard

@InProceedings{karkkainen_et_al:LIPIcs.ESA.2016.61, author = {K\"{a}rkk\"{a}inen, Juha and Kempa, Dominik}, title = {{Faster External Memory LCP Array Construction}}, booktitle = {24th Annual European Symposium on Algorithms (ESA 2016)}, pages = {61:1--61:16}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-015-6}, ISSN = {1868-8969}, year = {2016}, volume = {57}, editor = {Sankowski, Piotr and Zaroliagis, Christos}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2016.61}, URN = {urn:nbn:de:0030-drops-64026}, doi = {10.4230/LIPIcs.ESA.2016.61}, annote = {Keywords: LCP array, suffix array, external memory algorithms} }

Document

**Published in:** LIPIcs, Volume 25, 31st International Symposium on Theoretical Aspects of Computer Science (STACS 2014)

The sparse suffix sorting problem is to sort b=o(n) arbitrary suffixes of a string of length n using o(n) words of space in addition to the string. We present an O(n) time Monte Carlo algorithm using O(b.log(b)) space and an O(n.log(b)) time Las Vegas algorithm using O(b) space. This is a significant improvement over the best prior solutions of [Bille et al., ICALP 2013]: a Monte Carlo algorithm running in O(n.log(b)) time and O(b^(1+e)) space or O(n.log^2(b)) time and O(b) space, and a Las Vegas algorithm running in O(n.log^2(b)+b^2.log(b)) time and O(b) space. All the above results are obtained with high probability not just in expectation.

Tomohiro I, Juha Kärkkäinen, and Dominik Kempa. Faster Sparse Suffix Sorting. In 31st International Symposium on Theoretical Aspects of Computer Science (STACS 2014). Leibniz International Proceedings in Informatics (LIPIcs), Volume 25, pp. 386-396, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2014)

Copy BibTex To Clipboard

@InProceedings{i_et_al:LIPIcs.STACS.2014.386, author = {I, Tomohiro and K\"{a}rkk\"{a}inen, Juha and Kempa, Dominik}, title = {{Faster Sparse Suffix Sorting}}, booktitle = {31st International Symposium on Theoretical Aspects of Computer Science (STACS 2014)}, pages = {386--396}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-939897-65-1}, ISSN = {1868-8969}, year = {2014}, volume = {25}, editor = {Mayr, Ernst W. and Portier, Natacha}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.STACS.2014.386}, URN = {urn:nbn:de:0030-drops-44738}, doi = {10.4230/LIPIcs.STACS.2014.386}, annote = {Keywords: string algorithms, sparse suffix sorting, sparse suffix trees, Karp-Rabin fingerprints, space-time tradeoffs} }