Efficient Terabyte-Scale Text Compression via Stable Local Consistency and Parallel Grammar Processing
Abstract
We present compression algorithms designed to process terabyte-sized datasets in parallel. Our approach builds on locally consistent grammars, a lightweight form of compression, combined with simple post-processing techniques to achieve further space reductions. Locally consistent grammar algorithms are suitable for scaling as they need minimal satellite information to compact the text, but they are not inherently parallel. To enable parallelisation, we introduce a novel concept that we call stable local consistency. A grammar algorithm ALG is stable if for any pattern occurring in a collection , instances independently produce cores for with the same topology. In a locally consistent grammar, the core of is a subset of nodes and edges in the parse tree of that remains the same in all the occurrences of . This feature enables compression, but it only holds if ALG defines a common set of nonterminal symbols for the strings. Stability removes this restriction, allowing us to run in parallel and subsequently merge their grammars into a single output equivalent to that of . We implemented our ideas and tested them on massive datasets. Our experiments showed that our method could process 7.9 TB of bacterial genomes in around nine hours, using 16 threads and 0.43 bits/symbol of working memory, achieving a compression ratio of 85x.
Keywords and phrases:
Grammar compression, locally consistent parsing, hashing2012 ACM Subject Classification:
Theory of computation Data compressionFunding:
This project has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101060011.Editors:
Petra Mutzel and Nicola PrezzaSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
Classical dictionary-based compression methods such as Lempel-Ziv (LZ) [27, 39] or grammar compression [23, 4] achieve significant space reductions, but often require extensive resources, limiting their practicality for large datasets. Tools like gzip and zstd provide resource-saving simplifications of LZ that offer acceptable trade-offs for smaller inputs, but still struggle with massive repositories.
Recent heuristics have been developed for large-scale applications. For example, Deorowicz et al. [8] compress pangenomes by partitioning strings and compressing similar segments together using zstd. Other approaches, Hunt et al. [17], reorder genomes to improve LZ compression. Grammar algorithms like RePair [26] and SEQUITUR [32] achieve high compression ratios, but quickly exceed the available memory as the input grows. Gagie et al. [14] introduced a method using prefix-free parsing [3] to scale RePair for large inputs.
Locally consistent grammars [15, 35, 5, 10, 24] is a technique that performs rounds of locally consistent parsing [29, 38, 30, 18, 3, 19, 5, 10] to compress a text . This approach recursively segments based on sequence patterns, producing nearly identical phrases for matching substrings. In the parse tree of a locally consistent grammar, the nodes that cover the occurrences of a pattern share an area with identical topology and labels. This area is the core of [38], and is what makes compression possible. Locally consistent grammars are simple to construct as, unlike LZ or RePair, they only need local information to break consistently. However, they are not only useful for compression, they also help scaling the processing of large string collections. In fact, they have been used to speed up the computation of the Burrows–Wheeler transform [9], perform pattern matching in grammar-based self-indexes [5], find maximal exact matches [11, 31], among other things (see [1, 2, 20, 21, 22] for more applications).
Although it is possible to build from a locally consistent grammar of size in expected time [25, 24], being the string complexity [37] and being the alphabet size of , these techniques probably yield less impressive compression ratios in practice than LZ or RePair. However, simple transformations to reduce the size of can yield further space reductions. In this regard, Ochoa et al. [36] showed that any irreducible grammar can reach the th-order empirical entropy of a string. Their result suggests that building and then transforming it might be an efficient alternative to greedy approaches in large datasets.
Parallelising the grammar construction in massive collections is desirable to leverage multi-core architectures. Given an input , an efficient solution would be to split into chunks, say and , compress the chunks in different instances and , and merge the resulting (small) grammars and . However, ensuring the local consistency of the merged grammar is difficult without synchronising the instances. Most locally consistent algorithms assign random fingerprints to the grammar symbols to perform the parsing. However, when there is no synchronisation, different metasymbols emitted by and that expand to equal sequences of could have different fingerprints, thus producing an inconsistent parsing of and . Therefore, new locally consistent schemes are necessary to make the parallelisation possible.
Our contribution.
We present a parallel grammar compression method that scales to terabytes of data. Our framework consists of two operations. Let be a string collection with symbols, and let be a set of hash functions. The operation produces a locally consistent grammar generating strings in . Furthermore, let and be two collections with and . The operation builds a locally consistent grammar for the collection that combines and . BuildGram uses to induce a stable locally consistent parsing. The stable property means that and independently produce cores with the same topology for identical patterns. The set assigns random fingerprints to the metasymbols of the grammar under construction to guide the locally consistent parsing, with the fingerprint of a metasymbol depending on the sequence of its expansion. This feature ensures that metasymbols from different grammars that expand to matching sequences get the same fingerprints. leverages the stable property to produce a grammar equivalent to that of , thus allowing us to parallelise the compression. We show that runs in time w.h.p. and uses bits of working space, being the grammar size. Similarly, runs in time and uses bits, with and being the number of symbols in and , respectively. The parsing that we use in BuildGram is inspired by the concept of induced suffix sorting [34], which has been shown to be effective for processing strings [9, 11]. In future work, we plan to use our parallel compressor not only to reduce space usage but also to process large inputs. However, we note that the concept of stability is compatible with any locally consistent grammar that uses hashing to break the text. Our experiments showed that our strategy can efficiently compress several terabytes of data.
2 Notation and basic concepts
We consider the RAM model of computation. Given an input of symbols, we assume our procedures run in random-access memory, where the machine words use bits and can be manipulated in constant time. We use the big- notation to denote time and space complexities (in bits), and the term to express the logarithms of base two.
2.1 Strings
A string is a sequence of symbols over an alphabet . We use to refer to the symbol in from left to right, and to refer to the substring starting in and ending in . An equal-symbol run is a substring storing consecutive copies of , with or ; and or .
We consider a collection of strings as a multiset where each element has an arbitrary order . In addition, we use the operator to express the total number of symbols. We also use subscripts to differentiate collections (e.g. and ). The expression denotes the combination of and into a new collection . We assume that all collections have the same constant-size alphabet .
2.2 Grammar compression
Grammar compression consists in representing a string as a small context-free grammar that generates only [23, 4]. Formally, a grammar is a tuple where is the alphabet of (the terminals), is the set of nonterminals, is the set of rewriting rules in the form . The symbol is the grammar’s start symbol. Given two strings , rewrites (denoted ) if exists in . Furthermore, derives , denoted , if there is a sequence such that , , and for . The string resulting from is the expansion of , with the decompression of expressed as . Compression algorithms ensure that every occurs only once on the left-hand sides of . In this way, there is only one possible string for each . This type of grammar is referred to as straight-line. The sum of the lengths of the right-hand sides of is the grammar size.
The parse tree of represents . Given the rule , the root of is a node labelled that has children, which are labelled from left to right with , respectively. The child of , labelled , is a leaf if ; otherwise, it is an internal node whose subtree is recursively defined according to and its rule in .
Post-processing a grammar consists of capturing the remaining repetitions in its rules. For instance, if appears multiple times on the right-hand sides of , one can create a new rule and replace the occurrences of with . Run-length compression encapsulates each distinct equal-symbol appearing in the right-hand sides of as a constant-size rule . Grammar simplification removes every rule whose symbol appears once on the right-hand sides, replacing its occurrence with .
2.3 Locally-consistent parsing and grammars
A parsing is a partition of a string into a sequence of phrases , where the indices are breaks. Let denote the phrases that cover a substring , with and . A parsing is locally consistent [6] iff, for any pair of equal substrings , and differ in phrases at the beginning and at the end, with their internal phrase sequences identical.
A locally consistent parsing scheme relevant to this work is that of Nong et al. [35]. They originally described their idea to perform induced suffix sorting, but it has been shown that it is also locally consistent [10, 7]. They define a type for each position :
| (1) |
Their method scans from right to left and sets a break at each LMS-type position.
One can create a grammar that only generates by applying successive rounds of locally consistent parsing. Different works present this idea slightly differently (see [38, 30, 18, 3, 5, 10]), but the procedure is similar: in every round , the algorithm receives a string as input ( with ) and performs the following steps:
-
1.
Break into phrases using a locally consistent parsing.
-
2.
Assign a nonterminal to each distinct sequence that is a phrase in .
-
3.
Store every in and its associated rule in .
-
4.
Replace the phrases in by their nonterminals to form the string for .
The process ends when does not have more breaks, in which case it creates the rule for the start symbol and returns the resulting grammar . The phrases have a length of at least two, so the length of is half the length of in the worst case. Consequently, the algorithm incurs in rounds and runs in time.
The output grammar is locally consistent because it compresses the occurrences of a pattern largely in the same way. The first parsing round transforms the occurrences into substrings in the form , where the superscripts indicate symbols in . The blocks and have variable nonterminals that change with ’s context, while remains the same in all occurrences. In the second round, yields strings in the form that recursively have the same structure. The substring remains non-empty during the first rounds. The substrings , with , conform the core of [38] (see Figure 1).
2.4 Hashing and string fingerprints
Hashing refers to the idea of using a function to map elements in a universe to integers in a range uniformly at random. When the universe is large and only an unknown subset requires fingerprints over a range with , the typical solution is to use a universal hash function. For any pair , a universal hash function ensures that the collision probability is . Let be the universe, a prime number, and and integers chosen uniformly at random. One can make universal using the formula , with .
These ideas can be adapted to produce fingerprints for a set of strings over an alphabet [12]. Pick a prime number and choose an integer uniformly at random. Then, build the degree polynomial over and regard the symbols in as the polynomial’s coefficients. Additionally, compose with a universal hash function to obtain a fingerprint in . Let be three integers chosen uniformly at random. Then the function becomes
3 Our methods
3.1 The grammar model
We first introduce the features of the locally consistent grammar we build with BuildGram and MergeGrams. Given a rule , the operator is an alias for the string , with when . The grammar is fully balanced, which means that, for any , all root-to-leaf paths in have the same number of edges. We refer to as the level of , and extend the use of level to the rule associated with . We indistinctly use the terms nonterminal and metasymbol to refer to .
The start symbol is associated with a rule where is a sequence such that each and has height . The symbols in can have different heights and, hence, different levels (this idea will become clear in Section 3.3). Let be the highest level among the elements in . We set the level of equal to , which we regard as the height of .
We define the partitions and , where every pair is the set of rules and nonterminals (respectively) with level . In each subset , the left-hand sides are symbols over the alphabet , while the right-hand sides are strings over . Further, we consider to be the set of terminals.
3.2 Fingerprints for the grammar symbols
In this section, we describe the set of hash functions that assign fingerprints in BuildGram. The universal hash function maps terminal symbols to integers over an arbitrary range , with . Furthermore, each function , with , recursively assigns fingerprints to the right-hand sides of . Let be the integer range for the fingerprints emitted by . We choose a random prime number , three integer values , and a new integer . Now, given a rule , we compute the fingerprint for as
| (2) |
Although computes a fingerprint for a string, we associate this fingerprint with because each has one possible . Notice that the recursive definition of implicitly traverses and ignores the nonterminals labelling . As a result, the value that assigns to (or equivalently, to ) depends on , the functions , and the topology of . In practice, we avoid traversing by operating bottom-up over : When we process , the fingerprints in for that we require to obtain the fingerprints in are available.
BuildGram does not know a priori the number of hash functions needs to compress an input. However, the locally consistent grammar algorithm that we use requires rounds of parsing to compress (Section 3.3). If we consider the number of rounds plus the function for the terminals, then is enough to process .
3.3 Our grammar algorithm
Our procedure receives as input a collection of strings and a set of hash functions, and returns a locally consistent grammar that only generates strings in . We assume has at least elements (see Section 3.2), where is the length of the longest string in .
The algorithm of BuildGram is inspired by the parsing of Nong et al. [28], which has been used not only for compression [35], but also for the processing of strings [9] (see Section B). However, as noted in the Introduction, the ideas we present here and in the next sections are compatible with any locally consistent grammar that uses hashing.
Overview of the algorithm.
BuildGram constructs in successive rounds of locally consistent parsing. In each round , we run steps 1-4 of Section 2.3, breaking the strings of individually, but collapsing the rules in the same grammar . When we finish round , we flag each string with length one as inactive (that is, fully compressed) and stop the compression in round if there are no active strings. Subsequently, we create the sequence with compressed strings (that is, symbols we marked as inactive), create the start symbol with the corresponding rule , and finish BuildGram.
3.3.1 Parsing mechanism
We parse the active strings of (step 1 of the round) using a variant of the parsing of Nong et al. [33] (Section 2.3) that employs Equation 2 to randomise the sequences that trigger breaks. We refer to this modification as RandLMSPar.
Let be any pair of nonterminals. We define the partial order as follows:
Additionally, we define the equivalence relation:
to cover the cases where or and their fingerprints collide. Now, let be two adjacent positions in some string during round . We redefine the types of Equation 1 for as follows:
| (3) |
The above types are undefined for the suffix that is an equal-symbol run. This restriction implies that cannot have LMS-type positions (that is, breaks).
A substring is a phrase in RandLMSPar if the following two conditions hold: (i) or is LMS-type; and (ii) or is LMS-type.
Once we partition the strings of and store its distinct phrases in a set , we assign metasymbols to the phrases (step 2 of the round). Let be the number of symbols has when we begin round . We assign the nonterminal to the string and add . We note that the order of the strings in is arbitrary and does not affect the properties of our method. The last step of the parsing round is to create by replacing the phrases in with their nonterminals in . Figure 2 shows an example of BuildGram.
Our hash-based parsing mechanism induces a property in the grammar that we call stable local consistency.
Definition 1.
Stable local consistency: Let ALG be an algorithm that produces a locally consistent grammar. Additionally, let be a pattern occurring in an arbitrary number of text collections. ALG is stable iff, for any pair of distinct texts , the instances and independently produce a core for (Section 2.3) with identical tree topology and different nonterminal labels. The term “independently” means that does not use information about in its execution, and vice versa.
The classification of as a break depends on the fingerprint resulting from the evaluation of with functions . Thus, if appears in another collection , surrounded by an identical context, processing with produces breaks and a core topology for identical to that of . The stable property depends on the use of and not on the parsing algorithm, which means that any locally consistent parsing compatible with hashing (e.g. [29, 5, 15]) would achieve similar results.
3.4 Implementing our grammar algorithm
Calculating the LMS-type positions of RandLMSPar during the round requires Equation 3 to obtain the type of each , which involves knowing the relative order of and . We obtain this information by feeding and to Equation 2. The problem is that Equation 2 decompresses and from , adding a logarithmic penalty on the grammar construction. We avoid decompression by keeping an array that stores the fingerprints of the symbols we already have in .
At the beginning of BuildGram, we initialize with elements, where every stores the fingerprint of . Then, each round keeps in the fingerprint . After we finish round , we store the fingerprint of every so that we can compute the types of in the next round .
Let be the replacement for . We modify Equation 2 as follows:
| (4) |
This operation is valid because the alphabet of is , and already has its fingerprints.
A note on collisions.
The consecutive positions with (i.e. collisions) never cause a break because . Intuitively, the more contiguous colliding symbols we have, the less breaks the parsing produces, and the more inefficient the compression becomes. The chances of this situation are small if the hash function emits fingerprints in the range with . However, we do not know a priori , so we have to choose a very large . This decision has a trade-off because a large means that cells of require more bits, and hence more working memory. In Section 4, we investigate suitable values for in large collections.
Now, we present the theoretical cost of BuildGram.
Theorem 2.
Let be a collection of strings and symbols, where the longest string has length . Additionally, let be a set of hash functions with elements. runs in time w.h.p. and requires bits on top of , where is the grammar size of .
Proof.
Calculating the type of each in takes time if we have the array with precomputed fingerprints of . In addition, we can use a hash table to record the parsing phrases in , which takes time w.h.p. If we consider all the strings in , the running time of the parsing round is in expectation. On the other hand, all the phrases in have length , except (possibly) one phrase at each end of . Therefore, has symbols in the worst case. Considering that BuildGram requires rounds, the cumulative length of is at most , with being the contribution of the phrases with length one. However, BuildGram stops processing a string as soon as it is fully compressed, meaning that length-one phrases contribute elements in the worst case. Therefore, as , the running time of BuildGram is w.h.p.
Let be the number symbols in . The bits of working space in BuildGrams represent the bits of the hash tables, the array that stores the fingerprints, and the hash functions in .
3.5 Merging grammars
We now present our algorithm for merging grammars. Let and be two collections, with and being the lengths of the longest strings in and , respectively, and being their union (Section 2.1). Furthermore, let and be grammars that only generate strings in and , respectively. We assume that has elements (see Section 3.2). The instance returns a locally consistent grammar that only generates strings in , and that is equivalent to the output of .
Overview of the Algorithm.
The merging consists of making absorb the content that is unique to . Specifically, we discard rules from whose expansions occur in , and for those expanding to sequences not in , we add them as new rules in .
3.6 The merge algorithm
For the grammar , let be its set of rules, let be its set of nonterminals, and let be its height. The symbols , , and denote equivalent information for . We consider the partitions and , where every is the set of rules and nonterminals (respectively) that produced during the parsing round . The elements and are equivalent partitions for .
MergeGrams processes the grammar levels in increasing order. In each round , we keep the invariant that the right-hand sides of and are comparable. That is, given two rules and , implies . When , the invariant holds as the right-hand sides of and are over , which is the same for and (Section 2.1).
We begin the algorithm by creating an array that stores in the number of nonterminals with level . Observe that and because . We create an equivalent array for . We also initialize the sets where every keeps the indexes of with level- symbols.
The first step of the merge round is to scan the right-hand sides of , and for each , we modify and , with . The change allows us to append new elements to and while maintaining the validity of the symbols on the right-hand sides of , whose alphabet is . Then, we create a hash table that stores every modified rule as a key-value pair , and an empty array to update the right-hand sides of .
We check which right-hand sides in occur as keys in . Recall that, when , the strings in are already comparable to the keys in because they are over and the subtraction of does not change their values. For , we make these strings comparable during the previous round . If the string of a rule occurs in as a key, we extract the associated value from the hash table. On the other hand, if does not exist in , we create a new symbol and set . Subsequently, we record the new rule in and store . Once we process , we scan the right-hand sides of and use to update their symbols. We also use to update each position as . Now, we discard and continue to the next merge round .
When we finish round , the right-hand sides of are over , and the right-hand sides of will be over after we update their values with . These modifications will make both strings sets comparable, and our invariant will hold for .
After rounds of merge, one of the input grammars will run out of levels. The remaining rounds will skip the creation and query of , and will append new rules directly to (if any). After we finish the rounds, we concatenate the compressed strings to form and update the starting rule . The last step of MergeGrams is to modify the symbols in . For that purpose, we recompute , and for every level- rule , we set and , with . Figure 3 shows an example of MergeGrams.
Theorem 3.
Let (respectively, ) be a locally consistent grammar that generates strings in a collection of strings (respectively, a collection with strings). The size of is and the size of is . Similarly, let be the number of grammar symbols (respectively, ). builds a locally consistent grammar generating strings in in time w.h.p. and bits of space
Proof.
We obtain and in one scan of the nonterminal sets, which takes time, It is also possible to obtain the sets in time. Let (respectively, ) be the number symbols on the right-hand sides of (respectively, ). Filling in the hash table requires a linear scan of , which runs in time w.h.p. On the other hand, scanning and querying its right-hand sides in takes time w.h.p. In addition, modifying the right-hand sides of with takes time. If we transfer the cost of updating to the next round , performing the merge round takes time w.h.p. Now, let and be the number of level- symbols. We require bits to encode and , bits for , and bits for . Consequently, the cost of the round is bits. As and , then MergeGrams runs in time w.h.p and uses bits of space.
4 Experiments
Implementation details.
We implemented our framework in C++ in a tool called LCG (https://github.com/ddiazdom/lcg). We support parallel compression by interleaving executions of BuildGram and MergeGrams as follows: given a collection and an integer , LCG uses threads that execute BuildGram in parallel to compress different subsets of into different buffer grammars. When the combined space usage of the buffer grammars exceeds a given threshold, LCG merges them into a sink grammar using MergeGrams and resets the buffer grammars. Section A explains this idea in more detail. We refer to this strategy as PBuildgram to differentiate it from our description in Section 3.3. After running PBuildGram, LCG run-length compresses the output (RL step), and then removes unique nonterminals from the output of PBuildGram + RL (Simp step).
Experimental setup and inputs.
We compared LCG against other state-of-the-art compressors, measuring the compression speed in MB per second (MB/s), the peak of the working memory in bits per input symbol (bps), and the compression ratio. Furthermore, we assessed the amount of compression LCG achieves, its resource usage, and how it scales with the number of threads. We conducted the experiments on a machine with AlmaLinux 8.4, 3 TiB of RAM, and processor Intel(R) Xeon(R) CPU E7-8890 v4 @ 2.20GHz, with 192 cores. We tested four collections. HUM: all the human genome assemblies available in NCBI up to August 27, 2024 (3.46 TB, =16). ATB: release 2.0 of the AllTheBacteria dataset [17], which contains genomes of bacteria and archaea (7.9 TB, ). COVID: all the SARS-CoV-2 genomes in NCBI up to November 5 ( GB, ). KERNEL: last 40 versions of the Linux kernel (.h, .c, .txt, and .rst files) up to December 13, 2024 ( GB, ).
Competitor tools.
zstd (https://github.com/facebook/zstd) is an efficient tool that uses a simplified version of LZ and encodes the output using Huffman [16] and ANS [13]. agc (https://github.com/refresh-bio/agc) is a compressor for highly similar genomic sequences by Deorowicz et al. [8] that breaks strings into segments and groups segments into blocks according to sequence similarity. RePair (https://users.dcc.uchile.cl/˜gnavarro/software/repair.tgz) is a popular grammar compression algorithm by Larsson and Moffat [26] that recursively replaces the most common pair of symbols in the text. BigRePair (https://gitlab.com/manzai/bigrepair) is a RePair variant by Gagie et al. [14] that scales the compression by using prefix-free parsing [3]. LCG, agc, zstd, and BigRePair support multithreading, so we used 16 threads in each. RePair does not support multithreading. For zstd, we used compression level 15 and a window size of 2 GiB –the tool does not allow longer windows with compression level .
4.1 Results and discussion
| Tool | Compression ratio | Compression speed | Working memory | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| plain/compressed | MB/s | bps | ||||||||||
| ATB | HUM | COVID | KERNEL | ATB | HUM | COVID | KERNEL | ATB | HUM | COVID | KERNEL | |
| LCG | 85.26 | 135.54 | 328.10 | 99.99 | 232.26 | 244.73 | 506.44 | 237.91 | 0.43 | 0.29 | 0.36 | 2.05 |
| agc | – | 144.90 | 237.93 | – | – | 120.42 | 53.71 | – | – | 0.15 | 0.18 | – |
| zstd | 58.19 | 4.72 | 344.99 | 38.23 | 95.51 | 27.21 | 442.79 | 226.45 | 0.004 | 0.01 | 0.10 | 0.53 |
| RePair | – | – | 778.62 | 127.03 | – | – | 1.89 | 2.21 | – | – | 30.63 | 150.11 |
| BigRePair | – | – | 586.12 | 109.87 | – | – | 47.89 | 21.04 | – | – | 2.28 | 3.69 |
4.1.1 Comparison with other text compressors
LCG was the fastest tool, with a speed ranging MB/s (Table 1). The speed of the other tools varied with the input, but RePair remained the slowest (- MB/s). The fact that RePair used one thread and the other tools 16 alone does not explain these results. For instance, LCG was 268 times faster than RePair in COVID (506.44 MB/s versus 1.89 MB/s, respectively). We also observed that LCG achieved higher speeds in more compressible inputs (e.g. COVID), probably because the hash tables recording phrases from the text (see encoding in Section A.1) have to perform more lookups than insertions –lookups are cheaper.
The most space-efficient tool was zstd, with a working memory usage of bps. This result is due to the cap of GiB that zstd uses for the LZ window, regardless of the input size. This threshold keeps memory usage low, but limits compression in large datasets where repetitiveness is spread (HUM or KERNEL). On the other hand, zstd with ATB yielded important space reductions probably because Hunt et al. [17] preprocessed ATB to place similar strings close to each other to improve LZ compression. LCG used far less working memory than the other grammar compressors ( bps versus of RePair and of BigRePair), though it is still high compared to zstd.
RePair obtained the best compression ratios and is likely to outperform the other tools in ATB and HUM –where we could not run RePair. Compared to LCG, RePair achieved times more compression in COVID and 1.27 times more in KERNEL. The difference was smaller with BigRePair, with RePair achieving times more compression in COVID and times in KERNEL. Despite LCG did not obtain the best ratio, it still achieves important reductions, and its trade-off between compression and resource usage seems to be the best. Besides, it is still possible to further compress in LCG by applying RePair or Huffman encoding. We think these additional steps would be fast, as they operate over a small grammar.
| Dataset | RePair | LCG | ||||||||
| PBuildGram | RL | Simp | ||||||||
| % nonter. | Size | % deleted | Size | |||||||
| increase | reduction | rules | reduction | |||||||
| ATB | – | – | 0.69 | 83.35 | 27.02 | |||||
| HUM | – | – | 36.73 | 72.30 | 22.31 | |||||
| COVID | 35.16 | 88.71 | 29.11 | |||||||
| KERNEL | 1.76 | 82.25 | 24.62 | |||||||
4.1.2 Breakdown of our method
Compression.
PBuildGram produced substantially larger grammars than RePair, being larger in KERNEL and larger in COVID (see Table 2). However, we expected this result as PBuildGram is not greedy. Interestingly, PBuildGram produced fewer nonterminals than RePair in KERNEL ( versus , respectively). The maximum number of nonterminals produced in a round was (ATB), indicating that 32 bit fingerprints (i.e. in ) are likely to be enough to keep the number of collisions low in repetitive collections with dozens of TB, although nonrepetitive collections of this size might require larger fingerprints. Recall from Section 3.4 that the more bits we use for fingerprints, the less collisions we have, and hence less impact on the compression, but it also means more working memory. The effect of RL varied with the input: the grammar size reductions ranged between and , with a negligible increase in the number of nonterminals. RL performed well in HUM and COVID because they have long equal-symbol runs of Ns. On the other hand, Simp removed of the nonterminals and reduced the size of the grammar by , on average.
Resource usage.
The grammar encoding we chose (Section A.1) had an important effect on the usage of resources. LCG spent of its running time, on average, executing PBuildGram (Figure 4A). The bottleneck was the lookup/insertion of phrases in the hash tables of the grammars when BuildGram (subroutine of PBuildGram) parsed its input text. These hash table operations are costly because of the high number of cache misses and string comparisons. In particular, the first three parsing rounds of BuildGram are the most expensive because the text is not small enough and produces a large set of phrases (see Figure 6). Consequently, BuildGram has to hash more frequently and in larger hash tables in those rounds. The impact of the other steps is negligible, with RL and Simp accounting (on average) for and of the running time, respectively. The working memory usage (Figure 4B) varied with the input: in ATB and HUM, the sink grammar and the fingerprints of PBuildGram accounted for of the usage. The rest of the memory was satellite data: the arrays and grammars of the thread buffers. The results are different in COVID and KERNEL, where the memory usage is dominated by satellite data. The hash tables imposed a considerable space overhead as they store their keys (i.e. parsing phrases) in arrays of 32-bit cells to perform fast lookups (via memcmp). We can reduce the memory cost by using hash tables that store keys using VBytes instead. In this way, the keys still use an integral number of bytes, and we can still use memcmp for lookups. Our preliminary tests (not shown here) suggest that using VBytes in the hash tables (and 32-bit fingerprints) reduces the space of the sink grammar by in HUM, in COVID, and in KERNEL. If we consider that the buffer grammars use the same encoding, the change to VBytes could drastically reduce the peak of working memory in LCG. We can decrease working memory even further by keeping parts of the sink grammar on disk.
Effect of parallelism.
The compression speed of LCG in HUM increased steadily with the threads (Figure 4C), while the working memory peak remained stable in bps until threads. After that, the peak increased to 0.31 bps with threads, and with .
5 Conclusions and further work
We presented a parallel grammar compressor that processes texts at high speeds while achieving high compression ratios. Our working memory usage is still high compared to popular general-purpose compressors like zstd, but we can greatly reduce the gap by using VByte encoding or keeping some parts of the grammar on disk. On the other hand, we use substantially less memory than popular grammar compressors. In fact, to our knowledge, LCG is the only grammar-based tool that scales to terabytes of data. Furthermore, our simple strategy captures repetitions from distant parts of the text, making it more robust than other widely spread compression heuristics. Additional reductions in LCG are possible by using greedy methods, such as RePair, or statistical compression on top of the output grammar, but it will slow down the analysis of strings. As mentioned, our goal is not only to compress but also to scale string processing algorithms in massive collections. In the literature, it has been shown that the use of locally consistent grammars can speed up those algorithms [9, 11] but the efficient computation of the grammar remains a bottleneck. We solved that problem in this work. Integrating our scheme with those algorithms could enable the processing of an unprecedented volume of strings.
References
- [1] Tuğkan Batu and S Cenk Sahinalp. Locally consistent parsing and applications to approximate string comparisons. In Proc. 9th International Conference on Developments in Language Theory (DLT), pages 22–35, 2005.
- [2] Or Birenzwige, Shay Golan, and Ely Porat. Locally consistent parsing for text indexing in small space. In Proc. 31th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 607–626, 2020. doi:10.1137/1.9781611975994.37.
- [3] Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni Manzini, and Taher Mun. Prefix-free parsing for building big BWTs. Algorithms for Molecular Biology, 14:1–15, 2019. doi:10.1186/S13015-019-0148-5.
- [4] Moses Charikar, Eric Lehman, Ding Liu, Rina Panigrahy, Manoj Prabhakaran, Amit Sahai, and Abhi Shelat. The smallest grammar problem. IEEE Transactions on Information Theory, 51(7):2554–2576, 2005. doi:10.1109/TIT.2005.850116.
- [5] Anders Roy Christiansen, Mikko Berggren Ettienne, Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Optimal-time dictionary-compressed indexes. ACM Transactions on Algorithms (TALG), 17(1):1–39, 2020. doi:10.1145/3426473.
- [6] Richard Cole and Uzi Vishkin. Deterministic coin tossing and accelerating cascades: micro and macro techniques for designing parallel algorithms. In Proc. 18th Annual ACM Symposium on Theory of Computing (STOC), pages 206–219, 1986. doi:10.1145/12130.12151.
- [7] Jin-Jie Deng, Wing-Kai Hon, Dominik Köppl, and Kunihiko Sadakane. FM-indexing grammars induced by suffix sorting for long patterns. In Proc. 22nd Data Compression Conference (DCC), pages 63–72. IEEE, 2022. doi:10.1109/DCC52660.2022.00014.
- [8] Sebastian Deorowicz, Agnieszka Danek, and Heng Li. AGC: compact representation of assembled genomes with fast queries and updates. Bioinformatics, 39(3):btad097, 2023. doi:10.1093/BIOINFORMATICS/BTAD097.
- [9] Diego Díaz-Domínguez and Gonzalo Navarro. Efficient construction of the BWT for repetitive text using string compression. Information and Computation, 294:105088, 2023. doi:10.1016/J.IC.2023.105088.
- [10] Diego Díaz-Domínguez, Gonzalo Navarro, and Alejandro Pacheco. An LMS-based grammar self-index with local consistency properties. In Proc. 28th International Symposium on String Processing and Information Retrieval (SPIRE), pages 100–113, 2021. doi:10.1007/978-3-030-86692-1_9.
- [11] Diego Díaz-Domínguez and Leena Salmela. Computing all-vs-all MEMs in grammar-compressed text. In Proc. 30th International Symposium on String Processing and Information Retrieval (SPIRE), pages 157–170. Springer, 2023. doi:10.1007/978-3-031-43980-3_13.
- [12] Martin Dietzfelbinger, Joseph Gil, Yossi Matias, and Nicholas Pippenger. Polynomial hash functions are reliable. In Proc. 19th International Colloquium on Automata, Languages and Programming (ICALP), pages 235–246, 1992.
- [13] Jarek Duda. Asymmetric numeral systems: entropy coding combining speed of huffman coding with compression rate of arithmetic coding. arXiv preprint arXiv:1311.2540, 2013.
- [14] Travis Gagie, Tomohiro I, Giovanni Manzini, Gonzalo Navarro, Hiroshi Sakamoto, and Yoshimasa Takabatake. Rpair: Rescaling RePair with rsync. In Proc. 26th International Symposium on String Processing and Information Retrieval (SPIRE), pages 35–44, 2019. doi:10.1007/978-3-030-32686-9_3.
- [15] Paweł Gawrychowski, Adam Karczmarz, Tomasz Kociumaka, Jakub Łącki, and Piotr Sankowski. Optimal dynamic strings. In Proc. 29th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1509–1528, 2018.
- [16] David A. Huffman. A method for the construction of minimum-redundancy codes. Proceedings of the IRE, 40(9):1098–1101, 1952.
- [17] Martin Hunt, Leandro Lima, Wei Shen, John Lees, and Zamin Iqbal. AllTheBacteria-all bacterial genomes assembled, available and searchable, 2024. bioRxiv preprint. doi:10.1101/2024.03.08.584059.
- [18] Artur Jeż. A really simple approximation of smallest grammar. Theoretical Computer Science, 616:141–150, 2016. doi:10.1016/J.TCS.2015.12.032.
- [19] Dominik Kempa and Tomasz Kociumaka. String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure. In Proc. 51st Annual ACM Symposium on Theory of Computing (STOC), pages 756–767, 2019. doi:10.1145/3313276.3316368.
- [20] Dominik Kempa and Tomasz Kociumaka. Resolution of the burrows-wheeler transform conjecture. In Proc. 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 1002–1013, 2020. doi:10.1109/FOCS46700.2020.00097.
- [21] Dominik Kempa and Tomasz Kociumaka. Dynamic suffix array with polylogarithmic queries and updates. In Proc. 54th Annual ACM Symposium on Theory of Computing (STOC), pages 1657–1670, 2022. doi:10.1145/3519935.3520061.
- [22] Dominik Kempa and Tomasz Kociumaka. Collapsing the hierarchy of compressed data structures: Suffix arrays in optimal compressed space. In Proc. 64th Annual Symposium on Foundations of Computer Science (FOCS), pages 1877–1886, 2023. doi:10.1109/FOCS57990.2023.00114.
- [23] John C. Kieffer and En Hui Yang. Grammar–based codes: a new class of universal lossless source codes. IEEE Transactions on Information Theory, 46(3):737–754, 2000. doi:10.1109/18.841160.
- [24] Tomasz Kociumaka, Gonzalo Navarro, and Francisco Olivares. Near-optimal search time in -optimal space, and vice versa. Algorithmica, 86(4):1031–1056, 2024. doi:10.1007/S00453-023-01186-0.
- [25] Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Toward a definitive compressibility measure for repetitive sequences. IEEE Transactions on Information Theory, 69(4):2074–2092, 2022. doi:10.1109/TIT.2022.3224382.
- [26] N. Jesper Larsson and Alistair Moffat. Off-line dictionary-based compression. Proceedings of the IEEE, 88(11):1722–1732, 2000. doi:10.1109/5.892708.
- [27] Abraham Lempel and Jacob Ziv. On the complexity of finite sequences. IEEE Transactions on Information Theory, 22(1):75–81, 1976. doi:10.1109/TIT.1976.1055501.
- [28] Heng Li and Richard Durbin. Fast and accurate long-read alignment with Burrows-Wheeler Transform. Bioinformatics, 26(5):589–595, 2010. doi:10.1093/BIOINFORMATICS/BTP698.
- [29] Kurt Mehlhorn, Rajamani Sundar, and Christian Uhrig. Maintaining dynamic sequences under equality tests in polylogarithmic time. Algorithmica, 17:183–198, 1997. doi:10.1007/BF02522825.
- [30] S. Muthukrishnan and Süleyman Cenk Sahinalp. Approximate nearest neighbors and sequence comparison with block operations. In Proc of 32nd Annual ACM Symposium on Theory of Computing (STOC), pages 416–424, 2000. doi:10.1145/335305.335353.
- [31] Gonzalo Navarro. Computing MEMs and relatives on repetitive text collections. ACM Transactions on Algorithms, 21(1):1–33, 2024. doi:10.1145/3701561.
- [32] Craig G. Nevill-Manning and Ian H. Witten. Compression and explanation using hierarchical grammars. The Computer Journal, 40(2_and_3):103–116, 1997.
- [33] Ge Nong. Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACM Transactions on Information Systems, 31(3):1–15, 2013.
- [34] Ge Nong, Sen Zhang, and Wai Hong Chan. Linear suffix array construction by almost pure induced-sorting. In Proc. 19th Data Compression Conference (DCC), pages 193–202, 2009. doi:10.1109/DCC.2009.42.
- [35] Daniel Saad Nogueira Nunes, Felipe A. Louza, Simon Gog, Mauricio Ayala-Rincón, and Gonzalo Navarro. A grammar compression algorithm based on induced suffix sorting. In Proc. 28th Data Compression Conference (DCC), pages 42–51, 2018. doi:10.1109/DCC.2018.00012.
- [36] Carlos Ochoa and Gonzalo Navarro. RePair and all irreducible grammars are upper bounded by high-order empirical entropy. IEEE Transactions on Information Theory, 65(5):3160–3164, 2018. doi:10.1109/TIT.2018.2871452.
- [37] Sofya Raskhodnikova, Dana Ron, Ronitt Rubinfeld, and Adam Smith. Sublinear algorithms for approximating string compressibility. Algorithmica, 65:685–709, 2013. doi:10.1007/S00453-012-9618-6.
- [38] Sühleyman Cenk Sahinalp and Uzi Vishkin. Symmetry breaking for suffix tree construction. In Proc. 26th Annual ACM Symposium on Theory of Computing (STOC), pages 300–309, 1994. doi:10.1145/195058.195164.
- [39] Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 23(3):337–343, 1977. doi:10.1109/TIT.1977.1055714.
Appendix A Implementation details
This section presents the details of the implementation of PBuildGram, the compression algorithm we implemented in LCG. We use BuildGram (Section 3.3) and MergeGrams (Section 3.5) as building blocks to parallelize the process without losing compression power. Section A.1 describes the encoding we use for locally consistent grammars, Section A.2 introduces some changes in BuildGram to make parallel execution easier, and Section A.3 describes the steps of PBuildGram.
A.1 Grammar encoding
Let be a locally consistent grammar of height generating strings in . Additionally, let be the pair, with , of the level-based partition of (see Section 3.5). A hash table keeps for every the key-value pair , while an array stores the fingerprints in of . The array stores the fingerprints in for . Additionally, the array maintains the symbols representing the compressed strings of . Similarly to Section 3.6, the sets store in the indexes of with level- symbols. Overall, the encoding of is , and . We remark that the nonterminals in represent ranks. Specifically, the value for a rule means that this rule is the in , while , with , means that the rule where is the left-hand side of the in . The use of ranks simplifies the merge with other grammars (see Section 3.6). From now on, we will use the operator to denote the amount of bits for the encoding. We extend it so that represents . We implemented each hash table using Robin Hood hashing, storing the nonterminal phrases in 32-bit cells. Finally, we implemented using the C++ library xxHash111https://github.com/Cyan4973/xxHash.
A.2 Modification to the grammar algorithm
We modify BuildGram to operate in parallel efficiently. The new signature of the algorithm is , where and are two (possible nonempty) grammars, and the output is an updated version of . In this variant, we only record phrases in that are not in or . We store these phrases in and keep unchanged, using it as a “read-only” component. However, the general strategy to compress remains the same. The convenience of the change will become evident in the next section. We assume that and use the encoding we described in Section A.1. We add the subscript or to the encoding’s components to differentiate their origin, or (respectively).
BuildGram now works as follows: let us assume that we are in the th parsing round and that we receive the partially compressed collection as input. The alphabet of each is , and we use the fingerprints in and to compute the types of each (Equation 1) and thus the parsing phrases. Notice that, when , it holds and . Let be the active phrase in the parsing. We first check if exists in as a key. If that is the case, we get its corresponding value in the hash table and assign it to the active phrase. On the other hand, if is not a key in , we perform a lookup operation in . Like before, if is a key there, we get the corresponding value and assign it to the active phrase. Finally, if is not a key in either, we insert the active phrase into associated with a new metasymbol . Let and be the sizes of and (respectively) before starting the parsing round . The value we assign to the new metasymbol is . Subsequently, we replace by in and move to the next phrase. Before ending the parsing round, we store in the fingerprints in for the new phrases we inserted in . Let be one of the new phrases, and let be its metasymbol. We compute the fingerprint using Equation 4 and store the result in . Identifying the precedence of a symbol in the next round is simple: means that the phrase associated with is a key in and its fingerprint in is in ( never changes). On the other hand, means that its phrase is in and its fingerprint in in . Recall that we need and to compute the position types for . Similarly, we can obtain the fingerprints in for the new phrases of by modifying Equation 4 to receive two fingerprint arrays and instead of one.
A.3 Parallel grammar construction
Now that we have explained our grammar encoding and the changes that we need to compress in parallel, we are ready to describe PBuildGram. Figure 5 shows the steps in detail. We receive as input a string collection (a file), the number of compression threads, and a threshold indicating the approximate amount of working memory we can use. We first initialize distinct buffers and an empty grammar . Each buffer is a triplet , where is an array to store chunks of (one at a time), is the grammar where we compress the chunks we load into , and is an array of pairs indicating the chunks we have loaded into . Specifically, means that the chunk we loaded into contained the subset . Notice that these strings lie contiguously in ’s file. There is also an array with information equivalent to that of but for .
PBuildGram consists in a loop that interleaves two steps, compression and merge. During the compression step, parallel threads compress the buffers into the corresponding buffer grammars , and continue doing it while . When the space exceeds the threshold, a merge step collapses into the sink grammar and flushes the buffers. The algorithm then enters a new iteration that restarts the compression from the point in where it left it the last time.
The compression step initializes an I/O thread that reads from the disk, loading the chunks sequentially from left to right in the arrays . On the other hand, the compression threads process the arrays in parallel using the variant of BuildGram we described in Section A.2. Every thread receives a buffer and runs . We syncronize the I/O thread with the compression threads using two concurrent queues and . The queue keeps the chunks that are ready to be processed by the compressor threads, whereas contains the buffers that were already processed and can be recycled by the I/O thread to insert new chunks. When the algorithm starts, contains all buffers.
The synchronization process works as follows: let be the next chunk of that PBuildGram has to process. The I/O thread extracts the head from , reads the th chunk from disk, and loads it into . Subsequently, it appends to , and finally it appends to . The I/O thread continues to process the next chunks in the same way as long as the compression process remains active. On the other hand, each compression thread tries to acquire the next buffer available from the head of . After the thread acquires the buffer and runs BuildGram, it flushes and pushes into , thus marking this buffer for recycling. Notice that a compression thread can process multiple noncontiguous chunks of and collapse their information into the same grammar . However, later in the execution of the algorithm, we modify using the information in to fix this problem.
During the merge step, we execute with each . It is possible to collapse the grammars in parallel in a merge sort fashion: let us assume w.l.o.g that is a power of two. Thus, threads execute in parallel processes to produce the new grammars . Subsequently, the threads collapse the new grammars in the same way, and the process continues until only one remains. Every time we execute , we also concatenate the corresponding arrays and to keep track of the chunks of that the resulting encodes. Notice that after the merge, has all the information of . Finally, we reset the buffers and begin a new iteration of compression.
After we process all the chunks, we perform one last merge step to collapse the buffers in and then use the information in to reorder the elements of . Once we finish, we return and complete the execution of PBuildGram.
A.4 Storing the final grammar
As mentioned, we post-process the output of PBuildGram using RL and then Simp. The resulting file of this process (i.e., the output of LCG) uses bits to encode , bits to store pointers on the right-hand sides of , and bits to store pointers to the compressed sequences of the strings in .
A.5 Advantage of our parallel scheme
Our parallel scheme can use a high number of threads with little contention (and thus achieve high compression speeds), while keeping the amount of working memory manageable. A thread executing is the only one modifying and , and although the sink can be accessed by other threads concurrently, they only read information (i.e., little to no contention). On the other hand, there is some contention when the threads modify the queues and concurrently to remove or insert buffers (respectively). However, the compression threads spend most of their time executing BuildGram (modifying a queue is cheap), and it is unlikely that many of them attempt to access the same queue at the same time. On the other hand, the I/O thread might have more contention as we increase the number of threads because it has to compete to modify .
As mentioned above, PBuildGram keeps the consumption of working memory manageable as we add more threads. Specifically, is not bigger than by a factor of . In the first compression iteration, the sink grammar is empty and because the compression threads do not synchronize when they execute BuildGram, the grammars will be redundant. Therefore, memory usage will grow rapidly at the beginning, exceeding the memory threshold and trigger a merge. In this phase, we will collapse redundant content into and delete buffer grammars, thus reducing memory consumption. In the next compression iteration, will be non-empty, so every instance will only add to what is not in . In addition, will grow slower with every new merge iteration because there will be less “new” sequence content in . We note that memory usage still depends on , with depending, in turn, on the amount of repetitiveness in .
Appendix B Speeding up string processing algorithms (sketch)
In this section, we briefly explain how locally consistent grammars help speed up string processing algorithms (the idea might vary with the application). The most expensive operation in string algorithms is to find matching sequences. We use the fact that a grammar collapses redundant sequence information, so the search space in which a string algorithm has to operate is significantly smaller in the grammar than in the text.
Let be a locally consistent grammar generating elements in , and let be a string processing algorithm. As before, we divide according to levels. We regard the right-hand sides of as a string collection and run (a section of) ALG using as input. The process will give some complete answers and some partial answers. We pass the complete answers to the next level and use them as satellite data. When ALG is recursive and returns from the recursion to level , we use the new information to complete what is missing in the partial answers.
An example of this idea is the computation of maximal exact matches (MEM). Let be the longest common prefix between and , with . Similarly, let be the longest common suffix of and . Assume that we have run a standard algorithm to compute the MEMs on the right-hand sides of and that we found a match between two right-hand sides of . Completing the MEM requires us to obtain and . Once we compute all the matches in , we use the output so that we can compute and at the next level . We still have to project the MEMs to text positions, but this operation is cheaper than computing MEMs in their plain locations, especially if the text is redundant.
The grammar we produce with PBuildGram still requires some modifications to run string algorithms as described above, but the cost of this transformation is proportional to the grammar size, not the text size.
