Abstract 1 Introduction 2 Notation and basic concepts 3 Our methods 4 Experiments 5 Conclusions and further work References Appendix A Implementation details Appendix B Speeding up string processing algorithms (sketch) Appendix C Figures

Efficient Terabyte-Scale Text Compression via Stable Local Consistency and Parallel Grammar Processing

Diego Díaz-Domínguez ORCID University of Helsinki, Finland
Abstract

We present compression algorithms designed to process terabyte-sized datasets in parallel. Our approach builds on locally consistent grammars, a lightweight form of compression, combined with simple post-processing techniques to achieve further space reductions. Locally consistent grammar algorithms are suitable for scaling as they need minimal satellite information to compact the text, but they are not inherently parallel. To enable parallelisation, we introduce a novel concept that we call stable local consistency. A grammar algorithm ALG is stable if for any pattern P occurring in a collection 𝒯={T1,T2,,Tk}, instances ALG(T1),ALG(T2),,ALG(Tk) independently produce cores for P with the same topology. In a locally consistent grammar, the core of P is a subset of nodes and edges in the parse tree of 𝒯 that remains the same in all the occurrences of P. This feature enables compression, but it only holds if ALG defines a common set of nonterminal symbols for the strings. Stability removes this restriction, allowing us to run ALG(T1),ALG(T2),,ALG(Tk) in parallel and subsequently merge their grammars into a single output equivalent to that of ALG(𝒯). We implemented our ideas and tested them on massive datasets. Our experiments showed that our method could process 7.9 TB of bacterial genomes in around nine hours, using 16 threads and 0.43 bits/symbol of working memory, achieving a compression ratio of 85x.

Keywords and phrases:
Grammar compression, locally consistent parsing, hashing
Copyright and License:
[Uncaptioned image] © Diego Díaz-Domínguez; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Data compression
Funding:
This project has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101060011.
Editors:
Petra Mutzel and Nicola Prezza

1 Introduction

Classical dictionary-based compression methods such as Lempel-Ziv (LZ) [27, 39] or grammar compression [23, 4] achieve significant space reductions, but often require extensive resources, limiting their practicality for large datasets. Tools like gzip and zstd provide resource-saving simplifications of LZ that offer acceptable trade-offs for smaller inputs, but still struggle with massive repositories.

Recent heuristics have been developed for large-scale applications. For example, Deorowicz et al. [8] compress pangenomes by partitioning strings and compressing similar segments together using zstd. Other approaches, Hunt et al. [17], reorder genomes to improve LZ compression. Grammar algorithms like RePair [26] and SEQUITUR [32] achieve high compression ratios, but quickly exceed the available memory as the input grows. Gagie et al. [14] introduced a method using prefix-free parsing [3] to scale RePair for large inputs.

Locally consistent grammars [15, 35, 5, 10, 24] is a technique that performs rounds of locally consistent parsing [29, 38, 30, 18, 3, 19, 5, 10] to compress a text T[1..n]. This approach recursively segments T based on sequence patterns, producing nearly identical phrases for matching substrings. In the parse tree of a locally consistent grammar, the nodes that cover the occurrences of a pattern P share an area with identical topology and labels. This area is the core of P [38], and is what makes compression possible. Locally consistent grammars are simple to construct as, unlike LZ or RePair, they only need local information to break T consistently. However, they are not only useful for compression, they also help scaling the processing of large string collections. In fact, they have been used to speed up the computation of the Burrows–Wheeler transform [9], perform pattern matching in grammar-based self-indexes [5], find maximal exact matches [11, 31], among other things (see [1, 2, 20, 21, 22] for more applications).

Although it is possible to build from T a locally consistent grammar 𝒢 of size O(δlognlogσδlogn) in O(n) expected time [25, 24], δ being the string complexity [37] and σ being the alphabet size of T, these techniques probably yield less impressive compression ratios in practice than LZ or RePair. However, simple transformations to reduce the size of 𝒢 can yield further space reductions. In this regard, Ochoa et al. [36] showed that any irreducible grammar can reach the kth-order empirical entropy of a string. Their result suggests that building 𝒢 and then transforming it might be an efficient alternative to greedy approaches in large datasets.

Parallelising the grammar construction in massive collections is desirable to leverage multi-core architectures. Given an input 𝒯={Ta,Tb}, an efficient solution would be to split 𝒯 into chunks, say {Ta} and {Tb}, compress the chunks in different instances ALG({Ta})=𝒢a and ALG({Tb})=𝒢b, and merge the resulting (small) grammars 𝒢a and 𝒢b. However, ensuring the local consistency of the merged grammar is difficult without synchronising the instances. Most locally consistent algorithms assign random fingerprints to the grammar symbols to perform the parsing. However, when there is no synchronisation, different metasymbols emitted by ALG({Ta}) and ALG({Tb}) that expand to equal sequences of 𝒯 could have different fingerprints, thus producing an inconsistent parsing of Ta and Tb. Therefore, new locally consistent schemes are necessary to make the parallelisation possible.

Our contribution.

We present a parallel grammar compression method that scales to terabytes of data. Our framework consists of two operations. Let 𝒯={T1,T2,,Tk} be a string collection with Tj𝒯|Tj|=n symbols, and let ={h0,h1,,hl} be a set of hash functions. The operation BuildGram(𝒯,)=𝒢 produces a locally consistent grammar generating strings in 𝒯. Furthermore, let 𝒯a and 𝒯b be two collections with BuildGram(𝒯a,)=𝒢a and BuildGram(𝒯b,)=𝒢b. The operation MergeGrams(𝒢a,𝒢b) builds a locally consistent grammar 𝒢ab for the collection 𝒯ab that combines 𝒯a and 𝒯b. BuildGram uses to induce a stable locally consistent parsing. The stable property means that BuildGram(𝒯a,) and BuildGram(𝒯b,) independently produce cores with the same topology for identical patterns. The set assigns random fingerprints to the metasymbols of the grammar under construction to guide the locally consistent parsing, with the fingerprint of a metasymbol X depending on the sequence of its expansion. This feature ensures that metasymbols from different grammars that expand to matching sequences get the same fingerprints. MergeGrams(𝒢a,𝒢b) leverages the stable property to produce a grammar equivalent to that of BuildGram(𝒯ab,), thus allowing us to parallelise the compression. We show that BuildGram(𝒯,) runs in O(n) time w.h.p. and uses O(GlogG) bits of working space, G being the grammar size. Similarly, MergeGrams(𝒢a,𝒢b) runs in O(Ga+Gb) time and uses O(Galogga+Gbloggb) bits, with ga and gb being the number of symbols in 𝒢a and 𝒢b, respectively. The parsing that we use in BuildGram is inspired by the concept of induced suffix sorting [34], which has been shown to be effective for processing strings [9, 11]. In future work, we plan to use our parallel compressor not only to reduce space usage but also to process large inputs. However, we note that the concept of stability is compatible with any locally consistent grammar that uses hashing to break the text. Our experiments showed that our strategy can efficiently compress several terabytes of data.

2 Notation and basic concepts

We consider the RAM model of computation. Given an input of n symbols, we assume our procedures run in random-access memory, where the machine words use w=Θ(logn) bits and can be manipulated in constant time. We use the big-O notation to denote time and space complexities (in bits), and the term log to express the logarithms of base two.

2.1 Strings

A string T[1..n] is a sequence of n symbols over an alphabet Σ={1,2,,σ}. We use T[j] to refer to the jth symbol in T from left to right, and T[a..b] to refer to the substring starting in T[a] and ending in T[b]. An equal-symbol run T[a..b]=sc is a substring storing c consecutive copies of sΣ, with a=1 or T[a1]s; and b=n or T[b+1]s.

We consider a collection 𝒯={T1,T2,,Tk} of k strings as a multiset where each element Tj𝒯 has an arbitrary order j[1..k]. In addition, we use the operator 𝒯=Tj𝒯|Tj| to express the total number of symbols. We also use subscripts to differentiate collections (e.g. 𝒯a and 𝒯b). The expression 𝒯ab=𝒯a𝒯b denotes the combination of 𝒯a and 𝒯b into a new collection 𝒯ab. We assume that all collections have the same constant-size alphabet Σ.

2.2 Grammar compression

Grammar compression consists in representing a string T[1..n]Σ as a small context-free grammar that generates only T [23, 4]. Formally, a grammar 𝒢={Σ,V,,S} is a tuple where Σ is the alphabet of T (the terminals), V is the set of nonterminals, V×(ΣV) is the set of rewriting rules in the form XQ[1..q]. The symbol SV is the grammar’s start symbol. Given two strings wa=AXB,wb=AQ[1..q]B(ΣV), wb rewrites wa (denoted wawb) if XQ[1..q] exists in . Furthermore, wa derives wb, denoted wawb, if there is a sequence u1,u2,,ux such that u1=wa, ux=wb, and ujuj+1 for 1j<x. The string exp(X)Σ resulting from Xexp(X) is the expansion of X, with the decompression of T expressed as Sexp(S)=T. Compression algorithms ensure that every XV occurs only once on the left-hand sides of . In this way, there is only one possible string exp(X) for each X. This type of grammar is referred to as straight-line. The sum of the lengths of the right-hand sides of is the grammar size.

The parse tree PT(X) of XV represents Xexp(X). Given the rule XQ[1..q], the root of PT(X) is a node r labelled X that has q children, which are labelled from left to right with Q[1],Q[2],,Q[q], respectively. The jth child of r, labelled Q[j], is a leaf if Q[j]Σ; otherwise, it is an internal node whose subtree is recursively defined according to Q[j] and its rule in .

Post-processing a grammar consists of capturing the remaining repetitions in its rules. For instance, if XYV appears multiple times on the right-hand sides of , one can create a new rule ZXY and replace the occurrences of XY with Z. Run-length compression encapsulates each distinct equal-symbol X(ΣV) appearing in the right-hand sides of as a constant-size rule X(X,). Grammar simplification removes every rule XQ[1..q] whose symbol X appears once on the right-hand sides, replacing its occurrence with Q[1..q].

2.3 Locally-consistent parsing and grammars

A parsing is a partition of a string T[1..n] into a sequence of phrases T=T[1..j11]T[j1..j21]T[jx..n], where the indices j1<j2<<jx are breaks. Let par(o,o)=T[jy..jy+11]T[jy+1..jy+21]T[jy+u1..jy+u1] denote the u phrases that cover a substring T[o..o], with o[jy..jy+1) and o[jy+u1,jy+u). A parsing is locally consistent [6] iff, for any pair of equal substrings T[a..b]=T[a..b], par(a,b) and par(a,b) differ in O(1) phrases at the beginning and O(1) at the end, with their internal phrase sequences identical.

Figure 1: Locally consistent grammar compression of P[1..33]. The first row (bottom-up) is P, and the next rows are the metasymbols for parsing rounds. The grey boxes are breaks. The boxes below the thick black are the core of P. The values of Ai and Zi change if the context of P changes.

A locally consistent parsing scheme relevant to this work is that of Nong et al. [35]. They originally described their idea to perform induced suffix sorting, but it has been shown that it is also locally consistent [10, 7]. They define a type for each position T[]:

L-typeT[]>T[+1] or T[]=T[+1] and T[+1] is L-type.S-typeT[]<T[+1] or T[]=T[+1] and T[+1] is S-type.LMS-typeT[] is S-type and T[1] is L-type. (1)

Their method scans T from right to left and sets a break at each LMS-type position.

One can create a grammar 𝒢={Σ,V,,S} that only generates T by applying successive rounds of locally consistent parsing. Different works present this idea slightly differently (see [38, 30, 18, 3, 5, 10]), but the procedure is similar: in every round i, the algorithm receives a string Ti as input (Ti=T with i=1) and performs the following steps:

  1. 1.

    Break Ti into phrases using a locally consistent parsing.

  2. 2.

    Assign a nonterminal X to each distinct sequence Q[1..q] that is a phrase in Ti.

  3. 3.

    Store every X in V and its associated rule XQ[1..q] in .

  4. 4.

    Replace the phrases in Ti by their nonterminals to form the string Ti+1 for i+1.

The process ends when Ti does not have more breaks, in which case it creates the rule STi for the start symbol and returns the resulting grammar 𝒢. The phrases have a length of at least two, so the length of Ti+1 is half the length of Ti in the worst case. Consequently, the algorithm incurs in O(logn) rounds and runs in O(n) time.

The output grammar 𝒢 is locally consistent because it compresses the occurrences of a pattern P[1..m] largely in the same way. The first parsing round transforms the occurrences into substrings in the form A2[1..a2]P2[1..m2]Z2[1..z2]V, where the superscripts indicate symbols in T2. The blocks A2[1..a2] and Z2[1..z2] have O(1) variable nonterminals that change with P’s context, while P2[1..m2] remains the same in all occurrences. In the second round, P2[1..m2] yields strings in the form A3[1..a3]P3[1..m3]Z3[1..z3] that recursively have the same structure. The substring Pi[1..mi] remains non-empty during the first O(logm) rounds. The substrings Pi, with i=1,,O(logm), conform the core of P [38] (see Figure 1).

2.4 Hashing and string fingerprints

Hashing refers to the idea of using a function h:𝒰[m] to map elements in a universe 𝒰 to integers in a range [m]={0,1,,m1} uniformly at random. When the universe 𝒰 is large and only an unknown subset 𝒦𝒰 requires fingerprints over a range [m] with |𝒦|<m|𝒰|, the typical solution is to use a universal hash function. For any pair x,y𝒰, a universal hash function ensures that the collision probability is Pr[h(x)=h(y)]=1/m. Let 𝒰={0,1,2,,σ} be the universe, p>σ a prime number, and a[p]+={1,2,,p1} and b[p]={0,1,,p1} integers chosen uniformly at random. One can make h:𝒰[m] universal using the formula h(x)=((ax+b)modp)modm, with p>m.

These ideas can be adapted to produce fingerprints for a set 𝒦Σ of strings over an alphabet Σ={1,2,,σ} [12]. Pick a prime number p>σ and choose an integer c[p]+ uniformly at random. Then, build the degree q polynomial h(Q[1..q])=(i=1qQ[i]ci1)modp over [p] and regard the symbols Q[1],Q[2],,Q[q] in Q[1..q]𝒰 as the polynomial’s coefficients. Additionally, compose h with a universal hash function h:[p][m] to obtain a fingerprint in [m]. Let a,b,c[p]+ be three integers chosen uniformly at random. Then the function becomes h(Q[1..q])=((a(i=1qQ[i]ci1)+b)modp)modm.

3 Our methods

3.1 The grammar model

We first introduce the features of the locally consistent grammar 𝒢={Σ,V,,S} we build with BuildGram and MergeGrams. Given a rule XQ[1..q], the operator rhs(X) is an alias for the string Q[1..q], with rhs(s)=s when sΣ. The grammar 𝒢 is fully balanced, which means that, for any XV{S}, all root-to-leaf paths in PT(X) have the same number e of edges. We refer to e as the level of X, and extend the use of level to the rule XQ[1..q] associated with X. We indistinctly use the terms nonterminal and metasymbol to refer to XV.

The start symbol S is associated with a rule SC[1..k] where C[1..k] is a sequence such that each exp(C[j])=Tj[1..nj]𝒯 and PT(C[j]) has height O(lognj). The symbols in C[1..k] can have different heights and, hence, different levels (this idea will become clear in Section 3.3). Let lmax be the highest level among the elements in C[1..k]. We set the level of S equal to l=lmax+1, which we regard as the height of 𝒢.

We define the partitions ={1,,l1} and V={V1,V2,,Vl1}, where every pair (i,Vi) is the set of rules and nonterminals (respectively) with level i[1..l1]. In each subset i, the left-hand sides are symbols over the alphabet Vi, while the right-hand sides are strings over Vi1. Further, we consider V0=Σ to be the set of terminals.

3.2 Fingerprints for the grammar symbols

In this section, we describe the set of hash functions that assign fingerprints in BuildGram. The universal hash function h0:Σ[m0] maps terminal symbols to integers over an arbitrary range [0,1,,m01], with m0>σ. Furthermore, each function hi, with 1i<l, recursively assigns fingerprints to the right-hand sides of i. Let [mi1] be the integer range for the fingerprints emitted by hi1. We choose a random prime number pi>mi1, three integer values ai,bi,ci[pi]+, and a new integer mi. Now, given a rule XQ[1..q]i, we compute the fingerprint for Q[1..q] as

hi(Q[1..q])=((ai(j=1qhi1(rhs(Q[j]))cij1)+bi)modpi)modmi. (2)

Although hi:[mi1][mi] computes a fingerprint for a string, we associate this fingerprint with XVi because each Q[1..q] has one possible X. Notice that the recursive definition of hi implicitly traverses PT(X) and ignores the nonterminals labelling PT(X). As a result, the value that hi assigns to Q[1..q] (or equivalently, to X) depends on exp(X), the functions h0,h1,,hi, and the topology of PT(X). In practice, we avoid traversing PT(X) by operating bottom-up over 𝒢: When we process i, the fingerprints in hi1 for Vi1 that we require to obtain the fingerprints in hi are available.

BuildGram does not know a priori the number of hash functions needs to compress an input. However, the locally consistent grammar algorithm that we use requires O(logn) rounds of parsing to compress T[1..n] (Section 3.3). If we consider the number of rounds plus the function h0 for the terminals, then ||logn+1 is enough to process T[1..n].

3.3 Our grammar algorithm

Our procedure BuildGram(𝒯,) receives as input a collection 𝒯={T1,T2,,Tk} of k strings and a set of hash functions, and returns a locally consistent grammar 𝒢={Σ,V,,S} that only generates strings in 𝒯. We assume has at least lognmax+1 elements (see Section 3.2), where nmax is the length of the longest string in 𝒯.

The algorithm of BuildGram is inspired by the parsing of Nong et al. [28], which has been used not only for compression [35], but also for the processing of strings [9] (see Section B). However, as noted in the Introduction, the ideas we present here and in the next sections are compatible with any locally consistent grammar that uses hashing.

Overview of the algorithm.

BuildGram constructs 𝒢 in successive rounds of locally consistent parsing. In each round i, we run steps 1-4 of Section 2.3, breaking the strings of 𝒯 individually, but collapsing the rules in the same grammar 𝒢. When we finish round i, we flag each string Tji+1𝒯i+1 with length one as inactive (that is, fully compressed) and stop the compression in round i+1 if there are no active strings. Subsequently, we create the sequence C[1..k] with compressed strings (that is, symbols we marked as inactive), create the start symbol SV with the corresponding rule SC[1..k], and finish BuildGram.

3.3.1 Parsing mechanism

We parse the active strings of 𝒯i (step 1 of the round) using a variant of the parsing of Nong et al. [33] (Section 2.3) that employs Equation 2 to randomise the sequences that trigger breaks. We refer to this modification as RandLMSPar.

Let X,YVi1 be any pair of nonterminals. We define the partial order as follows:

XY{hi1(rhs(X))<hi1(rhs(Y))if hi1(rhs(X))hi1(rhs(Y)),undefinedif hi1(rhs(X))=hi1(rhs(Y)).

Additionally, we define the equivalence relation:

XYhi1(rhs(X))=hi1(rhs(Y)),

to cover the cases where X=Y or XY and their fingerprints collide. Now, let Tji[],Tji[+1]Vi1 be two adjacent positions in some string Tji𝒯i during round i. We redefine the types of Equation 1 for Tji[] as follows:

L-typeTji[]Tji[+1] or Tji[]Tji[+1] and Tji[+1] is L-type.S-typeTji[]Tji[+1] or Tji[]Tji[+1] and Tji[+1] is S-type.LMS-typeTji[] is S-type and Tji[1] is L-type. (3)

The above types are undefined for the suffix Tji[..nj]=sc that is an equal-symbol run. This restriction implies that Tji[..nj] cannot have LMS-type positions (that is, breaks).

A substring Tji[..r] is a phrase in RandLMSPar if the following two conditions hold: (i) =1 or Tji[] is LMS-type; and (ii) r=nj or Tji[r+1] is LMS-type.

Figure 2: Example of BuildGram with the input string agtagtagtgtagtaggagatcggag and the hash functions ={h0,h1,h2,h3}. The grey boxes indicate the breaks induced by .

Once we partition the strings of 𝒯i and store its distinct phrases in a set 𝒮, we assign metasymbols to the phrases (step 2 of the round). Let gi1=|ΣV| be the number of symbols 𝒢 has when we begin round i. We assign the nonterminal X=gi1+oVi to the oth string Q[1..q]𝒮 and add XQ[1..q]i. We note that the order of the strings in 𝒮 is arbitrary and does not affect the properties of our method. The last step of the parsing round is to create 𝒯i+1 by replacing the phrases in 𝒯i with their nonterminals in Vi. Figure 2 shows an example of BuildGram.

Our hash-based parsing mechanism induces a property in the grammar that we call stable local consistency.

Definition 1.

Stable local consistency: Let ALG be an algorithm that produces a locally consistent grammar. Additionally, let P[1..m] be a pattern occurring in an arbitrary number of text collections. ALG is stable iff, for any pair of distinct texts 𝒯a𝒯b, the instances ALG(𝒯a)=𝒢a and ALG(𝒯b)=𝒢b independently produce a core for P (Section 2.3) with identical tree topology and different nonterminal labels. The term “independently” means that ALG(𝒯a) does not use information about 𝒢b in its execution, and vice versa.

The classification of Tji[] as a break depends on the fingerprint resulting from the evaluation of exp(Tji[]) with functions h0,h1,,hi1. Thus, if exp(Ti[]) appears in another collection 𝒯𝒯, surrounded by an identical context, processing 𝒯 with produces breaks and a core topology for exp(Tji[]) identical to that of 𝒯. The stable property depends on the use of and not on the parsing algorithm, which means that any locally consistent parsing compatible with hashing (e.g. [29, 5, 15]) would achieve similar results.

3.4 Implementing our grammar algorithm

Calculating the LMS-type positions of RandLMSPar during the round i requires Equation 3 to obtain the type of each Tji[], which involves knowing the relative order of Tji[] and Tji[+1]. We obtain this information by feeding rhs(Tji[j]) and rhs(Tji[+1]) to Equation 2. The problem is that Equation 2 decompresses Tji[] and Tji[+1] from 𝒢, adding a logarithmic penalty on the grammar construction. We avoid decompression by keeping an array F[1..|ΣV|] that stores the fingerprints of the symbols we already have in ΣV.

At the beginning of BuildGram, we initialize F with Σ elements, where every F[s]=h0(s) stores the fingerprint of sΣ. Then, each round i keeps in F[Ti[]] the fingerprint hi1(rhs(Ti[])). After we finish round i, we store the fingerprint F[X]=hi(rhs(X)) of every XVi so that we can compute the types of 𝒯i+1 in the next round i+1.

Let rhs(X)=Q[1..q] be the replacement for X. We modify Equation 2 as follows:

F[X]=hi(Q[1..q],F)=((ai(j=1qF[Q[j]]cij1)+bi)modpi)modmi. (4)

This operation is valid because the alphabet of Q[1..q] is Vi1, and F already has its fingerprints.

A note on collisions.

The consecutive positions Tji[]Tji[+1] with hi1(rhs(Ti[]))=hi(rhs(Ti[+1])) (i.e. collisions) never cause a break because Ti[]Ti[+1]. Intuitively, the more contiguous colliding symbols we have, the less breaks the parsing produces, and the more inefficient the compression becomes. The chances of this situation are small if the hash function hi1 emits fingerprints in the range [mi1] with |Vi1|<mi1. However, we do not know a priori |Vi1|, so we have to choose a very large mi1. This decision has a trade-off because a large mi1 means that cells of F require more bits, and hence more working memory. In Section 4, we investigate suitable values for mi in large collections.

Now, we present the theoretical cost of BuildGram.

Theorem 2.

Let 𝒯 be a collection of k strings and 𝒯=n symbols, where the longest string has length nmax. Additionally, let be a set of hash functions with ||lognmax+1 elements. BuildGram(𝒯,)=𝒢 runs in O(n) time w.h.p. and requires O(Glogw) bits on top of 𝒯, where G is the grammar size of 𝒢.

Proof.

Calculating the type of each Tji[] in Tji[1..nj]𝒯i takes O(nj) time if we have the array F with precomputed fingerprints of Vi1. In addition, we can use a hash table to record the parsing phrases in Tji, which takes O(nx) time w.h.p. If we consider all the strings in 𝒯i, the running time of the parsing round i is 𝒪(𝒯i) in expectation. On the other hand, all the phrases in Tji have length >1, except (possibly) one phrase at each end of Tji. Therefore, 𝒯i+1 has n2i+2k symbols in the worst case. Considering that BuildGram requires O(lognmax) rounds, the cumulative length of 𝒯1,𝒯2,,𝒯lognmax is at most n+Σi=1lognmaxn2i+2k2n+2klognmax, with 2klognmax being the contribution of the phrases with length one. However, BuildGram stops processing a string Tj𝒯 as soon as it is fully compressed, meaning that length-one phrases contribute Tj𝒯log|Tj|klognmax elements in the worst case. Therefore, as Tj𝒯log|Tj|<Tj𝒯|Tj|=n, the running time of BuildGram is O(n) w.h.p.

Let g=|Σ|+|V| be the number symbols in 𝒢. The O(GlogG)+glogw+||w=O(Glogw) bits of working space in BuildGrams represent the O(GlogG) bits of the hash tables, the array F that stores the g fingerprints, and the hash functions in .

3.5 Merging grammars

We now present our algorithm for merging grammars. Let 𝒯a and 𝒯b be two collections, with na and nb being the lengths of the longest strings in 𝒯a and 𝒯b, respectively, and 𝒯ab=𝒯a𝒯b being their union (Section 2.1). Furthermore, let 𝒢a=BuildGram(𝒯a,) and 𝒢b=BuildGram(𝒯b,) be grammars that only generate strings in 𝒯a and 𝒯b, respectively. We assume that has ||logmax(na,nb)+1 elements (see Section 3.2). The instance MergeGrams(𝒢a,𝒢a) returns a locally consistent grammar 𝒢ab that only generates strings in 𝒯ab, and that is equivalent to the output of BuildGram(𝒯ab,).

Overview of the Algorithm.

The merging consists of making 𝒢a absorb the content that is unique to 𝒢b. Specifically, we discard rules from 𝒢b whose expansions occur in 𝒯a, and for those expanding to sequences not in 𝒯a, we add them as new rules in 𝒢a.

3.6 The merge algorithm

For the grammar 𝒢a, let a be its set of rules, let Va be its set of nonterminals, and let la be its height. The symbols b, Vb, and lb denote equivalent information for 𝒢b. We consider the partitions a={a1,a2,,ala1} and Va={Va1,Va2,,Vala1}, where every (ai,Vai) is the set of rules and nonterminals (respectively) that BuildGram(𝒯a,) produced during the parsing round i[1..la1]. The elements b={b1,b2,,blb1} and Vb={Vb1,Vb2,,Vblb1} are equivalent partitions for 𝒢b.

MergeGrams processes the grammar levels 1,2,,max(la,lb)1 in increasing order. In each round i, we keep the invariant that the right-hand sides of ai and bi are comparable. That is, given two rules XaQa[1..qa]ai and XbQb[1..qb]bi, Qa[1..qa]=Qb[1..qb] implies exp(Xa)=exp(Xb)Σ. When i=1, the invariant holds as the right-hand sides of a1 and b1 are over Σ, which is the same for 𝒯a and 𝒯b (Section 2.1).

We begin the algorithm by creating an array La[0..la] that stores in La[i]=j=0i1|Vaj| the number of nonterminals with level <i. Observe that La[0]=0 and La[1]=|Σ| because Va0=Σ. We create an equivalent array Lb for 𝒢b. We also initialize the sets E1,E2,,Elb1 where every Ei[1..kb] keeps the indexes of Cb[1..kb] with level-i symbols.

The first step of the merge round i is to scan the right-hand sides of ai, and for each XaQa[1..qa], we modify Xa=XaLa[i] and Qa[j]=Qa[j]La[i1], with j[1..qa]. The change Xa=XaLa[i] allows us to append new elements to Vai and ai while maintaining the validity of the symbols on the right-hand sides of ai+1, whose alphabet is Vai. Then, we create a hash table Ha that stores every modified rule XaQ[1..qa]ai as a key-value pair (Qa[1..qa],Xa), and an empty array Mb[1..|Vbi|] to update the right-hand sides of Rbi+1.

We check which right-hand sides in bi occur as keys in Ha. Recall that, when i=1, the strings in ai are already comparable to the keys in Ha because they are over Σ and the subtraction of La[i1=0] does not change their values. For i>1, we make these strings comparable during the previous round i1. If the string Qb[1..qb] of a rule XbQb[1..qb] occurs in Ha as a key, we extract the associated value Xa from the hash table. On the other hand, if Qb[1..qb] does not exist in Ha, we create a new symbol Xa=|Vai|+1 and set Vai=Vai{Xa}. Subsequently, we record the new rule XaQb[1..qb] in ai and store M[XbLb[i]]=Xa. Once we process bi, we scan the right-hand sides of bi+1 and use M to update their symbols. We also use Ei to update each position jEi as Cb[j]=M[Cb[j]Lb[j]]. Now, we discard bi and continue to the next merge round i+1.

Figure 3: Example of MergeGrams(𝒢a,𝒢b). As the parsing is stable, Ta[1..19]=Tb[1..19] have cores (dashed boxes) with the same topology in 𝒢a and 𝒢b . In (B-C), the nonterminals represent the relative position of their rules in their corresponding levels. For example, the left-hand side of 2524 in 𝒢b (side B) is 2 because that rule is the second in level 2. On the other hand, the symbol 2 in 524 refers to the second rule of level 1. In the first merge round, MergeGrams checks which right-hand sides in level one of 𝒢b are also right-hand sides in level one of 𝒢b (dashed lines in side B). Only tc is not in 𝒢a, so the algorithm appends it at the end of level one in 𝒢a and assigns it the new metasymbol 6 (side C). Subsequently, it discards level one in 𝒢b and updates the right-hand sides of level two in 𝒢b according to their corresponding metasymbols in 𝒢a. In (C), the rule 111431 becomes 122342 and 2524 becomes 2163. For example, 5 becomes 1 on the right-hand side of 2524 because the level one rule 5agg in 𝒢b matches 1agg in 𝒢a (see dashed lines in side B). After the update, MergeGrams goes to the next round and operates recursively.

When we finish round i, the right-hand sides of bi+1 are over [1..|Vai|], and the right-hand sides of ai+1 will be over [1..|Vai|] after we update their values with La. These modifications will make both strings sets comparable, and our invariant will hold for i+1.

After min(la,lb)1 rounds of merge, one of the input grammars will run out of levels. The remaining rounds will skip the creation and query of Ha, and will append new rules directly to ai (if any). After we finish the rounds, we concatenate the compressed strings to form Cab[1..ka+kb]=Ca[1..ka]Cb[1..kb] and update the starting rule SaCab[1..ka+kb]. The last step of MergeGrams is to modify the symbols in 𝒢a. For that purpose, we recompute La, and for every level-i rule XaQa[1..qa], we set Xa=Xa+La[i] and Qa[j]=Qa[j]+La[i1], with j[1..qa]. Figure 3 shows an example of MergeGrams.

Theorem 3.

Let 𝒢a (respectively, 𝒢b) be a locally consistent grammar that generates strings in a collection 𝒯a of ka strings (respectively, a collection 𝒯b with kb strings). The size of 𝒢a is Ga and the size of 𝒢b is Gb. Similarly, let ga=|Σ|+|Va| be the number of grammar symbols (respectively, gb=|Σ|+|Vb|). MergeGrams(𝒢a,𝒢b)=𝒢ab builds a locally consistent grammar generating strings in 𝒯ab=𝒯a𝒯b in O(Ga+Gb) time w.h.p. and O(Galogga+Gbloggb) bits of space

Proof.

We obtain La and Lb in one scan of the nonterminal sets, which takes O(ga+gb) time, It is also possible to obtain the sets E1,E2,,Elb in O(Gb) time. Let Gai (respectively, Gbi) be the number symbols on the right-hand sides of ai (respectively, bi). Filling in the hash table Ha requires a linear scan of ai, which runs in O(Gai) time w.h.p. On the other hand, scanning bi and querying its right-hand sides in Ha takes O(Gbi) time w.h.p. In addition, modifying the right-hand sides of bi+1 with M takes O(Gbi+1) time. If we transfer the cost of updating bi+1 to the next round i+1, performing the merge round i takes O(Gai+Gbi) time w.h.p. Now, let gai=|Vai| and gbi=|Vbi| be the number of level-i symbols. We require Gailoggai1+Gbiloggbi1 bits to encode ai and bi, O(Gailoggai1) bits for Ha, and gbiloggbi bits for M. Consequently, the cost of the round i is O(Gailoggai1+Gbiloggbi1) bits. As Ga=ka+|Σ|+j=1la1Gai and Gb=kb+|Σ|+j=1lb1Gbi, then MergeGrams runs in O(Ga+Gb) time w.h.p and uses O(Galogga+Gbloggb) bits of space.

4 Experiments

Implementation details.

We implemented our framework in C++ in a tool called LCG (https://github.com/ddiazdom/lcg). We support parallel compression by interleaving executions of BuildGram and MergeGrams as follows: given a collection 𝒯 and an integer p, LCG uses p threads that execute BuildGram in parallel to compress different subsets of 𝒯 into different buffer grammars. When the combined space usage of the buffer grammars exceeds a given threshold, LCG merges them into a sink grammar using MergeGrams and resets the buffer grammars. Section A explains this idea in more detail. We refer to this strategy as PBuildgram to differentiate it from our description in Section 3.3. After running PBuildGram, LCG run-length compresses the output (RL step), and then removes unique nonterminals from the output of PBuildGram + RL (Simp step).

Experimental setup and inputs.

We compared LCG against other state-of-the-art compressors, measuring the compression speed in MB per second (MB/s), the peak of the working memory in bits per input symbol (bps), and the compression ratio. Furthermore, we assessed the amount of compression LCG achieves, its resource usage, and how it scales with the number of threads. We conducted the experiments on a machine with AlmaLinux 8.4, 3 TiB of RAM, and processor Intel(R) Xeon(R) CPU E7-8890 v4 @ 2.20GHz, with 192 cores. We tested four collections. HUM: all the human genome assemblies available in NCBI up to August 27, 2024 (3.46 TB, σ=16). ATB: release 2.0 of the AllTheBacteria dataset [17], which contains genomes of bacteria and archaea (7.9 TB, σ=5). COVID: all the SARS-CoV-2 genomes in NCBI up to November 5 (267.4 GB, σ=16). KERNEL: last 40 versions of the Linux kernel (.h, .c, .txt, and .rst files) up to December 13, 2024 (54.4 GB, σ=190).

Competitor tools.

zstd (https://github.com/facebook/zstd) is an efficient tool that uses a simplified version of LZ and encodes the output using Huffman [16] and ANS [13]. agc (https://github.com/refresh-bio/agc) is a compressor for highly similar genomic sequences by Deorowicz et al. [8] that breaks strings into segments and groups segments into blocks according to sequence similarity. RePair (https://users.dcc.uchile.cl/˜gnavarro/software/repair.tgz) is a popular grammar compression algorithm by Larsson and Moffat [26] that recursively replaces the most common pair of symbols in the text. BigRePair (https://gitlab.com/manzai/bigrepair) is a RePair variant by Gagie et al. [14] that scales the compression by using prefix-free parsing [3]. LCG, agc, zstd, and BigRePair support multithreading, so we used 16 threads in each. RePair does not support multithreading. For zstd, we used compression level 15 and a window size of 2 GiB –the tool does not allow longer windows with compression level 15.

4.1 Results and discussion

Table 1: Performance of the competitor tools. The best result of each column is in bold. Cells with a dash are experiments that crashed or are incompatible with the input.
Tool Compression ratio Compression speed Working memory
plain/compressed MB/s bps
ATB HUM COVID KERNEL ATB HUM COVID KERNEL ATB HUM COVID KERNEL
LCG 85.26 135.54 328.10 99.99 232.26 244.73 506.44 237.91 0.43 0.29 0.36 2.05
agc 144.90 237.93 120.42 53.71 0.15 0.18
zstd 58.19 4.72 344.99 38.23 95.51 27.21 442.79 226.45 0.004 0.01 0.10 0.53
RePair 778.62 127.03 1.89 2.21 30.63 150.11
BigRePair 586.12 109.87 47.89 21.04 2.28 3.69

4.1.1 Comparison with other text compressors

LCG was the fastest tool, with a speed ranging 232.26506.44 MB/s (Table 1). The speed of the other tools varied with the input, but RePair remained the slowest (1.89-2.21 MB/s). The fact that RePair used one thread and the other tools 16 alone does not explain these results. For instance, LCG was 268 times faster than RePair in COVID (506.44 MB/s versus 1.89 MB/s, respectively). We also observed that LCG achieved higher speeds in more compressible inputs (e.g. COVID), probably because the hash tables recording phrases from the text (see encoding in Section A.1) have to perform more lookups than insertions –lookups are cheaper.

The most space-efficient tool was zstd, with a working memory usage of 0.0040.53 bps. This result is due to the cap of 2 GiB that zstd uses for the LZ window, regardless of the input size. This threshold keeps memory usage low, but limits compression in large datasets where repetitiveness is spread (HUM or KERNEL). On the other hand, zstd with ATB yielded important space reductions probably because Hunt et al. [17] preprocessed ATB to place similar strings close to each other to improve LZ compression. LCG used far less working memory than the other grammar compressors (0.292.05 bps versus 30.63150.11 of RePair and 2.283.69 of BigRePair), though it is still high compared to zstd.

RePair obtained the best compression ratios and is likely to outperform the other tools in ATB and HUM –where we could not run RePair. Compared to LCG, RePair achieved 2.37 times more compression in COVID and 1.27 times more in KERNEL. The difference was smaller with BigRePair, with RePair achieving 1.32 times more compression in COVID and 1.15 times in KERNEL. Despite LCG did not obtain the best ratio, it still achieves important reductions, and its trade-off between compression and resource usage seems to be the best. Besides, it is still possible to further compress in LCG by applying RePair or Huffman encoding. We think these additional steps would be fast, as they operate over a small grammar.

Table 2: Compression statistics. The value g=|V| is number of nonterminals, G=XV|rhs(X)| is the grammar size, max(gi)=maxi[1..1]|Vi| is the maximum number of nonterminals in a level, max(Gi)=maxi[1..l1]XVi|rhs(X)| is the maximum level size. “Size reduction” is the percentage of decrease for G relative to previous grammar. The column “% nonter. increase” is the percentage of increase for g relative to the grammar of PBuildGram.
Dataset RePair LCG
PBuildGram RL Simp
g G g G max(gi) max(Gi) % nonter. Size % deleted Size
increase reduction rules reduction
ATB 9 487 673 329 29 475 041 123 3 289 664 779 10 457 037 296 0.006 68 0.69 83.35 27.02
HUM 2 397 891 525 12 287 192 412 445 803 832 1 454 454 746 0.021 61 36.73 72.30 22.31
COVID 16 568 289 114 476 810 108 043 879 507 843 065 15 270 975 180 659 822 0.005 22 35.16 88.71 29.11
KERNEL 65 668 740 131 876 293 64 044 303 217 799 801 23 924 433 75 625 197 0.026 96 1.76 82.25 24.62

4.1.2 Breakdown of our method

Compression.

PBuildGram produced substantially larger grammars than RePair, being 69.32% larger in KERNEL and 363.63% larger in COVID (see Table 2). However, we expected this result as PBuildGram is not greedy. Interestingly, PBuildGram produced fewer nonterminals than RePair in KERNEL (6.4×107 versus 6.6×107, respectively). The maximum number of nonterminals max(gi) produced in a round was 3.3×109<2321 (ATB), indicating that 32 bit fingerprints (i.e. mi<232 in ) are likely to be enough to keep the number of collisions low in repetitive collections with dozens of TB, although nonrepetitive collections of this size might require larger fingerprints. Recall from Section 3.4 that the more bits we use for fingerprints, the less collisions we have, and hence less impact on the compression, but it also means more working memory. The effect of RL varied with the input: the grammar size reductions ranged between 0.69% and 36.73%, with a negligible increase in the number of nonterminals. RL performed well in HUM and COVID because they have long equal-symbol runs of Ns. On the other hand, Simp removed 81.65% of the nonterminals and reduced the size of the grammar by 25.76%, on average.

Resource usage.

The grammar encoding we chose (Section A.1) had an important effect on the usage of resources. LCG spent 93.2% of its running time, on average, executing PBuildGram (Figure 4A). The bottleneck was the lookup/insertion of phrases in the hash tables H1,H2,,Hl1 of the grammars when BuildGram (subroutine of PBuildGram) parsed its input text. These hash table operations are costly because of the high number of cache misses and string comparisons. In particular, the first three parsing rounds of BuildGram are the most expensive because the text is not small enough and produces a large set of phrases (see Figure 6). Consequently, BuildGram has to hash more frequently and in larger hash tables in those rounds. The impact of the other steps is negligible, with RL and Simp accounting (on average) for 1.82% and 4.98% of the running time, respectively. The working memory usage (Figure 4B) varied with the input: in ATB and HUM, the sink grammar and the fingerprints of PBuildGram accounted for 75%82% of the usage. The rest of the memory was satellite data: the arrays and grammars of the thread buffers. The results are different in COVID and KERNEL, where the memory usage is dominated by satellite data. The hash tables H1,H2,,Hl1 imposed a considerable space overhead as they store their keys (i.e. parsing phrases) in arrays of 32-bit cells to perform fast lookups (via memcmp). We can reduce the memory cost by using hash tables that store keys using VBytes instead. In this way, the keys still use an integral number of bytes, and we can still use memcmp for lookups. Our preliminary tests (not shown here) suggest that using VBytes in the hash tables (and 32-bit fingerprints) reduces the space of the sink grammar by 39% in HUM, 47% in COVID, and 45% in KERNEL. If we consider that the buffer grammars use the same encoding, the change to VBytes could drastically reduce the peak of working memory in LCG. We can decrease working memory even further by keeping parts of the sink grammar on disk.

Effect of parallelism.

The compression speed of LCG  in HUM increased steadily with the threads (Figure 4C), while the working memory peak remained stable in 0.29 bps until 16 threads. After that, the peak increased to 0.31 bps with 20 threads, and 0.33 with 24.

Figure 4: Performance of LCG. (A) Running time breakdown. (B) Memory peak breakdown. “Fps” are the fingerprints in PBuildGram, and “Sat. data” are arrays and grammars of the buffers (Section A.1). (C) Performance of LCG in HUM relative to the number of threads. The left y axis is the compression speed and the right y axis is the memory peak.

5 Conclusions and further work

We presented a parallel grammar compressor that processes texts at high speeds while achieving high compression ratios. Our working memory usage is still high compared to popular general-purpose compressors like zstd, but we can greatly reduce the gap by using VByte encoding or keeping some parts of the grammar on disk. On the other hand, we use substantially less memory than popular grammar compressors. In fact, to our knowledge, LCG is the only grammar-based tool that scales to terabytes of data. Furthermore, our simple strategy captures repetitions from distant parts of the text, making it more robust than other widely spread compression heuristics. Additional reductions in LCG are possible by using greedy methods, such as RePair, or statistical compression on top of the output grammar, but it will slow down the analysis of strings. As mentioned, our goal is not only to compress but also to scale string processing algorithms in massive collections. In the literature, it has been shown that the use of locally consistent grammars can speed up those algorithms [9, 11] but the efficient computation of the grammar remains a bottleneck. We solved that problem in this work. Integrating our scheme with those algorithms could enable the processing of an unprecedented volume of strings.

References

  • [1] Tuğkan Batu and S Cenk Sahinalp. Locally consistent parsing and applications to approximate string comparisons. In Proc. 9th International Conference on Developments in Language Theory (DLT), pages 22–35, 2005.
  • [2] Or Birenzwige, Shay Golan, and Ely Porat. Locally consistent parsing for text indexing in small space. In Proc. 31th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 607–626, 2020. doi:10.1137/1.9781611975994.37.
  • [3] Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni Manzini, and Taher Mun. Prefix-free parsing for building big BWTs. Algorithms for Molecular Biology, 14:1–15, 2019. doi:10.1186/S13015-019-0148-5.
  • [4] Moses Charikar, Eric Lehman, Ding Liu, Rina Panigrahy, Manoj Prabhakaran, Amit Sahai, and Abhi Shelat. The smallest grammar problem. IEEE Transactions on Information Theory, 51(7):2554–2576, 2005. doi:10.1109/TIT.2005.850116.
  • [5] Anders Roy Christiansen, Mikko Berggren Ettienne, Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Optimal-time dictionary-compressed indexes. ACM Transactions on Algorithms (TALG), 17(1):1–39, 2020. doi:10.1145/3426473.
  • [6] Richard Cole and Uzi Vishkin. Deterministic coin tossing and accelerating cascades: micro and macro techniques for designing parallel algorithms. In Proc. 18th Annual ACM Symposium on Theory of Computing (STOC), pages 206–219, 1986. doi:10.1145/12130.12151.
  • [7] Jin-Jie Deng, Wing-Kai Hon, Dominik Köppl, and Kunihiko Sadakane. FM-indexing grammars induced by suffix sorting for long patterns. In Proc. 22nd Data Compression Conference (DCC), pages 63–72. IEEE, 2022. doi:10.1109/DCC52660.2022.00014.
  • [8] Sebastian Deorowicz, Agnieszka Danek, and Heng Li. AGC: compact representation of assembled genomes with fast queries and updates. Bioinformatics, 39(3):btad097, 2023. doi:10.1093/BIOINFORMATICS/BTAD097.
  • [9] Diego Díaz-Domínguez and Gonzalo Navarro. Efficient construction of the BWT for repetitive text using string compression. Information and Computation, 294:105088, 2023. doi:10.1016/J.IC.2023.105088.
  • [10] Diego Díaz-Domínguez, Gonzalo Navarro, and Alejandro Pacheco. An LMS-based grammar self-index with local consistency properties. In Proc. 28th International Symposium on String Processing and Information Retrieval (SPIRE), pages 100–113, 2021. doi:10.1007/978-3-030-86692-1_9.
  • [11] Diego Díaz-Domínguez and Leena Salmela. Computing all-vs-all MEMs in grammar-compressed text. In Proc. 30th International Symposium on String Processing and Information Retrieval (SPIRE), pages 157–170. Springer, 2023. doi:10.1007/978-3-031-43980-3_13.
  • [12] Martin Dietzfelbinger, Joseph Gil, Yossi Matias, and Nicholas Pippenger. Polynomial hash functions are reliable. In Proc. 19th International Colloquium on Automata, Languages and Programming (ICALP), pages 235–246, 1992.
  • [13] Jarek Duda. Asymmetric numeral systems: entropy coding combining speed of huffman coding with compression rate of arithmetic coding. arXiv preprint arXiv:1311.2540, 2013.
  • [14] Travis Gagie, Tomohiro I, Giovanni Manzini, Gonzalo Navarro, Hiroshi Sakamoto, and Yoshimasa Takabatake. Rpair: Rescaling RePair with rsync. In Proc. 26th International Symposium on String Processing and Information Retrieval (SPIRE), pages 35–44, 2019. doi:10.1007/978-3-030-32686-9_3.
  • [15] Paweł Gawrychowski, Adam Karczmarz, Tomasz Kociumaka, Jakub Łącki, and Piotr Sankowski. Optimal dynamic strings. In Proc. 29th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1509–1528, 2018.
  • [16] David A. Huffman. A method for the construction of minimum-redundancy codes. Proceedings of the IRE, 40(9):1098–1101, 1952.
  • [17] Martin Hunt, Leandro Lima, Wei Shen, John Lees, and Zamin Iqbal. AllTheBacteria-all bacterial genomes assembled, available and searchable, 2024. bioRxiv preprint. doi:10.1101/2024.03.08.584059.
  • [18] Artur Jeż. A really simple approximation of smallest grammar. Theoretical Computer Science, 616:141–150, 2016. doi:10.1016/J.TCS.2015.12.032.
  • [19] Dominik Kempa and Tomasz Kociumaka. String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure. In Proc. 51st Annual ACM Symposium on Theory of Computing (STOC), pages 756–767, 2019. doi:10.1145/3313276.3316368.
  • [20] Dominik Kempa and Tomasz Kociumaka. Resolution of the burrows-wheeler transform conjecture. In Proc. 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 1002–1013, 2020. doi:10.1109/FOCS46700.2020.00097.
  • [21] Dominik Kempa and Tomasz Kociumaka. Dynamic suffix array with polylogarithmic queries and updates. In Proc. 54th Annual ACM Symposium on Theory of Computing (STOC), pages 1657–1670, 2022. doi:10.1145/3519935.3520061.
  • [22] Dominik Kempa and Tomasz Kociumaka. Collapsing the hierarchy of compressed data structures: Suffix arrays in optimal compressed space. In Proc. 64th Annual Symposium on Foundations of Computer Science (FOCS), pages 1877–1886, 2023. doi:10.1109/FOCS57990.2023.00114.
  • [23] John C. Kieffer and En Hui Yang. Grammar–based codes: a new class of universal lossless source codes. IEEE Transactions on Information Theory, 46(3):737–754, 2000. doi:10.1109/18.841160.
  • [24] Tomasz Kociumaka, Gonzalo Navarro, and Francisco Olivares. Near-optimal search time in δ-optimal space, and vice versa. Algorithmica, 86(4):1031–1056, 2024. doi:10.1007/S00453-023-01186-0.
  • [25] Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Toward a definitive compressibility measure for repetitive sequences. IEEE Transactions on Information Theory, 69(4):2074–2092, 2022. doi:10.1109/TIT.2022.3224382.
  • [26] N. Jesper Larsson and Alistair Moffat. Off-line dictionary-based compression. Proceedings of the IEEE, 88(11):1722–1732, 2000. doi:10.1109/5.892708.
  • [27] Abraham Lempel and Jacob Ziv. On the complexity of finite sequences. IEEE Transactions on Information Theory, 22(1):75–81, 1976. doi:10.1109/TIT.1976.1055501.
  • [28] Heng Li and Richard Durbin. Fast and accurate long-read alignment with Burrows-Wheeler Transform. Bioinformatics, 26(5):589–595, 2010. doi:10.1093/BIOINFORMATICS/BTP698.
  • [29] Kurt Mehlhorn, Rajamani Sundar, and Christian Uhrig. Maintaining dynamic sequences under equality tests in polylogarithmic time. Algorithmica, 17:183–198, 1997. doi:10.1007/BF02522825.
  • [30] S. Muthukrishnan and Süleyman Cenk Sahinalp. Approximate nearest neighbors and sequence comparison with block operations. In Proc of 32nd Annual ACM Symposium on Theory of Computing (STOC), pages 416–424, 2000. doi:10.1145/335305.335353.
  • [31] Gonzalo Navarro. Computing MEMs and relatives on repetitive text collections. ACM Transactions on Algorithms, 21(1):1–33, 2024. doi:10.1145/3701561.
  • [32] Craig G. Nevill-Manning and Ian H. Witten. Compression and explanation using hierarchical grammars. The Computer Journal, 40(2_and_3):103–116, 1997.
  • [33] Ge Nong. Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACM Transactions on Information Systems, 31(3):1–15, 2013.
  • [34] Ge Nong, Sen Zhang, and Wai Hong Chan. Linear suffix array construction by almost pure induced-sorting. In Proc. 19th Data Compression Conference (DCC), pages 193–202, 2009. doi:10.1109/DCC.2009.42.
  • [35] Daniel Saad Nogueira Nunes, Felipe A. Louza, Simon Gog, Mauricio Ayala-Rincón, and Gonzalo Navarro. A grammar compression algorithm based on induced suffix sorting. In Proc. 28th Data Compression Conference (DCC), pages 42–51, 2018. doi:10.1109/DCC.2018.00012.
  • [36] Carlos Ochoa and Gonzalo Navarro. RePair and all irreducible grammars are upper bounded by high-order empirical entropy. IEEE Transactions on Information Theory, 65(5):3160–3164, 2018. doi:10.1109/TIT.2018.2871452.
  • [37] Sofya Raskhodnikova, Dana Ron, Ronitt Rubinfeld, and Adam Smith. Sublinear algorithms for approximating string compressibility. Algorithmica, 65:685–709, 2013. doi:10.1007/S00453-012-9618-6.
  • [38] Sühleyman Cenk Sahinalp and Uzi Vishkin. Symmetry breaking for suffix tree construction. In Proc. 26th Annual ACM Symposium on Theory of Computing (STOC), pages 300–309, 1994. doi:10.1145/195058.195164.
  • [39] Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 23(3):337–343, 1977. doi:10.1109/TIT.1977.1055714.

Appendix A Implementation details

This section presents the details of the implementation of PBuildGram, the compression algorithm we implemented in LCG. We use BuildGram (Section 3.3) and MergeGrams (Section 3.5) as building blocks to parallelize the process without losing compression power. Section A.1 describes the encoding we use for locally consistent grammars, Section A.2 introduces some changes in BuildGram to make parallel execution easier, and Section A.3 describes the steps of PBuildGram.

A.1 Grammar encoding

Let 𝒢={Σ,V,,S} be a locally consistent grammar of height l generating strings in 𝒯={T1,T2,,Tk}. Additionally, let (i,Vi) be the ith pair, with i[1..l1], of the level-based partition of 𝒢 (see Section 3.5). A hash table Hi keeps for every XQ[1..q]i the key-value pair (Q[1..q],X), while an array Fi[1..|Vi|] stores the fingerprints in hi of Vi. The array F0[1..σ] stores the fingerprints in h0 for Σ. Additionally, the array C[1..k] maintains the symbols representing the compressed strings of 𝒯. Similarly to Section 3.6, the sets E1,E2,,El1 store in Ei[1..k] the indexes of C[1..k] with level-i symbols. Overall, the encoding of 𝒢 is (Σ,F0),(H1,F1,E1),,(Hl1,Fl1,El1), and C. We remark that the nonterminals in (i,Vi) represent ranks. Specifically, the value X=r for a rule XQ[1..q] means that this rule is the rth in i, while Q[u]=r, with u[1..q], means that the rule where Q[u]Vi1 is the left-hand side of the rth in i1. The use of ranks simplifies the merge with other grammars (see Section 3.6). From now on, we will use the operator space(𝒢) to denote the amount of bits for the encoding. We extend it so that space(𝒢1,𝒢2,,𝒢p) represents j=1pspace(𝒢j). We implemented each hash table Hi using Robin Hood hashing, storing the nonterminal phrases in 32-bit cells. Finally, we implemented using the C++ library xxHash111https://github.com/Cyan4973/xxHash.

A.2 Modification to the grammar algorithm

We modify BuildGram to operate in parallel efficiently. The new signature of the algorithm is BuildGram(𝒯,,𝒢a,𝒢b)=𝒢a, where 𝒢a and 𝒢b are two (possible nonempty) grammars, and the output is an updated version of 𝒢a. In this variant, we only record phrases in 𝒯 that are not in 𝒢a or 𝒢b. We store these phrases in 𝒢a and keep 𝒢b unchanged, using it as a “read-only” component. However, the general strategy to compress 𝒯 remains the same. The convenience of the change will become evident in the next section. We assume that 𝒢a and 𝒢b use the encoding we described in Section A.1. We add the subscript a or b to the encoding’s components to differentiate their origin, 𝒢a or 𝒢b (respectively).

BuildGram now works as follows: let us assume that we are in the ith parsing round and that we receive the partially compressed collection 𝒯i as input. The alphabet of each Tji𝒯i is Vai1Vbi1, and we use the fingerprints in Fai1 and Fbi1 to compute the types of each Ti[] (Equation 1) and thus the parsing phrases. Notice that, when i=1, it holds Va0=Vb0=Σ and Fai1=Fbi1. Let Q[1..q]=Tji[..+q1] be the active phrase in the parsing. We first check if Q[1..q] exists in Hbi as a key. If that is the case, we get its corresponding value X in the hash table and assign it to the active phrase. On the other hand, if Q[1..q] is not a key in Hbi, we perform a lookup operation in Hai. Like before, if Q[1..q] is a key there, we get the corresponding value X and assign it to the active phrase. Finally, if Q[1..q] is not a key in Hai either, we insert the active phrase Tji[..+q1]=Q[1..q] into Hai associated with a new metasymbol X. Let sa and sb be the sizes of Hai and Hbi (respectively) before starting the parsing round i. The value we assign to the new metasymbol is X=sa+sb+1. Subsequently, we replace Q[1..q] by X in Tji and move to the next phrase. Before ending the parsing round, we store in Fai the fingerprints in hi for the new phrases we inserted in Hai. Let Q[1..q] be one of the new phrases, and let X be its metasymbol. We compute the fingerprint hi(Q[1..q]) using Equation 4 and store the result in Fai[Xsb]. Identifying the precedence of a symbol Ti+1[]Vi in the next round i+1 is simple: Ti+1[]=Y>sb means that the phrase associated with Y is a key in Hai and its fingerprint in hi is in Fai[Ysb] (sb never changes). On the other hand, Ysb means that its phrase is in Hbi and its fingerprint in hi in Fbi[Y]. Recall that we need Fai and Fbi to compute the position types for Txi+1. Similarly, we can obtain the fingerprints in hi+1 for the new phrases of Hai by modifying Equation 4 to receive two fingerprint arrays Fai and Fbi instead of one.

A.3 Parallel grammar construction

Now that we have explained our grammar encoding and the changes that we need to compress in parallel, we are ready to describe PBuildGram. Figure 5 shows the steps in detail. We receive as input a string collection 𝒯 (a file), the number p of compression threads, and a threshold t indicating the approximate amount of working memory we can use. We first initialize p distinct buffers (B1,𝒢1,K1),(B2,𝒢2,K2),(Bp,𝒢p,Kp) and an empty grammar 𝒢. Each buffer j is a triplet (Bj,𝒢j,Kj), where Bj is an array to store chunks of 𝒯 (one at a time), 𝒢j is the grammar where we compress the chunks we load into Bj, and Kj is an array of pairs (x1,y1),(x2,y2) indicating the chunks we have loaded into Bj. Specifically, Kj[u]=(x,y) means that the uth chunk we loaded into Bj contained the subset Tx,Tx+1,,Tx+y1𝒯. Notice that these strings lie contiguously in 𝒯’s file. There is also an array K with information equivalent to that of Kj but for 𝒢.

Figure 5: Schematic representation of PBuildGram. The steps (a-e) indicate the cycle of a buffer during the compression step.

PBuildGram consists in a loop that interleaves two steps, compression and merge. During the compression step, p parallel threads compress the buffers B1,B2,,Bp into the corresponding buffer grammars 𝒢1,𝒢2,,𝒢p, and continue doing it while space(𝒢1,𝒢2,,𝒢p)<t. When the space exceeds the threshold, a merge step collapses 𝒢1,𝒢2,,𝒢p into the sink grammar 𝒢 and flushes the p buffers. The algorithm then enters a new iteration that restarts the compression from the point in 𝒯 where it left it the last time.

The compression step initializes an I/O thread that reads 𝒯 from the disk, loading the chunks sequentially from left to right in the arrays B1,B2,,Bp. On the other hand, the p compression threads process the arrays B1,B2,,Bp in parallel using the variant of BuildGram we described in Section A.2. Every thread receives a buffer j[1..p] and runs BuildGram(Bj,,𝒢j,𝒢)=𝒢j. We syncronize the I/O thread with the compression threads using two concurrent queues I and O. The queue I keeps the chunks that are ready to be processed by the compressor threads, whereas O contains the buffers that were already processed and can be recycled by the I/O thread to insert new chunks. When the algorithm starts, O contains all buffers.

The synchronization process works as follows: let u be the next chunk of 𝒯 that PBuildGram has to process. The I/O thread extracts the head (Bj,𝒢j,Kj) from O, reads the uth chunk Tx,Tx+1,,Tx+y𝒯 from disk, and loads it into Bj. Subsequently, it appends (x,y) to Kj, and finally it appends (Bj,𝒢j,Kj) to I. The I/O thread continues to process the next chunks u+1,u+2, in the same way as long as the compression process remains active. On the other hand, each compression thread tries to acquire the next buffer available (Bj,𝒢j,Kj) from the head of I. After the thread acquires the buffer and runs BuildGram, it flushes Bj and pushes (Bj,𝒢j,Kj) into O, thus marking this buffer for recycling. Notice that a compression thread can process multiple noncontiguous chunks of 𝒯 and collapse their information into the same grammar 𝒢j. However, later in the execution of the algorithm, we modify 𝒢j using the information in Kj to fix this problem.

During the merge step, we execute MergeGram(𝒢,𝒢j)=𝒢 with each j[1..p]. It is possible to collapse the grammars in parallel in a merge sort fashion: let us assume w.l.o.g that p+1 is a power of two. Thus, (p+1)/2 threads execute in parallel processes MergeGram(𝒢1,𝒢2)=𝒢1,MergeGram(𝒢3,𝒢4)=𝒢3,,MergeGrams(𝒢,𝒢p)=𝒢 to produce the new grammars 𝒢1,𝒢3,,𝒢. Subsequently, the (p+1)/4 threads collapse the new grammars in the same way, and the process continues until only one 𝒢 remains. Every time we execute MergeGram(𝒢j,𝒢j+1)=𝒢j, we also concatenate the corresponding arrays Kj and Kj+1 to keep track of the chunks of 𝒯 that the resulting 𝒢j encodes. Notice that after the merge, K has all the information of K1,K2,,Kp. Finally, we reset the buffers and begin a new iteration of compression.

After we process all the chunks, we perform one last merge step to collapse the buffers in 𝒢 and then use the information in K to reorder the elements of C[1..k]. Once we finish, we return 𝒢 and complete the execution of PBuildGram.

A.4 Storing the final grammar

As mentioned, we post-process the output of PBuildGram using RL and then Simp. The resulting file of this process (i.e., the output of LCG) uses Glogg bits to encode , glogG bits to store pointers on the right-hand sides of , and klogw bits to store pointers to the compressed sequences of the strings in 𝒯.

A.5 Advantage of our parallel scheme

Our parallel scheme can use a high number of threads with little contention (and thus achieve high compression speeds), while keeping the amount of working memory manageable. A thread executing BuildGram(Bj,,𝒢j,𝒢) is the only one modifying Bj and 𝒢j, and although the sink 𝒢 can be accessed by other threads concurrently, they only read information (i.e., little to no contention). On the other hand, there is some contention when the threads modify the queues I and O concurrently to remove or insert buffers (respectively). However, the compression threads spend most of their time executing BuildGram (modifying a queue is cheap), and it is unlikely that many of them attempt to access the same queue at the same time. On the other hand, the I/O thread might have more contention as we increase the number of threads because it has to compete to modify O.

As mentioned above, PBuildGram keeps the consumption of working memory manageable as we add more threads. Specifically, space(𝒢1,𝒢2,,𝒢p) is not bigger than space(𝒢) by a factor of p. In the first compression iteration, the sink grammar 𝒢 is empty and because the compression threads do not synchronize when they execute BuildGram, the grammars 𝒢1,𝒢2,,𝒢p will be redundant. Therefore, memory usage will grow rapidly at the beginning, exceeding the memory threshold t and trigger a merge. In this phase, we will collapse redundant content into 𝒢 and delete buffer grammars, thus reducing memory consumption. In the next compression iteration, 𝒢 will be non-empty, so every instance BuildGram(Bj,,𝒢j,𝒢)=𝒢j will only add to 𝒢j what is not in 𝒢. In addition, 𝒢j will grow slower with every new merge iteration because there will be less “new” sequence content in 𝒯. We note that memory usage still depends on max(t,space(𝒢)), with space(𝒢) depending, in turn, on the amount of repetitiveness in 𝒯.

Appendix B Speeding up string processing algorithms (sketch)

In this section, we briefly explain how locally consistent grammars help speed up string processing algorithms (the idea might vary with the application). The most expensive operation in string algorithms is to find matching sequences. We use the fact that a grammar collapses redundant sequence information, so the search space in which a string algorithm has to operate is significantly smaller in the grammar than in the text.

Let 𝒢={Σ,V,,S} be a locally consistent grammar generating elements in 𝒯, and let ALG(𝒢) be a string processing algorithm. As before, we divide ={1,2,,l1} according to levels. We regard the right-hand sides of i as a string collection and run (a section of) ALG using i as input. The process will give some complete answers and some partial answers. We pass the complete answers to the next level i+1 and use them as satellite data. When ALG is recursive and returns from the recursion to level i, we use the new information to complete what is missing in the partial answers.

An example of this idea is the computation of maximal exact matches (MEM). Let lcp(X,Y) be the longest common prefix between exp(X) and exp(Y), with X,YVi1. Similarly, let lcs(X,Y) be the longest common suffix of exp(X) and exp(Y). Assume that we have run a standard algorithm to compute the MEMs on the right-hand sides of i and that we found a match A[j1..j2]=B[k1..k2] between two right-hand sides A,B of i. Completing the MEM requires us to obtain lcs(A[j11],B[k1]) and lcp(A[j2+1],B[k2+1]). Once we compute all the matches in i, we use the output so that we can compute lcs and lcs at the next level i+1. We still have to project the MEMs to text positions, but this operation is cheaper than computing MEMs in their plain locations, especially if the text is redundant.

The grammar we produce with PBuildGram still requires some modifications to run string algorithms as described above, but the cost of this transformation is proportional to the grammar size, not the text size.

Appendix C Figures

Figure 6: Number of nonterminals and phrases generated in each parsing round for the output grammar 𝒢={Σ,V,,S} of PBuildGram. The x-axes are the parsing rounds i=1,2,,l. (A) Percentage gi/g×100 that the number gi=|Vi| of level-i nonterminals contributes to the total number g=|V| of nonterminals. (B) Percentage Gi/G×100 that the number of symbols Gi=XVi|rhs(X)| contributes to the total grammar size G. High percentages denote expensive rounds in each collection.