Abstract 1 Introduction 2 Compact Data Structures for 2D LCE Queries 3 Repetition-Aware LCE Data Structure 4 Applications 5 Open Problems References

Two-Dimensional Longest Common Extension Queries in Compact Space

Arnab Ganguly ORCID University of Wisconsin, Whitewater, WI, USA Daniel Gibney ORCID University of Texas at Dallas, TX, USA Rahul Shah ORCID Louisiana State University, Baton Rouge, LA, USA Sharma V. Thankachan ORCID North Carolina State University, Raleigh, NC, USA
Abstract

For a length n text over an alphabet of size σ, we can encode the suffix tree data structure in 𝒪(nlogσ) bits of space. It supports suffix array (SA), inverse suffix array (ISA), and longest common extension (LCE) queries in 𝒪(logσϵn) time, which enables efficient pattern matching; here ϵ>0 is an arbitrarily small constant. Further improvements are possible for LCE queries, where 𝒪(1) time queries can be achieved using an index of space 𝒪(nlogσ) bits. However, compactly indexing a two-dimensional text (i.e., an n×n matrix) has been a major open problem. We show progress in this direction by first presenting an 𝒪(n2logσ)-bit structure supporting LCE queries in near 𝒪((logσn)2/3) time. We then present an 𝒪(n2logσ+n2loglogn)-bit structure supporting ISA queries in near 𝒪(logn(logσn)2/3) time. Within a similar space, achieving SA queries in poly-logarithmic (even strongly sub-linear) time is a significant challenge. However, our 𝒪(n2logσ+n2loglogn)-bit structure can support SA queries in 𝒪(n2/(σlogn)c) time, where c is an arbitrarily large constant, which enables pattern matching in time faster than what is possible without preprocessing.

We then design a repetition-aware data structure. The δ2D compressibility measure for two-dimensional texts was recently introduced by Carfagna and Manzini [SPIRE 2023]. The measure ranges from 1 to n2, with smaller δ2D indicating a highly compressible two-dimensional text. The current data structure utilizing δ2D allows only element access. We obtain the first structure based on δ2D for LCE queries. It takes 𝒪~(n5/3+n8/5δ2D1/5) space and answers queries in 𝒪(logn) time.

Keywords and phrases:
String matching, text indexing, two-dimensional text
Copyright and License:
[Uncaptioned image] © Arnab Ganguly, Daniel Gibney, Rahul Shah, and Sharma V. Thankachan; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Pattern matching
Funding:
Supported by the US National Science Foundation (NSF) under Grant Numbers 2315822 (S Thankachan) and 2137057 (R Shah).
Editors:
Olaf Beyersdorff, Michał Pilipczuk, Elaine Pimentel, and Nguyễn Kim Thắng

1 Introduction

A two-dimensional text T[0..n)[0..n) can be viewed as an n×n matrix, where each entry is a character from an alphabet set Σ of size σ. Data structures for two-dimensional texts have been studied for decades. In particular, there has been extensive work on generalizing suffix trees [16, 17, 23] and suffix arrays [16, 22] to 2D text. These data structures, although capable of answering most queries in optimal (or near optimal) time, require 𝒪(n2) words, or 𝒪(n2logn) bits, of space.

On the other hand, in the case of 1D texts of length n, there exist data structures with the same functionality as suffix trees/arrays but requiring only 𝒪(nlogσ) bits of space [18, 32], or even smaller in the case where the text is compressible [11, 21]. This is true even for some variants of suffix trees, such as parameterized [14, 13] and order-isomorphic [12] suffix trees [33]. The query times of these space-efficient versions are often polylogarithmic, with the exception of LCE queries, for which Kempa and Kociumaka demonstrated that the query time can be made constant [19]. For 2D texts, the only results known in this direction include a data structure by Patel and Shah that uses 𝒪(n2logσ+n2loglogn) bits and supports inverse suffix array (ISA) queries in 𝒪(log4n/(loglogn)3) time [28]. In this work, we make further progress in this direction. In particular, we focus on space-efficient data structures for longest common extension (LCE) queries in the 2D setting. The problem is formally defined as follows:

Problem 1 (2D LCE).

Preprocess a 2D text T[0..n)[0..n) over an alphabet Σ of size σ into a data structure that can answer 2D LCE queries efficiently. A 2D LCE query consists of points (i1,j1), (i2,j2) and asks to return the largest L such that T[i1..i1+L)[j1..j1+L) and T[i2..i2+L)[j2..j2+L) are matching square submatrices of T.

A 2D suffix tree of size 𝒪(n2logn) bits can answer LCE queries in constant time. Our first result is an LCE data structure that occupies 𝒪(n2logσ) bits of space.

Theorem 1.

By maintaining an 𝒪(n2logσ)-bit data structure, we can answer 2D LCE queries in 𝒪((logσn)2/3(loglogσn)5/3) time.

Turning now to highly compressible 2D texts, we consider repetition-aware compression measures. The δ measure is an important and well-studied compressibility measure for 1D text [26]. Only recently has it been extended to 2D text by Carfagna and Manzini with a δ2D-measure [5]. They demonstrate that the data structure of Brisaboa et al. [3] occupies 𝒪((δ2D+nδ2D)lognlogσδ2Dlogn) space. However, this data structure only supports access to the elements of T. We provide the first repetition-aware data structure supporting the more advanced LCE queries. Note that the measure δ2D ranges from 1 to n2, with a smaller δ2D value implying higher compressibility.

Theorem 2.

By maintaining an 𝒪((n5/3+n8/5δ2D1/5)logβ) word data structure, we can answer 2D LCE queries in 𝒪(1+logβ) time, where β is always 𝒪(n) and goes to 𝒪(1) as δ2D approaches n2. In particular,

β={n if δ2D<n9/5n9/5/δ2D9/10 if δ2Dn9/5.

When δ2D=Θ(n2), our data structure takes 𝒪(n2) words of space and answers LCE queries in 𝒪(1) time. When δ2D=o(n2), the space becomes o(n2) and LCE queries are answered in logarithmic time. Our approach builds off many of the same techniques as our compact index but also introduces a matrix representation of the leaves of a truncated suffix tree. We call this a macro-matrix. We prove that if the original 2D text is compressible, then this macro-matrix remains compressible for appropriately chosen parameters. This is then combined with the data structure of Brisaboa et al. [3] to achieve Theorem 2.

As the first steps towards obtaining the other functionalities of the suffix tree, we apply our 2D LCE query structure from Theorem 1 to get the following results. Definitions of suffix array (SA) and inverse suffix array (ISA) are deferred to Section 1.1.

The following theorem significantly improves on the results by Patel and Shah [28].

Theorem 3 (2D ISA queries).

By maintaining an 𝒪(n2logσ+n2loglogn)-bit data structure, we can answer inverse suffix array queries in 𝒪(logn(logσn)2/3(loglogσn)5/3) time.

We also provide the first known results regarding a nearly compact index for 2D suffix array queries.

Theorem 4 (2D SA queries).

By maintaining an 𝒪(n2logσ+n2loglogn)-bit data structure, we can answer suffix array (SA) queries in 𝒪(n2/(σlogn)c) time, where c is an arbitrarily large constant fixed at the time of construction.

A fundamental problem is to find all submatrices of T that match with a given square pattern P[0..m)[0..m). After building the 2D suffix tree, given P as a query, the number of occurrences of P (denoted by occ) can be obtained in 𝒪(m2) time, and all occurrences can be reported in 𝒪(m2+occ) time. Our result, which uses a smaller index, is the following.

Theorem 5 (PM queries).

By maintaining an 𝒪(n2logσ+n2loglogn)-bit data structure, we can count the occurrences of an m×m query pattern in time 𝒪(m2+n2/(σlogn)c) and report all occurrences in time 𝒪(m2+occ+n2/(σlogn)c), where c is an arbitrarily large constant fixed at the time of construction.

Although the time complexities in Theorems 4 and 5 are far from satisfactory, these are the first results demonstrating subquadratic query times in compact space are possible for 2D SA and PM queries.

1.1 Preliminaries

Notation and Strings.

We denote the interval i, i+1, , j with [i..j] and the interval i, i+1, , j1, with [i..j). For a string S of length n we use S[i] to refer to ith character, i[0..n). We use S1S2 to denote the concatenation of two strings S1 and S2. For notation, S[i..j]=S[i]S[i+1]S[j], S[i..j)=S[i]S[i+1]S[j1], and S[i..]=S[i..n). Arrays and strings are zero-indexed throughout this work.

For a single string S[0..n) and i,j[0..n), LCE(i,j) is defined as the length of the longest common prefix of S[i..] and S[j..]. In the case of two strings, S1[0..n1) and S2[0..n2), we overload the notation so that for i[0..n1), j[0..n2), LCE(i,j) is the length of the longest common prefix of S1[i..] and S2[j..]. For a given string S, the suffix tree [34] is a compact trie of all suffixes of S with leaves ordered according to the lexicographic rank of the corresponding suffixes. The classical suffix tree takes 𝒪(n) words of space and can be constructed in 𝒪(n) time for polynomially sized integer alphabets [9]. The suffix array SA[0..n) of a string S[0..n) is the unique array such that S[SA[i]..] is the ith smallest suffix lexicographically. The inverse suffix array ISA[0..n) is the unique array such that ISA[SA[i]]=i, or equivalently, ISA[i] gives the lexicographic rank of S[i..]. The suffix tree can answer LCE queries in 𝒪(1) time. We call a compact trie with lexicographically ordered leaves for a subset of suffixes a sparse suffix tree. Observe that the number of nodes in a sparse suffix tree remains proportional to the number of suffixes it is built from.

We will utilize the following result by Kempa and Kociumaka, which provides an LCE data structure smaller than a classical suffix tree.

Lemma 6 ([19]).

1D LCE queries on a text S[0..n) over an alphabet set Σ=[0..σ) can be answered in 𝒪(1) time by maintaining a data structure of size 𝒪(nlogσ) bits.

The next result by Bille et al. allows for a trade-off between space and query time. We will utilize it in Section 2.2.

Lemma 7 ([1]).

Suppose we have the text S[0..n) as read-only, such that we can determine the lexicographic order of any of its two characters in constant time. Then we can answer 1D LCE queries on S in time 𝒪(τ) by maintaining an 𝒪(n/τ) words of space auxiliary structure, where 1τn is any parameter fixed at the time of construction.

d-Covers.

A d-cover of an interval [0..n) is a subset of positions, denoted by 𝒞, such that for any x[0..nd) and y[0..nd) there exists h[0..d) where x+h,y+h𝒞. It was shown by Burkhardt and Kärkkäinen that there exists a d-cover of size 𝒪(n/d) that can be computed in 𝒪(n/d) time [4]. d-Covers have been used previously for LCE queries in the 1D case by Gawrychowski et al. [15] and Bille et al. [2]. Since we need a small data structure that lets us find an h value as described above in constant time, we briefly outline the construction given in [4].

A difference cover modulo d is a subset 𝒟{0,1,,d1} where for all w{0,1,,d1} there exist u,v𝒟 such that wuvmodd. Colbourn and Ling showed there exists 𝒟 such that |𝒟|=Θ(d) [8]. A d-cover 𝒞 is constructed from a difference cover 𝒟 as follows: For j[0..n), if (jmodd)𝒟, then j is added to 𝒞. We also build a look-up table A of size d such that for all i{0,1,,d1} both A[i] and (A[i]+i)modd are in 𝒟. This is always possible, thanks to the definition of the difference cover. See Figure 1.

Refer to caption
Figure 1: An example d-cover for n=12 and d=7. Here the difference cover used is 𝒟={1,2,4}, resulting in a d-cover 𝒞={1,2,4,8,9,11} (elements indicated with ‘Refer to caption’) and a lookup table A=[1,1,2,1,4,4,2]. For the positions x=3 and y=6, we have h=A[(63)mod7]3mod75. Observe that 3+5,6+5𝒞.
Lemma 8 ([4]).

For a d-cover 𝒞 of an interval [0..n), there exists a data structure of size 𝒪(d) that given x,y[0..nd), outputs an h[0..d) such that x+h,y+h𝒞 in 𝒪(1) time.

Proof.

We maintain the 𝒪(d) space look-up table A as described above. We assume without loss of generality, yx. Let h(A[(yx)modd]x)modd. Observe that

x+hA[(yx)modd]modd.

Hence, (x+hmodd)𝒟 and x+h𝒞. Also,

y+hA[(yx)modd]+(yx)modd.

Hence, (y+hmodd)𝒟 and y+h𝒞.

2D Suffix Trees and 2D Suffix Arrays.

We utilize the generalization of suffix trees to 2D texts presented by Giancarlo [16]. This suffix tree is created from the Lstrings of the 2D text T. LStrings are over an alphabet i=1nΣ2i1. For a position (i,j)[0..n)2 the suffix T[i..][j..] is a0a1al where l=nmax(i,j) and a0=T[i][j] and ak=T[i+k][j..j+k)T[i..i+k][j+k] for k>0. See Figure 2. The characters are maintained implicitly as references to T, resulting in the 2D suffix tree over all suffixes T[i..][j..], (i,j)[0,n)2 occuping 𝒪(n2) words of space. Once constructed, the 2D suffix tree allows us to find the LCE of two positions in 𝒪(1) time through a lowest common ancestor (LCA) query. The 2D suffix tree also enables pattern matching in optimal 𝒪(m2+occ) time.

The order between characters a and a of Lstrings is defined as the lexicographic order induced by the base alphabet Σ. The lexicographic order of two Lstrings (and corresponding submatrices) is induced by the order of their characters. We additionally assume that the bottom row and rightmost column of T consist of only a $ symbol, which is the smallest in the alphabet order and occurs nowhere else in T.

Refer to caption
suffix starting at (0,0): aaabbbbbabcabcab$$$$$$$$$
suffix starting at (0,1): abbbbabcacab$$$$
suffix starting at (1,0): abbbbcbaa$$$cab$
Figure 2: An example 2D text and the suffixes starting at positions (0,0), (0,1), (1,0). The “” denotes concatenation, and consecutive symbols without “” between them are treated as a single character.

The suffix array SA[0..n2) of a 2D text T[0..n)[0..n) is an array containing 2D points such that if (i,j)=SA[h], then T[i..][j..] is the hth smallest suffix lexicographically. The inverse suffix array maps each (i,j)[0..n)2 to its position in SA, i.e. ISA[SA[h]]=h.

The 𝜹𝟐𝑫 Measure and 2D Block Trees.

The δ measure is a well-studied compressibility measure for 1D texts [7, 20, 24, 25, 30]. It is defined as δ(T)=max1tndt(T)/t where dt(T) denotes the number of distinct length t substrings of T[0,n).

Carfagna and Manzini recently generalized the δ measure to 2D texts [5, 6]. Letting dt(T) denote the number of distinct t×t submatrices of T[0..n)[0..n), δ2D(T)=max1tndt(T)/t2. Observe that δ2D(T) can range between 1, e.g., in case where all elements of T are the same character, and n2, i.e., the case where all elements of T are distinct. Carfagna and Manzini showed that the 2D block tree data structure of Brisaboa, et al. [3] occupies 𝒪((δ2D(T)+nδ2D(T))lognlogσδ2Dlogn) words of space and provides access to any entry of T in 𝒪(1+lognlogσδ2Dlogn) time. A further generalization of the δ measure to 2D allowing for non-square matrices was introduced by Romana et al. and related to other potential 2D compressibility measures [31]. In this work, we will only consider the δ2D measure based on square submatrices. We hereafter refer to δ2D as δ and omit the text T when it is clear from context.

2 Compact Data Structures for 2D LCE Queries

We start with some definitions. Let Ri denote the ith row and Cj denote the jth column of our 2D text T, where 0i,j<n. Specifically, Ri[0..n) (resp., Cj[0..n)) is a text of length n over the alphabet Σ, such that its kth character is T[i][k] (resp., T[k][j]), where k[0..n).

We define a set of sampled positions on the diagonals of T, that is T[n1][0], T[n2][0]T[n1][1], , T[0][n2]T[1][n1], T[0][n1], using d-cover with d=Θ(logσ2n). This is obtained by taking a d-cover 𝒞 for [0..n) and using it to define sample positions starting from the top left of each diagonal. Formally, the sample positions are

𝒞D={(i,j)i,j[0..n),min(i,j)𝒞}.

See Figure 3.

We maintain a sparse suffix tree over the suffixes starting from these sampled positions. As this is a compact trie with |𝒞D|=𝒪(n2/d) leaves, the space required for this sparse suffix tree is 𝒪(n2/d) words. By our above choice of d, this is 𝒪(n2logσ) bits. Using this sparse suffix tree, we can obtain LCE for any two sampled positions in 𝒪(1) time.

Refer to caption
Figure 3: An example 7-cover 𝒞={1,2,4} used for the diagonal sample positions of a 7×7 text. Note that this d=7 value is for illustrative purposes. Sample positions are indicated with a “ ”.

Additionally, we maintain the data structure from Lemma 6 for the concatenation of columns C0, , Cn1 and rows R0, , Rn1, which adds another 𝒪(n2logσ) bits. This allows us to find the LCE between Ri[x..] and Rj[y..] (or Ci[x..] and Cj[y..]) in 𝒪(1) time. We can take a minimum between the LCE value and min(nx,ny) to avoid common prefixes crossing row or column boundaries.

In what follows, we first present a simple preliminary solution. We then develop these ideas further with two refinements that lead us to Theorem 1. The components defined above (sparse suffix tree from diagonal samples and LCE data structures for concatenated rows and columns) are used in all three solutions.

2.1 Achieving 𝓞(𝐥𝐨𝐠𝝈𝟐𝒏) Query Time

To answer an LCE query (i1,j1), (i2,j2), we use the look-up structure discussed in Lemma 8 to obtain an h[0..d) such that (i1+h,j1+h) and (i2+h,j2+h) are sampled diagonal positions. For convenience, in the case where no such h in the look-up structure exists, because either (i1,j1) or (i2,j2) is near the boundary of T, we consider h as being one less than the minimum diagonal offset to a boundary of T. We first obtain LCE((i1+h,j1+h),(i2+h,j2+h)) in 𝒪(1) time. Next, for k[0..h), we compute the LCEs between Ri1+k[j1..] and Ri2+k[j2..], and between Cj1+k[i1..] and Cj2+k[i2..]. While iterating from k=1 to k=h1, if for some k either the LCE between Ri1+k[j1..] and Ri2+k[j2..] or between Cj1+k[i1..] and Cj2+k[i2..] becomes less than k, we output k1. Otherwise, we output the minimum over h+LCE((i1+h,j1+h),(i2+h,j2+h)) and all of the LCE values computed for the rows and columns specified above.

Only one constant time query for a diagonal sampled position is required, and the number of 1D LCE queries needed is at most 2d. Since d=Θ(logσ2n) and each 1D LCE query takes 𝒪(1) time, the total time is 𝒪(logσ2n).

2.2 Achieving 𝓞(𝐥𝐨𝐠𝝈𝒏(𝐥𝐨𝐠𝐥𝐨𝐠𝝈𝒏)𝟐) Query Time

First, we define Ri,t and Cj,t. These are texts of length n over an alphabet Σt, such that 0i,j and i+t1,j+t1<n. The kth character of Ri,t and Cj,t are length t strings over Σ defined as follows:

Ri,t[k]=Ri[k]Ri+1[k]Ri+t1[k]
Cj,t[k]=Cj[k]Cj+1[k]Cj+t1[k].

We call these meta characters. We also call Ri,t and Cj,t slabs of length t. Applying the structure from Lemma 6 over the concatenation of rows and the concatenation of columns, we can compare two meta characters in 𝒪(1) time.

Data Structure.

In addition to the previous components, we maintain the structure from Lemma 7 over the text obtained by concatenating Ri,t for i[0..n) and t=1,2,4,8,,min(ni,2logd). We also maintain the structure from Lemma 7 over the text obtained by concatenating Cj,t for j[0..n) and t=1,2,4,8,,min(nj,2logd). We leave the parameter τ appearing in Lemma 7 to be optimized later.

Querying.
Refer to caption
(a)
Refer to caption
(b)
Figure 4: The two cases considered when querying. The actual LCE is shown as the black square. LCE((i1+h,j1+h),(i2+h,j2+h)) is shown with a green square. The LCE of slabs are shown in red and blue. Further binary search is necessary in Case (b).

Given an LCE query (i1,j1), (i2,j2), we first find an h[0..d) such that (i1+h,j1+h) and (i2+h,j2+h) are sampled positions. We then decompose the interval [i1..i1+h) and [j1..j1+h) into 𝒪(logd) slabs that have lengths that are powers of two. We perform an LCE query for each corresponding slab for both rows and columns. A minimum is taken over all these LCE values and h+LCE((i1+h,j1+h),(i2+h,j2+h)). Denote this minimum with m. There are two possible cases.

  • m>h. See Figure 4(a). In this case, m is reported as the result.

  • mh. See Figure 4(b). In this case, we still need to find the largest value y such that the minimum LCE of the slabs covering Cj1[i1..], , Cj1+y[i1..] (with slabs covering Cj2[i2..],, Cj2+y[i2..], respectively) and Ri1[j1..], , Ri1+y[j1] (with slabs covering Ri2[j2..], , Ri2+y[j2..], respectively) is at least y. To accomplish this, we perform a modified binary search while keeping track of the minimum LCE values for both the column and row slabs. The only difference compared to standard binary search is that rather than always dividing the current range under consideration in half, we consider the power of two closest to half the size of the current range. This is done to ensure that we always use slabs for which we have prepared LCE data structures.

Analysis.

Letting T(l) be the number of LCE queries on slabs for the binary search on a range of length l, the resulting recurrence is

T(l)T(2logl/2)+1=𝒪(logl).

Hence, T(h)=𝒪(logd). We now fix τ=logσnloglogσn. Since each LCE query on a slab takes 𝒪(τ) time, the overall query time is τlogd=𝒪(logσn(loglogσn)2), where we used that d=Θ(logσ2n). The total added space relative to the previous solution is 𝒪(logdn2/τ) words. Using our definitions of d and τ, the space remains 𝒪(n2logσ) bits.

2.3 Achieving 𝓞(𝐥𝐨𝐠𝝈𝟐/𝟑𝒏(𝐥𝐨𝐠𝐥𝐨𝐠𝝈𝒏)𝟓/𝟑) Query Time

Data Structure.

Let x be a parameter to be defined later. In addition to the previously defined diagonal sample positions, we now define sample positions for the rows and columns using an x-cover, denoted by 𝒳. We maintain the structure in Lemma 7 (with parameter τ left open for optimizing later) over the text obtained by concatenating slabs Ri,t for t=1,2,4,8,,min(ni,2logd), whenever i𝒳. We do the same for slabs Ri,t for t=1,2,4,8,,min(2logd) whenever i+t1𝒳 and i0. Similarly, we maintain the structure from Lemma 7 for the concatenation of Cj,t for t=1,2,4,8,,min(2logd) for j𝒳. We do the same for Cj,t for t=1,2,4,8,,min(2logd) whenever j+t1𝒳 and j0. Note that these slabs do not need to be explicitly constructed and can be simulated directly using the input text.

Querying.

Given a query (i1,j1), (i2,j2), we first find h[0..d) such that (i1+h,j1+h) and (i2+h,j2+h) are diagonal sample positions. Let find y[0..x) such that i1+y and i2+y are in 𝒳. We find the LCEs of columns Ci1[j1..], , Ci1+y1[j1..] with Ci2[j2..], , Ci2+y1[j2..], respectively. We next find y[0..x) such that i1+h1y and i2+h1y are in 𝒳. We then find the LCEs of columns Ci1+hy[j1..], Ci1+h1[j1..], with Ci2+hy[j2..], , Ci2+h1[j2..], respectively. We then take the largest power of two, say 2a, such that i1+y+2ai1+h1y, and obtain the LCE of the slab Ci1+y,2a[j1..] with Ci2+y,2a[j2..]. We also obtain the LCE of the slabs Ci1+hy2a,2a[j1..] and Ci2+hy2a,2a[j2..]. We perform a symmetric procedure on the rows. A minimum is taken among all of these LCE values as well as h+LCE((i1+h,j1+h),(i2+h,j2+h)). Let m denote this minimum. We consider two cases like in Section 2.2.

  • m>h. In this case, m is reported as the result.

  • mh. As in Section 2.2, we want to output the largest value y such that the minimum LCE of the slabs covering Cj1[i1..], , Cj1+y[i1..] (with LCE relative to slabs covering Cj2[i2..],, Cj2+y[i2..]) and Ri1[j1..], , Ri1+y[j1] (with LCE relative to slabs covering Ri2[j2..], , Ri2+y[j2..]) is at least y. The modification to the binary search algorithm from Section 2.2 is that we intermix at most x single row/column evaluations to reach the next position in 𝒳. After this position in 𝒳 is reached, the power of two that most evenly splits the remaining range can be used.

Analysis.

We claim that answering a query requires 𝒪(xlogd) number of LCE queries for single rows/columns and 𝒪(logd) number of LCE queries on slabs. To see this, let S(l) be the number of single row/column LCE queries on a range of length l, and T(l) be the number of slab LCE queries. Then we have

S(l)S(2logl/2)+𝒪(x)=𝒪(xlogl)
T(l)T(2logl/2)+1=𝒪(logl).

Hence, S(h)=𝒪(xlogd) and T(h)=𝒪(logd). Each single row/column LCE query takes 𝒪(1) time and each LCE query on a slab takes 𝒪(τ) time. As a result, the total query time is 𝒪(xlogd+logdτ). To optimize, we keep d=Θ(logσ2n) and now fix x=τ=(logσ2/3n(loglogσn)2/3) and obtain the query time of 𝒪(logσ2/3n(loglogσn)5/3).

The (extra) space is 𝒪(logdn2/(xτ)) words. This is because we take 𝒪(logd) larger slabs for each column/row sample position, creating an overall string of length 𝒪(logdn2/x). The LCE structure from Lemma 7, then occupies 𝒪(logdn2/(xτ)) words. With the above choice of x and τ, the total space is 𝒪(n2logσ) bits. This completes the proof of Theorem 1.

3 Repetition-Aware LCE Data Structure

Overview.

We use a parameter τ that we will optimize over later. We aim to use a truncated suffix tree in conjunction with a sparse suffix tree on sampled positions from a τ-cover to efficiently perform LCE queries. If we truncate the 2D suffix tree at a string depth of τ, then the δ measure provides an upper bound of τ2δ on the number of leaves at depth τ. As we argue, one can also upper bound the number of additional leaves in the truncated suffix tree in terms of τ and n.

The first challenge in using the above ideas is that, for these LCE queries from sampled positions to provide information on the overall LCE result, the matching submatrices starting at sampled positions should overlap. This is accomplished by using a string depth of 2τ for the truncated suffix tree while still using a τ-cover. The second challenge is that given our LCE query, we need to know which leaves to consider in the truncated suffix tree. Moreover, we should accomplish this in o(n2) space when δ is small. To this end, we introduce the notion of a macro-matrix M, which stores the leaf in the truncated suffix tree to examine for a specified position in T. We then relate the δ measure of this macro-matrix to the δ measure of the matrix T. This relationship enables us to use the 2D block tree data structure of Brisboa et al. [3] on M, which occupies sublinear space for compressible matrices and supports efficient access to the elements of M.

3.1 Data Structures

Truncated Suffix Tree.

We first construct a 2D suffix tree of T truncated at a string depth of 2τ. Call this 𝒯2τ. We use 1, 2, to denote the leaves of 𝒯2τ.

Compressed Representation of Macro-Matrix.

We next define the macro-matrix. The elements of a macro-matrix are essentially meta symbols, where two meta-symbols are the same if and only if the corresponding 2τ×2τ square substrings are identical. Formally, the macro-matrix M is the matrix obtained as follows: for i,j[0..n),

  • if there exists a 2τ×2τ matrix with upper left corner (i,j), i.e., i,jn2τ, then we make M[i][j]= where is a pointer to the leaf of 𝒯2τ corresponding to T[i..i+2τ1][j..j+2τ1];

  • if i>n2τ or j>n2τ, then let M[i][j]= where is a pointer to the leaf in 𝒯2τ corresponding to the (nmax(i,j))×(nmax(i,j)) matrix with upper left corner (i,j).

See Figure 5. We then construct the 2D block tree of M, denoted as 𝖡𝖳(M).

Refer to caption
Figure 5: An example 2D text T, a truncated suffix tree with τ=1, i.e., truncated at a string depth of 2τ=2, and the resulting macro-matrix M.
Sparse Suffix Tree.

We define sample positions based on a τ-cover 𝒞 of [0..n). These consist of sample positions for the rows,

𝒞R={(i,j)i𝒞,j[0..n)}

for the columns,

𝒞C={(i,j)i[0..n),j𝒞}

and for the diagonals,

𝒞D={(i,j)i,j[0..n),min(i,j)𝒞}.

Let 𝒞=𝒞R𝒞C𝒞D. Observe that |𝒞|=Θ(n2/τ). We build a sparse suffix tree over the suffixes starting at sampled positions in 𝒞, denoted as 𝒯s. We also maintain the lookup data structure from Lemma 8. As before, this allows us to find in 𝒪(1) time equally far sampled positions at most τ away from the queried positions in each row, column, and diagonal.

3.2 Querying

Given LCE query (i1,j1), (i2,j2), we first use 𝖡𝖳(M) to get the corresponding values in M. Say these correspond to the leaves 1 and 2 in 𝒯2τ respectively. If 12, then the string depth of the LCA of 1 and 2 gives us the LCE of (i1,j1), (i2,j2).

If 1=2 then we use the lookup data structure from Lemma 8 to find:

  • h1[0..τ) such that (i1+h1,j1) and (i2+h1,j2) are sampled positions. We then use an 𝒪(1) time query on 𝒯s to get the LCE of (i1+h1,j1) and (i2+h1,j2). Denote this LCE value as L1.

  • h2[0..τ) such that (i1+h2,j1+h2) and (i2+h2,j2+h2) are sampled positions. We use an 𝒪(1) time query on 𝒯s to get the LCE of (i1+h2,j1+h2) and (i2+h2,j2+h2). Denote this LCE value as L2.

  • h3[0..τ) such that (i1,j1+h3) and (i2,j2+h3) are sampled positions. We use an 𝒪(1) time query on 𝒯s to get the LCE of (i1,j1+h3) and (i2,j2+h3). Denote this LCE value as L3.

We report min(h1+L1,h2+L2,h3+L3) as the solution.

3.3 Correctness

The first lemma is immediate.

Lemma 9.

When 12, LCE((i1,j1),(i2,j2)) is the string depth of LCA(1,2).

Lemma 10.

When 1=2, LCE((i1,j1),(i2,j2))=min(h1+L1,h2+L2,h3+L3).

Proof.

Define LLCE((i1,j1),(i2,j2)). First, we show that Lmin(h1+L1,h2+L2,h3+L3). Starting from (i1,j1+h1) there exists a matching submatrix (with respect to position (i2,j2+h1)) of size at least Lh1, thus we have that L1Lh1. Hence, L1+h1L. A similar argument holds for h2 and h3.

Next, we show Lmin(h1+L1,h2+L2,h3+L3).

  • We denote the submatrix T[i1+h1..i1+h1+L1)[j1..j1+L1) as T1.

  • We denote the submatrix T[i1+h2..i1+h2+L2)[j1+h2..j1+h2+L2) as T2.

  • We denote the submatrix T[i1..i1+L3)[j1+h3..j1+h3+L3) as T3

See Figure 6.

Refer to caption
Figure 6: The solution LCE L is shown as the black square. Submatrix T1 matrix in red, submatrix T2 in green, and submatrix T3 matrix in blue.

Observe that h1,h2,h3τ1 and since L2τ, we have L1,L2,L3τ. Submatrix T2 has lower left corner in column j1+h2j1+L11 and in row i1+h2+L21i1+h1 making it overlap with T1. Also, T2 has upper right corner in column j1+h2+L21j1+h3 and row i1+h2i1+h3+L31. Hence, T2 overlaps with T3 as well.

Now, suppose for the sake of contradiction that h1+L1,h2+L2,h3+L3>L. For any positions in row x=i1+L and column y where j1yj1+L we have

i1x=i1+Li1+h1+L11,i1+h2+L21

and

j1yj1+Lj1+h1+L11,j1+h2+L21.

Similarly, for any position in column y=j1+L and row x where i1xi1+L we have

j1y=j1+Lj1+h2+L21,j1+h3+L31

and

i1xi1+Li1+h2+L21,i1+h3+L31.

Based on the above inequalities and the fact that submatrices T1, T2, and T3 overlap, this implies that the matching submatrices with upper left corners (i1,j1) and (i2,j2) can be extended further by at least one row and column. This contradicts the definition of L.

3.4 Analysis and Optimization

3.4.1 Space Analysis

Space for 𝝉-Cover lookup structure and Sparse Suffix Tree.

According to Lemma 8, the lookup structure requires 𝒪(τ) space. Since |𝒞|=𝒪(n2/τ), we have that the sparse suffix tree 𝒯s uses 𝒪(n2/τ) space.

Space for 𝓣𝟐𝝉.

The space for the truncated suffix tree 𝒯2τ is bounded by the number of distinct 2τ×2τ submatrices of T, denoted d2τ(T), plus the number of distinct matrices of size less than 2τ that can not be further extended down and to the right (due to a boundary of T). There are at most 𝒪(τn) of the latter since, for every length from 1 to 2τ, at most 2n submatrices cannot be further extended. By the definition of δ, d2τ(T)4τ2δ(T), making the space for 𝒯2τ bound by 𝒪(τ2δ(T)+τn).

Space for Macro-Matrix.

The space for 𝖡𝖳(M) depends on δ(M). We prove the following lemma relating δ(T) and δ(M).

Lemma 11.

δ(M)=Ω(max(1,δ(T)/τ2n/τ)) and δ(M)=𝒪(τ2δ(T)+τn).

Proof.

First, the lower bound. Observe that for an arbitrary t[2τ..n], two matching t×t submatrices in T cause two matching (t2τ+1)×(t2τ+1) submatrices in M (with the same upper left corners as the corresponding submatrices in T). In this way, every distinct t×t submatrix in T maps to one distinct (t2τ+1)×(t2τ+1) submatrix in M, and we have dt(T)d(t2τ+1)(M). Then for t2τ, we have

dt(T)t2d(t2τ+1)(M)t2(t2τ+1)2δ(M)t2δ(M) (1)

implying dt(T)t2δ(M) for t2τ.

Next, consider t[1..2τ). Note that the number of distinct t×t submatrices in T is almost upper bounded by the number of distinct (t+2τ)×(t+2τ) submatrices in T, except that some of the distinct matrices with sizes between t×t and (t+2τ)×(t+2τ) may be prevented from being extended due to the right and bottom boundaries of T. The number of such submatrices is bounded by 2n(t+2τt)=𝒪(τn). Hence, for t<2τ,

dt(T)d(t+2τ)(T)+𝒪(τn)

Applying Inequality (1), we can then write

dt(T)t2d(t+2τ)(T)t2+𝒪(τn)t2(t+2τ)2t2δ(M)+𝒪(τn)=𝒪(τ2δ(M)+τn).

Taking the maximum over both cases, yields that δ(T)=𝒪(τ2δ(M)+τn).

For the upper bound, we claim that, for an arbitrary t[1..n],

dt(M)d(t+2τ1)(T)+𝒪(τn),

where we take d(t+2τ1)(T)=0 if t+2τ1>n. The above inequality follows from the fact that every distinct (t+2τ1)×(t+2τ1) submatrix in T maps to one distinct t×t submatrix in M. What remains to be counted for dt(M) are distinct t×t submatrices in M that are not resulting from some (t+2τ1)×(t+2τ1) submatrix in T. That is, submatrices on the bottom and/or right boundary. Again, the number of such t×t submatrices is bounded by 2n((t+2τ1)t)=𝒪(τn).

To complete the proof, we have the bound

δ(M)=maxtdt(M)t2 maxtd(t+2τ1)(T)+𝒪(τn)t2
maxt(t+2τ1)2t2δ(T)+𝒪(τn)=𝒪(τ2δ(T)+τn).

Let σ be the alphabet size of the macro-matrix M. The space for the block tree 𝖡𝖳(M) is

𝒪((δ(M)+nδ(M))log(nlogσδ(M)logn)).

Applying that σn2 and Lemma 11, this space is bound by

𝒪((τ2δ(T)+τnδ(T)+τn)log(nmax(1,δ(T)τ2nτ))).
Total Space.

Summing the total data structure sizes, the combined space is

𝒪((τ2δ(T)+τnδ(T)+τn)log(nmax(1,δ(T)τ2nτ))+n2τ+τ).

3.4.2 Optimizing

We consider two cases based on δ(T), which we now denote as just δ. If δ>n1/3, we set τ=n4/5/(2δ2/5) and let β=n/max(1,4δ9/5/n8/52n1/5δ2/5). The space is (up to constant factors)

(n8/5δ1/5+n13/10δ1/10+n9/5δ2/5)logβ+n8/5δ1/5+n4/5δ2/5=𝒪(n8/5δ1/5logβ).

Observe that as δ approaches n2, β approaches 𝒪(1).

If δn1/3, we make τ=n2/3. The resulting space complexity is

(n4/3δ+n7/6δ+n5/3)logβ+n5/3+n2/3=𝒪(n5/3logβ).

For this case, the argument of the logarithm β is 𝒪(n). One can also readily check that β as defined above is bound by the expression for β appearing in Theorem 2.

3.4.3 Query Time

The query time is dominated by the access to 𝖡𝖳(M), which takes 1+lognδ(M)=𝒪(1+logβ) time, where β is defined as above. The remaining queries take 𝒪(1) time. This completes the proof of Theorem 2.

4 Applications

We next demonstrate some applications of Theorem 1 by proving Theorems 3, 4, 5.

4.1 ISA Queries

We maintain a sampled suffix array. Specifically, we sample the suffix array values for every (logσn) leaf of the suffix tree. The space required for this is 𝒪(n2logσ) bits. Additionally, for each text position, we maintain how far away its predecessor sampled leaf is relative to its leaf in the suffix tree. This requires 𝒪(loglogσn) bits per entry. The resulting total space is 𝒪(n2logσ+n2loglogn) bits.

To find the ISA value of a text position (i,j), we perform a binary search on the sampled leaves to find the lexicographic predecessor of (i,j) within the sampled set. Once the predecessor is found, we add the offset associated with (i,j). This gives us the suffix array position associated with (i,j), i.e., its ISA value. The binary search requires 𝒪(logn) number of LCE queries. Each LCE query takes 𝒪(logσ2/3n(loglogσn)5/3) time, resulting in an overall time complexity of 𝒪(lognlogσ2/3n(loglogσn)5/3).

4.2 SA queries

Let τ be a parameter. We divide the leaves of the suffix tree into contiguous blocks of size n2/τ (except for perhaps the last block, which can be smaller). There are Θ(τ) blocks. We associate each position in T with the block in which its leaf lies in the suffix tree. This information is stored as follows: consider a binary array Bb associated with each block b. Each binary array is of length n2 and represents a linearization of T. For a block Bb, we consider a 1 in a position if the corresponding suffix tree leaf is in block b and 0 otherwise. Note that there are at most mn2/τ 1’s in Bb. We build a data structure representing Bb using mlogn2m+𝒪(m) bits of space, or equivalently, n2/τlogτ+𝒪(n2/τ) bits of space, such that select queries can be answered in constant time [29]. The total space for select data structures over all Θ(τ) bit vectors, is n2logτ+𝒪(n2)=𝒪(n2logτ) bits. We also maintain the ISA data structure described previously.

Given an SA query for index i, we first identify which block i is in. Say this is block b. We use select queries to iterate through the text positions contained in block b. For each text position iterated over, we perform an ISA query and check whether its ISA value equals i.

The space required for the ISA data structure is 𝒪(n2logσ+n2loglogn) bits. The space for the select data structures is 𝒪(n2logτ) bits. The query time is 𝒪(n2/τlognlogσ2/3n(loglogσn)5/3). We obtain Theorem 4 by making τ=(σlogn)c, where c is an arbitrarily large constant that can absorb the additional logarithmic factors in the query time.

4.3 Pattern Matching

Counting.

In addition to the previous structures, we maintain the LCE data structure from Lemma 6 over the rows and columns. First, a binary search is done to find the leaf for the lexicographically smallest suffix with P as a prefix (if one exists). We start by using an SA query to obtain SA[n2/2]. Using that we have read access to the original text, we match characters in P in Lstring order to the submatrix starting at SA[n2/2] until we reach our first mismatch. At this point, we know our lexicographical order relative to our current leaf. When we transition to a new leaf in the binary search, we perform an SA query followed by LCE queries with the position from the preceding leaf. This avoids repeatedly iterating over characters in P. Assuming the LCE query is at least the length already matched, we continue matching from the last matched position. A similar binary search finds the lexicographically largest suffix with P as a prefix. We return the suffix range length.

The total number of LCE and SA queries performed is 𝒪(logn). The time is dominated by the SA queries, which require 𝒪(n2/(σlogn)c) time.

Reporting.

We start with the suffix range obtained previously, say [x..y]. We use the same blocking scheme for the suffix leaves described for SA queries, also using constant time select data structures. We first identify the block that x lies in, say Bb. We use the select data structure to iterate through all of the text positions corresponding to suffixes in block b. For each position, we perform an ISA query and check whether its position in the suffix array is at least x. If it is, we output it. We perform a similar procedure for the block containing y, now checking if the position in the suffix array is at most y. For the remaining blocks, those completely contained in [x..y], we use their select data structures to output all occurrences with suffixes in that block.

The space complexity is the same as the SA data structure. For the query time, each block has size 𝒪(n2/τ), and with τ=(σlogn)c, the time spent on the blocks containing x and y is absorbed by n2/(σlogn)c already appearing due to SA queries.

5 Open Problems

We leave open many directions for potential improvement, for example:

  • Can we design a data structure with faster SA query time that uses 𝒪(n2logσ+n2loglogn) bits of space (or better)? This seems significantly harder than ISA queries. Suffix array sampling, like in the FM-index [10], is not immediately adaptable.

  • Can we design a data structure in repetition-aware compressed space that supports ISA, SA, or pattern-matching queries? Also, can the space for a data structure for LCE queries be improved? Grammar-based compression has proven useful for repetition-aware compressed data structures supporting LCE queries in the 1D case, particularly run-length straight-line programs (RL-SLP). For 1D text, it is possible to construct RL-SLPs with size close to the δ measure [25], which can be used for LCE [27] and pattern matching queries [24]. Although Romana et al. [31] introduce a version of RL-SLP for 2D text, it is open how such a RL-SLP could be utilized for LCE queries and other types of queries, e.g., SA and pattern matching queries.

References

  • [1] Philip Bille, Inge Li Gørtz, Mathias Bæk Tejs Knudsen, Moshe Lewenstein, and Hjalte Wedel Vildhøj. Longest common extensions in sublinear space. In Ferdinando Cicalese, Ely Porat, and Ugo Vaccaro, editors, Combinatorial Pattern Matching - 26th Annual Symposium, CPM 2015, Ischia Island, Italy, June 29 - July 1, 2015, Proceedings, volume 9133 of Lecture Notes in Computer Science, pages 65–76. Springer, 2015. doi:10.1007/978-3-319-19929-0_6.
  • [2] Philip Bille, Inge Li Gørtz, Benjamin Sach, and Hjalte Wedel Vildhøj. Time-space trade-offs for longest common extensions. J. Discrete Algorithms, 25:42–50, 2014. doi:10.1016/J.JDA.2013.06.003.
  • [3] Nieves R. Brisaboa, Travis Gagie, Adrián Gómez-Brandón, and Gonzalo Navarro. Two-dimensional block trees. Comput. J., 67(1):391–406, 2024. doi:10.1093/COMJNL/BXAC182.
  • [4] Stefan Burkhardt and Juha Kärkkäinen. Fast lightweight suffix array construction and checking. In Ricardo A. Baeza-Yates, Edgar Chávez, and Maxime Crochemore, editors, Combinatorial Pattern Matching, 14th Annual Symposium, CPM 2003, Morelia, Michocán, Mexico, June 25-27, 2003, Proceedings, volume 2676 of Lecture Notes in Computer Science, pages 55–69. Springer, 2003. doi:10.1007/3-540-44888-8_5.
  • [5] Lorenzo Carfagna and Giovanni Manzini. Compressibility measures for two-dimensional data. In Franco Maria Nardini, Nadia Pisanti, and Rossano Venturini, editors, String Processing and Information Retrieval - 30th International Symposium, SPIRE 2023, Pisa, Italy, September 26-28, 2023, Proceedings, volume 14240 of Lecture Notes in Computer Science, pages 102–113. Springer, 2023. doi:10.1007/978-3-031-43980-3_9.
  • [6] Lorenzo Carfagna and Giovanni Manzini. The landscape of compressibility measures for two-dimensional data. IEEE Access, 12:87268–87283, 2024. doi:10.1109/ACCESS.2024.3417621.
  • [7] Anders Roy Christiansen, Mikko Berggren Ettienne, Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Optimal-time dictionary-compressed indexes. ACM Trans. Algorithms, 17(1):8:1–8:39, 2021. doi:10.1145/3426473.
  • [8] Charles J. Colbourn and Alan C. H. Ling. Quorums from difference covers. Inf. Process. Lett., 75(1-2):9–12, 2000. doi:10.1016/S0020-0190(00)00080-6.
  • [9] Martin Farach. Optimal suffix tree construction with large alphabets. In 38th Annual Symposium on Foundations of Computer Science, FOCS ’97, Miami Beach, Florida, USA, October 19-22, 1997, pages 137–143. IEEE Computer Society, 1997. doi:10.1109/SFCS.1997.646102.
  • [10] Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In 41st Annual Symposium on Foundations of Computer Science, FOCS 2000, 12-14 November 2000, Redondo Beach, California, USA, pages 390–398. IEEE Computer Society, 2000. doi:10.1109/SFCS.2000.892127.
  • [11] Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM, 67(1):2:1–2:54, 2020. doi:10.1145/3375890.
  • [12] Arnab Ganguly, Dhrumil Patel, Rahul Shah, and Sharma V. Thankachan. LF successor: Compact space indexing for order-isomorphic pattern matching. In Nikhil Bansal, Emanuela Merelli, and James Worrell, editors, 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021, July 12-16, 2021, Glasgow, Scotland (Virtual Conference), volume 198 of LIPIcs, pages 71:1–71:19. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.ICALP.2021.71.
  • [13] Arnab Ganguly, Rahul Shah, and Sharma V. Thankachan. pbwt: Achieving succinct data structures for parameterized pattern matching and related problems. In Philip N. Klein, editor, Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2017, Barcelona, Spain, Hotel Porta Fira, January 16-19, pages 397–407. SIAM, 2017. doi:10.1137/1.9781611974782.25.
  • [14] Arnab Ganguly, Rahul Shah, and Sharma V. Thankachan. Fully functional parameterized suffix trees in compact space. In Mikolaj Bojanczyk, Emanuela Merelli, and David P. Woodruff, editors, 49th International Colloquium on Automata, Languages, and Programming, ICALP 2022, July 4-8, 2022, Paris, France, volume 229 of LIPIcs, pages 65:1–65:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2022. doi:10.4230/LIPICS.ICALP.2022.65.
  • [15] Pawel Gawrychowski, Tomasz Kociumaka, Wojciech Rytter, and Tomasz Walen. Faster longest common extension queries in strings over general alphabets. In Roberto Grossi and Moshe Lewenstein, editors, 27th Annual Symposium on Combinatorial Pattern Matching, CPM 2016, June 27-29, 2016, Tel Aviv, Israel, volume 54 of LIPIcs, pages 5:1–5:13. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2016. doi:10.4230/LIPIcs.CPM.2016.5.
  • [16] Raffaele Giancarlo. A generalization of the suffix tree to square matrices, with applications. SIAM J. Comput., 24(3):520–562, 1995. doi:10.1137/S0097539792231982.
  • [17] Gaston H Gonnet. Efficient searching of text and pictures. UW Centre for the New Oxford English Dictionary, 1990.
  • [18] Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput., 35(2):378–407, 2005. doi:10.1137/S0097539702402354.
  • [19] Dominik Kempa and Tomasz Kociumaka. String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure. In Moses Charikar and Edith Cohen, editors, Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019, pages 756–767. ACM, 2019. doi:10.1145/3313276.3316368.
  • [20] Dominik Kempa and Tomasz Kociumaka. Resolution of the Burrows-Wheeler transform conjecture. Commun. ACM, 65(6):91–98, 2022. doi:10.1145/3531445.
  • [21] Dominik Kempa and Tomasz Kociumaka. Collapsing the hierarchy of compressed data structures: Suffix arrays in optimal compressed space. In 64th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2023, Santa Cruz, CA, USA, November 6-9, 2023, pages 1877–1886. IEEE, 2023. doi:10.1109/FOCS57990.2023.00114.
  • [22] Dong Kyue Kim, Yoo Ah Kim, and Kunsoo Park. Generalizations of suffix arrays to multi-dimensional matrices. Theor. Comput. Sci., 302(1-3):223–238, 2003. doi:10.1016/S0304-3975(02)00828-9.
  • [23] Dong Kyue Kim, Joong Chae Na, Jeong Seop Sim, and Kunsoo Park. Linear-time construction of two-dimensional suffix trees. Algorithmica, 59(2):269–297, 2011. doi:10.1007/S00453-009-9350-Z.
  • [24] Tomasz Kociumaka, Gonzalo Navarro, and Francisco Olivares. Near-optimal search time in δ-optimal space, and vice versa. Algorithmica, 86(4):1031–1056, 2024. doi:10.1007/S00453-023-01186-0.
  • [25] Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Toward a definitive compressibility measure for repetitive sequences. IEEE Trans. Inf. Theory, 69(4):2074–2092, 2023. doi:10.1109/TIT.2022.3224382.
  • [26] Gonzalo Navarro. Indexing highly repetitive string collections, part I: repetitiveness measures. ACM Comput. Surv., 54(2):29:1–29:31, 2022. doi:10.1145/3434399.
  • [27] Takaaki Nishimoto, Tomohiro I, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Fully dynamic data structure for LCE queries in compressed space. In Piotr Faliszewski, Anca Muscholl, and Rolf Niedermeier, editors, 41st International Symposium on Mathematical Foundations of Computer Science, MFCS 2016, August 22-26, 2016 - Kraków, Poland, volume 58 of LIPIcs, pages 72:1–72:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2016. doi:10.4230/LIPICS.MFCS.2016.72.
  • [28] Dhrumil Patel and Rahul Shah. Inverse suffix array queries for 2-dimensional pattern matching in near-compact space. In Hee-Kap Ahn and Kunihiko Sadakane, editors, 32nd International Symposium on Algorithms and Computation, ISAAC 2021, December 6-8, 2021, Fukuoka, Japan, volume 212 of LIPIcs, pages 60:1–60:14. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.ISAAC.2021.60.
  • [29] Rajeev Raman, Venkatesh Raman, and Srinivasa Rao Satti. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms, 3(4):43, 2007. doi:10.1145/1290672.1290680.
  • [30] Sofya Raskhodnikova, Dana Ron, Ronitt Rubinfeld, and Adam D. Smith. Sublinear algorithms for approximating string compressibility. Algorithmica, 65(3):685–709, 2013. doi:10.1007/S00453-012-9618-6.
  • [31] Giuseppe Romana, Marinella Sciortino, and Cristian Urbina. Exploring repetitiveness measures for two-dimensional strings, 2024. doi:10.48550/arXiv.2404.07030.
  • [32] Kunihiko Sadakane. Succinct representations of LCP information and improvements in the compressed suffix arrays. In David Eppstein, editor, Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, January 6-8, 2002, San Francisco, CA, USA, pages 225–232. ACM/SIAM, 2002. URL: http://dl.acm.org/citation.cfm?id=545381.545410.
  • [33] Sharma V. Thankachan. Compact text indexing for advanced pattern matching problems: Parameterized, order-isomorphic, 2d, etc. (invited talk). In Hideo Bannai and Jan Holub, editors, 33rd Annual Symposium on Combinatorial Pattern Matching, CPM 2022, June 27-29, 2022, Prague, Czech Republic, volume 223 of LIPIcs, pages 3:1–3:3. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2022. doi:10.4230/LIPICS.CPM.2022.3.
  • [34] Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, October 15-17, 1973, pages 1–11. IEEE Computer Society, 1973. doi:10.1109/SWAT.1973.13.