Two-Dimensional Longest Common Extension Queries in Compact Space

Ganguly, Arnab; Gibney, Daniel; Shah, Rahul; Thankachan, Sharma V.

doi:10.4230/LIPIcs.STACS.2025.38

Two-Dimensional Longest Common Extension Queries in Compact Space

Arnab Ganguly

University of Wisconsin, Whitewater, WI, USA Daniel Gibney

University of Texas at Dallas, TX, USA Rahul Shah

Louisiana State University, Baton Rouge, LA, USA Sharma V. Thankachan

North Carolina State University, Raleigh, NC, USA

Abstract

For a length $n$ text over an alphabet of size $\sigma$ , we can encode the suffix tree data structure in $\mathcal{O}(n\log\sigma)$ bits of space. It supports suffix array ( $\operatorname{SA}$ ), inverse suffix array ( $\operatorname{ISA}$ ), and longest common extension ( $\operatorname{LCE}$ ) queries in $\mathcal{O}(\log^{\epsilon}_{\sigma}n)$ time, which enables efficient pattern matching; here $\epsilon>0$ is an arbitrarily small constant. Further improvements are possible for $\operatorname{LCE}$ queries, where $\mathcal{O}(1)$ time queries can be achieved using an index of space $\mathcal{O}(n\log\sigma)$ bits. However, compactly indexing a two-dimensional text (i.e., an $n\times n$ matrix) has been a major open problem. We show progress in this direction by first presenting an $\mathcal{O}(n^{2}\log\sigma)$ -bit structure supporting $\operatorname{LCE}$ queries in near $\mathcal{O}((\log_{\sigma}n)^{2/3})$ time. We then present an $\mathcal{O}(n^{2}\log\sigma+n^{2}\log\log n)$ -bit structure supporting $\operatorname{ISA}$ queries in near $\mathcal{O}(\log n\cdot(\log_{\sigma}n)^{2/3})$ time. Within a similar space, achieving $\operatorname{SA}$ queries in poly-logarithmic (even strongly sub-linear) time is a significant challenge. However, our $\mathcal{O}(n^{2}\log\sigma+n^{2}\log\log n)$ -bit structure can support $\operatorname{SA}$ queries in $\mathcal{O}(n^{2}/(\sigma\log n)^{c})$ time, where $c$ is an arbitrarily large constant, which enables pattern matching in time faster than what is possible without preprocessing.

We then design a repetition-aware data structure. The $\delta_{2D}$ compressibility measure for two-dimensional texts was recently introduced by Carfagna and Manzini [SPIRE 2023]. The measure ranges from $1$ to $n^{2}$ , with smaller $\delta_{2D}$ indicating a highly compressible two-dimensional text. The current data structure utilizing $\delta_{2D}$ allows only element access. We obtain the first structure based on $\delta_{2D}$ for $\operatorname{LCE}$ queries. It takes $\tilde{\mathcal{O}}(n^{5/3}+n^{8/5}\delta_{2D}^{1/5})$ space and answers queries in $\mathcal{O}(\log n)$ time.

Keywords and phrases:

String matching, text indexing, two-dimensional text

Copyright and License:

2012 ACM Subject Classification:

Theory of computation

\rightarrow

Pattern matching

Funding:

Supported by the US National Science Foundation (NSF) under Grant Numbers 2315822 (S Thankachan) and 2137057 (R Shah).

DOI:

10.4230/LIPIcs.STACS.2025.38

Event:

42nd International Symposium on Theoretical Aspects of Computer Science (STACS 2025)

Editors:

Olaf Beyersdorff, Michał Pilipczuk, Elaine Pimentel, and Nguyễn Kim Thắng

Series and Publisher:

Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik

1 Introduction

A two-dimensional text $T[0\mathinner{.\,.\allowbreak}n)[0\mathinner{.\,.\allowbreak}n)$ can be viewed as an $n\times n$ matrix, where each entry is a character from an alphabet set $\Sigma$ of size $\sigma$ . Data structures for two-dimensional texts have been studied for decades. In particular, there has been extensive work on generalizing suffix trees [16, 17, 23] and suffix arrays [16, 22] to 2D text. These data structures, although capable of answering most queries in optimal (or near optimal) time, require $\mathcal{O}(n^{2})$ words, or $\mathcal{O}(n^{2}\log n)$ bits, of space.

On the other hand, in the case of 1D texts of length $n$ , there exist data structures with the same functionality as suffix trees/arrays but requiring only $\mathcal{O}(n\log\sigma)$ bits of space [18, 32], or even smaller in the case where the text is compressible [11, 21]. This is true even for some variants of suffix trees, such as parameterized [14, 13] and order-isomorphic [12] suffix trees [33]. The query times of these space-efficient versions are often polylogarithmic, with the exception of $\operatorname{LCE}$ queries, for which Kempa and Kociumaka demonstrated that the query time can be made constant [19]. For 2D texts, the only results known in this direction include a data structure by Patel and Shah that uses $\mathcal{O}(n^{2}\log\sigma+n^{2}\log\log n)$ bits and supports inverse suffix array ( $\operatorname{ISA}$ ) queries in $\mathcal{O}(\log^{4}n/(\log\log n)^{3})$ time [28]. In this work, we make further progress in this direction. In particular, we focus on space-efficient data structures for longest common extension ( $\operatorname{LCE}$ ) queries in the 2D setting. The problem is formally defined as follows:

Problem 1 (2D LCE).

Preprocess a 2D text $T[0\mathinner{.\,.\allowbreak}n)[0\mathinner{.\,.\allowbreak}n)$ over an alphabet $\Sigma$ of size $\sigma$ into a data structure that can answer 2D LCE queries efficiently. A 2D LCE query consists of points $(i_{1},j_{1})$ , $(i_{2},j_{2})$ and asks to return the largest $L$ such that $T[i_{1}\mathinner{.\,.\allowbreak}i_{1}+L)[j_{1}\mathinner{.\,.\allowbreak}j_{% 1}+L)$ and $T[i_{2}\mathinner{.\,.\allowbreak}i_{2}+L)[j_{2}\mathinner{.\,.\allowbreak}j_{% 2}+L)$ are matching square submatrices of $T$ .

A 2D suffix tree of size $\mathcal{O}(n^{2}\log n)$ bits can answer $\operatorname{LCE}$ queries in constant time. Our first result is an $\operatorname{LCE}$ data structure that occupies $\mathcal{O}(n^{2}\log\sigma)$ bits of space.

Theorem 1.

By maintaining an $\mathcal{O}(n^{2}\log\sigma)$ -bit data structure, we can answer 2D LCE queries in $\mathcal{O}((\log_{\sigma}n)^{2/3}\cdot(\log\log_{\sigma}n)^{5/3})$ time.

Turning now to highly compressible 2D texts, we consider repetition-aware compression measures. The $\delta$ measure is an important and well-studied compressibility measure for 1D text [26]. Only recently has it been extended to 2D text by Carfagna and Manzini with a $\delta_{2D}$ -measure [5]. They demonstrate that the data structure of Brisaboa et al. [3] occupies $\mathcal{O}((\delta_{2D}+\sqrt{n\delta_{2D}})\log\frac{n\log\sigma}{\sqrt{% \delta_{2D}}\log n})$ space. However, this data structure only supports access to the elements of $T$ . We provide the first repetition-aware data structure supporting the more advanced $\operatorname{LCE}$ queries. Note that the measure $\delta_{2D}$ ranges from $1$ to $n^{2}$ , with a smaller $\delta_{2D}$ value implying higher compressibility.

Theorem 2.

By maintaining an $\mathcal{O}((n^{5/3}+n^{8/5}\delta_{2D}^{1/5})\log\beta)$ word data structure, we can answer 2D LCE queries in $\mathcal{O}(1+\log\beta)$ time, where $\beta$ is always $\mathcal{O}(n)$ and goes to $\mathcal{O}(1)$ as $\delta_{2D}$ approaches $n^{2}$ . In particular,

\beta=\begin{cases}n&\text{ if }\delta_{2D}<n^{9/5}\\ n^{9/5}/\delta_{2D}^{9/10}&\text{ if }\delta_{2D}\geq n^{9/5}.\end{cases}

When $\delta_{2D}=\Theta(n^{2})$ , our data structure takes $\mathcal{O}(n^{2})$ words of space and answers $\operatorname{LCE}$ queries in $\mathcal{O}(1)$ time. When $\delta_{2D}=o(n^{2})$ , the space becomes $o(n^{2})$ and $\operatorname{LCE}$ queries are answered in logarithmic time. Our approach builds off many of the same techniques as our compact index but also introduces a matrix representation of the leaves of a truncated suffix tree. We call this a macro-matrix. We prove that if the original 2D text is compressible, then this macro-matrix remains compressible for appropriately chosen parameters. This is then combined with the data structure of Brisaboa et al. [3] to achieve Theorem 2.

As the first steps towards obtaining the other functionalities of the suffix tree, we apply our 2D LCE query structure from Theorem 1 to get the following results. Definitions of suffix array ( $\operatorname{SA}$ ) and inverse suffix array ( $\operatorname{ISA}$ ) are deferred to Section 1.1.

The following theorem significantly improves on the results by Patel and Shah [28].

Theorem 3 (2D ISA queries).

By maintaining an $\mathcal{O}(n^{2}\log\sigma+n^{2}\log\log n)$ -bit data structure, we can answer inverse suffix array queries in $\mathcal{O}(\log n\cdot(\log_{\sigma}n)^{2/3}\cdot(\log\log_{\sigma}n)^{5/3})$ time.

We also provide the first known results regarding a nearly compact index for 2D suffix array queries.

Theorem 4 (2D SA queries).

By maintaining an $\mathcal{O}(n^{2}\log\sigma+n^{2}\log\log n)$ -bit data structure, we can answer suffix array (SA) queries in $\mathcal{O}(n^{2}/(\sigma\log n)^{c})$ time, where $c$ is an arbitrarily large constant fixed at the time of construction.

A fundamental problem is to find all submatrices of $T$ that match with a given square pattern $P[0\mathinner{.\,.\allowbreak}m)[0\mathinner{.\,.\allowbreak}m)$ . After building the 2D suffix tree, given $P$ as a query, the number of occurrences of $P$ (denoted by $o c c$ ) can be obtained in $\mathcal{O}(m^{2})$ time, and all occurrences can be reported in $\mathcal{O}(m^{2}+occ)$ time. Our result, which uses a smaller index, is the following.

Theorem 5 (PM queries).

By maintaining an $\mathcal{O}(n^{2}\log\sigma+n^{2}\log\log n)$ -bit data structure, we can count the occurrences of an $m\times m$ query pattern in time $\mathcal{O}(m^{2}+n^{2}/(\sigma\log n)^{c})$ and report all occurrences in time $\mathcal{O}(m^{2}+occ+n^{2}/(\sigma\log n)^{c})$ , where $c$ is an arbitrarily large constant fixed at the time of construction.

Although the time complexities in Theorems 4 and 5 are far from satisfactory, these are the first results demonstrating subquadratic query times in compact space are possible for 2D $\operatorname{SA}$ and PM queries.

1.1 Preliminaries

Notation and Strings.

We denote the interval $i$ , $i+1$ , $\ldots$ , $j$ with $[i\mathinner{.\,.\allowbreak}j]$ and the interval $i$ , $i+1$ , $\ldots$ , $j-1$ , with $[i\mathinner{.\,.\allowbreak}j)$ . For a string $S$ of length $n$ we use $S[i]$ to refer to $i^{th}$ character, $i\in[0\mathinner{.\,.\allowbreak}n)$ . We use $S_{1}\cdot S_{2}$ to denote the concatenation of two strings $S_{1}$ and $S_{2}$ . For notation, $S[i\mathinner{.\,.\allowbreak}j]=S[i]\cdot S[i+1]\cdot\ldots\cdot S[j]$ , $S[i\mathinner{.\,.\allowbreak}j)=S[i]\cdot S[i+1]\cdots\ldots\cdot S[j-1]$ , and $S[i\mathinner{.\,.\allowbreak}]=S[i\mathinner{.\,.\allowbreak}n)$ . Arrays and strings are zero-indexed throughout this work.

For a single string $S[0\mathinner{.\,.\allowbreak}n)$ and $i,j\in[0\mathinner{.\,.\allowbreak}n)$ , $\operatorname{LCE}(i,j)$ is defined as the length of the longest common prefix of $S[i\mathinner{.\,.\allowbreak}]$ and $S[j\mathinner{.\,.\allowbreak}]$ . In the case of two strings, $S_{1}[0\mathinner{.\,.\allowbreak}n_{1})$ and $S_{2}[0\mathinner{.\,.\allowbreak}n_{2})$ , we overload the notation so that for $i\in[0\mathinner{.\,.\allowbreak}n_{1})$ , $j\in[0\mathinner{.\,.\allowbreak}n_{2})$ , $\operatorname{LCE}(i,j)$ is the length of the longest common prefix of $S_{1}[i\mathinner{.\,.\allowbreak}]$ and $S_{2}[j\mathinner{.\,.\allowbreak}]$ . For a given string $S$ , the suffix tree [34] is a compact trie of all suffixes of $S$ with leaves ordered according to the lexicographic rank of the corresponding suffixes. The classical suffix tree takes $\mathcal{O}(n)$ words of space and can be constructed in $\mathcal{O}(n)$ time for polynomially sized integer alphabets [9]. The suffix array $\operatorname{SA}[0\mathinner{.\,.\allowbreak}n)$ of a string $S[0\mathinner{.\,.\allowbreak}n)$ is the unique array such that $S[\operatorname{SA}[i]\mathinner{.\,.\allowbreak}]$ is the $i^{th}$ smallest suffix lexicographically. The inverse suffix array $\operatorname{ISA}[0\mathinner{.\,.\allowbreak}n)$ is the unique array such that $\operatorname{ISA}[\operatorname{SA}[i]]=i$ , or equivalently, $\operatorname{ISA}[i]$ gives the lexicographic rank of $S[i\mathinner{.\,.\allowbreak}]$ . The suffix tree can answer $\operatorname{LCE}$ queries in $\mathcal{O}(1)$ time. We call a compact trie with lexicographically ordered leaves for a subset of suffixes a sparse suffix tree. Observe that the number of nodes in a sparse suffix tree remains proportional to the number of suffixes it is built from.

We will utilize the following result by Kempa and Kociumaka, which provides an $\operatorname{LCE}$ data structure smaller than a classical suffix tree.

Lemma 6 ([19]).

1D LCE queries on a text $S[0\mathinner{.\,.\allowbreak}n)$ over an alphabet set $\Sigma=[0\mathinner{.\,.\allowbreak}\sigma)$ can be answered in $\mathcal{O}(1)$ time by maintaining a data structure of size $\mathcal{O}(n\log\sigma)$ bits.

The next result by Bille et al. allows for a trade-off between space and query time. We will utilize it in Section 2.2.

Lemma 7 ([1]).

Suppose we have the text $S[0\mathinner{.\,.\allowbreak}n)$ as read-only, such that we can determine the lexicographic order of any of its two characters in constant time. Then we can answer 1D LCE queries on $S$ in time $\mathcal{O}(\tau)$ by maintaining an $\mathcal{O}(n/\tau)$ words of space auxiliary structure, where $1\leq\tau\leq n$ is any parameter fixed at the time of construction.

d-Covers.

A $d$ -cover of an interval $[0\mathinner{.\,.\allowbreak}n)$ is a subset of positions, denoted by $\mathcal{C}$ , such that for any $x\in[0\mathinner{.\,.\allowbreak}n-d)$ and $y\in[0\mathinner{.\,.\allowbreak}n-d)$ there exists $h\in[0\mathinner{.\,.\allowbreak}d)$ where $x+h,y+h\in\mathcal{C}$ . It was shown by Burkhardt and Kärkkäinen that there exists a $d$ -cover of size $\mathcal{O}(n/\sqrt{d})$ that can be computed in $\mathcal{O}(n/\sqrt{d})$ time [4]. $d$ -Covers have been used previously for LCE queries in the 1D case by Gawrychowski et al. [15] and Bille et al. [2]. Since we need a small data structure that lets us find an $h$ value as described above in constant time, we briefly outline the construction given in [4].

A difference cover modulo $d$ is a subset $\mathcal{D}\subseteq\{0,1,\ldots,d-1\}$ where for all $w\in\{0,1,\ldots,d-1\}$ there exist $u,v\in\mathcal{D}$ such that $w\equiv u-v\mod{d}$ . Colbourn and Ling showed there exists $\mathcal{D}$ such that $|\mathcal{D}|=\Theta(\sqrt{d})$ [8]. A $d$ -cover $\mathcal{C}$ is constructed from a difference cover $\mathcal{D}$ as follows: For $j\in[0\mathinner{.\,.\allowbreak}n)$ , if $(j\mod{d})\in\mathcal{D}$ , then $j$ is added to $\mathcal{C}$ . We also build a look-up table $A$ of size $d$ such that for all $i\in\{0,1,\ldots,d-1\}$ both $A[i]$ and $(A[i]+i)\mod{d}$ are in $\mathcal{D}$ . This is always possible, thanks to the definition of the difference cover. See Figure 1.

Refer to caption — Figure 1: An example $d$ -cover for $n=12$ and $d=7$ . Here the difference cover used is $\mathcal{D}=\{1,2,4\}$ , resulting in a $d$ -cover $\mathcal{C}=\{1,2,4,8,9,11\}$ (elements indicated with ‘’) and a lookup table $A=[1,1,2,1,4,4,2]$ . For the positions $x=3$ and $y=6$ , we have $h=A[(6-3)\mod 7]-3\mod 7\equiv 5$ . Observe that $3+5,6+5\in\mathcal{C}$ .

Lemma 8 ([4]).

For a $d$ -cover $\mathcal{C}$ of an interval $[0\mathinner{.\,.\allowbreak}n)$ , there exists a data structure of size $\mathcal{O}(d)$ that given $x,y\in[0\mathinner{.\,.\allowbreak}n-d)$ , outputs an $h\in[0\mathinner{.\,.\allowbreak}d)$ such that $x+h,y+h\in\mathcal{C}$ in $\mathcal{O}(1)$ time.

Proof.

We maintain the $\mathcal{O}(d)$ space look-up table $A$ as described above. We assume without loss of generality, $y\geq x$ . Let $h\coloneqq(A[(y-x)\mod d]-x)\mod{d}$ . Observe that

x+h\equiv A[(y-x)\mod{d}]\mod{d}.

Hence, $(x+h\mod d)\in\mathcal{D}$ and $x+h\in\mathcal{C}$ . Also,

y+h\equiv A[(y-x)\mod{d}]+(y-x)\mod{d}.

Hence, $(y+h\mod d)\in\mathcal{D}$ and $y+h\in\mathcal{C}$ . $\hfill\blacktriangleleft$

2D Suffix Trees and 2D Suffix Arrays.

We utilize the generalization of suffix trees to 2D texts presented by Giancarlo [16]. This suffix tree is created from the Lstrings of the 2D text $T$ . LStrings are over an alphabet $\cup_{i=1}^{n}\Sigma^{2i-1}$ . For a position $(i,j)\in[0\mathinner{.\,.\allowbreak}n)^{2}$ the suffix $T[i\mathinner{.\,.\allowbreak}][j\mathinner{.\,.\allowbreak}]$ is $a_{0}\cdot a_{1}\cdot\ldots\cdot a_{l}$ where $l=n-\max(i,j)$ and $a_{0}=T[i][j]$ and $a_{k}=T[i+k][j\mathinner{.\,.\allowbreak}j+k)\cdot T[i\mathinner{.\,.% \allowbreak}i+k][j+k]$ for $k>0$ . See Figure 2. The characters are maintained implicitly as references to $T$ , resulting in the 2D suffix tree over all suffixes $T[i\mathinner{.\,.\allowbreak}][j\mathinner{.\,.\allowbreak}]$ , $(i,j)\in[0,n)^{2}$ occuping $\mathcal{O}(n^{2})$ words of space. Once constructed, the 2D suffix tree allows us to find the $\operatorname{LCE}$ of two positions in $\mathcal{O}(1)$ time through a lowest common ancestor ( $\operatorname{LCA}$ ) query. The 2D suffix tree also enables pattern matching in optimal $\mathcal{O}(m^{2}+occ)$ time.

The order between characters $a$ and $a^{\prime}$ of Lstrings is defined as the lexicographic order induced by the base alphabet $\Sigma$ . The lexicographic order of two Lstrings (and corresponding submatrices) is induced by the order of their characters. We additionally assume that the bottom row and rightmost column of $T$ consist of only a $\$$ symbol, which is the smallest in the alphabet order and occurs nowhere else in $T$ .

The suffix array $\operatorname{SA}[0\mathinner{.\,.\allowbreak}n^{2})$ of a 2D text $T[0\mathinner{.\,.\allowbreak}n)[0\mathinner{.\,.\allowbreak}n)$ is an array containing 2D points such that if $(i,j)=\operatorname{SA}[h]$ , then $T[i\mathinner{.\,.\allowbreak}][j\mathinner{.\,.\allowbreak}]$ is the $h^{th}$ smallest suffix lexicographically. The inverse suffix array maps each $(i,j)\in[0\mathinner{.\,.\allowbreak}n)^{2}$ to its position in $\operatorname{SA}$ , i.e. $\operatorname{ISA}[\operatorname{SA}[h]]=h$ .

The $\delta_{2D}$ Measure and 2D Block Trees.

The $\delta$ measure is a well-studied compressibility measure for 1D texts [7, 20, 24, 25, 30]. It is defined as $\delta(T)=\max_{1\leq t\leq n}d_{t}(T)/t$ where $d_{t}(T)$ denotes the number of distinct length $t$ substrings of $T[0,n)$ .

Carfagna and Manzini recently generalized the $\delta$ measure to 2D texts [5, 6]. Letting $d_{t}(T)$ denote the number of distinct $t\times t$ submatrices of $T[0\mathinner{.\,.\allowbreak}n)[0\mathinner{.\,.\allowbreak}n)$ , $\delta_{2D}(T)=\max_{1\leq t\leq n}d_{t}(T)/t^{2}$ . Observe that $\delta_{2D}(T)$ can range between $1$ , e.g., in case where all elements of $T$ are the same character, and $n^{2}$ , i.e., the case where all elements of $T$ are distinct. Carfagna and Manzini showed that the 2D block tree data structure of Brisaboa, et al. [3] occupies $\mathcal{O}((\delta_{2D}(T)+\sqrt{n\delta_{2D}(T)})\log\frac{n\log\sigma}{% \sqrt{\delta_{2D}}\log n})$ words of space and provides access to any entry of $T$ in $\mathcal{O}(1+\log\frac{n\log\sigma}{\sqrt{\delta_{2D}}\log n})$ time. A further generalization of the $\delta$ measure to 2D allowing for non-square matrices was introduced by Romana et al. and related to other potential 2D compressibility measures [31]. In this work, we will only consider the $\delta_{2D}$ measure based on square submatrices. We hereafter refer to $\delta_{2D}$ as $\delta$ and omit the text $T$ when it is clear from context.

2 Compact Data Structures for 2D LCE Queries

We start with some definitions. Let $R_{i}$ denote the $i^{th}$ row and $C_{j}$ denote the $j^{th}$ column of our 2D text $T$ , where $0\leq i,j<n$ . Specifically, $R_{i}[0\mathinner{.\,.\allowbreak}n)$ (resp., $C_{j}[0\mathinner{.\,.\allowbreak}n)$ ) is a text of length $n$ over the alphabet $\Sigma$ , such that its $k^{th}$ character is $T[i][k]$ (resp., $T[k][j]$ ), where $k\in[0\mathinner{.\,.\allowbreak}n)$ .

We define a set of sampled positions on the diagonals of $T$ , that is $T[n-1][0]$ , $T[n-2][0]\cdot T[n-1][1]$ , $\ldots$ , $T[0][n-2]\cdot T[1][n-1]$ , $T[0][n-1]$ , using $d$ -cover with $d=\Theta(\log^{2}_{\sigma}n)$ . This is obtained by taking a $d$ -cover $\mathcal{C}$ for $[0\mathinner{.\,.\allowbreak}n)$ and using it to define sample positions starting from the top left of each diagonal. Formally, the sample positions are

\mathcal{C}_{D}=\{(i,j)\mid i,j\in[0\mathinner{.\,.\allowbreak}n),\min(i,j)\in% \mathcal{C}\}.

See Figure 3.

We maintain a sparse suffix tree over the suffixes starting from these sampled positions. As this is a compact trie with $|\mathcal{C}_{D}|=\mathcal{O}(n^{2}/\sqrt{d})$ leaves, the space required for this sparse suffix tree is $\mathcal{O}(n^{2}/\sqrt{d})$ words. By our above choice of $d$ , this is $\mathcal{O}(n^{2}\log\sigma)$ bits. Using this sparse suffix tree, we can obtain $\operatorname{LCE}$ for any two sampled positions in $\mathcal{O}(1)$ time.

Additionally, we maintain the data structure from Lemma 6 for the concatenation of columns $C_{0}$ , $\ldots$ , $C_{n-1}$ and rows $R_{0}$ , $\ldots$ , $R_{n-1}$ , which adds another $\mathcal{O}(n^{2}\log\sigma)$ bits. This allows us to find the $\operatorname{LCE}$ between $R_{i}[x\mathinner{.\,.\allowbreak}]$ and $R_{j}[y\mathinner{.\,.\allowbreak}]$ (or $C_{i}[x\mathinner{.\,.\allowbreak}]$ and $C_{j}[y\mathinner{.\,.\allowbreak}]$ ) in $\mathcal{O}(1)$ time. We can take a minimum between the $\operatorname{LCE}$ value and $\min(n-x,n-y)$ to avoid common prefixes crossing row or column boundaries.

In what follows, we first present a simple preliminary solution. We then develop these ideas further with two refinements that lead us to Theorem 1. The components defined above (sparse suffix tree from diagonal samples and $\operatorname{LCE}$ data structures for concatenated rows and columns) are used in all three solutions.

2.1 Achieving $\mathcal{O}(\log_{\sigma}^{2}n)$ Query Time

To answer an $\operatorname{LCE}$ query $(i_{1},j_{1})$ , $(i_{2},j_{2})$ , we use the look-up structure discussed in Lemma 8 to obtain an $h\in[0\mathinner{.\,.\allowbreak}d)$ such that $(i_{1}+h,j_{1}+h)$ and $(i_{2}+h,j_{2}+h)$ are sampled diagonal positions. For convenience, in the case where no such $h$ in the look-up structure exists, because either $(i_{1},j_{1})$ or $(i_{2},j_{2})$ is near the boundary of $T$ , we consider $h$ as being one less than the minimum diagonal offset to a boundary of $T$ . We first obtain $\operatorname{LCE}((i_{1}+h,j_{1}+h),(i_{2}+h,j_{2}+h))$ in $\mathcal{O}(1)$ time. Next, for $k\in[0\mathinner{.\,.\allowbreak}h)$ , we compute the $\operatorname{LCE}$ s between $R_{i_{1}+k}[j_{1}\mathinner{.\,.\allowbreak}]$ and $R_{i_{2}+k}[j_{2}\mathinner{.\,.\allowbreak}]$ , and between $C_{j_{1}+k}[i_{1}\mathinner{.\,.\allowbreak}]$ and $C_{j_{2}+k}[i_{2}\mathinner{.\,.\allowbreak}]$ . While iterating from $k=1$ to $k=h-1$ , if for some $k$ either the $\operatorname{LCE}$ between $R_{i_{1}+k}[j_{1}\mathinner{.\,.\allowbreak}]$ and $R_{i_{2}+k}[j_{2}\mathinner{.\,.\allowbreak}]$ or between $C_{j_{1}+k}[i_{1}\mathinner{.\,.\allowbreak}]$ and $C_{j_{2}+k}[i_{2}\mathinner{.\,.\allowbreak}]$ becomes less than $k$ , we output $k-1$ . Otherwise, we output the minimum over $h+\operatorname{LCE}((i_{1}+h,j_{1}+h),(i_{2}+h,j_{2}+h))$ and all of the $\operatorname{LCE}$ values computed for the rows and columns specified above.

Only one constant time query for a diagonal sampled position is required, and the number of 1D LCE queries needed is at most $2d$ . Since $d=\Theta(\log^{2}_{\sigma}n)$ and each 1D LCE query takes $\mathcal{O}(1)$ time, the total time is $\mathcal{O}(\log_{\sigma}^{2}n)$ .

2.2 Achieving $\mathcal{O}(\log_{\sigma}n\cdot(\log\log_{\sigma}n)^{2})$ Query Time

First, we define $R_{i,t}$ and $C_{j,t}$ . These are texts of length $n$ over an alphabet $\Sigma^{t}$ , such that $0\leq i,j$ and $i+t-1,j+t-1<n$ . The $k^{th}$ character of $R_{i,t}$ and $C_{j,t}$ are length $t$ strings over $\Sigma$ defined as follows:

R_{i,t}[k]=R_{i}[k]\cdot R_{i+1}[k]\cdots R_{i+t-1}[k]

C_{j,t}[k]=C_{j}[k]\cdot C_{j+1}[k]\cdots C_{j+t-1}[k].

We call these meta characters. We also call $R_{i,t}$ and $C_{j,t}$ slabs of length $t$ . Applying the structure from Lemma 6 over the concatenation of rows and the concatenation of columns, we can compare two meta characters in $\mathcal{O}(1)$ time.

Data Structure.

In addition to the previous components, we maintain the structure from Lemma 7 over the text obtained by concatenating $R_{i,t}$ for $i\in[0\mathinner{.\,.\allowbreak}n)$ and $t=1,2,4,8,\ldots,\min(n-i,2^{\lceil\log d\rceil})$ . We also maintain the structure from Lemma 7 over the text obtained by concatenating $C_{j,t}$ for $j\in[0\mathinner{.\,.\allowbreak}n)$ and $t=1,2,4,8,\ldots,\min(n-j,2^{\lceil\log d\rceil})$ . We leave the parameter $\tau$ appearing in Lemma 7 to be optimized later.

Querying.

Given an $\operatorname{LCE}$ query $(i_{1},j_{1})$ , $(i_{2},j_{2})$ , we first find an $h\in[0\mathinner{.\,.\allowbreak}d)$ such that $(i_{1}+h,j_{1}+h)$ and $(i_{2}+h,j_{2}+h)$ are sampled positions. We then decompose the interval $[i_{1}\mathinner{.\,.\allowbreak}i_{1}+h)$ and $[j_{1}\mathinner{.\,.\allowbreak}j_{1}+h)$ into $\mathcal{O}(\log d)$ slabs that have lengths that are powers of two. We perform an $\operatorname{LCE}$ query for each corresponding slab for both rows and columns. A minimum is taken over all these $\operatorname{LCE}$ values and $h+\operatorname{LCE}((i_{1}+h,j_{1}+h),(i_{2}+h,j_{2}+h))$ . Denote this minimum with $m$ . There are two possible cases.

$\blacksquare$

$m>h$ . See Figure 4(a). In this case, $m$ is reported as the result.
$\blacksquare$

$m\leq h$ . See Figure 4(b). In this case, we still need to find the largest value $y$ such that the minimum $\operatorname{LCE}$ of the slabs covering $C_{j_{1}}[i_{1}\mathinner{.\,.\allowbreak}]$ , $\ldots$ , $C_{j_{1}+y}[i_{1}\mathinner{.\,.\allowbreak}]$ (with slabs covering $C_{j_{2}}[i_{2}\mathinner{.\,.\allowbreak}]$ , $\ldots$ , $C_{j_{2}+y}[i_{2}\mathinner{.\,.\allowbreak}]$ , respectively) and $R_{i_{1}}[j_{1}\mathinner{.\,.\allowbreak}]$ , $\ldots$ , $R_{i_{1}+y}[j_{1}]$ (with slabs covering $R_{i_{2}}[j_{2}\mathinner{.\,.\allowbreak}]$ , $\ldots$ , $R_{i_{2}+y}[j_{2}\mathinner{.\,.\allowbreak}]$ , respectively) is at least $y$ . To accomplish this, we perform a modified binary search while keeping track of the minimum $\operatorname{LCE}$ values for both the column and row slabs. The only difference compared to standard binary search is that rather than always dividing the current range under consideration in half, we consider the power of two closest to half the size of the current range. This is done to ensure that we always use slabs for which we have prepared $\operatorname{LCE}$ data structures.

Analysis.

Letting $T(l)$ be the number of $\operatorname{LCE}$ queries on slabs for the binary search on a range of length $l$ , the resulting recurrence is

T(l)\leq T(2^{\lceil\log l/2\rceil})+1=\mathcal{O}(\log l).

Hence, $T(h)=\mathcal{O}(\log d)$ . We now fix $\tau=\log_{\sigma}n\cdot\log\log_{\sigma}n$ . Since each $\operatorname{LCE}$ query on a slab takes $\mathcal{O}(\tau)$ time, the overall query time is $\tau\cdot\log d=\mathcal{O}(\log_{\sigma}n\cdot(\log\log_{\sigma}n)^{2})$ , where we used that $d=\Theta(\log^{2}_{\sigma}n)$ . The total added space relative to the previous solution is $\mathcal{O}(\log d\cdot n^{2}/\tau)$ words. Using our definitions of $d$ and $\tau$ , the space remains $\mathcal{O}(n^{2}\log\sigma)$ bits.

2.3 Achieving $\mathcal{O}(\log_{\sigma}^{2/3}n\cdot(\log\log_{\sigma}n)^{5/3})$ Query Time

Data Structure.

Let $x$ be a parameter to be defined later. In addition to the previously defined diagonal sample positions, we now define sample positions for the rows and columns using an $x$ -cover, denoted by $\mathcal{X}$ . We maintain the structure in Lemma 7 (with parameter $\tau$ left open for optimizing later) over the text obtained by concatenating slabs $R_{i,t}$ for $t=1,2,4,8,\ldots,\min(n-i,2^{\lceil\log d\rceil})$ , whenever $i\in\mathcal{X}$ . We do the same for slabs $R_{i,t}$ for $t=1,2,4,8,\ldots,\min(2^{\lceil\log d\rceil})$ whenever $i+t-1\in\mathcal{X}$ and $i\geq 0$ . Similarly, we maintain the structure from Lemma 7 for the concatenation of $C_{j,t}$ for $t=1,2,4,8,\ldots,\min(2^{\lceil\log d\rceil})$ for $j\in\mathcal{X}$ . We do the same for $C_{j,t}$ for $t=1,2,4,8,\ldots,\min(2^{\lceil\log d\rceil})$ whenever $j+t-1\in\mathcal{X}$ and $j\geq 0$ . Note that these slabs do not need to be explicitly constructed and can be simulated directly using the input text.

Querying.

Given a query $(i_{1},j_{1})$ , $(i_{2},j_{2})$ , we first find $h\in[0\mathinner{.\,.\allowbreak}d)$ such that $(i_{1}+h,j_{1}+h)$ and $(i_{2}+h,j_{2}+h)$ are diagonal sample positions. Let find $y\in[0\mathinner{.\,.\allowbreak}x)$ such that $i_{1}+y$ and $i_{2}+y$ are in $\mathcal{X}$ . We find the $\operatorname{LCE}$ s of columns $C_{i_{1}}[j_{1}\mathinner{.\,.\allowbreak}]$ , $\ldots$ , $C_{i_{1}+y-1}[j_{1}\mathinner{.\,.\allowbreak}]$ with $C_{i_{2}}[j_{2}\mathinner{.\,.\allowbreak}]$ , $\ldots$ , $C_{i_{2}+y-1}[j_{2}\mathinner{.\,.\allowbreak}]$ , respectively. We next find $y^{\prime}\in[0\mathinner{.\,.\allowbreak}x)$ such that $i_{1}+h-1-y^{\prime}$ and $i_{2}+h-1-y^{\prime}$ are in $\mathcal{X}$ . We then find the $\operatorname{LCE}$ s of columns $C_{i_{1}+h-y^{\prime}}[j_{1}\mathinner{.\,.\allowbreak}]$ , $\ldots$ $C_{i_{1}+h-1}[j_{1}\mathinner{.\,.\allowbreak}]$ , with $C_{i_{2}+h-y^{\prime}}[j_{2}\mathinner{.\,.\allowbreak}]$ , $\ldots$ , $C_{i_{2}+h-1}[j_{2}\mathinner{.\,.\allowbreak}]$ , respectively. We then take the largest power of two, say $2^{a}$ , such that $i_{1}+y+2^{a}\leq i_{1}+h-1-y^{\prime}$ , and obtain the $\operatorname{LCE}$ of the slab $C_{i_{1}+y,2^{a}}[j_{1}\mathinner{.\,.\allowbreak}]$ with $C_{i_{2}+y,2^{a}}[j_{2}\mathinner{.\,.\allowbreak}]$ . We also obtain the $\operatorname{LCE}$ of the slabs $C_{i_{1}+h-y^{\prime}-2^{a},2^{a}}[j_{1}\mathinner{.\,.\allowbreak}]$ and $C_{i_{2}+h-y^{\prime}-2^{a},2^{a}}[j_{2}\mathinner{.\,.\allowbreak}]$ . We perform a symmetric procedure on the rows. A minimum is taken among all of these $\operatorname{LCE}$ values as well as $h+\operatorname{LCE}((i_{1}+h,j_{1}+h),(i_{2}+h,j_{2}+h))$ . Let $m$ denote this minimum. We consider two cases like in Section 2.2.

$\blacksquare$

$m>h$ . In this case, $m$ is reported as the result.
$\blacksquare$

$m\leq h$ . As in Section 2.2, we want to output the largest value $y$ such that the minimum $\operatorname{LCE}$ of the slabs covering $C_{j_{1}}[i_{1}\mathinner{.\,.\allowbreak}]$ , $\ldots$ , $C_{j_{1}+y}[i_{1}\mathinner{.\,.\allowbreak}]$ (with $\operatorname{LCE}$ relative to slabs covering $C_{j_{2}}[i_{2}\mathinner{.\,.\allowbreak}]$ , $\ldots$ , $C_{j_{2}+y}[i_{2}\mathinner{.\,.\allowbreak}]$ ) and $R_{i_{1}}[j_{1}\mathinner{.\,.\allowbreak}]$ , $\ldots$ , $R_{i_{1}+y}[j_{1}]$ (with $\operatorname{LCE}$ relative to slabs covering $R_{i_{2}}[j_{2}\mathinner{.\,.\allowbreak}]$ , $\ldots$ , $R_{i_{2}+y}[j_{2}\mathinner{.\,.\allowbreak}]$ ) is at least $y$ . The modification to the binary search algorithm from Section 2.2 is that we intermix at most $x$ single row/column evaluations to reach the next position in $\mathcal{X}$ . After this position in $\mathcal{X}$ is reached, the power of two that most evenly splits the remaining range can be used.

Analysis.

We claim that answering a query requires $\mathcal{O}(x\cdot\log d)$ number of LCE queries for single rows/columns and $\mathcal{O}(\log d)$ number of LCE queries on slabs. To see this, let $S(l)$ be the number of single row/column $\operatorname{LCE}$ queries on a range of length $l$ , and $T(l)$ be the number of slab $\operatorname{LCE}$ queries. Then we have

S(l)\leq S(2^{\lceil\log l/2\rceil})+\mathcal{O}(x)=\mathcal{O}(x\log l)

T(l)\leq T(2^{\lceil\log l/2\rceil})+1=\mathcal{O}(\log l).

Hence, $S(h)=\mathcal{O}(x\log d)$ and $T(h)=\mathcal{O}(\log d)$ . Each single row/column $\operatorname{LCE}$ query takes $\mathcal{O}(1)$ time and each $\operatorname{LCE}$ query on a slab takes $\mathcal{O}(\tau)$ time. As a result, the total query time is $\mathcal{O}(x\cdot\log d+\log d\cdot\tau)$ . To optimize, we keep $d=\Theta(\log^{2}_{\sigma}n)$ and now fix $x=\tau=(\log_{\sigma}^{2/3}n\cdot(\log\log_{\sigma}n)^{2/3})$ and obtain the query time of $\mathcal{O}(\log_{\sigma}^{2/3}n\cdot(\log\log_{\sigma}n)^{5/3})$ .

The (extra) space is $\mathcal{O}(\log d\cdot n^{2}/(\sqrt{x}\cdot\tau))$ words. This is because we take $\mathcal{O}(\log d)$ larger slabs for each column/row sample position, creating an overall string of length $\mathcal{O}(\log d\cdot n^{2}/\sqrt{x})$ . The $\operatorname{LCE}$ structure from Lemma 7, then occupies $\mathcal{O}(\log d\cdot n^{2}/(\sqrt{x}\cdot\tau))$ words. With the above choice of $x$ and $\tau$ , the total space is $\mathcal{O}(n^{2}\log\sigma)$ bits. This completes the proof of Theorem 1.

3 Repetition-Aware LCE Data Structure

Overview.

We use a parameter $\tau$ that we will optimize over later. We aim to use a truncated suffix tree in conjunction with a sparse suffix tree on sampled positions from a $\tau$ -cover to efficiently perform $\operatorname{LCE}$ queries. If we truncate the 2D suffix tree at a string depth of $\tau$ , then the $\delta$ measure provides an upper bound of $\tau^{2}\delta$ on the number of leaves at depth $\tau$ . As we argue, one can also upper bound the number of additional leaves in the truncated suffix tree in terms of $\tau$ and $n$ .

The first challenge in using the above ideas is that, for these $\operatorname{LCE}$ queries from sampled positions to provide information on the overall $\operatorname{LCE}$ result, the matching submatrices starting at sampled positions should overlap. This is accomplished by using a string depth of $2\tau$ for the truncated suffix tree while still using a $\tau$ -cover. The second challenge is that given our $\operatorname{LCE}$ query, we need to know which leaves to consider in the truncated suffix tree. Moreover, we should accomplish this in $o(n^{2})$ space when $\delta$ is small. To this end, we introduce the notion of a macro-matrix $M$ , which stores the leaf in the truncated suffix tree to examine for a specified position in $T$ . We then relate the $\delta$ measure of this macro-matrix to the $\delta$ measure of the matrix $T$ . This relationship enables us to use the 2D block tree data structure of Brisboa et al. [3] on $M$ , which occupies sublinear space for compressible matrices and supports efficient access to the elements of $M$ .

3.1 Data Structures

Truncated Suffix Tree.

We first construct a 2D suffix tree of $T$ truncated at a string depth of $2\tau$ . Call this $\mathcal{T}_{\leq 2\tau}$ . We use $\ell_{1}$ , $\ell_{2}$ , $\ldots$ to denote the leaves of $\mathcal{T}_{\leq 2\tau}$ .

Compressed Representation of Macro-Matrix.

We next define the macro-matrix. The elements of a macro-matrix are essentially meta symbols, where two meta-symbols are the same if and only if the corresponding $2\tau\times 2\tau$ square substrings are identical. Formally, the macro-matrix $M$ is the matrix obtained as follows: for $i,j\in[0\mathinner{.\,.\allowbreak}n)$ ,

$\blacksquare$

if there exists a $2\tau\times 2\tau$ matrix with upper left corner $(i,j)$ , i.e., $i,j\leq n-2\tau$ , then we make $M[i][j]=\ell$ where $\ell$ is a pointer to the leaf of $\mathcal{T}_{\leq 2\tau}$ corresponding to $T[i\mathinner{.\,.\allowbreak}i+2\tau-1][j\mathinner{.\,.\allowbreak}j+2\tau-1]$ ;
$\blacksquare$

if $i>n-2\tau$ or $j>n-2\tau$ , then let $M[i][j]=\ell$ where $\ell$ is a pointer to the leaf in $\mathcal{T}_{\leq 2\tau}$ corresponding to the $(n-\max(i,j))\times(n-\max(i,j))$ matrix with upper left corner $(i,j)$ .

See Figure 5. We then construct the 2D block tree of $M$ , denoted as $\mathsf{BT}(M)$ .

Sparse Suffix Tree.

We define sample positions based on a $\tau$ -cover $\mathcal{C}$ of $[0\mathinner{.\,.\allowbreak}n)$ . These consist of sample positions for the rows,

\mathcal{C}_{R}=\{(i,j)\mid i\in\mathcal{C},j\in[0\mathinner{.\,.\allowbreak}n)\}

for the columns,

\mathcal{C}_{C}=\{(i,j)\mid i\in[0\mathinner{.\,.\allowbreak}n),j\in\mathcal{C}\}

and for the diagonals,

\mathcal{C}_{D}=\{(i,j)\mid i,j\in[0\mathinner{.\,.\allowbreak}n),\min(i,j)\in% \mathcal{C}\}.

Let $\mathcal{C}^{\prime}=\mathcal{C}_{R}\cup\mathcal{C}_{C}\cup\mathcal{C}_{D}$ . Observe that $|\mathcal{C}^{\prime}|=\Theta(n^{2}/\sqrt{\tau})$ . We build a sparse suffix tree over the suffixes starting at sampled positions in $\mathcal{C}^{\prime}$ , denoted as $\mathcal{T}_{s}$ . We also maintain the lookup data structure from Lemma 8. As before, this allows us to find in $\mathcal{O}(1)$ time equally far sampled positions at most $\tau$ away from the queried positions in each row, column, and diagonal.

3.2 Querying

Given $\operatorname{LCE}$ query $(i_{1},j_{1})$ , $(i_{2},j_{2})$ , we first use $\mathsf{BT}(M)$ to get the corresponding values in $M$ . Say these correspond to the leaves $\ell_{1}$ and $\ell_{2}$ in $\mathcal{T}_{\leq 2\tau}$ respectively. If $\ell_{1}\neq\ell_{2}$ , then the string depth of the $\operatorname{LCA}$ of $\ell_{1}$ and $\ell_{2}$ gives us the $\operatorname{LCE}$ of $(i_{1},j_{1})$ , $(i_{2},j_{2})$ .

If $\ell_{1}=\ell_{2}$ then we use the lookup data structure from Lemma 8 to find:

$\blacksquare$

$h_{1}\in[0\mathinner{.\,.\allowbreak}\tau)$ such that $(i_{1}+h_{1},j_{1})$ and $(i_{2}+h_{1},j_{2})$ are sampled positions. We then use an $\mathcal{O}(1)$ time query on $\mathcal{T}_{s}$ to get the $\operatorname{LCE}$ of $(i_{1}+h_{1},j_{1})$ and $(i_{2}+h_{1},j_{2})$ . Denote this $\operatorname{LCE}$ value as $L_{1}$ .
$\blacksquare$

$h_{2}\in[0\mathinner{.\,.\allowbreak}\tau)$ such that $(i_{1}+h_{2},j_{1}+h_{2})$ and $(i_{2}+h_{2},j_{2}+h_{2})$ are sampled positions. We use an $\mathcal{O}(1)$ time query on $\mathcal{T}_{s}$ to get the $\operatorname{LCE}$ of $(i_{1}+h_{2},j_{1}+h_{2})$ and $(i_{2}+h_{2},j_{2}+h_{2})$ . Denote this $\operatorname{LCE}$ value as $L_{2}$ .
$\blacksquare$

$h_{3}\in[0\mathinner{.\,.\allowbreak}\tau)$ such that $(i_{1},j_{1}+h_{3})$ and $(i_{2},j_{2}+h_{3})$ are sampled positions. We use an $\mathcal{O}(1)$ time query on $\mathcal{T}_{s}$ to get the $\operatorname{LCE}$ of $(i_{1},j_{1}+h_{3})$ and $(i_{2},j_{2}+h_{3})$ . Denote this $\operatorname{LCE}$ value as $L_{3}$ .

We report $\min(h_{1}+L_{1},h_{2}+L_{2},h_{3}+L_{3})$ as the solution.

3.3 Correctness

The first lemma is immediate.

Lemma 9.

When $\ell_{1}\neq\ell_{2}$ , $\operatorname{LCE}((i_{1},j_{1}),(i_{2},j_{2}))$ is the string depth of $\operatorname{LCA}(\ell_{1},\ell_{2})$ .

Lemma 10.

When $\ell_{1}=\ell_{2}$ , $\operatorname{LCE}((i_{1},j_{1}),(i_{2},j_{2}))=\min(h_{1}+L_{1},h_{2}+L_{2},h% _{3}+L_{3})$ .

Proof.

Define $L\coloneqq\operatorname{LCE}((i_{1},j_{1}),(i_{2},j_{2}))$ . First, we show that $L\leq\min(h_{1}+L_{1},h_{2}+L_{2},h_{3}+L_{3})$ . Starting from $(i_{1},j_{1}+h_{1})$ there exists a matching submatrix (with respect to position $(i_{2},j_{2}+h_{1})$ ) of size at least $L-h_{1}$ , thus we have that $L_{1}\geq L-h_{1}$ . Hence, $L_{1}+h_{1}\geq L$ . A similar argument holds for $h_{2}$ and $h_{3}$ .

Next, we show $L\nless\min(h_{1}+L_{1},h_{2}+L_{2},h_{3}+L_{3})$ .

$\blacksquare$

We denote the submatrix $T[i_{1}+h_{1}\mathinner{.\,.\allowbreak}i_{1}+h_{1}+L_{1})[j_{1}\mathinner{.\,% .\allowbreak}j_{1}+L_{1})$ as $T_{1}$ .
$\blacksquare$

We denote the submatrix $T[i_{1}+h_{2}\mathinner{.\,.\allowbreak}i_{1}+h_{2}+L_{2})[j_{1}+h_{2}% \mathinner{.\,.\allowbreak}j_{1}+h_{2}+L_{2})$ as $T_{2}$ .
$\blacksquare$

We denote the submatrix $T[i_{1}\mathinner{.\,.\allowbreak}i_{1}+L_{3})[j_{1}+h_{3}\mathinner{.\,.% \allowbreak}j_{1}+h_{3}+L_{3})$ as $T_{3}$

See Figure 6.

Observe that $h_{1},h_{2},h_{3}\leq\tau-1$ and since $L\geq 2\tau$ , we have $L_{1},L_{2},L_{3}\geq\tau$ . Submatrix $T_{2}$ has lower left corner in column $j_{1}+h_{2}\leq j_{1}+L_{1}-1$ and in row $i_{1}+h_{2}+L_{2}-1\geq i_{1}+h_{1}$ making it overlap with $T_{1}$ . Also, $T_{2}$ has upper right corner in column $j_{1}+h_{2}+L_{2}-1\geq j_{1}+h_{3}$ and row $i_{1}+h_{2}\leq i_{1}+h_{3}+L_{3}-1$ . Hence, $T_{2}$ overlaps with $T_{3}$ as well.

Now, suppose for the sake of contradiction that $h_{1}+L_{1},h_{2}+L_{2},h_{3}+L_{3}>L$ . For any positions in row $x=i_{1}+L$ and column $y$ where $j_{1}\leq y\leq j_{1}+L$ we have

i_{1}\leq x=i_{1}+L\leq i_{1}+h_{1}+L_{1}-1,\leavevmode\nobreak\ i_{1}+h_{2}+L% _{2}-1

and

j_{1}\leq y\leq j_{1}+L\leq j_{1}+h_{1}+L_{1}-1,\leavevmode\nobreak\ j_{1}+h_{% 2}+L_{2}-1.

Similarly, for any position in column $y=j_{1}+L$ and row $x$ where $i_{1}\leq x\leq i_{1}+L$ we have

j_{1}\leq y=j_{1}+L\leq j_{1}+h_{2}+L_{2}-1,\leavevmode\nobreak\ j_{1}+h_{3}+L% _{3}-1

and

i_{1}\leq x\leq i_{1}+L\leq i_{1}+h_{2}+L_{2}-1,\leavevmode\nobreak\ i_{1}+h_{% 3}+L_{3}-1.

Based on the above inequalities and the fact that submatrices $T_{1}$ , $T_{2}$ , and $T_{3}$ overlap, this implies that the matching submatrices with upper left corners $(i_{1},j_{1})$ and $(i_{2},j_{2})$ can be extended further by at least one row and column. This contradicts the definition of $L$ . $\hfill\blacktriangleleft$

3.4 Analysis and Optimization

3.4.1 Space Analysis

Space for $\tau$ -Cover lookup structure and Sparse Suffix Tree.

According to Lemma 8, the lookup structure requires $\mathcal{O}(\tau)$ space. Since $|\mathcal{C}^{\prime}|=\mathcal{O}(n^{2}/\sqrt{\tau})$ , we have that the sparse suffix tree $\mathcal{T}_{s}$ uses $\mathcal{O}(n^{2}/\sqrt{\tau})$ space.

Space for $\mathcal{T}_{\leq 2\tau}$ .

The space for the truncated suffix tree $\mathcal{T}_{\leq 2\tau}$ is bounded by the number of distinct $2\tau\times 2\tau$ submatrices of $T$ , denoted $d_{2\tau}(T)$ , plus the number of distinct matrices of size less than $2\tau$ that can not be further extended down and to the right (due to a boundary of $T$ ). There are at most $\mathcal{O}(\tau n)$ of the latter since, for every length from $1$ to $2\tau$ , at most $2n$ submatrices cannot be further extended. By the definition of $\delta$ , $d_{2\tau}(T)\leq 4\tau^{2}\delta(T)$ , making the space for $\mathcal{T}_{\leq 2\tau}$ bound by $\mathcal{O}(\tau^{2}\delta(T)+\tau n)$ .

Space for Macro-Matrix.

The space for $\mathsf{BT}(M)$ depends on $\delta(M)$ . We prove the following lemma relating $\delta(T)$ and $\delta(M)$ .

Lemma 11.

$\delta(M)=\Omega(\max(1,\delta(T)/\tau^{2}-n/\tau))$ and $\delta(M)=\mathcal{O}(\tau^{2}\delta(T)+\tau n)$ .

Proof.

First, the lower bound. Observe that for an arbitrary $t\in[2\tau\mathinner{.\,.\allowbreak}n]$ , two matching $t\times t$ submatrices in $T$ cause two matching $(t-2\tau+1)\times(t-2\tau+1)$ submatrices in $M$ (with the same upper left corners as the corresponding submatrices in $T$ ). In this way, every distinct $t\times t$ submatrix in $T$ maps to one distinct $(t-2\tau+1)\times(t-2\tau+1)$ submatrix in $M$ , and we have $d_{t}(T)\leq d_{(t-2\tau+1)}(M)$ . Then for $t\geq 2\tau$ , we have

\frac{d_{t}(T)}{t^{2}}\leq\frac{d_{(t-2\tau+1)}(M)}{t^{2}}\leq\frac{(t-2\tau+1% )^{2}\delta(M)}{t^{2}}\leq\delta(M)

(1)

implying $d_{t}(T)\leq t^{2}\delta(M)$ for $t\geq 2\tau$ .

Next, consider $t\in[1\mathinner{.\,.\allowbreak}2\tau)$ . Note that the number of distinct $t\times t$ submatrices in $T$ is almost upper bounded by the number of distinct $(t+2\tau)\times(t+2\tau)$ submatrices in $T$ , except that some of the distinct matrices with sizes between $t\times t$ and $(t+2\tau)\times(t+2\tau)$ may be prevented from being extended due to the right and bottom boundaries of $T$ . The number of such submatrices is bounded by $2n(t+2\tau-t)=\mathcal{O}(\tau n)$ . Hence, for $t<2\tau$ ,

d_{t}(T)\leq d_{(t+2\tau)}(T)+\mathcal{O}(\tau n)

Applying Inequality (1), we can then write

\frac{d_{t}(T)}{t^{2}}\leq\frac{d_{(t+2\tau)}(T)}{t^{2}}+\frac{\mathcal{O}(% \tau n)}{t^{2}}\leq\frac{(t+2\tau)^{2}}{t^{2}}\delta(M)+\mathcal{O}(\tau n)=% \mathcal{O}(\tau^{2}\delta(M)+\tau n).

Taking the maximum over both cases, yields that $\delta(T)=\mathcal{O}(\tau^{2}\delta(M)+\tau n)$ .

For the upper bound, we claim that, for an arbitrary $t\in[1\mathinner{.\,.\allowbreak}n]$ ,

d_{t}(M)\leq d_{(t+2\tau-1)}(T)+\mathcal{O}(\tau n),

where we take $d_{(t+2\tau-1)}(T)=0$ if $t+2\tau-1>n$ . The above inequality follows from the fact that every distinct $(t+2\tau-1)\times(t+2\tau-1)$ submatrix in $T$ maps to one distinct $t\times t$ submatrix in $M$ . What remains to be counted for $d_{t}(M)$ are distinct $t\times t$ submatrices in $M$ that are not resulting from some $(t+2\tau-1)\times(t+2\tau-1)$ submatrix in $T$ . That is, submatrices on the bottom and/or right boundary. Again, the number of such $t\times t$ submatrices is bounded by $2n((t+2\tau-1)-t)=\mathcal{O}(\tau n)$ .

To complete the proof, we have the bound

	$\displaystyle\delta(M)=\max_{t}\frac{d_{t}(M)}{t^{2}}$	$\displaystyle\leq\max_{t}\frac{d_{(t+2\tau-1)}(T)+\mathcal{O}(\tau n)}{t^{2}}$
		$\displaystyle\leq\max_{t}\frac{(t+2\tau-1)^{2}}{t^{2}}\delta(T)+\mathcal{O}(% \tau n)=\mathcal{O}(\tau^{2}\delta(T)+\tau n).\$

$\hfill\blacktriangleleft$ Let $\sigma^{\prime}$ be the alphabet size of the macro-matrix $M$ . The space for the block tree $\mathsf{BT}(M)$ is

\mathcal{O}\left((\delta(M)+\sqrt{n\delta(M)})\log\left(\frac{n\log\sigma^{% \prime}}{\sqrt{\delta(M)}\log n}\right)\right).

Applying that $\sigma^{\prime}\leq n^{2}$ and Lemma 11, this space is bound by

\mathcal{O}\left((\tau^{2}\delta(T)+\tau\sqrt{n\delta(T)}+\tau n)\log\left(% \frac{n}{\sqrt{\max(1,\frac{\delta(T)}{\tau^{2}}-\frac{n}{\tau})}}\right)% \right).

Total Space.

Summing the total data structure sizes, the combined space is

\mathcal{O}\left((\tau^{2}\delta(T)\ +\tau\sqrt{n\delta(T)}+\tau n)\log\left(% \frac{n}{\sqrt{\max(1,\frac{\delta(T)}{\tau^{2}}-\frac{n}{\tau})}}\right)+% \frac{n^{2}}{\sqrt{\tau}}+\tau\right).

3.4.2 Optimizing

We consider two cases based on $\delta(T)$ , which we now denote as just $\delta$ . If $\delta>n^{1/3}$ , we set $\tau=\lceil n^{4/5}/(2\delta^{2/5})\rceil$ and let $\beta=n/\sqrt{\max(1,4\delta^{9/5}/n^{8/5}-2n^{1/5}\delta^{2/5})}$ . The space is (up to constant factors)

\displaystyle\left(n^{8/5}\delta^{1/5}+n^{13/10}\delta^{1/10}+\frac{n^{9/5}}{% \delta^{2/5}}\right)\log\beta+n^{8/5}\delta^{1/5}+\frac{n^{4/5}}{\delta^{2/5}}% =\mathcal{O}\left(n^{8/5}\delta^{1/5}\cdot\log\beta\right).

Observe that as $\delta$ approaches $n^{2}$ , $\beta$ approaches $\mathcal{O}(1)$ .

If $\delta\leq n^{1/3}$ , we make $\tau=n^{2/3}$ . The resulting space complexity is

\displaystyle(n^{4/3}\delta+n^{7/6}\sqrt{\delta}+n^{5/3})\log\beta+n^{5/3}+n^{% 2/3}=\mathcal{O}\left(n^{5/3}\log\beta\right).

For this case, the argument of the logarithm $\beta$ is $\mathcal{O}(n)$ . One can also readily check that $\beta$ as defined above is bound by the expression for $\beta$ appearing in Theorem 2.

3.4.3 Query Time

The query time is dominated by the access to $\mathsf{BT}(M)$ , which takes $1+\log\frac{n}{\sqrt{\delta(M)}}=\mathcal{O}(1+\log\beta)$ time, where $\beta$ is defined as above. The remaining queries take $\mathcal{O}(1)$ time. This completes the proof of Theorem 2.

4 Applications

We next demonstrate some applications of Theorem 1 by proving Theorems 3, 4, 5.

4.1 ISA Queries

We maintain a sampled suffix array. Specifically, we sample the suffix array values for every $(\log_{\sigma}n)$ leaf of the suffix tree. The space required for this is $\mathcal{O}(n^{2}\log\sigma)$ bits. Additionally, for each text position, we maintain how far away its predecessor sampled leaf is relative to its leaf in the suffix tree. This requires $\mathcal{O}(\log\log_{\sigma}n)$ bits per entry. The resulting total space is $\mathcal{O}(n^{2}\log\sigma+n^{2}\log\log n)$ bits.

To find the $\operatorname{ISA}$ value of a text position $(i,j)$ , we perform a binary search on the sampled leaves to find the lexicographic predecessor of $(i,j)$ within the sampled set. Once the predecessor is found, we add the offset associated with $(i,j)$ . This gives us the suffix array position associated with $(i,j)$ , i.e., its $\operatorname{ISA}$ value. The binary search requires $\mathcal{O}(\log n)$ number of $\operatorname{LCE}$ queries. Each $\operatorname{LCE}$ query takes $\mathcal{O}(\log_{\sigma}^{2/3}n\cdot(\log\log_{\sigma}n)^{5/3})$ time, resulting in an overall time complexity of $\mathcal{O}(\log n\cdot\log_{\sigma}^{2/3}n\cdot(\log\log_{\sigma}n)^{5/3})$ .

4.2 SA queries

Let $\tau$ be a parameter. We divide the leaves of the suffix tree into contiguous blocks of size $\lceil n^{2}/\tau\rceil$ (except for perhaps the last block, which can be smaller). There are $\Theta(\tau)$ blocks. We associate each position in $T$ with the block in which its leaf lies in the suffix tree. This information is stored as follows: consider a binary array $B_{b}$ associated with each block $b$ . Each binary array is of length $n^{2}$ and represents a linearization of $T$ . For a block $B_{b}$ , we consider a $1$ in a position if the corresponding suffix tree leaf is in block $b$ and $0$ otherwise. Note that there are at most $m\coloneqq\lceil n^{2}/\tau\rceil$ $1$ ’s in $B_{b}$ . We build a data structure representing $B_{b}$ using $m\log\frac{n^{2}}{m}+\mathcal{O}(m)$ bits of space, or equivalently, $n^{2}/\tau\cdot\log\tau+\mathcal{O}(n^{2}/\tau)$ bits of space, such that select queries can be answered in constant time [29]. The total space for select data structures over all $\Theta(\tau)$ bit vectors, is $n^{2}\log\tau+\mathcal{O}(n^{2})=\mathcal{O}(n^{2}\log\tau)$ bits. We also maintain the $\operatorname{ISA}$ data structure described previously.

Given an $\operatorname{SA}$ query for index $i$ , we first identify which block $i$ is in. Say this is block $b$ . We use select queries to iterate through the text positions contained in block $b$ . For each text position iterated over, we perform an $\operatorname{ISA}$ query and check whether its $\operatorname{ISA}$ value equals $i$ .

The space required for the $\operatorname{ISA}$ data structure is $\mathcal{O}(n^{2}\log\sigma+n^{2}\log\log n)$ bits. The space for the select data structures is $\mathcal{O}(n^{2}\log\tau)$ bits. The query time is $\mathcal{O}(n^{2}/\tau\cdot\log n\cdot\log_{\sigma}^{2/3}n\cdot(\log\log_{% \sigma}n)^{5/3})$ . We obtain Theorem 4 by making $\tau=(\sigma\log n)^{c}$ , where $c$ is an arbitrarily large constant that can absorb the additional logarithmic factors in the query time.

4.3 Pattern Matching

Counting.

In addition to the previous structures, we maintain the LCE data structure from Lemma 6 over the rows and columns. First, a binary search is done to find the leaf for the lexicographically smallest suffix with $P$ as a prefix (if one exists). We start by using an $\operatorname{SA}$ query to obtain $\operatorname{SA}[\lfloor n^{2}/2\rfloor]$ . Using that we have read access to the original text, we match characters in $P$ in Lstring order to the submatrix starting at $\operatorname{SA}[\lfloor n^{2}/2\rfloor]$ until we reach our first mismatch. At this point, we know our lexicographical order relative to our current leaf. When we transition to a new leaf in the binary search, we perform an $\operatorname{SA}$ query followed by $\operatorname{LCE}$ queries with the position from the preceding leaf. This avoids repeatedly iterating over characters in $P$ . Assuming the $\operatorname{LCE}$ query is at least the length already matched, we continue matching from the last matched position. A similar binary search finds the lexicographically largest suffix with $P$ as a prefix. We return the suffix range length.

The total number of $\operatorname{LCE}$ and $\operatorname{SA}$ queries performed is $\mathcal{O}(\log n)$ . The time is dominated by the $\operatorname{SA}$ queries, which require $\mathcal{O}(n^{2}/(\sigma\log n)^{c})$ time.

Reporting.

We start with the suffix range obtained previously, say $[x\mathinner{.\,.\allowbreak}y]$ . We use the same blocking scheme for the suffix leaves described for $\operatorname{SA}$ queries, also using constant time select data structures. We first identify the block that $x$ lies in, say $B_{b}$ . We use the select data structure to iterate through all of the text positions corresponding to suffixes in block $b$ . For each position, we perform an $\operatorname{ISA}$ query and check whether its position in the suffix array is at least $x$ . If it is, we output it. We perform a similar procedure for the block containing $y$ , now checking if the position in the suffix array is at most $y$ . For the remaining blocks, those completely contained in $[x\mathinner{.\,.\allowbreak}y]$ , we use their select data structures to output all occurrences with suffixes in that block.

The space complexity is the same as the $\operatorname{SA}$ data structure. For the query time, each block has size $\mathcal{O}(n^{2}/\tau)$ , and with $\tau=(\sigma\log n)^{c}$ , the time spent on the blocks containing $x$ and $y$ is absorbed by $n^{2}/(\sigma\log n)^{c}$ already appearing due to $\operatorname{SA}$ queries.

5 Open Problems

We leave open many directions for potential improvement, for example:

$\blacksquare$

Can we design a data structure with faster $\operatorname{SA}$ query time that uses $\mathcal{O}(n^{2}\log\sigma+n^{2}\log\log n)$ bits of space (or better)? This seems significantly harder than $\operatorname{ISA}$ queries. Suffix array sampling, like in the FM-index [10], is not immediately adaptable.
$\blacksquare$

Can we design a data structure in repetition-aware compressed space that supports $\operatorname{ISA}$ , $\operatorname{SA}$ , or pattern-matching queries? Also, can the space for a data structure for $\operatorname{LCE}$ queries be improved? Grammar-based compression has proven useful for repetition-aware compressed data structures supporting $\operatorname{LCE}$ queries in the 1D case, particularly run-length straight-line programs (RL-SLP). For 1D text, it is possible to construct RL-SLPs with size close to the $\delta$ measure [25], which can be used for $\operatorname{LCE}$ [27] and pattern matching queries [24]. Although Romana et al. [31] introduce a version of RL-SLP for 2D text, it is open how such a RL-SLP could be utilized for $\operatorname{LCE}$ queries and other types of queries, e.g., $\operatorname{SA}$ and pattern matching queries.

References

[1] Philip Bille, Inge Li Gørtz, Mathias Bæk Tejs Knudsen, Moshe Lewenstein, and Hjalte Wedel Vildhøj. Longest common extensions in sublinear space. In Ferdinando Cicalese, Ely Porat, and Ugo Vaccaro, editors, Combinatorial Pattern Matching - 26th Annual Symposium, CPM 2015, Ischia Island, Italy, June 29 - July 1, 2015, Proceedings, volume 9133 of Lecture Notes in Computer Science, pages 65–76. Springer, 2015. doi:10.1007/978-3-319-19929-0_6.
[2] Philip Bille, Inge Li Gørtz, Benjamin Sach, and Hjalte Wedel Vildhøj. Time-space trade-offs for longest common extensions. J. Discrete Algorithms, 25:42–50, 2014. doi:10.1016/J.JDA.2013.06.003.
[3] Nieves R. Brisaboa, Travis Gagie, Adrián Gómez-Brandón, and Gonzalo Navarro. Two-dimensional block trees. Comput. J., 67(1):391–406, 2024. doi:10.1093/COMJNL/BXAC182.
[4] Stefan Burkhardt and Juha Kärkkäinen. Fast lightweight suffix array construction and checking. In Ricardo A. Baeza-Yates, Edgar Chávez, and Maxime Crochemore, editors, Combinatorial Pattern Matching, 14th Annual Symposium, CPM 2003, Morelia, Michocán, Mexico, June 25-27, 2003, Proceedings, volume 2676 of Lecture Notes in Computer Science, pages 55–69. Springer, 2003. doi:10.1007/3-540-44888-8_5.
[5] Lorenzo Carfagna and Giovanni Manzini. Compressibility measures for two-dimensional data. In Franco Maria Nardini, Nadia Pisanti, and Rossano Venturini, editors, String Processing and Information Retrieval - 30th International Symposium, SPIRE 2023, Pisa, Italy, September 26-28, 2023, Proceedings, volume 14240 of Lecture Notes in Computer Science, pages 102–113. Springer, 2023. doi:10.1007/978-3-031-43980-3_9.
[6] Lorenzo Carfagna and Giovanni Manzini. The landscape of compressibility measures for two-dimensional data. IEEE Access, 12:87268–87283, 2024. doi:10.1109/ACCESS.2024.3417621.
[7] Anders Roy Christiansen, Mikko Berggren Ettienne, Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Optimal-time dictionary-compressed indexes. ACM Trans. Algorithms, 17(1):8:1–8:39, 2021. doi:10.1145/3426473.
[8] Charles J. Colbourn and Alan C. H. Ling. Quorums from difference covers. Inf. Process. Lett., 75(1-2):9–12, 2000. doi:10.1016/S0020-0190(00)00080-6.
[9] Martin Farach. Optimal suffix tree construction with large alphabets. In 38th Annual Symposium on Foundations of Computer Science, FOCS ’97, Miami Beach, Florida, USA, October 19-22, 1997, pages 137–143. IEEE Computer Society, 1997. doi:10.1109/SFCS.1997.646102.
[10] Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In 41st Annual Symposium on Foundations of Computer Science, FOCS 2000, 12-14 November 2000, Redondo Beach, California, USA, pages 390–398. IEEE Computer Society, 2000. doi:10.1109/SFCS.2000.892127.
[11] Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM, 67(1):2:1–2:54, 2020. doi:10.1145/3375890.
[12] Arnab Ganguly, Dhrumil Patel, Rahul Shah, and Sharma V. Thankachan. LF successor: Compact space indexing for order-isomorphic pattern matching. In Nikhil Bansal, Emanuela Merelli, and James Worrell, editors, 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021, July 12-16, 2021, Glasgow, Scotland (Virtual Conference), volume 198 of LIPIcs, pages 71:1–71:19. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.ICALP.2021.71.
[13] Arnab Ganguly, Rahul Shah, and Sharma V. Thankachan. pbwt: Achieving succinct data structures for parameterized pattern matching and related problems. In Philip N. Klein, editor, Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2017, Barcelona, Spain, Hotel Porta Fira, January 16-19, pages 397–407. SIAM, 2017. doi:10.1137/1.9781611974782.25.
[14] Arnab Ganguly, Rahul Shah, and Sharma V. Thankachan. Fully functional parameterized suffix trees in compact space. In Mikolaj Bojanczyk, Emanuela Merelli, and David P. Woodruff, editors, 49th International Colloquium on Automata, Languages, and Programming, ICALP 2022, July 4-8, 2022, Paris, France, volume 229 of LIPIcs, pages 65:1–65:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2022. doi:10.4230/LIPICS.ICALP.2022.65.
[15] Pawel Gawrychowski, Tomasz Kociumaka, Wojciech Rytter, and Tomasz Walen. Faster longest common extension queries in strings over general alphabets. In Roberto Grossi and Moshe Lewenstein, editors, 27th Annual Symposium on Combinatorial Pattern Matching, CPM 2016, June 27-29, 2016, Tel Aviv, Israel, volume 54 of LIPIcs, pages 5:1–5:13. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2016. doi:10.4230/LIPIcs.CPM.2016.5.
[16] Raffaele Giancarlo. A generalization of the suffix tree to square matrices, with applications. SIAM J. Comput., 24(3):520–562, 1995. doi:10.1137/S0097539792231982.
[17] Gaston H Gonnet. Efficient searching of text and pictures. UW Centre for the New Oxford English Dictionary, 1990.
[18] Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput., 35(2):378–407, 2005. doi:10.1137/S0097539702402354.
[19] Dominik Kempa and Tomasz Kociumaka. String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure. In Moses Charikar and Edith Cohen, editors, Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019, pages 756–767. ACM, 2019. doi:10.1145/3313276.3316368.
[20] Dominik Kempa and Tomasz Kociumaka. Resolution of the Burrows-Wheeler transform conjecture. Commun. ACM, 65(6):91–98, 2022. doi:10.1145/3531445.
[21] Dominik Kempa and Tomasz Kociumaka. Collapsing the hierarchy of compressed data structures: Suffix arrays in optimal compressed space. In 64th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2023, Santa Cruz, CA, USA, November 6-9, 2023, pages 1877–1886. IEEE, 2023. doi:10.1109/FOCS57990.2023.00114.
[22] Dong Kyue Kim, Yoo Ah Kim, and Kunsoo Park. Generalizations of suffix arrays to multi-dimensional matrices. Theor. Comput. Sci., 302(1-3):223–238, 2003. doi:10.1016/S0304-3975(02)00828-9.
[23] Dong Kyue Kim, Joong Chae Na, Jeong Seop Sim, and Kunsoo Park. Linear-time construction of two-dimensional suffix trees. Algorithmica, 59(2):269–297, 2011. doi:10.1007/S00453-009-9350-Z.
[24] Tomasz Kociumaka, Gonzalo Navarro, and Francisco Olivares. Near-optimal search time in $\delta$ -optimal space, and vice versa. Algorithmica, 86(4):1031–1056, 2024. doi:10.1007/S00453-023-01186-0.
[25] Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Toward a definitive compressibility measure for repetitive sequences. IEEE Trans. Inf. Theory, 69(4):2074–2092, 2023. doi:10.1109/TIT.2022.3224382.
[26] Gonzalo Navarro. Indexing highly repetitive string collections, part I: repetitiveness measures. ACM Comput. Surv., 54(2):29:1–29:31, 2022. doi:10.1145/3434399.
[27] Takaaki Nishimoto, Tomohiro I, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Fully dynamic data structure for LCE queries in compressed space. In Piotr Faliszewski, Anca Muscholl, and Rolf Niedermeier, editors, 41st International Symposium on Mathematical Foundations of Computer Science, MFCS 2016, August 22-26, 2016 - Kraków, Poland, volume 58 of LIPIcs, pages 72:1–72:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2016. doi:10.4230/LIPICS.MFCS.2016.72.
[28] Dhrumil Patel and Rahul Shah. Inverse suffix array queries for 2-dimensional pattern matching in near-compact space. In Hee-Kap Ahn and Kunihiko Sadakane, editors, 32nd International Symposium on Algorithms and Computation, ISAAC 2021, December 6-8, 2021, Fukuoka, Japan, volume 212 of LIPIcs, pages 60:1–60:14. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.ISAAC.2021.60.
[29] Rajeev Raman, Venkatesh Raman, and Srinivasa Rao Satti. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms, 3(4):43, 2007. doi:10.1145/1290672.1290680.
[30] Sofya Raskhodnikova, Dana Ron, Ronitt Rubinfeld, and Adam D. Smith. Sublinear algorithms for approximating string compressibility. Algorithmica, 65(3):685–709, 2013. doi:10.1007/S00453-012-9618-6.
[31] Giuseppe Romana, Marinella Sciortino, and Cristian Urbina. Exploring repetitiveness measures for two-dimensional strings, 2024. doi:10.48550/arXiv.2404.07030.
[32] Kunihiko Sadakane. Succinct representations of LCP information and improvements in the compressed suffix arrays. In David Eppstein, editor, Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, January 6-8, 2002, San Francisco, CA, USA, pages 225–232. ACM/SIAM, 2002. URL: http://dl.acm.org/citation.cfm?id=545381.545410.
[33] Sharma V. Thankachan. Compact text indexing for advanced pattern matching problems: Parameterized, order-isomorphic, 2d, etc. (invited talk). In Hideo Bannai and Jan Holub, editors, 33rd Annual Symposium on Combinatorial Pattern Matching, CPM 2022, June 27-29, 2022, Prague, Czech Republic, volume 223 of LIPIcs, pages 3:1–3:3. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2022. doi:10.4230/LIPICS.CPM.2022.3.
[34] Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, October 15-17, 1973, pages 1–11. IEEE Computer Society, 1973. doi:10.1109/SWAT.1973.13.

[bib.bib1] [1] Philip Bille, Inge Li Gørtz, Mathias Bæk Tejs Knudsen, Moshe Lewenstein, and Hjalte Wedel Vildhøj. Longest common extensions in sublinear space. In Ferdinando Cicalese, Ely Porat, and Ugo Vaccaro, editors, Combinatorial Pattern Matching - 26th Annual Symposium, CPM 2015, Ischia Island, Italy, June 29 - July 1, 2015, Proceedings, volume 9133 of Lecture Notes in Computer Science, pages 65–76. Springer, 2015. doi:10.1007/978-3-319-19929-0_6.

[bib.bib2] [2] Philip Bille, Inge Li Gørtz, Benjamin Sach, and Hjalte Wedel Vildhøj. Time-space trade-offs for longest common extensions. J. Discrete Algorithms, 25:42–50, 2014. doi:10.1016/J.JDA.2013.06.003.

[bib.bib3] [3] Nieves R. Brisaboa, Travis Gagie, Adrián Gómez-Brandón, and Gonzalo Navarro. Two-dimensional block trees. Comput. J., 67(1):391–406, 2024. doi:10.1093/COMJNL/BXAC182.

[bib.bib4] [4] Stefan Burkhardt and Juha Kärkkäinen. Fast lightweight suffix array construction and checking. In Ricardo A. Baeza-Yates, Edgar Chávez, and Maxime Crochemore, editors, Combinatorial Pattern Matching, 14th Annual Symposium, CPM 2003, Morelia, Michocán, Mexico, June 25-27, 2003, Proceedings, volume 2676 of Lecture Notes in Computer Science, pages 55–69. Springer, 2003. doi:10.1007/3-540-44888-8_5.

[bib.bib5] [5] Lorenzo Carfagna and Giovanni Manzini. Compressibility measures for two-dimensional data. In Franco Maria Nardini, Nadia Pisanti, and Rossano Venturini, editors, String Processing and Information Retrieval - 30th International Symposium, SPIRE 2023, Pisa, Italy, September 26-28, 2023, Proceedings, volume 14240 of Lecture Notes in Computer Science, pages 102–113. Springer, 2023. doi:10.1007/978-3-031-43980-3_9.

[bib.bib6] [6] Lorenzo Carfagna and Giovanni Manzini. The landscape of compressibility measures for two-dimensional data. IEEE Access, 12:87268–87283, 2024. doi:10.1109/ACCESS.2024.3417621.

[bib.bib7] [7] Anders Roy Christiansen, Mikko Berggren Ettienne, Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Optimal-time dictionary-compressed indexes. ACM Trans. Algorithms, 17(1):8:1–8:39, 2021. doi:10.1145/3426473.

[bib.bib8] [8] Charles J. Colbourn and Alan C. H. Ling. Quorums from difference covers. Inf. Process. Lett., 75(1-2):9–12, 2000. doi:10.1016/S0020-0190(00)00080-6.

[bib.bib9] [9] Martin Farach. Optimal suffix tree construction with large alphabets. In 38th Annual Symposium on Foundations of Computer Science, FOCS ’97, Miami Beach, Florida, USA, October 19-22, 1997, pages 137–143. IEEE Computer Society, 1997. doi:10.1109/SFCS.1997.646102.

[bib.bib10] [10] Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In 41st Annual Symposium on Foundations of Computer Science, FOCS 2000, 12-14 November 2000, Redondo Beach, California, USA, pages 390–398. IEEE Computer Society, 2000. doi:10.1109/SFCS.2000.892127.

[bib.bib11] [11] Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM, 67(1):2:1–2:54, 2020. doi:10.1145/3375890.

[bib.bib12] [12] Arnab Ganguly, Dhrumil Patel, Rahul Shah, and Sharma V. Thankachan. LF successor: Compact space indexing for order-isomorphic pattern matching. In Nikhil Bansal, Emanuela Merelli, and James Worrell, editors, 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021, July 12-16, 2021, Glasgow, Scotland (Virtual Conference), volume 198 of LIPIcs, pages 71:1–71:19. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.ICALP.2021.71.

[bib.bib13] [13] Arnab Ganguly, Rahul Shah, and Sharma V. Thankachan. pbwt: Achieving succinct data structures for parameterized pattern matching and related problems. In Philip N. Klein, editor, Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2017, Barcelona, Spain, Hotel Porta Fira, January 16-19, pages 397–407. SIAM, 2017. doi:10.1137/1.9781611974782.25.

[bib.bib14] [14] Arnab Ganguly, Rahul Shah, and Sharma V. Thankachan. Fully functional parameterized suffix trees in compact space. In Mikolaj Bojanczyk, Emanuela Merelli, and David P. Woodruff, editors, 49th International Colloquium on Automata, Languages, and Programming, ICALP 2022, July 4-8, 2022, Paris, France, volume 229 of LIPIcs, pages 65:1–65:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2022. doi:10.4230/LIPICS.ICALP.2022.65.

[bib.bib15] [15] Pawel Gawrychowski, Tomasz Kociumaka, Wojciech Rytter, and Tomasz Walen. Faster longest common extension queries in strings over general alphabets. In Roberto Grossi and Moshe Lewenstein, editors, 27th Annual Symposium on Combinatorial Pattern Matching, CPM 2016, June 27-29, 2016, Tel Aviv, Israel, volume 54 of LIPIcs, pages 5:1–5:13. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2016. doi:10.4230/LIPIcs.CPM.2016.5.

[bib.bib16] [16] Raffaele Giancarlo. A generalization of the suffix tree to square matrices, with applications. SIAM J. Comput., 24(3):520–562, 1995. doi:10.1137/S0097539792231982.

[bib.bib17] [17] Gaston H Gonnet. Efficient searching of text and pictures. UW Centre for the New Oxford English Dictionary, 1990.

[bib.bib18] [18] Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput., 35(2):378–407, 2005. doi:10.1137/S0097539702402354.

[bib.bib19] [19] Dominik Kempa and Tomasz Kociumaka. String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure. In Moses Charikar and Edith Cohen, editors, Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019, pages 756–767. ACM, 2019. doi:10.1145/3313276.3316368.

[bib.bib20] [20] Dominik Kempa and Tomasz Kociumaka. Resolution of the Burrows-Wheeler transform conjecture. Commun. ACM, 65(6):91–98, 2022. doi:10.1145/3531445.

[bib.bib21] [21] Dominik Kempa and Tomasz Kociumaka. Collapsing the hierarchy of compressed data structures: Suffix arrays in optimal compressed space. In 64th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2023, Santa Cruz, CA, USA, November 6-9, 2023, pages 1877–1886. IEEE, 2023. doi:10.1109/FOCS57990.2023.00114.

[bib.bib22] [22] Dong Kyue Kim, Yoo Ah Kim, and Kunsoo Park. Generalizations of suffix arrays to multi-dimensional matrices. Theor. Comput. Sci., 302(1-3):223–238, 2003. doi:10.1016/S0304-3975(02)00828-9.

[bib.bib23] [23] Dong Kyue Kim, Joong Chae Na, Jeong Seop Sim, and Kunsoo Park. Linear-time construction of two-dimensional suffix trees. Algorithmica, 59(2):269–297, 2011. doi:10.1007/S00453-009-9350-Z.

[bib.bib24] [24] Tomasz Kociumaka, Gonzalo Navarro, and Francisco Olivares. Near-optimal search time in $\delta$ -optimal space, and vice versa. Algorithmica, 86(4):1031–1056, 2024. doi:10.1007/S00453-023-01186-0.

[bib.bib25] [25] Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Toward a definitive compressibility measure for repetitive sequences. IEEE Trans. Inf. Theory, 69(4):2074–2092, 2023. doi:10.1109/TIT.2022.3224382.

[bib.bib26] [26] Gonzalo Navarro. Indexing highly repetitive string collections, part I: repetitiveness measures. ACM Comput. Surv., 54(2):29:1–29:31, 2022. doi:10.1145/3434399.

[bib.bib27] [27] Takaaki Nishimoto, Tomohiro I, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Fully dynamic data structure for LCE queries in compressed space. In Piotr Faliszewski, Anca Muscholl, and Rolf Niedermeier, editors, 41st International Symposium on Mathematical Foundations of Computer Science, MFCS 2016, August 22-26, 2016 - Kraków, Poland, volume 58 of LIPIcs, pages 72:1–72:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2016. doi:10.4230/LIPICS.MFCS.2016.72.

[bib.bib28] [28] Dhrumil Patel and Rahul Shah. Inverse suffix array queries for 2-dimensional pattern matching in near-compact space. In Hee-Kap Ahn and Kunihiko Sadakane, editors, 32nd International Symposium on Algorithms and Computation, ISAAC 2021, December 6-8, 2021, Fukuoka, Japan, volume 212 of LIPIcs, pages 60:1–60:14. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.ISAAC.2021.60.

[bib.bib29] [29] Rajeev Raman, Venkatesh Raman, and Srinivasa Rao Satti. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms, 3(4):43, 2007. doi:10.1145/1290672.1290680.

[bib.bib30] [30] Sofya Raskhodnikova, Dana Ron, Ronitt Rubinfeld, and Adam D. Smith. Sublinear algorithms for approximating string compressibility. Algorithmica, 65(3):685–709, 2013. doi:10.1007/S00453-012-9618-6.

[bib.bib31] [31] Giuseppe Romana, Marinella Sciortino, and Cristian Urbina. Exploring repetitiveness measures for two-dimensional strings, 2024. doi:10.48550/arXiv.2404.07030.

[bib.bib32] [32] Kunihiko Sadakane. Succinct representations of LCP information and improvements in the compressed suffix arrays. In David Eppstein, editor, Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, January 6-8, 2002, San Francisco, CA, USA, pages 225–232. ACM/SIAM, 2002. URL: http://dl.acm.org/citation.cfm?id=545381.545410.

[bib.bib33] [33] Sharma V. Thankachan. Compact text indexing for advanced pattern matching problems: Parameterized, order-isomorphic, 2d, etc. (invited talk). In Hideo Bannai and Jan Holub, editors, 33rd Annual Symposium on Combinatorial Pattern Matching, CPM 2022, June 27-29, 2022, Prague, Czech Republic, volume 223 of LIPIcs, pages 3:1–3:3. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2022. doi:10.4230/LIPICS.CPM.2022.3.

[bib.bib34] [34] Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, October 15-17, 1973, pages 1–11. IEEE Computer Society, 1973. doi:10.1109/SWAT.1973.13.

	suffix starting at $(0,0)$ :	$\displaystyle a\cdot aab\cdot bbbba\cdot bcabcab\cdot\$\$\$\$\$\$\$\$\$$
	suffix starting at $(0,1)$ :	$\displaystyle a\cdot bbb\cdot babca\cdot cab\$\$\$\$$
	suffix starting at $(1,0)$ :	$\displaystyle a\cdot bbb\cdot bcbaa\cdot\$\$\$cab\$$

Two-Dimensional Longest Common Extension Queries in Compact Space

Abstract

Keywords and phrases:

Copyright and License:

2012 ACM Subject Classification:

Funding:

DOI:

Event:

Editors:

Series and Publisher:

1 Introduction

Problem 1 (2D LCE).

Theorem 1.

Theorem 2.

Theorem 3 (2D ISA queries).

Theorem 4 (2D SA queries).

Theorem 5 (PM queries).

1.1 Preliminaries

Notation and Strings.

Lemma 6 ([19]).

Lemma 7 ([1]).

d-Covers.

Lemma 8 ([4]).

Proof.

2D Suffix Trees and 2D Suffix Arrays.

The 𝜹𝟐⁢𝑫 Measure and 2D Block Trees.

2 Compact Data Structures for 2D LCE Queries

2.1 Achieving 𝓞⁢(𝐥𝐨𝐠𝝈𝟐⁡𝒏) Query Time

2.2 Achieving 𝓞⁢(𝐥𝐨𝐠𝝈⁡𝒏⋅(𝐥𝐨𝐠⁡𝐥𝐨𝐠𝝈⁡𝒏)𝟐) Query Time

Data Structure.

Querying.

Analysis.

2.3 Achieving 𝓞⁢(𝐥𝐨𝐠𝝈𝟐/𝟑⁡𝒏⋅(𝐥𝐨𝐠⁡𝐥𝐨𝐠𝝈⁡𝒏)𝟓/𝟑) Query Time

Data Structure.

Querying.

Analysis.

3 Repetition-Aware LCE Data Structure

Overview.

3.1 Data Structures

Truncated Suffix Tree.

Compressed Representation of Macro-Matrix.

Sparse Suffix Tree.

3.2 Querying

3.3 Correctness

Lemma 9.

Lemma 10.

Proof.

3.4 Analysis and Optimization

3.4.1 Space Analysis

Space for 𝝉-Cover lookup structure and Sparse Suffix Tree.

Space for 𝓣≤𝟐⁢𝝉.

Space for Macro-Matrix.

Lemma 11.

Proof.

Total Space.

3.4.2 Optimizing

3.4.3 Query Time

4 Applications

4.1 ISA Queries

4.2 SA queries

4.3 Pattern Matching

Counting.

Reporting.

5 Open Problems

References

The $\delta_{2D}$ Measure and 2D Block Trees.

2.1 Achieving $\mathcal{O}(\log_{\sigma}^{2}n)$ Query Time

2.2 Achieving $\mathcal{O}(\log_{\sigma}n\cdot(\log\log_{\sigma}n)^{2})$ Query Time

2.3 Achieving $\mathcal{O}(\log_{\sigma}^{2/3}n\cdot(\log\log_{\sigma}n)^{5/3})$ Query Time

Space for $\tau$ -Cover lookup structure and Sparse Suffix Tree.

Space for $\mathcal{T}_{\leq 2\tau}$ .