Abstract 1 Introduction 2 Preliminaries 3 𝑶(𝒏𝐥𝐠𝝈)-bit Index 4 Construction in 𝑶(𝒏𝐥𝐠𝝈) Bits of Space References

Extending the Burrows–Wheeler Transform for Cartesian Tree Matching and Constructing It

Eric M. Osterkamp ORCID University of Münster, Germany Dominik Köppl ORCID University of Yamanashi, Kofu, Japan
Abstract

Cartesian tree matching is a form of generalized pattern matching where a substring of the text matches with the pattern if they share the same Cartesian tree. This form of matching finds application for time series of stock prices and can be of interest for melody matching between musical scores. For the indexing problem, the state-of-the-art data structure is a Burrows–Wheeler transform based solution due to [Kim and Cho, CPM’21], which uses nearly succinct space and can count the number of substrings that Cartesian tree match with a pattern in time linear in the pattern length. The authors address the construction of their data structure with a straight-forward solution that, however, requires pointer-based data structures, resulting in O(nlgn) bits of space, where n is the text length [Kim and Cho, CPM’21, Section A.4]. We address this bottleneck by a construction that requires O(nlgσ) bits of space and has a time complexity of O(nlgσlgnlglgn), where σ is alphabet size. Additionally, we can extend this index for indexing multiple circular texts in the spirit of the extended Burrows–Wheeler transform without sacrificing the time and space complexities. We present this index in a dynamic variant, where we pay a logarithmic slowdown and need space linear in the input texts in bits for the extra functionality that we can incrementally add texts. Our extended setting is of interest for finding repetitive motifs common in the aforementioned applications, independent of offsets and scaling.

Keywords and phrases:
Cartesian tree matching, extended Burrows–Wheeler transform, construction algorithm, generalized pattern matching
Funding:
Dominik Köppl: JSPS KAKENHI Grant Numbers JP23H04378 and JP25K21150.
Copyright and License:
[Uncaptioned image] © Eric M. Osterkamp and Dominik Köppl; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation
Related Version:
Full Version: https://arxiv.org/abs/2411.12241
Acknowledgements:
We sincerely thank the anonymous reviewers for their constructive comments.
Editors:
Paola Bonizzoni and Veli Mäkinen

1 Introduction

String matching is ubiquitous in computer science, and its variations are custom-made to solve a wide variety of problems. We here focus on a special kind of variation called substring consistent equivalence relation (SCER[21]. Two strings X and Y are said to SCER-match if they have the same length and all substrings of equal length starting at the same text position SCER-match, i.e., X[i..j] SCER-matches with Y[i..j] for all 1ij|X|=|Y|. A specific instance of SCER-matching is order-preserving matching [19, 16], which has been studied for the analysis of numerical time series. The aim of order-preserving matching is to match two strings if the relative order of the symbols in both strings is the same. Order-preserving matching therefore can find matches independently of value offsets and scaling.

Since order-preserving matching takes the global order of the symbols in a string into account, it may be too strict in applications that primarily consider the local ordering of ranges partitioned by the peaks. In fact, time series of stock prices are such a case, where a common pattern called the head-and-shoulder [8] involves one absolute peak (head) and neighboring local peaks (shoulders), where each of these neighboring peaks can be treated individually to match similar head-and-shoulder patterns with slightly changed local peaks. For such a reason, Park et al. [26] proposed Cartesian tree matching. Cartesian tree matching relaxes the notion of order-preserving matching in the sense that an order-preserving match is always a Cartesian tree match, but not necessarily the other way around. For instance, the two strings of digits 1537462 and 2438372 fit into the head-and-shoulder pattern, do not order-preserving match, but Cartesian tree match. For that, Cartesian tree matching compares the Cartesian trees of two strings to decide whether they match. The Cartesian tree [28] is a binary tree built upon an array of numbers. If all numbers are distinct, it is the min-heap whose in-order traversal retrieves the original array. Since its inception, Cartesian tree matching and variations thereof have attracted interest from researchers, whose studies include multiple pattern matching [26, 27], approximate matching [1, 18], substring matching [4], subsequence matching [24], indeterminate matching [10], cover matching [15], and palindromic matching [9].

In addition to stock prices, applications of generalized pattern matching can also be found in melody matching between musical scores. Pattern matching music scores such as differences, directions, or magnitudes have been studied [12]. However, the detection of repetitions in a piece of music has also been considered of importance [7]. The question therefore is whether we can find repetitive substrings that Cartesian tree match with a repetition of a set of melodic motifs (i.e., input texts). For that to be efficient, we want to index these motifs.

An index for Cartesian tree matching of a text string T is a data structure built upon T that can report the number of substrings of T that Cartesian tree match a given pattern. Given T has n symbols drawn from an integer alphabet of size σ, Park et al. [26] and Nishimoto et al. [23] proposed indexes for Cartesian tree matching of T that both occupy O(nlgn) bits of space and are constructed with O(nlgn) and O(nσlgn) additional bits of space, respectively. Park et al. [26, Section 5.1] achieve O(mlgσ) time and Nishimoto et al. [23, Section 5.1] O(m(σ+lgm)+occ) time for answering count queries, where occ is the number of occurrences of the pattern of length m. Adapting Ferragina and Mantaci’s FM-index for exact string matching [5], Kim and Cho [17] proposed an index that occupies 3n+o(n) bits of space, and answers a count query in O(m) time for a pattern of length m. For construction, they proposed a straight-forward solution that takes as input the Cartesian suffix tree of Park et al. [26], which, however, requires O(nlgn) additional bits of space.

We address two goals in this paper. The first is a construction algorithm for Kim and Cho’s index that takes O(nlgσ) bits of working space, which is compact for constant alphabet sizes. While we consider an index supporting access to a character of the length-n text T as compact if its space is O(nlgσ) bits, we here restrict queries to Cartesian-tree pattern matching, and a Cartesian tree storing n nodes can be represented in 2n bits, cf. [26, Section 3.5]. Therefore, a compact index for Cartesian-tree pattern matching takes O(n) bits, a bound we reach for constant alphabets. Second, while all aforementioned indexes can partially address the problem by indexing multiple texts, it is hard to detect whether the pattern is a repetition of one of the input texts, which is of interest in case of indexing melodic motifs that can repeat. Here, our goal is an index that can find such matches, even if they start with different offsets of the same repetition. In concrete words, our aim is to index multiple texts for Cartesian pattern matching. The search space for a pattern are the texts that are considered to be infinite concatenations with themselves.

In this paper, we propose an extension of the index of Kim and Cho [17] for Cartesian tree matching with techniques of the extended Burrows–Wheeler transform [20], and call the resulting data structure the cBWT index. The cBWT index is an extension in the sense that it supports indexing multiple input texts for Cartesian tree matching circularly. We show that we can compute the cBWT index in O(nlgσlgnlglgn) time and O(nlgσ) bits of space, where n is the total length of all texts to index. Our construction allows to build the cBWT index incrementally, i.e., we support the incremental addition of a new string to the set of texts we index. We can also compute the original index of Kim and Cho within the same complexities. Our ideas stem from construction algorithms of indexes for parameterized matching by Hashimoto et al. [11] and Iseri et al. [14], which we recently extended for multiple circular texts [25]. During construction, the cBWT index supports the backward search and count queries for pattern strings in O(mlgσlgnlglgn) time, where m is the pattern length.

2 Preliminaries

Let lg=log2 denote the logarithm to base two. We assume a random access model with word size Ω(lgn), where n denotes the input size. An interval {i,i+1,..,j1,j} of integers is denoted by [i..j], where [i..j]= if j<i.

Strings.

Let Σ denote an alphabet. We call elements of Σ symbols, a sequence of symbols from Σ a string over Σ, and denote the set of strings over Σ by Σ. Let U,V,WΣ. The concatenation of U and V is denoted by UV or UV. We write Uk if we concatenate k instances of U for a non-negative integer k, and Uω for the string obtained by infinitely concatenating U. We call U primitive if U=Xk implies U=X and k=1. It is known that for every UΣ there exists a unique primitive XΣ and a unique integer k such that U=Xk, denoted by root(U) and exp(U), respectively. If X=UVW, then U is called a prefix, W a suffix, and U,V,W substrings of X. The length |U| of U is the number of symbols in U, lcp(U,V) reports the length of the longest common prefix of U and V, ε denotes the unique string of length 0, and we define Σ+=Σ{ε}. Let XΣ+ and i,j[1..|X|]. Then X[i] denotes the i-th symbol in X, and X[i..j]=X[i]X[j], where X[i..j]=ε if j<i, X[..i]=X[1..i], and X[i..]=X[i..|X|]. For notational convenience, X[..k]=ε if k0, and V[k..]=ε if k|V|+1. We call 1p|X| satisfying X[q]=X[q+p] for every q[1..|X|p] a period of X. Let Rot(X,0)=X and Rot(X,k+1)=Rot(X,k)[2..]Rot(X,k)[1] for each non-negative integer k, i.e., Rot(X,k) denotes the k-th left rotation of X. The left rotations of X for k[0..|X|1] are the conjugates of X. We write U<V if and only if (a) U=V[..|U|] and |U|<|V| or (b) U[lcp(U,V)+1]<V[lcp(U,V)+1].

i 1 2 3 4 5 6 7 8
V[i] 𝟽 𝟷𝟺 𝟻 𝟷 𝟷𝟷 𝟸𝟽 𝟷𝟷 𝟽
Figure 1: Example integer string V to illustrate queries. Here, rank11(V,5)=1, rnkcntV(2,7,7,14)=3, select11(V,2)=7, RNVV(1,5,8)=11, and MIV(5,6)=[4..8].

Let VΣ+ be a (dynamic) string, cdΣ, i[1..|V|+1], j[0..|V|] and k[1..|V|]. Then insertV(i,c) inserts the symbol c at position i of V, deleteV(k) deletes the k-th entry of V, rankc(V,j) returns the number of occurrences of c in V[..j], rnkcntV(i,j,c,d) returns |{x[i..j]cV[x]d}|, selectc(V,i) returns the index of the i-th occurrence of c in V if irankc(V,|V|), RNVV(i,j,c) returns the smallest value in V[i..j] larger than c if it exists, and MIV(j,c) returns the maximal interval [..r] such that 0jr|V| and V[x]c for every x[+1..r]. See Figure 1 for examples. We represent strings by the following dynamic data structure, which supports the aforementioned operations and queries.

Lemma 2.1 ([14, Lemma 4]).

A dynamic string of length n over [0..σ] with σnO(1) can be stored in a data structure occupying O(nlgσ) bits of space, supporting insertion, deletion and the queries access, rank, rnkcnt, select, RNV and MI in O(lgσlgnlglgn) time.

Alphabet.

Throughout, we will work with the integer alphabet Σ=[0..σ], where σnO(1), and a special symbol $Σ stipulated to be smaller than any symbol from Σ. The special symbol is motivated by the construction algorithm, and as a delimiter when a string should not be considered circular, as in the index of Kim and Cho [17]. Let Σ$=Σ{$}.

Cartesian Tree Matching.

The Cartesian tree ct(V) of a string VΣ$ is a binary tree defined as follows. If V=ε, then ct(V) is the empty tree. If Vε, let i denote the position of the smallest symbol in V, where ties are broken with respect to text position. Then ct(V) has V[i] as its root, ct(V[..i1]) as its left subtree and ct(V[i+1..]) as its right subtree. We say that two strings U,VΣ$ Cartesian tree match (ct-match) if and only if ct(U)=ct(V), and write UV. For instance, in Figure 2 the substring T[3..6]=𝟻𝟷𝟽𝟾 of T=𝟼𝟸𝟻𝟷𝟽𝟾𝟸𝟼𝟻 ct-matches P=𝟽𝟹𝟺𝟽 while T[4..7]=𝟷𝟽𝟾𝟸 does not.

(a) ct(T).
(b) ct(P).
(c) ct(T[3..6]).
(d) ct(T[4..7]).
Figure 2: Cartesian trees of T=𝟼𝟸𝟻𝟷𝟽𝟾𝟸𝟼𝟻, P=𝟽𝟹𝟺𝟽, T[3..6]=𝟻𝟷𝟽𝟾, and T[4..7]=𝟷𝟽𝟾𝟸. Here, PT[3..6] and PT[4..7].

Parent Distance Encoding.

Park et al. [26] use an encoding scheme for representing Cartesian trees that reduces the computation of a ct-match to checking whether the encoded strings exactly match. We here give a variant thereof which takes the special symbol $ into account. Let Σ$ denote a symbol larger than any integer. The parent distance encoding V of VΣ$ is a string of length |V| over Σ${} such that

V[i]={if $V[i]<min{V[j]j[..i1]},V[i]if V[i]=$imax{j[..i1]V[j]V[i]}otherwise,

for each i[1..|V|]. We have V[..i]=V[..i] for each VΣ$+ and i[1..|V|]. For example, 𝟺𝟷𝟹𝟸𝟽$𝟹=𝟷𝟸𝟷$𝟷.111As a side note, the parent distance encoding has been leveraged for devising compact O(n)-bit data structures answering range minimum queries, such as the 2d-Min-Heap by Fischer [6] and the LRM-trees by Sadakane and Navarro [22], which semantically coincide.

Lemma 2.2 ([26, Theorem 1]).

Let U,VΣ. Then UVU=V.

Problem Statement.

We are interested in a solution to the following problem.

Problem (Count).

Given 𝒯Σ+ and PΣ, count each of the conjugates of the strings in 𝒯 whose infinite iteration has a prefix ct-matching P.

Throughout, let 𝒯={T1,,Td}Σ+. Our running example consists of the strings T1=𝟻𝟷𝟸, T2=𝟻𝟹𝟼𝟹 and T3=𝟺𝟺𝟽𝟾 over Σ=[0..8]. Given our running example, the solution to Count for P1=𝟼𝟺𝟹 and P2=𝟻𝟼𝟹𝟺 is 0 and 2, respectively. Here, P2Rot(T3,2)ω[..4]Rot(T1,2)ω[..4] since P2=𝟷𝟷=Rot(T3,2)ω[..4]=Rot(T1,2)ω[..4].

3 𝑶(𝒏𝐥𝐠𝝈)-bit Index

Let n=|T1Td|, nk=|Tk| for each k[1..d], and C𝒯(i)=Rot(Tj,i1k=1j1nk) for each i[1..n], where j=min{h[1..d]k=1hnki}, i.e., we identify each conjugate of each text T1,,Td with its start position inside the concatenation T1Td, such that we give them ranks from 1 to n. See Figure 3 for an example. In what follows, we put these ranks in a specific order by a permutation of [1..n] such that the permuted ranks of all conjugates with prefixes of their infinite concatenation ct-matching a pattern form an interval [..r][1..n].

3.1 Conjugate Array

We express the permutation of the ranks of all conjugates by the so-called conjugate array, which we will subsequently define. To achieve this, we extend the ideas of Mantaci et al. [20] to Cartesian tree matching, and introduce a preorder on Σ$+. For notational convenience, we define the rotational parent distance encoding Vr of VΣ$+ by Vr=V2[|V|+1..]. For any V,UΣ$+, let VωU if and only if there exists some natural number i such that Vω[..i]<Uω[..i] or root(Vr)=root(Ur) holds.

Lemma 3.1.

The relation ω defines a total preorder on Σ$+, i.e., the relation ω is binary, reflexive, transitive and connected.

We call this preorder the ω-preorder. We write V=ωU if and only if VωUUωV, and VωU if and only if VωUVωU. For instance, T3=𝟺𝟺𝟽𝟾ω𝟷𝟸𝟻=Rot(T1,1) and T2=𝟻𝟹𝟼𝟹=ω𝟼𝟹𝟻𝟹=Rot(T2,2). Note that the latter example violates antisymmetry, i.e., the relation ω is not a total order. In the following, we present a convenient but not necessarily optimal way to compute the ω-preorder of two given strings.

Lemma 3.2 (Weak Periodicity Lemma).

Let p and q be two periods of a string X. If p+q|X|, then gcd(p,q) is also a period of X.

Lemma 3.3 ([13, Lemma 5]).

Let V,UΣ$+ and z=max{|V|,|U|}. Then V=ωU if and only if Vω[..3z]=Uω[..3z].

Proof.

Without loss of generality, |U|=z. Let |V|=i, |root(Vr)|=j and |root(Ur)|=k.

()

Assume Vω[..3z]=Uω[..3z]. Then, on the one hand,

Rot(V,z)rω[..2z]=Vω[..3z][z+1..]=Uω[..3z][z+1..]=UrUr.

Since the rotational prev-encoding is commutative with left rotations, j is a period of Rot(V,z)r. Consequently, both z and j are periods of UrUr. Since j+z2z=|UrUr|, we can apply Lemma 3.2 and find that gcd(j,z) is a period of UrUr. As gcd(j,z) divides z=|Ur|, Ur can be formed by repeating Ur[..gcd(j,z)] an integral number of times, which implies gcd(j,z)k, i.e., ijk. On the other hand,

VrVr=Vω[..3z][i+1..3i]=Uω[..3z][i+1..3i]=Rot(U,i)rω[..2i].

Since the rotational prev-encoding is commutative with left rotations, k is a period of Rot(U,i)r. As ki, k is a period of VrVr in addition to i. Then k+i2i=|VrVr| and Lemma 3.2 imply that gcd(k,i) is a period of VrVr. As gcd(k,i) divides i=|Vr|, Vr[..gcd(k,i)] can be repeated an integral number of times to form Vr, which implies gcd(k,i)j. Consequently, kj, and therefore |root(Vr)|=j=k=|root(Ur)|. In particular, root(Vr)=Vω[..3z][3zj+1..]=Uω[..3z][3zk+1..]=root(Ur), i.e., V=ωU.

()

Let V=ωU. Assume Vω[..3z]Uω[..3z] with x minimal such that Vω[..3z][x]Uω[..3z][x]. If max{Vω[..3z][x],Uω[..3z][x]}<, then Rot(root(Vr),x1)[1]=Vω[..3z][x]Uω[..3z][x]=Rot(root(Ur),x1)[1], which contradicts V=ωU. Hence, and without loss of generality, assume Vω[..3z][x]=. Then Uω[..3z][x]<, x|root(Vr)|, and consequently root(Ur)[x]=Uω[..3z][x]<xroot(Vr)[x], a contradiction to V=ωU. Thus, Vω[..3z]=Uω[..3z].

Corollary 3.4.

Let V,UΣ$+ and z=max{|V|,|U|}. Then VωU if and only if Vω[..3z]<Uω[..3z].

Similarly to Boucher et al. [2], we define the conjugate array CA𝒯 of 𝒯 as the string of length n over [1..n] such that CA𝒯[i]=j if and only if

i1=|{k[1..n]C𝒯(k)ωC𝒯(j) or C𝒯(k)=ωC𝒯(j)k<j}|,

i.e., i1 is the number of all conjugates smaller than C𝒯(j) according to ω-preorder, where we break ties first with respect to text index, and then with respect to text position. By resolving all ties this way, we ensure that CA𝒯 is well-defined. Since CA𝒯 is a permutation, its inverse, which we denote by CA𝒯1, is also well-defined. See Figure 3 for our running example’s conjugate array.

i C𝒯(i) C𝒯(i)ω[..12] CA𝒯1[i] CA𝒯[i] C𝒯(CA𝒯[i])ω[..12] C𝒯(CA𝒯[i])
1 𝟻𝟷𝟸 𝟷𝟷𝟹𝟷𝟷𝟹𝟷𝟷𝟹𝟷 𝟿 𝟾 𝟷𝟷𝟷𝟹𝟷𝟷𝟷𝟹𝟷𝟷𝟷 𝟺𝟺𝟽𝟾
2 𝟷𝟸𝟻 𝟷𝟷𝟹𝟷𝟷𝟹𝟷𝟷𝟹𝟷𝟷 𝟹 𝟿 𝟷𝟷𝟹𝟷𝟷𝟷𝟹𝟷𝟷𝟷𝟹 𝟺𝟽𝟾𝟺
3 𝟸𝟻𝟷 𝟷𝟷𝟷𝟹𝟷𝟷𝟹𝟷𝟷𝟹 𝟽 𝟸 𝟷𝟷𝟹𝟷𝟷𝟹𝟷𝟷𝟹𝟷𝟷 𝟷𝟸𝟻
4 𝟻𝟹𝟼𝟹 𝟷𝟸𝟷𝟸𝟷𝟸𝟷𝟸𝟷𝟸 𝟷𝟶 𝟻 𝟷𝟸𝟷𝟸𝟷𝟸𝟷𝟸𝟷𝟸𝟷 𝟹𝟼𝟹𝟻
5 𝟹𝟼𝟹𝟻 𝟷𝟸𝟷𝟸𝟷𝟸𝟷𝟸𝟷𝟸𝟷 𝟺 𝟽 𝟷𝟸𝟷𝟸𝟷𝟸𝟷𝟸𝟷𝟸𝟷 𝟹𝟻𝟹𝟼
6 𝟼𝟹𝟻𝟹 𝟷𝟸𝟷𝟸𝟷𝟸𝟷𝟸𝟷𝟸 𝟷𝟷 𝟷𝟶 𝟷𝟷𝟷𝟷𝟹𝟷𝟷𝟷𝟹𝟷 𝟽𝟾𝟺𝟺
7 𝟹𝟻𝟹𝟼 𝟷𝟸𝟷𝟸𝟷𝟸𝟷𝟸𝟷𝟸𝟷 𝟻 𝟹 𝟷𝟷𝟷𝟹𝟷𝟷𝟹𝟷𝟷𝟹 𝟸𝟻𝟷
8 𝟺𝟺𝟽𝟾 𝟷𝟷𝟷𝟹𝟷𝟷𝟷𝟹𝟷𝟷𝟷 𝟷 𝟷𝟷 𝟷𝟷𝟷𝟹𝟷𝟷𝟷𝟹𝟷𝟷 𝟾𝟺𝟺𝟽
9 𝟺𝟽𝟾𝟺 𝟷𝟷𝟹𝟷𝟷𝟷𝟹𝟷𝟷𝟷𝟹 𝟸 𝟷 𝟷𝟷𝟹𝟷𝟷𝟹𝟷𝟷𝟹𝟷 𝟻𝟷𝟸
10 𝟽𝟾𝟺𝟺 𝟷𝟷𝟷𝟷𝟹𝟷𝟷𝟷𝟹𝟷 𝟼 𝟺 𝟷𝟸𝟷𝟸𝟷𝟸𝟷𝟸𝟷𝟸 𝟻𝟹𝟼𝟹
11 𝟾𝟺𝟺𝟽 𝟷𝟷𝟷𝟹𝟷𝟷𝟷𝟹𝟷𝟷 𝟾 𝟼 𝟷𝟸𝟷𝟸𝟷𝟸𝟷𝟸𝟷𝟸 𝟼𝟹𝟻𝟹
Figure 3: The conjugate array CA𝒯 of our running example 𝒯={𝟻𝟷𝟸,𝟻𝟹𝟼𝟹,𝟺𝟺𝟽𝟾}.

We define the conjugate range CR𝒯(P) of a pattern PΣ of length m in 𝒯 as a maximal interval [..r][1..n] such that PC𝒯(CA𝒯[i])ω[..m] for every i[..r]. Leveraging Lemma 2.2, we find that the conjugate range is well-defined.

Corollary 3.5.

Let 𝒯={T1,,Td}Σ$+, n=|T1Td|, PΣ, and m=|P|. Then PC𝒯(CA𝒯[i])ω[..m] if and only if iCR𝒯(P).

We have CR𝒯(ε)=[1..n] because the empty string is a prefix of every conjugate. The computation of CR𝒯(P) for some pattern PΣ on indexes related to the FM-index usually perform an algorithmic technique called the backward search. The backward search for P in 𝒯 processes P backwards in an online manner and refines CR𝒯(P[i+1..]) to CR𝒯(P[i..]) at the i-th step on reading P[i] with P[|P|+1..]=ϵ. Finally, the length of CR𝒯(P) is the solution to Count by Corollary 3.5. For our running example, CR𝒯(𝟼𝟺𝟹)= and CR𝒯(𝟻𝟼𝟹𝟺)=[6..7]. Below we define the necessary tools to allow for an efficient backward search.

3.2 LF-mapping

We want to define a map that allows us to cycle backwards through each of the texts to be indexed by mapping to the position of a conjugate in the conjugate array from the position of its left rotation. However, if we want to represent this mapping space-efficiently, we have to be careful since we used tie-breaks within texts when we sorted the conjugates by a preorder. Our idea is to relax the requirements and define a map that allows us to cycle backwards through a text that is ω-equal to the original to dodge any issues arising from our tie-breaks. For that, we want to cycle backwards in text order inside the roots of each ..r-encoded text. Whenever we want to move backwards at the starting position of a root, we jump to its end position. We express this backwards movement with the following permutation. For every i[1..n] with j=min{h[1..d]k=1hnki}, let

prev𝒯(i)={i1+root(Tjr)if C𝒯(i)=ωTj,i1otherwise.

For our running example, prev𝒯=(1 3 2)(4 5)(6 7)(8 11 10 9). Here, we observe that we have two cycles (4 5) and (6 7) corresponding to T2[1..2]=𝟻𝟹 and T2[3..4]=𝟼𝟹, respectively, which are ω-equal to the original text T2=𝟻𝟹𝟼𝟹. Now we are ready to express the LF-mapping in terms of the function prev𝒯. The LF-mapping LF𝒯 of 𝒯 is a string of length n over [1..n] such that LF𝒯[i]=CA𝒯1[prev𝒯(CA𝒯[i])] for each i[1..n]. Since the LF-mapping is a permutation, it has an inverse LF𝒯1, which we call the FL-mapping of 𝒯 and denote by FL𝒯. See Figure 5 for an example. The LF-mapping is at the core of the backward search. However, storing LF-mapping and FL-mapping in their plain form creates the need for two integer arrays of length n and entries of lgn bits, which motivates the following encoding scheme. Let the rotational Cartesian tree signature encoding V of VΣ$+ denote a string of length |V| over Σ$ such that

V[i]={V[i]if V[i]=$,rank(Rot(V,i),|V|)rank(V[i]Rot(V,i)[2..],|V|)otherwise,

for each i[1..|V|], i.e., if V[i]$, then V[i] reports the number of positions j[1..|V|] such that Rot(V,i)[j]V[i] and Rot(V,i)[j]=. See Figure 4 for an example. Note that V[i..select$(V,1)]=V[i..select$(V,1)] for each VΣ$+ satisfying rank$(V,|V|)1 and i[1..select$(V,1)].

Lemma 3.6.

Let VΣ$+. Then i=1jV[i]|V|, where j=select$(V,1)1 if rank$(V,|V|)1, and j=|V| otherwise.

i T3[i] Rot(T3,i) Rot(T3,i) T3[i]Rot(T3,i) T3[i]
1 𝟺 𝟺𝟽𝟾𝟺 𝟷𝟷𝟹 𝟷𝟷𝟷𝟹 𝟷
2 𝟺 𝟽𝟾𝟺𝟺 𝟷𝟷 𝟷𝟷𝟹𝟷 𝟸
3 𝟽 𝟾𝟺𝟺𝟽 𝟷𝟷 𝟷𝟷𝟷 𝟷
4 𝟾 𝟺𝟺𝟽𝟾 𝟷𝟷𝟷 𝟷𝟷𝟷 𝟶
Figure 4: Rotational Cartesian tree signature encoding T3 of T3=𝟺𝟺𝟽𝟾.

Taking advantage of both encodings, we investigate how the ω-preorder of two strings changes if we rotate them. For convenience, we borrowed the following notation from literature [14, 11]. Let π(V)=V[1] for every VΣ$+, and lcp(U,W)=rank(U,lcp(U,W)) for each U,WΣ$. For example, π(𝟺𝟺𝟽𝟾)=1 and lcp(𝟺𝟺𝟽𝟾,𝟽𝟾𝟺𝟺)=1.

Lemma 3.7 ([17, Lemma 3]).

Let V,UΣ$+ such that Rot(V,1)ωRot(U,1). Then VωU if and only if (a) π(V)=$ or (b) π(V)min{e,π(U)} and π(U)$, where e=lcp(Rot(V,1),Rot(U,1)).

Proof.

We show the statement for min{π(V),π(U)}Σ since it is non-trivial. Let z=max{|V|,|U|} and λ=lcp(Rot(V,1)ω[..3z],Rot(U,1)ω[..3z]). Then Rot(V,1)ω[..3z]<Rot(U,1)ω[..3z] by Corollary 3.4, λ<3z, and select(Rot(V,1),e)min{|V|,|U|}.

()

Assume π(V)<min{e,π(U)}. Let i=select(V,2). Then V[..i1]=U[..i1] and V[i]=i1=U[i], i.e. U<V. Consequently, Uω[..3z]<Vω[..3z], which implies UωV by Corollary 3.4. The statement follows by contraposition.

()

We exhaust all possible cases for π(V)min{e,π(U)}.

Case 1.

Assume min{e,π(V)}>π(U). Let i=select(U,2). Then V[..i1]=U[..i1] and V[i]=i1=U[i], i.e. V<U. Consequently, Vω[..3z]<Uω[..3z], which implies VωU by Corollary 3.4.

Case 2.

Assume min{π(V),π(U)}e. Then we have Vω[..3z][..λ+1]=Uω[..3z][..λ+1]. Since Rot(V,1)ωRot(U,1), λ+23z by Lemma 3.3. If Rot(U,1)ω[..3z][λ+1]=, Vω[..3z][λ+2]λ<λ+1=Uω[..3z][λ+2]. If Rot(U,1)ω[..3z][λ], then Vω[..3z][λ+2]=Rot(V,1)ω[..3z][λ+1]<Rot(U,1)ω[..3z][λ+1]=Uω[..3z][λ+2]. Thus, VωU by Corollary 3.4.

Case 3.

Assume e>π(V)=π(U). Then Vω[..3z][..λ+1]=Uω[..3z][..λ+1] and λz. Moreover, Vω[..3z][λ+2]=Rot(V,1)ω[..3z][λ+1]<Rot(U,1)ω[..3z][λ+1]=Uω[..3z][λ+2]. Hence, VωU by Corollary 3.4.

Our representation of the LF- and FL-mapping consists of two strings L𝒯 and F𝒯, which are defined as follows. First, L𝒯 is the string of length n over Σ$ such that L𝒯[i]=π(C𝒯(CA𝒯[LF𝒯[i]]))=C𝒯(CA𝒯[i])[|C𝒯(CA𝒯[i])|] for each i[1..n].

Corollary 3.8.

Let 𝒯Σ$+, n the accumulated length of all texts in 𝒯, i,j[1..n], and i<j. If L𝒯[i]=L𝒯[j], then LF𝒯[i]<LF𝒯[j].

Second, F𝒯 is the string of length n over Σ$ such that F𝒯[LF𝒯[i]]=L𝒯[i] for each i[1..n]. By what follows, L𝒯 and F𝒯 suffice to compute both LF- and FL-mapping of 𝒯.

Corollary 3.9.

Let 𝒯Σ$+, n the accumulated length of all texts in 𝒯, and i[1..n]. Then LF𝒯[i]=selectL𝒯[i](F𝒯,rankL𝒯[i](L𝒯,i)) and FL𝒯[i]=selectF𝒯[i](L𝒯,rankF𝒯[i](F𝒯,i)).

In Figure 5 we present F𝒯 and L𝒯 of our running example.

 Remark 3.10.

Let d=1, rank$(T1,n1)=1 and T1[n1]=$. If we substitute the occurrence of $ in F𝒯 and L𝒯 for 1, then the modified strings constitute the integer-based representation of Kim and Cho’s index [17, Section 4].

3.3 Backward Search

The LF-mapping can be leveraged for the backward search by the following result, which is due to Lemma 2.2 and Corollary 3.5.

Lemma 3.11 ([17, Lemma 6]).

Let 𝒯Σ$+, PΣ+, |P|=m, i[1..m], h=π(P[i..]$) and e=rank(P[i..],mi+1). For jCR𝒯(P[i+1..]), LF𝒯[j]CR𝒯(P[i..]) if and only if (a) e>1 and L𝒯[j]=h or (b) e=1 and L𝒯[j]h.

At this point it is straight-forward to apply the techniques developed by Kim and Cho [17, Sections 5 and 6] to define a static index of 𝒯Σ+ that occupies 3n+o(n) bits of space and that solves Count in O(m) time, where m is the pattern length. However, for brevity and in view of a space efficient construction of our proposed index and its extension, we will represent F𝒯 and L𝒯 by the dynamic data structure of Lemma 2.1, and introduce an auxiliary string for the backward search. Let LCP𝒯 denote a string of length n over Σ such that LCPT[1]=0 and LCP𝒯[i]=lcp(C𝒯(CAT[i]),C𝒯(CAT[i1])) for each i[2..n].

Lemma 3.12.

Let 𝒯={T1,,Td}Σ$+, n=|T1Td|, i,j[1..n], and i<j. Then lcp(C𝒯(CA𝒯[i]),C𝒯(CA𝒯[j]))=min{LCP𝒯[k]k[i+1..j]}=RNVLCP𝒯(i+1,j,1).

Lemma 3.13.

Let 𝒯Σ$+, PΣ+, m=|P|, i[1..m], [..r]=CR𝒯(P[i+1..]), h=π(P[i..]$), and e=rank(P[i..],mi+1). Then Algorithm 1 correctly computes [..r]=CR𝒯(P[i..]).

Algorithm 1 Computing the conjugate range CR𝒯(P[i..]). Here, 𝒯Σ$+, PΣ+, m=|P|, i[1..m], [..r]=CR𝒯(P[i+1..]), h=π(P[i..]$), and e=rank(P[i..],mi+1).

Proof.

Let c=|CR𝒯(P[i..])|. We show how Algorithm 1 computes c and r to obtain [..r].

Case 1.

Assume e>1. This case is handled from Line 3 through Line 5. By Lemma 3.11, LF𝒯[j]CR𝒯(P[i..]) if and only if L𝒯[j]=h for each j[..r]. We compute c in Line 4 and return an empty interval in Line 13 due to Line 2 if c0. Assume c1. We compute the largest x[..r] such that L𝒯[x]=h, and then r=LF𝒯[x] in Line 5, with correctness following by Corollary 3.8.

Case 2.

Assume e=1. This case is handled from Line 6 through Line 12. By Lemma 3.11, LF𝒯[j]CR𝒯(P[i..]) if and only if L𝒯[j]h for each j[..r]. Thus, we correctly compute c in Line 7, and return an empty interval in Line 13 due to Line 2 if c0. Assume c1. We compute the lowest value v in L𝒯[..r] greater than h1 in Line 8 and determine the largest x[..r] such that L𝒯[x]=v in Line 9. Let [′′..r′′] denote the interval computed in Line 10. Then lcp(C𝒯(CA𝒯[j]),C𝒯(CA𝒯[k]))v+1 for each j,k[′′..r′′] by Lemma 3.12. We compute y=LF𝒯[x]. The following statements hold due to Lemma 3.7.

  • If j[..x1] satisfies L𝒯[j]h, then LF𝒯[j]<LF𝒯[x] since vL𝒯[j].

  • If j[x+1..r′′] satisfies L𝒯[j]h, then LF𝒯[j]<LF𝒯[x] since v<L𝒯[j] and v<v+1lcp(C𝒯(CA𝒯[j]),C𝒯(CA𝒯[x])).

  • If j[r′′+1..r] satisfies L𝒯[j]h, then LF𝒯[j]>LF𝒯[x] since we have vlcp(C𝒯(CA𝒯[j]),C𝒯(CA𝒯[x])).

We apply these results to compute y in Line 11, and then infer r=LF𝒯[x]+c(y+1) in Line 12.

The next result is obtained from the computation of Demaine and colleagues’ [3] Cartesian tree signature encoding of the given string.

Lemma 3.14.

Given PΣ$ of length m, we can process P in O(m) time such that we can subsequently compute π(P[i..]$) and rank(P[i..],mi+1) in O(1) time, for every i[1..m].

Theorem 3.15.

Let 𝒯={T1,,Td}Σ$+ and n=|T1Td|. There exists a data structure occupying O(nlgσ) bits of space that solves Count in O(mlgσlgnlglgn) time, where m is the pattern length.

Proof.

We represent F𝒯, L𝒯 and LCP𝒯 by the data structure of Lemma 2.1, which leads to the claimed space complexity. We preprocess a pattern PΣ+ of length m with Lemma 3.14, and compute CR𝒯(P[i..]) from CR𝒯(P[i+1..]) for each i[1..m] in descending order leveraging Lemma 3.13. Since each conjugate range update takes O(1) queries, the claimed complexity for solving Count follows from Lemma 2.1. We call the data structure of Theorem 3.15 the cBWT index of 𝒯. The cBWT index of the running example is presented in Figure 5.

i CA𝒯[i] C𝒯(CA𝒯[i]) LF𝒯[i] F𝒯[i] L𝒯[i] LCP𝒯[i] C𝒯(CA𝒯[i])
1 𝟾 𝟺𝟺𝟽𝟾 𝟾 𝟷 𝟶 𝟶 𝟷𝟷𝟷
2 𝟿 𝟺𝟽𝟾𝟺 𝟷 𝟸 𝟷 𝟷 𝟷𝟷𝟹
3 𝟸 𝟷𝟸𝟻 𝟿 𝟸 𝟶 𝟷 𝟷𝟷
4 𝟻 𝟹𝟼𝟹𝟻 𝟷𝟶 𝟸 𝟶 𝟷 𝟷𝟸𝟷
5 𝟽 𝟹𝟻𝟹𝟼 𝟷𝟷 𝟸 𝟶 𝟷 𝟷𝟸𝟷
6 𝟷𝟶 𝟽𝟾𝟺𝟺 𝟸 𝟷 𝟸 𝟷 𝟷𝟷
7 𝟹 𝟸𝟻𝟷 𝟹 𝟷 𝟸 𝟸 𝟷
8 𝟷𝟷 𝟾𝟺𝟺𝟽 𝟼 𝟶 𝟷 𝟷 𝟷𝟷
9 𝟷 𝟻𝟷𝟸 𝟽 𝟶 𝟷 𝟸 𝟷
10 𝟺 𝟻𝟹𝟼𝟹 𝟺 𝟶 𝟸 𝟸 𝟷𝟸
11 𝟼 𝟼𝟹𝟻𝟹 𝟻 𝟶 𝟸 𝟸 𝟷𝟸
Figure 5: The cBWT index of our running example 𝒯={𝟻𝟷𝟸,𝟻𝟹𝟼𝟹,𝟺𝟺𝟽𝟾}.

4 Construction in 𝑶(𝒏𝐥𝐠𝝈) Bits of Space

We will show how to construct the cBWT index of a single text, and how an existing cBWT index can be extended to also index an additional text. We then leverage these two results to iteratively construct the cBWT index of any set of texts, adding one new text per iteration step.

4.1 Single Text cBWT Index

Before we can tackle the construction of the cBWT index of a single text, we need another technical result, which follows from an examination of the definitions and Lemma 3.7.

Lemma 4.1.

Let V,UΣ$+, Rot(V,1)ωRot(U,1), and e=lcp(Rot(V,1),Rot(U,1)). Then

lcp(V,U)={0if π(U)=$ or π(V)=$,eπ(V)+1if VωU and $π(V)=π(U)<e,1otherwise.

We will first show how to construct the cBWT index of a text from Σ$+ in which $ occurs exactly once, and then show how to construct the cBWT index of an arbitrary text from Σ+ (without the requirement on $).

Lemma 4.2.

Let RΣ$+, ρ=|R|2, R[ρ]=$, and rank$(R,ρ)=1. Algorithm 2 correctly computes CA{bR}1[1] and the cBWT index of {bR} for each bΣ.

Algorithm 2 Computing c+1=CA{bR}1[1] and updating the cBWT of {R} to that of {bR} for bΣ. Here, RΣ$+, ρ=|R|, R[ρ]=$, rank$(R,ρ)=1, and y=CA{R}1[1].

Proof.

For each i[0..π(bR)], let Ji=[i..ri] maximal such that riCA{R}1[1] and lcp(R,C{R}(CA{R}[j]))=i for each jJi, and Jπ(bR)+1=[π(bR)+1..rπ(bR)+1] maximal such that lcp(R,C{R}(CA{R}[j]))π(bR)+1 for each jJπ(bR)+1, i.e., the left and right boundary of the former are computed in Line 6 and Line 5, respectively, and the latter in Line 2. The following statements are due to the location of the single $ in each conjugate of R and Lemma 3.7.

  • If j[rπ(bR)+1+1..ρ], then bRωC{R}(CA{R}[LF{R}[j]]).

  • If j[CA{R}1[1]+1..rπ(bR)+1], then C{R}(CA{R}[LF{R}[j]])ωbR if and only if L{R}[j]π(bR)+1.

  • If j=CA{R}1[1], then C{R}(CA{R}[LF{R}[j]])=($R)[..ρ]ωbR.

  • If j[π(bR)+1..CA{R}1[1]1], then C{R}(CA{R}[LF{R}[j]])ωbR if and only if L{R}[j]π(bR).

  • If i[0..π(bR)] and jJi, then C{R}(CA{R}[LF{R}[j]])ωbR if and only if L{R}[j]i.

We apply these results from Line 2 through Line 6 to compute the number c of conjugates of R smaller than bR according to ω-preorder. Since C{R}(k)[..select$(C{R}(k),1)]=C{bR}(k+1)[..select$(C{R}(k),1)] for each k[1..ρ] by assumption on R and b,

C{bR}(CA{bR}[i])[..x]={C{R}(CA{R}[i])[..x]if i[1..c],bRif i=c+1,C{R}(CA{R}[i1])[..x]otherwise,

for each i[1..ρ+1] with x=select$(C{bR}(CA{bR}[i]),1). Thus, c is also the number of conjugates of bR smaller than bR according to ω-preorder. Subsequently, Algorithm 2 updates the cBWT index of {R} from Line 8 through Line 17. By assumption on b and R, LCP{R}[..c]=LCP{bR}[..c] and LCP{R}[c+2..]=LCP{bR}[c+3..]. From Line 8 through Line 14 we leverage Lemma 3.12 and Lemma 4.1 to compute LCP{bR}[c+1] and, if c<ρ, LCP{bR}[c+2], and update LCP{R}. By assumption on R, π(C{R}(j))=π(C{bR}(j+1)) for each j[1..ρ]. Consequently, F{R}[..c]=F{bR}[..c] and F{R}[c+1..]=F{bR}[c+2..]. Moreover, if we set L{R}[CA{R}1[1]]=π(bR), then L{R}[..c]=L{bR}[..c] and L{R}[c+1..]=L{bR}[c+2..]. It remains to compute F{bR}[c+1] and L{bR}[c+1], which are π(bR) and $, respectively. The update of both F{R} and L{R} is done from Line 15 through Line 17.

Corollary 4.3.

Let RΣ$+, ρ=|R|, R[ρ]=$, and rank$(R,ρ)=1. The cBWT index of {R} can be constructed in O(ρlgσ) bits of space and O(ρlgσlgρlglgρ) time.

Proof.

We show the case where ρ2. We represent F{R[ρ1..]}=$𝟶, L{R[ρ1..]}=𝟶$ and LCP{R[ρ1..]}=𝟶𝟶 by the data structure of Lemma 2.1. Initially, CA{R[ρ1..]}1[1]=2. We preprocess R with Lemma 3.14 in O(ρ) time to have access to π(R[i..])=π(Rot(R,i1)) for each i[1..ρ] in O(1) time. For each i[1..ρ2] in descending order, we apply Algorithm 2 to compute CA{R[i..]}1[1] and extend the cBWT index of {R[i+1..]} to that of {R[i..]}. Since the extension of the cBWT index of {R[i+1..]} to that of {R[i..]} takes O(π(R[i..])) queries for each i[1..ρ2], the construction of the cBWT index of {R[1..]}={R} takes a total of O(ρ) queries by Lemma 3.6. The claimed complexities then follow by Lemma 2.1.

 Remark 4.4.

Due to Remark 3.10, the previous statement yields the integer-based representation of Kim and Cho’s index [17, Section 4] by substituting $ with 1 in both F𝒯 and L𝒯. Thus, we can construct their compact index [17, Section 5] directly [17, Section A.4] while retaining the complexities as stated in Corollary 4.3.

Corollary 4.5.

Let TΣ+ and |T|=n. Then the cBWT index of {T} can be constructed in O(nlgσ) bits of space and O(nlgσlgnlglgn) time.

Proof.

Let R=T4$. We construct the cBWT index of {R} within the claimed complexities by Corollary 4.3. During construction, we build a bit string Y of length 4n+1 such that Y[i]=1 if and only if CA{R}[i][2..n+1]. By choice of R, C{T}(i)ω[..3n]=C{R}(i)[..3n] for each i[1..n], and C{T}(1)ω[..3n]=C{R}(n+1)[..3n]. The following statements hold for each i[1..n] by choice of Y and Lemma 3.3.

  • C{T}(CA{T}[i])ω[..3n]=C{R}(CA{R}[select1(Y,i)])[..3n].

  • π(C{T}(CA{T}[i]))=π(C{R}(CA{R}[select1(Y,i)])).

  • π(C{T}(CA{T}[LF{T}[i]]))=π(C{R}(CA{R}[LF{R}[select1(Y,i)]])).

Thus, F{T}[i]=F{R}[select1(Y,i)] and L{T}[i]=L{R}[select1(Y,i)] for each i[1..n]. Moreover, LCP{T}[i]=RNVLCP{R}(select1(Y,i1)+1,select1(Y,i),1) for each i[2..n] due to Lemma 3.12, and LCP{T}[1]=0. Hence, we can transform the cBWT index of {R} to index {T} with O(n) queries and operations. The claimed complexities now follow by Lemma 2.1.

4.2 Extending an Existing cBWT Index

Let Σ+ be a non-empty set of strings and ρ=|CA| denote the accumulated length of all strings in . In this subsection, we assume we have the cBWT index of at hand, and want to extend it by another text SΣ+ of length λ to index 𝒮={S}. To achieve this, we compute the integers cnt(Rot(S,i)), plcp(Rot(S,i)) and slcp(Rot(S,i)) for each i[1..λ], whose definitions follow. Let cnt(V)=|{i[1..ρ]C(CA[i])ωV}|,

plcp(V)={1if cnt(V)=0,lcp(C(CA[cnt(V)]),V)otherwise, andslcp(V)={lcp(C(CA[cnt(V)+1]),V)if cnt(V)<ρ,1otherwise,

for each VΣ$+. We can apply the techniques exhibited in Lemma 4.2 to compute the helper values.

Algorithm 3 Computing c=cnt(V), p=plcp(V) and s=slcp(V). Here, Σ+, ρ=|CA|, VΣ$+, rank(V,|V|)1, and y=cnt(Rot(V,1)).
Lemma 4.6.

Let Σ+, ρ=|CA|, and VΣ$+ with rank(V,|V|)1. Then Algorithm 3 correctly computes cnt(V), plcp(V) and slcp(V).

Proof.

Algorithm 3 takes π(V), y=cnt(Rot(V,1)), plcp(Rot(V,1)), slcp(Rot(V,1)), and the cBWT index of as input. If π(V)=$, then we return cnt(V)=0, plcp(V)=1 and slcp(V)=0 in Line 2 since Σ+. Thus, assume π(V)>$. For each i[0..π(V)], let Ji=[i..ri] maximal such that riy and lcp(V,C(CA[j]))=i for each jJi. The following statements are due to Lemma 3.7 and Σ+.

  • If j[y+1..ρ], then C(CA𝒯[LF[j]])ωV if and only if L[j]π(V)+1 and lcp(C(CA𝒯[i]),Rot(V,j))π(V)+1.

  • If j[rπ(V)+1..y], then C(CA𝒯[LF[j]])ωV if and only if L[j]π(V).

  • If i[0..π(V)] and jJi, then C(CA𝒯[LF[j]])ωV if and only if L𝒯[j]i.

We apply these results from Line 3 through Line 10 to compute cnt(V). Leveraging Lemma 4.1 and (the proof of) Lemma 3.12, we then compute plcp(V) and slcp(V) from Line 11 through Line 16 and from Line 17 through Line 23, respectively.

Lemma 4.7.

Let Σ+, ρ=|CA|, VΣ$+, and rank(V,|V|)1. If π(V)$, then cnt(V), plcp(V) and slcp(V) can be computed from π(V), cnt(Rot(V,1)), plcp(Rot(V,1)) and slcp(Rot(V,1)) in O((1+π(V))lgσlgρlglgρ) time with the cBWT index of .

Proof.

We apply Algorithm 3. Correctness follows from Lemma 4.6, and the time complexity is due to the loop of Algorithm 3 in Line 7 and Lemma 2.1.

For the iterative construction, we augment the cBWT index by a dynamic bit string E with the invariant that E is zeroed before and after each extension with a new string. We use E to temporarily mark modified parts in the cBWT such that we retain the functionality of the index even during construction. The length of E is the number ρ of characters indexed by the cBWT index, which we call in the next lemma the augmented cBWT index for clarity.

Lemma 4.8.

Let Σ+, ρ=|CA|, SΣ+, 𝒮={S}, λ=|S|, and z=max{|V|V𝒮}. The augmented cBWT index of can be extended to index 𝒮 in O((λ+ρ)lgσ) bits of space and O(zlgσlg(λ+ρ)lglg(λ+ρ)) time.

Proof.

We compute the cBWT index of {S} within the claimed complexities by Corollary 4.5. Then we compute cnt(Rot(S,i)), plcp(Rot(S,i)) and slcp(Rot(S,i)) for each i[1..λ].

Case 1.

Assume Rot(S,i)=ωR for some R and i[1..λ]. Then cnt(Rot(S,i))=r, plcp(Rot(S,i))=rank(Rot(S,i),λ), and, if r<ρ, slcp(Rot(S,i))=LCP[r+1] by Lemma 3.3, where i[1..λ] and r is the right boundary of CR(Rot(S,i)ω[..3z]). Since rank(Rot(S,i1),λ)=rank(Sω[i..4z],4zi+1) for each i[2..λ+1], the last λ steps of the backward search for Sω[2..4z] yield the necessary values to compute cnt(Rot(S,i)), plcp(Rot(S,i)) and slcp(Rot(S,i)) for all i[1..λ] in descending order within the claimed complexities by Theorem 3.15.

Case 2.

Assume Rot(S,i)ωR for each R and i[1..λ]. Let V=Sω[..4z]$. By assumption and Corollary 3.4, cnt(Rot(S,i))=cnt(Rot(V,i)), plcp(Rot(S,i))=plcp(Rot(V,i)), and slcp(Rot(S,i))=slcp(Rot(V,i)) for each i[1..λ]. We leverage Lemma 3.14 to preprocess V within the claimed complexities. Since cnt(Rot(V,4z))=0, plcp(Rot(V,4z))=1 and slcp(Rot(V,4z))=0 by construction, we can now use Lemma 4.7 to compute cnt(Rot(V,j)), plcp(Rot(V,j)) and slcp(Rot(V,j)) for each j[1..4z1] in descending order. Hence, we obtain cnt(Rot(S,i)), plcp(Rot(S,i)) and slcp(Rot(S,i)) for all i[1..λ] in descending order within the claimed complexities by Lemma 3.6.

Leveraging the backward search on the cBWT index of {S}, we store the plcp-values and slcp-values in lex-order, which takes O(λlgσ) bits of space. To store the cnt-values in lex-order and stay within the claimed time and space complexity, we utilize E, which we assume to be represented by the data structure of Lemma 2.1. For each i[1..λ] in descending order, we call insertE(select0(E,cnt(Rot(S,i)))+1,1), and discard cnt(Rot(S,i)) once cnt(Rot(S,i1)) has been computed. Then cnt(C{S}(CA{S}[i]))=rank0(E,select1(E,i)). The following statements for each i[1..λ+ρ] are due to construction.

  • If E[i]=0, then F𝒮[i]=F[rank0(E,i)] and L𝒮[i]=L[rank0(E,i)].

  • If E[i]=1, then F𝒮[i]=F{S}[rank1(E,i)] and L𝒮[i]=L{S}[rank1(E,i)].

  • LCP𝒮[1]=0.

  • If i2, E[i1]=1 and E[i]=1, then LCP𝒮[i]=LCP{S}[rank1(E,i)].

  • If i2, E[i1]=1 and E[i]=0, then LCP𝒮[i]=slcp(C{S}(CA{S}[rank1(E,i)])).

  • If i2, E[i1]=0 and E[i]=1, then LCP𝒮[i]=plcp(C{S}(CA{S}[rank1(E,i)])).

  • If i2, E[i1]=0 and E[i]=0, then LCP𝒮[i]=LCP[rank0(E,i)].

Consequently, O(λ) queries and operations are needed to update the cBWT index of to index 𝒮. Finally, we zero E. The claimed complexities then follow from Lemma 2.1.

We are finally able to state the main result of this section.

Theorem 4.9.

Let 𝒯={T1,,Td}Σ+ and n=|T1Td|. The cBWT index of 𝒯 can be constructed in O(nlgσ) bits of space and O(nlgσlgnlglgn) time.

Proof.

Let |Tk1||Tk| for each k[2..d], and let Ej denote a zeroed bit string of length |T1Tj| for each j[1..d]. First, we construct E1 and the cBWT index of {T1} within the claimed complexities by Lemma 4.5, where each string is represented by the data structure of Lemma 2.1. Subsequently, we iteratively extend the cBWT index of {T1,,Tk1} augmented by Ek1 to the cBWT index of {T1,,Tk} augmented by Ek for each k[2..d] in ascending order leveraging Lemma 4.8. Then the cBWT index of 𝒯 is constructed in O((n+|Td|)lgσ)=O(nlgσ) bits of space and O((k=1d|Tk|)lgσlgnlglgn)=O(nlgσlgnlglgn) time.

References

  • [1] Bastien Auvray, Julien David, Richard Groult, and Thierry Lecroq. Approximate cartesian tree matching: An approach using swaps. In Proc. SPIRE, volume 14240 of LNCS, pages 49–61, 2023. doi:10.1007/978-3-031-43980-3_5.
  • [2] Christina Boucher, Davide Cenzato, Zsuzsanna Lipták, Massimiliano Rossi, and Marinella Sciortino. r-indexing the eBWT. In Proc. SPIRE, volume 12944 of LNCS, pages 3–12, 2021. doi:10.1007/978-3-030-86692-1_1.
  • [3] Erik D. Demaine, Gad M. Landau, and Oren Weimann. On Cartesian trees and range minimum queries. Algorithmica, 68(3):610–625, 2014. doi:10.1007/s00453-012-9683-x.
  • [4] Simone Faro, Thierry Lecroq, Kunsoo Park, and Stefano Scafiti. On the longest common Cartesian substring problem. The Computer Journal, 66(4):907–923, 2022. doi:10.1093/COMJNL/BXAB204.
  • [5] Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Proc. FOCS, pages 390–398, 2000. doi:10.1109/SFCS.2000.892127.
  • [6] Johannes Fischer. Optimal succinctness for range minimum queries. In Proc. LATIN, volume 6034 of LNCS, pages 158–169, 2010. doi:10.1007/978-3-642-12200-2_16.
  • [7] Peter Foster, Anssi Klapuri, and Simon Dixon. A method for identifying repetition structure in musical audio based on time series prediction. In Proc. EUSIPCO, pages 1299–1303. IEEE, 2012. URL: https://ieeexplore.ieee.org/document/6334323/.
  • [8] Tak-Chung Fu, Korris Fu-Lai Chung, Robert Wing Pong Luk, and Chak-man Ng. Stock time series pattern matching: Template-based vs. rule-based approaches. Eng. Appl. Artif. Intell., 20(3):347–364, 2007. doi:10.1016/J.ENGAPPAI.2006.07.003.
  • [9] Mitsuru Funakoshi, Takuya Mieno, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Computing maximal palindromes in non-standard matching models. In Proc. IWOCA, volume 14764 of LNCS, pages 165–179, 2024. doi:10.1007/978-3-031-63021-7_13.
  • [10] Paweł Gawrychowski, Samah Ghazawi, and Gad M. Landau. On indeterminate strings matching. In Proc. CPM, volume 161 of LIPIcs, pages 14:1–14:14, 2020. doi:10.4230/LIPICS.CPM.2020.14.
  • [11] Daiki Hashimoto, Diptarama Hendrian, Dominik Köppl, Ryo Yoshinaka, and Ayumi Shinohara. Computing the parameterized Burrows–Wheeler transform online. In Proc. SPIRE, volume 13617 of LNCS, pages 70–85, 2022. doi:10.1007/978-3-031-20643-6_6.
  • [12] Yuzuru Hiraga. Structural recognition of music by pattern matching. In Proc. ICMC. Michigan Publishing, 1997. URL: https://hdl.handle.net/2027/spo.bbp2372.1997.113.
  • [13] Wing-Kai Hon, Tsung-Han Ku, Chen-Hua Lu, Rahul Shah, and Sharma V. Thankachan. Efficient algorithm for circular Burrows–Wheeler transform. In Proc. CPM, volume 7354 of LNCS, pages 257–268, 2012. doi:10.1007/978-3-642-31265-6_21.
  • [14] Kento Iseri, Tomohiro I, Diptarama Hendrian, Dominik Köppl, Ryo Yoshinaka, and Ayumi Shinohara. Breaking a barrier in constructing compact indexes for parameterized pattern matching. In Proc. ICALP, volume 297 of LIPIcs, pages 89:1–89:19, 2024. doi:10.4230/LIPICS.ICALP.2024.89.
  • [15] Natsumi Kikuchi, Diptarama Hendrian, Ryo Yoshinaka, and Ayumi Shinohara. Computing covers under substring consistent equivalence relations. In Proc. SPIRE, volume 12303 of LNCS, pages 131–146, 2020. doi:10.1007/978-3-030-59212-7_10.
  • [16] Jinil Kim, Peter Eades, Rudolf Fleischer, Seok-Hee Hong, Costas S. Iliopoulos, Kunsoo Park, Simon J. Puglisi, and Takeshi Tokuyama. Order-preserving matching. Theor. Comput. Sci., 525:68–79, 2014. doi:10.1016/J.TCS.2013.10.006.
  • [17] Sung-Hwan Kim and Hwan-Gue Cho. A compact index for Cartesian tree matching. In Proc. CPM, volume 191 of LIPIcs, pages 18:1–18:19, 2021. doi:10.4230/LIPIcs.CPM.2021.18.
  • [18] Sungmin Kim and Yo-Sub Han. Approximate Cartesian tree pattern matching. In Proc. DLT, volume 14791 of LNCS, pages 189–202, 2024. doi:10.1007/978-3-031-66159-4_14.
  • [19] Marcin Kubica, Tomasz Kulczyński, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. A linear time algorithm for consecutive permutation pattern matching. Inf. Process. Lett., 113(12):430–433, 2013. doi:10.1016/J.IPL.2013.03.015.
  • [20] Sabrina Mantaci, Antonio Restivo, Giovanna Rosone, and Marinella Sciortino. An extension of the Burrows–Wheeler transform. Theor. Comput. Sci., 387(3):298–312, 2007. doi:10.1016/j.tcs.2007.07.014.
  • [21] Yoshiaki Matsuoka, Takahiro Aoki, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Generalized pattern matching and periodicity under substring consistent equivalence relations. Theor. Comput. Sci., 656:225–233, 2016. doi:10.1016/j.tcs.2016.02.017.
  • [22] Gonzalo Navarro and Kunihiko Sadakane. Fully functional static and dynamic succinct trees. ACM Trans. Algorithms, 10(3):16:1–16:39, 2014. doi:10.1145/2601073.
  • [23] Akio Nishimoto, Noriki Fujisato, Yuto Nakashima, and Shunsuke Inenaga. Position heaps for Cartesian-tree matching on strings and tries. In Proc. SPIRE, volume 12944 of LNCS, pages 241–254, 2021. doi:10.1007/978-3-030-86692-1_20.
  • [24] Tsubasa Oizumi, Takeshi Kai, Takuya Mieno, Shunsuke Inenaga, and Hiroki Arimura. Cartesian tree subsequence matching. In Proc. CPM, volume 223 of LIPIcs, pages 14:1–14:18, 2022. doi:10.4230/LIPICS.CPM.2022.14.
  • [25] Eric M. Osterkamp and Dominik Köppl. Extending the parameterized Burrows–Wheeler transform. In Proc. DCC, pages 143–152, 2024. doi:10.1109/DCC58796.2024.00022.
  • [26] Sung Gwan Park, Magsarjav Bataa, Amihood Amir, Gad M. Landau, and Kunsoo Park. Finding patterns and periods in Cartesian tree matching. Theor. Comput. Sci., 845:181–197, 2020. doi:10.1016/J.TCS.2020.09.014.
  • [27] Siwoo Song, Geonmo Gu, Cheol Ryu, Simone Faro, Thierry Lecroq, and Kunsoo Park. Fast algorithms for single and multiple pattern Cartesian tree matching. Theor. Comput. Sci., 849:47–63, 2021. doi:10.1016/J.TCS.2020.10.009.
  • [28] Jean Vuillemin. A unifying look at data structures. Commun. ACM, 23(4):229–239, 1980. doi:10.1145/358841.358852.