Abstract 1 Introduction 2 Preliminaries 3 The Compressed Suffix Tree of 𝓣 4 Solving Circular Dictionary Matching Queries 5 Conclusions and Future Work References

Improved Circular Dictionary Matching

Nicola Cotumaccio ORCID University of Helsinki, Finland
Abstract

The circular dictionary matching problem is an extension of the classical dictionary matching problem where every string in the dictionary is interpreted as a circular string: after reading the last character of a string, we can move back to its first character. The circular dictionary matching problem is motivated by applications in bioinformatics and computational geometry.

In 2011, Hon et al. [ISAAC 2011] showed how to efficiently solve circular dictionary matching queries within compressed space by building on Mantaci et al.’s eBWT and Sadakane’s compressed suffix tree. The proposed solution is based on the assumption that the strings in the dictionary are all distinct and non-periodic, no string is a circular rotation of some other string, and the strings in the dictionary have similar lengths.

In this paper, we consider arbitrary dictionaries, and we show how to solve circular dictionary matching queries in O((m+occ)logn) time within compressed space using nlogσ(1+o(1))+O(n)+O(dlogn) bits, where n is the total length of the dictionary, m is the length of the pattern, occ is the number of occurrences, d is the number of strings in the dictionary and σ is the size of the alphabet. Our solution is based on an extension of the suffix array to arbitrary dictionaries and a sampling mechanism for the LCP array of a dictionary inspired by recent results in graph indexing and compression.

Keywords and phrases:
Circular pattern matching, dictionary matching, suffix tree, compressed suffix tree, suffix array, LCP array, Burrows-Wheeler Transform, FM-index
Copyright and License:
[Uncaptioned image] © Nicola Cotumaccio; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Pattern matching
Related Version:
Full Version: https://arxiv.org/abs/2504.03394 [21]
Acknowledgements:
I would like to thank Kunihiko Sadakane for introducing me to the topic.
Funding:
Funded by the Helsinki Institute for Information Technology (HIIT).
Editors:
Paola Bonizzoni and Veli Mäkinen

1 Introduction

The Burrows-Wheeler transform (BWT) [12] and the FM-index [27] support pattern matching on the compressed representation of a single string. If we want to encode a collection 𝒯 of strings, we may append a distinct end-of-string $i to each string and store the Burrows-Wheeler Transform of the concatenation of the strings. This approach is only a naive extension of the BWT and can significantly increase the size of the alphabet. In 2007, Mantaci et al. introduced the eBWT [40], a more sophisticated and elegant extension of the BWT to a collection of strings. When sorting all suffixes of all strings in the collection, Mantaci et al. define the mutual order between two suffixes Ti and Tj to be the mutual order of Tiω and Tjω in the lexicographic order, where Tiω and Tjω are the infinite strings obtained by concatenating Ti with itself and Tj with itself infinitely many times (see [13] for a recent paper on different variants of the eBWT). From the definition of the eBWT we obtain that, if we extend the backward search mechanism of the FM-index to multisets of strings, we are not matching a pattern P against all suffixes Ti’s, but against all strings Tiω’s. Equivalently, we are interpreting each string in the collection as a circular string in which, after reaching the last character of the string, we can go back to its first character.

In 2011, Hon et al. applied the framework of the eBWT to solve circular dictionary matching queries [36], even if they explicitly spotted the relationship between their techniques (which extends Sadakane’s compressed suffix tree [45] to a collection of strings) and the eBWT only in a subsequent paper [32]. The circular dictionary matching problem is an extension of the dictionary matching problem, which admits a classical solution based on Aho-Corasick automata [1], as well as more recent solutions within compressed space [8, 34]. Other variations of the same problem include dictionary matching with gaps [6, 35, 5] or mismatches [30], dictionary matching in the streaming model [16], dynamic dictionary matching [31] and internal dictionary matching [14]. The circular dictionary matching problem is motivated by some applications, see [38]. In bioinformatics, the genome of herpes simplex virus (HSV-1) and many other viruses exists as circular strings [47], and microbial samples collected from the environment directly without isolating and culturing the samples have circular chromosomes [26, 46]. In computational geometry, a polygon can be stored by listing its vertices in clockwise order. The circular dictionary matching problem has also been studied from other perspectives (average-case behavior [37], approximate variants [15]).

1.1 Our Contribution

Consider a dictionary 𝒯=(T1,T2,,Td) of total length n on an alphabet of size σ. In [36], Hon et al. show that, by storing a data structure of nlogσ(1+o(1))+O(n)+O(dlogn) bits, it is possible to solve circular dictionary matching queries in O((m+occ)log1+ϵn) time, where m is the length of the pattern. This result holds under three assumptions: (i) the Th’s are distinct and not periodic, (ii) no Th is a circular rotation of some other Th, and (iii) the Th’s have bounded aspect ratio, namely, (maxh=1dTh)/(minh=1dTh)=O(1). Assumption (ii) was made explicit only in the subsequent paper [32], where Hon et al. also mention that, in an extension of the paper, they will show how to remove assumptions (i-ii). Assumption (iii) is required to store a space-efficient data structure supporting longest common prefix (LCP) queries. In [33], Hon et al. sketched a new data structure for LCP queries that uses O(n+dlogn) bits without requiring assumption (iii), but all details are again deferred to an extended version.

The main result of our paper is Theorem 1. In particular, we give two main contributions.

  • We obtain results that are valid for an arbitrary dictionary 𝒯, without imposing any restriction. To this end, in Section 3.2 we introduce a more general compressed suffix array for an arbitrary collection of strings. In particular, we support LCP functionality within only O(n)+o(nlogσ) bits by using a sampling mechanism inspired by recent results [23, 17] on Wheeler automata.

  • We provide a self-contained proof of our result and a full example of the data structure for solving circular dictionary matching queries. The original paper [36] contains the main ideas to extend Sadakane’s compressed suffix tree to a multiset of strings, but it is very dense and proofs are often only sketched or missing, which may explain why additional properties to potentially remove assumptions (i - iii) were only observed in subsequent papers. We will provide an intuitive explanation of several steps by using graphs (and in particular cycles, see Figure 2), consistently with a research line that has shown how the suffix array, the Burrows-Wheeler transform and the FM-index admit a natural interpretation in the graph setting [29, 3, 24, 22, 18, 4, 2, 25, 20].

We will also show that the time bound of Theorem 1 can be improved when the number of occurrences is large.

The paper is organized as follows. In Section 2 we introduce the circular dictionary matching problem. In Section 3 we define our compressed suffix tree. In Section 4 we present an algorithm for circular dictionary matching. Due to space constraints, most proofs and some auxiliary results can be found in the full version [21].

2 Preliminaries

𝒯 a b c a b c b c a b c c a b
k 1 2 3 4 5 6 7 8 9 10 11 12 13 14
B1[k] 1 0 0 0 0 0 1 0 0 0 0 1 0 0
(a)
𝒮j Dj j 𝖫𝖢𝖯[j] 𝖡𝖶𝖳[j] 𝖡𝖶𝖳[j] B2[j]
a b c a b c a b c 1 1 c a 1
a b c a b c a b c 4
a b c a b c a b c 13
a b c b c a b c b 9 2 3 c a 0
b c a b c a b c a 2 3 0 a b 1
b c a b c a b c a 5
b c a b c a b c a 14
b c a b c b c a b 7 4 5 c b 0
b c b c a b c b c 10 5 2 a b 0
c a b c a b c a b 3 6 0 b c 1
c a b c a b c a b 6
c a b c a b c a b 12
c a b c b c a b c 8 7 4 b c 0
c b c a b c b c a 11 8 1 b c 0
(b)
Figure 1: Consider the dictionary 𝒯=(abcabc,bcabc,cab), our running example. In (b), the strings 𝒮j’s are sorted lexicographically, and every block identifies strings Tk’s that correspond to the same 𝒮j. Each string Tk is the orange prefix of 𝒮j. For example, we have 4D1, 𝒯4=abcabc and 𝒮1=(abc)ω.

Let Σ be a finite alphabet of size σ=|Σ|. We denote by Σ the set of all finite strings on Σ (including the empty string ϵ) and by Σω the set of all countably infinite strings on Σ. For example, if Σ={a,b}, then abbba is a string in Σ and ababab is a string in Σω. If TΣ, we denote by |T| the length of T. If 1i|T|, we denote by T[i] the i-th character of T, and if 1ij|T|, we define T[i,j]=T[i]T[i+1]T[j1]T[j]. If i>j, let T[i,j]=ϵ. Analogously, if TΣω, then T[i] is the i-th character of T, we have T[i,j]=T[i]T[i+1]T[j1]T[j] for every ji1, and if i>j we have T[i,j]=ϵ.

If TΣ, then for every integer z1 the string TzΣ is defined by Tz=TTT, where T is repeated z times, and the string TωΣω is defined by Tω=TTT, where T is repeated infinitely many times. Given T1,T2ΣΣω, let 𝗅𝖼𝗉(T1,T2) be the length of the longest common prefix between T1 and T2.

We say that a string TΣ is primitive is for every SΣ and for every integer z1, if T=Sz, then z=1 and T=S. For every TΣ, there exists exactly one primitive string RΣ and exactly one integer z1 such that T=Rz (see [39, Prop. 1.3.1]; we say that R is the root of T and we write R=ρ(T).

Let TΣ a string of length n. We define the following queries on T: (i) 𝖺𝖼𝖼𝖾𝗌𝗌(T,i): given 1in, return T[i]; (ii) 𝗋𝖺𝗇𝗄c(T,i): given cΣ and 1in, return |{1hi|T[h]=c}|; (iii) 𝗌𝖾𝗅𝖾𝖼𝗍c(T,i): given cΣ and 1i𝗋𝖺𝗇𝗄c(T,n), return the unique 1jn such that T[j]=c and |{1hj|T[h]=c}|=i. To handle limit cases easily, it is expedient to define 𝗌𝖾𝗅𝖾𝖼𝗍c(T,𝗋𝖺𝗇𝗄c(T,n)+1)=n+1. A bit array of length n can be stored using a data structure (called bitvector) of n+o(n) bits that supports 𝖺𝖼𝖼𝖾𝗌𝗌, 𝗋𝖺𝗇𝗄 and 𝗌𝖾𝗅𝖾𝖼𝗍 queries in O(1) time [42].

We consider a fixed total order on Σ and we extend it to ΣΣω lexicographically. For every T1,T2ΣΣω, we write T1T2 if (T1T2)(T1T2). In our examples (see e.g. Figure 1), Σ is a subset of the English alphabet and is the usual order on letters. In our data structures, Σ={0,1,,|Σ|1} and is the usual order such that 01|Σ|1.

2.1 Circular Dictionary Matching

Let T,PΣ and P. For every 1i|P||T|+1, we say that T occurs in P at position i if P[i,i+|T|1]=T.

Let TΣ and let 1i|T|. The circular suffix Ti of T is the string T[i,|T|]T[1,i1]. For example, if T=cab, then T1=cab, T2=abc and T3=bca.

Let d1, and let T1,T2,,TdΣ be nonempty strings. We say that 𝒯=(T1,T2,,Td) is a dictionary. We assume that the alphabet Σ is effective, that is, every character in Σ occurs in some Tj (our results could be extended to larger alphabets by using standard tools such as dictionaries [42, 22]). Define the total length n of 𝒯 to be n=k=1d|Tk|. In the example of Figure 1(a), we have d=3 and n=14. For every 1kn, the circular suffix 𝒯k of 𝒯 is the string Tf(g), where f is the largest integer in {1,2,,d} such that h=1f1|Th|<k and 1g|Tf| is such that k=(h=1f1|Th|)+g. In other words, the circular suffix 𝒯k is the circular suffix starting at position k in the concatenation T1T2Td. In Figure 1, if k=13, we have f=3, g=2 and 𝒯13=abc. Given a pattern PΣ, where |P|=m, define:

𝖢𝖽𝗆(𝒯,P)={(i,k){1,2,,m}×{1,2,,n}| 𝒯k occurs in P at position i}

and let occ=|𝖢𝖽𝗆(𝒯,P)|. For example, if P=abcbca, in the example of Figure 1 we have 𝖢𝖽𝗆(𝒯,P)={(1,9),(1,13),(2,10),(4,14)} and occ=4.

The main result of this paper is the following.

Theorem 1.

Let 𝒯=(T1,T2,,Td) be a dictionary of total length n. Then, 𝒯 can be encoded using a data structure of nlogσ(1+o(1))+O(n)+O(dlogn) bits such that, given a string P of length m, we can compute 𝖢𝖽𝗆(𝒯,P) in O((m+occ)logn) time.

We can also improve the time bound in Theorem 1 to O(mlogn+min{occlogn,nlogn+occ}), without needing to know the value occ in advance. This is useful if occ is large.

Following Hon et al. [36], we will build a data structure that extends Sadakane’s suffix tree to 𝒯. We will achieve the space bound in Theorem 1 as follows. (i) We will store the FM-index of 𝒯 using nlogσ(1+o(1))+O(n) bits, see Theorem 9. (ii) We will introduce a notion of suffix array (Section 3.2) and a notion of longest comment prefix (LCP) array (Section 3.3). Storing the suffix array and LCP array explicitly would require O(nlogn) bits so, in both cases, we will only store a sample (Theorem 10 and Theorem 13), leading to a compressed representation of both arrays. (iii) We will store some auxiliary data structures (O(n) bits in total): 8 bitvectors (called B1-B8), a data structure supporting range minimum queries on the LCP array (see Section 3.3), a data structure storing the topology of the suffix tree (see Section 3.4), a data structure storing the topology of the auxiliary tree 𝖲𝗎𝖿𝖿(𝒯) (see Section 4), and a data structure supporting range minimum queries on the auxiliary array 𝖫𝖾𝗇 (see Section 4).

3 The Compressed Suffix Tree of 𝓣

In this section, we extend Sadakane’s compressed suffix tree to the dictionary 𝒯. To this end, we will introduce a BWT-based encoding of 𝒯 (Section 3.1), a compressed suffix array (Section 3.2), an LCP array (Section 3.3), and the topology of the suffix tree (Section 3.4).

3.1 The Burrows-Wheeler Transform and the FM-index of 𝓣

Let us consider our dictionary 𝒯 (recall that n=k=1d|Tk|). We can naturally define a bijective function ϕ from the set {1,2,,n} to the set {(f,g)| 1fd,1g|Tf|} such that, if ϕ(k)=(f,g), then the k-th character of the concatenation T1T2Td is the g-th character of Tf. For example, in Figure 1 we have ϕ(13)=(3,2). Let us define a bitvector B1 to compute ϕ and ϕ1 in O(1) time. B1 is the bitvector of length n that contains exactly d ones such that, for every 1hd, the number of consecutive zeros after the h-th one is |Th|1 (see Figure 1(a)). Assume that ϕ(k)=(f,g). Given k, we have f=𝗋𝖺𝗇𝗄1(B1,k) and g=k𝗌𝖾𝗅𝖾𝖼𝗍1(B1,f)+1. Conversely, given f and g, we have k=𝗌𝖾𝗅𝖾𝖼𝗍1(B1,f)+g1. Note that in O(1) time we can also compute |Th| for every 1hd, because |Th|=𝗌𝖾𝗅𝖾𝖼𝗍1(B1,h+1)𝗌𝖾𝗅𝖾𝖼𝗍1(B1,h).

To exploit the circular nature of the dictionary, we will not sort the finite strings 𝒯k’s, but we will sort the infinite strings 𝒯kω’s, and the suffix tree of 𝒯 will be the trie of the 𝒯k’s. Notice that we may have 𝒯k=𝒯k for distinct 1k,kn (in Figure 1, we have 𝒯1=𝒯4=𝒯13=(abc)ω). The next lemma shows that this happens exactly when 𝒯k and 𝒯k have the same root.

Lemma 2.

Let 𝒯 be a dictionary, and let 1k,kn. Then, 𝒯kω=𝒯kω if and only if ρ(𝒯k)=ρ(𝒯k).

The next step is to sort the “suffixes” 𝒯k’s. Since in general the 𝒯k’s are not pairwise distinct, we will use an ordered partition (see Figure 1(b)).

Definition 3.

Let 𝒯 be a dictionary.

  • Let 𝒟=(D1,D2,,Dn) be the ordered partition of {1,2,,n} such that, for every k,k{1,2,,n}, (i) if k and k are in the same Dj, then 𝒯kω=𝒯kω and (ii) if kDj, kDj and j<j, then 𝒯kω𝒯kω.

  • For every 1jn, let 𝒮j=𝒯kω, where 1kn is such that kDj.

Note that 𝒮j is well defined because it does not depend on the choice of kDj by the definition of 𝒟. Note also that 𝒮1𝒮2𝒮n.

Let 1kn, and assume that ϕ(k)=(f,g). Define 𝗉𝗋𝖾𝖽(k) as follows: (i) if g2, then 𝗉𝗋𝖾𝖽(k)=k1, and (ii) if g=1, then 𝗉𝗋𝖾𝖽(k)=ϕ1(f,|Tf|). In other words, 𝗉𝗋𝖾𝖽(k) equals k1, modulo staying within the same string of 𝒯. In Figure 1(a), we have 𝗉𝗋𝖾𝖽(9)=8, 𝗉𝗋𝖾𝖽(8)=7 but 𝗉𝗋𝖾𝖽(7)=11. For every 1kn, we can compute 𝗉𝗋𝖾𝖽(k) in O(1) time by using the bitvector B1: if B1[k]=0, then 𝗉𝗋𝖾𝖽(k)=k1, and if B1[k]=1, then 𝗉𝗋𝖾𝖽(k)=𝗌𝖾𝗅𝖾𝖼𝗍1(𝗋𝖺𝗇𝗄1(B1,k)+1)1.

If U{1,2,,n}, define 𝗉𝗋𝖾𝖽(U)=kU𝗉𝗋𝖾𝖽(k). Since the function 𝗉𝗋𝖾𝖽 is a premutation of the set {1,2,,n}, we always have |𝗉𝗋𝖾𝖽(U)|=|U| for every U{1,2,,n}. Let us prove that 𝗉𝗋𝖾𝖽 yields a permutation of the Dj’s.

Lemma 4.

Let 𝒯 be a dictionary, and let 1jn. Then, there exists 1jn such that 𝗉𝗋𝖾𝖽(Dj)=Dj. Moreover, if c=𝒮j[1], we have 𝒮j=c𝒮j.

For every 1jn, let ψ(j)=j, where 𝗉𝗋𝖾𝖽(Dj)=Dj as in Lemma 4. Notice that ψ is permutation of {1,2,,n} because it is subjective: indeed, for every 1jn, if we pick any kDj, and consider 1kn such that k=𝗉𝗋𝖾𝖽(k) and 1jn such that kDj, then 𝗉𝗋𝖾𝖽(Dj)=Dj. Moreover, for every 1jn define μ(j)=𝒮ψ(j)[1]. By Lemma 4, we know that 𝒮ψ(j)=μ(j)𝒮j for every 1jn.

We can visualize ψ by drawing a (directed edge-labeled) graph G𝒯=(V,E), where V={Dj| 1jn} and E={(Dψ(j),Dj,μ(j)| 1jn}, see Figure 2. Since ψ is a permutation of {1,2,,n}, then every node of G has exactly one incoming edge and exactly one ongoing edge, so G is the disjoint union of some cycles. Moreover, for every 1jn the infinite string that we can read starting from node Dj and following the edges of the corresponding cycle is 𝒮j, because we know that 𝒮ψ(j)=μ(j)𝒮j for every 1jn. For example, in Figure 2 the infinite string that we can read starting from D3 is 𝒮3=(bca)ω.

Figure 2: The graph G𝒯 for the dictionary 𝒯=(abcabc,bcabc,cab) of Figure 1.

We will not explicitly build the graph G𝒯 to solve circular dictionary matching queries, but G𝒯 naturally captures the cyclic nature of 𝒯 and will help us detect some properties of the Dj’s. For example, since we know that the infinite string starting from node Dj is 𝒮j, and 𝒮1𝒮2𝒮n, we can easily infer the following properties (see Figure 2): if we consider two edges (Dj1,Dj1,c) and (Dj2,Dj2,d), then (i) if j1<j2, then cd, and (ii) if c=d and j1<j2, then j1<j2. We can formalize these properties in our graph-free setting as follows.

Lemma 5.

Let 𝒯 be a dictionary. let 1j1,j1,j2,j2n and c,dΣ such that 𝒮j1=c𝒮j1 and 𝒮j2=d𝒮j2.

  1. 1.

    (Property 1) If j1<j2, then cd.

  2. 2.

    (Property 2) If c=d and j1<j2, then j1<j2.

We now want to extend the backward search [27] to our dictionary 𝒯. We will again use G𝒯 to guide our intuition. Given U{1,2,,n} and given cΣ, let 𝖻𝖺𝖼𝗄(U,c) be the set of all 1jn such that there exists an edge (Dj,Dj,c) for some jU. For example, in Figure 2 we have 𝖻𝖺𝖼𝗄({6,7,8},b)={3,4,5}. Notice that {6,7,8} is convex, and {3,4,5} is also convex. This is true in general: 𝖻𝖺𝖼𝗄(U,c) is always convex if U is convex. Indeed, if j1<j<j2 and j1,j2𝖻𝖺𝖼𝗄(U,c), then from the properties of G𝒯 mentioned before Lemma 5 we first conclude that the edge leaving node Dj is labeled with c, and then we infer that the node Dj reached by this edge must be such that jU, which implies j𝖻𝖺𝖼𝗄(U,c). We can now formally define 𝖻𝖺𝖼𝗄(U,c) in our graph-free setting and prove that 𝖻𝖺𝖼𝗄(U,c) is convex if U is convex.

Definition 6.

Let 𝒯 be a dictionary, let U{1,2,,n} and let cΣ. Define 𝖻𝖺𝖼𝗄(U,c)={1jn|there exists jU such that 𝒮j=c𝒮j}.

Note that, if U1U2{1,2,,n} and cΣ, then 𝖻𝖺𝖼𝗄(U1,c)𝖻𝖺𝖼𝗄(U2,c).

Lemma 7.

Let 𝒯 be a dictionary, let U{1,2,,n} and let cΣ. If U is convex, then 𝖻𝖺𝖼𝗄(U,c) is convex.

Let us define the Burrows-Wheeler transform (BWT) of 𝒯.

Definition 8.

Let 𝒯 be a dictionary. Define the string 𝖡𝖶𝖳Σ of length n such that 𝖡𝖶𝖳[j]=μ(j) for every 1jn.

In other words, 𝖡𝖶𝖳[j] is the label of the edge reaching node Dj for every 1jn (see Figure 1(b) and Figure 2). The data structure that we will define in Theorem 9 is based on two sequences, 𝖡𝖶𝖳 and B2, that are related to 𝖡𝖶𝖳 (see Figure 1(b)). The sequence 𝖡𝖶𝖳 is obtained from 𝖡𝖶𝖳 by sorting its elements. We know that for every pair of edges (Dj1,Dj1,c) and (Dj2,Dj2,d), if j1<j2, then cd, so 𝖡𝖶𝖳[j] is the label of the edge leaving node Dj for every 1jn, which implies 𝖡𝖶𝖳[j]=𝒮j[1] for every 1jn. The bitvector B2 has length n and is obtained by marking with 1 the beginning of each equal-letter run in 𝖡𝖶𝖳. Formally, for every 1jn, we have B2[j]=1 if and only if j=1 or (j>2)(𝖡𝖶𝖳[j1]𝖡𝖶𝖳[j]). Since the alphabet is effective, the set {𝗌𝖾𝗅𝖾𝖼𝗍1(B2,c),𝗌𝖾𝗅𝖾𝖼𝗍1(B2,c)+1,𝗌𝖾𝗅𝖾𝖼𝗍1(B2,c)+2,,𝗌𝖾𝗅𝖾𝖼𝗍1(B2,c+1)2,𝗌𝖾𝗅𝖾𝖼𝗍1(B2,c+1)1} consists of all 1jn such that the edge leaving node Dj is labeled with c.

We can now extend the properties of the Burrows-Wheeler Transform and the FM-index to 𝒯. In particular, we will show that 𝖡𝖶𝖳 if an encoding of G𝒯. This result shows that 𝖡𝖶𝖳 is related to the eBWT of Mantaci et al. [40], but there are two main differences: (1) we do not need assumptions (i-ii) in Section 1 to define 𝖡𝖶𝖳 (in particular, the Th’s need not be primitive), and (2) 𝖡𝖶𝖳 is an encoding of G𝒯 but not of 𝒯 (namely, 𝖡𝖶𝖳 encodes the cyclic nature of the Th’s and not a specific circular suffix of each Th). To extend the FM-index to 𝒯, we will once again base our intuition on G𝒯.

Recall that two (directed edge-labeled) graphs G1=(V1,E1) and G2=(V2,E2) are isomorphic if there exists a bijection f from V1 to V2 such that for every u,vV1 and for every cΣ we have (u,v,c)E1 if and only if (f(u),f(v),c)E2. In other words, two graphs are isomorphic if they are the same graph up to renaming the nodes.

Theorem 9.

Let 𝒯 be a dictionary.

  1. 1.

    𝖡𝖶𝖳 is an encoding of G𝒯, that is, given 𝖡𝖶𝖳, we can retrieve G𝒯 up to isomorphism.

  2. 2.

    There exists a data structure encoding G𝒯 of nlogσ(1+o(1))+O(n) bits that supports the following queries in O(loglogσ) time: (i) 𝖺𝖼𝖼𝖾𝗌𝗌, 𝗋𝖺𝗇𝗄, and 𝗌𝖾𝗅𝖾𝖼𝗍 queries on 𝖡𝖶𝖳, (ii) 𝖻𝗐𝗌(,r,c): given 1rn and cΣ, decide if 𝖻𝖺𝖼𝗄({,+1,,r},c) is empty and, if it is nonempty, return and r such that 𝖻𝖺𝖼𝗄({,+1,,r},c)={,+1,,r}, (iii) 𝗉𝗋𝖾𝗏(j): given 1jn, return 1jn such that 𝗉𝗋𝖾𝖽(Dj)=Dj, and (iv) 𝖿𝗈𝗅𝗅𝗈𝗐(j): given 1jn, return 1jn such that 𝗉𝗋𝖾𝖽(Dj)=Dj.

Note that, even if 𝖡𝖶𝖳 is an encoding of G𝒯, it is not an encoding of 𝒯 (in particular, from G𝒯 we cannot recover 𝒯). This is true even when all Dj’s are singletons because G𝒯 only stores circular strings, so we cannot retrieve the bitvector B1 the marks the beginning of each string (we cannot even retrieve the specific enumeration T1, T2, , Td of the strings in 𝒯).

3.2 The Compressed Suffix Array of 𝓣

We know that 𝒟=(D1,D2,,Dn) is an ordered partition of {1,2,,n}, and we know that 𝒮1𝒮2𝒮n. The suffix array of 𝒯 should be defined in such a way that we can answer the following query: given 1jn, return the set Dj.

In the following, we say that a position 1kn refers to the string Th if h=𝗋𝖺𝗇𝗄1(B1,k). Every position refers to exactly one string. Notice that a set Dj may contain elements that refer to the same string Th. In Figure 1, we have 2,5D3, and both 2 and 5 refer to T1. This happens because T1 is not a primitive string. Let us show that, if we know any element of Dj referring to Th, we can retrieve all elements of Dj referring to Th.

t 1 2 3 4 5 6 7 8 9 10 11
B3[t] 1 0 0 1 0 0 0 0 1 0 0
𝖲𝖠[t] 1 13 9 2 14 7 10 3 12 8 11
B4[t] 1 0 1 1 0 1 1 1 0 1 1
𝖲𝖠[t] 1 9 14 7 3 12 11
B5[t] 1 0 1 0 1 1 0 1 1 0 1
𝖫𝖾𝗇[t] 6 3 5 6 3 5 5 6 3 5 5
Figure 3: The compressed suffix array of the dictionary 𝒯=(abcabc,bcabc,cab) in Figure 1. We use the sampling factor s=2.

For every 1hd, let ρh be the root of Th. Then, |Th| is divisible by |ρh|, and for every 1kn the root ρ(𝒯k) of 𝒯k is 𝒯k[1,ρh], where k refers to string Th. For every 1hd, let kh=𝗌𝖾𝗅𝖾𝖼𝗍1(B1,h) be the index in {1,2,,n} corresponding to the first character of Th.

Fix 1jn and 1hd, and let kDj any position referring to Th. For every k referring to Th, we have kDj if and only if ρ(𝒯k)=ρ(𝒯k) (by Lemma 2), if and only if 𝒯k[1,ρh]=𝒯k[1,ρh], if and only if |kk| is a multiple of |ρh|. Then, if G is the set of all elements of Dj referring to Th, we have G={kh+(kkh+w|ρh|mod|Th|)| 0w(|Th|/|ρh|)1}. For example, if Figure 1(b), if we consider j=3 and h=1 and we pick k=5, then 5D3, 5 refers to T1, ρ1=abc, |ρ1|=3, |T1|=6, k1=1 and G={2,5}.

To compute G we need to compute |Th| and |ρh|. We have already seen that we can compute |Th| in O(1) time using B1 (see Section 3.1), and we now store a bitvector B3 to compute |ρh| in O(1) time (see Figure 3). The bitvector B3 contains exactly d ones, we have B3[1]=1, and for every 1hd the number of consecutive zeros after the h-th one is |ρh|1. Then, |ρh|=𝗌𝖾𝗅𝖾𝖼𝗍1(B3,h+1)𝗌𝖾𝗅𝖾𝖼𝗍1(B3,h) for every 1hd.

Let us define a suffix array for 𝒯. Let be the equivalence relation on {1,2,,n} such that for every 1k,kn we have kk if and only if k and k belong to the same Dj and refer to the same string Th. Fix 1jn. Notice that Dj is the union of some -equivalence classes. Let Dj be the subset of Dj obtained by picking the smallest element of each -equivalence class contained in Dj. In Figure 1(b), for j=3, the partition of D3 induced by is {{2,5},{14}}, so D3={2,14}. Then 𝖲𝖠 is the array obtained by concatenating the elements in D1, the elements in D2, , the elements in Dn (see Figure 3), where the elements in Dj are sorted from smallest to largest. Equivalently, the elements in each Dj are sorted according to the indexes of the strings to which they refer (by definition, two distinct elements of Dj cannot refer to the same string). We also define a bitvector B4 to mark the beginning of each Dj (see again Figure 3). More precisely, B4 contains exactly n ones and for every 1jn the number of consecutive zeros after the j-th one is |Dj|1.

Fix 1kn, where kDj refers to string Th. Note that there exists t such that 𝖲𝖠[t]=k if and only if kDj, if and only if khkkh+|ρh|1. Moreover, for every kDj we have that [k]={k+w|ρh|| 0w(|Th|/|ρh|)1} is the set of all elements of Dj referring to Th. In particular, 𝖲𝖠 is an array of length n=j=1n|Dj|=h=1d|ρh| (in Figure 3, we have n=11).

The suffix array 𝖲𝖠 has the desired property: given 1jn, we can compute the set Dj in O(|Dj|) time as follows. We first retrieve Dj by observing that its elements are stored in 𝖲𝖠[t], 𝖲𝖠[t+1], 𝖲𝖠[t+2], , 𝖲𝖠[t], where t=𝗌𝖾𝗅𝖾𝖼𝗍1(B4,j) and t=𝗌𝖾𝗅𝖾𝖼𝗍1(B4,j+1)1. Then, for every kDj, we know that k refers to string Th, where h=𝗋𝖺𝗇𝗄1(B1,k), and we can compute the set [k] of all elements of Dj referring to Th as shown above. Then, we have Dj=kDj[k] and the union is disjoint.

j 2 3 4 5 6 7 8
𝖫𝖢𝖯[j] 3 0 5 2 0 4 1
B6[j] 1 0 0 0 0 0 1
𝖫𝖢𝖯[j] 3 1
(a)
(b)
Figure 4: The LCP array of the dictionary 𝒯=(abcabc,bcabc,cab) in Figure 1. We use the sampling factor s=2. The graph in (b) is the graph Q used to determine which values will be sampled. Yellow nodes are the sampled nodes.

Storing 𝖲𝖠 explicitly can require up to nlogn bits, so to achieve the space bound in Theorem 1 we need a sampling mechanism similar to the compressed suffix array of a string: we will sample some entries and we will reach a sampled entry by using the backward search to navigate the string (see [42]). More precisely, in our setting we will use the query 𝗉𝗋𝖾𝗏(j) of Theorem 9 to navigate G𝒯 in a backward fashion. Note that in the conventional case of a string, if we start from the end of the string and we apply the backward search, we can reach each position of the string, but in our setting the graph G𝒯 is the disjoint union of some cycles (see Figure 2), and the backward search does not allow navigating from a cycle to some other cycle. This means that we will need to sample at least one value per cycle. In the worst case, we have d cycles (one for each Th), so in the worst case we need to sample at least d values.

After choosing an appropriate sampling factor s, we store the sampled values in an array 𝖲𝖠 (respecting the order in 𝖲𝖠), and we use a bitvector B5 to mark the entries of 𝖲𝖠[k] that have been sampled (see Figure 3). We finally obtain the following theorem.

Theorem 10.

Let 𝒯 be a dictionary. By storing o(nlogσ)+O(n)+O(dlogn) bits, we can compute each entry 𝖲𝖠[t] in O(logn) time.

We conclude that, by storing the data structure of Theorem 10, for every 1jn we can compute Dj in O(|Dj|logn) time.

3.3 The LCP Array of 𝓣

Let us define the longest common prefix (LCP) array of 𝒯. We know that 𝒮1𝒮2𝒮n, so it is natural to give the following definition (see Figure 1(b) and Figure 4(a)).

Definition 11.

Let 𝒯 be a dictionary. Let 𝖫𝖢𝖯=𝖫𝖢𝖯[2,n] be the array such that 𝖫𝖢𝖯[j]=𝗅𝖼𝗉(𝒮j1,𝒮j) for every 2jn.

Note that each 𝖫𝖢𝖯[j] is finite because the 𝒮j’s are pairwise distinct. Let us prove a stronger result: 𝖫𝖢𝖯[j]n2 for every 2jn (Lemma 12). This implies that we can store an entry 𝖫𝖢𝖯[j] using at most logn bits.

Lemma 12.

Let 𝒯 be a dictionary. Then, 𝖫𝖢𝖯[j]n2 for every 2jn.

Storing the LCP array explicitly can require nlogn bits, so to achieve the space bound of Theorem 1 we need a sampling mechanism similar to the one for suffix arrays in Section 3.2.

Recall that, given an array A[1,n], we can define range minimum queries as follows: for every 1ijn, 𝗋𝗆𝗊A(i,j) returns an index ikj such that A[k]=minihjA[h]. There exists a data structure of 2n+o(n) bits that can solve a query 𝗋𝗆𝗊A(i,j) in O(1) time without accessing A [28]; moreover the data structure always returns the smallest ikj such that A[k]=minihjA[h].

Let us store a data structure for solving range minimum queries 𝗋𝗆𝗊𝖫𝖢𝖯 on 𝖫𝖢𝖯 in O(1) time. We will use this data structure to retrieve each entry of the LCP array from our sampled LCP array. The main idea is to use the sampling mechanism in [23]: an auxiliary graph Q allows determining which values will be sampled based on a sampling parameter s (see Figure 4(b)). Then, we define an array 𝖫𝖢𝖯 that stores the sampled values (respecting the order in 𝖫𝖢𝖯, see Figure 4(a)), and a bitvector B6 to mark the sampled entries of 𝖫𝖢𝖯 (see Figure 4(a)). By choosing an appropriate s, we obtain the following result.

Theorem 13.

Let 𝒯 be a dictionary. By storing o(nlogσ)+O(n) bits, we can compute each entry 𝖫𝖢𝖯[j] in O(logn) time.

3.4 The Topology of the Suffix Tree of 𝓣

(a)
(b)
t 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8
Z𝖲𝗎𝖿𝖿(𝒯)[t] 1 1 1 0 1 0 0 1 1 1 0 1 0 0 1 0 0 1 1 1 0 1 0 0 1 0 0 0
B7[t] 0 0 1 0 1 0 0 0 0 1 0 1 0 0 1 0 0 0 0 1 0 1 0 0 1 0 0 0
B8[t] 1 0 0 0 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1
Z𝖲𝗎𝖿𝖿(𝒯)[t] 1 1 0 1 0 1 0 1 0 0
(c)
Figure 5: (a) The suffix tree 𝖲𝗎𝖿𝖿(𝒯) of the dictionary 𝒯=(abcabc,bcabc,cab) in Figure 1. Marked nodes are orange. (b) The tree 𝖲𝗎𝖿𝖿(𝒯). (c) The bit arrays to navigate 𝖲𝗎𝖿𝖿(𝒯) and 𝖲𝗎𝖿𝖿(𝒯). The first row compactly includes all indexes t from 1 to 28.

We can now introduce the suffix tree 𝖲𝗎𝖿𝖿(𝒯) of our dictionary 𝒯 (the details are discussed in the full version). Refer to Figure 5(a) for an example. The (infinite) set of all finite strings that we can read starting from the root is 𝒫={𝒮j[1,t]| 1jn,t0}. Every node is an interval that describes the strings that can be read starting from the root and reading a nonempty prefix of the last edge. For example, the set of all strings P for which P is a prefix of 𝒮j if and only if 3j4 is {bca,bcab,bcabc} and, in fact, the set of all strings that reach node [3,4] is {bca,bcab,bcabc}. The suffix tree contains exactly n leaves (namely, [j,j] for every 1jn), and the set 𝒩 of all nodes has size at most 2n1. The set of all strings reaching a node [,r] is finite if and only if [,r] is an internal node, and in this case the longest string reaching [,r] has length λ,r=𝗅𝖼𝗉(𝒮,𝒮r), which can be computed by using the LCP array (for example, λ3,4=𝗅𝖼𝗉(𝒮3,𝒮4)=5). The suffix tree 𝒯 is an ordinal tree, and any ordinal tree T with t nodes can be encoded using its balanced paranthesis representation ZT[1,2t], a bit array that we can build incrementally as follows. Visit all nodes of the tree in depth-first search order (respecting the order of the children of each node) starting from the root. Every time we encounter a new node u we append an 1, and as soon as we leave the subtree rooted at u we add a 0. Hence ZT[1,2t] is a bit array containing exactly t values equal to 1 and t values equal to 0 (see Figure 5(c) for the balanced parenthesis representation Z𝖲𝗎𝖿𝖿(𝒯) of 𝖲𝗎𝖿𝖿(𝒯)). Navarro and Sadakane showed how to support navigational queries on the balanced representation of an ordinal tree in compact space and constant time (see the full version), and we store such a data structure for 𝖲𝗎𝖿𝖿(𝒯) (which requires O(n) bits). To correctly switch to the balanced parenthesis representation of 𝖲𝗎𝖿𝖿(𝒯), we also store a bitvector B7 of length 2|𝒩| such that for every 1t2|𝒩| we have B7[t]=1 if and only if (i) Z𝖲𝗎𝖿𝖿(𝒯)[t]=1 and (ii) the index t corresponds to a leaf of 𝖲𝗎𝖿𝖿(𝒯) (see Figure 5(c)).

4 Solving Circular Dictionary Matching Queries

In the following, we consider a pattern P=P[1,m] of length m, and we want to compute 𝖢𝖽𝗆(𝒯,P).

Fix 1im. We need to compute all 1kn such that 𝒯k occurs in P at position i. In general, if kDj, then 𝒮j=𝒯kω, hence 𝒯k=𝒮j[1,|𝒯k|]𝒫, which means that 𝒯k identifies a node of 𝖲𝗎𝖿𝖿(𝒯). This means that, if k occurs in P at position i, then there exists ii such that P[i,i]𝒫, and then every prefix of P[i,i] is also in 𝒫 (because every prefix of a string in 𝒫 is also in 𝒫). Consequently, by considering the largest i such that P[i,i]𝒫, we know that we only need to consider P[i,i] to compute all 1kn such that 𝒯k occurs in P at position i. We can then give the following definition.

Definition 14.

Let 𝒯 be a dictionary, and let PΣ. For every 1im, let ti be the largest integer such that i1tim and P[i,ti]𝒫.

Note that ti is well-defined for every 1im because P[i,i1]=ϵ𝒫.

Fix 1im, 1jn and kDj. If j[P[i,ti],rP[i,ti]], then P[i,ti] is a prefix of 𝒮j=𝒯kω and so 𝒯k occurs in P at position i if and only |𝒯k||P[i,ti]|. If j[P[i,ti],rP[i,ti]], we know that P[i,ti] is not a prefix of 𝒯kω, but it might still be true that a prefix of P[i,ti] is equal to 𝒯k. Let P[i,i] be the longest prefix of P[i,ti] that is a prefix of 𝒯kω (or equivalently, let i be the largest integer such that j[P[i,i],rP[i,i]]). Then, 𝒯k occurs in P at position i if and only |𝒯k||P[i,i]|. To compute |P[i,i]|, first notice that every prefix of P[i,ti] reaches an ancestor of [P[i,ti],rP[i,ti]], and every string reaching a strict ancestor of [P[i,ti],rP[i,ti]] is a strict prefix of P[i,ti]. As a consequence, we can first compute the nearest ancestor [,r] of [P[i,ti],rP[i,ti]] for which j[,r], and then |P[i,i]| is the length λ,r of the largest string reaching node [,r].

The following lemma captures our intuition.

Lemma 15.

Let 𝒯 be a dictionary, let PΣ, let 1im and let 1jn.

  1. 1.

    Assume that j[P[i,ti],rP[i,ti]]. Then, for every kDj we have that 𝒯k occurs in P at position i if and only if |𝒯k|tii+1.

  2. 2.

    Assume that j[P[i,ti],rP[i,ti]]. Let [,r] be the nearest ancestor of [P[i,ti],rP[i,ti]] in 𝖲𝗎𝖿𝖿(𝒯) for which j[,r]. Then, [,r] is not the root of 𝖲𝗎𝖿𝖿(𝒯). Moreover, let [,r] be the parent of [,r] in 𝖲𝗎𝖿𝖿(𝒯). Then, j[,1][r+1,r], and for every kDj we have that 𝒯k occurs in P at position i if and only if |𝒯k|λ,r.

Fix 1im. To implement Lemma 15, we should navigate 𝖲𝗎𝖿𝖿(𝒯) starting from node [P[i,ti],rP[i,ti]], so we need to know P[i,ti] and rP[i,ti]. Moreover, if j[P[i,ti],rP[i,ti]] and kDj, we know that 𝒯k occurs in P at position i if and only if |𝒯k|tii+1, so we also need to compute ti. In the full version, we present an O(mlogn) algorithm to compute all the ti’s, all the P[i,ti]’s and all the rP[i,ti]’s. The algorithm shares many similarities with Ohlebusch et al.’s algorithm for computing matching statistics using the FM-index and the LCP array of a string [44].

For every 1im, let 𝖢𝖽𝗆i(𝒯,P) be the set of all 1kn such that 𝒯k occurs in P at position i, and let occi=|𝖢𝖽𝗆i(𝒯,P)|. Computing 𝖢𝖽𝗆(𝒯,P) is equivalent to computing 𝖢𝖽𝗆i(𝒯,P) for every 1im, and occ=i=1mocci. In view of Theorem 1, it will be sufficient to show that we can compute 𝖢𝖽𝗆i(𝒯,P) in O((1+occi)logn) time, because then we can compute each 𝖢𝖽𝗆(𝒯,P) in i=1mO((1+occi)logn)=O((m+occ)logn) time.

Fix 1im. Lemma 15 suggests that, to compute 𝖢𝖽𝗆i(𝒯,P), we can proceed as follows. We start from node [P[i,ti],rP[i,ti]] is 𝖲𝗎𝖿𝖿(𝒯), we consider every j[P[i,ti],rP[i,ti]] and every kDj, and we decide whether 𝒯k occurs in P at position i (we will see later how to do this efficiently). Next, we navigate from node [P[i,ti],rP[i,ti]] up to the root. Every time we move from a node [,r] to its parent [,r], we consider every j[,1][r+1,r] and every kDj, and we decide whether 𝒯k occurs in P at position i (again, we will see later how to do this efficiently). To navigate 𝖲𝗎𝖿𝖿(𝒯) from [P[i,ti],rP[i,ti]] to the root, in the worst case we visit many nodes – say Ω(logn) nodes. If each of these steps leads to discovering at least one element in 𝖢𝖽𝗆i(𝒯,P) we can hope to achieve the bound O((1+occi)logn), but in the worst case many steps are not successful, occi is small and we cannot achieve the bound O((1+occi)logn).

Notice that i only determines the initial node [P[i,ti],rP[i,ti]], but once we are navigating 𝖲𝗎𝖿𝖿(𝒯) up to the root, we do not need i to assess whether we have found an element of 𝖢𝖽𝗆i(𝒯,P), because 𝒯k occurs in P at position i if and only if |𝒯k|λ,r, and λ,r does not depend on i. This means that we can determine which nodes will allow discovering an element of 𝖢𝖽𝗆i(𝒯,P) before knowing the pattern P (that is, at indexing time). We can then give the following definition which, crucially, does not depend on i or P (see Figure 5(a)).

Definition 16.

Let 𝒯 be a dictionary, and let [,r]𝒩. We say that [,r] is marked if one of the following is true: (i) [,r]=[1,n] (i.e., [,r] is the root of 𝖲𝗎𝖿𝖿(𝒯)), or (ii) [,r][1,n] (i.e., [,r] is not the root of 𝖲𝗎𝖿𝖿(𝒯)) and, letting [,r] be the parent of [,r] in 𝖲𝗎𝖿𝖿(𝒯), there exists j[,1][r+1,r] and there exists kDj such that |𝒯k|λ,r.

When we navigate 𝖲𝗎𝖿𝖿(𝒯) from [P[i,ti],rP[i,ti]] to the root, we should skip all non-marked nodes. Notice the set of all marked nodes induces a new tree structure. More precisely, consider the ordinal tree 𝖲𝗎𝖿𝖿(𝒯) with root [1,n] defined as follows. The set 𝒩 of nodes is the set of all marked nodes in 𝒩; in particular, the root [1,n] of 𝖲𝗎𝖿𝖿(𝒯) belongs to 𝒩. We can now build 𝖲𝗎𝖿𝖿(𝒯) incrementally. We traverse all nodes of 𝖲𝗎𝖿𝖿(𝒯) in depth-first search order, respecting the order of the children of each node (this is the same order used for the balanced parenthesis representation of 𝖲𝗎𝖿𝖿(𝒯)). The first marked node that we encounter is [1,n], which will be root of 𝖲𝗎𝖿𝖿(𝒯). Then, every time we encounter a marked node [,r], let [,r] be the nearest strict ancestor of [,r] in 𝖲𝗎𝖿𝖿(𝒯) that is marked, namely, [,r] is the first marked node that we encounter after [,r] in the (backward) path from [,r] to [1,n] in 𝖲𝗎𝖿𝖿(𝒯). Then, [,r] will be the parent of [,r] in 𝖲𝗎𝖿𝖿(𝒯), and if [,r] has already been assigned q0 children, then [,r] will be its (q+1)-th smallest child. See Figure 5(b) for an example, and see Figure 5(c) for the balanced parenthesis representation of 𝖲𝗎𝖿𝖿(𝒯). In addition to Navarro and Sadakane’s data structure for the tree 𝖲𝗎𝖿𝖿(𝒯), we also store the same data structure for the tree 𝖲𝗎𝖿𝖿(𝒯), which also requires O(n) bits.

We also remember which nodes of 𝖲𝗎𝖿𝖿(𝒯) are marked by using a bitvector B8 of length 2|𝒩| such that for every 1t2|𝒩| we have B8[t]=1 if and only if Z𝖲𝗎𝖿𝖿(𝒯)[t] is one of the two values corresponding to a marked node of 𝖲𝗎𝖿𝖿(𝒯). More precisely, we build B8 as follows. We visit all nodes of 𝖲𝗎𝖿𝖿(𝒯) in depth-first search order, respecting the order of the children of each node. Every time we encounter a new node u we append an 1 if u is marked and a 0 if u is not marked, and as soon as we leave the subtree rooted at u we add a 1 if u is marked and a 0 if u is not marked (see Figure 5(c)). By using B8, in O(1) time (i) we can move from 𝖲𝗎𝖿𝖿(𝒯) to 𝖲𝗎𝖿𝖿(𝒯) and from 𝖲𝗎𝖿𝖿(𝒯) to 𝖲𝗎𝖿𝖿(𝒯) and (ii) we can determine the nearest marked ancestor of a node (see the full version).

Define the array 𝖫𝖾𝗇 of length n=h=1d|ρh| (the length of 𝖲𝖠) such that 𝖫𝖾𝗇[t]=|𝒯𝖲𝖠[t]| for every t (see Figure 3). We will not store 𝖫𝖾𝗇, but only a data structure supporting range minimum queries on 𝖫𝖾𝗇 in O(1) time, which requires O(n) bits (see Section 3.3).

We now have all the ingredients to prove Theorem 1, our main claim. As we have seen, it will suffice to compute each 𝖢𝖽𝗆i(𝒯,P) in O((1+occi)logn). We start from node [P[i,ti],rP[i,ti]], and for every kP[i,ti]jrP[i,ti]Dj we determine whether 𝒯k occurs in P at position i. By Lemma 15, we need to check whether |𝒯k|tii+1, so we repeatedly solve range minimum queries on 𝖫𝖾𝗇 starting from the interval [P[i,ti],rP[i,ti]] to find all k’s for which 𝒯k occurs in P at position i in time proportional to the number of such occurrences. Next, we compute the lowest marked ancestor of [P[i,ti],rP[i,ti]], and we use 𝖲𝗎𝖿𝖿(𝒯) to determine all marked ancestors of [P[i,ti],rP[i,ti]]. Let [,r] be one of these ancestors, and let [,r] be its parent in 𝖲𝗎𝖿𝖿(𝒯). For every k(j1Dj)(r+1jrDj), we determine whether 𝒯k occurs in P at position i. By Lemma 15, we need to check whether |𝒯k|λ,r, so we repeatedly solve range minimum queries on 𝖫𝖾𝗇 starting from the intervals [,1] and [r+1,r] to find all k’s for which 𝒯k occurs in P at position i in time proportional to the number of such occurrences.

The details of the proof of Theorem 1 are in the full version. Note that we cannot directly infer that our data structure is an encoding of 𝒯 from Theorem 9 because the graph G𝒯 is not sufficient to retrieve 𝒯.

5 Conclusions and Future Work

We have shown how to improve and extend previous results on circular dictionary matching. In the literature, much effort has been devoted to designing construction algorithms for the data structure of Hon et al. [36] and, implicitly, the eBWT of Mantaci et al. [40]. All available approaches first build the eBWT and the suffix array of circular strings, and then use the suffix array of circular strings to build the remaining auxiliary data structures. In [32], Hon et al. showed that the eBWT and the suffix array can be built in O(nlogn) time using O(nlogσ+dlogn) bits of working space. Bannai et al. [7] improved the time bound to O(n) by showing that the bijective BWT can be built in linear time (Bonomo et al. [10, Section 6] showed how to reduce in linear time the problem of computing the eBWT to the problem of computing the bijective BWT). A more direct algorithm was proposed by Boucher et al. [11]. However, all these algorithms are still based on assumptions (i-ii) in Section 1 and cannot immediately applying to our setting in which we consider an arbitrary dictionary. After building the eBWT and the suffix array, it is possible to build the remaining auxiliary data structure in O(nlogn) time using O(nlogσ+dlogn) bits of working space [33]. Some proofs in [33] are only sketched, but it is not too difficult to show that, if we do not insist on achieving O(nlogσ+dlogn) bits of working space, it is possible to build the remaining auxiliary data structure from the eBWT and the suffix array in linear time.

In a companion paper, we will show that the data structure of Theorem 1 can be built in O(n) time. The main technical issue is understanding how to remove assumptions (i-ii) in Section 1 when building the suffix array. The algorithm by Boucher et al.’s [11] is a recursive algorithm based on the SAIS algorithm [43] for building the suffix array. SAIS divides all suffixes into two categories (S-type and L-type). We will show that, to remove assumptions (i-ii), we will use three (and not two) categories, building on a recent recursive algorithm [19] for constructing the “suffix array” of a deterministic automaton.

The natural question is whether the data structure of Theorem 1 (or a similar data structure with the same functionality) can be built in O(n) time within O(nlogσ+dlogn) bits of working space. We conjecture that this is possible but, thinking of the case d=1, this problem should be at least as difficult as building the compressed suffix tree of a string in O(n) time and O(nlogσ) working bits, which requires advanced techniques [41, 9].

References

  • [1] Alfred V. Aho and Margaret J. Corasick. Efficient string matching: an aid to bibliographic search. Commun. ACM, 18(6):333–340, June 1975. doi:10.1145/360825.360855.
  • [2] Jarno Alanko, Nicola Cotumaccio, and Nicola Prezza. Linear-time minimization of Wheeler DFAs. In 2022 Data Compression Conference (DCC), pages 53–62. IEEE, 2022. doi:10.1109/DCC52660.2022.00013.
  • [3] Jarno Alanko, Giovanna D’Agostino, Alberto Policriti, and Nicola Prezza. Regular languages meet prefix sorting. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 911–930. SIAM, 2020. doi:10.1137/1.9781611975994.55.
  • [4] Jarno N Alanko, Davide Cenzato, Nicola Cotumaccio, Sung-Hwan Kim, Giovanni Manzini, and Nicola Prezza. Computing the LCP array of a labeled graph. In 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024). Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2024.
  • [5] Amihood Amir, Tsvi Kopelowitz, Avivit Levy, Seth Pettie, Ely Porat, and B Riva Shalom. Mind the gap! Online dictionary matching with one gap. Algorithmica, 81:2123–2157, 2019. doi:10.1007/S00453-018-0526-2.
  • [6] Amihood Amir, Avivit Levy, Ely Porat, and B Riva Shalom. Dictionary matching with a few gaps. Theoretical Computer Science, 589:34–46, 2015. doi:10.1016/J.TCS.2015.04.011.
  • [7] Hideo Bannai, Juha Kärkkäinen, Dominik Köppl, and Marcin Piątkowski. Constructing the bijective and the extended Burrows-Wheeler transform in linear time. In 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021). Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021.
  • [8] Djamal Belazzougui. Succinct dictionary matching with no slowdown. In Amihood Amir and Laxmi Parida, editors, Combinatorial Pattern Matching, pages 88–100, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg. doi:10.1007/978-3-642-13509-5_9.
  • [9] Djamal Belazzougui, Fabio Cunial, Juha Kärkkäinen, and Veli Mäkinen. Linear-time string indexing and analysis in small space. ACM Transactions on Algorithms (TALG), 16(2):1–54, 2020. doi:10.1145/3381417.
  • [10] Silvia Bonomo, Sabrina Mantaci, Antonio Restivo, Giovanna Rosone, and Marinella Sciortino. Sorting conjugates and suffixes of words in a multiset. International Journal of Foundations of Computer Science, 25(08):1161–1175, 2014.
  • [11] Christina Boucher, Davide Cenzato, Zsuzsanna Lipták, Massimiliano Rossi, and Marinella Sciortino. Computing the original eBWT faster, simpler, and with less memory. In Thierry Lecroq and Hélène Touzet, editors, String Processing and Information Retrieval, pages 129–142, Cham, 2021. Springer International Publishing. doi:10.1007/978-3-030-86692-1_11.
  • [12] Michael Burrows and David J. Wheeler. A block-sorting lossless data compression algorithm. Technical report, Systems Research Center, 1994.
  • [13] Davide Cenzato, Veronica Guerrini, Zsuzsanna Lipták, and Giovanna Rosone. Computing the optimal BWT of very large string collections. In 2023 Data Compression Conference (DCC), pages 71–80, 2023. doi:10.1109/DCC55655.2023.00015.
  • [14] Panagiotis Charalampopoulos, Tomasz Kociumaka, Manal Mohamed, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. Internal dictionary matching. Algorithmica, 83(7):2142–2169, 2021. doi:10.1007/S00453-021-00821-Y.
  • [15] Panagiotis Charalampopoulos, Tomasz Kociumaka, Jakub Radoszewski, Solon P Pissis, Wojciech Rytter, Tomasz Waleń, and Wiktor Zuba. Approximate circular pattern matching. In ESA 2022-30th Annual European Symposium on Algorithms, 2022.
  • [16] Raphaël Clifford, Allyx Fontaine, Ely Porat, Benjamin Sach, and Tatiana Starikovskaya. Dictionary matching in a stream. In Algorithms-ESA 2015: 23rd Annual European Symposium, Patras, Greece, September 14-16, 2015, Proceedings, pages 361–372. Springer, 2015. doi:10.1007/978-3-662-48350-3_31.
  • [17] Alessio Conte, Nicola Cotumaccio, Travis Gagie, Giovanni Manzini, Nicola Prezza, and Marinella Sciortino. Computing matching statistics on Wheeler DFAs. In 2023 Data Compression Conference (DCC), pages 150–159, 2023. doi:10.1109/DCC55655.2023.00023.
  • [18] Nicola Cotumaccio. Graphs can be succinctly indexed for pattern matching in O(|E|2+|V|5/2) time. In 2022 Data Compression Conference (DCC), pages 272–281, 2022. doi:10.1109/DCC52660.2022.00035.
  • [19] Nicola Cotumaccio. Prefix sorting DFAs: A recursive algorithm. In 34th International Symposium on Algorithms and Computation (ISAAC 2023). Schloss-Dagstuhl-Leibniz Zentrum für Informatik, 2023.
  • [20] Nicola Cotumaccio. A Myhill-Nerode theorem for generalized automata, with applications to pattern matching and compression. In 41st International Symposium on Theoretical Aspects of Computer Science (STACS 2024). Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2024.
  • [21] Nicola Cotumaccio. Improved circular dictionary matching, 2025. arXiv:2504.03394.
  • [22] Nicola Cotumaccio, Giovanna D’Agostino, Alberto Policriti, and Nicola Prezza. Co-lexicographically ordering automata and regular languages - part I. J. ACM, 70(4), August 2023. doi:10.1145/3607471.
  • [23] Nicola Cotumaccio, Travis Gagie, Dominik Köppl, and Nicola Prezza. Space-time trade-offs for the LCP array of Wheeler DFAs. In Franco Maria Nardini, Nadia Pisanti, and Rossano Venturini, editors, String Processing and Information Retrieval, pages 143–156, Cham, 2023. Springer Nature Switzerland. doi:10.1007/978-3-031-43980-3_12.
  • [24] Nicola Cotumaccio and Nicola Prezza. On indexing and compressing finite automata. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2585–2599. SIAM, 2021. doi:10.1137/1.9781611976465.153.
  • [25] Nicola Cotumaccio and Catia Trubiani. Convex Petri nets. In Michael Köhler-Bussmeier, Daniel Moldt, and Heiko Rölke, editors, Proceedings of the International Workshop on Petri Nets and Software Engineering 2024 co-located with the 45th International Conference on Application and Theory of Petri Nets and Concurrency (PETRI NETS 2024), June 24 - 25, 2024, Geneva, Switzerland, volume 3730 of CEUR Workshop Proceedings. CEUR-WS.org, 2024. URL: https://ceur-ws.org/Vol-3730/poster1.pdf.
  • [26] Jonathan A Eisen. Environmental shotgun sequencing: its potential and challenges for studying the hidden world of microbes. PLoS biology, 5(3):e82, 2007.
  • [27] Paolo Ferragina and Giovanni Manzini. Indexing compressed text. J. ACM, 52(4):552–581, July 2005. doi:10.1145/1082036.1082039.
  • [28] Johannes Fischer and Volker Heun. Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM Journal on Computing, 40(2):465–492, 2011. doi:10.1137/090779759.
  • [29] Travis Gagie, Giovanni Manzini, and Jouni Sirén. Wheeler graphs: A framework for BWT-based data structures. Theoretical computer science, 698:67–78, 2017. doi:10.1016/J.TCS.2017.06.016.
  • [30] Paweł Gawrychowski and Tatiana Starikovskaya. Streaming dictionary matching with mismatches. Algorithmica, pages 1–21, 2019.
  • [31] Shay Golan, Tomasz Kociumaka, Tsvi Kopelowitz, and Ely Porat. Dynamic dictionary matching in the online model. In Algorithms and Data Structures: 16th International Symposium, WADS 2019, Edmonton, AB, Canada, August 5–7, 2019, Proceedings 16, pages 409–422. Springer, 2019. doi:10.1007/978-3-030-24766-9_30.
  • [32] Wing-Kai Hon, Tsung-Han Ku, Chen-Hua Lu, Rahul Shah, and Sharma V. Thankachan. Efficient algorithm for circular Burrows-Wheeler transform. In Juha Kärkkäinen and Jens Stoye, editors, Combinatorial Pattern Matching, pages 257–268, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. doi:10.1007/978-3-642-31265-6_21.
  • [33] Wing-Kai Hon, Tsung-Han Ku, Rahul Shah, and Sharma V. Thankachan. Space-efficient construction algorithm for the circular suffix tree. In Johannes Fischer and Peter Sanders, editors, Combinatorial Pattern Matching, pages 142–152, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg. doi:10.1007/978-3-642-38905-4_15.
  • [34] Wing-Kai Hon, Tsung-Han Ku, Rahul Shah, Sharma V. Thankachan, and Jeffrey Scott Vitter. Faster compressed dictionary matching. Theoretical Computer Science, 475:113–119, 2013. doi:10.1016/j.tcs.2012.10.050.
  • [35] Wing-Kai Hon, Tak-Wah Lam, Rahul Shah, Sharma V Thankachan, Hing-Fung Ting, and Yilin Yang. Dictionary matching with a bounded gap in pattern or in text. Algorithmica, 80:698–713, 2018. doi:10.1007/S00453-017-0288-2.
  • [36] Wing-Kai Hon, Chen-Hua Lu, Rahul Shah, and Sharma V Thankachan. Succinct indexes for circular patterns. In Algorithms and Computation: 22nd International Symposium, ISAAC 2011, Yokohama, Japan, December 5-8, 2011. Proceedings 22, pages 673–682. Springer, 2011. doi:10.1007/978-3-642-25591-5_69.
  • [37] Costas S Iliopoulos, Solon P Pissis, and M Sohel Rahman. Searching and indexing circular patterns. Algorithms for Next-Generation Sequencing Data: Techniques, Approaches, and Applications, pages 77–90, 2017. doi:10.1007/978-3-319-59826-0_3.
  • [38] Costas S. Iliopoulos and M. Sohel Rahman. Indexing circular patterns. In Shin-ichi Nakano and Md. Saidur Rahman, editors, WALCOM: Algorithms and Computation, pages 46–57, Berlin, Heidelberg, 2008. Springer Berlin Heidelberg. doi:10.1007/978-3-540-77891-2_5.
  • [39] M. Lothaire. Combinatorics on words, volume 17. Cambridge university press, 1997.
  • [40] Sabrina Mantaci, Antonio Restivo, Giovanna Rosone, and Marinella Sciortino. An extension of the Burrows–Wheeler transform. Theoretical Computer Science, 387(3):298–312, 2007. doi:10.1016/J.TCS.2007.07.014.
  • [41] J Ian Munro, Gonzalo Navarro, and Yakov Nekrich. Space-efficient construction of compressed indexes in deterministic linear time. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 408–424. SIAM, 2017. doi:10.1137/1.9781611974782.26.
  • [42] Gonzalo Navarro. Compact Data Structures: A Practical Approach. Cambridge University Press, 2016.
  • [43] Ge Nong, Sen Zhang, and Wai Hong Chan. Two efficient algorithms for linear time suffix array construction. IEEE Transactions on Computers, 60(10):1471–1484, 2011. doi:10.1109/TC.2010.188.
  • [44] Enno Ohlebusch, Simon Gog, and Adrian Kügel. Computing matching statistics and maximal exact matches on compressed full-text indexes. In Edgar Chavez and Stefano Lonardi, editors, String Processing and Information Retrieval, pages 347–358, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg. doi:10.1007/978-3-642-16321-0_36.
  • [45] Kunihiko Sadakane. Compressed suffix trees with full functionality. Theor. Comp. Sys., 41(4):589–607, December 2007. doi:10.1007/s00224-006-1198-x.
  • [46] Carola Simon and Rolf Daniel. Metagenomic analyses: past and future trends. Applied and environmental microbiology, 77(4):1153–1161, 2011.
  • [47] Blair L Strang and Nigel D Stow. Circularization of the herpes simplex virus type 1 genome upon lytic infection. Journal of virology, 79(19):12487–12494, 2005.