Sorted Consecutive Occurrence Queries in Substrings

Akram, Waseem; Mieno, Takuya

doi:10.4230/LIPIcs.CPM.2025.24

Sorted Consecutive Occurrence Queries in Substrings

Waseem Akram

Department of Computer Science and Engineering, Indian Institute of Technology, Kanpur, India Takuya Mieno

Department of Computer and Network Engineering, University of Electro-Communications, Chofu, Japan

Abstract

The string indexing problem is a fundamental computational problem with numerous applications, including information retrieval and bioinformatics. It aims to efficiently solve the pattern matching problem: given a text $T$ of length $n$ for preprocessing and a pattern $P$ of length $m$ as a query, the goal is to report all occurrences of $P$ as substrings of $T$ . Navarro and Thankachan [CPM 2015, Theor. Comput. Sci. 2016] introduced a variant of this problem called the gap-bounded consecutive occurrence query, which reports pairs of consecutive occurrences of $P$ in $T$ such that their gaps (i.e., the distances between them) lie within a query-specified range $[g_{1},g_{2}]$ . Recently, Bille et al. [FSTTCS 2020, Theor. Comput. Sci. 2022] proposed the top- $k$ close consecutive occurrence query, which reports the $k$ closest consecutive occurrences of $P$ in $T$ , sorted in non-decreasing order of distance. Both problems are optimally solved in query time with $O(n\log n)$ -space data structures.

In this paper, we generalize these problems to the range query model, which focuses only on occurrences of $P$ in a specified substring $T[a..b]$ of $T$ . Our contributions are as follows:

$\blacksquare$

We propose an $O(n\log^{2}n)$ -space data structure that answers the range top- $k$ consecutive occurrence query in $O(|P|+\log\log n+k)$ time.
$\blacksquare$

We propose an $O(n\log^{2+\epsilon}n)$ -space data structure that answers the range gap-bounded consecutive occurrence query in $O(|P|+\log\log n+\mathit{output})$ time, where $\epsilon$ is a positive constant and $\mathit{output}$ denotes the number of outputs.

Additionally, as by-products, we present algorithms for geometric problems involving weighted horizontal segments in a 2D plane, which are of independent interest.

Keywords and phrases:

string algorithm, consecutive occurrences, suffix tree

Funding:

Takuya Mieno: JSPS KAKENHI Grant Number JP24K20734.

Copyright and License:

2012 ACM Subject Classification:

Mathematics of computing

\rightarrow

Combinatorial algorithms

Acknowledgements:

We sincerely thank the anonymous referees for their thorough review and valuable feedback, which have contributed to enhancing the clarity and quality of this paper. The first author extends his heartfelt gratitude to his thesis supervisor, Prof. Sanjeev Saxena, for his invaluable guidance and unwavering support throughout this work.

DOI:

10.4230/LIPIcs.CPM.2025.24

Event:

36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)

Editors:

Paola Bonizzoni and Veli Mäkinen

Series and Publisher:

Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik

1 Introduction

The string indexing problem involves constructing a data structure that efficiently supports pattern matching queries. Given a text $T$ of length $n$ and a pattern $P$ of length $m$ , the goal is to report all occurrences of $P$ as substrings of $T$ . This problem has numerous applications in the real world including information retrieval and bioinformatics. A typical example of such an index structure is the suffix tree [37], a compact trie of all suffixes of $T$ . When combined with perfect hashing (e.g., [16, 32]), a suffix tree provides an $O(n)$ -space data structure that answers pattern matching queries in optimal $O(|P|+\mathit{occ})$ time for given pattern $P$ , where $\mathit{occ}$ is the number of occurrences of $P$ in $T$ .

In this work, we focus on variants of pattern matching queries that consider the distances between the starting positions of every two occurrences of $P$ in $T$ . Keller et al. [23] introduced a non-overlapping pattern matching problem, which reports only occurrences of $P$ separated by at least $|P|$ characters. They proposed a data structure of size $O(n\log n)$ that reports the maximal sequence of non-overlapping occurrences of $P$ in $T$ sorted in textual order, in $O(m+\mathit{output}\cdot\log\log n)$ query time where $\mathit{output}$ is the output size. Cohen and Porat [12] later improved the query time to $O(m+\mathit{output})$ . Several subsequent studies extended their work [17, 22, 20].

In 2016, Navarro and Thankachan [28] introduced another new variant of the pattern matching query called the consecutive occurrence query. The query reports consecutive occurrences of pattern $P$ , where each pair of occurrences has no other occurrences of $P$ between them and the distance is within range $[g_{1},g_{2}]$ specified by a query. They designed an $O(n\log n)$ -space data structure that supports the consecutive occurrence query in optimal $O(m+\mathit{output})$ time. Subsequently, Bille et al. [7] considered the top- $k$ close consecutive occurrence query, which reports (at most) $k$ consecutive occurrences of $P$ in non-decreasing order of distance, where $k$ is specified in a query. They proposed an $O(n\log n)$ -space data structure that answers the top- $k$ consecutive occurrence query in optimal $O(m+k)$ time. In the same work, they also presented two other trade-offs for the problem, each using $O(n)$ space, with query times of either $O(m+k^{1+\epsilon})$ or $O(m+k\log^{1+\epsilon}n)$ . Recently, Akram and Saxena [2] revisited Bille et al.’s work and proposed a simpler data structure with the same time/space bounds, which is the basis of our proposed method in this paper. Further, Bille et al. [8] addressed an extension of the query that requires consecutive occurrences of two (possibly distinct) patterns $P_{1}$ and $P_{2}$ where their distances lie within range $[g_{1},g_{2}]$ . They showed that the query can be answered in $\tilde{O}(|P_{1}|+|P_{2}|+n^{2/3}\mathit{occ}^{1/3})$ time with $\tilde{O}(n)$ space, and that any data structure using $\tilde{O}(n)$ space requires $\tilde{\Omega}(|P_{1}|+|P_{2}|+\sqrt{n})$ query time unless the String SetDisjointness Conjecture [21] fails. More recently, Gawrychowski, Gourdel, Starikovskaya and Steiner considered consecutive occurrence problems on grammar-compressed texts [18, 19]. As we have mentioned, the consecutive occurrence queries has attracted significant attention in recent years.

In this paper, we consider each variant of consecutive occurrence queries defined in [7] and [28] within substrings specified by queries. The first one is the range top- $k$ consecutive occurrence query that reports the $k$ closest consecutive occurrences of pattern $P$ in substring $T[a..b]$ , sorted by distance, where the query input consists of pattern $P$ , range $[a,b]$ , and integer $k$ . The second one is the range gap-bounded consecutive occurrence query that reports all the consecutive occurrences of pattern $P$ in substring $T[a..b]$ where their distances lie within $[g_{1},g_{2}]$ , where the query input consists of pattern $P$ , range $[a,b]$ representing the substring of $T$ , and gap range $[g_{1},g_{2}]$ . The formal definitions of the problems are provided in Section 2. Our main results are as follows:

$\blacksquare$

For the range top- $k$ consecutive occurrence query, we propose an $O(n\log^{2}n)$ -size data structure with a query time of $O(|P|+\log\log n+k)$ (Theorem 12).
$\blacksquare$

For the range gap-bounded consecutive occurrence query, we propose an $O(n\log^{2+\epsilon}n)$ -size data structure with a query time $O(|P|+\log\log n+\mathit{output})$ , where $\epsilon$ is a positive constant (Theorem 15).

Both of the data structures consist of the suffix tree [37] of $T$ and some appropriately chosen geometric data structures, associated with heavy paths [36] of the suffix tree. Although the high-level idea is similar to that of [28], we carefully employ non-trivial combinations of data structures since our problems only require occurrences within the query substring and outputs needs to be sorted by distance unlike [28]. Additionally, as by-products, we develop algorithms for geometric problems involving weighted horizontal segments in 2D plane, which are of independent interest.

Related Work

Variants of range queries (a.k.a. internal or substring queries) have been proposed for various problems in string processing. Mäkinen and Navarro [26] proposed a data structure for a fundamental range query called the position-restricted substring searching query, which reports the occurrences of pattern $P$ within a specified substring $T[a..b]$ . Their solution requires $O(n\log^{\epsilon}n)$ space, and answers queries in $O(m+\log\log n+\mathit{occ})$ time. Keller et al. [23] introduced the range non-overlapping pattern matching query and proposed a data structure of size $O(n\log n)$ that can answer queries in $O(m+k\log\log n)$ time. Cohen and Porat [12] later improved these results to $O(n\log^{\epsilon}n)$ space and $O(m+\log\log n+k)$ query time. Crochemore et al. [14] proposed an alternative trade-off with $O(n^{1+\epsilon})$ space and $O(m+k)$ query time.

Bille and Gørtz [6] showed that several string problems, including the position-restricted substring searching query, can be reduced to the substring range reporting problem, and provided an efficient solution for the problem. One of the results from [6] is a data structure of size $O(n\log^{\epsilon}n)$ that answers the position-restricted substring searching query in optimal $O(m+\mathit{occ})$ time, which is an improvement of Mäkinen and Navarro’s result [26]. Range queries have been considered not only for finding patterns in strings but also for computing regularities in strings or compressing substrings (see [13, 24, 15, 1, 3, 4, 27, 35, 25] and references therein).

Given a set $S$ of $N$ horizontal segments in the plane, the orthogonal segment intersection problem involves designing an efficient data structure that can be used to find the segments intersected by a vertical query segment. Chazelle [10] presented a solution that occupies $O(N)$ space and takes $O(\log N+K)$ time to answer a query, where $K$ is the output size. It is an important problem in computational geometry and has been studied in various settings (see [31, 33, 34]).

Rahul and Janardan [31] studied a variant of the orthogonal segments intersection problem where the input set $S$ consists of weighted horizontal segments and queries to be answered are of the following type: given a vertical segment and an integer $k$ , report the $k$ segments intersected by the query segment with largest weights. They proposed a general technique to solve top- $k$ variants of geometric problems efficiently, and solve this problem using $O(N\log^{2}N)$ space and $O(\log^{3}N+K)$ query time.

2 Preliminaries

2.1 Notations

Let $\Sigma$ be an ordered alphabet. An element in $\Sigma$ is called a character. An element in $\Sigma^{\star}$ is called a string. The length of a string $T$ is denoted by $|T|$ . The empty string $\varepsilon$ is the string of length $0$ . If $T=xyz$ holds for strings $T, x, y$ , and $z$ , then $x, y$ , and $z$ are called a prefix, a substring, and a suffix of $T$ , respectively. They are called a proper prefix, a proper substring, and a proper suffix of $T$ if $x\neq T$ , $y\neq T$ , and $z\neq T$ , respectively. For a string $T$ and an integer $i$ with $1\leq i\leq|T|$ , we denote by $T[i]$ the $i$ th character of $T$ . Also, for integers $i, j$ with $1\leq i\leq j\leq|T|$ , we denote by $T[i..j]$ the substring of $T$ that starts at position $i$ and ends at position $j$ . For convenience, we define $T[i..j]=\varepsilon$ if $i>j$ . We sometimes use notation $T[i..]:=T[i..|T|]$ for suffixes. For strings $T$ and $w$ , we denote by $\mathit{occ}(w,T)=\{i\mid T[i..i+|w|-1]=w\}$ the set of starting positions of occurrences of $w$ in $T$ . For two positions $i,j\in\mathit{occ}(w,T)$ with $i<j$ , the pair $(i,j)$ is called a consecutive occurrence of $w$ in $T$ if there is no element $k\in\mathit{occ}(w,T)$ with $i<k<j$ . For a consecutive occurrence $(i,j)$ we call the value $j-i$ the distance of $(i,j)$ . We denote by $\mathit{cons}(w,T)$ the set of consecutive occurrences of $w$ in $T$ . Note that $|\mathit{cons}(w,T)|=|\mathit{occ}(w,T)|-1$ if $\mathit{occ}(w,T)\neq\emptyset$ .

Now we formally define two problems we tackle in this paper.

Definition 1 (Top- $k$ query).

The range top- $k$ consecutive occurrence query over string $T$ is, given a query $(P,[a,b],k)$ where $P$ is a pattern and $a, b, k$ are integers, to output the top- $k$ close consecutive occurrences $(i,j)\in\mathit{cons}(P,T[a..b])$ , sorted in non-decreasing order of distance.

Definition 2 (Gap-bounded query).

The range gap-bounded consecutive occurrence query over string $T$ is, given a query $(P,[a,b],[g_{1},g_{2}])$ where $P$ is a pattern and $a,b,g_{1},g_{2}$ are integers, to output the consecutive occurrences $(i,j)\in\mathit{cons}(P,T[a..b])$ with $g_{1}\leq j-i\leq g_{2}$ , sorted in non-decreasing order of distance.

We give a concrete example for the top- $k$ query in Figure 1.

Figure 1: An example of the top-

k

query for

k=4

. The occurrences of

P=\mathtt{aba}

in the index range

[3,20]

are

3,6,8,10,12,

and

17

. The occurrence of

P

at index

19

is invalid with respect to the range

[3,20]

as it ends at index

21>20

. Thus, the top-

4

consecutive occurrences of

P

in the index range

T[3..20]

are

(6,8),(8,10),(10,12)

, and

(3,6)

.

In what follows, we fix a string $T$ of length $n>0$ arbitrarily. In this paper, we assume the standard word-RAM model with word size $w=\Omega(\log n)$ .

2.2 Algorithmic Tools

Suffix Trees

The suffix tree of string $T$ is a compacted trie for the set of suffixes of $T$ [37]. We denote by $\mathsf{str}(v)$ the path-string obtained by concatenating the edge labels from the root to node $v$ of the suffix tree of $T$ . If the last character of $T$ is unique (i.e., occurs in $T$ only once), then the suffix tree of $T$ has exactly $|T|$ leaves and there is a one-to-one correspondence between the leaves and the suffixes of $T$ . More precisely, for each suffix $T[i..]$ of $T$ , there exists a leaf $\ell_{i}$ of the suffix tree of $T$ such that $\mathsf{str}(\ell_{i})=T[i..]$ , and vice versa. Thus, we label such a leaf $\ell_{i}$ with integer $i$ . For a substring $w$ of $T$ , we define $\mathsf{locus}(w)$ as the highest node $u$ of the suffix tree of $T$ such that $w$ is a prefix of the string $\mathsf{str}(u)$ .

The suffix tree of string $T$ over $\Sigma$ can be constructed in $O(n\log|\Sigma|)$ time [37].

In the rest of this paper, we assume that the last character of $T$ is unique (one can achieve this property by appending a new letter), and we denote by $\mathcal{T}$ the suffix tree of $T$ . Namely, $\mathcal{T}$ has exactly $n$ leaves.

Heavy Path Decomposition of Tree

A heavy path of a tree is a root-to-leaf path in which each node has a size no smaller than the size of any of its siblings (the size of a node is the number of nodes in the subtree rooted at that node). Heavy path decomposition is a process of decomposing a tree into heavy paths [36]. First, we find a heavy path by starting from the tree’s root and choosing a child with the maximum size at each level. We follow the same procedure recursively for each subtree rooted at a node that is not on the heavy path, but its parent node is (on the heavy path). As a result, a collection of (disjoint) heavy paths is obtained. Some of its properties are:

$\blacksquare$

each node $v$ belongs to exactly one heavy path,
$\blacksquare$

any path from the root to a leaf can pass through at most $\log_{2}N$ heavy paths where $N$ is the size of the number of nodes in the tree,
$\blacksquare$

the number of heavy paths equals the number of leaves in the tree.

All heavy paths of a rooted tree with $N$ nodes can be computed in $O(N)$ time [36].

Segment Tree

The segment tree [5] is an important data structure in computational geometry. It stores a set of $N$ line segments in the plane such that the segments intersected by a vertical query line can be computed in $O(\log N+K)$ time, where $K$ is the number of reported segments. It can be built in $O(N\log N)$ space and time. It is a balanced binary search tree of height $O(\log N)$ built over the $x$ -coordinates of the segments’ endpoints. Each node stores a vertical slab $H(v):=[a,b)\times\mathbb{R}$ and a set $S(v)$ of segments, where $a$ and $b$ are real numbers. Let $l_{1},l_{2},\ldots l_{m}$ be the list of leaf nodes in the left-to-right order. The slab of a leaf node $l_{i}$ , for each $i=1,2,\ldots m-1$ , is determined by the key values of $l_{i}$ and $l_{i+1}$ , and the vertical slab of an internal node is the union of those of its children. Note that the vertical slabs of the nodes of any particular level form a partition of the plane. A segment $s$ is stored at each highest node $v$ such that $s\cap H(v)\neq\emptyset$ and no endpoint of $s$ lies inside $H(v)$ . Given a real value $x$ , the path from the root to the leaf $l_{i}$ whose vertical slab contains $x$ is referred to as the search path for $x$ . An important property of the segment tree is that a segment intersects the vertical line through $x$ if and only if it is stored at some node of the search path for $x$ .

Fractional Cascading Technique

Chazelle and Guibas [11] introduced the fractional cascading technique, a data structure technique which is useful in the scenarios where one needs to search an item in multiple sorted lists. It speeds up the search process. We briefly describe the idea of the technique in the context where lists are associated with the nodes of a balanced binary tree, the reader may refer to [11] for full details. Let $T$ be a balanced binary tree with $O(\log n)$ height, and each node $v$ stores a sorted list $L_{v}$ of comparable items where $n$ is the combined size of all lists. Given a root-to-leaf path in $T$ , one can trivially find the position (rank) of an item in the list of each node on the path in $O(\log^{2}n)$ time; performing a binary search at each node of the path. By employing fractional cascading technique, the search time can be reduced to $O(\log n)$ . The technique introduces an augmented list $A_{v}$ for each node $v$ in the tree: the augmented list $A_{v}$ consists of the items of $L_{v}$ and every fourth item from the list of each child; the list $A_{v}$ is also sorted. These augmented lists are built in a bottom-up fashion. Each item $x$ in $A_{v}$ stores a constant amount of information: (1) whether $x$ is from $L_{v}$ or from one of $v$ ’s children, (2) pointers to immediate neighbors of $x$ on either side in $A_{v}$ that belong to $L_{v}$ , (3) a pointer to the appropriate item in $A_{w}$ for each child $w$ of $v$ . If the position of $x$ in $A_{v}$ is known, one can find its position in the list of either child (of node $v$ ) in $O(1)$ time using the saved pointers. Suppose we are to find the position (rank) of an item $\alpha$ in lists of nodes on a particular root-to-leaf path $\pi=\langle u_{0}=\mathit{root},u_{1},u_{2},\ldots\rangle$ . A binary search is used to find the location of $\alpha$ in $A_{u_{0}}$ (and hence in $L_{u_{0}}$ ). Then, using the pointers stored in $A_{u_{0}}$ , the correct rank of $\alpha$ in the list $L_{u_{1}}$ of the next node $u_{1}$ on the path $\pi$ can be computed in $O(1)$ time. Generally speaking, if the rank of $\alpha$ in $L_{u_{j}}$ is known, the rank of $i$ in $L_{u_{j+1}}$ can be obtained in $O(1)$ time using the pointers stored in $A_{u_{j}}$ . Thus, the total time spent would be $O(\log n)$ .

Sorted Orthogonal Range Reporting Queries

A sorted range reporting problem over an array $A$ of integers is, given a query $([x_{1},x_{2}],k)$ where $x_{1},x_{2},k$ are integers, to report the $k$ smallest values in sub-array $A[x_{1}:x_{2}]$ in sorted order. The following result is due to Brodal et al. [9].

Lemma 3 ([9]).

There is a data structure of size $O(|A|)$ that can answer any sorted range reporting query in $O(k)$ time. The data structure can be constructed in $O(|A|\log|A|)$ time.

A 2D orthogonal range reporting problem asks to represent a set of 2D planar points into an efficient data structure such that all the points inside an orthogonal query rectangle can be reported efficiently. Nekrich and Navarro [29] studied a variant of the problem where the output points are to be reported in increasing (or decreasing) order of one of the two coordinates. They proposed an almost $O(N)$ space solution to the problem that takes $O(\log\log N+K)$ time to answer a query, where $N$ is the size of the input set and $K$ is the number of reported points. The output points are reported one by one in the sorted order. Their main result is as follows.

Lemma 4 (Theorem 2 in [29]).

Given a set of $N$ points in the plane with integer coordinates, one can preprocess the set so that the points lying inside a query rectangle $[a,b]\times[c,d]$ can be reported in sorted order in $O(\log\log N+K)$ time, where $K$ is output size. The space used by the data structure is $O(N\log^{\epsilon}N)$ .

Dynamic Fusion Nodes

A dynamic fusion node [30] is a data structure for a dynamic set of integers which supports operations on the set including insertion, deletion, predecessor, and successor. Every query can be executed in $O(\log N/\log w)$ time where $N$ is the size of the dynamic set and $w$ is the machine word size. The data structure uses $O(N)$ space. If $N$ is close to $w$ , the following holds:

Lemma 5 ([30]).

If $N=w^{O(1)}$ , we can maintain a dynamic set of $N$ integers supporting insertion, deletion, predecessor, and successor in $O(1)$ time using $O(N)$ space.

We will use this lemma later in the case where $N\leq\log|T|\in O(w)$ .

3 Naïve Solutions

In this section, we present a data structure that answers top- $k$ queries in near-optimal time. The data structure also supports gap-bounded queries in the same time bound. However, the space complexity is $O(n^{2}\log^{\epsilon}n)$ .

We first build the suffix tree $\mathcal{T}$ over the text $T$ of length $n$ . Recall that the labels of leaves in the subtree rooted at a node $u$ correspond to the occurrences of $\mathsf{str}(u)$ . Let $C(u)$ denote the set of consecutive occurrences of $\mathsf{str}(u)$ . For each $(i,j)\in C(u)$ , we define a $2$ -dimensional point $(i,j-i)$ . Let $\mathsf{P}(u)$ be the set of all such $2$ -d points at node $u$ . We then preprocess the set $\mathsf{P}(u)$ , for each $u\in\mathcal{T}$ , into a data structure that supports sorted range reporting queries using Lemma 4. The resulting data structure supports both types of queries.

Firstly, we estimate the space complexity. The size of the set $C(v)$ of consecutive occurrences is one less than the number of leaves in the subtree rooted at node $v$ . In the worst case (when $\mathcal{T}$ is left-skewed or right-skewed), the total size of all sets $C(v)$ , $v\in\mathcal{T}$ , is $O(n^{2})$ . As the set $\mathsf{P}(v)$ contains one point for each consecutive occurrence in the set $C(v)$ , $|\mathsf{P}(v)|=|C(v)|$ holds. Consequently, $\sum_{v\in\mathcal{T}}|\mathsf{P}(v)|=\sum_{v\in\mathcal{T}}|C(v)|=O(n^{2})$ . The suffix tree $\mathcal{T}$ occupies $O(n)$ space. The sorted range reporting structure representing the set $\mathsf{P}(v)$ uses $O(|\mathsf{P}(v)|\log^{\epsilon}|\mathsf{P}(v)|)$ space due to Lemma 4. Thus, the total space used by the data structures associated with all nodes is $\sum_{v\in\mathcal{T}}O(|\mathsf{P}(v)|\log^{\epsilon}|\mathsf{P}(v)|)=O(n^{2}% \log^{\epsilon}n)$ . Therefore, the space complexity of the data structure is $O(n^{2}\log^{\epsilon}n)$ .

Top- $𝒌$ query algorithm for given $(P,[a,b],k)$

We first find the node $u=\mathsf{locus}(P)$ where the search for pattern $P$ in the suffix tree $\mathcal{T}$ terminates. Then, we find the $(k+1)$ points of $\mathsf{P}(u)$ in the rectangle $[a,b-|P|+1]\times(-\infty,+\infty)$ with smallest y-coordinates (i.e., smallest distances) using the range reporting data structure stored at node $u$ . Let $(i,j)$ be the consecutive occurrence corresponding to the (returned) point with the largest x-coordinate. If the copy of $P$ starting at index $j$ ends with a position greater than $b$ , report the remaining $k$ consecutive occurrences as output. Otherwise, we report the $k$ consecutive occurrences with smallest distances. Finding $\mathsf{locus}(P)$ in the suffix tree $\mathcal{T}$ takes $O(|P|)$ time. If we are to find $k+1$ points with smallest y-coordinates in sorted order, the structure stored at a node $u$ takes $O(\log\log|\mathsf{P}(u)|+k+1)$ time to answer a query. The additional step to scan the returned points and report $k$ consecutive occurrences takes $O(k+1)$ time. Thus, the total time needed to answer a top- $k$ query is $O(|P|+\log\log n+k)$ .

Gap-bounded query algorithm for given $(P,[a,b],[g_{1},g_{2}])$

For gap-bounded query, the procedure is very similar. Again, we first find the node $u=\mathsf{locus}(P)$ where the search for $P$ in the suffix tree $\mathcal{T}$ terminates. Then, we query the data structure stored at node $u$ with the rectangle $[a,b-|P|+1]\times[g_{1},g_{2}]$ . Let $S^{\prime}$ be the set of the consecutive occurrences corresponding to the returned points. Finally, we simply scan the set $S^{\prime}$ and remove the consecutive occurrence $(i,j)\in S^{\prime}$ with the largest first occurrence $i$ if $j>b-|P|+1$ , in words, the copy of $P$ at index $j$ does not fit entirely within the index range $[a,b]$ . The resulting set $S^{\prime}$ is our output to the gap-bounded query. The time spent to answer a gap-bounded query is $O(|P|+\log\log n+\mathit{output})$ where $\mathit{output}$ denotes the number of outputs for the query. The analysis is very similar to that of top- $k$ query. Thus, we have the following proposition.

Proposition 6.

Given a given text $T$ , one can build an $O(n^{2}\log^{\epsilon}n)$ -space index that can be used to answer a top- $k$ query or a gap-bounded query in $O(|P|+\log\log n+\mathit{output})$ query time.

4 Range Top- $𝒌$ Consecutive Occurrence Queries

In this section, we propose an algorithm that solves the top- $k$ query using sub-quadratic space. Our approach is similar to those employed in [28, 7]. The suffix tree $\mathcal{T}$ is decomposed using heavy path decomposition. For each heavy path $h$ in the decomposition, we define a set of horizontal segments that compactly represents the sets $C(u)$ of consecutive occurrences associated with nodes $u\in h$ . We organize these segments into a data structure, denoted by $D_{h}$ , that supports top- $k$ queries for patterns $P$ with $\mathsf{locus}(P)$ on the path $h$ .

4.1 Solution

Let $\mathcal{T}$ be the suffix tree built over the text $T$ . The tree $\mathcal{T}$ is decomposed using the heavy path decomposition, which gives $n$ disjoint (heavy) paths. Let $h$ be a heavy path in the decomposition. We denote by $\mathit{apex}(h)$ the highest node in the path $h$ . We say that a consecutive occurrence $(i,j)$ is present at node $v$ if $(i,j)\in C(v)$ . The following lemma is helpful in compactly representing the sets $C(v)$ of consecutive occurrences.

Lemma 7.

For each consecutive occurrence $(i,j)\in C(v),\text{ where }v\in h$ , there exists two nodes $u$ and $w$ on the heavy path $h$ such that $(i,j)$ is present at each node of the sub-path of $h$ joining $u$ and $w$ and is not present at any other node of $h$ .

Proof.

Let $h=\langle\mathit{apex}(h)=v_{0},v_{1},v_{2},\ldots,v_{t}\rangle$ , and suppose $(i,j)$ is present at node $v_{d}$ for some $d\in\{0,1,\ldots,t-1\}$ . This implies that the leaves $\ell_{i}$ and $\ell_{j}$ are present in the subtree of $\mathcal{T}$ rooted at node $v_{d}$ . As we move down from $v_{d}$ to $v_{d+1}$ , one or both leaves, $\ell_{i}$ or $\ell_{j}$ , might branch off from the heavy path $h$ . If neither leaf $\ell_{i}$ nor $\ell_{j}$ is present in subtree rooted at $v_{d+1}$ , the pair $(i,j)$ would not be part of the set $C(v_{d+1})$ . Otherwise, $(i,j)$ would continue to exist at the next node $v_{d+1}$ , i.e., $(i,j)\in C(v_{d+1})$ . Note that if $(i,j)$ is not present at $v_{d+1}$ , it will not appear in any descendant node of $v_{d+1}$ on the path $h$ . This is because as we move down the path the existing leaves only disappear. See also Figure 2 for examples.

Figure 2: Illustration for Lemma 7. A heavy path

h

in the suffix tree for

T=\mathtt{ababaababab\$}

is shown in the figure. The occurrences

3

and

6

form a consecutive occurrence at node

v_{2}

; at each ancestor

v_{0}

and

v_{1}

they are separated by some indices. It remains a consecutive occurrence till node

v_{3}

as leaf

l_{3}

is no longer present in the subtree rooted at

v_{4}

. Thus, it is said that the consecutive occurrence

(3,6)

is alive on the subpath joining

v_{2}

and

v_{3}

. The corresponding horizontal segment is

[2,3]\times 3

; x-coordinates are determined by the depths of

v_{2}

and

v_{3}

and y-coordinate is determined by the distance of consecutive occurrence.

Since $i$ and $j$ are two occurrences of $\mathsf{str}(v_{d})$ , the string $\mathsf{str}(v_{0})$ must occur at these positions as $\mathsf{str}(v_{0})$ is a prefix of $\mathsf{str}(v_{d})$ . If $\mathsf{str}(v_{0})$ has no occurrence $g$ such that $i<g<j$ , the pair $(i,j)$ is a consecutive occurrence of $\mathsf{str}(v_{0})$ , i.e., $(i,j)\in C(v_{0})$ . Otherwise, there are occurrences between $i$ and $j$ , and the pair $(i,j)$ becomes a consecutive occurrence at the highest ancestor of $v_{d}$ where all occurrences between $i$ and $j$ disappear (i.e., the corresponding leaves branch off from the heavy path $h$ ). $\hfill\blacktriangleleft$ Let $d(u)$ denote the depth of $u$ within heavy path $h$ , i.e., $d(\mathit{apex}(h))=0$ and $d(u)=d(v)+1$ where $v$ is the parent of $u\neq\mathit{apex}(h)$ . By Lemma 7, for each consecutive occurrence $(i,j)$ present at some node of path $h$ , there are two nodes $u$ and $w$ on path $h$ such that $(i,j)$ is present at each node of the path connecting these two nodes. We create a horizontal segment $[d(u),d(w)]\times(j-i)$ in a 2D grid for each consecutive occurrence $(i,j)$ and associate $(i,j)$ as satellite data. Let $I_{h}$ be the set of all such horizontal segments corresponding to path $h$ . By the construction of segments, the pair $(i,j)$ is a consecutive occurrence of pattern $P$ if and only if $\mathsf{locus}(P)\in h$ and $d(u)\leq d(\mathsf{locus}(P))\leq d(w)$ hold. Also, since we are only interested in occurrences within substring $T[a..b]$ , we will report such $(i,j)$ satisfying $a\leq i<b-|P|+1$ . Thus, we can obtain the desired top- $k$ segments by reporting $k+1$ segments¹¹1The reason for finding up to $k+1$ segments is the same as in Section 3. $[d(u),d(w)]\times(j-i)$ with the smallest y-coordinates such that $d(u)\leq d(\mathsf{locus}(P))\leq d(w)$ and $a\leq i<b-|P|+1$ hold. Therefore, once $\mathsf{locus}(P),h$ , and $d(\mathsf{locus}(P))$ in $h$ are computed, the remaining process of a top- $k$ query can be reduced to the following geometric problem for $I=I_{h}$ :

Definition 8 (2D top- $k$ stabbing query with weight constraint).

A set $I=\{[l_{1},r_{1}]\times y_{1},\ldots,[l_{|I|},r_{|I|}]\times y_{|I|}\}$ of horizontal segments in a 2D plane and weight function $w:I\to\mathbb{N}$ are given for preprocessing. The query is, given $(x,[w_{1},w_{2}],k)$ , to report the $k$ segments $s_{1},\ldots,s_{k}$ with smallest y-coordinates in non-decreasing order of y-coordinate such that $l_{i}\leq x\leq r_{i}$ and $w_{1}\leq w(s_{i})\leq w_{2}$ hold for every segment $s_{i}=[l_{i},r_{i}]\times y_{i}$ .

We will later prove the following theorem:

Theorem 9.

There is a data structure of size $O(|I|\log|I|)$ that can answer the query of Definition 8 in $O(\log n+k)$ time.

We have the following result regarding the total size of the sets of segments.

Lemma 10 ([7]).

The total number of horizontal segments for all heavy paths is $O(n\log n)$ .

By Theorem 9 and Lemma 10, the total size of our data structure is $O(|\mathcal{T}|+\sum_{h}|I_{h}|\log|I_{h}|)=O(n\log^{2}n)$ . Since the locus of $P$ can be computed in $O(|P|)$ time using the suffix tree, we obtain the following.

Lemma 11.

We can represent a given text $T$ into an $O(n\log^{2}n)$ space data structure that supports a range top- $k$ consecutive occurrence query in $O(|P|+\log n+k)$ time.

Note that the query time of the lemma is optimal if $|P|\geq\log n$ . When $|P|<\log n$ , the $\log$ factor dominates the pattern length $|P|$ , and the query time becomes $O(\log n+k)$ .

Further, we slightly improve the query time using an additional data structure that efficiently answer queries with short patterns, i.e., $|P|<\log n$ . For each node $v$ with $|\mathsf{str}(v)|<\log n$ in the suffix tree $\mathcal{T}$ , we store a data structure built over the set $C(v)$ of consecutive occurrences as in Section 3. The additional space used at each node $v$ is $O(|C(v)|\log^{\epsilon}|C(v)|)$ . In the suffix tree $\mathcal{T}$ , the total number of leaves in the subtrees rooted at the nodes of a particular level is $n$ , so $\sum_{v:|\mathsf{str}(v)|<\log n}|C(v)|=O(n\log n)$ . Hence, the overall space used by the additional data structure is $O(n\log^{1+\epsilon}n)$ , which is subsumed by $O(n\log^{2}n)$ .

For a query pattern $P$ with $|P|<\log n$ , we simply find the node $\mathsf{locus}(P)$ in the suffix tree $\mathcal{T}$ in $O(|P|)$ time and query the associated range reporting data structure as in the naïve solution of Section 3. Thus, the query time would be $O(|P|+\log\log n+k)$ . Thus, we obtain the main theorem of this paper:

Theorem 12.

We can represent a given text $T$ into an $O(n\log^{2}n)$ space data structure that supports a range top- $k$ consecutive occurrence query in $O(|P|+\log\log n+k)$ time.

4.2 Proof of Theorem 9: Solving Geometric Subproblem

In this subsection, we prove Theorem 9.

Data Structure

Given set $I$ of horizontal segments, we build a segment tree $T$ over $I$ . Each node $v\in T$ maintains the segments stored at node $v$ , in a list $L_{v}$ sorted by their weights. To efficiently perform predecessor search, we employ the fractional cascading technique (see also Section 2) on the sorted lists $L_{v}$ for all $v\in T$ . We define an array $Y_{v}$ over the list $L_{v}$ such that $Y_{v}[i]$ stores the y-coordinate of segment $L_{v}[i]$ . We then preprocess the array $Y_{v}$ for the sorted range reporting queries of Lemma 3. The size of the enhanced segment tree structure is $O(|I|\log|I|)$ since data structures associated with each node $v$ use $O(|L_{v}|)$ space and $O(\sum_{v\in T}|L_{v}|)=O(|I|\log|I|)$ holds.

Query Algorithm

Given a query $(x,[w_{1},w_{2}],k)$ , we first find the search path $\pi$ for $x$ in the segment tree $T$ . Let $\pi=\langle u_{0}=\mathit{root},u_{1},\ldots,u_{|\pi|-1}\rangle$ . We then traverse the path $\pi$ in the top-down fashion and for each node $v\in\pi$ , compute the positions (ranks) of $w_{1}$ and $w_{2}$ in list $L_{v}$ , respectively denoted by $\mathit{rnk}_{v}(w_{1})$ and $\mathit{rnk}_{v}(w_{2})$ , using the pointers provided by the fractional cascading technique described in Section 2. We prepare an incremental buffer $B$ of size $O(k+\log n)$ consisting of $|\pi|=O(\log n)$ rows. Each $i$ th row of $B$ corresponds to the node $u_{i}$ on the path $\pi$ , denoted by $B[u_{i}]$ . Moreover, we maintain a min-heap $H$ implemented as a dynamic fusion node (Lemma 5). See Figure 3 for illustration.

Figure 3: Illustration for an overview of the data structures used in the query algorithm.

We find top- $1$ (i.e., the smallest) items from the subarrays $Y_{v}[\mathit{rnk}_{v}(w_{1})..\mathit{rnk}_{v}(w_{2})]$ for all nodes $v\in\pi$ and initialize the heap $H$ with the minima. Also, we initialize the counter $e(v)=0$ for each node $v\in\pi$ . We then keep repeating the following steps until the heap $H$ gets empty or $k$ items are reported:

1.

Pop the minimum item $\mu$ from the heap $H$ and report it. Let $v$ be the node on the path $\pi$ from which the item $\mu$ came.
2.

If $v$ ’s buffer $B[v]$ is non-empty, simply pop the minimum item from $B[v]$ and insert it into the heap $H$ . Then continue to the next iteration.
3.

Otherwise (if $B[v]$ is empty), find top- $2^{e(v)+1}$ items from $Y_{v}[\mathit{rnk}_{v}(w_{1})..\mathit{rnk}_{v}(w_{2})]$ by using Lemma 3. Then remove the first half $2^{e(v)}$ of them, which have already been moved to $H$ in previous steps, and insert the remaining items to $B[v]$ . Lastly, pop the minimum item from $B[v]$ and insert it into the heap $H$ , and increment $v$ ’s counter $e(v)$ by $1$ .

Correctness

We focus on the changes to the buffer $B[v]$ for any fixed node $v\in\pi$ throughout the query algorithm. Buffer $B[v]$ grows only if we enter Step 3, where $B[v]$ is empty. Picking up only steps where $B[v]$ grows, in each such step, $B[v]$ is updated to the top- $2^{e(v)}$ items from the items in $Y_{v}[\mathit{rnk}_{v}(w_{1})..\mathit{rnk}_{v}(w_{2})]$ that are not inserted to $H$ before, for incremental $e(v)=0,1,2,\ldots$ . Also, when the smallest item from $B[v]$ is inserted to $H$ at Step 2, the item is removed from $B[v]$ . Hence, buffer $B[v]$ keeps the top- $c$ items from $Y_{v}[\mathit{rnk}_{v}(w_{1})..\mathit{rnk}_{v}(w_{2})]$ that have not been inserted to $H$ for some $c\geq 0$ . Also, the smallest unreported item from $Y_{v}[\mathit{rnk}_{v}(w_{1})..\mathit{rnk}_{v}(w_{2})]$ is always present in heap $H$ for every $v\in\pi$ due to Step 2. Thus, for each iteration of the three steps of the algorithm, the smallest unreported item from $\pi$ that satisfies the weight constraint will be reported at Step 1 of the iteration.

Running Time

We now analyze the running time of the query algorithm. Given $x$ , determining the search path $\pi$ in the segment tree $T$ takes $O(\log n)$ time. Also, thanks to the fractional cascading technique, all the ranks of $w_{1}$ and $w_{2}$ in the nodes in $\pi$ can be computed in a total of $O(\log n)$ time. As the heap $H$ is implemented as dynamic fusion node structure, initializing $H$ takes $O(\log n)$ time. While iterating the loop, when buffer $B[v]$ becomes empty (i.e., all items of $B[v]$ have either been moved to $H$ or reported as output) for a node $v$ , it is replaced by a new buffer twice the size of the previous one in $O(|B[v]|)$ time by Lemma 3. Since (1) the size of $B[v]$ is at most the number of items from $v$ already reported or moved to $H$ and (2) the number of rows in $B$ is $|\pi|\in O(\log n)$ , the total size of the buffer $B$ is $O(k+\log n)$ . Further, each operation in a dynamic fusion node structure takes constant time. To summarize, the total time spent²²2Note that if we implement the heap $H$ as a standard binary heap instead of the dynamic fusion node, the query time becomes $O((k+\log n)\log\log n)$ since the size of $H$ is $O(\log n)$ . to answer the query is $O(k+\log n)$ .

We have completed the proof of Theorem 9.

5 Range Gap-Bounded Consecutive Occurrence Queries

In this section, we describe a space efficient data structure that supports gap-bounded queries. The data structure is similar to the one described for top- $k$ queries in Section 4, and it uses $O(n\log^{2+\epsilon}n)$ space. The query time is $O(|P|+\log\log n+\mathit{output})$ . The data structure differs from the top- $k$ structure (described in Section 4) only in the representation of the sets $I_{h}$ .

We first build the suffix tree $\mathcal{T}$ of $T$ and decompose it using heavy path decomposition. For each heavy path $h$ in the decomposition, we compute a set $I_{h}$ of horizontal segments that compactly represent the consecutive occurrences as in Section 4.1. Let $(P,[a,b],[g_{1},g_{2}])$ be a query we should answer. Firstly, we find the heavy path $h$ where $\mathsf{locus}(P)$ belongs. Then, we can obtain the desired segments as follows:

1.

Compute segments $[d(u),d(w)]\times(j-i)\in I_{h}$ such that $d(u)\leq d(\mathsf{locus}(P))\leq d(w)$ , $a\leq i<b-|P|+1$ , and $g_{1}\leq(j-i)\leq g_{2}$ hold.
2.

Scan the consecutive occurrences corresponding to the obtained segments and remove (at most one) consecutive occurrence $(i,j)$ with $j>b-|P|+1$ if it exist (as we mentioned in Section 3).
3.

Report the remaining segments.

The second and the third steps are obvious. The first step can be reduced to the following problem for $I=I_{h}$ , which is different from that of the top- $k$ algorithm.

Definition 13 (2D height-bounded stabbing query with weight constraint).

A set $I=\{[l_{1},r_{1}]\times y_{1},\ldots,[l_{|I|},r_{|I|}]\times y_{|I|}\}$ of horizontal segments in a 2D plane and weight function $w:I\to\mathbb{N}$ are given for preprocessing. The query is, given $(x,[w_{1},w_{2}],[\psi_{1},\psi_{2}])$ , to report all the segments $s_{1},\ldots,s_{t}$ in non-decreasing order of y-coordinate such that $l_{i}\leq x\leq r_{i}$ , $\psi_{1}\leq y_{i}\leq\psi_{2}$ and $w_{1}\leq w(s_{i})\leq w_{2}$ hold for every segment $s_{i}=[l_{i},r_{i}]\times y_{i}$ .

We show the following theorem:

Theorem 14.

There is a data structure of size $O(|I|\log^{1+\epsilon}|I|)$ that can answer the query of Definition 13 in $O(\log n\log\log n+\mathit{output})$ time where $\epsilon>0$ is a constant and $\mathit{output}$ is the number of outputs.

Proof.

The idea is very similar to that of Theorem 9. We construct a segment tree $T$ over $I$ and associate some data structures with nodes of the tree. Given a query $(x,[w_{1},w_{2}],[\psi_{1},\psi_{2}])$ , we first find the search path $\pi$ for $x$ . After that, the remaining operations related to $[w_{1},w_{2}]$ and $[\psi_{1},\psi_{2}]$ are considered on each node in $\pi$ . The differences are: (1) we associate another data structure with nodes of the segment tree, and (2) we do not need to use a doubling technique with an incremental buffer in the query algorithm. Below, we focus only on the differences from Theorem 9.

Let $v$ be a node of the segment tree $T$ . For each segment $s=[l,r]\times y$ in the set $S(v)$ of segments for $v$ , we create a 2D point $p(s)=(w(s),y)$ . We then construct a sorted orthogonal range reporting data structure of Lemma 4 on top of the set $\bigcup_{s\in S(v)}p(s)$ of points. The total size of the segment tree structure with range reporting structures is $O(|I|\log^{1+\epsilon}|I|)$ .

When a query is given, we traverse the search path $\pi$ for $x$ . At each node $v$ in $\pi$ , we create a (sorted) list $\sigma(v)$ of all the desired segments in $S(v)$ by reporting all the points within the rectangle $[w_{1},w_{2}]\times[\psi_{1},\psi_{2}]$ using the sorted orthogonal range reporting data structure. After finishing the traversal, we merge the sorted lists $\sigma(v)$ for all $v\in\pi$ by using a heap implemented as dynamic fusion node as in Section 4, and report the segments in the merged list in order. The total time to answer a query is $O(\log n\log\log n+\mathit{output})$ time since the length of $\pi$ is $O(\log n)$ and range reporting query at each $v$ takes $O(\log\log n+|\sigma(v)|)$ time.

The correctness can be seen from the definition of the 2D points $(w(s),y)$ within each node and the definition of the sorted range reporting query to be performed on those points. $\hfill\blacktriangleleft$

Now we obtain an $O(|P|+\log n\log\log n+\mathit{output})$ -query-time method for the range gap-bounded consecutive occurrence query. We reduce the query time to $O(|P|+\log\log n+\mathit{output})$ using the same technique employed for the top- $k$ case. For each node $v$ in the suffix tree $\mathcal{T}$ with $|\mathsf{str}(v)|<\log n\log\log n$ , we build a data structure over the set $C(v)$ of consecutive occurrences as in Section 3. By the same arguments used in the top- $k$ case, the total space used by these additional data structures is $O(n\log^{1+\epsilon}n\log\log n)$ , which is dominated by $O(n\log^{2+\epsilon}n)$ . Therefore, the next theorem holds;

Theorem 15.

We can represent a given text $T$ into an $O(n\log^{2+\epsilon}n)$ space data structure that supports a range gap-bounded consecutive occurrence query in $O(|P|+\log\log n+\mathit{output})$ time.

6 Conclusions

In this paper, we introduced range variants of consecutive occurrence queries, generalizing the problems studied in previous work [28, 7]. Specifically, we proposed:

$\blacksquare$

A data structure of size $O(n\log^{2}n)$ that answers the range top- $k$ consecutive occurrence query in $O(|P|+\log\log n+k)$ time.
$\blacksquare$

A data structure of size $O(n\log^{2+\epsilon}n)$ that answers the range gap-bounded consecutive occurrence query in $O(|P|+\log\log n+\mathit{output})$ time, where $\epsilon$ is a positive constant and $\mathit{output}$ denotes the number of outputs.

Further, we presented efficient algorithms for geometric problems involving weighted horizontal segments in a 2D plane, which appeared as sub-problems in our approach and are of independent interest.

Our data structures require $\Omega(n\log^{2}n)$ space in the worst case, whereas existing solutions for non-range variants use only $O(n\log n)$ space. Thus, reducing the space complexity of our data structures is an interesting direction for future research.

References

[1] Paniz Abedin, Arnab Ganguly, Solon P. Pissis, and Sharma V. Thankachan. Efficient data structures for range shortest unique substring queries. Algorithms, 13(11):276, 2020. doi:10.3390/A13110276.
[2] Waseem Akram and Sanjeev Saxena. Consecutive occurrences with distance constraints. In Conference on Algorithms and Discrete Applied Mathematics, pages 3–13. Springer, 2024. doi:10.1007/978-3-031-52213-0_1.
[3] Amihood Amir, Panagiotis Charalampopoulos, Solon P. Pissis, and Jakub Radoszewski. Dynamic and internal longest common substring. Algorithmica, 82(12):3707–3743, 2020. doi:10.1007/S00453-020-00744-0.
[4] Golnaz Badkobeh, Panagiotis Charalampopoulos, Dmitry Kosolobov, and Solon P. Pissis. Internal shortest absent word queries in constant time and linear space. Theor. Comput. Sci., 922:271–282, 2022. doi:10.1016/J.TCS.2022.04.029.
[5] Mark de Berg, Otfried Cheong, Marc J. van Kreveld, and Mark H. Overmars. Computational geometry: algorithms and applications, 3rd Edition. Springer, 2008. doi:10.1007/978-3-540-77974-2.
[6] Philip Bille and Inge Li Gørtz. Substring range reporting. Algorithmica, 69(2):384–396, 2014. doi:10.1007/S00453-012-9733-4.
[7] Philip Bille, Inge Li Gørtz, Max Rishøj Pedersen, Eva Rotenberg, and Teresa Anna Steiner. String indexing for top-k close consecutive occurrences. Theor. Comput. Sci., 927:133–147, 2022. doi:10.1016/J.TCS.2022.06.004.
[8] Philip Bille, Inge Li Gørtz, Max Rishøj Pedersen, and Teresa Anna Steiner. Gapped indexing for consecutive occurrences. Algorithmica, 85(4):879–901, 2023. doi:10.1007/S00453-022-01051-6.
[9] Gerth Stølting Brodal, Rolf Fagerberg, Mark Greve, and Alejandro López-Ortiz. Online sorted range reporting. In Yingfei Dong, Ding-Zhu Du, and Oscar H. Ibarra, editors, Algorithms and Computation, 20th International Symposium, ISAAC 2009, Honolulu, Hawaii, USA, December 16-18, 2009. Proceedings, volume 5878 of Lecture Notes in Computer Science, pages 173–182. Springer, 2009. doi:10.1007/978-3-642-10631-6_19.
[10] Bernard Chazelle. Filtering search: A new approach to query-answering. SIAM Journal on Computing, 15(3):703–724, 1986. doi:10.1137/0215051.
[11] Bernard Chazelle and Leonidas Guibas. Fractional cascading: I. a data structuring technique. Algorithmica, 1:133–162, January 1986. doi:10.1007/BF01840440.
[12] Hagai Cohen and Ely Porat. Range non-overlapping indexing. In Yingfei Dong, Ding-Zhu Du, and Oscar H. Ibarra, editors, Algorithms and Computation, 20th International Symposium, ISAAC 2009, Honolulu, Hawaii, USA, December 16-18, 2009. Proceedings, volume 5878 of Lecture Notes in Computer Science, pages 1044–1053. Springer, 2009. doi:10.1007/978-3-642-10631-6_105.
[13] Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. Introduction to algorithms. MIT press, 2022. URL: http://mitpress.mit.edu/books/introduction-algorithms.
[14] Maxime Crochemore, Costas S. Iliopoulos, Marcin Kubica, M. Sohel Rahman, German Tischler, and Tomasz Walen. Improved algorithms for the range next value problem and applications. Theor. Comput. Sci., 434:23–34, 2012. doi:10.1016/J.TCS.2012.02.015.
[15] Maxime Crochemore, Costas S. Iliopoulos, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszynski, Tomasz Walen, and Wiktor Zuba. Internal quasiperiod queries. In Christina Boucher and Sharma V. Thankachan, editors, String Processing and Information Retrieval - 27th International Symposium, SPIRE 2020, Orlando, FL, USA, October 13-15, 2020, Proceedings, volume 12303 of Lecture Notes in Computer Science, pages 60–75. Springer, 2020. doi:10.1007/978-3-030-59212-7_5.
[16] Michael L. Fredman, János Komlós, and Endre Szemerédi. Storing a sparse table with O(1) worst case access time. In 23rd Annual Symposium on Foundations of Computer Science, Chicago, Illinois, USA, 3-5 November 1982, pages 165–169. IEEE Computer Society, 1982. doi:10.1109/SFCS.1982.39.
[17] Arnab Ganguly, Rahul Shah, and Sharma V. Thankachan. Succinct non-overlapping indexing. Algorithmica, 82(1):107–117, 2020. doi:10.1007/S00453-019-00605-5.
[18] Pawel Gawrychowski, Garance Gourdel, Tatiana Starikovskaya, and Teresa Anna Steiner. Compressed indexing for consecutive occurrences. In Laurent Bulteau and Zsuzsanna Lipták, editors, 34th Annual Symposium on Combinatorial Pattern Matching, CPM 2023, June 26-28, 2023, Marne-la-Vallée, France, volume 259 of LIPIcs, pages 12:1–12:22. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPICS.CPM.2023.12.
[19] Pawel Gawrychowski, Garance Gourdel, Tatiana Starikovskaya, and Teresa Anna Steiner. Compressed consecutive pattern matching. In Ali Bilgin, James E. Fowler, Joan Serra-Sagristà, Yan Ye, and James A. Storer, editors, Data Compression Conference, DCC 2024, Snowbird, UT, USA, March 19-22, 2024, pages 163–172. IEEE, 2024. doi:10.1109/DCC58796.2024.00024.
[20] Daniel Gibney, Paul Macnichol, and Sharma V. Thankachan. Non-overlapping indexing in BWT-runs bounded space. In Franco Maria Nardini, Nadia Pisanti, and Rossano Venturini, editors, String Processing and Information Retrieval - 30th International Symposium, SPIRE 2023, Pisa, Italy, September 26-28, 2023, Proceedings, volume 14240 of Lecture Notes in Computer Science, pages 260–270. Springer, 2023. doi:10.1007/978-3-031-43980-3_21.
[21] Isaac Goldstein, Tsvi Kopelowitz, Moshe Lewenstein, and Ely Porat. Conditional lower bounds for space/time tradeoffs. In Faith Ellen, Antonina Kolokolova, and Jörg-Rüdiger Sack, editors, Algorithms and Data Structures - 15th International Symposium, WADS 2017, St. John’s, NL, Canada, July 31 - August 2, 2017, Proceedings, volume 10389 of Lecture Notes in Computer Science, pages 421–436. Springer, 2017. doi:10.1007/978-3-319-62127-2_36.
[22] Sahar Hooshmand, Paniz Abedin, M. Oguzhan Külekci, and Sharma V. Thankachan. I/O-efficient data structures for non-overlapping indexing. Theor. Comput. Sci., 857:1–7, 2021. doi:10.1016/J.TCS.2020.12.006.
[23] Orgad Keller, Tsvi Kopelowitz, and Moshe Lewenstein. Range non-overlapping indexing and successive list indexing. In Frank K. H. A. Dehne, Jörg-Rüdiger Sack, and Norbert Zeh, editors, Algorithms and Data Structures, 10th International Workshop, WADS 2007, Halifax, Canada, August 15-17, 2007, Proceedings, volume 4619 of Lecture Notes in Computer Science, pages 625–636. Springer, 2007. doi:10.1007/978-3-540-73951-7_54.
[24] Tomasz Kociumaka. Minimal suffix and rotation of a substring in optimal time. In Roberto Grossi and Moshe Lewenstein, editors, 27th Annual Symposium on Combinatorial Pattern Matching, CPM 2016, June 27-29, 2016, Tel Aviv, Israel, volume 54 of LIPIcs, pages 28:1–28:12. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2016. doi:10.4230/LIPICS.CPM.2016.28.
[25] Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, and Tomasz Walen. Internal pattern matching queries in a text and applications. SIAM J. Comput., 53(5):1524–1577, 2024. doi:10.1137/23M1567618.
[26] Veli Mäkinen and Gonzalo Navarro. Rank and select revisited and extended. Theor. Comput. Sci., 387(3):332–347, 2007. doi:10.1016/J.TCS.2007.07.013.
[27] Kazuki Mitani, Takuya Mieno, Kazuhisa Seto, and Takashi Horiyama. Finding top-k longest palindromes in substrings. Theor. Comput. Sci., 979:114183, 2023. doi:10.1016/J.TCS.2023.114183.
[28] Gonzalo Navarro and Sharma V. Thankachan. Reporting consecutive substring occurrences under bounded gap constraints. Theor. Comput. Sci., 638:108–111, 2016. doi:10.1016/J.TCS.2016.02.005.
[29] Yakov Nekrich and Gonzalo Navarro. Sorted range reporting. In Algorithm Theory–SWAT 2012: 13th Scandinavian Symposium and Workshops, Helsinki, Finland, July 4-6, 2012. Proceedings 13, pages 271–282. Springer, 2012. doi:10.1007/978-3-642-31155-0_24.
[30] Mihai Pătraşcu and Mikkel Thorup. Dynamic integer sets with optimal rank, select, and predecessor search. In 55th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2014, Philadelphia, PA, USA, October 18-21, 2014, pages 166–175. IEEE Computer Society, 2014. doi:10.1109/FOCS.2014.26.
[31] Saladi Rahul and Ravi Janardan. A general technique for top- $k$ geometric intersection query problems. IEEE Transactions on Knowledge and Data Engineering, 26(12):2859–2871, 2014. doi:10.1109/TKDE.2014.2316807.
[32] Milan Ruzic. Constructing efficient dictionaries in close to sorting time. In Luca Aceto, Ivan Damgård, Leslie Ann Goldberg, Magnús M. Halldórsson, Anna Ingólfsdóttir, and Igor Walukiewicz, editors, Automata, Languages and Programming, 35th International Colloquium, ICALP 2008, Reykjavik, Iceland, July 7-11, 2008, Proceedings, Part I: Tack A: Algorithms, Automata, Complexity, and Games, volume 5125 of Lecture Notes in Computer Science, pages 84–95. Springer, 2008. doi:10.1007/978-3-540-70575-8_8.
[33] Qingmin Shi and Joseph JaJa. Optimal and near-optimal algorithms for generalized intersection reporting on pointer machines. Information Processing Letters, 95(3):382–388, 2005. doi:10.1016/j.ipl.2005.04.008.
[34] Qingmin Shi and Joseph JaJa. Fast fractional cascading and its applications. Digital Repository at the University of Maryland, 2008. URL: https://www.researchgate.net/publication/228922992.
[35] Hiroki Shibata and Dominik Köppl. LZ78 substring compression with CDAWGs. In Zsuzsanna Lipták, Edleno Silva de Moura, Karina Figueroa, and Ricardo Baeza-Yates, editors, String Processing and Information Retrieval - 31st International Symposium, SPIRE 2024, Puerto Vallarta, Mexico, September 23-25, 2024, Proceedings, volume 14899 of Lecture Notes in Computer Science, pages 289–305. Springer, 2024. doi:10.1007/978-3-031-72200-4_22.
[36] Daniel D. Sleator and Robert Endre Tarjan. A data structure for dynamic trees. Journal of Computer and System Sciences, 26(3):362–391, 1983. doi:10.1016/0022-0000(83)90006-5.
[37] Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, October 15-17, 1973, pages 1–11. IEEE Computer Society, 1973. doi:10.1109/SWAT.1973.13.

[bib.bib1] [1] Paniz Abedin, Arnab Ganguly, Solon P. Pissis, and Sharma V. Thankachan. Efficient data structures for range shortest unique substring queries. Algorithms, 13(11):276, 2020. doi:10.3390/A13110276.

[bib.bib2] [2] Waseem Akram and Sanjeev Saxena. Consecutive occurrences with distance constraints. In Conference on Algorithms and Discrete Applied Mathematics, pages 3–13. Springer, 2024. doi:10.1007/978-3-031-52213-0_1.

[bib.bib3] [3] Amihood Amir, Panagiotis Charalampopoulos, Solon P. Pissis, and Jakub Radoszewski. Dynamic and internal longest common substring. Algorithmica, 82(12):3707–3743, 2020. doi:10.1007/S00453-020-00744-0.

[bib.bib4] [4] Golnaz Badkobeh, Panagiotis Charalampopoulos, Dmitry Kosolobov, and Solon P. Pissis. Internal shortest absent word queries in constant time and linear space. Theor. Comput. Sci., 922:271–282, 2022. doi:10.1016/J.TCS.2022.04.029.

[bib.bib5] [5] Mark de Berg, Otfried Cheong, Marc J. van Kreveld, and Mark H. Overmars. Computational geometry: algorithms and applications, 3rd Edition. Springer, 2008. doi:10.1007/978-3-540-77974-2.

[bib.bib6] [6] Philip Bille and Inge Li Gørtz. Substring range reporting. Algorithmica, 69(2):384–396, 2014. doi:10.1007/S00453-012-9733-4.

[bib.bib7] [7] Philip Bille, Inge Li Gørtz, Max Rishøj Pedersen, Eva Rotenberg, and Teresa Anna Steiner. String indexing for top-k close consecutive occurrences. Theor. Comput. Sci., 927:133–147, 2022. doi:10.1016/J.TCS.2022.06.004.

[bib.bib8] [8] Philip Bille, Inge Li Gørtz, Max Rishøj Pedersen, and Teresa Anna Steiner. Gapped indexing for consecutive occurrences. Algorithmica, 85(4):879–901, 2023. doi:10.1007/S00453-022-01051-6.

[bib.bib9] [9] Gerth Stølting Brodal, Rolf Fagerberg, Mark Greve, and Alejandro López-Ortiz. Online sorted range reporting. In Yingfei Dong, Ding-Zhu Du, and Oscar H. Ibarra, editors, Algorithms and Computation, 20th International Symposium, ISAAC 2009, Honolulu, Hawaii, USA, December 16-18, 2009. Proceedings, volume 5878 of Lecture Notes in Computer Science, pages 173–182. Springer, 2009. doi:10.1007/978-3-642-10631-6_19.

[bib.bib10] [10] Bernard Chazelle. Filtering search: A new approach to query-answering. SIAM Journal on Computing, 15(3):703–724, 1986. doi:10.1137/0215051.

[bib.bib11] [11] Bernard Chazelle and Leonidas Guibas. Fractional cascading: I. a data structuring technique. Algorithmica, 1:133–162, January 1986. doi:10.1007/BF01840440.

[bib.bib12] [12] Hagai Cohen and Ely Porat. Range non-overlapping indexing. In Yingfei Dong, Ding-Zhu Du, and Oscar H. Ibarra, editors, Algorithms and Computation, 20th International Symposium, ISAAC 2009, Honolulu, Hawaii, USA, December 16-18, 2009. Proceedings, volume 5878 of Lecture Notes in Computer Science, pages 1044–1053. Springer, 2009. doi:10.1007/978-3-642-10631-6_105.

[bib.bib13] [13] Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. Introduction to algorithms. MIT press, 2022. URL: http://mitpress.mit.edu/books/introduction-algorithms.

[bib.bib14] [14] Maxime Crochemore, Costas S. Iliopoulos, Marcin Kubica, M. Sohel Rahman, German Tischler, and Tomasz Walen. Improved algorithms for the range next value problem and applications. Theor. Comput. Sci., 434:23–34, 2012. doi:10.1016/J.TCS.2012.02.015.

[bib.bib15] [15] Maxime Crochemore, Costas S. Iliopoulos, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszynski, Tomasz Walen, and Wiktor Zuba. Internal quasiperiod queries. In Christina Boucher and Sharma V. Thankachan, editors, String Processing and Information Retrieval - 27th International Symposium, SPIRE 2020, Orlando, FL, USA, October 13-15, 2020, Proceedings, volume 12303 of Lecture Notes in Computer Science, pages 60–75. Springer, 2020. doi:10.1007/978-3-030-59212-7_5.

[bib.bib16] [16] Michael L. Fredman, János Komlós, and Endre Szemerédi. Storing a sparse table with O(1) worst case access time. In 23rd Annual Symposium on Foundations of Computer Science, Chicago, Illinois, USA, 3-5 November 1982, pages 165–169. IEEE Computer Society, 1982. doi:10.1109/SFCS.1982.39.

[bib.bib17] [17] Arnab Ganguly, Rahul Shah, and Sharma V. Thankachan. Succinct non-overlapping indexing. Algorithmica, 82(1):107–117, 2020. doi:10.1007/S00453-019-00605-5.

[bib.bib18] [18] Pawel Gawrychowski, Garance Gourdel, Tatiana Starikovskaya, and Teresa Anna Steiner. Compressed indexing for consecutive occurrences. In Laurent Bulteau and Zsuzsanna Lipták, editors, 34th Annual Symposium on Combinatorial Pattern Matching, CPM 2023, June 26-28, 2023, Marne-la-Vallée, France, volume 259 of LIPIcs, pages 12:1–12:22. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPICS.CPM.2023.12.

[bib.bib19] [19] Pawel Gawrychowski, Garance Gourdel, Tatiana Starikovskaya, and Teresa Anna Steiner. Compressed consecutive pattern matching. In Ali Bilgin, James E. Fowler, Joan Serra-Sagristà, Yan Ye, and James A. Storer, editors, Data Compression Conference, DCC 2024, Snowbird, UT, USA, March 19-22, 2024, pages 163–172. IEEE, 2024. doi:10.1109/DCC58796.2024.00024.

[bib.bib20] [20] Daniel Gibney, Paul Macnichol, and Sharma V. Thankachan. Non-overlapping indexing in BWT-runs bounded space. In Franco Maria Nardini, Nadia Pisanti, and Rossano Venturini, editors, String Processing and Information Retrieval - 30th International Symposium, SPIRE 2023, Pisa, Italy, September 26-28, 2023, Proceedings, volume 14240 of Lecture Notes in Computer Science, pages 260–270. Springer, 2023. doi:10.1007/978-3-031-43980-3_21.

[bib.bib21] [21] Isaac Goldstein, Tsvi Kopelowitz, Moshe Lewenstein, and Ely Porat. Conditional lower bounds for space/time tradeoffs. In Faith Ellen, Antonina Kolokolova, and Jörg-Rüdiger Sack, editors, Algorithms and Data Structures - 15th International Symposium, WADS 2017, St. John’s, NL, Canada, July 31 - August 2, 2017, Proceedings, volume 10389 of Lecture Notes in Computer Science, pages 421–436. Springer, 2017. doi:10.1007/978-3-319-62127-2_36.

[bib.bib22] [22] Sahar Hooshmand, Paniz Abedin, M. Oguzhan Külekci, and Sharma V. Thankachan. I/O-efficient data structures for non-overlapping indexing. Theor. Comput. Sci., 857:1–7, 2021. doi:10.1016/J.TCS.2020.12.006.

[bib.bib23] [23] Orgad Keller, Tsvi Kopelowitz, and Moshe Lewenstein. Range non-overlapping indexing and successive list indexing. In Frank K. H. A. Dehne, Jörg-Rüdiger Sack, and Norbert Zeh, editors, Algorithms and Data Structures, 10th International Workshop, WADS 2007, Halifax, Canada, August 15-17, 2007, Proceedings, volume 4619 of Lecture Notes in Computer Science, pages 625–636. Springer, 2007. doi:10.1007/978-3-540-73951-7_54.

[bib.bib24] [24] Tomasz Kociumaka. Minimal suffix and rotation of a substring in optimal time. In Roberto Grossi and Moshe Lewenstein, editors, 27th Annual Symposium on Combinatorial Pattern Matching, CPM 2016, June 27-29, 2016, Tel Aviv, Israel, volume 54 of LIPIcs, pages 28:1–28:12. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2016. doi:10.4230/LIPICS.CPM.2016.28.

[bib.bib25] [25] Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, and Tomasz Walen. Internal pattern matching queries in a text and applications. SIAM J. Comput., 53(5):1524–1577, 2024. doi:10.1137/23M1567618.

[bib.bib26] [26] Veli Mäkinen and Gonzalo Navarro. Rank and select revisited and extended. Theor. Comput. Sci., 387(3):332–347, 2007. doi:10.1016/J.TCS.2007.07.013.

[bib.bib27] [27] Kazuki Mitani, Takuya Mieno, Kazuhisa Seto, and Takashi Horiyama. Finding top-k longest palindromes in substrings. Theor. Comput. Sci., 979:114183, 2023. doi:10.1016/J.TCS.2023.114183.

[bib.bib28] [28] Gonzalo Navarro and Sharma V. Thankachan. Reporting consecutive substring occurrences under bounded gap constraints. Theor. Comput. Sci., 638:108–111, 2016. doi:10.1016/J.TCS.2016.02.005.

[bib.bib29] [29] Yakov Nekrich and Gonzalo Navarro. Sorted range reporting. In Algorithm Theory–SWAT 2012: 13th Scandinavian Symposium and Workshops, Helsinki, Finland, July 4-6, 2012. Proceedings 13, pages 271–282. Springer, 2012. doi:10.1007/978-3-642-31155-0_24.

[bib.bib30] [30] Mihai Pătraşcu and Mikkel Thorup. Dynamic integer sets with optimal rank, select, and predecessor search. In 55th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2014, Philadelphia, PA, USA, October 18-21, 2014, pages 166–175. IEEE Computer Society, 2014. doi:10.1109/FOCS.2014.26.

[bib.bib31] [31] Saladi Rahul and Ravi Janardan. A general technique for top- $k$ geometric intersection query problems. IEEE Transactions on Knowledge and Data Engineering, 26(12):2859–2871, 2014. doi:10.1109/TKDE.2014.2316807.

[bib.bib32] [32] Milan Ruzic. Constructing efficient dictionaries in close to sorting time. In Luca Aceto, Ivan Damgård, Leslie Ann Goldberg, Magnús M. Halldórsson, Anna Ingólfsdóttir, and Igor Walukiewicz, editors, Automata, Languages and Programming, 35th International Colloquium, ICALP 2008, Reykjavik, Iceland, July 7-11, 2008, Proceedings, Part I: Tack A: Algorithms, Automata, Complexity, and Games, volume 5125 of Lecture Notes in Computer Science, pages 84–95. Springer, 2008. doi:10.1007/978-3-540-70575-8_8.

[bib.bib33] [33] Qingmin Shi and Joseph JaJa. Optimal and near-optimal algorithms for generalized intersection reporting on pointer machines. Information Processing Letters, 95(3):382–388, 2005. doi:10.1016/j.ipl.2005.04.008.

[bib.bib34] [34] Qingmin Shi and Joseph JaJa. Fast fractional cascading and its applications. Digital Repository at the University of Maryland, 2008. URL: https://www.researchgate.net/publication/228922992.

[bib.bib35] [35] Hiroki Shibata and Dominik Köppl. LZ78 substring compression with CDAWGs. In Zsuzsanna Lipták, Edleno Silva de Moura, Karina Figueroa, and Ricardo Baeza-Yates, editors, String Processing and Information Retrieval - 31st International Symposium, SPIRE 2024, Puerto Vallarta, Mexico, September 23-25, 2024, Proceedings, volume 14899 of Lecture Notes in Computer Science, pages 289–305. Springer, 2024. doi:10.1007/978-3-031-72200-4_22.

[bib.bib36] [36] Daniel D. Sleator and Robert Endre Tarjan. A data structure for dynamic trees. Journal of Computer and System Sciences, 26(3):362–391, 1983. doi:10.1016/0022-0000(83)90006-5.

[bib.bib37] [37] Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, October 15-17, 1973, pages 1–11. IEEE Computer Society, 1973. doi:10.1109/SWAT.1973.13.

Sorted Consecutive Occurrence Queries in Substrings

Abstract

Keywords and phrases:

Funding:

Copyright and License:

2012 ACM Subject Classification:

Acknowledgements:

DOI:

Event:

Editors:

Series and Publisher:

1 Introduction

Related Work

2 Preliminaries

2.1 Notations

Definition 1 (Top-k query).

Definition 2 (Gap-bounded query).

2.2 Algorithmic Tools

Suffix Trees

Heavy Path Decomposition of Tree

Segment Tree

Fractional Cascading Technique

Sorted Orthogonal Range Reporting Queries

Lemma 3 ([9]).

Lemma 4 (Theorem 2 in [29]).

Dynamic Fusion Nodes

Lemma 5 ([30]).

3 Naïve Solutions

Top-𝒌 query algorithm for given (𝑷,[𝒂,𝒃],𝒌)

Gap-bounded query algorithm for given (𝑷,[𝒂,𝒃],[𝒈𝟏,𝒈𝟐])

Proposition 6.

4 Range Top-𝒌 Consecutive Occurrence Queries

4.1 Solution

Lemma 7.

Proof.

Definition 8 (2D top-k stabbing query with weight constraint).

Theorem 9.

Lemma 10 ([7]).

Lemma 11.

Theorem 12.

4.2 Proof of Theorem 9: Solving Geometric Subproblem

Data Structure

Query Algorithm

Correctness

Running Time

5 Range Gap-Bounded Consecutive Occurrence Queries

Definition 13 (2D height-bounded stabbing query with weight constraint).

Theorem 14.

Proof.

Theorem 15.

6 Conclusions

References

Definition 1 (Top- $k$ query).

Top- $𝒌$ query algorithm for given $(P,[a,b],k)$

Gap-bounded query algorithm for given $(P,[a,b],[g_{1},g_{2}])$

4 Range Top- $𝒌$ Consecutive Occurrence Queries

Definition 8 (2D top- $k$ stabbing query with weight constraint).