Abstract References

Faster Range LCP Queries in Linear Space

Yakov Nekirch Michigan Technological University, Houghton, MI, USA Sharma V. Thankachan ORCID North Carolina State University, Raleigh, NC, USA
Abstract

A range LCP query 𝗋𝗅𝖼𝗉(α,β) on a text T[1..n] asks to return the length of the longest common prefix of any two suffixes of T with starting positions in a range [α,β]. In this paper we describe a data structure that uses O(n) space and supports range LCP queries in time O(logεn) for any constant ε>0. Our result is the fastest currently known linear-space solution for this problem.

Keywords and phrases:
Data Structures, String Algorithms, Longest Common Prefix
Category:
Research
Funding:
Yakov Nekirch: Supported by the National Science Foundation under NSF grant 2203278.
Sharma V. Thankachan: Supported by the National Science Foundation under NSF grant 2315822.
Copyright and License:
[Uncaptioned image] © Yakov Nekirch and Sharma V. Thankachan; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Pattern matching
Editors:
Alessio Conte, Andrea Marino, Giovanna Rosone, and Jeffrey Scott Vitter

Problem Definition and Previous Work

In this note we consider a variant of the longest common prefix (LCP) problem, called the range LCP problem. In this problem we store a text T[1..n] in a data structure so that range LCP queries can be answered efficiently. A range LCP query [α,β] asks to return the length of the longest common prefix of any two suffixes with starting positions in a range [α,β],

𝗋𝗅𝖼𝗉(α,β)=max{𝖫𝖢𝖯(i,j)ij and i,j[α,β]},

where 𝖫𝖢𝖯(i,j) denotes the length of the longest common prefix of T[i..n] and T[j..n].

This problem and its variants were considered in several papers, see e.g., [5, 9, 2, 3, 1, 10, 7]. The currently fastest data structure by Amir et al. [2] uses O(nlogn) words of space and answers range LCP queries in O(loglogn) time. Henceforth we assume that a word of space consists of logn bits. The data structure with O(n) space usage by Abedin et al. [1] supports queries in O(log1+εn) time for any constant ε>0. The data structure of Matsuda et al. [10] uses O(nH0) bits of space where H0 is the 0-order entropy of the text T; however this space usage is achieved at a cost of significantly higher query time as their data structure supports queries in time O(nε).

In this note we describe a new trade-off between the space usage and the query time: Our data structure uses linear space and supports queries in time O(logεn) for any constant ε>0. Thus we achieve the same space usage as in [1] and query time that is close to [2]. Our solution combines the techniques from some previous papers with some new ideas. The compact data structure for predecessor queries by Grossi et al. [8] is also used in our construction.

Notation

We will say that a triple (i,j,h) is a bridge if 1i<jn and 𝖫𝖢𝖯(i,j)=h. The total number of bridges is O(n2). However in order to answer a range LCP query it is sufficient to consider a subset of all bridges of size O(nlogn) [2, 1]. Following the method of Abedin et al. [1], we consider special bridges that are defined below.

We consider the suffix tree of the text T and divide its nodes into heavy and light nodes [11]. Let size(u) denote the number of leaves in the subtree rooted at u. The root is a light node. Exactly one child u of every internal node u is designated as a heavy node, specifically one with the largest size(u) (ties are broken arbitrarily). All other nodes are light. Let i and j denote the leaves in the suffix tree that hold suffixes T[i..n] and T[j..n] respectively. Let u denote the lowest common ancestor of i and j and let ui and uj denote the children of u that are ancestors of i and j respectively. A bridge (i,j,h) is a special bridge if one of the following conditions is satisfied:

  1. 1.

    ui is a light node and j=min{x(i,x,h) is a bridge}

  2. 2.

    uj is a light node and i=max{x(x,j,h) is a bridge}

Lemma 1 ([1]).

There are O(nlogn) special bridges. For any α and β such that 1αβn, we have 𝗋𝗅𝖼𝗉(α,β)=max{h(i,j,h) is a special bridge and αi,jβ}

In the rest of this paper a bridge will denote a special bridge.

Bridge Classification

Let Δ=logn. For a bridge (i,j,h) we will say that i is its left leg, j is its right leg, and h is its height. We will say that a bridge (i,j,h) is in the interval [a,b] if aijb. Let t denote the set of all bridges of height t.

By pigeonhole principle, there exists some value π, 1πΔ, such that the total number of bridges in k0π+kΔ is bounded by O(nlognΔ)=O(n); see also [1, Section 5.1] We will say that all bridges in 1(k0π+kΔ) are base bridges. All other bridges are implicit bridges. Data structures for different categories of bridges are described below.

Base bridges

The total number of base bridges is O(n). Furthermore we can find the maximum height base bridge in a query range by answering a variant of an orthogonal range searching query. Our data structure for base bridges is summarized in the following lemma.

Lemma 2.

There exists an O(n)-word data structure that finds, for any interval [α,β], the base bridge with maximal height in [α,β]. The query time is O(loglogn).

Proof.

There is at most one bridge (i,j,1) for every value of i, 1in and the total number of bridges in 1 is O(n). Hence the total number of base bridges is O(n). In order to answer a query, we must find the largest h such that αiβ, αjβ, and there is a base bridge (i,j,h). Since ij, this is equivalent to finding the triple (i,j,h) such that iα, jβ, and h is maximized. The latter query is equivalent to a two-dimensional dominance maxima query. Using a data structure for top-k dominance queries with k=1, such a query can be answered in O(loglogn) time using O(n) space, see [4, Theorem 7].

Implicit Bridges

Using Lemma 2, we can find the largest h0 such that there is a base bridge of height h0 in the query range [α,β]. Now we explain how to find the largest h, h0<h<h0+Δ, such that there is (an implicit) bridge of height h in [α,βΔ].

The following properties of special bridges will be used.

Lemma 3 ([1], Lemma 6).

If there is a special bridge (i,j,h), then for any d<h there is a special bridge (i+d,j+d,hd) for some jj and a special bridge (i+d,j+d,hd) for some ii.

Lemma 4 ([1], Lemma 5).

There exists a data structure that uses O(n) space and supports the following queries in O(logεn) time:

  • find the right leg j of a special bridge (i,j,h) if its left leg i and its height h are known

  • find the left leg i of a special bridge (i,j,h) if its right leg j and its height h are known

Let H0 denote the set of heights of base bridges. For every h0H0 we consider all bridges of height h0 and construct the list of their left legs sorted in increasing order L(h0)={s1,s2,,sn0}, where n0 is the number of special bridges with height h0. For each k, 1kΔ, we also construct an array Rh0,k[1..n0] so that Rh0,k[i]=min{j|𝖫𝖢𝖯(sik,j)h0+k}. Finally let

Min(h0,k,a)=min{Rh0,k[i]|iai},

where sia is the successor of (a+k) in L(h0) and ia is the position of sia in L(h0).

Lemma 5.

If Min(h0,k,a)b, then there is a bridge of height h0+k in [a,b].
If Min(h0,k,a)>b, then there is no bridge (i,j,h) in [a,bΔ] such that hh0+k

Proof.

The first statement directly follows from the definition of Min: if Min(h0,k,a)b, then there is an index im, such that sima+k and a position j, such that simk<jb and 𝖫𝖢𝖯(simk,j)h0+k. Hence there is also a special bridge (simk,j,h0+k) where jjb and asimkb.
To prove the second part of the lemma, suppose that there is a bridge (i,j,h) where aibΔ, ajbΔ and h=h0+k+f for some f, 0fΔk. Then 𝖫𝖢𝖯(i,j)=h0+k+f and 𝖫𝖢𝖯(i+k+f,j+k+f)=h0. Hence there is a base bridge (i+k+f,j0,h0) for some j0j+k+f. Let t denote the position of (i+k+f,j0,h0) in L(h0). Furthermore 𝖫𝖢𝖯(i+f,j+f)=h0+k and there is an implicit bridge (i+f,j1,h0+k) for some j1j+f. Since j+fb, Rh0,k[t]b and Min(h0,k,a)b.

Lemma 6.

For any h0H0, there exists a data structure that uses O(n0logn) bits and determines whether Min(h0,k,a)b in O(logεn) time for any 1abn and for any 1kΔ.

Proof.

For every k, 1kΔ, we store a compact data structure that supports range minimum queries on Rh0,k in O(1) time and uses O(n0) bits of space. We can use the data structure from [6] for this purpose. All compact range minima data structures use O(n0Δ)=O(n0logn) bits. Arrays Rh0,k are not stored. Additionally we store L(h0) in a data structure that uses O(n0logn) bits and supports successor queries in O(loglogn) time [12].

To compute Min(h0,k,a), we find the smallest sL(h0) such that sa+k. Then we use the range minimum data structure and find the index im such that ims and Rh0,k[im]Rh0,k[i] for any is. Finally we compute the right leg jm of the bridge (im,jm,h0+k) using Lemma 4. If jm>b, then Min(h0,k,a)>b. If jmb, then Min(h0,k,a)b.

We keep a data structure from Lemma 6 for every h0H0. The total space usage of these data structures is O(nlogn) bits. Using Lemma 6 and binary search, we find the largest k such that Min(h0,k,a)b. By Lemma 5, there is a bridge of height h0+k in [a,b]; hence 𝖫𝖢𝖯(a,b)h. Also, by Lemma 5 there is no bridge (i,j,h) in [a,bΔ], such that hh0+k+1. The total time is O(logεnloglogn). We thus proved the following lemma.

Lemma 7.

There exists an O(n)-word data structure that finds, for any interval [α,β], the implicit bridge with height hi in [α,β], so that there is no bridge of height h>hi in [α,βΔ]. The query time is O(logεnloglogn).

Block Bridges

It remains to consider the case of bridges with right leg in the interval [βΔ,β] and of height h, h0<h<h0+Δ. We apply the pigeonhole principle again. Let Δ1=loglogn. There exists some value π1, 1π1Δ1, such that the total number of bridges in k0π1+kΔ1 is bounded by O(nlognΔ1)=O(nlognloglogn). Let H1={π1+kΔ1 1k(nπ1)/Δ1}. We will say that all bridges t where tH0H1 are good bridges. All other bridges are bad bridges.

We will denote by Blockt,h0 the set of all bridges (i,r,h) such that (a) hH0H1 and h0h<h0+Δ for some h0H and (b) tΔ+1r(t+1)Δ for some k, 0k<nΔ.

Lemma 8.

Let m denote the number of bridges in Blockt,h0. There exists a data structure that finds, for any interval [α,β], the bridge from Blockt,h0 with maximal height such that its right and left legs are in [α,β]. This data structure uses O(logn+mloglogn) bits and supports queries in O(logεn) time.

Proof.

Let It,h0 denote the set that contains left legs of all bridges in Blockt,h0. Every bridge (i,j,h) from Blockt,h0 is represented as follows: We replace the right leg j with d(j)=jtΔ and the height h with d(h)=hh0. We replace the left leg i with r(i), where r(i)=|{xi|xIt,h0}| is the rank of i in It,h0. Thus a bridge (i,j,h)Blockt,h0 is represented by a triple (r(i),d(j),d(h)). Since r(i), d(j), and d(h) are bounded by Δ we can store each triple (r(i),d(j),d(h)) using O(loglogn) bits. For each (r(i),d(j),d(h)), we can retrieve the corresponding values of j and h with one addition. If j and h of some bridge (i,j,h) are known, we can obtain the value of its left leg i in O(logεn) time using the data structure from Lemma 4.

In addition, we store all elements of It,h0 in a compact data structure that is described by Grossi et al. in [8, Lemma 3.3]. This data structure supports successor queries on a set of integers S; provided that we can access an arbitrary element of S in time tacc, a successor query can be answered in time O(logm/loglogn+tacc) where m is the number of elements in S. The data structure uses O(loglogu) bits per element, where u is the size of the universe (in addition to the space required to store S). In our case, It,h0 has O(Δ2) elements and the size of the universe is n. Hence for every left leg iIt,h0, the data structure uses O(loglogn) bits. We can obtain the value of any left leg in time O(logεn). Hence successor queries are answered in O((logΔ2)/loglogn+logεn)=O(logεn) time. That is, we can find for any α the smallest iαIt,h0 such that iαα.

Finally all triples (r(i),d(j),d(h)) are stored in the data structure described in [4, Theorem 7] 111The same data structure was also used in Lemma 2.. This data structure uses O(mlogm)=O(mloglogn) bits and supports the following range maxima queries in O(loglogn) time: for any rα and dβ, find the highest d(h), among all tuples (r(i),d(j),d(h)) satisfying r(i)rα and d(j)dβ.

In order to find the maximum-height bridge in [α,β] from Blockt,h0, we find the successor of α and its rank r(α), using the compact successor data structure. We also compute d(β)=βtΔ. Then we find the highest value d(hmax) among all (r(i),d(j),d(h)) satisfying r(i)rα and d(j)dβ. The maximum height of a bridge in [α,β] is hmax=d(hmax)+h0. The total time required to answer a query is O(logεn+loglogn)=O(logεn).

Lemma 9.

There are O(n) non-empty blocks.

Proof.

Suppose that there is at least one implicit bridge (i,j,h) in Blockt,h0. Let k=hh0. Then by Lemma 3 there is a special bridge (i,j+k,h0) such that i+kij+k. Since 1k<Δ and tΔ<j(t+1)Δ, we have tΔ<j+kt+2Δ. Thus for every base bridge (i,j,h0) where h0H0 there are at most two non-empty blocks. Since there are O(n) base bridges, the number of non-empty blocks is O(n).

Lemma 10.

There exists a data structure that finds, for any αβ, and any h0H0, the highest good bridge (i,j,h) in [α,β] such that its right leg j is in [βΔ,β] and its height h is in [h0,h0+Δ]. The data structure uses O(n) words of space and supports queries in O(logεn) time.

Proof.

We store a block data structure from Lemma 8 for each non-empty block Blockt,h0. The total number of bridges in all blocks is equal to the total number of good bridges. By Lemma 9, the total number of non-empty blocks is O(n). Hence the total space usage of all block data structures is O(nlogn+(nlognloglogn)loglogn)=O(nlogn) bits. The range [βΔ,β] intersects with at most two blocks. Hence we can find the highest bridge satisfying the conditions of this lemma in time O(logεn) by answering two queries to block data structures.

Putting All Parts Together

In order to answer a range LCP query [α,β] we need to identify the largest h such that there is a bridge (i,j,h) in [α,β]. Our algorithm works in four stages:

  1. 1.

    First, we find the largest h0 such that there is a base bridge of height h0 in [α,β]. This step takes O(loglogn) time by Lemma 2.

  2. 2.

    Then we find the largest hi, where h0<hi<h0+Δ, such that there is an implicit bridge of height hi in [α,βΔ]. This can be done in O(logεnloglogn) time by Lemma 7

  3. 3.

    We find the largest hn such that there is a good bridge of height hn>h0 with right leg in [βΔ,β]. This step takes O(logεn) time by Lemma 10.

  4. 4.

    Let h1=max(h0,hi,hn). We check if there is a bridge of height h in [α,β] for each h, h1<h<h1+Δ1. By Lemma 5 we can check each candidate value of h in O(logεn) time. Hence this step takes O(logεΔ1)=O(logεnloglogn) time.

The total query time is O(logεnloglogn). By replacing ε with a constant ε<ε in the above construction, we obtain our final result.

Theorem 11.

There exists a data structure that uses O(n) words of space and answers range LCP queries in time O(logεn) time.

References

  • [1] Paniz Abedin, Arnab Ganguly, Wing-Kai Hon, Kotaro Matsuda, Yakov Nekrich, Kunihiko Sadakane, Rahul Shah, and Sharma V. Thankachan. A linear-space data structure for range-lcp queries in poly-logarithmic time. Theor. Comput. Sci., 822:15–22, 2020. doi:10.1016/J.TCS.2020.04.009.
  • [2] Amihood Amir, Alberto Apostolico, Gad M. Landau, Avivit Levy, Moshe Lewenstein, and Ely Porat. Range LCP. J. Comput. Syst. Sci., 80(7):1245–1253, 2014. doi:10.1016/J.JCSS.2014.02.010.
  • [3] Amihood Amir, Moshe Lewenstein, and Sharma V. Thankachan. Range LCP queries revisited. In 22nd International Symposium String Processing and Information Retrieval (SPIRE), pages 350–361, 2015. doi:10.1007/978-3-319-23826-5_33.
  • [4] Timothy M. Chan, Yakov Nekrich, Saladi Rahul, and Konstantinos Tsakalidis. Orthogonal point location and rectangle stabbing queries in 3-d. J. Comput. Geom., 13(1):399–428, 2022. doi:10.20382/JOCG.V13I1A15.
  • [5] Graham Cormode and S. Muthukrishnan. Substring compression problems. In 16th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 321–330, 2005. URL: http://dl.acm.org/citation.cfm?id=1070432.1070478.
  • [6] Johannes Fischer and Volker Heun. Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput., 40(2):465–492, 2011. doi:10.1137/090779759.
  • [7] Arnab Ganguly, Manish Patil, Rahul Shah, and Sharma V. Thankachan. A linear space data structure for range LCP queries. Fundam. Informaticae, 163(3):245–251, 2018. doi:10.3233/FI-2018-1741.
  • [8] Roberto Grossi, Alessio Orlandi, Rajeev Raman, and S. Srinivasa Rao. More haste, less waste: Lowering the redundancy in fully indexable dictionaries. In 26th International Symposium on Theoretical Aspects of Computer Science (STACS), pages 517–528, 2009. doi:10.4230/LIPICS.STACS.2009.1847.
  • [9] Orgad Keller, Tsvi Kopelowitz, Shir Landau Feibish, and Moshe Lewenstein. Generalized substring compression. Theor. Comput. Sci., 525:42–54, 2014. doi:10.1016/j.tcs.2013.10.010.
  • [10] Kotaro Matsuda, Kunihiko Sadakane, Tatiana Starikovskaya, and Masakazu Tateshita. Compressed orthogonal search on suffix arrays with applications to range LCP. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM), pages 23:1–23:13, 2020. doi:10.4230/LIPICS.CPM.2020.23.
  • [11] Daniel Dominic Sleator and Robert Endre Tarjan. A data structure for dynamic trees. J. Comput. Syst. Sci., 26(3):362–391, 1983. doi:10.1016/0022-0000(83)90006-5.
  • [12] Peter van Emde Boas. Preserving order in a forest in less than logarithmic time and linear space. Inf. Process. Lett., 6(3):80–82, 1977. doi:10.1016/0020-0190(77)90031-X.