Faster Range LCP Queries in Linear Space
Abstract
A range LCP query on a text asks to return the length of the longest common prefix of any two suffixes of with starting positions in a range . In this paper we describe a data structure that uses space and supports range LCP queries in time for any constant . Our result is the fastest currently known linear-space solution for this problem.
Keywords and phrases:
Data Structures, String Algorithms, Longest Common PrefixCategory:
ResearchFunding:
Yakov Nekirch: Supported by the National Science Foundation under NSF grant 2203278.Copyright and License:
2012 ACM Subject Classification:
Theory of computation Pattern matchingEditors:
Alessio Conte, Andrea Marino, Giovanna Rosone, and Jeffrey Scott VitterSeries and Publisher:
Open Access Series in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Problem Definition and Previous Work
In this note we consider a variant of the longest common prefix (LCP) problem, called the range LCP problem. In this problem we store a text in a data structure so that range LCP queries can be answered efficiently. A range LCP query asks to return the length of the longest common prefix of any two suffixes with starting positions in a range ,
where denotes the length of the longest common prefix of and .
This problem and its variants were considered in several papers, see e.g., [5, 9, 2, 3, 1, 10, 7]. The currently fastest data structure by Amir et al. [2] uses words of space and answers range LCP queries in time. Henceforth we assume that a word of space consists of bits. The data structure with space usage by Abedin et al. [1] supports queries in time for any constant . The data structure of Matsuda et al. [10] uses bits of space where is the -order entropy of the text ; however this space usage is achieved at a cost of significantly higher query time as their data structure supports queries in time .
In this note we describe a new trade-off between the space usage and the query time: Our data structure uses linear space and supports queries in time for any constant . Thus we achieve the same space usage as in [1] and query time that is close to [2]. Our solution combines the techniques from some previous papers with some new ideas. The compact data structure for predecessor queries by Grossi et al. [8] is also used in our construction.
Notation
We will say that a triple is a bridge if and . The total number of bridges is . However in order to answer a range LCP query it is sufficient to consider a subset of all bridges of size [2, 1]. Following the method of Abedin et al. [1], we consider special bridges that are defined below.
We consider the suffix tree of the text and divide its nodes into heavy and light nodes [11]. Let denote the number of leaves in the subtree rooted at . The root is a light node. Exactly one child of every internal node is designated as a heavy node, specifically one with the largest (ties are broken arbitrarily). All other nodes are light. Let and denote the leaves in the suffix tree that hold suffixes and respectively. Let denote the lowest common ancestor of and and let and denote the children of that are ancestors of and respectively. A bridge is a special bridge if one of the following conditions is satisfied:
-
1.
is a light node and
-
2.
is a light node and
Lemma 1 ([1]).
There are special bridges. For any and such that , we have
In the rest of this paper a bridge will denote a special bridge.
Bridge Classification
Let . For a bridge we will say that is its left leg, is its right leg, and is its height. We will say that a bridge is in the interval if . Let denote the set of all bridges of height .
By pigeonhole principle, there exists some value , , such that the total number of bridges in is bounded by ; see also [1, Section 5.1] We will say that all bridges in are base bridges. All other bridges are implicit bridges. Data structures for different categories of bridges are described below.
Base bridges
The total number of base bridges is . Furthermore we can find the maximum height base bridge in a query range by answering a variant of an orthogonal range searching query. Our data structure for base bridges is summarized in the following lemma.
Lemma 2.
There exists an -word data structure that finds, for any interval , the base bridge with maximal height in . The query time is .
Proof.
There is at most one bridge for every value of , and the total number of bridges in is . Hence the total number of base bridges is . In order to answer a query, we must find the largest such that , , and there is a base bridge . Since , this is equivalent to finding the triple such that , , and is maximized. The latter query is equivalent to a two-dimensional dominance maxima query. Using a data structure for top- dominance queries with , such a query can be answered in time using space, see [4, Theorem 7].
Implicit Bridges
Using Lemma 2, we can find the largest such that there is a base bridge of height in the query range . Now we explain how to find the largest , , such that there is (an implicit) bridge of height in .
The following properties of special bridges will be used.
Lemma 3 ([1], Lemma 6).
If there is a special bridge , then for any there is a special bridge for some and a special bridge for some .
Lemma 4 ([1], Lemma 5).
There exists a data structure that uses space and supports the following queries in time:
-
find the right leg of a special bridge if its left leg and its height are known
-
find the left leg of a special bridge if its right leg and its height are known
Let denote the set of heights of base bridges. For every we consider all bridges of height and construct the list of their left legs sorted in increasing order , where is the number of special bridges with height . For each , , we also construct an array so that . Finally let
where is the successor of in and is the position of in .
Lemma 5.
If , then there is a bridge of height in .
If , then there is no bridge in such that
Proof.
The first statement directly follows from the definition of : if , then there is an index , such that and a position , such that and . Hence there is also a special bridge where and .
To prove the second part of the lemma, suppose that there is a bridge where , and for some , . Then and . Hence there is a base bridge for some . Let denote the position of in . Furthermore and there is an implicit bridge for some . Since , and .
Lemma 6.
For any , there exists a data structure that uses bits and determines whether in time for any and for any .
Proof.
For every , , we store a compact data structure that supports range minimum queries on in time and uses bits of space. We can use the data structure from [6] for this purpose. All compact range minima data structures use bits. Arrays are not stored. Additionally we store in a data structure that uses bits and supports successor queries in time [12].
To compute , we find the smallest such that . Then we use the range minimum data structure and find the index such that and for any . Finally we compute the right leg of the bridge using Lemma 4. If , then . If , then .
We keep a data structure from Lemma 6 for every . The total space usage of these data structures is bits. Using Lemma 6 and binary search, we find the largest such that . By Lemma 5, there is a bridge of height in ; hence . Also, by Lemma 5 there is no bridge in , such that . The total time is . We thus proved the following lemma.
Lemma 7.
There exists an -word data structure that finds, for any interval , the implicit bridge with height in , so that there is no bridge of height in . The query time is .
Block Bridges
It remains to consider the case of bridges with right leg in the interval and of height , . We apply the pigeonhole principle again. Let . There exists some value , , such that the total number of bridges in is bounded by . Let . We will say that all bridges where are good bridges. All other bridges are bad bridges.
We will denote by the set of all bridges such that (a) and for some and (b) for some , .
Lemma 8.
Let denote the number of bridges in . There exists a data structure that finds, for any interval , the bridge from with maximal height such that its right and left legs are in . This data structure uses bits and supports queries in time.
Proof.
Let denote the set that contains left legs of all bridges in . Every bridge from is represented as follows: We replace the right leg with and the height with . We replace the left leg with , where is the rank of in . Thus a bridge is represented by a triple . Since , , and are bounded by we can store each triple using bits. For each , we can retrieve the corresponding values of and with one addition. If and of some bridge are known, we can obtain the value of its left leg in time using the data structure from Lemma 4.
In addition, we store all elements of in a compact data structure that is described by Grossi et al. in [8, Lemma 3.3]. This data structure supports successor queries on a set of integers ; provided that we can access an arbitrary element of in time , a successor query can be answered in time where is the number of elements in . The data structure uses bits per element, where is the size of the universe (in addition to the space required to store ). In our case, has elements and the size of the universe is . Hence for every left leg , the data structure uses bits. We can obtain the value of any left leg in time . Hence successor queries are answered in time. That is, we can find for any the smallest such that .
Finally all triples are stored in the data structure described in [4, Theorem 7] 111The same data structure was also used in Lemma 2.. This data structure uses bits and supports the following range maxima queries in time: for any and , find the highest , among all tuples satisfying and .
In order to find the maximum-height bridge in from , we find the successor of and its rank , using the compact successor data structure. We also compute . Then we find the highest value among all satisfying and . The maximum height of a bridge in is . The total time required to answer a query is .
Lemma 9.
There are non-empty blocks.
Proof.
Suppose that there is at least one implicit bridge in . Let . Then by Lemma 3 there is a special bridge such that . Since and , we have . Thus for every base bridge where there are at most two non-empty blocks. Since there are base bridges, the number of non-empty blocks is .
Lemma 10.
There exists a data structure that finds, for any , and any , the highest good bridge in such that its right leg is in and its height is in . The data structure uses words of space and supports queries in time.
Proof.
We store a block data structure from Lemma 8 for each non-empty block . The total number of bridges in all blocks is equal to the total number of good bridges. By Lemma 9, the total number of non-empty blocks is . Hence the total space usage of all block data structures is bits. The range intersects with at most two blocks. Hence we can find the highest bridge satisfying the conditions of this lemma in time by answering two queries to block data structures.
Putting All Parts Together
In order to answer a range LCP query we need to identify the largest such that there is a bridge in . Our algorithm works in four stages:
-
1.
First, we find the largest such that there is a base bridge of height in . This step takes time by Lemma 2.
-
2.
Then we find the largest , where , such that there is an implicit bridge of height in . This can be done in time by Lemma 7
-
3.
We find the largest such that there is a good bridge of height with right leg in . This step takes time by Lemma 10.
-
4.
Let . We check if there is a bridge of height in for each , . By Lemma 5 we can check each candidate value of in time. Hence this step takes time.
The total query time is . By replacing with a constant in the above construction, we obtain our final result.
Theorem 11.
There exists a data structure that uses words of space and answers range LCP queries in time time.
References
- [1] Paniz Abedin, Arnab Ganguly, Wing-Kai Hon, Kotaro Matsuda, Yakov Nekrich, Kunihiko Sadakane, Rahul Shah, and Sharma V. Thankachan. A linear-space data structure for range-lcp queries in poly-logarithmic time. Theor. Comput. Sci., 822:15–22, 2020. doi:10.1016/J.TCS.2020.04.009.
- [2] Amihood Amir, Alberto Apostolico, Gad M. Landau, Avivit Levy, Moshe Lewenstein, and Ely Porat. Range LCP. J. Comput. Syst. Sci., 80(7):1245–1253, 2014. doi:10.1016/J.JCSS.2014.02.010.
- [3] Amihood Amir, Moshe Lewenstein, and Sharma V. Thankachan. Range LCP queries revisited. In 22nd International Symposium String Processing and Information Retrieval (SPIRE), pages 350–361, 2015. doi:10.1007/978-3-319-23826-5_33.
- [4] Timothy M. Chan, Yakov Nekrich, Saladi Rahul, and Konstantinos Tsakalidis. Orthogonal point location and rectangle stabbing queries in 3-d. J. Comput. Geom., 13(1):399–428, 2022. doi:10.20382/JOCG.V13I1A15.
- [5] Graham Cormode and S. Muthukrishnan. Substring compression problems. In 16th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 321–330, 2005. URL: http://dl.acm.org/citation.cfm?id=1070432.1070478.
- [6] Johannes Fischer and Volker Heun. Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput., 40(2):465–492, 2011. doi:10.1137/090779759.
- [7] Arnab Ganguly, Manish Patil, Rahul Shah, and Sharma V. Thankachan. A linear space data structure for range LCP queries. Fundam. Informaticae, 163(3):245–251, 2018. doi:10.3233/FI-2018-1741.
- [8] Roberto Grossi, Alessio Orlandi, Rajeev Raman, and S. Srinivasa Rao. More haste, less waste: Lowering the redundancy in fully indexable dictionaries. In 26th International Symposium on Theoretical Aspects of Computer Science (STACS), pages 517–528, 2009. doi:10.4230/LIPICS.STACS.2009.1847.
- [9] Orgad Keller, Tsvi Kopelowitz, Shir Landau Feibish, and Moshe Lewenstein. Generalized substring compression. Theor. Comput. Sci., 525:42–54, 2014. doi:10.1016/j.tcs.2013.10.010.
- [10] Kotaro Matsuda, Kunihiko Sadakane, Tatiana Starikovskaya, and Masakazu Tateshita. Compressed orthogonal search on suffix arrays with applications to range LCP. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM), pages 23:1–23:13, 2020. doi:10.4230/LIPICS.CPM.2020.23.
- [11] Daniel Dominic Sleator and Robert Endre Tarjan. A data structure for dynamic trees. J. Comput. Syst. Sci., 26(3):362–391, 1983. doi:10.1016/0022-0000(83)90006-5.
- [12] Peter van Emde Boas. Preserving order in a forest in less than logarithmic time and linear space. Inf. Process. Lett., 6(3):80–82, 1977. doi:10.1016/0020-0190(77)90031-X.
