Range Longest Increasing Subsequence and Its Relatives
Abstract
Longest increasing subsequence () is a classical textbook problem which is still actively studied in various computational models. In this work, we present a few results for the range longest increasing subsequence problem () and its variants. The input to is a sequence of real numbers and a collection of query ranges and for each query in , the goal is to report the of the sequence restricted to that query. Our two main results are for the following generalizations of the problem:
- 2D Range Queries:
-
In this variant of the problem, each query is a pair of ranges, one of indices and the other of values, and we provide a randomized algorithm with running time111The notation hides polylogarithmic factors in and . , where is the cumulative length of the output subsequences. This improves on the elementary runtime algorithm when . Previously, the only known result breaking the quadratic barrier was of Tiskin [SODAโ10] which could only handle 1D range queries (i.e., each query was a range of indices) and also just outputted the length of the (instead of reporting the subsequence achieving that length). Subsequent to our paper, Gawrychowski, Gorbachev, and Kociumaka in a preprint have extended Tiskinโs approach to handle reporting 1D range queries in time.
- Colored Sequences:
-
In this variant of the problem, each element in is colored and for each query in , the goal is to report a monochromatic contained in the sequence restricted to that query. For 2D queries, we provide a randomized algorithm for this colored version with running time . Moreover, for 1D queries, we provide an improved algorithm with running time . Thus, we again improve on the elementary runtime algorithm. Additionally, we prove that assuming the well-known Combinatorial Boolean Matrix Multiplication Hypothesis, that the runtime for 1D queries is essentially tight for combinatorial algorithms.
Our algorithms combine several tools such as dynamic programming (to precompute increasing subsequences with some desirable properties), geometric data structures (to efficiently compute the dynamic programming entries), random sampling (to capture elements which are part of the ), classification of query ranges into large and small , and classification of colors into light and heavy. We believe that our techniques will be of interest to tackle other variants of problem and other range-searching problems.
Keywords and phrases:
Longest Increasing Subsequence, Range Query, Fine-Grained ComplexityFunding:
Karthik C. S.: This work was supported by the National Science Foundation under Grant CCF-2313372 and by the Simons Foundation, Grant Number 825876, Awardee Thu D. Nguyen.Copyright and License:
2012 ACM Subject Classification:
Theory of computation Data structures design and analysis ; Theory of computation Computational geometryAcknowledgements:
We thank the anonymous reviewers for several helpful comments that improved the presentation of our results.Editor:
Shubhangi SarafSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl โ Leibniz-Zentrum fรผr Informatik
1 Introduction
In the longest increasing subsequence () problem, the input is a sequence of real numbers and the goal is to report indices , for the largest possible value of , such that . The length of the is then defined to be . For simplicity of discussion, we will assume that all the real numbers in the sequence are distinct. A standard dynamic programming algorithm reports the in time, which can be improved to by performing patience sort [51, 52, 11, 27] or by suitably augmenting a binary search tree which computes each entry in the dynamic programming table in time.
In many applications, such as time-series data analysis, the user might not be interested in the of the entire sequence. Instead they would like to focus their attention only on a โrangeโ of indices in the sequence. Here are a few examples to illustrate this point.
-
Consider a stock in the trading market. Let the sequence represent the daily price of stock from 1950 till 2022. Instead of querying for the of the entire sequence, the user might gain more insights [38, 49] about the trends of the stock by querying for the in a range of dates (or time-windows) such as Jan 2020 till March 2020 or Feb 2018 till Dec 2018.
-
In modern times, social media platforms are the major source for generating enormous content on the Internet on a minute-to-minute basis. Consider a time-series data with statistics at a fine-grained level (of each minute of the day) about the number of Google searches, number of Tweets posted, number of Amazon purchases, or the number of Instagram posts. Monitoring the under time-window constraints on such datasets can lead to more insights in understanding the social media usage trends and can also help in detecting abnormalities.
-
Popular genome sequencing algorithms identify high scoring segment pairs (HSPs) between a query transcript sequence and a long reference genomic sequence in order to obtain a global alignment, and this amounts to computing the in the reference genome [9, 80, 8]. Typically many queries are made (corresponding to various transcripts/proteins) w.r.t. the same reference genomic sequence, and this can be modeled as range queries to compute the .
Recently, in the theoretical computer science community, there has been a lot of interest in the dynamic problem [53, 43, 31], where the goal is to maintain exact or approximate under insertion and deletion of elements. Interestingly, an important subroutine which shows up is the computation of of a given range of elements in the sequence. For example, Gawrychowski and Janczewski [31] design a dynamic algorithm to quickly compute the approximate for any range of indices in . In fact, Gawrychowski and Janczewski go further and design a dynamic algorithm which maintains the approximate for a special class of dyadic rectangles where the query222Consider mapping the element to a 2D point , to contextualize the notion of 2D queries. is an axis-aligned rectangle in 2D and the goal is to compute the approximate of the 2D points (from the above mapping) which lie inside the rectangle.
Motivated by this, in this paper we study several natural variants of the problem where the 1D queries impose restriction on the range of the indices and the 2D queries impose range restriction on both the indices and the values in the sequence. For each of these variants of the problem there is a data structure counterpart that can be studied as well, and these results are later listed in Table 2.
1.1 Vanilla Problem: One dimensional range problem
The first problem is the 1D range longest increasing subsequence problem (). In the problem, we are given as input a sequence of real numbers and a collection of query ranges, where each is a range (or an interval) . For each , the goal is to report the of the sequence (i.e., the output must be the longest increasing subsequence in ). The length of the of the sequence is denoted by . Throughout the paper, we abuse notation for simplicity by using to refer to both the subsequence and its length. We clarify the intended meaning wherever it may be unclear.
A straightforward approach to solve the problem would involve no preprocessing: for each range in , we will retrieve the elements and report their . This would require time in the worst-case (for instance when many of the ranges in are of length). In a remarkable work, Tiskin [74, 76, 75] broke the quadratic barrier, by designing an time algorithm, albeit for the computationally easier length version of the problem, in which for all , the goal is to only output the length of the in the range (i.e., ). While the algorithm designed by Tiskin is tailored to handle the length version of the problem, in a subsequent work to the current paper, Gawrychowski, Gorbachev, and Kociumaka [29] extend Tiskinโs algorithm to handle the reporting version as well by providing an algorithm which answers the reporting version of the problem in time.
Since the vanilla version of the range problem is completely understood (up to log factors), we focus on the following three variants.
1.2 Problem-I: Two dimensional range problem
The next problem is the 2D range longest increasing subsequence problem () which is a natural generalization of the problem. As before, we are given as input a sequence . Consider a mapping where the element is mapped to a 2D point . Let be a collection of these points. We are also given as input a collection of query ranges, where each range in is an axis-aligned rectangle in 2D. For each , the goal is to report the of , i.e., the points of lying inside . See Figure 1(a) for an example.
Once again the naive approach to solve the problem would take time, where for each range , we will scan the points in to filter out the points lying inside and then compute their . One might be tempted to use orthogonal range searching data structures [3, 5, 24, 56, 13] to efficiently report , but in the worst-case we could have many queries with . In this paper, even for the problem, we succeed to improve on the basic quadratic runtime algorithm when is roughly . We obtain the following result where the notation hides polylog factors.
Theorem 1.
There is a randomized algorithm to answer the reporting version of problem in time, where is the cumulative length of the output subsequences. The bound on the running time and the correctness of the solution holds with high probability.
1.3 Problem-II: Colored 1D range problem
In the geometric data structures literature, colored range searching [41, 57, 59, 54, 45, 40, 14, 6, 34, 37, 46, 55, 71, 33] is a well-studied generalization of traditional range searching problems. In the colored setting, the geometric objects are partitioned into disjoint groups (or colors), and given a query region , the typical goal is to efficiently report [22, 25, 23, 47], or count [39], or approximately count [60] the number of colors intersecting , or find the majority color inside [21, 44].
In the same spirit, we study the colored 1D range longest increasing subsequence problem (). In addition to the sequence and 1D range queries given as input to the problem, in the problem we are additionally given a coloring of using the color set . Each element has a color chosen from (the corresponding 2D point also has the same color). For all , let denote the set of points with color and for any , with a slight abuse of notation, we will use to denote the length of the of . Moreover, we say a subsequence is monochromatic, if all the elements in the subsequence have the same color. For each , the goal is then to report the longest monochromatic increasing subsequence whose length is equal to . See Figure 1(b) for an example in 2D. We obtain the following result.
Theorem 2.
There is a randomized algorithm to answer the reporting version of the problem in time, where is the cumulative length of the output subsequences. The bound on the running time and the correctness of the solution holds with high probability.
We complement the above result with a conditional lower bound that indicates that the above runtime for might be (near) optimal.
Theorem 3.
Assuming the Combinatorial Boolean Matrix Multiplication Hypothesis, for every and , there is no combinatorial algorithm to answer the reporting version of the problem in time, where is the cumulative size of the outputs. The lower bound continues to hold even when we are only required to report the color of the longest monochromatic increasing subsequence for each range query.
The Combinatorial Boolean Matrix Multiplication Hypothesis () roughly asserts that Boolean matrix multiplication of two matrices cannot be computed by any combinatorial algorithm running in time , for any constant . Here โcombinatorial algorithmโ is typically defined as โnon-Strassen-like algorithmโ (as defined in [12]), and this captures all known fast matrix multiplication algorithms.
While is widely explored since the 1990s [70, 48], there is no strong consensus in the computer science community about its plausibility. In a recent breakthrough, Abboud et al. [1], have come tantalizingly close to even refuting it. All that said, remains a widely used hypothesis for proving conditional lower bounds [66, 2, 35]. At the very least, Theorem 3 provides evidence that any algorithm for that is significantly faster than the runtime given in Theorem 2, must be highly non-trivial.
1.4 Problem-III: Colored 2D range problem
The most general problem that we study in this paper is the colored 2D range longest increasing subsequence problem (). The queries in will be axis-aligned rectangles. For each , the goal is then to report the longest monochromatic increasing subsequence which attains . See Figure 1(b) for an example. We obtain the following result.
Theorem 4.
There is a randomized algorithm to answer the reporting version of the problem in time, where is the cumulative length of the output subsequences. The bound on the running time and the correctness of the solution holds with high probability.
We remark that the conditional lower bound of Theorem 3 continues to hold here too.
In Table 1, we have summarized the state-of-the-art upper and (conditional) lower bounds for the various variants of discussed in this paper.
| Problem | Upper Bound | Lower Bound |
|---|---|---|
| 1D-Range LIS | ||
| (Theorem 5.13 in [29]) | (Trivial) | |
| 2D-Range LIS | ||
| (Theorem 1) | (Trivial) | |
| Colored 1D-Range LIS | ||
| (Theorem 2) | (Theorem 3) | |
| Colored 2D-Range LIS | ||
| (Theorem 4) | (Theorem 3) |
1.5 Data Structure Counterpart of the Three Variants of Problem
A natural data structure counterpart question to the aforementioned four problems is to design a data structure for these problems and measure the tradeoff in the time needed for preprocessing the input sequence versus the time needed to answer a single range query. In Table 2, we have summarized the preprocessing and query time bounds that can be derived from the proofs of Theorems 1, 2, and 4.
| Problem | Preprocessing Time | Query time | Reference |
|---|---|---|---|
| 1D-Range LIS | [29] | ||
| 2D-Range LIS | This Paper | ||
| Colored 1D-Range LIS | This Paper | ||
| Colored 2D-Range LIS | This Paper |
Finally, we note that the proof of Theorem 3 also implies that assuming , for every , there is no data structure for the reporting version of the problem (or the problem) which can be preprocessed using a combinatorial algorithm in time such that each query can be answered using a combinatorial algorithm in time.
1.6 Related Works
in popular settings and models
To the best of our knowledge, there is not much prior work on the reporting version of . The results of Tiskin [74, 76, 75] on the length version of were later adapted to the reporting version in [29], subsequent to our work. Therefore, for the rest of this subsection, we list works on the problem in some popular settings/models.
In the Dynamic problem we need to maintain the length of of an array under insertion and deletion. The problem in the dynamic setting was initiated by Mitzenmacher and Seddighin [53]. In [31], an algorithm running in time and providing -approximation was presented (this approximation algorithm can be adapted to approximate the problem in the dynamic setting), and in [43] a randomized exact algorithm with the update and query time was provided. Finally, in [30], the authors provide conditional polynomial lower bounds for exactly solving in the dynamic setting.
In the streaming model, computing the requires bits of space [32], so it is natural to resort to approximation algorithms. In [32] is a deterministic -approximation in space for problem, and this was shown to be optimal [26, 28]. We remark here that the problem has also been implicitly studied in the streaming algorithms literature as estimating the sortedness of an array [7, 50, 32, 72].
In the setting of sublinear time algorithms, the authors of [68] showed how to approximate the length of the to within an additive error of , for an arbitrary , with queries. Denoting the length of by , in [67] the authors designed a non-adaptive -multiplicative factor approximation algorithm, where , with queries (and also obtained different tradeoff between the dependency on and ). In [58], the authors proved that adaptivity is essential in obtaining polylogarithmic query complexity for the problem. Recently, for any , in [10] a (randomized) non-adaptive -multiplicative factor approximation algorithm was provided with running time.
In the read-only random access model, the authors of [42] showed how to find the length of in time and only space, for any parameter .
Range-aggregate queries
Range-aggregate queries have traditionally been well-studied in the database community and the computational geometry community. See the survey papers [5, 65, 33]. In a typical range-aggregate query, the input is a collection of points in a -dimensional space, and the user specifies a range query (such as an axis-aligned rectangle, or a disk, or a halfspace), and the goal is to compute an informative summary of the subset of points which lie inside the range query, such as the top- points [15, 63, 64, 4, 17], a random subset of points [73, 36], reporting or counting colors or groups [25, 22, 61], statistics such as mode, min [16], median [18], or sum, the closest pair of points [77, 78, 79], and the skyline points [19, 20, 62].
In an orthogonal range-max problem, the input is a set of points in -dimensional space. Each point in has a weight associated with it. Preprocess into a data structure, so that given an axis-parallel box , the goal is to report the point in with the largest weight. In this work, we will use vanilla range trees from the textbook [13] to answer range-max queries (since the goal of this work is not to shave polylogarithmic factors). The preprocessing time to build a vanilla range tree is and the query time is .
2 Our first technique: Handling small
We design a general technique to obtain all our upper bounds. We note that throughout the paper, there is ample scope to shave logarithmic factors in the runtime of our various algorithms and subroutines. However, that is not the focus of this paper.
For the sake of simplicity, we provide an overview of the general technique for the simplest case of which is in 1D and does not involve colors. Our strategy to solve the problem is the following. Consider an integer parameter . For the problem, is set to . We will design two different techniques. The first technique is deterministic and reports the correct answer if . If , then correctness is not guaranteed. The second technique is randomized and reports the correct answer with high probability if . As such, for each , w.h.p., the correct result is obtained by one of the two algorithms.
In this subsection, we will give an overview of the first technique for . In our discussion we will use the terms element and point interchangeably. As discussed before, the element in sequence is equivalent to the 2D point in . Consider the special case where there is a vertical line with -coordinate such that each query range in contains (in the full version of the paper we show how to remove the assumption).
2.1 Idea-1: Lowest peaks and highest bases
Please refer to Figure 2(i) where eighteen points are shown. Chain and chain are increasing sequences to the left of and both have length four. Our first observation is that if we are allowed to store either or , then it would be wiser to store . The reason is that can โstitchโ itself with chain and , to form an increasing subsequence of length six and eight, respectively. On the other hand, cannot stitch itself with or to form a larger increasing subsequence. Analogously, between chains and , we will prefer since it can stitch with , whereas cannot stitch with any chain to the left of . In fact, the of the eighteen points is realised by the chain with length eight. This motivates the definition of lowest peak and highest base.
For an increasing subsequence , we define peak (resp., base) to be the last (resp., first) element in . The algorithm will store two -dimensional arrays:
- Array :
-
For all integers and for all , among all increasing subsequences of length in the range , let be the subsequence with the lowest peak. Then stores the value of the last element in .
- Array :
-
For all integers and for all , among all increasing sequences of length in the range , let be the subsequence with the highest base. Then stores the value of the first element in .
For example, in Figure 2(i), for the range and , the chain is the subsequence with the lowest peak; and for and , chain is the subsequence with the highest base. Efficient computation of the entries in arrays and will be discussed later.
A pair is defined to be compatible if , i.e., it is possible to โstitchโ the sequences corresponding to and to obtain an increasing sequence of length . For a query range , our goal is to find the compatible pair which maximizes the value of , i.e.,
2.2 Idea-2: Monotone property of peaks and bases
For a given query , a naive approach is to inspect all pairs of the form , for all , and then pick a compatible pair which maximizes the value of . The number of pairs inspected is which we cannot afford. However, the following monotone property will help in reducing the number of pairs inspected to :
-
For a fixed value of , the value of increases as the value of increases, i.e., for any , we have .
-
Analogously, for a fixed value of , the value of decreases as the value of increases, i.e., for any , we have .
See Figure 2(ii) for an example where values are increasing with increase in and values are decreasing with increase in . For each , the largest for which is compatible can be found in time by doing a binary search on the monotone sequence . Finally, report as the length of the for . Overall, the time taken to answer the query ranges will be .
2.3 Idea-3: Pivot points and dynamic programming
Now the key task is to design a preprocessing algorithm which can quickly compute each entry in the two-dimensional arrays and . We will look at and an analogous discussion will hold for . Assume that we have computed the value of and would like to next compute . Let be the point with -coordinate which is encountered when we move the vertical line from to .
Formally, we claim that the value of is higher than if and only if there is an increasing subsequence with the following three properties: (i) the subsequence length is , (ii) the base of the subsequence is higher than , and (iii) the last point in the subsequence is . Please refer to Figure 3 for an example.
We define to be a pivot point. Now to efficiently compute , we will need the support of another two-dimensional array defined as follows: among all increasing subsequences of length in the range containing pivot point as the last element, let be the subsequence with the highest base. Then is defined to be the value of the first element in . In Figure 3(a) and Figure 3(b), is realized by the -coordinate of and , respectively. As a result, we can succinctly encode the three properties stated above as follows:
In other words, an increasing sequence of length in the interval with the highest base will either contain the pivot element or not. The first term (i.e., ) captures the case where is not included, whereas the second term (i.e., ) captures the case where is included. As such, the task of efficiently computing has been reduced to the task of efficiently computing .
2.4 Idea-4: 2D range-max data structure
The final step is to compute each entry in in polylogarithmic time. The entries will be computed in increasing values of . Assume that the entries corresponding to sequences of length at most have been computed. Then the entries โs can be reduced to entries of โs as follows:
We will efficiently compute โs by constructing a data structure which can answer 2D range-max queries. In a 2D range-max query, we are given weighted points in 2D. Each point is associated with a weight. For each point , given a query region of the form , the goal is to report the point in with the maximum weight. The 2D range-max problem can be solved in time (via reduction to the so-called 2D orthogonal point location problem [69]).
Now with each point , we associate the weight and build a 2D range-max data structure. For each point , the point in with the maximum weight is reported. If point is reported, then set . The correctness follows from the third case in the above dynamic programming. Since , the time taken to construct is .
Therefore, the overall time taken by this technique to answer will be by setting .
2.5 Idea-5: Generalizing and to handle
We briefly describe the challenge arising while handling where the queries are axis-parallel rectangles. The definition of highest bases and lowest peak fail to be directly useful. For example, see Figure 4, where is equal to five, and is realized by stitching and . However, will correspond to the sequence and will correspond to the sequence , both of which lie completely outside .
The key observation in the 2D setting is that for an increasing subsequence and a query rectangle , if the first and the last point of lie inside , then all the points of lie inside . Accordingly, we generalize the definitions of and . Specifically, in the 2D setting, takes two arguments, the query and an integer , and is defined as follows:
For example, in Figure 4, the new definition of ensures that the sequence gets captured instead of . We construct analogously. Note that in the 1D setting, and were explicitly computed. However, in the 2D setting, the number of combinatorially different axis-parallel rectangles is significantly more than our time budget. Therefore, and cannot be precomputed and stored. Instead, for all (resp., ), we will compute (resp., ) during the query algorithm in time.
3 Our second technique: Handling large
Now we give an overview of our second technique for the problem (which we show in the full version of the paper that it also extends to the problem). If , then this technique will report the answer correctly with high probability.
3.1 Idea-1: Stitching elements
One challenge with the problem is that it is not decomposable. Consider a query range and an such that . Let and such that . Then the of need not be the concatenation of the of and the of , since the rightmost point in the of might have a larger value than the leftmost point in the of .
However, suppose an oracle reveals that point lies in the of range . Then we claim that the problem becomes decomposable. We will define some notation before proceeding. For a point corresponding to an element , define the north-east region of as . Analogously, define the north-west, the south-west and the south-east region of which are denoted by , respectively. Then it is easy to observe that,
| (1) |
See Figure 5 for an example. In other words, knowing that belongs to the decomposes the original problem into two sub-problems which can be computed independently. In such a case, we refer to point as a stitching element, which is formally defined below.
Definition 5 (Stitching element).
For a range , fix any of and call it . Then each element in is defined to be a stitching element w.r.t. .
The goal of our algorithm is to:
Construct a small-sized set such that, for any , at least one stitching element w.r.t. is contained in .
For each , in the preprocessing phase, the two terms on the right hand side of equation 1 can be computed in time for all possible queries. Therefore, the preprocessing time will be .
3.2 Idea-2: Random sampling
By using the fact that , it is possible to construct a set of size roughly via random sampling. For each , sample independently with probability , where is a sufficiently large constant. Let be the set of sampled points. Then we can establish the following connection between and the stitching elements.
Lemma 6.
For all ranges such that , if is one of the of , then with high probability .
This ensures that the preprocessing time will be . To answer any range , first scan to identify the points which lie inside . For each element , compute the right hand side of equation 1 in time. Finally, report the largest value computed. Therefore, the query time is bounded by .
4 Our third technique: Handling small cardinality colors
4.1 A naive approach
The problem is more challenging than the problem. Firstly, the result of Theorem 1 does not really help in answering since it completely ignores the information about the colors. Secondly, there might be a temptation to solve the problem โindependentlyโ for each color class. For example, consider a setting where each color class has roughly equal number of points, i.e., points. Consider a color and build an instance of Theorem 1 based on the points of color . Then, for each query , we have the value of . Repeat this for each color in . Finally, for each query , report the color for which is maximized.
Now lets (informally) analyze this algorithm. The running time bound of Theorem 1 has three terms. For now, lets ignore the second term and the third term, which represent the time taken to build the data structure and the time taken to report the output of the queries. The first term is the time taken to answer queries (which is roughly time per query). Adding the first term for each color class, we get
which is when and when , for any . Therefore, in the worst-case this approach does not help achieve our target query time bound.
4.2 A technique to handle small cardinality colors
We will design a third technique which will be helpful to answer and . The key idea is to handle all the colors in a โcombinedโ manner. The technique will work well when all the colors have small cardinality. Given a parameter , assume that for each color , the value of . The key observations made by the algorithm are the following:
-
Bounding the number of output subsequences. For a query , let the output be of color . Let (resp., ) be the first (resp., last) point on . Call this a pair . As such, whenever color is the output, the number of distinct pairs will be . Adding up over all the colors, the total number of distinct pairs will be:
If , then this is a significant improvement over the naive bound of distinct pairs (or distinct output subsequences).
-
Reduction to an uncolored problem. For each possible output subsequence of color with (resp., ) as the first (resp., last) point point on , we construct an axis-aligned rectangle with (resp., ) as the bottom-left (resp., top-right) corner. A weight is associated with rectangle which is equal to . See Figure 6. Let be the collection of such weighted rectangles. As a result, the problem of is now reduced to the rectangle range-max problem, where given an axis-parallel rectangle , among all the rectangles in which lie completely inside , the goal is to report the rectangle with the largest weight.
It is crucial to note that the colored problem has been reduced to a problem on uncolored rectangles for which an efficient data structure exists: the rectangle range-max data structure can be constructed in time and the range queries can be answered in time.
5 Putting all the pieces together
As an illustration we will consider the problem and put together all the three techniques discussed in the previous subsections in a specific manner.
5.1 Light and heavy colors
Define a parameter which will be set later. A color is classified as light if , otherwise, it is classified as heavy. We will design different algorithms to handle light colors and heavy colors. The advantage with a light color, say , is that we can precompute the for 2D axis-parallel ranges and still be within the time budget. On the other hand, the advantage with heavy colors is that there can be only heavy colors.
5.2 Handling light colors
Let be the set of points which belong to a light color. We will use the third technique to answer on and queries . It follows that the running time is (see the full version of the paper for details).
5.3 Handling heavy colors and small queries
Let be the set of points which belong to a heavy color. In the full version of the paper we will prove that for an arbitrary value of , the running time of the first technique will be . The first technique as described above only handles uncolored points. In the full version of the paper we will โgeneralizeโ the first technique to answer problem as well. The running time of this generalized first technique on colored pointset and queries will be , where the number of heavy colors is .
5.4 Handling heavy colors and large queries
In the full version of the paper we will prove that for an arbitrary , the running time of the second technique will be for problem. In fact, we will generalize this algorithm to the large case of with the same running time (ignoring polylogarithmic factors). The generalized algorithm will answer on and queries .
Combining all the three subroutines, the total running time will be:
We set the parameters and in the above expression to obtain a running time of .
6 Open Problems
We conclude the paper with a few open problems.
-
Is it possible to extend the algorithm in [29] for the problem to solve the problem in time? Or, is there a (conditional) hardness result which makes obtaining the upper bound unlikely?
-
Another interesting research direction is to design a deterministic algorithm which runs in sub-quadratic time. Specifically, can the construction of stitching set be efficiently derandomized?
-
For the problem, can we bridge the gap between the upper bound and the (conditional) lower bound. We conjecture that the upper bound can be further improved.
-
Can we extend our algorithmic technique to beat the quadratic barrier for the weighted version of ?
-
Finally, it would be interesting to explore the in the dynamic model.
References
- [1] Amir Abboud, Nick Fischer, Zander Kelley, Shachar Lovett, and Raghu Meka. New graph decompositions and combinatorial boolean matrix multiplication algorithms. In STOC, 2024. To appear.
- [2] Amir Abboud and Virginia Vassilevska Williams. Popular conjectures imply strong lower bounds for dynamic problems. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pages 434โ443. IEEE, 2014. doi:10.1109/FOCS.2014.53.
- [3] Peyman Afshani, Lars Arge, and Kasper Dalgaard Larsen. Orthogonal range reporting in three and higher dimensions. In Proceedings of Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 149โ158, 2009. doi:10.1109/FOCS.2009.58.
- [4] Peyman Afshani, Gerth Stolting Brodal, and Norbert Zeh. Ordered and unordered top-k range reporting in large data sets. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 390โ400, 2011. doi:10.1137/1.9781611973082.31.
- [5] Pankaj K. Agarwal and Jeff Erickson. Geometric range searching and its relatives. Advances in Discrete and Computational Geometry, pages 1โ56, 1998.
- [6] Pankaj K. Agarwal, Sathish Govindarajan, and S. Muthukrishnan. Range searching in categorical data: Colored range searching on grid. In Proceedings of European Symposium on Algorithms (ESA), pages 17โ28, 2002. doi:10.1007/3-540-45749-6_6.
- [7] Miklรณs Ajtai, TS Jayram, Ravi Kumar, and D Sivakumar. Approximate counting of inversions in a data stream. In Proceedings of ACM Symposium on Theory of Computing (STOC), pages 370โ379, 2002.
- [8] Michael H Albert, Alexander Golynski, Angรจle M Hamel, Alejandro Lรณpez-Ortiz, S Srinivasa Rao, and Mohammad Ali Safari. Longest increasing subsequences in sliding windows. Theoretical Computer Science, 321(2-3):405โ414, 2004. doi:10.1016/J.TCS.2004.03.057.
- [9] Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and David J Lipman. Basic local alignment search tool. Journal of molecular biology, 215(3):403โ410, 1990.
- [10] Alexandr Andoni, Negev Shekel Nosatzki, Sandip Sinha, and Clifford Stein. Estimating the longest increasing subsequence in nearly optimal time. In Proceedings of Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 708โ719, 2022. doi:10.1109/FOCS54457.2022.00073.
- [11] Lars Arge. The buffer tree: A technique for designing batched external data structures. Algorithmica, 37(1):1โ24, 2003. doi:10.1007/S00453-003-1021-X.
- [12] Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. Graph expansion and communication costs of fast matrix multiplication. Journal of the ACM (JACM), 59(6):1โ23, 2013. doi:10.1145/2395116.2395121.
- [13] Mark de Berg, Otfried Cheong, Marc van Kreveld, and Mark Overmars. Computational Geometry: Algorithms and Applications. Springer-Verlag, 3rd edition, 2008.
- [14] Panayiotis Bozanis, Nectarios Kitsios, Christos Makris, and Athanasios K. Tsakalidis. New upper bounds for generalized intersection searching problems. In Proceedings of International Colloquium on Automata, Languages and Programming (ICALP), pages 464โ474, 1995. doi:10.1007/3-540-60084-1_97.
- [15] Gerth Stolting Brodal. External memory three-sided range reporting and top- queries with sublogarithmic updates. In Proceedings of Symposium on Theoretical Aspects of Computer Science (STACS), volume 47, pages 23:1โ23:14. 2016. doi:10.4230/LIPIcs.STACS.2016.23.
- [16] Gerth Stรธlting Brodal, Pooya Davoodi, and S. Srinivasa Rao. On space efficient two dimensional range minimum data structures. In Proceedings of European Symposium on Algorithms (ESA), pages 171โ182, 2010. doi:10.1007/978-3-642-15781-3_15.
- [17] Gerth Stรธlting Brodal, Rolf Fagerberg, Mark Greve, and Alejandro Lopez-Ortiz. Online sorted range reporting. In International Symposium on Algorithms and Computation (ISAAC), pages 173โ182, 2009. doi:10.1007/978-3-642-10631-6_19.
- [18] Gerth Stรธlting Brodal, Beat Gfeller, Allan Gronlund Jorgensen, and Peter Sanders. Towards optimal range medians. Theoretical Computer Science, 412(24):2588โ2601, 2011. doi:10.1016/J.TCS.2010.05.003.
- [19] Gerth Stรธlting Brodal and Kasper Green Larsen. Optimal planar orthogonal skyline counting queries. In Scandinavian Symposium and Workshops on Algorithm Theory (SWAT), pages 110โ121, 2014. doi:10.1007/978-3-319-08404-6_10.
- [20] Gerth Stolting Brodal and Konstantinos Tsakalidis. Dynamic planar range maxima queries. In Proceedings of International Colloquium on Automata, Languages and Programming (ICALP), pages 256โ267, 2011.
- [21] Timothy M Chan, Stephane Durocher, Kasper Green Larsen, Jason Morrison, and Bryan T Wilkinson. Linear-space data structures for range mode query in arrays. Theory of Computing Systems, 55(4):719โ741, 2014. doi:10.1007/S00224-013-9455-2.
- [22] Timothy M. Chan, Qizheng He, and Yakov Nekrich. Further results on colored range searching. In International Symposium on Computational Geometry (SoCG), pages 28:1โ28:15, 2020. doi:10.4230/LIPIcs.SOCG.2020.28.
- [23] Timothy M. Chan and Zhengcheng Huang. Dynamic colored orthogonal range searching. In Proceedings of European Symposium on Algorithms (ESA), volume 204, pages 28:1โ28:13, 2021. doi:10.4230/LIPIcs.ESA.2021.28.
- [24] Timothy M. Chan, Kasper Green Larsen, and Mihai Patrascu. Orthogonal range searching on the ram, revisited. In Proceedings of Symposium on Computational Geometry (SoCG), pages 1โ10, 2011. doi:10.1145/1998196.1998198.
- [25] Timothy M. Chan and Yakov Nekrich. Better data structures for colored orthogonal range reporting. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 627โ636, 2020. doi:10.1137/1.9781611975994.38.
- [26] Funda Ergun and Hossein Jowhari. On distance to monotonicity and longest increasing subsequence of a data stream. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 730โ736, 2008.
- [27] Michael L Fredman. On computing the length of longest increasing subsequences. Discrete Mathematics, 11(1):29โ35, 1975. doi:10.1016/0012-365X(75)90103-X.
- [28] Anna Gรกl and Parikshit Gopalan. Lower bounds on streaming algorithms for approximating the length of the longest increasing subsequence. SIAM Journal on Computing, 39(8):3463โ3479, 2010. doi:10.1137/090770801.
- [29] Pawel Gawrychowski, Egor Gorbachev, and Tomasz Kociumaka. Core-sparse monge matrix multiplication: Improved algorithm and applications. CoRR, abs/2408.04613, 2024. doi:10.48550/arXiv.2408.04613.
- [30] Paweล Gawrychowski and Wojciech Janczewski. Conditional lower bounds for variants of dynamic LIS. arXiv preprint arXiv:2102.11797, 2021.
- [31] Pawel Gawrychowski and Wojciech Janczewski. Fully dynamic approximation of LIS in polylogarithmic time. In Proceedings of ACM Symposium on Theory of Computing (STOC), pages 654โ667, 2021. doi:10.1145/3406325.3451137.
- [32] Parikshit Gopalan, T. S. Jayram, Robert Krauthgamer, and Ravi Kumar. Estimating the sortedness of a data stream. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 318โ327, 2007. URL: http://dl.acm.org/citation.cfm?id=1283383.1283417.
- [33] Prosenjit Gupta, Ravi Janardan, Saladi Rahul, and Michiel H. M. Smid. Computational geometry: Generalized (or colored) intersection searching. In Handbook of Data Structures and Applications, CRC Press, 2nd edition, pages 1042โ1057, 2018.
- [34] Prosenjit Gupta, Ravi Janardan, and Michiel H. M. Smid. Further results on generalized intersection searching problems: Counting, reporting, and dynamization. Journal of Algorithms, 19(2):282โ317, 1995. doi:10.1006/JAGM.1995.1038.
- [35] Monika Henzinger, Sebastian Krinninger, Danupon Nanongkai, and Thatchaphol Saranurak. Unifying and strengthening hardness for dynamic problems via the online matrix-vector multiplication conjecture. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 21โ30, 2015. doi:10.1145/2746539.2746609.
- [36] Xiaocheng Hu, Miao Qiao, and Yufei Tao. Independent range sampling. In Proceedings of ACM Symposium on Principles of Database Systems (PODS), pages 246โ255, 2014. doi:10.1145/2594538.2594545.
- [37] Ravi Janardan and Mario A. Lopez. Generalized intersection searching problems. International Journal of Computational Geometry and Applications, 3(1):39โ69, 1993. doi:10.1142/S021819599300004X.
- [38] Ruoming Jin, Scott McCallen, and Eivind Almaas. Trend motif: A graph mining approach for analysis of dynamic complex networks. In Proceedings of International Conference on Management of Data (ICDM), pages 541โ546, 2007. doi:10.1109/ICDM.2007.92.
- [39] Haim Kaplan, Natan Rubin, Micha Sharir, and Elad Verbin. Counting colors in boxes. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 785โ794, 2007. URL: http://dl.acm.org/citation.cfm?id=1283383.1283467.
- [40] Haim Kaplan, Micha Sharir, and Elad Verbin. Colored intersection searching via sparse rectangular matrix multiplication. In Proceedings of Symposium on Computational Geometry (SoCG), pages 52โ60, 2006. doi:10.1145/1137856.1137866.
- [41] Marek Karpinski and Yakov Nekrich. Top-k color queries for document retrieval. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 401โ411, 2011. doi:10.1137/1.9781611973082.32.
- [42] Masashi Kiyomi, Hirotaka Ono, Yota Otachi, Pascal Schweitzer, and Jun Tarui. Space-efficient algorithms for longest increasing subsequence. In Proceedings of Symposium on Theoretical Aspects of Computer Science (STACS), 2018.
- [43] Tomasz Kociumaka and Saeed Seddighin. Improved dynamic algorithms for longest increasing subsequence. In Proceedings of ACM Symposium on Theory of Computing (STOC), pages 640โ653, 2021. doi:10.1145/3406325.3451026.
- [44] Danny Krizanc, Pat Morin, and Michiel H. M. Smid. Range mode and range median queries on lists and trees. Nordic Journal of Computing, 12(1):1โ17, 2005.
- [45] Ying Kit Lai, Chung Keung Poon, and Benyun Shi. Approximate colored range and point enclosure queries. Journal of Discrete Algorithms, 6(3):420โ432, 2008. doi:10.1016/J.JDA.2007.10.001.
- [46] Kasper Green Larsen and Rasmus Pagh. I/O-efficient data structures for colored range and prefix reporting. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 583โ592, 2012. doi:10.1137/1.9781611973099.49.
- [47] Kasper Green Larsen and Freek van Walderveen. Near-optimal range reporting structures for categorical data. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 265โ276, 2013. doi:10.1137/1.9781611973105.20.
- [48] Lillian Lee. Fast context-free grammar parsing requires fast boolean matrix multiplication. Journal of the ACM (JACM), 49(1):1โ15, 2002. doi:10.1145/505241.505242.
- [49] Youhuan Li, Lei Zou, Huaming Zhang, and Dongyan Zhao. Longest increasing subsequence computation over streaming sequences. IEEE Transactions on Knowledge and Data Engineering (TKDE), 30(6):1036โ1049, 2017. doi:10.1109/TKDE.2017.2761345.
- [50] David Liben-Nowell, Erik Vee, and An Zhu. Finding longest increasing and common subsequences in streaming data. Journal of Combinatorial Optimization, 11:155โ175, 2006. doi:10.1007/S10878-006-7125-X.
- [51] Colin L Mallows. Patience sorting. SIAM Review, 4(2):148โ149, 1962.
- [52] Colin L Mallows. Patience sorting. SIAM Review, 5(4):375, 1963.
- [53] Michael Mitzenmacher and Saeed Seddighin. Dynamic algorithms for LIS and distance to monotonicity. In Proceedings of ACM Symposium on Theory of Computing (STOC), pages 671โ684, 2020. doi:10.1145/3357713.3384240.
- [54] S. Muthukrishnan. Efficient algorithms for document retrieval problems. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 657โ666, 2002. URL: http://dl.acm.org/citation.cfm?id=545381.545469.
- [55] Yakov Nekrich. Efficient range searching for categorical and plain data. ACM Transactions on Database Systems (TODS), 39(1):9, 2014.
- [56] Yakov Nekrich and Saladi Rahul. 4d range reporting in the pointer machine model in almost-optimal time. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1862โ1876, 2023. doi:10.1137/1.9781611977554.CH71.
- [57] Yakov Nekrich and Jeffrey Scott Vitter. Optimal color range reporting in one dimension. In Proceedings of European Symposium on Algorithms (ESA), pages 743โ754, 2013. doi:10.1007/978-3-642-40450-4_63.
- [58] Ilan Newman and Nithin Varma. New sublinear algorithms and lower bounds for LIS estimation. In Proceedings of International Colloquium on Automata, Languages and Programming (ICALP), pages 100:1โ100:20, 2021. doi:10.4230/LIPIcs.ICALP.2021.100.
- [59] Manish Patil, Sharma V. Thankachan, Rahul Shah, Yakov Nekrich, and Jeffrey Scott Vitter. Categorical range maxima queries. In Proceedings of ACM Symposium on Principles of Database Systems (PODS), pages 266โ277, 2014. doi:10.1145/2594538.2594557.
- [60] Saladi Rahul. Approximate range counting revisited. In 33rd International Symposium on Computational Geometry (SoCG), volume 77, pages 55:1โ55:15, 2017. doi:10.4230/LIPIcs.SOCG.2017.55.
- [61] Saladi Rahul. Approximate range counting revisited. Journal of Computational Geometry, 12(1):40โ69, 2021. doi:10.20382/JOCG.V12I1A3.
- [62] Saladi Rahul and Ravi Janardan. Algorithms for range-skyline queries. In Proceedings of ACM Symposium on Advances in Geographic Information Systems (GIS), pages 526โ529, 2012. doi:10.1145/2424321.2424406.
- [63] Saladi Rahul and Yufei Tao. On top-k range reporting in 2d space. In Proceedings of ACM Symposium on Principles of Database Systems (PODS), pages 265โ275, 2015. doi:10.1145/2745754.2745777.
- [64] Saladi Rahul and Yufei Tao. Efficient top-k indexing via general reductions. In Proceedings of ACM Symposium on Principles of Database Systems (PODS), pages 277โ288, 2016. doi:10.1145/2902251.2902290.
- [65] Saladi Rahul and Yufei Tao. A guide to designing top-k indexes. SIGMOD Record, 48(2):6โ17, 2019. doi:10.1145/3377330.3377332.
- [66] Liam Roditty and Uri Zwick. On dynamic shortest paths problems. Algorithmica, 61:389โ401, 2011. doi:10.1007/S00453-010-9401-5.
- [67] Aviad Rubinstein, Saeed Seddighin, Zhao Song, and Xiaorui Sun. Approximation algorithms for lcs and LIS with truly improved running times. In Proceedings of Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 1121โ1145, 2019. doi:10.1109/FOCS.2019.00071.
- [68] Michael Saks and C Seshadhri. Estimating the longest increasing sequence in polylogarithmic time. SIAM Journal of Computing, 46(2):774โ823, 2017. doi:10.1137/130942152.
- [69] Neil Sarnak and Robert Endre Tarjan. Planar point location using persistent search trees. Communications of the ACM (CACM), 29(7):669โ679, 1986. doi:10.1145/6138.6151.
- [70] Giorgio Satta. Tree-adjoining grammar parsing and boolean matrix multiplication. Computational linguistics, 20(2):173โ191, 1994.
- [71] Qingmin Shi and Joseph JรกJรก. Optimal and near-optimal algorithms for generalized intersection reporting on pointer machines. Information Processing Letters (IPL), 95(3):382โ388, 2005. doi:10.1016/J.IPL.2005.04.008.
- [72] Xiaoming Sun and David P Woodruff. The communication and streaming complexity of computing the longest common and increasing subsequences. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 336โ345, 2007. URL: http://dl.acm.org/citation.cfm?id=1283383.1283419.
- [73] Yufei Tao. Algorithmic techniques for independent query sampling. In Proceedings of ACM Symposium on Principles of Database Systems (PODS), pages 129โ138, 2022. doi:10.1145/3517804.3526068.
- [74] Alexander Tiskin. Semi-local longest common subsequences in subquadratic time. Journal of Discrete Algorithms, 6(4):570โ581, 2008. doi:10.1016/J.JDA.2008.07.001.
- [75] Alexander Tiskin. Fast distance multiplication of unit-monge matrices. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1287โ1296, 2010. doi:10.1137/1.9781611973075.103.
- [76] Alexandre Tiskin. Semi-local string comparison: Algorithmic techniques and applications. Math. Comput. Sci., 1(4):571โ603, 2008. doi:10.1007/S11786-007-0033-3.
- [77] Jie Xue. Colored range closest-pair problem under general distance functions. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 373โ390, 2019. doi:10.1137/1.9781611975482.24.
- [78] Jie Xue, Yuan Li, Saladi Rahul, and Ravi Janardan. Searching for the closest-pair in a query translate. Journal of Computational Geometry, 11(2):26โ61, 2020. doi:10.20382/JOCG.V11I2A3.
- [79] Jie Xue, Yuan Li, Saladi Rahul, and Ravi Janardan. New bounds for range closest-pair problems. Discrete & Computational Geometry, 68(1):1โ49, 2022. doi:10.1007/S00454-022-00388-7.
- [80] Hongyu Zhang. Alignment of blast high-scoring segment pairs based on the longest increasing subsequence algorithm. Bioinformatics, 19(11):1391โ1396, 2003. doi:10.1093/BIOINFORMATICS/BTG168.
