Color Distance Oracles and Snippets: Separation Between Exact and Approximate Solutions
Abstract
In the snippets problem, the goal is to preprocess a text so that given two pattern queries, and , one can quickly locate the occurrences of the two patterns in that are closest to each other, or report the distance between these occurrences. Kopelowitz and Krauthgamer [CPM2016] showed upper bound tradeoffs and conditional lower bounds tradeoffs for the snippets problem, by utilizing connections between the snippets problem and the problem of constructing a color distance oracle (CDO), which is a data structure that preprocess a set of points with associated colors so that given two colors and one can quickly find the (distance between the) closest pair of points where one has color and the other has color . However, the existing upper bound and lower bound curves are not tight.
Inspired by recent advances by Kopelowitz and Vassilevska-Williams [ICALP2020] regarding tradeoff curves for Set-disjointness data structures, in this paper we introduce new conditionally optimal algorithms for a approximation version of the snippets problem and a approximation version of the CDO problem, by applying fast matrix multiplication. For example, for CDO on points in an array, if the preprocessing time is and the query time is then, assuming that (where is the exponent of in the runtime of the fastest matrix multiplication algorithm on two squared matrices of size ), we show that approximate CDO can be solved with the following tradeoff
Moreover, we prove that for exact CDO on points in an array, the algorithm of Kopelowitz and Krauthgamer [CPM2016], which obtains a tradeoff of , is essentially optimal assuming that the strong all-pairs shortest paths hypothesis holds for randomized algorithms. Thus, we demonstrate that the exact version of CDO is strictly harder than the approximate version. Moreover, this separation carries over to the snippets problem.
Keywords and phrases:
data structures, fast matrix multiplication, fine-grained complexity, pattern matching, distance oraclesCopyright and License:
2012 ACM Subject Classification:
Theory of computation Design and analysis of algorithmsFunding:
Supported by a BSF grant 2018364, and by an ERC grant MPM under the EU’s Horizon 2020 Research and Innovation Programme (grant no. 683064).Acknowledgements:
The authors thank an anonymous reviewer who suggested some ideas for tightening the lower bounds proved in Theorems 8 and 9.Editors:
Anne Benoit, Haim Kaplan, Sebastian Wild, and Grzegorz HermanSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
In the snippets problem, introduced by Kopelowitz and Krauthgamer [17], the goal is to preprocess a text so that given two pattern queries, and , one can quickly locate the locations of the two patterns in that are closest to each other, or report the distance between these locations. The snippets problem is motivated by many common text indexing applications, such as searching a corpus of documents for two query keywords. Often, the relevance of a document to the query is measured by the proximity of the two keywords within the document. The term snippet is derived from search engines often providing each result with a snippet of text from the document with the two keywords close to each other.
Kopelowitz and Krauthgamer [17] designed a tradeoff algorithm for the snippets problem based on a connection to a problem on colored points in metric spaces, which we define next.
Colored points and color distances
Let be a metric space with distance function ). Let be a set of points where . Let be a set of colors. For sake of convenience we assume that . Each point has an associated color . For a color , let . When is clear from context we abuse notation and denote .
For sets of points and a point , let and let . We extend the definition of to inputs of colors as follows. For colors and point , let and let . We say that is the color distance between and . A natural problem is to construct a color distance oracle (CDO), which is defined as follows.
Problem 1 (The Color Distance Oracle (CDO) problem [17]).
Given a set of colored points with colors from , preprocess to support the following query: given , return .
In this paper we consider metrics defined by locations in an array of size , and so the metric is the set111Throughout this paper for a positive integer we denote . , and the distance function of two points and is .
Multi-colored points and color hierarchies
A natural generalization of a colored set is to allow each to be associated with a nonempty set of colors , and in such a case is said to be multi-colored. This version of the CDO problem is called the multi-color distance oracle (MCDO) problem. In general, representing costs space by explicitly listing all colors in . Thus, the input size is .
Nevertheless, there are interesting cases in which the lists of colors for each point are not required to be given explicitly. One such example, which we focus on, is when for every two colors , either one of the sets and contains the other, or the two sets are disjoint. In such a case, we say that has a color hierarchy with respect to (a formal terminology is that is a laminar family), represented by a rooted forest (i.e., each tree has a root) of size . Each color is associated with a vertex in , such that the descendants of are exactly all the vertices whose color satisfies . We convert the forest to a rooted tree by adding a dummy root vertex and making the dummy root the parent of all of the roots of the trees in the forest.
With the aid of , a multi-colored set that admits a color hierarchy can be represented using only machine words, because it suffices to store and associate with each point just one color (the color with the lowest corresponding vertex in ); the other colors of are implicit from (the colors on the path from to the root of ). Notice that implies that there are at least two colors with the same point set. Thus, we make the simplifying assumption that all colors have different point sets222If we happen to have several colors with the same point set, we ignore all of them but one during preprocessing, and use a straightforward mapping from the original coloring to the reduced coloring during query time., and so the input size is .
Thus, a natural extension of Problem 1 is the following.
Problem 2 (The Multi-Color Distance Oracle with a Color Hierarchy (MCDOCH) problem [17]).
Given a set of multi-colored points with colors from which form a color hierarchy with respect to , preprocess to support the following query: given , return .
Let and be real numbers such that the preprocessing and query times of a MCDOCH algorithm are 333Throughout this paper we use the notation to suppress sub-polynomial factors. and , respectively. A tradeoff algorithm for the MCDOCH problem was presented in [17], where given parameter , the query time is and the preprocessing time is . When ignoring sub-polynomial factors, this upper bound tradeoff is given by where and . Notice that the same tradeoff applies to the CDO problem.
In addition to the tradeoff algorithms, [17] proved a 3SUM based conditional lower bound (CLB) tradeoff444The CLB tradeoff stated here is straightforward to derive from the statement of Theorem 4 in [17]. of for the CDO problem (and hence for the MCDOCH problem), even when is defined by locations in an array. Their CLB holds also for approximate versions of the CDO problem (see Theorem 8 in [17]), where for a fixed , the answer to a color distance query between and is required to be between and .
Solving the Snippets problem
By designing a reduction from the snippets problem to the MCDOCH problem, [17] were able to design an algorithm for the snippets problem with essentially the same tradeoff: for a chosen parameter their snippets algorithm uses preprocessing time and answers queries in time. The main idea in the reduction is to construct the suffix tree of , assign a color to each vertex in the suffix tree, and for each leaf in the suffix tree, the corresponding location in is assigned all of the colors on the path from to the suffix tree root. It is straightforward to see that the colors define a hierarchy and that is exactly the suffix tree.
A complexity gap
The upper bound curve of and the CLB tradeoff of form a complexity gap for the snippets and (approximate) CDO problems, which was left open by [17]. Moreover, based on the combinatorial Boolean Matrix Multiplication (BMM) hypothesis, [17, Theorem 8] implies that any algorithm that beats the curve, even for approximate versions, must use non-combinatorial techniques, such as fast matrix multiplication (FMM) algorithms.
We remark that the CLB tradeoff given in [17] is based on a reduction from a SetDisjointness problem to the (approximate) CDO problem, and at the time [17] was published, the SetDisjointness problem was known to exhibit the same upper bound tradeoff and 3SUM based CLB tradeoff [18]. However, recently Kopelowitz and Vassilevska-Williams [19] closed this complexity gap by, among other ideas, using FMM techniques. Thus, a natural question to ask is whether FMM can assist in obtaining improved algorithms for CDO problems.
Our results
We close the complexity gap for the approximation versions of CDO and , where is defined by locations in an array. Formally, the problems are defined as follows.
Problem 3 (The -Approximate Color Distance Oracle (ACDO) problem on an array).
Let be a metric defined by locations in an array of size . Given a set of colored points with colors from and a fixed real , preprocess to support the following query: given two colors , return a value such that .
Problem 4 (The -Approximate Multi-Color Distance Oracle problem with a Color Hierarchy (AMCDOCH) on an array).
Let be a metric defined by locations in an array of size . Given a set of multi-colored points with colors from which form a color hierarchy with respect to , and a fixed real , preprocess to support the following query: given two colors , return a value such that .
Our first main result is a new tradeoff algorithm for ACDO, which is stated by the following theorem. Notice that the time complexities presented are dependent on , which is the exponent of in the runtime of the fastest FMM algorithm on two squared matrices of size . The currently best upper bound on is given by [2]. However, for clarity, we choose to express our tradeoffs in terms of . We remark that ACDO is a special case of AMCDOCH, and so Theorem 6 implies Theorem 5. However, Theorem 5 is used in the proof of Theorem 6, so we explicitly state both theorems.
Theorem 5.
For any real , there exists an ACDO algorithm with preprocessing time and query time where
Theorem 6.
For any real , there exists an AMCDOCH algorithm with preprocessing time and query time where
Combining Theorem 6 with the reduction of the snippets problem to the MCDOCH problem given in [17], we obtain the following tradeoff for a approximation version of the snippets problem.
Theorem 7.
For any fixed , and , there exists an algorithm that preprocesses a text in time such that given two pattern strings and where the distance between the closest occurrences of the two patterns in is , the algorithm returns a value such that in time, where
We remark that it is straightforward to adapt our ACDO and AMCDOCH algorithms (and hence our new approximate snippets algorithm) to return the two points (one of each color) that define the distance being returned (or the locations of the two patterns in the approximate snippets problem). For details, see the full version [12].
A complexity separation between exact and approximate solutions
We remark that by combining the reduction given in [17, Theorem 8] from SetDisjointness to the CDO problem with the CLBs for SetDisjointness given in [19, Theorem 6] for the case of (which some researchers believe should be obtainable), Theorems 5, 6 and 7 are all conditionally optimal555For Theorems 5 and 6, the optimality follows directly from applying [17, Theorem 8] and [19, Theorem 6]. For Theorem 7, the optimality follows since an approximate solution for the snippets problem can be used to solve ACDO on an array as follows: treat each color as a character, and then the array becomes a string. Preprocess the string using the snippets algorithm, so that given an ACDO query on two colors , query the snippets algorithm with patterns and . Thus, the ACDO lower bound applies to approximate snippets. (up to subpolynomial factors). A natural question to ask is whether one can design exact algorithms with the same tradeoff bounds. We provide evidence that this is not possible, based on a strong version of the popular all-pairs shortest paths (APSP) conjecture [7], thereby demonstrating a separation between CDO and ACDO.
The Strong-APSP hypothesis, introduced by Chan, Vassilevska-Williams and Xu [7], states that for a graph with vertices666We use to differentiate from in the various CDO problems., even if all of the edge weights are in777Actually, in [7] the weights are bounded by , but since we can bound the weights by . , APSP still does not have a truly subcubic algorithm. A closely related problem to APSP is the -matrix product problem, where the input is two by matrices and , and the output is the matrix where . Shoshan and Zwick [24] showed that solving APSP with weights is equivalent (in the sense of having the same runtime) to solving -matrix product with entries in . Thus, the Strong-APSP hypothesis implies that there is no truly subcubic time -matrix product algorithm even when all of the entries are in .
By reducing -matrix product with entries in to MCDO, we are able to prove the following CLB in Section 6, which matches the upper bound given in [17].
Theorem 8.
Assuming the Strong-APSP conjecture, any algorithm for MCDO on points in an array of size with preprocessing time and query time, respectively, must obey .
Moreover, we are able to leverage the ideas used in the proof of Theorem 8 to reduce -matrix product with entries in to CDO via a randomized reduction, thereby obtaining a CLB under the assumption that the Strong-APSP hypothesis holds even for randomized algorithms (either in expectation or with high probability).
Theorem 9.
Assuming the Strong-APSP hypothesis for randomized algorithms, any algorithm for CDO on points in an array of size with preprocessing time and query time, respectively, must obey .
The proof of Theorem 9 is based on the proof of Theorem 8, combined with probabilistic techniques in order to create instances that are not multi-colored. Due to space considerations, the proof is given in the full version [12]
Theorem 9 combined with Theorem 5 implies that the exact version of CDO is strictly harder than the approximate version. Moreover, we remark that the CLBs for CDO apply also to the snippets problem, since the special case of defined by locations in an array is captured by the snippets problem when and are each a single character. Thus, the CLB of Theorem 9 applies also to the snippets problem, and so the exact version of the snippets problem is also strictly harder than the approximate version.
Additional Related Work
A problem closely related to the CDO is the (approximate) vertex-labeled distance oracles for graphs (VLDO) problem, where the goal is to preprocess a colored graph with vertices, so that given a vertex query and a color , one can quickly return (an approximation of) . Hermelin, Levy, Weimann, and Yuster [11] introduced the VLDO problem and designed a solution using expected space, with stretch factor and query time. They also showed how to reduce the space usage to but the stretch factor is exponential in . Chechik [8] later showed how to lower the stretch back to . For planar graphs, Evald, Fredslund-Hansen, and Wulff-Nilsen [10] designed a near-optimal exact tradeoff using space with query time, or space with query time. Li, Ma, and Ning [21] designed a approximation for planar graphs.
2 Preliminaries and Algorithmic Overview
Let . Let be a set of integers. Let be the largest integer in and let be the smallest integer in . For integers where , let .
Our algorithms make use of the following Nearest Neighbor Search (NNS) data structures.
Problem 10 (The Nearest Neighbour Search problem [1, 4, 16, 25]).
Given a set of size , preprocess to support the following query: given an integer , return .
Problem 11 (The Range Nearest Neighbour Search (RNNS) problem [5, 6, 9, 14, 15, 20, 22, 23]).
Given a set of integers, preprocess to support the following query: given and an integer , return .
2.1 Algorithmic Overview
Generic (Approximate) CDO Algorithm
Our algorithm for ACDO is based on a generic algorithm whose structure follows the structure of the algorithm in [17] for the exact CDO problem. The generic algorithm is a classic heavy-light algorithm; our algorithms are a new implementation of the generic algorithm.
For an integer parameter , a color is said to be heavy if and light otherwise. Let be the set of heavy colors, and let be the set of light colors. Notice that . In the preprocessing phase, for each color , the algorithm stores in an NNS data structure, denoted by . In addition, the algorithm pre-computes a matrix of size , such that . An ACDO query on is processed as follows. If both , then, without loss of generality, and . In such a case the algorithm returns , which is a approximation of . Otherwise, without loss of generality, is a light color, and the algorithm returns , by performing NNS queries on , one for each point in .
Time complexity
The preprocessing and query time costs of the generic algorithm depend on the implementation of the NNS data structure, and the time used for computing matrix . Thus, we express the time complexity of the generic algorithm as a function of , , and , which are the preprocessing time and query time of the NNS data structure on a set of size , and the time used to compute , respectively. In the preprocessing phase, the algorithm computes and creates an NNS data structure for each color in . Thus, the preprocessing time cost is
In the query phase, the time cost is dominated by executing at most NNS queries on a set of size at most , and so the time cost is
Constructing and solving ACDO on an array
In Section 3 we describe an efficient algorithm for constructing when is the metric of integers. Our algorithm is structured as follows. For pairs of heavy colors whose color distance is smaller than some threshold, the algorithm computes their distance via a straightforward brute-force computation. The more challenging part is dealing with pairs of heavy colors with larger color distance. Our approach is to compute a series of Boolean matrices with specially designed properties, so that after the computation of the matrices we are able scan the matrices and deduce a approximation for every pair of colors with a rather large distance color. We remark that to compute the matrices used for the approximations, our algorithm makes use of FMM, which, as noted in Section 1, is required for obtaining improvements in runtime compared to the exact solution of [17].
Solving AMCDOCH on an array
In Section 5 we prove Theorem 6. Our algorithm is a reduction from AMCDOCH on an array to ACDO on an array. The structure of our reduction somewhat resembles the structure of the MCDOCH algorithm of [17]. In particular, in the algorithm of [17] they partition the points into sets of size , preprocess each set into an RNNS data structure, and for each pair of sets apply RNNS queries to compute the distance between the sets. To obtain a faster runtime, we compute the distance between every pair of sets by recoloring the input points using the sets as a new color scheme and then applying our ACDO algorithm with the new color set. One important property of our ACDO algorithm is that when a query takes place between two heavy colors, the query time is constant since the algorithm just looks up a single value in . It turns out that essentially all of the colors in the new color set are heavy, and so after the preprocessing phase of our ACDO algorithm, we are able to compute the distance between a pair of sets in time. This property turns out to be helpful for our AMCDOCH algorithm, which is described in Section 5.
3 Computing when is integers
In this section, we show how to compute matrix when our metric is one dimensional, and for we have . We make the simplifying assumption that since, otherwise, one could shift by subtracting from each point.
Let be a positive integer parameter that will be determined later. The construction algorithm for has two conceptual parts. The first part deals with pairs of heavy colors with distance at most , by computing the distances exactly using a brute-force method. The second part deals with pairs of heavy colors with distance greater than . In this case, the algorithm computes an approximation of the distances by utilizing a series of matrices defined as follows.
Let and let . Let be a Boolean matrix where
Thus, matrix indicates for each heavy color and each block of size in which starts at a location following an integer multiple of , whether the heavy color appears in the block or not.
For , let be a Boolean matrix where
Matrix indicates for each block of size in which starts at a location following an integer multiple of , and for each heavy color, whether the heavy color appears in the block or not.
Let , where the product is a Boolean matrix product. Thus, roughly indicates for each pair of heavy colors whether their color distance is at most .
In Section 3.1 we state and prove lemmas regarding the connections between the matrices and approximating color distances. In Section 3.2 we describe the algorithm for constructing , and prove its correctness based on the lemmas of Section 3.1.
3.1 Properties of The Matrices
The following lemma describes a connection between entries of in and lower bounds on color distances.
Lemma 12.
For and integer , if and , then .
Proof.
Let and be chosen such that . Assume towards a contradiction that . We focus on the case of and so ; the proof for the case is symmetrical. Thus, there exists an integer such that and so . Therefore,
We conclude that , implying that , and so , which is a contradiction.
The following lemma shows a connection between colors with distance greater than and entries in whose value is 0.
Lemma 13.
For , if then and .
Proof.
Assume by contradiction that or . We focus on the case of ; the proof for the case is symmetrical. Since , there exists an integer such that and . Therefore, there exist points and such that and . Thus,
The following lemma describes a connection between entries of in and upper bounds on color distances.
Lemma 14.
For and integer , if , then .
Proof.
If , then there exists an integer such that and . Hence, there exist and such that and . Thus,
Notice that . Finally, since , we have
and so . Thus,
completing the proof.
The following lemma shows that is weakly monotone increasing with respect to .
Lemma 15.
For , and , if , then for any we have .
Proof.
If , then there exists an integer such that and . Thus, there exists such that . Moreover, since , we have , and so . Finally, since , we conclude that .
The following lemma shows that in , for each pair of heavy colors, there exists at least one corresponding 1 entry.
Lemma 16.
For , either or .
Proof.
Assume by contradiction that and . Let and be chosen such that . Hence by Lemma 12, we have . Notice that . Therefore,
which is a contradiction.
In the following lemma we show the main property of the matrices used in the algorithm of Section 3.2 when dealing with pairs of heavy colors with color distance greater than .
Lemma 17.
For , if , then there exists a unique integer such that and either or .
Proof.
By Lemma 16, at least one of and must be . So assume without loss of generality that . Since , by Lemma 13 . Thus, by Lemma 15, there must exist exactly one such that and .
To complete the proof we show uniqueness. If , then assume towards a contradiction that there exist two different integers such that , , , , , and . Without loss of generality, , and so by Lemma 15, and implies which is a contradiction.
Otherwise, if at least one of or is , then since we assumed without loss of generality that , we have , which together with Lemma 15 implies that for all , . Thus, is unique.
3.2 Algorithm for Computing
In this section, we describe the algorithm for computing , prove its correctness and analyze its time cost. Pseudocode for the algorithm is given in Algorithm 1. The algorithm begins by a brute-force exact computation of color distances for all heavy colors and where . The brute-force computation costs time.
The rest of the algorithm deals with the case of . First, the algorithm computes the matrices from matrices and . The time cost of computing all is dominated by the cost of matrix multiplications. Each multiplication is between a matrix of size and a matrix of size . It is folklore knowledge that the matrix multiplication of two matrices of size and can be computed in . Thus, the time cost of computing the matrix multiplications is time.
Finally, the algorithm utilizes the matrices to complete entries in which correspond to heavy pairs of colors that were not covered by the brute-force computation. To do so, for each such pair , the algorithm scans all to find the unique from Lemma 17 such that and either or , and sets to be . Computing the entries in for all such pairs costs , which is dominated by the cost of the matrix multiplications performed during the computation of all of the matrices.
Lemma 18.
There exists an algorithm that computes in time, such that for we have .
Proof.
The runtime of the algorithm follows from the discussion above.
If , then there exist and such that , and so, after the brute-force computation, we have .
Otherwise, , and so the algorithm sets to be , where is the unique integer from Lemma 17, and in particular, and either or . By Lemmas 14 and 12, .
Finally, to obtain a approximation one can run the algorithm with approximation parameter , which does not affect the asymptotic time complexity.
4 Proof of Theorem 5
In this section, we analyze the combination of the construction algorithm for given in Section 3.2 together with the generic ACDO algorithm described in Section 2.1. However, we focus on the type of instances of ACDO which are used in Section 5 for solving the approximate snippets problem. In particular, our metric is defined by locations in an array of size , and so we have . Thus, in our case,
If , then we choose . In such a case, we have , and so
If , then we choose . In such a case we have , and so
Thus, to summarize we have shown
Notice that in either case, .
For the NNS data structure we use the van Emde Boas [25] data structure which for a set of points from integer universe has a preprocessing cost of and query time . In our setting, , and so and .
Thus, the construction time of the ACDO algorithm is
and the query time is . To complete the proof of Theorem 5, we plug into , to obtain the following tradeoff curve for a fixed .
5 Proof Sketch of Theorem 6
The proof of Theorem 6 is inspired by the exact MCDOCH algorithm of [17], combined with a new observation which allows to leverage Theorem 5. Due to space considerations, in this section, we only highlight the differences between our algorithm for AMCDOCH and the exact MCDOCH algorithm of [17]. The complete proof of Theorem 6 is given in the full version [12].
The relevant part of the algorithm of [17] creates an array of size which is a permutation of the points in , and preprocesses into an RNNS data structure. Then, the algorithm partitions into blocks of size , for an integer parameter . Let denote the set of blocks.
Following [17], the algorithm constructs a matrix of size , such that . To construct , the algorithm of [17] performs RNNS queries for the computation of each entry in . Computing turns out to be the dominating component in the preprocessing runtime of [17].
To obtain a faster preprocessing time for the approximate setting, instead of constructing , our algorithm constructs an approximate matrix where . The construction of utilizes the algorithm of Theorem 5 as follows. Define a new coloring set over the points in , such that for each we have . Notice that for every , we have , and so using the colors of on , every point in has only one color. Thus, the algorithm uses the algorithm of Theorem 5 on , but with the colors of , and the query time designed to be ). Now, is set to be the answer of the ACDO query on colors and , and so .
In the last step of the preprocessing phase, the algorithm preprocesses (instead of in [13]) using a Range Minimum Query (2DRMQ) data structure [3] so that given a rectangle in , defined by its corners, the algorithm returns in time the smallest value entry in the rectangle.
Answering queries and correctness
The process for answering a query is the same as in [17], but using instead of . Thus, the correctness of the algorithm needs to be proven given the use of .
As shown in [17], for each , is exactly the points in some interval . The challenging part is answering an AMCDOCH query between and , when and are disjoint. Thus, assume without loss of generality that . For the description here, we assume that (and ) is the first location of some block, and (and ) is the last location of some block. The more general cases are covered in the full version of this paper [12], with an additional execution of RNNS queries.
The query algorithm executes a 2DRMQ data structure query on the rectangle in defined by corners and . Let be the answer returned by the 2DRMQ query. Notice that there exist and such that . Thus, there exist integers such that and . Thus, , and so
Time Complexity
For the RNNS data structure we use the solution of [23] which on points (which is the size of ) has preprocessing time and query time .
By Remark 19 and the fact that each block has size , each ACDO query used for constructing costs time. So, after preprocessing the ACDO data structure, constructing costs time. The preprocessing time of the 2DRMQ data structure of [3] when applied to is . Thus, since , the total time cost of the preprocessing phase, which is composed of constructing the ACDO and RNNS data structures on , computing , and preprocessing a 2DRMQ data structure, is
The query process consists of RNNS queries, and a single 2DRMQ query. Thus, the query time cost is . Finally, similar to the proof of Theorem 5, we obtained the following tradeoff curve for a fixed .
6 Proof of Theorem 8: CLB for Exact MCDO on an Array
To prove Theorem 8, we design a reduction from -matrix product on values in to MCDO. To do so, consider an unbalanced version of -matrix product where is of size , is of size , and all of the values are bounded by some positive integer . It is straightforward to show that for any such that , solving unbalanced -matrix product with any polynomial time improvement over time is equivalent to solving -matrix product on balanced matrices with in time. Thus, an algorithm for unbalanced -matrix product with values bounded by in time less than would refute the Strong-APSP hypothesis.
The reduction
Our goal is to design a reduction from the unbalanced -matrix product problem with values bounded by to MCDO. In preparation for the proof of Theorem 9, we first describe the reduction using positive integer parameter as the upper bound on the values in the input matrices, and during the analysis we set .
For each , the algorithm defines a point with color . For each , the algorithm defines a point with color . Thus, the largest point defined is at most , and there are colors. Notice that there cannot be two entries in the same row (column) of () that define the same point. However, it is possible that for , and so , but the two points have different colors. A similar phenomenon can happen to points defined by entries of which share the same row. Thus, we merge all occurrences of the same point into a single occurrence but colored with all of the colors of the pre-merge occurrences. The set of points, denoted by , contains at most multi-colored points in a metric defined by an array of size , and so the algorithm uses the MCDO algorithm to preprocess . Finally, for each entry in , the algorithm sets by performing a query on the MCDO data structure.
Correctness
Suppose for some . For every , by definition we have and so . Since is colored with color and is colored with color , we have
| (1) |
Let . If , then
However, if then . Since , we have
Thus, there exists some such that , and combined with Equation 1 we have .
Lower bound
Recall that in our setting . The reduction preprocesses an MCDO data structure on points in an array metric of size , and then answers MCDO queries. If and are preprocessing and query times, respectively, of the MCDO data structure, then the Strong-APSP conjecture implies that
Recall that our goal is to prove that . Thus, assume towards a contradiction that , which, by straightforward algebraic manipulation, implies that Moreover, since the unbalanced -matrix product is hard for any choice of , we can choose and so . Notice that Thus, we have , which contradicts the Strong-APSP hypothesis.
7 Conclusions and Open Problems
We have shown the existence of FMM based algorithms for both ACDO and the -approximate snippets problem which, assuming , are essentially optimal. Moreover, we proved CLBs for exact version of CDO, implying that the exact versions of CDO and the snippets problem are strictly harder than their approximate versions.
We remark that one immediate and straightforward way to improve our algorithms if is to apply fast rectangular matrix multiplication ([2]). However, we chose not to describe such improvements since they are mostly technical and do not add any additional insight to the problems that we address. Moreover, we remark that it is straightforward to adapt our CDO algorithms to return the two points (one of each color) that define the distance in addition to the actual distance (see the full version [12]).
Our work leaves open the task of designing improved algorithms for more general metrics such as higher dimensional Euclidean space.
References
- [1] Mohammad Reza Abbasifard, Bijan Ghahremani, and Hassan Naderi. A survey on nearest neighbor search methods. International Journal of Computer Applications, 2014.
- [2] Josh Alman, Ran Duan, Virginia Vassilevska Williams, Yinzhan Xu, Zixuan Xu, and Renfei Zhou. More asymmetry yields faster matrix multiplication. In Proceedings of the 2025 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2005–2039. SIAM, 2025. doi:10.1137/1.9781611978322.63.
- [3] Amihood Amir, Johannes Fischer, and Moshe Lewenstein. Two-dimensional range minimum queries. In Combinatorial Pattern Matching: 18th Annual Symposium, CPM 2007, London, Canada, July 9-11, 2007. Proceedings 18, pages 286–294. Springer, 2007. doi:10.1007/978-3-540-73437-6_29.
- [4] Alexandr Andoni. Nearest neighbor search: The old, the new, and the impossible. PhD thesis, Massachusetts Institute of Technology, 2009.
- [5] Maxim Babenko, Pawel Gawrychowski, Tomasz Kociumaka, and Tatiana Starikovskaya. Wavelet trees meet suffix trees. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 572–591. SIAM, 2014.
- [6] Djamal Belazzougui and Simon J Puglisi. Range predecessor and lempel-ziv parsing. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2053–2071. SIAM, 2016. doi:10.1137/1.9781611974331.CH143.
- [7] Timothy M Chan, Virginia Vassilevska Williams, and Yinzhan Xu. Fredman’s trick meets dominance product: Fine-grained complexity of unweighted apsp, 3sum counting, and more. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, 2023.
- [8] Shiri Chechik. Improved distance oracles and spanners for vertex-labeled graphs. In Algorithms–ESA 2012: 20th Annual European Symposium, Ljubljana, Slovenia, September 10-12, 2012. Proceedings 20, pages 325–336. Springer, 2012. doi:10.1007/978-3-642-33090-2_29.
- [9] Maxime Crochemore, Costas S Iliopoulos, Marcin Kubica, M Sohel Rahman, German Tischler, and Tomasz Waleń. Improved algorithms for the range next value problem and applications. Theoretical Computer Science, 434:23–34, 2012. doi:10.1016/J.TCS.2012.02.015.
- [10] Jacob Evald, Viktor Fredslund-Hansen, and Christian Wulff-Nilsen. Near-optimal distance oracles for vertex-labeled planar graphs. In 32nd International Symposium on Algorithms and Computation (ISAAC 2021). Schloss-Dagstuhl-Leibniz Zentrum für Informatik, 2021.
- [11] Danny Hermelin, Avivit Levy, Oren Weimann, and Raphael Yuster. Distance oracles for vertex-labeled graphs. In Automata, Languages and Programming: 38th International Colloquium, ICALP 2011, Zurich, Switzerland, July 4-8, 2011, Proceedings, Part II 38, pages 490–501. Springer, 2011. doi:10.1007/978-3-642-22012-8_39.
- [12] Noam Horowicz and Tsvi Kopelowitz. Color distance oracles and snippets: Separation between exact and approximate solutions. arXiv preprint, 2025. arXiv:2507.04578.
- [13] Matti Karppa and Petteri Kaski. Probabilistic tensors and opportunistic boolean matrix multiplication. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 496–515. SIAM, 2019. doi:10.1137/1.9781611975482.31.
- [14] Orgad Keller, Tsvi Kopelowitz, Shir Landau Feibish, and Moshe Lewenstein. Generalized substring compression. Theoretical Computer Science, 525:42–54, 2014. doi:10.1016/J.TCS.2013.10.010.
- [15] Orgad Keller, Tsvi Kopelowitz, and Moshe Lewenstein. Range non-overlapping indexing and successive list indexing. In Algorithms and Data Structures: 10th International Workshop, WADS 2007, Halifax, Canada, August 15-17, 2007. Proceedings 10, pages 625–636. Springer, 2007. doi:10.1007/978-3-540-73951-7_54.
- [16] Jon M Kleinberg. Two algorithms for nearest-neighbor search in high dimensions. In Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, 1997.
- [17] Tsvi Kopelowitz and Robert Krauthgamer. Color-distance oracles and snippets. In 27th Annual Symposium on Combinatorial Pattern Matching (CPM 2016). Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2016. doi:10.4230/LIPIcs.CPM.2016.24.
- [18] Tsvi Kopelowitz, Seth Pettie, and Ely Porat. Higher lower bounds from the 3sum conjecture. In Proceedings of the twenty-seventh annual ACM-SIAM symposium on Discrete algorithms, pages 1272–1287. SIAM, 2016. doi:10.1137/1.9781611974331.CH89.
- [19] Tsvi Kopelowitz and Virginia Vassilevska Williams. Towards optimal set-disjointness and set-intersection data structures. In Artur Czumaj, Anuj Dawar, and Emanuela Merelli, editors, 47th International Colloquium on Automata, Languages, and Programming, ICALP 2020, July 8-11, 2020, Saarbrücken, Germany (Virtual Conference), volume 168 of LIPIcs, pages 74:1–74:16. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2020. doi:10.4230/LIPICS.ICALP.2020.74.
- [20] Robert Krauthgamer and James R Lee. Navigating nets: Simple algorithms for proximity search. In Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms, pages 798–807. Citeseer, 2004. URL: http://dl.acm.org/citation.cfm?id=982792.982913.
- [21] Mingfei Li, Chu Chung Christopher Ma, and Li Ning. (1+ )-distance oracles for vertex-labeled planar graphs. In International Conference on Theory and Applications of Models of Computation, pages 42–51. Springer, 2013. doi:10.1007/978-3-642-38236-9_5.
- [22] J Ian Munro, Yakov Nekrich, and Jeffrey S Vitter. Fast construction of wavelet trees. Theoretical Computer Science, 638:91–97, 2016. doi:10.1016/J.TCS.2015.11.011.
- [23] Yakov Nekrich and Gonzalo Navarro. Sorted range reporting. In Algorithm Theory – SWAT 2012: 13th Scandinavian Symposium and Workshops, Helsinki, Finland, July 4-6, 2012. Proceedings 13, pages 271–282. Springer, 2012. doi:10.1007/978-3-642-31155-0_24.
- [24] Avi Shoshan and Uri Zwick. All pairs shortest paths in undirected graphs with integer weights. In 40th Annual Symposium on Foundations of Computer Science (Cat. No. 99CB37039), pages 605–614. IEEE, 1999. doi:10.1109/SFFCS.1999.814635.
- [25] Peter van Emde Boas. Preserving order in a forest in less than logarithmic time. In 16th Annual Symposium on Foundations of Computer Science (sfcs 1975), pages 75–84. IEEE, 1975. doi:10.1109/SFCS.1975.26.
