Range Counting Oracles for Geometric Problems
Abstract
In this paper, we study estimators for geometric optimization problems in the sublinear geometric model. In this model, we have oracle access to a point set with size in a discrete space , where queries can be made to an oracle that responds to orthogonal range counting requests. The query complexity of an optimization problem is measured by the number of oracle queries required to compute an estimator for the problem. We investigate two problems in this framework, the Euclidean Minimum Spanning Tree (MST) and Earth Mover Distance (EMD). For EMD, we show the existence of an estimator that approximates the cost of EMD with -relative error and -additive error using range counting queries for any parameter with . Moreover, we prove that this bound is tight. For MST, we demonstrate that the weight of MST can be estimated within a factor of using range counting queries.
Keywords and phrases:
Range counting oracles, minimum spanning trees, Earth Mover’s DistanceFunding:
Eunjin Oh: Supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.RS-2024-00440239, Sublinear Scalable Algorithms for Large-Scale Data Analysis) and the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No.RS-2024-00358505).Copyright and License:
![[Uncaptioned image]](x1.png)
2012 ACM Subject Classification:
Theory of computation Computational geometryAcknowledgements:
This research was initiated during the Workshop “Massive Data Models and Computational Geometry” held at the University of Bonn in September 2024 and funded by DFG, German Research Foundation, through EXC 2047 Hausdorff Center for Mathematics and FOR 5361 KI-FOR Algorithmic Data Analytics for Geodesy (AlgoForGe). We thank the organizers and participants for the stimulating environment that inspired this research.Editors:
Oswin Aichholzer and Haitao WangSeries and Publisher:

1 Introduction
In recent years, the size of data encountered in various applications has grown exponentially. While classical algorithms are designed to process the entire input to get exact or approximate solutions, the increasing demand for efficiency in massive datasets necessitates approaches that can provide useful results without examining all of the input data. Motivated by this, a wide range of sublinear-time algorithms has been extensively studied over the past few decades from various subfields in theoretical computer science. For instance, there are sublinear-time algorithms for the longest increasing subsequence and edit distance problems on strings [32, 34, 29], the triangle counting and vertex cover problems on graphs [10, 26, 38], Earth Mover’s Distance on probability distributions [9], and minimum spanning tree on general metric spaces [22]. Moreover, there are sublinear-time algorithms for several fundamental geometric problems [16, 20, 21, 31, 35]. For further information, refer to [24, 39].
To obtain sublinear-time algorithms, we need an oracle to access the input data. There are various sublinear-time oracle models in the geometric setting although there are common oracles for strings, graphs and metric spaces. Chazelle et al. [16] presented several geometric algorithms assuming that the input is given without any preprocessing. In the case of point inputs, the only thing one can do is uniform sampling. In the case of polygons, one can also check if two edges are adjacent. Although they showed that several problems admit sublinear-time algorithms in this model, it seems not strong enough to solve a wider range of fundamental geometric problems such as the clustering problems, the Earth Mover’s Distance problem, and the minimum spanning problem. Subsequently, many researchers developed sublinear-time algorithms for several different models. Examples are models in which the oracle can answer orthogonal range emptiness queries and cone nearest point queries [20], orthogonal range counting queries [21, 35], or separation queries [31].
In this paper, we study the range counting oracle model for geometric optimization problems. Here, we do not have direct access to the input point set, but instead we use an orthogonal range counting data structure for access. More specifically, given an orthogonal range as a query, it returns the number of input points contained in the query range. In fact, many database software systems, such as Oracle111https://docs.oracle.com/en/database/other-databases/nosql-database/24.3/sqlreferencefornosql/operator1.html and Amazon SimpleDB222https://docs.aws.amazon.com/AmazonSimpleDB/latest/DeveloperGuide/RangeQueriesSelect.html, provide built-in support for such range queries. These queries allow users to compute aggregates, such as the count, sum, or average of records that fall within a specified range of values for a set of attributes, which is crucial for a wide array of applications. For example, in data analytics, range queries are used to calculate the number of sales within a specific date range or determine the total revenue from products within a given price interval. In geographic information systems (GIS), range queries help aggregate spatial data points within certain coordinate bounds, such as counting the number of locations within a specific radius of a given point. Furthermore, in machine learning, range queries are employed during data preprocessing to summarize statistics over selected subsets, which supports tasks such as data filtering and dimensionality reduction.
The following are desirable properties for a reasonable oracle model: the queries to the oracle should be efficient to implement, it is preferable for it to be supported by lots of well-known databases, and it should allow sublinear-time algorithms for fundamental geometric problems. First, this model clearly satisfies the first property; there are numerous works on data structures for range counting queries and their variants [2, 7, 14, 15, 13, 25, 40]. In particular, one can preprocess a set of points in in time to construct a data structure of size so that the number of points inside a query range can be found in time. Second, range counting queries are already supported by well-known public databases as we mentioned earlier. Therefore, without any special preprocessing, we can apply sublinear-time algorithms designed on the range counting oracle for such databases in sublinear time. Third, we show that several fundamental problems can be solved in sublinear time on the range counting oracle model in this paper.
We focus on the following three fundamental geometric problems: the Earth Mover’s Distance problem, the minimum spanning tree problem, and the cell sampling problem. The Earth Mover’s Distance between two point sets and in and the cost of a minimum spanning tree of a point set are defined as follows.
where ranges over all one-to-one matchings between and , and ranges over all spanning trees of . If we have direct access to the input point set(s), then can be computed exactly in near quadratic time in [4], and approximately within -relative error in near linear time [3]. Similarly, can be computed exactly in near quadratic time by computing the complete graph on and applying Kruskal’s algorithm, and computed approximately in near linear time [30]. Note that can be used for measuring the similarity between two point sets, and can be used for summarizing the distribution of a point set. Thus these problems have various applications in multiple areas of computer science including computer vision [11], machine learning [27] and document similarity [33].
The cell sampling problem is defined as follows. Let be a set of points in a discrete space . Given a value with , let be the grid imposed on whose cells have side length . We say a grid cell is non-empty if it contains a point of . Our goal is to sample one non-empty cell almost uniformly at random from the set of all non-empty cells of . If we have direct access to , this problem is trivial as we can compute all non-empty cells explicitly. However, in our sublinear model, we cannot compute all non-empty cells if the number of non-empty cells exceeds our desired query complexity. We believe that cell sampling can be considered as a fundamental and basic primitive in sublinear models as sublinear algorithms rely on efficient sampling to extract meaningful information from large data sets. In fact, this problem has been studied in the dynamic streaming model, and has been used as a primitive for several geometric problems [28, 36] and graph problems [19].
Additive Error | Multiplicative Error | Query Complexity | ||
---|---|---|---|---|
UB | ||||
LB | – | |||
Cell Sampling | UB | – | ||
LB | – | |||
UB | – | |||
LB | – |
Our results.
We present sublinear-time algorithms for , Cell Sampling, and MST in the orthogonal range counting oracle model. Our results are summarized in Table 1.
-
For , we present an algorithm for estimating the Earth Mover’s Distance between two point sets of size in within a multiplicative error of in addition to an additive error of . Our algorithm succeeds with a constant probability, and uses range counting queries for any parameter . Notice that the optimal solution can be as large as , and in this case, the additive error is relatively small. Furthermore, we show that this trade-off is tight for any parameter .
-
For Cell Sampling, we present an algorithm that samples a non-empty cell in a grid so that the probability that a fixed non-empty cell is selected is in for any constant . Here denotes the number of non-empty cells. We also show that the query complexity here is tight. Moreover, if we sample non-empty cells, we can estimate the number of non-empty cells within -factor.
-
For , we present an algorithm for estimating the cost of a minimum spanning tree of a point set of size in within a multiplicative error of . Our algorithm again succeeds with constant probability, and uses range counting queries. Also, we show that any randomized algorithm using queries has at least a constant multiplicative error.
Related work.
Both EMD and MST have been extensively studied across various sublinear models over the past few decades. The work most closely related to ours is the one by Czumaj et al. [20]. They studied the Euclidean minimum spanning tree problem in a different model. In particular, their oracle supports cone nearest queries, orthogonal range emptiness queries, and point sampling queries.333More precisely, the oracle supports cone nearest queries and orthogonal range queries only, but they are also given the access to the input point set, which allow them to sample a vertex uniformly at random. Here, given a cone nearest query consisting of a cone with apex , the oracle returns -approximate nearest neighbor of in , where is the set of input points. Using queries, their algorithm estimates the cost of the MST of points within a relative error of . Moreover, they showed that if the oracle supports orthogonal emptiness queries only, then any deterministic algorithm for estimating the cost of MST within -relative error uses orthogonal range emptiness queries. Sublinear time algorithms for computing a metric minimum spanning tree problem have also been studied [23]. Here, the oracle supports distance queries: Given two points, it returns their distance in the underlying space. The most popular sublinear-space model is the streaming model. In this model, the input arrives as data stream, and an algorithm is restricted to using a sublinear amount of memory relative to the input size. Both EMD and MST are studied extensively in this model [5, 18, 28], showcasing the challenges and techniques for processing large datasets in the steaming model.
Although no sublinear algorithms for EMD and MST were known in the range counting oracle model prior to our work, other geometric optimization problems have been studied in this model. Monemizadeh [35] presented an algorithm for the facility location problem using queries, and Czumaj and Sohler [21] studied clustering problems and map labeling problems. To the best of our knowledge, these are the only work done in the range counting oracle model. However, there are several closely related models such as the range-optimization model. For instance, Arya et al. [8] studied the range MST problem, where the goal is to construct a data structure for computing the cost of MST within a query orthogonal range. In this setting, clustering problems also have been considered [1, 37].
Preliminaries.
We make use of quadtrees for both EMD and MST. Consider a discrete space . For , we use to denote a grid over the discrete space consisting of cells with side lengths of . A quadtree on is then constructed as follows. The root of the quadtree corresponds to the unique cell of . For each node of the quadtree of depth has four equal sized children, each corresponding to a cell in contained in the cell corresponding to . Note that the depth of is . If it is clear from the context, we use a node of the quadtree and its corresponding cell interchangeably.
Let be a point set in . For both EMD and MST we will use binary search in combination with range counting queries to find the smallest point in dimension . Furthermore, we will often want to sample a point from uniformly at random. We can do so using range counting queries in a procedure called telescoping sampling [35].
Lemma 1 (Telescoping Sampling [35]).
We can select a point of uniformly at random using range counting queries.
Throughout this paper, we use to denote . Due to space restrictions, some proofs are omitted; they can be found in the full version of the paper.
2 Sublinear algorithms for the Earth Mover’s Distance problem
In this section, we present several sublinear-time algorithms that approximate the cost of the Earth Mover’s Distance between two point sets in for an integer . The approximation bounds we obtain are tight up to a factor of for any integer .
In a one-dimensional space, we know the configuration of an optimal matching: the th leftmost red point is matched with the th leftmost blue point for all . Thus it suffices to estimate the cost of this matching. On the other hand, we do not know the exact configuration of an optimal matching in a higher dimensional space. Thus instead of considering an optimal matching, we consider a matching of cost , where is the Earth Mover’s Distance between two point sets. This is where the multiplicative error of comes from. Apart from this, we use the same approach for all dimensions .
2.1 A sublinear algorithm for points in 1D
In this section, we present a sublinear algorithm using range counting queries in a one-dimensional space. More precisely, we prove the following theorem. The optimal solution can be as large as , and in this case, the additive error is relatively small.
Theorem 2.
Given two sets and of size in a discrete space and a parameter , we can estimate the cost of the Earth Mover’s Distance between and within multiplicative error in addition to additive error with probability at least using range counting queries.
If no two points coincide, we have . Therefore, we have the following corollary.
Corollary 3.
Given two sets and of size in a discrete space such that no two points of coincide, and a parameter , we can estimate the cost of the Earth Mover’s Distance between and within multiplicative error with probability at least using range counting queries.
Let and be two sets in sorted along the axis. Let be an optimal matching between and . We separate two cases: long edges and short edges. An edge of is long if its length is at least , and it is short, otherwise. Let be the Earth Mover’s Distance between and .
We employ the following win-win strategy: For a long edge, a slight perturbation of its endpoints introduces only a constant relative error. Thus, we move its endpoints to the center of a grid, enabling us to estimate the total length efficiently. For short edges, we can disregard cases where the number of such edges is small, as this results in only a small additive error. Consequently, we may assume that the number of short edges is large, allowing us to estimate their total length using sampling. To further reduce the additive error, we partition the set of short edges into subsets based on edge length and then consider each subset separately.
Long edges.
For long edges, we subdivide into segments each of length . For each long edge , imagine that we move and to the centers of the segments containing them. Since is long, this introduces an additive error of , resulting in a constant relative error. Using this fact, we estimate the total length of all long edges within a constant relative error using range counting queries. Let be the resulting estimator.
Lemma 4.
We can estimate the total length of the long edges in within multiplicative error using range counting queries.
Proof.
We subdivide into segments each of length . Here, the number of segments is . Let be the set of the resulting segments. Then a long edge contains at least one segment. Imagine that we move every point incident to a long edge to the center of the segment containing . For each segment , we compute the number of red points contained in and the number of blue points contained in , respectively, using queries. Then we construct another point set by adding red points to the center of and adding blue points to the center of . We can do this without using queries. For a point , let be the its corresponding point in . Then consider the matching . We can compute explicitly without using any queries. Let be the set of edges of containing at least one segment of . A long edge of has its corresponding edge in . However, some short edge of might have its corresponding edge in . Here, observe that the short edges of having their corresponding edges in have length at least , thus the cost induced by those edges are also within a constant factor of . They will be also counted for short edges, but this also increases the total cost by a constant factor. Therefore, the lemma holds.
Short edges.
For short edges, we use random sampling. More specifically, we subdivide the set of short edges into subsets with respect to their lengths: , where is the set of all edges of of lengths in . Then we estimate the number of edges of for every index . To do this, we choose a random sample of of size . For each chosen point, we can find its mate. Let denote the set of all edges between the sampled points and their mates. If then the number of edges in would be a good estimator for . Otherwise, we can ignore them as induces an additive error of at most . This is because every short edges has length at most . Let . Then we have the following lemma.
Lemma 5.
If , we have with a constant probability. Otherwise, we have with a constant probability.
Proof.
Let be the random variable where if the th edge in is contained in , and otherwise. Let , where is the number of sampled points. By definition, we have , and . Now we analyze the failure probability. If , we use Chernoff bounds. Then the probability that is at most , where is the probability that . In our case, . Thus the failure probability is at most in this case. If , we use Markov’s inequality. Then the probability that is less than a small constant for a large constant . Therefore, the second part of the lemma also holds.
To amplify the success probability of Lemma 5, we repeat the procedure for computing times. Then we take the minimum, and let .
Lemma 6.
With a constant probability, .
Proof.
Since we can estimate the total length of long edges with an additive error of , it suffices to focus on short edges. For each edge of , we round up its length to . This induces the additive error of . Now we use Lemma 5. Consider the event that we have the desired bounds for all indices in Lemma 5. Due to the amplification of success probability we made before, this total success probability is a constant. If , the estimator has an error of in total for all such indices. If , we do not have a lower bound guarantee for . However, since the number of such edges is in total, and the length of each such edge is , this induces the additive error of . Therefore, the total additive error is .
2.2 A sublinear algorithm for higher dimensions
In this section, we present a sublinear algorithm that approximates the cost of the Earth Mover’s Distance between two point sets in for an integer . More specifically, we prove the following theorem. Again, notice that the Earth Mover’s Distance between two point sets can be as large as in the worst case.
Theorem 7.
Given two sets and of size in a discrete space and a parameter , we can estimate the cost of the Earth Mover’s Distance between and within -relative error in addition to additive error with probability at least using range counting queries.
If no two points coincide, we have . Therefore, we have the following corollary.
Corollary 8.
Given two sets and of size in a discrete space and a parameter such that no two points in coincide, we can estimate the cost of the Earth Mover’s Distance between and within -relative error with probability at least using range counting queries.
Let be the Earth Mover’s Distance between and . This algorithm is essentially the same as the algorithm for the one-dimensional case. The only difference is that we do not know the configuration of an optimal matching in a higher dimensional space. Instead, we can use a matching obtained from the quadtree on , which has cost of . Then as before, we separate two cases. If an edge has length at least , it is short. Otherwise, it is long. Then we can compute all long edges exactly, and we can estimate short edges using sampling. Details can be found in the full version of this paper.
3 Lower bounds for the Earth Mover’s Distance
In this section, we show that for any parameter , any randomized algorithm that approximates the Earth Mover’s Distance between two sets within an additive error of requires at least orthogonal range counting queries. More specifically, we prove:
Lemma 9.
For any parameter and any integers and , any randomized algorithm for approximating the Earth Mover’s Distance between two point sets of size in a discrete space within additive error uses range counting queries.
If points from may coincide, achieving any multiplicative approximation factor is impossible. Even when no two points of coincide, Lemma 9 shows that no randomized algorithm with a multiplicative error of uses range counting queries.
3.1 A lower bound for points in 1D
In this section, we prove Lemma 9 in the one-dimensional space. This will be extended for a higher dimensional space later. To analyze the lower bound on the additive error of a Monte Carlo algorithm, we use the following version of Yao’s Minmax Theorem [6]: Assume that there exists a randomized algorithm using range counting queries with success probability at least . Then for any distribution of instances, there exists a deterministic algorithm using range counting queries that returns desired answers on instances chosen from with probability at least . Here, the success probability of a randomized algorithm is the probability that the output is within our desired bound.
Thus it suffices to construct a distribution of instances such that no deterministic algorithm using range counting queries has success probability at least . For this, we define two types of gadgets on a segment of length : the near gadget and far gadget. See Figure 1. For the far gadget, we put red points near the left endpoint of and blue points near the right endpoint of . For the near gadget, we put red points and blue points alternately starting from the left endpoint of so that the distance between any two consecutive points is one, and the starting point lies in the left endpoint of . Additionally, we put red points and blue points alternately so that the distance between any two consecutive points is one, and the last point lies in the right endpoint of . Notice that the cost induced by the near gadget is , but the cost induced by the far gadget is . Our strategy is to place copies of the near gadget inside and to hide one far gadget inside with probability . In this way, we can construct two types of instances, one with cost and one with cost .
More specifically, we define a distribution of instances as follows. We partition into segments each of length . Let be the set of resulting segments along the axis. See Figure 1. With probability , we let . Then with probability , we choose one index from uniformly at random. That is, the probability that a fixed index is chosen is for . Then for each index , we place the near gadget on . For index , we place the far gadget on . Notice that we do not place the far gadget anywhere if . Thus it suffices to show that, for any deterministic algorithm using range counting queries, the probability that it estimates the cost of within an additive error of on the instances chosen from is less than .
Lemma 10.
For any deterministic algorithm using range counting queries, the probability that it estimates the cost of an instance chosen from within an additive error of is less than .
3.2 Lower bounds for dimensions 2 and higher
We now extend the construction of the 1D case to the 2D case. We again use Yao’s Minmax Theorem. Thus it suffices to construct a distribution of instances such that no deterministic algorithm using range counting queries has success probability at least .
For this, we define two types of gadgets on a square of side length : the near gadget and far gadget. See Figure 2. For the far gadget, we put two copies of the far gadget constructed from the 1D case, one on the upper side of and one on the lower side of , so that a red point comes first in the upper side, and a blue point comes first in the lower side. For the near gadget, we do the same: Put two copies of the near gadget constructed from the 1D case, one on the upper side and one on the lower side, so that a red point comes first in the upper side, and a blue point comes first in the lower side. In this way, every gadget has red points and blue points.
Now define a distribution of instances as follows. We construct the grid on consisting of cells each of side length . See Figure 2. We choose a cell as follows. With probability , we let . Then with probability , we choose one cell from the cells uniformly at random, and let it . Then for each cell in the grid, except for , we place the near gadget in the middle of the cell. Then we place the far gadget in the middle of . Notice that we do not place the far gadget anywhere if . Thus it suffices to show that, for any deterministic algorithm using range counting queries, the probability that it estimates the cost of within an additive error of on an instance chosen from is less than . We can prove this similarly to Lemma 10.
Lemma 11.
For any deterministic algorithm using range counting queries, the probability that it estimates the cost of an instance chosen from within an additive error of is less than .
Extension to a higher dimensional space.
The construction of an input distribution used in the two-dimensional case can be easily extended to a higher dimensional space. We again define two gadgets on a -dimensional cube of side length . For the far gadget, we put two copies of the far gadget we constructed from the -dimensional case, one on one facet of and one on its parallel facet of . We do the same for the near gadget using two copies of the near gadget we constructed from the -dimensional case. Then we can define a distribution of instances as we did for the two-dimensional space. Details can be found in the full version of this paper.
4 Sampling a non-empty cell uniformly at random
In this section, we show how to sample a grid cell (almost) uniformly at random from all grid cells containing points of . Let be a grid of certain side length imposed on the discrete space . Let denote the set of all grid cells of containing points of . We say such cells are non-empty. For each cell , we let denote the number of points contained in . Here, we say sampling is -approximate uniform if the probability that each non-empty cell is chosen is in for a constant with . Or, we simply say sampling is almost uniform.
One might want to use the telescoping sampling of Lemma 1 directly: In particular, we choose a point of uniformly at random, and return the cell of containing . However, the probability that a cell is chosen in this way is , not . So, instead, we use a two-step sampling procedure as stated in Algorithm 1. Here, we are required to check if it has at most non-empty cells. If so, we need to find all of them explicitly. This can be done range counting queries:
Lemma 12.
If the number of non-empty cells is we can compute all of them using range counting queries.
Lemma 13.
CellSampling selects a cell almost uniformly at random from all non-empty cells of using range counting queries.
Corollary 14.
We can estimate the number of non-empty cells within an -relative error using range counting queries.
Lower bound.
Now, we show that any randomized algorithm that can perform -approximate uniform sampling for non-empty cells requires at least range counting queries for any parameter . Our lower bound holds even in a discrete one-dimensional space with . We assume that we have an interval of length consisting of cells each of length . We subdivide into segments each of length . Let be the set of the segments for the line segment . In this way, each segment in contains cells. We construct a set of instances as follows.
We construct one uniform instance of . In this instance, for each segment , we identify its leftmost cell and place points in that cell. Therefore, each uniform instance contains non-empty cells (one corresponding to each segment) while maintaining a total of points per uniform instance.
Next, we construct non-uniform instances of . For the non-uniform instance, we select the segment, which we refer to as the witness segment. In the leftmost cell of this witness segment, we place points. Additionally, in each of the next cells following the leftmost cell, we place one point per cell. For each of the remaining non-witness segments, we place points in the leftmost cell of each segment, as we did for the uniform instance. In this way, every non-uniform instance has non-empty cells and contains points in total. Thus, the ratio of the number of non-empty cells between a non-uniform instance and a uniform instance is . For both instances see Figure 3.
Lemma 15.
Given an instance from , any randomized algorithm that determines whether is uniform with a success probability of at least requires range counting queries.
Lemma 16.
Any algorithm that can perform -approximate uniform sampling for non-empty cells requires range counting queries for any parameter .
5 Estimating the cost of a minimum spanning tree
Let be a set of points. In this section, we present a -approximation algorithm for estimating the cost of a minimum spanning tree of that uses range counting queries. The algorithm is randomized and succeeds with a constant probability. We then show that any randomized constant-factor approximation algorithm requires range counting queries.
5.1 Algorithm for the minimum spanning tree problem
For general graphs, Chazelle, Rubinfeld, and Trevisan [17] presented a -approximation algorithm for estimating the cost of a minimum spanning tree of a graph . This so-called CRT algorithm uses two types of oracles: we need to sample a vertex of uniformly at random, and we need to access the list of the neighbors of each vertex. If the average degree of is , and the edge weights of come from , then the running time of this algorithm is . Our algorithm is based on the CRT algorithm. As it works for graphs with bounded average degree, we use a quadtree-based -spanner of introduced by [12]. Here, the maximum degree of is , and the cost of a minimum spanning tree of is within to the cost of a minimum spanning tree of as shown in [30]. Thus it suffices to estimate the cost of a minimum spanning tree of .
However, a main issue here is that the running time depends on the maximum edge weight, which is in our case. To handle this, we need to look at the CRT algorithm more carefully. As we will see later, the problem reduces to the problem of estimating the number of components of for every , where denotes the subgraph of induced by edges of length less than . To do this, the CRT algorithm samples vertices from , and computes the number of edges contained in the component of containing each sampled vertex . If is contained in a small component, we can traverse until we visit all vertices of the component. If is contained in a large component, we cannot compute this. In this case, we ignore it. This induces the additive error of to the final estimator, where is the size threshold between large components and small components. To get an approximation factor of , we need to set , which is too large for our purpose. Note that we need to traverse a component of size , which requires at least queries. Here, the factor of in the additive error indeed the number of components we cannot traverse using queries.
Our strategy here is to traverse in a cell-by-cell manner. More specifically, we consider a grid of side length , and contract all vertices of in a single cell of into a vertex. Then using the same number of queries, we can visit more of the original vertices. By a careful analysis, we can reduce the additive error to . This idea was already used by [28] for maintaining the cost of the minimum spanning tree in the dynamic streaming setting.444As they did, we can use the Euclidean complete graph on instead of a spanner of . However, this requires additional work: Given two cells and and a value , we need to check if there are two points and with . We cannot do this using range counting queries. Instead, we can resolve this by defining another distance function. But we feel that working with spanners makes the analysis simpler. However, this raises a new issue: we need to sample a cell containing a point of uniformly at random. Here, we can use Lemma 13 to choose a random sample from vertices of .
Then we have the following lemma. Details can be found in the full version of this paper.
Theorem 17.
Given a set of size in a discrete space , we can estimate the cost of a minimum spanning tree of within a factor of with a constant probability using range counting queries.
5.2 Lower bound for the minimum spanning tree problem
In this section, we show that any randomized constant-factor approximation algorithm for the Euclidean minimum spanning tree problem uses range counting queries. For this, we construct a distribution of instances where any randomized algorithm using queries fails to obtain a constant-factor approximate solution with probability at least .
Let be a discrete domain with . We subdivide into equal-sized cells where each has side length . See Figure 4. In the middle of each cell, we place either the strip gadget or the uniform gadget. Each gadget is defined on the domain which we further subdivide into finer cells of side length . For the strip gadget, we put one point to each finer cell on one diagonal of . For the uniform gadget, we put one point to the cell in the -th row and the -th column for . In total, each gadget contains points. Notice that the cost of a minimum spanning tree of the points inside the strip gadget is , and the cost of a minimum spanning tree of the points inside the uniform gadget is . Now we define a distribution of instances. First, with probability , we place copies of the strip gadget in all cells of the domain . Then with probability , we pick one cell of the domain uniformly at random. We place a copy of the uniform gadget in , and place copies of the strip gadget in the other cells. Then we have the following lemma.
Lemma 18.
Let and be two instances of such that has the uniform gadget, but does not use the uniform gadget. Then , where denotes the cost of a minimum spanning tree.
Proof.
Let be a minimum spanning tree of . Observe that the cost of is at least . This is because contains at least edges connecting two points contained different cells in the domain . Their total length is at least .
Now we analyze an upper bound on the cost of a minimum spanning tree of . We can construct a spanning tree of from as follows. The two instances are the same, except that has a uniform gadget in a cell, say , and has a strip gadget in . By the cut property, there are at most edges of having one endpoint in and one endpoint lying outside of . Moreover, such edges have length at most . For each such edge, we reconnect it with any vertex in . Then we remove all edges having both endpoints in from , and add the edges of the minimum spanning tree of the strip gadget. In total, the cost of decreases by at most . Here, the term is the cost of the uniform gadget. Therefore, we have .
Lemma 19.
Any randomized constant-factor approximation algorithm for the minimum spanning tree problem on a point set of size in a discrete space requires range counting queries.
References
- [1] Mikkel Abrahamsen, Mark de Berg, Kevin Buchin, Mehran Mehr, and Ali D. Mehrabi. Range-Clustering Queries. In Proceedings of the 33rd International Symposium on Computational Geometry (SoCG 2017), volume 77, pages 5:1–5:16, 2017. doi:10.4230/LIPICS.SOCG.2017.5.
- [2] Peyman Afshani, Lars Arge, and Kasper Dalgaard Larsen. Orthogonal range reporting in three and higher dimensions. In Proceedings of the 50th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2009), pages 149–158, 2009. doi:10.1109/FOCS.2009.58.
- [3] Pankaj K Agarwal, Hsien-Chih Chang, Sharath Raghvendra, and Allen Xiao. Deterministic, near-linear -approximation algorithm for geometric bipartite matching. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing (STOC 2022), pages 1052–1065, 2022. doi:10.1145/3519935.3519977.
- [4] Pankaj K. Agarwal, Hsien-Chih Chang, and Allen Xiao. Efficient algorithms for geometric partial matching. In Proceedings of the 35th International Symposium on Computational Geometry (SoCG 2019), pages 6:1–6:14, 2019. doi:10.4230/LIPICS.SOCG.2019.6.
- [5] Alexandr Andoni, Khanh Do Ba, Piotr Indyk, and David Woodruff. Efficient sketches for earth-mover distance, with applications. In Proceedings of the 50th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2009), pages 324–330, 2009. doi:10.1109/FOCS.2009.25.
- [6] Sanjeev Arora and Boaz Barak. Computational complexity: a modern approach, 2009.
- [7] Sunil Arya and David M Mount. Approximate range searching. Computational Geometry, 17(3-4):135–152, 2000. doi:10.1016/S0925-7721(00)00022-5.
- [8] Sunil Arya, David M. Mount, and Eunhui Park. Approximate geometric mst range queries. In Proceedings of the 31st International Symposium on Computational Geometry (SoCG 2015), volume 34, pages 781–795, 2015. doi:10.4230/LIPICS.SOCG.2015.781.
- [9] Khanh Do Ba, Huy L Nguyen, Huy N Nguyen, and Ronitt Rubinfeld. Sublinear time algorithms for earth mover’s distance. Theory of Computing Systems, 48:428–442, 2011. doi:10.1007/S00224-010-9265-8.
- [10] Soheil Behnezhad, Mohammad Roghani, and Aviad Rubinstein. Sublinear time algorithms and complexity of approximate maximum matching. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing (STOC 2023), pages 267–280, 2023. doi:10.1145/3564246.3585231.
- [11] Nicolas Bonneel, Michiel Van De Panne, Sylvain Paris, and Wolfgang Heidrich. Displacement interpolation using lagrangian mass transport. In Proceedings of the 2011 SIGGRAPH Asia conference, pages 1–12, 2011.
- [12] Paul B Callahan and S Rao Kosaraju. A decomposition of multidimensional point sets with applications to k-nearest-neighbors and n-body potential fields. Journal of the ACM (JACM), 42(1):67–90, 1995. doi:10.1145/200836.200853.
- [13] Timothy M Chan. Orthogonal range searching in moderate dimensions: kd trees and range trees strike back. Discrete & Computational Geometry, 61:899–922, 2019. doi:10.1007/S00454-019-00062-5.
- [14] Timothy M Chan, Kasper Green Larsen, and Mihai Pătraşcu. Orthogonal range searching on the RAM, revisited. In Proceedings of the twenty-seventh annual Symposium on Computational Geometry (SoCG 2011), pages 1–10, 2011.
- [15] Timothy M Chan and Konstantinos Tsakalidis. Dynamic orthogonal range searching on the RAM, revisited. Journal of Computational Geometry, 9(2):45–66, 2018. doi:10.20382/JOCG.V9I2A5.
- [16] Bernard Chazelle, Ding Liu, and Avner Magen. Sublinear geometric algorithms. In Proceedings of the thirty-fifth annual ACM Symposium on Theory of Computing (STOC 2003), pages 531–540, 2003. doi:10.1145/780542.780620.
- [17] Bernard Chazelle, Ronitt Rubinfeld, and Luca Trevisan. Approximating the minimum spanning tree weight in sublinear time. SIAM J. Comput., 34(6):1370–1379, 2005. doi:10.1137/S0097539702403244.
- [18] Xi Chen, Vincent Cohen-Addad, Rajesh Jayaram, Amit Levi, and Erik Waingarten. Streaming Euclidean MST to a constant factor. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing (STOC 2023), pages 156–169, 2023. doi:10.1145/3564246.3585168.
- [19] Graham Cormode, Hossein Jowhari, Morteza Monemizadeh, and S. Muthukrishnan. The sparse awakens: Streaming algorithms for matching size estimation in sparse graphs. In Kirk Pruhs and Christian Sohler, editors, 25th Annual European Symposium on Algorithms, ESA 2017, September 4-6, 2017, Vienna, Austria, volume 87 of LIPIcs, pages 29:1–29:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2017. doi:10.4230/LIPICS.ESA.2017.29.
- [20] Artur Czumaj, Funda Ergün, Lance Fortnow, Avner Magen, Ilan Newman, Ronitt Rubinfeld, and Christian Sohler. Approximating the weight of the Euclidean minimum spanning tree in sublinear time. SIAM J. Comput., 35(1):91–109, 2005. doi:10.1137/S0097539703435297.
- [21] Artur Czumaj and Christian Sohler. Property testing with geometric queries. In Proceedings of the 9th Annual European Symposium on Algorithms (ESA 2001), pages 266–277, 2001. doi:10.1007/3-540-44676-1_22.
- [22] Artur Czumaj and Christian Sohler. Estimating the weight of metric minimum spanning trees in sublinear-time. In Proceedings of the thirty-sixth annual ACM Symposium on Theory of Computing (STOC 2004), pages 175–183, 2004. doi:10.1145/1007352.1007386.
- [23] Artur Czumaj and Christian Sohler. Estimating the weight of metric minimum spanning trees in sublinear time. SIAM J. Comput., 39(3):904–922, 2009. doi:10.1137/060672121.
- [24] Artur Czumaj and Christian Sohler. Sublinear-time algorithms. Property testing: current research and surveys, pages 41–64, 2010. doi:10.1007/978-3-642-16367-8_5.
- [25] Mark de Berg, Otfried Cheong, Marc van Kreveld, and Mark Overmars. Computational geometry: Algorithms and applications, 2008.
- [26] Talya Eden, Dana Ron, and C Seshadhri. On approximating the number of -cliques in sublinear time. In Proceedings of the 50th annual ACM SIGACT Symposium on Theory of Computing (STOC 2018), pages 722–734, 2018. doi:10.1145/3188745.3188810.
- [27] Rémi Flamary, Marco Cuturi, Nicolas Courty, and Alain Rakotomamonjy. Wasserstein discriminant analysis. Machine Learning, 107:1923–1945, 2018. doi:10.1007/S10994-018-5717-1.
- [28] Gereon Frahling, Piotr Indyk, and Christian Sohler. Sampling in dynamic data streams and applications. In Proceedings of the twenty-first annual symposium on Computational geometry (SoCG 2025), pages 142–149, 2005. doi:10.1145/1064092.1064116.
- [29] Elazar Goldenberg, Robert Krauthgamer, and Barna Saha. Sublinear algorithms for gap edit distance. In Proceedings of the 60th Annual Symposium on Foundations of Computer Science (FOCS 2019), pages 1101–1120, 2019. doi:10.1109/FOCS.2019.00070.
- [30] Sariel Har-Peled. Geometric approximation algorithms. Number 173 in Mathematical Surveys and Monographs. American Mathematical Soc., 2011.
- [31] Sariel Har-Peled, Mitchell Jones, and Saladi Rahul. Active-learning a convex body in low dimensions. Algorithmica, 83:1885–1917, 2021. doi:10.1007/S00453-021-00807-W.
- [32] Tomasz Kociumaka and Barna Saha. Sublinear-time algorithms for computing & embedding gap edit distance. In Proceedings of the 61st Annual Symposium on Foundations of Computer Science (FOCS 2020), pages 1168–1179, 2020. doi:10.1109/FOCS46700.2020.00112.
- [33] Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. From word embeddings to document distances. In Proceedings of 32rd International Conference on Machine Learning (ICML 2015), pages 957–966, 2015. URL: http://proceedings.mlr.press/v37/kusnerb15.html.
- [34] Michael Mitzenmacher and Saeed Seddighin. Improved sublinear time algorithm for longest increasing subsequence. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA 2021), pages 1934–1947, 2021. doi:10.1137/1.9781611976465.115.
- [35] Morteza Monemizadeh. Facility location in the sublinear geometric model. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2023), volume 275, pages 6:1–6:24, 2023. doi:10.4230/LIPICS.APPROX/RANDOM.2023.6.
- [36] Morteza Monemizadeh and David P. Woodruff. 1-pass relative-error l-sampling with applications. In Moses Charikar, editor, Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2010, Austin, Texas, USA, January 17-19, 2010, pages 1143–1160. SIAM, 2010. doi:10.1137/1.9781611973075.92.
- [37] Eunjin Oh and Hee-Kap Ahn. Approximate range queries for clustering. In 34th International Symposium on Computational Geometry (SoCG 2018), volume 99, pages 62:1–62:14, 2018. doi:10.4230/LIPICS.SOCG.2018.62.
- [38] Krzysztof Onak, Dana Ron, Michal Rosen, and Ronitt Rubinfeld. A near-optimal sublinear-time algorithm for approximating the minimum vertex cover size. In Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms (SODA 2012), pages 1123–1131, 2012. doi:10.1137/1.9781611973099.88.
- [39] Ronitt Rubinfeld and Asaf Shapira. Sublinear time algorithms. SIAM Journal on Discrete Mathematics, 25(4):1562–1588, 2011. doi:10.1137/100791075.
- [40] Cheng Sheng and Yufei Tao. New results on two-dimensional orthogonal range aggregation in external memory. In Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of Database Systems (PODS 2011), pages 129–139, 2011. doi:10.1145/1989284.1989297.