Abstract 1 Introduction 2 Sublinear algorithms for the Earth Mover’s Distance problem 3 Lower bounds for the Earth Mover’s Distance 4 Sampling a non-empty cell uniformly at random 5 Estimating the cost of a minimum spanning tree References

Range Counting Oracles for Geometric Problems

Anne Driemel ORCID University of Bonn, Germany Morteza Monemizadeh ORCID Department of Mathematics and Computer Science, TU Eindhoven, the Netherlands Eunjin Oh ORCID Department of Computer Science and Engineering, POSTECH, Pohang, South Korea Frank Staals ORCID Department of Information and Computing Sciences, Utrecht University, The Netherlands David P. Woodruff ORCID Carnegie Mellon University, Pittsburgh, PA, USA
Abstract

In this paper, we study estimators for geometric optimization problems in the sublinear geometric model. In this model, we have oracle access to a point set with size n in a discrete space [Δ]d, where queries can be made to an oracle that responds to orthogonal range counting requests. The query complexity of an optimization problem is measured by the number of oracle queries required to compute an estimator for the problem. We investigate two problems in this framework, the Euclidean Minimum Spanning Tree (MST) and Earth Mover Distance (EMD). For EMD, we show the existence of an estimator that approximates the cost of EMD with O(logΔ)-relative error and O(nΔs1+1/d)-additive error using O(spolylogΔ) range counting queries for any parameter s with 1sn. Moreover, we prove that this bound is tight. For MST, we demonstrate that the weight of MST can be estimated within a factor of (1±ε) using O~(n) range counting queries.

Keywords and phrases:
Range counting oracles, minimum spanning trees, Earth Mover’s Distance
Funding:
Eunjin Oh: Supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.RS-2024-00440239, Sublinear Scalable Algorithms for Large-Scale Data Analysis) and the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No.RS-2024-00358505).
David P. Woodruff: supported in part by Office of Naval Research award number N000142112647 and a Simons Investigator Award.
Copyright and License:
[Uncaptioned image] © Anne Driemel, Morteza Monemizadeh, Eunjin Oh, Frank Staals, and David P. Woodruff; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Computational geometry
Related Version:
Full Version: https://arxiv.org/abs/2504.15292
Acknowledgements:
This research was initiated during the Workshop “Massive Data Models and Computational Geometry” held at the University of Bonn in September 2024 and funded by DFG, German Research Foundation, through EXC 2047 Hausdorff Center for Mathematics and FOR 5361 KI-FOR Algorithmic Data Analytics for Geodesy (AlgoForGe). We thank the organizers and participants for the stimulating environment that inspired this research.
Editors:
Oswin Aichholzer and Haitao Wang

1 Introduction

In recent years, the size of data encountered in various applications has grown exponentially. While classical algorithms are designed to process the entire input to get exact or approximate solutions, the increasing demand for efficiency in massive datasets necessitates approaches that can provide useful results without examining all of the input data. Motivated by this, a wide range of sublinear-time algorithms has been extensively studied over the past few decades from various subfields in theoretical computer science. For instance, there are sublinear-time algorithms for the longest increasing subsequence and edit distance problems on strings [32, 34, 29], the triangle counting and vertex cover problems on graphs [10, 26, 38], Earth Mover’s Distance on probability distributions [9], and minimum spanning tree on general metric spaces [22]. Moreover, there are sublinear-time algorithms for several fundamental geometric problems [16, 20, 21, 31, 35]. For further information, refer to [24, 39].

To obtain sublinear-time algorithms, we need an oracle to access the input data. There are various sublinear-time oracle models in the geometric setting although there are common oracles for strings, graphs and metric spaces. Chazelle et al. [16] presented several geometric algorithms assuming that the input is given without any preprocessing. In the case of point inputs, the only thing one can do is uniform sampling. In the case of polygons, one can also check if two edges are adjacent. Although they showed that several problems admit sublinear-time algorithms in this model, it seems not strong enough to solve a wider range of fundamental geometric problems such as the clustering problems, the Earth Mover’s Distance problem, and the minimum spanning problem. Subsequently, many researchers developed sublinear-time algorithms for several different models. Examples are models in which the oracle can answer orthogonal range emptiness queries and cone nearest point queries [20], orthogonal range counting queries [21, 35], or separation queries [31].

In this paper, we study the range counting oracle model for geometric optimization problems. Here, we do not have direct access to the input point set, but instead we use an orthogonal range counting data structure for access. More specifically, given an orthogonal range as a query, it returns the number of input points contained in the query range. In fact, many database software systems, such as Oracle111https://docs.oracle.com/en/database/other-databases/nosql-database/24.3/sqlreferencefornosql/operator1.html and Amazon SimpleDB222https://docs.aws.amazon.com/AmazonSimpleDB/latest/DeveloperGuide/RangeQueriesSelect.html, provide built-in support for such range queries. These queries allow users to compute aggregates, such as the count, sum, or average of records that fall within a specified range of values for a set of attributes, which is crucial for a wide array of applications. For example, in data analytics, range queries are used to calculate the number of sales within a specific date range or determine the total revenue from products within a given price interval. In geographic information systems (GIS), range queries help aggregate spatial data points within certain coordinate bounds, such as counting the number of locations within a specific radius of a given point. Furthermore, in machine learning, range queries are employed during data preprocessing to summarize statistics over selected subsets, which supports tasks such as data filtering and dimensionality reduction.

The following are desirable properties for a reasonable oracle model: the queries to the oracle should be efficient to implement, it is preferable for it to be supported by lots of well-known databases, and it should allow sublinear-time algorithms for fundamental geometric problems. First, this model clearly satisfies the first property; there are numerous works on data structures for range counting queries and their variants [2, 7, 14, 15, 13, 25, 40]. In particular, one can preprocess a set of n points in d in O(nlogd1n) time to construct a data structure of size O(nlogd1n) so that the number of points inside a query range can be found in O(logd1n) time. Second, range counting queries are already supported by well-known public databases as we mentioned earlier. Therefore, without any special preprocessing, we can apply sublinear-time algorithms designed on the range counting oracle for such databases in sublinear time. Third, we show that several fundamental problems can be solved in sublinear time on the range counting oracle model in this paper.

We focus on the following three fundamental geometric problems: the Earth Mover’s Distance problem, the minimum spanning tree problem, and the cell sampling problem. The Earth Mover’s Distance 𝖤𝖬𝖣(R,B) between two point sets R and B in d and the cost 𝖬𝖲𝖳(P) of a minimum spanning tree of a point set Pd are defined as follows.

𝖤𝖬𝖣(R,B)=minπ:RBrRrπ(r) and 𝖬𝖲𝖳(P)=minTuvE(T)uv,

where π ranges over all one-to-one matchings between R and B, and T ranges over all spanning trees of P. If we have direct access to the input point set(s), then 𝖤𝖬𝖣(R,B) can be computed exactly in near quadratic time in 2 [4], and approximately within (1+ε)-relative error in near linear time [3]. Similarly, 𝖬𝖲𝖳(P) can be computed exactly in near quadratic time by computing the complete graph on P and applying Kruskal’s algorithm, and computed approximately in near linear time [30]. Note that 𝖤𝖬𝖣 can be used for measuring the similarity between two point sets, and 𝖬𝖲𝖳 can be used for summarizing the distribution of a point set. Thus these problems have various applications in multiple areas of computer science including computer vision [11], machine learning [27] and document similarity [33].

The cell sampling problem is defined as follows. Let P be a set of points in a discrete space [Δ]d. Given a value r with 1rΔ, let 𝒢 be the grid imposed on [Δ]d whose cells have side length r. We say a grid cell is non-empty if it contains a point of P. Our goal is to sample one non-empty cell almost uniformly at random from the set of all non-empty cells of 𝒢. If we have direct access to P, this problem is trivial as we can compute all non-empty cells explicitly. However, in our sublinear model, we cannot compute all non-empty cells if the number of non-empty cells exceeds our desired query complexity. We believe that cell sampling can be considered as a fundamental and basic primitive in sublinear models as sublinear algorithms rely on efficient sampling to extract meaningful information from large data sets. In fact, this problem has been studied in the dynamic streaming model, and has been used as a primitive for several geometric problems [28, 36] and graph problems [19].

Table 1: Summary of our results. Here, n denotes the number of points, and Δ denotes the size of the domain. For 𝖤𝖬𝖣, a parameter s determines a trade-off between the additive error and the query complexity.
Additive Error Multiplicative Error Query Complexity
𝖤𝖬𝖣 UB O(nΔ/s1+1/d) O(logΔ) O~(s)
LB Ω(nΔ/s1+1/d) O~(s)
Cell Sampling UB 1±ε O~(n)
LB O(1) Ω(n)
𝖬𝖲𝖳 UB 1±ε O~(n)
LB O(1) Ω(n1/3)

Our results.

We present sublinear-time algorithms for 𝖤𝖬𝖣, Cell Sampling, and MST in the orthogonal range counting oracle model. Our results are summarized in Table 1.

  • For 𝖤𝖬𝖣, we present an algorithm for estimating the Earth Mover’s Distance between two point sets of size n in [Δ]d within a multiplicative error of O(logΔ) in addition to an additive error of O(nΔ/s1+1/d). Our algorithm succeeds with a constant probability, and uses O~(s) range counting queries for any parameter s. Notice that the optimal solution can be as large as Θ(nΔ), and in this case, the additive error is relatively small. Furthermore, we show that this trade-off is tight for any parameter s.

  • For Cell Sampling, we present an algorithm that samples a non-empty cell in a grid so that the probability that a fixed non-empty cell is selected is in [(1ε)/m,(1+ε)/m] for any constant 0<ε<1. Here m denotes the number of non-empty cells. We also show that the query complexity here is tight. Moreover, if we sample O~(nlog(n)) non-empty cells, we can estimate the number of non-empty cells within (1±ε)-factor.

  • For 𝖬𝖲𝖳, we present an algorithm for estimating the cost of a minimum spanning tree of a point set of size n in [Δ]2 within a multiplicative error of 1±ε. Our algorithm again succeeds with constant probability, and uses O~(n) range counting queries. Also, we show that any randomized algorithm using o(n1/3) queries has at least a constant multiplicative error.

Related work.

Both EMD and MST have been extensively studied across various sublinear models over the past few decades. The work most closely related to ours is the one by Czumaj et al. [20]. They studied the Euclidean minimum spanning tree problem in a different model. In particular, their oracle supports cone nearest queries, orthogonal range emptiness queries, and point sampling queries.333More precisely, the oracle supports cone nearest queries and orthogonal range queries only, but they are also given the access to the input point set, which allow them to sample a vertex uniformly at random. Here, given a cone nearest query consisting of a cone C with apex p, the oracle returns (1+δ)-approximate nearest neighbor of p in CP, where P is the set of input points. Using O~(n) queries, their algorithm estimates the cost of the MST of n points within a relative error of (1±ε). Moreover, they showed that if the oracle supports orthogonal emptiness queries only, then any deterministic algorithm for estimating the cost of MST within O(n1/4)-relative error uses Ω(n) orthogonal range emptiness queries. Sublinear time algorithms for computing a metric minimum spanning tree problem have also been studied [23]. Here, the oracle supports distance queries: Given two points, it returns their distance in the underlying space. The most popular sublinear-space model is the streaming model. In this model, the input arrives as data stream, and an algorithm is restricted to using a sublinear amount of memory relative to the input size. Both EMD and MST are studied extensively in this model [5, 18, 28], showcasing the challenges and techniques for processing large datasets in the steaming model.

Although no sublinear algorithms for EMD and MST were known in the range counting oracle model prior to our work, other geometric optimization problems have been studied in this model. Monemizadeh [35] presented an algorithm for the facility location problem using O~(n) queries, and Czumaj and Sohler [21] studied clustering problems and map labeling problems. To the best of our knowledge, these are the only work done in the range counting oracle model. However, there are several closely related models such as the range-optimization model. For instance, Arya et al. [8] studied the range MST problem, where the goal is to construct a data structure for computing the cost of MST within a query orthogonal range. In this setting, clustering problems also have been considered [1, 37].

Preliminaries.

We make use of quadtrees for both EMD and MST. Consider a discrete space [Δ]2. For i{0,1,,logΔ}, we use 𝒢i to denote a grid over the discrete space [Δ]2 consisting of cells with side lengths of 2i. A quadtree 𝒬 on [Δ]2 is then constructed as follows. The root of the quadtree corresponds to the unique cell of 𝒢logΔ. For each node v of the quadtree of depth i has four equal sized children, each corresponding to a cell in 𝒢i1 contained in the cell corresponding to v. Note that the depth of 𝒬 is O(logΔ). If it is clear from the context, we use a node of the quadtree and its corresponding cell interchangeably.

Let P be a point set in [Δ]d. For both EMD and MST we will use binary search in combination with range counting queries to find the kth smallest point in dimension id. Furthermore, we will often want to sample a point from P uniformly at random. We can do so using O(logΔ) range counting queries in a procedure called telescoping sampling [35].

Lemma 1 (Telescoping Sampling [35]).

We can select a point of P uniformly at random using O(logΔ) range counting queries.

Throughout this paper, we use O~(f(s)) to denote O(f(s)polylogΔ). Due to space restrictions, some proofs are omitted; they can be found in the full version of the paper.

2 Sublinear algorithms for the Earth Mover’s Distance problem

In this section, we present several sublinear-time algorithms that approximate the cost of the Earth Mover’s Distance between two point sets in [Δ]d for an integer d1. The approximation bounds we obtain are tight up to a factor of O(logΔ) for any integer d1.

In a one-dimensional space, we know the configuration of an optimal matching: the kth leftmost red point is matched with the kth leftmost blue point for all 1kn. Thus it suffices to estimate the cost of this matching. On the other hand, we do not know the exact configuration of an optimal matching in a higher dimensional space. Thus instead of considering an optimal matching, we consider a matching of cost O(𝖮𝖯𝖳logΔ), where 𝖮𝖯𝖳 is the Earth Mover’s Distance between two point sets. This is where the multiplicative error of O(logΔ) comes from. Apart from this, we use the same approach for all dimensions d1.

2.1 A sublinear algorithm for points in 1D

In this section, we present a sublinear algorithm using O~(s) range counting queries in a one-dimensional space. More precisely, we prove the following theorem. The optimal solution can be as large as Θ(nΔ), and in this case, the additive error is relatively small.

Theorem 2.

Given two sets R and B of size n in a discrete space [Δ] and a parameter s>1, we can estimate the cost of the Earth Mover’s Distance between R and B within O(1) multiplicative error in addition to O(Δns2) additive error with probability at least 2/3 using O~(s) range counting queries.

If no two points coincide, we have 𝖮𝖯𝖳n. Therefore, we have the following corollary.

Corollary 3.

Given two sets R and B of size n in a discrete space [Δ] such that no two points of RB coincide, and a parameter s>1, we can estimate the cost of the Earth Mover’s Distance between R and B within O(Δs2) multiplicative error with probability at least 2/3 using O~(s) range counting queries.

Let R={r1,r2,,rn} and B={b1,b2,,bn} be two sets in [Δ] sorted along the axis. Let M be an optimal matching between R and B. We separate two cases: long edges and short edges. An edge of M is long if its length is at least Δ/s, and it is short, otherwise. Let 𝖮𝖯𝖳 be the Earth Mover’s Distance between R and B.

We employ the following win-win strategy: For a long edge, a slight perturbation of its endpoints introduces only a constant relative error. Thus, we move its endpoints to the center of a grid, enabling us to estimate the total length efficiently. For short edges, we can disregard cases where the number of such edges is small, as this results in only a small additive error. Consequently, we may assume that the number of short edges is large, allowing us to estimate their total length using sampling. To further reduce the additive error, we partition the set of short edges into log(Δ/s) subsets based on edge length and then consider each subset separately.

Long edges.

For long edges, we subdivide [Δ] into segments each of length Δ/(2s). For each long edge (ri,bi), imagine that we move ri and bi to the centers of the segments containing them. Since (ri,bi) is long, this introduces an additive error of O(Δ/s), resulting in a constant relative error. Using this fact, we estimate the total length of all long edges within a constant relative error using O~(s) range counting queries. Let Llong be the resulting estimator.

Lemma 4.

We can estimate the total length of the long edges in M within O(1) multiplicative error using O~(s) range counting queries.

Proof.

We subdivide [Δ] into s segments each of length Δ/(2s). Here, the number of segments is O(s). Let 𝒮 be the set of the resulting segments. Then a long edge contains at least one segment. Imagine that we move every point p incident to a long edge to the center of the segment containing p. For each segment S, we compute the number nr(S) of red points contained in S and the number nb(S) of blue points contained in S, respectively, using O(s) queries. Then we construct another point set P by adding nr(S) red points to the center of S and adding nb(R) blue points to the center of S. We can do this without using queries. For a point pRB, let p be the its corresponding point in P. Then consider the matching M={(r,b)(r,b)M}. We can compute M explicitly without using any queries. Let Mlong be the set of edges of M containing at least one segment of 𝒮. A long edge of M has its corresponding edge in Mlong. However, some short edge of M might have its corresponding edge in Mlong. Here, observe that the short edges of M having their corresponding edges in Mlong have length at least Δ/(2s), thus the cost induced by those edges are also within a constant factor of 𝖮𝖯𝖳. They will be also counted for short edges, but this also increases the total cost by a constant factor. Therefore, the lemma holds.

Short edges.

For short edges, we use random sampling. More specifically, we subdivide the set of short edges into t=log(Δ/s) subsets with respect to their lengths: E1,E2,,Et, where Ei is the set of all edges of M of lengths in [2i1,2i). Then we estimate the number of edges of Ei for every index i. To do this, we choose a random sample of R of size x=O(slog2Δ). For each chosen point, we can find its mate. Let Si denote the set of all edges between the sampled points and their mates. If |Ei|n/(slogΔ) then the number of edges in EiSi would be a good estimator for |Ei|. Otherwise, we can ignore them as Ei induces an additive error of at most Δn/(s2logΔ). This is because every short edges has length at most Δ/s. Let i=|EiSi|(n/x). Then we have the following lemma.

Lemma 5.

If |Ei|n/(slogΔ), we have i=Θ(|Ei|) with a constant probability. Otherwise, we have iO(|Ei|) with a constant probability.

Proof.

Let Yj be the random variable where Yj=1 if the jth edge in Si is contained in Ei, and Yj=0 otherwise. Let Y=Y1++Yx, where x=slog2Δ is the number of sampled points. By definition, we have 𝔼[Yj]=|Ei|n, and 𝔼[Y]=x|Ei|n. Now we analyze the failure probability. If |Ei|n/(slogΔ), we use Chernoff bounds. Then the probability that |𝔼[Y]Y|ε𝔼[Y] is at most 2exp(ε2p(slog2Δ)/3), where p is the probability that Yj=1. In our case, p=|Ei|/n1/(slogΔ). Thus the failure probability is at most 1/Δ in this case. If |Ei|<n/(slogΔ), we use Markov’s inequality. Then the probability that Yic|˙Ei| is less than a small constant for a large constant c. Therefore, the second part of the lemma also holds.

To amplify the success probability of Lemma 5, we repeat the procedure for computing i O(logΔ) times. Then we take the minimum, and let L=i=1t2ii+Llong.

Lemma 6.

With a constant probability, (1/4)𝖮𝖯𝖳nΔ/s2L4𝖮𝖯𝖳+nΔ/s2.

Proof.

Since we can estimate the total length of long edges with an additive error of 2𝖮𝖯𝖳, it suffices to focus on short edges. For each edge of Ei, we round up its length to 2i. This induces the additive error of 2𝖮𝖯𝖳. Now we use Lemma 5. Consider the event that we have the desired bounds for all indices i in Lemma 5. Due to the amplification of success probability we made before, this total success probability is a constant. If |Ei|(n/(slogΔ), the estimator i has an error of O(1)𝖮𝖯𝖳 in total for all such indices. If |Ei|<n/(slogΔ), we do not have a lower bound guarantee for i. However, since the number of such edges is n/s in total, and the length of each such edge is Δ/s, this induces the additive error of nΔ/s2. Therefore, the total additive error is nΔ/s2.

2.2 A sublinear algorithm for higher dimensions

In this section, we present a sublinear algorithm that approximates the cost of the Earth Mover’s Distance between two point sets in [Δ]d for an integer d2. More specifically, we prove the following theorem. Again, notice that the Earth Mover’s Distance between two point sets can be as large as Θ(Δn) in the worst case.

Theorem 7.

Given two sets R and B of size n in a discrete space [Δ]d and a parameter s>1, we can estimate the cost of the Earth Mover’s Distance between R and B within O(logΔ)-relative error in addition to O(Δns1+1/d) additive error with probability at least 2/3 using O~(s) range counting queries.

If no two points coincide, we have 𝖮𝖯𝖳n. Therefore, we have the following corollary.

Corollary 8.

Given two sets R and B of size n in a discrete space [Δ]2 and a parameter s>1 such that no two points in RB coincide, we can estimate the cost of the Earth Mover’s Distance between R and B within O(max{logΔ,Δs1+1/d})-relative error with probability at least 2/3 using O~(s) range counting queries.

Let 𝖮𝖯𝖳 be the Earth Mover’s Distance between R and B. This algorithm is essentially the same as the algorithm for the one-dimensional case. The only difference is that we do not know the configuration of an optimal matching in a higher dimensional space. Instead, we can use a matching obtained from the quadtree on [Δ]d, which has cost of O(logΔ)𝖮𝖯𝖳. Then as before, we separate two cases. If an edge has length at least Δ/s1/d, it is short. Otherwise, it is long. Then we can compute all long edges exactly, and we can estimate short edges using sampling. Details can be found in the full version of this paper.

3 Lower bounds for the Earth Mover’s Distance

In this section, we show that for any parameter s, any randomized algorithm that approximates the Earth Mover’s Distance between two sets within an additive error of O(nΔ/s1+1/d) requires at least s orthogonal range counting queries. More specifically, we prove:

Lemma 9.

For any parameter s and any integers d1 and Δ1, any randomized algorithm for approximating the Earth Mover’s Distance between two point sets of size n in a discrete space [Δ]d within additive error O(nΔ/s1+1/d) uses Ω(s) range counting queries.

If points from BR may coincide, achieving any multiplicative approximation factor is impossible. Even when no two points of BR coincide, Lemma 9 shows that no randomized algorithm with a multiplicative error of O(Δ/s1+1/d) uses o(s) range counting queries.

3.1 A lower bound for points in 1D

In this section, we prove Lemma 9 in the one-dimensional space. This will be extended for a higher dimensional space later. To analyze the lower bound on the additive error of a Monte Carlo algorithm, we use the following version of Yao’s Minmax Theorem [6]: Assume that there exists a randomized algorithm using s range counting queries with success probability at least 2/3. Then for any distribution μ of instances, there exists a deterministic algorithm using s range counting queries that returns desired answers on instances chosen from μ with probability at least 2/3. Here, the success probability of a randomized algorithm is the probability that the output is within our desired bound.

Thus it suffices to construct a distribution μ of instances such that no deterministic algorithm using s range counting queries has success probability at least 2/3. For this, we define two types of gadgets on a segment S of length Δ/(8s): the near gadget and far gadget. See Figure 1. For the far gadget, we put 4n/s red points near the left endpoint of S and 4n/s blue points near the right endpoint of S. For the near gadget, we put 2n/s red points and 2n/s blue points alternately starting from the left endpoint of S so that the distance between any two consecutive points is one, and the starting point lies in the left endpoint of S. Additionally, we put 2n/s red points and 2n/s blue points alternately so that the distance between any two consecutive points is one, and the last point lies in the right endpoint of S. Notice that the cost induced by the near gadget is Θ(n/s), but the cost induced by the far gadget is Θ(nΔ/s2). Our strategy is to place O(s) copies of the near gadget inside [Δ] and to hide one far gadget inside [Δ] with probability 1/2. In this way, we can construct two types of instances, one with cost n and one with cost Θ(nΔ/s2).

Figure 1: (a) All segments use the near gadget. The cost of this instance is n. (b) The gray segment uses the far gadget. The cost of this instance is Θ(nΔ/s2).

More specifically, we define a distribution μ of instances as follows. We partition [Δ] into 8s segments each of length Δ/(8s). Let 𝒮={S1,S2,,S8s} be the set of resulting segments along the axis. See Figure 1. With probability 1/2, we let t=0. Then with probability 1/2, we choose one index t from 1,2,,s uniformly at random. That is, the probability that a fixed index i is chosen is 1/(16s) for i=1,2,,8s. Then for each index it, we place the near gadget on Si. For index i=t, we place the far gadget on Si. Notice that we do not place the far gadget anywhere if t=0. Thus it suffices to show that, for any deterministic algorithm using s range counting queries, the probability that it estimates the cost of 𝖮𝖯𝖳 within an additive error of O(nΔ/s2) on the instances chosen from μ is less than 2/3.

Lemma 10.

For any deterministic algorithm using s range counting queries, the probability that it estimates the cost of an instance chosen from μ within an additive error of O(nΔ/s2) is less than 2/3.

3.2 Lower bounds for dimensions 2 and higher

We now extend the construction of the 1D case to the 2D case. We again use Yao’s Minmax Theorem. Thus it suffices to construct a distribution μ of instances such that no deterministic algorithm using s range counting queries has success probability at least 2/3.

For this, we define two types of gadgets on a square S of side length Δ/(8s): the near gadget and far gadget. See Figure 2. For the far gadget, we put two copies of the far gadget constructed from the 1D case, one on the upper side of S and one on the lower side of S, so that a red point comes first in the upper side, and a blue point comes first in the lower side. For the near gadget, we do the same: Put two copies of the near gadget constructed from the 1D case, one on the upper side and one on the lower side, so that a red point comes first in the upper side, and a blue point comes first in the lower side. In this way, every gadget has 8n/s red points and 8n/s blue points.

Figure 2: We partition [Δ]2 into 16s squares. In each square, we place either the far gadget or the near gadget. The cost of the far gadget is at least nΔ/s2, and the cost of the near gadget is Θ(n/s).

Now define a distribution μ of instances as follows. We construct the grid on [Δ]2 consisting of 16s cells each of side length Δ/(4s). See Figure 2. We choose a cell W as follows. With probability 1/2, we let W=. Then with probability 1/2, we choose one cell from the 16s cells uniformly at random, and let it W. Then for each cell in the grid, except for W, we place the near gadget in the middle of the cell. Then we place the far gadget in the middle of W. Notice that we do not place the far gadget anywhere if W=. Thus it suffices to show that, for any deterministic algorithm using s range counting queries, the probability that it estimates the cost of 𝖮𝖯𝖳 within an additive error of O(nΔ/s1.5) on an instance chosen from μ is less than 2/3. We can prove this similarly to Lemma 10.

Lemma 11.

For any deterministic algorithm using s range counting queries, the probability that it estimates the cost of an instance chosen from μ within an additive error of O(nΔ/s1.5) is less than 2/3.

Extension to a higher dimensional space.

The construction of an input distribution used in the two-dimensional case can be easily extended to a higher dimensional space. We again define two gadgets on a d-dimensional cube S of side length Δ/(4s1/d). For the far gadget, we put two copies of the far gadget we constructed from the (d1)-dimensional case, one on one facet of S and one on its parallel facet of S. We do the same for the near gadget using two copies of the near gadget we constructed from the (d1)-dimensional case. Then we can define a distribution of instances as we did for the two-dimensional space. Details can be found in the full version of this paper.

4 Sampling a non-empty cell uniformly at random

In this section, we show how to sample a grid cell (almost) uniformly at random from all grid cells containing points of P. Let 𝒢 be a grid of certain side length r imposed on the discrete space [Δ]2. Let 𝒞={c1,c2,,cm} denote the set of all grid cells of 𝒢 containing points of P. We say such cells are non-empty. For each cell ci, we let ni denote the number of points contained in ci. Here, we say sampling is c-approximate uniform if the probability that each non-empty cell is chosen is in [1/(cm),c/m] for a constant c with c>1. Or, we simply say sampling is almost uniform.

One might want to use the telescoping sampling of Lemma 1 directly: In particular, we choose a point v of P uniformly at random, and return the cell of 𝒢 containing v. However, the probability that a cell ci is chosen in this way is ni/n, not 1/m. So, instead, we use a two-step sampling procedure as stated in Algorithm 1. Here, we are required to check if it has at most n non-empty cells. If so, we need to find all of them explicitly. This can be done O(nlogΔ) range counting queries:

Lemma 12.

If the number of non-empty cells is O(n) we can compute all of them using O(nlogΔ) range counting queries.

Algorithm 1 CellSampling(r).
Lemma 13.

CellSampling(r) selects a cell almost uniformly at random from all non-empty cells of 𝒢 using O~(n) range counting queries.

Corollary 14.

We can estimate the number of non-empty cells within an O(1)-relative error using O~(n) range counting queries.

Lower bound.

Now, we show that any randomized algorithm that can perform c-approximate uniform sampling for non-empty cells requires at least Ω(n/c) range counting queries for any parameter c1. Our lower bound holds even in a discrete one-dimensional space [Δ] with Δn. We assume that we have an interval L of length Δn consisting of n cells each of length Δ/n. We subdivide L into n/(4c) segments each of length 4cΔ/n. Let 𝒮 be the set of the segments for the line segment L. In this way, each segment in S contains (4cΔ/n)/(Δ/n)=4cn cells. We construct a set of n/(4c)+1 instances as follows.

Figure 3: Illustration for the uniform instance and a non-uniform instance. The gray segment is the witness of the non-uniform instance.

We construct one uniform instance of L. In this instance, for each segment S𝒮, we identify its leftmost cell and place 4cn points in that cell. Therefore, each uniform instance contains n/(4c) non-empty cells (one corresponding to each segment) while maintaining a total of n points per uniform instance.

Next, we construct n/(4c) non-uniform instances of L. For the ith non-uniform instance, we select the ith segment, which we refer to as the witness segment. In the leftmost cell of this witness segment, we place 2cn points. Additionally, in each of the next 2cn cells following the leftmost cell, we place one point per cell. For each of the remaining n/(4c)1 non-witness segments, we place 4cn points in the leftmost cell of each segment, as we did for the uniform instance. In this way, every non-uniform instance has 2cn+1+n/(4c)1=2cn+n/(4c) non-empty cells and contains 4cn(n/(4c)1)+2cn+2cn=n points in total. Thus, the ratio of the number of non-empty cells between a non-uniform instance and a uniform instance is (2cn+n/(4c))/(n/(4c))=8c2+1. For both instances see Figure 3.

Lemma 15.

Given an instance I from , any randomized algorithm that determines whether I is uniform with a success probability of at least 2/3 requires Ω(n/c) range counting queries.

Lemma 16.

Any algorithm that can perform c-approximate uniform sampling for non-empty cells requires Ω(n/c) range counting queries for any parameter c1.

5 Estimating the cost of a minimum spanning tree

Let P[Δ]2 be a set of n points. In this section, we present a (1+ε)-approximation algorithm for estimating the cost 𝖮𝖯𝖳 of a minimum spanning tree of P that uses O~(n) range counting queries. The algorithm is randomized and succeeds with a constant probability. We then show that any randomized constant-factor approximation algorithm requires Ω(n1/3) range counting queries.

5.1 Algorithm for the minimum spanning tree problem

For general graphs, Chazelle, Rubinfeld, and Trevisan [17] presented a (1+ε)-approximation algorithm for estimating the cost of a minimum spanning tree of a graph G. This so-called CRT algorithm uses two types of oracles: we need to sample a vertex of G uniformly at random, and we need to access the list of the neighbors of each vertex. If the average degree of G is d, and the edge weights of G come from {1,,w}, then the running time of this algorithm is O(dwε2log(dw/ε)). Our algorithm is based on the CRT algorithm. As it works for graphs with bounded average degree, we use a quadtree-based (1+ε)-spanner S of P introduced by [12]. Here, the maximum degree of S is O(logΔ), and the cost of a minimum spanning tree of P is within (1+ε) to the cost of a minimum spanning tree of S as shown in [30]. Thus it suffices to estimate the cost of a minimum spanning tree of S.

However, a main issue here is that the running time depends on the maximum edge weight, which is Δ in our case. To handle this, we need to look at the CRT algorithm more carefully. As we will see later, the problem reduces to the problem of estimating the number ci of components of Si for every i{0,1,,log1+ε(2Δ)}, where Si denotes the subgraph of G induced by edges of length less than (1+ε)i. To do this, the CRT algorithm samples ε2 vertices from V(Si), and computes the number of edges contained in the component of Si containing each sampled vertex v. If v is contained in a small component, we can traverse Si until we visit all vertices of the component. If v is contained in a large component, we cannot compute this. In this case, we ignore it. This induces the additive error of ε(1+ε)i(n/t) to the final estimator, where t is the size threshold between large components and small components. To get an approximation factor of 1+ε, we need to set t=Δ, which is too large for our purpose. Note that we need to traverse a component of size t, which requires at least t queries. Here, the factor of n/t in the additive error indeed the number of components we cannot traverse using O(t) queries.

Our strategy here is to traverse Si in a cell-by-cell manner. More specifically, we consider a grid 𝒢 of side length (1+ε)i/2, and contract all vertices of Si in a single cell of 𝒢 into a vertex. Then using the same number of queries, we can visit more of the original vertices. By a careful analysis, we can reduce the additive error to ε(n/t)ε𝖮𝖯𝖳. This idea was already used by [28] for maintaining the cost of the minimum spanning tree in the dynamic streaming setting.444As they did, we can use the Euclidean complete graph on P instead of a spanner S of P. However, this requires additional work: Given two cells c and c and a value r, we need to check if there are two points pc and pc with pp=r. We cannot do this using O(logΔ) range counting queries. Instead, we can resolve this by defining another distance function. But we feel that working with spanners makes the analysis simpler. However, this raises a new issue: we need to sample a cell containing a point of P uniformly at random. Here, we can use Lemma 13 to choose a random sample from vertices of Si.

Then we have the following lemma. Details can be found in the full version of this paper.

Theorem 17.

Given a set P of size n in a discrete space [Δ]2, we can estimate the cost of a minimum spanning tree of P within a factor of (1+ε) with a constant probability using O~(n) range counting queries.

5.2 Lower bound for the minimum spanning tree problem

In this section, we show that any randomized constant-factor approximation algorithm for the Euclidean minimum spanning tree problem uses Ω(n1/3) range counting queries. For this, we construct a distribution μ of instances where any randomized algorithm using o(n1/3) queries fails to obtain a constant-factor approximate solution with probability at least 1/3.

Let [Δ]2 be a discrete domain with Δ=O(n). We subdivide [Δ]2 into 16n1/3 equal-sized cells where each has side length 4n5/6. See Figure 4. In the middle of each cell, we place either the strip gadget or the uniform gadget. Each gadget is defined on the domain [n5/6]2 which we further subdivide into n4/3 finer cells of side length n1/6. For the strip gadget, we put one point to each finer cell on one diagonal of [n5/6]2. For the uniform gadget, we put one point to the cell in the i-th row and the ((in1/3)modn2/3)-th column for i=0,1,2,,n2/3. In total, each gadget contains n2/3 points. Notice that the cost of a minimum spanning tree of the points inside the strip gadget is Θ(n5/6), and the cost of a minimum spanning tree of the points inside the uniform gadget is Θ(n7/6). Now we define a distribution μ of instances. First, with probability 1/2, we place copies of the strip gadget in all cells of the domain [Δ]2. Then with probability 1/2, we pick one cell c of the domain [Δ]2 uniformly at random. We place a copy of the uniform gadget in c, and place copies of the strip gadget in the other cells. Then we have the following lemma.

Lemma 18.

Let I and I be two instances of μ such that I has the uniform gadget, but I does not use the uniform gadget. Then 𝖬𝖲𝖳(I)2𝖬𝖲𝖳(I), where 𝖬𝖲𝖳() denotes the cost of a minimum spanning tree.

Proof.

Let M be a minimum spanning tree of I. Observe that the cost of M is at least 2n7/6. This is because M contains at least n1/3 edges connecting two points contained different cells in the domain [Δ]2. Their total length is at least 2n7/6.

Now we analyze an upper bound on the cost of a minimum spanning tree of I. We can construct a spanning tree of I from M as follows. The two instances are the same, except that I has a uniform gadget in a cell, say c, and I has a strip gadget in c. By the cut property, there are at most O(1) edges of M having one endpoint in c and one endpoint lying outside of c. Moreover, such edges have length at most 8n5/6. For each such edge, we reconnect it with any vertex in c. Then we remove all edges having both endpoints in c from M, and add the edges of the minimum spanning tree of the strip gadget. In total, the cost of M decreases by at most n7/6O(1)n5/6. Here, the term n7/6 is the cost of the uniform gadget. Therefore, we have 𝖬𝖲𝖳(I)𝖬𝖲𝖳(I)n7/6O(1)n5/6(1/2)n7/6𝖬𝖲𝖳(I).

Figure 4: The domain [Δ] is partitioned into 16n1/3 cells. Each cell contains the strip gadget or the uniform gadget. The strip gadget has cost Θ(n5/6) while the uniform gadget has cost Θ(n7/6).
Lemma 19.

Any randomized constant-factor approximation algorithm for the minimum spanning tree problem on a point set of size n in a discrete space [Δ]d requires Ω(n1/3) range counting queries.

References

  • [1] Mikkel Abrahamsen, Mark de Berg, Kevin Buchin, Mehran Mehr, and Ali D. Mehrabi. Range-Clustering Queries. In Proceedings of the 33rd International Symposium on Computational Geometry (SoCG 2017), volume 77, pages 5:1–5:16, 2017. doi:10.4230/LIPICS.SOCG.2017.5.
  • [2] Peyman Afshani, Lars Arge, and Kasper Dalgaard Larsen. Orthogonal range reporting in three and higher dimensions. In Proceedings of the 50th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2009), pages 149–158, 2009. doi:10.1109/FOCS.2009.58.
  • [3] Pankaj K Agarwal, Hsien-Chih Chang, Sharath Raghvendra, and Allen Xiao. Deterministic, near-linear ε-approximation algorithm for geometric bipartite matching. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing (STOC 2022), pages 1052–1065, 2022. doi:10.1145/3519935.3519977.
  • [4] Pankaj K. Agarwal, Hsien-Chih Chang, and Allen Xiao. Efficient algorithms for geometric partial matching. In Proceedings of the 35th International Symposium on Computational Geometry (SoCG 2019), pages 6:1–6:14, 2019. doi:10.4230/LIPICS.SOCG.2019.6.
  • [5] Alexandr Andoni, Khanh Do Ba, Piotr Indyk, and David Woodruff. Efficient sketches for earth-mover distance, with applications. In Proceedings of the 50th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2009), pages 324–330, 2009. doi:10.1109/FOCS.2009.25.
  • [6] Sanjeev Arora and Boaz Barak. Computational complexity: a modern approach, 2009.
  • [7] Sunil Arya and David M Mount. Approximate range searching. Computational Geometry, 17(3-4):135–152, 2000. doi:10.1016/S0925-7721(00)00022-5.
  • [8] Sunil Arya, David M. Mount, and Eunhui Park. Approximate geometric mst range queries. In Proceedings of the 31st International Symposium on Computational Geometry (SoCG 2015), volume 34, pages 781–795, 2015. doi:10.4230/LIPICS.SOCG.2015.781.
  • [9] Khanh Do Ba, Huy L Nguyen, Huy N Nguyen, and Ronitt Rubinfeld. Sublinear time algorithms for earth mover’s distance. Theory of Computing Systems, 48:428–442, 2011. doi:10.1007/S00224-010-9265-8.
  • [10] Soheil Behnezhad, Mohammad Roghani, and Aviad Rubinstein. Sublinear time algorithms and complexity of approximate maximum matching. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing (STOC 2023), pages 267–280, 2023. doi:10.1145/3564246.3585231.
  • [11] Nicolas Bonneel, Michiel Van De Panne, Sylvain Paris, and Wolfgang Heidrich. Displacement interpolation using lagrangian mass transport. In Proceedings of the 2011 SIGGRAPH Asia conference, pages 1–12, 2011.
  • [12] Paul B Callahan and S Rao Kosaraju. A decomposition of multidimensional point sets with applications to k-nearest-neighbors and n-body potential fields. Journal of the ACM (JACM), 42(1):67–90, 1995. doi:10.1145/200836.200853.
  • [13] Timothy M Chan. Orthogonal range searching in moderate dimensions: kd trees and range trees strike back. Discrete & Computational Geometry, 61:899–922, 2019. doi:10.1007/S00454-019-00062-5.
  • [14] Timothy M Chan, Kasper Green Larsen, and Mihai Pătraşcu. Orthogonal range searching on the RAM, revisited. In Proceedings of the twenty-seventh annual Symposium on Computational Geometry (SoCG 2011), pages 1–10, 2011.
  • [15] Timothy M Chan and Konstantinos Tsakalidis. Dynamic orthogonal range searching on the RAM, revisited. Journal of Computational Geometry, 9(2):45–66, 2018. doi:10.20382/JOCG.V9I2A5.
  • [16] Bernard Chazelle, Ding Liu, and Avner Magen. Sublinear geometric algorithms. In Proceedings of the thirty-fifth annual ACM Symposium on Theory of Computing (STOC 2003), pages 531–540, 2003. doi:10.1145/780542.780620.
  • [17] Bernard Chazelle, Ronitt Rubinfeld, and Luca Trevisan. Approximating the minimum spanning tree weight in sublinear time. SIAM J. Comput., 34(6):1370–1379, 2005. doi:10.1137/S0097539702403244.
  • [18] Xi Chen, Vincent Cohen-Addad, Rajesh Jayaram, Amit Levi, and Erik Waingarten. Streaming Euclidean MST to a constant factor. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing (STOC 2023), pages 156–169, 2023. doi:10.1145/3564246.3585168.
  • [19] Graham Cormode, Hossein Jowhari, Morteza Monemizadeh, and S. Muthukrishnan. The sparse awakens: Streaming algorithms for matching size estimation in sparse graphs. In Kirk Pruhs and Christian Sohler, editors, 25th Annual European Symposium on Algorithms, ESA 2017, September 4-6, 2017, Vienna, Austria, volume 87 of LIPIcs, pages 29:1–29:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2017. doi:10.4230/LIPICS.ESA.2017.29.
  • [20] Artur Czumaj, Funda Ergün, Lance Fortnow, Avner Magen, Ilan Newman, Ronitt Rubinfeld, and Christian Sohler. Approximating the weight of the Euclidean minimum spanning tree in sublinear time. SIAM J. Comput., 35(1):91–109, 2005. doi:10.1137/S0097539703435297.
  • [21] Artur Czumaj and Christian Sohler. Property testing with geometric queries. In Proceedings of the 9th Annual European Symposium on Algorithms (ESA 2001), pages 266–277, 2001. doi:10.1007/3-540-44676-1_22.
  • [22] Artur Czumaj and Christian Sohler. Estimating the weight of metric minimum spanning trees in sublinear-time. In Proceedings of the thirty-sixth annual ACM Symposium on Theory of Computing (STOC 2004), pages 175–183, 2004. doi:10.1145/1007352.1007386.
  • [23] Artur Czumaj and Christian Sohler. Estimating the weight of metric minimum spanning trees in sublinear time. SIAM J. Comput., 39(3):904–922, 2009. doi:10.1137/060672121.
  • [24] Artur Czumaj and Christian Sohler. Sublinear-time algorithms. Property testing: current research and surveys, pages 41–64, 2010. doi:10.1007/978-3-642-16367-8_5.
  • [25] Mark de Berg, Otfried Cheong, Marc van Kreveld, and Mark Overmars. Computational geometry: Algorithms and applications, 2008.
  • [26] Talya Eden, Dana Ron, and C Seshadhri. On approximating the number of k-cliques in sublinear time. In Proceedings of the 50th annual ACM SIGACT Symposium on Theory of Computing (STOC 2018), pages 722–734, 2018. doi:10.1145/3188745.3188810.
  • [27] Rémi Flamary, Marco Cuturi, Nicolas Courty, and Alain Rakotomamonjy. Wasserstein discriminant analysis. Machine Learning, 107:1923–1945, 2018. doi:10.1007/S10994-018-5717-1.
  • [28] Gereon Frahling, Piotr Indyk, and Christian Sohler. Sampling in dynamic data streams and applications. In Proceedings of the twenty-first annual symposium on Computational geometry (SoCG 2025), pages 142–149, 2005. doi:10.1145/1064092.1064116.
  • [29] Elazar Goldenberg, Robert Krauthgamer, and Barna Saha. Sublinear algorithms for gap edit distance. In Proceedings of the 60th Annual Symposium on Foundations of Computer Science (FOCS 2019), pages 1101–1120, 2019. doi:10.1109/FOCS.2019.00070.
  • [30] Sariel Har-Peled. Geometric approximation algorithms. Number 173 in Mathematical Surveys and Monographs. American Mathematical Soc., 2011.
  • [31] Sariel Har-Peled, Mitchell Jones, and Saladi Rahul. Active-learning a convex body in low dimensions. Algorithmica, 83:1885–1917, 2021. doi:10.1007/S00453-021-00807-W.
  • [32] Tomasz Kociumaka and Barna Saha. Sublinear-time algorithms for computing & embedding gap edit distance. In Proceedings of the 61st Annual Symposium on Foundations of Computer Science (FOCS 2020), pages 1168–1179, 2020. doi:10.1109/FOCS46700.2020.00112.
  • [33] Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. From word embeddings to document distances. In Proceedings of 32rd International Conference on Machine Learning (ICML 2015), pages 957–966, 2015. URL: http://proceedings.mlr.press/v37/kusnerb15.html.
  • [34] Michael Mitzenmacher and Saeed Seddighin. Improved sublinear time algorithm for longest increasing subsequence. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA 2021), pages 1934–1947, 2021. doi:10.1137/1.9781611976465.115.
  • [35] Morteza Monemizadeh. Facility location in the sublinear geometric model. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2023), volume 275, pages 6:1–6:24, 2023. doi:10.4230/LIPICS.APPROX/RANDOM.2023.6.
  • [36] Morteza Monemizadeh and David P. Woodruff. 1-pass relative-error lp-sampling with applications. In Moses Charikar, editor, Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2010, Austin, Texas, USA, January 17-19, 2010, pages 1143–1160. SIAM, 2010. doi:10.1137/1.9781611973075.92.
  • [37] Eunjin Oh and Hee-Kap Ahn. Approximate range queries for clustering. In 34th International Symposium on Computational Geometry (SoCG 2018), volume 99, pages 62:1–62:14, 2018. doi:10.4230/LIPICS.SOCG.2018.62.
  • [38] Krzysztof Onak, Dana Ron, Michal Rosen, and Ronitt Rubinfeld. A near-optimal sublinear-time algorithm for approximating the minimum vertex cover size. In Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms (SODA 2012), pages 1123–1131, 2012. doi:10.1137/1.9781611973099.88.
  • [39] Ronitt Rubinfeld and Asaf Shapira. Sublinear time algorithms. SIAM Journal on Discrete Mathematics, 25(4):1562–1588, 2011. doi:10.1137/100791075.
  • [40] Cheng Sheng and Yufei Tao. New results on two-dimensional orthogonal range aggregation in external memory. In Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of Database Systems (PODS 2011), pages 129–139, 2011. doi:10.1145/1989284.1989297.