Range Counting Oracles for Geometric Problems

Driemel, Anne; Monemizadeh, Morteza; Oh, Eunjin; Staals, Frank; Woodruff, David P.

doi:10.4230/LIPIcs.SoCG.2025.42

Range Counting Oracles for Geometric Problems

Anne Driemel

University of Bonn, Germany Morteza Monemizadeh

Department of Mathematics and Computer Science, TU Eindhoven, the Netherlands Eunjin Oh

Department of Computer Science and Engineering, POSTECH, Pohang, South Korea Frank Staals

Department of Information and Computing Sciences, Utrecht University, The Netherlands David P. Woodruff

Carnegie Mellon University, Pittsburgh, PA, USA

Abstract

In this paper, we study estimators for geometric optimization problems in the sublinear geometric model. In this model, we have oracle access to a point set with size $n$ in a discrete space $[\Delta]^{d}$ , where queries can be made to an oracle that responds to orthogonal range counting requests. The query complexity of an optimization problem is measured by the number of oracle queries required to compute an estimator for the problem. We investigate two problems in this framework, the Euclidean Minimum Spanning Tree (MST) and Earth Mover Distance (EMD). For EMD, we show the existence of an estimator that approximates the cost of EMD with $O(\log\Delta)$ -relative error and $O(\frac{n\Delta}{s^{1+1/d}})$ -additive error using $O(s\operatorname*{polylog}\Delta)$ range counting queries for any parameter $s$ with $1\leq s\leq n$ . Moreover, we prove that this bound is tight. For MST, we demonstrate that the weight of MST can be estimated within a factor of $(1\pm\varepsilon)$ using $\tilde{O}(\sqrt{n})$ range counting queries.

Keywords and phrases:

Range counting oracles, minimum spanning trees, Earth Mover’s Distance

Funding:

Eunjin Oh: Supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.RS-2024-00440239, Sublinear Scalable Algorithms for Large-Scale Data Analysis) and the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No.RS-2024-00358505).

David P. Woodruff: supported in part by Office of Naval Research award number N000142112647 and a Simons Investigator Award.

Copyright and License:

2012 ACM Subject Classification:

Theory of computation

\rightarrow

Computational geometry

Related Version:

Full Version: https://arxiv.org/abs/2504.15292

Acknowledgements:

This research was initiated during the Workshop “Massive Data Models and Computational Geometry” held at the University of Bonn in September 2024 and funded by DFG, German Research Foundation, through EXC 2047 Hausdorff Center for Mathematics and FOR 5361 KI-FOR Algorithmic Data Analytics for Geodesy (AlgoForGe). We thank the organizers and participants for the stimulating environment that inspired this research.

DOI:

10.4230/LIPIcs.SoCG.2025.42

Event:

41st International Symposium on Computational Geometry (SoCG 2025)

Editors:

Oswin Aichholzer and Haitao Wang

Series and Publisher:

Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik

1 Introduction

In recent years, the size of data encountered in various applications has grown exponentially. While classical algorithms are designed to process the entire input to get exact or approximate solutions, the increasing demand for efficiency in massive datasets necessitates approaches that can provide useful results without examining all of the input data. Motivated by this, a wide range of sublinear-time algorithms has been extensively studied over the past few decades from various subfields in theoretical computer science. For instance, there are sublinear-time algorithms for the longest increasing subsequence and edit distance problems on strings [32, 34, 29], the triangle counting and vertex cover problems on graphs [10, 26, 38], Earth Mover’s Distance on probability distributions [9], and minimum spanning tree on general metric spaces [22]. Moreover, there are sublinear-time algorithms for several fundamental geometric problems [16, 20, 21, 31, 35]. For further information, refer to [24, 39].

To obtain sublinear-time algorithms, we need an oracle to access the input data. There are various sublinear-time oracle models in the geometric setting although there are common oracles for strings, graphs and metric spaces. Chazelle et al. [16] presented several geometric algorithms assuming that the input is given without any preprocessing. In the case of point inputs, the only thing one can do is uniform sampling. In the case of polygons, one can also check if two edges are adjacent. Although they showed that several problems admit sublinear-time algorithms in this model, it seems not strong enough to solve a wider range of fundamental geometric problems such as the clustering problems, the Earth Mover’s Distance problem, and the minimum spanning problem. Subsequently, many researchers developed sublinear-time algorithms for several different models. Examples are models in which the oracle can answer orthogonal range emptiness queries and cone nearest point queries [20], orthogonal range counting queries [21, 35], or separation queries [31].

In this paper, we study the range counting oracle model for geometric optimization problems. Here, we do not have direct access to the input point set, but instead we use an orthogonal range counting data structure for access. More specifically, given an orthogonal range as a query, it returns the number of input points contained in the query range. In fact, many database software systems, such as Oracle¹¹1https://docs.oracle.com/en/database/other-databases/nosql-database/24.3/sqlreferencefornosql/operator1.html and Amazon SimpleDB²²2https://docs.aws.amazon.com/AmazonSimpleDB/latest/DeveloperGuide/RangeQueriesSelect.html, provide built-in support for such range queries. These queries allow users to compute aggregates, such as the count, sum, or average of records that fall within a specified range of values for a set of attributes, which is crucial for a wide array of applications. For example, in data analytics, range queries are used to calculate the number of sales within a specific date range or determine the total revenue from products within a given price interval. In geographic information systems (GIS), range queries help aggregate spatial data points within certain coordinate bounds, such as counting the number of locations within a specific radius of a given point. Furthermore, in machine learning, range queries are employed during data preprocessing to summarize statistics over selected subsets, which supports tasks such as data filtering and dimensionality reduction.

The following are desirable properties for a reasonable oracle model: the queries to the oracle should be efficient to implement, it is preferable for it to be supported by lots of well-known databases, and it should allow sublinear-time algorithms for fundamental geometric problems. First, this model clearly satisfies the first property; there are numerous works on data structures for range counting queries and their variants [2, 7, 14, 15, 13, 25, 40]. In particular, one can preprocess a set of $n$ points in $\mathbb{R}^{d}$ in $O(n\log^{d-1}n)$ time to construct a data structure of size $O(n\log^{d-1}n)$ so that the number of points inside a query range can be found in $O(\log^{d-1}n)$ time. Second, range counting queries are already supported by well-known public databases as we mentioned earlier. Therefore, without any special preprocessing, we can apply sublinear-time algorithms designed on the range counting oracle for such databases in sublinear time. Third, we show that several fundamental problems can be solved in sublinear time on the range counting oracle model in this paper.

We focus on the following three fundamental geometric problems: the Earth Mover’s Distance problem, the minimum spanning tree problem, and the cell sampling problem. The Earth Mover’s Distance $\mathsf{EMD}(R,B)$ between two point sets $R$ and $B$ in $\mathbb{R}^{d}$ and the cost $\mathsf{MST}(P)$ of a minimum spanning tree of a point set $P\in\mathbb{R}^{d}$ are defined as follows.

\mathsf{EMD}(R,B)=\min_{\pi:R\rightarrow B}\sum_{r\in R}\|r-\pi(r)\|\ \text{ % and }\mathsf{MST}(P)=\min_{T}\sum_{uv\in E(T)}\|u-v\|,

where $\pi$ ranges over all one-to-one matchings between $R$ and $B$ , and $T$ ranges over all spanning trees of $P$ . If we have direct access to the input point set(s), then $\mathsf{EMD}(R,B)$ can be computed exactly in near quadratic time in $\mathbb{R}^{2}$ [4], and approximately within $(1+\varepsilon)$ -relative error in near linear time [3]. Similarly, $\mathsf{MST}(P)$ can be computed exactly in near quadratic time by computing the complete graph on $P$ and applying Kruskal’s algorithm, and computed approximately in near linear time [30]. Note that $\mathsf{EMD}$ can be used for measuring the similarity between two point sets, and $\mathsf{MST}$ can be used for summarizing the distribution of a point set. Thus these problems have various applications in multiple areas of computer science including computer vision [11], machine learning [27] and document similarity [33].

The cell sampling problem is defined as follows. Let $P$ be a set of points in a discrete space $[\Delta]^{d}$ . Given a value $r$ with $1\leq r\leq\Delta$ , let $\mathcal{G}$ be the grid imposed on $[\Delta]^{d}$ whose cells have side length $r$ . We say a grid cell is non-empty if it contains a point of $P$ . Our goal is to sample one non-empty cell almost uniformly at random from the set of all non-empty cells of $\mathcal{G}$ . If we have direct access to $P$ , this problem is trivial as we can compute all non-empty cells explicitly. However, in our sublinear model, we cannot compute all non-empty cells if the number of non-empty cells exceeds our desired query complexity. We believe that cell sampling can be considered as a fundamental and basic primitive in sublinear models as sublinear algorithms rely on efficient sampling to extract meaningful information from large data sets. In fact, this problem has been studied in the dynamic streaming model, and has been used as a primitive for several geometric problems [28, 36] and graph problems [19].

Table 1: Summary of our results. Here,

n

denotes the number of points, and

\Delta

denotes the size of the domain. For

\mathsf{EMD}

, a parameter

s

determines a trade-off between the additive error and the query complexity.

		Additive Error	Multiplicative Error	Query Complexity
$\mathsf{EMD}$	UB	$O(n\Delta/s^{1+1/d})$	$O(\log\Delta)$	$\tilde{O}(s)$
$\mathsf{EMD}$	LB	$\Omega(n\Delta/s^{1+1/d})$	–	$\tilde{O}(s)$
Cell Sampling	UB	–	$1\pm\varepsilon$	$\tilde{O}(\sqrt{n})$
Cell Sampling	LB	–	$O(1)$	$\Omega(\sqrt{n})$
$\mathsf{MST}$	UB	–	$1\pm\varepsilon$	$\tilde{O}(\sqrt{n})$
$\mathsf{MST}$	LB	–	$O(1)$	$\Omega(n^{1/3})$

Our results.

We present sublinear-time algorithms for $\mathsf{EMD}$ , Cell Sampling, and MST in the orthogonal range counting oracle model. Our results are summarized in Table 1.

$\blacksquare$

For $\mathsf{EMD}$ , we present an algorithm for estimating the Earth Mover’s Distance between two point sets of size $n$ in $[\Delta]^{d}$ within a multiplicative error of $O(\log\Delta)$ in addition to an additive error of $O(n\Delta/s^{1+1/d})$ . Our algorithm succeeds with a constant probability, and uses $\tilde{O}(s)$ range counting queries for any parameter $s$ . Notice that the optimal solution can be as large as $\Theta(n\Delta)$ , and in this case, the additive error is relatively small. Furthermore, we show that this trade-off is tight for any parameter $s$ .
$\blacksquare$

For Cell Sampling, we present an algorithm that samples a non-empty cell in a grid so that the probability that a fixed non-empty cell is selected is in $[(1-\varepsilon)/m,(1+\varepsilon)/m]$ for any constant $0<\varepsilon<1$ . Here $m$ denotes the number of non-empty cells. We also show that the query complexity here is tight. Moreover, if we sample $\tilde{O}(\sqrt{n}\log(n))$ non-empty cells, we can estimate the number of non-empty cells within $(1\pm\varepsilon)$ -factor.
$\blacksquare$

For $\mathsf{MST}$ , we present an algorithm for estimating the cost of a minimum spanning tree of a point set of size $n$ in $[\Delta]^{2}$ within a multiplicative error of $1\pm\varepsilon$ . Our algorithm again succeeds with constant probability, and uses $\tilde{O}(\sqrt{n})$ range counting queries. Also, we show that any randomized algorithm using $o(n^{1/3})$ queries has at least a constant multiplicative error.

Related work.

Both EMD and MST have been extensively studied across various sublinear models over the past few decades. The work most closely related to ours is the one by Czumaj et al. [20]. They studied the Euclidean minimum spanning tree problem in a different model. In particular, their oracle supports cone nearest queries, orthogonal range emptiness queries, and point sampling queries.³³3More precisely, the oracle supports cone nearest queries and orthogonal range queries only, but they are also given the access to the input point set, which allow them to sample a vertex uniformly at random. Here, given a cone nearest query consisting of a cone $C$ with apex $p$ , the oracle returns $(1+\delta)$ -approximate nearest neighbor of $p$ in $C\cap P$ , where $P$ is the set of input points. Using $\tilde{O}(\sqrt{n})$ queries, their algorithm estimates the cost of the MST of $n$ points within a relative error of $(1\pm\varepsilon)$ . Moreover, they showed that if the oracle supports orthogonal emptiness queries only, then any deterministic algorithm for estimating the cost of MST within $O(n^{1/4})$ -relative error uses $\Omega(\sqrt{n})$ orthogonal range emptiness queries. Sublinear time algorithms for computing a metric minimum spanning tree problem have also been studied [23]. Here, the oracle supports distance queries: Given two points, it returns their distance in the underlying space. The most popular sublinear-space model is the streaming model. In this model, the input arrives as data stream, and an algorithm is restricted to using a sublinear amount of memory relative to the input size. Both EMD and MST are studied extensively in this model [5, 18, 28], showcasing the challenges and techniques for processing large datasets in the steaming model.

Although no sublinear algorithms for EMD and MST were known in the range counting oracle model prior to our work, other geometric optimization problems have been studied in this model. Monemizadeh [35] presented an algorithm for the facility location problem using $\tilde{O}(\sqrt{n})$ queries, and Czumaj and Sohler [21] studied clustering problems and map labeling problems. To the best of our knowledge, these are the only work done in the range counting oracle model. However, there are several closely related models such as the range-optimization model. For instance, Arya et al. [8] studied the range MST problem, where the goal is to construct a data structure for computing the cost of MST within a query orthogonal range. In this setting, clustering problems also have been considered [1, 37].

Preliminaries.

We make use of quadtrees for both EMD and MST. Consider a discrete space $[\Delta]^{2}$ . For $i\in\{0,1,\dots,\log\Delta\}$ , we use $\mathcal{G}_{i}$ to denote a grid over the discrete space $[\Delta]^{2}$ consisting of cells with side lengths of $2^{i}$ . A quadtree $\mathcal{Q}$ on $[\Delta]^{2}$ is then constructed as follows. The root of the quadtree corresponds to the unique cell of $\mathcal{G}_{\log\Delta}$ . For each node $v$ of the quadtree of depth $i$ has four equal sized children, each corresponding to a cell in $\mathcal{G}_{i-1}$ contained in the cell corresponding to $v$ . Note that the depth of $\mathcal{Q}$ is $O(\log\Delta)$ . If it is clear from the context, we use a node of the quadtree and its corresponding cell interchangeably.

Let $P$ be a point set in $[\Delta]^{d}$ . For both EMD and MST we will use binary search in combination with range counting queries to find the $k^{\mathrm{th}}$ smallest point in dimension $i\leq d$ . Furthermore, we will often want to sample a point from $P$ uniformly at random. We can do so using $O(\log\Delta)$ range counting queries in a procedure called telescoping sampling [35].

Lemma 1 (Telescoping Sampling [35]).

We can select a point of $P$ uniformly at random using $O(\log\Delta)$ range counting queries.

Throughout this paper, we use $\tilde{O}(f(s))$ to denote $O(f(s)\operatorname*{polylog}\Delta)$ . Due to space restrictions, some proofs are omitted; they can be found in the full version of the paper.

2 Sublinear algorithms for the Earth Mover’s Distance problem

In this section, we present several sublinear-time algorithms that approximate the cost of the Earth Mover’s Distance between two point sets in $[\Delta]^{d}$ for an integer $d\geq 1$ . The approximation bounds we obtain are tight up to a factor of $O(\log\Delta)$ for any integer $d\geq 1$ .

In a one-dimensional space, we know the configuration of an optimal matching: the $k$ th leftmost red point is matched with the $k$ th leftmost blue point for all $1\leq k\leq n$ . Thus it suffices to estimate the cost of this matching. On the other hand, we do not know the exact configuration of an optimal matching in a higher dimensional space. Thus instead of considering an optimal matching, we consider a matching of cost $O(\mathsf{OPT}\log\Delta)$ , where $\mathsf{OPT}$ is the Earth Mover’s Distance between two point sets. This is where the multiplicative error of $O(\log\Delta)$ comes from. Apart from this, we use the same approach for all dimensions $d\geq 1$ .

2.1 A sublinear algorithm for points in 1D

In this section, we present a sublinear algorithm using $\tilde{O}(s)$ range counting queries in a one-dimensional space. More precisely, we prove the following theorem. The optimal solution can be as large as $\Theta(n\Delta)$ , and in this case, the additive error is relatively small.

Theorem 2.

Given two sets $R$ and $B$ of size $n$ in a discrete space $[\Delta]$ and a parameter $s>1$ , we can estimate the cost of the Earth Mover’s Distance between $R$ and $B$ within $O(1)$ multiplicative error in addition to $O(\frac{\Delta n}{s^{2}})$ additive error with probability at least $2/3$ using $\tilde{O}(s)$ range counting queries.

If no two points coincide, we have $\mathsf{OPT}\geq n$ . Therefore, we have the following corollary.

Corollary 3.

Given two sets $R$ and $B$ of size $n$ in a discrete space $[\Delta]$ such that no two points of $R\cup B$ coincide, and a parameter $s>1$ , we can estimate the cost of the Earth Mover’s Distance between $R$ and $B$ within $O(\frac{\Delta}{s^{2}})$ multiplicative error with probability at least $2/3$ using $\tilde{O}(s)$ range counting queries.

Let $R=\{r_{1},r_{2},\ldots,r_{n}\}$ and $B=\{b_{1},b_{2},\ldots,b_{n}\}$ be two sets in $[\Delta]$ sorted along the axis. Let $M$ be an optimal matching between $R$ and $B$ . We separate two cases: long edges and short edges. An edge of $M$ is long if its length is at least $\Delta/s$ , and it is short, otherwise. Let $\mathsf{OPT}$ be the Earth Mover’s Distance between $R$ and $B$ .

We employ the following win-win strategy: For a long edge, a slight perturbation of its endpoints introduces only a constant relative error. Thus, we move its endpoints to the center of a grid, enabling us to estimate the total length efficiently. For short edges, we can disregard cases where the number of such edges is small, as this results in only a small additive error. Consequently, we may assume that the number of short edges is large, allowing us to estimate their total length using sampling. To further reduce the additive error, we partition the set of short edges into $\log(\Delta/s)$ subsets based on edge length and then consider each subset separately.

Long edges.

For long edges, we subdivide $[\Delta]$ into segments each of length $\Delta/(2s)$ . For each long edge $(r_{i},b_{i})$ , imagine that we move $r_{i}$ and $b_{i}$ to the centers of the segments containing them. Since $(r_{i},b_{i})$ is long, this introduces an additive error of $O(\Delta/s)$ , resulting in a constant relative error. Using this fact, we estimate the total length of all long edges within a constant relative error using $\tilde{O}(s)$ range counting queries. Let $L_{\text{long}}$ be the resulting estimator.

Lemma 4.

We can estimate the total length of the long edges in $M$ within $O(1)$ multiplicative error using $\tilde{O}(s)$ range counting queries.

Proof.

We subdivide $[\Delta]$ into $s$ segments each of length $\Delta/(2s)$ . Here, the number of segments is $O(s)$ . Let $\mathcal{S}$ be the set of the resulting segments. Then a long edge contains at least one segment. Imagine that we move every point $p$ incident to a long edge to the center of the segment containing $p$ . For each segment $S$ , we compute the number $n_{r}(S)$ of red points contained in $S$ and the number $n_{b}(S)$ of blue points contained in $S$ , respectively, using $O(s)$ queries. Then we construct another point set $P^{\prime}$ by adding $n_{r}(S)$ red points to the center of $S$ and adding $n_{b}(R)$ blue points to the center of $S$ . We can do this without using queries. For a point $p\in R\cup B$ , let $p^{\prime}$ be the its corresponding point in $P^{\prime}$ . Then consider the matching $M^{\prime}=\{(r^{\prime},b^{\prime})\mid(r,b)\in M\}$ . We can compute $M^{\prime}$ explicitly without using any queries. Let $M^{\prime}_{\text{long}}$ be the set of edges of $M^{\prime}$ containing at least one segment of $\mathcal{S}$ . A long edge of $M$ has its corresponding edge in $M^{\prime}_{\text{long}}$ . However, some short edge of $M$ might have its corresponding edge in $M^{\prime}_{\text{long}}$ . Here, observe that the short edges of $M$ having their corresponding edges in $M^{\prime}_{\text{long}}$ have length at least $\Delta/(2s)$ , thus the cost induced by those edges are also within a constant factor of $\mathsf{OPT}$ . They will be also counted for short edges, but this also increases the total cost by a constant factor. Therefore, the lemma holds. $\hfill\blacktriangleleft$

Short edges.

For short edges, we use random sampling. More specifically, we subdivide the set of short edges into $t=\log(\Delta/s)$ subsets with respect to their lengths: $E_{1},E_{2},\ldots,E_{t}$ , where $E_{i}$ is the set of all edges of $M$ of lengths in $[2^{i-1},2^{i})$ . Then we estimate the number of edges of $E_{i}$ for every index $i$ . To do this, we choose a random sample of $R$ of size $x=O(s\log^{2}\Delta)$ . For each chosen point, we can find its mate. Let $S_{i}$ denote the set of all edges between the sampled points and their mates. If $|E_{i}|\geq n/(s\log\Delta)$ then the number of edges in $E_{i}\cap S_{i}$ would be a good estimator for $|E_{i}|$ . Otherwise, we can ignore them as $E_{i}$ induces an additive error of at most $\Delta n/(s^{2}\log\Delta)$ . This is because every short edges has length at most $\Delta/s$ . Let $\ell_{i}=|E_{i}\cap S_{i}|\cdot(n/x)$ . Then we have the following lemma.

Lemma 5.

If $|E_{i}|\geq n/(s\log\Delta)$ , we have $\ell_{i}=\Theta(|E_{i}|)$ with a constant probability. Otherwise, we have $\ell_{i}\leq O(|E_{i}|)$ with a constant probability.

Proof.

Let $Y_{j}$ be the random variable where $Y_{j}=1$ if the $j$ th edge in $S_{i}$ is contained in $E_{i}$ , and $Y_{j}=0$ otherwise. Let $Y=Y_{1}+\ldots+Y_{x}$ , where $x=s\log^{2}\Delta$ is the number of sampled points. By definition, we have $\mathbb{E}[Y_{j}]=\frac{|E_{i}|}{n}$ , and $\mathbb{E}[Y]=x\cdot\frac{|E_{i}|}{n}$ . Now we analyze the failure probability. If $|E_{i}|\geq n/(s\log\Delta)$ , we use Chernoff bounds. Then the probability that $|\mathbb{E}[Y]-Y|\geq\varepsilon\mathbb{E}[Y]$ is at most $2\exp(-\varepsilon^{2}\cdot p\cdot(s\log^{2}\Delta)/3)$ , where $p$ is the probability that $Y_{j}=1$ . In our case, $p=|E_{i}|/n\geq 1/(s\log\Delta)$ . Thus the failure probability is at most $1/\Delta$ in this case. If $|E_{i}|<n/(s\log\Delta)$ , we use Markov’s inequality. Then the probability that $Y_{i}\geq c\dot{|}E_{i}|$ is less than a small constant for a large constant $c$ . Therefore, the second part of the lemma also holds. $\hfill\blacktriangleleft$

To amplify the success probability of Lemma 5, we repeat the procedure for computing $\ell_{i}$ $O(\log\Delta)$ times. Then we take the minimum, and let $L=\sum_{i=1}^{t}2^{i}\cdot\ell_{i}+L_{\text{long}}$ .

Lemma 6.

With a constant probability, $(1/4)\cdot\mathsf{OPT}-n\Delta/s^{2}\leq L\leq 4\cdot\mathsf{OPT}+n\Delta/s^{2}$ .

Proof.

Since we can estimate the total length of long edges with an additive error of $2\cdot\mathsf{OPT}$ , it suffices to focus on short edges. For each edge of $E_{i}$ , we round up its length to $2^{i}$ . This induces the additive error of $2\cdot\mathsf{OPT}$ . Now we use Lemma 5. Consider the event that we have the desired bounds for all indices $i$ in Lemma 5. Due to the amplification of success probability we made before, this total success probability is a constant. If $|E_{i}|\geq(n/(s\log\Delta)$ , the estimator $\ell_{i}$ has an error of $O(1)\cdot\mathsf{OPT}$ in total for all such indices. If $|E_{i}|<n/(s\log\Delta)$ , we do not have a lower bound guarantee for $\ell_{i}$ . However, since the number of such edges is $n/s$ in total, and the length of each such edge is $\Delta/s$ , this induces the additive error of $n\Delta/s^{2}$ . Therefore, the total additive error is $n\Delta/s^{2}$ . $\hfill\blacktriangleleft$

2.2 A sublinear algorithm for higher dimensions

In this section, we present a sublinear algorithm that approximates the cost of the Earth Mover’s Distance between two point sets in $[\Delta]^{d}$ for an integer $d\geq 2$ . More specifically, we prove the following theorem. Again, notice that the Earth Mover’s Distance between two point sets can be as large as $\Theta(\Delta n)$ in the worst case.

Theorem 7.

Given two sets $R$ and $B$ of size $n$ in a discrete space $[\Delta]^{d}$ and a parameter $s>1$ , we can estimate the cost of the Earth Mover’s Distance between $R$ and $B$ within $O(\log\Delta)$ -relative error in addition to $O(\frac{\Delta n}{s^{1+1/d}})$ additive error with probability at least $2/3$ using $\tilde{O}(s)$ range counting queries.

If no two points coincide, we have $\mathsf{OPT}\geq n$ . Therefore, we have the following corollary.

Corollary 8.

Given two sets $R$ and $B$ of size $n$ in a discrete space $[\Delta]^{2}$ and a parameter $s>1$ such that no two points in $R\cup B$ coincide, we can estimate the cost of the Earth Mover’s Distance between $R$ and $B$ within $O(\max\{\log\Delta,\frac{\Delta}{s^{1+1/d}}\})$ -relative error with probability at least $2/3$ using $\tilde{O}(s)$ range counting queries.

Let $\mathsf{OPT}$ be the Earth Mover’s Distance between $R$ and $B$ . This algorithm is essentially the same as the algorithm for the one-dimensional case. The only difference is that we do not know the configuration of an optimal matching in a higher dimensional space. Instead, we can use a matching obtained from the quadtree on $[\Delta]^{d}$ , which has cost of $O(\log\Delta)\cdot\mathsf{OPT}$ . Then as before, we separate two cases. If an edge has length at least $\Delta/s^{1/d}$ , it is short. Otherwise, it is long. Then we can compute all long edges exactly, and we can estimate short edges using sampling. Details can be found in the full version of this paper.

3 Lower bounds for the Earth Mover’s Distance

In this section, we show that for any parameter $s$ , any randomized algorithm that approximates the Earth Mover’s Distance between two sets within an additive error of $O(n\Delta/s^{1+1/d})$ requires at least $s$ orthogonal range counting queries. More specifically, we prove:

Lemma 9.

For any parameter $s$ and any integers $d\geq 1$ and $\Delta\geq 1$ , any randomized algorithm for approximating the Earth Mover’s Distance between two point sets of size $n$ in a discrete space $[\Delta]^{d}$ within additive error $O(n\Delta/s^{1+1/d})$ uses $\Omega(s)$ range counting queries.

If points from $B\cup R$ may coincide, achieving any multiplicative approximation factor is impossible. Even when no two points of $B\cup R$ coincide, Lemma 9 shows that no randomized algorithm with a multiplicative error of $O(\Delta/s^{1+1/d})$ uses $o(s)$ range counting queries.

3.1 A lower bound for points in 1D

In this section, we prove Lemma 9 in the one-dimensional space. This will be extended for a higher dimensional space later. To analyze the lower bound on the additive error of a Monte Carlo algorithm, we use the following version of Yao’s Minmax Theorem [6]: Assume that there exists a randomized algorithm using $s$ range counting queries with success probability at least $2/3$ . Then for any distribution $\mu$ of instances, there exists a deterministic algorithm using $s$ range counting queries that returns desired answers on instances chosen from $\mu$ with probability at least $2/3$ . Here, the success probability of a randomized algorithm is the probability that the output is within our desired bound.

Thus it suffices to construct a distribution $\mu$ of instances such that no deterministic algorithm using $s$ range counting queries has success probability at least $2/3$ . For this, we define two types of gadgets on a segment $S$ of length $\Delta/(8s)$ : the near gadget and far gadget. See Figure 1. For the far gadget, we put $4n/s$ red points near the left endpoint of $S$ and $4n/s$ blue points near the right endpoint of $S$ . For the near gadget, we put $2n/s$ red points and $2n/s$ blue points alternately starting from the left endpoint of $S$ so that the distance between any two consecutive points is one, and the starting point lies in the left endpoint of $S$ . Additionally, we put $2n/s$ red points and $2n/s$ blue points alternately so that the distance between any two consecutive points is one, and the last point lies in the right endpoint of $S$ . Notice that the cost induced by the near gadget is $\Theta(n/s)$ , but the cost induced by the far gadget is $\Theta(n\Delta/s^{2})$ . Our strategy is to place $O(s)$ copies of the near gadget inside $[\Delta]$ and to hide one far gadget inside $[\Delta]$ with probability $1/2$ . In this way, we can construct two types of instances, one with cost $n$ and one with cost $\Theta(n\Delta/s^{2})$ .

Figure 1: (a) All segments use the near gadget. The cost of this instance is

n

. (b) The gray segment uses the far gadget. The cost of this instance is

\Theta(n\Delta/s^{2})

.

More specifically, we define a distribution $\mu$ of instances as follows. We partition $[\Delta]$ into $8s$ segments each of length $\Delta/(8s)$ . Let $\mathcal{S}=\{S_{1},S_{2},\ldots,S_{8s}\}$ be the set of resulting segments along the axis. See Figure 1. With probability $1/2$ , we let $t=0$ . Then with probability $1/2$ , we choose one index $t$ from $1,2,\ldots,s$ uniformly at random. That is, the probability that a fixed index $i$ is chosen is $1/(16s)$ for $i=1,2,\ldots,8s$ . Then for each index $i\neq t$ , we place the near gadget on $S_{i}$ . For index $i=t$ , we place the far gadget on $S_{i}$ . Notice that we do not place the far gadget anywhere if $t=0$ . Thus it suffices to show that, for any deterministic algorithm using $s$ range counting queries, the probability that it estimates the cost of $\mathsf{OPT}$ within an additive error of $O(n\Delta/s^{2})$ on the instances chosen from $\mu$ is less than $2/3$ .

Lemma 10.

For any deterministic algorithm using $s$ range counting queries, the probability that it estimates the cost of an instance chosen from $\mu$ within an additive error of $O(n\Delta/s^{2})$ is less than $2/3$ .

3.2 Lower bounds for dimensions 2 and higher

We now extend the construction of the 1D case to the 2D case. We again use Yao’s Minmax Theorem. Thus it suffices to construct a distribution $\mu$ of instances such that no deterministic algorithm using $s$ range counting queries has success probability at least $2/3$ .

For this, we define two types of gadgets on a square $S$ of side length $\Delta/(8\sqrt{s})$ : the near gadget and far gadget. See Figure 2. For the far gadget, we put two copies of the far gadget constructed from the 1D case, one on the upper side of $S$ and one on the lower side of $S$ , so that a red point comes first in the upper side, and a blue point comes first in the lower side. For the near gadget, we do the same: Put two copies of the near gadget constructed from the 1D case, one on the upper side and one on the lower side, so that a red point comes first in the upper side, and a blue point comes first in the lower side. In this way, every gadget has $8n/s$ red points and $8n/s$ blue points.

Figure 2: We partition

[\Delta]^{2}

into

16s

squares. In each square, we place either the far gadget or the near gadget. The cost of the far gadget is at least

n\Delta/s^{2}

, and the cost of the near gadget is

\Theta(n/s)

.

Now define a distribution $\mu$ of instances as follows. We construct the grid on $[\Delta]^{2}$ consisting of $16s$ cells each of side length $\Delta/(4\sqrt{s})$ . See Figure 2. We choose a cell $W$ as follows. With probability $1/2$ , we let $W=\emptyset$ . Then with probability $1/2$ , we choose one cell from the $16s$ cells uniformly at random, and let it $W$ . Then for each cell in the grid, except for $W$ , we place the near gadget in the middle of the cell. Then we place the far gadget in the middle of $W$ . Notice that we do not place the far gadget anywhere if $W=\emptyset$ . Thus it suffices to show that, for any deterministic algorithm using $s$ range counting queries, the probability that it estimates the cost of $\mathsf{OPT}$ within an additive error of $O(n\Delta/s^{1.5})$ on an instance chosen from $\mu$ is less than $2/3$ . We can prove this similarly to Lemma 10.

Lemma 11.

For any deterministic algorithm using $s$ range counting queries, the probability that it estimates the cost of an instance chosen from $\mu$ within an additive error of $O(n\Delta/s^{1.5})$ is less than $2/3$ .

Extension to a higher dimensional space.

The construction of an input distribution used in the two-dimensional case can be easily extended to a higher dimensional space. We again define two gadgets on a $d$ -dimensional cube $S$ of side length $\Delta/(4s^{1/d})$ . For the far gadget, we put two copies of the far gadget we constructed from the $(d-1)$ -dimensional case, one on one facet of $S$ and one on its parallel facet of $S$ . We do the same for the near gadget using two copies of the near gadget we constructed from the $(d-1)$ -dimensional case. Then we can define a distribution of instances as we did for the two-dimensional space. Details can be found in the full version of this paper.

4 Sampling a non-empty cell uniformly at random

In this section, we show how to sample a grid cell (almost) uniformly at random from all grid cells containing points of $P$ . Let $\mathcal{G}$ be a grid of certain side length $r$ imposed on the discrete space $[\Delta]^{2}$ . Let $\mathcal{C}=\{c_{1},c_{2},\ldots,c_{m}\}$ denote the set of all grid cells of $\mathcal{G}$ containing points of $P$ . We say such cells are non-empty. For each cell $c_{i}$ , we let $n_{i}$ denote the number of points contained in $c_{i}$ . Here, we say sampling is $c$ -approximate uniform if the probability that each non-empty cell is chosen is in $[1/(cm),c/m]$ for a constant $c$ with $c>1$ . Or, we simply say sampling is almost uniform.

One might want to use the telescoping sampling of Lemma 1 directly: In particular, we choose a point $v$ of $P$ uniformly at random, and return the cell of $\mathcal{G}$ containing $v$ . However, the probability that a cell $c_{i}$ is chosen in this way is $n_{i}/n$ , not $1/m$ . So, instead, we use a two-step sampling procedure as stated in Algorithm 1. Here, we are required to check if it has at most $\sqrt{n}$ non-empty cells. If so, we need to find all of them explicitly. This can be done $O(\sqrt{n}\log\Delta)$ range counting queries:

Lemma 12.

If the number of non-empty cells is $O(\sqrt{n})$ we can compute all of them using $O(\sqrt{n}\log\Delta)$ range counting queries.

Algorithm 1 CellSampling

(r)

.

Lemma 13.

CellSampling $(r)$ selects a cell almost uniformly at random from all non-empty cells of $\mathcal{G}$ using $\tilde{O}(\sqrt{n})$ range counting queries.

Corollary 14.

We can estimate the number of non-empty cells within an $O(1)$ -relative error using $\tilde{O}(\sqrt{n})$ range counting queries.

Lower bound.

Now, we show that any randomized algorithm that can perform $c$ -approximate uniform sampling for non-empty cells requires at least $\Omega(\sqrt{n}/c)$ range counting queries for any parameter $c\geq 1$ . Our lower bound holds even in a discrete one-dimensional space $[\Delta]$ with $\Delta\geq n$ . We assume that we have an interval $L$ of length $\Delta\geq n$ consisting of $n$ cells each of length $\Delta/n$ . We subdivide $L$ into $\sqrt{n}/(4c)$ segments each of length $4c\Delta/\sqrt{n}$ . Let $\mathcal{S}$ be the set of the segments for the line segment $L$ . In this way, each segment in $S$ contains $(4c\Delta/\sqrt{n})/(\Delta/n)=4c\sqrt{n}$ cells. We construct a set $\mathcal{I}$ of $\sqrt{n}/(4c)+1$ instances as follows.

Figure 3: Illustration for the uniform instance and a non-uniform instance. The gray segment is the witness of the non-uniform instance.

We construct one uniform instance of $L$ . In this instance, for each segment $S\in\mathcal{S}$ , we identify its leftmost cell and place $4c\sqrt{n}$ points in that cell. Therefore, each uniform instance contains $\sqrt{n}/(4c)$ non-empty cells (one corresponding to each segment) while maintaining a total of $n$ points per uniform instance.

Next, we construct $\sqrt{n}/(4c)$ non-uniform instances of $L$ . For the $i\textsuperscript{th}$ non-uniform instance, we select the $i\textsuperscript{th}$ segment, which we refer to as the witness segment. In the leftmost cell of this witness segment, we place $2c\sqrt{n}$ points. Additionally, in each of the next $2c\sqrt{n}$ cells following the leftmost cell, we place one point per cell. For each of the remaining $\sqrt{n}/(4c)-1$ non-witness segments, we place $4c\sqrt{n}$ points in the leftmost cell of each segment, as we did for the uniform instance. In this way, every non-uniform instance has $2c\sqrt{n}+1+\sqrt{n}/(4c)-1=2c\sqrt{n}+\sqrt{n}/(4c)$ non-empty cells and contains $4c\sqrt{n}\cdot(\sqrt{n}/(4c)-1)+2c\sqrt{n}+2c\sqrt{n}=n$ points in total. Thus, the ratio of the number of non-empty cells between a non-uniform instance and a uniform instance is $(2c\sqrt{n}+\sqrt{n}/(4c))/(\sqrt{n}/(4c))=8c^{2}+1$ . For both instances see Figure 3.

Lemma 15.

Given an instance $I$ from $\mathcal{I}$ , any randomized algorithm that determines whether $I$ is uniform with a success probability of at least $2/3$ requires $\Omega(\sqrt{n}/c)$ range counting queries.

Lemma 16.

Any algorithm that can perform $c$ -approximate uniform sampling for non-empty cells requires $\Omega(\sqrt{n}/c)$ range counting queries for any parameter $c\geq 1$ .

5 Estimating the cost of a minimum spanning tree

Let $P\subseteq[\Delta]^{2}$ be a set of $n$ points. In this section, we present a $(1+\varepsilon)$ -approximation algorithm for estimating the cost $\mathsf{OPT}$ of a minimum spanning tree of $P$ that uses $\tilde{O}(\sqrt{n})$ range counting queries. The algorithm is randomized and succeeds with a constant probability. We then show that any randomized constant-factor approximation algorithm requires $\Omega(n^{1/3})$ range counting queries.

5.1 Algorithm for the minimum spanning tree problem

For general graphs, Chazelle, Rubinfeld, and Trevisan [17] presented a $(1+\varepsilon)$ -approximation algorithm for estimating the cost of a minimum spanning tree of a graph $G$ . This so-called CRT algorithm uses two types of oracles: we need to sample a vertex of $G$ uniformly at random, and we need to access the list of the neighbors of each vertex. If the average degree of $G$ is $d$ , and the edge weights of $G$ come from $\{1,\ldots,w\}$ , then the running time of this algorithm is $O(dw\varepsilon^{-2}\log(dw/\varepsilon))$ . Our algorithm is based on the CRT algorithm. As it works for graphs with bounded average degree, we use a quadtree-based $(1+\varepsilon)$ -spanner $S$ of $P$ introduced by [12]. Here, the maximum degree of $S$ is $O(\log\Delta)$ , and the cost of a minimum spanning tree of $P$ is within $(1+\varepsilon)$ to the cost of a minimum spanning tree of $S$ as shown in [30]. Thus it suffices to estimate the cost of a minimum spanning tree of $S$ .

However, a main issue here is that the running time depends on the maximum edge weight, which is $\Delta$ in our case. To handle this, we need to look at the CRT algorithm more carefully. As we will see later, the problem reduces to the problem of estimating the number $c_{i}$ of components of $S_{i}$ for every $i\in\{0,1,\ldots,\log_{1+\varepsilon}(2\Delta)\}$ , where $S_{i}$ denotes the subgraph of $G$ induced by edges of length less than $(1+\varepsilon)^{i}$ . To do this, the CRT algorithm samples $\varepsilon^{-2}$ vertices from $V(S_{i})$ , and computes the number of edges contained in the component of $S_{i}$ containing each sampled vertex $v$ . If $v$ is contained in a small component, we can traverse $S_{i}$ until we visit all vertices of the component. If $v$ is contained in a large component, we cannot compute this. In this case, we ignore it. This induces the additive error of $\varepsilon(1+\varepsilon)^{i}\cdot(n/t)$ to the final estimator, where $t$ is the size threshold between large components and small components. To get an approximation factor of $1+\varepsilon$ , we need to set $t=\Delta$ , which is too large for our purpose. Note that we need to traverse a component of size $t$ , which requires at least $t$ queries. Here, the factor of $n/t$ in the additive error indeed the number of components we cannot traverse using $O(t)$ queries.

Our strategy here is to traverse $S_{i}$ in a cell-by-cell manner. More specifically, we consider a grid $\mathcal{G}$ of side length $(1+\varepsilon)^{i}/2$ , and contract all vertices of $S_{i}$ in a single cell of $\mathcal{G}$ into a vertex. Then using the same number of queries, we can visit more of the original vertices. By a careful analysis, we can reduce the additive error to $\varepsilon(n/t)\leq\varepsilon\cdot\mathsf{OPT}$ . This idea was already used by [28] for maintaining the cost of the minimum spanning tree in the dynamic streaming setting.⁴⁴4As they did, we can use the Euclidean complete graph on $P$ instead of a spanner $S$ of $P$ . However, this requires additional work: Given two cells $c$ and $c^{\prime}$ and a value $r$ , we need to check if there are two points $p\in c$ and $p^{\prime}\in c^{\prime}$ with $\|p-p^{\prime}\|=r$ . We cannot do this using $O(\log\Delta)$ range counting queries. Instead, we can resolve this by defining another distance function. But we feel that working with spanners makes the analysis simpler. However, this raises a new issue: we need to sample a cell containing a point of $P$ uniformly at random. Here, we can use Lemma 13 to choose a random sample from vertices of $S_{i}^{\prime}$ .

Then we have the following lemma. Details can be found in the full version of this paper.

Theorem 17.

Given a set $P$ of size $n$ in a discrete space $[\Delta]^{2}$ , we can estimate the cost of a minimum spanning tree of $P$ within a factor of $(1+\varepsilon)$ with a constant probability using $\tilde{O}(\sqrt{n})$ range counting queries.

5.2 Lower bound for the minimum spanning tree problem

In this section, we show that any randomized constant-factor approximation algorithm for the Euclidean minimum spanning tree problem uses $\Omega(n^{1/3})$ range counting queries. For this, we construct a distribution $\mu$ of instances where any randomized algorithm using $o(n^{1/3})$ queries fails to obtain a constant-factor approximate solution with probability at least $1/3$ .

Let $[\Delta]^{2}$ be a discrete domain with $\Delta=O(n)$ . We subdivide $[\Delta]^{2}$ into $16n^{1/3}$ equal-sized cells where each has side length $4n^{5/6}$ . See Figure 4. In the middle of each cell, we place either the strip gadget or the uniform gadget. Each gadget is defined on the domain $[n^{5/6}]^{2}$ which we further subdivide into $n^{4/3}$ finer cells of side length $n^{1/6}$ . For the strip gadget, we put one point to each finer cell on one diagonal of $[n^{5/6}]^{2}$ . For the uniform gadget, we put one point to the cell in the $i$ -th row and the $((i\cdot n^{1/3})\mod n^{2/3})$ -th column for $i=0,1,2,\ldots,n^{2/3}$ . In total, each gadget contains $n^{2/3}$ points. Notice that the cost of a minimum spanning tree of the points inside the strip gadget is $\Theta(n^{5/6})$ , and the cost of a minimum spanning tree of the points inside the uniform gadget is $\Theta(n^{7/6})$ . Now we define a distribution $\mu$ of instances. First, with probability $1/2$ , we place copies of the strip gadget in all cells of the domain $[\Delta]^{2}$ . Then with probability $1/2$ , we pick one cell $c$ of the domain $[\Delta]^{2}$ uniformly at random. We place a copy of the uniform gadget in $c$ , and place copies of the strip gadget in the other cells. Then we have the following lemma.

Lemma 18.

Let $I$ and $I^{\prime}$ be two instances of $\mu$ such that $I$ has the uniform gadget, but $I^{\prime}$ does not use the uniform gadget. Then $\mathsf{MST}(I^{\prime})\leq 2\cdot\mathsf{MST}(I)$ , where $\mathsf{MST}(\cdot)$ denotes the cost of a minimum spanning tree.

Proof.

Let $M$ be a minimum spanning tree of $I$ . Observe that the cost of $M$ is at least $2\cdot n^{7/6}$ . This is because $M$ contains at least $n^{1/3}$ edges connecting two points contained different cells in the domain $[\Delta]^{2}$ . Their total length is at least $2\cdot n^{7/6}$ .

Now we analyze an upper bound on the cost of a minimum spanning tree of $I^{\prime}$ . We can construct a spanning tree of $I^{\prime}$ from $M$ as follows. The two instances are the same, except that $I$ has a uniform gadget in a cell, say $c$ , and $I^{\prime}$ has a strip gadget in $c$ . By the cut property, there are at most $O(1)$ edges of $M$ having one endpoint in $c$ and one endpoint lying outside of $c$ . Moreover, such edges have length at most $8n^{5/6}$ . For each such edge, we reconnect it with any vertex in $c$ . Then we remove all edges having both endpoints in $c$ from $M$ , and add the edges of the minimum spanning tree of the strip gadget. In total, the cost of $M$ decreases by at most $n^{7/6}-O(1)\cdot n^{5/6}$ . Here, the term $n^{7/6}$ is the cost of the uniform gadget. Therefore, we have $\displaystyle\mathsf{MST}(I^{\prime})-\mathsf{MST}(I)\leq n^{7/6}-O(1)\cdot n^% {5/6}\leq(1/2)\cdot n^{7/6}\leq\mathsf{MST}(I)$ . $\hfill\blacktriangleleft$

Figure 4: The domain

[\Delta]

is partitioned into

16n^{1/3}

cells. Each cell contains the strip gadget or the uniform gadget. The strip gadget has cost

\Theta(n^{5/6})

while the uniform gadget has cost

\Theta(n^{7/6})

.

Lemma 19.

Any randomized constant-factor approximation algorithm for the minimum spanning tree problem on a point set of size $n$ in a discrete space $[\Delta]^{d}$ requires $\Omega(n^{1/3})$ range counting queries.

References

[1] Mikkel Abrahamsen, Mark de Berg, Kevin Buchin, Mehran Mehr, and Ali D. Mehrabi. Range-Clustering Queries. In Proceedings of the 33rd International Symposium on Computational Geometry (SoCG 2017), volume 77, pages 5:1–5:16, 2017. doi:10.4230/LIPICS.SOCG.2017.5.
[2] Peyman Afshani, Lars Arge, and Kasper Dalgaard Larsen. Orthogonal range reporting in three and higher dimensions. In Proceedings of the 50th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2009), pages 149–158, 2009. doi:10.1109/FOCS.2009.58.
[3] Pankaj K Agarwal, Hsien-Chih Chang, Sharath Raghvendra, and Allen Xiao. Deterministic, near-linear $\varepsilon$ -approximation algorithm for geometric bipartite matching. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing (STOC 2022), pages 1052–1065, 2022. doi:10.1145/3519935.3519977.
[4] Pankaj K. Agarwal, Hsien-Chih Chang, and Allen Xiao. Efficient algorithms for geometric partial matching. In Proceedings of the 35th International Symposium on Computational Geometry (SoCG 2019), pages 6:1–6:14, 2019. doi:10.4230/LIPICS.SOCG.2019.6.
[5] Alexandr Andoni, Khanh Do Ba, Piotr Indyk, and David Woodruff. Efficient sketches for earth-mover distance, with applications. In Proceedings of the 50th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2009), pages 324–330, 2009. doi:10.1109/FOCS.2009.25.
[6] Sanjeev Arora and Boaz Barak. Computational complexity: a modern approach, 2009.
[7] Sunil Arya and David M Mount. Approximate range searching. Computational Geometry, 17(3-4):135–152, 2000. doi:10.1016/S0925-7721(00)00022-5.
[8] Sunil Arya, David M. Mount, and Eunhui Park. Approximate geometric mst range queries. In Proceedings of the 31st International Symposium on Computational Geometry (SoCG 2015), volume 34, pages 781–795, 2015. doi:10.4230/LIPICS.SOCG.2015.781.
[9] Khanh Do Ba, Huy L Nguyen, Huy N Nguyen, and Ronitt Rubinfeld. Sublinear time algorithms for earth mover’s distance. Theory of Computing Systems, 48:428–442, 2011. doi:10.1007/S00224-010-9265-8.
[10] Soheil Behnezhad, Mohammad Roghani, and Aviad Rubinstein. Sublinear time algorithms and complexity of approximate maximum matching. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing (STOC 2023), pages 267–280, 2023. doi:10.1145/3564246.3585231.
[11] Nicolas Bonneel, Michiel Van De Panne, Sylvain Paris, and Wolfgang Heidrich. Displacement interpolation using lagrangian mass transport. In Proceedings of the 2011 SIGGRAPH Asia conference, pages 1–12, 2011.
[12] Paul B Callahan and S Rao Kosaraju. A decomposition of multidimensional point sets with applications to k-nearest-neighbors and n-body potential fields. Journal of the ACM (JACM), 42(1):67–90, 1995. doi:10.1145/200836.200853.
[13] Timothy M Chan. Orthogonal range searching in moderate dimensions: kd trees and range trees strike back. Discrete & Computational Geometry, 61:899–922, 2019. doi:10.1007/S00454-019-00062-5.
[14] Timothy M Chan, Kasper Green Larsen, and Mihai Pătraşcu. Orthogonal range searching on the RAM, revisited. In Proceedings of the twenty-seventh annual Symposium on Computational Geometry (SoCG 2011), pages 1–10, 2011.
[15] Timothy M Chan and Konstantinos Tsakalidis. Dynamic orthogonal range searching on the RAM, revisited. Journal of Computational Geometry, 9(2):45–66, 2018. doi:10.20382/JOCG.V9I2A5.
[16] Bernard Chazelle, Ding Liu, and Avner Magen. Sublinear geometric algorithms. In Proceedings of the thirty-fifth annual ACM Symposium on Theory of Computing (STOC 2003), pages 531–540, 2003. doi:10.1145/780542.780620.
[17] Bernard Chazelle, Ronitt Rubinfeld, and Luca Trevisan. Approximating the minimum spanning tree weight in sublinear time. SIAM J. Comput., 34(6):1370–1379, 2005. doi:10.1137/S0097539702403244.
[18] Xi Chen, Vincent Cohen-Addad, Rajesh Jayaram, Amit Levi, and Erik Waingarten. Streaming Euclidean MST to a constant factor. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing (STOC 2023), pages 156–169, 2023. doi:10.1145/3564246.3585168.
[19] Graham Cormode, Hossein Jowhari, Morteza Monemizadeh, and S. Muthukrishnan. The sparse awakens: Streaming algorithms for matching size estimation in sparse graphs. In Kirk Pruhs and Christian Sohler, editors, 25th Annual European Symposium on Algorithms, ESA 2017, September 4-6, 2017, Vienna, Austria, volume 87 of LIPIcs, pages 29:1–29:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2017. doi:10.4230/LIPICS.ESA.2017.29.
[20] Artur Czumaj, Funda Ergün, Lance Fortnow, Avner Magen, Ilan Newman, Ronitt Rubinfeld, and Christian Sohler. Approximating the weight of the Euclidean minimum spanning tree in sublinear time. SIAM J. Comput., 35(1):91–109, 2005. doi:10.1137/S0097539703435297.
[21] Artur Czumaj and Christian Sohler. Property testing with geometric queries. In Proceedings of the 9th Annual European Symposium on Algorithms (ESA 2001), pages 266–277, 2001. doi:10.1007/3-540-44676-1_22.
[22] Artur Czumaj and Christian Sohler. Estimating the weight of metric minimum spanning trees in sublinear-time. In Proceedings of the thirty-sixth annual ACM Symposium on Theory of Computing (STOC 2004), pages 175–183, 2004. doi:10.1145/1007352.1007386.
[23] Artur Czumaj and Christian Sohler. Estimating the weight of metric minimum spanning trees in sublinear time. SIAM J. Comput., 39(3):904–922, 2009. doi:10.1137/060672121.
[24] Artur Czumaj and Christian Sohler. Sublinear-time algorithms. Property testing: current research and surveys, pages 41–64, 2010. doi:10.1007/978-3-642-16367-8_5.
[25] Mark de Berg, Otfried Cheong, Marc van Kreveld, and Mark Overmars. Computational geometry: Algorithms and applications, 2008.
[26] Talya Eden, Dana Ron, and C Seshadhri. On approximating the number of $k$ -cliques in sublinear time. In Proceedings of the 50th annual ACM SIGACT Symposium on Theory of Computing (STOC 2018), pages 722–734, 2018. doi:10.1145/3188745.3188810.
[27] Rémi Flamary, Marco Cuturi, Nicolas Courty, and Alain Rakotomamonjy. Wasserstein discriminant analysis. Machine Learning, 107:1923–1945, 2018. doi:10.1007/S10994-018-5717-1.
[28] Gereon Frahling, Piotr Indyk, and Christian Sohler. Sampling in dynamic data streams and applications. In Proceedings of the twenty-first annual symposium on Computational geometry (SoCG 2025), pages 142–149, 2005. doi:10.1145/1064092.1064116.
[29] Elazar Goldenberg, Robert Krauthgamer, and Barna Saha. Sublinear algorithms for gap edit distance. In Proceedings of the 60th Annual Symposium on Foundations of Computer Science (FOCS 2019), pages 1101–1120, 2019. doi:10.1109/FOCS.2019.00070.
[30] Sariel Har-Peled. Geometric approximation algorithms. Number 173 in Mathematical Surveys and Monographs. American Mathematical Soc., 2011.
[31] Sariel Har-Peled, Mitchell Jones, and Saladi Rahul. Active-learning a convex body in low dimensions. Algorithmica, 83:1885–1917, 2021. doi:10.1007/S00453-021-00807-W.
[32] Tomasz Kociumaka and Barna Saha. Sublinear-time algorithms for computing & embedding gap edit distance. In Proceedings of the 61st Annual Symposium on Foundations of Computer Science (FOCS 2020), pages 1168–1179, 2020. doi:10.1109/FOCS46700.2020.00112.
[33] Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. From word embeddings to document distances. In Proceedings of 32rd International Conference on Machine Learning (ICML 2015), pages 957–966, 2015. URL: http://proceedings.mlr.press/v37/kusnerb15.html.
[34] Michael Mitzenmacher and Saeed Seddighin. Improved sublinear time algorithm for longest increasing subsequence. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA 2021), pages 1934–1947, 2021. doi:10.1137/1.9781611976465.115.
[35] Morteza Monemizadeh. Facility location in the sublinear geometric model. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2023), volume 275, pages 6:1–6:24, 2023. doi:10.4230/LIPICS.APPROX/RANDOM.2023.6.
[36] Morteza Monemizadeh and David P. Woodruff. 1-pass relative-error l ${}_{\mbox{p}}$ -sampling with applications. In Moses Charikar, editor, Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2010, Austin, Texas, USA, January 17-19, 2010, pages 1143–1160. SIAM, 2010. doi:10.1137/1.9781611973075.92.
[37] Eunjin Oh and Hee-Kap Ahn. Approximate range queries for clustering. In 34th International Symposium on Computational Geometry (SoCG 2018), volume 99, pages 62:1–62:14, 2018. doi:10.4230/LIPICS.SOCG.2018.62.
[38] Krzysztof Onak, Dana Ron, Michal Rosen, and Ronitt Rubinfeld. A near-optimal sublinear-time algorithm for approximating the minimum vertex cover size. In Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms (SODA 2012), pages 1123–1131, 2012. doi:10.1137/1.9781611973099.88.
[39] Ronitt Rubinfeld and Asaf Shapira. Sublinear time algorithms. SIAM Journal on Discrete Mathematics, 25(4):1562–1588, 2011. doi:10.1137/100791075.
[40] Cheng Sheng and Yufei Tao. New results on two-dimensional orthogonal range aggregation in external memory. In Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of Database Systems (PODS 2011), pages 129–139, 2011. doi:10.1145/1989284.1989297.

[bib.bib1] [1] Mikkel Abrahamsen, Mark de Berg, Kevin Buchin, Mehran Mehr, and Ali D. Mehrabi. Range-Clustering Queries. In Proceedings of the 33rd International Symposium on Computational Geometry (SoCG 2017), volume 77, pages 5:1–5:16, 2017. doi:10.4230/LIPICS.SOCG.2017.5.

[bib.bib2] [2] Peyman Afshani, Lars Arge, and Kasper Dalgaard Larsen. Orthogonal range reporting in three and higher dimensions. In Proceedings of the 50th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2009), pages 149–158, 2009. doi:10.1109/FOCS.2009.58.

[bib.bib3] [3] Pankaj K Agarwal, Hsien-Chih Chang, Sharath Raghvendra, and Allen Xiao. Deterministic, near-linear $\varepsilon$ -approximation algorithm for geometric bipartite matching. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing (STOC 2022), pages 1052–1065, 2022. doi:10.1145/3519935.3519977.

[bib.bib4] [4] Pankaj K. Agarwal, Hsien-Chih Chang, and Allen Xiao. Efficient algorithms for geometric partial matching. In Proceedings of the 35th International Symposium on Computational Geometry (SoCG 2019), pages 6:1–6:14, 2019. doi:10.4230/LIPICS.SOCG.2019.6.

[bib.bib5] [5] Alexandr Andoni, Khanh Do Ba, Piotr Indyk, and David Woodruff. Efficient sketches for earth-mover distance, with applications. In Proceedings of the 50th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2009), pages 324–330, 2009. doi:10.1109/FOCS.2009.25.

[bib.bib6] [6] Sanjeev Arora and Boaz Barak. Computational complexity: a modern approach, 2009.

[bib.bib7] [7] Sunil Arya and David M Mount. Approximate range searching. Computational Geometry, 17(3-4):135–152, 2000. doi:10.1016/S0925-7721(00)00022-5.

[bib.bib8] [8] Sunil Arya, David M. Mount, and Eunhui Park. Approximate geometric mst range queries. In Proceedings of the 31st International Symposium on Computational Geometry (SoCG 2015), volume 34, pages 781–795, 2015. doi:10.4230/LIPICS.SOCG.2015.781.

[bib.bib9] [9] Khanh Do Ba, Huy L Nguyen, Huy N Nguyen, and Ronitt Rubinfeld. Sublinear time algorithms for earth mover’s distance. Theory of Computing Systems, 48:428–442, 2011. doi:10.1007/S00224-010-9265-8.

[bib.bib10] [10] Soheil Behnezhad, Mohammad Roghani, and Aviad Rubinstein. Sublinear time algorithms and complexity of approximate maximum matching. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing (STOC 2023), pages 267–280, 2023. doi:10.1145/3564246.3585231.

[bib.bib11] [11] Nicolas Bonneel, Michiel Van De Panne, Sylvain Paris, and Wolfgang Heidrich. Displacement interpolation using lagrangian mass transport. In Proceedings of the 2011 SIGGRAPH Asia conference, pages 1–12, 2011.

[bib.bib12] [12] Paul B Callahan and S Rao Kosaraju. A decomposition of multidimensional point sets with applications to k-nearest-neighbors and n-body potential fields. Journal of the ACM (JACM), 42(1):67–90, 1995. doi:10.1145/200836.200853.

[bib.bib13] [13] Timothy M Chan. Orthogonal range searching in moderate dimensions: kd trees and range trees strike back. Discrete & Computational Geometry, 61:899–922, 2019. doi:10.1007/S00454-019-00062-5.

[bib.bib14] [14] Timothy M Chan, Kasper Green Larsen, and Mihai Pătraşcu. Orthogonal range searching on the RAM, revisited. In Proceedings of the twenty-seventh annual Symposium on Computational Geometry (SoCG 2011), pages 1–10, 2011.

[bib.bib15] [15] Timothy M Chan and Konstantinos Tsakalidis. Dynamic orthogonal range searching on the RAM, revisited. Journal of Computational Geometry, 9(2):45–66, 2018. doi:10.20382/JOCG.V9I2A5.

[bib.bib16] [16] Bernard Chazelle, Ding Liu, and Avner Magen. Sublinear geometric algorithms. In Proceedings of the thirty-fifth annual ACM Symposium on Theory of Computing (STOC 2003), pages 531–540, 2003. doi:10.1145/780542.780620.

[bib.bib17] [17] Bernard Chazelle, Ronitt Rubinfeld, and Luca Trevisan. Approximating the minimum spanning tree weight in sublinear time. SIAM J. Comput., 34(6):1370–1379, 2005. doi:10.1137/S0097539702403244.

[bib.bib18] [18] Xi Chen, Vincent Cohen-Addad, Rajesh Jayaram, Amit Levi, and Erik Waingarten. Streaming Euclidean MST to a constant factor. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing (STOC 2023), pages 156–169, 2023. doi:10.1145/3564246.3585168.

[bib.bib19] [19] Graham Cormode, Hossein Jowhari, Morteza Monemizadeh, and S. Muthukrishnan. The sparse awakens: Streaming algorithms for matching size estimation in sparse graphs. In Kirk Pruhs and Christian Sohler, editors, 25th Annual European Symposium on Algorithms, ESA 2017, September 4-6, 2017, Vienna, Austria, volume 87 of LIPIcs, pages 29:1–29:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2017. doi:10.4230/LIPICS.ESA.2017.29.

[bib.bib20] [20] Artur Czumaj, Funda Ergün, Lance Fortnow, Avner Magen, Ilan Newman, Ronitt Rubinfeld, and Christian Sohler. Approximating the weight of the Euclidean minimum spanning tree in sublinear time. SIAM J. Comput., 35(1):91–109, 2005. doi:10.1137/S0097539703435297.

[bib.bib21] [21] Artur Czumaj and Christian Sohler. Property testing with geometric queries. In Proceedings of the 9th Annual European Symposium on Algorithms (ESA 2001), pages 266–277, 2001. doi:10.1007/3-540-44676-1_22.

[bib.bib22] [22] Artur Czumaj and Christian Sohler. Estimating the weight of metric minimum spanning trees in sublinear-time. In Proceedings of the thirty-sixth annual ACM Symposium on Theory of Computing (STOC 2004), pages 175–183, 2004. doi:10.1145/1007352.1007386.

[bib.bib23] [23] Artur Czumaj and Christian Sohler. Estimating the weight of metric minimum spanning trees in sublinear time. SIAM J. Comput., 39(3):904–922, 2009. doi:10.1137/060672121.

[bib.bib24] [24] Artur Czumaj and Christian Sohler. Sublinear-time algorithms. Property testing: current research and surveys, pages 41–64, 2010. doi:10.1007/978-3-642-16367-8_5.

[bib.bib25] [25] Mark de Berg, Otfried Cheong, Marc van Kreveld, and Mark Overmars. Computational geometry: Algorithms and applications, 2008.

[bib.bib26] [26] Talya Eden, Dana Ron, and C Seshadhri. On approximating the number of $k$ -cliques in sublinear time. In Proceedings of the 50th annual ACM SIGACT Symposium on Theory of Computing (STOC 2018), pages 722–734, 2018. doi:10.1145/3188745.3188810.

[bib.bib27] [27] Rémi Flamary, Marco Cuturi, Nicolas Courty, and Alain Rakotomamonjy. Wasserstein discriminant analysis. Machine Learning, 107:1923–1945, 2018. doi:10.1007/S10994-018-5717-1.

[bib.bib28] [28] Gereon Frahling, Piotr Indyk, and Christian Sohler. Sampling in dynamic data streams and applications. In Proceedings of the twenty-first annual symposium on Computational geometry (SoCG 2025), pages 142–149, 2005. doi:10.1145/1064092.1064116.

[bib.bib29] [29] Elazar Goldenberg, Robert Krauthgamer, and Barna Saha. Sublinear algorithms for gap edit distance. In Proceedings of the 60th Annual Symposium on Foundations of Computer Science (FOCS 2019), pages 1101–1120, 2019. doi:10.1109/FOCS.2019.00070.

[bib.bib30] [30] Sariel Har-Peled. Geometric approximation algorithms. Number 173 in Mathematical Surveys and Monographs. American Mathematical Soc., 2011.

[bib.bib31] [31] Sariel Har-Peled, Mitchell Jones, and Saladi Rahul. Active-learning a convex body in low dimensions. Algorithmica, 83:1885–1917, 2021. doi:10.1007/S00453-021-00807-W.

[bib.bib32] [32] Tomasz Kociumaka and Barna Saha. Sublinear-time algorithms for computing & embedding gap edit distance. In Proceedings of the 61st Annual Symposium on Foundations of Computer Science (FOCS 2020), pages 1168–1179, 2020. doi:10.1109/FOCS46700.2020.00112.

[bib.bib33] [33] Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. From word embeddings to document distances. In Proceedings of 32rd International Conference on Machine Learning (ICML 2015), pages 957–966, 2015. URL: http://proceedings.mlr.press/v37/kusnerb15.html.

[bib.bib34] [34] Michael Mitzenmacher and Saeed Seddighin. Improved sublinear time algorithm for longest increasing subsequence. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA 2021), pages 1934–1947, 2021. doi:10.1137/1.9781611976465.115.

[bib.bib35] [35] Morteza Monemizadeh. Facility location in the sublinear geometric model. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2023), volume 275, pages 6:1–6:24, 2023. doi:10.4230/LIPICS.APPROX/RANDOM.2023.6.

[bib.bib36] [36] Morteza Monemizadeh and David P. Woodruff. 1-pass relative-error l ${}_{\mbox{p}}$ -sampling with applications. In Moses Charikar, editor, Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2010, Austin, Texas, USA, January 17-19, 2010, pages 1143–1160. SIAM, 2010. doi:10.1137/1.9781611973075.92.

[bib.bib37] [37] Eunjin Oh and Hee-Kap Ahn. Approximate range queries for clustering. In 34th International Symposium on Computational Geometry (SoCG 2018), volume 99, pages 62:1–62:14, 2018. doi:10.4230/LIPICS.SOCG.2018.62.

[bib.bib38] [38] Krzysztof Onak, Dana Ron, Michal Rosen, and Ronitt Rubinfeld. A near-optimal sublinear-time algorithm for approximating the minimum vertex cover size. In Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms (SODA 2012), pages 1123–1131, 2012. doi:10.1137/1.9781611973099.88.

[bib.bib39] [39] Ronitt Rubinfeld and Asaf Shapira. Sublinear time algorithms. SIAM Journal on Discrete Mathematics, 25(4):1562–1588, 2011. doi:10.1137/100791075.

[bib.bib40] [40] Cheng Sheng and Yufei Tao. New results on two-dimensional orthogonal range aggregation in external memory. In Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of Database Systems (PODS 2011), pages 129–139, 2011. doi:10.1145/1989284.1989297.