Fully Scalable MPC Algorithms for Euclidean -Center
Abstract
The -center problem is a fundamental optimization problem with numerous applications in machine learning, data analysis, data mining, and communication networks. The -center problem has been extensively studied in the classical sequential setting for several decades, and more recently there have been some efforts in understanding the problem in parallel computing, on the Massively Parallel Computation (MPC) model. For now, we have a good understanding of -center in the case where each local MPC machine has sufficient local memory to store some representatives from each cluster, that is, when one has local memory per machine. While this setting covers the case of small values of , for a large number of clusters these algorithms require undesirably large local memory, making them poorly scalable. The case of large has been considered only recently for the fully scalable low-local-memory MPC model for the Euclidean instances of the -center problem. However, the earlier works have been considering only the constant dimensional Euclidean space, required a super-constant number of rounds, and produced only centers whose cost is a super-constant approximation of -center.
In this work, we significantly improve upon the earlier results for the -center problem for the fully scalable low-local-memory MPC model. In the low dimensional Euclidean case in , we present the first constant-round fully scalable MPC algorithm for -approximation. We push the ratio further to -approximation albeit using slightly more centers. All these results naturally extends to slightly super-constant values of . In the high-dimensional regime, we provide the first fully scalable MPC algorithm that in a constant number of rounds achieves an -approximation for -center.
Keywords and phrases:
Massively Parallel Computing, Euclidean Spaces, -Center ClusteringCategory:
Track A: Algorithms, Complexity and GamesFunding:
Artur Czumaj: Research supported in part by the Centre for Discrete Mathematics and its Applications (DIMAP), by EPSRC award EP/V01305X/1, by a Weizmann-UK Making Connections Grant, by an IBM Award.Copyright and License:
![[Uncaptioned image]](x1.png)
2012 ACM Subject Classification:
Theory of computation Massively parallel algorithms ; Theory of computation Facility location and clustering ; Theory of computation Randomness, geometry and discrete structuresEditors:
Keren Censor-Hillel, Fabrizio Grandoni, JoΓ«l Ouaknine, and Gabriele PuppisSeries and Publisher:

1 Introduction
Clustering and the -Center Problem.
Clustering is a fundamental task in data analysis and machine learning. We consider a well-known clustering problem, called -Center, in Euclidean spaces. In this problem, given an integer parameter and a dataset , the goal is to find a center set of points, such that the following clustering objective is minimized
(1) |
Here, for any two points , and for any point and any set of points .
Solving -Center on massive data sets introduces outstanding scalability issues. To meet this scalability challenge, practical approaches usually use several interconnected computers to solve the problem, i.e., they resort to distributed computing. Accordingly, there has been significant recent interest in scalable algorithms with provable guarantees for -CenterΒ [1, 5, 7, 9, 11, 14, 20, 31, 37, 44, 46], primarily in the Massively Parallel Computing (MPC) model, which has nowadays become the de-facto standard theoretical model for such large-scale distributed computation settings (see, e.g., [6, 29, 36]).
Massively Parallel Computation (MPC) Model.
The MPC model, introduced inΒ [43], provides a theoretical abstraction for widely-used practical frameworks such as MapReduceΒ [19], HadoopΒ [52], SparkΒ [53], and DryadΒ [38]. In this model, we are given a set of machines, each with some given memory of size (also known as local memory). At the beginning of computation, the input (which in our case is a set of data points from ) is arbitrarily distributed among these machines, with the constraint that it must fit within each machineβs local memory. (Hence we will require that the number of machines is , for otherwise the input would not fit the system.) The MPC computation proceeds in synchronous rounds. In each round, first, each machine processes its local data and performs an arbitrary computation on its data without communicating with other machines. Then, at the end of each round, machines can communicate by exchanging messages, subject to the constraint that for every machine, the total size of the messages it sends or receives is . When the algorithm terminates, MPC machines collectively output the solution. The goal is to finish the computational task using as small as possible number of rounds.
Local Memory Regimes and Full Scalability.
The central parameter determining the computational model is the size of the local memory . Unlike the input size , local memory is defined by the hardware provided and as such, one would like the relation between and to be as flexible as possible. Therefore an ideal MPC algorithm should be fully scalable, meaning that it should work with for any constant . The importance of designing fully scalable MPC algorithms has been recently observed (cf. [8]) for clustering problems like -Center (and also -means and -median), where the prior research (see below) demonstrates that the problemβs difficulty changes radically depending on whether or not, i.e., whether one machine can hold the entire set of proposed centers or not. It is furthermore desirable for the algorithm to use a near-linear total memory.
Prior Work with High Local Memory Requirements.
In the non-fully scalable regime, when , a classical technique of coresets [32, 33] can be applied to reduce the input points into a -approximate proxy with only points (ignoring factors). For -Center, provided that for some constant [7], this small proxy can be computed in MPC rounds and moved to a single machine. The clustering problem (in fact, its approximate version because of the approximation caused by the use of coresets) can then be solved locally without further communication in a single MPC round. However, the coreset approach is not applicable when , because coresets suffer a trivial size lower bound of making the approach sketched above unsuitable. Hence, new techniques must be developed for fully scalable algorithms when local memory is sub-linear in .
Several other works for -Center, although not using a coreset directly, also follow a similar paradigm of finding a sketch of points in each machineΒ [1, 20, 31, 46], and therefore they still require . We remark that these results work under general metrics with distance oracle, which is a setting very different from our Euclidean setting. This fundamental difference translates to different flavor of studies: in general metrics not much more can be done than using the triangle inequality, whereas in we need to investigate what Euclidean properties are useful for fully scalable algorithms.
Toward Fully Scalable MPC Algorithms.
To combat these technical challenges, several recent works have designed fully scalable MPC algorithms for -Center and for related clustering problems, including -Median and -MeansΒ [5, 8, 12, 13, 14, 17]. These MPC algorithms are fully scalable in the sense that they can work with for any constant . Indeed, they usually work with any regardless of (albeit many of these results output more than centers, only achieving a bi-criteria approximation). Therefore, in particular, despite the inspiring recent progress, fully scalable MPC algorithms for -Center are poorly understood. The state-of-the-art algorithm (for geometric -Center and only for ) is by Coy, Czumaj, and MishraΒ [14]: it achieves a super-constant approximation ratio of (improving over an -rounds bound from an earlier work of Batteni et al.Β [5]), using centers; this violates the constraint of using at most centers and works in a super-constant number of rounds. These bounds, which are stated for the standard regime of with a constant , show a drastic gap to the above-mentioned bounds achieved in the fully scalable regime.
Challenges of High Dimensionality.
Apart from the fully-scalability, the high dimensionality of Euclidean spaces is another challenge in designing MPC algorithms. Indeed, there has been emerging works that address high dimensionality in MPC [4, 8, 12, 13, 17, 21, 39]. Unfortunately, all previous fully-scalable MPC algorithms for -Center only work for low-dimensional regime since it requires dependence in in local space, and fully-scalable MPC algorithms suitable for high dimension, especially those with dependence, constitutes an open area of research. Overall, there has been a large gap in fully-scalable algorithms for -Center under both low- and high-dimensional regime.
1.1 Our Results
We give new fully scalable MPC algorithms for Euclidean -Center and we systematically address both the low-dimensional and high-dimensional regimes. Our results in low dimension significantly improve previous results simultaneously in various aspects (i.e., approximation ratio, round complexity, etc.). We also obtain the first results for high dimension, and our approximation ratio bypasses several natural barriers.
Low-Dimensional Regime.
In low dimension, we provide the first fully scalable MPC algorithm that achieves a constant approximation to -Center, running in a constant number of rounds (Theorem 1.1). This result significantly improves the previous fully scalable algorithms for -CenterΒ [5, 14] in several major aspects: by achieving constant round complexity, better approximation factor, and true approximation (without bi-criteria considerations that allow slightly more than centers).
Theorem 1.1.
There exists an MPC algorithm that given , , and a dataset of points distributed across MPC machines with local memory , with probability at least computes a -approximate solution to -Center, using rounds and total memory.
Furthermore, we can improve the approximation bound if we allow bi-criteria approximations: we design a -approximation -Center algorithm that uses centers (Theorem 1.2). This bi-criteria approximation is almost optimal, and is the first of its kind in the literature.
Theorem 1.2.
There exists an MPC algorithm that given , , and a dataset of points distributed across MPC machines with local memory , with probability at least computes a -approximate solution111An -approximate solution to -Center (also called a bi-criteria solution) is a center set that has at most centers and has cost at most times the optimal cost of using at most centers. to -Center, using rounds and total memory.
Observe that for the memory regime of with a constant , both Theorems 1.1 andΒ 1.2 run in a constant number of rounds. Moreover, is the complexity of far more rudimentary tasks, e.g., outputting the summation of numbers (see, e.g., [49]). The dependence on in the memory bounds is . Thus, for any constant , if , then the algorithms work in a constant number of rounds with local memory and total memory , which is the regime studied in the previous works on fully scalable algorithms for -CenterΒ [5, 14]. Furthermore, if , then the total memory is in the desirable regime of .
High-Dimensional Regime.
Next, we go beyond the low-dimensional regime and explore the high dimension case where can be as large as .222As also observed in e.g.,Β [12, 17], one can assume without loss of generality by a Johnson-Lindenstrauss transform. For this regime, we provide the first fully scalable MPC algorithm for -Center, and it achieves an -approximation, running in a constant number of rounds (Theorem 1.3).
Theorem 1.3.
There exists an MPC algorithm that given , , and a dataset of points distributed across MPC machines with local memory , with probability at least computes an -approximate solution to -Center, using rounds and total memory.
We are not aware of any earlier fully scalable MPC algorithms for -Center in high dimension that we could compare with. In fact, fully scalable MPC algorithms for clustering problems in high dimension are generally less understood, and the only known result is an -approximation for -MedianΒ [12] (and in fact this ratio may be improved to using the techniques fromΒ [2] although it is not explicitly mentioned), and as far as we know, nothing is known for -Center and -Means. These existing results for -Median rely on the tree embedding technique, and currently only an -distortion is knownΒ [2] (which translates to the ratio). As a result, even if these techniques could be adapted for -Center, it would only provide an -approximation, which falls short of the -approximation achieved by our method; in fact, our bound is even better than the fundamental lower bound of -approximation of tree embedding. This is not to mention the added technical difficulty of using the tree embedding: its expected distance distortion guarantee is too weak to be useful for the βmaxβ aggregation of distances in -Center.
1.2 Technical Overview
Our algorithms for -Center rely on variants of the classic reductions to geometric versions of the ruling set (RS) and minimum dominating set (MDS) problems, which are fundamental problems in distributed computing. We briefly describe the reductions in Section 1.2.1, particularly to mention our exact setup of RS and MDS, and state the results we obtain for each. Then in Section 1.2.2 and Section 1.2.3, we provide a technical overview of our proposed MPC algorithms for these results, focusing on the key challenges and core techniques. While the relation between -Center and RS/MDS is well-known and has also been used in MPC algorithms for -Center in general metricsΒ [1, 31], it has not been studied for Euclidean -Center in the fully scalable setting. This is a key technical difference to previous fully scalable algorithms for Euclidean -CenterΒ [5, 14] which employ successive uniform sampling to find centers.
Both our low dimension and high dimension results rely on geometric hashing techniques (in Section 3), through which we utilize the Euclidean structure. For low dimension, our algorithms are based on natural parallel algorithms where similar variants were also considered in graph MPC algorithms, and the geometric hashing is the key to achieve the new bounds. For high dimension, our algorithm is a variant of the one-round version of Lubyβs algorithm. It has been known that the one-round Lubyβs algorithm yields a bound for RS in general graphs (see e.g.Β [24, Exercise 1.12]). However, our new variant crucially makes use of the Euclidean structure via geometric hashing, and it breaks the mentioned bound in general graphs, improving it by an factor in the Euclidean setting; See Section 1.2.3 for a more formal discussion. Due to the space limit, proofs for claims made in this section may be omitted in this version and can be found in the full versionΒ [16].
1.2.1 Reductions and Results for Geometric RS and MDS
To establish Theorems 1.1, 1.2, andΒ 1.3, we begin with introducing the definitions and reductions for RS and MDS. Let be parameters, and let be the minimum cost of the solution for -Center.
Geometric RS and MDS.
A subset is called a -independent set (-IS) for , if for every , , and we say is a -dominating set (-DS) for , if for every , . A subset is a -ruling set (-RS) for if is both a -IS and -DS for . A -MDS is a -DS with the minimum size, denoted as . A related well known notion is maximal independent set (MIS), where a -MIS is -RS.
Reductions.
It is known (see, e.g., [35]) that any -IS of for must have at most points. Therefore, an MPC algorithm that computes a -RS of would immediately yield -approximation for -Center on . On the other hand, for MDS, since the optimal solution for -Center itself is a candidate for a -MDS and has at most points for , a -approximation to -MDS would yield centers. This relation helps to obtain the desired bi-criteria approximation. Compared with the setting of RS which could only leads to some -approximation, MDS operates on the entire . This is necessary for the ratio since centers need to be picked from instead of only from .
Results for RS and MDS.
We obtain the following results for RS and MDS, in both low and high dimension. Combining with the above-mentioned reductions, these results readily imply Theorems 1.1, 1.2, andΒ 1.3. All results run in rounds which is constant in the typical setup of for constant . In low dimension, we obtain -RS and a -DS whose size is at most times the -MDS (which in a sense is a βbi-criteriaβ approximation). Both results use local space, and total space. In high dimension, we obtain -RS, using (ideal) local space and total space. These results are summarized in Table 1.
guarantee | local space | total space | reference |
---|---|---|---|
-RS | Lemma 4.1 | ||
-DS of size | Lemma 4.2 | ||
-RS | Lemma 4.3 |
We remark that MIS, RS, and MDS are fundamental yet notoriously challenging problems in MPC. Existing studies on these problems are mostly under (general) graphs or general metric spaces, and they achieve worse bounds than ours, e.g., they need to use a super-constant number of roundsΒ [25, 26, 27, 30, 40, 47], and/or are not fully scalableΒ [10, 27, 31]. However, our results seem to suggest that these problems in Euclidean spaces behave very differently than in graphs/general metrics. On the one hand, we obtain fully-scalable algorithms in both low and high dimension, but on the other hand, our algorithms are only βapproximationsβ to MIS and MDS; for instance, in low dimension, both our RS and MDS results have factor off in the dominating parameter to MIS and MDS. For RS/MIS and MDS without violating dominating parameter, we are only aware of a line of research in distributed computing for growth-bounded graphs, see Schneider and Wattenhofer [51], which indirectly lead to -rounds fully scalable algorithms for MIS/MDS in for constant . It is still open to design fully scalable algorithms for MIS, even in 2D, in constant number of rounds. In fact, this problem is already challenging on a 2D input set with diameter . Nevertheless, our RS and MDS results suffice for approximations for -Center.
1.2.2 RS and MDS in Low Dimension
The RS and MDS in low dimension starts with a rounding to a -grid. Specifically, we move each data point to the nearest -grid point, whose coordinates are multiples of , denoted as .333In our proof we actually need to use a slightly different rounding to make sure the image also belongs to . Then we show that any -RS to yields a -RS to the original dataset , and similarly, any -DS to yields a -DS to . Hence, we can assume without loss of generality that is a subset of the said grid, and find RS and MDS on it. This rounding is useful, since it ensures that in any subset of of diameter (), the number of the grid points is at most . Hence, as long as , we can afford to bring all grid points in a small area to a single machine and solve RS/MDS on it locally.
An Overview for the Proof of RS.
In fact, on the rounded dataset , we find a -RS which is as well -MIS on . A standard way of finding MIS in a graph is a greedy algorithm: start with the entire vertex set, and if the current vertex set is nonempty, then find any vertex , add it to the output (MIS) set, remove all vertices that are adjacent to from the current vertex set, and repeat. In the geometric setting, we can use an improved version where in each iteration we identify a large number of vertices that are independent, instead of only one. Specifically, we partition the entire into some groups , such that each group consists of regions that are apart from each other. Furthermore, each region has a bounded diameter for some .444The reader may notice the reminiscence with the widely-used notion of network decomposition in graphsΒ [3, 50]. In that notion, the node set of the graph is partitioned into groups , β¦, , such that in the subgraph induced by each , each connected component has diameter at most For , there is a simple way to partition with and , as illustrated in Figure 1. For general , we give a partition that achieves and . See Lemma 3.1 for the precise statement.
A key property of this decomposition is that the MIS computation in each region can be done independently within each group, as regions are -separated. This yields an -round MPC algorithm: iterating over the groups , computing an MIS in each region in parallel, and removing points within from selected MIS points before proceeding to the next group. Since each region in any group is of diameter which is at most , there are at most data points in each region, using the assumption of the rounded instance. Hence, as long as , the data points in each region can be stored in a single machine. Each such iteration can be implemented in MPC rounds, and thus the overall scheme which has iterations takes MPC rounds.
To reduce this to rounds, we exploit the locality of the above procedure. Instead of iterating sequentially over groups, each region for directly determines its final subset by identifying the influence from earlier groups that are within a bounded range . Since an MIS selection only affects a -radius per iteration, and each region has a diameter at most , the cumulative affected area over iterations remains bounded. Leveraging the rounding property again, the number of relevant regions remains at most , allowing all necessary regions to be replicated locally. This ensures that each region can compute its MIS in parallel in rounds, provided .
An Overview for the Proof of MDS.
We start with a weaker local space bound that requires a dependence of in , instead of the claimed bound. Our approach starts with a simple algorithm: partition into hypercubes of side-length , where is a parameter to be determined. Then send the data points in each hypercube to a single machine, and solve locally the exact MDS of each hypercube. Taking the union of these MDSs forms the final dominating set. Clearly, this returns a dominating set for the entire dataset, but it may not be of small size. Intuitively, the major gap in this algorithm is when the optimal solution uses points located at the boundary of the hypercubes to dominate the adjacent hypercubes, while the algorithm described here only uses a point to dominate points in a single hypercube.
To address this issue, we observe that bridging the gap between the algorithmβs solution and the optimal solution requires only bounding the number of points in the outer β-extended regionβ of each hypercube , denoted as , that intersect with the optimal solution. Specifically, it suffices to show that this number is at most an -fraction of the optimal solution size. To establish this bound, we apply an averaging argument: If we shift the entire hypercube partitioning by some multiple of , there must exist a shift that satisfies our requirement, provided that the hypercube side-length is sufficiently large. This may be visualized more easily in 1D: each is simply two intervals of length to the left and right of (which itself is also an interval, whose length is ). Then by shifting a multiple of , the two intervals of , after these shifts, form a partition of the entire . Unfortunately, this simple shifting of hypercubes only leads to , which translates to a dependence of in the local space . The main reason for this bound is that the same point in the optimal solution may belong to up to sets (for some hypercube ).
To further reduce , which leads to the local space bound, we need to employ a more sophisticated geometric hashing. We state this in Lemma 3.1 and Λ3.2. This hash maps each point in into some bucket, and we would replace the hypercubes in the above-mentioned algorithm with such buckets. An important property of this bucketing is that any point in (hence any point in the optimal solution) can intersect at most number of sets over all buckets , instead of as in the simple hypercube partition. However, the use of this new bucketing also introduces additional issues. Specifically, since the buckets are of complicated structure, it is difficult to analyze the even more complex set for the averaging argument. To this end, we manage to show (in Lemma 3.1, third property) that the union of is contained in the complement of a Cartesian power of (1D) intervals (e.g., ). This structure of Cartesian power is similar enough to the annulus of hypercubes (which may be handled by projecting to each dimension), albeit taking the complement. This eventually enables us to use a modified averaging argument to finish the proof.
1.2.3 RS in High Dimension
Our -RS in high dimension is a modification of the well-known Lubyβs algorithm. In this discussion, we assume and ignore this parameter. The first modification is that, unlike the standard Lubyβs algorithm which runs for iterationsΒ [45], our algorithm only runs Lubyβs for one iteration: for every data point , generate a uniform random value , and then for each , include in the RS if has the smallest value in (the ball centered at with radius , intersecting points in ). This one-round Lubyβs algorithm achieves -RS (with high probability), and we also show that this is tight in general graphs.555Similar bounds were also mentioned without proof in the literature, see e.g.Β [24, Exercise 1.12]. We give a proof (sketch) for this tight bound for completeness. However, this is worse than the factor that we can achieve.
A New Preprocessing Step Based on Geometric Hashing.
Hence, we need to introduce the second important modification to this one-round Lubyβs algorithm in order to bypass the factor. Specifically, before running the one-round Lubyβs algorithm, we run a preprocessing step to map the data points to the buckets of a geometric hashing. The geometric hashing that we use is the consistent hashingΒ [18] (see Lemma 3.4), and the concrete guarantee in our context is that, each hash bucket has diameter , and for any subset with diameter at most , the number of buckets that intersects is at most . We pick an arbitrary point in from each (non-empty) bucket, denoting the resultant set as , and we run the one-round Lubyβs on . Clearly, this hashing step only additively increases the dominating parameter by which we can afford. At a high level, the use of this rounding is to limit the size of the -neighborhood for every point, which is analogue to the degree of a graph. In a sense, what we prove is that one-round Lubyβs on graphs with degree bound yields an -ruling set.
Next, we explain in more detail why this hashing step helps to obtain dominating parameter. This requires us to do a formal analysis to one-round Lubyβs algorithm (which did not seem to appear in the literature), and utilize the property of hashing that for every , .
A Re-Assignment Argument.
Let be the resultant set found by running one-round Lubyβs on . Fix a point , and we need to upper bound . To this end, we interpret the algorithm as defining an auxiliary sequence (where is also random). Specifically, is formed in the following process: we start with , whenever is not picked into we define as the point with the smallest value in , and we terminate the procedure otherwise. Clearly, , and it suffices to upper bound (which is a random stopping time). Indeed, a similar plan of re-assignment argument has been employed inΒ [17], but the definition and analysis of is quite different since they focus on facility location.
Now, a crucial step is to bound the probability of the event that, given the algorithm picks a prefix of sequence , the algorithm picks some new instead of terminating. We call this extension probability. Ideally, if we can give this extension probability a universal upper bound , then for , the probability that is at most , which decreases exponentially with respect to . However, this seemingly good bound may not immediately be sufficient, since one still needs to take a union bound over sequences of length , because the sequence is random. A naΓ―ve bound is since each may be picked from any point in , and this is positively large (i.e., cannot βcancel it outβ.). Moreover, an added difficulty is that the βidealβ case of having universal upper bound for the extension probability may not be possible in the first place.
Our analysis resolves both difficulties. We show that the extension probability is roughly upper bounded by (recalling that is given). For the union bound, since the hashing guarantees that for any , we can approximate the size by rounding to the next power of , denoted as , and this yields only possibilities after rounding. For a (fixed) sequence , we define its configuration as the rounded value of . We can do a union bound with respect to the configuration of the sequences, and there are at most number of configurations for length- sequences.
Finally, given a sequence with a given configuration (for some ), to upper bound the extension probability, we need to analyze , and we show in a key lemma that this is upper bounded by . Therefore, we can pick , apply the union bound and conclude that .
We also provide a (sketch) of how to slightly modify our analysis to show the one-round Lubyβs algorithm yields -RS. As mentioned, this did not seem to appear in the literature. We also provide a tight instance for this one-round Lubyβs algorithm.
1.3 Related Work
The -Center problem, as one of the fundamental clustering problems [28, 34, 35], has been studied extensively in sequential setting, and more recently, also in parallel setting. For MPC algorithms for -Center, most of prior works have focused on non-fully scalable algorithms that have a dependence of in the local memory and can generally achieve rounds. In general metric space,Β [20] obtains a large-constant approximation, using rounds if the local memory is , which offers a tradeoff between the number of rounds and the local memory. More recent works aim to achieve a smaller constant ratio, down to factor 2, at the cost of having local memory size with a fixed polynomial dependence in and/or a polynomial dependence in the number of machinesΒ [1, 31, 37, 46]. There has been also some -Center studies on doubling metrics instances (which is a generalization of )Β [7, 11], where in the bi-criteria (or with outliers) setting, not only a constant but even a ratio can be achieved due to the low-dimensional structureΒ [7].
Fully-scalable MPC algorithms have been also studied for related clustering problems of -Median and -Means in . [12] describes a hierarchical clustering algorithm that in a constant number of rounds returns a -approximation for -Median using local memory; we are not aware of any similar approximation for -Means (even when and with ratio). Furthermore, since the techniques used in [12] rely critically on the approach of hierarchically well-separated trees, they are unlikely to lead to sub-polylogarithmic approximation ratio algorithms. On the other hand, bi-criteria approximation are known for both -Median and -MeansΒ [8, 17], and in a constant number of rounds and with local memory, one can approximate -median and -means with ratio using centersΒ [17]. In the same setting, it is also possible to achieve a -approximation for special inputsΒ [13].
2 Preliminaries
For integer , let . For a mapping , denote as the pre-image of . For some , a set and , write to denote . For , let be the set of -grid points, that is, points whose coordinates take values as integer multiples of . For a subset , let be its diameter. Similarly define and as the version.
Neighborhoods.
Let be a ball centered at with radius . For a point set , let denote a ball inside . For a point set and , let be the -neighborhood of under the distance, i.e., , and let be the -annulus of under distance.
Geometric Independent Set, Ruling Set and Dominating Set.
Let be some threshold, be some parameter and be a dataset. A subset is called a -independent set (-IS) for , if for every , , and we say is a -dominating set (-DS) for , if for every , . A subset is a -ruling set (-RS) for if is both a -IS and -DS for . A -MDS is a -DS with the minimum size, denoted as . A related well known notion is maximal independent set (MIS), where a -MIS for , denoted as , is a -RS for .
-Center.
Recall that the objective of -Center is defined in Section 1 as for a dataset and center set . Let be the minimum value of the solution for -Center, i.e., . When the context is clear, we simply write for .
Standard MPC Primitives.
In our algorithms we frequently use several basic primitives on MPC with local memory using total memory and number of rounds , where is the size of a generic input. This includes standard procedures of broadcast and converge-cast (of a message that is of size ), see e.g.Β [23]. Goodrich et al.Β [29] show that the task of sorting numbers can also be performed deterministically in the above setting.
Lemma 2.1 (Packing property, cf. [48, Lemma 4.1]).
For a point set such that , we have that .
3 Geometric Hashing
We present our new geometric hashing in Lemma 3.1. This hashing is crucially used in our low dimension results. The construction of this hashing is the same asΒ [18, Theorem 5.3], and the first two properties have also been established inΒ [18]. However, the third property is new, and is based on a careful analysis that utilizes the structure of this specific construction.
Lemma 3.1.
For every , , there is a hash function such that the following holds.
-
1.
Each bucket has diameter at most , namely, for every image , .
-
2.
The bucket set can be partitioned into groups , such that every two buckets in the same group has .
-
3.
For every , ,666Recall that the notation denotes the -th Cartesian power of . where , and for .
Furthermore, it takes space to store and to evaluate for every .
Proof.
The proof can be found in Appendix A.
Lemma 3.1 readily implies the following property. Roughly speaking, it ensures that for any point set with small enough diameter, the number of intersected buckets in the hash is bounded.
Fact 3.2.
The second property of Lemma 3.1 implies that for every such that , it holds that .
Geometric hash functions that has similar guarantee as in Λ3.2 has been studied under the notion of consistent hashingΒ [18], or sparse partitions which interpret the hashing as a space partitionΒ [22, 41]. Specifically, the definition of consistent hashing is stated as follows. The consistently guarantee of Λ3.2 is slightly stronger in the sense that the diameter bound of is in instead of .
Definition 3.3 ([18, Definition 1.6]).
A mapping is called a -gap -consistent hash with diameter bound , or simply -hash , if it satisfies:
-
Diameter: for every image , we have ; and
-
Consistency: for every with , we have .
Lemma 3.4 gives a space-efficient consistent hashing with near-optimal parameter tradeoffs. We rely on this parameter tradeoff in a preprocessing step of our high dimension ruling set result.
Lemma 3.4 ([18, Theorem 5.1]).
For every , there exists a (deterministic) -hash where . Furthermore, can be described using bits and one can evaluate for any point in space .
4 MPC Algorithms for RS and MDS: Formal Statements
In this section, we present the formal statements of our results for RS and MDS. The proofs for the low-dimensional RS and MDS results (Lemmas 4.1 andΒ 4.2) closely follow the arguments outlined in the technical overview (see Section 1.2.2) and are therefore omitted due to space limitation (and can be found in the full versionΒ [16]). We nonetheless provide a proof outline for the high-dimensional RS result (Lemma 4.3) in Section 5.
Lemma 4.1.
There is a deterministic MPC algorithm that given threshold , constant and dataset of points distributed across MPC machines with local memory , computes a -RS for in rounds, using total memory.
Lemma 4.2.
There is a deterministic MPC algorithm that given threshold , parameter and dataset of points distributed across MPC machines with local memory , computes a -DS for such that in rounds, using total memory.
Lemma 4.3.
There is an MPC algorithm that given threshold , and a dataset of points in distributed across MPC machines with local memory , with probability at least computes a -ruling set for in rounds, using total memory.
5 Proof Outline of Lemma 4.3: RS in High Dimension
Our algorithm may be viewed as a Euclidean version of one-round Lubyβs, with two major differences. One is an additional preprocessing algorithm Algorithm 1, and this step is crucially needed to improve the ruling set parameter from (which is what one-round Lubyβs algorithm can achieve in generalΒ [24]) to . The other is that it is not immediate to exactly simulate the one-round Lubyβs in high dimensional Euclidean spaces, as the neighborhood around a data point can be huge and is not easy to handle in MPC. To this end, we need to use some approximate ball that is βsandwichedβ between and for some and every point , and an MPC procedure for this has been suggested inΒ [17] (restated in Lemma 5.7). We would address this inaccuracy introduced by in the analysis of the offline algorithm. In the remainder of this section, we use the notations from Algorithms 1 andΒ 2.
Lemma 5.1.
is a -independent set for (with probability ).
Lemma 5.2.
For any , any -ruling set for is an -ruling set for .
Fact 5.3.
For every , .
Lemma 5.4.
Proof.
It suffices to show that for every , with probability , , since one can conclude the proof using a union bound for all points .
Now, fix a point and consider . In order to analyze , we construct a sequence using an auxiliary algorithm, stated in Algorithm 3, where the sequence starts at the point and denote the length of as which is random. We emphasize that Algorithm 3 is defined with respect to the internal states of a specific (random) run of Algorithm 2, and it does not introduce new randomness.
Observe that in Algorithm 3, for every , . Since for every , , then we have that
by triangle inequality. Hence, it remains to give an upper bound for , and we show this in the following Lemma 5.5 which is the main technical lemma. Its proof can be found in the full version due to space limit.
Lemma 5.5.
For , we have .
Proof of Lemma 4.3.
Our MPC algorithm for Lemma 4.3 is obtained via an MPC implementation of the offline Algorithms 1 andΒ 2. However, before we do so, our algorithm requires to run the Johnson-Lindenstrauss (JL) transform [42] on the dataset as preprocessing to reduce to , using parameter , restated as follows with our notations.
Lemma 5.6 (JL transformΒ [42]).
For some , let be a random matrix such that every entry is an independent standard Gaussian variable where , then the mapping satisfies that with probability , , .
This JL transform only needs rounds to run in MPC, since one can generate the matrix in a lead machine, which uses space , and then broadcast to every other machines. Since the pairwise distance between data points is preserved, and that our algorithms only use the distance between data points, it is immediate that any -ruling set on is as well a -ruling set on . Hence, it suffices to work on which is of dimension . Without loss of generality, we can simply assume is of dimension .
Now, we turn to implement Algorithms 1 andΒ 2 in MPC. For Algorithm 1, since the hash in Lemma 3.4 that we use is data oblivious and only requires space, all machines can generate the same hash without communication. The other steps in Algorithm 1 may be implemented using standard MPC procedures including sorting, broad-casting and converge-casting. For Algorithm 2, the the only nontrivial step is to implement . To this end, we make use of the following lemma fromΒ [17]. In particular, our MPC implementation of Algorithm 2 applies Lemma 5.7 with . This finishes the description of the algorithm for Lemma 4.3.
Lemma 5.7 ([17, Theorem 3.1]).
There is a deterministic MPC algorithm that takes as input , , of points and for each a value , distributed across machines with local memory , computes for every a value , where is an arbitrary set that satisfies
(2) |
in rounds and total memory.
Now we turn to the analysis. Observe that the assumptions regarding underlying the analysis of Algorithms 1 andΒ 2 remain valid. By Lemma 5.1, the found set is also an -independent for as . Moreover, since we assume , the parameters , and . Therefore, applying Lemma 5.4 with , we conclude that , , with probability . Combining with Lemma 5.2, we conclude that is -ruling set for , with probability .
References
- [1] Sepideh Aghamolaei and Mohammad Ghodsi. A 2-approximation algorithm for data-distributed metric -center. arXiv, 2309.04327, 2023. doi:10.48550/arXiv.2309.04327.
- [2] AmirMohsen Ahanchi, Alexandr Andoni, MohammadTaghi Hajiaghayi, Marina Knittel, and Peilin Zhong. Massively parallel tree embeddings for high dimensional spaces. In Proceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 77β88. ACM, 2023. doi:10.1145/3558481.3591096.
- [3] Baruch Awerbuch, AndrewΒ V Goldberg, Michael Luby, and SergeΒ A Plotkin. Network decomposition and locality in distributed computation. In 30th Annual Symposium on Foundations of Computer Science (FOCS), pages 364β369. IEEE Computer Society, 1989. doi:10.1109/SFCS.1989.63504.
- [4] Amir Azarmehr, Soheil Behnezhad, Rajesh Jayaram, Jakub Lacki, Vahab Mirrokni, and Peilin Zhong. Massively parallel minimum spanning tree in general metric spaces. In Proceedings of the 2025 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 143β174. SIAM, 2025. doi:10.1137/1.9781611978322.5.
- [5] MohammadHossein Bateni, Hossein Esfandiari, Manuela Fischer, and VahabΒ S. Mirrokni. Extreme -center clustering. In Thirty-Fifth AAAI Conference on Artificial Intelligence, (AAAI), pages 3941β3949. AAAI Press, 2021. doi:10.1609/AAAI.V35I5.16513.
- [6] Paul Beame, Paraschos Koutris, and Dan Suciu. Communication steps for parallel query processing. Journal of the ACM, 64(6):40:1β40:58, 2017. doi:10.1145/3125644.
- [7] MarkΒ de Berg, Leyla Biabani, and Morteza Monemizadeh. -center clustering with outliers in the MPC and streaming model. In IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 853β863. IEEE, 2023. doi:10.1109/IPDPS54959.2023.00090.
- [8] Aditya Bhaskara and Maheshakya Wijewardena. Distributed clustering via LSH based data partitioning. In Proceedings of the 35th International Conference on Machine Learning (ICML), volumeΒ 80 of Proceedings of Machine Learning Research, pages 569β578. PMLR, 2018. URL: http://proceedings.mlr.press/v80/bhaskara18a.html.
- [9] Leyla Biabani and Ami Paz. -center clustering in distributed models. In Structural Information and Communication Complexity - 31st International Colloquium (SIROCCO), volume 14662 of Lecture Notes in Computer Science, pages 83β100. Springer, 2024. doi:10.1007/978-3-031-60603-8_5.
- [10] MΓ©lanie Cambus, Fabian Kuhn, Shreyas Pai, and Jara Uitto. Time and space optimal massively parallel algorithm for the 2-ruling set problem. In 37th International Symposium on Distributed Computing (DISC), volume 281 of LIPIcs, pages 11:1β11:12. Schloss Dagstuhl β Leibniz-Zentrum fΓΌr Informatik, 2023. doi:10.4230/LIPICS.DISC.2023.11.
- [11] Matteo Ceccarello, Andrea Pietracaprina, and Geppino Pucci. Solving k-center clustering (with outliers) in mapreduce and streaming, almost as accurately as sequentially. Proceedings of the VLDB Endowment, 12(7):766β778, 2019. doi:10.14778/3317315.3317319.
- [12] Vincent Cohen-Addad, Silvio Lattanzi, Ashkan Norouzi-Fard, Christian Sohler, and Ola Svensson. Parallel and efficient hierarchical -median clustering. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems (NeurIPS), pages 20333β20345, 2021. URL: https://proceedings.neurips.cc/paper/2021/hash/aa495e18c7e3a21a4e48923b92048a61-Abstract.html.
- [13] Vincent Cohen-Addad, VahabΒ S. Mirrokni, and Peilin Zhong. Massively parallel k-means clustering for perturbation resilient instances. In International Conference on Machine Learning (ICML), volume 162 of Proceedings of Machine Learning Research, pages 4180β4201. PMLR, 2022. URL: https://proceedings.mlr.press/v162/cohen-addad22b.html.
- [14] Sam Coy, Artur Czumaj, and Gopinath Mishra. On parallel -center clustering. In Proceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 65β75. ACM, 2023. doi:10.1145/3558481.3591075.
- [15] Artur Czumaj, Arnold Filtser, Shaofeng H.-C. Jiang, Robert Krauthgamer, Pavel VeselΓ½, and Mingwei Yang. Streaming facility location in high dimension via geometric hashing. arXiv preprint arXiv:2204.02095, 2022. The latest version has additional results compared to the preliminary version in [18]. arXiv:2204.02095.
- [16] Artur Czumaj, Guichen Gao, Mohsen Ghaffari, and Shaofeng H.Β C. Jiang. Fully scalable mpc algorithms for euclidean k-center. arXiv, 2504.16382, 2025. doi:10.48550/arXiv.2504.16382.
- [17] Artur Czumaj, Guichen Gao, ShaofengΒ H.-C. Jiang, Robert Krauthgamer, and Pavel VeselΓ½. Fully-scalable MPC algorithms for clustering in high dimension. In 51st International Colloquium on Automata, Languages, and Programming (ICALP), volume 297 of LIPIcs, pages 50:1β50:20. Schloss Dagstuhl β Leibniz-Zentrum fΓΌr Informatik, 2024. doi:10.4230/LIPICS.ICALP.2024.50.
- [18] Artur Czumaj, ShaofengΒ H.-C. Jiang, Robert Krauthgamer, Pavel VeselΓ½, and Mingwei Yang. Streaming facility location in high dimension via geometric hashing. In 63rd IEEE Annual Symposium on Foundations of Computer Science (FOCS), pages 450β461. IEEE, 2022. doi:10.1109/FOCS54457.2022.00050.
- [19] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107β113, 2008. doi:10.1145/1327452.1327492.
- [20] Alina Ene, Sungjin Im, and Benjamin Moseley. Fast clustering using MapReduce. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 681β689. ACM, 2011. doi:10.1145/2020408.2020515.
- [21] Alessandro Epasto, Mohammad Mahdian, VahabΒ S. Mirrokni, and Peilin Zhong. Massively parallel and dynamic algorithms for minimum size clustering. In Proceedings of the 2022 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1613β1660. SIAM, 2022. doi:10.1137/1.9781611977073.66.
- [22] Arnold Filtser. Scattering and sparse partitions, and their applications. ACM Transactions on Algorithms, 20(4):30:1β30:42, 2024. doi:10.1145/3672562.
- [23] Mohsen Ghaffari. Massively parallel algorithms, 2019. Lecture Notes from ETH ZΓΌrich. URL: http://people.csail.mit.edu/ghaffari/MPA19/Notes/MPA.pdf.
- [24] Mohsen Ghaffari. Distributed graph algorithms, 2022. Lecture Notes from MIT. URL: https://people.csail.mit.edu/ghaffari/DA22/Notes/DGA.pdf.
- [25] Mohsen Ghaffari, Christoph Grunau, and CeΒ Jin. Improved MPC algorithms for MIS, matching, and coloring on trees and beyond. In 34th International Symposium on Distributed Computing (DISC), volume 179 of LIPIcs, pages 34:1β34:18. Schloss Dagstuhl β Leibniz-Zentrum fΓΌr Informatik, 2020. doi:10.4230/LIPICS.DISC.2020.34.
- [26] Mohsen Ghaffari and Jara Uitto. Sparsifying distributed algorithms with ramifications in massively parallel computation and centralized local computation. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1636β1653. SIAM, 2019. doi:10.1137/1.9781611975482.99.
- [27] Jeff Giliberti and Zahra Parsaeian. Massively parallel ruling set made deterministic. In 38th International Symposium on Distributed Computing (DISC), volume 319 of LIPIcs, pages 29:1β29:21. Schloss Dagstuhl β Leibniz-Zentrum fΓΌr Informatik, 2024. doi:10.4230/LIPICS.DISC.2024.29.
- [28] TeofiloΒ F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38:293β306, 1985. doi:10.1016/0304-3975(85)90224-5.
- [29] MichaelΒ T. Goodrich, Nodari Sitchinava, and Qin Zhang. Sorting, searching, and simulation in the MapReduce framework. In Algorithms and Computation - 22nd International Symposium (ISAAC), volume 7074 of Lecture Notes in Computer Science, pages 374β383. Springer, 2011. doi:10.1007/978-3-642-25591-5_39.
- [30] Chetan Gupta, Rustam Latypov, Yannic Maus, Shreyas Pai, Simo SΓ€rkkΓ€, Jan StudenΓ½, Jukka Suomela, Jara Uitto, and Hossein Vahidi. Fast dynamic programming in trees in the MPC model. In Proceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 443β453. ACM, 2023. doi:10.1145/3558481.3591098.
- [31] Alireza Haqi and Hamid Zarrabi-Zadeh. Almost optimal massively parallel algorithms for -center clustering and diversity maximization. In Proceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 239β247. ACM, 2023. doi:10.1145/3558481.3591077.
- [32] Sariel Har-Peled. No, coreset, no cry. In Foundations of Software Technology and Theoretical Computer Science (FSTTCS), volume 3328 of Lecture Notes in Computer Science, pages 324β335. Springer, 2004. doi:10.1007/978-3-540-30538-5_27.
- [33] Sariel Har-Peled and Soham Mazumdar. On coresets for -means and -median clustering. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing (STOC), pages 291β300. ACM, 2004. doi:10.1145/1007352.1007400.
- [34] DoritΒ S. Hochbaum and DavidΒ B. Shmoys. A best possible heuristic for the -center problem. Mathematics of Operations Research, 10(2):180β184, 1985. doi:10.1287/MOOR.10.2.180.
- [35] DoritΒ S. Hochbaum and DavidΒ B. Shmoys. A unified approach to approximation algorithms for bottleneck problems. Journal of the ACM, 33(3):533β550, 1986. doi:10.1145/5925.5933.
- [36] Sungjin Im, Ravi Kumar, Silvio Lattanzi, Benjamin Moseley, and Sergei Vassilvitskii. Massively parallel computation: Algorithms and applications. Foundations and Trends in Optimization, 5(4):340β417, 2023. doi:10.1561/2400000025.
- [37] Sungjin Im and Benjamin Moseley. Brief announcement: Fast and better distributed mapreduce algorithms for -center clustering. In Proceedings of the 27th ACM on Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 65β67. ACM, 2015. doi:10.1145/2755573.2755607.
- [38] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In Proceedings of the 2007 EuroSys Conference, pages 59β72. ACM, 2007. doi:10.1145/1272996.1273005.
- [39] Rajesh Jayaram, Vahab Mirrokni, Shyam Narayanan, and Peilin Zhong. Massively parallel algorithms for high-dimensional Euclidean minimum spanning tree. In Proceedings of the 2024 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 3960β3996. SIAM, 2024. doi:10.1137/1.9781611977912.139.
- [40] Hongyan Ji, Kishore Kothapalli, SriramΒ V. Pemmaraju, and Ajitanshu Singh. Fast deterministic massively parallel ruling sets algorithms. In Proceedings of the 26th International Conference on Distributed Computing and Networking (ICDCN), pages 152β160. ACM, 2025. doi:10.1145/3700838.3700872.
- [41] Lujun Jia, Guolong Lin, Guevara Noubir, Rajmohan Rajaraman, and Ravi Sundaram. Universal approximations for TSP, Steiner tree, and set cover. In Proceedings of the 37th Annual ACM Symposium on Theory of Computing (STOC), pages 386β395. ACM, 2005. doi:10.1145/1060590.1060649.
- [42] William Johnson and Joram Lindenstrauss. Extensions of Lipschitz maps into a Hilbert space. Contemporary Mathematics, 26:189β206, 1984. doi:10.1090/conm/026/737400.
- [43] HowardΒ J. Karloff, Siddharth Suri, and Sergei Vassilvitskii. A model of computation for MapReduce. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 938β948. SIAM, 2010. doi:10.1137/1.9781611973075.76.
- [44] Ting Liang, Qilong Feng, Xiaoliang Wu, Jinhui Xu, and Jianxin Wang. Improved approximation algorithm for the distributed lower-bounded -center problem. In Theory and Applications of Models of Computation - 18th Annual Conference (TAMC), volume 14637 of Lecture Notes in Computer Science, pages 309β319. Springer, 2024. doi:10.1007/978-981-97-2340-9_26.
- [45] Michael Luby. A simple parallel algorithm for the maximal independent set problem. In Proceedings of the 17th Annual ACM Symposium on Theory of Computing (STOC), pages 1β10. ACM, 1985. doi:10.1145/22145.22146.
- [46] Gustavo Malkomes, MattΒ J. Kusner, Wenlin Chen, KilianΒ Q. Weinberger, and Benjamin Moseley. Fast distributed -center clustering with outliers on massive data. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems (NIPS), pages 1063β1071, 2015. URL: https://proceedings.neurips.cc/paper/2015/hash/8fecb20817b3847419bb3de39a609afe-Abstract.html.
- [47] Krzysztof Onak. Round compression for parallel graph algorithms in strongly sublinear space. arXiv, 1807.08745, 2018. doi:10.48550/arXiv.1807.08745.
- [48] David Pollard. Empirical Processes: Theory and Applications, chapter 4: Packing and Covering in Euclidean Spaces, pages 14β20. IMS, 1990. doi:10.1214/cbms/1462061091.
- [49] Tim Roughgarden, Sergei Vassilvitski, and JoshuaΒ R. Wang. Shuffles and circuits (on lower bounds for modern parallel computation). Journal of the ACM, 65(6):41:1β41:24, 2018. doi:10.1145/3232536.
- [50] VΓ‘clav RozhoΕ and Mohsen Ghaffari. Polylogarithmic-time deterministic network decomposition and distributed derandomization. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 350β363. ACM, 2020. doi:10.1145/3357713.3384298.
- [51] Johannes Schneider and Roger Wattenhofer. An optimal maximal independent set algorithm for bounded-independence graphs. Distributed Computing, 22(5-6):349β361, 2010. doi:10.1007/S00446-010-0097-1.
- [52] Tom White. Hadoop: The Definitive Guide. OβReilly, 4th edition, 2015. URL: http://www.oreilly.de/catalog/9781491901632/index.html.
- [53] Matei Zaharia, Mosharaf Chowdhury, MichaelΒ J. Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster computing with working sets. In 2nd USENIX Workshop on Hot Topics in Cloud Computing, HotCloudβ10. USENIX Association, 2010. URL: https://www.usenix.org/conference/hotcloud-10/spark-cluster-computing-working-sets.
Appendix A Proof of LemmaΒ 3.1: Geometric Hashing
See 3.1
The construction of the hash is the same as that in [15, Theorem 5.3]. Their proof already implies the space bound, as well as the first two properties; In fact, our second property is stronger than what they state, but the stronger version was indeed proved in their paper, albeit implicitly. We would restate their algorithm/construction, briefly sketch the proof for the first two properties, and focus on the third property.
Hash Construction.
We start with reviewing their algorithm/construction of . Partition into hypercubes of side length of the form where . Notice that each of hypercubes is of diameter . Let where . For every , let be the set of -dimensional faces of the above-mentioned hypercubes, and let be the set of points that belong to these faces, i.e., . For each , let be -neighborhood of (with respect to distance). Let and . The main procedure for the construction of goes as follows.
-
For every , for every , let , and for every assign where is an arbitrary but fixed point. Observe that and thus will not be assigned again in later iterations.
-
For every , let be the remaining part of whose has not been assigned, and assign for every , where is arbitrary but fixed point.
First Two Properties.
The first property is immediate from this construction since every bucket is a subset of a hypercube of side-length whose diameter is . To establish the second property, we need to define the groups. For , define and let . Then by the construction of , , hence the is a proper partition of buckets. These βs are implicitly analyzed inΒ [15, Lemma 5.5 and Lemma 5.8], restated and adapted (as Lemma A.1) to our notation as follows. This lemma readily implies our second property.
Lemma A.1 ([15, Lemma 5.5 and Lemma 5.8]).
For every and every , .
Third Property.
We proceed to prove the third property. The third property, particularly is well-defined, since by the choice of parameters it can be verified that . The proof starts with the following claim. It relates the complicated to a much simpler object , which is merely the -neighborhood of -dimensional faces.
Claim A.2.
.
Proof.
Recall that . Therefore, it suffices to prove that for every and for every , . We prove this separately for the case of and .
Case I: .
We first analyze the case that . Fix some and some (for some ). Observe that . Then we have that . Observe that . Hence, . This finishes the case of .
Case II: .
Next, we analyze the case that . By Lemma A.1, for every , , and this implies
(3) |
as distance between any two points is at most their distance. Now fix some . By the construction, and . By (3), has no intersection with other than on . Hence,
(4) |
To further relate with , we have
where the last step follows from for every . Combining this with (4), we conclude that . This finishes the proof of ΛA.2.
By ΛA.2, it remains to prove . Since , it suffices to show (for , . Now fix some and . Since is a -dimensional face, it has dimensions spanning an entire interval of length , and has exactly one dimension taking a value of the form (). Formally, if and only if there exists and such that for ,
(5) |
Let be that as in (5). Then for every ,
where the last inequality follows from the definition of and recall that . This further implies and , , by the triangle inequality of . Therefore, we conclude that , and this finishes the proof of Lemma 3.1. β