Streaming Diameter of High-Dimensional Points
Abstract
We improve the space bound for streaming approximation of Diameter but also of Farthest Neighbor queries, Minimum Enclosing Ball and its Coreset, in high-dimensional Euclidean spaces. In particular, our deterministic streaming algorithms store points. This improves by a factor of the previous space bound of Agarwal and Sharathkumar (SODA 2010), while retaining the state-of-the-art approximation guarantees, such as for Diameter or Farthest Neighbor queries, and also offering a simpler and more complete argument. Moreover, we show that storing points is necessary for a streaming -approximation of Farthest Pair and Farthest Neighbor queries.
Keywords and phrases:
streaming algorithm, farthest pair, diameter, minimum enclosing ball, coresetFunding:
Magnús M. Halldórsson: Partially supported by Icelandic Research Fund grant 2511609.Copyright and License:
2012 ACM Subject Classification:
Theory of computation Streaming, sublinear and near linear time algorithmsAcknowledgements:
We would like to thank Martin Tancer for helpful discussions on high-dimensional geometry.Editors:
Anne Benoit, Haim Kaplan, Sebastian Wild, and Grzegorz HermanSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
In the streaming model, the input data is assumed to be vast and must be processed using limited memory in one or a few passes. Therefore, streaming algorithms “sketch” the input, yielding a small data structure that still accurately preserves desired properties. The research on streaming algorithms has been remarkably fruitful and we now have optimal or near-optimal algorithms for counting distinct elements, frequency moments, quantiles and a plethora of other problems (see [14] for a comprehensive exposition).
We focus on high-dimensional geometric streams, where the input consists of points in for large, a topic of recent interest, e.g., [28, 17, 26, 15, 25, 11, 10, 12, 23]. Extent measures, such as the Diameter or the Minimum Enclosing Ball, are fundamental statistics of a set of points, having a body of work in both streaming and non-streaming settings [2, 22, 19, 3, 8, 7].
In an influential work, Agarwal and Sharathkumar [5] gave a streaming algorithmic framework for several high-dimensional extent problems. Their Blurred Ball Cover data structure maintains a collection of balls, whose union approximately covers the input . It was then used to approximate a number of high-dimensional extent problems. The claim is that each ball is represented by a coreset of points of , for a total space of points. It appears though that a somewhat higher space bound is needed for the claimed approximations; see Appendix A.
We give a modified data structure, Guarded Ball Cover, building on [5]. It allows for both a simpler and more complete treatment, and also results in the smaller space bound of points. The improved space bound extends to all four applications: approximate Farthest Neighbor queries and for maintaining approximate Farthest Pair (providing an estimate for Diameter), Minimum Enclosing Ball, and Coreset for Minimum Enclosing Ball. This is feasible by storing only a single point per ball, along with its center and radius. Correctness arguments are simplified by also storing the first point of as a proxy for all points deleted later from memory. We retain the approximation guarantees of all four applications, as in [5]: for Minimum Enclosing Ball and for each of Diameter, Farthest Neighbor Queries, and Coreset for Minimum Enclosing Ball.
We also show that points need to be stored for a comparable -approximation of Farthest Neighbor queries and also for maintaining -approximate Farthest Pair. This applies to a computational model where the algorithm must return an input point upon a query and the space is determined by the number of points stored; crucially, once a point is deleted from memory, it cannot be retrieved.
2 Preliminaries
Let be a multiset of points in that arrive sequentially in a stream. Upon arrival, each point is either stored in memory (and, possibly, deleted later) or irrevocably discarded. We assume one-pass streaming algorithms in the insertion-only setting. By we denote an error parameter and by an approximation guarantee.
An extent measure of a set of points computes certain statistics of either this set or a geometric shape enclosing it [2]. Let denote the Euclidean distance between points and . The Diameter is the maximum Euclidean distance between any pair of points in and the Farthest Pair is a pair of points of having Euclidean distance equal to the Diameter. The Farthest Neighbor of a point is a point of the largest Euclidean distance from . An -farthest-neighbor - of a query point is a point such that for every it is .
By we denote a ball centered at point with radius . The -expansion of is defined as . The Minimum Enclosing Ball is the ball of minimum radius containing all points of . A ball is - if and each point of is within Euclidean distance from .
For a set of points , a coreset is a set preserving a geometric property of [21]. A set is - for MEB if each point of is contained in the -expansion of .
We focus on computing - for any query and maintaining -, - and -; these ratios are the same as in [5].
2.1 Related Work
The streaming algorithm of Gonzalez [20] computes a 2-MEB by storing the first point of and its farthest neighbor ; the enclosing ball is simply . Zarrabi-Zadeh and Chan [29] improved the guarantee of -MEB to by giving a one-pass streaming algorithm that stores one ball. They also gave a lower bound of for the guarantee of any deterministic algorithm for -MEB that stores only one ball.
Badŏiu et al. [8] showed that the number of coreset points approximating - for a set in does not depend on . Improved algorithms were given in [24, 6, 7]; however, these algorithms do not work in a streaming fashion.
Preceding the work of Agarwal and Sharathkumar [4], a simple -approximate two-pass algorithm for the Diameter was given by Egecioglu and Kalantari [16], working in space .
Following the conference result of [4] that maintains a -, Chan and Pathak [9] improved this guarantee to by employing a detailed analysis to this algorithm. Subsequently, Agarwal and Sharathkumar in their journal paper [5] observed that the guarantee of their algorithm is slightly greater than by presenting an input for .
On the negative side, any randomized streaming algorithm that maintains -, - or - with probability at least requires space for certain values of . These values are for - and - and for -, as shown by Agarwal and Sharathkumar [5].
For low , such as or , the lower bounds do not apply as it is possible to maintain - or answer - queries in a poly-logarithmic space, using an optimized version of the sampling approach of [18]; this applies also to dynamic streams where points may be deleted. For high-dimensional dynamic streams, the best streaming algorithm follows from asymmetric embedding techniques of Indyk [22] but provides -approximation only at the cost of using space polynomial in .
3 The Guarded Ball Cover
The Guarded Ball Cover is a collection of balls that approximately cover all points of . We represent each ball of by a triplet , where is the center of (possibly ), is its radius, and is a point of inside . The point is referred to as the guard of . Our algorithm maintains a coreset that consists of the guard points. We treat the first point specially by always having .
Let be the collection of the -expansions of the balls in . If the arriving point belongs to , then it is discarded. Otherwise, a new ball is added to . Finally, all balls of too small radius are removed from .
As the space bound is our primary measure, we assume an exact MEB algorithm, but a good approximation also suffices, specifically within a factor of (using the algorithm of [6] as a subroutine).
The following lemma holds for every MEB computed in line 2 of Algorithm 1:
Lemma 1 (Lemma 2.2 in [8]).
If is the MEB of a set of points, then any closed half-space containing also contains a point of on the boundary of .
To analyze the algorithm, we first observe that deleted balls are “guarded” by .
Lemma 2.
Suppose ball is deleted when a ball of radius is added to . Then, is contained in a ball of radius centered at . Consequently, for any guard point that has been deleted up to the current time, it holds that .
Proof.
contains and when deleted in line 3, it has radius at most . Since the guard of (which is also evicted from memory) is inside , it is within distance at most from . The main technical part is to show that the radii of new balls increase exponentially:
Lemma 3.
If is added to following , then it holds that .
Proof.
The proof is similar to that of Lemma 2 in [5] and Claim 2.4 in [8]. We first assume that the exact MEB is computed in line 2 of Algorithm 1. Let , be the point sets such that and , i.e., is the coreset just before computing . Consider two cases:
If then (Figure 1, left), using the triangle inequality and that (by line 1).
Otherwise . Then let be the hyperplane passing through with direction as its normal and let be the halfspace bounded by that does not contain . There is a point at Euclidean distance from , by Lemma 1. Then, (Figure 1, right), where the first inequality follows from the cosine law. By Lemma 2, there is a point such that (if then , and otherwise). Hence, .
Finally, if we use a -approximate MEB, then we still have that .
Finally, we show that at any time, the -expansion of any deleted ball is contained in the -expansion of each ball created after the deletion of the former ball.
Lemma 4.
If ball was deleted then , for each created after the deletion of .
Proof.
Let . By Lemmas 2 and 3, it is ; therefore, we have as . Since is in , it follows that . Our main result follows from the preceding lemmas:
Theorem 5.
contains balls and .
Proof.
Differences to the Blurred Ball Cover
The Blurred Ball Cover [5], similarly to the Guarded Ball Cover, initiates the computation of a new ball when an arriving point is not contained in any -expansion of a stored ball. The key difference is that in the Blurred Ball Cover, the coreset on the boundary of the new ball, which is returned by the MEB computation and comprises up to points, is explicitly stored in memory for each ball. (In fact, points may need to be stored for each ball in the Blurred Ball Cover as we discuss in Appendix A.) Coresets belonging to sufficiently small balls are also deleted from the Blurred Ball Cover, though the radius threshold for this deletion is instead of in our algorithm. Finally, the first point is never deleted by the Guarded Ball Cover, contrary to the Blurred Ball Cover which treats as any other point of .
Update time
The worst-case update time of our algorithm is when an approximate MEB computation is used [7], which is comparable to that of [5]; the precise update times primarily depend on the MEB computation. (Note that the update times of [5] are also affected by the possible issue mentioned in Appendix A.) We remark that the Blurred Ball Cover of [5] allows for better amortized update time as it is possible to perform batched updates, i.e., storing incoming points in a buffer and only running the update procedure when the buffer gets full (see also the second paragraph in Appendix A). This is not possible in our case as we require storing one guard per ball; that is, if we run an update on our sketch with a batch of size , we may need to store up to guard points for if they all end up in the coreset of (and thus on its boundary).
4 Applications
Farthest Neighbor Queries and Diameter
We largely follow the analysis of [5]. For a query point , the algorithm computes and returns the farthest point in : . We show that , where is one of the (optimal) farthest points from .
Let be the ball in of greatest radius that contains in its -expansion, which exists by Theorem 5. Applying the triangle inequality, followed by the inequality , we have that
| (1) |
By Lemma 1 (for half-space with direction as normal, on the boundary, and not containing ), when was created, there was a guard such that: i) , ii) , and iii) . By i), iii), and the cosine law, we have
| (2) |
Combining (1) and (2) and using ii), we get that
| (3) |
Note that , where is the radius of the largest ball in , as otherwise there is a ball containing of radius less than . Also, by Lemma 2, there is a point ( or ) of distance at most from . By definition of , . Thus, . Combining this with (3) gives that is a -, for .
Finally, for the closely related problem of Diameter (Farthest Pair), we return for each point . If and exceeds the stored value for Diameter, then we replace the old Farthest Pair with the new pair .
Corollary 6.
For a stream of points in , the Guarded Ball Cover of stored points answers - for any query , and maintains -.
Minimum Enclosing Ball
The following theorem was shown by Chan and Pathak [9] and improved the guarantee of the approximate MEB algorithm of Agarwal and Sharathkumar to (see [5], p. 91):
Theorem 7 (Theorem 1 in [9], Theorem 1 in [27]).
Let be subsets of a point set in , with such that: i) is increasing over , and ii) , for each . Then, , where .
In our case, is the coreset on the boundary of the MEB computed in line 2 of Algorithm 1. Therefore, the first requirement of Theorem 7 holds by Lemma 3. The second requirement follows immediately for points in and by Lemma 4 for points deleted from .
Corollary 8.
For a stream of points in , the Guarded Ball Cover of stored points maintains -.
Coreset for Minimum Enclosing Ball
We mostly follow the analysis of [5]. Let be the most recently created ball added to and let be the coreset at the time of computing , thus . Note that has radius at least , since is an MEB of a subset of . We claim that contains all points in , which implies that forms a -. Namely, we show that each point has , for .
Consider a point that is farthest from . Let be a guarded ball that has not been deleted and whose -expansion contains ( is well-defined by Theorem 5). By the triangle inequality and the definition of , . Let be the hyperplane passing through with direction as normal and let be the halfspace bounded by that does not contain . By Lemma 1 there is a guard in with , and by the cosine law it is . Then (using the inequality ),
By Lemma 2, there is a guard with , so by the triangle inequality, . Hence, .
Corollary 9.
For a stream of points in , the Guarded Ball Cover of stored points maintains -.
5 Lower Bound for Farthest Pair and Farthest Neighbor Queries
We show that computing a -approximation (with ) of Farthest Neighbor queries or Farthest Pair is impossible in the streaming model without returning points outside of .
This applies to the computational model of “coreset-based algorithms” in which the space bound is counted in the number of input points stored and the algorithm must return an input point upon a query (or two input points for the Farthest Pair); crucially, once a point is deleted from memory, it cannot be retrieved. This model is akin to comparison-based model in sorting or selection, as used in [13] for streaming lower bounds for quantile estimation.
Theorem 10.
For any and , any coreset-based randomized streaming algorithm answering approximate Farthest Neighbor queries or maintaining the approximate Farthest Pair in with multiplicative error , has to store points.
Proof.
We use the easy direction of Yao’s minimax principle and design a distribution over instances (points and Farthest Neighbor queries) so that any deterministic streaming algorithm using space will, with high constant probability, answer a query incorrectly, i.e., the point returned on the query will be more than -factor closer to than the farthest point. The same argument applies to Farthest Pair.
Suppose without loss of generality that . We insert points of the standard basis, i.e., , where the -th coordinate is , for (the order of insertions does not matter). The random part of the construction is to choose uniformly at random and make a Farthest Neighbor query for point , where the coordinate is and only the first coordinates are not . Clearly, the farthest point from is and their Euclidean distance is , using the choice of .
However, with some constant probability, point is not stored as the algorithm stores points. Conditioning on the event that is not stored, the algorithm needs to answer the query with a point for . However, the distance between and with is , using the choice of . Thus, the approximation ratio of the algorithm is at least .
6 Conclusions and Open Problems
We have designed streaming algorithms storing points from that can estimate several extent statistics of the input. All error guarantees are almost optimal, with the only exception of the Minimum Enclosing Ball application, where there exists a small gap between the approximation guarantee of and the lower bound converging to for . This is achieved by simplifying (and also fixing) the Blurred Ball Cover from [5] into a “Guarded Ball Cover”, where we store points per ball, compared to points per ball for the Blurred Ball Cover.
We believe that the space bound can be improved, at least by shaving off the factor. One possible direction for improvement is the use of randomization as both our algorithms and those of [5] are deterministic, while the simple lower bound of for “coreset-based” algorithms holds even when randomization is used.
While our algorithms are more space-efficient than that of [5], the amortized update times are somewhat worse, as discussed in Section 3. Thus, we ask if it is possible to optimize the amortized update time while retaining the near-quadratic space bound. More importantly, it would be interesting to develop a “fast” streaming algorithm for extent problems, that is, with amortized update time .
Beyond the streaming setting, an important property of data sketches is mergeability [1], which enables to summarize the input in a parallel or distributed way and then merge the resulting sketches into one summary of the whole dataset. It is open how to design a merge operation for Guarded (or Blurred) Ball covers while retaining the space and approximation guarantees.
References
- [1] Pankaj K. Agarwal, Graham Cormode, Zengfeng Huang, Jeff M. Phillips, Zhewei Wei, and Ke Yi. Mergeable summaries. ACM Trans. Database Syst., 38(4):26, 2013. doi:10.1145/2500128.
- [2] Pankaj K. Agarwal, Sariel Har-Peled, and Kasturi R. Varadarajan. Approximating extent measures of points. J. ACM, 51(4):606–635, 2004. doi:10.1145/1008731.1008736.
- [3] Pankaj K. Agarwal, Jiří Matoušek, and Subhash Suri. Farthest neighbors, maximum spanning trees and related problems in higher dimensions. Comput. Geom., 1:189–201, 1991. doi:10.1016/0925-7721(92)90001-9.
- [4] Pankaj K. Agarwal and R. Sharathkumar. Streaming algorithms for extent problems in high dimensions. In Proceedings of the 21st ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1481–1489. SIAM, 2010. doi:10.1137/1.9781611973075.120.
- [5] Pankaj K. Agarwal and R. Sharathkumar. Streaming algorithms for extent problems in high dimensions. Algorithmica, 72(1):83–98, 2015. doi:10.1007/S00453-013-9846-4.
- [6] Mihai Badoiu and Kenneth L. Clarkson. Smaller core-sets for balls. In Proceedings of the 14th ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 801–802. ACM/SIAM, 2003. URL: http://dl.acm.org/citation.cfm?id=644108.644240.
- [7] Mihai Badoiu and Kenneth L. Clarkson. Optimal core-sets for balls. Comput. Geom., 40(1):14–22, 2008. doi:10.1016/J.COMGEO.2007.04.002.
- [8] Mihai Badoiu, Sariel Har-Peled, and Piotr Indyk. Approximate clustering via core-sets. In Proceedings of the 34th ACM Symposium on Theory of Computing (STOC), pages 250–257. ACM, 2002. doi:10.1145/509907.509947.
- [9] Timothy M. Chan and Vinayak Pathak. Streaming and dynamic algorithms for minimum enclosing balls in high dimensions. In Proceedings of the 12th Workshop on Algorithms and Data Structures (WADS), pages 195–206. Springer, 2011. doi:10.1007/978-3-642-22300-6_17.
- [10] Xi Chen, Vincent Cohen-Addad, Rajesh Jayaram, Amit Levi, and Erik Waingarten. Streaming Euclidean MST to a constant factor. In Proceedings of the 55th ACM Symposium on Theory of Computing (STOC), pages 156–169. ACM, 2023. doi:10.1145/3564246.3585168.
- [11] Xi Chen, Rajesh Jayaram, Amit Levi, and Erik Waingarten. New streaming algorithms for high dimensional EMD and MST. In Proceedings of the 54th ACM Symposium on Theory of Computing (STOC), pages 222–233. ACM, 2022. doi:10.1145/3519935.3519979.
- [12] Xiaoyu Chen, Shaofeng H.-C. Jiang, and Robert Krauthgamer. Streaming Euclidean max-cut: Dimension vs data reduction. In Proceedings of the 55th ACM Symposium on Theory of Computing (STOC), pages 170–182. ACM, 2023. doi:10.1145/3564246.3585170.
- [13] Graham Cormode and Pavel Veselý. A tight lower bound for comparison-based quantile summaries. In Proceedings of the 39th ACM Symposium on Principles of Database Systems (PODS), pages 81–93. ACM, 2020. doi:10.1145/3375395.3387650.
- [14] Graham Cormode and Ke Yi. Small Summaries for Big Data. Cambridge University Press, 2020.
- [15] Artur Czumaj, Shaofeng H.-C. Jiang, Robert Krauthgamer, Pavel Veselý, and Mingwei Yang. Streaming facility location in high dimension via geometric hashing. In Proceedings of the 63rd IEEE Symposium on Foundations of Computer Science (FOCS), pages 450–461. IEEE, 2022. doi:10.1109/FOCS54457.2022.00050.
- [16] Ömer Egecioglu and Bahman Kalantari. Approximating the diameter of a set of points in the Euclidean space. Inf. Process. Lett., 32(4):205–211, 1989. doi:10.1016/0020-0190(89)90045-8.
- [17] Hossein Esfandiari, Praneeth Kacham, Vahab Mirrokni, David P. Woodruff, and Peilin Zhong. High-dimensional geometric streaming for nearly low rank data. In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024. URL: https://openreview.net/forum?id=yQfA0etfB7.
- [18] Gereon Frahling, Piotr Indyk, and Christian Sohler. Sampling in dynamic data streams and applications. Int. J. Comput. Geom. Appl., 18(1/2):3–28, 2008. doi:10.1142/S0218195908002520.
- [19] Ashish Goel, Piotr Indyk, and Kasturi R. Varadarajan. Reductions among high dimensional proximity problems. In Proceedings of the 12th ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 769–778. ACM/SIAM, 2001. URL: http://dl.acm.org/citation.cfm?id=365411.365776.
- [20] Teofilo F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theor. Comput. Sci., 38:293–306, 1985. doi:10.1016/0304-3975(85)90224-5.
- [21] Sariel Har-Peled. Geometric Approximation Algorithms. American Mathematical Society, 2011.
- [22] Piotr Indyk. Better algorithms for high-dimensional proximity problems via asymmetric embeddings. In Proceedings of the 14th ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 539–545. ACM/SIAM, 2003. URL: http://dl.acm.org/citation.cfm?id=644108.644200.
- [23] Shaofeng H.-C. Jiang, Robert Krauthgamer, and Shay Sapir. Moderate dimension reduction for k-center clustering. In Proceedings of the 40th International Symposium on Computational Geometry (SoCG), volume 293, pages 64:1–64:16, 2024. doi:10.4230/LIPICS.SOCG.2024.64.
- [24] Piyush Kumar, Joseph S. B. Mitchell, and E. Alper Yildirim. Approximate minimum enclosing balls in high dimensions using core-sets. ACM J. Exp. Algorithmics, 8, 2003. doi:10.1145/996546.996548.
- [25] Sepideh Mahabadi, Ilya P. Razenshteyn, David P. Woodruff, and Samson Zhou. Non-adaptive adaptive sampling on turnstile streams. In Proceedings of the 52nd ACM Symposium on Theory of Computing (STOC), pages 1251–1264. ACM, 2020. doi:10.1145/3357713.3384331.
- [26] Yury Makarychev, Naren Sarayu Manoj, and Max Ovsiankin. Near-optimal streaming ellipsoidal rounding for general convex polytopes. In Proceedings of the 56th ACM Symposium on Theory of Computing (STOC), pages 1526–1537. ACM, 2024. doi:10.1145/3618260.3649692.
- [27] Vinayak Pathak. Streaming and dynamic algorithms for minimum enclosing balls in high dimensions. Master’s thesis, University of Waterloo, 2011.
- [28] David P. Woodruff and Taisuke Yasuda. High-dimensional geometric streaming in polynomial space. In Proceedings of the 63rd IEEE Symposium on Foundations of Computer Science (FOCS), pages 732–743. IEEE, 2022. doi:10.1109/FOCS54457.2022.00075.
- [29] Hamid Zarrabi-Zadeh and Timothy M. Chan. A simple streaming algorithm for minimum enclosing balls. In Proceedings of the 18th Annual Canadian Conference on Computational Geometry (CCCG), 2006.
Appendix A A Note on Lemma 2 in [5]
We report on a possible issue in the proof of Lemma 2 in [5] (a similar issue appears in the conference version [4]). Lemma 2 in [5] states that for any , , where and are two consecutive balls of the Blurred Ball Cover. The property that the radii of balls increase geometrically is crucially needed to bound the space requirements.
The proof of Lemma 2 goes as follows: Ball is created ( is the coreset of and APPROX-MEB is the subroutine of [7] that computes it) because one of the points in a buffer is not contained in any -expansion of a ball for . Define ball as the MEB of . It is subsequently claimed that , without a proof. However, is the MEB of , while is an approximate MEB for , namely the -expansion of contains all these points but a smaller expansion of may not contain them all.
We observe that this is fixable by computing a tighter MEB approximation . That, however, may result in a coreset of size , which (unlike our approach) increases the space bound in [5] from to . We are unaware of a fix that does not affect the space bound or error guarantees.
