Abstract 1 Introduction 2 Technical overview 3 Approximation algorithm for Klee’s measure problem 4 Lower bound for union volume estimation 5 Reduction from discrete to continuous space References

Approximating Klee’s Measure Problem and a Lower Bound for Union Volume Estimation

Karl Bringmann ORCID Saarland University, Saarland Informatics Campus, Saarbrücken, Germany
Max-Planck-Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany
Kasper Green Larsen ORCID Aarhus University, Denmark André Nusser ORCID Université Côte d’Azur, CNRS, Inria, I3S, Sophia Antipolis, France Eva Rotenberg ORCID Technical University of Denmark, Kongens Lyngby, Denmark Yanheng Wang ORCID Saarland University, Saarland Informatics Campus, Saarbrücken, Germany
Max-Planck-Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany
Abstract

Union volume estimation is a classical algorithmic problem. Given a family of objects O1,,Ond, we want to approximate the volume of their union. In the special case where all objects are boxes (also called hyperrectangles) this is known as Klee’s measure problem. The state-of-the-art (1+ε)-approximation algorithm [Karp, Luby, Madras ’89] for union volume estimation as well as Klee’s measure problem in constant dimension d uses a total of O(n/ε2) queries of three types: (i) determine the volume of Oi; (ii) sample a point uniformly at random from Oi; and (iii) ask whether a given point is contained in Oi.

First, we show that if an algorithm learns about the objects only through these types of queries, then Ω(n/ε2) queries are necessary. In this sense, the complexity of [Karp, Luby, Madras ’89] is optimal. Our lower bound holds even if the objects are equiponderous axis-aligned polygons in 2, if the containment query allows arbitrary (not necessarily sampled) points, and if the algorithm can spend arbitrary time and space examining the query responses.

Second, we provide a more efficient approximation algorithm for Klee’s measure problem, which improves the running time from O(n/ε2) to O((n+1ε2)logO(d)(n)). We circumvent our lower bound by exploiting the geometry of boxes in various ways: (1) We sort the boxes into classes of similar shapes after inspecting their corner coordinates. (2) With orthogonal range searching, we show how to sample points from the union of boxes in each class, and how to merge samples from different classes. (3) We bound the amount of wasted work by arguing that most pairs of classes have a small intersection.

Keywords and phrases:
approximation, volume of union, union of objects, query complexity
Funding:
Kasper Green Larsen: Supported by a DFF Sapere Aude Research Leader Grant No. 9064-00068B.
André Nusser: This work was supported by the French government through the France 2030 investment plan managed by the National Research Agency (ANR), as part of the Initiative of Excellence of Université Côte d’Azur under reference number ANR-15-IDEX-01. Part of this work was conducted while the author was at BARC, University of Copenhagen, supported by the VILLUM Foundation grant 16582.
Eva Rotenberg: Supported by DFF Grant 2020-2023 (9131-00044B) “Dynamic Network Analysis”, the VILLUM Foundation grant VIL37507 “Efficient Recomputations for Changeful Problems” and the Carlsberg Foundation Young Researcher Fellowship CF21-0302 “Graph Algorithms with Geometric Applications”.
Copyright and License:
[Uncaptioned image] © Karl Bringmann, Kasper Green Larsen, André Nusser, Eva Rotenberg, and
Yanheng Wang; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Computational geometry
Related Version:
Full Version: https://arxiv.org/abs/2410.00996 [5]
Funding:
Karl Bringmann and Yanheng Wang: This work is part of the project TIPEA that has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant agreement No. 850979).
Editors:
Oswin Aichholzer and Haitao Wang

1 Introduction

We revisit the classical problem of union volume estimation: given objects O1,,Ond, we want to estimate the volume of O1On.111Technically, the objects need to be measurable. In the most general form, O1,,On can be any measurable sets in a measure space, and we want to estimate the measure of their union. However, this work only deals with boxes in d (in our algorithm) and polygons in the plane (in our lower bound). This problem has several important applications such as DNF counting and network reliability; see the discussion in Section 1.2.

The state-of-the-art solution [19] works in a model where one has access to each input object Oi by three types of queries: (i) determine the volume of the object, (ii) sample a point uniformly at random from the object, and (iii) ask whether a point is contained in the object. Apart from these types of queries, the model allows arbitrary computations. The complexity of algorithms is thus measured by the number of queries.

After Karp and Luby [18] introduced this model, Karp, Luby and Madras [19] gave an algorithm that (1+ε)-approximates the volume of n objects in this model with constant success probability.222The success probability can be boosted to 1δ, adding a log(1/δ) factor in time and query complexity. It uses O(n/ε2) queries plus O(n/ε2) additional time, which improves earlier algorithms [18, 22], and only asks containment queries of previously sampled points. The last 35 years have seen no algorithmic improvement. Is this classical upper bound best possible? We resolve the question in this work by providing a matching lower bound.

The union volume estimation problem was also studied recently in the streaming setting [25, 23], where a stream of objects O1,,OnΩ arrive in order. When we are at position i in the stream, we can only query object Oi. This line of work yields algorithms with constant success probability that use O(polylog(|Ω|)/ε2) queries and additional time per object (and the same bound also holds for space complexity). Summed over all n objects, the bounds match the classical algorithm up to the polylog(|Ω|) factor. So, interestingly, even in the streaming setting the same upper bound can be achieved.333See also [30] for earlier work studying Klee’s measure problem in the streaming setting.

The perhaps most famous application of the algorithm by Karp, Luby, and Madras [19] is Klee’s measure problem [21]. This is a fundamental problem in computational geometry in which we are given n axis-aligned boxes in d and want to compute the volume of their union. An axis-aligned box is a set in the form [a1,b1]××[ad,bd]d, and the input consists of the corner coordinates a1,b1,,ad,bd of each box. A long line of research on this problem and various special cases (e.g., for fixed dimensions or for cubes) [31, 26, 11, 2, 1, 12, 32, 3] lead to an exact algorithm running in time O(nd/2+nlogn) for constant d [13]. A conditional lower bound suggests that any faster algorithm would require fast matrix multiplication techniques [12], but it is unclear how to apply fast matrix multiplication to this problem. On the approximation side, note that the three queries can be implemented in time O(d) for any d-dimensional axis-aligned box. Thus the union volume estimation algorithm can be applied, and it computes a (1+ε)-approximation to Klee’s measure problem in time O(nd/ε2), as has been observed in [4]. This direct application of union volume estimation was the state of the art for approximate solutions for Klee’s measure problem until our work. See Section 1.2 for interesting applications of Klee’s measure problem.

1.1 Our contribution

Lower bound for union volume estimation

Given the state of the art, a natural question is to ask whether the query complexity of the general union volume estimation algorithm of [19] can be further improved. Any such improvement would speed up several important applications, cf. Section 1.2. On the other hand, any lower bound showing that the algorithm of [19] is optimal also implies tightness of the known streaming algorithms (up to logarithmic factors), as the streaming algorithms match the static running time bound.

Previously, a folklore lower bound of Ω(n+1/ε2) was known. But between this and the upper bound O(n/ε2) there is still a large gap; consider for example the regime ε=1/n where the bounds are Ω(n) and O(n2), respectively. We close this gap by strengthening the lower bound to Ω(n/ε2), thereby also showing optimality of [19]. Note that the lower bound is about query complexity in the model; it does not make any computational assumption. In particular, it holds even if the algorithm spends arbitrary time and space in computation.

Theorem 1.

Any algorithm that computes a (1+ε)-approximation to the volume of the union of n objects via volume, sampling and containment queries with success probability at least 2/3 must make Ω(n/ε2) queries.

We highlight that our lower bound holds even for equiponderous, axis-aligned polygons in the plane.

Upper bound for Klee’s measure problem

Our lower bound for union volume estimation implies that any improved algorithm for Klee’s measure problem must exploit the geometric structure of boxes. To this end, we exploit our knowledge of the corner coordinates and classify the boxes by shapes. In addition, we apply the geometric tool of orthogonal range searching. They allow us to approximate Klee’s measure problem faster than the previous O(n/ε2) algorithm. Here and throughout, we assume that the dimension d is a constant; in particular, we suppress factors of 2O(d) in the time complexity.

Theorem 2.

There is an algorithm that runs in time O((n+1ε2)log2d+1(n)) and with probability at least 0.9 computes a (1+ε)-approximation for Klee’s measure problem.

This is a strict improvement when ε1/logd+1(n) is small. Consider for example the regime ε=1/n: Our algorithm runs in near-linear time, whereas the previous algorithm runs in quadratic time. We remark that the success probability can be boosted to any 1δ by standard techniques which incurs an extra log(1/δ) factor in the running time. Finally, we highlight that the core of our algorithm is an efficient method to sample uniformly and independently with a given density from the union of the input boxes. This can be of independent interest outside the context of volume estimation.

1.2 Related work

A major application of union volume estimation is DNF counting, in which we are given a Boolean formula in disjunctive normal form and want to count the number of satisfying assignments. Computing the exact number of satisfying assignments is #P-complete, so it likely requires superpolynomial time. Approximating the number of satisfying assignments can be achieved by a direct application of union volume estimation, as described in [19]. Their algorithm remains the state of the art for approximate DNF counting, see e.g. [24]. This has been extended to more general model counting [27, 9, 24], probabilistic databases [20, 15, 28], and probabilistic queries on databases [7].

We also mention network reliability as another application for union volume estimation, which was already discussed in [19]. Additionally, Karger’s famous paper on the problem [17] uses the algorithm of [19] as a subroutine. However, the current state-of-the-art algorithms no longer use union volume estimation as a tool [8].

Finally, we want to draw a connection to the following well-known query sampling bound. Canetti, Even, and Goldreich [6] showed that approximating the mean of a random variable whose codomain is the unit interval requires Ω(1/ε2) queries, thus obtaining tight bounds for the sampling complexity of the mean estimation problem. Their bound generalize to Ω(1/(με2)) on the number of queries needed to estimate the mean μ of a random variable in general. Before our work it was thus natural to expect that the 1/ε2 dependence in the number of queries for union volume estimation is optimal. However, whether the factor n is necessary, or the number of queries could be improved to, say, O(n+1/ε2), was open to the best of our knowledge.

Klee’s measure problem is an important problem in computational geometry. One reason for its significance is that techniques developed for it can often be adapted to solve various related problems, such as the depth problem (given a set of boxes, what is the largest number of boxes that can be stabbed by a single point?) [13] or Hausdorff distance under translation in L [14]. Moreover, various other problems can be reduced to Klee’s measure problem or to its related problems. For example, deciding whether a set of boxes covers its boundary box can be reduced to Klee’s measure problem [13]. The continuous k-center problem on graphs (i.e., finding centers that can lie on the edges of a graph that cover the vertices of a graph) is reducible to Klee’s measure problem as well [29]. Finding the smallest hypercube containing at least k points among n given points can be reduced to the depth problem [16, 10, 13]. In light of this, it would be interesting to see if our approximation techniques generalize to any of these related problems.

2 Technical overview

We now sketch our upper and lower bounds at an intuitive level. The formal arguments will be presented in Section 3 and Section 4, respectively. In this paper we denote [n]:={1,2,n}.

2.1 Upper bound for Klee’s measure problem

Due to our lower bound, we have to exploit the structure of the input boxes to obtain a running time of the form O((n+1ε2)polylog(n)). We follow the common algorithmic approach of sampling. Specifically, we aim to sample a set S from the union of boxes such that each point is selected with probability density p. For appropriately defined p, the value |S|/p is a good estimate of the volume of the union. In the overview we assume that a proper p is given and focus on the main difficulty: creating a sample set S for the given p.

We start by sorting the input boxes into classes by shape. Two boxes are in the same class if, in each dimension k[d], their side lengths are both in [2Lk,2Lk+1) for some Lk. We call two classes similar if the corresponding side lengths are polynomially related (e.g., within a factor of n4) in each dimension. We make two crucial observations:

  1. 1.

    Each class is similar to only polylogarithmically many classes (Observation 11).

  2. 2.

    Dissimilar classes have a small intersection compared to their union (Observation 12, Figure 1).

Figure 1: When the side lengths of two boxes differ a lot in at least one of their dimensions (in our examples, the y-axis), their intersection is small compared to their union.

Our algorithm continues as follows. We scan through the classes in an arbitrary order. For each class, we sample with density p from the union of the boxes in this class, but we only keep a point if it is not contained in any class that comes later in the order. To efficiently test containment in a later class, we use an orthogonal range searching data structure (with an additional dimension for the index of the class). When the scan finishes, we obtain a desired sample set S.

Figure 2: We sample points in the grid cells 𝒢 that are intersected by a box 𝒪i from a fixed class. We then use orthogonal range searching to determine whether a sampled point is in a box from the class and should be kept (), or is not and should be discarded (×).

Let us elaborate why the algorithm is efficient.

Sampling from a single class.

Our approach is simple yet powerful: (i) grid the space into cells of side lengths comparable to the boxes in this class; (ii) sample points from the relevant cells uniformly at random; (iii) discard points outside the union by querying an orthogonal range searching data structure. See Figure 2. All these steps exploit the geometry of boxes. Since the grid cells and boxes have roughly the same shape, a significant fraction of the points sampled in (ii) are contained in the union, i.e., not discarded in (iii). The orthogonal range searching data structure allows us to quickly decide which points to discard. These explain the efficiency of sampling from a class.

Bounding the amount of wasted work.

Sampled points that appear in later classes also need to be discarded, and this is a potential (and only) source of inefficiency. So we need to bound the number of them. Such a point shows up later either (i) in a similar class, or (ii) in a dissimilar class. Our first observation stated that a class is similar to at most polylogarithmically many classes. So (i) can happen up to polylogarithmically many times from the perspective of a fixed point in the union. On the other hand, our second observation stated that the intersection of dissimilar classes is small. So on average (ii) rarely happens and does not affect the expected running time significantly.

2.2 Lower bound for union volume estimation

Our lower bound is shown by a reduction from the Query-Gap-Hamming problem: Given two vectors x,y{1,+1}, distinguish whether their inner product is greater than or less than . It is known that any algorithm distinguishing these two cases with success probability at least 2/3 must access Ω() bits of x and y.

We first sketch why Ω(1/ε2) queries are necessary to (1+ε)-approximate the volume of the union of two objects in the query model. Given a Query-Gap-Hamming instance x,y, we construct two objects X={(j,xj):j[]} and Y={(j,yj):j[]}. These are discrete point sets, but they can be inflated to polygons such that cardinality corresponds to volume. See Figure 3 for an example. Note that for all k{0,,}, we have

|XY|=+kx,y=2k.

If an algorithm 𝒜 can (1+ε)-approximate |XY|, then it can approximate k within additive error 2ε, thus it can also approximate x,y within additive error 4ε. Setting =1/(16ε2), the additive error is at most 4ε=414=, which suffices to distinguish x,y= from x,y= and thereby solves the Query-Gap-Hamming instance.

Note that |X|=|Y|=, so a volume query does not disclose any information about x and y. Each sample or containment query accesses at most one bit of x or y. Therefore, algorithm 𝒜 has to make Ω()=Ω(1/ε2) queries to X,Y.

Figure 3: The vector x=(+1,1,+1,+1,1,1) represented as the set {(j,xj):j[6]}, where each point is drawn as a rectangle.

In order to generalize this lower bound for the union of two objects to an Ω(n/ε2) lower bound for the union of n objects, we need to ensure that each query gives away only O(1/n) bits of information about x and y. We apply two obfuscations that jointly slow down the exposure of bits; see Figure 4. Firstly, we introduce objects X1,,Xn whose union is X and objects Y1,,Yn whose union is Y. Imagine cutting each X-induced rectangle in Figure 3 into n side-by-side pieces and distributing them randomly among X1,,Xn. Do the same for Y. The idea is that one needs to make Ω(n) containment queries on the rectangular region in order to hit the correct piece; only then is the corresponding bit revealed. Secondly, we introduce a large band shared by all Xi and Yi for i[n]. In Figure 4, this is the long dark-blue rectangle that spans from left to right. Intuitively it enforces Ω(n) sample queries to obtain a single point that contains any information about x and y.

Figure 4: The vector y or x=(+1,1,+1,+1,1,1) gives rise to n polygons; one of these polygons is illustrated in dark blue. The light blue area indicates the union of all these n polygons.

3 Approximation algorithm for Klee’s measure problem

In this section, we present our approximation algorithm for Klee’s measure problem in constant dimension d: See 2

Lemmas marked with are folklore or straightforward. Their proofs can be found in the full version of this paper [5].

3.1 Preliminaries

In Klee’s measure problem we are given boxes O1,,Ond. Here, a box is an object of the form Oi=[a1,b1]××[ad,bd], and as input we are given the coordinates a1,b1,,ad,bd of each box. Based on these, it is easy to compute the side lengths and volume of each box.

Throughout we write V:=Volume(i=1nOi) for the volume of the union of boxes. The goal is to approximate V up to a factor of 1+ε. Our approach is based on sampling, so now let us introduce the relevant notions.

Recall the Poisson distribution Pois(λ) with mean and variance λ: It captures the number of active points in a universe, under the assumption that active points occur uniformly and independently at random across the universe, and that λ points are active on average. The following definition is usually referred to as a homogeneous Poisson point process at rate p. Intuitively, we activate each point in some universe Ud independently with “probability density” p, thus the number of activated points follows the Poisson distribution with mean pVolume(U).

Definition 3 (p-sample).

Let Ud be a measurable set, and let p[0,1]. We say that a random subset SU is a p-sample of U if for any measurable UU we have that |SU|Pois(pVolume(U)).

In particular, if S is a p-sample of U, then |S|Pois(pVolume(U)). Two more useful properties follow from the definition:

  1. (i)

    For any measurable subset UU, the restriction SU is a p-sample of U.

  2. (ii)

    The union of p-samples of two disjoint sets U,U is a p-sample of UU.

Besides sampling, we will apply orthogonal range searching to handle the so-called appears(x,i) query: Given xd and i, is xOiOn?

Lemma 4 ().

We can build a data structure in O(nlogd+1(n)) time that answers appears(x,i) queries in O(logd+1(n)) time.

For our algorithm to work, we also need a constant-factor approximation of the volume V. It is known that this can be computed in O(n) time [19]. In order to stay simple and self-contained, we state a weaker result by implementing the Karp-Luby algorithm [18] with the help of appears queries.

Lemma 5 (, adapted from [18]).

Given the data structure from Lemma 4, there exists an algorithm that computes in O(nlogd+1(n)) time a 2-approximation to V with probability at least 0.9.

3.2 Classifying boxes by shapes

As our first step in the algorithm, we classify boxes by their shapes.

Definition 6.

Let L1,,Ld. We say that a box Od is of type (L1,,Ld) if its side length in dimension k is contained in [2Lk,2Lk+1), for all k[d].

Using this definition, we partition the input boxes O1,,On into classes C1,,Cm such that each class corresponds to one type of boxes. The notation is fixed from now on. For each t[m], we denote Ut:=OCtOd, namely the union of boxes in class Ct.

Similar to appears, we may use orthogonal range searching to handle the so-called inClass(x,t) query: Is a given point xd contained in Ut?

Lemma 7 ().

We can build a data structure in O(nlogd+1(n)) time that answers inClass(x,t) queries in O(logd+1(n)) time.

Sampling from a class

The next lemma shows that we can p-sample from any Ut efficiently by rejection sampling.

Lemma 8.

Given t[m], p[0,1] and the data structure from Lemma 7, one can generate a p-sample of Ut in expected time O(|Ct|log|Ct|+pVolume(Ut)logd+1(n)).

Proof.

Let (L1,,Ld) be the type of class Ct. We subdivide d into the grid

𝒢:={[i1 2L1,(i1+1) 2L1)××[id 2Ld,(id+1) 2Ld)i1,,id}.

We call each element of 𝒢 a cell. Let 𝒢:={G𝒢GUt} be the set of cells that intersect with Ut. Let U:=G𝒢G.

First we create a p-sample S of U as follows. Generate KPois(pVolume(U)), the number of points we are going to include. Then sample K points uniformly at random from U by repeating the following step K times: Select a cell G𝒢 uniformly at random and then sample a point from G uniformly at random. The sampled points constitute our set S.

Next we compute SUt: For each xS, we query inClass(x,t); if the answer is true then we keep x, otherwise we discard it. The resulting set SUt is a p-sample of Ut, since restricting to a fixed subset preserves the p-sample property.

Before we analyze the running time, we show that Ut makes up a decent proportion of U. Recall that every box in class Ct is of type (L1,,Ld). Consider the projection to any dimension k[d]. Each projected box from Ct can intersect at most three projected cells from 𝒢. So each box from Ct intersects at most 3d cells from 𝒢, implying that |𝒢|3d|Ct|. Moreover, since the volume of any cell is at most the volume of a box in Ct, we have Volume(U)3dVolume(Ut).

Recall that we assume d to be constant and hence 3d=O(1). The computation of 𝒢 takes O(|𝒢|log|𝒢|)O(|Ct|log|Ct|) time. The remaining time is dominated by the inClass queries. The expected size of S is pVolume(U)3dpVolume(Ut). As we query inClass once for each point of S, the total expected time is O(pVolume(Ut)logd+1(n)) by Lemma 7.

Classes do not overlap much

We show the following interesting property of classes, that the sum of their volumes is within a polylogarithmic factor of the total volume V.

Lemma 9.

We have t=1mVolume(Ut)23d+1logd(n)V.

We later exploit it to draw p-samples from i=1nOi=t=1mUt efficiently. Towards proving the lemma, let us collect some simple observations.

Definition 10.

We say that a class of type (L1,,Ld) is similar to a class of type (L1,,Ld) if 2|LkLk|<n4 for all k[d]. Otherwise they are said to be dissimilar.

Observation 11.

Every class is similar to at most 8dlogd(n) classes.

Proof.

Fix a type (L1,,Ld). For each k[d], there are at most 8logn many integers Lk such that 2|LkLk|<n4.

Observation 12.

If two boxes O,O are in dissimilar classes, then Volume(OO)2V/n4.

Proof.

Let (L1,,Ld) be the type of O, and (L1,,Ld) be the type of O. Since the boxes belong to dissimilar classes, there is a dimension k[d] such that 2|LkLk|n4. Without loss of generality, assume 2LkLkn4; the other case is symmetric. Let [ak,bk] and [ak,bk] be the projections of O and O onto dimension k, respectively. Note that bkak[2Lk,2Lk+1) and bkak[2Lk,2Lk+1). So we have bkakbkak2Lk(Lk+1)n4/2. In other words, at most a 2/n4 fraction of the interval [ak,bk] intersects the interval [ak,bk]. Hence, Volume(OO)Volume(O)2/n42V/n4.  

Proof of Lemma 9.

Without loss of generality assume Volume(U1)Volume(Um). We construct a set of indices T[m] by the following procedure:

  • Initialize T=.

  • For t=1,,m, if Ct and Cs are dissimilar for all sT, then add t to T.

For t[m], we have tT only if there exists an sT such that Cs,Ct are similar and Volume(Us)Volume(Ut); we call s a witness of t. If multiple witnesses exist, then we pick an arbitrary one. Conversely, every sT can be a witness at most 8dlogd(n) times by Observation 11. Hence,

t=1mVolume(Ut)8dlogd(n)tTVolume(Ut). (1)

It remains to bound tTVolume(Ut). Consider any distinct s,tT. By construction, Cs and Ct are dissimilar; and each class contains at most n boxes. So Volume(UsUt)n2(2V/n4)=2V/n2 by Observation 12. Using this and inclusion-exclusion, we bound

tTVolume(Ut) Volume(tTUt)+{s,t}TVolume(UsUt)
V+(m2)2Vn2 2V.

Plugging this into the right-hand side of Expression (1), we obtain the lemma statement.

3.3 Joining the classes

Recall that C1,,Cm are the classes of the input boxes and U1,,Um their respective unions. Assume without loss of generality that the boxes are ordered in accordance with the class ordering, that is, C1={O1,,Oi1} form the first class, C2={Oi1+1,,Oi2} form the second class, and so on. In general, we assume that Ct={Oit1+1,,Oit} for all t[m], where 0=i0<i1<<im=n.

Let Dt:=Ut(s=t+1mUs) be the points in Ut that are not contained in later classes. Note that D1,,Dm is a partition of t=1mUt=i=1nOi. Hence, to generate a p-sample of i=1nOi, it suffices to draw p-samples from each Dt and then take their union.444This idea has previously been used on objects, by considering the difference Di:=Oi(j=i+1nOj) [18, 25], while we use this idea on classes. To this end, we draw a p-sample St from Ut via Lemma 8. Then we remove all xSt for which appears(x,it+1)=true; these are exactly the points that appear in a later class. What remains is a p-sample of Dt. The union of these sets thus is a p-sample of i=1nOi, and we can use the size of this p-sample to estimate the volume V of i=1nOi. The complete algorithm is summarized in Algorithm 1.

Algorithm 1 Approximation of Klee’s measure.
  1. 1.

    Partition the boxes into classes C1,,Cm. Relabel the boxes so that their indices are in accordance with the class ordering, i.e., Ct={Oit1+1,,Oit} for all t[m].

  2. 2.

    Build the data structures from Lemmas 4 and 7.

  3. 3.

    Obtain a crude estimate V~ by Lemma 5. Set p:=8/(ε2V~).

  4. 4.

    For t=1,,m do:

    • Draw a p-sample St from the union Ut=OCtO via Lemma 8.

    • Compute |St| where St:={xSt:appears(x,it+1)=false}.

  5. 5.

    Output t=1m|St|/p.

Lemma 13.

Conditioned on V~2V, Algorithm 1 outputs a (1+ε)-approximation to V with probability at least 3/4.

Proof.

Note that for all t[m], the set St is a p-sample of Dt. Since D1,,Dm partition t=1mUt=i=1nOi, their union t=1mSt is a p-sample of i=1nOi. It follows that N:=t=1m|St|Pois(pV). The expectation and variance of N are pV=8V/(ε2V~)4/ε2. So by Chebyshev,

Pr(|NpV|>εpV)Var[N](εpV)214.

Hence, with probability at least 3/4, the output N/p is a (1+ε)-approximation to V.

Lemma 14.

Conditioned on V~V2, Algorithm 1 runs in time O((n+1ε2)log2d+1(n)) in expectation.

Proof.

Step 1 takes O(nlogn) time: we can compute the side lengths of each box, determine its class, then sort the boxes by class index. Step 2 takes O(nlogd+1(n)) time by Lemmas 4 and 7. Step 3 takes O(nlogd+1(n)) time by Lemma 5.

In step 4, iteration t, sampling St costs expected time O((itit1)log(itit1)+pVolume(Ut)logd+1(n)) by Lemma 8, and computing St costs expected time O((1+pVolume(Ut))logd+1(n)) by Lemma 4. Over all iterations, the expected running time is

O(logd+1(n)(n+pt=1mVolume(Ut))).

Substituting p=8/(ε2V~)16/(ε2V) and applying Lemma 9, we can bound

pt=1mVolume(Ut)16ε2Vt=1mVolume(Ut)23d+5logd(n)ε2.

Hence the expected running time of step 4 is O(log2d+1(n)(n+1ε2)).

Finally, step 5 takes O(n) time.

Proof of Theorem 2.

Run Algorithm 1 with a time budget tenfold the bound in Lemma 14; abort the algorithm as soon as it spends more time than the budget allows. So the stated time bound is clearly satisfied. Now consider three bad events:

  • V~[V2,2V].

  • V~[V2,2V], but the algorithm is aborted.

  • V~[V2,2V] and the algorithm is not aborted, but it does not output a (1+ε)-approximation to V.

By Lemma 5, the first event happens with probability at most 0.1. By Lemma 14 and Markov’s inequality, the second event happens with probability at most 0.1. Lastly, by Lemma 13, the third event happens with probability at most 1/4. So the total error probability is at most 0.1+0.1+14=920. If none of the bad events happen, then the algorithm correctly outputs a (1+ε)-approximation to V. The success probability of 1120 can be boosted to, say, 0.9 by returning the median of a sufficiently large constant number of repetitions.

4 Lower bound for union volume estimation

In this section, any randomized algorithm or protocol is assumed to have success probability at least 2/3. We consider the discrete version of union volume estimation in two dimensions, where each object Oi is a finite subset of the integer lattice 2. The goal is to (1+ε)-approximate the cardinality |O1On| of their union. The algorithm does not have direct access to the objects, but it can ask three forms of queries:

  • Volume(Oi): Return the cardinality |Oi|.

  • Sample(Oi): Draw a uniform random point from Oi.

  • Contains((a,b),Oi): Given a point (a,b)2, return whether (a,b)Oi or not.

We aim to prove that Ω(n/ε2) queries are necessary. Later in Section 5 we translate this to a lower bound in the continuous space 2, thereby proving Theorem 1.

Our starting point is the Query-Gap-Hamming problem: The inputs are two (hidden) vectors x,y{1,1} and we can access one bit of x or y of our choice at a time. The goal is to distinguish the cases x,y and x,y using as few accesses as possible. The following lemma is folklore; its proof can be found in the full version of this paper [5].

Lemma 15 ().

There exists a constant δ(0,1) with the following property. For every , any randomized algorithm for Query-Gap-Hamming on vectors of length needs at least δ accesses in expectation.

From now on, let us fix the constant δ(0,1) guaranteed by Lemma 15. Let n, ε(0,1) be arbitrary and define :=136ε2. We reduce Query-Gap-Hamming on vectors of length to estimating the cardinality of the union of 2n objects in 2.

The reduction

From the hidden vectors x,y{1,1} we construct 2n objects X1,,Xn,Y1,,Yn2. Let R:={(n+1,0),,(n+n,0)}. Take independent random permutations π1,,π of [n] and define

Xi:=R{(jn+πj(i),xj):j[]}

for i[n]. Then take another set of independent random permutations τ1,,τ and define

Yi:=R{(jn+τj(i),yj):j[]}

for i[n]. Note that R is a subset of all Xi and Yi.

How is x,y related to the cardinality of the union |X1XnY1Yn|? Consider an arbitrary index j[]. If xj=yj then the point sets {(jn+πj(i),xj):i[n]} and {(jn+τj(i),yj):i[n]} are equal (regardless of the concrete permutations), so they together contribute n to the cardinality of the union. On the other hand, if xjyj then they are disjoint and thus contribute 2n. Furthermore, the point set R is contained in all objects and contributes n. Hence, the cardinality of the union is equal to

n+j:xj=yjn+j:xjyj2n=52n12n(j:xj=yj1+j:xjyj(1))=52n12nx,y.

Suppose we get a (1+ε)-approximation ρ to the cardinality of the union, then

ρ[(1ε)(52n12nx,y),(1+ε)(52n12nx,y)].

Since |x,y|, by computing (52nρ)2n we obtain a value in x,y±6ε, that is, an additive 6ε approximation to x,y. Our choice of ε=16 allows to distinguish x,y from x,y.

The reduction is not yet complete. The catch is that an algorithm for union volume estimation can only learn about the objects via queries. It is left to us to answer those queries. In Algorithm 2 we describe how to answer queries on object Xi. Queries on object Yi can be handled similarly by replacing πj with τj and xj with yj. The correctness is easy to verify, and this completes the reduction. So far, the argument works for arbitrary permutations; randomness becomes relevant when we bound the number of bit accesses.

Algorithm 2 Answering queries.
  • For query Volume(Xi): Return (n+1).

  • For query Sample(Xi): Draw a random block j[] and a random shift s[n]. With probability 11n+1 return (jn+s,0). Otherwise return (jn+πj(i),xj).

  • For query Contains((a,b),Xi): If (a,b)R then return true. Else, compute j:=a/n and s:=ajn. If j[] or πj(i)s then return false. Otherwise, if xi=b then return true; else return false.

Bounding the number of accesses

Now for the sake of contradiction, suppose there exists a randomized algorithm 𝒜 that (1+ε)-approximates the cardinality of the union of any 2n objects in 2, using less than δ36029/δnε2 queries. We run 𝒜 on the objects defined earlier (using Algorithm 2 to handle its queries). From the output, we can distinguish x,y and x,y. We claim that less than δ bits of x,y are accessed from the side of Algorithm 2.

To this end, note that accesses to x,y only happen in the “otherwise” clauses of Sample() and Contains() in Algorithm 2. For analysis we isolate the following mini game:

Lemma 16.

Consider a game of m:=29/δn rounds. At the start, Merlin generates a secret random permutation π of [n]. In each round, Arthur adaptively sends one of the following two requests to Merlin:

  • Random(): Answer “yes” with probability 1n+1 and “no” with the remaining probability.

  • Probe(i,s): Answer “yes” if π(i)=s and “no” otherwise.

No matter what strategy Arthur employs, the probability that Merlin ever answers “yes” in the game is at most 2δ/5.

Proof.

Let E be the event that Merlin ever answers “yes” and the first one was due to a Random() request. Let F be the event that Merlin ever answers “yes” and the first one was due to a Probe() request. We will bound Pr(E) and Pr(F) separately.

To bound Pr(E), define events E1,,Em where Et indicates that Arthur’s t-th request is Random() and the answer is “yes”. Clearly Pr(Et)1n+1, thus Pr(E)t=1mPr(Et)mn+1<29/δ<δ/15.

To bound Pr(F) we use an entropy argument. In particular, we will describe and analyze an encoding scheme for a random permutation π of [n].

The encoder receives π and acts as Merlin in the game, using π as the secret permutation. It interacts with Arthur till the end of the game. If F does not happen then the encoding is simply a bit 0 followed by π. On the other hand, if F happens, then consider the first round T[m] when “yes” appeared; denote that request by Probe(i,s). The encoding starts with a bit 1, then the number T, and finishes with the ordering π of [n]{i} induced by π.

The decoder receives the encoding and tries to reconstruct π. First it reads the leading bit of the encoding. If the bit is 0 then π immediately follows. If the bit is 1 then the decoder recovers the number T. It pretends to be Merlin and starts a game with Arthur, using the same random tape as the encoder but without knowledge of π. For the first T1 requests it just answers “no” to Arthur. Now comes the T-th request from Arthur, which must be a Probe(i,s) request with answer “yes”. This allows the decoder to deduce π(i)=s. The remaining entries in π can be fully restored from π.

Writing p:=Pr(F), the encoding length is at most

(1p)(1+log(n!))+p(1+log(m)+log((n1)!))
< (1p)(log(n!)+3)+p(log(n)9/δ+log((n1)!)+3)
= log(n!)+39p/δ.

where we used m:=29/δn in the second line. Since the encoding length must be at least the Shannon entropy log(n!) of the encoded permutation π, we conclude that Pr(F)=pδ/3.

Putting the two bounds together, we have Pr(EF)Pr(E)+Pr(F)2δ/5.

Where does the game come into play? The reader might find it useful to recall Figure 4. The space 2 is split into blocks horizontally, where each block j[] holds information about the j-th bit of the input vector. Let us focus on a particular block j. As 𝒜 runs, some of its queries “hit” this block while others don’t. Precisely, we say that a Sample(Xi) query hits j if the random block chosen in Algorithm 2 is j. Likewise, we say that a Contains((a,b),Xi) query hits j if a/n=j. Now we restrict ourselves to the queries hitting j and ignore all the others. Then, over time, 𝒜 (as Arthur) is exactly playing the game with us (as Merlin) on the secret permutation πj: Sample() queries correspond to Random() requests; Contains() queries correspond to Probe() requests; Algorithm 2 branches to the “otherwise” clauses if and only if the answer is “yes”. Whatever information 𝒜 collects in the queries hitting other blocks j, it has no effect on the decision in this game, as all other permutations are independent of πj, and as x,y do not play a role in this game.

With this in mind, let random variable Nj count the number of queries hitting j. Define

J:={j[]:xj is accessed and Njm}andJ:={j[]:Nj>m}.

where m:=29/δn. By Lemma 16, Pr(jJ)2δ/5 for every j[], so 𝔼[|J|]2δ/5. On the other hand, by definition of J and our assumption that the number of queries is at most δ36029/δnε2, we can bound

|J|<1m+1δ36029/δnε2δ10.

So Algorithm 2 accesses less than 2δ5+δ10=δ2 bits of x in expectation. Symmetrically, the same bound holds for y. Altogether it accesses less than δ bits of x,y. But as we argued, the output can be used to distinguish x,y and x,y. This contradicts Lemma 15.

5 Reduction from discrete to continuous space

Union volume estimation and Klee’s measure problem were formulated in the continuous space d. Their analogs in discrete space d are only simpler due to the following reduction. Consider the natural embedding φ that blows up each point x=(x1,,xd)d into a continuous cube φ(x)=[x1,x1+1]××[xd,xd+1]d. For any finite subset Xd, denote its image by φ(X)d. We have:

  • Volume(φ(X))=|X|.

  • To sample from φ(X) uniformly, we can first sample xX uniformly and then sample from φ(x) uniformly.

  • If X is a discrete box then φ(X) is a continuous box, and vice versa.

Given this correspondence, our algorithm for Klee’s measure problem can be made to handle discrete boxes in d. On the other hand, the discrete objects in our lower bound construction can be realized as axis-aligned polygons in 2.

References

  • [1] Pankaj K. Agarwal. An improved algorithm for computing the volume of the union of cubes. In David G. Kirkpatrick and Joseph S. B. Mitchell, editors, Proceedings of the 26th ACM Symposium on Computational Geometry, Snowbird, Utah, USA, June 13-16, 2010, pages 230–239. ACM, 2010. doi:10.1145/1810959.1811000.
  • [2] Pankaj K. Agarwal, Haim Kaplan, and Micha Sharir. Computing the volume of the union of cubes. In Jeff Erickson, editor, Proceedings of the 23rd ACM Symposium on Computational Geometry, Gyeongju, South Korea, June 6-8, 2007, pages 294–301. ACM, 2007. doi:10.1145/1247069.1247121.
  • [3] Karl Bringmann. An improved algorithm for Klee’s measure problem on fat boxes. Comput. Geom., 45(5-6):225–233, 2012. doi:10.1016/j.comgeo.2011.12.001.
  • [4] Karl Bringmann and Tobias Friedrich. Approximating the volume of unions and intersections of high-dimensional geometric objects. Comput. Geom., 43(6-7):601–610, 2010. doi:10.1016/j.comgeo.2010.03.004.
  • [5] Karl Bringmann, Kasper Green Larsen, André Nusser, Eva Rotenberg, and Yanheng Wang. Approximating klee’s measure problem and a lower bound for union volume estimation. CoRR, abs/2410.00996, 2024. doi:10.48550/arXiv.2410.00996.
  • [6] Ran Canetti, Guy Even, and Oded Goldreich. Lower bounds for sampling algorithms for estimating the average. Inf. Process. Lett., 53(1):17–25, 1995. doi:10.1016/0020-0190(94)00171-T.
  • [7] Nofar Carmeli, Shai Zeevi, Christoph Berkholz, Alessio Conte, Benny Kimelfeld, and Nicole Schweikardt. Answering (unions of) conjunctive queries using random access and random-order enumeration. ACM Trans. Database Syst., 47(3):9:1–9:49, 2022. doi:10.1145/3531055.
  • [8] Ruoxu Cen, William He, Jason Li, and Debmalya Panigrahi. Beyond the quadratic time barrier for network unreliability. In David P. Woodruff, editor, Proceedings of the 2024 ACM-SIAM Symposium on Discrete Algorithms, SODA 2024, Alexandria, VA, USA, January 7-10, 2024, pages 1542–1567. SIAM, 2024. doi:10.1137/1.9781611977912.62.
  • [9] Supratik Chakraborty, Kuldeep S. Meel, and Moshe Y. Vardi. A scalable approximate model counter. In Christian Schulte, editor, Principles and Practice of Constraint Programming - 19th International Conference, CP 2013, Uppsala, Sweden, September 16-20, 2013. Proceedings, volume 8124 of Lecture Notes in Computer Science, pages 200–216. Springer, 2013. doi:10.1007/978-3-642-40627-0_18.
  • [10] Timothy M. Chan. Geometric applications of a randomized optimization technique. Discret. Comput. Geom., 22(4):547–567, 1999. doi:10.1007/PL00009478.
  • [11] Timothy M. Chan. Semi-online maintenance of geometric optima and measures. SIAM J. Comput., 32(3):700–716, 2003. doi:10.1137/S0097539702404389.
  • [12] Timothy M. Chan. A (slightly) faster algorithm for Klee’s measure problem. Comput. Geom., 43(3):243–250, 2010. doi:10.1016/j.comgeo.2009.01.007.
  • [13] Timothy M. Chan. Klee’s measure problem made easy. In 54th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2013, 26-29 October, 2013, Berkeley, CA, USA, pages 410–419. IEEE Computer Society, 2013. doi:10.1109/FOCS.2013.51.
  • [14] Timothy M. Chan. Minimum L Hausdorff distance of point sets under translation: Generalizing klee’s measure problem. In Erin W. Chambers and Joachim Gudmundsson, editors, 39th International Symposium on Computational Geometry, SoCG 2023, June 12-15, 2023, Dallas, Texas, USA, volume 258 of LIPIcs, pages 24:1–24:13. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPICS.SOCG.2023.24.
  • [15] Nilesh N. Dalvi and Dan Suciu. Efficient query evaluation on probabilistic databases. VLDB J., 16(4):523–544, 2007. doi:10.1007/s00778-006-0004-3.
  • [16] David Eppstein and Jeff Erickson. Iterated nearest neighbors and finding minimal polytopes. Discret. Comput. Geom., 11:321–350, 1994. doi:10.1007/BF02574012.
  • [17] David R. Karger. A randomized fully polynomial time approximation scheme for the all-terminal network reliability problem. SIAM J. Comput., 29(2):492–514, 1999. doi:10.1137/S0097539796298340.
  • [18] Richard M. Karp and Michael Luby. Monte-Carlo algorithms for the planar multiterminal network reliability problem. J. Complex., 1(1):45–64, 1985. doi:10.1016/0885-064X(85)90021-4.
  • [19] Richard M. Karp, Michael Luby, and Neal Madras. Monte-Carlo approximation algorithms for enumeration problems. J. Algorithms, 10(3):429–448, 1989. doi:10.1016/0196-6774(89)90038-2.
  • [20] Benny Kimelfeld, Yuri Kosharovsky, and Yehoshua Sagiv. Query efficiency in probabilistic XML models. In Jason Tsong-Li Wang, editor, Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12, 2008, pages 701–714. ACM, 2008. doi:10.1145/1376616.1376687.
  • [21] Victor Klee. Can the measure of 1n[ai,bi] be computed in less than O(nlogn) steps? The American Mathematical Monthly, 84(4):284–285, 1977. doi:10.1080/00029890.1977.11994336.
  • [22] Michael G. Luby. Monte-Carlo methods for estimating system reliability. Technical report, Report UCB/CSD 84/168, Computer Science Division, University of California, Berkeley, 1983.
  • [23] Kuldeep S. Meel, Sourav Chakraborty, and N. V. Vinodchandran. Estimation of the size of union of delphic sets: Achieving independence from stream size. In Leonid Libkin and Pablo Barceló, editors, PODS ’22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022, pages 41–52. ACM, 2022. doi:10.1145/3517804.3526222.
  • [24] Kuldeep S. Meel, Aditya A. Shrotri, and Moshe Y. Vardi. Not all FPRASs are equal: demystifying FPRASs for DNF-counting. Constraints An Int. J., 24(3-4):211–233, 2019. doi:10.1007/s10601-018-9301-x.
  • [25] Kuldeep S. Meel, N. V. Vinodchandran, and Sourav Chakraborty. Estimating the size of union of sets in streaming models. In Leonid Libkin, Reinhard Pichler, and Paolo Guagliardo, editors, PODS’21: Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Virtual Event, China, June 20-25, 2021, pages 126–137. ACM, 2021. doi:10.1145/3452021.3458333.
  • [26] Mark H. Overmars and Chee-Keng Yap. New upper bounds in Klee’s measure problem. SIAM J. Comput., 20(6):1034–1045, 1991. doi:10.1137/0220065.
  • [27] Aduri Pavan, N. V. Vinodchandran, Arnab Bhattacharyya, and Kuldeep S. Meel. Model counting meets F0 estimation. In Leonid Libkin, Reinhard Pichler, and Paolo Guagliardo, editors, PODS’21: Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Virtual Event, China, June 20-25, 2021, pages 299–311. ACM, 2021. doi:10.1145/3452021.3458311.
  • [28] Christopher Ré, Nilesh N. Dalvi, and Dan Suciu. Efficient top-k query evaluation on probabilistic data. In Rada Chirkova, Asuman Dogac, M. Tamer Özsu, and Timos K. Sellis, editors, Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, The Marmara Hotel, Istanbul, Turkey, April 15-20, 2007, pages 886–895. IEEE Computer Society, 2007. doi:10.1109/ICDE.2007.367934.
  • [29] Qiaosheng Shi and Binay K. Bhattacharya. Application of computational geometry to network p-center location problems. In Proceedings of the 20th Annual Canadian Conference on Computational Geometry, Montréal, Canada, August 13-15, 2008, 2008.
  • [30] Srikanta Tirthapura and David P. Woodruff. Rectangle-efficient aggregation in spatial data streams. In Michael Benedikt, Markus Krötzsch, and Maurizio Lenzerini, editors, Proceedings of the 31st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2012, Scottsdale, AZ, USA, May 20-24, 2012, pages 283–294. ACM, 2012. doi:10.1145/2213556.2213595.
  • [31] Jan van Leeuwen and Derick Wood. The measure problem for rectangular ranges in d-space. J. Algorithms, 2(3):282–300, 1981. doi:10.1016/0196-6774(81)90027-4.
  • [32] Hakan Yildiz and Subhash Suri. On Klee’s measure problem for grounded boxes. In Tamal K. Dey and Sue Whitesides, editors, Proceedings of the 28th ACM Symposium on Computational Geometry, Chapel Hill, NC, USA, June 17-20, 2012, pages 111–120. ACM, 2012. doi:10.1145/2261250.2261267.