Abstract 1 Introduction 2 Preliminaries 3 From reporting to approximate sampling/counting 4 A (𝟐+𝜺)-approximation for densest subset disks 5 An (𝟏+𝜺)-approximation for densest subset disks 7 Conclusions References Appendix A Chernoff’s inequality

Approximating Densest Subgraph in Geometric Intersection Graphs

Sariel Har-Peled ORCID Department of Computer Science, University of Illinois at Urbana-Champaign, IL, USA Saladi Rahul ORCID Indian Institute of Science (IISc), Bangalore, India
Abstract

For an undirected graph 𝖦=(𝖵,𝖤), with n vertices and m edges, the densest subgraph problem, is to compute a subset S𝖵 which maximizes the ratio |𝖤S|/|S|, where 𝖤S𝖤 is the set of all edges of 𝖦 with endpoints in S. The densest subgraph problem is a well studied problem in computer science. Existing exact and approximation algorithms for computing the densest subgraph require Ω(m) time. We present near-linear time (in n) approximation algorithms for the densest subgraph problem on implicit geometric intersection graphs, where the vertices are explicitly given but not the edges. As a concrete example, we consider n disks in the plane with arbitrary radii and present two different approximation algorithms.

As a by-product, we show a reduction from (shallow) range-reporting to approximate counting/sampling which seems to be new and is useful for other problems such as independent query sampling.

Keywords and phrases:
Geometric intersection graphs, Densest subgraph, Range searching, Approximation algorithms
Copyright and License:
[Uncaptioned image] © Sariel Har-Peled and Saladi Rahul; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Computational geometry
Related Version:
Full Version: https://www.arxiv.org/abs/2405.18337
Funding:
Research of Saladi Rahul is funded by Walmart Center for Tech Excellence.
Editors:
Olaf Beyersdorff, Michał Pilipczuk, Elaine Pimentel, and Nguyễn Kim Thắng

1 Introduction

Given an undirected graph 𝖦=(𝖵,𝖤), the density of a set SV(𝖦) is |𝖤S|/|S|, where each edge in 𝖤S𝖤(𝖦) has both its vertices in S. In the densest subgraph problem, the goal is to find a subset of V(𝖦) with the maximum density. Computing the densest subgraph is a primitive operation in large-scale graph processing, and has found applications in mining closely-knit communities [12], link-spam detection [18], and reachability and distance queries [14]. See [7] for a detailed discussion on the applications of densest subgraph, and [22] for a survey on the recent developments on densest subgraph.

Exact algorithms for densest subgraph.

Unlike the (related) problem of computing the largest clique (which is NP-Hard), the densest subgraph can be computed (exactly) in polynomial time. Goldberg [19] show how to reduce the problem to O(logn) instances of st min-cut problem. Gallo et al. [17] improved the running time slightly by using parametric max flow computation. Charikar [10] presented an LP based solution to solve the problem, for which Khuller and Saha [21] gave a simpler rounding scheme.

Approximation algorithms for densest subgraph.

The exact algorithms described above require solving either an LP or an st min-cut instance, both of which are relatively expensive to compute. To obtain a faster algorithm, Charikar [10] analyzed a 2-approximation algorithm which repeatedly removes the vertex with the smallest degree and calculates the density of the remaining graph. This algorithm runs in linear time. After that, Bahmani et al. [6] used the primal-dual framework to give a (1+ε)-approximation algorithm which runs in O((m/ε2)logn) time. The problem was also studied in the streaming model the focus is on approximation using bounded space. McGregor et al. [24] and Esfandiari et al. [16] presented a (1+ε)-approximation streaming algorithm using roughly O~(|𝖵|) space. There has been recent interest in designing dynamic algorithms for approximate densest subgraph, where an edge gets inserted or deleted in each time step  [8, 26, 15].

Geometric intersection graphs.

The geometric intersection graph of a set 𝒟 of n objects is a graph G=(V,E), where each object in 𝒟 corresponds to a unique vertex in V, and an edge exists between two vertices if and only if the corresponding objects intersect. Further, in an implicit geometric intersection graph, the input is only the set of objects 𝒟 and not the edge set 𝖤, whose size could potentially be quadratic in terms of |𝖵|.

Unlike general graphs, the geometric intersection graphs typically have more structure, and as such computing faster approximation algorithms [4, 11], or obtaining better approximation algorithms on implicit geometric intersection graphs has been an active field of research.

One problem closely related to the densest subgraph problem is that of finding the maximum clique. Unlike the densest subgraph problem, the maximum clique problem is NP-hard for various geometric intersection graphs as well (for e.g., segment intersection graphs [9]). However, for the case of unit-disk graphs, an elegant polynomial time solution by Clark et al. [13] is known.

Figure 1.1: Five disks and their corresponding graph. The densest subgraph is {a,b,c,d} with density 5/4.
Motivation.

Densest subgraph computation on geometric intersection graphs can help detect regions which have strong cellular network coverage, or have many polluting factories. The region covered (resp., polluted) by cellular towers (resp., or factories) can be represented by disks of varying radii.

1.1 Our results

Here, we study the densest subgraph problem on (implicit) geometric intersection graphs, and present near-linear time (in terms of |𝖵|) approximation algorithms.

From reporting to approximate counting/sampling.

We show a reduction from (shallow) range-reporting to approximate counting/sampling: given a set of objects 𝒟, and a reporting data-structure for 𝒟, we show a reduction to building a data-structure, such that given a query object 𝗊, it returns approximately-at-uniform an object from 𝗊𝒟={x𝒟|x𝗊}, and also returns an (1±ε)-approximation for the size of this set. Previous work on closely related problem includes the work by Afshani and Chan [1] (that uses shallow counting queries), the work by Afshani et al. [2], and the work by Afshani and Philips [3]. For our application, this data-structure enables us to sample (1±ε)-uniformly a disk from the set of disks intersecting a given disk. See Section 3 and Theorem 12 for details.

The reduction seems to be new, and should be useful for other problems. For example, in databases, motivated by summarizing large query output size, and to incorporate fairness and diversity in the output results, independent query sampling (IQS) has received quite a bit of attention. In IQS the goal is to report a uniform sample of the query output (i.e., of 𝗊𝒟). Crucially, the query output should be independent of all the previous queries posed (for example, if the same query is posed again, then the sample reported should be independent of the previous sample). See the survey by Tao [27] on IQS (and the references within). Our reduction addresses two of the future directions proposed in the survey: (a) designing a generic IQS data structure which works for a broad class of geometric queries, and (b) approximate IQS where the probability of reporting an object is almost-uniform, which is usually good enough in practice.

The application.

For the sake of concreteness, we consider the case of n disks (with arbitrary radii) lying in the plane and present two different approximation algorithms. See Figure 1.1. However, our algorithms would work for other shapes with minor modifications.

A (𝟐+𝜺)-approximation.

Our first approximation algorithm uses the greedy strategy of removing disks of low-degree from the intersection graph. By batching the queries, and using the above data-structure, we get a (2+ε)-approximation for the densest subset of disks in time Oε(nlogn), where Oε hides constants polynomial in 1/ε. Getting this running time requires some additional ideas such as establishing densest subgraph’s connection with deepest point in the arrangement of disks (defined later). The running time is optimal in terms of n in the comparison-based model, see Remark 20.

A (𝟏+𝜺)-approximation.

A more promising approach is to randomly sample edges from the intersection graph, and then apply known approximation algorithms. This requires some additional work since unlike previous work, we can only sample approximately in uniform. See Section 5 for details. The running time of the new algorithm is Oε(nlog2nloglogn) which is slower than the first (2+ε)-approximation algorithm. The results are summarized in Figure 1.2.

Approximation Running time Ref
2+ε .O(nlognε4) Theorem 19
1+ε O(nlog2nε2(1ε2+loglogn)). Theorem 29
Figure 1.2: Our results.

2 Preliminaries

2.1 Definitions

In the following, 𝒟 denotes a (given) set of n objects (i.e., disks). For S𝒟 and u𝒟, we use the shorthands S+u=S{u} and Su=S{u}. For ε(0,1) and a real number α>0, let (1±ε)α denote the interval ((1ε)α,(1+ε)α). Throughout, a statement holds with high probability, if it holds with probability at least 1nc, where c is a sufficiently large constant.

Observation 1.
  1. (I)

    For any ε, we have 11+ε1ε.

  2. (II)

    For ε(0,1/2), we have 11±ε=(1/(1+ε),1/(1ε))1±2ε since 1ε11+ε11ε1+2ε.

  3. (III)

    For ε(0,1/3), we have (1±ε)2(1±3ε).

  4. (IV)

    For ε(0,1) and constants c,c1 and c2, such that cc1c2, we have (1±c1ε/c)(1±c2ε/c)1±c1+c2+1cε111Indeed, (1+c1ε/c)(1+c2ε/c)1+(c1/c+c2/c+c1c2/c2)ε1+(c1+c2+1)ε/c..

Definition 2.

Given a set of objects 𝒟 (say in d), their intersection graph has an edge between two objects if and only if they intersect. Formally,

𝖦𝒟=(𝒟,{uv|u,v𝒟, and uv}).
Definition 3.

For a graph 𝖦, and a subset S𝖵(𝖦), let

𝖤S=𝖤S(𝖦)={uv𝖤(𝖦)|u,vS}.

The induced subgraph of 𝖦 over S is 𝖦S=(S,𝖤S), and let 𝗆(S)=|𝖤S| denote the number of edges in this subgraph.

Definition 4.

For a set S𝖵(𝖦), its density in 𝖦 is (S)=𝖦(S)=𝗆(𝖦S)/|S|, where 𝗆(𝖦S) is the number of edges in 𝖦S. Similarly, for a set of objects 𝒟, and a subset S𝒟, the density of S is (S)=𝗆(𝖦S)/|S|.

Definition 5.

For a graph 𝖦, its max density is the quantity d(𝖦)=maxS𝖵(𝖦)(S), and analogously, for a set of objects 𝒟, its max density is d(𝒟)=maxS𝒟(S),

The problem at hand is to compute (or approximate) the maximum density of a set of objects 𝒟. If a subset S realizes this quantity, then it is the densest subset of 𝒟 (i.e., (𝖦𝒟)S is the densest subgraph of 𝖦𝒟). One can make the densest subset unique, if there are several candidates, by asking for the lexicographic minimal set realizing the maximum density. For simplicity of exposition we threat the densest subset as being unique.

Lemma 6.

Let 𝒪𝒟 be the densest subset, and let =(𝒪). For u𝒪, let d𝒪(u)=|u(𝒪u)|, where u(𝒪u)={x𝒪u|xu}. Then, for all u𝒪, we have d𝒪(u),

Proof.

Observe that

=|𝖤𝒪||𝒪|=|𝖤𝒪u|+d𝒪(u)|𝒪|1+1.

As such, if d𝒪(u)<, then

d𝒪(u)1<=d𝒪(u)+|𝖤𝒪u|1+|𝒪|1<|𝖤𝒪u||𝒪|1.

But this implies that 𝒪u is denser than 𝒪, which is a contradiction.

2.2 Reporting all intersecting pairs of disks

The algorithm of Section 5 requires an efficient algorithm to report all the intersecting pairs of disks.

Lemma 7.

Given a set 𝒟 of n disks, all the intersecting pairs of disks of 𝒟 can be computed in O(nlogn+k) expected time, where k is the number of intersecting pairs.

Proof.

We break the boundary of each disk at its two x-extreme points, resulting in a set of 2n x-monotone curves. Computing the vertical decomposition of the arrangement of these disks (curves) 𝒜(𝒟) can be done in O(nlogn+k) expected time [20]. See Figure 2.1 for an example. This gives us readily all the pairs that their boundaries intersect.

As for the intersections that rise out of containment, perform a traversal of the dual graph of the vertical decomposition (i.e., each vertical trapezoid is a vertex, and two trapezoids are adjacent if they share a boundary edge). The dual graph is planar with O(n+k) vertices and edges, and as such the graph can be traversed in O(n+k) time. During the traversal, by appropriate bookkeeping, it is straightforward to maintain the list of disks containing the current trapezoid, in O(1) per edge traversed, as any edge traversed changes this set by at most one.

For a disk 𝖽, let p𝖽 be the rightmost point of 𝖽. For each disk 𝖽, pick any trapezoid Δ such that p𝖽Δ and either the top boundary of 𝖽 is the ceiling of Δ or the bottom boundary of 𝖽 is the floor of Δ. Assign Δ𝖽Δ as the representative trapezoid of 𝖽.

During the traversal of the dual graph, consider the case where we arrive at a representative trapezoid of a disk 𝖽. Let L(𝖽) be the list of disks containing Δ𝖽. Then scan L(𝖽) to report all the containing pairs of 𝖽. Each disk in L(𝖽) either intersects the boundary of 𝖽, or contain it. Therefore, the total time spent at the representative trapezoids is 𝖽𝒟|L(d)|=O(k).

Figure 2.1: Vertical decomposition of four disks.

3 From reporting to approximate sampling/counting

In this section, given a set of objects 𝒟, and a reporting data-structure for 𝒟, we show a reduction to building a data-structure, such that given a query object 𝗊, it returns an object from 𝗊𝒟={x𝒟|x𝗊}, with almost uniform probability, and also returns an (1±ε)-approximation for the size of this set.

3.1 The data-structure

The given reporting data-structure.

Let 𝒟 be a set of n objects, and assume for any subset X𝒟 of size m, one can construct, in C(m) time, a data-structure that given a query object 𝖽, returns, in O(Q(m)+k) time, all the objects in X that intersects 𝖽, where k=|𝖽X|. Furthermore, we assume that if a parameter k is specified by the query, then the data-structure stops after O(Q(m)+k) time, if k>k, and indicate that this is the case.

Example 8.

If 𝒟 is a set of disks, and the query 𝖽 is a disk, then this becomes a query reporting of the k-nearest neighbors in an additive weighted Voronoi diagram. Liu [23] showed how to build such a reporting data-structure, in O(nlogn) expected preprocessing time, and query time O(logn+k).

Data-structure construction.

We build a random binary tree over the objects of 𝒟, by assigning each object of 𝒟 with equal probability either a 0 or a 1 label. This partitions 𝒟 into two sets 𝒟0 (label 0) and 𝒟1 (label 1). Recursively build random trees T0 and T1 for 𝒟0 and 𝒟1, respectively, with the output tree having T0 and T1 as the two children of the root. The constructions bottoms out when the set of objects is of size one. Let 𝒯 be the resulting tree that has exactly n leaves. For every node u of 𝒯, we construct the reporting data-structure for the set of objects 𝒟(u) – that is, the set of objects stored in the subtree of u.

Finally, create an array Li for each level i of the tree 𝒯, containing pointers to all the nodes in this level of the tree.

Answering a query.

Given a query object 𝗊, and a parameter ε(0,1/2), the algorithm starts from an arbitrary leaf v of 𝒯. The leaf v has a unique root to v path, denoted by π=u0u1ut, where u0=root(𝒯) and ut=v. The algorithm performs a binary search on π, using the reporting data-structure associated with each node, to find the maximal i, such that |𝗊𝒟(ui)|>ψ=clogn, where c is a sufficiently large constant. Here, we use the property that one can abort the reporting query if the number of reported objects exceeds ψ. This implies that each query takes Q(n)+O(logn) time (with no dependency on ε). Next, the algorithm computes the maximal j such that |𝗊𝒟(uj)|>ψε=cε2logn (or set j=0 if such a j does not exist). This is done by going up the path π from ui, trying ui1,ui2,,uj using the reporting data-structure till the condition is fulfilled222One can also “jump” to level i+log2(1/ε2)2, and do a local search there for j, but this “improvement” does not effect the performance.. Next, the algorithm chooses a vertex uLj uniformly at random. It computes the set S=𝗊𝒟(u) using the reporting data-structure. The algorithm then returns a random object from S uniformly at random, and the number 2j|S|. The first is a random element chosen from 𝗊𝒟, and the second quantity returned is an estimate for |𝗊𝒟|.

3.2 Analysis

3.2.1 Correctness

Lemma 9.

Let ε(0,1/2), ψε=cε2logn, and 𝗊 be a query object. Let M1 be the integer such that ψε/16|𝒟𝗊|/2M<ψε/8. Then, for all nodes v at distance iM from the root of 𝒯, we have [.|𝒟(v)𝗊|(1±ε/2)|𝒟𝗊|2i]1nΩ(c).

Proof.

Consider a node v at a distance i from the root, and let Yv=|𝒟(v)𝗊|. Clearly, μv=𝔼[Yv]=|𝒟𝗊|/2i. Since iM, we have μvψε/16. By Chernoff’s inequality, we have

[.Yv(1±ε/2)μv]2exp(ε2μv/3)2exp(ε2ψε/48)2exp((c/48)logn)2nc/48.

The number of nodes in 𝒯 is O(n), and hence, by the union bound, for all nodes v at distance iM from the root, we have [Yv(1±ε/2)μv]1/nΩ(c).

Observation 10.

Lemma 9 implies that for all nodes v at distance i>M from the root of 𝒯, we have [.|𝒟(v)𝗊|>ψε]1/nΩ(c). Indeed, Lemma 9 implies this for all nodes at distance M from the root, and the sizes of these sets are monotonically decreasing along any path down the tree.

Lemma 11.

Assume that the number of distinct sets 𝗊𝒟, over all possible query objects 𝗊, is bounded by a polynomial O(nd), where d is some constant. Then, for a query 𝗊, the probability that the algorithm returns a specific object 𝗈𝒟𝗊, is in (1±ε)/β, where β=|𝒟𝗊|. Similarly, the estimate the algorithm outputs for β is in (1±ε)β. The answer is correct for all queries, with probability 11/nΩ(c), for a sufficiently large constant c.

Proof.

It is easy to verify the algorithm works correctly if |𝗊𝒟(uj)|<ψε. Otherwise, for the node uj computed by the algorithm, we have |𝗊𝒟(uj)|>ψε=cε2logn. By Observation 10, with high probability, we have that jM. By Lemma 9, it implies that for any node uLj, we have 2j|𝒟(u)𝗊|(1±ε/2)|𝒟𝗊|=(1±ε/2)β, which implies that the estimate for the size of β is correct, as uLj. This readily implies that the probability of returning a specific object 𝗈𝒟𝗊 is in (1±ε)/β, since

1εβ1(1+ε/2)|𝒟𝗊|1|Lj||𝒟(u)𝗊|1(1ε/2)|𝒟𝗊|1+εβ.

As for the probabilities, there are n nodes in 𝒯, and O(nd) different queries, and thus the probability of failure is at most nd+1/nΩ(c)<1/nΩ(c), by Lemma 9.

3.2.2 Running times

Query time.

The depth of 𝒯 is h=O(logn) with high probability (follows readily from Chernoff’s inequality). Thus, the first stage (of computing the maximal i) requires O(loglogn) queries on the reporting data-structure, where each query takes O(Q(n)+logn) time. The second stage (of finding maximal j) takes

τ=𝔼[.t=jiO(Q(n)+|𝒟(ut)𝗊|)].

Thus, we have

  1. (A)

    If Q(n)=O(logn), then τ=O(ψε), as the cardinality of 𝒟(ut)𝗊 decreases by a factor of two (in expectation) as one move downward along a path in the tree. Thus τ is a geometric summation dominated by the largest term.

  2. (B)

    If Q(n)=Ω(logn), we have (in expectation) that |ij|O(log(1/ε)), and thus τ=O(Q(n)log(1/ε)+ψε) time.

  3. (C)

    If Q(n)=O(nλ), for 0<λ1, then the query time is dominated by the query time for the top node (i.e., uj) in this path, and τ=O(Q(n)), as can easily be verified.

Construction time.

The running time bounds of the form O(C(n)) are well-behaved, if for any non-negative integers n1,n2,, such that i=1ni=n, implies that i=1C(ni)=O(C(n)). Under this assumption on the construction time, we have that the total construction time is O(C(n)logn).

3.2.3 Summary

Theorem 12.

Let 𝒟 be a set of n objects, and assume we are given a well-behaved range-reporting data-structure that can be constructed in C(m) time, for m objects, and answers a reporting query 𝗊 in O(Q(m)+|𝗊𝒟|) time. Then, one can construct a data-structure, in O(C(n)logn) time, such that given a query object 𝗊, it reports an (1±ε)-estimate for β=|𝗊𝒟|, and also returns an object from 𝗊𝒟, where each object is reported with probability (1±ε)/β. The data-structure answers all such queries correctly with probability 11/nΩ(1). The expected query time is:

  1. (i)

    O((ε2+loglogn)logn) if Q(m)=O(logm).

  2. (ii)

    O(Q(n)) if Q(m)=O(mλ), for some constant λ>0.

  3. (iii)

    O(ε2logn+Q(n)loglognε) otherwise.

Plugging in the data-structure of Liu [23] for disks, with C(m)=O(mlogm), and Q(m)=O(logm), in the above theorem, implies the following.

Corollary 13.

Let 𝒟 be a set of n disks in the plane. One can construct in O(nlog2n) time a data-structure, such that given a query disk 𝗊 and a parameter ε(0,1/2), it outputs an (1±ε)-estimate for β=|𝗊𝒟|, and also returns a disk in 𝗊𝒟 with a probability that is (1±ε)-uniform. The expected query time is O((ε2+loglogn)logn), and the result returned is correct with high probability for all possible queries.

Approximate depth queries for a fixed threshold.

If we are interested only in a single threshold, the above data-structure is an overkill, as one can directly construct a single sample and perform the query directly on this sample. This implies the following.

Corollary 14.

Let 𝒟 be a set of n disks in the plane. One can construct, in O(nlogn) time, a data-structure, such that given a query disk 𝗊, and a parameter ε(0,1/2), it reports whether |𝗊𝒟|β or |𝗊𝒟|<β. The data structure is allowed to make a mistake when |𝗊𝒟|[(1ε)β,(1+ε)β]. The data-structure answers the query correctly with probability 11/nΩ(1), and the query time is O(ε2logn).

4 A (𝟐+𝜺)-approximation for densest subset disks

In this section, we design a (2+ε)-approximation algorithm to compute the densest subset of disks in O(ε4nlogn) time.

4.1 Constant approximation via depth

The depth of a point in the arrangement of disks is the number of disks containing the point. The deepest point in the plane is the point with the maximum depth. The deepest point in the arrangement of disks and the densest subgraph in the geometric intersection graph of disks are the same (up to a constant factor).

Lemma 15.

Let h be the depth of the deepest point in 𝒟, where 𝒟 is a set of disks in the plane, and let d be maximum density of 𝒟. Then (h1)/2d7h, where c is a sufficiently large constant. In particular, in O(nlogn) time, one can compute a quantity t, such that (t1)/2d8t.

Proof.

Lemma 6 readily implies that d(h1)/2, as this is the density provided by the disks containing the deepest point.

Figure 4.1: Defending a disk by a guard set made of 7 points.

As for the other direction, consider the densest subset 𝒪𝒟, and let 𝖽 be the smallest disk in S. By Lemma 6, the set T=𝖽(𝒪𝖽) has size least d. Let c be the center of 𝖽 and inscribe an equilateral triangle in 𝖽. Consider a tiling of 6 such triangles that are interior disjoint, all sharing a vertex at c. Let P be the set of 7 vertices used by these triangles (including c), see Figure 4.1. One can verify333Indeed, if the center c of 𝖽 is in the union of the seven copies of 𝖽 centered at the points of P, see Figure 4.1, then the disk 𝖽 must contain one of the points of P. Otherwise, it can be verified that 𝖽 and 𝖽 do not intersect. that any disk 𝖽 at least as large as 𝖽 that intersect 𝖽, must contain at least one point of P.

Thus, each disk in T is stabbed by at least one point in P, which readily implies that there is a point in P that has depth |T|/|P|d/7.

As for the last part, a 1.1-approximation of the deepest point in the arrangement of the disks can be computed in O(nlogn) time [5]. The returned approximate deepest point provides the desired approximation.

4.2 The algorithm

The input is a set 𝒟 of n disks in the plane. Let ϑ=ε/15. The algorithm starts by computing a number ξ, such that d[(ξ1)/2,8ξ], using the algorithm of Lemma 15. The basic idea is to try a sequence of exponentially decaying values to the optimal density d. To this end, in the ith round, the algorithm tries the degree threshold β=8ξ(1ϑ)i, for i=1,,log1+ϑ16=O(1/ε).

In the beginning of such a round, let 𝒟. During a round, the algorithm repeatedly removes “low-degree” objects, by repeatedly doing the following:

  1. (I)

    Construct the data-structure of Corollary 14 on the objects of .

  2. (II)

    Let < be the objects whose degree in is smaller than (1+ϑ)β according to this data-structure (i.e., query all the objects of for their degree). Let =<.

  3. (III)

    If is empty, then this round failed, and the algorithm continues to the next round.

  4. (IV)

    If |<|<ϑ||, then the algorithm returns as the desired approximate densest subset.

  5. (V)

    Otherwise, let <. The algorithm continues to the next iteration (still inside the same round).

4.3 Analysis

Lemma 16.

When the algorithm terminates, we have β(1ϑ)3d, with high probability, where d is the optimal density.

Proof.

Consider the iteration when β[(1ϑ)3d,(1ϑ)2d]. By definition of <, all the objects in it have a degree at most (1+ϑ)β(1+ϑ)(1ϑ)2d(1ϑ)d. By Lemma 6, all the objects in the optimal solution have degree d (when restricted to the optimal solution). Therefore, none of the objects in the optimal solution are in < and hence, the set is not empty (and contains the optimal solution). Inside this round, the loop is performed at most O(ϑ1logn) times, as every iteration of the loop shrinks by a factor of 1ϑ. This implies that the algorithm must stop in this round.

Lemma 17.

The above algorithm returns a (2+ε)-approximation of the densest subset.

Proof.

Consider the set , the value of β when the algorithm terminated, and let ν=||. By Lemma 16, β(1ϑ)3d, and by the algorithm stopping condition, we have |<|<ϑ||. In addition, all the objects in have degree at least (1ϑ)β. Thus, the number of induced edges on is

𝗆()(1ϑ)β||2>(1ϑ)2β||2(1ϑ)5d2||(15ϑ)d2||.

Thus ()=𝗆()||(15ϑ)d/2=(1ε/3)d/2, as ϑ=ε/15. Observe that 2/(1ε/3)2(1+ε/2)2+ε.

Lemma 18.

The expected running time of the above algorithm is O(nε4logn).

Proof.

The expected time spent, in an iteration, in step I and II is O(ϑ2nlogn), where n=||, and this dominates the running time of this iteration. Let ni be the size of in the ith iteration (inside the same round), and observe that ni+1(1ϑ)ni. Assume t iterations are performed inside this round. As i=1tni=O(n/ϑ), the expected running time for the round is i=1tO(ϑ2nilogni)=O(ϑ3nlogn). The value of β can change O(1/ε) times during the algorithm and hence, the overall expected running time of the algorithm is O(nϑ4log2n), which implies the result since ϑ=Θ(ε).

Summarizing, we get the following.

Theorem 19.

Let 𝒟 be a set of n disks in the plane, and let ε(0,1/2) be a parameter. The overall algorithm computes, in O(ε4nlogn) expected time, a (2+ε)-approximation to the densest subgraph of 𝖦𝒟, and the result is correct with high probability.

 Remark 20 (A lower bound).

Consider the element uniqueness problem – of deciding if n real numbers are all distinct. It is known that in the comparison model this requires Ω(nlogn) time. Treating each number as a disk of radius zero, readily implies there is a subgraph of density 1 (and in particular, non-zero), if and only if there are two identical numbers. Thus, any approximation algorithm for the maximum density problem requires Ω(nlogn) time (in the comparison model).

5 An (𝟏+𝜺)-approximation for densest subset disks

Here, we present a (1+ε)-approximation algorithm for the densest subset of disks, which is based on the following intuitive idea – if the intersection graph is sparse, then the problem is readily solvable. If not, then one can sample a sparse subgraph, and use an approximation algorithm on the sampled graph.

5.1 Densest subgraph estimation via sampling

Let 𝖦=(𝖵,𝖤) be a graph with n vertices and m edges, with maximum subgraph density d. Let ϑ(0,1/6) be a parameter, and assume that m>cnϑ2logn, where c is some sufficiently large constant, which in particular implies that

d=d(𝖦)mncϑ2logn.

Assume we have an estimate m¯(1±ϑ)m of m. For a constant c to be specified shortly, with c<c, let

ψ=cnm¯ϑ2logncc(1ϑ)6c5c<1.

Let F={e1,,er} be a random sample of r=ψm¯ edges from 𝖦. Specifically, in the ith iteration, an edge ei is picked from the graph, where the probability of picking any edge is in (1±ϑ)/m. Let H=(V,F), and observe that H is a sparse graph with n vertices and r=O(ϑ2nlogn) edges. The claim is that the densest subset DV in H, or even approximate densest subset, is close to being the densest subset in 𝖦. The proof of this follows from previous work [24], but requires some modifications, since we only have an estimate to the number of edges m, and we are also interested in approximating the densest subgraph on the resulting graph. We include the details here so that the presentation is self contained. The result we get is summarized in Lemma 24, if the reader is uninterested in the (somewhat tedious) analysis of this algorithm.

5.1.1 Analysis

Lemma 21.

Let d=d(𝖦), and let UV, be an arbitrary set of k vertices. If 𝖦(U)d/60, then [H(U)ψd/5]n100k.

Proof.

We have dm/n, where n=|𝖵(𝖦)| and m=|𝖤(𝖦)|, and thus

ψ=cnm¯ϑ2logncn(1+ϑ)mϑ2logn1dc(1+ϑ)ϑ2logn

Let Xi=1 if the edge sampled in the ith round belongs to HU and zero, otherwise. Let X=iXi be the number of edges in HU. Then

[Xi=1](1±ϑ)|𝖤(𝖦U)|m=(1±ϑ)k𝖦(U)m.

By linearity of expectations, and as m¯(1±ϑ)m, we have

𝔼[X](1±ϑ)ψm¯k𝖦(U)m(1±ϑ)2ψk𝖦(U). (5.1)

By assumption 𝖦(U)d/60, implying that 𝔼[X](1+3ϑ)ψkd/60ψkd/30, if ϑ(0,1/3), by Observation 1. Observe that (1+(2e1))𝔼[X]ψkd5. By Chernoff’s inequality, Lemma 30, we have

[H(U)ψd5] =[Xψkd5]2ψkd/51n100k,

by picking c to be sufficiently large.

Lemma 22.

Let d=d(𝖦), and let UV, be an arbitrary set of k vertices. If 𝖦(U)d/60, then [H(U)(1±ϑ)3ψ𝖦(U)]1n100k .

Proof.

Following the argument of Lemma 21 and as 𝖦(U)d/60, we have that 𝔼[X]=Ω(kψ𝖦(U))=Ω(kψd)=Ω(kcϑ2logn). Chernoff’s inequality, Theorem 31, then implies that X(1±ϑ)𝔼[X] with probability at least 12exp(ϑ2𝔼[X]/4)11/n100k, for n sufficiently large. The claim now readily follows from Eq. (5.1).

Lemma 23.

Let α(0,1/6) be a parameter. For all sets UV, such that H(U)(1α)d(H), we have that 𝖦(U)(16ϑ)(1α)d, and this holds with high probability.

Proof.

Let X be the densest subset in 𝖦. By Lemma 22, we have that

H(X)(1±ϑ)3ψd(𝖦)d(H)(1ϑ)3ψdψd2.

By Lemma 21, we have that for all the sets T𝖵, with 𝖦(T)d/60, we have H(T)<ψd/5<d(H)/2, and this happens with probability k=2nT𝖵:|T|=k1/n100kk=2n(nk)/n100k1/n99.

Thus, all the sets U𝖵 under consideration have 𝖦(U)>d/60. By Lemma 22, for all such sets, with probability 1n100k11/n99, we have H(U)(1±ϑ)3ψ𝖦(U), which implies 𝖦(U)1(1±ϑ)3ψH(U). Thus, we have

𝖦(U)1(1+ϑ)3ψH(U)(1α)d(H)(1+ϑ)3ψ(1α)(1ϑ)3ψd(1+ϑ)3ψ(1α)(16ϑ)d,

since 1/(1+ϑ)1ϑ, and (1ϑ)616ϑ.

5.1.2 Summary

Lemma 24.

Let ε(0,1) be a parameter, and let 𝖦=(𝖵,𝖤) be a graph with n vertices and m edges, with m=Ω(ε2nlogn). Furthermore, let m¯ be an estimate to m, such that m¯(1±ϑ)m, where ϑ=ε/10. Let ψ=c(n/m¯)ϑ2logn, and let F be a random sample of ψm¯=O(ε2nlogn) edges, with repetition, where the probability of any specific edge to be picked is (1±ϑ)/m, and c is a sufficiently large constant. Let H=(𝖵,F) be the resulting graph, and let X𝖵 be subset of H with H(X)(1ε/6)d(H). Then, (X)(1ε)d(𝖦).

Proof.

This follows readily from the above, by setting α=ε/6, and using Lemma 23.

5.2 Random sampler

To implement the above algorithm, we need an efficient algorithm for sampling edges from the intersection graph of disks, which we describe next.

5.2.1 The algorithm

The algorithm consists of the following steps:

  1. (I)

    Build the data-structure of Corollary 13 on the disks of 𝒟 with error parameter ε/c, where c is a sufficiently large constant. Also, build the range-reporting data-structure of Liu [23] on the disks of 𝒟.

  2. (II)

    For each object 𝗈𝒟, query the data-structure of Corollary 13 with 𝗈. Let the estimate returned be d. If d<c/ε (for a constant cc), then report 𝗈𝒟 by querying the range-reporting data-structure with 𝗈, and set d𝗈|𝗈𝒟|1. Otherwise, set d𝗈d.

  3. (III)

    We perform |F| iterations and in each iteration, sample a random edge from 𝖦𝒟. In a given iteration, sample a disk 𝗈𝒟, where 𝗈 has a probability of d𝗈𝗈𝒟d𝗈 being sampled. If d𝗈<c/ε, then uniformly-at-random report a disk from 𝗈(𝒟𝗈). Otherwise, query the data-structure of Corollary 13 with 𝗈 which returns a disk in 𝗈𝒟 (keep querying till a disk other than 𝗈 is returned).

5.2.2 Analysis

Lemma 25.

For each object 𝗈𝒟, we have d𝗈(1±ε/c′′)|𝗈(𝒟𝗈)|, where cc′′ with high probability.

Proof.

Fix an object 𝗈. When d<c/ε, then the statement holds trivially. Let d=|𝗈(𝒟𝗈)|. Now we consider the case dc/ε. We know that d(1±ε/c)(d+1). Firstly, d(1+ε/c)(d+1)(1+ε/c′′)d hold if

d1ε1+ε/c1/c′′1/c1/2ε.

Observe that d1/2ε, since c/εd(1+ε/c)(d+1) implies that dcε(1+ε/c)11/2ε. Finally, d(1ε/c)(d+1)(1ε/c′′)d holds trivially. Therefore,

d𝗈=d(1±ε/c′′)d.

Lemma 26.

In each iteration, the probability of sampling any edge in 𝖦𝒟 is (1±ε)/m.

Proof.

An edge (u,v) in 𝖦𝒟 can get sampled in step (III) in two ways. In the first way, the disk corresponding to u gets sampled and then v gets reported as the random neighbor of u, and vice-versa for the second way.

Let d=|𝗈(𝒟𝗈)|, where 𝗈 is the disk corresponding to u. Consider the case, where d𝗈c/ε. As such, the first way of sampling the edge (u,v) has probability lower-bounded by:

d𝗈d𝗈1±ε/cdalmost-uniformityd𝗈(1±ε/c′′)2mLemma251±ε/cd(1±ε/c′′)d(1±ε/c′′)2m1±ε/cd1±ε2m.

Therefore, for the case d𝗈c/ε, the probability of sampling the edge (u,v) is (1±ε)/m. Similarly, the statement holds for the case d𝗈<c/ε.

Lemma 27.

The expected running time of the algorithm is O(ε4nlog2n+ε2nlog2nloglogn).

Proof.

The first step of the algorithm takes O(nlog2n) expected time. The second step of the algorithm takes O(n(ε2+loglogn)logn) expected time. In the third step of the algorithm, each iteration requires querying the data-structure of Corollary 13 O(1) times in expectation. Therefore, the third step takes O(|F|)O((ε2+loglogn)logn)=O(ε4nlog2n+ε2nlog2nloglogn) expected time. 

The above implies the following result.

Lemma 28.

Let 𝒟 be a collection of n disks in the plane, and let 𝖦𝒟 be the corresponding geometric intersection graph with m edges. Let F be a random sample of O(ε2nlogn) edges from 𝖦𝒟, with repetition, where the probability of any specific edge to be picked is (1±ε)/m. The edges are all chosen independently into F. Then, the algorithm described above, for computing F, runs in O(ε4nlog2n+ε2nlog2nloglogn) expected time.

5.3 The result

Theorem 29.

Let 𝒟 be a collection of n disks in the plane. A (1+ε)-approximation to the densest subgraph of 𝒟 can be computed in O(ε4nlog2n+ε2nlog2nloglogn) expected time. The correctness of the algorithm holds with high probability.

Proof.

The case of intersection graph having O(ε2nlogn) edges can be handled directly by computing the whole intersection graph in O(ε2nlogn) expected time (using Lemma 7).

To handle the other case, we use Lemma 28 to generate a graph H=(𝖵,F) in O(ε4nlog2n+ε2nlog2nloglogn) expected time. Since the intersection graph has Ω(ε2nlogn) edges, using Lemma 24, it suffices to compute a (1ε/6) approximate densest subgraph of H, which can be computed by the algorithm of [6] in O(ε4nlog2n) expected time.

6 Extension to other geometric intersection graphs

Although the focus of the paper has been on disks, the techniques described in this paper can be extended to other geometric intersection graphs as well. For any geometric intersection graph with a well-behaved range-reporting data-structure that can be constructed in C(n) time and query time O(Q(n)+|𝗊𝒟|), the technique in Section 5 gives a (1+ε)-approximation algorithm with a running time of O~(C(n)+nQ(n)) (where the O~ notation hides poly-logarithmic factors in n). Therefore, if C(n)=O(n2λ1) and Q(n)=O(n1λ2), it leads to a (1+ε)-approximation algorithm with running time O(n2λ3), where λ1,λ2,λ3(0,1). For example, data structures with such bounds exist for axis-aligned boxes in d-dimension (with nlogO(d)n running time) and spheres in d-dimensions (via reduction to halfspace range reporting). This improves upon the running time of Ω(m)=Ω(n2) obtained for general graphs where the edges are given explicitly.

The technique in Section 4 for disks performs O(1/ε) rounds by establishing a connection between the deepest point in the arrangement of disks with the maximum density. This connection extends to spheres in d-dimensions but does not hold, for example, for axis-aligned boxes in d-dimensions (in 2-d the depth can be at most two but the maximum density can be Ω(n)). In any case, the technique in Section 4 will perform at most O(ε1logn) rounds if we start guessing the degree threshold from n. Then the running time of the algorithm will be O~(C(n)+nQ(n)), where C(n) and Q(n) are the construction time and the query time, respectively, of the approximate depth data-structure for a fixed threshold (analogous version of Corollary 14). Once again this leads to nlogO(d)n running time for axis-aligned boxes in d-dimensions and truly sub-quadratic running time for spheres in d-dimensions.

7 Conclusions

We presented two near-linear time approximation algorithms to compute the densest subgraph on (implicit) geometric intersection graph of disks. We conclude with a few open problems. Are there implicit geometric intersection graphs, such as unit-disk graphs or say, interval graphs, for which the exact densest subgraph can be computed in sub-quadratic time (in terms of n)? Finally, maintaining the approximate densest subgraph in sub-linear time (again in terms of n) under insertions and deletions of objects, looks to be a challenging problem (in prior work on general graphs an edge gets deleted or inserted, but in an intersection graph a vertex gets deleted or inserted).

References

  • [1] Peyman Afshani and Timothy M. Chan. Optimal halfspace range reporting in three dimensions. In Claire Mathieu, editor, Proc. 20th ACM-SIAM Sympos. Discrete Algs. (SODA), pages 180–186. SIAM, 2009. doi:10.1137/1.9781611973068.21.
  • [2] Peyman Afshani, Chris H. Hamilton, and Norbert Zeh. A general approach for cache-oblivious range reporting and approximate range counting. Comput. Geom., 43(8):700–712, 2010. doi:10.1016/J.COMGEO.2010.04.003.
  • [3] Peyman Afshani and Jeff M. Phillips. Independent range sampling, revisited again. In Proceedings of Symposium on Computational Geometry (SoCG), volume 129, pages 4:1–4:13, 2019. doi:10.4230/LIPICS.SOCG.2019.4.
  • [4] Pankaj K. Agarwal and Jiangwei Pan. Near-linear algorithms for geometric hitting sets and set covers. In Proceedings of Symposium on Computational Geometry (SoCG), page 271, 2014. doi:10.1145/2582112.2582152.
  • [5] Boris Aronov and Sariel Har-Peled. On approximating the depth and related problems. SIAM Journal of Computing, 38(3):899–921, 2008. doi:10.1137/060669474.
  • [6] Bahman Bahmani, Ashish Goel, and Kamesh Munagala. Efficient primal-dual graph algorithms for mapreduce. In Algorithms and Models for the Web Graph: 11th International Workshop, pages 59–78, 2014. doi:10.1007/978-3-319-13123-8_6.
  • [7] Bahman Bahmani, Ravi Kumar, and Sergei Vassilvitskii. Densest subgraph in streaming and mapreduce. Proceedings of the VLDB Endowment, 5(5):454–465, 2012. doi:10.14778/2140436.2140442.
  • [8] Sayan Bhattacharya, Monika Henzinger, Danupon Nanongkai, and Charalampos E. Tsourakakis. Space- and time-efficient algorithm for maintaining dense subgraphs on one-pass dynamic streams. In Proceedings of ACM Symposium on Theory of Computing (STOC), pages 173–182, 2015. doi:10.1145/2746539.2746592.
  • [9] Sergio Cabello, Jean Cardinal, and Stefan Langerman. The clique problem in ray intersection graphs. Discrete & Computational Geometry, 50(3):771–783, 2013. doi:10.1007/S00454-013-9538-5.
  • [10] Moses Charikar. Greedy approximation algorithms for finding dense components in a graph. In Proceedings of Approximation Algorithms for Combinatorial Optimization Problems (APPROX), pages 84–95, 2000. doi:10.1007/3-540-44436-X_10.
  • [11] Chandra Chekuri, Sariel Har-Peled, and Kent Quanrud. Fast LP-based approximations for geometric packing and covering problems. In Proc. 31st ACM-SIAM Sympos. Discrete Algs. (SODA), pages 1019–1038, 2020. doi:10.1137/1.9781611975994.62.
  • [12] Jie Chen and Yousef Saad. Dense subgraph extraction with application to community detection. IEEE Transactions on Knowledge and Data Engineering (TKDE), 24(7):1216–1230, 2012. doi:10.1109/TKDE.2010.271.
  • [13] Brent N. Clark, Charles J. Colbourn, and David S. Johnson. Unit disk graphs. Discrete Mathematics, 86(1-3):165–177, 1990. doi:10.1016/0012-365X(90)90358-O.
  • [14] Edith Cohen, Eran Halperin, Haim Kaplan, and Uri Zwick. Reachability and distance queries via 2-hop labels. SIAM Journal of Computing, 32(5):1338–1355, 2003. doi:10.1137/S0097539702403098.
  • [15] Alessandro Epasto, Silvio Lattanzi, and Mauro Sozio. Efficient densest subgraph computation in evolving graphs. In Proceedings of International World Wide Web Conferences (WWW), pages 300–310, 2015. doi:10.1145/2736277.2741638.
  • [16] Hossein Esfandiari, MohammadTaghi Hajiaghayi, and David P. Woodruff. Brief announcement: Applications of uniform sampling: Densest subgraph and beyond. In Proceedings of Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 397–399, 2016. doi:10.1145/2935764.2935813.
  • [17] Giorgio Gallo, Michael D. Grigoriadis, and Robert Endre Tarjan. A fast parametric maximum flow algorithm and applications. SIAM Journal of Computing, 18(1):30–55, 1989. doi:10.1137/0218003.
  • [18] David Gibson, Ravi Kumar, and Andrew Tomkins. Discovering large dense subgraphs in massive graphs. In Proceedings of Very Large Data Bases (VLDB), pages 721–732, 2005. URL: http://www.vldb.org/archives/website/2005/program/paper/thu/p721-gibson.pdf.
  • [19] A. V. Goldberg. Finding a maximum density subgraph. Technical report, University of California at Berkeley, Berkeley, CA, USA, 1984.
  • [20] Sariel Har-Peled. Geometric Approximation Algorithms, volume 173 of Math. Surveys & Monographs. Amer. Math. Soc., Boston, MA, USA, 2011. doi:10.1090/surv/173.
  • [21] Samir Khuller and Barna Saha. On finding dense subgraphs. In Proceedings of International Colloquium on Automata, Languages and Programming (ICALP), pages 597–608, 2009. doi:10.1007/978-3-642-02927-1_50.
  • [22] Tommaso Lanciano, Atsushi Miyauchi, Adriano Fazzone, and Francesco Bonchi. A survey on the densest subgraph problem and its variants. CoRR, abs/2303.14467, 2023. doi:10.48550/arXiv.2303.14467.
  • [23] Chih-Hung Liu. Nearly optimal planar $k$ nearest neighbors queries under general distance functions. SIAM J. Comput., 51(3):723–765, 2022. doi:10.1137/20M1388371.
  • [24] Andrew McGregor, David Tench, Sofya Vorotnikova, and Hoa T. Vu. Densest subgraph in dynamic graph streams. In Proceedings of Mathematical Foundations of Computer Science (MFCS), pages 472–482, 2015. doi:10.1007/978-3-662-48054-0_39.
  • [25] Rajeev Motwani and Prabhakar Raghavan. Randomized Algorithms. Cambridge University Press, Cambridge, UK, 1995. doi:10.1017/CBO9780511814075.
  • [26] Saurabh Sawlani and Junxing Wang. Near-optimal fully dynamic densest subgraph. In Proceedings of ACM Symposium on Theory of Computing (STOC), pages 181–193, 2020. doi:10.1145/3357713.3384327.
  • [27] Yufei Tao. Algorithmic techniques for independent query sampling. In Proceedings of ACM Symposium on Principles of Database Systems (PODS), pages 129–138, 2022. doi:10.1145/3517804.3526068.

Appendix A Chernoff’s inequality

The following are standard forms of Chernoff’s inequality, see [25].

Lemma 30.

Let X1,,Xn be n independent Bernoulli trials, where [Xi=1]=pi, and [Xi=0]=1pi, for i=1,,n. Let X=i=1bXi, and μ=𝔼[.X]=ipi. For δ>2e1, we have [.X>(1+δ)μ]<2μ(1+δ).

Theorem 31.

Let ε(0,1) be a parameter. Let X1,,Xn{0,1} be n independent random variables, let X=i=1nXi, and let μ=𝔼[X]. We have that

max([.X<(1ε)μ],[.X>(1+ε)μ])exp(ε2μ/3)