Dynamic clustering to minimize the sum of radii

In this paper, we study the problem of opening centers to cluster a set of clients in a metric space so as to minimize the sum of the costs of the centers and of the cluster radii, in a dynamic environment where clients arrive and depart, and the solution must be updated efficiently while remaining competitive with respect to the current optimal solution. We call this dynamic sum-of-radii clustering problem. We present a data structure that maintains a solution whose cost is within a constant factor of the cost of an optimal solution in metric spaces with bounded doubling dimension and whose worst-case update time is logarithmic in the parameters of the problem.


Introduction
The main goal of clustering is to partition a set of objects into homogeneous and well separated subsets (clusters). Clustering techniques have long been used in a wide variety of application areas, see for example the excellent surveys [25,21]. There are several ways to model clustering. Among them, the problem of sum-of-radii (or sum-of-diameter) clustering has been extensively studied: the clients are located in a metric space and one must open facilities to minimize facility opening cost (or keep the number of open facilities limited to at most k) plus the sum of the cluster radii (or, in other applications, cluster diameters). To give a concrete example, imagine a telecommunications agency setting up mobile towers that provide wireless access to selected clients, incurring costs for setting up towers as well as for configuring a tower to serve the customers lying within a certain distance, where that latter contribution to the cost increases with the maximum distance served by the tower.
Assume the number of facilities is limited to k. For sum-of-diameter clustering, Doddi, Marathe, Ravi, Taylor and Widmayer [15] prove hardness of approximation to within better than a factor of 2. More recently, NP-hardness was proved for the sum-of-radii problem even for shortest path metrics on weighted planar graphs [24], or, in the case of sum-of-diameters, even for metrics of constant doubling dimension [17]. Turning to approximation algorithms, Charikar and Panigraphy [13] design and analyze an O(1) approximation algorithm for sum-of-radii and for sum-of-diameter clustering with k clusters. They start from a linearprogramming relaxation, use a primal-dual type approach and, along the way, design a bicriteria algorithm. They also design an incremental algorithm that handles arrivals of clients, merging clusters as needed so that at any time the clustering has O(k) clusters and the cost is O(1) times the optimal cost for k clusters.
There have been many other papers on sum-of-radii or sum-of-diameters clustering. A few papers focus on the problems of partitioning the clients into a constant number of clusters as quickly as possible [20,24]. Some papers concern themselves with bicriteria results such as [2]. Consider the special case of a metric that is Euclidean in two dimensions. Lev-Tov and Peleg [23] give a polynomial time approximation scheme (PTAS) for the related problem of covering input clients by a min-cost set of disks centered at the servers, where both clients and potential servers are located in the Euclidean plane and are part of the input. Recently, Behsaz and Salavatipour [4] gave a PTAS for the minimum sum of diameters problem on the plane with Euclidean distances. See also [1,18] for other work on the two-dimensional geometric setting.
The problems are complicated by situations where the set of clients may change over time, for example documents in a very large database that must be efficiently searchable and maintained. This then leads to various models: online [14,16], incremental [13], streaming, or dynamic. The dynamic setting, where clients may not only arrive but also depart, has been empirically studied at least since 1993 [12], and is the focus of the present paper, with the joint goals of maintaining clusterings whose objective value is close to optimal, and of updating the cluster quickly after each event.
This paper can be interpreted as part of a recent focus on exploiting primal-dual techniques in the dynamic setting. In the online setting (where new elements arrive but never depart), primal-dual techniques are extremely successful [11]. Initially it seemed that such techniques were inherently restricted to settings with arrivals only, and no departures, but there recently has been exciting progress to handle the dynamic setting as well, starting with [8,7] and continuing with [3,9,26,10]. Of particular notice for us is recent work by Gupta et al [19] for the set cover problem (and Bhattacharya et al [6], restricted to vertex cover) in the dynamic setting where elements arrive and depart.
In this paper, we study the dynamic sum-of-radii clustering problem, defined as follows: The original input consists of a (possibly infinite) set V of potential clients or points, a finite set F ⊆ V of facilities with an opening cost f j for each facility j, and a metric d over V . For the online input, a set C of live clients evolves over time: at each timestep t, either a new client arrives and is added to C, or a client from C departs and is removed from C, or a query is made for the approximate cost of an optimal solution (cost query), or a query asks for the entire current solution (solution query). For the output, at each timestep the algorithm maintains a set of open facilities, each open facility j being associated to a radius R j , such that every client of C is covered, i.e. belongs to some open ball B(j, R j ), and the goal is to minimize the cost, namely, the sum over open facilities j of f j + R j .
The dynamic sum-of-radii clustering problem can actually be interpreted as a special case of dynamic set cover: our metric space is the universe, our clients are the elements, and for each center c and radius r, the ball B(c, r) defines a set of cost r consisting of those clients covered by the ball. The dynamic set cover algorithm from [19], specialized to our setting, maintains competitive ratio O(log n) and has update time O(f log n), where f is the maximum number of sets containing any element; [19] also gives an O(f 3 ) approximation in time O(f 2 ) for set cover, and for the related dynamic k-coverage problem, they give a constant approximation fully dynamic algorithm with O(f log n) update time.
The doubling dimension of a metric space (V, d) is said to be bounded by κ if any ball B(x, r) in (V, d) can be covered by 2 κ balls of radius r/2 [22]. For example, D-dimensional Euclidean space has doubling dimension Θ(D).
In this paper we give an algorithm for constant doubling dimension, that maintains a solution whose cost is within a constant factor of the optimal cost, and with logarithmic update times for arrival or departure events. The algorithm answers cost queries in constant time and solution queries in linear time in the size of the solution, up to a factor of log(W/f min ), where W is the diameter of the metric space and f min is the minimum opening cost of any facility (which is strictly positive without loss of generality, since facilities of cost 0 can remain open at all times). The universe is known ahead of time, and only the collection of active clients changes dynamically. Note that for the above mentioned algorithm by [19] f would be Θ(2 2κ log(W/f min )). The 2 6κ factor in the update time is due to fixed-radius nearest neighbor query that we solve using only the basic data structure already present in the algorithm. However, it can be improved -for example in case of finite metric spaces where a O(1) lookup table over the space is possible (e.g., graphs with the shortest path distance), the update time reduces to O(log(W/f min )) at the cost of additional preprocessing time. More generally, if the metric space allows to answer fixed-radius nearest neighbor queries in time O(d), then the update time becomes O(d log(W/f min )), at the cost of preprocessing time necessary to construct the oracle.
We show the following structural property for the sum-of-radii clustering problem (Theorem 8): there exists a collection Π of pairs j, r where j ∈ F and r is a non-negative integer, each with an associated area A(j, r) of V , and an abstract tree T over Π, with the following properties: 1. T has height O(log(W/f min )) and degree at most 2 4κ . 2. The collection A of areas is a laminar family, its laminar structure is given by T , and for each area, A(j, r) ⊆ B(j, 7 · 5 r ) 3. For any subset C of V , there exists a collection S of areas covering C and whose cost, Our algorithm has two phases. First, in the preprocessing phase (Section 2), the algorithm constructs Π, the laminar family of areas A and corresponding abstract tree T . Thanks to the last property above, it suffices to restrict attention to solutions that use only areas A(j, r) for coverage, with j, r ∈ Π. Second, in the dynamic phase (Section 3), while clients arrive and depart, the algorithm maintains an optimal set of pairs j, r of Π such that the corresponding areas A(j, r) cover all current clients. The hierarchical structure of T makes this simple, so that each update takes time proportional to the height times 2 6κ .
The main contribution of the paper is the definition of Π and the corresponding laminar family of areas A. The latter is reminiscent of the cover tree data structure of [5]. However, the cover tree is tailored to the nearest neighbor problem and its covers, to the best of our knowledge, lack the structural properties of areas that we need to prove the approximation factor for the sum-of-radii clustering problem. We expect that our new structure can be used for other clustering-type problems.

Discretization of radii
Given the set of clients C ⊆ V , let OP T denote the cost of an optimum solution for C.

Lemma 2.
For all C ⊆ V , there exists a solution such that every ball B(j, R) has f min ≤ f j ≤ R ≤ 5 · W , the radius R is an integer power of 5, and the cost is O(OP T ).
Proof. Consider the unknown optimal solution. If some ball is such that max(f j , R) > W then replace the entire solution by a ball centered at the facility of cost f min and of radius W . Else, for each ball B(j, R) of the optimal solution: if f j > R then increase the radius of the ball from R to f j Increase R to the smallest integer power of 5 that is greater than or equal to R. The new solution satisfies the desired constraints, and the cost has increased by a factor of 10 at most.
A logradius is an integer r such that f min ≤ 5 r ≤ 5 · W . Let ρ min = log 5 f min and ρ max = log 5 W . Then the number of different logradii, ρ max − ρ min + 1, is O(log(W/f min )).

Maximal subsets of distant facilities
We construct a set Π of pairs j, r where j is a facility and r is a logradius, satisfying the following properties: 1. (Covering) For every facility j ∈ J and every logradius r such that f j ≤ 5 r , there exists a facility j ∈ J r with d(j, j ) ≤ 5 r+1 and j , r ∈ Π. 2. (Separating) For all distinct j , r , j , r ∈ Π, we have d(j, j ) > 5 r+1 .
For each logradius r ∈ [ρ min , ρ max ]: let J r = {j ∈ F | f j ≤ 5 r }. let J r be a maximal subset of J r such that any two facilities in J r are at distance greater than 5 r+1 .
Note that for r = ρ max , the set J r contains just one facility.

Hierarchical decomposition of Π
Construct an abstract tree T over Π as follows (with ties broken arbitrarily): the root of T is the unique pair j, ρ max . for all r < ρ max and j ∈ J r : let j be the facility of J r+1 closest to j parent(j, r) ← j , r + 1 By construction, T has height at most ρ max − ρ min + 1 and the parent of a pair j, r is a pair of the form j , r + 1 .
The following Lemma is simple, but it captures the essential way in which using larger balls will greatly simplify the structure, and is the main step towards constructing a laminar set of areas for covering clients.

Lemma 4. For any point p and radius r, the set of pairs
has at most 2 (α+1)κ elements, where κ is the doubling dimension of the metric space.
Proof. By definition of doubling dimension, B(p, 2 α · 5 r+1 ) can be covered by a set of at most (2 κ ) α+1 balls of radius (1/2) · 5 r+1 . By the Separating property of Π, any two pairs j, r of Π(p, r) are at distance greater than 5 r+1 from each other, hence must belong to different balls of the set, and so Π(p, r) has cardinality at most (2 κ ) α+1 .

Lemma 5. A node j, r of T has at most 2 4κ children
Proof. Children of j, r have logradius r − 1, so by the Covering property of Π their distance to j is at most 5 r+1 , so they belong to Π(j, r − 1) for α = 3, and so Lemma 4 applies.

Hierarchical decomposition of V into a laminar family of areas
Recall that a collection A of sets is laminar if for any two A, B ∈ A, either A ∩ B = ∅ or A ⊆ B or B ⊆ A. We partition V into a laminar family of areas, denoted by A, such that no two same-logradius areas overlap.
For each j, r ∈ Π, initialize A(j, r) ← ∅. For each point p ∈ V : let r * be minimum such that there exists pairs j, r * with p ∈ B(j, 7 · 5 r * ). Among all such pairs, let j * , r * denote the one minimizing d(p, j * ). Add p to the set A(j * , r * ) and to every set A(j , r ) with (j , r ) ancestor of (j * , r * ) in T .

Lemma 6.
For every j, r ∈ Π, A(j, r) ⊆ B(j, 7 · 5 r ). A(j, r). Either it's been added directly, in which case it belongs to B(j, 7 · 5 r ), or it's been inherited, in which case it also belongs to it by Lemma 3.

Lemma 7.
For every subset C ⊆ V of clients there exists S ⊆ Π such that C is covered by ∪{A(j, r) : j, r ∈ S} and j,r ∈S (f j + 7 · 5 r ) = O(2 2κ · OP T ).
Proof. Let S * be a solution of cost O(OP T ) satisfying the properties of Lemma 2. For each ball B(j, 5 r ) of S * , put in S all the pairs j , r ∈ Π such that d(j, j ) ≤ 8 · 5 r . We claim that C is covered by ∪{A(j , r) : j , r ∈ S}. Indeed, consider a client p ∈ C and a ball B(j, 5 r ) of S * containing p. By the Covering property of Π, there exists j , r ∈ Π with d(j, j ) ≤ 5 r+1 . Then d(p, j ) ≤ 5 r+1 + 5 r < 7 · 5 r , and so in the definition of areas covering p we must have r * ≤ r. Along the path from j * , r * to the root of T , there exists a pair for logradius r, j , r . By definition of areas and by Lemma 6, p ∈ A(j , r) ⊆ B(j , 7 · 5 r ), so d(j, j ) ≤ d(j, p) + d(p, j ) ≤ 8 · 5 r , and therefore j , r ∈ S and p is covered.
In terms of costs, since all these areas are associated to pairs within distance 8·5 r < 2·5 r+1 from j, by Lemma 4 for α = 1, there are at most 2 2κ of them.
By Lemma 6 and the definition of areas, we note that ∪ Jr A(j, r) = ∪ Jr B(j, 7 · 5 r ), so we also give hereafter an equivalent description of the same laminar family, illustrating the way in which the parent-child relations in tree T and the proximity relations in the metric space are balanced against one another. (This also has the advantage of being constructive even if V is infinite). Partition ∪ Jρ min B(j, 7 · 5 ρmin ), using the facilities of J ρmin as centers, into Voronoi cells A(j, ρ min ). For r ∈ (ρ min , ρ max ]: Partition ∪ j∈Jr B(j, 7 · 5 r ) \ ∪ j∈Jr−1 B(j, 7 · 5 r−1 ), using the facilities of J r as centers, into Voronoi cells A(j, r).
The construction of this section can be summarized in the following structural Theorem.
Theorem 8. Let a metric space (V, d) of doubling dimension κ be given, as well as a subset F of elements of V called facilities, with an associated cost f j for each j ∈ F . Then there exists an abstract tree T whose nodes are indexed by facilities j ∈ F and non-negative integers r ≥ 0, and, for each node j, r , an associated area A(j, r) ⊆ V with the following properties 1. T has height O(log(W/f min )) and degree at most 2 4κ , and for each j, r ∈ T and its parent node j , r + 1 , B(j, 7 · 5 r ) ⊆ B(j , 7 · 5 r+1 ).

2.
A is a laminar family, its laminar structure is given by T , and for each area A(j, r), A(j, r) ⊆ B(j, 7 · 5 r )

3.
For any subset C of V , for any collection of balls B centered at facilities of F and covering C, there exists a collection S of areas covering C, such that j,r ∈S f j + 7 · 5 r = O(2 2κ ) B(j,R)∈B (f j + R).

3
Data structure

Solving the offline restricted problem
Given C ⊆ V , we wish to compute the solution of minimum cost among all solutions that are restricted to covering C using areas A(j, r) for j, r ∈ Π, where using area A(j, r) has cost f j + c 2 · 5 r . We call that the restricted problem. By Theorem 8 the optimal restricted cost is a O(2 2κ ) approximation of the optimal (unrestricted) cost.
Computing the optimal solution to the restricted problem in an offline manner is straightforward, thanks to the laminar structure of the candidate areas. We first compute, for each node j, r of T , the cost c j,r = f j + c 2 · 5 r of area A(j, r), as well as the number n j,r of clients that are in area A(j, r) but not in any of the areas of children nodes: since areas A(j , r−1) are all disjoint by laminarity, we have n j,r = |C ∩A(j, r)|−∪ j ,r−1 :parent(j ,r−1)=j |C ∩A(j , r −1)|. We then compute the optimal cost x j,r of covering the clients of C ∩ A(j, r) using only areas of the subtree of T rooted at j, r , using the following bottom-up recurrence: For j, r ∈ Π in bottom-up order in T : Indeed, if n j,r = 0 then the solution must use area A(j, r); but then by laminarity area A(j, r) covers all clients in that subtree, so no other area is needed in the solution, and the cost is exactly the cost c j,r of A(j, r). If on the other hand n j,r = 0, then we have an alternative possibility: we could do without A(j, r). Then, by disjointness of sibling areas the problem separates into independent subproblems, one for each child of j, r , hence the recurrence simply sums their costs.
The cost of the optimal restricted solution is then x j,ρmax for the root j, ρ max of T . Given c j,r and x j,r , computing the optimal restricted solution, a collection S of areas, is done recursively: Thus the algorithm to compute the optimal set S of areas covering C in the restricted problem, given the values of c j,r , x j,r explores a tree T that, as it is a partial subtree of T , also has height at most O(log(W/f min )) and degree at most 2 4κ ; moreover its internal nodes are all ancestors of areas added to the solution S, so the running time to compute S itself is O(2 4κ log(W/f min )|S|).

The dynamic data structure
The dynamic data structure supports insertions of clients, deletions of clients, queries for the cost of the optimal restricted solution, and queries for the set of open facilities and areas of the optimal restricted solution.
The algorithm will maintain two dynamic data structures: 1. a list of the currently existing clients C ⊆ V , with, for each client p, the j, r ∈ Π such that p ∈ A(j, r) and r is minimum; and 2. an annotated dependency tree T A , keeping for each node v = j, r the following additional information: a. its cost c v = f j + 7 · 5 r , b. the number n v of currently existing clients that belong to A(j, r) but not to any descendant area, c. the value x v , which is the minimum cost needed to cover all clients belonging to A(j, r) using only areas A(j , r ) for j , r ∈ Π, and d. the value y v = u child of v x u .
To initialize the data structures, from the preprocessing phase the algorithm is given the set Π of pairs j, r , as well as the laminar family of areas A with its dependency tree T using the following representation, which can be easily computed in time linear in its size: (1) An array of size ρ max − ρ min + 1, keeping for each logradius r ∈ [ρ min , ρ max ] a list of all the facilities of J r , and (2) An annotated tree data structure obtained from T by setting every n v , x v , y v equal to 0, and c j,r = f j + 7 · 5 r . The initial set of clients is C = ∅.
Answering queries is done as in Section 3.1. We next describe the client deletions. When a client p is deleted, we start from j, r in T A , such that p ∈ A(j, r) and r is minimum; we decrement n v and we traverse the path from j, r up to the root of T A , updating x v and y parent(v) for every node visited along the way using the recurrence from Section 3.1. This takes time proportional to the height of the tree, O(log(W/f min )). Similarly, when a client p is inserted, we first find j, r in T A , such that p ∈ A(j, r) and r is minimum, in a way to be described shortly; we increment n v , and then we traverse the path from j, r up to the root of T A , similarly updating x v and y parent(v) .
Thus, it only remains to determine the pair j * , r * with smallest logradius such that p ∈ A(j * , r * ). By Lemma 6, p ∈ B(j * , 7 · 5 r * ). Thus we will first find all pairs j, r such that p ∈ B(j, 7 · 5 r ), based on them determine r * , and then look for j * , r * in that set of balls. Thanks to Lemma 3, the first part can be done using a simple recursive algorithm starting from the root of T A (see below). The second part simply uses the definition of areas, i.e., it finds the pair j * , r * where j * has with minimum distance to p out of all pairs j, r * withp ∈ B(j, 7 · 5 r * ).
let r * be minimum such that there exists pairs j, r * in the set Pairs(p, j root , ρ max ). Among all such pairs, output the pair j * , r * minimizing d(p, j * ).
The running time is dominated by the first part, which is O(2 4κ ) times the number of pairs j, r such that p ∈ B(j, 7 · 5 r ). There are log(W/f min ) possible values of r. For each r, by Lemma 4 there are at most 2 2κ pairs j, r ∈ Π such that p ∈ B(j, c 2 · 5 r ) and the algorithm has to test the O(2 4κ ) children of each of them. Thus the running time to do an insertion is O(2 6κ log(W/f min )).