Facility Location on High-Dimensional Euclidean Spaces

Lee, Euiwoong; Shin, Kijun

doi:10.4230/LIPIcs.ITCS.2025.70

Facility Location on High-Dimensional Euclidean Spaces

Euiwoong Lee University of Michigan, Ann Arbor, MI, USA Kijun Shin Seoul National University, South Korea

Abstract

Recent years have seen great progress in the approximability of fundamental clustering and facility location problems on high-dimensional Euclidean spaces, including $k$ -Means and $k$ -Median. While they admit strictly better approximation ratios than their general metric versions, their approximation ratios are still higher than the hardness ratios for general metrics, leaving the possibility that the ultimate optimal approximation ratios will be the same between Euclidean and general metrics. Moreover, such an improved algorithm for Euclidean spaces is not known for Uncapaciated Facility Location (UFL), another fundamental problem in the area.

In this paper, we prove that for any $\gamma\geq 1.6774$ there exists $\varepsilon>0$ such that Euclidean UFL admits a $(\gamma,1+2e^{-\gamma}-\varepsilon)$ -bifactor approximation algorithm, improving the result of Byrka and Aardal [3]. Together with the $(\gamma,1+2e^{-\gamma})$ NP-hardness in general metrics, it shows the first separation between general and Euclidean metrics for the aforementioned basic problems. We also present an $(\alpha_{\mathrm{Li}}-\varepsilon)$ -(unifactor) approximation algorithm for UFL for some $\varepsilon>0$ in Euclidean spaces, where $\alpha_{\mathrm{Li}}\approx 1.488$ is the best-known approximation ratio for UFL by Li [15].

Keywords and phrases:

Approximation Algorithms, Clustering, Facility Location

Funding:

Euiwoong Lee: Supported by organization NSF CCF-2236669 and a gift from Google.

Copyright and License:

2012 ACM Subject Classification:

Theory of computation

\rightarrow

Facility location and clustering

Editor:

Raghu Meka

Series and Publisher:

Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik

1 Introduction

The (metric) Uncapacitated Facility Location (UFL) is one of the most fundamental problems in computer science and operations research. The input of the problem consists of a metric space $(X,d)$ , a set of facility locations $\mathcal{F}\subseteq X$ , a set of clients $\mathcal{C}\subseteq X$ , as well as facility opening costs $\{f_{i}\}_{i\in\mathcal{F}}$ . The goal is open a subset of centers $S\subseteq\mathcal{F}$ to minimize the sum of the opening cost $(\sum_{i\in S}f_{i})$ and the connection cost $(\sum_{j\in\mathcal{C}}\min_{i\in S}d(i,j))$ . After intensive research efforts over the years [11, 12, 16, 15], the best approximation ratio is $1.488$ [15] and the best hardness ratio is $1.463$ [11].

As the objective function is the sum of two heterogeneous terms of the opening cost and the connection cost, the natural notion of bifactor approximation has been actively studied as well. Formally, given an instance of UFL, a solution $S\subseteq\mathcal{F}$ is called an $(\lambda_{f},\lambda_{c})$ -approximation for some $\lambda_{f},\lambda_{c}\geq 1$ if, for any solution $T\subseteq\mathcal{F}$ , the total cost of $S$ is at most $\lambda_{f}\cdot F^{*}+\lambda_{c}\cdot C^{*}$ , where $F^{*}$ , $C^{*}$ denote the opening and connection cost of $T$ respectively. In particular, the case $\lambda_{f}=1$ , also known as a $\lambda_{c}$ -Lagrangian Multiplier Preserving (LMP) approximation, has been actively studied due to its connection to another fundamental clustering problem of $k$ -Median. There is a $(2-\varepsilon)$ -LMP approximation for some $\varepsilon>2\cdot 10^{-7}$ [8], and any $\lambda_{c}$ -LMP approximation for UFL can be translated to $1.307\cdot\lambda_{c}$ -approximation for $k$ -Median [9].

Generalizing the hardness of Guha and Khuller [11], Jain, Mahdian and Saberi [12] proved that no $(\lambda_{f},\lambda_{c})$ -approximation polynomial-time algorithm exists for $\lambda_{c}<1+2e^{-\lambda_{f}}$ unless $\mathbf{P}=\mathbf{NP}$ . (Guha-Khuller’s hardness ratio $\gamma\approx 1.463$ is exactly the solution of $\gamma=1+2e^{-\gamma}$ .) While the optimal value for $\lambda_{c}$ is not known for small values of $\lambda_{f}$ , Byrka and Aardal gave an algorithm that achieves an $(\lambda_{f},1+2e^{-\lambda_{f}})$ -approximation for any $\lambda_{f}\geq 1.6774$ [3].

Euclidean spaces are arguably the most natural metric spaces for facility location and clustering problems. Formally, Euclidean UFL is a special case of UFL where the underlying metric is $(\mathbb{R}^{k},\|.\|_{2})$ for some dimension $k$ . When $k=O(1)$ , this problem admits a PTAS [5], while the problem remains APX-hard when $k$ is part of the input [7].¹¹1While the cited paper only studies $k$ -Median and $k$ -Means, the soundness analysis in their Theorem 4.1 (of the arXiv version) can be directly extended to any number of open facilities $k$ , implying APX-hardness of Euclidean UFL.

Recent years have seen active studies on related $k$ -Means and $k$ -Median on high-dimensional Euclidean spaces [1, 10, 4], so that the best-known approximation ratios for them are $5.912$ and $2.406$ respectively. While they are strictly lower than the best-known approximation ratios for general metric spaces (which are $9$ and $2.613$ ), they are still larger than the best-known hardness ratios for general metrics (which are $1+8/e\approx 3.943$ and $1+2/e\approx 1.73$ ) [12], which means that it is still plausible that the optimal approximation ratios for $k$ -Median and $k$ -Means are the same between Euclidean metrics and general metrics.

Our first result is the first strict separation between Euclidean and general metric spaces for UFL. In particular, we show that Euclidean UFL admits a $(1.6774,1+2e^{-1.6774}-\varepsilon)$ approximation for some universal constant $\varepsilon>0$ , which is NP-hard to do in general metrics.

Theorem 1.

There exists a $(1.6774,1+2e^{-1.6774}-\varepsilon)$ -approximation algorithm for Euclidean UFL for some $\varepsilon\geq 3\cdot 10^{-42}$ .

By the result of Mahdian et al. [16], it implies an $(\gamma,1+2e^{-\gamma}-\varepsilon e^{-(\gamma-1.6774)})$ -approximation for any $\gamma\geq 1.6774$ . Using this result, we are able to slightly improve the approximation ratio for the best-known $(\alpha_{\mathrm{Li}}\approx 1.488)$ -unifactor approximation of Li [15].

Theorem 2.

There exists a $(\alpha_{\mathrm{Li}}-\varepsilon)$ -approximation algorithm for Euclidean UFL for some $\varepsilon\geq 2\cdot 10^{-45}$ .

Recent years also have seen great progress on hardness of approximation for clustering problems in high-dimensional Euclidean spaces, including Euclidean $k$ -Means and $k$ -Median [6, 7]. We show that similar techniques extend to UFL as well, proving the APX-hardness.

Theorem 3.

Euclidean UFL is APX-hard.

2 High-level Plan

Our work is based on the framework of Byrka and Aardal [3] who achieved an optimal $(\lambda_{f},1+2e^{-\lambda_{f}})$ -bifactor approximation for $\lambda_{f}\geq 1.6774$ in general metrics. We first review their framework. It is based on the following standard linear programming (LP) relaxation:

Minimize	$\displaystyle\sum\limits_{i\in\mathcal{F},j\in\mathcal{C}}d(i,j)x_{ij}+\sum% \limits_{i\in\mathcal{F}}f_{i}y_{i}$
subject to	$\displaystyle\sum\limits_{i\in\mathcal{F}}x_{ij}=1$	$\displaystyle\forall j\in\mathcal{C},$
	$\displaystyle x_{ij}\leq y_{i}$	$\displaystyle\forall i\in\mathcal{F},j\in\mathcal{C},$
	$\displaystyle x_{ij},y_{i}\in[0,1]$	$\displaystyle\forall i\in\mathcal{F},j\in\mathcal{C}.$

The dual formulation is as follows:

Maximize	$\displaystyle\sum\limits_{j\in\mathcal{C}}v_{j}$
subject to	$\displaystyle\sum\limits_{j\in\mathcal{C}}w_{ij}\leq f_{i}$	$\displaystyle\forall i\in\mathcal{F},$
	$\displaystyle v_{j}\leq w_{ij}$	$\displaystyle\forall i\in\mathcal{F},j\in\mathcal{C},$
	$\displaystyle w_{ij}\geq 0$	$\displaystyle\forall i\in\mathcal{F},j\in\mathcal{C}.$

A feasible solution $(x,y)$ induces the support graph, which is defined as the bipartite graph $G=((\mathcal{F},\mathcal{C}),E)$ where nodes $i\in\mathcal{F}$ and $j\in\mathcal{C}$ are adjacent iff the corresponding LP variable $x_{ij}>0$ . Two clients $j,j^{\prime}\in\mathcal{C}$ are considered neighbors in $G$ if they share the same facility.

Let $(x^{*},y^{*})$ be a fixed optimal solution to the primal program. The overall cost is divided into the facility cost $F^{*}=\sum_{i\in\mathcal{F}}f_{i}y^{*}_{i}$ and the connection cost $C^{*}=\sum_{i\in\mathcal{F},j\in\mathcal{C}}d(i,j)x^{*}_{ij}$ . Our goal is to round this solution to obtain a solution $S$ whose total cost is at most $\lambda_{f}F^{*}+\lambda_{c}C^{*}$ ; it is well known that it implies the $(\lambda_{f},\lambda_{c})$ -approximation defined in the introduction by scaling [3], so let us redefine the $(\lambda_{f},\lambda_{c})$ -approximation for the rest of the paper so that $S$ is $(\lambda_{f},\lambda_{c})$ -approximate if its total cost is $\lambda_{f}F^{*}+\lambda_{c}C^{*}$ .

The opening cost and connection cost for individual clients can be further divided using the optimal LP dual solution $(v^{*},w^{*})$ . For each client $j\in\mathcal{C}$ , the fractional connection cost is given by $C^{*}_{j}=\sum_{i\in\mathcal{F}}d(i,j)x^{*}_{ij}$ , and the fractional facility cost is computed as $F^{*}_{j}=v^{*}_{j}-C^{*}_{j}$ . The irregularity of the facilities surrounding $j\in\mathcal{C}$ is defined by

r_{\gamma}(j)=\frac{d(j,\mathcal{D}_{j})-d(j,\mathcal{D}_{j}\cup\mathcal{C}_{j% })}{F^{*}_{j}}.

Similarly,

r^{\prime}_{\gamma}(j)=(\gamma-1)\cdot r_{\gamma}(j)=\frac{d(j,\mathcal{D}_{j}% \cup\mathcal{C}_{j})-d(j,\mathcal{C}_{j})}{F^{*}_{j}}.

If $\sum_{i\in\mathcal{F}^{\prime}}y^{*}_{i}=0$ , we set $d(j,\mathcal{F}^{\prime})=0$ . Similarly, when $F^{*}_{j}=0$ , we define $r_{\gamma}(j)=0$ and $r^{\prime}_{\gamma}(j)=0$ . According to the definition, the following conditions hold: the irregularity $0\leq r_{\gamma}(j)\leq 1$ , the average distance to a close facility $C_{j}=d(j,\mathcal{C}_{j})=C^{*}_{j}-r^{\prime}_{\gamma}(j)\cdot F^{*}_{j}$ , and the average distance to a distant facility $D_{j}=d(j,\mathcal{D}_{j})=C^{*}_{j}+r_{\gamma}(j)\cdot F^{*}_{j}$ . The maximum distance to a close facility is bounded by $M_{j}\leq D_{j}$ .

Clustering of [3].

At a high level, clustering operates based on the support graph $G=((\mathcal{F},\mathcal{C}),E)$ . For each $c\in\mathcal{C}$ , let $N_{c}:=\{c^{\prime}\in\mathcal{C}:\exists f\in\mathcal{F}\mbox{ such that }(c,% f),(c^{\prime},f)\in E\}$ be the neighbor of $c$ . The clustering algorithm iteratively selects some client $c$ as a cluster center, put all its neighbors into the cluster, and proceed with the remaining clients. Eventually, all clients are partitioned into one of these clusters. After this, for each cluster, exactly one facility adjacent to cluster center is opened. This ensures that every client is connected to a facility that is not too far from them. Therefore, the criteria for choosing cluster centers and opening facilities will determine the quality of solution.

Starting with a fractional solution of the LP $(x^{*},y^{*})$ and a parameter $\gamma\in(1,2)$ , [3] constructed the facility-augmented solution $(\bar{x},\bar{y})$ , where each $y^{*}_{i}$ value is multiplied by $\gamma$ and each client $j\in\mathcal{C}$ reconfigures its $x^{*}_{ij}$ values to be fractionally connected to as close facilities as possible. (E.g., $\bar{x}_{ij}>0$ implies $x^{*}_{ij}>0$ , but not vice versa.) With some postprocessing, one can also assume that $\bar{x}_{ij}\in\{0,\bar{y}_{i}\}$ for every $i\in\mathcal{F},j\in\mathcal{C}$ . Then one can categorize every facility near $j\in\mathcal{C}$ into two types: close facilities $\mathcal{C}_{j}=\{i\in\mathcal{F}\mid\bar{x}_{ij}>0\}$ and distant facilities $\mathcal{D}_{j}=\{i\in\mathcal{F}\mid\bar{x}_{ij}=0\text{ and }x^{*}_{ij}>0\}$ . This implies that as $\gamma$ increases, the clusters become smaller, and more facilities are opened.

Let the average distance from $j\in\mathcal{C}$ to a set of facilities $\mathcal{F}^{\prime}\subseteq\mathcal{F}$ be defined as $d(j,\mathcal{F}^{\prime})=\big{(}\sum\limits_{i\in\mathcal{F}^{\prime}}d(i,j)y% ^{*}_{i}\big{)}/\big{(}\sum\limits_{i\in\mathcal{F}^{\prime}}y^{*}_{i}\big{)}$ . Then Let $C_{j}:=d(j,\mathcal{C}_{j})$ , $M_{j}:=\max_{i\in\mathcal{C}_{j}}d(j,i)$ , and $D_{j}:=d(j,\mathcal{D}_{j})$ . We have $C_{j}\leq M_{j}\leq D_{j}$ .

At this point, the support graph is defined by $(\bar{x},\bar{y})$ solution. Intuitively, we choose the client $j$ with the smallest $C_{j}+M_{j}$ as a new cluster center. Given this clustering, the standard randomized rounding procedure is as follows:

1.

For each cluster center $j$ , choose exactly one facility from its neighboring facility set $\{i\in\mathcal{F}:(i,j)\in E\}$ according to the $\bar{y}$ values. (Recall that the sum of these values is exactly 1.)
2.

For any facility $i\in\mathcal{F}$ that is not adjacent to any cluster center in $G$ , independently open $i$ with probability $\bar{y}_{i}$ .

Algorithm 1 greedy: [3]’s clustering algorithm.

Let us consider one client $j\in\mathcal{C}$ and see how its expected connection cost can be bounded under the above randomized rounding. Byrka and Aardal [3] proved the following properties.

$\blacksquare$

The probability that at least one facility in $\mathcal{C}_{j}$ is opened is at least $1-e^{-1}$ .
$\blacksquare$

The probability that at least one facility in $\mathcal{C}_{j}\cup\mathcal{D}_{j}$ is opened is at least $1-e^{-\gamma}$ .
$\blacksquare$

Let client $j^{\prime}\in\mathcal{C}$ be a neighbor of $j$ in $G$ . Then, either $\mathcal{C}_{j^{\prime}}\setminus(\mathcal{C}_{j}\cup\mathcal{D}_{j})=\emptyset$ or the rerouting cost $d(j,\mathcal{C}_{j^{\prime}}\setminus(\mathcal{C}_{j}\cup\mathcal{D}_{j}))\leq d% (j^{\prime},j)+d(j^{\prime},\mathcal{C}_{j^{\prime}}\setminus(\mathcal{C}_{j}% \cup\mathcal{D}_{j}))\leq D_{j}+C_{j^{\prime}}+M_{j^{\prime}}$ holds. Especially, when $j^{\prime}$ is the cluster center of $j$ , it is at most $C_{j}+M_{j}+D_{j}$ . (Li [15] refined this bound to $C_{j}+(3-\gamma)M_{j}+(\gamma-1)D_{j}$ .)

Then, one can (at least informally) expect that the expected connection cost of $j$ is at most $(1-e^{-1})C_{j}+(e^{-1}-e^{-\gamma})D_{j}+e^{-\gamma}(D_{j}+C_{j}+M_{j})$ . It turns out that setting $\gamma\approx 1.6774$ (the solution of $e^{-1}+e^{-\gamma}-(\gamma-1)(1-e^{-1}+e^{-\gamma})=0$ ) ensures that this value is at most $(1+2e^{-\gamma})C^{*}_{j}$ , proving their $(\gamma,1+2e^{-\gamma})$ -bifactor. (See Section 6 for the formal treatment of their analysis as well as our improvement.)

Exploit the Geometry of Euclidean Spaces.

In order to strictly improve the approximation ratio, it is natural to attempt to find a cluster $N$ and its center $j^{\prime}$ where the above inequality holds with some additional slack. Let $cost_{j^{\prime}}(j)=d(j,\mathcal{C}_{j^{\prime}}\setminus(\mathcal{C}_{j}\cup% \mathcal{D}_{j}))$ . Intuitively, our goal is to find a cluster $N\subseteq\mathcal{C}$ with center $j^{\prime}$ such that

\sum_{j\in N}cost_{j^{\prime}}(j)\leq\sum_{j\in N}\big{(}(1-\varepsilon_{1})C_% {j}+(3-\gamma)M_{j}+(\gamma-1)D_{j}\big{)}.

(1)

The only requirement from the rounding algorithm is that $N_{j^{\prime}}\subseteq N$ . Compared to [3]’s clustering, we want to shave $\varepsilon_{1}\cdot C_{j}$ on average.

Let us consider the very special case where $C_{j}=M_{j}=D_{j}=1$ for every $j\in\mathcal{C}$ ; every facility serving $j$ in the original LP solution $(x^{*},y^{*})$ is at the same distance from $j$ . Let $j^{\prime}\in\mathcal{C}$ be a cluster center and $j\in\mathcal{C}$ be in the cluster of $j^{\prime}$ . Then, a simple 3-hop triangle inequality (just using $\mathcal{C}_{j}\cap\mathcal{C}_{j^{\prime}}\neq\emptyset$ ) ensures that $cost_{j^{\prime}}(j)\leq 3$ , and our goal is to improve it to $(3-\varepsilon_{1})$ . If $cost_{j^{\prime}}(j)>3-\varepsilon_{1}$ , how should the instance look like around $j^{\prime}$ ?

It turns out that the instance around $j^{\prime}$ must exhibit a very specific structure in order to ensure that the 3-hop triangle inequality is tight for almost every neighbor $j\in N_{j^{\prime}}$ . We must have almost every $j\in N_{j^{\prime}}$ located around almost the same point at distance $2$ from $j^{\prime}$ , where almost all facility neighbors of $j^{\prime}$ are at the opposite end of the line connecting $N_{j^{\prime}}$ and $j^{\prime}$ . See Figure 1 for an example. Intuitively, the existence of such a dense region of clients suggests that if we let a client $j^{\prime\prime}$ in the region as a new center, many of the 3-hop-triangle inequalities cannot be tight, which implies average rerouting cost $cost_{j^{\prime\prime}}(j^{\prime})\leq 3-\varepsilon_{1}$ . If $j^{\prime\prime}$ is again problematic, we can repeat this procedure over and over.

Refer to caption — Figure 1: A simple case when $cost_{j^{\prime}}(j)\approx 3$ . There is a client-dense region on the left.

However, if we relax the condition to $C_{j}=M_{j}=1\leq D_{j}$ , certain exceptions begin to emerge. One possible scenario is as follows: Since $cost_{j^{\prime}}(j)=d(j,\mathcal{C}_{j^{\prime}}\setminus(\mathcal{C}_{j}\cup% \mathcal{D}_{j}))$ captures the rerouting of $j$ to $j^{\prime}$ ’s close facilities $\mathcal{C}_{j^{\prime}}$ except the $j$ ’s facilities $\mathcal{C}_{j}\cup\mathcal{D}_{j}$ , if $\mathcal{C}_{j}\cup\mathcal{D}_{j}$ is large enough to exclude the facilities of $\mathcal{C}_{j^{\prime}}$ , then $cost_{j^{\prime}}(j)$ might not behave as expected. However, a large volume of $\mathcal{C}_{j}\cup\mathcal{D}_{j}$ implies low $C^{*}_{j}/F^{*}_{j}$ ratio. If the $C^{*}/F^{*}$ ratio is sufficiently low, then this facility-dominant instance is actually easier to handle with a completely different algorithm, the JMS algorithm [12], which is known to be $(1.11,1.7764)$ -approximation algorithm.

Therefore, from now on, assume that the cluster centered at $j^{\prime}$ is connection-dominant. More strictly, assume that for any neighbor $j$ of cluster center $j^{\prime}$ satisfies that $\mathcal{C}_{j}\cup\mathcal{D}_{j}$ cannot cover half of the ball centered at $j^{\prime}$ with a radius of $1$ . At this point, we can finally assert that it is impossible to avoid the formation of a dense region of clients.

Assume towards contradiction that there is no dense region and consider $j$ in $j^{\prime}$ ’s cluster such that $cost_{j^{\prime}}(j)>3-\varepsilon_{1}$ . Almost all facilities of $j^{\prime}$ must be placed in one of two locations: on the opposite side of $j$ relative to $j^{\prime}$ , or within $\mathcal{C}_{j}\cup\mathcal{D}_{j}$ . Since there is no dense region, there must be a neighbor $k$ of $j^{\prime}$ such that $cost_{j^{\prime}}(k)>3-\varepsilon_{1}$ , and $k$ is located in a different direction from $j$ . This implies that the facilities positioned on the opposite side of $j$ now help reduce $cost_{j^{\prime}}(k)$ , forcing that they are in $\mathcal{C}_{k}\cup\mathcal{D}_{k}$ ; since we assumed that $\mathcal{C}_{k}\cup\mathcal{D}_{k}$ cannot cover half of the unit ball around $j^{\prime}$ , it implies the angle $\angle jj^{\prime}k$ must be strictly greater than $\frac{\pi}{2}$ ! Ultimately, this process can be simplified to the following situation: inserting unit vectors into a unit sphere with every pairwise angle greater than $\frac{\pi}{2}+\varepsilon$ for some constant $\varepsilon>0$ . It is well known that in the geometry of Euclidean space, there is an upper bound $f(\varepsilon)$ on the number of such vectors, and such an upper bound shows that one of the regions around $j$ (or $k$ ) we considered must have been dense. See Figure 2 for an example.

However, there are several technical barriers to extending this notion to the general case without restrictions on $C_{j}$ , $M_{j}$ , and $D_{j}$ . In Section 3, we introduce these barriers and formalize the above concept. We also propose sufficient conditions to satisfy (1). From Section 3.1 to Section 5, we demonstrate how to remove these conditions, leaving only the connection-dominant instance assumptions. In Section 6, we propose and analyze the full algorithm, which achieves improved bi-factor approximation performance.

3 Finding Good Center via Geometry

In this section, we exploit the geometry of Euclidean spaces to prove the existence of a cluster center strictly better than the greedy choice of [3] under certain conditions (Theorem 10). We first define several concepts and introduce their motivation, including the sketch of our algorithm.

Recall that our goal is to find a cluster that satisfies (1). In all the following propositions, $\gamma$ is a fixed value in the range $\gamma\in(1.6,2)$ .

Definition 4.

Suppose $j^{\prime}\in\mathcal{C}$ is a cluster center. Let $N_{j^{\prime}}$ be the set of neighbors of $j^{\prime}$ , and $\varepsilon_{1}=10^{-12}$ . Additionally, define two more sets:

N^{-}_{j^{\prime}}=\{j\in N_{j^{\prime}}\,\,|\,\,cost_{j^{\prime}}(j)>(1-% \varepsilon_{1})C_{j}+(3-\gamma)M_{j}+(\gamma-1)D_{j}\}

N^{+}_{j^{\prime}}=\{j\in\mathcal{C}\,\,|\,\,cost_{j^{\prime}}(j)\leq(1-% \varepsilon_{1})C_{j}+(3-\gamma)M_{j}+(\gamma-1)D_{j}\}

Moreover, Saving and Spending of center $j^{\prime}$ is defined as

Saving(j^{\prime})=\sum_{j\in N^{+}_{j^{\prime}}}\{((1-\varepsilon_{1})C_{j}+(% 3-\gamma)M_{j}+(\gamma-1)D_{j})-cost_{j^{\prime}}(j)\},

Spending(j^{\prime})=\sum_{j\in N^{-}_{j^{\prime}}}\{cost_{j^{\prime}}(j)-((1-% \varepsilon_{1})C_{j}+(3-\gamma)M_{j}+(\gamma-1)D_{j})\}.

With the goal (1) in mind, $N_{j^{\prime}}^{+}$ (resp. $N_{j^{\prime}}^{-}$ ) contains clients $j$ who meet (resp. do not meet) this goal, and $Saving(j^{\prime})$ (resp. $Spending(j^{\prime})$ ) indicates how much $cost_{j^{\prime}}(j)$ exceeds (resp. is short of) this goal. Note that $N^{-}_{j^{\prime}}\subseteq N_{j^{\prime}}\subseteq N^{+}_{j^{\prime}}\cup N^{% -}_{j^{\prime}}$ by definition; it is the best cluster for center $j^{\prime}$ , which contains all its neighbors (which is required by the algorithm design) and possibly more to increase savings. Therefore, if $Saving(j^{\prime})\geq Spending(j^{\prime})$ , then $j^{\prime}$ is considered a good center; otherwise, it is a bad center.

Now, we will explain some problematic situations that arise when extending the problem to the general case, i.e., without restrictions on $C_{j}$ , $M_{j}$ , and $D_{j}$ ’s. The first simple concern is that the choice of the center $j^{\prime}$ will no longer be solely based on $C_{j^{\prime}}+M_{j^{\prime}}$ as in Algorithm 1, which breaks the previous arguments. Therefore, to gain more flexibility in selecting a new cluster center, it is beneficial to decompose the entire support graph into several layers, where each layer only concerns clients with roughly the same $C_{j^{\prime}}+M_{j^{\prime}}$ values.

Definition 5.

A network is a subgraph $((\mathcal{F}^{\prime},\mathcal{C}^{\prime}),E^{\prime})$ of the support graph $G=((\mathcal{F},\mathcal{C}),E)$ . A network is called homogeneous if there exists $s\geq 0$ , such that for any client $j\in\mathcal{C^{\prime}}$ , $s\leq C_{j}+M_{j}\leq(1+\delta)s$ for $\delta=3\times 10^{-23}$ .

However, there are still two more scenarios where the above strategy might not hold, as illustrated in Figure 3. This means that the neighbors in $N^{-}_{j^{\prime}}$ are all bad clients, but do not create a dense region.

I.

If $C_{j^{\prime}}\ll s$ , the facilities in $\mathcal{C}_{j^{\prime}}$ are concentrated near $j^{\prime}$ . In this case, since all the facilities are very close to $j^{\prime}$ , the 3-hop triangle inequality is almost tight for any $j\in N_{j^{\prime}}$ regardless of where it is.
II.

Recall that if $\mathcal{C}_{j}\cup\mathcal{D}_{j}$ is large enough to exclude the facilities of $\mathcal{C}_{j^{\prime}}$ , then $cost_{j^{\prime}}(j)$ might not behave as expected. In particular, a technical problem arises when two facilities from $\mathcal{C}_{j^{\prime}}$ that are almost antipodal with respect to $j^{\prime}$ are both contained in $\mathcal{C}_{j}\cup\mathcal{D}_{j}$ , which is illustrated in Figure 3 (right).

The following definition addresses Scenario I.

Definition 6.

For $K_{6}=1.302$ , let $\theta=\frac{K_{6}+1-\gamma}{2K_{6}+2-\gamma}$ . A client $j\in\mathcal{C}$ is normal if $C_{j}\geq\theta(C_{j}+M_{j})$ , otherwise weird.

The following definition addresses Scenario II. Note that $v^{*}_{j}$ is the maximum distance between $j$ and any facility in $\mathcal{C}_{j}\cup\mathcal{D}_{j}$ . We are interested in the ball around $j^{\prime}$ of radius $0.998z_{j^{\prime}}(j)$ (for sake of analysis), and $j$ having a small remote arm with respect to $j^{\prime}$ means that $\mathcal{C}_{j}\cup\mathcal{D}_{j}$ cannot contain two antipodal points in our ball of interest (with some slack depending on $\alpha$ ). Let $z_{j^{\prime}}(j)=d(j^{\prime},\mathcal{C}_{j^{\prime}}\setminus(\mathcal{C}_{% j}\cup\mathcal{D}_{j}))$ . The right-hand side represents the square of the length of the other side of a triangle, where the lengths of the two sides are $d(j^{\prime},j)$ and $0.998z_{j^{\prime}}(j)$ , and the included angle between them is $\frac{\pi}{2}-\alpha$ .

Definition 7.

For normal center $j^{\prime}$ , a neighbor $j$ of $j^{\prime}$ is said to have a small remote arm if the following holds for $\alpha=5\times 10^{-4}$ .

(v^{*}_{j})^{2}=(C^{*}_{j}+F^{*}_{j})^{2}<d(j^{\prime},j)^{2}+\left(0.998z_{j^% {\prime}}(j)\right)^{2}-2d(j^{\prime},j)\cdot 0.998z_{j^{\prime}}(j)\cdot\cos{% (\frac{\pi}{2}-\alpha)}

Otherwise, $j$ is said to have a big remote arm.

In Section 3.1, we will show that these two scenarios are the only bad cases to worry about. For instance, if we assume that everything is normal and has a small remote arm, each candidate center $j^{\prime}\in\mathcal{C}$ is either good or contains another candidate center $j^{\prime\prime}$ with $Saving(j^{\prime\prime})>Saving(j^{\prime})$ in its dense region.

How can we handle these two bad scenarios? In the following two lemmas, we prove that both the weird center and the big remote arm neighbor imply a high ratio of fractional facility cost $F^{*}_{j^{\prime}}$ to fractional connection cost $C^{*}_{j^{\prime}}$ . The proof appears in the extended version of this paper [14].

Lemma 8.

For any weird client $j$ , $C^{*}_{j}\leq K_{6}F^{*}_{j}$ holds.

Lemma 9.

When $1.6<\gamma<2$ , for a normal center $j^{\prime}$ , if $j\in N^{-}_{j^{\prime}}$ has a big remote arm, then $C^{*}_{j}\leq K_{6}F^{*}_{j}$ .

Therefore, if we consider a (sub)instance that has a low facility-to-connection cost ratio, it is natural to expect to apply this argument, which ultimately leads to a (somewhat) good center. Throughout the paper, we will often express this low facility-to-connection ratio condition will be expressed as $C^{*}>KF^{*}$ and use the following values for $K$ : $K_{1}=1.3025>K_{2}=1.3024>K_{3}=1.3023>K_{4}=1.3022>K_{5}=1.3021>K_{6}=1.302$ . The following theorem is the main result of the section.

Theorem 10.

Consider a homogeneous network and let $j^{\prime}$ be the normal center with the highest saving among all normal centers. Let $c^{*}=\sum_{j\in N^{+}_{j^{\prime}}\cup N^{-}_{j^{\prime}}}C^{*}_{j}$ and $f^{*}=\sum_{j\in N^{+}_{j^{\prime}}\cup N^{-}_{j^{\prime}}}F^{*}_{j}$ . If $c^{*}>K_{5}f^{*}$ , then this cluster is good on average for $\varepsilon_{2}=5\times 10^{-18}$ and every $\gamma\in(1.6,2)$ . i.e.

\sum_{j\in N^{+}_{j^{\prime}}\cup N^{-}_{j^{\prime}}}cost_{j^{\prime}}(j)\leq% \sum_{j\in N^{+}_{j^{\prime}}\cup N^{-}_{j^{\prime}}}((1-\varepsilon_{2})C_{j}% +(3-\gamma)M_{j}+(\gamma-1)D_{j}).

3.1 Geometric Arguments

In this subsection, we prove Theorem 10 using properties of Euclidean geometry. As previously discussed, our goal is to show: When a bad cluster center $j^{\prime}$ is normal and (many of) its neighbors have a small remote arm, it is possible to find a dense region of clients near $j^{\prime}$ .

When we consider the rerouting of $j$ to $j^{\prime}$ ’s close facilities $\mathcal{C}_{j^{\prime}}$ , we can define some worst facilities. In the below definition, $B_{j^{\prime}j}\subseteq\mathcal{C}_{j^{\prime}}$ is the set of facilities that have the (almost) worst distance from $j^{\prime}$ , and $T_{j^{\prime}j}\subseteq B_{j^{\prime}j}$ is the set of facilities with both worst distance and worst angle.

Definition 11.

Let $r:=10^{-8}$ , and let $\phi_{r}\approx 2\times 10^{-4}$ be the minimum angle that satisfies $1+x^{2}+2x\cos{\phi_{r}}\leq(1+(1-r)x)^{2}$ for $0\leq x\leq 2$ . For $j\in\mathcal{C}$ and its center $j^{\prime}$ , let

	$\displaystyle B_{j^{\prime}j}$	$\displaystyle:=\{i\in\mathcal{C}_{j^{\prime}}\,\,\|\,\,0.999z_{j^{\prime}}(j)% \leq d(i,j^{\prime})\leq M_{j^{\prime}}\}$
	$\displaystyle T_{j^{\prime}j}$	$\displaystyle:=\{i\in B_{j^{\prime}j}\,\,\|\,\,\angle jj^{\prime}i>\pi-\phi_{r}\}.$

The following lemma shows that if $j^{\prime}$ is normal and $j\in N_{j^{\prime}}^{-}$ has a bad rerouting through $j^{\prime}$ , then among the facilities in $B_{j^{\prime}j}\setminus(\mathcal{C}_{j}\cup\mathcal{D}_{j})$ , which are the rerouting candidates with worst distance, more than half of them must have a bad angle as well. It is a formalization of the intuition illustrated in Figure 1.

Lemma 12.

For any $j\in N^{-}_{j^{\prime}}$ of normal center $j^{\prime}$ , $z_{j^{\prime}}(j)>0.99C_{j^{\prime}}$ .

Proof.

By the triangle inequality and the homogeneous condition,

	$\displaystyle z_{j^{\prime}}(j)\geq cost_{j^{\prime}}(j)-d(j^{\prime},j)$	$\displaystyle>(1-\varepsilon_{1})C_{j}+(3-\gamma)M_{j}+(\gamma-1)D_{j}-(M_{j^{% \prime}}+M_{j})$
		$\displaystyle\geq(1-\varepsilon_{1})C_{j}+(2-\gamma)M_{j}+(\gamma-1)D_{j}-((1+% \delta)s-C_{j^{\prime}})$
		$\displaystyle\geq-(\varepsilon_{1}+\delta)s+C_{j^{\prime}}\geq\left(1-\frac{% \varepsilon_{1}+\delta}{\theta}\right)C_{j^{\prime}}\geq 0.99C_{j^{\prime}}.\$

$\hfill\blacktriangleleft$

Lemma 13.

For a normal center $j^{\prime}$ and any $j\in N^{-}_{j^{\prime}}$ , let $G_{1}=T_{j^{\prime}j}\cap(\mathcal{C}_{j^{\prime}}\setminus(\mathcal{C}_{j}% \cup\mathcal{D}_{j}))$ , $G_{2}=(B_{j^{\prime}j}\setminus T_{j^{\prime}j})\cap(\mathcal{C}_{j^{\prime}}% \setminus(\mathcal{C}_{j}\cup\mathcal{D}_{j}))$ . Then the following holds:

\sum_{i\in G_{1}}y^{*}_{i}>\sum_{i\in G_{2}}y^{*}_{i}.

Proof.

Assume the nontrivial case: $\mathcal{C}_{j^{\prime}}\setminus(\mathcal{C}_{j}\cup\mathcal{D}_{j})\neq\emptyset$ . Let $G_{3}=(\mathcal{C}_{j^{\prime}}\setminus(\mathcal{C}_{j}\cup\mathcal{D}_{j}))% \setminus(G_{1}\cup G_{2})$ . Let rerouting probability $p_{n}$ and rerouting length $l_{n}$ for $1\leq n\leq 3$ as:

p_{n}=\frac{\sum\limits_{i\in G_{n}}y^{*}_{i}}{\sum\limits_{i\in\mathcal{C}_{j% ^{\prime}}\setminus(\mathcal{C}_{j}\cup\mathcal{D}_{j})}y^{*}_{i}},\quad l_{n}% =\frac{\sum\limits_{i\in G_{n}}y^{*}_{i}\cdot d(i,j^{\prime})}{\sum\limits_{i% \in G_{n}}y^{*}_{i}}

Note that $p_{1}+p_{2}+p_{3}=1$ and $p_{1}l_{1}+p_{2}l_{2}+p_{3}l_{3}=z_{j^{\prime}}(j)$ . Then the goal of this theorem can be written as $p_{1}>p_{2}$ .

By Lemma 12, $z_{j^{\prime}}(j)\geq 0.99C_{j^{\prime}}$ holds. We derive a lower bound for $p_{1}+p_{2}$ as:

z_{j^{\prime}}(j)=p_{1}l_{1}+p_{2}l_{2}+p_{3}l_{3}\leq(p_{1}+p_{2})\cdot M_{j^% {\prime}}+(1-p_{1}-p_{2})\cdot 0.999z_{j^{\prime}}(j)

implies that

p_{1}+p_{2}\geq\frac{0.001z_{j^{\prime}}(j)}{M_{j^{\prime}}-0.999z_{j^{\prime}% }(j)}\geq\frac{0.001\cdot 0.99C_{j^{\prime}}}{M_{j^{\prime}}}\geq\frac{0.00099% \theta}{1-\theta}

since the right-hand side gives the minimum value when $M_{j^{\prime}}/C_{j^{\prime}}$ is the maximum. It is bounded because $j^{\prime}$ is normal.

Denote a position vector as $\vec{v}$ . From the definition of $\phi_{r}$ , $cost_{j^{\prime}}(j)$ is at most

	$\displaystyle cost_{j^{\prime}}(j)$	$\displaystyle=\frac{\sum\limits_{i\in\mathcal{C}_{j^{\prime}}\setminus(% \mathcal{C}_{j}\cup\mathcal{D}_{j})}y^{}_{i}\left\|\left\|(\vec{v_{j^{\prime}}}% -\vec{v_{j}})+(\vec{v_{i}}-\vec{v_{j^{\prime}}})\right\|\right\|}{\sum\limits_{i% \in\mathcal{C}_{j^{\prime}}\setminus(\mathcal{C}_{j}\cup\mathcal{D}_{j})}y^{}% _{i}}$
		$\displaystyle\leq p_{1}(d(j^{\prime},j)+l_{1})+p_{2}\cdot(d(j^{\prime},j)+(1-r% )l_{2})+p_{3}(d(j^{\prime},j)+l_{3})$
		$\displaystyle=d(j^{\prime},j)+z_{j^{\prime}}(j)-p_{2}rl_{2}$
		$\displaystyle\leq C_{j^{\prime}}+M_{j^{\prime}}+(2-\gamma)M_{j}+(\gamma-1)D_{j% }-p_{2}rl_{2}$
		$\displaystyle\leq(1-0.999\cdot 0.99p_{2}r)C_{j^{\prime}}+M_{j^{\prime}}+(2-% \gamma)M_{j}+(\gamma-1)D_{j}.$

Therefore, since $j\in N^{-}_{j^{\prime}}$ , $(1-\varepsilon_{1})C_{j}+M_{j}<(1-0.999\cdot 0.99p_{2}r)C_{j^{\prime}}+M_{j^{% \prime}}\leq(1-0.98p_{2}r)C_{j^{\prime}}+M_{j^{\prime}}$ holds. It implies

	$\displaystyle\left(1-\frac{\varepsilon_{1}}{2}\right)(C_{j}+M_{j})\leq(1-% \varepsilon_{1})C_{j}+M_{j}$	$\displaystyle<(1-0.98p_{2}r)C_{j^{\prime}}+M_{j^{\prime}}$
		$\displaystyle\leq((1-0.98p_{2}r)\theta+(1-\theta))(C_{j^{\prime}}+M_{j^{\prime% }}).$

From the homogeneous condition,

\left(1-\frac{\varepsilon_{1}}{2}\right)<(1-0.98p_{2}r\theta)(1+\delta)

which implies

p_{2}<\frac{1}{0.98\theta r}\cdot\frac{\delta+\frac{\varepsilon_{1}}{2}}{1+% \delta}<\frac{0.00099\theta}{1-\theta}<p_{1}.\

$\hfill\blacktriangleleft$

The following argument demonstrates how the positional distribution of facilities in $\mathcal{C}_{j^{\prime}}$ restricts that of the neighbors in $N^{-}_{j^{\prime}}$ . Consider two neighbors $j,k\in N^{-}_{j^{\prime}}$ , both with small remote arms, separated by an angle greater than $2\phi_{r}$ , say $\frac{1}{100}$ . Then, $T_{j^{\prime}j}\cap T_{j^{\prime}k}=\emptyset$ . However, facilities in $T_{j^{\prime}j}$ reduce $cost_{j^{\prime}}(k)$ and vice versa, which implies that either $(\mathcal{C}_{k}\cup\mathcal{D}_{k})\cap T_{j^{\prime}j}\neq\emptyset$ or vice versa. As the small remote arm condition of $j$ puts a limit on how much $(\mathcal{C}_{j}\cup\mathcal{D}_{j})$ can intersect $\mathcal{C}_{j^{\prime}}$ , it is natural to expect that the number of such pairs is small.

Theorem 14.

For a normal center $j^{\prime}$ , let $S$ be a subset of $N^{-}_{j^{\prime}}$ , consisting of clients with a small remote arm. Furthermore, let any two elements $j_{1},j_{2}\in S$ be separated by an angle greater than $\frac{1}{100}$ with respect to center $j^{\prime}$ , i.e., $\angle j_{1}j^{\prime}j_{2}>\frac{1}{100}$ . Then, the cardinality of $S$ is bounded by $M=5\times 10^{6}$ , independent of the Euclidean space’s dimension.

Proof.

Denote $S=\{j_{1},j_{2},...,j_{\left|S\right|}\}$ . Without loss of generality,

\sum_{i\in T_{j^{\prime}j_{k}}\bigcap(\mathcal{C}_{j^{\prime}}\setminus(% \mathcal{C}_{j_{k}}\cup\mathcal{D}_{j_{k}}))}y_{i}^{*}\geq\sum_{i\in T_{j^{% \prime}j_{k+1}}\bigcap(\mathcal{C}_{j^{\prime}}\setminus(\mathcal{C}_{j_{k+1}}% \cup\mathcal{D}_{j_{k+1}}))}y_{i}^{*}

for all $1\leq k<\left|S\right|$ . Suppose $T_{j^{\prime}j_{n^{\prime}}}\cap(\mathcal{C}_{j_{n}}\cup\mathcal{D}_{j_{n}})=\emptyset$ for some $n^{\prime}<n$ . Additionally, given $j_{n}\in N^{-}_{j^{\prime}}$ and the small arm condition, it implies $T_{j^{\prime}j_{n}}\cap(\mathcal{C}_{j_{n}}\cup\mathcal{D}_{j_{n}})=\emptyset$ . Therefore, $T_{j^{\prime}j_{n^{\prime}}}\cup T_{j^{\prime}j_{n}}\subseteq\mathcal{C}_{j^{% \prime}}\setminus(\mathcal{C}_{j_{n}}\cup\mathcal{D}_{j_{n}})$ . Moreover, $T_{j^{\prime}j_{n^{\prime}}}\cap T_{j^{\prime}j_{n}}=\emptyset$ since $2\phi_{r}<\frac{1}{100}$ . However, it contradicts to Lemma 13.

We show that two $j$ ’s in $S$ with similar $z_{j^{\prime}}(j)$ values form a large angle with $j^{\prime}$ . Suppose $\angle j_{n^{\prime}}\,j^{\prime}\,j_{n}=\beta<\frac{\pi}{2}+\alpha-\phi_{r}$ for some $n^{\prime}<n$ with $\frac{z_{j^{\prime}}(j_{n})}{z_{j^{\prime}}(j_{n^{\prime}})}\in[\frac{1}{1.001% },1.001]$ . Then for any point $x\in T_{j^{\prime}j_{n^{\prime}}}$ , it holds that $d(j^{\prime},x)\geq 0.999z_{j^{\prime}}(j_{n^{\prime}})>0.998z_{j^{\prime}}(j_% {n})$ and $\angle j_{n}j^{\prime}x<\beta+\phi_{r}<\frac{\pi}{2}+\alpha$ . Also, the quadratic function $d(j^{\prime},x)^{2}-2d(j^{\prime},j_{n})d(j^{\prime},x)\cdot\sin(\alpha)$ is shown to be non-decreasing for $d(j^{\prime},x)>0$ . It comes from Lemma 12, which ensures that $d(j^{\prime},x)>0.998z_{j^{\prime}}(j_{n})\geq 0.998\cdot 0.99C_{j^{\prime}}% \geq 0.98\theta s\geq 2(1+\delta)\sin{\alpha}\cdot s\geq d(j^{\prime},j_{n})% \cdot\sin{\alpha}$ . Therefore, since $j_{n}$ has a small remote arm,

	$\displaystyle d(j_{n},x)^{2}$	$\displaystyle=d(j^{\prime},j_{n})^{2}+d(j^{\prime},x)^{2}-2d(j^{\prime},j_{n})% d(j^{\prime},x)\cdot\cos{\angle j_{n}j^{\prime}x}$
		$\displaystyle>d(j^{\prime},j_{n})^{2}+d(j^{\prime},x)^{2}-2d(j^{\prime},j_{n})% d(j^{\prime},x)\cdot\sin{\alpha}$
		$\displaystyle\geq d(j^{\prime},j_{n})^{2}+(0.998z_{j^{\prime}}(j_{n}))^{2}-2d(% j^{\prime},j_{n})\cdot 0.998z_{j^{\prime}}(j_{n})\cdot\sin{\alpha}$
		$\displaystyle>(C^{}_{j_{n}}+F^{}_{j_{n}})^{2}={v^{*}_{j_{n}}}^{2}$

which implies $T_{j^{\prime}j_{n^{\prime}}}\bigcap(\mathcal{C}_{j_{n}}\cup\mathcal{D}_{j_{n}}% )=\emptyset$ , contradicting to the above result. Refer to Figure 4.

According to Rankin [17], the maximum number of disjoint spherical caps, each with an angular radius of $\frac{\pi}{4}+\frac{\alpha-\phi_{r}}{2}$ , is at most $1+\csc(\alpha-\phi_{r})$ in any dimension. Since $0.99C_{j^{\prime}}\leq z_{j^{\prime}}(j)\leq M_{j^{\prime}}$ holds, it is feasible to segment this range into successive subranges such as $[0.99C_{j^{\prime}},1.001\cdot 0.99C_{j^{\prime}}]$ , $[1.001\cdot 0.99C_{j^{\prime}},1.001^{2}\cdot 0.99C_{j^{\prime}}]$ , …, up to $[M_{j^{\prime}}/1.001,M_{j^{\prime}}]$ . Thus each subrange can only contain a finite number of clients with a small remote arm. Moreover, the normal center condition ensures a bounded number of such divisions. Consequently, the cardinality of $|S|$ is at most

|S|\leq\left(1+\frac{1}{\sin{(\alpha-\phi_{r})}}\right)\cdot\frac{\log{\frac{M% _{j^{\prime}}}{0.99C_{j^{\prime}}}}}{\log{1.001}}\leq\left(1+\frac{1}{\sin{(% \alpha-\phi_{r})}}\right)\cdot\frac{\log{\frac{1-\theta}{0.99\theta}}}{\log{1.% 001}}.\

$\hfill\blacktriangleleft$

Therefore, if we have a large $S\subseteq N_{j^{\prime}}^{-}$ with small remote arms, there must exist a large subset of $S$ whose pairwise angle is small, creating a dense region.

Lemma 15.

For a normal center $j^{\prime}$ , let $S$ be a set of clients from $N^{-}_{j^{\prime}}$ with a small remote arm. If $S\neq\emptyset$ , then there exists a client $j\in S$ for which $Saving(j)\geq\frac{s}{125}\frac{\left|S\right|}{M}$ .

Proof.

Let $S^{\prime}$ be a maximal subset of $S$ where any two clients are separated by an angle greater than $\frac{1}{100}$ with respect to the center $j^{\prime}$ . Then, for any client $j\in S$ , there exists a client $z_{j}\in S^{\prime}$ such that $\angle jj^{\prime}z_{j}\leq\frac{1}{100}$ . Therefore, there exists a $z\in S^{\prime}$ such that the number of clients $j\in S$ for which $z_{j}=z$ is at least $\frac{\left|S\right|}{\left|S^{\prime}\right|}$ .

For $1\leq n\leq 5$ , let $R_{n}$ be a region where

R_{n}=\{x\in\mathbb{R}^{l}\,\,|\,\,\angle z\,j^{\prime}\,x\leq\frac{1}{100},\,% \,\frac{2n-2}{5}(1+\delta)s\leq d(j^{\prime},x)\leq\frac{2n}{5}(1+\delta)s\}.

Therefore, there exists an index $k$ such that $\left|R_{k}\right|\geq\frac{\left|S\right|}{5\left|S^{\prime}\right|}$ . For any two clients $j_{1},j_{2}\in R_{k}$ , the distance $d(j_{1},j_{2})$ is bounded by the sum of their radial and angular differences. Hence, $d(j_{1},j_{2})\leq\frac{2(1+\delta)s}{5}+\frac{2(1+\delta)s}{50}=\frac{11}{25}% (1+\delta)s$ . From the triangle inequality and the homogeneous condition,

	$\displaystyle cost_{j_{1}}(j_{2})$	$\displaystyle\leq d(j_{1},j_{2})+M_{j_{1}}\leq\frac{36}{25}(1+\delta)s\leq% \frac{3-\varepsilon_{1}}{2(1+\delta)}s$
		$\displaystyle\leq(1-\varepsilon_{1})C_{j_{2}}+2M_{j_{2}}\leq(1-\varepsilon_{1}% )C_{j_{2}}+(3-\gamma)M_{j_{2}}+(\gamma-1)D_{j_{2}}$

which implies $j_{2}\in N^{+}_{j_{1}}$ . By Theorem 14, $\left|S^{\prime}\right|\leq M$ . Therefore, the saving of any client $j\in R_{k}$ is at least

	$\displaystyle Saving(j)$	$\displaystyle\geq\frac{\left\|S\right\|}{5\left\|S^{\prime}\right\|}\left((1-% \varepsilon_{1})C_{j}+(3-\gamma)M_{j}+(\gamma-1)D_{j}-\frac{36}{25}(1+\delta)s\right)$
		$\displaystyle\geq\frac{\left\|S\right\|}{5M}\frac{3-72\delta-25\varepsilon_{1}}{% 50}s\geq\frac{s}{125}\frac{\left\|S\right\|}{M}.\$

$\hfill\blacktriangleleft$

Given these geometric tools, Theorem 10 follows as the number of big-remote-arm-neighbors of the chosen center $j^{\prime}$ can be bounded.

See 10 The proof appears in the extended version of this paper [14].

4 Clustering for Homogeneous Instances

In this section, we present an algorithm that operates on a connection-dominant homogeneous instance, ensuring strictly better performance than a naive greedy clustering strategy. Let $c(i)$ for $i\in\mathcal{C}$ be the center of $i$ when some clustering is given in the context.

Theorem 16.

Suppose a homogeneous network $G=((\mathcal{F},\mathcal{C}),E)$ satisfies $C^{*}>K_{4}F^{*}$ . Then, the clustering produced by Algorithm 2 is good on average. Precisely, for $\varepsilon_{3}=3\times 10^{-32}$ and every $\gamma\in(1.6,2)$ , the following holds:

\sum_{j\in\mathcal{C}}cost_{c(j)}(j)\leq\sum_{j\in\mathcal{C}}((1-\varepsilon_% {3})C_{j}+(3-\gamma)M_{j}+(\gamma-1)D_{j}).

Algorithm 2 homogeneous: Homogeneous Clustering.

For one of a cluster $A$ made by the algorithm, if the center of $A$ is weird, then the ratio is at most $K_{6}$ since all clients within them are weird. If $A$ is not “good”, its connection-facility ratio is at most $K_{5}$ by Theorem 10. Therefore, the assumed ratio $K_{4}$ in this theorem is larger than these values, implying that a constant proportion of clusters are “good” since they satisfy the conditions of Theorem 10. Therefore, by reducing the $\varepsilon$ value by that proportion, the desired result can be obtained.

Proof.

Divide $\mathcal{C}$ into three groups:

1.

Clients clustered by a normal center, where the ratio of that cluster’s connection cost to facility cost is greater than $K_{5}$ .
2.

Clients clustered by a normal center, where this ratio is at most $K_{5}$ .
3.

Clients clustered by a weird center.

Let $S_{n}$ be the set of clients in the $n$ -th group ( $1\leq n\leq 3)$ . Here, $S_{1}$ and $S_{2}$ correspond to clusters formed by the if condition, and $S_{3}$ is formed by the else condition. Define the following values:

C^{*}_{n}=\sum_{j\in S_{n}}C^{*}_{j},\quad F^{*}_{n}=\sum_{j\in S_{n}}F^{*}_{j}

By Theorem 10, $\sum_{j\in S_{1}}cost_{c(j)}(j)\leq\sum_{j\in S_{1}}((1-\varepsilon_{2})C_{j}+% (3-\gamma)M_{j}+(\gamma-1)D_{j})$ . The rerouting cost for $S_{2}$ is only bounded by the homogeneous condition. For a client $j$ , $cost_{c(j)}(j)\leq C_{c(j)}+M_{c(j)}+(2-\gamma)M_{j}+(\gamma-1)D_{j}\leq(1+% \delta)(C_{j}+M_{j})+(2-\gamma)M_{j}+(\gamma-1)D_{j}$ . Lastly, clients in $S_{3}$ are clustered through a greedy strategy. Thus, $cost_{c(j)}(j)\leq C_{j}+(3-\gamma)M_{j}+(\gamma-1)D_{j}$ . Note that $S_{3}$ consists solely of weird clients, meaning $C_{3}^{*}\leq K_{6}F_{3}^{*}\leq K_{5}F_{3}^{*}$ .

From above, the total rerouting cost is bounded by:

	$\displaystyle\sum_{j\in\mathcal{C}}cost_{c(j)}(j)$	$\displaystyle=\sum_{j\in S_{1}}cost_{c(j)}(j)+\sum_{j\in S_{2}}cost_{c(j)}(j)+% \sum_{j\in S_{3}}cost_{c(j)}(j)$
		$\displaystyle\leq\sum_{j\in S_{1}}((1-\varepsilon_{2})C_{j}+M_{j}+(2-\gamma)M_% {j}+(\gamma-1)D_{j})$
		$\displaystyle\quad+\sum_{j\in S_{2}}((1+\delta)(C_{j}+M_{j})+(2-\gamma)M_{j}+(% \gamma-1)D_{j})$
		$\displaystyle\quad+\sum_{j\in S_{3}}(C_{j}+(3-\gamma)M_{j}+(\gamma-1)D_{j}).$

Therefore, it is sufficient to show that

(\delta+\varepsilon_{3})\sum_{j\in S_{2}}(C_{j}+M_{j})+\varepsilon_{3}\sum_{j% \in S_{3}}C_{j}\leq(\varepsilon_{2}-\varepsilon_{3})\sum_{j\in S_{1}}C_{j}.

Note that $C^{*}_{2}+C^{*}_{3}\leq K_{5}(F^{*}_{2}+F^{*}_{3})<\frac{K_{5}}{K_{4}}C^{*}$ holds, which means $C^{*}_{1}>\left(1-\frac{K_{5}}{K_{4}}\right)C^{*}$ . Also $C^{*}_{1}>K_{4}F^{*}_{1}$ , since $C^{*}_{2}+C^{*}_{3}\leq K_{5}(F^{*}_{2}+F^{*}_{3})$ . Then the following holds:

	$\displaystyle\quad(\delta+\varepsilon_{3})\sum_{j\in S_{2}}(C_{j}+M_{j})+% \varepsilon_{3}\sum_{j\in S_{3}}C_{j}$
	$\displaystyle\leq(\delta+\varepsilon_{3})\sum_{j\in S_{2}\cup S_{3}}(C_{j}+M_{% j})\leq(\delta+\varepsilon_{3})\sum_{j\in\mathcal{C}}(C_{j}+D_{j})$
	$\displaystyle\leq(\delta+\varepsilon_{3})(2C^{}+(2-\gamma)F^{})$
	$\displaystyle\leq\frac{(\delta+\varepsilon_{3})(2-\gamma+2K_{4})}{K_{4}}C^{*}$
	$\displaystyle\leq(\varepsilon_{2}-\varepsilon_{3})\frac{(K_{5}-\gamma+1)(K_{4}% -K_{5})}{K_{4}K_{5}}C^{}\leq(\varepsilon_{2}-\varepsilon_{3})\frac{K_{5}-% \gamma+1}{K_{5}}C^{}_{1}$
	$\displaystyle\leq(\varepsilon_{2}-\varepsilon_{3})(C^{}_{1}-(\gamma-1)F^{}_{% 1})\leq(\varepsilon_{2}-\varepsilon_{3})\sum_{j\in S_{1}}C_{j}.\$

$\hfill\blacktriangleleft$

5 Clustering for Connection-dominant Instances

In this section, we introduce an algorithm that operates on connection-dominant instances without a homogeneous condition. This algorithm uses Algorithm 1 and Algorithm 2 as subroutines. The theorem stated below shows our main result.

Theorem 17.

For any connection-dominant instance, i.e. $C^{*}>K_{1}F^{*}$ , there exists an algorithm that finds a clustering configuration whose rerouting cost is at most

\sum_{j\in\mathcal{C}}cost_{c(j)}(j)\leq\sum_{j\in\mathcal{C}}((1-\varepsilon_% {5})C_{j}+(3-\gamma)M_{j}+(\gamma-1)D_{j})

for $\varepsilon_{5}=2\times 10^{-41}$ and every $\gamma\in(1.6,2)$ .

Definition 18.

Let $B_{0}$ be the set of clients for which $C_{j}+M_{j}=0$ . Let $s=\min\limits_{j\in\mathcal{C}\setminus B_{0}}(C_{j}+M_{j})$ . For $\delta^{\prime}=7\times 10^{-32}$ , let a Block $B_{n}$ be the set of clients such that

B_{n}=\{j\in\mathcal{C}\,|\,(1+\delta^{\prime})^{n-1}s\leq C_{j}+M_{j}<(1+% \delta^{\prime})^{n}s\}

Note that if $(1+\delta^{\prime})^{m}\leq(1+\delta)$ , then a network composed of at most $m$ consecutive blocks is still homogeneous. However, applying Algorithm 2 directly to each block would be impossible when the block is a facility-dominant or neighbors of the center from some block may belong to a different block. Moreover, the following fact implies that the neighbor relationships between consecutive blocks are the main point.

Observation 19.

If two neighbors $j,j^{\prime}\in\mathcal{C}$ belonging to neither the same block nor consecutive blocks, satisfying $(1+\delta^{\prime})(C_{j}+M_{j})\leq C_{j^{\prime}}+M_{j^{\prime}}$ , then $j^{\prime}\in N^{+}_{j}$ holds when $Saving(\cdot)$ criteria $\varepsilon$ satisfies that $\varepsilon\leq\frac{2\delta^{\prime}}{1+\delta^{\prime}}$ .

Proof.

	$\displaystyle cost_{j}(j^{\prime})$	$\displaystyle\leq C_{j}\!+\!M_{j}\!+\!(2-\gamma)M_{j^{\prime}}\!+\!(\gamma-1)D% _{j^{\prime}}\leq\frac{C_{j^{\prime}}\!+\!M_{j^{\prime}}}{1\!+\!\delta^{\prime% }}\!+\!(2-\gamma)M_{j^{\prime}}\!+\!(\gamma-1)D_{j^{\prime}}$
		$\displaystyle\leq\left(1-\frac{2\delta^{\prime}}{1+\delta^{\prime}}\right)% \cdot C_{j^{\prime}}+(3-\gamma)M_{j^{\prime}}+(\gamma-1)D_{j^{\prime}}.\$

$\hfill\blacktriangleleft$

A key idea is that with a sufficiently small $\delta^{\prime}$ , the support graph can be segmented almost arbitrarily while still preserving the homogeneous condition, enabling the identification of weak connections between consecutive blocks. Notably, even if two clients $j\in B_{n}$ and $j^{\prime}\in B_{n+1}$ come from consecutive blocks, the $cost_{j}(j^{\prime})$ is not worse than the greedy strategy.

Definition 20.

An interval $I$ is the set of consecutive blocks, up to a maximum $2L$ for $L=2\times 10^{8}$ . Precisely, $I=\{B_{i},\ldots,B_{i+k}\}$ for $k<L$ , where $\sum_{j=i}^{i+k}C^{*}(B_{j})-C^{*}(B_{i})\geq K_{3}\sum_{j=i}^{i+k}F^{*}(B_{j})$ , except when $k=0$ . The size of an interval $I$ is the number of blocks it contains, denoted as $\left|I\right|$ . The reward of interval $I$ is defined as $R(I)=\sum_{j=i}^{i+k}C^{*}(B_{j})-C^{*}(B_{i})$ .

Note that a block is a unique type of interval with a size of $1$ . Also $(1+\delta^{\prime})^{2L}\leq 1+\delta$ holds. An interval is the basic unit of clustering to which Algorithm 1 and Algorithm 2 are applied. Therefore, the reward of an interval represents the extent to which a guaranteed $S a v i n g$ can be obtained when the current interval can be clustered using Algorithm 2, regardless of how the preceding intervals have been clustered. From this perspective, the first block, which does not contribute to the reward, serves as a kind of “buffer”.

Given the entire support graph $G=(\mathcal{F}\cup\mathcal{C},E)$ and a subset $\mathcal{C}^{\prime}\subseteq\mathcal{C}$ , let $G_{\mathcal{C}^{\prime}}$ be the subgraph (network) induced by $\mathcal{F}\cup\mathcal{C}^{\prime}$ . Consequently, when we call Algorithms 1 or 2 with $G_{\mathcal{C}^{\prime}}$ , the calculations for $S a v i n g$ and $S p e n d i n g$ are performed solely with respect to the implicitly defined client set $\mathcal{C}^{\prime}$ . We denote Algorithm 1 as greedy and Algorithm 2 as homogeneous. For simplicity, when $I$ denotes an interval, we interpret the expression $\sum_{j\in I}$ as $\sum_{B\in I}\sum_{j\in B}$ , aggregating over all clients within the interval. The following theorem illustrates how reward is related to $S a v i n g$ . The proof appears in the extended version of this paper [14].

Lemma 21.

Let $J$ be a set of non-overlapping intervals. Then clustering produced by Algorithm 3 is good on average. Precisely, for $\varepsilon_{4}=2\times 10^{-36}$ ,

\sum_{j\in\mathcal{C}}cost_{c(j)}(j)\leq\sum_{j\in\mathcal{C}}\left(\left(1-% \frac{\sum\limits_{I\in J}R(I)}{\sum\limits_{j\textquoteright\in\mathcal{C}}C_% {j\textquoteright}}\varepsilon_{4}\right)C_{j}+(3-\gamma)M_{j}+(\gamma-1)D_{j}% \right).

Algorithm 3 conn: Connection-dominant Clustering.

In conclusion, it suffices to find a set of non-overlapping intervals $J$ which have a high $R(J)$ value. Here, we will briefly touch on the idea. We will iterate through the blocks in reverse order $(B_{N},\ldots,B_{0})$ , cutting out suitable ranges that satisfy the interval conditions. Fix a point $r$ to be the right end of the interval, and expand the left end of the current range one block to the left until the range is suitable for processing. Suppose, at some point, the $C/F$ value of the current range is less than $K_{2}$ , which is strictly less ratio of the input instance, $K_{1}$ . Then, this range can be considered a minor part and can be excluded as there is no need for it to become an interval. Therefore, we only consider cases where the current range has a $C/F$ value of at least $K_{2}$ .

However, in the situation where the current range’s reward is small, meaning that the first block must avoid most of the $C^{*}$ . In this case, if we keep expanding the range to the left, we will eventually reach the initial block $B_{0}$ , satisfying the interval condition. There remains a subtle issue of the range’s length exceeding $2L$ during the expansion process, but this implies that the $C^{*}$ value on the left side of the range is always exponentially increasing compared to the right side. This can be resolved by appropriately reducing the right side of the range.

Lemma 22.

For any support graph $G$ that is connection-dominant, $C^{*}>K_{1}F^{*}$ , then the set of non-overlapping intervals $J$ obtained by Algorithm 4 satisfies $\sum_{I\in J}R(I)\geq\frac{1}{10^{5}}C^{*}$ .

Algorithm 4 cutinterval: Find a set of non-overlapping intervals

J

with large rewards.

The proof appears in the extended version of this paper [14].

By directly applying Lemma 21 and Lemma 22, Theorem 17 can be proved.

Algorithm 5 Overall bi-factor clustering process.

6 Improved Bifactor Approximation

In this section, we present an improved bifactor approximation algorithm for UFL, proving Theorem 1. To deal with facility-dominant instances, we employ the JMS algorithm [12], which is known to be $(1.11,1.7764)$ -approximation algorithm.

Now, we will show that Algorithm 5 satisfies Theorem 1. Let $\gamma_{0}\leq 1.6774$ be a solution of the below equation.

\frac{1}{e}+\frac{1}{e^{\gamma}}-(\gamma-1)(1-\frac{1}{e}+(1-\varepsilon_{5})% \frac{1}{e^{\gamma}})=0.

To prove Theorem 1, we consider two cases. When $C^{*}\leq K_{1}F^{*}$ , the inequality $1.11F^{*}+1.7764C^{*}\leq 1.6774F^{*}+(1+2e^{-1.6774}-\varepsilon_{6})C^{*}$ holds. For the case $C^{*}>K_{1}F^{*}$ , we can employ the same proof structure as the statement proved in [3]²²2Theorem 4.3 of [3]. The proof appears in the extended version of this paper [14].

7 Improved Unifactor Approximation

In this section, we propose the algorithm that guarantees better unifactor approximation suggested by Li [15], proving Theorem 2.

Framework of [15].

Li showed that a hard instance for a certain $\gamma$ might not be a hard instance for another value of $\gamma$ . This suggests that selecting a $\gamma$ value at random could improve the expected performance of the algorithm.

They introduced a characteristic function $h:(0,1]\rightarrow\mathbb{R}$ to represent the distribution of distances between a client and its neighboring facilities. For a client $j\in\mathcal{C}$ , assume $i_{1},i_{2},\ldots,i_{k}$ are the facilities within $\mathcal{C}_{j}\cup\mathcal{D}_{j}$ , ordered by increasing distance from $j$ . Then, for $0<p\leq 1$ , $h_{j}(p)$ is defined as $d(j,i_{t})$ , where $t$ is the smallest index satisfying $\sum_{l=1}^{t}y^{*}_{i_{l}}\geq p$ . Also, they improved the bound of the rerouting cost as $d(j,\mathcal{C}_{j^{\prime}}\setminus(\mathcal{C}_{j}\cup\mathcal{D}_{j}))\leq% (2-\gamma)M_{j}+(\gamma-1)D_{j}+C_{j^{\prime}}+M_{j^{\prime}}$ . For fixed $\gamma$ , the expected connection cost for $j$ for the aforementioned algorithm of [3] is at most

\mathbb{E}[C_{j}]\leq\int_{0}^{1}h_{j}(p)e^{-\gamma p}\gamma\,dp+e^{-\gamma}% \left(\gamma\int_{0}^{1}h_{j}(p)\,dp+(3-\gamma)h_{j}(\frac{1}{\gamma})\right).

Therefore, it can be modeled as a $0$ -sum game to analyze the approximation ratio. The characteristic function for the whole instance is given by $h(p)=\sum_{j\in\mathcal{C}}h_{j}(p)$ . Assuming $h$ is normalized so that $\int_{0}^{1}h(p)\,dp=1$ , the algorithm proceeds as follows: with probability $\kappa$ , it employs the JMS algorithm [12]. Mahdian [16] proved that the JMS algorithm achieves a $(1.11,1.7764)$ -approximation. Otherwise, $\gamma$ is sampled randomly from the distribution $\mu:(1,\infty]\rightarrow\mathbb{R}^{*}$ , ensuring that $\kappa+\int_{1}^{\infty}\mu(\gamma)\,d\gamma=1$ . Thus, the value of the $0$ -sum game, i.e., the approximation ratio of the algorithm under a fixed strategy $(\kappa,\mu)$ , is calculated as follows:

\nu(\kappa,\mu,h)=\max\Bigl{\{}\int_{1}^{\infty}\gamma\mu(\gamma)\,d\gamma+1.1% 1\kappa,\int_{1}^{\infty}\alpha(\gamma,h)\mu(\gamma)\,d\gamma+1.7764\kappa% \Bigr{\}}

where

\alpha(\gamma,h)=\int_{0}^{1}h(p)e^{-\gamma p}\gamma\,dp+e^{-\gamma}\left(% \gamma+(3-\gamma)h\left(\frac{1}{\gamma}\right)\right).

Moreover, for a given probability density function $\mu$ for $\gamma$ , it can be shown that the characteristic function for the hardest instance is a threshold function, which defined as

h_{q}(p)=\begin{cases}\frac{1}{1-q},&\text{for }p>q\\ 0,&\text{for }p\leq q\\ \end{cases}

for some $0\leq q<1$ . This means that the final approximation ratio for some $\mu$ is given by $\max_{0\leq q<1}\nu(\kappa,\mu,h_{q})$ . In [15], the suggested distribution for $\alpha_{\mathrm{Li}}$ is $\mu(p)=\theta D(p-\gamma_{1})+\frac{1-\kappa-\theta}{\gamma_{2}-\gamma_{1}}[% \gamma_{1}<p<\gamma_{2}]$ , where $D$ is Dirac delta function, $\gamma_{1}=1.479311$ , $\gamma_{2}=2.016569$ , $\theta=0.503357$ , $\kappa=0.195583$ .

Our Improvement.

By using Algorithm 5, the expected connection cost for some $\gamma$ and characteristic function $h$ is given as follows:

\alpha^{\prime}(\gamma,h)=\begin{cases}\int_{0}^{1}h(p)e^{-\gamma p}\gamma\,dp% +e^{-\gamma}\left(\gamma(1-\varepsilon_{7})+(3-\gamma)h(\frac{1}{\gamma})% \right),&\text{if $1.6<\gamma<2$}\\ \int_{0}^{1}h(p)e^{-\gamma p}\gamma\,dp+e^{-\gamma}\left(\gamma+(3-\gamma)h(% \frac{1}{\gamma})\right).&\text{otherwise}\\ \end{cases}

Therefore, we define a new $0$ -sum game value as follows, assuming $h$ is scaled up such that $\int_{0}^{1}h(p)\,dp=1$ .

\nu^{\prime}(\mu,\theta,h)=\max\Bigl{\{}\int_{1}^{\infty}\gamma\mu(\gamma)\,d% \gamma+1.11\kappa,\int_{1}^{\infty}\alpha^{\prime}(\gamma,h)\mu(\gamma)\,d% \gamma+1.7764\kappa\Bigr{\}}.

Note that $\alpha^{\prime}$ is still linear for $h$ . Therefore, even if the game definition changes, the adversary’s choice of the characteristic function $h$ remains a threshold function $h_{q}$ for some $0\leq q<1$ . Given that there is a positive probability of sampling $\gamma$ between $1.6$ and $2$ , it is possible to achieve a lower cost.

Lemma 23.

Let $\mu_{2}(\gamma)=(1-\varepsilon_{7})\mu_{1}(\gamma)+\varepsilon_{7}(1-\kappa_{2% })D(\gamma-1)$ , where $D$ is a Dirac-delta function. Then the following holds:

		$\displaystyle\max_{0\leq q<1}\max\Bigl{\{}\int_{1}^{\infty}\gamma\mu_{2}(% \gamma)\,d\gamma+1.11\kappa_{2},\int_{1}^{\infty}\alpha^{\prime}(\gamma,h_{q})% \mu_{2}(\gamma)\,d\gamma+1.7764\kappa_{2}\Bigr{\}}$
	$\displaystyle<$	$\displaystyle\max_{0\leq q<1}\max\Bigl{\{}\int_{1}^{\infty}\gamma\mu_{1}(% \gamma)\,d\gamma+1.11\kappa_{2},\int_{1}^{\infty}\alpha(\gamma,h_{q})\mu_{1}(% \gamma)\,d\gamma+1.7764\kappa_{2}\Bigr{\}}-\frac{\varepsilon_{7}}{1000}.$

The proof appears in the extended version of this paper [14].

Therefore, we present an improved unifactor approximation algorithm, Algorithm 6.

Algorithm 6 Overall uni-factor clustering process.

When $C^{*}\leq K_{1}F^{*}$ , an inequality $1.11F^{*}+1.7764C^{*}\leq 1.487F^{*}+1.487C^{*}$ is satisfied. For the case $C^{*}>K_{1}F^{*}$ , by Lemma 23, Algorithm 6 shows an improved unifactor approximation with $\varepsilon_{8}=\frac{\varepsilon_{7}}{1000}=2\times 10^{-45}$ .

8 APX-Hardness

In this section, we prove that UFL is APX-Hard in Euclidean spaces. We use the following result from Austrin, Khot, and Safra [2]. For $\rho\in[-1,+1]$ and $\mu\in[0,1]$ , let $\Gamma_{\rho}(\mu):=\Pr[X\leq\Phi^{-1}(\mu)\wedge Y\leq\Phi^{-1}(\mu)]$ where $X, Y$ are standard Gaussian random variables with covariance $\rho$ and $\Phi$ is the cumulative density function of the standard normal distribution.

Theorem 24.

Assuming the Unique Games Conjecture, for any $q\in(0,1/2)$ and $\varepsilon>0$ , it is NP-hard to, given a graph $G=(V,E)$ , distinguish between the following two cases.

$\blacksquare$

(Completeness) $G$ contains an independent set of size $q|V|$ .
$\blacksquare$

(Soundness) For any $T\subseteq V$ , the number of edges with both endpoints in $T$ is at least $|E|\cdot(\Gamma_{-q/(1-q)}(\mu)-\varepsilon)$ where $\mu=|T|/|V|$ .

Fix an arbitrary $q\in(0,1/2)$ . Without loss of generality, assume $V=[n]$ . Also let $m:=|E|$ . Our UFL instance has $V$ as the set of facilities and $E$ as the set of clients. The ambient Euclidean space is $\mathbb{R}^{n}$ , and let $e_{i}$ be the $i$ th standard unit vector (i.e., $(e_{i})_{i}=1$ and $(e_{i})_{j}=0$ for every $j\neq i$ ). Then each facility $i\in V$ is located at $e_{i}$ and each client $(i,j)\in E$ is located at $e_{i}+e_{j}$ . Finally, let $\lambda$ be the common facility cost for every $i\in\mathcal{F}$ to be determined. This finishes the description of the UFL instance.

In the completeness case, there is an independent set $U$ of size $q n$ . We open $V\setminus U$ . Since $V\setminus U$ is a vertex cover, every client in $E$ has a facility at distance $1$ , so the total cost is

\lambda(1-q)n+m.

In the soundness case, consider any solution that opens $S\subseteq V$ and let $T=V\setminus S$ and $\mu=|T|/n$ . By the soundness guarantee, at least $m(\Gamma_{-q/(1-q)}-\varepsilon)$ clients do not have a facility at distance $1$ . Since every client-facility distance is either $1$ or $\sqrt{3}$ , the total cost is at least

\lambda(1-\mu)n+m\big{(}1+(\sqrt{3}-1)(\Gamma_{-q/(1-q)}(\mu)-\varepsilon)\big% {)}.

(2)

For fixed $q$ , the function $\Gamma_{-q/(1-q)}(\mu)$ is a strictly convex function of $\mu$ , so if we let $\lambda$ such that

-\lambda n+m(\sqrt{3}-1)\frac{d\Gamma_{-q/(1-q)}(\mu)}{d\mu}|_{\mu=q}=0,

then (2) is minimized when when $\mu=q$ , which becomes

\lambda(1-q)n+m\big{(}1+(\sqrt{3}-1)(\Gamma_{-q/(1-q)}(q)-\varepsilon)\big{)}.

Furthermore, we can notice that

\lambda=\frac{m}{n}\cdot(\sqrt{3}-1)\frac{d\Gamma_{-q/(1-q)}(\mu)}{d\mu}|_{\mu% =q}=\Theta(\frac{m}{n}).

Then one can see that the optimal value in the soundness case is at least $m(\sqrt{3}-1)(\Gamma_{-q/(1-q)}(q)-\varepsilon)$ larger than the optimal value in the completeness case. For a fixed $q$ , by choosing $\varepsilon$ sufficiently small, one can ensure that this excess is at least a $\delta$ fraction of the completeness case optimal value for some constant $\delta>0$ , which proves a $(1+\delta)$ -hardness of approximation.

9 Conclusion

The most natural open problem is to get an improved approximation for bifactor or unifactor approximation for UFL. Though we show a strict separation between general and Euclidean metrics for bifactor approximation in a certain regime, it is not achieved for all regimes of bifactor or unifactor approximation.

Whereas our algorithm is based on the primal rounding approach of [3] and [15], it might be a fruitful research direction to design a variant of the Jain-Mahdian-Saberi algorithm [12] (greedy algorithm analyzed by the dual fitting method) or Jain-Vazirani [13] (primal-dual algorithm) for a further improvement. In particular, as the best unifactor approximation for UFL in both general and Euclidean metrics employ the $(1.11,1.7764)$ -approximation of the JMS algorithm as a black box, improving the JMS algorithm will directly yield a better result for the best unifactor approximation for UFL. The JV algorithm was already improved in Euclidean spaces [1, 10, 4], but they are not enough for UFL.

References

[1] Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, and Justin Ward. Better guarantees for k-means and euclidean k-median by primal-dual algorithms. SIAM Journal on Computing, 49(4):FOCS17–97, 2019. doi:10.1137/18M1171321.
[2] Per Austrin, Subhash Khot, and Muli Safra. Inapproximability of vertex cover and independent set in bounded degree graphs. Theory of Computing, 7(1):27–43, 2011. doi:10.4086/TOC.2011.V007A003.
[3] Jaroslaw Byrka and Karen Aardal. An optimal bifactor approximation algorithm for the metric uncapacitated facility location problem. SIAM Journal on Computing, 39(6):2212–2231, 2010. doi:10.1137/070708901.
[4] Vincent Cohen-Addad, Hossein Esfandiari, Vahab Mirrokni, and Shyam Narayanan. Improved approximations for euclidean k-means and k-median, via nested quasi-independent sets. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, pages 1621–1628, 2022. doi:10.1145/3519935.3520011.
[5] Vincent Cohen-Addad, Andreas Emil Feldmann, and David Saulpic. Near-linear time approximation schemes for clustering in doubling metrics. Journal of the ACM (JACM), 68(6):1–34, 2021. doi:10.1145/3477541.
[6] Vincent Cohen-Addad and CS Karthik. Inapproximability of clustering in lp metrics. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 519–539. IEEE, 2019.
[7] Vincent Cohen-Addad, Karthik C S, and Euiwoong Lee. Johnson coverage hypothesis: Inapproximability of k-means and k-median in $\ell_{p}$ -metrics. In Proceedings of the 2022 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1493–1530. SIAM, 2022. doi:10.1137/1.9781611977073.63.
[8] Vincent Cohen-Addad Viallat, Fabrizio Grandoni, Euiwoong Lee, and Chris Schwiegelshohn. Breaching the 2 lmp approximation barrier for facility location with applications to k-median. In Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 940–986. SIAM, 2023.
[9] Kishen N Gowda, Thomas Pensyl, Aravind Srinivasan, and Khoa Trinh. Improved bi-point rounding algorithms and a golden barrier for k-median. In Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 987–1011. SIAM, 2023. doi:10.1137/1.9781611977554.CH38.
[10] Fabrizio Grandoni, Rafail Ostrovsky, Yuval Rabani, Leonard J Schulman, and Rakesh Venkat. A refined approximation for euclidean k-means. Information Processing Letters, 176:106251, 2022. doi:10.1016/J.IPL.2022.106251.
[11] Sudipto Guha and Samir Khuller. Greedy strikes back: Improved facility location algorithms. Journal of algorithms, 31(1):228–248, 1999. doi:10.1006/JAGM.1998.0993.
[12] Kamal Jain, Mohammad Mahdian, and Amin Saberi. A new greedy approach for facility location problems. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 731–740, 2002. doi:10.1145/509907.510012.
[13] Kamal Jain and Vijay V Vazirani. Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and lagrangian relaxation. Journal of the ACM (JACM), 48(2):274–296, 2001. doi:10.1145/375827.375845.
[14] Euiwoong Lee and Kijun Shin. Facility location on high-dimensional euclidean spaces, 2024. arXiv:2501.18105.
[15] Shi Li. A 1.488 approximation algorithm for the uncapacitated facility location problem. Information and Computation, 222:45–58, 2013. doi:10.1016/J.IC.2012.01.007.
[16] Mohammad Mahdian, Yinyu Ye, and Jiawei Zhang. A 1.52-approximation algorithm for the uncapacitated facility location problem. In Proc. of APPROX, pages 229–242, 2002.
[17] Robert Alexander Rankin. The closest packing of spherical caps in n dimensions. Glasgow Mathematical Journal, 2(3):139–144, 1955.

[bib.bib1] [1] Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, and Justin Ward. Better guarantees for k-means and euclidean k-median by primal-dual algorithms. SIAM Journal on Computing, 49(4):FOCS17–97, 2019. doi:10.1137/18M1171321.

[bib.bib2] [2] Per Austrin, Subhash Khot, and Muli Safra. Inapproximability of vertex cover and independent set in bounded degree graphs. Theory of Computing, 7(1):27–43, 2011. doi:10.4086/TOC.2011.V007A003.

[bib.bib3] [3] Jaroslaw Byrka and Karen Aardal. An optimal bifactor approximation algorithm for the metric uncapacitated facility location problem. SIAM Journal on Computing, 39(6):2212–2231, 2010. doi:10.1137/070708901.

[bib.bib4] [4] Vincent Cohen-Addad, Hossein Esfandiari, Vahab Mirrokni, and Shyam Narayanan. Improved approximations for euclidean k-means and k-median, via nested quasi-independent sets. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, pages 1621–1628, 2022. doi:10.1145/3519935.3520011.

[bib.bib5] [5] Vincent Cohen-Addad, Andreas Emil Feldmann, and David Saulpic. Near-linear time approximation schemes for clustering in doubling metrics. Journal of the ACM (JACM), 68(6):1–34, 2021. doi:10.1145/3477541.

[bib.bib6] [6] Vincent Cohen-Addad and CS Karthik. Inapproximability of clustering in lp metrics. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 519–539. IEEE, 2019.

[bib.bib7] [7] Vincent Cohen-Addad, Karthik C S, and Euiwoong Lee. Johnson coverage hypothesis: Inapproximability of k-means and k-median in $\ell_{p}$ -metrics. In Proceedings of the 2022 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1493–1530. SIAM, 2022. doi:10.1137/1.9781611977073.63.

[bib.bib8] [8] Vincent Cohen-Addad Viallat, Fabrizio Grandoni, Euiwoong Lee, and Chris Schwiegelshohn. Breaching the 2 lmp approximation barrier for facility location with applications to k-median. In Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 940–986. SIAM, 2023.

[bib.bib9] [9] Kishen N Gowda, Thomas Pensyl, Aravind Srinivasan, and Khoa Trinh. Improved bi-point rounding algorithms and a golden barrier for k-median. In Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 987–1011. SIAM, 2023. doi:10.1137/1.9781611977554.CH38.

[bib.bib10] [10] Fabrizio Grandoni, Rafail Ostrovsky, Yuval Rabani, Leonard J Schulman, and Rakesh Venkat. A refined approximation for euclidean k-means. Information Processing Letters, 176:106251, 2022. doi:10.1016/J.IPL.2022.106251.

[bib.bib11] [11] Sudipto Guha and Samir Khuller. Greedy strikes back: Improved facility location algorithms. Journal of algorithms, 31(1):228–248, 1999. doi:10.1006/JAGM.1998.0993.

[bib.bib12] [12] Kamal Jain, Mohammad Mahdian, and Amin Saberi. A new greedy approach for facility location problems. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 731–740, 2002. doi:10.1145/509907.510012.

[bib.bib13] [13] Kamal Jain and Vijay V Vazirani. Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and lagrangian relaxation. Journal of the ACM (JACM), 48(2):274–296, 2001. doi:10.1145/375827.375845.

[bib.bib14] [14] Euiwoong Lee and Kijun Shin. Facility location on high-dimensional euclidean spaces, 2024. arXiv:2501.18105.

[bib.bib15] [15] Shi Li. A 1.488 approximation algorithm for the uncapacitated facility location problem. Information and Computation, 222:45–58, 2013. doi:10.1016/J.IC.2012.01.007.

[bib.bib16] [16] Mohammad Mahdian, Yinyu Ye, and Jiawei Zhang. A 1.52-approximation algorithm for the uncapacitated facility location problem. In Proc. of APPROX, pages 229–242, 2002.

[bib.bib17] [17] Robert Alexander Rankin. The closest packing of spherical caps in n dimensions. Glasgow Mathematical Journal, 2(3):139–144, 1955.

Facility Location on High-Dimensional Euclidean Spaces

Abstract

Keywords and phrases:

Funding:

Copyright and License:

2012 ACM Subject Classification:

Related Version:

DOI:

Event:

Editor:

Series and Publisher:

1 Introduction

Theorem 1.

Theorem 2.

Theorem 3.

2 High-level Plan

Clustering of [3].

Exploit the Geometry of Euclidean Spaces.

3 Finding Good Center via Geometry

Definition 4.

Definition 5.

Definition 6.

Definition 7.

Lemma 8.

Lemma 9.

Theorem 10.

3.1 Geometric Arguments

Definition 11.

Lemma 12.

Proof.

Lemma 13.

Proof.

Theorem 14.

Proof.

Lemma 15.

Proof.

4 Clustering for Homogeneous Instances

Theorem 16.

Proof.

5 Clustering for Connection-dominant Instances

Theorem 17.

Definition 18.

Observation 19.

Proof.

Definition 20.

Lemma 21.

Lemma 22.

6 Improved Bifactor Approximation

7 Improved Unifactor Approximation

Framework of [15].

Our Improvement.

Lemma 23.

8 APX-Hardness

Theorem 24.

9 Conclusion

References