Approximating Dasgupta Cost in Sublinear Time from a Few Random Seeds

Kapralov, Michael; Kumar, Akash; Lattanzi, Silvio; Mousavifar, Aida; Wrzos-Kaminska, Weronika

doi:10.4230/LIPIcs.ICALP.2025.103

Approximating Dasgupta Cost in Sublinear Time
from a Few Random Seeds

Michael Kapralov

EPFL, Lausanne, Switzerland Akash Kumar Indian Institute of Technology, Bombay, India Silvio Lattanzi

Google Research, Zurich, Switzerland Aida Mousavifar Google, Zurich, Switzerland Weronika Wrzos-Kaminska

EPFL, Lausanne, Switzerland

Abstract

Testing graph cluster structure has been a central object of study in property testing since the foundational work of Goldreich and Ron [STOC’96] on expansion testing, i.e. the problem of distinguishing between a single cluster (an expander) and a graph that is far from a single cluster. More generally, a $(k,\epsilon)$ -clusterable graph $G$ is a graph whose vertex set admits a partition into $k$ induced expanders, each with outer conductance bounded by $\epsilon$ . A recent line of work initiated by Czumaj, Peng and Sohler [STOC’15] has shown how to test whether a graph is close to $(k,\epsilon)$ -clusterable, and to locally determine which cluster a given vertex belongs to with misclassification rate $\approx\epsilon$ , but no sublinear time algorithms for learning the structure of inter-cluster connections are known. As a simple example, can one locally distinguish between the “cluster graph” forming a line and a clique?

In this paper, we consider the problem of testing the hierarchical cluster structure of $(k,\epsilon)$ -clusterable graphs in sublinear time. Our measure of hierarchical clusterability is the well-established Dasgupta cost, and our main result is an algorithm that approximates Dasgupta cost of a $(k,\epsilon)$ -clusterable graph in sublinear time, using a small number of randomly chosen seed vertices for which cluster labels are known. Our main result is an $O(\sqrt{\log k})$ approximation to Dasgupta cost of $G$ in $\approx n^{1/2+O(\epsilon)}$ time using $\approx n^{1/3}$ seeds, effectively giving a sublinear time simulation of the algorithm of Charikar and Chatziafratis [SODA’17] on clusterable graphs. To the best of our knowledge, ours is the first result on approximating the hierarchical clustering properties of such graphs in sublinear time.

Keywords and phrases:

Sublinear algorithms, Hierarchical Clustering, Dasgupta’s Cost

Category:

Track A: Algorithms, Complexity and Games

Copyright and License:

© Michael Kapralov, Akash Kumar, Silvio Lattanzi, Aida Mousavifar, and
Weronika Wrzos-Kaminska; licensed under Creative Commons License CC-BY 4.0

2012 ACM Subject Classification:

Theory of computation

\rightarrow

Streaming, sublinear and near linear time algorithms

Editors:

Keren Censor-Hillel, Fabrizio Grandoni, Joël Ouaknine, and Gabriele Puppis

Series and Publisher:

Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik

1 Introduction

Graph clustering is a central problem in data analysis, with applications in a wide variety of scientific disciplines from data mining to social science, statistics and more. The overall objective in these problems is to partition the vertex set of the graph into disjoint “well connected” subgraphs which are sparsely connected to each other. It is quite common in the practice of graph clustering that besides the graph itself one is given a list of vertices with correct cluster labels for them, and one must extend this limited amount of cleanly labelled data to a clustering of the entire graph. This corresponds to the widely used seeded model (see, e.g., [7] and numerous follow up works, e.g., [17, 29, 39, 4]). The central question that we consider in this paper is

What can be learned about the cluster structure of the input graph from a few seed nodes in sublinear time?

Formally, we work with the classical model for well-clusterable graphs [14], where the input graph $G=(V,E)$ is assumed to admit a partitioning into a disjoint union of $k$ induced expanders $C_{1},\ldots,C_{k}$ with outer conductance bounded by $\epsilon\ll 1$ and inner conductance being $\Omega(1)$ . We refer to such instances as $(k,\Omega(1),\epsilon)$ -clusterable graphs, or $(k,\epsilon)$ -clusterable graphs for short. Such graphs have been the focus of significant attention in the property testing literature [15, 27, 36], starting from the seminal work of [24]. A recent line of work has shown how to design nearly optimal sublinear time clustering oracles for such graphs, i.e. algorithms that can consistently answer clustering queries on such a graph from a local exploration only. However, existing works do not show how to learn the structure of connections between the clusters. In particular, to the best of our knowledge, no approach in existing literature can resolve the following simple question:

Distinguish between the clusters being arranged in a line and the clusters forming an (appropriately subsampled) clique (See Fig. 1).

Figure 1: Clusters arranged in a line (Left); Clusters forming a clique (Right).

More generally, we would like to design a sublinear time algorithm that approximates the hierarchical clustering properties of $k$ -clusterable graphs. Hierarchical clustering is a useful primitive in machine learning and data science with essential applications in information retrieval [32, 8], social networks [21] and phylogenetics [18]. Informally, in hierarchical clustering the objective is to construct a hierarchy of partitions that explain the cluster structure of the graph at different scales – note that such a partitioning looks very different in the two cases (line vs. clique) above. Formally, the quality of such a hierarchy of partitions is often evaluated using Dasgupta cost [16], and the main question studied in our paper is

Is it possible to approximate the Dasgupta cost of a $(k,\Omega(1),\epsilon)$ -clusterable graph using few queries to the input graph and a few correctly clustered seed vertices?

In practice an algorithm operating in the seeded model [7] most often does not have full control over the seeds, but rather is given a list generated by some external process. To model this, we assume that the seed vertices are sampled independently from the input graph, with probability proportional to their degrees: we refer to this model as the random sample model.

The case $k=1$ , i.e. approximating Dasgupta cost of an expander

When $k=1$ , our input is a single expander, i.e. a single cluster, we approximate its Dasgupta cost in sublinear time using degree queries on the seeds. At first glance one might think that Dasgupta cost of an expander can be approximated well simply as a function of its number of vertices and average degree, but this is only the case for regular expanders. The irregular case is nontrivial, a $\text{poly}(1/\varphi)$ approximation was recently given by [31]. As our first result, we give an algorithm approximating Dasgupta cost of an (irregular) $\varphi$ -expander using $\approx n^{1/3}$ seed vertices (and degree queries on these vertices). This, somewhat surprisingly, turns out to be a tight bound. Specifically, we show

Theorem 1 (Approximating Dasgupta cost of an expander).

Dasgupta cost of a $\varphi$ -expander can be approximated to within a $\text{poly}(1/\varphi)$ factor using degree queries on $\approx n^{1/3}$ seed vertices. Furthermore, the bound of $\approx n^{1/3}$ is tight up to polylogarithmic factors.

The case $k>1$

For $k>1$ we leverage recent results on clustering oracles to decompose the problem of approximating the Dasgupta cost into two: (1) approximating Dasgupta cost of individual clusters and (2) approximating Dasgupta cost of the contracted graph, in which each cluster is contracted into a supernode. Such a decomposition is only possible for bounded degree graphs, see Example 4.2 in [31], so this is the setting we work in. We show that access to a few seed vertices is sufficient to obtain oracle access to the cut function (and, more generally, quadratic form of the Laplacian) of the contracted graph in time $\approx n^{1/2+O(\epsilon)}$ . Our main result is Theorem 2 below:

Theorem 2 (Informal version of Theorem 9).

There exists an algorithm that for every $(k,\Omega(1),\epsilon)$ -clusterable bounded degree graph $G=(V,E)$ estimates the Dasgupta cost of $G$ up to $O(\sqrt{\log k})$ factor in the random sample model in time $\approx n^{1/2+O(\epsilon)}\cdot(d_{\max})^{O(1)}$ .

$\blacktriangleright$ Remark 3.

We remark that our algorithm for estimating Dasgupta cost from Theorem 9 can be made to provide an oracle access to a low cost hierarchical clustering tree.

$\blacktriangleright$ Remark 4.

One can verify by adapting the lower bound of $\Omega(n^{1/2})$ on expansion testing due to Goldreich and Ron [23] that at least $\Omega(\sqrt{n/k})$ queries are needed for a $o(k/\log k)$ approximation for constant $k$ in this model. The proof is a rather direct adaptation of the classical result of Goldreich and Ron, and we therefore do not present it.

$\blacktriangleright$ Remark 5.

Recall that in our random sample model for seed vertices the seeds are sampled independently with probability proportional to their degrees. This model matches quite closely what happens in practice in the sense that the algorithm does not always have full control over the seeds [7]. One can also consider the stronger model in which the algorithm can ask for correct label of any vertex of its choosing. This model is significantly stronger, and in particular one can design an algorithm for obtaining the same approximation of Dasgupta cost as our Theorem 2 above, but with time complexity polynomial in $d$ , $\log n$ and $1/\epsilon$ .

We note that the currently best known approximation to Dasgupta cost on $n$ -vertex graphs is $O(\sqrt{\log n})$ , achieved by the recursive sparsest cut algorithm of [11]. Our approximation is $O(\sqrt{\log k})$ , matching what the Charikar and Chatziafratis algorithm achieves on $k$ -node graphs. In fact, our main technical contribution is an efficient way of simulating this algorithm in sublinear time on $k$ -clusterable graphs.

Related work on $(k,\Omega(1),\epsilon)$ -clusterable graphs.

Such graphs have been extensively studied in the property testing framework as well as local computation models. Its testing version, where one essentially wants to determine $k$ , the number of clusters in $G$ , in sublinear time, generalizes the well-studied problem of testing graph expansion, where one wants to distinguish between an expander (i.e. a good single cluster) and a graph with a sparse cut (i.e. at least two clusters). [24] showed that expansion testing requires $\Omega(n^{1/2})$ queries, then [15, 27, 36] developed algorithms to distinguish an expander from a graph that is far from a graph with conductance $\epsilon$ in time $\approx n^{1/2+O(\epsilon)}$ , which the recent work of [12] showed to be tight. The setting of $k>2$ has seen a lot of attention recently [14, 12, 37, 22], with close to information theoretically optimal clustering oracles, i.e. small space data structures that provide quick access to an approximate clustering, obtained in [22]. More recently, [31] studied hierarchical clustering of $k$ -clusterable graphs and developed a nearly linear time algorithm that approximates the Dasgupta cost of the graph up to constant factor. However, their algorithm to work requires significantly stronger assumptions on the input data i.e., $\epsilon\ll 1/k^{O(1)}$ , and their algorithm does not run in sublinear time. Note that the problem of estimating Dasgupta cost becomes non-trivial when $\epsilon\gg\frac{1}{k}$ , i.e., when Dasgupta cost of the graph is dominated by the outgoing edges between different clusters¹¹1 For instance, in a $d$ -regular, $(k,\varphi,\epsilon)$ -clusterable graph, one can easily show that the Dasgupta cost is at least $\Omega(\frac{\varphi\cdot d\cdot n^{2}}{k})$ , simply because of the contribution of the $k$ induced $\varphi$ -expanders. On the other hand, the total number of edges running between the clusters is bounded by $\epsilon\cdot d\cdot n$ , and therefore their total contribution to Dasgupta cost is $O(\epsilon\cdot d\cdot n^{2})$ . Thus, the problem becomes non-trivial when $\epsilon\gg\frac{1}{k}$ ..

The most closely related work on our setting is [28] where the authors provide a sublinear algorithm for hierarchical clustering. However, their algorithm works under significantly stronger assumptions on their input instance. They introduce the notion of hierarchically clusterable graphs, which assumes a planted hierarchical clustering structure not only at the bottom level of the hierarchy but at every level. Their result relies on several properties of such graphs. In contrast, we only assume that the input graph is $k$ -clusterable. For this reason we cannot use the techniques developed in [28], and we need to develop a completely new approach.

Very recently, [1, 6] considered the problem of hierarchical clustering under Dasgupta objective in the streaming model. Both papers give a one pass $\widetilde{O}(n)$ memory streaming algorithm which finds a tree with Dasgupta cost within an $O(\sqrt{\log n})$ factor of the optimum in polynomial time. Additionally, [1] also considers this problem in the query model and presents an $O(\sqrt{\log n})$ approximate hierarchical clustering using $\widetilde{O}(n)$ queries without making any clusterability assumptions of the input graph. On the other hand, our algorithms assume the graph is $k$ -clusterable and approximate the Dasgupta cost within an $O(\sqrt{\log k})$ in sublinear time.

Related work on hierarchical clustering.

We briefly review developments in the area of algorithms for hierarchical clustering since the introduction of Dasgupta’s objective function. Dasgupta designed an algorithm based on recursive sparsest-cut that provides $O(\log^{3/2}n)$ approximation for his objective function. This was improved by Charikar and Chatizafratis who showed that the recursive sparsest-cut algorithm already returns a tree with approximation guarantee $O(\sqrt{\log n})$ [11]. Furthermore, they showed that it’s impossible to approximate Dasgupta cost within constant factor in general graphs under the Small-Set Expansion hypothesis. More recently, [13] studied this problem in a regime in which the input graph is sampled from a Hierarchical Stochastic Block Model [13]. They construct a tree in nearly linear time that approximates Dasgupta cost of the graph up to a constant factor. [13] uses a type of hierarchical stochastic block model, which generates close to regular expanders with high probability, and their analysis crucially relies on having dense clusters and large degrees. Our model allows for arbitrary expanders as opposed to dense random graphs and is more expressive in this sense.

Related work in semi-supervised active clustering.

We note that our model is also related to the semi-supervised active clustering framework (SSAC) introduced in [5]. In this model we are given a set $X$ of $n$ points and an oracle answering to same-cluster queries of the form “are these two points in the same cluster?”. Thanks to its elegance and applications to crowdsourcing, the model received a lot of attention and has been extensively studied both in theory [2, 3, 9, 10, 26, 33, 34, 35, 38, 42] and in practice [20, 25, 40, 41] – see also [19] for other types of queries.

1.1 Basic definitions

Definition 6 (Inner and outer conductance).

Let $G=(V,E)$ be a graph. For a set $C\subseteq V$ and a set $S\subseteq C$ , let $E(S,C\setminus S)$ be the set of edges with one endpoint in $S$ and the other in $C\setminus S$ . The conductance of $S$ within $C$ , is $\phi^{G}_{C}(S)=\frac{|E(S,C\setminus S)|}{\text{vol}(S)}$ . The outer conductance of $C$ is defined to be $\phi^{G}_{\text{out}}(C)=\phi_{V}^{G}(C)=\frac{|E(C,V\setminus C)|}{\text{vol}% (C)}\text{.}$ The inner conductance of $C\subseteq V$ is defined to be $\phi_{\text{in}}^{G}(C)=\min_{S\subseteq C\text{,}0<|S|\leq\frac{\text{vol}(C)% }{2}}\phi^{G}_{C}(S)$ if $|C|>1$ and one otherwise.

We define $k$ -clusterable graphs as a class of instances that can be partitioned into $k$ expanders.

Definition 7 ( $(k,\varphi,\epsilon)$ -clustering).

Let $G=(V,E)$ be a graph. A $(k,\varphi,\epsilon)$ -clustering of $G$ is a partition of vertices $V$ into disjoint subsets $C_{1},\ldots,C_{k}$ such that for all $i\in[k]$ , $\phi_{\text{in}}^{G}(C_{i})\geq\varphi$ , $\phi_{\text{out}}^{G}(C_{i})\leq\epsilon$ and for all $i,j\in[k]$ one has $\frac{\text{vol}(C_{i})}{\text{vol}(C_{j})}\in O(1).$ A graph $G$ is called $(k,\varphi,\epsilon)$ -clusterable if there exists a $(k,\varphi,\epsilon)$ -clustering for $G$ .

For Theorem 2, we assume that the degree of every vertex is maximal by adding self-loops, and use the notion of conductance corresponding to the graph with the added self-loops.

Dasgupta cost.

Hierarchical clustering is the task of partitioning vertices of a graph into nested clusters. The nested partitions can be represented by a rooted tree whose leaves correspond to the vertices of graph, and whose internal nodes represent the clusters of vertices. Dasgupta introduced a natural optimization framework for formulating hierarchical clustering tasks as an optimization problem [16]. We recall this framework now. Let $T$ be any rooted tree whose leaves are vertices of the graph. For any node $x$ of $T$ , let $T[x]$ be the subtree rooted at $x$ , and let $\textsc{leaves}(T[x])\subseteq V$ denote the leaves of this subtree. For leaves $x,y\in V$ , let $\textsc{LCA}(x,y)$ denote the lowest common ancestor of $x$ and $y$ in $T$ . In other words, $T[\textsc{LCA}(x,y)]$ is the smallest subtree whose leaves contain both $x$ and $y$ .

Definition 8 (Dasgupta cost [16]).

Dasgupta cost of the tree $T$ for the graph $G=(V,E)$ is defined to be $\text{COST}_{G}(T)=\sum_{\{x,y\}\in E}|\textsc{leaves}(T[\textsc{LCA}(x,y)])|% \text{.}$

The random sample model for seed vertices.

We consider a random sample model for seed vertices, in which the algorithm is given a (multi)set $S$ of seed vertices, which are sampled independently with probability proportional to their degrees, together with their cluster label.

2 Technical overview

In this section we give an overview of our main algorithmic result, stated below as Theorem 9 (formal version of Theorem 2). It postulates a sublinear time algorithm for estimating Dasgupta cost of $k$ -clusterable graphs. Here, we use $O^{*}$ -notation to suppress $\text{poly}(k)$ , $\text{poly}(1/\varphi)$ , $\text{poly}(1/\epsilon)$ and $\operatorname{polylog}n$ -factors.

Theorem 9.

Let $k\geq 2$ , $\varphi\in(0,1)$ and $\frac{\epsilon}{\varphi^{2}}$ be a sufficiently small constant. Let $G=(V,E)$ be a bounded degree graph that admits a $(k,\varphi,\epsilon)$ -clustering $C_{1},\ldots,C_{k}$ . Let $|V|=n$ , $|E|=m$ .

There exists an algorithm ( $\textsc{EstimatedCost}(G)$ ; Algorithm 1) that w.h.p. estimates the optimum Dasgupta cost of $G$ within an $O\left(\frac{\sqrt{\log k}}{\varphi^{O(1)}}\right)$ factor in time $O^{*}\left(n^{1/2+O(\epsilon/\varphi^{2})}\cdot(d_{\max})^{O(1)}\right)$ using $O^{*}\left(n^{O(\epsilon/\varphi^{2})}\cdot(d_{\max})^{O(1)}\right)$ seed queries.

Our algorithm consists of two main parts: First, we estimate the contribution from the inter-cluster edges to the Dasgupta cost. A natural approach is to contract the clusters $C_{1},\ldots,C_{k}$ into supernodes, and use Dasgupta cost of the contracted graph (defined below) as a proxy.

Definition 10 (Contracted graph).

Let $G=(V,E)$ be a graph and let $\mathcal{C}=(C_{1},\ldots,C_{k})$ denote a partition of $V$ into disjoint subsets. We say that the weighted graph $H=\left([k],\binom{[k]}{2},W,w\right)$ is a contraction of $G$ with respect to the partition $\mathcal{C}$ if for every $i,j\in[k]$ we have $W(i,j)=|E(C_{i},C_{j})|$ , and for every $i\in[k]$ we have $w(i)=|C_{i}|$ . We denote the contraction of $G$ with respect to the partition $\mathcal{C}$ by $H=G/\mathcal{C}$ .

The problem is of course that it is not clear how to get access to this contracted graph in sublinear time, and our main contribution is a way of doing so. Our approach amounts to first obtaining access to the quadratic form of the Laplacian of the contracted graph $H$ , and then using the hierarchical clustering algorithm of [11] on the corresponding approximation to the contracted graph. Thus, we essentially show how to simulate the algorithm of [11] in sublinear time on $(k,\varphi,\epsilon)$ -clusterable graphs.

The procedure TotalClustersCost approximates the contribution from the internal cluster edges to the Dasgupta cost.

Algorithm 1 below presents our estimator for the Dasgupta cost of the graph.

Algorithm 1

\textsc{EstimatedCost}(G)

. time

\approx m^{1/2+O(\epsilon)}

Our algorithm uses a weighted definition of Dasgupta cost (Definition 24), which we denote WCOST, to relate the cost of $G$ and the contracted graph $H$ . Then, our estimate EST in Algorithm 1 simply sums the contribution from the weighted Dasgupta cost of the tree $\widetilde{T}$ on the contracted graph $\widetilde{H}$ , with the contribution from the clusters. We want to ensure that the estimate always provides an upper bound on the optimal Dasgupta cost of $G$ . To this end, we scale the weighted Dasgupta cost $\text{WCOST}_{\widetilde{H}}(\widetilde{T})$ up by a factor of $O\left(\frac{1}{\varphi^{2}}\right)$ (to account for the multiplicative error), and add a term on the order of $\frac{\xi mnk^{2}}{\varphi^{2}}$ (to account for the additive error). That way we obtain an estimate EST such that

\text{COST}(G)\leq\text{EST}\leq O\left(\frac{\sqrt{\log k}}{\varphi^{O(1)}}% \right)\text{COST}(G),

where $\text{COST}(G)$ denotes the optimum Dasgupta cost of $G$ .

We outline the main ideas behind accessing the contracted graph in Section 2.2, and present the complete analysis in the full version. The TotalClustersCost procedure simply outputs a fixed value that depends on $n$ , $d$ , and $k$ . We provide more details on this in Section 2.1.

We remark that with a little post-processing, our algorithms for estimating Dasgupta cost can be adapted to recover a low-cost hierarchical-clustering tree. To construct such a tree we first construct a tree $\widetilde{T}$ with $k$ leaves on the contracted graph. The algorithm constructs $\widetilde{T}$ in sublinear time $\approx n^{1/2+O(\epsilon)}$ . Then, for every cluster $C_{i}$ one can construct a particular tree $\mathcal{T}^{i}_{\deg}$ using (Algorithm $1$ of [31]) on the vertices of $C_{i}$ . Finally, we can extend the leaf $i$ of the tree $\widetilde{T}$ by adding trees $\mathcal{T}^{i}$ as its direct child. Note that constructing $\mathcal{T}^{i}_{\deg}$ explicitly takes time $O(|C_{i}|)$ , however, this step is only required if one intends to output the full hierarchical-clustering tree of $G$ . Otherwise, for only estimating $\text{COST}(G)$ , we can estimate $\text{COST}(\mathcal{T}^{i}_{\deg})$ as a function of the cluster size and the degree without explicitly constructing $\mathcal{T}^{i}_{\deg}$ .

2.1 Estimating Dasgupta cost of an expander using seed queries

In this section, we design an algorithm for estimating the Dasgupta cost of an irregular $\varphi$ -expander up to $\text{poly}(1/\varphi)$ factor using $\approx n^{1/3}$ seed queries. We also prove that this is optimal (Theorem 13) in the full version. Later in the paper, we approximate the contribution of the clusters to the Dasgupta cost of a $d$ -regular $(k,\varphi,\epsilon)$ -clusterable graph. There, a more basic approach suffices. In this section, we focus on a single but irregular $\varphi$ -expander.

Theorem 11.

Let $G=(V,E)$ be a $\varphi$ -expander (possibly with self-loops). Let $T^{*}$ denote the tree with optimum Dasgupta cost for $G$ . Then procedure ClusterCost (Algorithm 3), uses $O^{*}\left(n^{1/3}\right)$ seed queries and with probability $1-n^{-101}$ returns a value such that:

\text{COST}(T^{*})\leq\textsc{ClusterCost}(G)\leq O\left(\frac{1}{\varphi^{5}}% \right)\cdot\text{COST}(T^{*})\text{.}

We now outline the proof of Theorem 11.

Let $G$ be a $\varphi$ -expander, i.e., $\phi_{\text{in}}(G)\geq\varphi$ . To estimate the Dasgupta cost of $G$ , we use Theorem 12 from [31]. This result shows that there is a specific tree called $\mathcal{T}_{\deg}$ on $G$ that approximates the Dasgupta cost of $G$ up to $O\left(\frac{1}{\varphi^{4}}\right)$ . For completeness, we include the algorithm (Algorithm 2) for computing $\mathcal{T}_{\deg}$ from [31]. Note that Algorithm 2 from [31] runs in time $O(m+n\log{n})$ , however, we don’t need to explicitly construct $\mathcal{T}_{\deg}$ . Instead, we design an algorithm that estimates the cost of $\mathcal{T}_{\deg}$ in time $n^{1/3}$ .

Algorithm 2 HCwithDegrees(

G\{V\}

) [31].

Theorem 12 (Theorem 3 in [31]).

Given any graph $G=(V,E,w)$ with inner-conductance $\varphi$ as input, Algorithm 2 runs in $O(m+n\log{n})$ time, and returns an $H C$ tree $\mathcal{T}_{\deg}(G)$ that satisfies $\text{COST}_{G}(\mathcal{T}_{\deg}(G))=O(1/\varphi^{4})\cdot{OPT}_{G}$ .

Our procedure for estimating the Dasgupta cost of the tree returned by Algorithm 2 is based on a simple expression for the (approximate) cost of this tree that we derive (and later show how to approximate by sampling).

Let $G=(V,E)$ be an arbitrary expander with vertices $x_{1},x_{2},\ldots x_{n}$ ordered such that $d_{1}\geq d_{2}\geq\ldots\geq d_{n}$ , where $d_{i}=\text{deg}(x_{i})$ . We denote by $\mathcal{T}_{\deg}$ the Dasgupta Tree returned by Algorithm 1 of [31]. Specifically, we show that the cost $\text{COST}_{G}(\mathcal{T}_{\deg})$ is to within an $O(1/\varphi)$ factor approximated by

\sum_{i=1}^{n}i\cdot d_{i}=\sum_{x\in V}\mathrm{rank}(x)\cdot\deg(x),

(1)

where $\deg(x)$ is the degree of $x$ and $\mathrm{rank}(x)$ is the rank of $x$ in the ordering of vertices of $V$ in non-increasing order of degrees. The proof is rather direct, and is presented in the full version. Our task therefore reduces to approximating (1) in sublinear time. To achieve this, we partition the vertices into buckets according to their degree: For every $d$ between $1$ and $n/\varphi$ that is a power of $2$ , let $B_{d}\coloneqq\{x\in V:d\leq\deg(x)<2d\}.$ We will refer to $B_{d}$ as the degree class of $d$ . Let $n_{d}\coloneqq|B_{d}|$ denote the size of the degree class, and let $r_{d}$ denote the highest rank in $B_{d}$ . Note that $r_{d}$ is the number of vertices in $G$ that have degree at least $d$ , so we have $r_{d}=\sum_{t\geq d}n_{t}$ .

The vertices in $B_{d}$ have ranks $r_{d},r_{d}-1,\dots,r_{d}-n_{d}+1$ and degrees in $[d,2d]$ , which gives the bounds

\frac{d}{2}\cdot n_{d}\cdot r_{d}\leq\sum_{i=r_{d}-n_{d}+1}^{r_{d}}i\cdot d% \leq\sum_{x\in B_{d}}\mathrm{rank}(x)\cdot\deg(x)\leq\sum_{i=r_{d}-n_{d}+1}^{r% _{d}}i\cdot 2d\leq 2d\cdot n_{d}\cdot r_{d},

(2)

so our task is further reduced to estimating the quantity

\sum_{d}\sum_{x\in B_{d}}d\cdot n_{d}\cdot r_{d}.

(3)

We do so by sampling: simply sample $\approx n^{1/3}$ vertices, and approximate the number of vertices $n_{d}$ and the highest rank $r_{d}$ of each degree class. This is summarized in Algorithm 3 below.

Algorithm 3 ClusterCost

(G,S,\hat{m})

.
#

S

is a (multi)set of size

s

of vertices in

G=(V,E)

#

\hat{m}

is a constant factor estimate of

|E|

While the algorithm is simple, the analysis is quite interesting, and the bound of $n^{1/3}$ on the number of seeds is tight! We now outline the main ideas behind the analysis of the algorithm.

Ideally, we would like to estimate the number of vertices $n_{d}$ and the highest rank $r_{d}$ of every degree class $d$ . However, this is hard to achieve, as some degree classes may be small. The crux of the analysis is showing that with $\approx n^{1/3}$ samples, we can approximate $n_{d}$ and $r_{d}$ for any degree class that contributes at least a $(1/\log n)$ -fraction of the Dasgupta cost.

Recalling that our model assumes degree proportional sampling, the expected number of samples from any degree class $B_{t}$ is

\frac{s}{2m}\sum_{x\in B_{t}}\deg(x)\approx\frac{s}{m}n_{t}\cdot t,

where $s$ is the total number of samples. Thus, we can estimate $n_{t}$ and $r_{t}$ whenever $n_{t}\cdot t\geq\Omega^{*}\left(\frac{m}{s}\right).$

Now, consider the degree class $d$ with the highest degree mass. Since there are at most $\log n/\varphi$ different degree classes, we have $n_{d}\cdot d\geq\Omega^{*}(m).$ Thus, we can estimate the contribution to the Dasgupta cost of any degree class $B_{t}$ which satisfies

\frac{n_{d}\cdot d}{n_{t}\cdot t}\leq O^{*}(s).

Using the degree class $d$ as a reference, we show that any degree class $t$ that has a significant contribution to the Dasgupta cost, must have a sufficiently large degree mass compared to $d$ .

Specifically, if $B_{t}$ is a degree class that contributes at least a $1/\log n$ fraction of the Dasgupta cost, i.e.

\sum_{x\in B_{t}}\mathrm{rank}(x)\deg(x)\geq\frac{1}{\log n}\sum_{x\in V}% \mathrm{rank}(x)\deg(x),

then, by Equation (2), we have

2t\cdot r_{t}\cdot n_{t}\geq\sum_{x\in B_{t}}\mathrm{rank}(x)\deg(x)\geq\frac{% 1}{\log n}\sum_{x\in V}\mathrm{rank}(x)\deg(x)\geq\frac{1}{\log n}\sum_{i=1}^{% r_{t}}i\cdot t\geq\frac{1}{\log n}\cdot\frac{r_{t}^{2}}{2}\cdot t.

From this, we conclude that $n_{t}\gtrapprox r_{t}$ , allowing us to use the quantity $n_{t}^{2}\cdot t$ as a further proxy for the contribution of $B_{t}$ to the Dasgupta cost.

Furthermore, we show that if $d$ is our high-degree-mass reference class and $t$ is any degree class that contributes at least a $1/\log n$ fraction of the Dasgupta cost, then $n_{t}^{2}\cdot t\geq n_{d}^{2}\cdot d$ . Intuitively, this is because the contribution from $B_{t}$ is no smaller than the contribution from $B_{d}$ .

Therefore, the following optimization problem provides an upper bound on the sufficient number of samples.

	$\displaystyle\max_{t,n_{t},d,n_{d}}\frac{n_{d}\cdot d}{n_{t}\cdot t}$
	such that
	$\displaystyle n_{t}^{2}\cdot t$	$\displaystyle\geq n_{d}^{2}\cdot d\qquad\qquad\#\text{$B_{t}$ has large % contribution to the Dasgupta cost}$
		$\displaystyle n_{t},n_{d}\leq n\qquad\quad\ \#\text{at most $n$ vertices}$
		$\displaystyle n_{t},t,d\geq 1\qquad\quad\#\text{$B_{t}$ is non-empty and % degrees are non-zero}$
		$\displaystyle n_{d}\geq 0.$

However, the above optimization problem is too weak. For example, setting $n_{d}=n^{1/2}$ , $d=n$ , $n_{t}=n$ , $t=1$ gives a feasible solution with value $n^{1/2}$ . But this solution would correspond to having $n^{1/2}$ vertices of degree $n$ and $n$ vertices of degree $1$ , which is impossible in an actual graph. We remedy this by adding an additional constraint that encodes that $t,n_{t},d,n_{d}$ arise from a valid graph.

	$\displaystyle\max_{t,n_{t},d,n_{d}}\frac{n_{d}\cdot d}{n_{t}\cdot t}$
	such that
	$\displaystyle n_{t}^{2}\cdot t$	$\displaystyle\geq n_{d}^{2}\cdot d\qquad\qquad\#\text{$B_{t}$ has large % contribution to the Dasgupta cost}$
		$\displaystyle d\leq n_{d}\qquad\qquad\ \ \#\text{$B_{d}$ does not have too % many edges to $V\setminus B_{d}$}$
		$\displaystyle n_{t},n_{d}\leq n\qquad\quad\ \#\text{at most $n$ vertices}$
		$\displaystyle n_{t},t,d\geq 1\qquad\quad\#\text{$B_{t}$ is non-empty and % degrees are non-zero}$
		$\displaystyle n_{d}\geq 0.$

A priori, there is no reason why the constraint $d\leq n_{d}$ should be satisfied by our reference class $B_{d}$ . However, we show that for any graph, it is possible to find a reference class $B_{d}$ which satisfies $d\leq n_{d}$ and contributes a large fraction of the degree mass. Intuitively, this is because if all the high-degree-mass classes had $d>n_{d}$ , then they would require too many edges to be routed outside of their degree class, eventually exhausting the available vertices. See the full version for the details.

Finally, we prove that the refined optimization problem has optimal value $\approx n^{1/3}$ . Therefore $\approx n^{1/3}$ samples suffice to discover any degree class $t$ with a non-trivial contribution to the Dasgupta cost. The full analysis is presented in the full version.

We also show that $\Omega(n^{1/3})$ seeds are necessary to approximate $\sum_{x\in V}\mathrm{rank}(x)\cdot\deg(x)$ to within any constant factor:

Theorem 13.

For every positive constant $\alpha>1$ and $n$ sufficiently large, there exists a pair of expanders $G$ and $G^{\prime}$ such that $\sum_{i=1}^{n}i\cdot d_{i}\leq n^{2}$ , $\sum_{i=1}^{n}i\cdot d^{\prime}_{i}\geq\alpha n^{2}$ and at least $\Omega(n^{1/3})$ vertices need to be queried in order to have probability above $2/3$ of distinguishing between them (where $d_{1}\geq...\geq d_{n}\geq 1$ is the degree sequence in $G$ and $d^{\prime}_{1}\geq...\geq d^{\prime}_{n}\geq 1$ is the degree sequence in $G^{\prime}$ ).

Figure 2 below illustrates the graphs $G$ and $G^{\prime}$ from Theorem 13. Graph $G$ has of a set $A$ of $n^{2/3}$ vertices of degree $n^{2/3}$ , and the remaining vertices have degree $1$ . Graph $G^{\prime}$ has a set $A^{\prime}$ of $n^{2/3}$ vertices of degree $n^{2/3}$ , but the remaining vertices have degree $\alpha$ . In order to distinguish the two graphs, we need to query a vertex outside of $A$ or $A^{\prime}$ , but this requires $\Omega(n^{1/3})$ queries in expectation. The proof of Theorem 13 is straightforward, and is included in the full version.

Figure 2: Illustration of the two instances in Theorem 13.

Finally, we describe our procedure TotalClustersCost for approximating the total contribution of the clusters to the Dasgupta cost of a $d$ -regular $(k,\varphi,\epsilon)$ -clusterable graph. To approximate the contribution of a single cluster $C_{i}$ , it suffices to apply the formula $\sum_{x\in C_{i}}\mathrm{rank}(x)d=\frac{|C_{i}|(|C_{i}|+1)}{2}d\approx|C_{i}|% ^{2}d.$ The procedure TotalClustersCost simply sums up these contributions from all the clusters.

2.2 Sublinear time access to the contracted graph

In this section, we consider a $(k,\varphi,\epsilon)$ -clusterable graph with bounded maximum degree. For simplicity of presentation, we assume without loss of generality that the graph is $d$ -regular. This is because, by a standard reduction, we can convert a degree $d$ -bounded graph into a $d$ -regular graph by adding self-loops to each vertex.

We denote the Laplacian of $G$ by $\mathcal{L}_{G}$ and the normalized Laplacian of $G$ by $L_{G}$ . We will use the following notation for additive-multiplicative approximation.

Definition 14.

For $x,y\in\mathbb{R}$ , write

x\approx_{a,b}y\text{\qquad\qquad if }a^{-1}\cdot y-b\leq x\leq a\cdot y+b.

For matrices $X,Y\in\mathbb{R}^{k\times k}$ , write

X\approx_{a,b}Y\text{\qquad\qquad if }a^{-1}\cdot Y-b\cdot m\cdot I_{k}\preceq X% \preceq a\cdot Y+b\cdot m\cdot I_{k}.

Let $H=G/\mathcal{C}$ be the contraction of $G$ with respect to the underlying clustering $\mathcal{C}=(C_{1},C_{2},\ldots C_{k})$ (Definition 10). We write $H=([k],\binom{[k]}{2},W)$ to emphasize that the vertex set of the contracted graph $H$ corresponds to the clusters of $G$ and for $i,j\in[k]$ , the pair $(i,j)$ is an edge of $H$ with weight $W(i,j)=|E(C_{i},C_{j})|$ . If we were explicitly given the adjacency/Laplacian matrix of the contracted graph $H$ , then finding a good Dasgupta tree for $H$ can be easily done by using the algorithm of [11] which gives a $O(\sqrt{\log k})$ approximation to the optimal tree for $H$ (and an $\sqrt{\log k}/\varphi^{O(1)}$ approximation to the optimal tree for $G$ as shown in Theorem 9).

The problem is that we do not have explicit access to the Laplacian of the contracted graph (denoted by $\mathcal{L}_{H}$ ). However, to get a good approximation to the Dasgupta cost of $H$ , it suffices to provide explicit access to a Laplacian $\widetilde{L}$ (which corresponds to a graph $\widetilde{H}$ ) where cuts in $\widetilde{H}$ approximate sizes of corresponding cuts in $H$ in the following sense: $\exists\alpha>1,\beta>0$ and such that for all $S\subseteq[k]$ ,

|\widetilde{E}(S,V\setminus S)|\approx_{\alpha,\beta}|E(S,V\setminus S)|.

Motivated by this observation, we simulate this access approximately by constructing a matrix $\widetilde{\mathcal{L}}$ which spectrally approximates $\mathcal{L}_{H}$ in the sense that

\mathcal{L}_{H}\approx_{a,\xi}\widetilde{\mathcal{L}}

(4)

in time $\approx m^{1/2+O(\epsilon/\varphi^{2})}\cdot\text{poly}(1/\xi)$ for some $0<\xi<1<a$ (see Theorem A.2 in the full version). So, our immediate goal is to spectrally approximate $\mathcal{L}_{H}$ . We describe this next.

Spectrally approximating $\mathcal{L}_{H}$ .

The key insight behind our spectral approximation $\widetilde{\mathcal{L}}$ to $\mathcal{L}_{H}$ comes from considering the case where our graph is a collection of $k$ disjoint expanders each on $n/k$ vertices. To understand this better, let $L_{G}=U\Lambda U^{T}$ denote the eigendecomposition of the normalized Laplacian. Let $M\coloneqq\frac{1}{2}I+\frac{1}{2d}A$ denote the lazy random walk matrix, and note that $M=I-\frac{1}{2}L_{G}$ . Let $M=U\Sigma U^{T}$ denote the eigendecomposition of the lazy ranom walk matrix. Letting $U_{[k]}\in\mathbb{R}^{n\times k}$ denote a matrix whose columns are the first $k$ columns of $U$ , we will use random sampling to obtain our spectral approximation $\widetilde{\mathcal{L}}$ to the matrix $(I-U_{[k]}\Sigma_{[k]}U_{[k]}^{T})$ . Indeed, for the instance consisting of $k$ -disjoint equal sized expanders, note that $I-U_{[k]}\Sigma_{[k]}U_{[k]}^{T}=U_{-[k]}\Sigma_{-[k]}U_{-[k]}^{T}$ where $U_{[-k]}\in\mathbb{R}^{n\times(n-k)}$ is the matrix whose columns are the last $(n-k)$ columns of $U$ . Using the information that $\lambda_{n}\geq\lambda_{n-1}\geq\cdots\geq\lambda_{k+1}\geq\varphi^{2}/2$ , one can compare quadratic forms on $I-U_{[k]}\Sigma_{[k]}U_{[k]}^{T}$ and $L_{G}$ (the normalized Laplacian of $G$ ) to show that $\widetilde{\mathcal{L}}\approx_{O(1/\varphi^{2}),\xi}\mathcal{L}_{H}.$

We will now describe this in more detail. First, we will show that the matrix $I-U_{[k]}\Sigma_{[k]}U_{[k]}^{T}$ approximates the quadratic forms of $L_{G}$ multiplicatively. Then, we describe how this allows us to approximate the quadratic forms of $\mathcal{L}_{H}$ . Finally, we will outline how to approximate the matrix $I-U_{[k]}\Sigma_{[k]}U_{[k]}^{T}$ .

First, we introduce a central definition to this work, which is the notion of spectral embedding.

Definition 15 ( $k$ -dimensional spectral embedding).

For every vertex $x$ we let $f_{x}=U_{[k]}^{T}\mathds{1}_{x}$ be the $k$ -dimensional spectral embedding of vertex $x$ .

The spectral embeddings of vertices in a graph provide rich geometric information which has been shown to be useful in graph clustering [30, 14, 12, 22]. The following remark asserts that the inner products between $f_{x}$ and $f_{y}$ are well-defined even though the choice for these vectors may not be basis free. First, we need the following standard result on eigenvalues of $(k,\varphi,\epsilon)$ -clusterable graphs [30, 12].

Lemma 16 (Lemma 3 in [22]).

Let $G=(V,E)$ be a $d$ -regular graph that admits a $(k,\varphi,\epsilon)$ -clustering. Let $\lambda_{1}\leq\ldots\leq\lambda_{n}$ be eigenvalues of $L_{G}$ . Then we have $\lambda_{k}\leq 2\epsilon$ and $\lambda_{k+1}\geq\frac{\varphi^{2}}{2}$ .

$\blacktriangleright$ Remark 17.

Take a $(k,\varphi,\epsilon)$ -clusterable graph $G$ where $\epsilon/\varphi^{2}$ smaller than a constant. Thus, the space spanned by the bottom $k$ eigenvectors of the normalized Laplacian of $G$ is uniquely defined, i.e. the choice of $U_{[k]}$ is unique up to multiplication by an orthonormal matrix $R\in\mathbb{R}^{k\times k}$ on the right. Indeed, by Lemma 16 it holds that $\lambda_{k}\leq 2\epsilon$ and $\lambda_{k+1}\geq\varphi^{2}/2$ . Thus, since we assume that $\epsilon/\varphi^{2}$ is smaller than an absolute constant, we have $2\epsilon<\varphi^{2}/2$ and thus, the subspace spanned by the bottom $k$ eigenvectors of the Laplacian, i.e. the space of $U_{[k]}$ , is uniquely defined, as required. We note that while the choice of $f_{x}$ for $x\in V$ is not unique, but the dot product between the spectral embedding of $x\in V$ and $y\in V$ is well defined, since for every orthonormal $R\in\mathbb{R}^{k\times k}$ one has

\langle Rf_{x},Rf_{y}\rangle=(Rf_{x})^{T}(Rf_{y})=\left(f_{x}\right)^{T}(R^{T}% R)\left(f_{y}\right)=\left(f_{x}\right)^{T}\left(f_{y}\right)\text{.}

Since $G$ is $(k,\varphi,\epsilon)$ -clusterable, by Remark 17, the space spanned by the bottom $k$ eigenvectors of the $M$ is uniquely defined. Thus, for any $z\in\mathbb{R}^{n}$ , $z^{T}(U_{[k]}\Sigma_{[k]}U_{[k]}^{T})z$ is well defined.

Having observed this, we will now show that quadratic forms of $I-U_{[k]}\Sigma_{[k]}U_{[k]}^{T}$ approximate quadratic forms of $L_{G}$ multiplicatively.

Lemma 18.

Suppose that $G$ is $d$ -regular, and let $L_{G}$ and $M$ denote the normalized Laplacian and lazy random walk matrix of $G$ . Let $M=U\Sigma U^{T}$ denote the eigendecomposition of $M$ . Then for any vector $z\in\mathbb{R}^{n}$ with $||z||_{2}=1$ we have

\frac{1}{2}\cdot z^{T}L_{G}z\leq z^{T}\left(1-U_{[k]}\Sigma_{[k]}U_{[k]}^{T}% \right)z\leq\frac{3}{\varphi^{2}}\cdot z^{T}L_{G}z.

Proof.

Recall that $U_{[k]}\in\mathbb{R}^{n\times k}$ is a matrix whose columns are the first $k$ columns of $U$ , and $\Sigma_{[k]}\in\mathbb{R}^{k\times k}$ is a matrix whose columns are the first $k$ rows and columns of $\Sigma$ . Let $U_{[-k]}\in\mathbb{R}^{n\times(n-k)}$ be matrix whose columns are the last $n-k$ columns of $U$ , and $\Sigma_{[-k]}\in\mathbb{R}^{(n-k)\times(n-k)}$ be a matrix whose columns are the last $n-k$ rows and columns of $\Sigma$ . Thus, the eigendecomposition of $M$ is $M=U\Sigma U^{T}=U_{[k]}\Sigma_{[k]}U_{[k]}^{T}+U_{[-k]}\Sigma_{[-k]}U_{[-k]}^{T}$ . Note that $M=I-\frac{L_{G}}{2}$ , thus we have

z^{T}(U_{[k]}\Sigma_{[k]}U_{[k]}^{T})z+z^{T}(U_{-[k]}\Sigma_{-[k]}U_{-[k]}^{T}% )z=z^{T}Mz=1-\frac{z^{T}L_{G}z}{2}\text{,}

(5)

which by rearranging gives

\frac{z^{T}L_{G}z}{2}\leq 1-z^{T}(U_{[k]}\Sigma_{[k]}U_{[k]}^{T})z=\frac{z^{T}% L_{G}z}{2}+z^{T}(U_{-[k]}\Sigma_{-[k]}U_{-[k]}^{T})z.

(6)

The first inequality gives $1-z^{T}(U_{[k]}\Sigma_{[k]}U_{[k]}^{T})z\geq\frac{z^{T}L_{G}z}{2}$ as desired.

To establish the second inequality above, we will show $z^{T}(U_{-[k]}\Sigma_{-[k]}U_{-[k]}^{T})z\leq\frac{2}{\varphi^{2}}z^{T}L_{G}z$ . Let $z=\sum_{i=1}^{n}\alpha_{i}u_{i}$ be the eigendecomposition of vector $z$ . Note that

z^{T}L_{G}z=\sum_{i=1}^{n}\lambda_{i}\alpha_{i}^{2}\geq\lambda_{k+1}\sum_{i=k+% 1}^{n}\alpha_{i}^{2}.

By Lemma 16 we have $\lambda_{k+1}\geq\frac{\varphi^{2}}{2}$ . This gives

\sum_{i=k+1}^{n}\alpha_{i}^{2}\leq\frac{z^{T}L_{G}z}{\lambda_{k+1}}\leq\frac{2% }{\varphi^{2}}\cdot z^{T}L_{G}z.

(7)

Finally, putting (6) and (7) together we get

	$\displaystyle 1-z^{T}(U_{[k]}\Sigma_{[k]}U_{[k]}^{T})z$	$\displaystyle=\frac{z^{T}L_{G}z}{2}+z^{T}(U_{-[k]}\Sigma_{-[k]}U_{-[k]}^{T})z$
		$\displaystyle\leq z^{T}L_{G}z\left(\frac{1}{2}+\frac{2}{\varphi^{2}}\right)$
		$\displaystyle\leq\frac{3}{\varphi^{2}}\cdot z^{T}L_{G}z\text{.}\$

$\hfill\blacktriangleleft$

Now, we apply Lemma 18 to estimate the quadratic form of $\mathcal{L}_{H}$ on a vector $z\in\mathbb{R}^{k}$ . To that effect, for $z\in\mathbb{R}^{k}$ , we define $z_{\text{ext}}\in\mathbb{R}^{n}$ as the natural extension of $z$ to $\mathbb{R}^{n}$ : we let $z_{\text{ext}}\in\mathbb{R}^{n}$ be the vector such that for every $x\in V$ , $z_{\text{ext}}(x)=z_{i}$ , where $C_{i}$ is the cluster that $x$ belongs to.

Note that $z^{T}\mathcal{L}_{H}z=z_{\text{ext}}^{T}\mathcal{L}_{G}z_{\text{ext}}=d\cdot z% _{\text{ext}}^{T}L_{G}z_{\text{ext}}.$ Thus, to estimate $z^{T}\mathcal{L}_{H}z$ it suffices to design a good estimate for $z_{\text{ext}}^{T}L_{G}z_{\text{ext}}$ , for which we use $z_{\text{ext}}^{T}(I-U_{[k]}\Sigma_{[k]}U_{[k]}^{T})z_{\text{ext}}$ , as per Lemma 18.

Finally, we briefly discuss how to estimate the quantity $z_{\text{ext}}^{T}(I-U_{[k]}\Sigma_{[k]}U_{[k]}^{T})z_{\text{ext}}$ . We have

z_{\text{ext}}^{T}(I-U_{[k]}\Sigma_{[k]}U_{[k]}^{T})z_{\text{ext}}=||z_{\text{% ext}}||_{2}^{2}-z_{\text{ext}}^{T}U_{[k]}\Sigma_{[k]}U_{[k]}^{T}z_{\text{ext}}% =\sum_{i\in[k]}|C_{i}|z_{i}^{2}-z_{\text{ext}}^{T}U_{[k]}\Sigma_{[k]}U_{[k]}^{% T}z_{\text{ext}}.

Since the first term on the RHS can be easily approximated in the random sample model, we concentrate on obtaining a good estimate for the second term. We have

z_{\text{ext}}^{T}U_{[k]}\Sigma_{[k]}U_{[k]}^{T}z_{\text{ext}}=\sum_{i,j\in[k]% }z_{i}z_{j}\sum_{\begin{subarray}{c}x\in C_{i}\\ y\in C_{j}\end{subarray}}\left\langle f_{x},\Sigma_{[k]}f_{y}\right\rangle,

(8)

and therefore in order to estimate $z_{\text{ext}}^{T}L_{G}z_{\text{ext}}$ , it suffices to use a few random samples to estimate the sum above, as long as one is able to compute high accuracy estimates for $\left\langle f_{x},\Sigma_{[k]}f_{y}\right\rangle,x,y\in V,$ with high probability. We refer to such a primitive as a weighted dot product oracle, since it computes a weighted dot product between the $k$ -dimensional spectral embeddings $f_{x}$ and $f_{y}$ for $x,y\in V$ . Assuming such an estimator, which we denote by WeightedDotProductOracle, our algorithm ApproxContractedGraph (Algorithm 4 below) obtains an approximation $\mathcal{L}^{\prime}$ to the Laplacian of the contracted graph.

Algorithm 4

\textsc{ApproxContractedGraph}(G,\xi,\mathcal{D})

. time

m^{1/2+O(\epsilon)}\cdot\text{poly}(1/\xi)

Estimating weighted dot products.

Our construction of WeightedDotProductOracle (Algorithm 8 in the full version) for estimating $\left\langle f_{x},\Sigma_{[k]}f_{y}\right\rangle$ proceeds along the lines of [22]. We run short random-walks of length $t\approx\log n/\varphi^{2}$ to obtain dot product access to the spectral embedding of vertices. Given $x\in V$ , let $m_{x}$ denote the probability distribution of endpoints of a $t$ -step random-walks started from $x$ .

We first show that one can estimate $\left\langle m_{x},m_{y}\right\rangle$ in time $\approx m^{1/2+O(\epsilon/\varphi^{2})}\cdot\text{poly}(1/\xi)$ with probability $1-n^{-100\cdot k}$ . Then, we construct a Gram matrix $\mathcal{G}\in\mathbb{R}^{s\times s}$ such that $\mathcal{G}_{x,y}=\left\langle m_{x},m_{y}\right\rangle$ for every $x,y\in S$ , where $S$ is a small set of sampled vertices with $|S|=s=m^{O(\epsilon)}$ . Next, we apply an appropriate linear transformation to the Gram matrix $\mathcal{G}$ and use it to estimate $\left\langle f_{x},\Sigma_{[k]}f_{y}\right\rangle$ up to very tiny additive error $\frac{\xi}{n\cdot\text{poly}(k)}$ (see Section C in the full version).

Using Semidefinite Programming to round $\mathcal{L}^{\prime}$ .

As mentioned above, our proxy for the Laplacian $\mathcal{L}_{H}$ is obtained via an approximation to $I-U_{[k]}\Sigma_{[k]}U_{[k]}^{T}$ . However, this approximator might not even be a Laplacian. To allay this, we first show that using calls to weighted dot product oracle, we can approximate all the entries of $I-U_{[k]}\Sigma_{[k]}U_{[k]}^{T}$ to within a very good precision. Starting off from such an approximation, one can use semidefinite programming methods to round the intermediate approximator to a bonafide Laplacian $\widetilde{\mathcal{L}}$ . In some more detail, we show the following.

Theorem 19 (Informal version of Theorem A.2 in the full version).

The algorithm
ApproxContractedGraph (Algorithm 4) when given a $(k,\Omega(1),\epsilon)$ -clusterable graph as input, uses a data structure $\mathcal{D}$ obtained from $\approx m^{1/2+O(\epsilon)}$ time preprocessing routine, runs in time $\approx m^{1/2+O(\epsilon)}$ , and finds a graph $\widetilde{H}$ with Laplacian $\widetilde{\mathcal{L}}$ such that with probability $1-n^{-100}$ , it holds that $\widetilde{\mathcal{L}}\approx_{O(1/\varphi^{2}),\xi}\mathcal{L}_{H}.$

Approximating the Dasgupta cost of the contracted graph $\widetilde{H}$ .

Consider the graph $\widetilde{H}=\left([k],\binom{[k]}{2},\widetilde{W},\widetilde{w}\right)$ returned by the ApproxContractedGraph procedure (Algorithm 4).

Once Theorem 19 has been established, our estimation primitive EstimatedCost (Algorithm 1) uses a simple vertex-weighted extension of a result of [11] to find a tree $\widetilde{T}$ on $\widetilde{H}$ .

Specifically, we need the following definitions.

Definition 20 (Vertex-weighted sparsest cut problem).

Let $H=(V,E,W,w)$ be a vertex and edge weighted graph. For every set $S\in V$ , we define the sparsity of cut $(S,V\setminus S)$ on graph $H$ as

\text{Sparsity}_{H}(S)=\frac{W(S,V\setminus S)}{w(S)\cdot w(V\setminus S)}% \text{,}

where $w(S)=\sum_{x\in S}w(x)$ . The vertex-weighted sparsest cut of graph $G$ is the cut with the minimum sparsity, i.e., $\arg\min_{S\subseteq V}\text{Sparsity}_{H}(S)$ .

Definition 21 (Vertex-weighted recursive sparsest cut algorithm (WRSC)).

Let $\alpha>1$ and $H=(V,E,W,w)$ be a vertex and edge weighted graph. Let $(S,V\setminus S)$ be the vertex-weighted sparsest cut of $H$ . The vertex-weighted recursive sparsest cut algorithm on graph $H$ is a recursive algorithm that first finds a cut $(T,V\setminus T)$ such that $\text{Sparsity}_{H}(T)\leq\alpha\cdot\text{Sparsity}_{H}(S)$ , and then recurs on the subgraph $H[T]$ and subgraph $H[V\setminus T]$ .

Next, we first state results which help bound the Dasgupta cost incurred by the tree one gets by using the vanilla recursive sparsest cut algorithm on any graph. Then, in Corollary 25, we present corresponding bounds for vertex-weighted graphs.

Theorem 22 (Theorem 2.3 from [11]).

Let $G=(V,E)$ be a graph. Suppose the RSC algorithm uses an $\alpha$ approximation algorithm for uniform sparsest cut. Then the algorithm RSC achieves an $O(\alpha)$ approximation for the Dasgupta cost of $G$ .

The following corollary from [11], follows using the $O(\sqrt{\log|V|})$ approximation algorithm for the uniform sparsest cut.

Corollary 23 ([11]).

Let $G=(V,E)$ be a graph. Then algorithm RSC achieves an $O(\sqrt{\log|V|})$ approximation for the Dasgupta cost of $G$ .

Since the clusters of $G$ have different sizes, and since the Dasgupta cost of a graph is a function of the size of the lowest common ancestor of the endpoints of the edges, we use weighted Dasgupta cost to relate the cost of $G$ and the contracted graph $H$ .

Definition 24 (Weighted Dasgupta cost).

Let $G=(V,E,W,w)$ denote a vertex and edge weighted graph. For a tree $T$ with $|V|$ leaves (corresponding to vertices of $G$ ), we define the weighted Dasgupta cost of $T$ on $G$ as

\text{WCOST}_{G}(T)=\sum_{(x,y)\in E}W(x,y)\cdot\sum_{z\in\textsc{leaves}(T[% \textsc{LCA}(x,y)])}w(z).

We get the following guarantee on the Weighted Dasgupta cost obtained by the WRSC algorithm.

Corollary 25.

Let $H=(V,E,W,w)$ be a vertex and edge weighted graph. Then algorithm WRSC achieves an $O(\sqrt{\log|V|})$ approximation for the weighted Dasgupta cost of $H$ .

Letting $\widetilde{T}=\textsc{WRSC}(\widetilde{H})$ be the tree computed by Algorithm 1, using Corollary 25, we show that the estimate

\textsc{EST}\coloneqq O\left(\frac{1}{\varphi^{2}}\right)\cdot\text{WCOST}_{% \widetilde{H}}(\widetilde{T})+\textsc{TotalClustersCost}(G)+O\left(\frac{\xi mnk% ^{2}}{\varphi^{2}}\right)

computed by Algorithm 1 satisfies

\text{COST}(G)\leq\textsc{EST}\leq O\left(\frac{\sqrt{\log k}}{\varphi^{O(1)}}% \right)\text{COST}(G).

The details are presented in the full version.

References

[1] Arpit Agarwal, Sanjeev Khanna, Huan Li, and Prathamesh Patil. Sublinear algorithms for hierarchical clustering. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, NeurIPS, 2022. URL: http://papers.nips.cc/paper_files/paper/2022/hash/16466b6c95c5924784486ac5a3feeb65-Abstract-Conference.html.
[2] Nir Ailon, Anup Bhattacharya, and Ragesh Jaiswal. Approximate correlation clustering using same-cluster queries. In Proc. of LATIN, pages 14–27, 2018. doi:10.1007/978-3-319-77404-6_2.
[3] Nir Ailon, Anup Bhattacharya, Ragesh Jaiswal, and Amit Kumar. Approximate clustering with same-cluster queries. In Proc. of ITCS, volume 94, pages 40:1–40:21, 2018. doi:10.4230/LIPIcs.ITCS.2018.40.
[4] Hassan Ashtiani and Shai Ben-David. Representation learning for clustering: A statistical framework. In Marina Meila and Tom Heskes, editors, Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, UAI 2015, July 12-16, 2015, Amsterdam, The Netherlands, pages 82–91. AUAI Press, 2015. URL: http://auai.org/uai2015/proceedings/papers/305.pdf.
[5] Hassan Ashtiani, Shrinu Kushagra, and Shai Ben-David. Clustering with same-cluster queries. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 3216–3224, 2016. URL: https://proceedings.neurips.cc/paper/2016/hash/9597353e41e6957b5e7aa79214fcb256-Abstract.html.
[6] Sepehr Assadi, Vaggos Chatziafratis, Jakub Lacki, Vahab Mirrokni, and Chen Wang. Hierarchical clustering in graph streams: Single-pass algorithms and space lower bounds. In COLT, volume 178 of Proceedings of Machine Learning Research, pages 4643–4702. PMLR, 2022. URL: https://proceedings.mlr.press/v178/assadi22a.html.
[7] Sugato Basu, Arindam Banerjee, and Raymond Mooney. Semi-supervised clustering by seeding. In In Proceedings of 19th International Conference on Machine Learning (ICML-2002. Citeseer, 2002.
[8] Pavel Berkhin. A survey of clustering data mining techniques. In Grouping Multidimensional Data, pages 25–71. Springer, 2006. doi:10.1007/3-540-28349-8_2.
[9] Marco Bressan, Nicolò Cesa-Bianchi, Silvio Lattanzi, and Andrea Paudice. Exact recovery of mangled clusters with same-cluster queries. In Advances in Neural Information Processing Systems, volume 33, pages 9324–9334, 2020.
[10] Marco Bressan, Nicolò Cesa-Bianchi, Silvio Lattanzi, and Andrea Paudice. Exact recovery of clusters in finite metric spaces using oracle queries. In Proc. of COLT, volume 134, pages 775–803, 2021. URL: http://proceedings.mlr.press/v134/bressan21a.html.
[11] Moses Charikar and Vaggos Chatziafratis. Approximate hierarchical clustering via sparsest cut and spreading metrics. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 841–854. SIAM, 2017. doi:10.1137/1.9781611974782.53.
[12] Ashish Chiplunkar, Michael Kapralov, Sanjeev Khanna, Aida Mousavifar, and Yuval Peres. Testing graph clusterability: Algorithms and lower bounds. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 497–508. IEEE, 2018. doi:10.1109/FOCS.2018.00054.
[13] Vincent Cohen-Addad, Varun Kanade, Frederik Mallmann-Trenn, and Claire Mathieu. Hierarchical clustering: Objective functions and algorithms. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 378–397. SIAM, 2018. doi:10.1137/1.9781611975031.26.
[14] Artur Czumaj, Pan Peng, and Christian Sohler. Testing cluster structure of graphs. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, June 14-17, 2015, pages 723–732, 2015. doi:10.1145/2746539.2746618.
[15] Artur Czumaj and Christian Sohler. Testing expansion in bounded-degree graphs. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2007), October 20-23, 2007, Providence, RI, USA, Proceedings, pages 570–578. IEEE Computer Society, 2007. doi:10.1109/FOCS.2007.69.
[16] Sanjoy Dasgupta. A cost function for similarity-based hierarchical clustering. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 118–127, 2016. doi:10.1145/2897518.2897527.
[17] Ayhan Demiriz, Kristin P Bennett, and Mark J Embrechts. Semi-supervised clustering using genetic algorithms. Artificial neural networks in engineering (ANNIE-99), pages 809–814, 1999.
[18] Michael B. Eisen, Paul T. Spellman, Patrick O. Brown, and David Botstein. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences, 95(25):14863–14868, 1998.
[19] Ehsan Emamjomeh-Zadeh and David Kempe. Adaptive hierarchical clustering using ordinal queries. In Proc. of ACM-SIAM SODA, pages 415–429. SIAM, 2018. doi:10.1137/1.9781611975031.28.
[20] Donatella Firmani, Sainyam Galhotra, Barna Saha, and Divesh Srivastava. Robust entity resolution using a crowd oracle. IEEE Data Eng. Bull., 41(2):91–103, 2018. URL: http://sites.computer.org/debull/A18june/p91.pdf.
[21] Frédéric Gilbert, Paolo Simonetto, Faraz Zaidi, Fabien Jourdan, and Romain Bourqui. Communities and hierarchical structures in dynamic social networks: analysis and visualization. Social Network Analysis and Mining, 1(2):83–95, 2011. doi:10.1007/S13278-010-0002-8.
[22] Grzegorz Gluch, Michael Kapralov, Silvio Lattanzi, Aida Mousavifar, and Christian Sohler. Spectral clustering oracles in sublinear time. In Dániel Marx, editor, Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms, SODA 2021, Virtual Conference, January 10 - 13, 2021, 2021. doi:10.1137/1.9781611976465.97.
[23] Oded Goldreich and Dana Ron. Property testing in bounded degree graphs. Algorithmica, 32(2):302–343, 2002. doi:10.1007/S00453-001-0078-7.
[24] Oded Goldreich and Dana Ron. On testing expansion in bounded-degree graphs. In Studies in Complexity and Cryptography. Miscellanea on the Interplay between Randomness and Computation - In Collaboration with Lidor Avigad, Mihir Bellare, Zvika Brakerski, Shafi Goldwasser, Shai Halevi, Tali Kaufman, Leonid Levin, Noam Nisan, Dana Ron, Madhu Sudan, Luca Trevisan, Salil Vadhan, Avi Wigderson, David Zuckerman, pages 68–75. Springer, 2011. doi:10.1007/978-3-642-22670-0_9.
[25] Anja Gruenheid, Besmira Nushi, Tim Kraska, Wolfgang Gatterbauer, and Donald Kossmann. Fault-tolerant entity resolution with the crowd. CoRR, abs/1512.00537, 2015. arXiv:1512.00537.
[26] Wasim Huleihel, Arya Mazumdar, Muriel Médard, and Soumyabrata Pal. Same-cluster querying for overlapping clusters. In Advances in Neural Information Processing Systems 32, pages 10485–10495, 2019. URL: https://proceedings.neurips.cc/paper/2019/hash/8a94ecfa54dcb88a2fa993bfa6388f9e-Abstract.html.
[27] Satyen Kale and C. Seshadhri. An expansion tester for bounded degree graphs. In Automata, Languages and Programming, 35th International Colloquium, ICALP 2008, Reykjavik, Iceland, July 7-11, 2008, Proceedings, Part I: Tack A: Algorithms, Automata, Complexity, and Games, pages 527–538, 2008. doi:10.1007/978-3-540-70575-8_43.
[28] Michael Kapralov, Akash Kumar, Silvio Lattanzi, and Aida Mousavifar. Learning hierarchical cluster structure of graphs in sublinear time. In Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 925–939. SIAM, 2023. doi:10.1137/1.9781611977554.CH36.
[29] Brian Kulis, Sugato Basu, Inderjit Dhillon, and Raymond Mooney. Semi-supervised graph clustering: a kernel approach. Machine learning, 74(1):1–22, 2009. doi:10.1007/S10994-008-5084-4.
[30] James R Lee, Shayan Oveis Gharan, and Luca Trevisan. Multiway spectral partitioning and higher-order cheeger inequalities. Journal of the ACM (JACM), 61(6):37, 2014.
[31] Bogdan-Adrian Manghiuc and He Sun. Hierarchical clustering: O(1)-approximation for well-clustered graphs. In NeurIPS, pages 9278–9289, 2021. URL: https://proceedings.neurips.cc/paper/2021/hash/4d68e143defa221fead61c84de7527a3-Abstract.html.
[32] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to information retrieval. Cambridge University Press, 2008. doi:10.1017/CBO9780511809071.
[33] Arya Mazumdar and Soumyabrata Pal. Semisupervised Clustering, AND-Queries and Locally Encodable Source Coding. In Advances in Neural Information Processing Systems 30, pages 6489–6499, 2017. URL: https://proceedings.neurips.cc/paper/2017/hash/2131f8ecf18db66a758f718dc729e00e-Abstract.html.
[34] Arya Mazumdar and Barna Saha. Clustering with noisy queries. In Advances in Neural Information Processing Systems 30, pages 5788–5799, 2017. URL: https://proceedings.neurips.cc/paper/2017/hash/db5cea26ca37aa09e5365f3e7f5dd9eb-Abstract.html.
[35] Arya Mazumdar and Barna Saha. Query complexity of clustering with side information. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4682–4693. Curran Associates, Inc., 2017. URL: http://papers.nips.cc/paper/7054-query-complexity-of-clustering-with-side-information.pdf.
[36] Asaf Nachmias and Asaf Shapira. Testing the expansion of a graph. Inf. Comput., 208(4):309–314, 2010. doi:10.1016/j.ic.2009.09.002.
[37] Pan Peng. Robust clustering oracle and local reconstructor of cluster structure of graphs. In Shuchi Chawla, editor, Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms, SODA 2020, Salt Lake City, UT, USA, January 5-8, 2020, pages 2953–2972. SIAM, 2020. doi:10.1137/1.9781611975994.179.
[38] Barna Saha and Sanjay Subramanian. Correlation clustering with same-cluster queries bounded by optimal cost. CoRR, abs/1908.04976, 2019. arXiv:1908.04976.
[39] Janne Sinkkonen and Samuel Kaski. Clustering based on conditional distributions in an auxiliary space. Neural Computation, 14(1):217–239, 2002. doi:10.1162/089976602753284509.
[40] Vasilis Verroios and Hector Garcia-Molina. Entity resolution with crowd errors. In Proc. of IEEE ICDE, pages 219–230, 2015. doi:10.1109/ICDE.2015.7113286.
[41] Vasilis Verroios, Hector Garcia-Molina, and Yannis Papakonstantinou. Waldo: An adaptive human interface for crowd entity resolution. In Proc. of ACM SIGMOD, pages 1133–1148, 2017. doi:10.1145/3035918.3035931.
[42] Fabio Vitale, Anand Rajagopalan, and Claudio Gentile. Flattening a hierarchical clustering through active learning. In Advances in Neural Information Processing Systems 32, pages 15263–15273, 2019. URL: https://proceedings.neurips.cc/paper/2019/hash/03793ef7d06ffd63d34ade9d091f1ced-Abstract.html.

[bib.bib1] [1] Arpit Agarwal, Sanjeev Khanna, Huan Li, and Prathamesh Patil. Sublinear algorithms for hierarchical clustering. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, NeurIPS, 2022. URL: http://papers.nips.cc/paper_files/paper/2022/hash/16466b6c95c5924784486ac5a3feeb65-Abstract-Conference.html.

[bib.bib2] [2] Nir Ailon, Anup Bhattacharya, and Ragesh Jaiswal. Approximate correlation clustering using same-cluster queries. In Proc. of LATIN, pages 14–27, 2018. doi:10.1007/978-3-319-77404-6_2.

[bib.bib3] [3] Nir Ailon, Anup Bhattacharya, Ragesh Jaiswal, and Amit Kumar. Approximate clustering with same-cluster queries. In Proc. of ITCS, volume 94, pages 40:1–40:21, 2018. doi:10.4230/LIPIcs.ITCS.2018.40.

[bib.bib4] [4] Hassan Ashtiani and Shai Ben-David. Representation learning for clustering: A statistical framework. In Marina Meila and Tom Heskes, editors, Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, UAI 2015, July 12-16, 2015, Amsterdam, The Netherlands, pages 82–91. AUAI Press, 2015. URL: http://auai.org/uai2015/proceedings/papers/305.pdf.

[bib.bib5] [5] Hassan Ashtiani, Shrinu Kushagra, and Shai Ben-David. Clustering with same-cluster queries. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 3216–3224, 2016. URL: https://proceedings.neurips.cc/paper/2016/hash/9597353e41e6957b5e7aa79214fcb256-Abstract.html.

[bib.bib6] [6] Sepehr Assadi, Vaggos Chatziafratis, Jakub Lacki, Vahab Mirrokni, and Chen Wang. Hierarchical clustering in graph streams: Single-pass algorithms and space lower bounds. In COLT, volume 178 of Proceedings of Machine Learning Research, pages 4643–4702. PMLR, 2022. URL: https://proceedings.mlr.press/v178/assadi22a.html.

[bib.bib7] [7] Sugato Basu, Arindam Banerjee, and Raymond Mooney. Semi-supervised clustering by seeding. In In Proceedings of 19th International Conference on Machine Learning (ICML-2002. Citeseer, 2002.

[bib.bib8] [8] Pavel Berkhin. A survey of clustering data mining techniques. In Grouping Multidimensional Data, pages 25–71. Springer, 2006. doi:10.1007/3-540-28349-8_2.

[bib.bib9] [9] Marco Bressan, Nicolò Cesa-Bianchi, Silvio Lattanzi, and Andrea Paudice. Exact recovery of mangled clusters with same-cluster queries. In Advances in Neural Information Processing Systems, volume 33, pages 9324–9334, 2020.

[bib.bib10] [10] Marco Bressan, Nicolò Cesa-Bianchi, Silvio Lattanzi, and Andrea Paudice. Exact recovery of clusters in finite metric spaces using oracle queries. In Proc. of COLT, volume 134, pages 775–803, 2021. URL: http://proceedings.mlr.press/v134/bressan21a.html.

[bib.bib11] [11] Moses Charikar and Vaggos Chatziafratis. Approximate hierarchical clustering via sparsest cut and spreading metrics. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 841–854. SIAM, 2017. doi:10.1137/1.9781611974782.53.

[bib.bib12] [12] Ashish Chiplunkar, Michael Kapralov, Sanjeev Khanna, Aida Mousavifar, and Yuval Peres. Testing graph clusterability: Algorithms and lower bounds. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 497–508. IEEE, 2018. doi:10.1109/FOCS.2018.00054.

[bib.bib13] [13] Vincent Cohen-Addad, Varun Kanade, Frederik Mallmann-Trenn, and Claire Mathieu. Hierarchical clustering: Objective functions and algorithms. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 378–397. SIAM, 2018. doi:10.1137/1.9781611975031.26.

[bib.bib14] [14] Artur Czumaj, Pan Peng, and Christian Sohler. Testing cluster structure of graphs. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, June 14-17, 2015, pages 723–732, 2015. doi:10.1145/2746539.2746618.

[bib.bib15] [15] Artur Czumaj and Christian Sohler. Testing expansion in bounded-degree graphs. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2007), October 20-23, 2007, Providence, RI, USA, Proceedings, pages 570–578. IEEE Computer Society, 2007. doi:10.1109/FOCS.2007.69.

[bib.bib16] [16] Sanjoy Dasgupta. A cost function for similarity-based hierarchical clustering. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 118–127, 2016. doi:10.1145/2897518.2897527.

[bib.bib17] [17] Ayhan Demiriz, Kristin P Bennett, and Mark J Embrechts. Semi-supervised clustering using genetic algorithms. Artificial neural networks in engineering (ANNIE-99), pages 809–814, 1999.

[bib.bib18] [18] Michael B. Eisen, Paul T. Spellman, Patrick O. Brown, and David Botstein. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences, 95(25):14863–14868, 1998.

[bib.bib19] [19] Ehsan Emamjomeh-Zadeh and David Kempe. Adaptive hierarchical clustering using ordinal queries. In Proc. of ACM-SIAM SODA, pages 415–429. SIAM, 2018. doi:10.1137/1.9781611975031.28.

[bib.bib20] [20] Donatella Firmani, Sainyam Galhotra, Barna Saha, and Divesh Srivastava. Robust entity resolution using a crowd oracle. IEEE Data Eng. Bull., 41(2):91–103, 2018. URL: http://sites.computer.org/debull/A18june/p91.pdf.

[bib.bib21] [21] Frédéric Gilbert, Paolo Simonetto, Faraz Zaidi, Fabien Jourdan, and Romain Bourqui. Communities and hierarchical structures in dynamic social networks: analysis and visualization. Social Network Analysis and Mining, 1(2):83–95, 2011. doi:10.1007/S13278-010-0002-8.

[bib.bib22] [22] Grzegorz Gluch, Michael Kapralov, Silvio Lattanzi, Aida Mousavifar, and Christian Sohler. Spectral clustering oracles in sublinear time. In Dániel Marx, editor, Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms, SODA 2021, Virtual Conference, January 10 - 13, 2021, 2021. doi:10.1137/1.9781611976465.97.

[bib.bib23] [23] Oded Goldreich and Dana Ron. Property testing in bounded degree graphs. Algorithmica, 32(2):302–343, 2002. doi:10.1007/S00453-001-0078-7.

[bib.bib24] [24] Oded Goldreich and Dana Ron. On testing expansion in bounded-degree graphs. In Studies in Complexity and Cryptography. Miscellanea on the Interplay between Randomness and Computation - In Collaboration with Lidor Avigad, Mihir Bellare, Zvika Brakerski, Shafi Goldwasser, Shai Halevi, Tali Kaufman, Leonid Levin, Noam Nisan, Dana Ron, Madhu Sudan, Luca Trevisan, Salil Vadhan, Avi Wigderson, David Zuckerman, pages 68–75. Springer, 2011. doi:10.1007/978-3-642-22670-0_9.

[bib.bib25] [25] Anja Gruenheid, Besmira Nushi, Tim Kraska, Wolfgang Gatterbauer, and Donald Kossmann. Fault-tolerant entity resolution with the crowd. CoRR, abs/1512.00537, 2015. arXiv:1512.00537.

[bib.bib26] [26] Wasim Huleihel, Arya Mazumdar, Muriel Médard, and Soumyabrata Pal. Same-cluster querying for overlapping clusters. In Advances in Neural Information Processing Systems 32, pages 10485–10495, 2019. URL: https://proceedings.neurips.cc/paper/2019/hash/8a94ecfa54dcb88a2fa993bfa6388f9e-Abstract.html.

[bib.bib27] [27] Satyen Kale and C. Seshadhri. An expansion tester for bounded degree graphs. In Automata, Languages and Programming, 35th International Colloquium, ICALP 2008, Reykjavik, Iceland, July 7-11, 2008, Proceedings, Part I: Tack A: Algorithms, Automata, Complexity, and Games, pages 527–538, 2008. doi:10.1007/978-3-540-70575-8_43.

[bib.bib28] [28] Michael Kapralov, Akash Kumar, Silvio Lattanzi, and Aida Mousavifar. Learning hierarchical cluster structure of graphs in sublinear time. In Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 925–939. SIAM, 2023. doi:10.1137/1.9781611977554.CH36.

[bib.bib29] [29] Brian Kulis, Sugato Basu, Inderjit Dhillon, and Raymond Mooney. Semi-supervised graph clustering: a kernel approach. Machine learning, 74(1):1–22, 2009. doi:10.1007/S10994-008-5084-4.

[bib.bib30] [30] James R Lee, Shayan Oveis Gharan, and Luca Trevisan. Multiway spectral partitioning and higher-order cheeger inequalities. Journal of the ACM (JACM), 61(6):37, 2014.

[bib.bib31] [31] Bogdan-Adrian Manghiuc and He Sun. Hierarchical clustering: O(1)-approximation for well-clustered graphs. In NeurIPS, pages 9278–9289, 2021. URL: https://proceedings.neurips.cc/paper/2021/hash/4d68e143defa221fead61c84de7527a3-Abstract.html.

[bib.bib32] [32] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to information retrieval. Cambridge University Press, 2008. doi:10.1017/CBO9780511809071.

[bib.bib33] [33] Arya Mazumdar and Soumyabrata Pal. Semisupervised Clustering, AND-Queries and Locally Encodable Source Coding. In Advances in Neural Information Processing Systems 30, pages 6489–6499, 2017. URL: https://proceedings.neurips.cc/paper/2017/hash/2131f8ecf18db66a758f718dc729e00e-Abstract.html.

[bib.bib34] [34] Arya Mazumdar and Barna Saha. Clustering with noisy queries. In Advances in Neural Information Processing Systems 30, pages 5788–5799, 2017. URL: https://proceedings.neurips.cc/paper/2017/hash/db5cea26ca37aa09e5365f3e7f5dd9eb-Abstract.html.

[bib.bib35] [35] Arya Mazumdar and Barna Saha. Query complexity of clustering with side information. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4682–4693. Curran Associates, Inc., 2017. URL: http://papers.nips.cc/paper/7054-query-complexity-of-clustering-with-side-information.pdf.

[bib.bib36] [36] Asaf Nachmias and Asaf Shapira. Testing the expansion of a graph. Inf. Comput., 208(4):309–314, 2010. doi:10.1016/j.ic.2009.09.002.

[bib.bib37] [37] Pan Peng. Robust clustering oracle and local reconstructor of cluster structure of graphs. In Shuchi Chawla, editor, Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms, SODA 2020, Salt Lake City, UT, USA, January 5-8, 2020, pages 2953–2972. SIAM, 2020. doi:10.1137/1.9781611975994.179.

[bib.bib38] [38] Barna Saha and Sanjay Subramanian. Correlation clustering with same-cluster queries bounded by optimal cost. CoRR, abs/1908.04976, 2019. arXiv:1908.04976.

[bib.bib39] [39] Janne Sinkkonen and Samuel Kaski. Clustering based on conditional distributions in an auxiliary space. Neural Computation, 14(1):217–239, 2002. doi:10.1162/089976602753284509.

[bib.bib40] [40] Vasilis Verroios and Hector Garcia-Molina. Entity resolution with crowd errors. In Proc. of IEEE ICDE, pages 219–230, 2015. doi:10.1109/ICDE.2015.7113286.

[bib.bib41] [41] Vasilis Verroios, Hector Garcia-Molina, and Yannis Papakonstantinou. Waldo: An adaptive human interface for crowd entity resolution. In Proc. of ACM SIGMOD, pages 1133–1148, 2017. doi:10.1145/3035918.3035931.

[bib.bib42] [42] Fabio Vitale, Anand Rajagopalan, and Claudio Gentile. Flattening a hierarchical clustering through active learning. In Advances in Neural Information Processing Systems 32, pages 15263–15273, 2019. URL: https://proceedings.neurips.cc/paper/2019/hash/03793ef7d06ffd63d34ade9d091f1ced-Abstract.html.