Abstract 1 Introduction 2 Technical overview References

Approximating Dasgupta Cost in Sublinear Time
from a Few Random Seeds

Michael Kapralov ORCID EPFL, Lausanne, Switzerland Akash Kumar Indian Institute of Technology, Bombay, India Silvio Lattanzi ORCID Google Research, Zurich, Switzerland Aida Mousavifar Google, Zurich, Switzerland Weronika Wrzos-Kaminska ORCID EPFL, Lausanne, Switzerland
Abstract

Testing graph cluster structure has been a central object of study in property testing since the foundational work of Goldreich and Ron [STOC’96] on expansion testing, i.e. the problem of distinguishing between a single cluster (an expander) and a graph that is far from a single cluster. More generally, a (k,ϵ)-clusterable graph G is a graph whose vertex set admits a partition into k induced expanders, each with outer conductance bounded by ϵ. A recent line of work initiated by Czumaj, Peng and Sohler [STOC’15] has shown how to test whether a graph is close to (k,ϵ)-clusterable, and to locally determine which cluster a given vertex belongs to with misclassification rate ϵ, but no sublinear time algorithms for learning the structure of inter-cluster connections are known. As a simple example, can one locally distinguish between the “cluster graph” forming a line and a clique?

In this paper, we consider the problem of testing the hierarchical cluster structure of (k,ϵ)-clusterable graphs in sublinear time. Our measure of hierarchical clusterability is the well-established Dasgupta cost, and our main result is an algorithm that approximates Dasgupta cost of a (k,ϵ)-clusterable graph in sublinear time, using a small number of randomly chosen seed vertices for which cluster labels are known. Our main result is an O(logk) approximation to Dasgupta cost of G in n1/2+O(ϵ) time using n1/3 seeds, effectively giving a sublinear time simulation of the algorithm of Charikar and Chatziafratis [SODA’17] on clusterable graphs. To the best of our knowledge, ours is the first result on approximating the hierarchical clustering properties of such graphs in sublinear time.

Keywords and phrases:
Sublinear algorithms, Hierarchical Clustering, Dasgupta’s Cost
Category:
Track A: Algorithms, Complexity and Games
Copyright and License:
[Uncaptioned image] © Michael Kapralov, Akash Kumar, Silvio Lattanzi, Aida Mousavifar, and
Weronika Wrzos-Kaminska; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Streaming, sublinear and near linear time algorithms
Related Version:
Full Version: https://arxiv.org/abs/2207.02581
Editors:
Keren Censor-Hillel, Fabrizio Grandoni, Joël Ouaknine, and Gabriele Puppis

1 Introduction

Graph clustering is a central problem in data analysis, with applications in a wide variety of scientific disciplines from data mining to social science, statistics and more. The overall objective in these problems is to partition the vertex set of the graph into disjoint “well connected” subgraphs which are sparsely connected to each other. It is quite common in the practice of graph clustering that besides the graph itself one is given a list of vertices with correct cluster labels for them, and one must extend this limited amount of cleanly labelled data to a clustering of the entire graph. This corresponds to the widely used seeded model (see, e.g., [7] and numerous follow up works, e.g., [17, 29, 39, 4]). The central question that we consider in this paper is

What can be learned about the cluster structure of the input graph from a few seed nodes in sublinear time?

Formally, we work with the classical model for well-clusterable graphs [14], where the input graph G=(V,E) is assumed to admit a partitioning into a disjoint union of k induced expanders C1,,Ck with outer conductance bounded by ϵ1 and inner conductance being Ω(1). We refer to such instances as (k,Ω(1),ϵ)-clusterable graphs, or (k,ϵ)-clusterable graphs for short. Such graphs have been the focus of significant attention in the property testing literature [15, 27, 36], starting from the seminal work of [24]. A recent line of work has shown how to design nearly optimal sublinear time clustering oracles for such graphs, i.e. algorithms that can consistently answer clustering queries on such a graph from a local exploration only. However, existing works do not show how to learn the structure of connections between the clusters. In particular, to the best of our knowledge, no approach in existing literature can resolve the following simple question:

Distinguish between the clusters being arranged in a line and the clusters forming an (appropriately subsampled) clique (See Fig. 1).

Figure 1: Clusters arranged in a line (Left); Clusters forming a clique (Right).

More generally, we would like to design a sublinear time algorithm that approximates the hierarchical clustering properties of k-clusterable graphs. Hierarchical clustering is a useful primitive in machine learning and data science with essential applications in information retrieval [32, 8], social networks [21] and phylogenetics [18]. Informally, in hierarchical clustering the objective is to construct a hierarchy of partitions that explain the cluster structure of the graph at different scales – note that such a partitioning looks very different in the two cases (line vs. clique) above. Formally, the quality of such a hierarchy of partitions is often evaluated using Dasgupta cost [16], and the main question studied in our paper is

Is it possible to approximate the Dasgupta cost of a (k,Ω(1),ϵ)-clusterable graph using few queries to the input graph and a few correctly clustered seed vertices?

In practice an algorithm operating in the seeded model [7] most often does not have full control over the seeds, but rather is given a list generated by some external process. To model this, we assume that the seed vertices are sampled independently from the input graph, with probability proportional to their degrees: we refer to this model as the random sample model.

The case 𝒌=𝟏, i.e. approximating Dasgupta cost of an expander

When k=1, our input is a single expander, i.e. a single cluster, we approximate its Dasgupta cost in sublinear time using degree queries on the seeds. At first glance one might think that Dasgupta cost of an expander can be approximated well simply as a function of its number of vertices and average degree, but this is only the case for regular expanders. The irregular case is nontrivial, a poly(1/φ) approximation was recently given by [31]. As our first result, we give an algorithm approximating Dasgupta cost of an (irregular) φ-expander using n1/3 seed vertices (and degree queries on these vertices). This, somewhat surprisingly, turns out to be a tight bound. Specifically, we show

Theorem 1 (Approximating Dasgupta cost of an expander).

Dasgupta cost of a φ-expander can be approximated to within a poly(1/φ) factor using degree queries on n1/3 seed vertices. Furthermore, the bound of n1/3 is tight up to polylogarithmic factors.

The case 𝒌>𝟏

For k>1 we leverage recent results on clustering oracles to decompose the problem of approximating the Dasgupta cost into two: (1) approximating Dasgupta cost of individual clusters and (2) approximating Dasgupta cost of the contracted graph, in which each cluster is contracted into a supernode. Such a decomposition is only possible for bounded degree graphs, see Example 4.2 in [31], so this is the setting we work in. We show that access to a few seed vertices is sufficient to obtain oracle access to the cut function (and, more generally, quadratic form of the Laplacian) of the contracted graph in time n1/2+O(ϵ). Our main result is Theorem 2 below:

Theorem 2 (Informal version of Theorem 9).

There exists an algorithm that for every (k,Ω(1),ϵ)-clusterable bounded degree graph G=(V,E) estimates the Dasgupta cost of G up to O(logk) factor in the random sample model in time n1/2+O(ϵ)(dmax)O(1).

 Remark 3.

We remark that our algorithm for estimating Dasgupta cost from Theorem 9 can be made to provide an oracle access to a low cost hierarchical clustering tree.

 Remark 4.

One can verify by adapting the lower bound of Ω(n1/2) on expansion testing due to Goldreich and Ron [23] that at least Ω(n/k) queries are needed for a o(k/logk) approximation for constant k in this model. The proof is a rather direct adaptation of the classical result of Goldreich and Ron, and we therefore do not present it.

 Remark 5.

Recall that in our random sample model for seed vertices the seeds are sampled independently with probability proportional to their degrees. This model matches quite closely what happens in practice in the sense that the algorithm does not always have full control over the seeds [7]. One can also consider the stronger model in which the algorithm can ask for correct label of any vertex of its choosing. This model is significantly stronger, and in particular one can design an algorithm for obtaining the same approximation of Dasgupta cost as our Theorem 2 above, but with time complexity polynomial in d, logn and 1/ϵ.

We note that the currently best known approximation to Dasgupta cost on n-vertex graphs is O(logn), achieved by the recursive sparsest cut algorithm of [11]. Our approximation is O(logk), matching what the Charikar and Chatziafratis algorithm achieves on k-node graphs. In fact, our main technical contribution is an efficient way of simulating this algorithm in sublinear time on k-clusterable graphs.

Related work on (𝒌,𝛀(𝟏),ϵ)-clusterable graphs.

Such graphs have been extensively studied in the property testing framework as well as local computation models. Its testing version, where one essentially wants to determine k, the number of clusters in G, in sublinear time, generalizes the well-studied problem of testing graph expansion, where one wants to distinguish between an expander (i.e. a good single cluster) and a graph with a sparse cut (i.e. at least two clusters). [24] showed that expansion testing requires Ω(n1/2) queries, then [15, 27, 36] developed algorithms to distinguish an expander from a graph that is far from a graph with conductance ϵ in time n1/2+O(ϵ), which the recent work of [12] showed to be tight. The setting of k>2 has seen a lot of attention recently [14, 12, 37, 22], with close to information theoretically optimal clustering oracles, i.e. small space data structures that provide quick access to an approximate clustering, obtained in [22]. More recently, [31] studied hierarchical clustering of k-clusterable graphs and developed a nearly linear time algorithm that approximates the Dasgupta cost of the graph up to constant factor. However, their algorithm to work requires significantly stronger assumptions on the input data i.e., ϵ1/kO(1), and their algorithm does not run in sublinear time. Note that the problem of estimating Dasgupta cost becomes non-trivial when ϵ1k, i.e., when Dasgupta cost of the graph is dominated by the outgoing edges between different clusters111 For instance, in a d-regular, (k,φ,ϵ)-clusterable graph, one can easily show that the Dasgupta cost is at least Ω(φdn2k), simply because of the contribution of the k induced φ-expanders. On the other hand, the total number of edges running between the clusters is bounded by ϵdn, and therefore their total contribution to Dasgupta cost is O(ϵdn2). Thus, the problem becomes non-trivial when ϵ1k..

The most closely related work on our setting is [28] where the authors provide a sublinear algorithm for hierarchical clustering. However, their algorithm works under significantly stronger assumptions on their input instance. They introduce the notion of hierarchically clusterable graphs, which assumes a planted hierarchical clustering structure not only at the bottom level of the hierarchy but at every level. Their result relies on several properties of such graphs. In contrast, we only assume that the input graph is k-clusterable. For this reason we cannot use the techniques developed in [28], and we need to develop a completely new approach.

Very recently, [1, 6] considered the problem of hierarchical clustering under Dasgupta objective in the streaming model. Both papers give a one pass O~(n) memory streaming algorithm which finds a tree with Dasgupta cost within an O(logn) factor of the optimum in polynomial time. Additionally, [1] also considers this problem in the query model and presents an O(logn) approximate hierarchical clustering using O~(n) queries without making any clusterability assumptions of the input graph. On the other hand, our algorithms assume the graph is k-clusterable and approximate the Dasgupta cost within an O(logk) in sublinear time.

Related work on hierarchical clustering.

We briefly review developments in the area of algorithms for hierarchical clustering since the introduction of Dasgupta’s objective function. Dasgupta designed an algorithm based on recursive sparsest-cut that provides O(log3/2n) approximation for his objective function. This was improved by Charikar and Chatizafratis who showed that the recursive sparsest-cut algorithm already returns a tree with approximation guarantee O(logn) [11]. Furthermore, they showed that it’s impossible to approximate Dasgupta cost within constant factor in general graphs under the Small-Set Expansion hypothesis. More recently, [13] studied this problem in a regime in which the input graph is sampled from a Hierarchical Stochastic Block Model [13]. They construct a tree in nearly linear time that approximates Dasgupta cost of the graph up to a constant factor. [13] uses a type of hierarchical stochastic block model, which generates close to regular expanders with high probability, and their analysis crucially relies on having dense clusters and large degrees. Our model allows for arbitrary expanders as opposed to dense random graphs and is more expressive in this sense.

Related work in semi-supervised active clustering.

We note that our model is also related to the semi-supervised active clustering framework (SSAC) introduced in [5]. In this model we are given a set X of n points and an oracle answering to same-cluster queries of the form “are these two points in the same cluster?”. Thanks to its elegance and applications to crowdsourcing, the model received a lot of attention and has been extensively studied both in theory [2, 3, 9, 10, 26, 33, 34, 35, 38, 42] and in practice [20, 25, 40, 41] – see also [19] for other types of queries.

1.1 Basic definitions

Definition 6 (Inner and outer conductance).

Let G=(V,E) be a graph. For a set CV and a set SC, let E(S,CS) be the set of edges with one endpoint in S and the other in CS. The conductance of S within C, is ϕCG(S)=|E(S,CS)|vol(S). The outer conductance of C is defined to be ϕoutG(C)=ϕVG(C)=|E(C,VC)|vol(C). The inner conductance of CV is defined to be ϕinG(C)=minSC,0<|S|vol(C)2ϕCG(S) if |C|>1 and one otherwise.

We define k-clusterable graphs as a class of instances that can be partitioned into k expanders.

Definition 7 ((k,φ,ϵ)-clustering).

Let G=(V,E) be a graph. A (k,φ,ϵ)-clustering of G is a partition of vertices V into disjoint subsets C1,,Ck such that for all i[k], ϕinG(Ci)φ, ϕoutG(Ci)ϵ and for all i,j[k] one has vol(Ci)vol(Cj)O(1). A graph G is called (k,φ,ϵ)-clusterable if there exists a (k,φ,ϵ)-clustering for G.

For Theorem 2, we assume that the degree of every vertex is maximal by adding self-loops, and use the notion of conductance corresponding to the graph with the added self-loops.

Dasgupta cost.

Hierarchical clustering is the task of partitioning vertices of a graph into nested clusters. The nested partitions can be represented by a rooted tree whose leaves correspond to the vertices of graph, and whose internal nodes represent the clusters of vertices. Dasgupta introduced a natural optimization framework for formulating hierarchical clustering tasks as an optimization problem [16]. We recall this framework now. Let T be any rooted tree whose leaves are vertices of the graph. For any node x of T, let T[x] be the subtree rooted at x, and let leaves(T[x])V denote the leaves of this subtree. For leaves x,yV , let LCA(x,y) denote the lowest common ancestor of x and y in T. In other words, T[LCA(x,y)] is the smallest subtree whose leaves contain both x and y.

Definition 8 (Dasgupta cost [16]).

Dasgupta cost of the tree T for the graph G=(V,E) is defined to be COSTG(T)={x,y}E|leaves(T[LCA(x,y)])|.

The random sample model for seed vertices.

We consider a random sample model for seed vertices, in which the algorithm is given a (multi)set S of seed vertices, which are sampled independently with probability proportional to their degrees, together with their cluster label.

2 Technical overview

In this section we give an overview of our main algorithmic result, stated below as Theorem 9 (formal version of Theorem 2). It postulates a sublinear time algorithm for estimating Dasgupta cost of k-clusterable graphs. Here, we use O-notation to suppress poly(k), poly(1/φ), poly(1/ϵ) and polylogn-factors.

Theorem 9.

Let k2, φ(0,1) and ϵφ2 be a sufficiently small constant. Let G=(V,E) be a bounded degree graph that admits a (k,φ,ϵ)-clustering C1,,Ck. Let |V|=n, |E|=m.

There exists an algorithm (EstimatedCost(G); Algorithm 1) that w.h.p. estimates the optimum Dasgupta cost of G within an O(logkφO(1)) factor in time O(n1/2+O(ϵ/φ2)(dmax)O(1)) using O(nO(ϵ/φ2)(dmax)O(1)) seed queries.

Our algorithm consists of two main parts: First, we estimate the contribution from the inter-cluster edges to the Dasgupta cost. A natural approach is to contract the clusters C1,,Ck into supernodes, and use Dasgupta cost of the contracted graph (defined below) as a proxy.

Definition 10 (Contracted graph).

Let G=(V,E) be a graph and let 𝒞=(C1,,Ck) denote a partition of V into disjoint subsets. We say that the weighted graph H=([k],([k]2),W,w) is a contraction of G with respect to the partition 𝒞 if for every i,j[k] we have W(i,j)=|E(Ci,Cj)|, and for every i[k] we have w(i)=|Ci|. We denote the contraction of G with respect to the partition 𝒞 by H=G/𝒞.

The problem is of course that it is not clear how to get access to this contracted graph in sublinear time, and our main contribution is a way of doing so. Our approach amounts to first obtaining access to the quadratic form of the Laplacian of the contracted graph H, and then using the hierarchical clustering algorithm of [11] on the corresponding approximation to the contracted graph. Thus, we essentially show how to simulate the algorithm of [11] in sublinear time on (k,φ,ϵ)-clusterable graphs.

The procedure TotalClustersCost approximates the contribution from the internal cluster edges to the Dasgupta cost.

Algorithm 1 below presents our estimator for the Dasgupta cost of the graph.

Algorithm 1 EstimatedCost(G).                                              time m1/2+O(ϵ)

Our algorithm uses a weighted definition of Dasgupta cost (Definition 24), which we denote WCOST, to relate the cost of G and the contracted graph H. Then, our estimate EST in Algorithm 1 simply sums the contribution from the weighted Dasgupta cost of the tree T~ on the contracted graph H~, with the contribution from the clusters. We want to ensure that the estimate always provides an upper bound on the optimal Dasgupta cost of G. To this end, we scale the weighted Dasgupta cost WCOSTH~(T~) up by a factor of O(1φ2) (to account for the multiplicative error), and add a term on the order of ξmnk2φ2 (to account for the additive error). That way we obtain an estimate EST such that

COST(G)ESTO(logkφO(1))COST(G),

where COST(G) denotes the optimum Dasgupta cost of G.

We outline the main ideas behind accessing the contracted graph in Section 2.2, and present the complete analysis in the full version. The TotalClustersCost procedure simply outputs a fixed value that depends on n, d, and k. We provide more details on this in Section 2.1.

We remark that with a little post-processing, our algorithms for estimating Dasgupta cost can be adapted to recover a low-cost hierarchical-clustering tree. To construct such a tree we first construct a tree T~ with k leaves on the contracted graph. The algorithm constructs T~ in sublinear time n1/2+O(ϵ). Then, for every cluster Ci one can construct a particular tree 𝒯degi using (Algorithm 1 of [31]) on the vertices of Ci. Finally, we can extend the leaf i of the tree T~ by adding trees 𝒯i as its direct child. Note that constructing 𝒯degi explicitly takes time O(|Ci|), however, this step is only required if one intends to output the full hierarchical-clustering tree of G. Otherwise, for only estimating COST(G), we can estimate COST(𝒯degi) as a function of the cluster size and the degree without explicitly constructing 𝒯degi.

2.1 Estimating Dasgupta cost of an expander using seed queries

In this section, we design an algorithm for estimating the Dasgupta cost of an irregular φ-expander up to poly(1/φ) factor using n1/3 seed queries. We also prove that this is optimal (Theorem 13) in the full version. Later in the paper, we approximate the contribution of the clusters to the Dasgupta cost of a d-regular (k,φ,ϵ)-clusterable graph. There, a more basic approach suffices. In this section, we focus on a single but irregular φ-expander.

Theorem 11.

Let G=(V,E) be a φ-expander (possibly with self-loops). Let T denote the tree with optimum Dasgupta cost for G. Then procedure ClusterCost (Algorithm 3), uses O(n1/3) seed queries and with probability 1n101 returns a value such that:

COST(T)ClusterCost(G)O(1φ5)COST(T).

We now outline the proof of Theorem 11.

Let G be a φ-expander, i.e., ϕin(G)φ. To estimate the Dasgupta cost of G, we use Theorem 12 from [31]. This result shows that there is a specific tree called 𝒯deg on G that approximates the Dasgupta cost of G up to O(1φ4). For completeness, we include the algorithm (Algorithm 2) for computing 𝒯deg from [31]. Note that Algorithm 2 from [31] runs in time O(m+nlogn), however, we don’t need to explicitly construct 𝒯deg. Instead, we design an algorithm that estimates the cost of 𝒯deg in time n1/3.

Algorithm 2 HCwithDegrees(G{V}) [31].
Theorem 12 (Theorem 3 in [31]).

Given any graph G=(V,E,w) with inner-conductance φ as input, Algorithm 2 runs in O(m+nlogn) time, and returns an HC tree 𝒯deg(G) that satisfies COSTG(𝒯deg(G))=O(1/φ4)OPTG.

Our procedure for estimating the Dasgupta cost of the tree returned by Algorithm 2 is based on a simple expression for the (approximate) cost of this tree that we derive (and later show how to approximate by sampling).

Let G=(V,E) be an arbitrary expander with vertices x1,x2,xn ordered such that d1d2dn, where di=deg(xi). We denote by 𝒯deg the Dasgupta Tree returned by Algorithm 1 of [31]. Specifically, we show that the cost COSTG(𝒯deg) is to within an O(1/φ) factor approximated by

i=1nidi=xVrank(x)deg(x), (1)

where deg(x) is the degree of x and rank(x) is the rank of x in the ordering of vertices of V in non-increasing order of degrees. The proof is rather direct, and is presented in the full version. Our task therefore reduces to approximating (1) in sublinear time. To achieve this, we partition the vertices into buckets according to their degree: For every d between 1 and n/φ that is a power of 2, let Bd{xV:ddeg(x)<2d}. We will refer to Bd as the degree class of d. Let nd|Bd| denote the size of the degree class, and let rd denote the highest rank in Bd. Note that rd is the number of vertices in G that have degree at least d, so we have rd=tdnt.

The vertices in Bd have ranks rd,rd1,,rdnd+1 and degrees in [d,2d], which gives the bounds

d2ndrdi=rdnd+1rdidxBdrank(x)deg(x)i=rdnd+1rdi2d2dndrd, (2)

so our task is further reduced to estimating the quantity

dxBddndrd. (3)

We do so by sampling: simply sample n1/3 vertices, and approximate the number of vertices nd and the highest rank rd of each degree class. This is summarized in Algorithm 3 below.

Algorithm 3 ClusterCost(G,S,m^).
# S is a (multi)set of size s of vertices in G=(V,E)
# m^ is a constant factor estimate of |E|

While the algorithm is simple, the analysis is quite interesting, and the bound of n1/3 on the number of seeds is tight! We now outline the main ideas behind the analysis of the algorithm.

Ideally, we would like to estimate the number of vertices nd and the highest rank rd of every degree class d. However, this is hard to achieve, as some degree classes may be small. The crux of the analysis is showing that with n1/3 samples, we can approximate nd and rd for any degree class that contributes at least a (1/logn)-fraction of the Dasgupta cost.

Recalling that our model assumes degree proportional sampling, the expected number of samples from any degree class Bt is

s2mxBtdeg(x)smntt,

where s is the total number of samples. Thus, we can estimate nt and rt whenever nttΩ(ms).

Now, consider the degree class d with the highest degree mass. Since there are at most logn/φ different degree classes, we have nddΩ(m). Thus, we can estimate the contribution to the Dasgupta cost of any degree class Bt which satisfies

nddnttO(s).

Using the degree class d as a reference, we show that any degree class t that has a significant contribution to the Dasgupta cost, must have a sufficiently large degree mass compared to d.

Specifically, if Bt is a degree class that contributes at least a 1/logn fraction of the Dasgupta cost, i.e.

xBtrank(x)deg(x)1lognxVrank(x)deg(x),

then, by Equation (2), we have

2trtntxBtrank(x)deg(x)1lognxVrank(x)deg(x)1logni=1rtit1lognrt22t.

From this, we conclude that ntrt, allowing us to use the quantity nt2t as a further proxy for the contribution of Bt to the Dasgupta cost.

Furthermore, we show that if d is our high-degree-mass reference class and t is any degree class that contributes at least a 1/logn fraction of the Dasgupta cost, then nt2tnd2d. Intuitively, this is because the contribution from Bt is no smaller than the contribution from Bd.

Therefore, the following optimization problem provides an upper bound on the sufficient number of samples.

maxt,nt,d,ndnddntt
such that
nt2t nd2d#Bt has large contribution to the Dasgupta cost
nt,ndn#at most n vertices
nt,t,d1#Bt is non-empty and degrees are non-zero
nd0.

However, the above optimization problem is too weak. For example, setting nd=n1/2, d=n, nt=n, t=1 gives a feasible solution with value n1/2. But this solution would correspond to having n1/2 vertices of degree n and n vertices of degree 1, which is impossible in an actual graph. We remedy this by adding an additional constraint that encodes that t,nt,d,nd arise from a valid graph.

maxt,nt,d,ndnddntt
such that
nt2t nd2d#Bt has large contribution to the Dasgupta cost
dnd#Bd does not have too many edges to VBd
nt,ndn#at most n vertices
nt,t,d1#Bt is non-empty and degrees are non-zero
nd0.

A priori, there is no reason why the constraint dnd should be satisfied by our reference class Bd. However, we show that for any graph, it is possible to find a reference class Bd which satisfies dnd and contributes a large fraction of the degree mass. Intuitively, this is because if all the high-degree-mass classes had d>nd, then they would require too many edges to be routed outside of their degree class, eventually exhausting the available vertices. See the full version for the details.

Finally, we prove that the refined optimization problem has optimal value n1/3. Therefore n1/3 samples suffice to discover any degree class t with a non-trivial contribution to the Dasgupta cost. The full analysis is presented in the full version.

We also show that Ω(n1/3) seeds are necessary to approximate xVrank(x)deg(x) to within any constant factor:

Theorem 13.

For every positive constant α>1 and n sufficiently large, there exists a pair of expanders G and G such that i=1nidin2, i=1nidiαn2 and at least Ω(n1/3) vertices need to be queried in order to have probability above 2/3 of distinguishing between them (where d1dn1 is the degree sequence in G and d1dn1 is the degree sequence in G).

Figure 2 below illustrates the graphs G and G from Theorem 13. Graph G has of a set A of n2/3 vertices of degree n2/3, and the remaining vertices have degree 1. Graph G has a set A of n2/3 vertices of degree n2/3, but the remaining vertices have degree α. In order to distinguish the two graphs, we need to query a vertex outside of A or A, but this requires Ω(n1/3) queries in expectation. The proof of Theorem 13 is straightforward, and is included in the full version.

Figure 2: Illustration of the two instances in Theorem 13.

Finally, we describe our procedure TotalClustersCost for approximating the total contribution of the clusters to the Dasgupta cost of a d-regular (k,φ,ϵ)-clusterable graph. To approximate the contribution of a single cluster Ci, it suffices to apply the formula xCirank(x)d=|Ci|(|Ci|+1)2d|Ci|2d. The procedure TotalClustersCost simply sums up these contributions from all the clusters.

2.2 Sublinear time access to the contracted graph

In this section, we consider a (k,φ,ϵ)-clusterable graph with bounded maximum degree. For simplicity of presentation, we assume without loss of generality that the graph is d-regular. This is because, by a standard reduction, we can convert a degree d-bounded graph into a d-regular graph by adding self-loops to each vertex.

We denote the Laplacian of G by G and the normalized Laplacian of G by LG. We will use the following notation for additive-multiplicative approximation.

Definition 14.

For x,y, write

xa,by if a1ybxay+b.

For matrices X,Yk×k, write

Xa,bY if a1YbmIkXaY+bmIk.

Let H=G/𝒞 be the contraction of G with respect to the underlying clustering 𝒞=(C1,C2,Ck) (Definition 10). We write H=([k],([k]2),W) to emphasize that the vertex set of the contracted graph H corresponds to the clusters of G and for i,j[k], the pair (i,j) is an edge of H with weight W(i,j)=|E(Ci,Cj)|. If we were explicitly given the adjacency/Laplacian matrix of the contracted graph H, then finding a good Dasgupta tree for H can be easily done by using the algorithm of [11] which gives a O(logk) approximation to the optimal tree for H (and an logk/φO(1) approximation to the optimal tree for G as shown in Theorem 9).

The problem is that we do not have explicit access to the Laplacian of the contracted graph (denoted by H). However, to get a good approximation to the Dasgupta cost of H, it suffices to provide explicit access to a Laplacian L~ (which corresponds to a graph H~) where cuts in H~ approximate sizes of corresponding cuts in H in the following sense: α>1,β>0 and such that for all S[k],

|E~(S,VS)|α,β|E(S,VS)|.

Motivated by this observation, we simulate this access approximately by constructing a matrix ~ which spectrally approximates H in the sense that

Ha,ξ~ (4)

in time m1/2+O(ϵ/φ2)poly(1/ξ) for some 0<ξ<1<a (see Theorem A.2 in the full version). So, our immediate goal is to spectrally approximate H. We describe this next.

Spectrally approximating H.

The key insight behind our spectral approximation ~ to H comes from considering the case where our graph is a collection of k disjoint expanders each on n/k vertices. To understand this better, let LG=UΛUT denote the eigendecomposition of the normalized Laplacian. Let M12I+12dA denote the lazy random walk matrix, and note that M=I12LG. Let M=UΣUT denote the eigendecomposition of the lazy ranom walk matrix. Letting U[k]n×k denote a matrix whose columns are the first k columns of U, we will use random sampling to obtain our spectral approximation ~ to the matrix (IU[k]Σ[k]U[k]T). Indeed, for the instance consisting of k-disjoint equal sized expanders, note that IU[k]Σ[k]U[k]T=U[k]Σ[k]U[k]T where U[k]n×(nk) is the matrix whose columns are the last (nk) columns of U. Using the information that λnλn1λk+1φ2/2, one can compare quadratic forms on IU[k]Σ[k]U[k]T and LG (the normalized Laplacian of G) to show that~O(1/φ2),ξH.

We will now describe this in more detail. First, we will show that the matrix IU[k]Σ[k]U[k]T approximates the quadratic forms of LG multiplicatively. Then, we describe how this allows us to approximate the quadratic forms of H. Finally, we will outline how to approximate the matrix IU[k]Σ[k]U[k]T .

First, we introduce a central definition to this work, which is the notion of spectral embedding.

Definition 15 (k-dimensional spectral embedding).

For every vertex x we let fx=U[k]T𝟙x be the k-dimensional spectral embedding of vertex x.

The spectral embeddings of vertices in a graph provide rich geometric information which has been shown to be useful in graph clustering [30, 14, 12, 22]. The following remark asserts that the inner products between fx and fy are well-defined even though the choice for these vectors may not be basis free. First, we need the following standard result on eigenvalues of (k,φ,ϵ)-clusterable graphs [30, 12].

Lemma 16 (Lemma 3 in [22]).

Let G=(V,E) be a d-regular graph that admits a (k,φ,ϵ)-clustering. Let λ1λn be eigenvalues of LG. Then we have λk2ϵ and λk+1φ22.

 Remark 17.

Take a (k,φ,ϵ)-clusterable graph G where ϵ/φ2 smaller than a constant. Thus, the space spanned by the bottom k eigenvectors of the normalized Laplacian of G is uniquely defined, i.e. the choice of U[k] is unique up to multiplication by an orthonormal matrix Rk×k on the right. Indeed, by Lemma 16 it holds that λk2ϵ and λk+1φ2/2. Thus, since we assume that ϵ/φ2 is smaller than an absolute constant, we have 2ϵ<φ2/2 and thus, the subspace spanned by the bottom k eigenvectors of the Laplacian, i.e. the space of U[k], is uniquely defined, as required. We note that while the choice of fx for xV is not unique, but the dot product between the spectral embedding of xV and yV is well defined, since for every orthonormal Rk×k one has

Rfx,Rfy=(Rfx)T(Rfy)=(fx)T(RTR)(fy)=(fx)T(fy).

Since G is (k,φ,ϵ)-clusterable, by Remark 17, the space spanned by the bottom k eigenvectors of the M is uniquely defined. Thus, for any zn, zT(U[k]Σ[k]U[k]T)z is well defined.

Having observed this, we will now show that quadratic forms of IU[k]Σ[k]U[k]T approximate quadratic forms of LG multiplicatively.

Lemma 18.

Suppose that G is d-regular, and let LG and M denote the normalized Laplacian and lazy random walk matrix of G. Let M=UΣUT denote the eigendecomposition of M. Then for any vector zn with z2=1 we have

12zTLGzzT(1U[k]Σ[k]U[k]T)z3φ2zTLGz.
Proof.

Recall that U[k]n×k is a matrix whose columns are the first k columns of U, and Σ[k]k×k is a matrix whose columns are the first k rows and columns of Σ. Let U[k]n×(nk) be matrix whose columns are the last nk columns of U, and Σ[k](nk)×(nk) be a matrix whose columns are the last nk rows and columns of Σ. Thus, the eigendecomposition of M is M=UΣUT=U[k]Σ[k]U[k]T+U[k]Σ[k]U[k]T. Note that M=ILG2, thus we have

zT(U[k]Σ[k]U[k]T)z+zT(U[k]Σ[k]U[k]T)z=zTMz=1zTLGz2, (5)

which by rearranging gives

zTLGz21zT(U[k]Σ[k]U[k]T)z=zTLGz2+zT(U[k]Σ[k]U[k]T)z. (6)

The first inequality gives 1zT(U[k]Σ[k]U[k]T)zzTLGz2 as desired.

To establish the second inequality above, we will show zT(U[k]Σ[k]U[k]T)z2φ2zTLGz. Let z=i=1nαiui be the eigendecomposition of vector z. Note that

zTLGz=i=1nλiαi2λk+1i=k+1nαi2.

By Lemma 16 we have λk+1φ22. This gives

i=k+1nαi2zTLGzλk+12φ2zTLGz. (7)

Finally, putting (6) and (7) together we get

1zT(U[k]Σ[k]U[k]T)z =zTLGz2+zT(U[k]Σ[k]U[k]T)z
zTLGz(12+2φ2)
3φ2zTLGz.

Now, we apply Lemma 18 to estimate the quadratic form of H on a vector zk. To that effect, for zk, we define zextn as the natural extension of z to n: we let zextn be the vector such that for every xV, zext(x)=zi, where Ci is the cluster that x belongs to.

Note that zTHz=zextTGzext=dzextTLGzext. Thus, to estimate zTHz it suffices to design a good estimate for zextTLGzext, for which we use zextT(IU[k]Σ[k]U[k]T)zext, as per Lemma 18.

Finally, we briefly discuss how to estimate the quantity zextT(IU[k]Σ[k]U[k]T)zext. We have

zextT(IU[k]Σ[k]U[k]T)zext=zext22zextTU[k]Σ[k]U[k]Tzext=i[k]|Ci|zi2zextTU[k]Σ[k]U[k]Tzext.

Since the first term on the RHS can be easily approximated in the random sample model, we concentrate on obtaining a good estimate for the second term. We have

zextTU[k]Σ[k]U[k]Tzext=i,j[k]zizjxCiyCjfx,Σ[k]fy, (8)

and therefore in order to estimate zextTLGzext, it suffices to use a few random samples to estimate the sum above, as long as one is able to compute high accuracy estimates for fx,Σ[k]fy,x,yV, with high probability. We refer to such a primitive as a weighted dot product oracle, since it computes a weighted dot product between the k-dimensional spectral embeddings fx and fy for x,yV. Assuming such an estimator, which we denote by WeightedDotProductOracle, our algorithm ApproxContractedGraph (Algorithm 4 below) obtains an approximation to the Laplacian of the contracted graph.

Algorithm 4 ApproxContractedGraph(G,ξ,𝒟).                                 time m1/2+O(ϵ)poly(1/ξ)
Estimating weighted dot products.

Our construction of WeightedDotProductOracle (Algorithm 8 in the full version) for estimating fx,Σ[k]fy proceeds along the lines of  [22]. We run short random-walks of length tlogn/φ2 to obtain dot product access to the spectral embedding of vertices. Given xV, let mx denote the probability distribution of endpoints of a t-step random-walks started from x.

We first show that one can estimate mx,my in time m1/2+O(ϵ/φ2)poly(1/ξ) with probability 1n100k. Then, we construct a Gram matrix 𝒢s×s such that 𝒢x,y=mx,my for every x,yS, where S is a small set of sampled vertices with |S|=s=mO(ϵ). Next, we apply an appropriate linear transformation to the Gram matrix 𝒢 and use it to estimate fx,Σ[k]fy up to very tiny additive error ξnpoly(k) (see Section C in the full version).

Using Semidefinite Programming to round 𝓛.

As mentioned above, our proxy for the Laplacian H is obtained via an approximation to IU[k]Σ[k]U[k]T. However, this approximator might not even be a Laplacian. To allay this, we first show that using calls to weighted dot product oracle, we can approximate all the entries of IU[k]Σ[k]U[k]T to within a very good precision. Starting off from such an approximation, one can use semidefinite programming methods to round the intermediate approximator to a bonafide Laplacian ~. In some more detail, we show the following.

Theorem 19 (Informal version of Theorem A.2 in the full version).

The algorithm
ApproxContractedGraph (Algorithm 4) when given a (k,Ω(1),ϵ)-clusterable graph as input, uses a data structure 𝒟 obtained from m1/2+O(ϵ) time preprocessing routine, runs in time m1/2+O(ϵ), and finds a graph H~ with Laplacian ~ such that with probability 1n100, it holds that ~O(1/φ2),ξH.

Approximating the Dasgupta cost of the contracted graph 𝑯~.

Consider the graph H~=([k],([k]2),W~,w~) returned by the ApproxContractedGraph procedure (Algorithm 4).

Once Theorem 19 has been established, our estimation primitive EstimatedCost (Algorithm 1) uses a simple vertex-weighted extension of a result of [11] to find a tree T~ on H~.

Specifically, we need the following definitions.

Definition 20 (Vertex-weighted sparsest cut problem).

Let H=(V,E,W,w) be a vertex and edge weighted graph. For every set SV, we define the sparsity of cut (S,VS) on graph H as

SparsityH(S)=W(S,VS)w(S)w(VS),

where w(S)=xSw(x). The vertex-weighted sparsest cut of graph G is the cut with the minimum sparsity, i.e., argminSVSparsityH(S).

Definition 21 (Vertex-weighted recursive sparsest cut algorithm (WRSC)).

Let α>1 and H=(V,E,W,w) be a vertex and edge weighted graph. Let (S,VS) be the vertex-weighted sparsest cut of H. The vertex-weighted recursive sparsest cut algorithm on graph H is a recursive algorithm that first finds a cut (T,VT) such that SparsityH(T)αSparsityH(S), and then recurs on the subgraph H[T] and subgraph H[VT].

Next, we first state results which help bound the Dasgupta cost incurred by the tree one gets by using the vanilla recursive sparsest cut algorithm on any graph. Then, in Corollary 25, we present corresponding bounds for vertex-weighted graphs.

Theorem 22 (Theorem 2.3 from [11]).

Let G=(V,E) be a graph. Suppose the RSC algorithm uses an α approximation algorithm for uniform sparsest cut. Then the algorithm RSC achieves an O(α) approximation for the Dasgupta cost of G.

The following corollary from [11], follows using the O(log|V|) approximation algorithm for the uniform sparsest cut.

Corollary 23 ([11]).

Let G=(V,E) be a graph. Then algorithm RSC achieves an O(log|V|) approximation for the Dasgupta cost of G.

Since the clusters of G have different sizes, and since the Dasgupta cost of a graph is a function of the size of the lowest common ancestor of the endpoints of the edges, we use weighted Dasgupta cost to relate the cost of G and the contracted graph H.

Definition 24 (Weighted Dasgupta cost).

Let G=(V,E,W,w) denote a vertex and edge weighted graph. For a tree T with |V| leaves (corresponding to vertices of G), we define the weighted Dasgupta cost of T on G as

WCOSTG(T)=(x,y)EW(x,y)zleaves(T[LCA(x,y)])w(z).

We get the following guarantee on the Weighted Dasgupta cost obtained by the WRSC algorithm.

Corollary 25.

Let H=(V,E,W,w) be a vertex and edge weighted graph. Then algorithm WRSC achieves an O(log|V|) approximation for the weighted Dasgupta cost of H.

Letting T~=WRSC(H~) be the tree computed by Algorithm 1, using Corollary 25, we show that the estimate

ESTO(1φ2)WCOSTH~(T~)+TotalClustersCost(G)+O(ξmnk2φ2)

computed by Algorithm 1 satisfies

COST(G)ESTO(logkφO(1))COST(G).

The details are presented in the full version.

References

  • [1] Arpit Agarwal, Sanjeev Khanna, Huan Li, and Prathamesh Patil. Sublinear algorithms for hierarchical clustering. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, NeurIPS, 2022. URL: http://papers.nips.cc/paper_files/paper/2022/hash/16466b6c95c5924784486ac5a3feeb65-Abstract-Conference.html.
  • [2] Nir Ailon, Anup Bhattacharya, and Ragesh Jaiswal. Approximate correlation clustering using same-cluster queries. In Proc. of LATIN, pages 14–27, 2018. doi:10.1007/978-3-319-77404-6_2.
  • [3] Nir Ailon, Anup Bhattacharya, Ragesh Jaiswal, and Amit Kumar. Approximate clustering with same-cluster queries. In Proc. of ITCS, volume 94, pages 40:1–40:21, 2018. doi:10.4230/LIPIcs.ITCS.2018.40.
  • [4] Hassan Ashtiani and Shai Ben-David. Representation learning for clustering: A statistical framework. In Marina Meila and Tom Heskes, editors, Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, UAI 2015, July 12-16, 2015, Amsterdam, The Netherlands, pages 82–91. AUAI Press, 2015. URL: http://auai.org/uai2015/proceedings/papers/305.pdf.
  • [5] Hassan Ashtiani, Shrinu Kushagra, and Shai Ben-David. Clustering with same-cluster queries. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 3216–3224, 2016. URL: https://proceedings.neurips.cc/paper/2016/hash/9597353e41e6957b5e7aa79214fcb256-Abstract.html.
  • [6] Sepehr Assadi, Vaggos Chatziafratis, Jakub Lacki, Vahab Mirrokni, and Chen Wang. Hierarchical clustering in graph streams: Single-pass algorithms and space lower bounds. In COLT, volume 178 of Proceedings of Machine Learning Research, pages 4643–4702. PMLR, 2022. URL: https://proceedings.mlr.press/v178/assadi22a.html.
  • [7] Sugato Basu, Arindam Banerjee, and Raymond Mooney. Semi-supervised clustering by seeding. In In Proceedings of 19th International Conference on Machine Learning (ICML-2002. Citeseer, 2002.
  • [8] Pavel Berkhin. A survey of clustering data mining techniques. In Grouping Multidimensional Data, pages 25–71. Springer, 2006. doi:10.1007/3-540-28349-8_2.
  • [9] Marco Bressan, Nicolò Cesa-Bianchi, Silvio Lattanzi, and Andrea Paudice. Exact recovery of mangled clusters with same-cluster queries. In Advances in Neural Information Processing Systems, volume 33, pages 9324–9334, 2020.
  • [10] Marco Bressan, Nicolò Cesa-Bianchi, Silvio Lattanzi, and Andrea Paudice. Exact recovery of clusters in finite metric spaces using oracle queries. In Proc. of COLT, volume 134, pages 775–803, 2021. URL: http://proceedings.mlr.press/v134/bressan21a.html.
  • [11] Moses Charikar and Vaggos Chatziafratis. Approximate hierarchical clustering via sparsest cut and spreading metrics. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 841–854. SIAM, 2017. doi:10.1137/1.9781611974782.53.
  • [12] Ashish Chiplunkar, Michael Kapralov, Sanjeev Khanna, Aida Mousavifar, and Yuval Peres. Testing graph clusterability: Algorithms and lower bounds. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 497–508. IEEE, 2018. doi:10.1109/FOCS.2018.00054.
  • [13] Vincent Cohen-Addad, Varun Kanade, Frederik Mallmann-Trenn, and Claire Mathieu. Hierarchical clustering: Objective functions and algorithms. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 378–397. SIAM, 2018. doi:10.1137/1.9781611975031.26.
  • [14] Artur Czumaj, Pan Peng, and Christian Sohler. Testing cluster structure of graphs. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, June 14-17, 2015, pages 723–732, 2015. doi:10.1145/2746539.2746618.
  • [15] Artur Czumaj and Christian Sohler. Testing expansion in bounded-degree graphs. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2007), October 20-23, 2007, Providence, RI, USA, Proceedings, pages 570–578. IEEE Computer Society, 2007. doi:10.1109/FOCS.2007.69.
  • [16] Sanjoy Dasgupta. A cost function for similarity-based hierarchical clustering. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 118–127, 2016. doi:10.1145/2897518.2897527.
  • [17] Ayhan Demiriz, Kristin P Bennett, and Mark J Embrechts. Semi-supervised clustering using genetic algorithms. Artificial neural networks in engineering (ANNIE-99), pages 809–814, 1999.
  • [18] Michael B. Eisen, Paul T. Spellman, Patrick O. Brown, and David Botstein. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences, 95(25):14863–14868, 1998.
  • [19] Ehsan Emamjomeh-Zadeh and David Kempe. Adaptive hierarchical clustering using ordinal queries. In Proc. of ACM-SIAM SODA, pages 415–429. SIAM, 2018. doi:10.1137/1.9781611975031.28.
  • [20] Donatella Firmani, Sainyam Galhotra, Barna Saha, and Divesh Srivastava. Robust entity resolution using a crowd oracle. IEEE Data Eng. Bull., 41(2):91–103, 2018. URL: http://sites.computer.org/debull/A18june/p91.pdf.
  • [21] Frédéric Gilbert, Paolo Simonetto, Faraz Zaidi, Fabien Jourdan, and Romain Bourqui. Communities and hierarchical structures in dynamic social networks: analysis and visualization. Social Network Analysis and Mining, 1(2):83–95, 2011. doi:10.1007/S13278-010-0002-8.
  • [22] Grzegorz Gluch, Michael Kapralov, Silvio Lattanzi, Aida Mousavifar, and Christian Sohler. Spectral clustering oracles in sublinear time. In Dániel Marx, editor, Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms, SODA 2021, Virtual Conference, January 10 - 13, 2021, 2021. doi:10.1137/1.9781611976465.97.
  • [23] Oded Goldreich and Dana Ron. Property testing in bounded degree graphs. Algorithmica, 32(2):302–343, 2002. doi:10.1007/S00453-001-0078-7.
  • [24] Oded Goldreich and Dana Ron. On testing expansion in bounded-degree graphs. In Studies in Complexity and Cryptography. Miscellanea on the Interplay between Randomness and Computation - In Collaboration with Lidor Avigad, Mihir Bellare, Zvika Brakerski, Shafi Goldwasser, Shai Halevi, Tali Kaufman, Leonid Levin, Noam Nisan, Dana Ron, Madhu Sudan, Luca Trevisan, Salil Vadhan, Avi Wigderson, David Zuckerman, pages 68–75. Springer, 2011. doi:10.1007/978-3-642-22670-0_9.
  • [25] Anja Gruenheid, Besmira Nushi, Tim Kraska, Wolfgang Gatterbauer, and Donald Kossmann. Fault-tolerant entity resolution with the crowd. CoRR, abs/1512.00537, 2015. arXiv:1512.00537.
  • [26] Wasim Huleihel, Arya Mazumdar, Muriel Médard, and Soumyabrata Pal. Same-cluster querying for overlapping clusters. In Advances in Neural Information Processing Systems 32, pages 10485–10495, 2019. URL: https://proceedings.neurips.cc/paper/2019/hash/8a94ecfa54dcb88a2fa993bfa6388f9e-Abstract.html.
  • [27] Satyen Kale and C. Seshadhri. An expansion tester for bounded degree graphs. In Automata, Languages and Programming, 35th International Colloquium, ICALP 2008, Reykjavik, Iceland, July 7-11, 2008, Proceedings, Part I: Tack A: Algorithms, Automata, Complexity, and Games, pages 527–538, 2008. doi:10.1007/978-3-540-70575-8_43.
  • [28] Michael Kapralov, Akash Kumar, Silvio Lattanzi, and Aida Mousavifar. Learning hierarchical cluster structure of graphs in sublinear time. In Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 925–939. SIAM, 2023. doi:10.1137/1.9781611977554.CH36.
  • [29] Brian Kulis, Sugato Basu, Inderjit Dhillon, and Raymond Mooney. Semi-supervised graph clustering: a kernel approach. Machine learning, 74(1):1–22, 2009. doi:10.1007/S10994-008-5084-4.
  • [30] James R Lee, Shayan Oveis Gharan, and Luca Trevisan. Multiway spectral partitioning and higher-order cheeger inequalities. Journal of the ACM (JACM), 61(6):37, 2014.
  • [31] Bogdan-Adrian Manghiuc and He Sun. Hierarchical clustering: O(1)-approximation for well-clustered graphs. In NeurIPS, pages 9278–9289, 2021. URL: https://proceedings.neurips.cc/paper/2021/hash/4d68e143defa221fead61c84de7527a3-Abstract.html.
  • [32] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to information retrieval. Cambridge University Press, 2008. doi:10.1017/CBO9780511809071.
  • [33] Arya Mazumdar and Soumyabrata Pal. Semisupervised Clustering, AND-Queries and Locally Encodable Source Coding. In Advances in Neural Information Processing Systems 30, pages 6489–6499, 2017. URL: https://proceedings.neurips.cc/paper/2017/hash/2131f8ecf18db66a758f718dc729e00e-Abstract.html.
  • [34] Arya Mazumdar and Barna Saha. Clustering with noisy queries. In Advances in Neural Information Processing Systems 30, pages 5788–5799, 2017. URL: https://proceedings.neurips.cc/paper/2017/hash/db5cea26ca37aa09e5365f3e7f5dd9eb-Abstract.html.
  • [35] Arya Mazumdar and Barna Saha. Query complexity of clustering with side information. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4682–4693. Curran Associates, Inc., 2017. URL: http://papers.nips.cc/paper/7054-query-complexity-of-clustering-with-side-information.pdf.
  • [36] Asaf Nachmias and Asaf Shapira. Testing the expansion of a graph. Inf. Comput., 208(4):309–314, 2010. doi:10.1016/j.ic.2009.09.002.
  • [37] Pan Peng. Robust clustering oracle and local reconstructor of cluster structure of graphs. In Shuchi Chawla, editor, Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms, SODA 2020, Salt Lake City, UT, USA, January 5-8, 2020, pages 2953–2972. SIAM, 2020. doi:10.1137/1.9781611975994.179.
  • [38] Barna Saha and Sanjay Subramanian. Correlation clustering with same-cluster queries bounded by optimal cost. CoRR, abs/1908.04976, 2019. arXiv:1908.04976.
  • [39] Janne Sinkkonen and Samuel Kaski. Clustering based on conditional distributions in an auxiliary space. Neural Computation, 14(1):217–239, 2002. doi:10.1162/089976602753284509.
  • [40] Vasilis Verroios and Hector Garcia-Molina. Entity resolution with crowd errors. In Proc. of IEEE ICDE, pages 219–230, 2015. doi:10.1109/ICDE.2015.7113286.
  • [41] Vasilis Verroios, Hector Garcia-Molina, and Yannis Papakonstantinou. Waldo: An adaptive human interface for crowd entity resolution. In Proc. of ACM SIGMOD, pages 1133–1148, 2017. doi:10.1145/3035918.3035931.
  • [42] Fabio Vitale, Anand Rajagopalan, and Claudio Gentile. Flattening a hierarchical clustering through active learning. In Advances in Neural Information Processing Systems 32, pages 15263–15273, 2019. URL: https://proceedings.neurips.cc/paper/2019/hash/03793ef7d06ffd63d34ade9d091f1ced-Abstract.html.