Abstract 1 Introduction 2 Preliminaries 3 Cluster editing on cographs 4 Cluster Editing on Trivially Perfect Graphs References

Cluster Editing on Cographs and Related Classes

Manuel Lafond ORCID Department of Computer Science, Université de Sherbrooke, Canada Alitzel López Sánchez ORCID Department of Computer Science, Université de Sherbrooke, Canada Weidong Luo ORCID Department of Computer Science, Université de Sherbrooke, Canada
Abstract

In the Cluster Editing problem, sometimes known as (unweighted) Correlation Clustering, we must insert and delete a minimum number of edges to achieve a graph in which every connected component is a clique. Owing to its applications in computational biology, social network analysis, machine learning, and others, this problem has been widely studied for decades and is still undergoing active research. There exist several parameterized algorithms for general graphs, but little is known about the complexity of the problem on specific classes of graphs.

Among the few important results in this direction, if only deletions are allowed, the problem can be solved in polynomial time on cographs, which are the P4-free graphs. However, the complexity of the broader editing problem on cographs is still open. We show that even on a very restricted subclass of cographs, the problem is NP-hard, W[1]-hard when parameterized by the number p of desired clusters, and that time no(p/logp) is forbidden under the ETH. This shows that the editing variant is substantially harder than the deletion-only case, and that hardness holds for the many superclasses of cographs (including graphs of clique-width at most 2, perfect graphs, circle graphs, permutation graphs). On the other hand, we provide an almost tight upper bound of time nO(p), which is a consequence of a more general nO(cwp) time algorithm, where cw is the clique-width. Given that forbidding P4s maintains NP-hardness, we look at {P4,C4}-free graphs, also known as trivially perfect graphs, and provide a cubic-time algorithm for this class.

Keywords and phrases:
Cluster editing, cographs, parameterized algorithms, clique-width, trivially perfect graphs
Copyright and License:
[Uncaptioned image] © Manuel Lafond, Alitzel López Sánchez, and Weidong Luo; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Design and analysis of algorithms
Related Version:
Full Version: https://arxiv.org/abs/2412.12454
Acknowledgements:
We thank the anonymous reviewers for their valuable comments.
Editors:
Olaf Beyersdorff, Michał Pilipczuk, Elaine Pimentel, and Nguyễn Kim Thắng

1 Introduction

Clustering objects into groups of similarity is a ubiquitous task in computer science, with applications in computational biology [45, 40], social network analysis [46, 1, 2], machine learning [16, 18], and many others [4]. There are many interpretations of what a “good” clustering is, with one of the most simple, elegant, and useful being the Cluster Editing formulation – sometimes also known as (unweighted) Correlation Clustering [3]. In this graph-theoretical view, pairs of objects that are believed to be similar are linked by an edge, and non-edges correspond to dissimilar objects. If groups are perfectly separable, this graph should be a cluster graph, that is, a graph in which each connected component is a clique. However, due to noise and errors, this is almost never observed in practice. To remove such errors, Cluster Editing asks for a minimum number of edges to correct to obtain a cluster graph, where “correcting” means adding a non-existing edge or deleting an edge.

Owing to its importance, this APX-hard [12], of course also NP-hard [3, 37], problem has been widely studied in the parameterized complexity community. Let k be the number of required edge modifications. After a series of works, the problem now can be solved in time O(1.62k) [6, 7, 8, 24, 25] and admits a 2k kernel [11, 13, 26]. In addition, if we require that the solution contains exactly p clusters, then the problem is NP-hard for every p2 [45], but admits a PTAS [23], a (p+2)k+p kernel [26], and can be solved in 2O(pk)nO(1) time. This is shown to be tight assuming the ETH, under which 2o(pk)nO(1) is not possible [19].

Another angle, which we study in this paper, is to focus on specific classes of graphs. For example, restricting the input to bounded-degree graphs does not help, as Cluster Editing is NP-hard even on planar unit disk graphs with maximum degree 4 [35, 43]. In [5], the authors circumvent the APX-hardness of the problem by proposing a PTAS on planar graphs. A polynomial-time algorithm is provided for the problem on unit interval graphs [41], a subclass of unit disk. The Cluster Deletion problem, in which only edge deletions are allowed, has received much more attention on restricted classes. It is polynomial-time solvable on graphs of maximum degree 3 and NP-hard for larger degrees [35]. Various results were also obtained on interval and split graphs [36], other subclasses of chordal graphs [9], and unit disk graphs [43]. In [31], graphs with bounded structural parameters are studied, with the weighted variant being paraNP-hard in twin cover, but FPT in the unweighted case.

If we forbid specific induced subgraphs, the reduction in [35] implies that Cluster Deletion is NP-hard on C4-free graphs (as observed in [21]). If instead we forbid P4, i.e., induced paths on four vertices, we obtain the class of cographs, on which the deletion problem is remarkably shown to be polynomial-time solvable in [21] using a greedy max-clique strategy. However, the complexity of Cluster Editing on cographs has remained open. In addition, to our knowledge, there are no known non-trivial polynomial-time algorithms for Cluster Editing on specific graph classes with a finite set of forbidden induced subgraphs. It is not hard to obtain a polynomial-time algorithm for the problem on the threshold graphs, i.e. the {P4,C4,2K2}-free graphs, using standard dynamic programming on its co-tree. However, it appears to be unknown whether removing any of the three induced subgraphs from this set leads to NP-hardness.

In this paper, we focus on open complexity questions for the Cluster Editing problem on cographs and related classes. It is worth mentioning that the cograph restriction is more than a mere complexity classification endeavor – it can be useful to determine how well an equivalence relation (i.e., a cluster graph) can approximate a different type of relation (see for example [47]). In the case of cograph-like relations, our motivations have roots in phylogenetics. In this field, gene pairs are classified into orthologs and paralogs, with orthology graphs known to correspond exactly to cographs [29, 38, 30]. However, as argued in [44], most orthology prediction software use clustering libraries and infer a cluster graph of orthologs. The question that arises is then “how much information is lost by predicting a cluster graph, knowing that the true graph is a cograph”? This requires finding a cluster graph that is the closest to a given cograph G, leading to Cluster Editing on cographs. Furthermore, researchers argue that social communities should sometimes be modeled as cographs [33] or trivially perfect graphs [42], as opposed to cluster graphs as is done in most community detection approaches. This leads to Cluster Editing on cographs or trivially perfect graphs. Additionally, an algorithm for the NP-hard Trivial Perfect Graph Editing, which can practical scalability to large real-world graphs, is provided in [10].

Our contributions.

We first settle the complexity of Cluster Editing on cographs by showing that it is not only NP-hard, but also W[1]-hard when a parameter p is specified, which represents the number of desired clusters. We use the Unary Bin Packing hardness results from [32], which also implies an Exponential Time Hypothesis (ETH) lower bound that forbids time no(p/logp) under this parameter. In fact, our hardness holds for very restricted classes of cographs, namely graphs obtained by taking two cluster graphs, and adding all possible edges between them (this also correspond to cographs that admit a cotree of height 3). Moreover, because cographs have clique-width (cw) at most 2, this also means that the problem is para-NP-hard in clique-width, and that a complexity of the form ng(cw) is unlikely, for any function g (the same actually holds for the modular-width parameter and generalizations, see [20, 39]). In fact, the ETH forbids time f(p)ng(cw)o(p/logp) for any functions f and g, which contrasts with the aforementioned subexponential bounds in pk [19].

The hardness also extends to all superclasses of cographs, such as circle graphs, perfect graphs, and permutation graphs. On the other hand we show that time nO(p) can be achieved on any cograph, which is almost tight. This contrasts with the general Cluster Editing problem which is NP-hard when p=2. In fact, this complexity follows from a more general algorithm for arbitrary graphs that runs in time nO(cwp), which shows that Cluster Editing is XP in parameter cw+p. Note that our hardness results imply that XP membership in either parameter individually is unlikely, and so under standard assumptions both cw and p must contribute in the exponent of n.

Finally, we aim to find the largest subclass of cographs on which Cluster Editing is polynomial-time solvable. The literature mentioned above implies that such a class lies somewhere between P4-free and {P4,C4,2K2}-free graphs. We improve the latter by showing that Cluster Editing can be solved in time O(n3) on {P4,C4}-free graphs, also known as trivially perfect graphs (TPG). This result is achieved by a characterization of optimal clusterings on TPGs, which says that as we build a clustering going up in the cotree, only the largest cluster is allowed to become larger as we proceed.

Our results are summarized in the following, where n is the number of vertices of the graphs, and p-Cluster Editing is the variant in which the edge modifications must result in p connected components that are cliques. We treat p as a parameter specified in the input.

Theorem 1.

The following results hold:

  • Cluster Editing is NP-complete on cographs, and solvable in time O(n3) on trivially perfect graphs.

  • p-Cluster Editing admits an nO(cwp) time algorithm if a cw-expression is given, but admits no f(p)ng(cw)o(p/logp) time algorithm for any functions f and g unless ETH fails.

  • p-Cluster Editing on cographs is NP-complete, and W[1]-hard parameterized by p.

  • p-Cluster Editing on cographs admits an nO(p) time algorithm, but admits no f(p)no(p/logp) time algorithm for any function f unless ETH fails.

2 Preliminaries

We use the notation [n]={1,,n}. For two sets A and B, AB is the symmetric difference between A and B. For a graph G, V(G) and E(G) are the vertex and edge sets of G, respectively, and G[S] is the subgraph induced by SV(G). The complement of G is denoted G¯. Given two graphs G and H, the disjoint union GH is the graph with vertex set V(G)V(H) and edge set E(G)E(H). The join GH of two graphs G and H is the graph obtained from GH by adding every possible edge uv with uV(G) and vV(H).

It will be useful to consider two equivalent views on the Cluster Editing problem, in terms of the edge operations to perform to achieve a cluster graph, and in terms of the resulting partition into clusters. A graph is a cluster graph if it is a disjoint union of complete graphs. Let G=(V,E) be a graph and FV×V. If G=(V,EF) is a cluster graph, then F is called a cluster editing set. The edges of F can be divided into two types: FE(G) are called deleted edges, and FE(G) are called inserted edges, where the deleted edges disconnect some adjacent vertices in G and the inserted edges connect some non-adjacent vertices in G to transform G into G.

Note that the clusters of G result in a partition of V(G). Conversely, given any partition 𝒞 of V(G), we can easily infer the editing set F that yields the clusters 𝒞: it consists of non-edges within the clusters, and of edges with endpoints in different clusters. To be precise, a clustering of G is a partition 𝒞={C1,,Cl} of V(G). The cluster editing set of 𝒞 is

edit(𝒞):={uvE(G):uCi,vCj,ij}i[l]E(G[Ci]¯).

We define costG(𝒞)=|edit(𝒞)|. An element of C𝒞 is called a cluster, and the cardinality of 𝒞 is called cluster number. An optimal cluster editing set for G is a cluster editing set for G of minimum size. An optimal clustering is a partition 𝒞 of V(G) of minimum cost.

A formal definition of Cluster Editing problem is as follows.
Cluster Editing
Input: A graph G and an integer k.
Question: Is there a clustering of G with cost at most k?

A clustering with exactly p clusters is called a p-clustering. In p-Cluster Editing, the problem is the same, but we must find a p-clustering of cost at most k.

We will sometimes use the fact that twins, which are pairs of vertices that have the same closed neighborhood, can be assumed to be in the same cluster. More generally, a critical clique of a graph G is a clique K such that all vK have the same neighbor vertices in V(G)K, and K is maximal under this property.

Proposition 2 ([13, 26]).

Let K be a critical clique of G. For any optimal clustering 𝒞 of G, there is a C𝒞 such that KC.

Note that Proposition 2 is not always true if we require a p-clustering instead of any optimal clustering. For example if p is imposed and G is a complete graph with p vertices, then G consists of a critical clique which be broken. We will ensure that this is not problematic in the results that follow.

Cographs and cotrees.

A cograph is a graph that can be constructed using the three following rules: (1) a single vertex is a cograph; (2) the disjoint union GH of cographs G,H is a cograph; (3) the join GH of cographs G,H is a cograph. Cographs are exactly the P4-free graphs, i.e., that do not contain a path on four vertices as an induced subgraph.

Cographs are also known for their tree representation. For a graph G, a cotree for G is a rooted tree T whose set of leaves, denoted L(T), satisfies L(T)=V(G). Moreover, every internal node vV(T)L(T) is one of two types, either a 0-node or a 1-node, such that uvE(G) if and only if the lowest common ancestor of u and v in T is a 1-node111To emphasize the distinction between general graphs and trees, we will refer to vertices of a tree as nodes.. It is well-known that G is a cograph if and only if there exists a cotree T for G. This can be seen from the intuition that leaves represent applications of Rule (1) above, 0-node represents disjoint unions, and 1-node represents joins.

A trivially perfect graph (TPG), among several characterizations, is a cograph G that has no induced cycle on four vertices, i.e., a {P4,C4}-free graph. TPGs are also the chordal cographs. For our purposes, a TPG is a cograph that admits a cotree T in which every 1-node has at most one child that is not a leaf (see [28, Lemma 4.1]).

Clique-width and NLC-width.

Our clique-width (cw) algorithm does not use the notion of cw directly, but instead the analogous measure of NLC-width [28]. We only provide the definition of the latter. Let G=(V,E,lab) be a graph in which every vertex has one of k node labels (k-NL), where lab is a function from V to [k]. A single-vertex graph labeled t is denoted by t.

Definition 3.

A k-node labeled controlled (k-NLC) graph is a k-NL graph defined recursively as follows.

  1. 1.

    t is a k-NLC graph for every t[k].

  2. 2.

    Let G1=(V1,E1,lab1),G2=(V2,E2,lab2) be two k-NLC graphs, relation S[k]2, and Eadd={uv:uV1vV2(lab1(u),lab2(v))S}.

    The k-NL graph G=(V,E,lab) denoted G1×SG2 is a k-NLC graph defined as: V=V1V2, E=E1E2Eadd, and lab(u)=lab1(u), lab(v)=lab2(v) for uV1, vV2.

  3. 3.

    Let G=(V,E,lab) be a k-NLC graph and R be a function from [k] to [k]. Then R(G)=(V,E,lab) is a k-NLC graph, where lab(v)=R(lab(v)) for all vV.

Intuitively, operation 2 denoted by ×S takes the disjoint union of the graphs G1,G2, then adds all possible edges between labeled vertices of G1 and labeled vertices of G2 as controlled by the pairs of S. Operation 3 denoted by R relabels the vertices of a graph controlled by R. NLCk denotes all k-NLC graphs. The NLC-width of a labeled graph G is the smallest k such that G is a k-NLC graph. Furthermore, for a labeled graph G=(V,E,lab), the NLC-width of a graph G=(V,E), obtained from G with all labels removed, equals the NLC-width of G.

A well-parenthesized expression X built with operations 1, 2, 3 is called a k-expression. The graph constructed by X is denoted by GX. We associate the k-expression tree T of GX with X in a natural way, that is, leaves of T correspond to all vertices of GX and the internal nodes of T correspond to the operations 2, 3 of X. For each node u of T, the sub-tree rooted at u corresponds to a sub k-expression of X denoted by Xu, and the graph GXu constructed by Xu is also denoted by Gu. Moreover, we say Gu is the related graph of u in T.

Let us briefly mention that a graph has clique-width at most k if it can be built using what we will call a cw(k)-expression, which has four operations instead: creating a single labeled vertex t; taking the disjoint union; adding all edges between vertices of a specified label pair (i,i); relabeling all vertices with label i to another label j. We do not detail further, since the following allows us to use NLC-width instead of clique-width.

Proposition 4 ([27, 34]).

For every graph G, the clique-width of G is at least the NLC-width of G, and at most two times the NLC-width of G. Moreover, any given cw(k)-expression can be transformed into an equivalent k-expression in polynomial time (i.e., yielding the same graph).

We will assume that our k-expressions are derived from a given cw(k)-expression. Reference [34] is often cited for this transformation, but seems to have vanished from nature. The transformation can be done using the normal form of a cw(k)-expression described in [17].

3 Cluster editing on cographs

We first prove our hardness results for Cluster Editing and p-Cluster Editing on cographs, using a reduction from Unary Bin Packing. An instance of Unary Bin Packing consists of a unary-encoded integer multiset A=[a1,,an], which represent the sizes of n items, and integers b,k. We must decide whether the items can be packed into at most k bins that each have capacity b. Thus, we must partition A into k multisets A1,,Ak, some possibly empty, such that aAiab for each i[k].

For our purpose, we introduce a variant of this problem called Unary Perfect Bin Packing. The problem is identical, except that the partition of A into A1,,Ak must satisfy aAia=b for each i[k]. That is, we must fill all k bins to their maximum capacity b. Note that for this to be possible, i=1nai=kb must hold. Note that these packing problems can be solved in polynomial time for any fixed k [32]. We assume henceforth that k,n10, maxi=1naib, and i=1naikb, otherwise, they can be decided in polynomial time. It is known that Unary Bin Packing is NP-complete [22], W[1]-hard parameterized by the number of bins k [32], and does not admit an f(k)|I|o(k/logk) time algorithm for any function f unless Exponential Time Hypothesis (ETH) fails [32], where |I| is the size of the instance (in unary). The results easily extend to the perfect version.

Proposition 5.

Unary Perfect Bin Packing is NP-complete, W[1]-hard parameterized by k, and has no f(k)|I|o(k/logk) time algorithm for any function f unless ETH fails.

Proof.

Clearly, Unary Perfect Bin Packing is in NP. Let (A,b,k) be an instance of Unary Bin Packing, where A=[a1,,an]. We assume that k is even, as otherwise we increase k by 1 and add to A an item of value b. If i=1nai0.1kb, we argue that we have a YES instance: we can pack the largest k/2 of the n items into k/2 bins, and pack the other nk/2 items in the remaining bins, as follows. Assume w.l.o.g. ba1ak/2ak/2+1an. We have k/2ak/2i=1k/2ai0.1kb, so ak/20.2b. This means that ak/2+1,,an0.2b and we can always select a batch of items with these sizes such that the total size of a batch is in the interval [0.8b,b] until there are no items left (the last batch may have a size less than 0.8b). Thus, we can pack the items with sizes ak/2+1,,an using at most 0.1kb0.8b+1=0.125k+1<k/2 bins.

So assume henceforth that i=1nai>0.1kb. Construct an instance (B,b,k) for Unary Perfect Bin Packing, where B=[a1,,an,1,,1] consists of all integers of A and kbi=1nai 1s. Since 0.1kb<i=1naikb, the new instance size is bounded by a linear function of the original instance size. Moreover, the new instance can be obtained in polynomial time. It is easy to verify that (A,b,k) is a YES instance of Unary Bin Packing if and only if (B,b,k) is a YES instance of Unary Perfect Bin Packing.

Lemma 6.

Given a unary-encoded integer multiset A=[a1,,an] for the sizes of n items, and integers b,k satisfying kb=i=1nai, there is a polynomial-time algorithm which outputs a cograph G and an integer t such that,

  1. 1.

    for any optimal clustering 𝒞 of G, |𝒞|=k and the cost of 𝒞 is at least t;

  2. 2.

    the n items can be perfectly packed by k bins with capacity b if and only if there is a clustering of G with cost at most t.

Moreover, G is obtained by taking a join of two cluster graphs.

Proof.

For the remainder, let us denote a:=i=1nai and s:=i=1nai2. Construct an instance (G,t) for Cluster Editing as follows. First, add two cluster graphs I=B1Bk and J=A1An into G, where Bi is a complete graph with h:=(nka)10 vertices for every i[k], and Aj is a complete graph with aj vertices for every j[n]. Then, connect each vV(I) to all vertices of V(J). See Figure 1. One can easily verify that G is a cograph obtained from the join of two cluster graphs. In addition, define t:=(k1)ah+12(kb2s). Clearly, t is an integer since sakbkb2 (mod 2). Let ={V(B1),,V(Bk)} and 𝒥={V(A1),,V(An)}. Clearly, every element in 𝒥 is a critical clique of G.

Figure 1: An illustration of the construction. In the subgraph I, each Bi is a “large enough” complete graph, and in the subgraph J, each Aj is a complete graph with aj vertices. The wiggly line indicates that all possible edges between I and J are present (there are no edges between two Bi’s, and no edge between two Aj’s).
Claim 7.

Let 𝒞 be an optimal clustering of G, and F be the cluster editing set for this solution. Then, |𝒞|=k, and each element of 𝒞 is a superset of exactly one element from . Moreover, |F|t, and |F|=t if and only if all elements of 𝒞 have the same cardinality.

Proof.

Let 𝒞={P1,,Pl}. Since each element in 𝒥 is a critical clique of G, each such element is a subset of some cluster in 𝒞 by Proposition 2. Note that each cluster of 𝒞 could contain 0,1, or more cliques of . We split the possibilities into three cases and provide bounds on |F| for each case. First, if there exists a cluster of 𝒞 that includes at least two critical cliques from , then F contains h2 inserted edges to connect two critical cliques from , and thus |F|h2. Assume instead that every cluster of 𝒞 includes at most one critical clique from . Then, we have lk. Suppose lk+1. Then, there are lk clusters in 𝒞 that do not contain any critical cliques from . Let 𝒞𝒞 consist of these lk clusters and U=P𝒞P. Then, for every vU, kh deleted edges are required in F to disconnect v from all vertices of V(I), and for every uV(J)U, (k1)h deleted edges are required in F to disconnect u from all vertices of V(I)P, where P is the clique of contained in the same cluster as u. Therefore,

|F| kh|U|+(k1)h|V(J)U|
=h|U|+(k1)h|U|+(k1)h|V(J)U|
=h|U|+(k1)ha
(k1)ah+h.

Now assume that l=k. Then, every element of 𝒞 includes exactly one critical clique from . Consider each i[k] and assume w.l.o.g. that V(Bi)Pi. Let Wi=PiV(Bi) and let {𝒥1,,𝒥k} be a partition of 𝒥 such that, for each i[k], the union of all elements of 𝒥i is Wi (such a partition exists because each element of 𝒥 is entirely contained in some Wi). Firstly, (k1)h deleted edges are required in F to disconnect each vWi from all vertices of V(I)V(Bi). Secondly, for each i[k], 12S,S𝒥i|S||S| inserted edges are required in F to connect all vertices of Wi. One can easily check that for each i[k], F accounts for every edge with an endpoint in Pi and an endpoint outside, and accounts for all non-edges within Pi. Therefore, all the possible edges of F are counted. Thus,

|F| =i=1k(|Wi|(k1)h+12S,S𝒥i|S||S|)
=|V(J)|(k1)h+12i=1k((S𝒥i|S|)2S𝒥i|S|2)
=(k1)ah+12i=1k|Wi|212S𝒥|S|2
=(k1)ah+12i=1k|Wi|2s2.

We can now compare |F| from the lower bounds obtained in the previous two cases, as follows.

|F| =(k1)ah+12i=1k|Wi|2s2
<(k1)ah+i=1k|Wi|2+1i,jk|Wi||Wj|
=(k1)ah+(i=1k|Wi|)2
=(k1)ah+a2
<(k1)ah+h<h2.

The last line implies that having |𝒞|=k, with each element of 𝒞 a superset of exactly one element from , always achieves a lower cost than the other possibilities.

Next, consider the lower bound of |F|t and the conditions on equality. Using the same starting point,

|F| =(k1)ah+12i=1k|Wi|2s2
=(k1)ah+12k(i=1k|Wi|2)(i=1k12)s2
(k1)ah+12k(i=1k|Wi|)2s2 (1)
=(k1)ah+12ka2s2=t.

In (1), we used the Cauchy-Schwarz inequality, and the two sides are equal if and only if |W1|==|Wk|. In addition, |W1|==|Wk| if and only if |P1|==|Pk| (recall that Wi=PiV(Bi) and |V(Bi)|=h for each i[k]). This proves every statement of the claim.

Armed with the above claim, we immediately deduce the first statement of the theorem. We now show the second part, i.e., the n items of A can fit perfectly into k bins of capacity b if and only if we can make G a cluster graph with at most t edge modifications.
(): Assume there is a clustering 𝒞 of G with cost at most t. By Claim 7, the size of any optimal cluster editing set for G is at least t. Thus, the cost of 𝒞 must be equal to t, and 𝒞 must be optimal. Claim 7 then implies that 𝒞={P1,,Pk}, such that |P1|==|Pk| and V(Bi)Pi for each i[k] (w.l.o.g.). Moreover, each critical clique of 𝒥 is a subset of some element of 𝒞 by Proposition 2. So there is a partition 𝒥1,,𝒥k of 𝒥 such that PiV(Bi)=S𝒥iS for every i[k]. This means that 𝒥 can be partitioned into k groups such that the number of vertices in each group equals b.
(): Assume the n items can be perfectly packed into k bins with capacity b. We construct a cluster editing set F of size at most t. Consider each i[k]. Let SiA consist of the sizes of the items packed in the i-th bin, and define a corresponding cluster graph Bi=ajSiAj. We put in F a set of (k1)h deleted edges, by disconnecting each vV(Bi) from every vertex in V(I)V(Bi) in G. We then add to F the set of 12V(Aj),V(Ar)V(Bi)ajar of inserted edges into F to make Bi a complete graph, where j,r[n]. Applying the modifications in F results in a clustering 𝒞={P1,,Pk} such that Pi=V(Bi)V(Bi) for each i[k]. Moreover, the cardinality of the cluster editing set is

|F| =i=1k((k1)h|V(Bi)|+12V(Aj),V(Ar)V(Bi)ajar)
=(k1)hi=1n|V(Ai)|+12i=1k((V(Aj)V(Bi)aj)2V(Aj)V(Bi)aj2)
=(k1)h|V(J)|+12i=1k(|V(Bi)|2V(Aj)V(Bi)aj2)
=(k1)ah+12i=1kb212V(Aj)V(J)aj2
=(k1)ah+12kb212s=t.

Thus, G has a clustering with cost at most t.

Since the reduction given in Lemma 6 is a parameter-preserving Karp reduction from Unary Perfect Bin Packing with parameter k to p-Cluster Editing in cographs with parameter p=k, we can make the following series of deductions.

Theorem 8.

The following hardness results hold:

  • Cluster Editing on cographs and p-Cluster Editing on cographs are NP-complete.

  • p-Cluster Editing on cographs is W[1]-hard parameterized by p.

  • p-Cluster Editing on cographs admits no f(p)no(p/logp) time algorithm for any function f unless ETH fails, where n is the number of vertices of the input graph.

Since cographs have a constant clique-width [15], we have the following.

Corollary 9.

p-Cluster Editing admits no f(p)ng(cw)o(p/logp) time algorithm for any functions f and g unless ETH fails.

An 𝒏𝑶(𝒑𝒄𝒘) time algorithm

We propose a dynamic programming algorithm over the k-expression tree T of G as described in the preliminaries, where k is at most twice the clique-width of G. The idea is that, for each node u of T and every way of distributing the vertices of V(Gu) among p clusters and k labels, we compute the minimum-cost clustering conditioned on such a distribution. The latter is described as a matrix of vertex counts per cluster per label, and the cost can be computed by combining the appropriate cost for matrices of the children of u.

The entries of all matrices discussed here are natural numbers between 0 and n. Mi,: and M:,j represent the i-th row and j-the column of matrix M, respectively. We write mi,j for the entry of M in row i and column j. The sum of all entries in Mi,: and M:,j are denoted by sum{Mi,:} and sum{M:,j}, respectively. We use {Si}i=1n to denote a sequence (S1,,Sn).

Let M be a k×p matrix and G be a k-NL graph. An M-sequencing of G is a sequence {Ci}i=1p of subsets of V(G) such that

  1. 1.

    the non-empty subsets of this sequence form a partition 𝒞 of V(G);

  2. 2.

    the number of vertices in Cj labeled i is equal to mi,j for every i[k],j[p].

The partition 𝒞 obtained from {Ci}i=1p is called an M-clustering, and the cost of the M-sequencing is the cost of the clustering 𝒞 of G, which is costG(𝒞).

Clearly, an M-sequencing of G exists if and only if the sum of all entries of M equals the number of vertices of G and the sum of all entries of the i-th row of M equals the number of vertices in G labeled i for every i[k]. The cost of M-sequencing is defined as if it does not exist. In addition, we say M is a well-defined matrix for G if an M-sequencing of G exists. An optimal M-sequencing of G is an M-sequencing of G of minimum cost.

Theorem 10.

p-Cluster Editing has an O(n2pcw+4) time algorithm if a k-expression is given, where k=cw and n is the number of vertices of the input graph.

Proof.

Let T be a k-expression tree of k-NLC graph G=(V,E,lab). For every uV(T), let Gu=(Vu,Eu,labu) be the related graph of u, and let Viu denote the vertices of Vu labeled i for each i[k]. Let Mu=(mi,jz) be a well-defined matrix of Gu. Assume u is a node of T corresponding to operation ×S, and v,w are the two children of u. We define h(Mv,Mw,S), for later use, which equals

j[p]sum{M:,jv}sum{M:,jw}+(i,i)Ssum{Mi,:v}sum{Mi,:w}2(i,i)Sj[p]mi,jvmi,jw.

Next, we will first provide the dynamic programming algorithm for this problem, and then prove its correctness.

Let dynamic programming table D(u,Mu) denote the cost of an optimal Mu-sequencing of Gu. For a leaf u of T and all well-defined Mu of u, D(u,Mu):=0. For an internal vertex u of T with corresponding operation ×S, and children v,w, we have

D(u,Mu):= minMu=Mv+Mw{D(v,Mv)+D(w,Mw)+h(Mv,Mw,S)}.

For an internal node u of T with corresponding operation R, and child v, we have

D(u,Mu):= minMvD(v,Mv),

where the minimization is taken over every well-defined matrix Mv for Gv that satisfies Mi,:u=(i,i)RMi,:v for every i[k].

Since every entry of the k×p matrix Mu is at most n, each uV(T) has at most npk tables. For the ×S vertices, one entry D(u,Mu) can be computed in time O(npk+3) by enumerating all the possible O(npk) matrices Mv, from which Mw can be deduced from MuMv in time O(pk)=O(n2), and computing h(Mv,Mw,S) in time O(n3) (because S has at most k2n2 entries, and we treat p and k as upper-bounded by n for simplicity). Thus the set of all entries for u can be computed in O(n2pk+3) time if all tables of its children are given. For the R vertices, each entry can be computed in time O(npk+3) similarly by enumerating the Mv matrices and checking the sum conditions, for a total time of O(n2pk+3) as well. Since T has O(n) nodes, we can compute all tables for all nodes, from leaves to root, of T in O(n2pk+4) time, which is O(n2pcw+4) if we assume cw=k. Let r be the root of T. The output of our algorithm is the minimum D(r,Mr) such that Mr has no zero columns (note that if we require at most p clusters, we can simply tolerate zero columns).

We now prove that the recurrences are correct. The leaf case is easy to verify, so we assume inductively that the table is filled correctly for the children of an internal node u. Suppose that u corresponds to operation ×S and consider the value we assign to an entry D(u,Mu). Assume the two children of u are v,w. Suppose {Ciu}i=1p is an optimal Mu-sequencing for Gu with Mu-clustering 𝒞u. Let 𝒞v={CVv:C𝒞u}{} and 𝒞w={CVw:C𝒞u}{}. Since Gu is an union of Gv and Gw by adding some edges between them controlled by S, we claim that the cost of the optimal Mu-sequencing costGu(𝒞u) equals

y{v,w}costGy(𝒞y)+C𝒞u|CVv||CVw| +(i,i)S|Viv||Viw|
2(i,i)SC𝒞u|CViv||CViw|,

justified as follows. First, any inserted or deleted edge of 𝒞u whose endpoints are both in Vv, or both in Vw, is counted exactly once, namely in the first summation y{v,w}costGy(𝒞y). Then, consider the inserted/deleted edges with one endpoint in Vv and the other in Vw. Observe that a non-edge between Vv and Vw is counted once in the second summation if and only if it is an inserted edge. That non-edge is not counted in the third summation, because the latter only counts edges of Gu, and it is not subtracted in the last term for the same reason. Thus the expression counts inserted edges between Vv,Vw in 𝒞u exactly once, and no other such non-edge. Now, consider an edge e between Vv and Vw, which exists because of S. If e is a deleted edge, its endpoints are in different clusters and it is counted exactly once, in the third summation. If e is not deleted, its endpoints are in the same cluster and it is counted in both the second and third summation. On the other hand, that edge is subtracted twice in the last term, and so overall it is not counted. We deduce that the expression correctly represents the cost of 𝒞u in Gu.

For each y{v,w}, we further define My=(mi,jy) such that mi,jy=|CjuViy| for all i,j. Clearly, we have Mv+Mw=Mu. Now, we claim that {CiuVy}i=1p is an optimal My-sequencing of Gy with My-clustering 𝒞y for every y{v,w}. Roughly speaking, this can be seen by observing that merging any Mv and Mw-clustering will result, in the cost expression given above, in the same values for the three last summations. Since the choice of My clusterings only affects the first summation, they should be chosen to minimize it. To see this in more detail, assume for contradiction that {Div}i=1p is an Mv-sequencing of Gv with Mv-clustering 𝒟v and {Diw}i=1p is an Mw-sequencing of Gw with Mw-clustering 𝒟w such that y{v,w}costGy(𝒟y)<y{v,w}costGy(𝒞y). Then, {DivDiw}i=1p is a Mu-sequencing for Gu since Mv+Mw=Mu and Vv+Vw=Vu. Furthermore, the cost of the Mu-sequencing {DivDiw}i=1p is y{v,w}costGy(𝒟y)+h(Mv,Mw,S), where h(Mv,Mw,S) also equals

C𝒞u|CVv||CVw|+(i,i)S|Viv||Viw|2(i,i)SC𝒞u|CViv||CViw|

based on the definitions of Mv and Mw. As a result, y{v,w}costGy(𝒟y)+h(Mv,Mw,S)<y{v,w}costGy(𝒞y)+h(Mv,Mw,S)=costGu(𝒞u), a contradiction. Therefore, 𝒞u is obtained by merging optimal Mv and Mw-clusterings. Since our value of D(u,Mu) eventually considers D(v,Mv)+D(w,Mw), which by induction contain the optimal costs, we have that D(u,Mu) is at most the cost of 𝒞u. It is also easy to see that each possible entry considered in the minimization corresponds to an Mu-clustering of Gu, which cannot be better than 𝒞u, and so it follows that D(u,Mu) is also at least the cost of 𝒞u. Thus our value of D(u,Mu) is correct.

Consider an internal node u with corresponding operation R. Let v be the child of u. Suppose {Ciu}i=1p is an optimal Mu-sequencing for Gu with Mu-clustering 𝒞u. We define Mv=(mi,jv) such that mi,jv=|VivCju|. Then, Mi,:u=(i,i)RMi,:v for each i[k]. Now, we claim that {Ciu}i=1p is also an optimal Mv-sequencing of Gv with Mv-clustering 𝒞u. Otherwise, assume for contradiction that {Civ}i=1p is an Mv-sequencing of Gv with Mv-clustering 𝒞v such that costGv(𝒞v)<costGv(𝒞u). Then, {Civ}i=1p is also a Mu-sequencing of Gu with Mu-clustering 𝒞v, because (1) the non-empty sets of {Civ}i=1p define a partition of Vv, thus also define a partition of Vu since Vv=Vu. (2) In Cjv of Vv, the number of vertices labeled i is mi,jv. In addition, Gu is obtained from Gv by relabeling vertices controlled by function R. Hence, in Cjv of Vu, the number of vertices labeled i is (i,i)Rmi,jv, which equals mi,ju for each i[k]. Thus, the cost of the Mu-sequencing {Civ}i=1p is costGu(𝒞v)=costGv(𝒞v)<costGv(𝒞u)=costGu(𝒞u), a contradiction. As before, this shows that our value of D(u,Mu) is both an upper and lower bound on the cost of 𝒞u, and is therefore correct.

Recall that cographs have clique-width at most 2. Moreover, a cw(2)-expression and then a 2-expression can easily be derived from a binary cotree, since 0-nodes are simply join operations that add no edges, and 1-nodes can be expressed by joining graphs with two different colors and adding every edge between them (in fact, dynamic programming over the cotree directly could be more efficient). We therefore have the following consequence, which we note is not conditioned on receiving a k-expression.

Corollary 11.

p-Cluster Editing on cographs admits an O(n4p+4) time algorithm.

4 Cluster Editing on Trivially Perfect Graphs

We now show that Cluster Editing can be solved in cubic time on TPGs, first providing some intuitions. Consider a TPG G, and recall that it admits a cotree T in which every 1-node has at most one child that is not a leaf. Such a cotree can be found in linear time [14]. Let 𝒞 be an optimal clustering of G. Let vV(T) and let XV(G) be the set of leaves that descend from v, which we call a clade. Suppose that there are two clusters C1,C2𝒞 that intersect with X, but that also satisfy C1X,C2X (we will say that X “grows” in C1 and C2). We can show that there is an alternate optimal clustering obtainable by moving C1X into a new cluster by itself, and merging (C1X)C2 into a single cluster222Note that this is where the properties of TPGs are used: this works because if we consider the neighbors of X outside of X, they are all children of 1-nodes on the path from v to the root. They thus form a clique, which makes the merging (C1X)C2 advantageous as it saves the deletions of edges from that clique. Also, we may need to exchange the roles of C1 and C2.. In this manner, X “grows” in one less cluster, and we can repeat this until it grows in at most one cluster. This property allows dynamic programming over the clades of the cotree, as we only need to memorize optimal solutions with respect to the size of the only cluster that can grow in the subsequent clades.

For two vertex-disjoint subsets of vertices X,Y, we denote EG(X,Y)={uvE(G):uX,vY} and eG(X,Y)=|EG(X,Y)|. We further denote e¯G(X,Y)=|X||Y|eG(X,Y), which is the number of non-edges between X and Y. We drop the subscript G from these notations if it is clear from the context. Two disjoint subsets X,YV(G) are neighbors if e(X,Y)=|X||Y|, and they are non-neighbors if e¯(X,Y)=|X||Y|. That is, every possible edge is present, and absent, respectively. Note that using this notation, for a given clustering 𝒞={C1,,Cl}, the set of inserted edges is C𝒞E(G[C]¯) and the set of deleted edges is 1i<jlE(Ci,Cj).

Let G be a cograph with cotree T. For vV(T), we denote by L(v) the set of leaves that descend from v in T (note that L(v)V(G)). We call L(v) the clade of v. We say that XV(G) is a clade of T if X is a clade of some vV(T). In this case, we say that X is rooted at v. The set of clades of T is defined as clades(T)={L(v):vV(T)}.

We first show a technical property of optimal solutions that will be useful. The lemma statement is illustrated in Figure 2.

Figure 2: Illustration of Lemma 12. The tree shown is the cotree of G, with disc vertices being 1-nodes and square vertices 0-nodes. Here, Y=XC1 and Y=XC2. If |A||B| and |A||B|, then rearranging C1 and C2 in one of the two ways shown also gives an optimal clustering.
Lemma 12.

Let G be a TPG with cotree T, let uV(T), and let X=L(u). Let 𝒞 be an optimal clustering of G and let C1,C2𝒞 be distinct clusters that both intersect with X. Let U be the set of strict ancestors of u in T. Let A (resp. A) be the set of vertices of C1 (resp. C2) that are children of 1-nodes in U, and let B=C1(XA) (resp. B=C2(XA)).

If |A||B| and |A||B|, then at least one of the alternate clusterings (𝒞{C1,C2}){C1AB,C2X} or (𝒞{C1,C2}){C1X,C2AB} has cost at most costG(𝒞).

Let G be a TPG with cotree T. Let 𝒞={C1,,Cl} be a clustering of G. Let X be a clade of T. For Ci𝒞, we say that X grows in Ci if CiX and CiX. Note that this differs from the notion of overlapping, since XCi is possible. We then say that X has a single-growth in 𝒞 if X grows in at most one element of 𝒞. In other words, at most one cluster of 𝒞 containing elements of X also has elements outside of X, and the rest of X is split into clusters that are subsets of X. Note that X could grow in zero elements of 𝒞. Furthermore, we will say that X has a heritable single-growth in 𝒞 if, for all clades Y of T such that YX, Y has a single-growth in 𝒞. For vV(T), we may also say that v grows in Ci if L(v) grows in Ci, or that v has a single-growth in 𝒞 if L(v) does. For brevity, we may write SG for single-growth, and HSG for heritable single-growth.

Lemma 13.

Suppose that G is a TPG with cotree T. Then there exists an optimal clustering 𝒞 of G such that every clade X of T has an SG in 𝒞.

Proof.

Consider an optimal clustering 𝒞={C1,,Cl} of G, chosen such that the number of clades of T that have an HSG in 𝒞 is maximum, among all optimal clusterings333We could attempt to choose 𝒞 to maximize the clades with an SG instead, but we will modify 𝒞 later on, and keeping track of changes in HSG clades is much easier than SGs.. If every clade of T has the HSG property, then we are done (since HSG implies SG), so we assume that not every clade has the HSG property. Clearly, in 𝒞, every leaf of T has an HSG, the root of T does not have an HSG, and there exists at least one internal node of T that does not have an SG. Choose uV(T) with the minimum number of descendants such that u does not have an SG in 𝒞. Then, every descendant of u has an SG, thus, also has an HSG in 𝒞. Notice that u cannot be the root of T, because the root trivially has an SG in 𝒞.

Denote X=L(u). Since u does not have the SG property, X grows in at least two clusters of 𝒞, say C1 and C2. Thus XC1,C1X,XC2,C2X are all non-empty. Denote Y=XC1,Y=XC2. We show that we can transform 𝒞 into another optimal clustering 𝒞 in which one of Y or Y is a cluster by itself, and such that the number of clades that have an HSG in 𝒞 is no less than in 𝒞.

As in Lemma 12, let U be the set of strict ancestors of u in T. Let A and A be the sets of vertices of C1 and C2, respectively, that are children of 1-nodes in U. Then let B=C1(XA) and B=C2(XA). Note that because C1X and X does not intersect A or B, we have that AB is non-empty. Likewise, AB is non-empty.

We argue that |A||B|. Suppose instead that |A|<|B|. Consider the clustering 𝒞=(𝒞{C1}){Y,AB}. Then 𝒞 has |A||Y| edge deletions that are not in 𝒞 (and no other edge modification is in 𝒞 but not in 𝒞, since Y and B share no edge, as the lowest common ancestor of any vertex in X and a vertex of B is a 0-node in the cotree). On the other hand, 𝒞 has |B||Y|>|A||Y| edge additions that are not in 𝒞. Hence, the cost of 𝒞 is strictly lower than that of 𝒞, a contradiction. By the same argument, we get that |A||B|.

We see that C1 and C2 satisfy all the requirements of Lemma 12, and so we may get an alternate optimal clustering 𝒞 or 𝒞′′, where 𝒞 is obtained from 𝒞 by replacing C1,C2 by Y,C2AB, and 𝒞′′ is obtained by replacing C1,C2 by Y,C1AB. Notice that X grows in fewer clusters in 𝒞 than in 𝒞, and the same is true for 𝒞′′. Before proceeding, we need to argue that T has as many clades with an HSG in 𝒞, and in 𝒞′′ than in 𝒞.

So, let wV(T) be such that w has an HSG in 𝒞. Observe that w cannot be in U{u}, since these have u as a descendant and u does not have an SG in 𝒞. Therefore, w must either be: (1) a leaf child of a 1-node in U; (2) a descendant of u; or (3) a node whose first ancestor in U is a 0-node. Let v be a descendant of w, with v=w possible. By the definition of HSG, v has an SG in 𝒞. In all cases, we argue that v still has an SG in 𝒞 and 𝒞′′.

  1. 1.

    If w is the child of a 1-node of U, then w=v is a leaf and it trivially has an SG in 𝒞 and 𝒞′′.

  2. 2.

    Suppose that w, and thus v, is a descendant of u. If L(v) does not intersect with C1 nor C2, then the clusters of 𝒞 that intersect with L(v) are unaltered in 𝒞 and 𝒞′′, and thus v also has an SG in 𝒞, and in 𝒞′′. Thus suppose that L(v) intersects with C1C2.

    In that case, since v descends from u, L(v)X and it must thus intersect with YY. If L(v)Y, then v grows in C1 because AB. Likewise, if L(v)Y, then v grows in C2. It follows that L(v) intersects exactly one of Y or Y, and grows in exactly one of C1 or C2. If L(v) intersects Y, then in 𝒞, L(v) may grow in the cluster Y, but it does not grow in C2AB since it does intersect with it, and does not grow in other clusters of 𝒞 since these were unaltered. Thus L(v) has an SG in 𝒞. In 𝒞′′, L(v) grows in C1AB but no other cluster for the same reason. Thus L(v) has an SG in 𝒞′′ as well. If L(v) intersects Y, the same arguments can be used to deduce that L(v) has an SG in 𝒞 and 𝒞′′.

  3. 3.

    Finally, suppose that w is such that its first ancestor in U is a 0-node. Again we may assume that L(v) intersects C1C2. This implies that L(v) intersects with BB, and does not intersect with YYAA. If L(v) intersects B, then it grows in C1 since |A||B|. Then in 𝒞, v grows in C2AB, but not Y nor any other unaltered cluster. In 𝒞′′ it grows in C1AB, but not Y nor any other unaltered cluster. If L(v) intersects B, the same argument applies. Either way, L(v) has an SG in 𝒞 and 𝒞′′.

Since any descendant of w has an SG in either 𝒞 and 𝒞′′, we deduce that w has an HSG in both alternate clusterings.

We have thus found an optimal clustering 𝒞{𝒞,𝒞′′} such that every wV(T) that has an HSG in 𝒞 also has an HSG in 𝒞. Moreover, since 𝒞 “extracts” either Y or Y from its cluster, X grows in one less cluster of 𝒞. If X grows in only one such cluster, then u has an SG in 𝒞 and therefore also an HSG in 𝒞, by the choice of u. In this case, T has more clades that have an HSG in 𝒞 than with 𝒞, which contradicts our choice of 𝒞. If X still grows in at least two clusters of 𝒞, we may repeat the above modification as many times as needed until X grows in a single cluster, yielding the same contradiction. We deduce that every node of T has an HSG, and therefore an SG in 𝒞.

Our goal is to use the above to perform dynamic programming over the cotree. For a node v of this cotree, we will store the value of a solution for the subgraph induced by L(v), and will need to determine which cluster of such a partial solution should grow. As it turns out, we should choose the largest cluster to grow.

Lemma 14.

Let G be a TPG with cotree T. Let 𝒞 be an optimal clustering of G such that every clade of T has an SG in 𝒞 and, among all such possible optimal clusterings, such that |𝒞| is maximum. Then for every clade X of T, one of the following holds:

  • X does not grow in any Ci𝒞; or

  • X grows in one Ci𝒞, and |XCi|=maxCj𝒞|XCj|.

Proof.

Let X be a clade of T and suppose that X grows in some Ci𝒞. Note that vertices of X share the same neighborhood outside of X. Thus Ci can be partitioned into {XCi,A,B} such that X,A are neighbors and X,B are non-neighbors. Note that |A||B| as otherwise, we could obtain an alternate clustering 𝒞 by replacing Ci by {XCi,AB} and save a cost of |XCi|(|B||A|)>0.

We also argue that |A|>|B|. This is because if |A|=|B|, the same clustering 𝒞 has costG(𝒞)=costG(𝒞), but has one more cluster. We also argue that every clade of T has an SG in 𝒞, contradicting our choice of 𝒞. Let Y=XCi and consider some vV(T). By assumption v has an SG in 𝒞. If CiL(v)=, then the clusters of 𝒞 that intersect with L(v) are unaltered in 𝒞 and v also has an SG 𝒞. Suppose that CiL(v). Apart from Ci=YAB, the clusters of 𝒞 that intersect with L(v) are unaltered in 𝒞. Moreover, L(v) does not grow in YAB, and neither does it grow in Y or AB. Thus v also has an SG 𝒞. The remaining case is when L(v) grows in Ci (but no other cluster of 𝒞). Let uV(T) be such that L(u)=X. If v=u, then L(v) does not grow in any cluster of 𝒞. If v is a strict descendant of u, then L(v) can only grow in Y of 𝒞. If v is a strict ancestor of u, then YL(v) and L(v) can only grow in AB of 𝒞. If v is in the rest of V(T), then YL(v)= and L(v) grows only in AB of 𝒞. As a result, L(v) has an SG in 𝒞. Since this holds for any v, every node has an SG in 𝒞, which is a contradiction since |𝒞|<|𝒞|. Therefore, |A|>|B|.

Now suppose that there is some CjCi such that |XCj|>|XCi|. Because X already grows in Ci, the SG property implies that CjX. Consider the alternate clustering 𝒞 obtained from 𝒞 by replacing Ci,Cj by CjAB,XCi. The number of modifications in 𝒞 but not in 𝒞 is |XCi||A|+|Cj||B|, but the number of modifications in 𝒞 not in 𝒞 is |XCi||B|+|Cj||A|. The difference between the latter and the former is (|Cj||XCi|)(|A||B|). Since |A|>|B| and |Cj|=|XCj|>|XCi|, this is greater than 0, and thus 𝒞 is a clustering of cost lower than 𝒞, a contradiction.

Our algorithm will search for an optimal clustering that satisfies all the requirements of Lemma 14. That is, for a TPG G with cotree T, a clustering 𝒞 is well-behaved if it is optimal, every clade of T has an SG in 𝒞, and among all such possible clusterings |𝒞| is maximum. As we know that such a 𝒞 exists, we will search for one using dynamic programming.

In the remainder, for an arbitrary clustering 𝒞 of G and XV(G), define 𝒞|X={CiX|Ci𝒞}{}, i.e., the restriction of 𝒞 to X. Note that 𝒞|X is a clustering of G[X], and we refer to costG[X](𝒞|X) as the cost of 𝒞 in G[X]. Although 𝒞|X is not necessarily an optimal clustering of G[X], we can deduce from the above that it has minimum cost among those clusterings with the same largest cluster.

Corollary 15.

Let G be a TPG with cotree T, and let 𝒞 be a well-behaved clustering of G. Let uV(T) with clade X=L(u), and let ku=maxCj𝒞|XCj| denote the size of the largest intersection of 𝒞 with X.

Then 𝒞|X is a clustering of G[X] such that costG[X](𝒞|X) is minimum, among all clusterings of G[X] whose largest cluster has size ku.

We define a 2-dimensional dynamic programming table D indexed by V(T)×[n], with the intent that D[u,k] has the cost of an optimal clustering of G[L(u)] in which the largest cluster has size k. Notice that this intent is mostly for intuition purposes, since we will not be able to guarantee that D[u,k] stores the correct value for each u,k. Indeed, if we require a clustering of G[L(u)] with largest cluster size k, such a clustering may not be optimal and the properties of the above lemmas may not hold. However, we will argue that when k is the size of a largest cluster in an optimal clustering, then the entries are correct, as we prove that they combine information from optimal clusterings at the children of u which may also be assumed to be correct.

We assume that the cotree T of G is binary (note that such a cotree always exists and, since the previous lemmas make no assumption on the structure of the cotree, this is without loss of generality). If u is a leaf, put D[u,1]=0 and D[u,k]= for every k1. For internal node u, let u1,u2 be the two children of u in T. We put

D[u,k]=min{D[u1,k]+minj[k]D[u2,j]+𝕀u|L(u1)||L(u2)|D[u2,k]+minj[k]D[u1,j]+𝕀u|L(u1)||L(u2)|minj[k1](D[u1,j]+D[u2,kj]+α(u)),

where 𝕀u=0 if u is a 0-node and 𝕀u=1 if it is a 1-node, and

α(u)=min{|L(u1)||L(u2)|j(kj)if u is a 1-nodej(kj)otherwise.

The recurrence for D[u,k] mainly says that there are three ways to obtain a solution with a largest cluster of size k: either that cluster is taken directly from the solution at u1, from the solution at u2, or we take a cluster of size j from u1 and size kj from u2, and merge them together.

Lemma 16.

Let 𝒞 be a well-behaved clustering of G. Let uV(T), and denote by ku the size of the largest cluster of 𝒞|L(u). Then both of the following hold:

  • for each k such that D[u,k], there exists a clustering of G[L(u)] of cost at most D[u,k] whose largest cluster has size k;

  • D[u,ku] is equal to costG[L(u)](𝒞|L(u)), the cost of 𝒞 restricted to G[L(u)].

Proof.

We prove the statement by induction on the cotree T. For a leaf u, it is clear that both statements hold with D[u,1]=0. Let u be an internal node of T and let u1,u2 be its children. Denote X=L(u),X1=L(u1),X2=L(u2).

We focus on the first part and assume that D[u,k]. The value of D[u,k] is the minimum among three cases. If D[u,k]=D[u1,k]+minj[k]D[u2,j]+𝕀u|L(u1)||L(u2)|, then because D[u,k], D[u1,k] and D[u2,j] for the chosen j in the minimum expression. We can take the disjoint union of a clustering of G[L(u1)] whose largest cluster has size k (one exists of cost at most D[u1,k] by induction), and a clustering of G[L(u2)] with largest cluster of size jk (one exists of cost at most D[u2,j] by induction). If u is a 1-node, all edges between L(u1) and L(u2) are present and 𝕀u|X1||X2| must be added to the cost (whereas if u is a 0-node, no additional cost is required). This confirms that, if the first case of the recurrence is the minimum, there exists a clustering with cost at most D[u,k] whose largest cluster has size k. The same argument applies to the second case of the recurrence.

So assume that D[u,k]=minj[k1](D[u1,j]+D[u2,kj]+α(u)). This case corresponds to taking, by induction, a clustering of G[X1] and of G[X2] with the largest cluster of size j and kj, respectively, and merging their largest cluster into one cluster of size k (leaving the other clusters intact). If u is a 1-node, the number of deleted edges between the resulting clusters is |L(u1)||L(u2)|j(kj), and if u is a 0-node we must pay j(kj) edge insertions. This shows the first part of the statement.

For the second part, suppose that k=ku. Here, 𝒞 is the optimal clustering stated in the lemma. As we just argued, there is a clustering of G[X] of cost at most D[u,ku] with the largest cluster size ku. By Corollary 15, 𝒞|X has minimum cost among such clusterings, and so D[u,ku] is at least costG[X](𝒞|X). We argue that the latter is also an upper bound on D[u,ku]. Let C1𝒞|X1, and note that if C1𝒞|X, then C1 was “merged” with some cluster of 𝒞|X2 to obtain 𝒞|X. In this case, u1 grows in some cluster of 𝒞|X, and therefore also grows in some cluster of 𝒞. By the SG property, this means that there is at most one C1 of 𝒞|X1 such that C1𝒞|X. For the same reason, there is at most one C2 of 𝒞|X2 that is not in 𝒞|X. This in turn implies that either 𝒞|X=𝒞|X1𝒞|X2, or that there is a unique C1𝒞|X1 and a unique C2𝒞|X2 such that C1C2 is in 𝒞|X. We now consider both cases.

  • If 𝒞|X=𝒞|X1𝒞|X2, since 𝒞|X has its largest cluster of size k=ku, one of 𝒞|X1 or 𝒞|X2 must have a largest cluster of size k, and the other a largest cluster of some size j[k]. Using induction on the second part of the lemma, the corresponding entries D[ui,k] and D[ui,j] for {i,i}={1,2} store the costs of 𝒞|X1 and 𝒞|X2, and since all the possibilities are considered by the D[u,ku] recurrence, it is clear that D[u,ku] is at most costG[X](𝒞|X).

  • If C1C2𝒞|X, we have 𝒞|X=((𝒞|X1𝒞|X2){C1,C2}){C1C2}. Observe that X1 grows in C1C2. This means that X1 also grows in the cluster of C𝒞 that contains C1C2. By Lemma 14, |X1C||X1C′′| for each C′′𝒞. Since C contains C1, we get that |X1C|=|C1|, and since {XC′′:C′′𝒞} contains all the clusters of 𝒞|X1, it must be that C1 is the largest cluster of 𝒞|X1. By the same argument, C2 is the largest cluster of 𝒞|X2. Let j=|C1|. We have that 𝒞|X has the largest cluster size k, and is obtained by taking a clustering of G[L(u1)] with the largest cluster of size j, and a clustering of G[L(u2)] with largest cluster of size kj, and merging these two clusters. If u is a 1-node, the cost of this is the cost of 𝒞 in G[L(u1)], which is D[u1,j] by induction, plus the cost of 𝒞 in G[L(u2)], which is D[u2,kj] by induction, plus the cost for all the edges between L(u1) and L(u2), excluding those between C1 and C2. If u is a 0-node, the cost is the same, except that we pay j(kj) for the non-edges between C1 and C2. Either way, this case is considered by our recurrence, and we get D[u,k]costG[X](𝒞|X).

Having shown both the lower and upper bounds, we get that D[u,ku]=costG[X](𝒞|X).

Theorem 17.

The Cluster Editing problem can be solved in time O(n3) on trivially perfect graphs.

Open problems.

We observe that the structural properties shown on TPGs only work if the number of desired clusters is unrestricted. The complexity of p-Cluster Editing on TPGs is open (note that if p is constant, our nO(p) time algorithm on cographs provides a polynomial-time algorithm). Regarding our clique-width (or rather NLC-width), it might be possible to improve the complexity, for example by achieving nO(p+cw) instead of nO(pcw).

We proved the problem in P on {P4,C4}-free graphs, but we do not know the complexity on {P4,2K2}-graphs. The complement of such graphs are {P4,C4}-free and may be in P as well, but it is unclear whether the editing problem on the complement can be solved using our techniques. More generally, it would be ideal to aim for a dichotomy theorem for forbidden induced subgraphs, that is, to characterize the forbidden induced subgraphs that make Cluster Editing in P, and the ones that make it NP-hard.

References

  • [1] Faisal N Abu-Khzam, Joseph R Barr, Amin Fakhereldine, and Peter Shaw. A greedy heuristic for cluster editing with vertex splitting. In 2021 4th International conference on artificial intelligence for industries (AI4I), pages 38–41. IEEE, 2021. doi:10.1109/AI4I51902.2021.00017.
  • [2] Emmanuel Arrighi, Matthias Bentert, Pål Grønås Drange, Blair D Sullivan, and Petra Wolf. Cluster editing with overlapping communities. In 18th International Symposium on Parameterized and Exact Computation (IPEC 2023). Schloss-Dagstuhl-Leibniz Zentrum für Informatik, 2023.
  • [3] Nikhil Bansal, Avrim Blum, and Shuchi Chawla. Correlation clustering. Machine learning, 56:89–113, 2004. doi:10.1023/B:MACH.0000033116.57574.95.
  • [4] Hila Becker. A survey of correlation clustering. Advanced Topics in Computational Learning Theory, pages 1–10, 2005.
  • [5] André Berger, Alexander Grigoriev, and Andrej Winokurow. A PTAS for the cluster editing problem on planar graphs. In International Workshop on Approximation and Online Algorithms, pages 27–39. Springer, 2016. doi:10.1007/978-3-319-51741-4_3.
  • [6] Sebastian Böcker. A golden ratio parameterized algorithm for cluster editing. J. Discrete Algorithms, 16:79–89, 2012. doi:10.1016/J.JDA.2012.04.005.
  • [7] Sebastian Böcker, Sebastian Briesemeister, Quang Bao Anh Bui, and Anke Truß. Going weighted: Parameterized algorithms for cluster editing. Theor. Comput. Sci., 410(52):5467–5480, 2009. doi:10.1016/J.TCS.2009.05.006.
  • [8] Sebastian Böcker and Peter Damaschke. Even faster parameterized cluster deletion and cluster editing. Inf. Process. Lett., 111(14):717–721, 2011. doi:10.1016/J.IPL.2011.05.003.
  • [9] Flavia Bonomo, Guillermo Duran, and Mario Valencia-Pabon. Complexity of the cluster deletion problem on subclasses of chordal graphs. Theoretical Computer Science, 600:59–69, 2015. doi:10.1016/J.TCS.2015.07.001.
  • [10] Ulrik Brandes, Michael Hamann, Ben Strasser, and Dorothea Wagner. Fast quasi-threshold editing. In Nikhil Bansal and Irene Finocchi, editors, Algorithms - ESA 2015 - 23rd Annual European Symposium, Patras, Greece, September 14-16, 2015, Proceedings, volume 9294 of Lecture Notes in Computer Science, pages 251–262. Springer, 2015. doi:10.1007/978-3-662-48350-3_22.
  • [11] Yixin Cao and Jianer Chen. Cluster editing: Kernelization based on edge cuts. Algorithmica, 64(1):152–169, 2012. doi:10.1007/S00453-011-9595-1.
  • [12] Moses Charikar, Venkatesan Guruswami, and Anthony Wirth. Clustering with qualitative information. In FOCS 2003, 11-14 October 2003, Cambridge, MA, USA, Proceedings, pages 524–533. IEEE Computer Society, 2003. doi:10.1109/SFCS.2003.1238225.
  • [13] Jianer Chen and Jie Meng. A 2k kernel for the cluster editing problem. J. Comput. Syst. Sci., 78(1):211–220, 2012. doi:10.1016/J.JCSS.2011.04.001.
  • [14] Derek G. Corneil, Yehoshua Perl, and Lorna K. Stewart. A linear recognition algorithm for cographs. SIAM J. Comput., 14(4):926–934, 1985. doi:10.1137/0214065.
  • [15] Bruno Courcelle and Stephan Olariu. Upper bounds to the clique width of graphs. Discrete Applied Mathematics, 101(1-3):77–114, 2000. doi:10.1016/S0166-218X(99)00184-5.
  • [16] Nameirakpam Dhanachandra, Khumanthem Manglem, and Yambem Jina Chanu. Image segmentation using k-means clustering algorithm and subtractive clustering algorithm. Procedia Computer Science, 54:764–771, 2015.
  • [17] Wolfgang Espelage, Frank Gurski, and Egon Wanke. Deciding clique-width for graphs of bounded tree-width. Journal of Graph Algorithms and Applications, 7(2):141–180, 2003. doi:10.7155/JGAA.00065.
  • [18] Absalom E Ezugwu, Abiodun M Ikotun, Olaide O Oyelade, Laith Abualigah, Jeffery O Agushaka, Christopher I Eke, and Andronicus A Akinyelu. A comprehensive survey of clustering algorithms: State-of-the-art machine learning applications, taxonomy, challenges, and future research prospects. Engineering Applications of Artificial Intelligence, 110:104743, 2022. doi:10.1016/J.ENGAPPAI.2022.104743.
  • [19] Fedor V. Fomin, Stefan Kratsch, Marcin Pilipczuk, Michal Pilipczuk, and Yngve Villanger. Tight bounds for parameterized complexity of cluster editing with a small number of clusters. J. Comput. Syst. Sci., 80(7):1430–1447, 2014. doi:10.1016/J.JCSS.2014.04.015.
  • [20] Jakub Gajarskỳ, Michael Lampis, and Sebastian Ordyniak. Parameterized algorithms for modular-width. In IPEC 2013, Sophia Antipolis, France, September 4-6, 2013, Revised Selected Papers 8, pages 163–176. Springer, 2013.
  • [21] Yong Gao, Donovan R. Hare, and James Nastos. The cluster deletion problem for cographs. Discret. Math., 313(23):2763–2771, 2013. doi:10.1016/J.DISC.2013.08.017.
  • [22] M. R. Garey and David S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, 1979.
  • [23] Ioannis Giotis and Venkatesan Guruswami. Correlation clustering with a fixed number of clusters. Theory Comput., 2(13):249–266, 2006. doi:10.4086/TOC.2006.V002A013.
  • [24] Jens Gramm, Jiong Guo, Falk Hüffner, and Rolf Niedermeier. Graph-modeled data clustering: Fixed-parameter algorithms for clique generation. In Rossella Petreschi, Giuseppe Persiano, and Riccardo Silvestri, editors, CIAC 2003, Rome, Italy, May 28-30, 2003, Proceedings, volume 2653 of Lecture Notes in Computer Science, pages 108–119. Springer, 2003. doi:10.1007/3-540-44849-7_17.
  • [25] Jens Gramm, Jiong Guo, Falk Hüffner, and Rolf Niedermeier. Automated generation of search tree algorithms for hard graph modification problems. Algorithmica, 39(4):321–347, 2004. doi:10.1007/S00453-004-1090-5.
  • [26] Jiong Guo. A more effective linear kernelization for cluster editing. Theor. Comput. Sci., 410(8-10):718–726, 2009. doi:10.1016/J.TCS.2008.10.021.
  • [27] Frank Gurski, Egon Wanke, and Eda Yilmaz. Directed NLC-width. Theor. Comput. Sci., 616:1–17, 2016. doi:10.1016/J.TCS.2015.11.003.
  • [28] P Heggernes and D Kratsch. Linear-time certifying algorithms for recognizing trivially perfect graphs. Reports In Informatics ISSN, pages 0333–3590, 2006.
  • [29] Marc Hellmuth, Maribel Hernandez-Rosales, Katharina T Huber, Vincent Moulton, Peter F Stadler, and Nicolas Wieseke. Orthology relations, symbolic ultrametrics, and cographs. Journal of mathematical biology, 66:399–420, 2013.
  • [30] Marc Hellmuth, Nicolas Wieseke, Marcus Lechner, Hans-Peter Lenhof, Martin Middendorf, and Peter F Stadler. Phylogenomics with paralogs. Proceedings of the National Academy of Sciences, 112(7):2058–2063, 2015.
  • [31] Giuseppe F Italiano, Athanasios L Konstantinidis, and Charis Papadopoulos. Structural parameterization of cluster deletion. In International Conference and Workshops on Algorithms and Computation, pages 371–383. Springer, 2023. doi:10.1007/978-3-031-27051-2_31.
  • [32] Klaus Jansen, Stefan Kratsch, Dániel Marx, and Ildikó Schlotter. Bin packing with fixed number of bins revisited. J. Comput. Syst. Sci., 79(1):39–49, 2013. doi:10.1016/J.JCSS.2012.04.004.
  • [33] Songwei Jia, Lin Gao, Yong Gao, James Nastos, Yijie Wang, Xindong Zhang, and Haiyang Wang. Defining and identifying cograph communities in complex networks. New Journal of Physics, 17(1):013044, 2015.
  • [34] Ojvind Johansson. Clique-decomposition, NLC-decomposition, and modular decomposition-relationships and results for random graphs. Congressus Numerantium, pages 39–60, 1998.
  • [35] Christian Komusiewicz and Johannes Uhlmann. Cluster editing with locally bounded modifications. Discrete Applied Mathematics, 160(15):2259–2270, 2012. doi:10.1016/J.DAM.2012.05.019.
  • [36] Athanasios L Konstantinidis and Charis Papadopoulos. Cluster deletion on interval graphs and split related graphs. Algorithmica, 83(7):2018–2046, 2021. doi:10.1007/S00453-021-00817-8.
  • [37] Mirko Krivánek and Jaroslav Morávek. NP-hard problems in hierarchical-tree clustering. Acta Informatica, 23(3):311–323, 1986. doi:10.1007/BF00289116.
  • [38] Manuel Lafond and Nadia El-Mabrouk. Orthology and paralogy constraints: satisfiability and consistency. BMC genomics, 15:1–10, 2014.
  • [39] Manuel Lafond and Weidong Luo. Parameterized complexity of domination problems using restricted modular partitions. In Jérôme Leroux, Sylvain Lombardy, and David Peleg, editors, MFCS 2023, volume 272 of Leibniz International Proceedings in Informatics (LIPIcs), pages 61:1–61:14, Dagstuhl, Germany, 2023. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPIcs.MFCS.2023.61.
  • [40] Manuel Lafond, Mona Meghdari Miardan, and David Sankoff. Accurate prediction of orthologs in the presence of divergence after duplication. Bioinformatics, 34(13):i366–i375, 2018. doi:10.1093/BIOINFORMATICS/BTY242.
  • [41] Bassel Mannaa. Cluster editing problem for points on the real line: A polynomial time algorithm. Information processing letters, 110(21):961–965, 2010. doi:10.1016/J.IPL.2010.08.002.
  • [42] James Nastos and Yong Gao. Familial groups in social networks. Soc. Networks, 35(3):439–450, 2013. doi:10.1016/J.SOCNET.2013.05.001.
  • [43] Sebastian Ochs. Cluster Deletion on Unit Disk Graphs. Master’s thesis, Philipps-Universität Marburg, 2023.
  • [44] Alitzel López Sánchez and Manuel Lafond. Colorful orthology clustering in bounded-degree similarity graphs. Journal of Bioinformatics and Computational Biology, 19(06):2140010, 2021.
  • [45] Ron Shamir, Roded Sharan, and Dekel Tsur. Cluster graph modification problems. Discret. Appl. Math., 144(1-2):173–182, 2004. doi:10.1016/J.DAM.2004.01.007.
  • [46] Nate Veldt, David F Gleich, and Anthony Wirth. A correlation clustering framework for community detection. In Proceedings of the 2018 World Wide Web Conference, pages 439–448, 2018. doi:10.1145/3178876.3186110.
  • [47] Charles T Zahn, Jr. Approximating symmetric relations by equivalence relations. Journal of the Society for Industrial and Applied Mathematics, 12(4):840–847, 1964.