Abstract 1 Introduction 2 Technical Overview 3 Preliminaries 4 Removing Cliques in the Missing Graph 5 A Communication Lower Bound for Independent Set References

Deterministic Independent Sets in the Semi-Streaming Model

Daniel Ye ORCID University of Waterloo, Canada
Abstract

We consider the independent set problem in the semi-streaming model. For any input graph G=(V,E) with n vertices, an independent set is a set of vertices with no edges between any two elements. In the semi-streaming model, G is presented as a stream of edges and any algorithm must use O~(n)111O~() is used to hide polylog factors. bits of memory to output a large independent set at the end of the stream.

Prior work has designed various semi-streaming algorithms for finding independent sets. Due to the hardness of finding maximum and maximal independent sets in the semi-streaming model, the focus has primarily been on finding independent sets in terms of certain parameters, such as the maximum degree Δ. In particular, there is a simple randomized algorithm that obtains independent sets of size nΔ+1 in expectation, which can also be achieved with high probability using more complicated algorithms. For deterministic algorithms, the best bounds are significantly weaker. The best we know is a straightforward algorithm that finds an Ω~(nΔ2) size independent set.

We show that this straightforward algorithm is nearly optimal by proving that any deterministic semi-streaming algorithm can only output an O~(nΔ2) size independent set. Our result proves a strong separation between the power of deterministic and randomized semi-streaming algorithms for the independent set problem.

Keywords and phrases:
Sublinear Algorithms, Derandomization, Semi-Streaming Algorithms
Category:
Track A: Algorithms, Complexity and Games
Copyright and License:
[Uncaptioned image] © Daniel Ye; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Lower bounds and information complexity
Acknowledgements:
I would like to thank Sepehr Assadi for introducing me to this problem and for his numerous discussions, comments, and feedback throughout the development of this work. Without him, this would not have been possible. I would also like to thank Vihan Shah, Janani Sundaresan, and Christopher Trevisan for their insightful comments and meticulous remarks on earlier versions of this work.
Funding:
Supported in part by Sepehr Assadi’s Sloan Research Fellowship and NSERC Discovery grant.
Editors:
Keren Censor-Hillel, Fabrizio Grandoni, Joël Ouaknine, and Gabriele Puppis

1 Introduction

Finding an independent set of a graph is a classical problem in graph theory with wide-ranging applications. An independent set of G=(V,E) is a subset IV such that none of the vertices in I have an edge between them. Being able to find large independent sets plays a crucial role in network scheduling, transportation management, and much more. Especially in recent years, there has been an increased need for handling massive graphs to solve various tasks. Consequently, there has been an increased interest in studying the independent set problem in modern models of computation, such as the semi-streaming model introduced in [14]. In this model, G is presented as a stream of edges and the storage space of the algorithm is bounded to O~(n), where n=|V|. Our result pertains to the independent set problem under the semi-streaming model.

It is known that the Maximum Independent Set problem is both NP-hard [18] and hard-to-approximate to within a factor of n1δ for any δ>0 [20]. Not only that, it is hard-to-approximate in the semi-streaming model, regardless of the time complexity of the algorithm [17]. As such, one line of research has focused on obtaining Maximal Independent Set. In this respect, [2] yields an O(loglogn)-pass semi-streaming algorithm for this problem, and very recently, [5] proved this is the optimal number of passes.

Hence, to study algorithms in even fewer passes, the problem must be further relaxed to finding “combinatorially optimal” independent sets. On one hand, we can try finding an optimal bound with respect to degree sequences – the Caro-Wei theorem guarantees an independent set of size v11+deg(v), which is optimal in the sense that there are many graphs that do not admit larger independent sets. To achieve this bound in the semi-streaming model, [15] devises an algorithm for hypergraphs, which when applied to graphs, yields an expected independent set of size Ω(v1deg(v)) with O(n) memory and O(1) time per edge.

On the other hand, if we consider bounds with respect to n and Δ (the maximum degree in a graph), the Caro-Wei theorem guarantees independent sets of size nΔ+1, which is also tight. There is a very easy randomized semi-streaming algorithm for achieving this bound in expectation: randomly permute the vertices, and when an edge arrives in the stream, mark the endpoint that appears later. At the end, output all unmarked vertices. It is a standard result in graph theory that this algorithm outputs an independent set of size nΔ+1. We can even obtain such an independent set with high probability using the (Δ+1)-coloring semi-streaming algorithm presented in [4].

Interestingly, these results all make heavy use of randomization. Thus, as with a plethora of other problems in the semi-streaming model, there is a particular interest in derandomizing algorithms to achieve similar performance. Some problems admit single-pass deterministic algorithms matching their randomized counterparts (e.g. connectivity and bipartiteness [1], maximum matching [19]). However, others have a proven discrepancy between the theoretical space complexities of deterministic algorithms and randomized ones under this model (e.g. triangle counting [8] and vertex coloring [3]).

For the independent set problem, the state-of-the-art has been a fairly simple deterministic algorithm for finding an independent set of size nΔ2. We begin by choosing nΔ vertices arbitrarily, then store all the edges between them (resulting in O(nΔΔ)=O(n) edges). We can then calculate a maximal independent set of this subgraph, resulting in an independent set of size O(nΔ2). There has been very little progress in constructing a deterministic algorithm that does better than this, which has led to the following open question:

Is there a single-pass semi-streaming deterministic algorithm that can match the performance of randomized ones? In particular, is there a deterministic algorithm that can find an independent set of size better than O~(nΔ2) in general graphs?

1.1 Our Contributions

Our main result is a negative answer to the open question: the deterministic algorithm stated above is optimal up to polylog factors.

Result 1.

Any deterministic single-pass semi-streaming algorithm can only find an independent set of size at most O~(nΔ2) in a graph of maximum degree Δ (even when Δ is known).

To the best of our knowledge, no space lower bounds have been devised for deterministic algorithms solving the independent set problem in the semi-streaming model. This result provides a lower bound that is tight up to logarithmic factors, illustrating another significant separation between deterministic and randomized algorithms in the semi-streaming model.

1.2 Our Techniques

We give a summary of our techniques here.

We begin our proof of the space lower bound in Result 1 with a similar setup to [3], who derived a space lower bound for deterministic algorithms solving the vertex coloring problem. In our case, we consider the multi-party communication complexity of the Independent Set problem, where the edges of a graph are partitioned among some number of players. In a predefined order, each player may speak once (outputting O~(n) bits) and is heard by all future players. The goal is for the last player to output a large independent set.

We will design an adversary that adaptively constructs a graph that forces the algorithm to output an independent set of size O~(nΔ2). To do so, it is useful to consider the graph of non-edges (this is slightly different from the missing graph we use in our formal proof). For each player, its non-edges are determined by the messages of all the previous players. We first consider the set of inputs consistent with the previous players’ messages, then define the non-edges as the set of all edges that were not sent to any previous player in any consistent input. This is useful because any independent set the last player outputs must be a clique in its graph of non-edges (otherwise, there would be some input graph consistent with all the messages that renders the algorithm incorrect).

Additionally, we make use of the compression lemma of [3]. In particular, on any arbitrary graph with degree d, we can create a distribution of graphs by sampling each edge with probability p while removing results with degree 2pd. Then, for any algorithm that compresses a graph in this distribution to an s-bit summary, the compression lemma finds a summary such that at most O(sp) edges are not in any graph mapped to that summary.

This is useful for [3] as their adversary can narrow its search to a small set of vertices that a deterministic algorithm cannot summarize well. In fact, by focusing on vertices with low non-edge degrees, it can sample each remaining edge with a higher probability and improve the bound given by the compression lemma. However, for the independent set problem, deterministic algorithms are free to choose vertices from any section of the graph (and do not need to deal with every vertex as in coloring). As such, our adversary does not have the luxury of searching for some useful set of vertices nor working with vertices with low non-edge degrees. Instead, it must remove large independent sets globally from the graph and account for vertices that might not be easy to work with. To achieve this, our adversary generates graphs that are similar in structure to a Turán graph. In particular, as a deterministic algorithm receives its edges, the overall graph will seem more like many densely-connected vertex-disjoint subgraphs. The key to this strategy is a new lemma in our paper that allows for destroying large independent sets (equivalently, cliques) by adding “few” edges (equivalently, removing in the case of cliques).

1.3 Related Works

A similar-in-spirit result for vertex coloring is proven in [3], which shows that deterministic algorithms cannot color a graph with exp(Δo(1)) colors in a single pass. Since vertex coloring is fairly hard, designing an adversary entails finding some set of vertices that is a clique from the perspective of a deterministic algorithm. With the independent set problem, however, we must design an adversary that removes all large independent sets from the perspective of a deterministic algorithm. Due to this difference, deterministic algorithms can easily find independent sets of size npoly(Δ) in our setting (whereas no deterministic algorithm can find a coloring using at most poly(Δ) colors). We show that they cannot do better than quadratic, up to a polylog factor.

More generally, there has also been a significant interest in finding independent sets in graph streams [2, 5, 15, 9, 10, 7, 6]. Independent sets in the online streaming model are studied in [16]. Under this model, they devise a deterministic algorithm with performance ratio O(2Δ), which they prove is also tight. We provide an adjacent result in the semi-streaming model, which does not require an algorithm to maintain a feasible solution at all times. Additionally, [12] studied independent sets in vertex-arrival streams, where each element in the input stream is a vertex along with its incident edges to earlier vertices. They show that the maximum independent set problem in the vertex-arrival model is not much easier than the problem in the edge-streaming model.

Independent sets have also been studied in more tangential settings. [11] studied the problem of finding the Caro-Wei bound itself that other algorithms (such as the one in [15]) achieve, and [7] studied the geometric independent set problem in the streaming model.

2 Technical Overview

In this section, we provide an informal sketch of our strategy. For clarity, we omit many details and technicalities.

As stated in Section 1.2, we prove Result 1 by considering the multi-party communication complexity of the independent set problem. Critical to our approach is the idea of non-edges: the set of all edges that a player knows has not been sent to any previous player based on their messages. The goal of our adversary is to ensure that the last player’s graph of non-edges has small maximum clique size, which limits the largest independent set they can confidently output.

To do so, it will send a set of edges to each player sampled from a distribution that is chosen based on the behavior of previous players. As it progresses, the graph of non-edges will resemble a Turán graph. To achieve this, it will partition the vertex set into disjoint subsets. Within each subset, the adversary will send edges so that the non-edge graph induced by the subset is fairly sparse, resulting in a small clique number. Between subsets, there can be any number of edges, but the overall clique number is bounded above by the sum of the clique numbers of the non-edge subgraph induced by each subset.

2.1 The Adversary

Let n be the number of vertices, Δ the maximum degree, and s the memory size. Set the number of players k to be logn and define =Ω(sn). For the first player, the graph of non-edges is a clique on n vertices.

Figure 1: (a) Partition the graph into vertex-disjoint subgraphs. (b) Further divide each subgraph based on non-edge degree.
  1. 1.

    We first partition the graph of non-edges into nΔ2 vertex-disjoint subgraphs of size Δ2. Denote the union of these subgraphs as H. This is shown in Figure 1 (a).

  2. 2.

    For Player 1, sample each edge in H independently with probability 1Δ. Discard any sample that increases the degree of any vertex by 2Δ. By the Compression Lemma (stated below), since each player is a compression algorithm, there exists a message M1 such that Player 2’s non-edge graph intersects H in at most O~(sΔ) edges.

  3. 3.

    Let H be the graph containing these remaining non-edges. For each subgraph in H, split the vertices into Pj and Qj, where Pj consists of vertices with degree 2Δ within the subgraph and Qj consists of the remaining vertices. This is shown in Figure 1 (b).

    1. (a)

      Since the degree of each vertex in Qj (w.r.t. H) exceeds 2Δ and there are sΔ edges in H, there are at most O(s) vertices in the union of all Qj. We can combine these vertices into a single set Q of size O(s). This is shown in Figure 2.

    2. (b)

      For each Pj, we can apply Lemma 1 (stated below) and remove Δ2 edges from each vertex in Pj (by sending those edges to Player 2) to ensure the maximum clique size in Pj is O~(2). This is shown in Figure 2.

  4. 4.

    Recurse on the non-edge graph induced on Q, which is now of size O(s)<nc for some c>1 (we choose to be a large enough multiple of sn). Repeat these same operations with Player 2. This time, we can partition the graph into nΔ2 subgraphs with Δ2c vertices. Then, apply the Compression Lemma with probability c1Δ to bound the number of non-edges to O(sΔc). This results in a new set of subgraphs P with clique number O~(2) and a set Q of <nc2 vertices.

  5. 5.

    We repeat the process k times. This is enough for the set Q to have size nΔ2. At this stage, the graph of non-edges is made up of many vertex-disjoint subgraphs, each with degree O~(2). The subgraphs are connected to each other through a (potentially) large number of edges. This is shown in Figure 3.

  6. 6.

    The largest clique in the last player’s graph of non-edges has size at most the sum of the clique number of each subgraph, which is knΔ2O~(2)+nΔ2=O~(nΔ2). This implies that the largest IS the last player can output has size O~(nΔ2). Note that =O(polylog(n)) in the semi-streaming model.

Figure 2: Combine the vertices with high degree into a single set.
Figure 3: The final graph of non-edges is made up of multiple vertex-disjoint subgraphs, each with a low clique number. The subgraphs are connected to each other through a (potentially) large number of edges.

The compression lemma takes advantage of the bounded message size, allowing us to bound on the number of non-edges without significantly increasing the degree on each vertex.

Compression Lemma: Let H be any arbitrary graph with degree d. Consider the distribution of subgraphs obtained by sampling each edge in H with probability p while ignoring results that have degree 2pd. For any compression algorithm that maps graphs sampled from this distribution into s-bit summaries, there is some summary such that at most O(sp) edges are not in any graph mapped to this summary.

The lemma below extends the compression lemma, allowing us to aggressively bound the maximum clique size in low-degree graphs without removing many edges from each vertex.

Lemma 1. Let G be an arbitrary graph with maximum degree Δ. For any positive integer d, there is some subgraph H of G with maximum degree d such that the largest clique size in GH has size O~(Δd).

3 Preliminaries

Notation.

For an integer k1, we denote [k]:={1,2,,k}. For a tuple (X1,,Xk) and any i[k], we denote X<i as (X1,,Xi1). For any distribution μ, we will denote supp(μ) as the support of μ.

For a graph G=(V,E), we denote Δ(G) as the maximum degree of G, and for any vV, deg(v) as the degree of v in G. For any vertex set TV, we will denote the induced subgraph of G on T as G[T]. Often, we will also partition a set S into a collection of subsets 𝒫 and a subset Q. When we use this language, we are saying that 𝒫{Q} is a partition of S.

One part of our strategy also involves partitioning the vertices into many small subsets. For any integer g1, we will denote Partition(S,g) as an arbitrary partition of S into subsets of size g, except potentially for the last set, which has size <g.

Fact 0.

For any set S and g1, |Partition(S,g)||S|g.

Finally, we use the following standard Chernoff bound:

Proposition 1 (Chernoff bound; c.f. [13]).

Suppose X1,,Xm are m independent random variables in the range [0,1]. Let X:=i=1mXi and μL𝔼[X]μH. Then, for any ϵ>0,

Pr(X>(1+ϵ)μH)exp(ϵ2μH3+ϵ)andPr(X<(1ϵ)μL)exp(ϵ2μL2+ϵ).

3.1 The Communication Complexity of Independent Sets

We prove our space lower bound in Result 1 through a communication complexity argument in the following communication game, which as stated in Section 1.2, is defined similarly to [3].

Definition 2.

For integers n,Δ,k,s1, the Independent-Set(n,Δ,k,s) game is defined as:

  1. 1.

    There are k players P1,,Pk. Each Player Pi knows the vertex set V and receives a set Ei of edges. Let G=(V,E), where E=E1Ek and players are guaranteed E1,,Ek are disjoint. Players are guaranteed Δ(G)Δ, and their goal is to output an Independent set of G.

  2. 2.

    In order from i=1 to i=k, each player Pi writes a public message Mi based on Ei and M<i (all the messages from the previous players) of length at most s.

  3. 3.

    The goal of the players is to output an independent set of G by Pk outputting it as the message Mk.

The following is standard:

Lemma 3.

Suppose there is a deterministic streaming algorithm that, on any n-vertex graph G with known maximum degree Δ, outputs an independent set of G with size r using s bits of space. Then, for any positive k, there also exists a deterministic protocol for Independent-Set(n,Δ,k,s) that outputs an independent set of size r.

Proof.

The players simply run the streaming algorithm on their input by writing the content of the memory of the algorithm from one player to the next on the blackboard, so that the next player can continue running the algorithm on their input. At the end, the last player computes the output of the streaming algorithm and writes it on the blackboard.

3.2 The Missing Graph and Compression Lemma

As stated in Section 1.2, we use the compression lemma in [3]. We begin with two definitions that are similar to [3].

Definition 4.

For a base graph GBase=(V,EBase) and parameters p(0,1], d1, we define the random graph distribution 𝔾=𝔾(GBase,p,d) as follows:

  1. 1.

    Sample a graph G on vertices V and edges E by picking each edge of EBase independently with probability p.

  2. 2.

    Return G if Δ(G)<2pd. Otherwise, repeat the process.

To analyze arbitrary deterministic algorithms, we will often consider the compression algorithm associated with it. We will represent the “information” available to the algorithm as a “missing graph222The missing graph is not the same as the graph of non-edges (which pertains to all previous players). However, they are closely related, and the missing graph is one component of the graph of non-edges.”, which we define here:

Definition 5.

Consider 𝔾(GBase,p,d) for a base graph GBase=(V,EBase) and parameters p(0,1], d1, and an integer s1. A compression algorithm with size s is any function Φ:supp(𝔾){0,1}s that maps graphs sampled from 𝔾 into s-bit strings. For any graph Gsupp(𝔾), we refer to Φ(G) as the summary of G. For any summary Φ{0,1}s, we define:

  1. 1.

    𝔾ϕ as the distribution of graphs mapped to ϕ by Φ.

  2. 2.

    GMiss(ϕ)=(V,EMiss(ϕ)), called the missing graph of ϕ, as the graph on vertices V and edges missed by all graphs in 𝔾ϕ. In other words,

    EMiss(ϕ)={(u,v)EBase|no graph in supp(Gϕ) contains the edge (u,v)}

The previous definitions are used extensively. The following lemma also plays a crucial role in our communication lower bound, bounding the number of conclusive missing edges that can be recovered from a compression algorithm of a given size. It is proven in Lemma 4.3 of [3].

Lemma 6 (Compression Lemma).

Let GBase=(V,EBase) be an n-vertex graph, s1 be an integer, and p(0,1) and d1 be parameters such that dmax{Δ(GBase),4ln(2n)p}. Consider the distribution 𝔾:=𝔾(GBase,p,d) and suppose Φ:supp(𝔾){0,1}s is a compression algorithm of size s for 𝔾. Then, there exists a summary ϕ{0,1}s such that in the missing graph of ϕ,

|EMiss(ϕ)|ln2(s+1)p.

4 Removing Cliques in the Missing Graph

In this section, we introduce a key tool used by our adversary. More specifically, we develop a procedure that removes large cliques in the missing graph. This is useful because a deterministic algorithm that finds independent sets must be certain that none of the vertices in its output have an edge between them, which creates a clique of the same size in its graph of non-edges. To do so, we will design our adversary to generate a Turán-type graph, as mentioned in Section 1.2.

4.1 Removing Cliques in Low-Degree Graphs

The following lemma provides a method to bound the largest size of a clique when the degree is bounded by removing a small number of edges.

Lemma 7.

Let G be a graph with n vertices such that Δ(G)Δ. For any positive integer d, there is some subgraph H of G such that:

  1. 1.

    The degree of H is d.

  2. 2.

    The largest clique in GH has size 16ln(n)Δd+10.

Proof.

We will use a probabilistic argument.

Firstly, if d16ln(n), then we let H be the empty graph. Since Δ(G)Δ, the largest clique in G has size Δ. Indeed, 16ln(n)Δd+1016ln(n)Δ16ln(n)=Δ.

Similarly, if Δd, then take H=G. The largest clique size is 1 and Δ(H)Δd. Finally, if n10, then the largest clique size is at most 10, so we can let S be the empty graph as well.

It remains to prove the lemma for n>10 and 16ln(n)<d<Δ. In this case, H is a random subgraph chosen by sampling each edge in G with probability d2Δ. For any vertex v, we let X1,,Xdeg(v) be indicator variables for whether an incident edge is chosen. If we let X:=i=1deg(v)Xi, then 𝔼[X]d2. By Proposition 1 with ϵ=1,

Pr(X>d)Pr(X>2𝔼[X])exp(d24)<exp(16lnn8)=exp(2lnn).

Hence, by a union bound, Pr(Some vertex in H has degree greater than d) does not exceed

nexp(2lnn)=exp(lnn2lnn)=exp(lnn)=1n.

For any group of 16ln(n)Δd+10<k vertices, if it is not a clique in G, it definitely will not be a clique after removing the edges in H. Otherwise, it can only be a clique if none of the edges are selected. For any k-subset, the probability of this happening is:

Pr(this k-subset is a clique) =(1d2Δ)k(k1)/2
exp(d2Δk(k1)2)
<exp(d2Δ(16lnnΔd)k12)
=exp(4lnn(k1)).

Now, we apply a union bound over all k-subsets of vertices:

Pr(GH has a k-clique) (nk)exp(4lnn(k1))
(enk)kexp(4lnn(k1))
=exp(kln(en/k)4ln(n)(k1))
=exp(k(1+ln(n)ln(k)4ln(n))+4lnn)
exp(k(13ln(n))+4ln(n))
exp(2klnn+4ln(n)) (Since ln(n)1)
exp(6ln(n)+4ln(n)) (Sub k=16ln(n)Δd3)
=1n2.

Finally, we calculate the probability that this procedure yields a subgraph H satisfying Item 1 and Item 2. Through a union bound on the complement, when n>10, we get

Pr(H does not satisfy Item 1 or Item 2)110+1100<1.

Thus, there is some H satisfying the two conditions.

4.2 Removing Cliques in General Graphs

To use Lemma 7, we need graphs of low maximum degree. However, the Compression Lemma can only bound the total number of edges. Hence, to employ Lemma 7, we will partition a graph with few edges into many subgraphs with low degree and a small “remainder” subgraph. We will start with the following definition:

Definition 8.

For a positive integer b and a graph G=(V,E) with m edges, we define Split(G,b) as a partition of V into a pair of vertex sets (P,Q) such that Δ(G[P])b and |Q|2mb.

A split necessarily exists for regular graphs. In the general case, the following proposition ensures that a split exists:

Proposition 9.

Suppose we are given a graph G=(V,E) with n vertices and m edges. For every positive integer b, Split(G,b) exists.

Proof.

Let P:={vV:deg(v)b}. Then, Q:=V\P. This is clearly a partition since each vertex is in exactly one of P or Q.

We will now prove the bound on G[P]. For every vertex vP, deg(v)b in G. Since G[P] is a subgraph of G, deg(v)b in G[P] as well. Hence, Δ(G[P])b.

Now for the bound on |Q|. It is known that the number of edges is equal to half the sum of the degrees. Hence, we have

m=12vVdeg(v)12vV\Pdeg(v)12vQb=12|Q|b.

Then, we rearrange to get |Q|2mb.

To “remove” many large cliques in an arbitrary graph, we will run a two-step subalgorithm.

The first step of the subalgorithm involves finding a subgraph of our input graph GBase, which we denote as HBase. The graph HBase is chosen such that, for any algorithm compressing it, we can easily bound the number of edges in HMiss(ϕ) (the missing graph of HBase for some message ϕ) using the Compression Lemma. We will then prove that we can partition the vertices of HMiss(ϕ) (which are the same as the vertices of GBase) into a collection of vertex sets 𝒫 and a vertex set Q such that the maximum degree in (HMiss(ϕ))[P] is small for all P𝒫 and the size of Q is small.

Lemma 10.

Let GBase=(V,EBase) be a graph with n vertices. Let g1 (group size) and s1 (message size) be arbitrary integers. Let dcomp4ln(2n) (compression degree) and dfilter1 (filter degree) be arbitrary real numbers.

Then, there is a subgraph HBaseGBase and a distribution :=𝔾(HBase,p,d) (for some p and d) such that, for every compression algorithm Φ:supp(){0,1}s, we can find a message ϕ and a partition of V into a collection of vertex sets 𝒫 and a vertex set Q satisfying:

  1. 1.

    For all Hsupp(), Δ(H)2dcomp.

  2. 2.

    For all P𝒫, GBase[P]=HBase[P].

  3. 3.

    For any vertex set P𝒫, Δ(HMiss(ϕ)[P])dfilter.

  4. 4.

    The size of 𝒫 is ng.

  5. 5.

    The size of Q is 2ln(2)(s+1)gdcompdfilter.

Proof.

Let H:=Partition(V,g) and HBase:=SHGBase[S]. We prove this lemma through casework on dcomp.

If dcomp<g, then we define :=𝔾(HBase,dcompg,g).

For all Ksupp(), Δ(K)2dcompgg=2dcomp by construction, proving Item 1.

These values of p=dcompg and d=g satisfy the requirements for the Compression Lemma. In particular, since 4ln(2n)dcomp<g, we have p=dcompg(0,1). Additionally, d=gΔ(HBase) since each subgraph has at most g vertices, and d=g=dcompdcomp/g4ln(2n)p. Hence, the values of p and d satisfy the necessary constraints.

Therefore, there exists some message ϕ such that HMiss(ϕ)=(V,EMiss(ϕ)) satisfies

|EMiss(ϕ)|ln(2)(s+1)gdcomp.

Let (L,Q):=Split(HMiss(ϕ),dfilter). By Proposition 9, |Q|2ln(2)(s+1)gdcompdfilter, proving Item 5. Additionally, let 𝒫:={SL:SH}. Since H is a partition of V and Q=V\L, we know that 𝒫{Q} is a partition of V. By Section 3, |𝒫|=|H|ng, proving Item 4.

We will now prove the conditions on all P𝒫. Let XP:=HMiss(ϕ)[P].

  • By Proposition 9, since PL, Δ(XP)Δ(HMiss(ϕ)[L])dfilter. This proves Item 3.

  • We know PS for some SH. Additionally, H is a partition of V. So, by our construction of HBase, we have HBase[S]=GBase[S]. Hence, since PS, we have HBase[P]=GBase[P]. This proves Item 2.

If dcompg, then we let =Φ(HBase,1,|V|). We let 𝒫=H, so Q=, which satisfies Item 5. The distribution is a single graph HBase. Since each subgraph in HBase has g vertices, Δ(HBase)g2dcomp, proving Item 1. Since |𝒫|=|H|, Item 4 is proven by Section 3. Since supp()={HBase}, HMiss(ϕ)=, proving Item 3. By how we defined 𝒫 and HBase (note that H is a partition), GBase[P]=HBase[P] for all P𝒫. This proves Item 2.

In the second step of the subalgorithm, we will apply Lemma 7 on HMiss(ϕ)[P] for all P𝒫, storing the removed edges in a subgraph R (which will have low degree). In the end, the maximum clique size in HMiss(ϕ)R will be small since it cannot exceed the sum of the maximum clique sizes over the subgraphs induced by the vertex sets in 𝒫.

Lemma 11.

Suppose we have a graph H with n vertices and a partition of its vertices into a collection of vertex sets 𝒫 and a vertex set Q. Additionally, for all P𝒫, suppose that the degree of G[P] does not exceed dfilter.

Then, for any integer dremove>0 (removal degree) there is a subgraph RH such that:

  • For all P𝒫, the largest clique in (HMiss(ϕ)R)[P] has size 16ln(n)dfilterdremove+10.

  • None of the edges in R is incident to a vertex in Q.

  • The degree of R does not exceed dremove.

Proof.

For all P𝒫, by Lemma 7 on H[P] and dremove, there is a graph RP with degree dremove s.t. the maximum clique size in H[P]RP is 16ln(n)dfilterdremove+10.

We take R:=P𝒫RP. Since Q is disjoint with all sets in 𝒫, none of the edges in R are incident to any vertex in Q. Additionally, for all P𝒫, since RPR and the vertex sets in 𝒫 are disjoint, we have RP=R[P]. Thus, the largest clique in (HR)[P]=H[P]RP has size 16ln(n)dfilterdremove+10.

5 A Communication Lower Bound for Independent Set

We will prove our lower bound in Result 1 by showing that it is impossible for any set of players to output a large independent set, which is sufficient by Lemma 3. To do so, we will design an adversary that, for any large independent set, can find an invalid graph and a set of edges to send to each player such that each player outputs the same message. This will ensure that no deterministic algorithm can confidently output any large independent set.

5.1 The Adversary

Theorem 12.

There exist constants η>0 and η0>0 such that: if n, s, and Δ are integers satisfying

η<ns,max{η0sln(n)n,η0(lnn)2}<Δ<n,

then the size of the largest independent set a deterministic algorithm using s bits of memory can output for all graphs of size n and maximum degree Δ does not exceed O~(sΔ2sn).

Under the semi-streaming model, s=O~(n). Consequently, the largest independent set that a deterministic algorithm can find under this model has size O~(nΔ2), which formalizes Result 1.

To begin with the proof of Theorem 12, we will let η=e2 and η0=128. We let =max{2eln(2)(s+1)n,8lnn}. By our choice of η and η0, it is easy to prove that <Δ4ln(2n). We let k=lnn+1. For our adversary, k denotes the number of players, and we will generally send graphs of degree Δ to each player.

At a high level, our adversary adheres to the following structure:

  • For all i=1k, GBase(i) is the graph where most edges for Player i will be chosen from.

  • Similarly, Ri will be provided as an additional set of edges to send to Player i to “manually” remove large cliques in the missing graph for the previous players’ messages.

  • For all i[k], Player i receives Ri and a subgraph HiGBase(i).

  • Lemma 10 is used to find an adversarial distribution of subgraphs i, which is used to obtain Hi. Lemma 11 is used to determine Ri+1, which is a subgraph of GBase(i).

  • All graph parameters are derived adversarily based on what each player communicates. The compression algorithm associated with each player also affects the input for future players by determining GBase(i+1) and Ri+1.

  • The adversarily generated input graph, Ginput, is the union of all graphs Hi and Ri.

An adversary that generates a “hard” input

  1. 1.

    We start with GBase(1) as the clique on n vertices and R1 as an empty graph.

  2. 2.

    For i=1k,

    1. (a)

      Let ni be the number of vertices in GBase(i), where GBase(i)=(VBase(i),EBase(i)).

    2. (b)

      If ni<nΔ2, then terminate the algorithm, letting = and Ri+1=. For the sake of our analysis, we let GBase(i+1)=GBase(i) and run the adversary until i=k.

    3. (c)

      Otherwise, apply Lemma 10 with GBase(i), group size niΔ2n, message size s, compression degree Δ, and filter degree 2Δ.

      We have a distribution i with a base graph HBase(i), and we define Φi=Φ(i,M<i) as the compression algorithm for Player i after receiving Ri.

      By Lemma 10, there is a message Mi:=ϕ and some partition of VBase(i) into a collection of vertex sets 𝒫i and a vertex set Qi satisfying the conclusions of Lemma 10.

    4. (d)

      Apply Lemma 11 with HMiss(Mi) and removal degree Δ2 to find some Ri+1 such that the conclusions of Lemma 11 holds.

    5. (e)

      Let GBase(i+1)=(GBase(i)(HBase(i)HMiss(Mi)))[Qi].

  3. 3.

    Finally, we generate the edges we send each player. For i=1,,k,

    1. (a)

      If =, let Hi=. Otherwise, choose Hisupp(i) such that Φi(Hi)=Mi.

    2. (b)

      We send Player i the graph HiRi.

    Then, the input graph is the union of these graphs. In particular, Ginput:=i=1kHiRi.

To begin, we must ensure that the input graph Ginput is valid. The following lemma ensures that we are not sending a multigraph to the players.

Lemma 13.

No two players will receive the same edge (i.e. Ginput is not a multigraph).

Proof.

It is sufficient to prove the stronger statement that {Hi:i[k]}{Ri+1:i[k]} is disjoint. For each i[k], by Lemma 10 and Lemma 11, HBase(i)GBase(i) and Ri+1HMiss(Mi)HBase(i)GBase(i). Hence, both Hi and Ri+1 are subgraphs of GBase(i).

Firstly, we claim that Hi and Ri+1 are edge-disjoint. In particular, since Hi is sampled from supp(i) and Φi(Hi)=Mi, by definition, we know that none of the edges in Hi are in HMiss(Mi). However, Ri+1HMiss(Mi), so none of the edges in Ri+1 are in Hi.

Next, we prove that GBase(i+1) is edge-disjoint with Hi and Ri+1. For any edge in Hi, it is in HBase(i) but not HMiss(i). Hence, it is not in GBase(i)(HBase(i)HMiss(Mi)). For Ri+1, by Lemma 11, none of its edges are incident to a vertex in Qi. Thus, none of the edges in Ri+1 appear in GBase(i+1)=(GBase(i)(HBase(i)HMiss(Mi)))[Qi].

Finally, for any j>i, we know that GBase(j)GBase(i+1). Thus, for any edge (u,v) in Hi or Ri+1, we know that (u,v) is not in GBase(j) (since (u,v)GBase(i+1)). Since Hj and Rj+1 are subgraphs of GBase(j), (u,v) will not be in Hj or Rj+1 either.

Thus, an edge in Hi or Ri+1 (it only appears in one of the two) will not be sent in any Hj or Rj+1 if j>i. As such, {Hi:i[k]}{Ri+1:i[k]} is edge-disjoint. Since R1 is empty, this set contains all the graphs sent to any player. Hence, no two players will receive the same edge.

Next, we must also ensure that the maximum degree of the input graph agrees with the maximum degree given to each player.

Lemma 14.

For each vertex vGinput, deg(v)Δ.

Proof.

We consider the degree on each vertex v. Firstly, there is at most one value of i for which v is incident to some edge in Ri. After all, by Lemma 11, if an edge in Ri is incident to v, then vQi1 and v is not in GBase(i) nor any subsequent graphs. For this value of i, Δ(Ri)Δ2. Furthermore, Δ(Hj)2Δ for all j[k] by Lemma 10. Thus, deg(v) cannot exceed Δ2 in Ri nor 2Δ in Hj for any j[k].

Since Ginput=(j=1kHj)(i=1kRi), the degree of v in Ginput is the sum of its degree in each of these graphs. In particular, the degree of v is k2Δ+Δ24ln(n)Δ+Δ2Δ2+Δ2=Δ. The second last inequality is true by k=ln(n)+1 and n>η. The last is true by 8lnn.

Lemma 13 and Lemma 14 prove that the input graph is valid. The next step is to prove that the adversary terminates – for the sake of our analysis, we will instead show that the final base graph GBase(k) is small.

Lemma 15.

nknΔ2.

Proof.

We begin by proving that nimax(nΔ2,nei1), which we will show through induction.

For the base case, ni=nne0.

For the inductive step, if ninΔ2, then ni+1nΔ2 too. Otherwise, by Lemma 10,

|Qi|2ln(2)(s+1)niΔ2nΔ2Δ.

Dropping the floor, the above is 2ln(2)(s+1)niΔ2nΔ2=2ln(2)(s+1)nni1enei1=nei.

The second last inequality is true by the bound on , while the last is true by the inductive hypothesis. Note that ni+1=|Qi|. This concludes the inductive step.

To prove Lemma 15, we note that nek1nelnn1. Since Δ<n, we have nΔ2>1>nek1, so max(nΔ2,nek1)=nΔ2. We showed that nkmax(nΔ2,nek1), so nknΔ2.

Having proven that the input graph is valid and the algorithm terminates, it remains to bound the largest clique at each iteration. Lemma 11 allows us to bound the largest clique in HMiss(ϕ)[P] for all P𝒫i. The following lemma bounds the size of 𝒫i, which will allow us to bound the size of the largest clique in the missing graph over the entirety of 𝒫i.

Lemma 16.

For all i[k], the size of 𝒫i is 3nΔ2.

Proof.

By Lemma 10, |𝒫i|niniΔ2nniniΔ22n2nΔ22nΔ2+13nΔ2.

The second inequality is true because niΔ2n1. The last inequality is true by Δ<n.

Finally, we prove that Player k cannot conclusively find a large independent set by proving that the adversary can find a “breaking” graph if the output is ever too large.

Lemma 17.

Suppose Player k outputs an Independent set A of size greater than

nΔ2+nΔ2k(962ln(n)+30).

Then, there is another graph Ginput and set of edges to send to each player such that each player outputs the same message but an edge exists between some u,v in the output of Player k.

Proof.

For each vA, either vGBase(k+1) or vP for some i[k] and P𝒫i. We let Li:=P𝒫iP. Since nk<nΔ2, more than nk(962ln(n)+30)Δ2 vertices in A are also in L1Lk. By the Pigeonhole principle, for some i[k], we have |LiA|>n(962ln(n)+30)Δ2.

Since 𝒫i is a partition of the vertices in Li, and since |𝒫i|3nΔ2 by Lemma 16, there is some P=(VP,EP)𝒫i such that |PA|>322ln(n)+10 (again, by the Pigeonhole principle).

By Lemma 10 and Lemma 11, the largest clique in (HMiss(Mi)Ri+1)[P] has size at most

16ln(n)dfilterdremoval+10=162ΔΔ/2ln(n)+10=322ln(n)+10.

Thus, there is some u,vP such that (u,v)HMiss(Mi)Ri+1.

If (u,v)Ri+1, then we send the same graph and edges. Note that Rk+1= since nk<nΔ2. Thus, Player i+1 exists and will receive Ri+1. As such, there is an edge between (u,v) in Ginput.

Otherwise, we know there is some j[i] s.t. (u,v)HBase(j) but (u,v)HMiss(Mj). We prove this with two cases:

If (u,v)GBase(i), then let j=i. Since HBase(i)[P]=GBase(i)[P], we have (u,v)HBase(i). Additionally, since (u,v)Ri+1, we know that (u,v)HMiss(Mj) as well.

Otherwise, (u,v)GBase(i). Since GBase(i)GBase(1) and (u,v)GBase(1), we can find some j<i s.t. (u,v)GBase(j) but (u,v)GBase(j+1). In particular,

(u,v)GBase(j+1)=(GBase(j)(HBase(j)HMiss(Mj)))[Qj].

Since u,vGBase(i) and all vertices in GBase(i) are also in Qj, we know that u,vQj. This allows us to conclude that

(u,v)GBase(j)(HBase(j)HMiss(Mj)).

We know that (u,v)GBase(j), so we must also have (u,v)HBase(j)HMiss(Mj). Hence, (u,v)HBase(j) and (u,v)HMiss(Mj).

Since (u,v)HMiss(j), and (u,v)HBase(j), there is some graph Hi such that (u,v)H and Φi(H)=Mi. We substitute Hi with H when choosing Ginput and the edges we send each player. Each player sends the same message, but (u,v)Ginput, concluding the proof.

Lemma 17 leads naturally to a bound on the largest independent set a deterministic algorithm can find, which the following formalizes.

Lemma 18.

There does not exist a deterministic algorithm using at most s bits of memory that outputs an independent set of size greater than nΔ2+nΔ2k(962ln(n)+30) for all graphs with n vertices and maximum degree Δ.

Proof.

Suppose a deterministic algorithm exists. Then, by Lemma 3, there is a set of k=lnn+1 players that can solve the Independent-Set(n,Δ,k,s) problem with the same independent set size. However, by Lemma 17, this is impossible. After all, if Player k outputs an independent set of size greater than nΔ2+nΔ2k(962ln(n)+30), then we can find another set of edges to send such that each player sends the same message but the independent set is no longer valid.

To conclude, we substitute =max{2eln(2)(s+1)n,8lnn} and k=lnn+1 into the bounds we showed in Lemma 18.

Proof of Theorem 12.

This proof is simply a matter of loosening the bound in Lemma 18 until we get a “simpler” form. We will start with the choices of k and . For k, we have

k=lnn+12lnn+13lnn.

For the first possible value of , we have

2eln(2)(s+1)n4eln(2)sn8eln(2)sn.

For the other potential value of , we have

8lnn 16lnn.

Hence, max{8eln2sn,16lnn}. Since both terms are positive and greater than 1, we can multiply both upper bounds to get

8eln2sn16lnn=128eln2slnnn.

To simplify calculations, we will let r=128eln(2), which is a constant.

By Lemma 18, no deterministic algorithm can find an independent set of size greater than

nΔ2+nΔ2k(962ln(n)+30) nΔ2(1+3ln(n)[96(rslnnn)2ln(n)+30])
nΔ226(ln(n))2(96(rslnnn)2)
=1152r2ln4(n)s2nΔ2
=O~(sΔ2sn).

Hence, the size of the largest independent set that a deterministic algorithm can find for all graphs with n vertices and maximum degree Δ does not exceed O~(sΔ2sn).

References

  • [1] Kook Jin Ahn, Sudipto Guha, and Andrew McGregor. Analyzing graph structure via linear measurements. In Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms, pages 459–467. SIAM, 2012. doi:10.1137/1.9781611973099.40.
  • [2] KookJin Ahn, Graham Cormode, Sudipto Guha, Andrew McGregor, and Anthony Wirth. Correlation clustering in data streams. In International Conference on Machine Learning, pages 2237–2246. PMLR, 2015.
  • [3] Sepehr Assadi, Andrew Chen, and Glenn Sun. Deterministic graph coloring in the streaming model. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, pages 261–274, 2022. doi:10.1145/3519935.3520016.
  • [4] Sepehr Assadi, Yu Chen, and Sanjeev Khanna. Sublinear algorithms for (Δ+1) vertex coloring. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 767–786. SIAM, 2019. doi:10.1137/1.9781611975482.48.
  • [5] Sepehr Assadi, Christian Konrad, Kheeran K Naidu, and Janani Sundaresan. O(loglogn) passes is optimal for semi-streaming maximal independent set. In Proceedings of the 56th Annual ACM Symposium on Theory of Computing, pages 847–858, 2024. doi:10.1145/3618260.3649763.
  • [6] Ainesh Bakshi, Nadiia Chepurko, and David P. Woodruff. Weighted maximum independent set of geometric objects in turnstile streams. In Jaroslaw Byrka and Raghu Meka, editors, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2020, August 17-19, 2020, Virtual Conference, volume 176 of LIPIcs, pages 64:1–64:22. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2020. doi:10.4230/LIPICS.APPROX/RANDOM.2020.64.
  • [7] Sujoy Bhore, Fabian Klute, and Jelle J Oostveen. On streaming algorithms for geometric independent set and clique. In International Workshop on Approximation and Online Algorithms, pages 211–224. Springer, 2022. doi:10.1007/978-3-031-18367-6_11.
  • [8] Vladimir Braverman, Rafail Ostrovsky, and Dan Vilenchik. How hard is counting triangles in the streaming model? In Automata, Languages, and Programming: 40th International Colloquium, ICALP 2013, Riga, Latvia, July 8-12, 2013, Proceedings, Part I 40, pages 244–254. Springer, 2013. doi:10.1007/978-3-642-39206-1_21.
  • [9] Xiuge Chen, Rajesh Chitnis, Patrick Eades, and Anthony Wirth. Sublinear-space streaming algorithms for estimating graph parameters on sparse graphs. In Algorithms and Data Structures Symposium, pages 247–261. Springer, 2023. doi:10.1007/978-3-031-38906-1_17.
  • [10] Graham Cormode, Jacques Dark, and Christian Konrad. Independent set size approximation in graph streams. arXiv preprint arXiv:1702.08299, 2017. arXiv:1702.08299.
  • [11] Graham Cormode, Jacques Dark, and Christian Konrad. Approximating the caro-wei bound for independent sets in graph streams. In International Symposium on Combinatorial Optimization, pages 101–114. Springer, 2018. doi:10.1007/978-3-319-96151-4_9.
  • [12] Graham Cormode, Jacques Dark, and Christian Konrad. Independent sets in vertex-arrival streams. In Christel Baier, Ioannis Chatzigiannakis, Paola Flocchini, and Stefano Leonardi, editors, 46th International Colloquium on Automata, Languages, and Programming, ICALP 2019, July 9-12, 2019, Patras, Greece, volume 132 of LIPIcs, pages 45:1–45:14. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2019. doi:10.4230/LIPICS.ICALP.2019.45.
  • [13] Devdatt P Dubhashi and Alessandro Panconesi. Concentration of measure for the analysis of randomized algorithms. Cambridge University Press, 2009.
  • [14] Joan Feigenbaum, Sampath Kannan, Andrew McGregor, Siddharth Suri, and Jian Zhang. On graph problems in a semi-streaming model. Theoretical Computer Science, 348(2-3):207–216, 2005. doi:10.1016/J.TCS.2005.09.013.
  • [15] Bjarni V Halldórsson, Magnús M Halldórsson, Elena Losievskaja, and Mario Szegedy. Streaming algorithms for independent sets. In Automata, Languages and Programming: 37th International Colloquium, ICALP 2010, Bordeaux, France, July 6-10, 2010, Proceedings, Part I 37, pages 641–652. Springer, 2010. doi:10.1007/978-3-642-14165-2_54.
  • [16] Bjarni V Halldórsson, Magnús M Halldórsson, Elena Losievskaja, and Mario Szegedy. Streaming algorithms for independent sets in sparse hypergraphs. Algorithmica, 76:490–501, 2016. doi:10.1007/S00453-015-0051-5.
  • [17] Magnús M Halldórsson, Xiaoming Sun, Mario Szegedy, and Chengu Wang. Streaming and communication complexity of clique approximation. In International Colloquium on Automata, Languages, and Programming, pages 449–460. Springer, 2012. doi:10.1007/978-3-642-31594-7_38.
  • [18] Richard M Karp. Reducibility among combinatorial problems. Springer, 2010. doi:10.1007/978-3-540-68279-0_8.
  • [19] Ami Paz and Gregory Schwartzman. A (2+ε)-approximation for maximum weight matching in the semi-streaming model. ACM Transactions on Algorithms (TALG), 15(2):1–15, 2018. doi:10.1145/3274668.
  • [20] David Zuckerman. Linear degree extractors and the inapproximability of max clique and chromatic number. In Proceedings of the thirty-eighth annual ACM symposium on Theory of computing, pages 681–690, 2006. doi:10.1145/1132516.1132612.