Abstract 1 Introduction 2 Preliminaries 3 The Triangles Approximation Algorithm 4 A Lower Bound for Testable Triangle Counting Algorithms 5 Testable Streaming Algorithm for Counting Triangles 6 Approximating the Number of Edges References

Testable Algorithms for Approximately Counting Edges and Triangles in Sublinear Time and Space

Talya Eden ORCID Bar-Ilan University, Ramat Gan, Israel Ronitt Rubinfeld Massachusetts Institute of Technology, Cambridge, MA, USA Arsen Vasilyan ORCID University of Texas at Austin, TX, USA
Abstract

We consider the fundamental problems of approximately counting the numbers of edges and triangles in a graph in sublinear time. Previous algorithms for these tasks are significantly more efficient under a promise that the arboricity of the graph is bounded by some parameter α¯. However, when this promise is violated, the estimates given by these algorithms are no longer guaranteed to be correct.

For the triangle counting task, we give an algorithm that requires no promise on the input graph G, and computes a (1±ϵ)-approximation for the number of triangles t in G in time O(mα(G)t+mt2/3), where α(G) is the arboricity of the graph. The algorithm can be used on any graph G (no prior knowledge of the arboricity α(G) is required), and the algorithm adapts its run-time on the fly based on the graph G.

We accomplish this by trying a sequence of candidate values α~ for α(G) and using a novel algorithm in the framework of testable algorithms. This ensures that wrong candidates α~ cannot lead to wrong estimates: if the advice is incorrect, the algorithm either succeeds despite this or detects this and continues with a new candidate. Once the algorithm accepts the candidate, its output is guaranteed to be correct with high probability.

We prove that this approach preserves – up to an additive overhead – the dramatic efficiency gains obtainable when good arboricity bounds are known in advance, while ensuring robustness against misleading advice. We further complement this result with a lower bound, showing that such an overhead is unavoidable whenever the advice may be faulty.

We further demonstrate implications of our results for triangle counting in the streaming model.

Keywords and phrases:
Sublinear Algorithms, Triangle Counting, Edge Counting, Arboricity
Funding:
Talya Eden: Talya gratefully acknowledges the support of the MIT International Science and Technology Initiatives (MISTI) and the MIT-Israel Zuckerman STEM Fund. Part of this work was conducted while the authors were visiting the Simons Institute for the Theory of Computing.
Ronitt Rubinfeld: Supported by NSF awards DMS-2022448 and CCF-2310818, and the MIT-Israel Zuckerman STEM Fund. Part of this work was conducted while the authors were visiting the Simons Institute for the Theory of Computing.
Arsen Vasilyan: Arsen gratefully acknowledges the support of the MIT International Science and Technology Initiatives (MISTI) and the MIT-Israel Zuckerman STEM Fund. Part of this work was conducted while the authors were visiting the Simons Institute for the Theory of Computing.
Copyright and License:
[Uncaptioned image] © Talya Eden, Ronitt Rubinfeld, and Arsen Vasilyan; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Streaming, sublinear and near linear time algorithms
Related Version:
Full Version: https://arxiv.org/abs/2509.20351 [30]
Acknowledgements:
We thank Sepehr Assadi for discussions that influenced this work.
Editor:
Shubhangi Saraf

1 Introduction

Counting triangles is a fundamental graph problem with applications in network analysis, database systems, and social network theory, among others (see [1] for a comprehensive survey). A multitude of algorithms for counting triangles have been developed which work well in sequential [53, 18, 40], streaming [42, 49, 17], sublinear-time [23, 29, 12, 25, 27, 3, 31], parallel and distributed models [11, 33, 15, 19, 54, 46, 55], both from theoretical and practical perspectives. The best known sublinear-time algorithm for estimating the number of triangles in general graphs in the augmented model111In the sublinear-time augmented query model, the algorithm can perform degree queries, neighbor queries and pair queries, as well as query for a uniformly distributed edge in the graph. See Section 2.2 for full details. runs in O(m3/2t) time [23], where m and t denote the number of edges and triangles in the input graph G, respectively. By the same result, this algorithm is optimal due to query complexity lower bounds.222The result of [23] is stated in the general query model, that only allows for degree, neighbor and pair queries, without uniform edge queries. Their complexity in this model is Θ(nt1/3+m3/2t). However, it directly follows from their work that in the augmented model the complexity is Θ(m3/2t) (that is, only the second term is needed for the upper bound, and the lower bound proof of the second term still holds for the stronger augmented model).

Given the importance of the triangle counting problem, there is much interest in even faster algorithms for typical real world graphs. These graphs often have structural properties that make it possible to perform faster computations. An important example of such a property is the graph’s arboricity, a sparsity parameter that corresponds (up to a constant factor of 2) to the maximum average degree over all subgraphs. The arboricity is typically small in many real-world networks [34, 41, 58, 20], including planar graphs and preferential attachment networks. When an upper bound on the arboricity is known, there are fast sublinear-time triangle-counting algorithms [28, 10] that circumvent the known lower bounds for general graphs.333Here, the work of [28] is in the general query model, and the work of [10] is augmented model.

The aforementioned arboricity-based algorithms require an upper bound α¯ on the actual arboricity of the graph, denoted α(G), as one of their input parameters. Those algorithms rely crucially on the correctness of this assumption. If the arboricity of the graph exceeds α¯, these algorithms are no longer guaranteed to give correct estimates for the number of triangles.

1.1 Our Contribution

In this work, we develop a novel algorithm in the augmented query access model that does not require any information about the arboricity of the input graph. For all graphs G, our algorithm gives a correct (1±ϵ)-approximation to the number of triangles in G, and has a run-time that scales with the arboricity of the input graph. Specifically, the run-time of our algorithm is O(mα(G)t+mt2/3).444We use the O() notation to hide poly(1/ϵ,logn) dependencies.

A naive approach would be to combine the sublinear-time arboricity estimator of [26] with the triangle counting results of [28, 10]. However, [26] runs in essentially optimal time Θ(n/α(G)), and so invoking it would dominate the complexity and substantially degrade the overall runtime, reflecting the inherently high query cost of arboricity estimation.

We avoid estimating arboricity directly, by first designing an algorithm in the formal framework of testable algorithms, introduced by [56] in the context of agnostic learning. A testable algorithm receives advice about the input that may or may not be accurate. The algorithm is allowed to either output an estimate or terminate early indicating “bad advice”. It must satisfy the following two properties: (1) Completeness: If the advice is correct, the algorithm outputs a correct estimate (with high probability). (2) Soundness: If the algorithm outputs an estimate (i.e., does not return “bad advice”), then that estimate is correct (with high probability), even if the advice was incorrect. This framework captures natural scenarios in which domain knowledge provides partial information about the input class, but the reliability of this information cannot be guaranteed.

We note that previous work on testable algorithms focused on settings in learning theory, and our central conceptual contribution is extending this framework to approximate counting and estimation tasks. Specifically, we design efficient testable algorithms for estimating the number of edges and triangles in a graph, given a potentially unreliable upper bound α~ on its arboricity as advice.555Throughout the paper we use α(G) to denote the actual arboricity of the graph G, α¯ to denote a guaranteed upper bound on the arboricity, and α~ to denote an unreliable upper bound. Our algorithms return a (1±ε)-approximation with high probability, provided it does not output “bad advice”. Moreover, if the given advice is accurate, then the algorithm is not likely to output “bad advice”.

Theorem 1 (Testable approximate triangle counting algorithm).

Let G be the input graph, and ϵand δ approximation and error parameters. Let m and t be the number of edges and triangles in G, respectively. There exists an algorithm that gets as input augmented query access to G, the number666We remark that the requirement on knowing m can be removed at an additive (essential) overhead of O(nα(G)m), as implied by Corollary 25 below or O(n1/4) due to [7]. However, for the sake of clarity, we chose to present the two results separately and focus here on the triangle estimation problem, as this is the main focus of our work. of edges m, as well as an advice: a positive integer α~. The algorithm has expected run-time and query complexity of O((mα~t+mt2/3)log1δ). Upon each invocation, the algorithm either returns “bad advice” or a number t^, satisfying

  • (Completeness) For every G, if α(G)α~, then the algorithm can output “bad advice” only with probability at most δ.

  • (Soundness) For every G and α~, w.p. at least 1δ the algorithm will either return “bad advice” or output t^(1±ϵ)t.

In order to eliminate the need for advice, we use this algorithm with a sequence of guesses serving as our advice value α~. We start with running the algorithm with a setting of α~=1, and continue invoking the algorithm while doubling the guessed value of α~ until the algorithm outputs an answer (rather than “bad advice”). By Soundness, the first accepting run outputs a correct estimate with high probability, and by Completeness, the number of guesses is bounded, so the overhead is only logarithmic. Consequently, we obtain an algorithm applicable to any graph G.

Corollary 2 (Instance-adaptive approximate triangle-counting algorithm.).

There exists an algorithm with the following guarantees. The algorithm receives augmented query access to a graph G, and the number of edges m in G together with parameters ϵ,δ(0,1). With probability at least 1δ, the algorithm produces an estimate t^ that with probability at least 1δ satisfies t^(1±ϵ)t, where t is the number of triangles in G. The expected run-time of the algorithm is O((mα(G)t+mt2/3)log(1δ)), where α(G) is the arboricity of G.

We refer to this algorithm as an instance-adaptive algorithm since it adapts “on the fly” to the structure of G, scaling with α(G) without requiring any a priori bound on it.

Optimality of the bound

The query complexity of our testable algorithm for approximating the number of triangles lies between two extremes: the known lower bound in the general setting (with no advice), and the more efficient algorithms achievable under a reliable arboricity promise. We prove that this tradeoff is inherent: any algorithm that uses significantly fewer queries may produce an underestimate of the triangle count when the advice is incorrect, thus establishing the optimality of our approach. The lower bound also establishes a strict separation between the instance-adaptive estimation setting we consider here, and the trusted-advice setting of [28]. Formally, we give an information-theoretic lower-bound showing that Ω(mt2/3) queries are necessary, which together with the Ω(mα(G)t) lower bound of [28] demonstrates the optimality of our algorithm.

Theorem 3.

Let 𝒜 be a testable algorithm for approximating the number of triangles in a graph G given arboricity advice α~ and augmented query access to G. Let n,m and t denote the number of nodes, edges and triangles in G. Finally, assume that the algorithm knows n and m, and that α~m/n. Then 𝒜 must perform Ω(mt2/3+mα(G)t) queries in order to succeed with high probability.

On the choice of the query model

As mentioned previously, the triangle-counting algorithm of [28] operates in the strictly less powerful general query model, which does not provide uniform edge samples. Under a promised arboricity bound α¯, their complexity is Θ(nα¯2t+mα¯t), in contrast to the lower bound of Ω(nt1/3+m3/2t) for general graphs [23]. If we adopt the same (general) query model in our setting, but without any bound on the arboricity, then a lower bound of Ω(n/t1/3) follows immediately by hiding a clique of size Θ(t1/3). Thus, in order to achieve a meaningful arboricity-dependent runtime we assume access to uniform edge samples, as is given by the augmented model.777For a more concrete sense on these bounds, consider a grid graph where each square is augmented with one diagonal edge. This graph has n vertices, m=Θ(n) edges, t=Θ(n) triangles, and arboricity O(1). In the general query model and assuming a bound on the arboricity, the result of [28] yields an upper bound of O(nα2t+mαt)=O(1) queries. Without any knowledge of the arboricity and still in the general model, any algorithm requires Ω(n/t1/3)=Ω(n2/3) queries. By contrast, allowing uniform edge samples (but without assuming any bound on the arboricity) enables our O(mt2/3+mα(G)t)=O(n1/3) algorithm.

Further results

Using a known reduction from sublinear-time algorithms in the augmented model to streaming algorithms in the edge arrival model (see, e.g., [26, 32]), we prove the following.

Theorem 4.

There exists a 9-pass O(mα~log1/δt+mlog1/δt2/3)-space algorithm for testable triangle-counting with arboricity advice (Definition 6) in the arbitrary-order insert-only edge arrival streaming model. As is standard and unavoidable in the streaming model, the algorithm requires a rough initial estimate t~=Θ(t) of the triangle-count t.888In our implementation we assume t~[t/4,t], but the same argument can be adapted to work with t~[1ct,ct] for any given known constant c.

In the guaranteed advice setting, Bera and Seshadhri [6] gave an arboricity-dependent 6-pass O(mα¯/t)-space streaming algorithm for triangle counting.

We note that the instance-adaptive version of our algorithm can also be implemented in the streaming model, however with a O(logn) round complexity, due to the need to search for the correct advice α¯. For further discussion see Section 5. We also note that lower bounds do not directly transfer between the two models, and so it remains open as to whether the space bounds and round complexity of Theorem 1 are optimal.

Finally, we state our result for edge estimation in the testable setting.

Theorem 5.

There exists a testable algorithm in the augmented query model (Definition 6) for (1±ϵ)-approximate counting of edges on a graph G with n vertices and m edges, that has an expected run-time of O(nα¯log1/δm).

Interestingly, this result does not incur any additive overhead over the O(nα¯m) results of [27]. It does, however, require the uniform edge samples of the stronger augmented query access model, as opposed to the algorithm of [27], which works in the general model. As in the case of triangles, this stronger query access is necessary if one wants to obtain an arboricity-dependent algorithm in the testable setting, as otherwise the presence of a small (hard to find) clique of size m can fool any edge estimation algorithm which uses at most o(n/m) vertex samples. For small-arboricity graphs, the resulting algorithm is much faster than the state-of-the-art edge-counting algorithm for general graphs in the augmented model [7]. In particular, for constant arboricity graphs, the running time of our algorithm is O(1), whereas the algorithm of [7], that does not assume any knowledge on the arboricity, runs in time O(n1/4), which is essentially optimal for general graphs.

As in the case of triangle counting, an easy consequence of our testable algorithms is an instance-adaptive algorithm for approximating the number of edges (using a similar doubling trick). Formally, the resulting instance-adaptive approximate edge counting algorithm runs in time O(nα(G)m).

Our results are summarized below.

Table 1: Summary of our results, compared to the state of the art in the same model for general graphs as well as bounded arboricity graphs. Our results can be seen as lying in between these two “extremes”. Here (and elsewhere) α(G) denotes the actual arboricity of the graph, α¯ a guaranteed upper bound on α(G), and α~ a potentially-incorrect advice on α(G).
Computational model General Graphs Guaranteed Bound on Arboricity Our Results (Testable Algorithms)
Triangles sublinear-time, augmented Θ(m3/2t) O(mα¯t) Θ(mα(G)t+mt2/3)
streaming, edge arrival Θ(min{m3/2t,mt}) O(mα~/t+mt2/3)
Edges augmented model Θ(n1/4) O(nα¯m) O(nα(G)m)

1.2 Overview of the algorithms and lower bounds for triangle estimation

1.2.1 Testable triangles algorithm

The ERS algorithm.

Our algorithm builds on ideas from the earlier by [28], henceforth ERS, and we begin with a brief overview of how ERS works. The ERS algorithm estimates the number of triangles by classifying edges as light or heavy and assigning each triangle to its first light edge, leaving only all-heavy triangles unassigned. It then samples a set R of uniformly random edges and attempts to find triangles incident to edges in R by sampling random neighboring edges. Whenever a triangle is found, the algorithm checks whether it is assigned to the sampled edge and uses these events to estimate the total number of assigned triangles.

ERS classifies an edge as heavy when it participates in more than τt triangles. This assignment rule yields a low-variance, nearly unbiased estimator. Consequently, sampling a set R of O(mτt/t) edges uniformly at random ensures that, with high probability, the number of triangles assigned to some edge in R is close to its expected value. To estimate this quantity, the algorithm repeatedly tries to sample triangles incident to edges of R, and whenever a triangle incident to some edge eR is found, it checks whether the triangle is indeed assigned to e.

ERS prove that estimating the number of triangles assigned to R requires only O(d(E)/t) triangle-sampling attempts in expectation, where d(E)=defeEd(e), and d(e) for e={u,v} is defined as min{d(u),d(v)}. Once a triangle is found, verifying whether it is assigned to an edge e can be done in O(d(e)/τt) queries. To bound this term, they define a degree-heaviness threshold τd beyond which edges are also classified as heavy (independent of their triangle count) and then ignored. Hence, the cost of an assignment verification step is O(τd/τt).

The arboricity guarantee then allows them to bound each of these three terms by O(mα¯/t), through the following arguments:

  1. (i)

    Triangle-heaviness: In bounded-arboricity graphs, edges with triangle load above τt=α¯/ϵ can be ignored, since they can only contribute a small fraction of triangles. This determines the sample size of R.

  2. (ii)

    Sum of edge degrees: In bounded-arboricity graphs, the sum of edge degrees satisfies d(E)=eEd(e)=O(mα¯), implying that the number of triangle sampling attempts is bounded.

  3. (iii)

    Degree-heaviness: In bounded-arboricity graphs, only few edges exceed τd=mα¯2/t, so their contribution to the total triangles count is small and assignment verification remains efficient.

Plugging in each of these bounds in their respective terms gives the O(mα¯/t) running time.

The testable algorithm.

Our testable algorithm uses the three properties above to verify the advice α~. If these properties hold, then the correctness does not require that the actual arboricity be small; rather, these properties suffice to guarantee the quality of the estimate.

However, since the advice is not guaranteed, setting the thresholds τt and τd as in ERS no longer ensures that the number of unassigned heavy triangles is small. We must therefore modify the thresholds and explicitly check that the number of heavy edges remains bounded. In ERS, the set of heavy edges H satisfies |H|=O(ϵt/α¯), and arboricity is used to argue that H induces at most O(α¯|H|)=O(ϵt) triangles. Without the arboricity guarantee, we instead rely on the general bound that any m-edge graph contains at most O(m3/2) triangles. Hence we would like to make sure that |H|=O((ϵt)2/3). Accordingly, we set

γ=cmax{α~,t1/3},τt=γ/ϵ,andτd=mγ2/t,

for some small constant c. We show that these settings ensure that if the advice is correct then |H| will be small. Therefore, we can estimate the size of H, and if |H|>(ϵt)2/3 we can safely output “bad advice”. Otherwise, if |H| does not exceed the bound (so we do not reject), then it is small enough that – even in high arboricity graphs – it cannot induce too many heavy triangles, thereby preserving Properties (i) and (iii). For Property (ii), once the set R is sampled, we explicitly verify that d(R)=defeRd(e)=O(α~|R|); if this fails, we reject the advice. These modifications yield the runtime bound

O(mτt/t+mα~/t+τd/τt)=O(m/t2/3+mα~/t).

1.2.2 The 𝛀(𝒎/𝒕𝟐/𝟑) lower bound

We prove a lower bound of Ω(m/t2/3+mα(G)/t) for any testable triangle-counting algorithm in the augmented model. The Ω(mα(G)/t) term follows from the no-advice lower bound of [28], and it remains to prove the first term, which is dominant whenever α~<t1/3.

Consider a testable algorithm invoked with advice α~<t1/3. The lower bound is shown by constructing two contrasting families of graphs.

In the first family, every graph has arboricity α~ and contains no triangles. The second family is obtained by taking graphs from the first family and augmenting them with a clique of size about t1/3. (In the formal proof, the construction is refined to ensure that graphs in both families have exactly the same numbers of vertices and edges.) Consequently, graphs in the second family contain Θ(t) triangles and have arboricity Θ(t1/3), and so the advice is incorrect.

By definition, any testable algorithm must, with high probability, output an estimate t^=0 on graphs from the first family, and either “bad advice” or an estimate t^(1±ϵ)t on graphs from the second family. Hence, the algorithm must be able to distinguish between the two families. This requires it to hit the planted clique with high probability, which in turn forces it to perform Ω(m/t2/3) samples.

1.3 Additional related work

Testable learning algorithms

Testable learning is a framework introduced by Rubinfeld and Vasilyan [56] to allow a user to safely use a learning algorithm that is designed for specific distributions on labelled examples (e.g., uniform). The difficulty is that ascertaining whether examples come from the assumed distribution requires way too many samples. The new framework suggests the design of a tester and learner as a pair, where the tester efficiently tests a weaker property of the distribution, chosen such that the weaker property is both easier to test and sufficient for proving that the algorithm’s output is correct. Their work, and several followups [45, 36, 59, 37, 38, 21, 22] have demonstrated such tester-learner pairs for a number of learning tasks. The work of [35] extends the testable learning framework to handle assumptions on the label noise in classification. A recent line of work on Testable Learning with Distribution Shift (TDS learning) [45, 44, 16, 36] tests the assumption that the data encountered by a classifier during deployment comes from the same distribution encountered during training.

In concurrent work, Marcussen, Rubinfeld and Sudan introduce an analogue of testable algorithms for the setting of random graphs [47], referred to as “quality control problems”. As in this work, the algorithms in [47] have the goal of reliably approximating the number of motifs in input graphs. However, the requirements of the testable algorithm differ in our setting and theirs, particularly with respect to the completeness guarantees of the algorithms. Completeness in their setting corresponds to algorithmic correctness with high probability over random graphs, while in our setting the algorithm must be correct with high probability on any low-arboricity input graph. Our work also uses very different techniques because low-arboricity graphs have disparate structural properties from random graphs.

Testing and estimating the arboricity in sublinear-time and space

The task of estimating the arboricity was studied in a series of works, both in the sublinear-time and the streaming settings [9, 48, 43, 5, 9]. Eden, Mossel and Ron [26] improved the running times and space complexity to O(n/α¯) at the cost of increasing the approximation parameter to O(log2n) and the round complexity to logarithmic. Eden, Levi, and Ron [24] presented an O(1/ϵ)O(log(1/ϵ))-time algorithm for testing whether a graph has bounded arboricity in the augmented model. However, being ϵ-close to a bounded-arboricity graph is not sufficient for applying ERS: even a small number of additional edges can still generate many triangles.

Sublinear-time and space subgraph counting algorithms

Sublinear time subgraph counting results have been extensively studied both in various query models [39, 23, 29, 25, 3, 31, 12, 2, 27, 10]. Estimating the number of edges in sublinear time, when given access to uniform edge samples as well as degree and neighbor queries, follows (implicitly) from the work of Motwani, Panigrahy and Xu [50] and from the work of Beretta and Tětek [8]. Both these results give an essentially optimal bound of Θ(n1/3). When pair queries are also allowed, implying full augmented query access, the bound was recently improved to Θ(n1/4) by Beretta, Chakrabarty and Seshadhri [7].

Sublinear arboricity-dependent bounds were studied in [27, 28, 4, 10, 6, 32].

Triangle counting with predictions

Algorithms with predictions, like testable algorithms, use auxiliary information to overcome worst-case lower bounds. However, the auxiliary information is often more informative (as it models advice given by a machine learning algorithm). In addition, such algorithms must always produce an output, even when the prediction is of low quality. The goal is that the performance improves with prediction accuracy, and, ideally, never falls below the baseline guarantee. Triangle counting in the streaming model using predictions was studied in several papers using various types of predictions [17, 13]. Common to all is that they require per-edge advice, rather than global information.

1.4 Organization

In Section 2, we formally define the notion of a testable algorithm, and the augmented query access model, and provide some useful notations and claims. In Section 3, we provide our sublinear time and space algorithms and lower bound for triangle counting, with some of the proofs deferred to Appendix B in the full version of this work [30]. The sublinear-time lower bound is proven in Section 4. The streaming result is discussed in Section 5. Finally, in Section 6, we provide the algorithm for edge counting.

2 Preliminaries

2.1 Testable algorithm definition

The following is our definition of testable algorithms for estimating the number of motifs with arboricity advice. The problems of estimating the number of edges and the number of triangles are a special case.

Definition 6 (Testable algorithm for motif counting with arboricity advice).

Let H be a fixed subgraph, and let t denote the number of copies of H in an input graph G. A testable arboricity advice algorithm for counting the number of copies of H in G (in the augmented query model) is an algorithm that receives: augmented query access to G, an approximation parameter ϵ, and a positive integer α~, which serves as a (possibly unreliable) upper bound on the arboricity of G.

The algorithm outputs either an estimate t^ or the comment “bad advice”, as follows:

  • Completeness: If α(G)α~, then whp 𝒜 outputs an estimate t^ (and does not output “bad advice”).

  • Soundness: For every graph G and any α~, if 𝒜 outputs a value t^, then whp t^(1±ϵ)t.

2.2 The augmented query model

In the augmented query model the algorithm interacts with an unknown simple graph G=([n],E) through the following oracles: Vertex access. Vertices are labeled [n], so the algorithm can specify any v[n] (equivalently, it can draw a uniform random vertex by choosing v uniformly from [n]). Degree query. On input v, return the degree of v, d(v). 𝒊th-neighbor query. On input (v,i), return the ith neighbor of v if 1id(v) (with respect to a fixed but arbitrary ordering of v’s incident edges), and a distinguished null symbol otherwise. Pair query. On input (u,v), return whether {u,v}E. Uniform edge sample. Return an edge drawn uniformly at random from E, independently at each call.

If no edge sample access is allowed, then this is referred to as the general access model.

2.3 Notations and useful inequalities

Definition 7 (Arboricity of a graph).

The arboricity of a graph G is the minimum number of forests required to cover its edge set. By [51, 52], it is also equivalent to

α(G)=maxSGm(S)n(S)1,

where n(S) and m(S) denote the number of vertices and edges, respectively, in the subgraph induced by the set S.

Theorem 8.

For any graph G, α(G)m and mnα(G).

The following theorem of Chiba and Nishizeki bounds the number of triangles by the arboricity.

Theorem 9 (Due to [18]).

Let G be a graph with arboricity at most α(G). Then

t(G)eEd(e)2mα(G)2m3/2.
Definition 10 (Degree, neighbors and triangles of an edge).

The degree of an edge e={u,v} is the minimum degree of its endpoints, d(e)=d({u,v})=min{d(u),d(v)}. The set of neighbors of an edge, is the set of vertices that are neighbors of its min degree endpoint.

Also let t(e) denote the number of triangles incident to the edge e.

Chernoff inequality and a useful corollary are deferred to Appendix A in the full version of this work [30].

3 The Triangles Approximation Algorithm

This section is devoted to proving the main theorem of the paper, a testable triangle counting algorithm with arboricity advice (Definition 6). We start with an algorithm that takes as an input a value t~ that serves as a rough guess of the true triangle count t. In Section B.3 of the full version of this work [30] we show how to use Theorem 16 together with the Search Theorem of [29] to search for a good guess t~ and obtain a testable triangle counting algorithm that does not require an input of t~.

Before presenting the algorithm, we provide some useful notations and structural claims.

3.1 Preliminaries for triangle counting

The algorithm distinguishes those edges which either have high degree or too many triangles incident to them:

Definition 11 (Heavy and light edges).

For a given advice α~, and a guessed value t~, let γ=max{α~,t~1/3}. Further let τd=8mγ2ϵt~ be a degree threshold, and τt=12γ/ϵ be a triangles-degree threshold. Let Hd denote the set of edges for which d(e)>τd, and Ht the set of edges for which t(e)>τt. We say that an edge e is degree- (triangles-) heavy, if either eHd (eHt). Otherwise, we say it is degree (triangles) light. Finally, we let H=HdHt and refer to it as the set of heavy edges, and to EH as the set of light edges.

The algorithm assigns each triangle to one of its edges as follows:

Definition 12 (Assigning triangles to edges).

Let P=(E0,E1) be a partition of the graphs’ edges. We assign each triangle to its first edge in E0 if such exists (according to some arbitrary predefined order). Otherwise, if all of its edges are in E1, then the triangle is not assigned to any edge. For an edge e, we let aP(e) denote the number of triangles assigned to e by partition P.

The algorithm partitions edges into two groups E0,E1. E0 contains all light edges. E1 contains all edges that are degree-heavy and all edges that are “extremely” triangles-heavy – that is, with a higher threshold than required to just be triangles-heavy. Edges that are degree-light, and triangles-heavy with τtt(e)2τt can be in either E0 or E1.

Definition 13 (A good partition).

We say that a partition of the graphs’ edges P=(E0,E1) is (τd,τt)-good if: (1) any light edge (according to Definition 11) is in E0, (2) every edge such that d(e)>τd or t(e)>2τt is in E1, and (3) all other edges can be either in E0 or E1.

Our first claim shows that if the advice is good, then we can ignore triangles for which all edges are heavy, because there are not many of them.

Claim 14.

If α~>α(G) and t~[t/4,t], then |Hd|(ϵt~)2/3 and |Ht|(ϵt~)2/3.

Proof.

By Theorem 9 and the assumption on α~, eEd(e)2mα(G)2mα~. Therefore, there could be at most 2mα~/τd edges with degree greater than τd. By the setting of τd=8mγ2ϵt~, we get

|Hd|2mα~τd=ϵα~t~4γ2ϵt~4γ14min{(ϵt~)2/3,ϵt~α~},

where the last two inequalities are since γ=max{α~,t~1/3}.

By the definition of Ht, it holds that eHtt(e)|Ht|τt and also, eHtt(e)3t since this holds for any set of edges, so in particular it holds for Ht. Hence, by the setting of τt=12γ/ϵ=12ϵmax{α~,t~1/3} and since t4t~, it follows that |Ht|3t/τtϵt/(4t~1/3)(ϵt~)2/3.

Claim 15.

For any partition P=(E0,E1), it holds that eEaP(e)t. Furthermore, if |E1|3(ϵt~)2/3 and t~[t/4,t], then eE0aP(e)(112ϵ)t.

Proof.

By the assignment rule in Definition 12, every triangle is assigned to at most one edge, so it always holds that eEaP(e)t.

We now prove the second part of the claim. If there are at most 3(ϵt~)2/3 edges in E1, by Theorem 9 these edges induce at most 2|E1|3/212ϵt~12ϵt triangles, where the last inequality is by the assumption t~t. That is, there are at most 12ϵt triangles with all three edges are in E1, and all the rest of the (at least) (112ϵ)t triangles will be assigned to one of the edges in E0. Hence, eE0t(e)>(112ϵ)t.

3.2 The algorithm

Approx-Triangles-With-Arboricity-Advice(t~,ϵ,δ,α~).

  1. 1.

    Set γ=max{α~,t~1/3}, τd=8mγ2ϵt~ and τt=12γ/ϵ.

  2. 2.

    Sample a set R of r=16mτtln(4/δ)ϵ2t~ edges uniformly at random.

  3. 3.

    Query the degrees of all edges in R, and let d(R)=eRd(e).
    If d(R)>|R|α~4δ then output “bad advice”

  4. 4.

    Invoke IsHeavy on all edges in R, and if the fraction of edges for which the algorithm returned “Heavy” is greater than 52(ϵt~)2/3m~ then output “bad advice”.

  5. ** The remaining steps of the algorithm essentially execute the algorithm of [28]. **

  6. 5.

    Set up a data structure to allow sampling each edge in R with probability d(e)/d(R).

  7. 6.

    For i=1 to s=d(R)|R|(t~/m)10ln(8/δ)ϵ2 do:

    1. (a)

      Sample an edge eR, so that each edge is sampled with probability d(e)/d(R).

    2. (b)

      Pick w to be a uniform random neighbor of e.

    3. (c)

      If e,w form a triangle, then invoke IsAssigned(e,w).

    4. (d)

      If the triangle (e,w) is assigned to e, then let χi=1. Otherwise, let χi=0.

  8. 7.

    Let χ=1si=1sχi

  9. 8.

    Return t^=defd(R)m|R|χ

IsAssigned(e=(u,v),w).

  1. 1.

    Invoke IsHeavy for each of the edges (u,v),(u,w),(v,w).

  2. 2.

    If e is the first edge in E0 then return YES. Otherwise, return NO.

IsHeavy(e=(u,v),τt,τd).

  1. 1.

    If d(e)>τd then return “Heavy”.

  2. 2.

    Sample k=18d(e)τtln10mδ neighbors of e uniformly at random (from its min-degree vertex), and for each check if it forms a triangle with e.
    ** If IsHeavy(e=(u,v),τt,τd) was called in the past, re-use the same random bits as used previously. **

  3. 3.

    If the number of witnessed triangles is greater than 1.5kτtd(e), then return “Heavy”. Otherwise, return “Not Heavy”.

As noted above, we begin with an algorithm that takes as input a value t~, intended to be a rough estimate of the true triangle count t. It is a standard technique in subgraph-counting algorithms to first design a procedure that succeeds under the assumption that a coarse approximation of the target quantity is provided, and then later eliminate this assumption by iterating over a sequence of candidate estimates. Formally, to accomplish this we invoke the Search Theorem of [29]. Applying this theorem requires us to decompose the Soundness condition of our algorithm into two cases – depending on whether the guess t~ is reasonably accurate uper bound on the true triangle count t or instead exceeds t.

Theorem 16.

Let G be the input graph and ϵ and δ be an approximation and error parameters. Let t be the number of triangles in G. Algorithm Approx-Triangles-With-Arboricity-Advice receives: augmented query access to G, the number of edges m, parameters ϵ and δ, and a pair of positive integers α~ and t~. The algorithm has an expected run-time and query complexity of O(mmax{α~,t~1/3}t~ln1δϵ3δlnmδ(1+t/t~)). Every time the algorithm is run, it can either return a number t^ or it can return “bad advice”, and

  • (Completeness) For every G and t~, if α(G)α~, then w.p. at least 1δ the algorithm Approx-Triangles-With-Arboricity-Advice will not output “bad advice”.

  • (Used for soundness) For every G and α~, if t~[t/4,t] then w.p. at least 1δ Approx-Triangles-With-Arboricity-Advice will either return “bad advice” or output t^(1±20ϵ)t.

  • (Used for soundness) For every G and α~, if t~>t then w.p. at least 5ϵ, the algorithm Approx-Triangles-With-Arboricity-Advice will either return “bad advice” or output a value of t^ such that t^<(1+20ϵ)t.

The proof of the run-time bound in Theorem 16 is deferred to Appendix B.2 in the full version of this work [30] and the correctness properties in Theorem are shown in Section 3.4 (this includes the completeness property, and the two properties used for soundness). Section 3.3 focuses on the sub-routine IsHeavy with the proof of Claim 17 deferred to Appendix B in the full version of this work [30].

As mentioned earlier, Theorem 16 is used in Section B.3 of the full version of this work [30] together with the Search Theorem of [29] to obtain a testable algorithm (Definition 6) in the augmented query model for (1±ϵ)-approximate counting of triangles (Theorem 1). The instance-adaptive algorithm (Corollary 2) is analyzed in Section 3.5.

3.3 Analyzing procedure IsHeavy

We rely on the following the guarantees of the procedure IsHeavy, and defer the proof to Section B.1 in the full version of this work [30].

Claim 17.

With probability at least 1δ/10 over all random coins, IsHeavy induces a (τd,τt)-good partition, as defined in Definition 13. The query complexity and the run-time of IsHeavy is O(min(τd,d(e))τtlnmδ).

3.4 Correctness of Approx-Triangles-With-Arboricity-Advice

In this section we prove the Completeness condition and the two Soundness conditions in Theorem 16. For the rest of the analysis we will use the following convention:

 Remark 18.

In this section we will consider as random variables various quantities in Algorithm Approx-Triangles-With-Arboricity-Advicesuch as t^ and χi. These random variables are well-defined even if the algorithm decides to terminate in steps 4 or 3 prior to computing these quantities. This is true because all these random variables are determined on (a) the partition defined by IsHeavy(b) the set R and (c) the edges sampled in Step 6.

3.4.1 Proof of Completeness

In this sub-section we prove that the algorithm Approx-Triangles-With-Arboricity-Advice satisfies the completeness condition in Theorem 16. Specifically, we show that for every G and t~, if α(G)α~, then w.p. at least 1δ the algorithm Approx-Triangles-With-Arboricity-Advice will not output “bad advice”. Recalling that the algorithm has the option to reject in Step 3 and Step 4, in this section we show that either is unlikely if α(G)α~. We start by considering Step 4.

Claim 19.

If the partition P=(E0,E1) induced by the procedure IsHeavy satisfies E12(ϵt~)2/3, then the algorithm outputs “bad advice” in Step 4 with probability only at most δ/4. Moreover, P=(E0,E1) satisfies |E1|3(ϵt~)2/3, then the algorithm will output “bad advice” in Step 4 with probability at least 1δ/4.

Proof.

In Step 4, the procedure IsHeavy outputs “Heavy” for en edge e if and only if eE1, where recall that P=(E0,E1) induced by the procedure IsHeavy. Therefore each edge e sampled in Step 4, has a probability of |E1|/m to be in E1 (so IsHeavy outputs “Heavy” for e).

Suppose E12(ϵt~)2/3, then by the multiplicative Chernoff bound, and by the setting of |R|m(ϵt~)2/330ln(4/δ),

Pr[eR𝟙eE1>52|R|(ϵt~)2/3m]<exp(|R|(ϵt~)2/324m)=δ/4,

proving the first part of the claim. Now, suppose that |E1|3(ϵt~)2/3. Then, again

Pr[eR𝟙d(e)>τd52|R|(ϵt~)2/3m]<exp(1242(ϵt~)2/3m|R|)=δ/4.

Building on the claim above, we now show that Step 4 is unlikely to output “bad advice” if α~α(G).

Claim 20.

If α~α(G) and the partition P=(E0,E1) induced by IsHeavy is a (τd,τt)-good partition (see Definition 13), then the algorithm can output “bad advice” in Step 4 only with probability at most δ/4.

Proof.

If P=(E0,E1) induced by IsHeavy is a (τd,τt)-good partition (Definition 13), then the E1 is a subset of HdHt. But if α~α(G), then by Claim 14, we see that |Hd|(ϵt~)2/3 and |Ht|(ϵt~)2/3, and therefore

|E1||Hd|+|Ht|2(ϵt~)2/3,

and thus by claim 19 the algorithm outputs “bad advice” in Step 4 with probability only at most δ/4.

Finally, we prove that if the advice α~ is good, then with high probability the algorithm will not reject due to Step 3.

Claim 21.

If α~α(G), then the algorithm can only return “bad advice” in Step 3 with probability at most δ/3.

Proof.

By Theorem 9, if the graph has arboricity at most α(G) and α~α(G), then

𝔼eE[d(e)]=1meEd(e)2α(G)2α~.

Hence, for every i, 𝔼[d(R)]|R|2α~, and by Markov’s inequality, Pr[d(R)>4α~|R|/δ]<δ/4.

3.4.2 Proof of Soundness Condition 1

In this sub-section we prove that the algorithm Approx-Triangles-With-Arboricity-Advice satisfies the first soundness condition in Theorem 16. Specifically, we prove that for every G and α~, if t~[t/4,t] then, with probability at least 1δ, Approx-Triangles-With-Arboricity-Advice will either return “bad advice” or output t^(1±20ϵ)t.

Let P=(E0,E1) be the partition induced by the procedure IsHeavy. By Claim 17, with probability at least 1δ/10, the partition P is a (τd,τt)-good (see Definition 13). Since the probability distribution of the partition P is (as a random variable) independent from the choice of set R and edges that Approx-Triangles-With-Arboricity-Advice picks in Step 6. Thus, if we condition on the event that P is a (τd,τt)-good partition, this does not change the distribution of R and edges in Step 6. Therefore, for the rest of this subsection we condition on the event that P is a (τd,τt)-good partition.

Now, suppose that |E1|>3(ϵt~)2/3. Claim 19 tells us that in this case the algorithm Approx-Triangles-With-Arboricity-Advice will with probability at least 1δ/4 output “bad advice”.

Thus, it remains to consider the case that |E1|3(ϵt~)2/3. We show in Claim 23 (provided t~[t/4,t]) in this case the random variable t^ will satisfy t^(1±20ϵ)t. This will finish the proof because, as follows from Remark 18, in the event that the random variable t^ satisfies t^(1±20ϵ)t then the algorithm Approx-Triangles-With-Arboricity-Advice either outputs such t^ or outputs “bad advice” at an earlier step.

We first need the following claim telling us that the number of triangles assigned to set R will be approximately proportional to the total number triangles t (with high probability):

Claim 22.

If P=(E0,E1) is a (τd,τt)-good partition (Definition 13), satisfies |E1|3(ϵt~)2/3, and t~[t/4,t], then with probability at least 1δ4, we have

aP(R)|R|[(113ϵ)tm,(1+ϵ)tm].
Proof.

If P is a (τd,τt)-good partition, then for any edge eE, aP(e)2τt. This is true since if t(e)>2τt, then eE1, and so aP(e)=0, and otherwise, if eE0, then aP(e)t(e)2τt. Furthermore, if t~t, then |R|16mτtln(4/δ)ϵ2t~m(2τt)t3ln(4/δ)ϵ2. Therefore, by the Chernoff bound, with probability at least 1δ4,

1reRaP(e)(1±ϵ)𝔼eE[aP(e)](1±ϵ)eE0aP(e)m.

Claim 15 tells us that if |E1|3(ϵt~)2/3 and t~[t/4,t], then teE0aP(e)(112ϵ)t, which combined with the bound above yields the conclusion of the claim.

Overall, recall that the partition P=(E0,E1) induced by the algorithm IsHeavy is a (τd,τt)-good partition with probability at least 1δ4 (Claim 17), this subsection has the premise that t~[t/4,t] and we are focusing on the case that |E1|>3(ϵt~)2/3. With all this in mind, the following claim finishes the proof in this subsection, showing that the estimate t^ will indeed be a good approximation to the true triangle count t.

Claim 23.

Suppose ϵ1/20 and the partition P=(E0,E1) induced by the algorithm IsHeavy is a (τd,τt)-good partition (Definition 13), and also suppose |E1|3(ϵt~)2/3 and t~[t/4,t], then with probability at least 1δ/4 the algorithm outputs a value t^ such that t^(1±20ϵ)t.

Proof.

Observe that conditioned on an edge eR being sampled in Step 6a, χi=1 iff the triangle (e,w) is assigned to e, which happens with probability Pr[χi=1e]=aP(e)d(e). Therefore,

𝔼[χi]=eRd(e)d(R)aP(e)d(e)=aP(R)d(R). (1)

By Claim 22, and since aP(R)R[(112ϵ)tm,(1+ϵ)tm], it is the case that

𝔼[χi][(112ϵ)|R|td(R)m,(1+ϵ)|R|td(R)m].

This, together with the premise that t~[t/4,t] and ϵ1/20 implies that

s=defd(R)|R|t~m10ln(8/δ)ϵ21𝔼[χi]3ln(8/δ¯)ϵ2,

and therefore the Chernoff bound allows us to conclude that

Pr[|1sχ𝔼[χi]|>ϵ𝔼[χi]]<δ4.

If this holds, then we have

t^=defd(R)m|R|χ=(1±13ϵ)t,

proving the claim.

3.4.3 Proof of Soundness Condition 2

In this sub-section we prove that the algorithm Approx-Triangles-With-Arboricity-Advice satisfies the second soundness condition in Theorem 16. This is accomplished by the following claim:

Claim 24.

For every G and α~, if t~>t then w.p. at least ϵ/4, the algorithm Approx-Triangles-With-Arboricity-Advicewill either return “bad advice” or output a value of t^ such that t^<(1+20ϵ)t.

Proof.

Let P=(E0,E1) again denote the partition induced by the algorithm IsHeavy. Conditioned on a specific choice of R, the expectation of each χi still satisfies Equation 1, and therefore averaging over R we have

𝔼[t^]=𝔼[d(R)m|R|χ]=𝔼[d(R)m|R|aP(R)d(R)]=m|R|𝔼[aP(R)].

By the assignment rule in Definition 12, and by the first part of Claim 15, we have
𝔼eE[aP(e)]=1meEaP(e)tm, and, by linearity of expectation, 𝔼[aP(R)]|R|tm. Combining with the inequality above, we get EX[t^]t. Therefore, by Markov’s inequality, we have

Pr[t^>(1+20ϵ)t]<11+ϵ<15ϵ.

Hence, with probability at least 5ϵ, the random variable t^ satisfies t^(1+20ϵ)t. Conditioned on this event, either the algorithm outputs such value of t^ satisfying t^(1+ϵ)t, or it terminates and returns “bad advice” at an earlier step.

3.5 From testable triangle-counting to instance-adaptive counting

In this section, we use Theorem 1 to obtain an instance-adaptive algorithm for approximate triangle counting (Corollary 2).

Proof of Corollary 2.

We run the algorithm in Theorem 1 with a parameter δ¯=δ/(10logm) and an increasing sequence of values α~=2j where j{1,2,3,,logm}. Once for some j the algorithm outputs value t^j (Rather than “bad advice”), we output t^j and terminate.

The soundness condition in (Definition 6), together with a union bound over j, implies that once we receive an estimate t^j at one of the steps j, this estimate will satisfy t^j(1±ϵ)t.

The true arboricity α(G) is at most 2m. For one of the iterations j0, we have α~j0=2j0[α(G),2α(G)]. The completeness condition tells us that for each jj0 with probability at least 1δ the algorithm from Theorem 1 terminates. Thus, the expected run-time of the algorithm is upper-bounded by

j0O(mα~log1/δt+mlog1/δt2/3)+j>j0δjj0O(mα~2jj0log1/δt+mlog1/δt2/3)j0O(mα~log1/δt+mlog1/δt2/3)+j>j00.1jj0O(mα~2jj0log1/δt+mlog1/δt2/3)O(mα~2jj0log1/δt+mlog1/δt2/3),

finishing the proof.

4 A Lower Bound for Testable Triangle Counting Algorithms

Proof of Theorem 3.

Note that if α~<m/n, then by Theorem 8, any testable algorithm can simply reject, so that proving any non-trivial lower bound is impossible. Hence, the lower bound only holds for the case where α~m/n.

Since any testable algorithm for triangle counting also implies a triangle-counting algorithm in the case where the graph is guaranteed to be low-arboricity, the lower bound of Ω(mα(G)t) by [28, 10] applies to testable algorithms, and so it remains to prove a lower bound of Ω(m/t2/3) for the regime α~<t1/3.

Consider the following two families of graphs. In the first family, 𝒢1, all graphs have Θ(n) nodes, Θ(m) edges, arboricity α~ and zero triangles. In the second family 𝒢2, all graphs have Θ(n) nodes, Θ(m) edges, arboricity t1/3 and Θ(t) triangles.

Observe that any testable algorithm must (whp) return t^=0 for graphs from the first family, and (whp) either output “bad advice” or a correct estimate of t for graphs from the second family. In particular, this implies that the algorithm must be able to distinguish between the two families with high probability.

More concretely, in the first family the graphs all consist of two disjoint sets. The first, A1, is a bipartite graph over n vertices where each vertex has degree 2m/n, and the second, A2 is a bipartite set over r=Θ(t2/3/α~) vertices so that each vertex has degree α~ and in total there are (t1/32)=Θ(t2/3) edges. The arboricity of these graphs is α~ due to the set A2.

In the second family the graphs are very similar, but the set A2 is a clique over t1/3 vertices, so that here too it contributes Θ(t2/3) edges, however, here it also contributes Θ(t) triangles. In addition, the graphs in this family have an additional independent set, A3 over rt1/3 vertices, so that both graph families have the same number of vertices and edges. The arboricity of these graphs is t1/3 due to the set A2. Within each family the graphs only differ on the labelings of the vertices.

In order to distinguish the two families with high probability, any testable algorithm must hit at least one vertex or edge from VA1. Hence, it must make

Ω(min{nr,mt2/3})=Ω(min{nα~t2/3,mt2/3})=Ω(mt2/3)

many uniform vertex or edge samples, where the last equality is by the assumption that mn<α~.

5 Testable Streaming Algorithm for Counting Triangles

In this section we prove that our testable estimation algorithm can be adapted to the streaming model, achieving 9 rounds and space complexity O(mt2/3+mα~t). The state-of-the-art arboricity-dependent bound is due to Bera and Seshadhri [6], which uses only 6 rounds and O(mα¯/t) space, but requires a trusted bound on α(G).

As shown in [26, 32], any sublinear-time algorithm in the augmented query model with O(q) queries and adaptivity depth k can be simulated in the arbitrary-order insert-only edge arrival streaming model using O(q)polylogn space and k rounds.999An algorithm 𝒜 in the augmented query model has adaptivity depth k if the following holds. For every execution of 𝒜, the set of queries Q performed by it can be partitioned into k sets Q1,,Qk, such that for each j[k], the queries in Qj can be determined solely from the responses to Q1,,Qj1. That is, query complexity translates into space complexity, while adaptivity depth translates into the number of rounds. This is true since neighbor and edge queries can be implemented with 0-samplers, degree queries with counters, and pair queries with a one-bit verifier. Hence, a set of non-adaptive queries can be simulated in a single pass. Inspecting the algorithm Approx-Triangles-With-Arboricity-Advice, one can verify that it has an adaptivity depth of 9.

Recall that Approx-Triangles-With-Arboricity-Advice receives both an advice value α~ for α(G) and a guessed value t~ for the number of triangles in the graph. In the sublinear setting, the latter can be removed using a search procedure. However, this search cannot be implemented in the streaming model with a constant number of passes; it would either introduce a polylogarithmic round overhead or, if executed in parallel, require linear space. This limitation is common in streaming counting algorithms. Consequently, the streaming literature typically assumes access to a coarse constant-factor estimate of the triangle count (see, e.g., [14, 49] for a detailed discussion on this assumption). Given such an estimate, Approx-Triangles-With-Arboricity-Advice yields a constant-round streaming testable algorithm for approximating the number of triangles, as stated in Corollary 4.

A similar polylogarithmic round overhead would also arise if the algorithm were not provided with the advice value α~ in advance, and instead had to adapt its space complexity to the true arboricity α(G) of the graph (cf. Corollary 2)

6 Approximating the Number of Edges

In this section we prove Theorem 5 and its following corollary.

Corollary 25.

There exists an algorithm with the following guarantees. The algorithm receives access to a graph G, and is given the number of vertices n in G together with parameters ϵ and δ(0,1). With probability at least 1δ, the algorithm produces an estimate m^ that with probability at least 1δ satisfies m^(1±ϵ)m, where m is the number of edges in G. The expected run-time of the algorithm is O(nα(G)log1/δt), where α(G) is the arboricity of G.

Recall that in this section too the algorithm relies on access to uniform vertex samples, degree queries and neighbor queries, as well as uniform edge samples. However, we remove the assumption on having prior knowledge of m, as this is the task at hand. Instead, the algorithm is given a guess m~, and we use the search theorem in order to output a good estimate of m (see Section 6.3).

We now present the algorithm which is a simple adaptation of [57], and the main theorem of this section.

Approx-Edges-With-Arboricity-Advice(m~,ϵ,α¯,δ).

  1. 1.

    Let δ=δ/2 and ϵ=ϵ/6.

  2. 2.

    Sample a set R of r=12ln(1/δ)ϵ2 edges uniformly at random.

  3. 3.

    For every eR, let xe=1 if d(e)>2α¯/ϵ.

  4. 4.

    If 1reRxe>2ϵ then output “bad advice”.

  5. 5.

    (Otherwise, continue with the bounded-arboricity algorithm of [57])

  6. 6.

    Let q=nα¯m~12ln(2/δ)ϵ3.

  7. 7.

    If qn, query d(v) for all vV and return vVd(v).

  8. 8.

    For i=1 to q:

    1. (a)

      Sample uV uniformly at random.

    2. (b)

      Sample vΓ(u) uniformly at random.

    3. (c)

      If d(u)2α/ϵ and uv, then let χi=d(u). Otherwise, let χi=0.

  9. 9.

    Return m^=nqiχi.

Theorem 26.

Let G be the input graph, and ϵ and δ the approximation and failure parameters. Let m be the number of edges in G. In addition to augmented query access to G the algorithm takes as an input a pair of positive integers α~ and m~. Then, the following holds.

  • For every G and m~, if α(G)α~, then w.p. at least 1δ the algorithm Approx-Edges-With-Arboricity-Advice will not output “bad advice”.

  • (used for soundness) For every G and α~, if m~<m then with probability at least 1δ Approx-Edges-With-Arboricity-Advice will either return “bad advice”or output m^(1±10ϵ)m.

  • (used for soundness) For every G and α~, if m~>m then w.p. at least ϵ/4, the algorithm Approx-Edges-With-Arboricity-Advice will either return “bad advice”or output a value of m^ such that m^<(1+ϵ)m.

  • The query complexity and running time are the minimum between O(nα¯m)poly(1/ϵ,ln(1/δ)) and O(n).

We start with some preliminary definitions and claims.

6.1 Preliminary notations and claims

Definition 27 (Heavy vertices and edges.).

Let H denote the set of all vertices with d(v)>2α¯/ϵ. Let m(H) denote the set of edges in the subgraph induced by H. I.e., edges with both endpoints having degree greater than 2α¯/ϵ (so that d(e)>2α¯/ϵ). Let L=VH.

Notation 27 (An ordering and outgoing edges).

We assume an ordering on the graphs’ vertices such that uv if d(u)<d(v) of if d(u)=d(v) and id(u)<id(v). We let d+(u) denote the number of neighbors of v such that uv and refer to this set as the set of outgoing neighbors of u. For a set S we let d+(S)=vSd+(S).

Observe that

vVd+(v)=d+(L)+d+(H)=m.
Observation 28.

Suppose α>α¯. Observe that since every vH contributes at least 2α/ϵ edges to m, it holds that |H|ϵm/(2α). Thus, by the Nash-Williams theorem, m(H)2|H|αϵm. Therefore,

m(H)=vHd+(v)<ϵm, so that d+(L)=defvLd+(v)[(1ϵ)m,m]

6.2 Proof of Theorem 26

Proof of Theorem 26.

We prove each of the items separately.

Proof of item 1.

By Observation 28, if α¯>α(G), then m(H)<ϵm. Hence, for every eE, PreE[d(e)>2α¯/ϵ]<ϵ. By Chernoff’s inequality, for r=12ln(1/δ)/ϵ2,

Pr[1reRxe>2ϵ]<δ.

Hence, the probability of rejecting in Step 4 is at most δ.

Proof of item 2.

If m(H)>4ϵm, then by Chernoff’s inequality,

Pr[1reRxe<2ϵ]<δ.

Therefore, with probability at least 1δ, the algorithm returns “Bad Advice” in Step 4 and the claim holds.

Now consider the case where m(H)<4ϵm. Then d+(L)=md+(H)[(14ϵ),m].

Let u denote the vertex sampled in Step 8a. If uL then 𝔼vΓ(u)[χiu,uL]=d+(u)d(u)d(u)=d+(u). Also, for any uH, 𝔼[χiu,uH]=0. Hence, by linearity of expectation,

𝔼uV,vΓ(u)[χi]=1n(uLd+(u)+uH0)=d+(L)n. (2)

Furthermore, for any u, χi2α¯/ϵ. Therefore, the χi’s are [0,B] random variables with B=2α¯/ϵ and μ=𝔼[χi]=d+(L)n12mn. Since r=n(2α¯/ϵ)12m3ln(2/ϵ)ϵ2Bμ3ln(2/ϵ)ϵ2, by the Chernoff bound, μ^=def1ri=1rχi(1±ϵ)d+(L)n(1±10ϵ)mn. Hence, the returned value m^=nri=1rχi in Step 9 is with probability at least 1δ in (1±10ϵ)m.

Proof of item 3.

By Equation 2 and linearity of expectation,

𝔼[1qiχi]=𝔼[χ1]=d+(L)nmn.

By Markov’s inequality,

Pr[1qiχi>(1+ϵ)m]<11+ϵ<1ϵ/4.
Proof of item 4.

The query complexity and running time are

r+minn,q=O(min{n,q})=min{n,nα¯mln(1/δ)ϵ3}.

6.3 Searching for the “right” 𝒎~ value

We refer the reader to the full version of this work [30] for details on searching for the right m~ value serving as a rough estimate of the true edge count m.

References

  • [1] Mohammad Al Hasan and Vachik S Dave. Triangle counting in large networks: a review. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(2):e1226, 2018. doi:10.1002/WIDM.1226.
  • [2] Maryam Aliakbarpour, Amartya Shankha Biswas, Themis Gouleakis, John Peebles, Ronitt Rubinfeld, and Anak Yodpinyanee. Sublinear-time algorithms for counting star subgraphs via edge sampling. Algorithmica, 80(2):668–697, 2018. doi:10.1007/S00453-017-0287-3.
  • [3] Sepehr Assadi, Michael Kapralov, and Sanjeev Khanna. A Simple Sublinear-Time Algorithm for Counting Arbitrary Subgraphs via Edge Sampling. In Innovations in Theoretical Computer Science Conference (ITCS), pages 6:1–6:20, 2019. doi:10.4230/LIPIcs.ITCS.2019.6.
  • [4] Sepehr Assadi and Hoai-an Nguyen. Asymptotically optimal bounds for estimating h-index in sublinear time with applications to subgraph counting. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2022, September 19-21, 2022, University of Illinois, Urbana-Champaign, USA (Virtual Conference), 2022. doi:10.4230/LIPIcs.APPROX/RANDOM.2022.48.
  • [5] Bahman Bahmani, Ravi Kumar, and Sergei Vassilvitskii. Densest subgraph in streaming and mapreduce. Proceedings of the VLDB Endowment, 5(5), 2012. doi:10.14778/2140436.2140442.
  • [6] Suman K. Bera and C. Seshadhri. How the degeneracy helps for triangle counting in graph streams. In Principles of Database Systems (PODS), pages 457–467, 2020. doi:10.1145/3375395.3387665.
  • [7] Lorenzo Beretta, Deeparnab Chakrabarty, and C Seshadhri. Faster estimation of the average degree of a graph using random edges and structural queries. arXiv preprint arXiv:2507.06925, 2025. doi:10.48550/arXiv.2507.06925.
  • [8] Lorenzo Beretta and Jakub Tětek. Better sum estimation via weighted sampling. ACM Transactions on Algorithms, 20(3):1–33, 2024.
  • [9] Sayan Bhattacharya, Monika Henzinger, Danupon Nanongkai, and Charalampos E Tsourakakis. Space-and time-efficient algorithm for maintaining dense subgraphs on one-pass dynamic streams. In Forty-Seventh Annual ACM Symposium on Theory of Computing, pages 173–182. ACM, 2015. doi:10.1145/2746539.2746592.
  • [10] Arijit Bishnu, Debarshi Chanda, and Gopinath Mishra. Arboricity and random edge queries matter for triangle counting using sublinear queries. arXiv preprint arXiv:2502.15379, 2025. doi:10.48550/arXiv.2502.15379.
  • [11] Amartya Shankha Biswas, Talya Eden, Quanquan C Liu, Ronitt Rubinfeld, and Slobodan Mitrović. Massively parallel algorithms for small subgraph counting. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2022), pages 39:1–39:28. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2022. doi:10.4230/LIPIcs.APPROX/RANDOM.2022.39.
  • [12] Amartya Shankha Biswas, Talya Eden, and Ronitt Rubinfeld. Towards a decomposition-optimal algorithm for counting and sampling arbitrary motifs in sublinear time. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPIcs.APPROX/RANDOM.2021.55.
  • [13] Cristian Boldrin and Fabio Vandin. Fast and accurate triangle counting in graph streams using predictions. In 2024 IEEE International Conference on Data Mining (ICDM), pages 31–40. IEEE, 2024. doi:10.1109/ICDM59182.2024.00010.
  • [14] Vladimir Braverman, Rafail Ostrovsky, and Dan Vilenchik. How hard is counting triangles in the streaming model? In International Colloquium on Automata, Languages, and Programming, pages 244–254. Springer, 2013. doi:10.1007/978-3-642-39206-1_21.
  • [15] Keren Censor-Hillel, Dean Leitersdorf, and David Vulakh. Deterministic near-optimal distributed listing of cliques. In Proceedings of the 2022 ACM Symposium on Principles of Distributed Computing, pages 271–280, 2022. doi:10.1145/3519270.3538434.
  • [16] Gautam Chandrasekaran, Adam R Klivans, Vasilis Kontonis, Konstantinos Stavropoulos, and Arsen Vasilyan. Efficient discrepancy testing for learning with distribution shift. Advances in Neural Information Processing Systems, 37, 2024.
  • [17] Justin Y Chen, Talya Eden, Piotr Indyk, Honghao Lin, Shyam Narayanan, Ronitt Rubinfeld, Sandeep Silwal, Tal Wagner, David P Woodruff, and Michael Zhang. Triangle and four cycle counting with predictions in graph streams. In International Conference on Learning Representations (ICLR), 2022.
  • [18] Norishige Chiba and Takao Nishizeki. Arboricity and subgraph listing algorithms. SIAM Journal on Computing, 14(1):210–223, 1985. doi:10.1137/0214017.
  • [19] Artur Czumaj and Christian Konrad. Detecting cliques in congest networks. Distributed Computing, 33(6):533–543, 2020. doi:10.1007/S00446-019-00368-W.
  • [20] Maximilien Danisch, Oana Denisa Balalau, and Mauro Sozio. Listing k-cliques in sparse real-world graphs. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, WWW 2018, pages 589–598, 2018.
  • [21] Ilias Diakonikolas, Daniel Kane, Vasilis Kontonis, Sihan Liu, and Nikos Zarifis. Efficient testable learning of halfspaces with adversarial label noise. Advances in Neural Information Processing Systems, 36, 2023.
  • [22] Ilias Diakonikolas, Daniel Kane, Sihan Liu, and Nikos Zarifis. Testable learning of general halfspaces with adversarial label noise. In The Thirty Seventh Annual Conference on Learning Theory, pages 1308–1335. PMLR, 2024. URL: https://proceedings.mlr.press/v247/diakonikolas24a.html.
  • [23] Talya Eden, Amit Levi, Dana Ron, and C. Seshadhri. Approximately counting triangles in sublinear time. SIAM Journal on Computing, 46(5):1603–1646, 2017. doi:10.1137/15M1054389.
  • [24] Talya Eden, Reut Levi, and Dana Ron. Testing bounded arboricity. ACM Transactions on Algorithms (TALG), 16(2):1–22, 2020. doi:10.1145/3381418.
  • [25] Talya Eden, Reut Levi, Dana Ron, and Ronitt Rubinfeld. Approximately counting and sampling hamiltonian motifs in sublinear time. In Proceedings of the 57th Annual ACM Symposium on Theory of Computing, pages 1043–1054, 2025. doi:10.1145/3717823.3718160.
  • [26] Talya Eden, Saleet Mossel, and Dana Ron. Approximating the arboricity in sublinear time. In Proceedings of the 2022 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2404–2425. SIAM, 2022. doi:10.1137/1.9781611977073.96.
  • [27] Talya Eden, Dana Ron, and C. Seshadhri. Sublinear time estimation of degree distribution moments: The arboricity connection. SIAM Journal on Discrete Math, 33(4):2267–2285, 2019. doi:10.1137/17M1159014.
  • [28] Talya Eden, Dana Ron, and C. Seshadhri. Faster sublinear approximation of the number of k-cliques in low-arboricity graphs. In Symposium on Discrete Algorithms (SODA), pages 1467–1478. SIAM, 2020. doi:10.1137/1.9781611975994.89.
  • [29] Talya Eden, Dana Ron, and C. Seshadhri. On approximating the number of k-cliques in sublinear time. SIAM Journal on Computing, 49(4):747–771, 2020. doi:10.1137/18M1176701.
  • [30] Talya Eden, Ronitt Rubinfeld, and Arsen Vasilyan. Testable algorithms for approximately counting edges and triangles in sublinear time and space. Preprint ArXiv:2509.20351, 2025.
  • [31] Hendrik Fichtenberger, Mingze Gao, and Pan Peng. Sampling arbitrary subgraphs exactly uniformly in sublinear time. In 47th International Colloquium on Automata, Languages, and Programming, ICALP, 2020.
  • [32] Hendrik Fichtenberger and Pan Peng. Approximately counting subgraphs in data streams. In Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 413–425, 2022. doi:10.1145/3517804.3524145.
  • [33] Orr Fischer, Tzlil Gonen, Fabian Kuhn, and Rotem Oshman. Possibilities and impossibilities for distributed subgraph detection. In Proceedings of the 30th on Symposium on Parallelism in Algorithms and Architectures, pages 153–162, 2018. doi:10.1145/3210377.3210401.
  • [34] Gaurav Goel and Jens Gustedt. Bounded arboricity to determine the local structure of sparse graphs. In International Workshop on Graph-Theoretic Concepts in Computer Science, pages 159–167. Springer, 2006. doi:10.1007/11917496_15.
  • [35] Surbhi Goel, Adam R Klivans, Konstantinos Stavropoulos, and Arsen Vasilyan. Testing noise assumptions of learning algorithms. arXiv preprint arXiv:2501.09189, 2025. doi:10.48550/arXiv.2501.09189.
  • [36] Surbhi Goel, Abhishek Shetty, Konstantinos Stavropoulos, and Arsen Vasilyan. Tolerant algorithms for learning with arbitrary covariate shift. Advances in Neural Information Processing Systems, 37:124979–125018, 2024.
  • [37] Aravind Gollakota, Parikshit Gopalan, Adam Klivans, and Konstantinos Stavropoulos. Agnostically learning single-index models using omnipredictors. Advances in Neural Information Processing Systems, 36, 2024.
  • [38] Aravind Gollakota, Adam Klivans, Konstantinos Stavropoulos, and Arsen Vasilyan. Tester-learners for halfspaces: Universal algorithms. Advances in Neural Information Processing Systems, 36, 2023.
  • [39] Mira Gonen, Dana Ron, and Yuval Shavitt. Counting stars and other small subgraphs in sublinear-time. SIAM Journal on Discrete Mathematics, 25(3):1365–1411, 2011. doi:10.1137/100783066.
  • [40] Alon Itai and Michael Rodeh. Finding a minimum circuit in a graph. In Proceedings of the ninth annual ACM symposium on Theory of computing, pages 1–10, 1977.
  • [41] Shweta Jain and C. Seshadhri. A fast and provable method for estimating clique counts using Turán’s theorem. In Proceedings of the 26th international conference on world wide web, pages 441–449, 2017. doi:10.1145/3038912.3052636.
  • [42] John Kallaugher, Andrew McGregor, Eric Price, and Sofya Vorotnikova. The complexity of counting cycles in the adjacency list streaming model. In Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 119–133, 2019. doi:10.1145/3294052.3319706.
  • [43] Valerie King, Alex Thomo, and Quinton Yong. Computing (1+ epsilon)-approximate degeneracy in sublinear time. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, pages 2160–2168. International Joint Conferences on Artificial Intelligence Organization, 2023. doi:10.24963/IJCAI.2023/240.
  • [44] Adam Klivans, Konstantinos Stavropoulos, and Arsen Vasilyan. Learning intersections of halfspaces with distribution shift: Improved algorithms and sq lower bounds. In Shipra Agrawal and Aaron Roth, editors, Proceedings of Thirty Seventh Conference on Learning Theory, volume 247 of Proceedings of Machine Learning Research, pages 2944–2978. PMLR, 30 June–03 July 2024. URL: https://proceedings.mlr.press/v247/klivans24b.html.
  • [45] Adam Klivans, Konstantinos Stavropoulos, and Arsen Vasilyan. Testable learning with distribution shift. In Shipra Agrawal and Aaron Roth, editors, Proceedings of Thirty Seventh Conference on Learning Theory, volume 247 of Proceedings of Machine Learning Research, pages 2887–2943. PMLR, 30 June–03 July 2024. URL: https://proceedings.mlr.press/v247/klivans24a.html.
  • [46] Tamara G Kolda, Ali Pinar, Todd Plantenga, C Seshadhri, and Christine Task. Counting triangles in massive graphs with mapreduce. SIAM Journal on Scientific Computing, 36(5):S48–S77, 2014. doi:10.1137/13090729X.
  • [47] Cassandra Marcussen, Ronitt Rubinfeld, and Madhu Sudan. Quality control in sublinear time: a case study via random graphs. arXiv:2508.16531, 2025. doi:10.48550/arXiv.2508.16531.
  • [48] Andrew McGregor, David Tench, Sofya Vorotnikova, and Hoa T Vu. Densest subgraph in dynamic graph streams. In International Symposium on Mathematical Foundations of Computer Science, pages 472–482. Springer, 2015. doi:10.1007/978-3-662-48054-0_39.
  • [49] Andrew McGregor, Sofya Vorotnikova, and Hoa T Vu. Better algorithms for counting triangles in data streams. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 401–411, 2016. doi:10.1145/2902251.2902283.
  • [50] Rajeev Motwani, Rina Panigrahy, and Ying Xu. Estimating sum by weighted sampling. In International Colloquium on Automata, Languages, and Programming, pages 53–64. Springer, 2007. doi:10.1007/978-3-540-73420-8_7.
  • [51] C St JA Nash-Williams. Edge-disjoint spanning trees of finite graphs. Journal of the London Mathematical Society, 1(1):445–450, 1961.
  • [52] C St JA Nash-Williams. Decomposition of finite graphs into forests. Journal of the London Mathematical Society, 1(1):12–12, 1964.
  • [53] Jaroslav Nešetřil and Svatopluk Poljak. On the complexity of the subgraph problem. Comment. Math. Univ. Carol., 26(2):415–419, 1985.
  • [54] Rasmus Pagh and Charalampos E Tsourakakis. Colorful triangle counting and a mapreduce implementation. Information Processing Letters, 112(7):277–281, 2012. doi:10.1016/J.IPL.2011.12.007.
  • [55] Ha-Myung Park and Chin-Wan Chung. An efficient mapreduce algorithm for counting triangles in a very large graph. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, pages 539–548, 2013. doi:10.1145/2505515.2505563.
  • [56] Ronitt Rubinfeld and Arsen Vasilyan. Testing distributional assumptions of learning algorithms. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, STOC 2023, Orlando, FL, USA, June 20-23, 2023, pages 1643–1656. ACM, 2023. doi:10.1145/3564246.3585117.
  • [57] C. Seshadhri. An extremely simple algorithm for approximate edge counting in bounded arboricity graphs. Unpublished manuscript, 2025.
  • [58] Kijung Shin, Tina Eliassi-Rad, and Christos Faloutsos. Patterns and anomalies in k-cores of real-world graphs with applications. Knowledge and Information Systems, 54(3):677–710, 2018. doi:10.1007/S10115-017-1077-6.
  • [59] Lucas Slot, Stefan Tiegel, and Manuel Wiedmer. Testably learning polynomial threshold functions. Advances in Neural Information Processing Systems, 37, 2024.