Abstract 1 Introduction 2 Preliminaries 3 Algorithms for Uniform Memory Access Cost 4 Algorithms for Non-Uniform Memory Access Costs 5 Empirical Evaluation 6 Conclusion References

Weighted Matching in a Poly-Streaming Model

Ahammed Ullah Purdue University, West Lafayette, IN, USA S M Ferdous ORCID Pacific Northwest National Laboratory, Richland, WA, USA Alex Pothen ORCID Purdue University, West Lafayette, IN, USA
Abstract

We introduce the poly-streaming model, a generalization of streaming models of computation in which k processors process k data streams containing a total of N items. The algorithm is allowed 𝒪(f(k)M1) space, where M1 is either o(N) or the space bound for a sequential streaming algorithm. Processors may communicate as needed. Algorithms are assessed by the number of passes, per-item processing time, total runtime, space usage, communication cost, and solution quality.

We design a single-pass algorithm in this model for approximating the maximum weight matching (MWM) problem. Given k edge streams and a parameter ε>0, the algorithm computes a (2+ε)-approximate MWM. We analyze its performance in a shared-memory parallel setting: for any constant ε>0, it runs in time 𝒪~(Lmax+n), where n is the number of vertices and Lmax is the maximum stream length. It supports 𝒪(1) per-edge processing time using 𝒪~(kn) space. We further generalize the design to hierarchical architectures, in which k processors are partitioned into r groups, each with its own shared local memory. The total intergroup communication is 𝒪~(rn) bits, while all other performance guarantees are preserved.

We evaluate the algorithm on a shared-memory system using graphs with trillions of edges. It achieves substantial speedups as k increases and produces matchings with weights significantly exceeding the theoretical guarantee. On our largest test graph, it reduces runtime by nearly two orders of magnitude and memory usage by five orders of magnitude compared to an offline algorithm.

Keywords and phrases:
Streaming Algorithms, Matchings, Graphs, Parallel Algorithms
Funding:
Ahammed Ullah: Research supported by grant SC-0022260 of the Advanced Scientific Computing Research program of the U. S. Department of Energy.
S M Ferdous: Laboratory Directed Research and Development Program at PNNL.
Alex Pothen: Research supported by grant SC-0022260 of the Advanced Scientific Computing Research program of the U. S. Department of Energy.
Copyright and License:
[Uncaptioned image] © Ahammed Ullah, S M Ferdous, and Alex Pothen; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Mathematics of computing Graph algorithms
; Mathematics of computing Approximation algorithms
Related Version:
Full Version: https://arxiv.org/abs/2507.14114
Editors:
Anne Benoit, Haim Kaplan, Sebastian Wild, and Grzegorz Herman

1 Introduction

Data-intensive computations arise in data science, machine learning, and science and engineering disciplines. These datasets are often massive, generated dynamically, and, when stored, kept in distributed formats on disks, making them amenable to processing as multiple data streams. The modular feature of these datasets can be exploited by streaming algorithms designed for tightly-coupled shared-memory and distributed-memory multiprocessors to efficiently solve large problem instances that offline algorithms cannot handle due to their high memory requirements. However, the design of parallel algorithms that process multiple data streams concurrently has not yet received much attention.

Current multicore shared-memory processors consist of up to a few hundred cores, organized hierarchically to share caches and memory controllers. These cores compute in parallel to achieve speedups over serial execution. With multiple memory controllers, I/O operations can also proceed in parallel, and this feature can be used to process multiple data streams concurrently. These I/O capabilities and the limitations of offline algorithms motivate a model of computation, illustrated in Figure 1 and discussed next.

Figure 1: A schematic diagram of the poly-streaming model for shared-memory parallel computers. Processors {Pl}l[k] have access to 𝒪(f(k)M1) memory collectively, depicted with the rectangle connected to the processors.

The streaming model of computation allows o(N) space for a data stream of size N [2, 13]. For graphs, the semi-streaming model permits 𝒪(npolylog n) space for a graph with n vertices and an edge stream of arbitrary length [8]. Building on these space-constrained models, we introduce the poly-streaming model. The key aspects of our model are as follows.

We consider k data streams that collectively contain N items. An algorithm has access to k (abstract) processors, and is allowed 𝒪(f(k)M1) total space, where M1 is either o(N) or the space permitted to a single-stream algorithm. In each pass, each stream is assigned to one of the processors, and each processor independently reads one item at a time from its stream and processes it. Processors may communicate as needed, either via shared or remote memory access. Algorithms are assessed on several metrics: space complexity, number of passes, per-item processing time, total runtime, communication cost, and solution quality.

In the poly-streaming model, we address the problem of approximating a maximum weight matching (MWM) in an edge-weighted graph, where the goal is to find a set of vertex-disjoint edges with maximum total weight. We design an algorithm for approximating an MWM when the graph is presented as multiple edge streams. Our design builds on the algorithm of [20] and adds support for handling multiple streams concurrently. We also generalize the design to NUMA (non-uniform memory access) multiprocessor architectures.

We summarize our contributions to the MWM problem as follows. Let Lmax and Lmin denote the maximum and minimum lengths of the input streams, respectively, and let n denote the number of vertices in a graph G. For any realization of the CREW PRAM model (such as in Figure 1), we have the following result.

Theorem 1.

For any constant ε>0, there exists a single-pass poly-streaming algorithm for the maximum weight matching problem that achieves a (2+ε)-approximation. It admits a CREW PRAM implementation with 𝒪~(Lmax+n) runtime.111𝒪~() hides polylogarithmic factors. If Lmin=Ω(n), the algorithm achieves 𝒪(logn) amortized per-edge processing time using 𝒪~(k+n) space. For arbitrarily balanced streams, it uses either 𝒪~(k+n) space and 𝒪~(n) per-edge processing time, or 𝒪~(kn) space and 𝒪(1) per-edge processing time.

In NUMA architectures, memory access costs depend on a processor’s proximity to the target memory. We generalize the algorithm in Theorem 1 to account for these cost differences. In particular, we show that when k processors are partitioned into r groups, each with its own shared local memory, the total number of global memory accesses across all groups is 𝒪~(rn). This generalization preserves all other performance guarantees from Theorem 1, except that the 𝒪~(k+n) space bound becomes 𝒪~(k+rn). These results are formalized in Theorem 18 in Section 4. This design gives a memory-efficient algorithm for the NUMA shared memory multiprocessors, on which we report empirical results.

We have evaluated our algorithm on a NUMA machine using graphs with billions to trillions of edges. For most of these graphs, our algorithm uses space that is orders of magnitude smaller than that required by offline algorithms. For example, storing the largest graph in our evaluation would require more than 91,600GB (90TB), whereas our algorithm used less than 1GB. Offline matching algorithms typically require even more memory to accommodate their auxiliary data structures.

We employ approximate dual variables that correspond to a linear programming relaxation of MWM to obtain a posteriori upper bounds on the weights of optimal matchings. These bounds allow us to compare the weight of a matching produced by our algorithm with the optimal weight. Thus, we show that our algorithm produces matchings whose weights significantly exceed the approximation guarantee.

For k=128, our algorithm achieves runtime speedups of 1683 across all graphs in our evaluation, on a NUMA machine with only 8 memory controllers. This is significant scaling for a poly-streaming algorithm, given that 8 memory controllers are not sufficient to serve the concurrent and random access requests of 128 processors without delays. Nevertheless, these speedups demonstrate the effectiveness of our design, which accounts for a processor’s proximity to the target memory. A metric less influenced by memory latency suggests that the algorithm would achieve even better speedups on architectures with more efficient memory access.

Note that Theorem 1 and Theorem 18 both guarantee 𝒪~(Lmax+n) runtime. For Lmax=Ω(n), this is tight up to polylogarithmic factors. However, by using 𝒪~(kn) space and 𝒪(1) per-edge processing time, we can even achieve 𝒪~(Lmax+n/k) runtime, which becomes polylogarithmic for large values of k (see the arXiv version).

Organization.

Section 2 introduces necessary background. Section 3 presents the design and analyses of our algorithm in Theorem 1. In Section 4, we extend the design to NUMA architectures. Section 5 summarizes the evaluation results. We conclude in Section 6 with a discussion of future research directions.

2 Preliminaries

For a graph G=(V,E), let n:=|V| and m:=|E| denote the number of vertices and edges, respectively. We denote an edge e:={u,v} by the unordered pair of its endpoints. For a weighted graph, let we denote the weight of edge e, and for any subset AE, define w(A):=eAwe. For [k], let E be the set of edges received in the th stream. Define Lmax:=max[k]|E| and Lmin:=min[k]|E|.

A matching in a graph is a set of edges that do not share endpoints. A maximum weight matching (MWM) is a matching with maximum total weight; that is, w()w() for all matchings E.

A ρ-approximation algorithm computes a solution whose value is within a factor ρ of optimal. The factor ρ is called the (worst-case) approximation ratio. We assume ρ1 for both maximization and minimization problems. Thus, for maximization, a ρ-approximation guarantees a solution whose value is at least 1ρ times the optimal.

We use the linear programming (LP) relaxation of the MWM problem, and its dual, shown in Figure 2. In the primal LP, each variable xe is 1 if edge e is in the matching and 0 otherwise. Each yu is a dual variable, and δ(u) denotes the set of edges incident on a vertex u. Let {xe}eE and {yu}uV be feasible solutions to the primal and dual LPs, respectively. By weak LP duality, we have eEwexeuVyu. If {xe}eE is an optimal solution to the primal LP, then w()eEwexeuVyu. The first inequality holds because the primal LP is a relaxation of the MWM problem.

Primal LP

maximizeeEwexesubject toeδ(u)xe1,for all uVxe0,for all eE

Dual LP

minimizeuVyusubject toueyuwe,for all eEyu0,for all uV
Figure 2: The linear programming (LP) relaxations of the MWM problem and its dual.

3 Algorithms for Uniform Memory Access Cost

In this section, we present the design and analyses of our algorithm in Theorem 1 that assumes a uniform memory access cost.

3.1 The Algorithm

Several semi-streaming algorithms have been designed for the MWM problem [3, 4, 6, 8, 10, 11, 18, 20, 23] (see the arXiv version for brief descriptions of these algorithms). In this paper, we focus exclusively on the single-pass setting in the poly-streaming model. Our starting point is the algorithm of Paz and Schwartzman [20], which computes a (2+ε)-approximation of MWM. This is currently the best known guarantee in the single-pass setting under arbitrary or adversarial ordering of edges.222No single-pass algorithm can achieve an approximation ratio better than 1+ln21.7; see [15]. We extend a primal-dual analysis by Ghaffari and Wajc [11] to analyze our algorithm.

The algorithm of Paz and Schwartzman [20] proceeds as follows. Initialize an empty stack S and set αu=0 for each vertex uV. For each edge e={u,v} in the edge stream, skip e if we<(1+ε)(αu+αv). Otherwise, compute ge=we(αu+αv), push e onto the stack S, and increase both αu and αv by ge. After processing all edges, compute a matching greedily by popping edges from S.

Note that for each edge pushed onto the stack, the increment ge=we(αu+αv) satisfies geε(αu+αv). This ensures that both αu and αv increase by a factor of 1+ε. Hence, the number of edges in the stack incident to any vertex is at most log1+ε(W)=𝒪(logWε), where W is the (normalized) maximum edge weight. Therefore, the total number of edges in the stack is 𝒪(nlogWε)=𝒪(nlognε).333Throughout the paper, we assume W=𝒪(poly(n)). For arbitrary weights on edges, we can skip any edge whose weight is less than εWmax2(1+ε)n2, where Wmax denotes the maximum edge weight observed so far in the stream. This ensures that the (normalized) maximum weight the algorithm sees is 𝒪(n2/ε), while maintaining a 2(1+𝒪(ε)) approximation ratio (see [11] for details).

PS-MWM(V,,ε):

/* each processor executes this algorithm concurrently */

  1. 1.

    In parallel initialize locku, and set αu and marku to 0 for all uV
    /* processor initializes or sets Θ(n/k) locks/variables */

  2. 2.

    S /* initialize an empty stack */

  3. 3.

    for each edge e={u,v} in th stream do

    1. (a)
  4. 4.

    wait for all processors to complete execution of Step 3 /* a barrier */

  5. 5.
  6. 6.

    return

Figure 3: A poly-streaming matching algorithm.

To design a poly-streaming algorithm, we begin with a simple version and then refine it. All k processors share a global stack and a set of variables {αu}uV, and each processor runs the above sequential streaming algorithm on its respective stream. To complete and adapt this setup for efficient execution across multiple streams, we must address two interrelated issues: (1) concurrent edge arrivals across streams may lead to contention for the shared stack or variables, and (2) concurrent updates to the shared variables may lead to inconsistencies in their observed values.

A natural approach to addressing these issues is to enforce a fair sequential strategy, where processors access shared resources in a round-robin order. While this ensures progress, it incurs 𝒪(k) per-edge processing time, which scales poorly with increasing k. Instead, we adopt fine-grained contention resolution that avoids global coordination by allowing processors to operate asynchronously. However, under the initial setup, this leads to 𝒪~(n/ε) per-edge processing time: a processor may be blocked from accessing shared resources until the stack has accumulated its 𝒪~(n/ε) potential edges. We address these limitations with the following design choices.

For the first issue, we observe that a global ordering of edges, as used in the single-stack solution, is not necessary; local orderings within multiple stacks suffice. In particular, we can identify a subset of edges (later referred to as tight edges) for which maintaining local orderings is sufficient to compute a (2+ε)-approximate MWM. Hence, we can localize computation using k stacks, assigning one stack to each processor exclusively during the streaming phase. This design eliminates the 𝒪~(n/ε) contention associated with a shared stack.

However, contention still arises when updating the variables {αu}uV. It is unclear how to resolve this contention without using additional space. Hence, we consider two strategies for processing edge streams that illustrate the trade-off between space and per-edge processing time. In the first, which we call the non-deferrable strategy, the decision to include an edge in a stack is made immediately during streaming. In the second, which we call the deferrable strategy, this decision may be deferred to post-processing. The latter strategy requires more space but achieves 𝒪(1) per-edge processing time.

To address the second issue, which concerns the potential for inconsistencies due to concurrent updates to the variables {αu}uV, we observe that the variables are monotonically increasing and collectively require only 𝒪~(n/ε) updates. Thus, for most edges that are not eligible for the stacks, decisions can be made by simply reading the current values of the relevant variables. However, for the 𝒪~(n/ε) edges that are included in the stacks, we must update the corresponding variables. To ensure consistency of these updates, we associate a lock with each variable in {αu}uV. We maintain |V| exclusive locks and allow a variable to be updated only after acquiring its corresponding lock.444This corresponds to the concurrent-read exclusive-write (CREW) paradigm of the PRAM model.

We now outline the non-deferrable strategy of our poly-streaming algorithm for the MWM problem (for the deferrable strategy see the arXiv version). For simplicity, we assume that if a processor attempts to release a lock it did not acquire, the operation has no effect. We also assume that any algorithmic step described with the “in parallel” construct includes an implicit barrier (or synchronization primitive) at the end, synchronizing the processors participating in that step.

Process-Edge(e={u,v},S,ε):

/* Assumes access to global variables {αu}uV and locks {locku}uV */

  1. 1.

    if we(1+ε)(αu+αv) then return

  2. 2.

    repeatedly try to acquire locku and lockv in lexicographic order of u and v as long as we>(1+ε)(αu+αv)

  3. 3.

    if we>(1+ε)(αu+αv) then

    1. (a)

      gewe(αu+αv)

    2. (b)

      increment αu and αv by ge

    3. (c)

      add e to the top of S along with ge

  4. 4.

    release locku and lockv, and return

Figure 4: A subroutine used in algorithms PS-MWM and PS-MWM-LD.

The non-deferrable strategy is presented in Algorithm PS-MWM, with two subroutines used by PS-MWM described in Process-Edge (Figure 4) and Process-Stack (Figure 5). In PS-MWM, Steps 1–2 form the preprocessing phase, Steps 3–4 the streaming phase, and Step 5 the post-processing phase. Each processor [k] executes PS-MWM asynchronously, except that all processors begin the post-processing phase simultaneously (due to Step 4) and then resume asynchronous execution.

In the subroutine Process-Edge, Step 2 ensures that all edges are processed using the non-deferrable strategy: a processor repeatedly attempts to acquire the locks corresponding to the endpoints of an edge e={u,v} until it succeeds, or the edge becomes ineligible for inclusion in a stack. As a result, a processor executing Step 3, has a consistent view of the variables αu and αv. In Step 3(c), we store the gain ge of an edge e along with the edge itself for use in the post-processing phase.

Process-Stack(S):

/* Assumes access to global variables {αu}uV; and {marku}uV, initialized in Algorithm PS-MWM (or PS-MWM-LD). */

  1. 1.

  2. 2.

    while S do

    1. (a)

      remove the top edge e={u,v} of S

    2. (b)

      if we+ge<αu+αv then wait for e to be a tight edge
      /* e is a tight edge if we+ge=αu+αv */

    3. (c)

      if both marku and markv are set to 0 then
      /* no locking is needed since e is a tight edge */

      1. i.

        {e}

      2. ii.

        set marku and markv to 1

    4. (d)

      decrement αu and αv by ge

  3. 3.

    return

Figure 5: A subroutine used in algorithms PS-MWM and PS-MWM-LD.

When all k processors are ready to execute Step 5 of PS-MWM, the k stacks collectively contain all the edges needed to construct a (2+ε)-approximate MWM, which can be obtained in several ways. In the subroutine Process-Stack, we outline a simple approach based on local edge orderings. We define an edge e={u,v} in a stack to be a tight edge if we+ge=αu+αv. Equivalently, an edge is tight if and only if all of its neighboring edges that were included after it in any stack have already been removed. Any set of tight edges can be processed concurrently, regardless of their positions in the stacks. In Process-Stack, we simultaneously process the tight edges that appear at the tops of the stacks.

3.2 Analyses

We now formally characterize several correctness properties of the algorithm and analyze its performance. These correctness properties include the absence of deadlock, livelock, and starvation. The performance metrics are space usage, approximation ratio, per-edge processing time, and total runtime.

To simplify the analysis, we assume that processors operate in a quasi-synchronous manner. In particular, to analyze Step 3 of Algorithm PS-MWM, we define an algorithmic superstep as a unit comprising a constant number of elementary operations.

Definition 2 (Superstep).

A processor takes one superstep for an edge if it executes Process-Edge with at most one iteration of the loop in Step 2 (i.e., without contention), requiring 𝒪(1) elementary operations. Each additional iteration of Step 2 due to contention adds one superstep, with each such iteration also requiring 𝒪(1) operations.

Definition 3 (Effective Iterations).

Effective iterations is the maximum number of supersteps taken by any processor during the execution of Step 3 of Algorithm PS-MWM.

Note that for k=1, the effective iterations is equal to the number of edges in the stream. Using this notion, we align the supersteps of different processors and define the following directed graph.

Definition 4 (G(t)).

For the tth effective iteration, consider the set of edges processed across all k streams. Let e=(u,v) denote an edge processed in th stream, where u precedes v in the lexicographic ordering of the vertices. If processor is idle in the tth iteration, then e=. Define G(t):=(V(t),E(t)), where E(t):={e[k]} and V(t):=(u,v)E(t){u,v}.

The following property of G(t) is straightforward to verify.

Proposition 5.

G(t) is a directed acyclic graph.

We show that Algorithm PS-MWM is free from deadlock, livelock, and starvation. Deadlock occurs when a set of processors forms a cyclic dependency, with each processor waiting for a resource held by another. Livelock occurs when a set of processors repeatedly form such a cycle, where each processor continually acquires and releases resources without making progress. Starvation occurs when a processor waits indefinitely for a resource because other processors repeatedly acquire it first. The following lemma shows that the streaming phase of Algorithm PS-MWM is free from deadlock, livelock, and starvation.

Lemma 6.

The concurrent executions of the subroutine Process-Edge are free from deadlock, livelock, and starvation.

Proof.

Since the variables {αu}uV are updated only while holding their corresponding locks, we treat the locks {locku}uV as the only shared resources in Process-Edge.

Let G(t) be the graph defined in Definition 4. By Proposition 5, G(t) is a directed acyclic graph (DAG), and hence each of its components is also a DAG.

To reason about cyclic dependencies, we focus on components of G(t) involving processors executing Step 2 of Process-Edge. Every DAG contains at least one vertex with no outgoing edges. Thus, each such component includes an edge e=(u,v) such that only processor requests lockv. This precludes the possibility of cyclic dependencies; that is, the concurrent executions of Process-Edge is free from deadlock and livelock.

To show that starvation does not occur, suppose an edge appears in every effective iteration t[a,b], that is, e=(u,v)t[a,b]E(t). We show that ba=𝒪~(n/ε), which bounds the number of supersteps that processor may spend attempting to acquire locks for e. Step 2 requires one superstep per iteration, while all other steps collectively require at most one. For each t(a,b], the component of G(t1) containing e has at least one vertex with no outgoing edge. This guarantees that at least one edge in that component acquires its locks and completes Step 3 during the (t1)th effective iteration. Since Step 3 can increment the values in {αu}uV for at most 𝒪(nlog1+εW)=𝒪~(n/ε) edges over the entire execution, the number of iterations for which e may remain blocked is also bounded by 𝒪~(n/ε).

To analyze Step 5 of Algorithm PS-MWM, we adopt the same simplification: processors are assumed to operate in a quasi-synchronous manner. Accordingly, we define 𝒰(t) as the set of edges present in the stacks [k]S at the beginning of iteration t of Step 2 in Process-Stack. The following definition is useful for characterizing tight edges via an equivalent notion.

Definition 7 (Follower).

An edge ej𝒰(t) is a follower of an edge ei𝒰(t) if eiej and ej is added to some stack Sj after ei is added to some stack Si. We denote the set of followers of an edge e by (e).

The proofs of the following four lemmas are included in the arXiv version. The fourth lemma establishes that the post-processing phase of PS-MWM is free from deadlock, livelock, and starvation.

Lemma 8.

An edge e is a tight edge if and only if (e)=.

Lemma 9.

Let 𝒯(t) be the set of top edges in the stacks at the beginning of iteration t of Step 2 of Process-Stack. Then 𝒯(t) contains at least one tight edge.

Lemma 10.

The set of tight edges in 𝒰(t) is vertex-disjoint.

Lemma 11.

The concurrent executions of the subroutine Process-Stack are free from deadlock, livelock, and starvation.

We now analyze the performance metrics of the algorithm.

Lemma 12.

For any constant ε>0, the space complexity and per-edge processing time of Algorithm PS-MWM are 𝒪(k+nlogn) and 𝒪(nlogn), respectively. Furthermore, for Lmin=Ω(n), the amortized per-edge processing time of the algorithm is 𝒪(logn).

Proof.

The claimed space bound follows from three components: 𝒪(n) space for the variables and locks, 𝒪(nlogn) space for the stacked edges, and 𝒪(1) space per processor.

The worst-case per-edge processing time follows from the second part of the proof of Lemma 6.

Processor processes |E| edges, each requiring at least one distinct effective iteration (see Definition 3). Additional iterations may arise when it repeatedly attempts to acquire locks in Step 2 of Process-Edge. From the second part of the proof of Lemma 6, the total number of such additional iterations is bounded by 𝒪(nlogn). This implies that to process |E| edges, a processor uses 𝒪(|E|+nlogn) supersteps. Therefore, the amortized per-edge processing time is

𝒪(|E|+nlogn|E|)=𝒪(nlogn|E|)=𝒪(nlognLmin)=𝒪(logn).

Note that the amortized per-edge processing time is computed over the edges of an individual stream, not over the total number of edges across all streams. While both forms of amortization are meaningful for poly-streaming algorithms, our analysis is more practically relevant, as it reflects the cost incurred per edge arrival within a single stream.

Lemma 13.

For any constant ε>0, Algorithm PS-MWM takes 𝒪(Lmax+nlogn) time.

Proof.

The preprocessing phase (Steps 1–2) takes Θ(n/k) time.

To process |E| edges, processor takes 𝒪(|E|+nlogn) supersteps (see the proof of Lemma 12). Since |E|Lmax for all [k], the time required for Step 3 is 𝒪(Lmax+nlogn).

At the beginning of Step 5, the total number of edges in the stacks is 𝒰(1)=𝒪(nlogn). By Lemma 9, iteration t of Process-Stack removes at least one edge from 𝒰(t). Hence, the time required for Step 5 is 𝒪(nlogn).

The claim now follows by summing the time spent across all three phases.

Now, using the characterizations of tight edges, we extend the duality-based analysis of [11] to our algorithm. Let Δαe denote the change in uVαu resulting from processing an edge eE in Step 3 of Process-Edge. If an edge eE is not included in a stack S, then Δαe=0, either because it fails the condition in Step 1 or Step 3 of Process-Edge. It follows that eLEΔαe=uVαu. For an edge e that is included in some stack Si, let 𝒫(e) denote the set of edges that share an endpoint with e and are included in some stack Sj no later than e including e itself. The following two results are immediate from Observation 3.2 and Lemma 3.4 of [11].

Proposition 14.

Any edge e added to some stack S satisfies the inequality

wee𝒫(e)ge=12(e𝒫(e)Δαe).
Proposition 15.

After all processors complete Step 3 of Algorithm PS-MWM, the variables {αu}uV, scaled by a factor of (1+ε), form a feasible solution to the dual LP in Figure 2.

Lemma 16.

Let be a maximum weight matching in G. The matching :=l[k] returned by Algorithm PS-MWM satisfies w()12(1+ε)w().

Proof.

We only process tight edges in Process-Stack. By Lemma 10 tight edges are vertex disjoint, and hence their independent processing does not interfere with their inclusion in .

By Lemma 8, an edge e included in must satisfy (e)=. Consider any edge e𝒫(e)\{e}. Since e(e), we have (e), which means e is not a tight edge before e is processed. Thus, when e is selected for inclusion in , none of the edges in 𝒫(e)\{e} is tight. Hence, all edges of 𝒫(e) are in the stacks when we are about to process e. Therefore, the total gain contributed by edges of 𝒫(e) can be attributed to the weight of e, and by Proposition 14, we have

w()=ewe12(ee𝒫(e)Δαe)12(e[k]SΔαe)=12(e[k]EΔαe)=12(uVαu).

Let {xe}eE be an optimal solution to the primal LP in Figure 2. By Proposition 15 and LP duality we have

w()eEwexe(1+ε)(uVαu)2(1+ε)w().

Algorithm PS-MWM uses only one pass over the streams. Theorem 1 now follows by combining the results in Lemma 12, Lemma 13, Lemma 16, and the analysis of the deferrable strategy sketched in the arXiv version.

4 Algorithms for Non-Uniform Memory Access Costs

In this section, we extend the algorithm from Section 3 to account for the non-uniform memory access (NUMA) costs present in real-world machines.

In a poly-streaming algorithm, each processor may receive an arbitrary subset of the input, making it difficult to maintain memory access locality. Modern shared-memory machines, as illustrated in Figure 1, have non-uniform memory access costs and far fewer memory controllers than processors [12]. As a result, memory systems with such limitations would struggle to handle the high volume of concurrent, random memory access requests generated by poly-streaming algorithms, leading to significant delays.

PS-MWM-LD(V,l,j,ε):

  1. 1.

    In parallel initialize locku, and set αu and marku to 0 for all uV
    /* processor initializes or sets Θ(n/k) locks/variables */

  2. 2.

    In parallel initialize lockuj, and set αuj to 0 for all uV
    /* processor initializes or sets Θ(n/(k/r)) locks / variables */

  3. 3.

    In parallel initialize glockj /* one processor initializes for group j */

  4. 4.

    S /* initialize an empty stack */

  5. 5.

    for each edge e={u,v} in th stream do

    1. (a)
  6. 6.

    wait for all processors to complete execution of Step 4 /* a barrier */

  7. 7.
  8. 8.

    return

Figure 6: A generalization of Algorithm PS-MWM using local dual variables.

We now describe a generalization of the algorithm from Section 3 that localizes a significant portion of each processor’s memory access to its near memory. This generalization applies to both edge-processing strategies introduced in Section 3.1. We focus on the non-deferrable strategy. (The deferrable strategy generalizes in the same way, following the same relationship between the two strategies as in the specialized case.)

The runtime of Process-Edge is dominated by the time to access the dual variables {αu}uV. By assigning a dedicated stack to each processor, we have substantially localized accesses associated with edges in that stack. However, since a large fraction of edges is typically not included in the stacks, the runtime remains dominated by accesses to dual variables associated with these discarded edges. We therefore describe an algorithm that localizes these accesses to memory near the processor.

To localize accesses to the dual variables {αu}uV, we observe that these variables increase monotonically during the streaming phase. This observation motivates a design in which a subset of processors maintains local copies of the variables and can discard a substantial number of edges without synchronizing with the global copy. When a processor includes an edge in its stack, it increments the corresponding dual variables in the global copy by the gain of the edge and synchronizes its local copy accordingly. As a result, some local copies may lag behind the global copy, but they can be synchronized when needed.

A general scheme for allocating dual variables is as follows. The set of k processors is partitioned into r groups. For simplicity, we assume that k is a multiple of r, so each group contains exactly k/r processors. For r>1, in addition to a global copy of dual variables, we maintain r local copies {αuj}uV, one for each j[r]. Group j consists of the processors {[k]/(k/r)=j}, and uses {αuj}uV as its local copy of the dual variables. Algorithm PS-MWM is the special case r=1, where all processors operate using only the global copy of the dual variables.

Algorithm PS-MWM-LD, along with its subroutine Process-Edge-LD, incorporates local dual variables in addition to global ones. In Step 2, processors in each group j[r] collectively initialize their group’s local copies of dual variables and locks, followed by initializing a group lock in Step 3. All other steps of the algorithm are identical to those in PS-MWM.

Process-Edge-LD(e={u,v},S,ε):

/* Assumes access to {αu}uV, {locku}uV, {αuj}uV, {lockuj}uV, and glockj */

  1. 1.

    if we(1+ε)(αuj+αvj) then return

  2. 2.

    repeatedly try to acquire lockuj and lockvj in lexicographic order of u and v as long as we>(1+ε)(αuj+αvj)

  3. 3.

    if we(1+ε)(αuj+αvj) then release lockuj and lockvj, and return

  4. 4.

    repeatedly try to acquire glockj

  5. 5.
  6. 6.

    αujαu and αvjαv /* synchronization of local and global dual variables */

  7. 7.

    release lockuj, lockvj, glockj and return

Figure 7: A subroutine used in Algorithm PS-MWM-LD.

In the subroutine Process-Edge-LD, Step 5 implements the non-deferable strategy. Steps 1–3 and Step 6 enforce the localization of access to dual variables. Steps 2–3 ensure that, at any given time, each global dual variable is accessed by at most one processor per group; we refer to this processor as the delegate of the group for that variable. Thus, a processor executing Steps 4–6 serves as a delegate of its group for that time. In Step 6, after completing updates to the global variables, the delegate synchronizes its group’s local copy in 𝒪(1) time. As a result, the waiting time on a local variable in Step 2 is bounded by the total time spent by the corresponding delegates, up to constant factors.

The delegates in each group handle vertex-disjoint edges, so concurrent executions of Step 6 would have been safe. However, the lock in Step 4 ensures that at most one delegate per group executes Step 5. Regardless of these design choices, the behavior of the delegates executing Step 5 concurrently mirrors that of processors competing for exclusive access to global dual variables in PS-MWM.

The following lemma highlights the benefit of using Algorithm PS-MWM-LD.

Lemma 17.

For any constant ε>0, in the streaming phase of Algorithm PS-MWM-LD, processors in all r groups collectively access global variables a total of 𝒪(rnlogn) times.

In contrast to the result in Lemma 17, the streaming phase of Algorithm PS-MWM accesses global variables Ω(m) times or up to 𝒪(m+knlogn) times.

Algorithm PS-MWM-LD, together with the generalization of the deferrable strategy, leads to the following result (a proof is included in the arXiv version).

Theorem 18.

Let k processors be partitioned into r groups, each with its own shared local memory. For any constant ε>0, there exists a single-pass poly-streaming algorithm for the maximum weight matching problem that achieves a (2+ε)-approximation. It admits a CREW PRAM implementation with 𝒪~(Lmax+n) runtime. If Lmin=Ω(n), the algorithm achieves 𝒪(logn) amortized per-edge processing time using 𝒪~(k+rn) space. For arbitrarily balanced streams, it uses either 𝒪~(k+rn) space and 𝒪~(n) per-edge processing time, or 𝒪~(kn) space and 𝒪(1) per-edge processing time. The processors collectively access the global memory 𝒪~(rn) times.

5 Empirical Evaluation

This section summarizes our evaluation results for Algorithm PS-MWM-LD. Detailed datasets, experimental setup, and additional comparisons (including with PS-MWM) are provided in the arXiv version.

5.1 Datasets

Table 1: Summary of datasets. Each collection contains eight graphs.
Graph Collection # of Edges (in billions)
The SSW graphs 1.36127.4
The BA graphs 4.64550.1
The ER graphs 2564096
The UA-dv graphs 275.2550.1
The UA graphs 8.931100
The ER-dv graphs 324096

Table 1 summarizes our datasets. Each collection consists of eight graphs, with edge counts ranging from one billion to four trillion. To the best of our knowledge, these represent some of the largest graphs for which matchings have been reported in the literature. Exact and approximate offline MWM algorithms (see [22]) would exceed available memory on the larger graphs. The first class (SSW) consists of six of the largest graphs from the SuiteSparse Matrix collection [5] and two from the Web Data Commons [19], which includes the largest publicly available graph dataset. Other classes include synthetic graphs generated from the Barabási–Albert (BA), Uniform Attachment (UA), and Erdős–Rényi (ER) models [1, 7, 21].

5.2 Experimental Setup

We ran all experiments on a community cluster called Negishi [17], where each node has an AMD Milan processor with 128 cores running at 2.2 GHz, 256–1024 GB of memory, and the Rocky Linux 8 operating system version 8.8. The cores are organized in a hierarchy: groups of eight cores constitute a core complex that share an L3 cache. Eight core complexes form a socket, and they share four dual-channel memory controllers; two sockets constitute a Milan node [12]. Memory access within a socket is approximately three times faster than across sockets.

We implemented the algorithms in C++ and compiled the code using the gcc compiler (version 12.2.0) with the -O3 optimization flag. For shared-memory parallelism, we used the OpenMP library (version 4.5). All experiments used ε=1e6. Reported values are the average over five runs.

5.3 Space

The SSW graphs

The BA graphs

The ER graphs

Figure 8: Memory used by the algorithm and the corresponding graph size (space needed to store the entire graph in CSR format). Note that the y-axes are in a logarithmic scale.

Figure 8 summarizes the space usage of our algorithm. For k=1, the algorithm of Paz and Schwartzman [20], we store one copy of the dual variables, stack, and matching. For k>1, our algorithm stores r+1 copies of the dual variables (global and local), stacks, matching, and locks. We choose the values of r based on the number of memory controllers and the number of streams.

The maximum space used by our algorithm is 223GB, for the web graph WDC_2012. In comparison, storing this graph in compressed sparse row (CSR) format would require over 2800GB. Storing the largest graph in our datasets (ER1_4096) in CSR would require more than 91,600GB (89.45TB), for which our algorithm used less than 0.8GB.

5.4 Solution Quality

The SSW graphs

The BA graphs

The ER graphs

Figure 9: Comparisons of min-OPT percent obtained by different algorithms. ALG-d denotes the best results from four dual update rules, and ALG-s denotes the algorithm of Feigenbaum et al. [8].

min-OPT percent.

In the arXiv version, we describe different ways to get a posteriori upper bounds on the weight of a MWM w(M), using the values of the dual variables. Let Ymin denote the minimum value of these upper bounds. If is a matching in the graph returned by any algorithm, then we have w()w()w()Ymin. Hence, w()Ymin×100 gives a lower bound on the percentage of the maximum weight w() obtained by . We use min-OPT percent to denote the fraction w()Ymin×100.

Figure 9 shows min-OPT percent obtained by different algorithms. In the arXiv version, we describe four dual update rules as alternatives to the default rule used in Steps 3(a)–(b) of Process-Edge. The values under k=1 and k=128 use the default rule, and the values under ALG-d use the best result among the four new dual update rules. For perspective, we include min-OPT percent obtained by the sequential 6-approximate streaming algorithm of Feigenbaum et al. [8], denoted ALG-s.

The results under k=1 and k=128 show that, in terms of solution quality, our poly-streaming algorithm is on par with the single-stream algorithm of [20]. The values under ALG-d indicate further potential improvements using alternative dual update rules. The comparison with ALG-s supports our choice of the algorithm from [20] over other simple algorithms, such as that of [8]. The arXiv version contains comparisons with an offline algorithm and details on the dual update rules.

5.5 Runtime

The SSW graphs

The BA graphs

The ER graphs

Figure 10: Speedup in runtime vs. k. Note that both axes are in a logarithmic scale.

We report runtime-based speedups, computed as the total time across all three phases of PS-MWM-LD (preprocessing, streaming, and post-processing). Figure 10 shows these speedups. For k=128, we have speedups of 1660, 3773, and 6883 for the SSW graphs, the BA graphs, and the ER graphs, respectively.

Due to the significant memory bottlenecks (discussed in Section 4), we also report speedups w.r.t. effective iterations (Definition 3), which are less affected by such bottlenecks. The speedup w.r.t. effective iterations is the ratio of the metric for one stream to that for k streams. Now for k=128, we obtain speedups of 112127, 121127, and 124128 for the SSW graphs, the BA graphs, and the ER graphs, respectively. These results indicate that shared variable access incurs no noticeable contention among processors. As a result, we expect even better runtime improvements on systems with more memory controllers or better support for remote memory access.

Figure 11: Breakdown of runtime into three phases for k=1 and k=128. Note that the y-axis is in a logarithmic scale.

Figure 11 shows the runtimes for different graphs, decomposed into three phases, for k=1 and k=128. The plots report the absolute time savings achieved by processing multiple streams concurrently. For k=1 and k=128, the geometric means of the runtimes for these graphs are over 2350 seconds and under 45 seconds, respectively. For the largest graph (ER1_4096), single-stream processing took over 8000 seconds, whereas poly-stream processing reduced the time to under 100 seconds.

6 Conclusion

While numerous studies have focused on optimizing time (in parallel computing) or space (in streaming algorithms) in isolation, the poly-streaming model offers a practically relevant paradigm for exploring how time and space can be jointly optimized. It fills a gap by providing a formal framework for analyzing algorithmic design choices and their associated time-space trade-offs. Our study of matchings illustrates both the benefits of this paradigm and its practical relevance.

The simplicity of our matching algorithm and its generalization reflect our choice to adopt the design of [20]. We believe this principle will inspire the development of other poly-streaming algorithms. To this end, we note that [20] has also motivated simple algorithms for related problems, such as matchings with submodular objectives [16], b-matchings [14], and collections of disjoint matchings [9].

References

  • [1] Réka Albert and Albert-László Barabási. Statistical mechanics of complex networks. Reviews of Modern Physics, 74(1):47, 2002.
  • [2] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. In Proceedings of the Twenty-eighth Annual ACM Symposium on Theory of Computing, pages 20–29, 1996. doi:10.1145/237814.237823.
  • [3] Sepehr Assadi. A simple (1—ε)-approximation semi-streaming algorithm for maximum (weighted) matching. In 2024 Symposium on Simplicity in Algorithms (SOSA), pages 337–354. SIAM, 2024. doi:10.1137/1.9781611977936.31.
  • [4] Michael Crouch and Daniel M Stubbs. Improved streaming algorithms for weighted matching, via unweighted matching. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2014), pages 96–104. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2014. doi:10.4230/LIPICS.APPROX-RANDOM.2014.96.
  • [5] Timothy A Davis and Yifan Hu. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS), 38(1):1–25, 2011. doi:10.1145/2049662.2049663.
  • [6] Leah Epstein, Asaf Levin, Julián Mestre, and Danny Segev. Improved approximation guarantees for weighted matching in the semi-streaming model. SIAM Journal on Discrete Mathematics, 25(3):1251–1265, 2011. doi:10.1137/100801901.
  • [7] Paul Erdős and Alfréd Rényi. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci., 5:17–60, 1960.
  • [8] Joan Feigenbaum, Sampath Kannan, Andrew McGregor, Siddharth Suri, and Jian Zhang. On graph problems in a semi-streaming model. Theoretical Computer Science, 348(2-3):207–216, 2005. doi:10.1016/J.TCS.2005.09.013.
  • [9] S M Ferdous, Bhargav Samineni, Alex Pothen, Mahantesh Halappanavar, and Bala Krishnamoorthy. Semi-streaming algorithms for weighted k-disjoint matchings. In 32nd Annual European Symposium on Algorithms (ESA 2024), pages 53–1. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2024.
  • [10] Buddhima Gamlath, Sagar Kale, Slobodan Mitrovic, and Ola Svensson. Weighted matchings via unweighted augmentations. In Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing, pages 491–500, 2019. doi:10.1145/3293611.3331603.
  • [11] Mohsen Ghaffari and David Wajc. Simplified and space-optimal semi-streaming (2+ε)-approximate matching. In Symposium on Simplicity in Algorithms, volume 69, 2019.
  • [12] NASA High-End Computing Capability (HECC). AMD Milan Processors, 2024. Accessed: 2025-07-07. URL: https://www.nas.nasa.gov/hecc/support/kb/amd-milan-processors_688.html.
  • [13] Monika Rauch Henzinger, Prabhakar Raghavan, and Sridhar Rajagopalan. Computing on data streams. External Memory Algorithms, 50:107–118, 1998. doi:10.1090/DIMACS/050/05.
  • [14] Chien-Chung Huang and François Sellier. Semi-streaming algorithms for submodular function maximization under b-matching, matroid, and matchoid constraints. Algorithmica, 86(11):3598–3628, 2024. doi:10.1007/S00453-024-01272-X.
  • [15] Michael Kapralov. Space lower bounds for approximating maximum matching in the edge arrival model. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1874–1893. SIAM, 2021. doi:10.1137/1.9781611976465.112.
  • [16] Roie Levin and David Wajc. Streaming submodular matching meets the primal-dual method. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1914–1933. SIAM, 2021. doi:10.1137/1.9781611976465.114.
  • [17] Gerry McCartney, Thomas Hacker, and Baijian Yang. Empowering faculty: A campus cyberinfrastructure strategy for research communities. Educause Review, 2014.
  • [18] Andrew McGregor. Finding graph matchings in data streams. In International Workshop on Approximation Algorithms for Combinatorial Optimization, pages 170–181. Springer, 2005. doi:10.1007/11538462_15.
  • [19] Robert Meusel, Sebastiano Vigna, Oliver Lehmberg, and Christian Bizer. The graph structure in the web–analyzed on different aggregation levels. The Journal of Web Science, 1, 2015. doi:10.1561/106.00000003.
  • [20] Ami Paz and Gregory Schwartzman. A (2+ ε)-approximation for maximum weight matching in the semi-streaming model. ACM Transactions on Algorithms (TALG), 15(2):1–15, 2018. doi:10.1145/3274668.
  • [21] Erol A Peköz, Adrian Röllinn, and Nathan Ross. Total variation error bounds for geometric approximation. Bernoulli, 19(2):610–632, 2013.
  • [22] Alex Pothen, S M Ferdous, and Fredrik Manne. Approximation algorithms in combinatorial scientific computing. Acta Numerica, 28:541–633, 2019. doi:10.1017/S0962492919000035.
  • [23] Mariano Zelke. Weighted matching in the semi-streaming model. Algorithmica, 62(1-2):1–20, 2012. doi:10.1007/S00453-010-9438-5.