Weighted Matching in a Poly-Streaming Model
Abstract
We introduce the poly-streaming model, a generalization of streaming models of computation in which processors process data streams containing a total of items. The algorithm is allowed space, where is either or the space bound for a sequential streaming algorithm. Processors may communicate as needed. Algorithms are assessed by the number of passes, per-item processing time, total runtime, space usage, communication cost, and solution quality.
We design a single-pass algorithm in this model for approximating the maximum weight matching (MWM) problem. Given edge streams and a parameter , the algorithm computes a -approximate MWM. We analyze its performance in a shared-memory parallel setting: for any constant , it runs in time , where is the number of vertices and is the maximum stream length. It supports per-edge processing time using space. We further generalize the design to hierarchical architectures, in which processors are partitioned into groups, each with its own shared local memory. The total intergroup communication is bits, while all other performance guarantees are preserved.
We evaluate the algorithm on a shared-memory system using graphs with trillions of edges. It achieves substantial speedups as increases and produces matchings with weights significantly exceeding the theoretical guarantee. On our largest test graph, it reduces runtime by nearly two orders of magnitude and memory usage by five orders of magnitude compared to an offline algorithm.
Keywords and phrases:
Streaming Algorithms, Matchings, Graphs, Parallel AlgorithmsFunding:
Ahammed Ullah: Research supported by grant SC-0022260 of the Advanced Scientific Computing Research program of the U. S. Department of Energy.Copyright and License:
2012 ACM Subject Classification:
Mathematics of computing Graph algorithms ; Mathematics of computing Approximation algorithmsEditors:
Anne Benoit, Haim Kaplan, Sebastian Wild, and Grzegorz HermanSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
Data-intensive computations arise in data science, machine learning, and science and engineering disciplines. These datasets are often massive, generated dynamically, and, when stored, kept in distributed formats on disks, making them amenable to processing as multiple data streams. The modular feature of these datasets can be exploited by streaming algorithms designed for tightly-coupled shared-memory and distributed-memory multiprocessors to efficiently solve large problem instances that offline algorithms cannot handle due to their high memory requirements. However, the design of parallel algorithms that process multiple data streams concurrently has not yet received much attention.
Current multicore shared-memory processors consist of up to a few hundred cores, organized hierarchically to share caches and memory controllers. These cores compute in parallel to achieve speedups over serial execution. With multiple memory controllers, I/O operations can also proceed in parallel, and this feature can be used to process multiple data streams concurrently. These I/O capabilities and the limitations of offline algorithms motivate a model of computation, illustrated in Figure 1 and discussed next.
The streaming model of computation allows space for a data stream of size [2, 13]. For graphs, the semi-streaming model permits space for a graph with vertices and an edge stream of arbitrary length [8]. Building on these space-constrained models, we introduce the poly-streaming model. The key aspects of our model are as follows.
We consider data streams that collectively contain items. An algorithm has access to (abstract) processors, and is allowed total space, where is either or the space permitted to a single-stream algorithm. In each pass, each stream is assigned to one of the processors, and each processor independently reads one item at a time from its stream and processes it. Processors may communicate as needed, either via shared or remote memory access. Algorithms are assessed on several metrics: space complexity, number of passes, per-item processing time, total runtime, communication cost, and solution quality.
In the poly-streaming model, we address the problem of approximating a maximum weight matching (MWM) in an edge-weighted graph, where the goal is to find a set of vertex-disjoint edges with maximum total weight. We design an algorithm for approximating an MWM when the graph is presented as multiple edge streams. Our design builds on the algorithm of [20] and adds support for handling multiple streams concurrently. We also generalize the design to NUMA (non-uniform memory access) multiprocessor architectures.
We summarize our contributions to the MWM problem as follows. Let and denote the maximum and minimum lengths of the input streams, respectively, and let denote the number of vertices in a graph . For any realization of the CREW PRAM model (such as in Figure 1), we have the following result.
Theorem 1.
For any constant , there exists a single-pass poly-streaming algorithm for the maximum weight matching problem that achieves a -approximation. It admits a CREW PRAM implementation with runtime.111 hides polylogarithmic factors. If , the algorithm achieves amortized per-edge processing time using space. For arbitrarily balanced streams, it uses either space and per-edge processing time, or space and per-edge processing time.
In NUMA architectures, memory access costs depend on a processor’s proximity to the target memory. We generalize the algorithm in Theorem 1 to account for these cost differences. In particular, we show that when processors are partitioned into groups, each with its own shared local memory, the total number of global memory accesses across all groups is . This generalization preserves all other performance guarantees from Theorem 1, except that the space bound becomes . These results are formalized in Theorem 18 in Section 4. This design gives a memory-efficient algorithm for the NUMA shared memory multiprocessors, on which we report empirical results.
We have evaluated our algorithm on a NUMA machine using graphs with billions to trillions of edges. For most of these graphs, our algorithm uses space that is orders of magnitude smaller than that required by offline algorithms. For example, storing the largest graph in our evaluation would require more than (), whereas our algorithm used less than . Offline matching algorithms typically require even more memory to accommodate their auxiliary data structures.
We employ approximate dual variables that correspond to a linear programming relaxation of MWM to obtain a posteriori upper bounds on the weights of optimal matchings. These bounds allow us to compare the weight of a matching produced by our algorithm with the optimal weight. Thus, we show that our algorithm produces matchings whose weights significantly exceed the approximation guarantee.
For , our algorithm achieves runtime speedups of – across all graphs in our evaluation, on a NUMA machine with only memory controllers. This is significant scaling for a poly-streaming algorithm, given that memory controllers are not sufficient to serve the concurrent and random access requests of processors without delays. Nevertheless, these speedups demonstrate the effectiveness of our design, which accounts for a processor’s proximity to the target memory. A metric less influenced by memory latency suggests that the algorithm would achieve even better speedups on architectures with more efficient memory access.
Note that Theorem 1 and Theorem 18 both guarantee runtime. For , this is tight up to polylogarithmic factors. However, by using space and per-edge processing time, we can even achieve runtime, which becomes polylogarithmic for large values of (see the arXiv version).
Organization.
2 Preliminaries
For a graph , let and denote the number of vertices and edges, respectively. We denote an edge by the unordered pair of its endpoints. For a weighted graph, let denote the weight of edge , and for any subset , define . For , let be the set of edges received in the th stream. Define and .
A matching in a graph is a set of edges that do not share endpoints. A maximum weight matching (MWM) is a matching with maximum total weight; that is, for all matchings .
A -approximation algorithm computes a solution whose value is within a factor of optimal. The factor is called the (worst-case) approximation ratio. We assume for both maximization and minimization problems. Thus, for maximization, a -approximation guarantees a solution whose value is at least times the optimal.
We use the linear programming (LP) relaxation of the MWM problem, and its dual, shown in Figure 2. In the primal LP, each variable is if edge is in the matching and otherwise. Each is a dual variable, and denotes the set of edges incident on a vertex . Let and be feasible solutions to the primal and dual LPs, respectively. By weak LP duality, we have . If is an optimal solution to the primal LP, then . The first inequality holds because the primal LP is a relaxation of the MWM problem.
Primal LP
Dual LP
3 Algorithms for Uniform Memory Access Cost
In this section, we present the design and analyses of our algorithm in Theorem 1 that assumes a uniform memory access cost.
3.1 The Algorithm
Several semi-streaming algorithms have been designed for the MWM problem [3, 4, 6, 8, 10, 11, 18, 20, 23] (see the arXiv version for brief descriptions of these algorithms). In this paper, we focus exclusively on the single-pass setting in the poly-streaming model. Our starting point is the algorithm of Paz and Schwartzman [20], which computes a -approximation of MWM. This is currently the best known guarantee in the single-pass setting under arbitrary or adversarial ordering of edges.222No single-pass algorithm can achieve an approximation ratio better than ; see [15]. We extend a primal-dual analysis by Ghaffari and Wajc [11] to analyze our algorithm.
The algorithm of Paz and Schwartzman [20] proceeds as follows. Initialize an empty stack and set for each vertex . For each edge in the edge stream, skip if . Otherwise, compute , push onto the stack , and increase both and by . After processing all edges, compute a matching greedily by popping edges from .
Note that for each edge pushed onto the stack, the increment satisfies . This ensures that both and increase by a factor of . Hence, the number of edges in the stack incident to any vertex is at most , where is the (normalized) maximum edge weight. Therefore, the total number of edges in the stack is .333Throughout the paper, we assume . For arbitrary weights on edges, we can skip any edge whose weight is less than , where denotes the maximum edge weight observed so far in the stream. This ensures that the (normalized) maximum weight the algorithm sees is , while maintaining a approximation ratio (see [11] for details).
PS-MWM:
/* each processor executes this algorithm concurrently */
-
1.
In parallel initialize , and set and to for all
/* processor initializes or sets locks/variables */ -
2.
/* initialize an empty stack */
-
3.
for each edge in th stream do
- (a)
-
4.
wait for all processors to complete execution of Step 3 /* a barrier */
- 5.
-
6.
return
To design a poly-streaming algorithm, we begin with a simple version and then refine it. All processors share a global stack and a set of variables , and each processor runs the above sequential streaming algorithm on its respective stream. To complete and adapt this setup for efficient execution across multiple streams, we must address two interrelated issues: (1) concurrent edge arrivals across streams may lead to contention for the shared stack or variables, and (2) concurrent updates to the shared variables may lead to inconsistencies in their observed values.
A natural approach to addressing these issues is to enforce a fair sequential strategy, where processors access shared resources in a round-robin order. While this ensures progress, it incurs per-edge processing time, which scales poorly with increasing . Instead, we adopt fine-grained contention resolution that avoids global coordination by allowing processors to operate asynchronously. However, under the initial setup, this leads to per-edge processing time: a processor may be blocked from accessing shared resources until the stack has accumulated its potential edges. We address these limitations with the following design choices.
For the first issue, we observe that a global ordering of edges, as used in the single-stack solution, is not necessary; local orderings within multiple stacks suffice. In particular, we can identify a subset of edges (later referred to as tight edges) for which maintaining local orderings is sufficient to compute a -approximate MWM. Hence, we can localize computation using stacks, assigning one stack to each processor exclusively during the streaming phase. This design eliminates the contention associated with a shared stack.
However, contention still arises when updating the variables . It is unclear how to resolve this contention without using additional space. Hence, we consider two strategies for processing edge streams that illustrate the trade-off between space and per-edge processing time. In the first, which we call the non-deferrable strategy, the decision to include an edge in a stack is made immediately during streaming. In the second, which we call the deferrable strategy, this decision may be deferred to post-processing. The latter strategy requires more space but achieves per-edge processing time.
To address the second issue, which concerns the potential for inconsistencies due to concurrent updates to the variables , we observe that the variables are monotonically increasing and collectively require only updates. Thus, for most edges that are not eligible for the stacks, decisions can be made by simply reading the current values of the relevant variables. However, for the edges that are included in the stacks, we must update the corresponding variables. To ensure consistency of these updates, we associate a lock with each variable in . We maintain exclusive locks and allow a variable to be updated only after acquiring its corresponding lock.444This corresponds to the concurrent-read exclusive-write (CREW) paradigm of the PRAM model.
We now outline the non-deferrable strategy of our poly-streaming algorithm for the MWM problem (for the deferrable strategy see the arXiv version). For simplicity, we assume that if a processor attempts to release a lock it did not acquire, the operation has no effect. We also assume that any algorithmic step described with the “in parallel” construct includes an implicit barrier (or synchronization primitive) at the end, synchronizing the processors participating in that step.
Process-Edge:
/* Assumes access to global variables and locks */
-
1.
if then return
-
2.
repeatedly try to acquire and in lexicographic order of and as long as
-
3.
if then
-
(a)
-
(b)
increment and by
-
(c)
add to the top of along with
-
(a)
-
4.
release and , and return
The non-deferrable strategy is presented in Algorithm PS-MWM, with two subroutines used by PS-MWM described in Process-Edge (Figure 4) and Process-Stack (Figure 5). In PS-MWM, Steps 1–2 form the preprocessing phase, Steps 3–4 the streaming phase, and Step 5 the post-processing phase. Each processor executes PS-MWM asynchronously, except that all processors begin the post-processing phase simultaneously (due to Step 4) and then resume asynchronous execution.
In the subroutine Process-Edge, Step 2 ensures that all edges are processed using the non-deferrable strategy: a processor repeatedly attempts to acquire the locks corresponding to the endpoints of an edge until it succeeds, or the edge becomes ineligible for inclusion in a stack. As a result, a processor executing Step 3, has a consistent view of the variables and . In Step 3(c), we store the gain of an edge along with the edge itself for use in the post-processing phase.
Process-Stack:
/* Assumes access to global variables ; and , initialized in Algorithm PS-MWM (or PS-MWM-LD). */
-
1.
-
2.
while do
-
(a)
remove the top edge of
-
(b)
if then wait for to be a tight edge
/* is a tight edge if */ -
(c)
if both and are set to then
/* no locking is needed since is a tight edge */-
i.
-
ii.
set and to
-
i.
-
(d)
decrement and by
-
(a)
-
3.
return
When all processors are ready to execute Step 5 of PS-MWM, the stacks collectively contain all the edges needed to construct a -approximate MWM, which can be obtained in several ways. In the subroutine Process-Stack, we outline a simple approach based on local edge orderings. We define an edge in a stack to be a tight edge if . Equivalently, an edge is tight if and only if all of its neighboring edges that were included after it in any stack have already been removed. Any set of tight edges can be processed concurrently, regardless of their positions in the stacks. In Process-Stack, we simultaneously process the tight edges that appear at the tops of the stacks.
3.2 Analyses
We now formally characterize several correctness properties of the algorithm and analyze its performance. These correctness properties include the absence of deadlock, livelock, and starvation. The performance metrics are space usage, approximation ratio, per-edge processing time, and total runtime.
To simplify the analysis, we assume that processors operate in a quasi-synchronous manner. In particular, to analyze Step 3 of Algorithm PS-MWM, we define an algorithmic superstep as a unit comprising a constant number of elementary operations.
Definition 2 (Superstep).
A processor takes one superstep for an edge if it executes Process-Edge with at most one iteration of the loop in Step 2 (i.e., without contention), requiring elementary operations. Each additional iteration of Step 2 due to contention adds one superstep, with each such iteration also requiring operations.
Definition 3 (Effective Iterations).
Effective iterations is the maximum number of supersteps taken by any processor during the execution of Step 3 of Algorithm PS-MWM.
Note that for , the effective iterations is equal to the number of edges in the stream. Using this notion, we align the supersteps of different processors and define the following directed graph.
Definition 4 ().
For the effective iteration, consider the set of edges processed across all streams. Let denote an edge processed in th stream, where precedes in the lexicographic ordering of the vertices. If processor is idle in the iteration, then . Define , where and .
The following property of is straightforward to verify.
Proposition 5.
is a directed acyclic graph.
We show that Algorithm PS-MWM is free from deadlock, livelock, and starvation. Deadlock occurs when a set of processors forms a cyclic dependency, with each processor waiting for a resource held by another. Livelock occurs when a set of processors repeatedly form such a cycle, where each processor continually acquires and releases resources without making progress. Starvation occurs when a processor waits indefinitely for a resource because other processors repeatedly acquire it first. The following lemma shows that the streaming phase of Algorithm PS-MWM is free from deadlock, livelock, and starvation.
Lemma 6.
The concurrent executions of the subroutine Process-Edge are free from deadlock, livelock, and starvation.
Proof.
Since the variables are updated only while holding their corresponding locks, we treat the locks as the only shared resources in Process-Edge.
Let be the graph defined in Definition 4. By Proposition 5, is a directed acyclic graph (DAG), and hence each of its components is also a DAG.
To reason about cyclic dependencies, we focus on components of involving processors executing Step 2 of Process-Edge. Every DAG contains at least one vertex with no outgoing edges. Thus, each such component includes an edge such that only processor requests . This precludes the possibility of cyclic dependencies; that is, the concurrent executions of Process-Edge is free from deadlock and livelock.
To show that starvation does not occur, suppose an edge appears in every effective iteration , that is, . We show that , which bounds the number of supersteps that processor may spend attempting to acquire locks for . Step 2 requires one superstep per iteration, while all other steps collectively require at most one. For each , the component of containing has at least one vertex with no outgoing edge. This guarantees that at least one edge in that component acquires its locks and completes Step 3 during the th effective iteration. Since Step 3 can increment the values in for at most edges over the entire execution, the number of iterations for which may remain blocked is also bounded by .
To analyze Step 5 of Algorithm PS-MWM, we adopt the same simplification: processors are assumed to operate in a quasi-synchronous manner. Accordingly, we define as the set of edges present in the stacks at the beginning of iteration of Step 2 in Process-Stack. The following definition is useful for characterizing tight edges via an equivalent notion.
Definition 7 (Follower).
An edge is a follower of an edge if and is added to some stack after is added to some stack . We denote the set of followers of an edge by .
The proofs of the following four lemmas are included in the arXiv version. The fourth lemma establishes that the post-processing phase of PS-MWM is free from deadlock, livelock, and starvation.
Lemma 8.
An edge is a tight edge if and only if .
Lemma 9.
Let be the set of top edges in the stacks at the beginning of iteration of Step 2 of Process-Stack. Then contains at least one tight edge.
Lemma 10.
The set of tight edges in is vertex-disjoint.
Lemma 11.
The concurrent executions of the subroutine Process-Stack are free from deadlock, livelock, and starvation.
We now analyze the performance metrics of the algorithm.
Lemma 12.
For any constant , the space complexity and per-edge processing time of Algorithm PS-MWM are and , respectively. Furthermore, for , the amortized per-edge processing time of the algorithm is .
Proof.
The claimed space bound follows from three components: space for the variables and locks, space for the stacked edges, and space per processor.
The worst-case per-edge processing time follows from the second part of the proof of Lemma 6.
Processor processes edges, each requiring at least one distinct effective iteration (see Definition 3). Additional iterations may arise when it repeatedly attempts to acquire locks in Step 2 of Process-Edge. From the second part of the proof of Lemma 6, the total number of such additional iterations is bounded by . This implies that to process edges, a processor uses supersteps. Therefore, the amortized per-edge processing time is
Note that the amortized per-edge processing time is computed over the edges of an individual stream, not over the total number of edges across all streams. While both forms of amortization are meaningful for poly-streaming algorithms, our analysis is more practically relevant, as it reflects the cost incurred per edge arrival within a single stream.
Lemma 13.
For any constant , Algorithm PS-MWM takes time.
Proof.
The preprocessing phase (Steps 1–2) takes time.
To process edges, processor takes supersteps (see the proof of Lemma 12). Since for all , the time required for Step 3 is .
At the beginning of Step 5, the total number of edges in the stacks is . By Lemma 9, iteration of Process-Stack removes at least one edge from . Hence, the time required for Step 5 is .
The claim now follows by summing the time spent across all three phases.
Now, using the characterizations of tight edges, we extend the duality-based analysis of [11] to our algorithm. Let denote the change in resulting from processing an edge in Step 3 of Process-Edge. If an edge is not included in a stack , then , either because it fails the condition in Step 1 or Step 3 of Process-Edge. It follows that . For an edge that is included in some stack , let denote the set of edges that share an endpoint with and are included in some stack no later than including itself. The following two results are immediate from Observation 3.2 and Lemma 3.4 of [11].
Proposition 14.
Any edge added to some stack satisfies the inequality
Proposition 15.
After all processors complete Step 3 of Algorithm PS-MWM, the variables , scaled by a factor of , form a feasible solution to the dual LP in Figure 2.
Lemma 16.
Let be a maximum weight matching in . The matching returned by Algorithm PS-MWM satisfies .
Proof.
We only process tight edges in Process-Stack. By Lemma 10 tight edges are vertex disjoint, and hence their independent processing does not interfere with their inclusion in .
By Lemma 8, an edge included in must satisfy . Consider any edge . Since , we have , which means is not a tight edge before is processed. Thus, when is selected for inclusion in , none of the edges in is tight. Hence, all edges of are in the stacks when we are about to process . Therefore, the total gain contributed by edges of can be attributed to the weight of , and by Proposition 14, we have
Let be an optimal solution to the primal LP in Figure 2. By Proposition 15 and LP duality we have
Algorithm PS-MWM uses only one pass over the streams. Theorem 1 now follows by combining the results in Lemma 12, Lemma 13, Lemma 16, and the analysis of the deferrable strategy sketched in the arXiv version.
4 Algorithms for Non-Uniform Memory Access Costs
In this section, we extend the algorithm from Section 3 to account for the non-uniform memory access (NUMA) costs present in real-world machines.
In a poly-streaming algorithm, each processor may receive an arbitrary subset of the input, making it difficult to maintain memory access locality. Modern shared-memory machines, as illustrated in Figure 1, have non-uniform memory access costs and far fewer memory controllers than processors [12]. As a result, memory systems with such limitations would struggle to handle the high volume of concurrent, random memory access requests generated by poly-streaming algorithms, leading to significant delays.
PS-MWM-LD:
-
1.
In parallel initialize , and set and to for all
/* processor initializes or sets locks/variables */ -
2.
In parallel initialize , and set to for all
/* processor initializes or sets locks / variables */ -
3.
In parallel initialize /* one processor initializes for group */
-
4.
/* initialize an empty stack */
-
5.
for each edge in th stream do
- (a)
-
6.
wait for all processors to complete execution of Step 4 /* a barrier */
- 7.
-
8.
return
We now describe a generalization of the algorithm from Section 3 that localizes a significant portion of each processor’s memory access to its near memory. This generalization applies to both edge-processing strategies introduced in Section 3.1. We focus on the non-deferrable strategy. (The deferrable strategy generalizes in the same way, following the same relationship between the two strategies as in the specialized case.)
The runtime of Process-Edge is dominated by the time to access the dual variables . By assigning a dedicated stack to each processor, we have substantially localized accesses associated with edges in that stack. However, since a large fraction of edges is typically not included in the stacks, the runtime remains dominated by accesses to dual variables associated with these discarded edges. We therefore describe an algorithm that localizes these accesses to memory near the processor.
To localize accesses to the dual variables , we observe that these variables increase monotonically during the streaming phase. This observation motivates a design in which a subset of processors maintains local copies of the variables and can discard a substantial number of edges without synchronizing with the global copy. When a processor includes an edge in its stack, it increments the corresponding dual variables in the global copy by the gain of the edge and synchronizes its local copy accordingly. As a result, some local copies may lag behind the global copy, but they can be synchronized when needed.
A general scheme for allocating dual variables is as follows. The set of processors is partitioned into groups. For simplicity, we assume that is a multiple of , so each group contains exactly processors. For , in addition to a global copy of dual variables, we maintain local copies , one for each . Group consists of the processors , and uses as its local copy of the dual variables. Algorithm PS-MWM is the special case , where all processors operate using only the global copy of the dual variables.
Algorithm PS-MWM-LD, along with its subroutine Process-Edge-LD, incorporates local dual variables in addition to global ones. In Step 2, processors in each group collectively initialize their group’s local copies of dual variables and locks, followed by initializing a group lock in Step 3. All other steps of the algorithm are identical to those in PS-MWM.
Process-Edge-LD:
/* Assumes access to , , , , and */
-
1.
if then return
-
2.
repeatedly try to acquire and in lexicographic order of and as long as
-
3.
if then release and , and return
-
4.
repeatedly try to acquire
- 5.
-
6.
and /* synchronization of local and global dual variables */
-
7.
release , , and return
In the subroutine Process-Edge-LD, Step 5 implements the non-deferable strategy. Steps 1–3 and Step 6 enforce the localization of access to dual variables. Steps 2–3 ensure that, at any given time, each global dual variable is accessed by at most one processor per group; we refer to this processor as the delegate of the group for that variable. Thus, a processor executing Steps 4–6 serves as a delegate of its group for that time. In Step 6, after completing updates to the global variables, the delegate synchronizes its group’s local copy in time. As a result, the waiting time on a local variable in Step 2 is bounded by the total time spent by the corresponding delegates, up to constant factors.
The delegates in each group handle vertex-disjoint edges, so concurrent executions of Step 6 would have been safe. However, the lock in Step 4 ensures that at most one delegate per group executes Step 5. Regardless of these design choices, the behavior of the delegates executing Step 5 concurrently mirrors that of processors competing for exclusive access to global dual variables in PS-MWM.
The following lemma highlights the benefit of using Algorithm PS-MWM-LD.
Lemma 17.
For any constant , in the streaming phase of Algorithm PS-MWM-LD, processors in all groups collectively access global variables a total of times.
In contrast to the result in Lemma 17, the streaming phase of Algorithm PS-MWM accesses global variables times or up to times.
Algorithm PS-MWM-LD, together with the generalization of the deferrable strategy, leads to the following result (a proof is included in the arXiv version).
Theorem 18.
Let processors be partitioned into groups, each with its own shared local memory. For any constant , there exists a single-pass poly-streaming algorithm for the maximum weight matching problem that achieves a -approximation. It admits a CREW PRAM implementation with runtime. If , the algorithm achieves amortized per-edge processing time using space. For arbitrarily balanced streams, it uses either space and per-edge processing time, or space and per-edge processing time. The processors collectively access the global memory times.
5 Empirical Evaluation
This section summarizes our evaluation results for Algorithm PS-MWM-LD. Detailed datasets, experimental setup, and additional comparisons (including with PS-MWM) are provided in the arXiv version.
5.1 Datasets
| Graph Collection | # of Edges (in billions) |
|---|---|
| The SSW graphs | |
| The BA graphs | |
| The ER graphs | |
| The UA-dv graphs | |
| The UA graphs | |
| The ER-dv graphs |
Table 1 summarizes our datasets. Each collection consists of eight graphs, with edge counts ranging from one billion to four trillion. To the best of our knowledge, these represent some of the largest graphs for which matchings have been reported in the literature. Exact and approximate offline MWM algorithms (see [22]) would exceed available memory on the larger graphs. The first class (SSW) consists of six of the largest graphs from the SuiteSparse Matrix collection [5] and two from the Web Data Commons [19], which includes the largest publicly available graph dataset. Other classes include synthetic graphs generated from the Barabási–Albert (BA), Uniform Attachment (UA), and Erdős–Rényi (ER) models [1, 7, 21].
5.2 Experimental Setup
We ran all experiments on a community cluster called Negishi [17], where each node has an AMD Milan processor with 128 cores running at 2.2 GHz, 256–1024 GB of memory, and the Rocky Linux 8 operating system version 8.8. The cores are organized in a hierarchy: groups of eight cores constitute a core complex that share an L3 cache. Eight core complexes form a socket, and they share four dual-channel memory controllers; two sockets constitute a Milan node [12]. Memory access within a socket is approximately three times faster than across sockets.
We implemented the algorithms in C++ and compiled the code using the gcc compiler (version 12.2.0) with the -O3 optimization flag. For shared-memory parallelism, we used the OpenMP library (version ). All experiments used . Reported values are the average over five runs.
5.3 Space
The SSW graphs
The BA graphs
The ER graphs
Figure 8 summarizes the space usage of our algorithm. For , the algorithm of Paz and Schwartzman [20], we store one copy of the dual variables, stack, and matching. For , our algorithm stores copies of the dual variables (global and local), stacks, matching, and locks. We choose the values of based on the number of memory controllers and the number of streams.
The maximum space used by our algorithm is , for the web graph WDC_2012. In comparison, storing this graph in compressed sparse row (CSR) format would require over . Storing the largest graph in our datasets (ER1_4096) in CSR would require more than (), for which our algorithm used less than .
5.4 Solution Quality
The SSW graphs
The BA graphs
The ER graphs
min-OPT percent.
In the arXiv version, we describe different ways to get a posteriori upper bounds on the weight of a MWM , using the values of the dual variables. Let denote the minimum value of these upper bounds. If is a matching in the graph returned by any algorithm, then we have . Hence, gives a lower bound on the percentage of the maximum weight obtained by . We use min-OPT percent to denote the fraction .
Figure 9 shows min-OPT percent obtained by different algorithms. In the arXiv version, we describe four dual update rules as alternatives to the default rule used in Steps 3(a)–(b) of Process-Edge. The values under and use the default rule, and the values under ALG-d use the best result among the four new dual update rules. For perspective, we include min-OPT percent obtained by the sequential 6-approximate streaming algorithm of Feigenbaum et al. [8], denoted ALG-s.
The results under and show that, in terms of solution quality, our poly-streaming algorithm is on par with the single-stream algorithm of [20]. The values under ALG-d indicate further potential improvements using alternative dual update rules. The comparison with ALG-s supports our choice of the algorithm from [20] over other simple algorithms, such as that of [8]. The arXiv version contains comparisons with an offline algorithm and details on the dual update rules.
5.5 Runtime
The SSW graphs
The BA graphs
The ER graphs
We report runtime-based speedups, computed as the total time across all three phases of PS-MWM-LD (preprocessing, streaming, and post-processing). Figure 10 shows these speedups. For , we have speedups of , , and for the SSW graphs, the BA graphs, and the ER graphs, respectively.
Due to the significant memory bottlenecks (discussed in Section 4), we also report speedups w.r.t. effective iterations (Definition 3), which are less affected by such bottlenecks. The speedup w.r.t. effective iterations is the ratio of the metric for one stream to that for streams. Now for , we obtain speedups of , , and for the SSW graphs, the BA graphs, and the ER graphs, respectively. These results indicate that shared variable access incurs no noticeable contention among processors. As a result, we expect even better runtime improvements on systems with more memory controllers or better support for remote memory access.
Figure 11 shows the runtimes for different graphs, decomposed into three phases, for and . The plots report the absolute time savings achieved by processing multiple streams concurrently. For and , the geometric means of the runtimes for these graphs are over seconds and under seconds, respectively. For the largest graph (ER1_4096), single-stream processing took over seconds, whereas poly-stream processing reduced the time to under seconds.
6 Conclusion
While numerous studies have focused on optimizing time (in parallel computing) or space (in streaming algorithms) in isolation, the poly-streaming model offers a practically relevant paradigm for exploring how time and space can be jointly optimized. It fills a gap by providing a formal framework for analyzing algorithmic design choices and their associated time-space trade-offs. Our study of matchings illustrates both the benefits of this paradigm and its practical relevance.
The simplicity of our matching algorithm and its generalization reflect our choice to adopt the design of [20]. We believe this principle will inspire the development of other poly-streaming algorithms. To this end, we note that [20] has also motivated simple algorithms for related problems, such as matchings with submodular objectives [16], -matchings [14], and collections of disjoint matchings [9].
References
- [1] Réka Albert and Albert-László Barabási. Statistical mechanics of complex networks. Reviews of Modern Physics, 74(1):47, 2002.
- [2] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. In Proceedings of the Twenty-eighth Annual ACM Symposium on Theory of Computing, pages 20–29, 1996. doi:10.1145/237814.237823.
- [3] Sepehr Assadi. A simple (1—)-approximation semi-streaming algorithm for maximum (weighted) matching. In 2024 Symposium on Simplicity in Algorithms (SOSA), pages 337–354. SIAM, 2024. doi:10.1137/1.9781611977936.31.
- [4] Michael Crouch and Daniel M Stubbs. Improved streaming algorithms for weighted matching, via unweighted matching. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2014), pages 96–104. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2014. doi:10.4230/LIPICS.APPROX-RANDOM.2014.96.
- [5] Timothy A Davis and Yifan Hu. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS), 38(1):1–25, 2011. doi:10.1145/2049662.2049663.
- [6] Leah Epstein, Asaf Levin, Julián Mestre, and Danny Segev. Improved approximation guarantees for weighted matching in the semi-streaming model. SIAM Journal on Discrete Mathematics, 25(3):1251–1265, 2011. doi:10.1137/100801901.
- [7] Paul Erdős and Alfréd Rényi. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci., 5:17–60, 1960.
- [8] Joan Feigenbaum, Sampath Kannan, Andrew McGregor, Siddharth Suri, and Jian Zhang. On graph problems in a semi-streaming model. Theoretical Computer Science, 348(2-3):207–216, 2005. doi:10.1016/J.TCS.2005.09.013.
- [9] S M Ferdous, Bhargav Samineni, Alex Pothen, Mahantesh Halappanavar, and Bala Krishnamoorthy. Semi-streaming algorithms for weighted k-disjoint matchings. In 32nd Annual European Symposium on Algorithms (ESA 2024), pages 53–1. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2024.
- [10] Buddhima Gamlath, Sagar Kale, Slobodan Mitrovic, and Ola Svensson. Weighted matchings via unweighted augmentations. In Proceedings of the 2019 ACM Symposium on Principles of Distributed Computing, pages 491–500, 2019. doi:10.1145/3293611.3331603.
- [11] Mohsen Ghaffari and David Wajc. Simplified and space-optimal semi-streaming -approximate matching. In Symposium on Simplicity in Algorithms, volume 69, 2019.
- [12] NASA High-End Computing Capability (HECC). AMD Milan Processors, 2024. Accessed: 2025-07-07. URL: https://www.nas.nasa.gov/hecc/support/kb/amd-milan-processors_688.html.
- [13] Monika Rauch Henzinger, Prabhakar Raghavan, and Sridhar Rajagopalan. Computing on data streams. External Memory Algorithms, 50:107–118, 1998. doi:10.1090/DIMACS/050/05.
- [14] Chien-Chung Huang and François Sellier. Semi-streaming algorithms for submodular function maximization under b-matching, matroid, and matchoid constraints. Algorithmica, 86(11):3598–3628, 2024. doi:10.1007/S00453-024-01272-X.
- [15] Michael Kapralov. Space lower bounds for approximating maximum matching in the edge arrival model. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1874–1893. SIAM, 2021. doi:10.1137/1.9781611976465.112.
- [16] Roie Levin and David Wajc. Streaming submodular matching meets the primal-dual method. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1914–1933. SIAM, 2021. doi:10.1137/1.9781611976465.114.
- [17] Gerry McCartney, Thomas Hacker, and Baijian Yang. Empowering faculty: A campus cyberinfrastructure strategy for research communities. Educause Review, 2014.
- [18] Andrew McGregor. Finding graph matchings in data streams. In International Workshop on Approximation Algorithms for Combinatorial Optimization, pages 170–181. Springer, 2005. doi:10.1007/11538462_15.
- [19] Robert Meusel, Sebastiano Vigna, Oliver Lehmberg, and Christian Bizer. The graph structure in the web–analyzed on different aggregation levels. The Journal of Web Science, 1, 2015. doi:10.1561/106.00000003.
- [20] Ami Paz and Gregory Schwartzman. A (2+ )-approximation for maximum weight matching in the semi-streaming model. ACM Transactions on Algorithms (TALG), 15(2):1–15, 2018. doi:10.1145/3274668.
- [21] Erol A Peköz, Adrian Röllinn, and Nathan Ross. Total variation error bounds for geometric approximation. Bernoulli, 19(2):610–632, 2013.
- [22] Alex Pothen, S M Ferdous, and Fredrik Manne. Approximation algorithms in combinatorial scientific computing. Acta Numerica, 28:541–633, 2019. doi:10.1017/S0962492919000035.
- [23] Mariano Zelke. Weighted matching in the semi-streaming model. Algorithmica, 62(1-2):1–20, 2012. doi:10.1007/S00453-010-9438-5.
