GreedyML: A Parallel Algorithm for Maximizing Constrained Submodular Functions
Abstract
We describe a parallel approximation algorithm for maximizing monotone submodular functions subject to hereditary constraints on distributed memory multiprocessors. Our work is motivated by the need to solve submodular optimization problems on massive data sets, for practical contexts such as data summarization, machine learning, and graph sparsification.
Our work builds on the randomized distributed RandGreeDI algorithm, proposed by Barbosa, Ene, Nguyen, and Ward (2015). This algorithm computes a distributed solution by randomly partitioning the data among all the processors and then employing a single accumulation step in which all processors send their partial solutions to one processor. However, for large problems, the accumulation step exceeds the memory available on a processor, and the processor which performs the accumulation becomes a computational bottleneck.
Hence we propose a generalization of the RandGreeDI algorithm that employs multiple accumulation steps to reduce the memory required. We analyze the approximation ratio and the time complexity of the algorithm (in the BSP model). We evaluate the new GreedyML algorithm on three classes of problems, and report results from large-scale data sets with millions of elements. The results show that the GreedyML algorithm can solve problems where the sequential Greedy and distributed RandGreeDI algorithms fail due to memory constraints. For certain computationally intensive problems, the GreedyML algorithm is faster than the RandGreeDI algorithm. The observed approximation quality of the solutions computed by the GreedyML algorithm closely matches those obtained by the RandGreeDI algorithm on these problems.
Keywords and phrases:
Combinatorial optimization, submodular functions, distributed algorithms, approximation algorithms, data summarizationFunding:
S M Ferdous: Laboratory Directed Research and Development Program at PNNL.Copyright and License:
2012 ACM Subject Classification:
Computing methodologies Distributed algorithms ; Theory of computation Distributed algorithms ; Theory of computation Facility location and clustering ; Theory of computation Packing and covering problems ; Theory of computation Nearest neighbor algorithms ; Theory of computation Divide and conquer ; Theory of computation Sparsification and spanners ; Theory of computation Discrete optimization ; Computing methodologies Feature selectionSupplementary Material:
Software (Source Code): https://github.com/smferdous1/Multi-RandGreedi.gitarchived at
swh:1:dir:ac12481ab5a7ec665040a687e43a3bd5e744c4ed
Editors:
Petra Mutzel and Nicola PrezzaSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
We describe GreedyML, a parallel approximation algorithm for maximizing monotone submodular functions subject to hereditary constraints on distributed memory multiprocessors. GreedyML is built on an earlier distributed approximation algorithm, which has limited parallelism and higher memory requirements. Maximizing a submodular function under constraints is NP-hard, but a natural iterative Greedy algorithm exists that selects elements based on the marginal gain (defined later) and is -approximate for cardinality constraints and -approximate for matroid constraints; here is Euler’s number.
Maximizing a submodular function (rather than a linear objective function) promotes diversity in the computed solution since at each step the algorithm augments its current solution with an element with the least properties in common with the current solution. A broad collection of practical problems are modeled using submodular functions, including data and document summarization [23], load balancing parallel computations in quantum chemistry [9], sensor placement [6], resource allocation [28], active learning [11], interpretability of neural networks [8], influence maximization in social networks [13], diverse recommendation [5], etc. Surveys discussing submodular optimization formulations, algorithms, and computational experiments include Tohidi et al. [29] and Krause and Golovin [14].
Our algorithm builds on the RandGreeDI framework [2], a state-of-the-art randomized distributed algorithm for monotone submodular function maximization under hereditary constraints, which has an approximation ratio half that of the Greedy algorithm. The RandGreeDI algorithm randomly partitions the data among all the processors, runs the standard Greedy algorithm on each partition independently in parallel, and then executes a single accumulation step in which all processors send their partial solutions to one processor. However, this accumulation step could exceed the memory available on a processor when the memory is small relative to the size of the data, or when solutions are large. Additionally, the accumulation serializes both the computation and communication and is a bottleneck when scaled to many machines.
Our GreedyML algorithm brings additional parallelism to this step and can lower the memory and running time by introducing hierarchical accumulation organized through an accumulation tree. Similar to RandGreeDI, we randomly partition the data among all the processors, which constitute the leaves of the accumulation tree. We merge partial solutions at multiple levels in the tree, and the final solution is computed at the root. We prove that the GreedyML algorithm has a worst-case expected approximation guarantee of , where is the approximation guarantee for the Greedy algorithm, is the branching factor in the tree (the maximum number of children of an internal node), and is the number of machines (leaves in the accumulation tree). Using the BSP model, we also analyze the time and communication complexity of the GreedyML and RandGreeDI algorithms and show that the former has lower computation and communication costs than the latter.
We evaluate the parallel algorithms on three representative and practical submodular function maximization problems: maximum -set cover, maximum -vertex dominating set in graphs, and exemplar-based clustering (modeled by the -medoid problem). We experiment on large data sets with millions of elements that exceed the memory constraints (a few GBs) on a single processor, and discuss how to choose the accumulation tree to have more levels to adapt to the small memory available on a processor. This strategy also enables us to solve for larger values of the parameter , which corresponds to the size of the solution sought. We also show that the number of function evaluations on the critical path of the accumulation tree, and hence the run time, could be reduced by the parallel algorithm. In most cases, we find the quality of the solutions computed by GreedyML closely matches those obtained by the RandGreeDI algorithm on these problems despite having a worse expected approximation guarantee.
2 Background and Related Work
2.1 Submodular functions
A set function , defined on the power set of a ground set , is submodular if it satisfies the diminishing marginal gain property. That is,
A submodular function is monotone if for every , we have . The constrained submodular maximization problem is defined as follows.
We consider hereditary constraints: i.e., for every set , every subset of is also in . The hereditary family of constraints includes various common ones such as cardinality constraints () and matroid constraints ( corresponds to the collection of independent sets of a matroid).
Lovász extension.
For the analysis of our algorithm, we use the Lovász extension [21], a relaxation of submodular functions. A submodular function can be viewed as a function defined over the vertices of the unit hypercube, , by identifying sets with binary vectors of length in which the component is if , and otherwise. The Lovász extension [21] is a convex extension that extends over the entire hypercube and given by, Here, is uniformly random in . The Lovász extension satisfies the following properties [21]:
-
1.
, for all where is a vector containing 1 for the elements in and otherwise,
-
2.
, and
-
3.
.
An -approximation algorithm () for constrained submodular maximization produces a feasible solution , satisfying , where is an optimal solution.
2.2 Related Work
GreeDI and RandGreeDI.
The iterative Greedy algorithm for maximizing constrained submodular functions starts with an empty solution. Given any current solution , an element is feasible if it can be added to the solution without violating the constraints. Given a dataset and a current solution , the Greedy algorithm in each iteration chooses a feasible element that maximizes the marginal gain, . The algorithm terminates when the maximum marginal gain is zero or all feasible elements have been considered.
We now discuss the GreeDI and RandGreeDI algorithms, which are the state-of-the-art distributed algorithms for constrained submodular maximization. The GreeDI algorithm [23] partitions the data arbitrarily on available machines, and on each machine, it runs the Greedy algorithm independently to compute a local solution. These solutions are then accumulated to a single global machine. The Greedy algorithm is executed again on the accumulated data to obtain a global solution. The final solution is the best solution among all the local and global solutions. For a cardinality constraint, where is the solution size, the GreeDI algorithm has a worst-case approximation guarantee of , where is the number of machines.
Although GreeDI performs well in practice [23], its approximation ratio is not a constant but depends on and . Improving on this work, Barbosa et al. [2] proposed the RandGreeDI algorithm, which partitions the data uniformly at random on machines and achieves an expected approximation guarantee of for cardinality and for matroid constraints. In general, it has an approximation ratio of where is the approximation ratio of the Greedy algorithm used at the local and global machines. We present the pseudocode of RandGreeDI framework in Algorithm 1.
Note that for a cardinality constraint, both GreeDI and RandGreeDI perform calls to the objective function and communicate elements to the global machine where is the number of elements in the ground set, is the number of machines, and is solution size.
Both GreeDI and RandGreeDI require a single global accumulation from the solutions generated in local machines that can quickly become dominating since the runtime, memory, and complexity of this global aggregation grows linearly with the number of machines. We propose to alleviate this by introducing a hierarchical aggregation strategy that maintains an accumulation tree. Our GreedyML framework generalizes the RandGreeDI from a single accumulation to a multi-level accumulation. The number of partial solutions to be aggregated depends on the branching factor of the tree, which can be a constant. Thus, the number of accumulation levels grows logarithmically with the number of machines, and the total aggregation is not likely to become a memory, runtime, and communication bottleneck with the increase in the number of machines. We refer to Appendix A for the detailed complexity comparisons of the RandGreeDI and our GreedyML algorithm.
Other work.
Early approaches on distributed submodular maximization includes the -approximate Sample and Prune algorithm by Kumar et al. [17], which requires rounds assuming memory per machines. Here, is a user parameter. GreeDI [23] and RandGreeDI [2] are shown to be more efficient in practice than the Sample and Prune algorithm.
More recent distributed approaches [4, 24, 25] use the multi-linear extension to map the problem into a continuous function. They typically perform a gradient ascent on each local machine and build a consensus solution in each round, which improves the approximation factor to . However, we do not believe that these approaches are practical since they involve expensive gradient computations (could be exponential-time). Most of these algorithms are not implemented, and the one reported implementation solves problems with only a few hundred elements in the data set [25].
A shared-memory parallel algorithm, the FAST algorithm [3], uses adaptive sequencing to speed up the Greedy algorithm. Adaptive sequencing is a technique to add several elements in one round to the current solution. First, all elements with large marginal gains with respect to the current solution are selected. They are then randomly permuted, and prefixes with large average marginal gains are considered. The subset that is added is chosen to be a largest subset with sufficiently high average marginal gain. These sequencing rounds are repeated until the solution has the desired cardinality. However, this algorithm does not scale to larger numbers of machines and does not work in memory-restricted contexts. Parallelism in this algorithm occurs in setting a number of threshold values to determine what constitutes a sufficiently large marginal gain.
A more recent algorithm is the distributed DASH algorithm [7], which replaces the Greedy algorithm in the RandGreeDI framework with a similar adaptive sequencing algorithm. This algorithm has the same memory and processor bottlenecks as RandGreeDI. It can be used instead of the greedy algorithm in the GreedyML framework proposed here.
3 Description of Our Algorithm
We describe and analyze our algorithm that generalizes the RandGreeDI algorithm from a single accumulation step to multiple accumulation steps. Each accumulation step corresponds to a level in an accumulation tree, which we describe next.
3.1 Data Structure and Preliminaries
Accumulation tree.
Given a problem and an algorithm to solve it, its memory requirements (to store the data and associated data structures), and machines of a specified memory size, we choose the number of machines needed to solve the problem. We identify the machines by the set of ids: . These machines are organized into a tree structure that we call the accumulation tree, since it governs how local solutions are merged hierarchically to compute the final solution. Every machine is a leaf in the tree, and the interior nodes of the tree correspond to subsets of machines. The machine corresponding to an interior node is the one that corresponds to its left-most child. Figure 1 shows an example of a generic accumulation tree with leaves, where the maximum number of children of an interior node, its branching factor, is . Each node in the tree is identified by a tuple, its level in the tree, and the ID of the machine corresponding to the node.
The problem is partitioned randomly among the machines (leaves of the tree). The size of the subset of the data assigned to a machine (and associated data structures) must fit within the memory available on a machine. Each leaf computes solutions from its subset of the data, and shares this solution with its parent node. Each interior node collects solutions from its children, unions them all together, and then computes a solution from this latter data. Every interior node must also have sufficient memory for the union of solutions from its children. It then obtains the best solution from among the solutions it received from its children and the solution it computed at this step, and communicates it to its parent. The root node finally reports the best solution from among the local solutions of its children and the solution it computed from the union of the local solutions. Thus, the edges of the tree determine the accumulation pattern of the intermediate solutions. The number of accumulation levels (i.e., one minus the height of the tree), denoted by , is . When is less than , nodes at the higher levels may have fewer than children. We characterize an accumulation tree by the triple , where is the number of leaves (machines), is the number of levels, and is the branching factor.
Observe that the parameter remains the same in multiple nodes that are involved in computations at multiple levels. For our analysis, we keep the branching factor constant across all levels.
Randomness.
The randomness in the algorithm is only in the initial placement of the data on the machines, and we use a random tape to encapsulate this. The random tape has a randomized entry for each element in to indicate the machine containing that element. Any expectation results proved henceforth are over the choice of this random tape. Moreover, if the data accessible to a node is , we consider the randomness over just . Whenever the expectation is over , we denote the expectation as .
Recurrence relation.
Figure 2 shows the recurrence relation that forms the basis of the GreedyML algorithm, defined for every node in the accumulation tree; it will be the basis for the multilevel distributed algorithm. At level (leaves), the recurrence function returns the Greedy solution of the random subset of data assigned to it. At other levels (internal nodes), it returns the better among the Greedy solution computed from the union of the received solution sets of its children and its solution from its previous level. It is undefined for tuples that do not correspond to nodes in the tree (at higher levels). The detailed pseudocode of our algorithm is presented in Algorithm 2.
3.2 Pseudocode of GreedyML
Algorithm 2 describes our multilevel distributed algorithm using two procedures. The first procedure GreedyML is a wrapper function that sets up the environment to run the distributed algorithm. The second function is the iterative implementation of the recurrence relation that runs on each machine. The wrapper function partitions the data into subsets and assigns them to the machines (Line 2.Then each machine runs the function on the subset assigned to it (Line 5, Line 7. The wrapper function uses and returns the solution from machine (Line 8) as it is the root of the accumulation tree.
The procedure is an iterative implementation of the recurrence relation 2 that runs on every machine. Each machine checks whether it needs to be active at a particular level (Line 5) and decides whether it needs to receive from (Line 11) or send to other machines (Line 6). The function returns the solution from the last level of the machine.
4 Analysis of Our Algorithm
We prove the expected approximation ratio of GreedyML algorithm in Theorem 4 using three Lemmas. We restate a result from [2] that applies to the leaves of our accumulation tree and characterizes elements that do not change the solution computed by the Greedy algorithm.
Lemma 1 ([2]).
If we have Greedy , for each element , then .
The next two Lemmas connect the quality of the computed solutions to the optimal solution at the internal nodes (in level one) of the accumulation tree. Lemma 2 provides a lower bound on the expected function value of the individual solutions of the Greedy algorithm received from the leaf nodes, while Lemma 3 analyzes the expected function value of the Greedy execution over the accumulated partial solutions.
Let be a probability distribution over the elements in , and be a random subset of such that each element is independently present in with probability . The probability is defined as follows:
For any leaf node, the distribution defines the probability that each element of is in the solution of the Greedy algorithm when it is placed in the node.
Lemma 2.
Let be a leaf node of the accumulation tree, be the solution computed from , and be the elements considered in forming . If Greedy is an -approximate algorithm, then
Proof.
We first construct a subset of that contains all the elements that do not appear in when added to some leaf node in the subtree rooted at child . Let be the rejected set that can be added to without changing ; i.e., Therefore,
From Lemma 1, we know that . Since the rejected set and the constraints are hereditary, (i.e, is a feasible solution of child node c). Then from the condition of Lemma 2, we have
Lemma 3.
Let be the union of all the solutions computed by the children of an internal node in the accumulation tree, and be the solution from the Greedy algorithm on the set . If Greedy is an -approximate algorithm, then
Proof.
We first show a preliminary result on the union set . Consider an element present in some solution from a child . Then,
Since the distribution of conditioned on is identical to the distribution of , where , we have,
Since this result holds for every child , and each subset is disjoint from the corresponding subsets mapped to the other children, we have
Now, we are ready to prove the Lemma. The subset , since it is a subset of and the constraints are hereditary. Further, since the Greedy algorithm is -approximate, we have
| ] | |||||
| (1) | |||||
Theorem 4.
Let be an accumulation tree, be the ground set, and be a random mapping of elements of to the leaves of the tree . Let be an optimal solution computed from for the constrained submodular function . If Greedy is an -approximate algorithm, then
Proof.
We concentrate on a node at level 1, where after obtaining the partial solutions from the children of this node, we compute the Greedy on the union of these partial solutions. Let be any of the partial solutions, be the union of these partial solutions, and be . From Lemma 2 and Lemma 3,
By multiplying the first inequality by and then adding it to the second, we get
The theorem follows since the solution quality can only improve at higher levels of the tree.
5 Experimentation
5.1 Experimental setup
We conduct experiments to demonstrate that our algorithms are capable of overcoming the memory limitations of the Greedy and RandGreeDI algorithms and can solve large-scale constrained submodular maximization problems. We also compare these algorithms with respect to runtimes and the quality of solutions.
All the algorithms are executed on a cluster computer with nodes, each of which is an AMD EPYC node with GB of total memory shared by the cores. Each core operates at GHz frequency. The nodes are connected with a 100 Gbps HDR Infiniband network. To simulate a distributed environment on this cluster, we needed to ensure that the memory is not shared between nodes. Therefore, in what follows, a machine will denote one node with just one core assigned for computation, but having access to all GB of memory. We also found that this made the runtime results more reproducible.
For our experimental evaluation, we report the runtime and quality of the algorithms being compared. For runtime, we exclude the file reading time on each machine, and for the quality, we show the objective function value of the corresponding submodular function. Since the RandGreeDI and GreedyML are distributed algorithms, we also report the number of function calls in the critical path of the computational tree, which represents the parallel runtime of the algorithm. Given an accumulation tree, the number of function calls in the critical path refers to the maximum number of function calls that the algorithm makes along a path from the leaf to the root. In our implementation, this quantity can be captured by the number of function calls made by the nodes of the accumulation tree with since this node participates in the function calls from all levels of the tree.
| Function | Dataset | avg. | ||
|---|---|---|---|---|
| -dominating set | AGATHA_2015 | 183,964,077 | 11,588,725,964 | 63.32 |
| MOLIERE_2016 | 30,239,687 | 6,669,254,694 | 220.54 | |
| com-Friendster | 65,608,366 | 1,806,067,135 | 27.52 | |
| road_usa | 23,947,347 | 57,708,624 | 2.41 | |
| road_central | 14,081,816 | 33,866,826 | 2.41 | |
| belgium_osm | 1,441,295 | 3,099,940 | 2.14 | |
| -cover | webdocs | 1,692,082 | 299,887,139 | 177.22 |
| kosarak | 990,002 | 8,018,988 | 8.09 | |
| retail | 88,162 | 908,576 | 10.31 | |
| -medoid | Tiny ImageNet | 100,000 | 1,228,800,000 | 12,288 |
Datasets.
In this paper, we limit our experiments to cardinality constraints using three different submodular functions described in detail in Appendix A. Other hereditary constraints add more computation to the problem and will greatly increase the run times of the experiments.
Our benchmark dataset is shown in Table 1. They are grouped based on the objective function and are sorted by the values within each group (see the Table for a definition). For the -dominating set, our testbed consists of the Friendster social network graph [31], a collection of road networks from DIMACS10 dataset, and the Sybrandt dataset. We chose road graphs since they have relatively small average vertex degrees, leading to large vertex-dominating sets. We chose the Sybrandt collection [26][27] since it is a huge data set of machine learning graphs. For the -cover objective, we use popular set cover datasets from the Frequent Itemset Mining Dataset Repository [10]. For the -medoid problem, we use the Tiny ImageNet dataset [18].
MPI Implementation.
GreedyML is implemented using C++11, and compiled with g++9.3.0, using the O3 optimization flag. We use the Lazy Greedy [22] variant that has the same approximation guarantee as the Greedy but is faster in practice since it potentially reduces the number of function evaluations needed to choose the next element (by using the monotone decreasing gain property of submodular functions). Our implementation of the GreedyML algorithm uses the OpenMPI library for inter-node communication. We use the MPI_Gather and MPI_Gatherv primitives to receive all the solution sets from the children (Line 11 in Algorithm 2). We generated custom MPI_Comm communicators to enable this communication using MPI_Group primitives. Customized communicators are required since every machine has different children at each level. Additionally, we use the MPI_Barrier primitive to synchronize all the computations at each level.
5.2 Experimental Results
The experiments are executed with different accumulation trees that vary in the number of machines (), the number of levels , and the branching factors , to assess their performance. We repeat each experiment six times and report the geometric mean of the results. Unless otherwise stated, a machine in our experiments represents a node in the cluster with only one core assigned for computation. Whenever memory constraints allow, we compare our results with the sequential Greedy algorithm that achieves a approximation guarantee.
5.2.1 Experiments with memory limit
Here we show the memory advantage of our GreedyML algorithm w.r.t RandGreeDI with two experiments. In the first one, we impose a limit of 100 MB of space for each node and vary , the solution size. This also simulates how the new algorithm can find applications in the edge-computing context. We also fix and vary the memory limits, necessitating different numbers of nodes to fit the data in the leaves. We observe the quality and runtime of different accumulation tree structures in these two experiments. Both these experiments are designed to show that the RandGreeDI algorithm quickly runs out of memory with increasing and , and by choosing an appropriate accumulation tree, our GreedyML algorithm can solve the problem with negligible drop in accuracy. For these experiments, we will choose the shortest accumulation tree that can be used with the memory limit and values.
Varying k.
For this experiment, we use 16 machines with a limit on the available memory of 100 MB per machine and vary from to for the -dominating set problem on the road_usa [1] dataset. The small memory limit in this experiment can also be motivated from an edge computing context.
The left plot in Figure 3 shows the number of function calls with varying values of for the Greedy (green bars) and GreedyML algorithms (blue bars). Note that when (the left-most bar in the Figure), the GreedyML algorithm corresponds to the RandGreeDI algorithm. For the GreedyML (and the RandGreeDI), we are interested in the number of function calls in the critical path since it represents the parallel runtime of the algorithm. With our memory limits, only instances can be solved using the RandGreeDI algorithm.
As we increase , we can generate solutions using our GreedyML with different accumulation trees. The corresponding lowest-depth accumulation tree with the number of levels and branching factor is shown on top of the blue bars. The result shows that the number of function evaluations on the critical path in the GreedyML algorithm is smaller than the number of function evaluations in the sequential Greedy algorithm. While the number of function calls for accumulation trees with smaller values is larger than RandGreeDI, we see that GreedyML can solve the problems with larger values in the same machine setup, which was not possible with RandGreeDI. But it comes with a trade-off on parallel runtime. We observe that as we make the branching factor smaller, the number of function calls in the critical path increases, suggesting that it is sufficient to choose the accumulation trees with the largest branching factor (thus the lowest depth tree) whenever the memory allows it.
The right plot of Figure 3 shows the relative objective function value, i.e., the relative number of vertices covered by the dominating set compared to the Greedy algorithm, with varying . The figure shows that the RandGreeDI and GreedyML algorithms attain quality at most less than the serial Greedy algorithm. Similar trends can be observed for other datasets.
| Dataset | Alg. | Mem. Limit | Rel. Func.(%) | Time (s.) | |||
|---|---|---|---|---|---|---|---|
| Friendster | RG | 4GB | 8 | 8 | 1 | 99.959 | 61.994 |
| GML | 2GB | 16 | 4 | 2 | 99.903 | 61.352 | |
| GML | 1GB | 32 | 2 | 5 | 99.793 | 79.997 | |
| MOLIERE_2016 | RG | 8GB | 8 | 8 | 1 | 99.257 | 121.318 |
| GML | 4GB | 16 | 4 | 2 | 99.106 | 108.764 | |
| GML | 2GB | 32 | 2 | 5 | 98.990 | 161.139 | |
| AGATHA_2015 | RG | 12GB | 8 | 8 | 1 | 99.996 | 94.122 |
| GML | 6GB | 16 | 4 | 2 | 99.995 | 99.574 | |
| GML | 3GB | 32 | 2 | 5 | 99.989 | 104.156 |
Varying memory limits.
This experiment demonstrates that the memory efficiency of the GreedyML algorithm enables us to solve problems on parallel machines, whereas the RandGreeDI and Greedy cannot solve them due to insufficient memory. Unlike the previous experiment (Varying ), where we selected the accumulation trees based on , here, we fix and choose accumulation trees based on the memory available on the machines. We consider the -dominating set problem and report results on the Friendster [31], AGATHA_2015[27], and MOLIERE_2016[26] dataset in Table 2. For the Friendster dataset, we choose such that the -dominating set requires MB, roughly a factor of smaller than the original graph. The RandGreeDI algorithm (the first row) can execute this problem only on machines, each with GB of memory, since in the accumulation step, one machine receives solutions of size MB each from machines. The GreedyML algorithm, having multiple levels of accumulation, can run on machines with only GB memory, using and . Furthermore, it can also run on machines with only GB memory, using and . We repeat the same experiment for the other two datasets with these three machine configurations, with corresponding memory restrictions.
We show relative quality and running time for the three datasets from these configurations in Table 2. Our results show that function values computed by the GreedyML algorithm (the and GB results) are insensitive to the number of levels in the tree. As expected, increasing the number of levels in the accumulation tree increases the execution times due to the communication and synchronization costs involved. However, aggregating at multiple levels enables us to solve large problems by overcoming memory constraints. So, in this scenario, it is sufficient to select the number of machines depending on the size of the dataset and then select the branching factor such that the accumulation step does not exceed the memory limits. We also notice that the RandGreeDI algorithm has an inherently serial accumulation step, and the GreedyML algorithm provides a mechanism to parallelize it.
5.2.2 Selecting the Accumulation tree
Now we show how the accumulation tree may be chosen to reduce runtime (or a proxy, the number of function calls in the critical path) when the number of machines is fixed.
In this experiment, we show results for the -dominating set and -coverage problem by fixing the number of machines and varying branching factors, the number of levels in the accumulation tree, and the solution size .
In Figure 4, we provide summary results on the number of function evaluations in the critical path relative to the Greedy algorithm and the running times by taking a geometric mean over all nine datasets.
Three subfigures (top left, top right, and bottom left) of Figure 4 show the execution time in seconds for the GreedyML and RandGreeDI algorithms, as the number of levels and the parameter are varied. When is small (top left), there is less variation in the execution time since work performed on the leaves dominates overall time. As increases (bottom left), the GreedyML algorithm becomes faster than the RandGreeDI algorithm (). Note that although Figure 4 presents the geometric mean results over all nine datasets, the runtime and the function values for the individual datasets follow the same trend. The largest and smallest reduction in runtime we observe is on the belgium_osm and kosarak datasets with a reduction of around and 1%, respectively, for all values.
The bottom right plot fixes and shows the number of function calls in the critical path of the accumulation tree relative to the Greedy algorithm for different pairs. Here, the leftmost bar represents the RandGreeDI algorithm. We observe that the relative number of function calls for RandGreeDI is around 70% of Greedy, whereas the GreedyML (with and ) reduces it by 15 percent. From Table 5, the number of function calls at a leaf node is , while at an accumulation node, it is , for the RandGreeDI algorithm. Hence, the accumulation node dominates the computation since it has a quadratic dependence on , becoming a bottleneck for large values. This plot also shows that the number of function calls is a good indicator of the algorithm’s run time and that the cost of function evaluations dominates the overall time. The other factor affecting run time is communication costs, which are relatively small and grow with the number of levels when is very large.
We note (not shown in the figure) that generally the objective function values obtained by the GreedyML algorithm are not sensitive to the choice of the number of levels and the branching factors of the accumulation tree and differ by less than from the values of the RandGreeDI algorithm. For the webdocs -coverage problem, however, Greedy quality is about higher than both the RandGreeDI and GreedyML.
| Accumulation Step | |||
|---|---|---|---|
| First | Final | ||
| 3 | 4 | 0.74111 | 0.99994 |
| 2 | 8 | 0.82971 | 1.00005 |
| 2 | 16 | 0.91965 | 0.99994 |
| 1 | 32 | 1.00003 | 1.00003 |
In Table 3, we report the improvement in objective function value when accumulating over multiple levels, over choosing to stop at level one of accumulation. We use the maximum -cover function for the Friendster dataset with for different branching factors at the first and final accumulation levels. We observe that the objective values at the highest accumulation level are not very sensitive to the tree parameters, contrary to their sensitivity to the approximation ratio derived in Theorem 4.
5.2.3 Scaling results
Here we perform a strong scaling experiment to show how computation and communication times vary for the RandGreeDI and GreedyML algorithms. For the latter algorithm, we use the tallest accumulation tree by using a branching factor of two, thereby increasing the number of accumulation steps. Our results will show that even though the RandGreeDI algorithm has a low asymptotic communication cost, it can become a bottleneck when scaled to a large number of machines, and that our algorithm alleviates this bottleneck.
Next, we show how the GreedyML algorithm alleviates the scaling bottlenecks of the RandGreeDI algorithm using the -dominating set problem on the Friendster dataset. We set the branching factor for the GreedyML algorithm since this has the highest number of levels and, thus, the lowest approximation guarantee. We compare communication and computation times against the RandGreeDI algorithm from 8 to 128 machines with .
In Figure 5, we plot the total execution time by stacking communication and computation times for the two algorithms. For RandGreeDI, the communication time scales poorly since it increases linearly with the number of machines, (See Table 5). But, for GreedyML algorithm (with a constant branching factor, ), the communication cost is , which grows logarithmically in the number of machines. Figure 5 shows that the total communication times of the GreedyML algorithm are consistently around seconds, whereas the RandGreeDI increases from seconds to seconds. We observe that computation times for both RandGreeDI and GreedyML change similarly with , indicating that the majority of the computation work is performed at the leaf nodes. For computation time, we observe a slightly worse scaling of RandGreeDI compared to GreedyML, again because the central node becomes a computational bottleneck as increases. Similar to other experiments, we observe (not shown in the plot) an almost identical quality in the solutions, where the GreedyML solution has a quality reduced by less than from that of the RandGreeDI algorithm.
5.2.4 The k-medoid problem
| Local Obj. | Added Images | ||||
|---|---|---|---|---|---|
| Rel. Func. Val. (%) | Speedup | Rel. Func. Val. (%) | Speedup | ||
| 5 | 2 | 92.22 | 2.00 | 93.69 | 2.01 |
| 3 | 4 | 92.21 | 1.96 | 92.70 | 1.94 |
| 2 | 8 | 92.73 | 1.95 | 92.77 | 1.93 |
| 2 | 16 | 92.22 | 1.49 | 93.34 | 1.44 |
In this final subsection, we perform experiments for the -medoid objective function (one that is computationally more expensive than the others we have used here) and show that we can provide a significant speedup by using taller accumulation trees without loss in quality. The k-medoid function is extensively used in machine learning as a solution to exemplar-based clustering problems.
Our dataset consists of the Tiny ImageNet dataset [18] containing 100K images ( pixels) with different classes, with images from each class. We convert and normalize the image into a vector and use Euclidean distance to measure dissimilarity. We define an auxiliary vector as a pixel vector of all zeros. Note that, unlike the other two functions, the -medoid function requires access to the full dataset to compute the functional value. Since the dataset is distributed, this poses an issue in the experiment. To overcome this, following [23, 2], we calculate the objective function value using only the images available locally on each machine. This means the ground set for each machine is just the images present in that machine. Additionally, they [23, 2] have also added subsets of randomly chosen images to the central machine to provide practical quality improvement. We have followed these techniques (local only and local with additional images) in the experiments for our multilevel GreedyML algorithm.
In our experiments, we set to images, fix the number of machines (), and vary the accumulation trees by choosing different and . For the variant with additional images, we add random images from the original dataset to each accumulation step.
In Table 4, we show the relative objective function values and speedup for different accumulation trees relative to the RandGreeDI algorithm. We observe that the objective function values for GreedyML algorithm are almost similar to RandGreeDI. Our results show that the GreedyML algorithm becomes gradually faster as we increase the number of levels, with runtime improvement ranging from . This is because the -medoid function is computationally intensive, where computation cost increases quadratically with the number of images (Table 5). With and , the RandGreeDI algorithm has images at the root node but only images at the leaves; thus, the computation at the root node dominates in cost. On the other hand, as we decrease the branching factor (from to ), the number of images () in the interior nodes decreases from to for the GreedyML algorithm. This gradual decrease in compute time is reflected in the total time and in the observed speedup.
6 Conclusion and Future work
We have developed a new distributed algorithm, GreedyML, that enhances the existing distributed algorithm for maximizing a constrained submodular function. We prove GreedyML is -approximate, but empirically demonstrate that its quality is close to the best approximation algorithms for several practical problems. Our algorithm alleviates the inherent serial computation and communication bottlenecks of the RandGreeDI algorithm while reducing memory requirements. This enables submodular maximization to be applied to massive-scale problems effectively.
Future work could experiment with other hereditary constraints, such as matroid and -system constraints. Another direction is to apply GreedyML to closely related non-monotone and weakly submodular functions. Since our experiments suggest that GreedyML delivers higher quality solutions than the expected approximation guarantees, one could investigate whether the approximation ratio could be improved.
References
- [1] David A Bader, Henning Meyerhenke, Peter Sanders, and Dorothea Wagner. Graph partitioning and graph clustering. In Contemporary Mathematics, volume 588. American Mathematical Society, 2012. 10th DIMACS Implementation Challenge Workshop.
- [2] Rafael da Ponte Barbosa, Alina Ene, Huy L. Nguyen, and Justin Ward. The power of randomization: Distributed submodular maximization on massive datasets. In Proceedings of the 32nd International Conference on Machine Learning, pages 1236–1244. JMLR.org, 2015. URL: http://proceedings.mlr.press/v37/barbosa15.html.
- [3] Adam Breuer, Eric Balkanski, and Yaron Singer. The FAST algorithm for submodular maximization. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 1134–1143. PMLR, 13–18 July 2020. URL: https://proceedings.mlr.press/v119/breuer20a.html.
- [4] Chandra Chekuri and Kent Quanrud. Submodular function maximization in parallel via the multilinear relaxation. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 303–322. Society for Industrial and Applied Mathematics, 2019. doi:10.1137/1.9781611975482.20.
- [5] Laming Chen, Guoxin Zhang, and Eric Zhou. Fast greedy MAP inference for determinantal point process to improve recommendation diversity. In Advances in Neural Information Processing Systems, pages 5627–5638, 2018. URL: https://proceedings.neurips.cc/paper/2018/hash/dbbf603ff0e99629dda5d75b6f75f966-Abstract.html.
- [6] M. Coutino, S. P. Chepuri, and G. Leus. Submodular sparse sensing for Gaussian detection with correlated observations. IEEE Transactions on Signal Processing, 66:4025–4039, 2018. doi:10.1109/TSP.2018.2846220.
- [7] Tonmoy Dey, Yixin Chen, and Alan Kuhnle. Dash: A distributed and parallelizable algorithm for size-constrained submodular maximization. Proceedings of the AAAI Conference on Artificial Intelligence, 37(4):3941–3948, June 2023. doi:10.1609/aaai.v37i4.25508.
- [8] E. Elenberg, A. G. Dimakis, M. Feldman, and A. Karbasi. Streaming weak submodularity: Interpreting neural networks on the fly. In Advances in Neural Information Processing Systems, pages 4044–4054, 2017.
- [9] S M Ferdous, Alex Pothen, Arif Khan, Ajay Panyala, and Mahantesh Halappanavar. A parallel approximation algorithm for maximizing submodular -matching. In SIAM Conference on Applied and Computational Discrete Algorithms (ACDA), pages 45–56, 2021. doi:10.1137/1.9781611976830.5.
- [10] FIMI. Frequent itemset mining dataset repository. http://fimi.uantwerpen.be/data/, 2003.
- [11] D. Golovin and A. Krause. Adaptive submodularity: Theory and applications in active learning and stochastic optimization. Journal of Artificial Intelligence Research, 42:427–486, 2011. doi:10.1613/JAIR.3278.
- [12] Leonard Kaufman and Peter J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley, 1990. doi:10.1002/9780470316801.
- [13] David Kempe, Jon Kleinberg, and Éva Tardos. Maximizing the spread of influence through a social network. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Datamining, pages 137–146, 2003. doi:10.1145/956750.956769.
- [14] Andreas Krause and Daniel Golovin. Submodular function maximization. In Tractability: Practical Approaches to Hard Problems, pages 71–104. Cambridge University Press, 2014. doi:10.1017/CBO9781139177801.004.
- [15] Andreas Krause and Carlos Guestrin. Near-optimal observation selection using submodular functions. In Proceedings of the Twenty-Second AAAI Conference on Artificial Intelligence, volume 7, pages 1650–1654, 2007. URL: http://www.aaai.org/Library/AAAI/2007/aaai07-265.php.
- [16] Andreas Krause, Jure Leskovec, Carlos Guestrin, Jeanne VanBriesen, and Christos Faloutsos. Efficient sensor placement optimization for securing large water distribution networks. Journal of Water Resources Planning and Management, 134(6):516–526, 2008.
- [17] Ravi Kumar, Benjamin Moseley, Sergei Vassilvitskii, and Andrea Vattani. Fast greedy algorithms in mapreduce and streaming. ACM Trans. Parallel Comput., 2(3):14:1–14:22, 2015. doi:10.1145/2809814.
- [18] Fei-Fei Li and Andrej Karpathy. Tiny imagenet challenge. http://cs231n.stanford.edu/tiny-imagenet-200.zip, 2017. [Online; last accessed 13-Mar-2024].
- [19] Hui Lin and Jeff Bilmes. Multi-document summarization via budgeted maximization of submodular functions. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 912–920, 2010. URL: https://aclanthology.org/N10-1134/.
- [20] Hui Lin and Jeff Bilmes. A class of submodular functions for document summarization. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 510–520, 2011. URL: https://aclanthology.org/P11-1052/.
- [21] L. Lovász. Submodular functions and convexity. In Achim Bachem, Bernhard Korte, and Martin Grötschel, editors, Mathematical Programming The State of the Art: Bonn 1982, pages 235–257. Springer Berlin Heidelberg, 1983. doi:10.1007/978-3-642-68874-4_10.
- [22] Michel Minoux. Accelerated greedy algorithms for maximizing submodular set functions. In J. Stoer, editor, Optimization Techniques, pages 234–243. Springer, 1978.
- [23] Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, and Andreas Krause. Distributed submodular maximization: Identifying representative elements in massive data. In Advances in Neural Information Processing Systems, volume 26, pages 2049–2057, 2013. URL: https://proceedings.neurips.cc/paper/2013/hash/84d2004bf28a2095230e8e14993d398d-Abstract.html.
- [24] Aryan Mokhtari, Hamed Hassani, and Amin Karbasi. Decentralized submodular maximization: Bridging discrete and continuous settings. In Proceedings of the 35th International Conference on Machine Learning, pages 3616–3625, 2018.
- [25] Alexander Robey, Arman Adibi, Brent Schlotfeldt, Hamed Hassani, and George J. Pappas. Optimal algorithms for submodular maximization with distributed constraints. In Proceedings of the 3rd Conference on Learning for Dynamics and Control, pages 150–162, 2021. URL: http://proceedings.mlr.press/v144/robey21a.html.
- [26] Justin Sybrandt, Michael Shtutman, and Ilya Safro. Moliere: Automatic biomedical hypothesis generation system. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17, pages 1633–1642, New York, NY, USA, 2017. Association for Computing Machinery. doi:10.1145/3097983.3098057.
- [27] Justin Sybrandt, Ilya Tyagin, Michael Shtutman, and Ilya Safro. Agatha: Automatic graph mining and transformer based hypothesis generation approach. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, CIKM ’20, pages 2757–2764, New York, NY, USA, 2020. Association for Computing Machinery. doi:10.1145/3340531.3412684.
- [28] K. Thekumparampil, A. Thangaraj, and R. Vaze. Combinatorial resource allocation using submodularity of waterfilling. IEEE Transactions on Wireless Communications, 15:206–216, 2016. doi:10.1109/TWC.2015.2469291.
- [29] Ehsan Tohidi, Rouhollah Amiri, Mario Coutino, David Gesbert, Geert Leus, and Amin Karbasi. Submodularity in action: From machine learning to signal processing applications. IEEE Signal Processing Magazine, 37(5):120–133, 2020. doi:10.1109/MSP.2020.3003836.
- [30] Leslie G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103–111, 1990. doi:10.1145/79173.79181.
- [31] Jaewon Yang and Jure Leskovec. Defining and evaluating network communities based on ground-truth. In Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics, pages 1–8, 2012.
Appendix A Submodular Functions and Complexity
Our algorithm can handle any hereditary constraint, but we consider only cardinality constraints in our experiments to keep the run times low. More general constraints involve additional computations to check if an element can be added to the current solution set and satisfy the constraints. They increase the computation time but not the communication time, and we belive the GreedyML algorithm will perform even better relative to the RandGreeDI algorithm. Cardinality constraints are widely used in various applications such as sensor placement [16], text, image, and document summarization [19, 20], and information gathering [15]. The problem of maximizing a submodular function under cardinality constraints can be expressed as follows.
Here is the ground set, is a non-negative monotone submodular function, and is the size of the solution set .
In our experiments, we have considered the following three submodular functions.
-cover.
Given a ground set , a collection of subsets , and an integer , the goal is to select a set containing of these subsets to maximize .
-dominating set.
The -dominating set problem is a special case of the -cover problem defined on graphs with the ground set as the set of vertices. We say a vertex dominates all its adjacent vertices (denoted by ). Our goal is to select a set of vertices to dominate as many vertices as possible, i.e., . The marginal gain of any vertex is the number of vertices in its neighborhood that are not yet dominated. Therefore, the problem shows diminishing marginal gains and is submodular.
-medoid problem.
The -medoid problem [12] is used to compute exemplar-based clustering, which asks for a set of exemplars (cluster centers) representatives of a large dataset. Given a collection of elements in a ground set , and a dissimilarity measure , we define a loss function (denoted by ) as the average pairwise dissimilarity between the exemplars () and the elements of the data set, i.e., . Following [23], we turn this loss minimization to a submodular maximization problem by setting , where is an auxiliary element specific to the dataset. The goal is to select a subset of size that maximizes .
Next, we analyze the computational and communication complexity of our GreedyML algorithm using the bulk synchronous parallel (BSP) model of parallel computation [30]. We denote the number of elements in the ground set by the solution size by , the number of machines by , and the number of levels in the accumulation tree by .
Computational Complexity.
The number of objective function calls by the sequential Greedy algorithm is , since elements are selected to be in the solution, and we may need to compute marginal gains for each of them. Each machine in RandGreeDI algorithm makes function calls, where the second term comes from the accumulation step. Each machine of the GreedyML algorithm with branching factor makes calls. Recall that .
We note that the time complexity of a function call depends on the specific function being computed. For example, in the -coverage and the k-dominating set problems, computing a function costs , where is the size of the largest itemset for -coverage, and the maximum degree of a vertex for the vertex dominating set. In both cases, the runtime complexity is for the RandGreeDI, and for the GreedyML algorithm. The -medoid problem computes a local objective function value and has a complexity of where is the number of features, and is the number of elements present in the machine. For the leaves of the accumulation tree, , and for interior nodes, . Therefore its complexity is for the RandGreeDI, and for the GreedyML algorithm.
Communication Complexity.
Each edge in the accumulation tree represents communication from a machine at a lower level to one at a higher level and contains four messages. They are the indices of the selected elements of size , the size of the data associated with each selection (proportional to the size of each adjacency list (), the total size of the data elements, and the data associated with each selection. Therefore the total volume of communication is per child. Since at each level, a parent node receives messages from children, the communication complexity is for each parent. Therefore the communication complexity for the RandGreeDI algorithm is and for the GreedyML algorithm is We summarize these results in Table 5.
| Algorithms | Metric | Greedy | RandGreeDI | GreedyML |
|---|---|---|---|---|
| All | Elements per leaf node | |||
| Calls per leaf node | ||||
| Elements per interior node | ||||
| Calls per interior node | ||||
| Total Function Calls | ||||
| -cover / k-dominating set | : subset size/number of neighbours | |||
| Cost Per call | ||||
| Computational complexity | ||||
| Communication cost | 0 | |||
| -medoid | : number of features | |||
| Cost Per call in Leaf node | ||||
| Cost Per call in interior node | 0 | |||
| Computational complexity | ||||
| Communication cost | 0 | |||
| Parameter | Description | Parameter | Description |
|---|---|---|---|
| Approximation ratio of the Greedy algorithm | Complete universe for the input dataset | ||
| Branching factor of the accumulation tree | Size of the universe | ||
| Number of leaves in the accumulation tree | Input Dataset | ||
| Numbers of machines used for computation. | Size of the input dataset | ||
| Number of levels of the accumulation tree | Dataset corresponding to any node of the tree | ||
| level identifier for a node | Solution Set | ||
| Machine identifier for a node. | Size of solution | ||
| Submodular function | The optimal solution | ||
| Lovász extension of function | Probability that e OPT is selected | ||
| Part of the dataset assigned to machine | by the Greedy algorithm when sampled from V. |
Appendix B Cluster Centers (Images) selected by GreedyML and RandGreeDI algorithms
