Better Process Mapping and Sparse Quadratic Assignment

Communication and topology aware process mapping is a powerful approach to reduce communication time in parallel applications with known communication patterns on large, distributed memory systems. We address the problem as a quadratic assignment problem (QAP), and present algorithms to construct initial mappings of processes to processors as well as fast local search algorithms to further improve the mappings. By exploiting assumptions that typically hold for applications and modern supercomputer systems such as sparse communication patterns and hierarchically organized communication systems, we arrive at significantly more powerful algorithms for these special QAPs. Our multilevel construction algorithms employ recently developed, perfectly balanced graph partitioning techniques and excessively exploit the given communication system hierarchy. We present improvements to a local search algorithm of Brandfass et al., and decrease the running time by reducing the time needed to perform swaps in the assignment as well as by carefully constraining local search neighborhoods. Experiments indicate that our algorithms not only dramatically speed up local search, but due to the multilevel approach also find much better solutions~in~practice.


Introduction
distance between PE i and PE j. That is, the cost for communicating the amount C i,j between processors i and j is C i,j D i,j . We follow Brandfass et al. [5] and others, and model the embedding problem as a quadratic assignment problem (QAP): Find a one-to-one mapping Π of processes to PEs which minimizes the overall communication cost. More precisely, we want to minimize J(C, D, Π) := i,j C Π(i),Π(j) D i,j where the sum is over all PE pairs and k = Π(i) means that process k is assigned to PE i. Note that searching for the inverse permutation instead, i. e., assigning process i to PE Π −1 (i), results in the same assignment problem as Π is a one-to-one mapping. Throughout this work, we assume that C and D are symmetric -otherwise one can create equivalent QAP problems with symmetric inputs [5].
In this paper, we focus on sparse communication patterns, and therefore do not want to store the complete communication matrix but instead represent it more efficiently as a graph. On the other hand, typical system topologies feature a hierarchy that we can exploit. Hierarchy information, and in general D, is given implicitly and can be queried, and therefore does not have to be stored explicitly. Graph partitioning is a key component in our algorithms to find initial solutions. The graph partitioning problem looks for blocks of nodes V 1 ,. . . ,V k that partition V , i. e., V 1 ∪· · ·∪V k = V and V i ∩ V j = ∅ for i = j. The balancing constraint demands that ∀i ∈ 1..k : c(V i ) ≤ L max := (1 + ) c(V )/k for some parameter . In the perfectly balanced case the imbalance parameter is set to zero, i. e., no deviation from the average is allowed. One commonly used objective is to minimize the total cut i<j w(E ij ) where E ij := {{u, v} ∈ E : u ∈ V i , v ∈ V j }. A vertex v ∈ V i that has a neighbor w ∈ V j , i = j, is a boundary vertex.
Related Work. There has been a huge amount of research on GP, and we refer the reader to [4,6] for extensive material and references. All general-purpose methods that work well on large real-world graphs are based on the multilevel principle. The basic idea can be traced back to multigrid solvers for systems of linear equations [25]. Well-known MGP software packages include Jostle [28], Metis [16], and Scotch [20]. Jostle contains algorithms to compute processor assignments in scientific simulations. Jostle integrates local search into a multi-level method to partition the model of computation and communication. To do so, they solve the problem on the coarsest level and afterwards perform refinement that takes the user supplied network communication model into account. Scotch performs dual recursive bipartitioning to perform this task.
There is likewise a large literature on process mapping, often in the context of scientific applications using MPI (Message-Passing Interface). Hatazaki [13] was among the first authors to propose graph partitioning to solve the MPI process mapping problem for unweighted process topologies. Träff [26] used a similar approach, and gave one of the first non-trivial implementations for the NEC vector supercomputers. Mercier and Clet-Ortega and later Jeannot [17,18] simplify the mapping problem by only considering the topology inside the compute nodes themselves and ignoring the topology of the network. Multiple placement policies are investigated to enhance overall system performance. Yu et al. [29] discuss and implement embedding heuristics for the BlueGene 3d torus systems. Hoefler and Snir [15] optimize instead the congestion of the mapping, show that this problem is NP-complete, and give a corresponding heuristic with an experimental evaluation based on application data from the Florida Sparse Matrix Collection. Routing aware mapping heuristics taking the hierarchy of specific hardware topologies into account were discussed in [1]. Vogelstein et al. [27] concentrate on solving general quadratic assignment and graph matching problems. They propose a gradient based heuristic that involves solving assignment problems and give experimental evidence for better solution quality and speed compared to certain other heuristics. The worst-case complexity of their approach is high, O(n 3 ) steps.
Detailed Related Work. We now discuss related work by Müller-Merbach [19], Heider [14] and Brandfass et al. [5] as well as Glantz et al. [12] in greater detail since our work either makes use of the tools proposed by those authors or because we compare against their results. Müller-Merbach [19] proposes a greedy construction method to obtain an initial permutation for the QAP. The method roughly works as follows: Initially compute the total communication volume for each processor and also the total distance for each core. Note that this corresponds to the weighted degrees of the vertices in the communication and distance models, respectively. Afterwards, the process with the largest communication volume is assigned to the core with the smallest total distance. To build a complete assignment, the algorithm proceeds by looking at unassigned vertices and cores. For each of the unassigned processes the communication load to already assigned vertices is computed. For each core, the total distance to already assigned cores is computed. The process with the largest communication sum is assigned to the core with the smallest distance sum. Glantz et al. [12] note that the algorithm does not link the choices for the vertices and cores and hence propose a modification of this algorithm called GreedyAllC (the best algorithm in [12]). GreedyAllC links the mapping choices by scaling the distance with the amount of communication to be done. The algorithm has the same asymptotic complexity and memory requirements as the algorithm by Müller-Merbach. We also compare our proposed methods against GreedyAllC in Section 4.
Heider [14] proposes a method to improve an already given permutation/mapping. The method repeatedly tries to perform swaps in the assignment. To do so, the author defines a pair-exchange neighborhood N (Π) that contains all permutations that can be reached by swapping two elements in Π. Here, swapping two elements means that Π(i) will be assigned to processor j and Π(j) will be assigned to processor i after the swap is done. The algorithm then looks at the neighborhood in a cyclic manner. More precisely, in each step the current pair (i, j) is updated to (i, j + 1) if j < n, to (i + 1, i + 2) if j = n and i < n − 1, and lastly to (1,2) if j = n and i = n − 1. A swap is performed if it yields positive gain, i. e., the swap reduces the objective. The overall runtime of the algorithm is O(n 3 ). We denote the search space with N 2 . To reduce the runtime, Brandfass et al. [5] introduce a couple of modifications. First of all, only symmetric inputs are considered. If the input is not symmetric, the input is substituted by a symmetric one such that the output of the algorithm remains the same. Second, pairs (i, j) for which the objective cannot change, are not considered. For example, if two processes reside on the same compute node, swapping them will not change the objective. Lastly, the authors partition the neighborhood search space into s consecutive index blocks and only perform swaps inside those blocks. This reduces the number of possible pairs from O(n 2 ) to O(ns) overall pairs. We denote the search space with N p (pruned neighborhood). In addition, instead of starting from the identity permutation, the authors use the method of Müller-Merbach [19] to compute an initial solution. This improves runtime of the local search approach as well as the objective of the solution.

Rank Reordering Algorithms
We now present our main contributions and techniques. This includes algorithms to compute initial solutions, speeding up the local search algorithms for sparse communication patterns and defining new search spaces for the local search algorithm. Throughout this section, we assume that the input communication matrix is already given as a graph G C , i. e., no conversion of the matrix into a graph is necessary. More precisely, the graph representation is defined as is the edge set of the processes that need to communicate with each other. Note that the set contains forward and backward edges, and that the weights of the edges in the graph correspond to the entries in the matrix C.

Initial Solutions
We propose two strategies exploiting the hierarchy. Intuitively, we want to identify subgraphs in the communication graph of processes that have to communicate much with each other and then place such processes closely, i. e., on the same node, same rack and so forth. In the following, we assume a homogeneous hierarchy of the supercomputer, but our algorithms can be extended to heterogeneous hierarchies in a straightforward way. Let S = a 1 , a 2 , ..., a k be a sequence describing the hierarchy of the supercomputer. The sequence should be interpreted as each processor having a 1 cores, each node a 2 processors, each rack a 3 nodes, . . . . We propose two algorithms to compute initial mappings, a top down and a bottom up approach. The first one, top down, splits the communication graph recursively and the second one, builds a hierarchy bottom up.
The top down approach starts by computing a perfectly balanced partition of G C into a k blocks each having n/a k vertices (processes). The partitioning task is done using the techniques provided by Sanders and Schulz [22] which provide high quality partitions and guarantee that each block of the output partition has the specified amount of vertices. In principle, the nodes of each block will be assigned completely to one of the a k system entities. Each of the system entities provides precisely n/a k PEs. We then proceed recursively and partition each subgraph induced by a block into a k−1 blocks and so forth. The recursion stops as soon as the subgraphs have only a 1 vertices left. In the base case, we assign processes to permutation ranks.
The bottom up approach proceeds in the opposite order of the hierarchy. That means the communication graph G C is split first into k = n/a 1 blocks with precisely a 1 vertices each. Again, this is done using the perfectly balanced partitioning techniques mentioned above. Each block will later on be assigned to a unique system entity that is able to host a 1 processes, i. e., a node having a 1 cores. Then each of the blocks is contracted and we partition the contracted graph and so forth. In this case, if replacing edges of the form {u, w} , {v, w} would generate two parallel edges {x, w}, we insert a single edge with C x,w = C u,w + C v,w . This way, the correct sum of the distances are accounted for in later stages of the algorithm. The recursion stops as soon as the last hierarchy stage is reached, i. e., the last graph with n vertices has been partitioned into n /a k vertices with a k vertices each. Recall that vertices in the same block will be assigned to a specified subpart of the system. In this case, a vertex in the graph on the last level of the recursion represents a whole set of task with the property that the sum of the vertex weights of each block is precisely the amount of PEs that are present in the subsystem that they are assigned to. We then backtrack the recursion to construct the final mapping.

Faster Swapping
Initially computing as well as recomputing the objective function after a swap is performed is an expensive step in the algorithm of Brandfass et al. [5]. In their work, both the communication pattern as well as the distances between the PEs are given as complete matrices. These matrices have a quadratic number of elements and hence the initial computation of the objective function costs O(n 2 ) time. After a swap is performed, Brandfass et al. update the objective using the objective function value before the swap. This is done by looking at all elements in the corresponding columns of the communication and distance matrix. Overall, an update step in their algorithm takes O(n) time which is clearly a bottleneck for sparse communication patterns. We now describe how we speed up the initial computation as well as the update of the objective. As a first step, we rewrite the objective to work with the inverse of the permutation: with the interpretation that task u is assigned to PE Π −1 (u). This makes it easier to work with the graph representation of the communication matrix. We rewrite the objective to work with the graph representation instead of the complete communication pattern matrix C: The first observation is that given an initial mapping, we can compute the initial objective in O(n + m) time which is better for sparse graphs. Our next goal is to make the update of the objective fast after a swap has been performed. To do so, let be the contribution to the objective of a single vertex u given the current mapping. Note that by using Γ Π −1 , we can again rewrite the objective J(C, D, Π) := u∈V Γ Π −1 (u). Throughout the algorithm, the vertex contributions Γ are always kept up to date. Additionally, it is quite easy to see that performing a swap in the assignment only affects the nodes that are swapped themselves as well as their neighborhood in the communication graph. Hence, we only need to update the node contributions of those nodes and can update the objective accordingly. We update the node contributions as follows: Let u and v be the vertices to be swapped in their assignment Π −1 . We start by subtracting the node contributions of all affected nodes from the objective. Before we perform the swap, we iterate over the neighbors of u and v and subtract the contribution induced by the edge connecting the neighbor from its Γ value. We then set the node contributions of u and v to zero and perform the swap. Now we again iterate over all neighbors, basically recomputing the node contributions of u and v, and at the same time adding the new contribution induced by the edge connecting the neighbor to its Γ value. As a last step, we add the new node contributions of all affected nodes from the objective. Overall, this takes O(d u + d v ) time where d u and d v are the degrees of the vertices u and v in the communication graph.

Alternative Local Search Spaces
We now define swapping neighborhoods using the communication graph G C . In the simplest version, assignments are only allowed to be swapped if the processes are connected by an edge in the communication graph, i. e., the processes have to communicate with each other. We denote this neighborhood with N C . The size of the search space is O(m) since it contains exactly m pairs that may be swapped. Swaps are performed in random order. Local search terminates after m unsuccessful swaps, i. e., all pairs have been tried and no swap resulted in a gain in the objective. Note that this approach assumes that swaps with positive gain are close in terms of graph theoretic distance in the communication graph. We also define augmented neighborhoods in which swaps are allowed if two processes have distance less than d in the communication graph. We denote this neighborhood by N d C . Note that this creates a sequence of neighborhoods increasing in size N C ⊆ N 2 C ⊆ . . . ⊆ N n C = N 2 where N 2 is the largest neighborhood used by Brandfass et al. [5] (see Section 2). Our experimental section shows that performing swaps with small graph theoretic distance in the communication graph is sufficient to obtain good solutions.

Miscellanea
Constant Time Distance Oracle. Storing the complete distance matrix requires O(n 2 ) space. However, due to the problem structure it is not necessary to store the complete matrix. Instead one can build an interval tree over the PE given describing the hierarchy. The distance of two PEs can then be found by finding the lowest common ancestor in the tree. Such a query can be answered in constant time by investing O(n) preprocessing time [3].
We can use a simpler approach that obtains the distance of two PEs by a few, simple division operations. More precisely, for a hierarchy S = a 1 , a 2 , ..., a k we initially build an array describing the sizes of the intervals on the different levels of the hierarchy. A query then proceeds to scan the implicitly given intervals from top to bottom until the PEs are not on the same subsystem. We then return the corresponding distance.

Experiments
Methodology. We have implemented the algorithm described above within the KaHIP framework using C++ and compiled all algorithms using gcc 4.63 with full optimization's turned on (-O3 flag). We integrated our algorithms in KaHIP v1.00 graph partitioning framework. They will integrated into that framework and also released separately. The codes of Brandfass et al. [5] could not be made available to us, so that we implemented those algorithms in our framework as well. Our implementation also uses the sparse representation of the communication pattern. GreedyAllC [12] has been kindly provided by the authors. We also compare against the dual recursive bisection codes of Hofler and Snir [15] (LibTopoMap). Our experiments evaluate the objective of the quadratic assignment problem as well as the running time necessary to compute the solution. To keep the evaluation simple, we use mostly one system hierarchy configuration D. We perform ten repetitions of each algorithm using different random seeds for initialization. Unless otherwise mentioned, we use the geometric mean when reporting averages in order to give every instance the same influence on the final score. The system we are using to compute solutions has four Octa-core Intel Xeon E5-4640 processors (32 cores) which run at a clock speed of 2.4 GHz. It has 512 GB local memory.
Instances. We use graphs from various sources to test our algorithm. In Section 4.1, we use these graphs as input to a partitioning algorithm that partitions them into a given number of blocks and then computes the communication graph C which is the input to our mapping algorithms. We use the largest six graphs from Chris Walshaw's benchmark archive [24]. Graphs derived from sparse matrices have been taken from the Florida Sparse Matrix Collection [9]. We also use graphs from the 10th DIMACS Implementation Challenge [2] website. Here, rggX is a random geometric graph with 2 X nodes where nodes represent random points in the unit square and edges connect nodes whose Euclidean distance is below 0.55 ln n/n. The graph delX is a Delaunay triangulation of 2 X random points in the unit square. The graphs af_shell9, thermal2, and nlr are from the matrix and the numeric section of the DIMACS benchmark set. The graphs europe and deu are large road networks of Europe and Germany taken from [10]. Basic properties of the graphs under consideration can be found in Table A.3.

Sparse Quadratic Assignment Problem
In this section, we look at the impact of the various algorithmic components that we presented throughout the paper. In general, we use a hierarchy S = a 1 , . . . , a k describing the system hierarchy and communication parameters D = d 1 , . . . , d k describing the distances between various cores in the subsystems. More precisely, d i describes the distance of two cores that are in the same subsystems for i < i, and in different subsystems for i ≥ i. The total number of cores is then given by n = i a i . Here, we focus on two different system configurations to keep the evaluation simple. Our process in this section is as follows: Take the input graph, partition it into n blocks using the fast configuration of KaHIP, compute the communication graph induced by that (vertices represent blocks, edges are induced by connectivity between blocks, edge cut between two blocks is used as communication volume) and then compute the mapping of the communication graph to the specified system.  Table 1 Average running time and average speedup of local search for pruned search space Np. m/n is the average density of the generated instances, t LS the average running time of the algorithm using slow gain computations and t fastLS the average running time using fast gain computations.

Speed-Up of Local Search
We now take the algorithm configurations initially used by Brandfass et al. [5] and investigate the impact of our faster local search algorithms. The configurations are as follows: Use the greedy growing algorithm by Müller-Merbach (as described in Section 2) to provide initial solutions and use the pruned local search neighborhood N p by Brandfass et al. [5] (see Section 2 for details). We run two configurations: One in which computing the gain takes linear time (the old algorithm) and one with our improved algorithm. In this experiment, we use S = 4 : 16 : k, D = 1 : 10 : 100 with k = 2 i , i ∈ {1, ..., 9}. Note that the objective of the computed solutions by the algorithm using faster gain computations is precisely the same as their counter part, hence we do not report the value of the objective in this section. The results of the experiments are summarized in Figure 1 and Table 1. First, we observe that our new algorithm is always faster than the old algorithm. This is expected since the models of computation and communication that are mapped are indeed sparse. Table 1 shows that our fast local search algorithm scales almost linearly in n while the algorithm not using fast gain computations shows quadratic scaling behaviour. The table also already shows a dependency of our algorithm on the  density of the instances. This is due to the fact that the gain computation depends on the degrees of the vertices in the communication graph and is in alignment with our theoretical analysis. The expected dependency on the density of the instances can also be seen more clearly in Figure 1. The smallest algorithmic speedup obtained in this experiment is two and the largest speedup is approximately 3 596. We conclude that exploiting the sparsity of the application can improve the running time of local search significantly. We now always use fast gain computations.

Local Search Neighborhoods
In this section, we look at the influence of local search neighborhoods on final solution quality. The base configuration used here employs the greedy growing algorithm by Müller-Merbach for initialization. Afterwards local search is done using the specified local search neighborhood, i. e., the quadratic neighborhood N 2 , the pruned quadratic neighborhood  Figure 2 presents performance plots using all instances. A curve in a performance plot for algorithm X is obtained as follows: For each instance, we calculate the ratio between the objective or running time obtained by any of the considered algorithms and objective or running time of algorithm X. These values are then sorted. Additionally, we present average ratios of solution quality and running time in Table 2. First, the local search algorithm using the N 1 neighborhood appears to be the fastest algorithm but also the worst in terms of solution quality. Compared to the initial construction heuristic it takes roughly a factor 1.34 in running time while improving solution quality by roughly 4%. With increasing distance d for the local search neighborhood N d , solutions improve but also the running time increases. As expected, the local search algorithm using the largest local search neighborhood N 2 computes the best solutions. Here, solutions generated by the initial heuristic are improved by roughly 20%. However, this is also the slowest algorithm (a factor 443 slower than the initial construction heuristic). Also note that we are only able to evaluate the performance of the algorithm at that scale due to the fast gain computations introduced in this paper. Additionally, as n increases the algorithm becomes much slower. This is due to the fact that convergence of the algorithm takes more time for larger n. In contrast, the other local search neighborhoods show much better scaling behaviour as expected. The local search neighborhood N 10 is faster and computes solutions that are only slightly worse than N 2 . For example, for n = 32K the algorithm using N 10 is more than a factor nine faster and computes solutions that are only 5.5% worse.

Initial Heuristics and Their Scaling Behaviour
We now evaluate the different heuristics that can be used to create solutions. For the evaluation, we employ the algorithm of Müller-Merbach [19], GreedyAllC [12], LibTopoMap [15] (dual recursive bisectioning), Identity, Random, the Bottom-Up as well as the Top-Down and the Top-Down algorithm combined with local search that uses the N 10 neighborhood (Top-Down+N 10 ). The problems we look at are defined as S 1 = 4 : 16 : k, D 1 = 1 : 10 : 100 with k ∈ {1, . . . , 128}. We run the Bottom-Up algorithm only to k ≤ 50 due to its large running time. Figure 3 shows the average improvement over solutions obtained by the algorithm of Müller-Merbach and a performance plot for the different algorithms. Indeed, the random mapping algorithms performs worse than the algorithm of Müller-Merbach. On average, the objective computed by the algorithm is 67% worse than the solutions computed by the algorithm of Müller-Merbach. Our Top-Down algorithms yields the best solutions on most of the instances. On average, solutions computed by Top-Down are 52% better than the solutions computed by Müller-Merbach. Adding local search with the N 10 neighborhood to the algorithm yields additional 5.3% improvement on average. GreedyAllC only improves slightly, i. e., 1% on average, over the algorithm of Müller-Merbach. The identity mapping seems to be the best algorithm for powers of two. This is due to the way the input to the algorithms is constructed, i. e., blocks are initially assigned by KaHIP. To be more precise, KaHIP uses a recursive bisection algorithm on the input graph to compute a model of computation and communication (the input to our mapping algorithms). In each recursion it assigns consecutive blocks to the left side and to the right side. Hence, for powers of two, the identity mapping yields a strategy similar to using recursive bisection on the model to be mapped with good bisections. If the number of elements is not a power of two, then the bisections implied by the identity are not good and hence it performs worse.
LibTopoMap is somewhere in between. It mostly computes better solutions than the greedy algorithms but overall worse solutions than BottomUp and TopDown. On average, solutions are 8% better than the solutions computed by the greedy algorithm of Müller-Merbach. Interestingly, its achieved solution quality is better when the number of vertices in the instances is close to a power of two. This is due to the fact that the algorithm uses dual recursive bisection on the communication and processor graph. However, when the input size is not close to a power of two, there are no good bisections in the processor graph.
In our experiments, Bottom-Up is the slowest algorithm. This is due to the fact that on the coarsest level large partitioning problems have to be solved. The Top-Down algorithm does not have the problem, but is still slower than all other algorithms (except Bottom-Up). On average it is a factor 194 slower than the Müller-Merbach algorithm and a factor 40 slower than GreedyAllC. LibTopoMap is roughly a factor 18 slower than the algorithm of Müller-Merbach. However, the running time of Top-Down is on average only 80% of the time it takes to partition the input graph (using the fast configuration of KaHIP), i. e., the time it takes to create the model which is the input to the mapping algorithms. Adding local search with the N 10 neighborhood to the algorithm costs additional time, on average 64% of the time it takes to partition the graph. Considering also the high solution quality advantage, we believe that the algorithms are still highly useful in practice.
Scalability. We now scale the problem size to n = 2 19 processes/cores. We take the largest graph from our benchmark collection rgg24 and create mapping problems defined as S 1 = 4 : 16 : 128 : k, D 1 = 1 : 10 : 100 : 1000 with k ∈ 2 i , i ∈ {1, . . . , 8}. We run Müller-Merbach and the TopDown+N 1 algorithm once. Both algorithms work well on our machine until i = 4 (n = 2 17 ), at which point there is not sufficient memory available if the implementations use the full distance matrix. Note that the machine has 512GB of memory. Hence, we performed a second run of both algorithms computing distances online (as described in Section 3.4). Note that the version of the Müller-Merbach algorithm is only able to solve larger problem sizes due to both of our changes: the sparse representation of the communication pattern as well as online computation of distances. Computing distances online slows down Müller-Merbach roughly by a factor of five and local search by a factor of three. The running time of TopDown remains the same since it uses the provided hierarchy instead of the distance matrix. In turn the running time advantage of Müller-Merbach also decreases. This is also due to the fact that Müller-Merbachs algorithm is a quadratic time algorithm. For the largest mapping problem (n = 2 19 ), the Müller-Merbach algorithm takes a factor 1.64 longer than TopDown. Overall, computing distances online enables a potential user of the algorithms to tackle larger mapping problems.

Conclusion
In high performance systems, different cores that are on the same processor usually have the same communication link quality when they communicate with each other, as do cores that are on the same node but not on the same processor and so forth. Using these assumptions, we derived algorithms to create initial mappings as well as faster local search algorithms with alternative local search spaces. Overall, our algorithms drastically speedup local search and are able to compute high quality solutions.
Important future work includes deriving distributed parallel algorithms for the problem. Moreover, we want to investigate algorithms to create a hierarchy of the system if it is not provided as an input to our algorithm. It may be worth to look at more complex local search neighborhoods, e. g., local search spaces that allow to swap whole groups of assignments or allow swapping along cycles in the communication graph. We also want to study the impact of our process mapping on parallel application performance.