The Subset Assignment Problem for Data Placement in Caches

We introduce the subset assignment problem in which items of varying sizes are placed in a set of bins with limited capacity. Items can be replicated and placed in any subset of the bins. Each (item, subset) pair has an associated cost. Not assigning an item to any of the bins is not free in general and can potentially be the most expensive option. The goal is to minimize the total cost of assigning items to subsets without exceeding the bin capacities. This problem is motivated by the design of caching systems composed of banks of memory with varying cost/performance specifications. The ability to replicate a data item in more than one memory bank can benefit the overall performance of the system with a faster recovery time in the event of a memory failure. For this setting, the number $n$ of data objects (items) is very large and the number $d$ of memory banks (bins) is a small constant (on the order of $3$ or $4$). Therefore, the goal is to determine an optimal assignment in time that minimizes dependence on $n$. The integral version of this problem is NP-hard since it is a generalization of the knapsack problem. We focus on an efficient solution to the LP relaxation as the number of fractionally assigned items will be at most $d$. If the data objects are small with respect to the size of the memory banks, the effect of excluding the fractionally assigned data items from the cache will be small. We give an algorithm that solves the LP relaxation and runs in time $O({3^d \choose d+1} \text{poly}(d) n \log(n) \log(nC) \log(Z))$, where $Z$ is the maximum item size and $C$ the maximum storage cost.


Introduction
We define a combinatorial optimization problem which we call the subset assignment problem. An instance of this problem consists of n items of varying sizes and d bins of varying capacities. Any item can be replicated and assigned to multiple bins. A problem instance also includes n · 2 d cost parameters which denote for each item and each subset of the bins, the cost of storing copies of the item on that subset of the bins. The objective is to find an assignment of items to subsets of bins which minimizes the total cost subject to the constraint that the sum of the sizes of items assigned to each bin does not exceed the capacity of the bin.
The costs do not necessarily exhibit any special properties, although we do assume that they are non-negative. For example, we do not assume that cost necessarily increases or decreases the more bins an item is assigned to. Assigning an item to the empty set, which corresponds to not assigning the item to any of the bins, is not free in general and can potentially be the most expensive option for an item.
The subset assignment problem is a natural generalization of the multiple knapsack problem (MKP), in which each item can only be stored on a single bin. The book by Martello and Toth [16] and the more recent book by Kellerer et al. [13] both devote a chapter to MKP. A restricted version of the subset assignment problem in which each item can only be stored in a single bin corresponds to the multiple knapsack problem (MKP). MKP is known to be NP-complete but does have a polynomial time approximation scheme [5]. For the application we are interested in, the number of items n is very large (on the order of billions) and the number of bins is a small constant (on the order of 3 or 4). Furthermore, the size of even the largest item is small with respect to the capacity of the bins. Since there is an optimal solution for the linear programming relaxation in which at most d items are fractionally placed, the effect of excluding the fractionally assigned items from the cache is negligible. Therefore, we focus on an efficient solution to the linear programming relaxation.
The linear relaxation of MKP can be expressed as a minimum cost flow on a bipartite graph, a classic and well studied problem in the literature [1]. Tighter analysis for the case of minimum cost flow on an imbalanced bipartite graph (n >> d) is given in [10] and improved in [2]. Naturally, the goal with highly imbalanced bipartite graphs (which corresponds to the situation in the subset assignment problem in which the number of items is much larger than the number of bins) is to minimize dependence on n, even at the expense of greater dependence on d.
We propose an algorithm for the linear programming relaxation of the subset assignment problem that is similar in structure to cycle canceling algorithms for min-cost flow and is also inspired by the concept of a bipush, which is central to the tighter analysis [10] and [2]. The analysis shows that our algorithm runs in O(f (d) poly(d)n log(n) log(nC) log(Z)), where C is the maximum cost of storing an item on any subset of the bins and Z is the maximum size of any item. The function f (d) is defined to be the number of distinct sets of vectors { v 1 , . . . , v r } where v i ∈ {−1, 0, 1} d and the solution α to the equation r i=1 α i v i = 0 with α 1 = 1 is unique and positive ( α > 0). In order for the solution α to be unique, the first r − 1 vectors must be linearly independent and therefore r ≤ d + 1. If r < d + 1, the set can be uniquely expanded to a set of size d + 1 such that the solution to r i=1 α i v i = 0 with α 1 = 1 is unique and non-negative ( α ≥ 0). This observation gives an upper bound of 3 d d+1 for f (d), hence the running time of O( 3 d d+1 poly(d)n log(n) log(nC) log(Z)). Numerical simulation has shown that f (3) = 778 and f (4) = 531, 319. Since the problem specification requires n · 2 d cost parameters, an exponential dependence on d is unavoidable. A direction for future research is to reduce the dependence on d from exponential in d 2 to exponential in d.
A reasonable assumption for the cache assignment problem (the motivating application for the subset assignment problem) is that it is never advantageous to replicate an item in more than two bins. Under this assumption, the vectors can have at most 4 non-zero entries. So the function f (d) is bounded by d O(d) poly(d) resulting in an overall running time of O(d O(d) poly(d)n log(n) log(nC) log(Z)).
In a linear programming formulation of the problem there are 2 d n variables and n + d constraints. The best polynomial time algorithms to solve a general instance of linear programming require time at least cubic in n which for the values we consider is prohibitively large. (For example, Karmarkar's algorithm requires O(N 3.5 L) operations in which N is the number of variables and all input numbers can be encoded with O(L) digits [12].) Our algorithm is closer to the Simplex algorithm in that after each iteration, the current solution is a basic feasible solution. However, the algorithm does not necessarily traverse the edges of the simplex. The algorithm selects an optimal local improvement which, in general, can result in a solution which is not a basic feasible solution and then restorse the solution to a basic feasible solution without increasing the cost. It is not clear how to bound the running time of any implementation of the Simplex algorithm in which the algorithm is bound to traverse edges of the simplex. Since the problem can be formulated as a packing problem, there is a randomized approximation algorithm whose solution is within a factor of 1 + ǫ of optimal and whose running time is O(2 d poly(d)n log n/ǫ 2 ) [15].

Motivation
The subset assignment problem is motivated by the problem of managing a multi-level cache. Although caches are used in many different contexts, we are particularly interested in the use of caches to augment database management systems. Query results (called key-value pairs) are stored in the cache so that the next time the query is issued, the result can be retrieved from memory instead of recomputed from scratch. Typically key-value pairs vary in size as they contain different types of data. Furthermore, re-computation can vary dramatically, depending on the application. In some applications, a key-value pair is the result of hours of data-intensive computation. If the key-value pair does not reside in the cache (corresponding to allocating the item to the empty set in our formulation), this computation cost would be paid every time the keyvalue pair is accessed. Memcached, currently the most popular key-value store manager, is used by companies such as Facebook [18], Twitter and Wikipedia. Today's memcached uses DRAM for fast storage and retrieval of key-value pairs. However, using a cache that consists of a collection of memory banks with different characteristics can potentially improve cost or performance [7].
We model a sequence of requests to data items (key-value pairs) in the cache as a stream of independent events as do social networking benchmarks such as BG [4] and LinkBench [3]. Cache management (or paging) under i.i.d. request sequences is also well studied in the theory literature [19,11]. If query and update (read and write) statistics are known in advance, the optimal policy is a static placement of data items in the memory banks that minimizes the expected time to service each request. A static placement can have much better performance over adaptive online algorithms if the request frequencies are stable [9,8]. Since the popularity of queries do vary over time, a static placement would need to be recomputed periodically based on recent statistics followed with a reorganization of key-value pairs across memory banks.
With the advent of Non-Volatile Memory (NVM) such as PCM, STT-RAM, NAND Flash, and the (soon to be released) Intel X-Point, cache designers are provided with a wider selection of memory types with different performance, cost and reliability characteristics. The relative read/write latency and bandwidth for different memory types vary considerably. An important challenge in computer system design is how to effectively design caching middleware that leverages these new choices [14,7,17]. The survey in [17] makes the case that the advent of new storage technologies significantly changes the standard assumptions in system design and leveraging such technologies will require more sophisticated workload-aware storage tiering.
In this paper, the cache is composed of a small number of memory banks each of which is a different type of memory. The goal is to find an optimal placement of data items in the cache. Our model also takes into account that a memory bank can fail due to a power outage or hardware failure. (Nonvolatile memories do not lose their content during a power outage but they can experience hardware failures). If a memory bank fails, its contents must be restored, either all at once or over time. In this case, it may be advantageous to store a data item on more than one memory bank so that the data can be more easily recovered in the event that one or more memory banks fail. On the other hand, maintaining multiple copies of a key-value pair can be costly if they must be frequently updated. We express these different trade-offs in an optimization problem by allowing a key-value pair to be replicated and stored on any subset of the memory banks. The ∅ option represents not keeping the key-value pair in the cache at all and recomputing the result from the database at every query, an option that can be computationally very costly. Simulation results from [7] show that it can be advantageous to store copies of data items in more than one memory bank to speed up recovery time, although it depends on the read and write frequencies of the data items as well as the failure rates of the memory.
[7] gives a detailed description for how the memory parameters and request frequencies translate into costs and uses the model to study a closely related problem in which one is given a fixed budget as well as the price for the different types of memory. The goal is to determine the optimal amount of each type of memory to purchase as well as the optimal placement of key-value pairs to memory banks that minimizes expected service time subject to the overall budget constraint. The algorithm in [7] is implemented and evaluated using traces generated by a standard social networking benchmark [4]. In this paper, we consider the situation in which the design of the cache is already determined in that there is a set of memory banks whose capacities are given as part of the problem input. The goal is to place each item on a subset of the memory banks so that the capacities of each memory bank is not exceeded and the total cost is minimized. In both cases, the objective function is expected service time for all the items. The cost of serving an item p located on subset S of the memory banks is a sum of three terms: the total expected time to serve read requests to p, the total expected time to serve write requests to p, and the total expected time needed to restore p in the event that one or more memory bank in S fails. More explicitly: The functions read-freq(p) and write-freq(p) represent the probability of a read or write request to p. The function fail-freq(F ) is the probability that all the memory banks in subset F fail. On a request to read an item p, item p can be obtained from any of the copies of p in the cache. Therefore, read-time(p, S) represents the time to read p from the memory bank in S that provides the fastest read time. The time to read a data item from a memory bank depends on the size of the item as well as the latency and bandwidth for reading from that type of memory. Updating an item, on the other hand, requires updating every copy of that item in the cache. Therefore, write-time(p, S) is the maximum time to write p to any of the memory banks in S, assuming that writing p to its multiple destinations is done in parallel. (If writing is done sequentially, then write-time(p, S) is the sum of the write times over all the memory banks in S).
The time to write a data item to a memory bank depends on the size of the item as well as the latency and bandwidth for writing to that type of memory.
Recovering from failure could involve reading from those memory banks that still have a copy of p and rewriting them to those that lost it. Since the relative read/write frequencies for items and read/write times for memories vary significantly, there is no useful structure to exploit in modeling the cost of assigning an item to a subset which is why they are assumed to be arbitrary values given as part of the input to the subset assignment problem. However, based on the empirical failure rates of the memory technologies, it is reasonable to assume that two memory banks will never fail at the same time. Under this assumption, there is no need to keep more than two copies of an item in the cache and we can restrict the data placements for an item to subsets of size one or two. The problem is addressed in this paper in its full generality although a better bound on the running time can be obtained with this restriction.

Problem Definition
There are n items and each item p has a given size(p). There are d bins B = {b 1 , . . . , b d }. Each bin b has a given capacity(b). An item can be replicated and placed on any subset of the memory banks S ⊆ B. We call S a placement option for an item. Placing p on S has cost denoted by cost(p, S) ≥ 0. A placement of items to memory banks is described by a set of n · 2 d variables x(p, S) ≥ 0 with the constraint that for each p, S x(p, S) = size(p). (1) Also the capacity of each bin cannot be exceeded, so for each b, The goal is to minimize subject to the condition that all x(p, S) ≥ 0, (1) and (2) above. The placement option ∅, corresponding to not placing an item in any of the bins, is an option for every p, so the problem always has a feasible solution. For each bin b, we will add an extra item p whose size is capacity(b). For each added p, cost(p, ∅) = cost(p, {b}) = 0. For all other S ⊆ B, cost(p, S) = ∞. We assume that the pages are numbered so that the extra item for bin b i is p i . With the additional items, we can assume that every solution under consideration has every bin filled exactly to capacity since any extra space in b i can be filled with p i without changing the cost of the solution. Therefore we require that for each b, S∋b p x(p, S) = capacity(b). An assignment which satisfies the equality constraints on the bins is called perfectly filled.

Preliminaries
Our algorithm starts with a feasible, perfectly filled solution and improves the assignment in a series of small steps, called augmentations. The augmentations, a generalization of a negative cycle in min-cost flow, always maintain the condition that the current assignment is feasible and perfectly filled. In each iteration the algorithm finds an augmentation that approximates the best possible augmentation in terms of the overall improvement in cost. An augmentation is a linear combination of moves in which mass is moved from x(p, S) to x(p, T ) for some item p. Each move gives rise to a d-dimensional vector over {−1, 0, 1} that denotes the net increase or decrease to each bin as a result of the move. We require that the linear combination of vectors for an augmentation equal 0 in order to maintain the condition that the bins are perfectly filled. The profile for an augmentation is the set of vectors corresponding to the moves in that augmentation. In order to find a good augmentation, we exhaustively search over all profiles and then find a good set of actual moves that correspond to each profile. Exhaustively searching over all profiles introduces a factor of f (d), the number of distinct profiles which is at most 3 d d+1 . In order to bound the number of iterations, we also need to establish that there is an augmentation that improves the cost by a significant factor. For flows, this is accomplished by showing that the difference between the current solution and the optimal solution can be decomposed into at most m simple cycles, where m is the number of edges in the network. If ∆ is the difference between the current and optimal cost, then there is a cycle that improves the cost by at least ∆/m. We proceed in a similar way, showing that the difference between two assignments can be decomposed into at most 2(n + d) augmentations any of which can be applied to the current assignment. Therefore there is an augmentation that improves the cost by at least ∆/2(n + d).

Augmentations
For S ⊆ B, S is a d-dimensional vector whose i th coordinate is 1 if b i ∈ S and is 0 otherwise. Let V be the set of all length d vectors over {−1, 0, 1}. A set V ⊆ V is said to be minimally dependent if V is linearly dependent and no proper subset of V is linearly dependent. If V = { v 1 , . . . , v r } is minimally dependent, then the values α 1 , . . . , α r such that r i=1 α i v i = 0 are unique up to a global constant factor. In order to make a unique vector α, we always maintain the convention that α 1 = 1. A minimally dependent set V is said to be positive if the associated vector α > 0.
A move is defined by a triplet (p, S, T ) that represents the possibility of moving mass from x(p, S) to x(p, T ). The profile for a set of moves Note that the vector T − S represents the net increase or decrease to each bin that results from moving one unit of mass from x(p, S) to x(p, T ) for some p. A set of moves is called an augmentation if the set of vectors in its profile is minimally dependent and positive. Note that an augmentation contains at most d + 1 moves.
An augmentation A = {(p 1 , S 1 , T 1 ), . . . , (p r , S r , T r )} can be applied to a particular assignment x if for every i = 1, . . . r, x(p i , S i ) > 0. Let α be the unique vector of values such that α 1 = 1 and r j=1 α j ( T j − S j ) = 0. If the augmentation is applied with magnitude a to x, then for every (p j , S j , T j ) ∈ A, x(p j , S j ) is replaced with x(p j , S j ) − aα j and x(p j , T j ) is replaced with x(p j , T j ) + aα j . The cost vector for an augmentation is c, where c j = cost(p j , T j ) − cost(p j , S j ). The cost associated with applying the augmentation with magnitude a is a( c · α). Since the goal is to minimize the cost, we only apply augmentations whose cost is negative.
The following lemma is analogous to the fact for flows that says there is always a cycle in the network representing the difference between two feasible flows. The proof is given in the Appendix.
Lemma 1. Let x and y be two feasible, perfectly filled assignments to the same instance of the subset assignment problem. Then there is an augmentation that can be applied to x that consists only of moves of the form (p, S, T ) where x(p, S) > y(p, S) and x(p, T ) < y(p, T ).

Basic Feasible Assignments
An item is said to be fractionally assigned if there are two subsets S = S ′ , such that x(p, S) > 0 and x(p, S ′ ) > 0. If items can only be assigned to single bins as in the standard assignment problem, then it follows from total unimodularity that the optimal solution is integral, assuming that all the input values are integers. For the subset assignment problem, the optimal solution may not be integral, even if all the input values are integers. Here is an example in which the item sizes and bin capacities are 1, but an optimal solution must have fractionally assigned items: we have two items p and q and two bins b and c with costs where C is a large number. The optimal assignment is to equally distribute p over {b, c} and ∅, and to equally distribute q over {b} and {c}.
The linear programming formulation of the subset assignment problem has n + d constraints. n constraints enforce that each p must be assigned: The other d constraints, say that each bin must be exactly filled to capacity. Therefore, any basic feasible solution to the linear programming formulation of the subset assignment problem has at most n + d non-zero variables. Since for every p, there is at least one S such that x(p, S) > 0 and n >> d, we know at least n − d of the items will not be fractionally assigned because they have only one S such that x(p, S) > 0. The number of variables x(p, S) such that 0 < x(p, S) < size(p) is at most 2d, so the number of fractionally assigned items is at most d.
The criteria for a feasible solution to be a basic feasible solution is that once the variables are chosen that will be positive, there is exactly one way to assign values to those variables so that all the constraints are satisfied. Suppose we have a feasible assignment x. First place the items in bins that are not fractionally assigned. If x is a basic feasible solution, then there is a unique way to place the remaining items so that the bins are filled exactly to capacity. We rephrase the definition of a basic feasible solution in the language of the subset assignment problem and prove the same facts about the new definition.
Consider a feasible assignment x. Let P frac be the set of data items that are fractionally assigned. Let X frac be the set of variables x(p, S) such that 0 < x(p, S) < size(p). Let P int be the set of items that are assigned to exactly one subset. That is p ∈ P int if x(p, S) ∈ {0, size(p)} for all S. Definition 2. For each p ∈ P frac select one S such that x(p, S) > 0. Denote the selected set for p by S p . Let X be the set of variables x(p, S) such that S = S p and 0 < x(p, S) < size(p). Let V be the set of vectors S − S p for each x(p, S) ∈ X. Then x is a basic feasible assignment (bfa) if and only if V is linearly independent.
Although the definition for a basic feasible assignment was given in terms of a particular choice of S p 's, the property of being a bfa does not depend on this choice. Lemma 3. The condition of being a bfa does not depend on the choice of S p , for p ∈ P frac .
Proof. Let p ∈ P frac and let {S 1 , . . . , S r } be the subsets such that x(p, S j ) > 0. Suppose that S p is chosen to be S i . Select any two S j = S k . Since ( S j − S k ) = ( S j − S p ) − ( S k − S p ), the space spanned by all ( S j − S k ) for S j = S k is equal to the space spanned by all ( S j − S p ) for S j = S p . The space spanned by all ( S j − S k ) = 0 is independent of the choice of S p . Lemma 4. If x is a bfa, then the number of variables in X frac is at most 2d and the number of fractionally assigned items is at most d.
Proof. Since |X| = |V |, and V must be linearly independent for any bfa, it must be that if x is a bfa, then |X| ≤ d. The set of fractionally assigned variables (X frac ) includes all the x(p, S p ) for p ∈ P frac and X. For each x(p, S p ), there is at least one variable in X. Therefore the number of variables such that 0 < x(p, S) < size(p) in any bfa is at most 2d.
The process Restore, given in the Appendix, takes an assignment x which may not be a bfa and restores it to an assignment which is a bfa. The process maintains the condition that the current assignment is feasible and perfectly filled. If the set V is linearly dependent, a linear combination of the moves (p, S p , S) is chosen for each x(p, S) ∈ X such that applying the linear combination of moves keeps the bins perfectly filled. Since p has some weight on S p and some weight on S, all the moves can be applied in either the forward or reverse direction. (A negative coefficient denotes applying a move in the reverse direction.) We choose a direction for the linear combination of moves such that the cost does not increase. The combination of moves is applied until either x(p, S) or x(p, S p ) becomes 0 for one of the moves represented in V . Thus, the cost of the assignment does not increase and the number of fractionally assigned variables decreases by at least one. The process continues until V is linearly independent.
Lemma 5. There is an optimal solution that is also a bfa.
Proof. Start with an optimal assignment x which may not be a bfa. Apply Restore to x. The resulting assignment is a bfa. And since the cost of x does not increase, x is still optimal.

The Algorithm
The algorithm we present proceeds in a series of iterations. In each iteration, we apply an augmentation to the current assignment. Since the resulting assignment may no longer be a bfa, we then apply Restore to turn the solution back into a bfa. x(p j , ∅) = size(p j ), for j > d. (All the "original" items start outside the bins.) Note that it is possible to find an augmentation that moves from a bfa to another bfa directly. This is essentially what the simplex algorithm does. However, not every augmentation results in a bfa. The augmentations that do result in a bfa must include the moves that shift mass between the fractionally assigned items. (These moves correspond to the vectors V described in the definition of a bfa). Restricting the augmentation in this way may result in a sub-optimal augmentation. For example, those augmentations could require decreasing a variable that is already very small in which case the augmentation can not be applied with very large magnitude. So we allow the algorithm to select from the set of all augmentations to get as much benefit as possible, and then move the assignment to a bfa.

Finding an augmentation that is close to the best possible
The first step is a preprocessing step in which every possible augmentation profile is generated. This consists of generating every minimally dependent subset V of V and its associated α. Preprocess (shown in the Appendix) runs in time O( 3 d d+1 poly(d)). The number of distinct augmentation profiles returned by the procedure is at most 3 d d+1 . Given an augmentation profile V = { v 1 , . . . , v r }, the goal is to find an augmentation whose profile matches V and can be applied with a magnitude that gives close to the best possible improvement. For each vector v ∈ V, we maintain a data structure with every move (p, S, T ) such that T − S = v and x(p, S) > 0. We will call the set of all such moves Moves( v). The data structure should be able to answer queries of the form: given x 0 , find the move (p, S, T ) such that cost(p, T ) − cost(p, S) is minimized subject to the condition that x(p, S) ≥ x 0 . These kind of queries can be handled by an augmented binary search tree in logarithmic time [6].
For a given bfa x and augmentation A, one can calculate the maximum possible magnitude a with which A can be applied to x. We will make use of upper and lower bounds for the value a for any augmentation and bfa combination. Call these values a max and a min . Round a min down so that a max /a min is a power of 2. The while loop in procedure FindAugmentation runs for log(a max /a min ) iterations.
For an augmentation A that can be applied to assignment x with magnitude a, the total change in cost is denoted by cost (A, x, a). Recall that since we are minimizing cost we will only apply an augmentation if the total change in cost is less than 0. Lemma 6. Let A 1 be the augmentation returned by FindAugmentation( x) and A 2 be another augmentation. If a 1 and a 2 are the maximum magnitudes with which A 1 and A 2 can be applied to x, then 2d · cost(A 1 , x, a 1 ) ≤ cost(A 2 , x, a 2 ).
Proof. Let V 2 be the profile for A 2 . Let α be the vector associated with the profile V 2 . Letā be the value of the form a max /2 j such that 2ā > a 2 ≥ā. There is an iteration inside the while loop of FindAugmentation( x) in which the augmentation profile is V 2 and the value for a isā. The augmentation constructed in this iteration will be called V 3 . The moves in A 2 and in A 3 are {(p Note that since V 2 can be applied to x with magnitude a 2 , it must be the case that for i = 1, . . . , r, x(p (2) i , S (2) i ) ≥ α i a 2 because applying the moves involves removing α i a 2 from x(p i ) is chosen to be the move with minimum cost such that x(p i ) ≥ α iā . Therefore the cost of (p i ) is at most the cost of (p (2) i , S (2) i , T (2) i ). The value of the variable CurrentCost for that iteration is i ) Let CurrentCost 1 be the value of the variable CurrentCost and a ′ the value of the variable a during the iteration in which the augmentation A 1 is considered. Since A 1 was selected by FindAugmentation, CurrentCost 1 ≤ CurrentCost 3 . It remains to show that the maximum magnitude with which A 1 can be applied is at least a ′ /d and therefore the actual change in cost at most CurrentCost 1 /d.
Let V 1 be the profile for A 1 and let β be the vector associated with profile V 1 . Since we are now only referring to one augmentation, we omit the subscripts and call the moves in A = {(p 1 , S 1 , T 1 ), . . . , (p r , S r , T r )}. We are guaranteed by the selection of the move (p i , S i , T i ) that for every i, x(p i , S i )/β i ≥ a ′ . Let β sum p,S and β max p,S denote the sum and maximum over all β i such that p i = p and S i = S. The value of a 1 , the maximum value with which A can be applied, is equal to x(p, S)/β p,S for some pair (p, S). We have

Number of iterations of the main loop
The procedure FindAugmentation takes a bfa x and returns an augmentation that reduces the cost of the current solution by an amount which is within Ω(1/d) of the best possible augmentation that can be applied to x. In order to bound the number iterations of the main loop, we need to show that there always is a good augmentation that can be applied to x that moves it towards an optimal solution. The idea is that for any two assignments x and y, x can be transformed into y by applying a sequence of augmentations. Each augmentation decreases the number of variables in which x and y differ by one. Since the number of non-zero variables in any bfa is at most n + d, there are at most 2(n + d) augmentations in the sequence. Thus, if the difference in cost between y and x is ∆, one of the augmentations will decrease the cost by at least ∆/2(n + d).
The idea is analogous to the partitioning the difference between two min cost flows into a set of disjoint cycles. Some additional work is required to establish that the chosen augmentation can be applied with sufficient magnitude. The proofs of Lemmas 7, 8, 9 and 10 are given in the Appendix.
Lemma 7. Let x be a bfa for an instance of the subset assignment problem and let ∆ be the difference in the objective function between x and the optimal solution. Then there is an augmentation A such that when A is applied to x with the maximum possible magnitude, the cost drops by at least ∆/2(n + d).
In order to bound the number of iterations in the main loop, we need to know the smallest difference in cost between two assignments that have different cost.  By Lemma 4, the bfa at the beginning of an iteration has at most 2d fractionally assigned variables. An augmentation consists of at most d + 1 moves and therefore changes the value of at most 2(d + 1) variables. Thus, the input to Restore is an assignment with O(d) fractionally assigned variables. Each iteration of Restore reduces the number of fractionally assigned variables by at least one. Therefore, the number of iterations of Restore is bounded by O(d) and the total time spent in Restore during an iteration is poly(d).

Analysis of the running time
The inner loop of FindAugmentation requires O(d) queries to one of the augmented binary search trees resulting in O(d 2 log n) time for each iteration of the inner loop. The number of times the inner loop is executed is log(a max /a min ) times the number of augmentation profiles, f (d). Therefore the running time of FindAugmentation dominates the running time of an iteration of the main loop which is O(f (d)d 2 log n log(a max /a min )). By Lemma 10, the number of iterations of the main loop is O(nd 2 log(dnC)), and since f (d) ≤ 3 d d+1 , the total running time is O( 3 d d+1 d 4 n log n(log n + log C) log(a max /a min )). We now bound a max /a min : Lemma 11. The values of a are bounded above by a max = d d/2 Z and below by a min = 1/d d/2+1 , where Z = max p size(p).
The proof of Lemma 11 is given in the Appendix. Hence, we get that log(a max /a min ) is O(d 2 log d log Z) and the total running time is poly(d)n log(n) log(nC) log(Z) .

A The algorithm Restore
The algorithm Restore takes a feasible, perfectly filled assignment x and converts it to a bfa that is also perfectly filled. The cost of the resulting assignment is no larger than the cost of the input assignment. The number of iterations is bounded by the number of fractionally assigned variables in the input assignment.

Algorithm 3 Restore
For each p ∈ P frac , select an S s.t. x(p, S) > 0. Call the chosen set S p . Let X be the set of variables x(p, S) s.t. S = S p and 0 < x(p, S) < size(p). Order the variables in X: {x(p 1 , S 1 ), . . . , x(p r , S r )} Let V be the set of vectors S − S p for each x(p, S) ∈ X.
while V is linearly dependent Let β 1 , . . . , β r be such that Select an x(p, S ′ ) from X and remove it from X. S p becomes S ′ . Update vectors in V with new S p .
B A preprocessing step for the generation of all augmentation profiles Reorder α to match V 's order. Rescale α so that the first component is 1. Add (V, α) to P. return P

C Proof of Lemma 1
Proof. The first step is to come up with a linear combination of moves of the form (p, S, T ) where x(p, S) > y(p, S) and x(p, T ) < y(p, T ) that transform x into y. The profile for the set of moves must be linearly dependent because the net change to the load on each bin is 0. However, the resulting profile is not necessarily minimally dependent. The next step is to find a subset of those moves that can be applied to x whose profile is minimally dependent and whose α has positive coefficients. The following procedure accomplishes the first step: Initialize z = x and j = 1. while z = y Find a (p, S, T ) such that z(p, S) > y(p, S) and z(p, T ) < y(p, T ). β j = min{z(p, S) − y(p, S), y(p, T ) − z(p, T )} (p j , S j , T j ) = (p, S, T ) z(p, S) = z(p, S) − β j z(p, T ) = z(p, T ) + β j j = j + 1 Define S to be the set of pairs (p, S) such that z(p, S) > y(p, S) and T to be the set of pairs (p, T ) such that z(p, T ) < y(p, T ). In each iteration, if (p, S, T ) is the selected move, then either (p, S) drops out of S or (p, T ) drops out of T . Therefore, the process is finite and a move is never selected twice. Let t be the number of moves selected in the process. Applying each move (p j , S j , T j ) with magnitude β j transforms x into y. Since x and y are both perfectly filled, the net change to the load on each bin is 0: t j=1 β j ( T j − S j ) = 0.
In the second step, we adjust the linear combination of moves selected until its profile is minimally dependent. Let B be the set of indices j such that β j > 0. Initially B = {1, . . . , t}. Define V B = { T j − S j : j ∈ B}. The following procedure accomplishes the second step: while there is a proper subset of V B that is linearly dependent Select aB ⊆ B such that VB is minimally dependent. Let {γ j : j ∈B} be the unique set of values such that: j∈B γ j ( T j − S j ) = 0, and min j∈B γ j = 1. if γ j > 0, for every j ∈B return {(p j , S j , T j ) : j ∈B} else c = min j:γj <0 βj −αj for each j ∈B β j = β j + cγ j return {(p j , S j , T j ) : j ∈ B} Note that since j∈B γ j ( T j − S j ) = 0, adding a constant multiple of the sum j∈B γ j ( T j − S j ) to j∈B β j ( T j − S j ), maintains the condition that j∈B β j ( T j − S j ) = 0.
the cost by at least ∆/2(n+d) and Lemma 6 indicates that the augmentation returned by FindAugmentation reduces the cost by at least 1/2d times the best possible. Therefore, each iteration reduces the difference in cost between the current assignment and the optimal assignment by at least a factor of 1 − 1/(4d(n + d)).
The number of iterations is the smallest t such that which is O(nd 2 log(dnC)).

H Proof of Lemma 11
Proof. Recall that the entries of α were obtained as the solution to the equation A α = w where the columns of A and the vector w have entries in {−1, 0, 1}. By Lemma 8, 1/d d/2 ≤ α i ≤ d.
An upper bound on a is the ratio of the maximum possible value for x(p, S) over the minimum possible value for α i . The highest value that x(p, S) can achieve is Z = max p size(p) because of (1). So let a max = d d/2 Z.
Similarly a lower bound on a is the ratio of the minimum possible x(p, S) over the maximum possible value for α i . By Lemma 9, x(p, S) is least 1/d d/2 . So we can take a min to be 1/d d/2+1 .