Evaluating and Tuning n-fold Integer Programming

In recent years, algorithmic breakthroughs in stringology, computational social choice, scheduling, and so on, were achieved by applying the theory of so-called n-fold integer programming. An n-fold integer program (IP) has a highly uniform block structured constraint matrix. Hemmecke, Onn, and Romanchuk [Math. Program., 2013] showed an algorithm with runtime ΔO(rst + r2s) n3, where Δ is the largest coefficient, r,s, and t are dimensions of blocks of the constraint matrix and n is the total dimension of the IP; thus, an algorithm efficient if the blocks are of small size and with small coefficients. The algorithm works by iteratively improving a feasible solution with augmenting steps, and n-fold IPs have the special property that augmenting steps are guaranteed to exist in a not-too-large neighborhood. However, this algorithm has never been implemented and evaluated. We have implemented the algorithm and learned the following along the way. The original algorithm is practically unusable, but we discover a series of improvements that make its evaluation possible. Crucially, we observe that a certain constant in the algorithm can be treated as a tuning parameter, which yields an efficient heuristic (essentially searching in a smaller-than-guaranteed neighborhood). Furthermore, the algorithm uses an overly expensive strategy to find a “best” step, while finding only an “approximately best” step is much cheaper, yet sufficient for quick convergence. Using this insight, we improve the asymptotic dependence on n from n3 to n2 log n. Finally, we tested the behavior of the algorithm with various values of the tuning parameter and different strategies of finding improving steps. First, we show that decreasing the tuning parameter initially leads to an increased number of iterations needed for convergence and eventually to getting stuck in local optima, as expected. However, surprisingly small values of the parameter already exhibit good behavior while significantly lowering the time the algorithm spends per single iteration. Second, our new strategy for finding “approximately best” steps wildly outperforms the original construction.


INTRODUCTION
In this article, we consider the general integer linear programming (ILP) problem in standard form, with A an integer m × n matrix, b ∈ Z m , w ∈ Z n , l, u ∈ (Z ∪ {±∞}) n . It is well known to be strongly NP-hard, but models many important problems in combinatorial optimization such as planning [31], scheduling [15], and transportation [4], and thus powerful generic solvers have been developed for it [28]. Still, theory is motivated to search for tractable special cases. One such special case is when the constraint matrix A has a so-called N -fold structure: Here, r , s, t, N ∈ N, u, l, w ∈ Z N t , b ∈ Z r +N s , E (N ) is an (r + Ns) × Nt-matrix, E 1 ∈ Z r ×t is an r × t-matrix, and E 2 ∈ Z s×t is an s × t-matrix. We call E (N ) the N -fold product of E = ( E 1 E 2 ) and denote by L the length of the binary encoding of the instance (A, w, b, l, u) 1 Problem (ILP) with A = E (N ) is known as N -fold integer programming (N -fold IP). Hemmecke, Onn, and Romanchuk [18] prove the following. Proposition 1 ([18, Theorem 6.2]). There is an algorithm that solves 2 (ILP) with A = E (N ) encoded with L bits in time Δ O (tr s+t 2 s ) · n 3 L, where Δ = 1 + max{ E 1 ∞ , E 2 ∞ }.
The algorithm belongs to the larger family of augmentation (primal) algorithms. It starts with an initial feasible solution x 0 ∈ Z N t and produces a sequence of increasingly better solutions x 1 , . . . , x σ (better means wx σ < wx σ −1 < · · · < wx 0 ). It is guaranteed that the algorithm terminates, that x σ is an optimal solution, and that the algorithm converges quickly, i.e., σ is polynomial in the length of the input. A key property of N -fold IPs is that, if an augmenting step exists, then it can be decomposed into a bounded number of elements of the so-called Graver basis of A, which we denote G(A). This in turn makes it possible to compute it using dynamic programming [18,Lemma 3.1]. In a sense, this property makes the algorithm a local search algorithm, which is always guaranteed to find an improvement in a not-too-large neighborhood. The bound on the number of elements or the size of the neighborhood that needs to be searched is called the Graver complexity of E, denoted д(E). This, in turn, implies that, if an augmenting step exists, then there is always one with small 1 -norm; for a matrix A, we denote this bound д 1 (A) = max g∈G(A) g 1 [27,Theorem 4]. However, the algorithm has never been implemented and evaluated.

Our Contributions
We have implemented the algorithm and tested it on two problems for which N -fold formulations were known: makespan minimization on uniformly related machines (Q ||C max ) and Closest String; we have used randomly generated instances. The solver, tools, and, e.g., many more plots can be accessed in a publicly accessible repository at https://github.com/katealtmanova/ nfoldexperiment.
In the course of implementing the algorithm, we learn the following. The algorithm in its initial form is practically unusable due to an a priori construction of the Graver basis G(E 2 ) of size exponential in s, t and Δ, and a related (even larger) set Z (E), whose size is exponential in r , s, t and Δ. However, we discover a series of improvements (some building on recent insights [27]), which avoid the construction of these two sets. Moreover, we adjust the algorithm to treat д 1 (A) as a tuning parameter g 1 , which turns it into a heuristic (i.e., an optimal solution or polynomial runtime is not guaranteed; we shall discuss this topic in more detail later).
We also study the augmentation strategy, which is the way the algorithm chooses an augmenting step among all the possible options. The original algorithm uses an overly expensive strategy to find a "best" step, which means that a large number of possible steps is evaluated in each iteration. We show that finding only an "approximately best" step is sufficient to obtain asymptotically equivalent convergence rate, and the work per iteration decreases exponentially. Using this insight, we improve the asymptotic dependence on N from N 3 to N 2 log N .
Finally, we evaluate the behavior of the algorithm. We ask how is the performance of the algorithm (in terms of number of dynamic programming calls and quality of the returned solution) influenced by (1) the choice of the tuning parameter 1 < g 1 ≤ д 1 (A)?
(2) the choice of the augmentation strategy between "best step", "approximate best step", and "any step"?
As expected, with g 1 moving from д 1 (A) to 1, we first see an increase in the number of iterations needed for convergence and eventually the algorithm gets stuck in a local optima. However, surprisingly small values (e.g., g 1 = 50 when д 1 (A) > 10 11 ) of the parameter already exhibit close to optimal behavior while significantly decreasing the time spend per iteration. Second, our new strategy for finding "approximately best" steps outperforms the original construction by orders of magnitude, while the naive "any step" strategy behaves erratically. We note that at this stage we are not (yet) interested in showing supremacy over existing algorithms; we simply want to understand the practical behavior of an algorithm whose theoretical importance was recently highlighted. For this reason our experimental focus is on the two aforementioned questions rather than simply measuring the time. Unfortunately, our data does not indicate any slowdown of a commercial MILP solver based on the number of bricks, which is required to give the algorithm of Theorem 2 a chance to beat it.
Due to the rigid format of E (N ) , we are limited to few problems for which N -fold formulations are known. Regarding instances, for Closest String, we use the same approach as Chimani et al. [7]; for Makespan Minimization, we generate our own data, because standard benchmarks are not limited to short jobs or few types of jobs.

Related Work
Our work mainly relates to primal heuristics [3] for MIPs, which are used to help reach optimality faster and provide good feasible solutions early in the termination process. Specifically, our algorithm is a neighborhood (or local) search algorithm. The standard paradigm is Large Neighborhood Search (LNS) [30] with specializations such as for example Relaxation Induced Neighborhood Search (RINS) [8] and Feasibility Pump [2]. In terms of this paradigm, our proposed algorithm searches in the neighborhood induced by the 1 -distance around the current feasible solution and the search procedure is formulated as an ILP subproblem with the additional constraint x 1 ≤ g 1 . In this sense the closest technique to ours is local branching [13], which also searches in the 1neighborhood; however, we treat the discovered step as a direction and apply it exhaustively, so, unlike in local branching, we make long steps. Moreover, local branching was mainly applied to binary ILPs without any additional structure of the constraint matrix.
On the theoretical side, very recently Koutecký et al. [27] have studied parameterized strongly polynomial algorithms for various block-structured ILPs, not just N -fold IP. Eisenbrand et al. [10] independently (and using slightly different techniques) arrive at the same complexity of N -fold IP as our Theorem 2. Eisenbrand et al. [11] and Jansen et al. [21] both exhibit (different) nearlinear time algorithms for N -fold IP. In particular, the approach of Eisenbrand et al. is relevant to implementations of an FPT algorithm for N -fold IP; however, due to our approach of using existing ILP solvers as a subroutine, we do not exploit it.

PRELIMINARIES
For positive integers m, n, we set [m, n] = {m, . . . , n} and [n] = [1, n]. We write vectors in boldface (e.g., x, y) and their entries in normal font (e.g., the i-th entry of x is x i ). Given the problem (ILP), we say that x is feasible for (ILP) if Ax = b and l ≤ x ≤ u.

Graver Bases and Augmentation.
Let us now introduce Graver bases and discuss how they can be used for optimization. We also recall N -fold IPs; for background, we refer to the books of Onn [29] and De Loera et al. [9].
N -fold IP. The structure of E (N ) allows us to divide the Nt variables of x into N bricks of size t. We use subscripts to index within a brick and superscripts to denote the index of the brick, i.e., x i j is the j-th variable of the i-th brick with j ∈ [t] and i ∈ [N ]. Let x, y be n-dimensional integer vectors. We call x, y sign-compatible if they lie in the same orthant, that is, if for each i ∈ [n] it holds that x i · y i ≥ 0. We call i g i a sign-compatible sum if all g i are pair-wise sign-compatible. Moreover, we write y x if x and y are sign-compatible and |y i | ≤ |x i | for each i ∈ [n]. Clearly, imposes a partial order called "conformal order" on ndimensional vectors. For an integer matrix A ∈ Z m×n , its Graver basis G(A) is the set of -minimal non-zero elements of the lattice of A, ker Z (A) = {z ∈ Z n | Az = 0}. An important property of G(A) is the following. Let x be a feasible solution to (ILP). We call g an x-feasible step (or simply feasible step if x is clear) if x + g is feasible for (ILP). Further, we call a feasible step g augmenting if w(x + g) < wx; note that g decreases the objective by wg. An augmenting step g and a step length λ ∈ N form an x-feasible step pair with respect to a feasible solution x if l ≤ x + λg ≤ u. A pair (λ, g) ∈ (N × ker Z (A)) is a λ-Graver-best step pair and λg is a λ-Graver-best step if it is feasible and for every feasible step pair (λ, g ), g ∈ G(A), we have wg ≤ wg . An augmenting step g and a step length λ ∈ N form a Graver-best step pair if it is λ-Graver-best and it minimizes wλ g over all λ ∈ N, where (λ , g ) is a λ -Graver-best step pair. We say that λg is a Graver-best step if (λ, g) is a Graver-best step pair.
The Graver-best augmentation procedure for (ILP) with a given feasible solution x 0 and initial value i = 0 works as follows: (1) If there is no Graver-best step for x i , then return it as optimal.
(2) If a Graver-best step λg for x i exists, then set x i+1 := x i + λg, i := i + 1, and go to 1. By standard techniques (detecting unboundedness, etc.), we can ensure that log M ≤ L.

APPROXIMATE GRAVER-BEST STEPS
In this section, we introduce the notion of a c-approximate Graver-best step (Definition 1), show that such steps exhibit good convergence (Lemma 5), can be easily obtained (Lemma 6), and result in a significant speed-up of the N -fold IP algorithm (Theorem 2). Definition 1 (c-approximate Graver-best step). Let c ∈ R with c ≥ 1. Given an instance of (ILP) and a feasible solution x, we say that an x-feasible step h is a c-approximate Graver-best step for x if, for every x-feasible step pair (λ, g) ∈ (N × G(A)), we have wh ≤ 1 c · λwg. Recall the Graver-best augmentation procedure. We call its analogue where we replace a Graverbest step with a c-approximate Graver-best step the c-approximate Graver-best augmentation procedure.
Lemma 5 (c-approximate convergence bound). Given a feasible solution x 0 for (ILP), the capproximate Graver-best augmentation procedure finds an optimum of (ILP) in at most c · 3n log M steps, where M = w(x 0 − x * ) and x * is any minimizer of wx.
Proof. The proof is a straightforward adaptation of the proof of Proposition 4, which we first repeat here for convenience. Let x * be a minimizer and let h = x * − x 0 . Since Ah = 0, by Proposi- . Thus, by an averaging argument, an x-feasible step pair (λ, g) such that λg is a Graver-best step must satisfy wλg ≤ 1 2n−2 M. In other words, any Graver-best step pair improves the objective function by at least a 1 2n−2 -fraction of the total optimality gap M, and thus 3n log M steps suffice to reach an optimum (cf. [29, Lemma 3.10]).
It is straightforward to see that a c-approximate Graver-best step satisfies wx − w(x + λg) ≤ Proof. Let (λ, g) satisfy the assumptions, and let (λ,g) ∈ (N × G(A)) be a Graver-best step pair. Let λ be a nearest smaller power of c fromλ, and observe that λ g is a c-approximate Graver-best step, because λ ≥λ c . However, since λg is a λ-Graver-best step, we have λg ≤ λ g and thus λg is also a c-approximate Graver-best step, since we have wλg ≤ wλ g ≤ 1 c wλg. Remark 1. Lemma 5 extends naturally to separable convex objectives; see the original proof [29,Lemma 3.10]. Moreover, Lemma 6 also extends to separable convex objectives as was recently shown by Eisenbrand et al. [10]. Thus, Theorem 2 (below) holds also for separable convex objectives.
Proof. Recall that Δ = A ∞ + 1. Koutecký et al. [27,Theorem 4] show that a λ-Graver-best step can be found in time Δ O (r 2 s+r s 2 ) Nt. Moreover, Hemmecke et al. [17] prove a proximity theorem that allows the reduction of an instance of (ILP) to an equivalent instance with new bounds where the last inequality can be found in the proof of Reference [27,Theorem 4]. This bound implies that Γ 2-apx from Lemma 6 satisfies . By Lemma 6, finding a λ-Graver-best for each λ ∈ Γ 2-apx and picking the minimum results in a 2-approximate Graver-best step, and can be done in time Δ r 2 s+r s 2 (Nt ) log(Nt ). By Lemma 5, (4n − 4) log M steps suffice to reach the optimum.

IMPLEMENTATION
We first give an overview of the original algorithm, which is our starting point. Then, we discuss our specific improvements and mention a few details of the software implementation.

Overview of the Original Algorithm
Recall that any Nt-dimensional vector related to N -fold IP is naturally partitioned into N bricks of length t. In particular, this applies to the solution vector x and any augmenting step g. The key property of the N -fold product E (N ) is that, regardless of N ∈ N, the number of nonzero bricks of any g ∈ G(E (N ) ) is bounded by some constant д(E) called the Graver complexity of E, and, moreover, that the sum of all non-zero bricks of g can be decomposed into at most д(E) elements of G(E 2 ) [18, Lemma 3.1]. This facilitates the following construction. Let , of the bricks of g ∈ G(E (N ) ) is contained in Z (E) and a λ-Graver-best step, λ ∈ N, can be found using dynamic programming over the elements of Z (E).
To ensure that a Graver-best step is found, a set of step-lengths Γ best is constructed as follows. Observe that any Graver-best (and thus feasible) step pair (λ, g) ∈ (N × G(E (N ) )), must satisfy that in at least one brick i ∈ [N ] it is "tight," that is, (λ, g) is x-feasible while (λ + 1, g) is not specifically because l i ≤ x i + λg i ≤ u i holds but l i ≤ x i + (λ + 1)g i ≤ u i does not. Thus, for each z ∈ Z (E) and each i ∈ [N ], we find all the potentially "tight" step lengths λ and add them to Γ best , which results ALGORITHM 1: Pseudocode of the algorithm of Hemmecke, Onn, and Romanchuk. input: matrices E 1 , E 2 , positive integer N , and vectors b, l, u, w output: optimal solution to ( N , b, l, u) Notice that this approach does not work for separable convex objectives for which a Graver-best step might not be tight in any coordinate.
For a overview of algorithm as described by Hemmecke, Onn, and Romanchuk, see Algorithm 1.

Replacing Dynamic Programming with ILP
We have started off by implementing the algorithm exactly as it is described by Hemmecke et al. [18]. The first obstacle is encountered almost immediately and is contained in the constant д(E). This constant can be computed, but the computation is extremely difficult [12,16]. Another possibility is to estimate it, in which case it is almost always larger than N and thus is essentially meaningless. Finally, one can take the approach partially suggested in Reference [18, Section 7], where we consider д(E) in the construction of Z (E) to be a tuning parameter and consider the approximate set Z gc (E), gc ∈ N, obtained by taking sums of at most gc elements of G(E 2 ). This makes the algorithm more practical, but turns it into a heuristic. In spite of this sacrifice, already for small (r = 3, s = 1, t = 7, N = 10) instances and extremely small value of gc = 3, the dynamic programming based on the Z gc (E) construction was taking an unreasonably long time (over one minute). Admittedly this could be improved; however, already for gc > 5, it becomes infeasible to compute Z gc (E), and for larger instances (r > 5, t > 12) it becomes very difficult to compute even G(E 2 ). For these reasons, we sought to completely replace the dynamic program involving Z (E).
Koutecký et al. [27] show that all instances of (ILP) with the property that the so-called dual treedepth td D (A) of A is bounded and the largest coefficient A ∞ is bounded also have the property that д 1 (A) = max g∈G(A) g 1 is bounded, which implies that augmenting steps can be found efficiently. This class of ILPs contains N -fold IP.
The interpretation of the above fact is that, to solve (ILP), it is sufficient to repeatedly (for different x and λ) solve an auxiliary (ILP) instance, to find good augmenting steps; we note that the constraint h 1 ≤ д 1 (A) can be linearized [27,Lemma 25]. The heuristic approach outlined above transfers easily: We replace д 1 (A) in (AugILP) with some integer g 1 , 1 < g 1 ≤ д 1 (A); this makes (AugILP) easier to solve at the cost of losing the guarantee that an augmenting step is found if one exists. In theory, solving (AugILP) should be easier than solving the original instance (ILP) due to the special structure of A [27, Lemma 25]. Our approach here is to simply invoke an industrial MILP solver on (AugILP) to find a λ-Graver-best step. Note that the quantities д(E) and д 1 (A) and the tuning parameters gc and g 1 are related but distinct. First, д(E) bounds the number of non-zero bricks of any element of G(A) and the number of elements of G(E 2 ) into which it decomposes, while д 1 (A) bounds the 1 -norm of any element of G(A). It can be seen that bounded д 1 (A) implies bounded д(E) and vice versa. Second, gc and g 1 are tuning parameters derived from д(E) and д 1 (A), respectively. The crucial distinction is that the tuning parameter g 1 translates naturally into a linear constraint of (AugILP) while gc only translates naturally to a construction of a restricted set of states Z gc (E), which we are trying to avoid.

Augmentation Strategy: Step Lengths
Logarithmic Γ. The majority of algorithms based on Graver basis augmentation rely on the Graver-best augmentation procedure [6,9,18,23,24,29]. Consequently, these algorithms require finding (exact) Graver-best steps. In the aforementioned algorithms this is always done using the construction of the set Γ best mentioned above, which is of size f (k ) · n where k is the relevant parameter (e.g., Exhausting λ. Moreover, we have noticed that sometimes the algorithm finds a step g for λ = 2 k , which is not tight in any brick, and then repeatedly applies it for shorter step-lengths λ < λ. In other words, the discovered direction g is not exhausted. Thus, for each λ ∈ N, upon finding the λ-Graver-best step g, we replace λ with the largest λ ≥ λ for which (λ , g) is still x-feasible.
Early Termination. Another observation is that in any given iteration of the algorithm, if λ > 1, then some augmenting step has been found and if the computation is taking too long, we might terminate it and simply apply the best step found so far.
Initialize Once. We have noticed that a large portion of time spent on computing a λ-Graver-best step is taken by the initialization of the MILP model, which is then solved very quickly. However, notice that in the formulation of (AugILP) the only changing parameters are the lower and upper bounds. This leads us to a practical improvement: Initialize the MILP model once in the beginning, and realize each (AugILP) call by changing the bounds and reoptimizing the model.
For a overview of the newly proposed algorithm see Algorithm 2.

Software and Hardware
We have implemented our solver in the SageMath computer algebra system [34]. This was a convenient choice for several reasons. The SageMath system offers an interactive notebook-style webbased interface, which allows rapid prototyping and debugging. Data types for vectors and matrices, Graver basis algorithms [1], and a unified interface for MILP solvers are also readily available.
We have experimented with the open-source solvers GLPK [33], Coin-OR CBC [32], and the commercial solver Gurobi [14] and have settled for using the latter, since it performs the best. The downside of SageMath is that an implementation of the original dynamic program is likely much slower than a similar implementation in C; however, this DP is impractical anyway as explained in Section 4.2. Moreover, as we will evidence later, the overhead of SageMath in the construction of a MILP model is significant and for smaller instances (where (AugILP) is not called many ALGORITHM 2: Pseudocode of our new heuristic algorithm. The algorithm is exact if g 1 ≥ д 1 (A) = max g∈G(A) g 1 . Note the two nested loops: We shall refer to them as the inner loop, which computes a c-approximate Graver-best step, and the outer loop, which repeatedly adds the computed step to the current solution x i−1 . input: matrices E 1 , E 2 , positive integers N , c, and g 1 , and vectors b, l, u, w output: a feasible solution to (AugILP) with A = E (N ) N , b, l, u) times) the time spent on constructing the MILP model dominates the runtime. For random instance generation and subsequent data evaluation and graphing, we have used the Jupyter notebook environment [22] and Matplotlib and Seaborn libraries [19,35]. The computations were performed on a computer with an Intel Xeon E5-2630 v3 (2.40GHz) CPU and 128GB RAM.

Instances
We choose two problems for which N -fold IP formulations were shown in the literature, namely the Q ||C max scheduling problem [23] and the Closest String problem [24]. Here, we introduce both problems in their decision variants.
Uniformly related machines makespan minimization (Q ||C max ) Input: Set of m machines M, each with a speed s i ∈ N. A set of n jobs J , each with a processing time p j ∈ N. A target makespan B. Question: Is there an assignment of jobs J to m machines such that the time when the last job finishes (the makespan) is at most B? Here, a job j scheduled on a machine i takes time p j /s i to execute.

Input:
A set of k strings s 1 , . . . , s k of length L over an alphabet Σ and a positive integer d. Question: Is there a string y ∈ Σ L such that max In the rest of this section, we present N -fold IP models we used in our study and the describe how we generate random instances.

Scheduling
We observe that Q ||C max is equivalent to the multi-sized bin packing problem, where we have m bins of various capacities instead of m machines of different speeds, and we adopt this view as it is more convenient. We also view it as a high-multiplicity problem where the items are not given explicitly as a list of item sizes, but succinctly by a vector of item multiplicities. Because Algorithm 2 is primarily an optimization algorithm, we follow the standard approach [18,Lemma 3.8] and turn the feasibility problem into an auxiliary optimization instance in which finding a starting feasible solution is easy. However, the naive approach [18, Lemma 3.8] would almost double the dimension, which is not necessary in the specific case of Q ||C max . Instead, we introduce an auxiliary machine onto which all jobs are initially scheduled, and the objective is to minimize the number of jobs scheduled on this machine. If a solution is found with no jobs scheduled on this auxiliary machine, then it corresponds to an admissible schedule with makespan at most B.
N -fold IP Model. Let p = (p 1 , . . . ,p k ) be the vector of item sizes, let n = (n 1 , . . . , n k ) be the vector of item multiplicities, n = k j=1 n j , and let s 1 , . . . , s m be speeds of the machines in the instance of Q ||C max . We use the following ILP model for Q ||C max with fixed makespan B. We have km integral variables x i j with i ∈ [m] and j ∈ [k] to express the number of jobs of type j scheduled on machine i. Furthermore, we introduce a variable x 0 j expressing the number of unscheduled jobs of type j for j ∈ [k]. As already pointed out, we minimize the number of unscheduled jobs.
Here, we have essentially added a "penalty machine," which runs fast enough so that it is possible to schedule all of the given jobs to this extra machine. Now, it is straightforward to verify that this is indeed an N -fold IP model with N = m + 1 in which the matrix E 1 is the identity matrix of size k × k and E 2 = p. The input parameters of the instance generation are number of bins (or machines) m, the smallest and the largest capacities S and L, respectively, item sizes p 1 , . . . ,p k and probability weights w 1 , . . . ,w k , and a slack ratio σ ∈ R with 0 ≤ σ ≤ 1. Let W = k i=1 w i . The instance is then generated as follows. First, we choose m capacities from [S, L] uniformly at random. This determines the total available time of the machines C. The next goal is to generate items whose total size is roughly σ · C. We do this by repeatedly picking an item length from p 1 , . . . ,p k , where p j is selected with probability w j /W , until the total size of items picked so far exceeds σ · C, when we terminate and return the generated instance.
Batch Generation. We generate a batch of experimental instances from a list of parameters, which correspond to command line arguments of the batch generator. The generated batch is a cartesian product of all possible choices of the parameters.  [5,6,7,8,9,10,11,12,13]. For each number ∈ p_s, we compute the first primes and randomly pick a subset of size k of them as the processing times p 1 ≤ · · · ≤ p k . We set the weights w 1 ≥ · · · ≥ w k to be p k , . . . ,p 1 , i.e., jobs of larger length occur with smaller probability. Note that max number_job_types ≤ min p_s must hold. (We pick processing times that are primes, because this easily guarantees that the set of p i 's is coprime and thus the instance cannot be trivially reduced to an instance with smaller p max .) count_for_each_p An integer, by default 3. For each choice of ∈ p_s, we make count_for_each_p independent choices of the size k subset of the first primes.

Closest String
The random instance is generated exactly as done by Chimani et al. [7]: First, we generate a random "target" string y ∈ Σ L and create k copies s 1 , . . . , s k of it; then, we make α random changes in s 1 , . . . , s k . This way, we have an upper bound α on the optimum. The input parameters of the instance generation are thus k, L, Σ, the distance ratio r such that α = n /r, and a distance factor δ , 0 ≤ δ ≤ 1, such that we ask whether there exists a string in distance d = δ · n /r. Thus, for δ = 1, we are guaranteed that the answer is Yes while for δ = 0 the answer is almost surely No. Again, we solve an auxiliary optimization instance where we essentially start with a string of "all blanks," where we set the Hamming distance between the blank and any character in Σ to 0. Then, we try to fill in all the blanks while staying in the specified distance d; the objective is thus the remaining number of blanks.
N -fold IP Model. 4 Let (s 1 , . . . , s k , d ) be an instance of the Closest String problem, where all of the strings s 1 , . . . , s k are of length n and taken from alphabet Σ. We assume the given instance is already preprocessed, that is, |Σ| ≤ k + 1 (the plus one comes from the presence of the blank symbol). We call k-tuples of symbols in Σ a configuration and denote the set of all configurations C. An input position i ∈ [n] has a configuration C ∈ C if s j [i] = C[i] for all j = 1, . . . , k. For a configuration C ∈ C by n C , we denote the number of input positions having configuration C. Notice now that our task is to decide for each configuration C ∈ C how many times we are going to use a character σ ∈ Σ in the output sting y. To that end, we introduce integral variables x C,σ for each configuration C ∈ C and each character σ ∈ Σ. Then, we introduce some auxiliary variables (all of them will be set to 0 using the box constraints) to maintain the N -fold format and design a valid model with N = |C| ≤ k k . To see this, notice that we have to compute the distance of y to every string s i in the input. Let C ∈ C be a configuration and let D C ∈ {0, 1} k×|Σ | be the matrix whose columns we index by elements of Σ with D C (i, σ ) = d H (C[i], σ ), that is, the matrix D C describes the Hamming distance of the configuration C if we decide to assign σ once in the output string y. We stress here that, since Σ contains the blank symbol, D C contains the all zero column in the corresponding position corresponding. Finally, we let D = (D C 1 | · · · | D C |C| ) be a matrix in which we collect all of the above defined distance matrices. Let t be the number of columns of the matrix D. For each configuration C ∈ C, we introduce a vector of variables x C of length t whose entries we index xC ,σ ; we set the box constraints to 0 ≤ xC ,σ ∀C ∈ C, ∀σ ∈ Σ and xC ,σ ≤ 0 ∀C ∈ C \ {C}, ∀σ ∈ Σ. We generate an instance for each parameter tuple from the cartesian product of all the lists above.
The parameter logdir (by default logs) determines the target directory to store the logs. The directory will have subdirectories according to the dimension Nt on the first level, subdirectories according to different Δ (maximum coefficient) on the second level, and subdirectories for each problem instance on the third level. Finally, each instance directory contains one .log and one .pickle (protocol version 2) file for each choice of g 1 and Γ. The parameter instance_type is one of sched (default) or cs, for Q ||C max or Closest String, respectively. Parameters augip_timelimit and milp_timelimit are both integers determining the timelimit for the MILP solver, with the former one applying to the (AugILP) instance and the latter one to when we call the solver on the original (ILP) instance. Finally, passing --disable_nfold turns off the iterative algorithm and only uses the MILP solver to solve the original (ILP) instance.

EVALUATION
We first give an outline of the evaluation process, which is divided into three parts.
Qualitative Evaluation. In the first part, we begin with two main questions, specifically, how is the performance of the algorithm (both in terms of the number of iterations and the quality of the returned solution) influenced by: (1) the value of the tuning parameter g 1 and (2) the augmentation strategy Γ?
Regarding our first question, theoretically, we should see either an increase in the number of iterations, a decrease in the quality of the returned solution, or both. However, the range of the tuning parameter g 1 is quite large: Any number between 2 and д 1 (A) is a valid choice, and in all our scenarios the true value of д 1 (A) exceeds 200. Thus, we are interested in the transition values of g 1 when the algorithm no longer finds the true optimum or when its convergence rate drops significantly.
Regarding our second question, there are two main candidates for the set of step-lengths Γ. We can either use the "best step" construction Γ best of the original algorithm, which assures that we always make a Graver-best step before moving to the next iteration. Or, we can use the "approximate best step" construction Γ 2-apx of Theorem 2, which provides a 2-approximate Graver-best step. To make this comparison more interesting, we also consider Γ 5-apx and Γ 5-apx and also the trivial "any step" strategy where we always make the 1-Graver-best step, which corresponds to taking Γ any = {1}. Recall that due to the trick of always exhausting the discovered direction, this strategy actually has a chance at quick convergence, unlike if we only made the step with λ = 1.
Quantitative Evaluation. Later, we will quantify the relationship of several instance parameters such as the dimension, largest coefficient Δ, number of columns of E 1 , which is t, number of rows of E 1 , which is r , number of bricks N , and tuning parameter g 1 , to performance parameters such as optimality gap or convergence rate. Recall that in both our scenarios, we have s = 1 and thus we do not mention this parameter further.
Toward Practical Applications. Finally, we explore possible avenues to transfer our ideas to practice. To that end, we ask "on which instances could n-fold IP beat Gurobi?" Due to the immense amount of attention dedicated to industrial MILP solvers, we do not expect our ideas to lead to significant improvements across many kinds of instances; however, we do expect that there exist some special instances on which Gurobi performs poorly and could be outperformed by a newer implementation of our solver.
To this end, we study the relationship of several time measures (total time, time spent on augmentation calls, time taken by Gurobi to solve the instance, etc.) to parameters such as dimension, Δ, r , and t.

Qualitative Evaluation
Here, we demonstrate the overall behavior of the algorithm on two selected instances (one for Q ||C max and one for Closest String); we encourage the reader to see the full data (including plots) at https://github.com/katealtmanova/nfoldexperiment.
We chose two instances among the tested ones as representatives of the overall behavior: • A Closest String instance with parameters k = 3, |Σ| = 4, L = 8000, r = 4 and δ = 0.3.
Plots. We use two types of plots to visualize our data. First and only for the scheduling instance, we hand-picked four "interesting" values of g 1 , namely, g 1 = 25, 50, 150, and 1000, and we give a line plot for each such value of g 1 and each augmentation strategy Γ (Figures 1-4). The x axis of each line plot corresponds to inner iterations (computations of a λ-Graver-best step). The y axis corresponds to objective values. Each line plot contains two lines: a thin blue line marking each individual value computed in the inner loop, and a thick orange line marking the progress of the outer loop, i.e., the minimum over all steps computed in the individual outer iterations.
The second type of plot (Figures 4 and 5) is essentially obtained from the first type by considering all tested values of g 1 (not only the "interesting" values), discarding the thin (inner loop) lines, and stacking the remaining lines on top of each other, thus obtaining one line plot for each augmentation strategy Γ.
Conclusions. Our main takeaway regarding Question #1 is that, while the theoretical upper bounds for д 1 (A) are huge, already small values of g 1 (g 1 > 10 for Closest String and g 1 > 150 for Makespan Minimization) are sufficient for convergence to global optima. We remark that, in the case of Closest String, this hints at the possibility that the maximum value of any feasible augmenting step g ∈ G(A) is bounded by k O (1) rather than k O (k ) , which would imply an algorithm with runtime k O (k ) log L while the currently best algorithm runs in time k O (k 2 ) log L [24].
Regarding Question #2, we see that Γ 2-apx converges in a similar way as Γ best but is orders of magnitude cheaper to compute. The "any step" augmentation strategy Γ any usually converges surprisingly quickly, but our results make it clear that its behavior is erratic and unpredictable. Specifically, with augmentation strategies such as Γ 2-apx , increasing the parameter g 1 reliably leads to faster convergence, while for Γ any this is not the case. Consequently, beyond some value of g 1 strategies such as Γ 2-apx outperform Γ any in absolute numbers of iterations.
The detailed Figures 1-3 reveal that the step that is eventually taken is often found for relatively larger step-lengths λ; this explains why Γ 5-apx outperforms Γ 2-apx and is typically outperformed by Γ 10-apx , as Γ c−apx spends less time on short step-lengths with increasing c.

Quantitative Evaluation
In the second part of our evaluation, we relate several instance parameters to the two selected performance parameters. The instance parameters of our interest are • number of columns of the E 1 block, that is, t, • number of rows of the E 1 block, that is, r , • number of bricks N .
As we have noted for both problems, the matrix E 2 has only one row, so the parameter s is always 1. Moreover, for Closest String the parameters dimension, N , t, and r are closely related, as the dimension is Nt with N = t 2 and t ≤ r r . To simplify matters, from now on, we ran all tests with augmentation strategy Γ 2-apx .
Regarding performance parameters, we wish to study the optimality gap, which is simply the difference between the optimum obtained by the algorithm and the exact optimum. Moreover, we wish to quantify the notion of a "convergence rate" in a normalized way to allow comparison across instances. To this end, fix an instance and denote by it(g 1 ) the number of (inner) iterations taken by the algorithm to reach the optimum (and +∞ if optimality gap is positive), let it min = min g 1 it(g 1 ), and finally let the convergence rate be c (g 1 ) = it min / it(g 1 ). Thus, 0 ≤ c (g 1 ) ≤ 1 with c (g 1 ) = 0 if setting the tuning parameter to value g 1 does not make the algorithm find the optimum, and with larger values corresponding to faster convergence.
The testing batches were generated with the following parameters: • For Q ||C max , the command line was './nfold_sched_tester.sage --logdir 31012019 --machines 10  Plots. We visualize the relationships as follows: each plot in Figures 6 and 7 is a heatmap whose columns are increasing values of g 1 , rows are increasing values of Δ or dimension (for Q ||C max ), and cells are values of optimality gap or convergence rate. The color scheme is such that darker shades correspond to worse behavior, be it larger optimality gap or smaller convergence rate.

Conclusions.
Our results, now measured across many instances, confirm our previous hypotheses: increasing values of g 1 lead to decreasing optimality gaps and improving convergence rates. Moreover, the effect seems to correlate more with Δ than the dimension Nt. To see this, observe 2.2:16 K. Altmanová et al. Fig. 2. Augmentation strategy Γ any on a Q ||C max instance (for interpretation cf. Figure 1). in particular Figure 6 whose rows and columns look similar, indicating relatively small correlation with the dimension as compared with Δ.
This corresponds to the theoretical observation that the "true" value of д 1 (A) is independent of t and N but depending on Δ and r . (Note that for Q ||C max , we have t = r so the parameter t is expected to have an effect, however, our tested values cover possibly a too narrow range.)

Toward Practical Applications
So far, we have been interested in parameters "internal" to the implemented algorithm. In particular, we have disregarded actual time taken by the computation and any analysis of potential bottlenecks of the algorithm. The relevant time parameters that we study now are the following: • total time needed to run Algorithm 2, denoted total, • time required to initialize the MILP model of (AugILP), denoted augip init, • time consumed by solving (AugILP) excluding initialization, denoted augip total, and • time required to construct the MILP model of the whole instance (ILP) and solve it using Gurobi, denoted gurobi construct & solve.
Our initial observation during preliminary experiments was that the total required time grows significantly with increasing dimension. Thus, our goal was to determine potential instance parameters such as dimension or Δ, which make the instance hard for Gurobi, with the hope that for such instances a good implementation of a parameterized N -fold IP algorithm would outperform Gurobi. However, a closer examination has revealed that the observed growth is caused by increasing time taken by the model construction phase (aug init and the "construct" part of gurobi construct & solve).  3. Augmentation strategy Γ best on a Q ||C max instance (for interpretation cf. Figure 1). Fig. 4. Q ||C max , stacked plots, left-to-right Γ 2-apx , Γ 5-apx , Γ 10-apx , Γ best , and Γ any .     Plots. We present our findings in two types of plots. The first one (Figures 8 and 9) is a line plot whose y axis is time and x axis is one of dimensions, r , and Δ (only for Q ||C max ), with individual lines corresponding to the different time parameters above. Semi-transparent bands around lines correspond to 95% confidence intervals.
The second type ( Figure 10) constructed only for Q ||C max shows the individual time parameters with respect to dimension and Δ, in the form of heatmaps.
Conclusions. Unfortunately, we conclude that, at least in the case of our formulations of Closest String and Q ||C max , neither increasing Δ nor dimension create an obstacle for Gurobi itself. Instead, the bottleneck lies in the overhead of SageMath and Python data structures.

OUTLOOK
We have initiated an experimental investigation of a certain subclass of ILP with a block structured constraint matrix. Our results show that, as theory suggests, for such ILPs a primal algorithm always augmenting with steps of small 1 -norm converges quickly. We close with a few interesting research directions.
First, in theory, the special structure of (AugILP) (in particular, an 1 -norm bound on its solution) as compared with (ILP) means that (AugILP) can be solved faster than (ILP). However, in practice, this seems to have little to no effect. Thus, we ask: Is there a way to tune generic MILP solvers to solve (AugILP) significantly faster than (ILP)?
Second, what is the behavior of our algorithm on instances other than N -fold IP? For example, how large does g 1 have to be to attain the optimum quickly for standard benchmark instances, e.g., MIPLIB [26]?
Third, the approach of Koutecký et al. [27] suggests that a key property for the efficient solvability of (AugILP) is a certain "sparsity" and "shallowness" (formally captured by the graph parameter tree-depth) of graphs related to the constraint matrix. Thus, we ask what are "natural" instances with small tree-depth, and what is "typical" tree-depth of instances used in practice.