Two Proofs for Shallow Packings

We refine the bound on the packing number, originally shown by Haussler, for shallow geometric set systems. Specifically, let V\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {V}$$\end{document} be a finite set system defined over an n-point set X; we view V\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {V}$$\end{document} as a set of indicator vectors over the n-dimensional unit cube. A δ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta $$\end{document}-separated set of V\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {V}$$\end{document} is a subcollection W\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {W}$$\end{document}, s.t. the Hamming distance between each pair u,v∈W\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf{u}, \mathbf{v}\in \mathcal {W}$$\end{document} is greater than δ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta $$\end{document}, where δ>0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta > 0$$\end{document} is an integer parameter. The δ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta $$\end{document}-packing number is then defined as the cardinality of a largest δ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta $$\end{document}-separated subcollection of V\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {V}$$\end{document}. Haussler showed an asymptotically tight bound of Θ((n/δ)d)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Theta ((n/\delta )^d)$$\end{document} on the δ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta $$\end{document}-packing number if V\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {V}$$\end{document} has VC-dimension (or primal shatter dimension) d. We refine this bound for the scenario where, for any subset, X′⊆X\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X' \subseteq X$$\end{document} of size m≤n\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m \le n$$\end{document} and for any parameter 1≤k≤m\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1 \le k \le m$$\end{document}, the number of vectors of length at most k in the restriction of V\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {V}$$\end{document} to X′\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X'$$\end{document} is only O(md1kd-d1)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(m^{d_1} k^{d-d_1})$$\end{document}, for a fixed integer d>0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d > 0$$\end{document} and a real parameter 1≤d1≤d\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1 \le d_1 \le d$$\end{document} (this generalizes the standard notion of bounded primal shatter dimension when d1=d\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_1 = d$$\end{document}). In this case when V\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {V}$$\end{document} is “k-shallow” (all vector lengths are at most k), we show that its δ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta $$\end{document}-packing number is O(nd1kd-d1/δd)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(n^{d_1} k^{d-d_1}/\delta ^d)$$\end{document}, matching Haussler’s bound for the special cases where d1=d\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d_1=d$$\end{document} or k=n\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k=n$$\end{document}. We present two proofs, the first is an extension of Haussler’s approach, and the second extends the proof of Chazelle, originally presented as a simplification for Haussler’s proof.

when d 1 = d). In this case when V is "k-shallow" (all vector lengths are at most k), regions, e.g., halfspaces, balls, or simplices in R d . In such cases, the primal shatter (and VC-) dimension is a function of d; see, e.g., [14] for more details. When we flip the roles of points and regions, we obtain the so-called dual set systems (where we refer to the former as primal set systems). In this case, the ground set is a collection S of algebraic surfaces in R d , and V corresponds to faces of all dimensions in the arrangement A(S) of S, that is, this is the decomposition of R d into connected open cells of dimensions 0, 1, . . . , d induced by S. Each cell is a maximal connected region that is contained in the intersection of a fixed number of the surfaces and avoids all other surfaces; in particular, the 0-dimensional cells of A(S) are called "vertices", and d-dimensional cells are simply referred to as "cells"; see [28] for more details. The distinction between primal and dual set systems in geometry is essential, and set systems of both kinds appear in numerous geometric applications, see, once again [14] and the references therein.

δ-Packing
The length v of a vector v ∈ V under the L 1 norm is defined as n i=1 |v i |, where v i is the ith coordinate of v, i = 1, . . . , n. The distance ρ(u, v) between a pair of vectors u, v ∈ V is defined as the L 1 norm of the difference u − v, that is, In other words, it is the symmetric difference distance between the corresponding sets represented by u, v. 1 Let δ > 0 be an integer parameter. We say that a subset of vectors W ⊆ {0, 1} n is δ-separated if for each pair u, v ∈ W, ρ(u, v) > δ. The δ-packing number for V, denote it by M(δ, V), is then defined as the cardinality of a largest δ-separated subset W ⊆ V. A key property, originally shown by Haussler [12] (see also [4,5,7,21,31]), is that set systems of bounded primal shatter dimension admit small δ-packing numbers. That is: Theorem 3 (Packing Lemma [12,21]) Let V ⊆ {0, 1} n be a set of indicator vectors of primal shatter dimension d, and let 1 ≤ δ ≤ n be an integer parameter. Then M(δ, V) = O((n/δ) d ), where the constant of proportionality depends on d.
We note that in the original formulation in [12] the assumption is that the set system has a finite VC-dimension. However, its formulation in [21], which is based on a simplification of the analysis of Haussler by Chazelle [4], relies on the assumption that the primal shatter dimension is d, which is the actual bound that we state in Theorem 3 (we comment though that one component of the analysis uses the fact that the VC-dimension d 0 is finite, but this follows from the bound O(d log d) on d 0 ). We also comment that a closer inspection of the analysis in [12] shows that this assumption can be replaced with that of having bounded primal shatter dimension (independent of the analysis in [4]). We describe these considerations in Sect. 2.1.

Previous Work
In his seminal work, Dudley [7] presented the first application of chaining, a proof technique due to Kolmogorov, to empirical process theory, where he showed the bound O((n/δ) d 0 log d 0 (n/δ)) on M(δ, V), with a constant of proportionality depending on the VC-dimension d 0 (see also previous work by Haussler [11] and Pollard [26] for an alternative proof and a specification of the constant of proportionality). This bound was later improved by Haussler [12], who showed M(δ, V) ≤ e(d 0 +1) 2en δ d 0 , where e is the base of the natural logarithm (see also Theorem 3), and presented a matching lower bound, which leaves only a constant factor gap, which depends exponentially in d 0 . In fact, the aforementioned bounds are more general, and can also be applied to classes of real-valued functions of finite "pseudo-dimension" (the special case of set systems corresponds to Boolean functions), see, e.g., [11], however, we do not discuss this generalization in this paper and focus merely on set systems V of finite primal shatter (resp., VC-) dimension.
The bound of Haussler [12] (Theorem 3) is in fact a generalization of the so-called Sauer-Shelah Lemma [27,29], asserting that |V| ≤ (en/d 0 ) d 0 , and thus this bound is O(n d 0 ). Indeed, when δ = 1, the corresponding δ-separated set should include all vectors in V, and then the bound of Haussler [12] becomes O(n d 0 ), matching the Sauer-Shelah bound up to a constant factor that depends on d 0 .
There have been several studies extending Haussler's bound or improving it in some special scenarios. We name only a few of them. Gottlieb et al. [10] presented a sharpening of this bound when δ is relatively large, i.e., δ is close to n/2, in which case the vectors are "nearly orthogonal". They also presented a tighter lower bound, which considerably simplifies the analysis of Bshouty et al. [3], who achieved the same bound.
A major application of packing is in obtaining improved bounds on the sample complexity in machine learning. This was studied by Li et al. [17] (see also [11]), who presented an asymptotically tight bound on the sample complexity, in order to guarantee a small "relative error." This problem has been revisited by Har-Peled and Sharir [15] in the context of geometric set systems, where they referred to a sample of the above kind as a "relative approximation", and showed how to integrate it into an approximate range counting machinery, which is a central application in computational geometry. The packing number has also been used by Welzl [31] in order to construct spanning trees of low crossing number (see also [21]) and by Matoušek [20,21] in order to obtain asymptotically tight bounds in geometric discrepancy.

Our Result
Shallow Packing Lemma In the sequel, we refine the bound in the Packing Lemma (Theorem 3) so that it becomes sensitive to the length of the vectors v ∈ V, based on an appropriate refinement of the underlying primal shatter function. This refinement has several geometric applications. Our ultimate goal is to show that when the set system is "shallow" (that is, the underlying vectors are short), the packing number becomes much smaller than the bound in Theorem 3. Nevertheless, we cannot always enforce such an improvement, as in some settings the worst-case asymptotic bound on the packing number is ((n/δ) d ) even when the set system is shallow. See Fig. 1 and the paragraph at the end of this section where we give a more detailed description of this construction to the non-expert reader.
Therefore, in order to obtain an improvement on the packing number of shallow set systems, we may need further assumptions on the primal shatter function. Such assumptions stem from the random sampling technique of Clarkson and Shor [6], which we define as follows. Let V be our set system. We assume that for any sequence I of m ≤ n indices, and for any parameter 1 ≤ k ≤ m, the number of vectors in where d is the primal shatter dimension and 1 ≤ d 1 ≤ d is a real parameter. 2 When k = m we obtain O(m d ) vectors in total, in accordance with the assumption that the primal shatter dimension is d, but the above bound is also sensitive to the length of the vectors as long as d 1 < d. From now on, we say that a primal shatter function of this kind has the (d, d 1 ) Clarkson-Shor property.
Let us now denote by M(δ, k, V) the δ-packing number of V, where the vector length of each element in V is at most k, for some integer parameter 1 ≤ k ≤ n. With this notation we can assume, without loss of generality, that k ≥ δ/2, as otherwise the distance between any two elements in V must be strictly less than δ, in which case the packing contains at most one element. In Sects. 2 and 3 we present two proofs for our main result, stated below: Theorem 4 (Shallow Packing Lemma) Let V ⊆ {0, 1} n be a set of indicator vectors, whose primal shatter function has a (d, d 1 ) Clarkson-Shor property (where d is the primal shatter dimension and 1 ≤ d 1 ≤ d is a real parameter). Let δ ≥ 1 be an integer parameter, and k an integer parameter between 1 and n, and suppose that k ≥ δ/2. Then: where the constant of proportionality depends on d.
The actual dependence of the constant of proportionality on d is analyzed at end of Sect. 2 for the first proof, and end of Sect. 3 for the second proof. This problem has initially been addressed by the second author in [9] as a major tool to obtain size-sensitive discrepancy bounds in set systems of this kind, where it has been shown M(δ, k, . The analysis in [9] is a refinement over the technique of Dudley [7] combined with the existence of small-size relative approximations (see [9] for more details). In the current analysis we completely remove the extra log d (n/δ) factor appearing in the previous bound. In particular, when d 1 = d (where we just have the original assumption on the primal shatter function) or k = n (in which case each vector in V has an arbitrary length), our bound matches the tight bound of Haussler, and thus appears as a generalization of the Packing Lemma (when replacing VC-dimension by primal shatter dimension). We present two proofs for Theorem 4, the first is an extension of Haussler's approach (Sect. 2), and the second is an extension of Chazelle's proof [4] to the Packing Lemma (Sect. 3).
We note that after the submission of this paper we found a third proof by Mustafa [24], based on the simplification of Chazelle's proof [4] and combining Markov's inequality. While we find this proof simpler than ours, we emphasize the fact that the problem of shallow packings has initially been posed by the second author in [9], and the preliminary version of this paper [8] is the first to resolve it (in the sense of Theorem 4). We thus believe the results in [24] were somewhat inspired by the preliminary version in [8].
Applications Theorem 4 implies smaller packing numbers for several natural geometric set systems under the shallowness assumption. These set systems are described in detail in Sect. 4.1. In Sect. 4.2 we show an application in geometric discrepancy, where the goal is to obtain discrepancy bounds that are sensitive to the length of the vectors in V. Due to the bound in Theorem 4 we obtain an improvement over the one presented in [9].

An Example of a Shallow Set System with Large Packing Numbers
Consider a ground set that is a collection of axis-parallel rectangles, and the vectors V represent subsets of rectangles that cover a point in the plane. For simplicity of exposition, we define these vectors to represent all two-dimensional cells in the arrangement of the given rectangles. It is well known that the primal shatter function of V is quadratic (see, e.g., [14]), and therefore, by Theorem 3, the packing number is O((n/δ) 2 ). Nevertheless, we claim that even when the arrangement is shallow, the asymptotic bound on the packing number is not any better than (n/δ) 2 . Indeed, fix a positive even parameter δ > 0, and suppose, without loss of generality, that n/δ is an integer number. Consider now an n δ × n δ grid of long and skinny rectangles, where each rectangle in the grid is duplicated δ/2 times (with a possibly infinitesimal perturbation), as illustrated in Fig. 1. Clearly, each (two-dimensional) cell in the arrangement is covered by at most δ rectangles, and thus the set system is δ-shallow. Consider now only the cells at "depth" δ (that is, they are covered by precisely δ rectangles), and let F ⊂ V be the set of their representative vectors. It is easy to verify that, for each pair v, v ∈ F, the distance ρ(v, v ) is at least δ (see once again Fig. 1), and thus F is both δ-separated and δ-shallow. However, by construction, we have |F| = ((n/δ) 2 ), and thus the δ-shallowness assumption does not yield an improvement over the general case.

Relative Approximations and ε-Nets
We mentioned in the introduction the notion of relative (ε, η)-approximations. We now define them formally: Following the definition from [15], given a set system V ⊆ {0, 1} n and two parameters, 0 < ε < 1 and 0 < η < 1, we say that a subsequence I of indices is a relative (ε, η)-approximation if it satisfies, for each vector v ∈ V, where v | I is the projection of v ∈ V onto I . As observed by Har-Peled and Sharir [15], the analysis of Li et al. [17] implies that if V has primal shatter dimension d, then a random sample of cd log (1/ε) εη 2 indices (each of which is drawn independently) is a relative (ε, η)-approximation for V with constant probability, where c > 0 is an absolute constant. More specifically, success with probability at least 1 − q is guaranteed if one samples c(d log (1/ε)+log (1/q)) εη 2 indices. 3 It was also observed in [15] that ε-nets arise as a special case of relative (ε, η)approximations. Specifically, an ε-net is a subsequence of indices I with the property that any vector v ∈ V with v ≥ nε satisfies v | I ≥ 1. In other words, N is a hitting set for all the "long" vectors. In this case, if we set η to be some constant fraction, say, 1/4, then a relative (ε, 1/4)-approximation becomes an ε-net. Moreover, a random sample of O d log (1/ε)+log (1/q) ε indices (with an appropriate choice of the constant of proportionality) is an ε-net for V, with probability at least 1 − q; see [15] for further details.

Overview of Haussler's Approach
For the sake of completeness, we repeat some of the details in the analysis of Haussler [12] and use similar notation for ease of presentation.
Let V ⊆ {0, 1} n be a collection of indicator vectors of bounded primal shatter dimension d, and denote its VC-dimension by d 0 . By the discussion above, . From now on we assume that V is δ-separated, and thus a bound on |V| is also a bound on the packing number of V. The analysis in [12] exploits the method of "conditional variance" in order to conclude where We justify this choice in Appendix 1, as well as the facts that m ≤ n and I consists of precisely m − 1 indices. Moreover, we refine Haussler's analysis to include two natural extensions (see Appendix 1 for details): (i) Obtain a refined bound on Exp I [|V | I |]: In the analysis of Haussler Exp I [|V | I |] is replaced by its upper bound O(m d ), resulting from the fact that the primal shatter dimension of V (and thus of V | I ) is d, from which we obtain that for any choice of I , , with a constant of proportionality that depends on d, and thus the packing number is O((n/δ) d ), as asserted in Theorem 3. 4 However, in our analysis we would like to have a more subtle bound on the actual expected value of |V | I |. In fact, the scenario imposed by our assumptions on the set system eventually yields a much smaller bound on the expectation of |V | I |, and thus on |V| due to Inequality (1). We review this in more detail below. (ii) Relaxing the bound on m. We show that Inequality (1) is still applicable when the sample I is slightly larger than the bound in (2), as a stand alone relation, this may result in a suboptimal bound on |V|, however, this property will assist us to obtain local improvements over the bound on |V|, eventually yielding the bound in Theorem 4. Specifically, in our analysis we proceed in iterations, where at the first iteration we obtain a preliminary bound on |V| (Corollary 6), and then, at each subsequent iteration j > 1, we draw a sample I j of m j − 1 indices where m is our choice in (2), and log ( j) (·) is the jth iterated logarithm function. Then, by a straightforward generalization of Haussler's analysis (described in Appendix 1), we obtain, for each j = 2, . . . , log * (n/δ): We note that since the bounds (1)-(4) involve a dependency on the VC-dimension d 0 , we will sometimes need to explicitly refer to this parameter in addition to the primal shatter dimension d. Nevertheless, throughout the analysis we exploit the relation

Overview of Our Approach
We next present the proof of Theorem 4. In what follows, we assume that V is δ-separated, and denote by d its primal shatter dimension and by d 0 its VC-dimension.
We first recall the assumption that the primal shatter function of V has a (d, d 1 ) Clarkson-Shor property, and that the length of each vector v ∈ V under the L 1 norm is at most k. This implies that V consists of at most O(n d 1 k d−d 1 ) vectors.
Since the Clarkson-Shor property is hereditary, then this also applies to any projection of V onto a subset of indices, implying that the bound on |V | I | is at most where I is a subset of m − 1 indices as above. However, due to our sampling scheme we expect that the length of each vector in V | I should be much smaller than k, (e.g., in expectation this value should not exceed k(m − 1)/n), from which we may conclude that the actual bound on |V | I | is smaller than the , which matches our asymptotic bound in Theorem 4 (recall that m = O(n/δ)). However, this is likely to happen only in case where the length of each vector in V | I does not exceed its expected value, or that there are only a few vectors whose length deviates from its expected value by far, whereas, in the worst case there might be many leftover "long" vectors in V | I . Nevertheless, our goal is to show that, with some care one can proceed in iterations, where initially I is a slightly larger sample, and then at each iteration we reduce its size, until eventually it becomes O(m) and we remain with only a few long vectors. At each such iteration V | I is a random structure that depends on the choice of I and may thus contain long vectors, however, in expectation they will be scarce! Specifically, we proceed over at most log * (n/δ) iterations, where we perform local improvements over the bound on |V|, as follows. Let |V| ( j) be the bound on |V| after the jth iteration is completed, 1 ≤ j ≤ log * (n/δ). We first show in Corollary 6 that for the first iteration, |V| ≤ |V| (1) , with a constant of proportionality that depends on d. Then, at each further iteration j ≥ 2, we select a set I j of m j −1 = O(n log ( j) (n/δ)/δ) indices uniformly at random without replacements from [n] (see (3) for the bound on m j ). Our goal is to bound Exp I j [|V | I j |] using the bound |V| ( j−1) , obtained at the previous iteration, which, we assume by induction to Lemma 8 for the recursive relation that we derive, as well as our observation that the coefficient of the recursive term is at most 1), where the base case j = 2 is shown in Corollary 6.
A key property in the analysis is to show that the probability that the length of a vector v ∈ V | I j (after the projection of V onto I j ) deviates from its expectation decays exponentially (Lemma 7). Note that in our case this expectation is at most k(m j −1)/n. This, in particular, enables us to claim that in expectation the overall majority of the vectors in V | I j have length at most O(k(m j − 1)/n), whereas the remaining longer vectors are scarce. This is the key idea to derive a recursive inequality for |V| ( j) using the bound on Exp I j [|V | I j |] (Lemma 8). Roughly speaking, since the Clarkson-Shor property is hereditary, we apply it to V | I j and conclude that the number of its vectors Then we apply Inequality (4) in order to complete the inductive step, whence we obtain the bound on |V| ( j) , and thus on |V|. Note that the full analysis is somewhat more subtle, as we need to show that the actual coefficient of the recursive term |V| ( j) is at most 1 (Lemma 8). We emphasize the fact that the sample I j is always chosen from the original ground set [n], and thus, at each iteration we construct a new sample from scratch, and then exploit our observation in Inequality (4).
In what follows, we also assume that δ < n/2 (d 0 +1) (where d 0 is the VC-dim), as otherwise the bound on the packing number is a constant that depends on d and d 0 by the Packing Lemma (Theorem 3). This assumption is crucial for the recursive analysis presented in this section-see below.

The First Iteration
In order to show our bound on |V (1) |, we form a subset 5 with the following two properties: (i) each vector in V is mapped to a distinct vector in V | I 1 , and (ii) the length of each vector in A set I 1 as above exists by the considerations in [9]. Specifically: Lemma 5 A sample I 1 as above satisfies properties (i)-(ii), with probability at least 1/2.
Proof We rely on the notions of "relative approximations" and "ε-nets", defined in Sect. 2.1. In order to show (i), we first form the set system corresponding to all symmetric difference pairs induced by V. That is, we form the vector set D, where Since we assume that V is δ-separated, we have w ≥ δ, for each w ∈ D. As observed in [7] (see also [14]), D has a finite VC-dimension.
We now construct an ε-net for D, with ε = δ/n. By our discussion above a sample I 1 of O(d(n/δ) log (n/δ)) indices has this property with probability greater than, say, 3/4 (for a sufficiently large constant of proportionality).
Thus, by definition, any vector w ∈ D (recall that its length is at least δ) must satisfy |w | I 1 | ≥ 1, where w | I 1 denotes the projection of w onto I 1 . But this implies that we must have u | I 1 = v | I 1 , for each pair u, v ∈ V, and thus u, v must be mapped to distinct vectors in the projection of V onto I 1 , from which property (i) follows.
In order to have property (ii) we observe that the same sample I 1 is a relative (δ/n, 1/4)-approximation for V with probability at least 3/4 (for an appropriate choice of the constant of proportionality). Given this property of I 1 , this implies that any vector otherwise. Since v ≤ k, and k ≥ δ/2 by assumption, it is easy to verify that we Combining the two roles of I 1 (each with probability 3/4), it follows that it is both a (δ/n)-net for D and a relative (δ/n, 1/4)-approximation for V, with probability at least 1/2, and thus it satisfies properties (i)-(ii) with this probability. This completes the proof of the lemma.
We next apply Lemma 5 in order to bound |V | I 1 |. We first recall that the (d, d 1 ) Clarkson-Shor property of the primal shatter function of V is hereditary. Incorporating the bound on m 1 and property (ii), we conclude that with a constant of proportionality that depends on d. Now, due to property (i), |V| ≤ |V | I 1 |, we thus conclude: Remark We note that the preliminary bound given in Corollary 6 is crucial for the analysis, as it constitutes the base for the iterative process described in Sect. 2.4. In fact, this step of the analysis alone bypasses our refinement to Haussler's approach, and instead exploits the approach of Dudley [7].

The Subsequent Iterations: Applying the Inductive Step
Let us now fix an iteration j ≥ 2. As noted above, we assume by induction on j that the bound . Let I j be a subset of m j − 1 indices, chosen uniformly at random without replacements from [n], with m j given by (3). Let v ∈ V, and denote by v | I j its projection onto I j . The We next show: where t ≥ 2e is a real parameter and e is the base of the natural logarithm.
Proof We first observe that the length of v | I j is a random variable with a hypergeometric distribution. Indeed, this is precisely the question of uniformly choosing m j −1 elements at random (into our set I j ) from a given set of n elements without replacements, and then, for a given v -element subset of the full set (recall that the length of v corresponds to the cardinality of an appropriate subset in the set system), we consider how many of its elements have been chosen into I j . Specifically, we have: Our goal is to show a Chernoff-type bound over the probability that v | I j deviates from its expectation. However, we face the difficulty that the corresponding indicator variables are not independent, and thus we cannot apply a Chernoff bound directly (see, e.g. [1]). Nevertheless, in our scenario a Chernoff bound is still applicable, this can be viewed by various approaches, see, e.g., [2,23,25]. For the sake of completeness we describe the proof in detail, and rely on the analysis of Panconesi and Srinivasan [25], which implies that when the underlying indicator variables are "negatively correlated", one can still apply a Chernoff bound (see also [2]).
We enumerate all non-zero coordinates of v in an arbitrary order, let L = {l 1 , . . . l v } be this set of indices (in this notation we ignore all the zerocoordinates), and attach an indicator variable X i to each index l i ∈ L, which is defined to be one if and only if l i ∈ I j (in other words, the corresponding element in the underlying set induced by v has been chosen to be included into the sample of the m j − 1 elements). According to this notation, v | I j is represented by the sum However, the variables X i are not independent due to our probabilistic model (that is, I j is chosen without replacements), nevertheless, they are negatively correlated. This implies that for each subset Indeed, following the considerations in [2], let us show first the latter inequality. Put L K = i∈K {l i }. Then Prob[ i∈K X i = 1] = Prob[L K ⊆ I j ], and since I j is uniformly chosen, in order to bound the latter we need to take the proportion between the number of subsets of size m j − 1 that contain L K and the entire number of subsets of size m j − 1 that can be chosen from an n-element set. Hence n(n − 1) · · · (n − |K | + 1) , and the latter is smaller than m j −1 n |K | , as is easily verified. Using similar arguments for the first correlation inequality, we obtain We are now ready to apply [25,Theorem 3.4] stating that if the indicator variables X i are negatively correlated then (recall that X = v i=1 X i ) 6 : , for any ρ > 1. In particular, when ρ ≥ 2e (where e is the base of the natural logarithm), the latter term is bounded by 2 −ρ Exp [X ] . Recall that we assumed v ≤ k, and thus Exp[X ] = Exp[ v | I j ] ≤ k · (m j − 1)/n. Thus, for any t ≥ 2e, we obtain: observe that in this case ρ := = 2 −tk(m j −1)/n , as asserted. 6 We note that in the original formulation in [25], one needs to have a set of independent random variableŝ In the scenario of our problemX i is taken to be a Bernoulli indicator random variable, which takes value one with probability (m j − 1)/n, in which case Exp[X ] = Exp[X ] = v · (m j − 1)/n. We now proceed as follows. Recall that we assume k ≥ δ/2, and by (3) we have m j = O d 0 n log ( j) (n/δ) δ . Thus it follows from Lemma 7 that where C ≥ 2e is a sufficiently large constant, and D > d 0 is another constant whose choice depends on C and d 0 , and can be made arbitrarily large. Since d 0 ≥ d we obviously have D > d. We next show: Lemma 8 Under the assumption that k ≥ δ/2, we have, at any iteration j ≥ 2:

where |V| (l) is the bound on |V| after the lth iteration, and A > 0 is a constant that depends on d (and d 0 ) and the constant of proportionality determined by the Clarkson-Shor property of V.
Proof We in fact show: and then exploit the relation |V| ≤ (d 0 + 1) Exp I j [|V | I j |] (Inequality (4)), in order to prove Inequality (6). In order to obtain the first term in the bound on Exp I j [|V | I j |], we consider all vectors of length at most C · k(m j −1) n (where C ≥ 2e is a sufficiently large constant as above) in the projection of V onto a subset I j of m j − 1 indices (in this part of the analysis I j can be arbitrary). Since the primal shatter function of V has a (d, d 1 ) Clarkson-Shor property, which is hereditary, we obtain at most It is easy to verify that the constant of proportionality A in the bound just obtained depends on d, d 0 , and the constant of proportionality determined by the (d, d 1 ) Clarkson-Shor property of V (observe that the latter one is at most (2(d 0 + 1)) d , since m j depends linearly in d 0 ).
Next, in order to obtain the second term, we consider the vectors v ∈ V that are mapped to vectors v | I j ∈ V | I j with v | I j > C · k(m j −1) n . By Inequality (5): and recall that |V| ( j−1) is the bound on |V| after the previous iteration j − 1. This completes the proof of the lemma.
Remark We note that the bound on Exp I j [|V | I j |] consists of the worst-case bound on the number of short vectors of length at most C · k(m j − 1)/n, obtained by the (d, d 1 ) Clarkson-Shor property, plus the expected number of long vectors.

Wrapping Up
We now complete the analysis and solve Inequality (6). Our initial assumption that δ < n/2 (d 0 +1) , and the fact that D > d is sufficiently large, imply that the coefficient of the recursive term is smaller than 1, for any 2 ≤ j ≤ log * (n/δ)− log * (d 0 + 1). 7 Then, using induction on j, one can verify that the solution is for any 2 ≤ j ≤ log * (n/δ) − log * (d 0 + 1). We thus conclude In particular, at the termination of the last iteration j * = log * (n/δ) − log * (d 0 + 1), we obtain: with a constant of proportionality that depends exponentially on d (and d 0 ). Specifically, it is easy to verify that due to our choice of j * and the fact that A is at most (2(d 0 + 1)) d , the resulting constant of proportionality in the bound on |V| is at most . This at last completes the proof of Theorem 4.

Second Proof: Refining Chazelle's Approach
In this section, we shall prove a size-sensitive version of Haussler's upper bound for δ-separated set of vectors, having bounded primal shatter dimension, building on Chazelle's presentation of Haussler's proof as explained in [21]. By Haussler's result [12], we know that We shall show that the optimal (up to constants) upper bound for g, is in fact c * , where c * is independent of n, k and δ.

Overview of the Second Approach
Consider a preliminary attempt to extend Chazelle's proof to shallow packings. Like Chazelle, one chooses a random subsequence of indices I of size roughly n/δ, and estimates the number of projections on I caused by δ-separated vectors of size bounded by k. Suppose that the projection of every vector in the δ-separated system V onto I , were nearly equal to its expected value k/δ. By the definition of the (d, d 1 ) Clarkson-Shor property, the number of projections on the subsequence I would then be bounded by c n , for some constant c, and Chazelle's proof would go through in a straightforward manner. However, for a given vector, its projection on I can be much larger than expected. To handle this, we shall choose I in a way that the number of such "bad" vectors, is at most a constant times their expected number. Also, we treat the unknown "extra" factor g (at most the Haussler packing bound, divided by n d 1 k d−d 1 δ d ) as a function of n, k, δ, whose value we show to be bounded by a constant independent of n, k, and δ (see Inequality (12) and succeeding paragraph). This allows us to get the final bound in a single iteration.

Details of the Proof
Before we give the details of the second proof, we will need the definition of the unit distance graph of a family of vectors which will play a central role in the proof of the theorem.
Proof Consider a random subsequence of indices I = (i 1 , . . . , i s ) obtained by selecting each i ∈ [n] with probability p = 11d 0 K δ , and then taking a random permutation of the selected indices. Here K = K (n, k, δ) ≥ 4 is a parameter to be fixed later. Define V 1 := V | I . Consider the unit distance graph UD(V 1 ). For each vector v 1 ∈ V 1 , define the weight of v 1 as: ({v 1 , v 1 }). We now need the following lemma, whose proof we include from Matoušek's book [21], in order to make the exposition selfcontained.

Now define the weight of an edge {v
Proof The proof is based on the following lemma, proved by Haussler [12] for set systems with bounded VC-dimension. The following version appears in [21]: Lemma 11 [12] Let V ⊆ {0, 1} n be a set of indicator vectors with VC-dimension d 0 . Then the unit-distance graph UD(V) has at most d 0 |V| edges.
Since the VC-dimension of V 1 is bounded by d 0 from the hereditary property of VCdimension, the lemma implies that there exists a vertex v 1 ∈ V 1 , whose degree is at most 2d 0 . Removing v 1 , the total vertex weight will be reduce by w(v 1 ), and the total edge weight will be reduce by at most 2d 0 w(v 1 ). By continuing the process and the argument until all vertices are removed, we get the lemma.
Next, we shall prove a lower bound on the expectation Exp[W ]. Set 1 and v 1 differ in the coordinate i s , and let We need to lower bound Exp[W 1 ]. Given I , let i.e., the number of vectors in V, each of whose length after projecting onto I is more than ekp, (where e is the base of the natural logarithm function where the last inequality follows from the bounds on . Partition V into equivalence classes V 1 , . . . , V r according to their projection onto I : Here, c is the constant independent of n, δ, occurring in the definition of primal shatter dimension. Define B ⊂ [r ] as follows: Further, let G be [r ] \ B. Since Nice holds, we have: We will estimate the contribution of the classes in G to the expected weight Exp[W 2 |I ], that is, the contribution of the edges of UD(V 1 ) which have both of their endpoints in G. Consider a class V i such that i ∈ G. Let V 1 ⊂ V i be those vectors in V i with 1 in the i s -th coordinate, and let Then the edge {v, v } ∈ E 1 formed by the projection of V i onto I , has weight Observe that in Inequality (10) Therefore, the expected contribution of (v, v ) to b 1 b 2 is at least δ n and this implies where the inequality in the second line comes from Inequality (10). Hence, the expected contribution of G to the weight W 2 is: Observe that by the (d, d 1 ) Clarkson-Shor property and the fact that |I | < 3np 2 , we have where C is a constant. Substituting the bound of |G| in Inequality (11) and the fact that j∈B |V j | ≤ 8 Exp[Y ] (see Inequality (9)) we get: Since the above holds for each I which satisfies Nice, we get that Using Inequality (8), and comparing with the upper bound on W , and substituting the lower bound on Exp[W 2 ] and solving for M(δ, k, V), we get Next, the following lemma connects the parameters K and g: Substituting the choice of K and the value of Exp[Y ] from Lemma 12, we get that (where in the second step we used that 1 − 16/33K ≥ 1/2, and hence C 2 is a positive constant.) The last inequality above implies that g d ≤ C 2 max{4, 2d log g 11d 0 } d , or with We claim that the above implies that g, and hence K , must be bounded from above by a constant independent of n, k, δ. If g is not increasing with n, k, and δ, then g is bounded, and we are done. Otherwise, for any g = g(n, k, δ) growing with n, k, or δ, we have g C 4 log g (where C 4 = 2dC 3 11d 0 ) for sufficiently large n, k or δ. This inequality can only be satisfied when g is bounded by a constant function of n, k and δ, i.e., g(n, k, δ) ≤ c * , where c * is independent of n, k and δ. Thus in either case, g = g(n, k, δ) is a function bounded by an absolute constant. Therefore, it suffices to take K ≥ 2d log c * 11d 0 , where c * is the greatest positive value satisfying Inequality (12). The constants C 1 , C 2 are of the order of d d 0 . Therefore It only remains to prove Claim 12: Proof of Lemma 12 The proof follows easily from Chernoff Bounds. Fix v ∈ V, and let Z = v | I and Z = v | I . Observe that Z ≥ Z . Since v ≤ k, we have ≥ e, and as can be verified by elementary calculus, for any β ≥ e, log β − 1 + 1/β ≥ 1/e. Hence the expected number Exp[Y ] of vectors whose projections on I have norm at least (11ed 0 K k/δ), is at most: for K ≥ 2d log g 11d 0 .

Realization to Geometric Set Systems
We now incorporate the random sampling technique of Clarkson and Shor [6] with Theorem 4 in order to conclude that small shallow packings exist in several useful scenarios. This includes the case where V represents: (i) A collection of halfspaces defined over an n-point set in d-space (that is, each set in the set system is the subset of points contained in a halfspace). In this case, for any integer parameter 0 ≤ k ≤ n, the number of halfspaces that contain at most k points is O(n d/2 k d/2 ), and thus the primal shatter function has a (d, d/2 ) Clarkson-Shor property. (ii) A collection of balls defined over an n-point set in d-space. Here, the number of balls that contain at most k points is O(n (d+1)/2 k (d+1)/2 ), and therefore the primal shatter function has a (d + 1, (d + 1)/2 ) Clarkson-Shor property. (iii) A collection of parallel slabs (that is, each of these regions is enclosed between two parallel hyperplanes and has an arbitrary width), defined over an n-point set in d-space. The number of slabs, which contains at most k points is O(n d k). (iv) A dual set system of points in d-space and a collection F of n (d − 1)-variate (not necessarily continuous or totally defined) functions of constant description complexity. Specifically, the graph of each function is a semi-algebraic set in R d defined by a constant number of polynomial equalities and inequalities of constant maximum degree (see [28,Chapter 7] for a detailed description of these properties, which we omit here). 9 In this case, V is represented by the cells (of all dimensions) in the arrangement of the graphs of the functions in F (see Sect. 1 for the definition) that lie below at most k function graphs. This portion of the arrangement is also referred to as the at most k-level, and its combinatorial complexity is O(n d−1+ε k 1−ε ), for any ε > 0, where the constant of proportionality depends on d and ε. Thus the primal shatter function has a (d, d − 1 + ε) Clarkson-Shor property. All bounds presented in (i)-(iv) are well known in the field of computational geometry; we refer the reader to [6,22,28] for further details. We thus conclude: where the constant of proportionality depends on d.

Corollary 14
Let V ⊆ {0, 1} n be a set of indicator vectors representing a set system of balls defined over an n-point set in d-space, and let δ, k be two integer parameters as in Theorem 4. Then: where the constant of proportionality depends on d.

Corollary 15
Let V ⊆ {0, 1} n be a set of indicator vectors representing a set system of parallel slabs defined over an n-point set in d-space, and let δ, k be two integer parameters as in Theorem 4. Then: where the constant of proportionality depends on d.
where the constant of proportionality depends on d and on ε.

Size-Sensitive Discrepancy Bounds
Suppose we are given a set system defined over an n-point set X (for simplicity of exposition, we refer to as a collection of subsets of X rather than a collection of vectors defined over the unit cube). We now wish to color the points of X by two colors, such that in each set of the deviation from an even split is as small as possible.
In a previous work [9], the second author presented size-sensitive discrepancy bounds for set systems of halfspaces defined over n points in d-space. These bounds were achieved by combining the entropy method [18] with δ-packings, and, as observed in [9], they are optimal up to a poly-logarithmic factor. Incorporating our bound in Theorem 4 into the analysis in [9], the bounds on χ(S) improve by a logarithmic factor. Specifically, we obtain (see Appendix 1 for the proof): Theorem 17 Let be a set system, defined over an n-point set, whose primal shatter function has a (d, d 1 ) Clarkson-Shor property. Then, there is a two-coloring χ , such that for each S ∈ : Such a two-coloring χ can be computed in expected polynomial time.

Concluding Remarks
The analogy between shallow packings for dual set systems and shallow cuttings [19] may cause one to interpret shallow packings as the "primal version" of shallow cuttings. In this paper we presented discrepancy applications for shallow packings. We hope to find additional applications in geometry and beyond.
A key property in the analysis of Haussler [12] lies in the density of a unit distance graph G = (V, E), defined over V, whose edges correspond to all pairs u, v ∈ V, whose symmetric difference distance is (precisely) 1. In other words, u, v appear as neighbors on the unit cube {0, 1} n . It has been shown by Haussler et al. [13] that the density of G is bounded by the VC-dimension of V, that is, |E|/|V| ≤ d 0 ; see also [12] for an alternative proof using the technique of "shifting". 11 Then, this low density property is exploited in order to show that once we have chosen (n − 1) coordinates of the random variable V, the variance in the choice of the remaining coordinate is relatively small. That is: As observed in [12], Lemma 18 continues to hold on any restriction of V to a sequence I = {i i , . . . , i m } of m ≤ n indices. Indeed, when projecting V onto I the VC-dimension in the resulting set system remains d 0 . Furthermore, the conditional variance is now defined w.r.t. the induced probability distribution on V | I in the obvious way, where the probability to obtain a sequence of m values corresponds to an appropriate marginal distribution, that is, With this observation, we can rewrite the inequality stated in Lemma 18 as however, may not remain uniform after the projection of V onto a proper subsequence I = (i 1 , . . . , i m ) of m < n indices, as several vectors in V may be projected onto the same vector in V | I . 11 We cannot guarantee such a relation when the VC-dimension d 0 is replaced by the primal shatter dimension d, and therefore we proceed with the analysis using this ratio.
If I is a sequence chosen uniformly at random (over all such m-tuples), then when averaging over all choices of I we clearly obtain: by linearity of expectation. In fact, by symmetry of the random variables V i j (recall that I is a random m-tuple) each of the summands in the above inequality has an equal contribution, and thus, in particular (recall once again that the expectation is taken over all choices of I = {i i , . . . , i m }): where we write Exp I [·] to emphasize the fact that the expectation is taken over all choices of I . The above bound is now integrated with the next key property: where the conditional variance is taken w.r.t. the distribution P, and the expectation is taken w.r.t. the random choice of i m .
We now observe that when the entire sequence I = (i 1 , . . . , i m ) is chosen uniformly at random, then the bound in Lemma 19 continues to hold when averaging on the entire sequence I (rather than just on i m ), that is, we have: Note The analysis of Haussler [12] then proceeds as follows. We assume that V is δ-separated as in Lemma 19 (then a bound on |V| is the actual bound on the packing number), and then choose the derivation of the discrepancy bound for the original sets S ∈ is straightforward by the analysis in [9], in which S is represented by the disjoint union of the symmetric difference of pairs of canonical sets.
Assume without loss of generality that log n is an integer, and let k := log n. By Theorem 4 and the analysis in [9], we have that for each i = 1, . . . , k and j = i − 1, . . . , k, where C > 0 is an appropriate constant as stated in Theorem 4. By the construction in [9], each set F i j ∈ F i j satisfies |F i j | = O(n/2 i−1 ), for a fixed index i, and any j = i − 1, . . . , k.
Our discrepancy parameter i j is chosen as follows: where for an appropriate constant B > 5 + log C, and for a sufficiently large constant of proportionality A > 0, whose choice depends on B, and will be determined shortly (note that all the three constants A, B, and C depend on d).
In order to apply the constructive discrepancy minimization technique of Lovett and Meka [18], we need to show: Lemma 20 Put s j := n/2 j−1 . The choice in (17), for A > 0 sufficiently large (whose choice depends on C and thus on d), satisfies We next proceed almost verbatim as in [9]. We first note that at j = j 0 the above exponent becomes a constant, whereas the bound C · 2 (d−d 1 )(i−1) , at j = j 0 we obtain: as asserted. We now fix an index i, split the summation into the two parts j ≥ j 0 and i − 1 ≤ j < j 0 , and then bound each part in turn. In the first part, the exponent will "take over" the summation in the sense that it decreases superexponentially, making the other factors (with j > j 0 ) insignificant, and in the second part, the packing size will decrease geometrically. Thus the "peak" of this summation is obtained at j = j 0 , and is decreasing as we go beyond or below j.
For the first part, put j := j 0 + l, for an integer l ≥ 0, and then where the logarithmic factor is now eliminated due to the summation over i. The exponents in the above sum decrease superexponentially. Choosing A sufficiently large (say, A > 2 6+(B+1)+log d ) and having B > 5 + log C as above, we can guarantee that the latter sum is strictly smaller than n/32. When j < j 0 , put j := j 0 − l, l > 0 as above. We now obtain, by just bounding the exponent from above by 1, and using similar considerations as above: Once again, our choice for B guarantees that the above (geometrically decreasing) sum is strictly smaller than n/32. Thus the entire summation is bounded by n/16, as asserted.
This completes the proof of Theorem 17.