Restricted Isometry Property for General p-Norms

The restricted isometry property (RIP) is a fundamental property of a matrix, which enables sparse recovery. Informally, an m × n matrix satisfies RIP of order k for the ℓ<sub>p</sub> norm, if ||Ax||<sub>p</sub> ≈ ||x||<sub>p</sub> for every vector × with at most k non-zero coordinates. For every 1 ≤ p <; ∞, we obtain almost tight bounds on the minimum number of rows m necessary for the RIP property to hold. Prior to this paper, only the cases p = 1, 1+ 1/ log n, and 2 were studied. Interestingly, our results show that the case p = 2 is a singularity point: the optimal number of rows m is Θ̃(k<sup>p</sup>) for all p ∈ [1, ∞) \ {2}, as opposed to Θ̃(k) for p = 2. We also obtain almost tight bounds for the column sparsity of RIP matrices and discuss the implications of our results for the stable sparse recovery problem as defined by Candès et al.


stricted Isom
try Property for the p -norm (RIP-p).
nformally speaking, we are interested in a linear map from R n to R m with m n that approximately preserves p -norms for all vectors that have only few non-zero coordinates.

More precisely, an m × n matrix A ∈ R m×n is said to have (k, D)-RIP-p property for sparsity k ∈ [n] def = {1, . . ., n}, distortion D > 1, and the p -norm for p ∈ [1, ∞), if for every vector x ∈ R n with at most k non-zero coordinates it satisfies
x p ≤ Ax p ≤ D • x p .
In this work we investigate the following question: fixing p ∈ [1, ∞), n ∈ N, k ∈ [n], and D > 1, What is the smallest m ∈ N so that there exists a (k, D)-RIP-p matrix A ∈ R m×n ?And, at the same time, Can such a matrix A be non-trivially sparse?


Motivation

Why are RIP matric s useful?RIP-2 matrices were introduced by Candès and Tao [CT05] in order to decode an input vector f from corrupted linear measurements Bf + e under the assumption that e is sufficiently sp rse (has only few non-zero entries).Later Candès, Romberg and Tao [CRT06] used RIP-2 matrices to solve the Stable Sparse Recovery problem, which has since found numerous applications in areas such as compressive sensing of signals [CRT06,Don06], genetic data analysis [KBG + 10], and data stream algorithms [Mut05,GI10].

Informally speaking, in the stable sparse recovery problem, the input signal x ∈ R n is assumed to be close to k-sparse, that is, to have most of the mass concentrated on k coordinates.The goal is to design a set of m linear measurements on x that can be represented as a singl m × n matrix A such that, given the sketch y = Ax ∈ R m , one can 'approximately' recover x.Formally, the recovered vector x ∈ R n is required to satisfy
x − x p ≤ C min k-sparse x * x − x * q (1.1)
for some C > 0, p, q ∈ [1, ∞), and k ∈ [n].We refer to (1.1) as he p / q guarantee.The parameters of interest include: the number of measurements m, the approximation factor C and the complexity of the recovery procedure.Ideally, we want m to be not much larger than k.

Candès, Romberg and Tao [ RT06] proved that if A is (O(k), 1 + ε)-RIP-2 for a sufficiently

all ε > 0, then one can
achieve the 2 / 1 guarantee with C = O(k −1/2 ) in polynomial time.

The construction of RIP-1 matrices was first studied by Berinde et al.
[BGI + 08].
It is a folklore result (formally recorded in [IR13]) that if A is (O(k), 1 + ε)-RIP-1 for a sufficiently small ε > 0, then one can achieve 1 / 1 guarantee with C = O(1).

Known constructions and limitations.Candès and Tao [CT05] proved that for every ε > 0, a matrix with m = O(k log(n/k)/ε 2 ) rows and n columns whose entries are sampled from i.i.d.Gaussians is (k, 1 + ε)-RIP-2 with high probability.Later, a simpler proof of the same result was discovered by Baraniuk et al. [BDDW08] 1 .Berinde et al. [BGI + 08] showed that a (scaled) random sparse binary matrix with m = O(k log(n/k)/ε 2 ) rows is (k, 1 + ε)-RIP-1 with high probability.

Since the number of measurements is very important in practice, it is natural to ask, how optimal is the dimension bound m = O(k log(n/k)) that the above constructions achieve?The  [DIPW10] and Candés [Can08] imply the lower bound m = Ω(k log(n/k)) for (k, 1 + ε)-RIP-p matrices for p ∈ {1, 2}, provided that ε > 0 is sufficiently small.Another important parameter of a measurement matrix A is its column sparsity: the maximum number of non-zero entries in a single column of A. If A has column sparsity d, then we can perform multiplication x → Ax in time O(nd) as opposed to the naive O(nm) bound.Moreover, for sparse matrices A, one can maintain the sketch y = Ax very efficiently if we update x. 2  The aforementioned constructions of RIP matrices exhibit very different behavior with respect to column sparsity.RIP-2 matrices obtained from ra dom Gaussian matrices are obviously dense, whereas the construction of RIP-1 matrices of Berinde et al. [BGI + 08] gives very small column sparsity d = O(log(n/k)/ε).It is known that both sparsities are essentially tight. 3 Another notable difference between RIP-1 and RIP-2 matrices is the following.The construction of Berinde et al. [BGI + 08] provides RIP-1 matrices with non-negative entries, whereas Chandar proved [Cha10] that any RIP-2 matrix with non-negative entries must have m = Ω(k 2 ).In other words, negative signs are crucial in the construction of RIP-2 matrices but not for the RIP-1 case.

In sum.Motivated by these discrepancies between the optimal constructions for RIP-p matrices for p ∈ 1, 1 + 1 log k , 2 , we initiate the study of RIP-p matrices for general p ∈ [1, ∞).Having in mind that the upper bound m = O(k log(n/k)/ε 2 ) holds for (k, 1 + ε)-RIP-p for p ∈ 1, 1 + 1 log k , 2 , it would be natural to conjectu e that the same bound holds at least for every p ∈ (1, 2).As we will see, surprisingly, this conjecture is very far from being true.Another reason to study RIP-p matrices for the general p is the potential to y.Indeed, we obtain new results in this direction.


Our results

Upper Bounds.On the positive side, for all ε > 0 and all p ∈ (1, ∞), we construct (k, 1+ε)-RIP-p matrices with m = O(k p ) rows.Here, we use the O(•)-notation to hide factors that depend on ε, p, and are polynomial in log n.More precisely, we show that a (scaled) random sparse 0/1 matrix with O(k p ) rows and column sparsity O(k p−1 ) has the desired RIP property with high probability.

This construction essentially matches that of Berinde et al. [BGI + 08] when

approaches 1.
t the same time, when p = 2, our result matches known constructions of non-negative RIP-2 matrices based on the incoherence argument. 4

Lower bounds.Though the number of rows m in our construction is clearly suboptimal for p = 2, surprisingly, we show that ur upper bounds are almost optimal, both in terms of dimension m and column sparsity d, for every constant p ∈ (1, ∞) except 2! More formally, on the dimension side, for every p ∈ (1, ∞) \ {2}, distortion D > 1, and (k, D)-RIP-p matrix A ∈ R m×n , we show that m = Ω(k p ), where Ω(•) hides factors that depend on p and D. Note that, it is not hard to extend an argument of Chandar [Cha10] and obtain a lower bound m = Ω(k p−1 ) 5 .This additional factor k is exactly what makes our lower bound non-trivial and tight for p (1, ∞) \ {2}, and thus enables us to conclude that p = 2 is a 'singularity'.

As for the column sparsity, we present a simple extension of the argument of Chandar [Cha10] and prove that for every p ∈ [1, ∞) any (k, D)-RIP-p matrix must have column sparsity Ω(k p−1 ).

Implications for sparse recovery.Finally, we extend the

sults of Candès, Romberg an
Tao [CRT06] and Candès [Can08] and show that, for every 1 ≤ q < p, RIP-p matrices provide the stabl sparse recovery with the q / 1 guarantee, m = O(k p ) measurements, (d = O(k p−1 ) column sparsity), and approximation factor C = O(k −1+1/q ) in polynomial time.These extensions are quite straightforward and seem to be folklore, but, to the best of our kno es with 1 < p < 2, we obtain a tradeoff between the recovery quality and the column sparsity of a m se than O(k), which is what one would get for p ∈ {1, 2}.)

• Using RIP-p matrices with 2 < p < ∞, we provide stronger recovery guarantee than 2 / 1 at a cost of increasing the column sparsity from d = O(k) to d = O(k p−1 ), and the total number of measurements from m = O(k) to m = O(k p ).

We provide the proof for the sake of completeness in Appendix A.


Overview of our proofs

Upper bounds.We construct RIP-p matrices as follows.Beginning wit

a zero matrix A with
m = O(k p ) rows and n columns, independently for each column of A, we choose d = O(k p−1 ) out of m entries uniformly at random (without replacement), and assign the value d −1/p to those selected entries.For this construction, we have two very different proofs of its correctness: one works only for p ≥ 2, and the other works only for 1 < p < 2. For his problem to a probabilistic question similar in spirit to the following "balls and bins" question.Consider n bins in which we throw n balls uniformly and independently.As a result, we get n numbers X 1 , X 2 , . . ., X n , where X i is the number of balls falling into the i-th bin.We would like to upper bound the tail Pr [S ≥ 1000
• E [S]] for the random variable S = n i=1 X p−1 i
. The first challenge roduced by Joag-Dev and Proschan [JDP83].The second problem is that the random variables X p−1 i are 4 That is, a (scaled) ran

m m × n binary matri
with m = O(ε −2 k 2 log(n/k)) rows and sparsity d = O(ε −1 k log(n/k)) property with certain parameters.

5 Also, the same argument gives the lower bound Ω(k p ) for binary RIP-p matrices for every p ∈ [1; ∞).

heavy tailed: they have tails of the form Pr X p−1 i ≥ t ≈ exp(−t 1 p−1 ), so the standard technique of bounding the moment-generating function does not work.Instead, we bound the high moments of S directly, which introduces certain technical challenges.Let us oroughly studied by Nagaev [Nag69a,Nag69b], but it seems that for the results in these papers the independence of summands is crucial.

One major reason the above approach fails to work for 1 < p < 2 is that, in this range, even the best possible tail inequality for S is too weak for our purposes.Anothe t, for the 'lower tail' of Ax p p (that is, to prove that Ax p ≥ (1 − ε) x p holds for all k-sparse x), the simple proof in p ≥ 2 no longer holds.Our solution to both problems above is to instead build our RIP matrices based on the following general notion of bipartite expanders.

Let G = (U, V, E) with |U | = n, |V | = m and E ⊆ U × V be a bipartite graph he same degree d.We say that G is an ( , d, δ)-expander, if for every S ⊆ U with |S| ≤ we have
| v ∈ V | ∃u ∈ S (u, v) ∈ E} | ≥ (1 − δ)d|S| .
It is known that random d-regular graphs are good expanders, and we can take the (scaled) adjacency matrix of such an expander y for 1 < p < 2. Our argument can be seen as a subtle interpolation between the argument from [BGI + 08], which proves that (scaled) adjacency matrices of (k, d, Θ(ε))-expanders (with O(k) rows) are (k, 1 + ε)-RIP-1 and the one using incoherence argument,6 which shows that (2, d, Θ(ε/k))-expanders give (k, 1 + ε)-RIP-2 matrices (with O(k 2 ) rows).

Lower bounds.Our dimension lower bo high-level idea can be described in four simple steps.Consider any (k, D)-RIP-p matrix A ∈ R n×m , and assume that D is very close to 1 in this high-level description.

In the first three steps, we deduce from the RIP propert that (a) the sum of the p-th powers of all entries in A is approximately n, (b) the largest ent t k 1/p−1 , and (c) the sum of squares of al

entries in A is at least
n k m 2/p−1
b) and (c) together by arguing about the relationships between the p , ∞ and 2 norms of entries of A, and prove the desired lower bound on m.

The sparsity lower bound d = Ω(k p−1 ) can be ob extend the techniques of Nelson and Nguy ẽn [NN13] to obtain a slightly better sparsity lower bound.However, sinc include it.


RIP Construction for p ≥ 2

In this section, we construct (k, 1 + ε)-RIP-p matrices for p ≥ 2 by proving the following theorem.Definition 2.1.We say that an m × n matrix A is a random binary matrix with sparsity d ∈ [m], if A is generated by assigning d −1/p to d random m without replacement), and assigning 0 to the remaining entries.


Theorem 2.2. For all
n ∈ Z + , k ∈ [n], ε ∈ (0, 1 2 ) and p ∈ [2, ∞), there exist m, d ∈ Z + with m = p O(p) • k p ε 2 • log p−1 n and d = p O(p) • k p−1 ε • log p−1 n ≤ m
such that, letting A be a random binary m × n matrix of sparsity d, with probabilit

at least 98%,
A satisfies (1 −
) x p p ≤ Ax p p ≤ (1 + ε) x p p for all k-sparse vectors x ∈ R n .
Our proof is divided into two steps: (1) the "lower-tail step", that is, with probability at least 0.99 we have Ax p p ≥ (1 − ε) x p p for all k-sparse x, and (2) the "upper-tail step", that is, with probability at least 0.99 lumns of A are considered, every S i has to be almost disjo olumns.This can be summarized by the following claim.
Claim 2.3. If d ≥ Cε −1 k log n an holds, then for every j ∈ [k], we have |S j | ≥ (1 − ε)d.Thus, we can lower bound Ax p as
Ax p p = 1 d • m i=1 j∈[k]:i∈S j x j p ≥ 1 d • m i=1 j∈[k]:i∈S j x j p = 1 d • j∈[k] |S j | • |x j | p ≥ (1 − ε) x p p . (2.1)
Note that the above claim only works when m = Ω(k 2 log n/ε 2 ), and therefore we cannot use it in for the case of 1 < p < 2.


The Upper-Tail Step

Suppose again that x is supported on [k].Then, we upper bound A p p as
Ax p p = 1 d • m

1 j∈[k]:i∈S j x j p ≤
d • m i=1 j ∈ [k] | i ∈ S j p−1 • j∈[k]:i∈S j |x j | p = 1 d • k j=1 |x j | p • i∈S j j ∈ [k] | i ∈ S j p−1 , (2.2)
where the first inequality follows from the fact that (a
1 + • • • + a N ) p ≤ N p−1 sequence of N non-negative reals a 1 , . . . , a N . Note that the quantity j ∈ [k] | i ∈ S j ∈ [k]
c f non-zeros of A in the i-th row and the first k columns.From now on, in orde ) holds for j = j * , and then take a union bound over the choices of j * .Without loss of generality, assume that S j * = {1, 2, . . ., d}, consis ing of the first d rows.For every i ∈ S j * , define a random variable
X i def = j ∈ [k] | i ∈ S j − 1. It is easy to see that X i is distributed as Bin(k − 1, d/m
), the binomial distributio that is the sum of k − 1 i.i.d.random 0/1 variables, each being 1 with probability d/m.For notational simplicity, let us define δ def = dk/m.We will later choose δ < ε to be very small.Our goal in (2.3) can now be reformulated as follows: upper bound the probability Pr
d i=1 ((X i + 1) p−1 − 1) > εd .
We begin with a lemma showing an upper bound on the moments of each Y
i def = (X i + 1) p−1 − 1.
Lemma 2.4.There exists a constant C ≥ 1 such that, if X is drawn from the binomial distribution Bin(k − 1, δ/k) for some δ < 1/(2e 2 ), and p ≥ 2, then for any real ≥ 1,
E[((X + 1) p−1 − 1) ] ≤ C • δ( (p − 1) + 1) (p−1)+1 .
Next, we note that although the random variables X i 's are dependent, they can be verified to be negatively associated, a notio introduced by Joag-Dev and Proschan [JDP83].This theory allows us to conclude the following bound on the moments.Lemma 2.5.Let X 1 , . . ., X d be d random variables, each drawn independently from Bin(k − 1, δ/k).Then, for every integer t ≥ 1 ( X i + 1) p−1 − 1) t .
Now, using the moments of random variables Y i = (X i + 1) p−1 − 1 from Lemma 2.4, as well as Lemma 2.5, we can compute the tail bound of the sum d i=1 Y i .Our proof of the following Lemma uses the result of Latała [Lat97].

Lemma 2.6.There exists constants C ≥ 1 such that, whenever δ ≤ ε/p Cp and d ≥ p Cp /ε, we have
Pr d i=1 ((X i + 1) p−1 − 1) > εd ≤ e −Ω( (εd) 1/(p−1) p ) .
Finally, we are ready to prove Theorem 2.2.


Proof of Theorem 2.2. We can choose
d = Θ(p) p−1 • k p−1 ε • log p−1 n so that e −Ω( (εd) 1/(p−1) p ) < 1 100 1 k( n k ) . Since our choice of m = dkp Θ(p) ε ensures that δ = dk/m ≤ ε/p Cp ,k( n k ) one has i∈S j * j ∈ [k] | i ∈ S j p−1 = d i=1 (X i + 1) p−1 ≤ (1 + ε)d .
Therefore, by applying the union bound over all j * ∈ [k], we conclude that with probability at least 1 − 1 100 1
( n k ) , the desired inequali

(2.3) is satisfied for
all j ∈ [k].
Recall that, owing to (2.2), the inequality (2.3) implie east 0.99, we have Ax p p ≤ (1 + ε) x p p for all k-sparse vectors x.

On the other hand, since our choice of d and m atisfies the assumptions d ≥ Ω(k log n/ε) and m ≥ 2dk/ε sume that 1 + τ ≤ p ≤ 2 − τ for some τ > 0, and whenever we write O τ (•), we assume that some factor that depends on τ is hidden.(F r instance, factors of p/(1 − p) may be hidden.)
Theorem 3.1. For every n ∈ Z + , k ∈ [n], 0 < ε < 1/ that, letting A be a random binary m × n matrix of sparsity d, with probability at least 98%,
A satisfies (1 − ε) x p p ≤ Ax p p ≤ (1 + ε) x p p for all k-sparse vectors x ∈ R n . Not that, when k ≥ ε − p(2−p)
(p−1) 3 , the above bounds on m and k can following notion of bipartite expa ery S ⊆ U with |S| ≤ we have In fact, the proof of Lemma 3.3 implies a simple probabilistic construction of such expanders: with prob of a (2 , d, δ)-expander scaled by d − e that A is the (scaled) adjacency matrix of a (2 , d, δ)-expander, for parameters of and δ that we will specify in the end of the full proof in Appendix C.7
{v ∈ V | ∃u ∈ S (u, v) ∈ E} ≥ ( − δ)d|S| . Lemma 3.3. [BMRV02, Lemma 3.10] For every δ ∈ (0, 1 2 ),

High-Level Proof Idea

The goal is to show that Ax p p − 1 ≤ ε for every k-spars

vector x that satisfies x p
1.Without loss of generality, let us assume that x is supported on [k], the first k coordinates among [n], and
|x 1 | ≥ |x 2 | ≥ . . . ≥ |x k |.
We partition the k columns into k/ blocks each of size , and denote them by B 1 = {1, 2, . . ., }, B 2 = { + 1, + 2, . . ., 2 }, and so on.With this definition, we can expand Ax p p as follows:
Ax p p − 1 = m i=1 k j=1 A ij x j p − x p p = m i=1 k j=1 A ij x j p − m i=1 k j=1 |A ij x j | p ≤ O(1)• m i=1 k j=1 |A ij x j |• k j =j+1 A ij x j p−1 = O(1)• k/ b=1 m i=1 j∈B b |A ij x j |• k j =j+1 A ij x j p−1 , (3.1)
where the inequality follows from Claim C.1, a tight bound on the difference between 'the p-th power of the sum' and 'the sum of the p-th powers'.

To upper bound the right-hand side of (3.1), we fix a block B b = {(b − 1) + 1, . . ., b } and consider three groups of non-zero en ries of A: 'primary', 'secondary' a d 'tertiary' entries.

Let us first define primary and secondary entries: together they form

ne primary entr
es L b ⊆ [m] × B b using the following procedure.For every row of A that has non-zero entries in the columns of B b , we pick the non-zero entry with the smallest column index and add it to the set of primary entries L b .We define secondary entries D b ⊆ [m] × B b to be the remaining non-zero entries in the columns of B b .Finally, we define tertiary entries D b ⊆ [m] × (B b+1 ∪ . . .∪ B k/l ) as the set of non-zero entries that lie in the same row as some primary entry from L b and in some block B b for the sake of illustration).

Next, let us sketch how we upper bound the right-hand side of (3.1).First, along the way we use crucially the simple estimate |x j | ≤ j −1/p for every j ∈ [k].Second, we upper bound the following partial sum of (3.1) for each b separately:
m i=1 j∈B b |A ij x j | • k j =j+1 A ij x j p−1 .
We further decompose this sum with respect to (i, j) that are primary (i.e., in L b ) or secondary (i.e., in D b ), secondary or tertiary (i.e., in D b ∪ D b ).The crucial o of Hölder's inequality.The details are somewhat lengthy: in particular, we have to treat the case b = 1 separately, and carefully cho e dimen P-p matrix with distortion D > 1.Then,
If 1 < p <

either m ≥ Ω (2 − p)n pD 2 p/2 or m ≥
Ω k p D 2p/(2−p) , If p > 2, either m ≥ n 2k or m ≥ Ω k p D p 2 /(p−2) .
We start with three auxiliary lemmas.The first one establishes bounds on the sum of p-th powers of t

entries of A.

Lemma 4.2.For any column j ∈ [n], the following holds: x ∈ {−1, 1} n ,
stablishes a lower (or upper) bound on the sum of squares of the entries of A. The proof of this lemma relies on the RIP property Ax p ≈ x p examined upon a rando k-sparse vector x sampled from the uniform distribution over
{−1, 1} k . Lemma 4.4. If 1 < p ≤ 2 then i,j A 2 i,j ≥ n k m 2/p−1 ; if p ≥ 2 then i,j A 2 i,j ≤ nD 2 k m 2/p−1 .
Now we are ready to prove Theorem 4.1.In

his section we do so only for the case 1 < p < 2 and defer the other half along with the proof
s to Appendix D.

Proof of Theorem 4.1 for 1 < p < 2. Using Lemma 4.3 we can evaluate
m i=1 k j=1 b 2 i,j ≤ m i=1 k j=1 D • j 1/p−1 2 = mD 2 k j=1 j 2/p−2 ≤ mD 2 k • k 1/p−1 2−p z ≤ n • D p D • k 1/p−1 2−p , (4.2)
where x follows from Lemma 4.3, y follows from the definition of b i,j , and z foll

s from Lemma 4.2.Adding (4.1) and (4.2)
ives:
i,j A 2 i,j = m i=1 n j=1 b 2 i,j ≤ O p 2 − p mD 2 k 2/p−1 + n • D p D • k 1/p−1 2−p
and using Lemma 4.4 we conclude that:
n • k m 2/p−1 ≤ O p 2 − p mD 2 k 2/p−1 + n • D p D • k 1/p−1 2−p . Therefore, either n • D p D • k 1/p−1 2−p or p 2−p mD 2 k 2/p−1 must be at least Ω n • k m 2/p−1
. These two cases exactly correspond (after rearranging terms) to the desired inequalities.

(We remark here that when p = 2, the factor k m 2/p−1 on the left hand side becomes 1, and therefore no interesting lower bound on m can be deduced we are aware of an alternative proof of a slightly stronger lower bound that extends the argument of Nelson and Nguy

[NN13], but since the better
ound does not seem to be optimal, and the argument is much more complicated, we decided not ≤ n/k.Since for every basis vector e j ∈ R n we have Ae j p ≥ 1, it implies that for every column of A there is an entry with absolute value at least d −1/p .Thus, there exists a row with at least n/m ≥ k such entries.Without loss of generality, let us assume that • for every 1 ≤ j ≤ k we have x j = sgn(A 1j ) ∈ {−1, 1};

• for every j > k we have x j = 0;

• the first coordinate of the vector D • k 1/p . Thus, d ≥ k p−1 /D p .

Appendix


A Applications

In this section we extend the results from [CRT06] and [Can08] to the case of general p norms.Namely, we show that RIP-p matrices for p > 1 give rise to the polynomial-time stable sparse recovery with q / 1 guarantee and approximation C = O(k −1+1/q ) for every 1 ≤ q ≤ p. Suppose that we are given a m×n .Our goal is to recover from y a good approximation x to x.One of the stand e value) coordinates of x, and h = x − x be the error vector.For a parameter α > 0 to be chosen l [n] \ S be the set of αk largest (in absolute value) coordinates of h, let T 1 ⊆ [n] \ (S ∪ T 0 ) be the set of αk next largest coordinates, and so on.We state and prove some simple claims first that are true for every measurement matrix A.


Claim A.1. We have
h e
x S 1 + x S 1 = x 1 ≥ x 1 = x + h 1 ≥ x S 1 − h S 1 oof.For every i ≥ 2 we have h
T i ∞ ≤ h T i−1 1 /(αk) by the definition of T i , implying h T i p ≤ (αk • h T i p ∞ ) 1/p ≤ h T i−1 1 /(αk) 1−1/p x S 1 (αk) 1−1/p ≤ 1 α 1−1/p • h S p + 2 x S 1 k 1−1/p ,
where the second inequality follows from Claim A.1 and the third inequality follows from the relation between 1 and p norms, that is,
h S 1 ≤ k 1−1/p • h S p .
Claim A.3.For every 1 ≤ p ≤ ∞ we have
h S∪T 0 p ≤ 1 α 1−1/p • h S p + 2 x S 1 k 1−1/p .
Proof.
h P-p matrices and p / 1 recovery

Here we prove that if A is a matrix with RIP-p property, then the 0 p − D α 1−1/p • h S p + 2 x S 1 k 1−1/p ≥ 1 − D α 1−1/p • h S∪T 0 p − 2D • x S 1 (αk) 1−1/p ,
where equality x holds beca se both x and x are feasible for (A.1), inequality y holds since A satisfies the RIP-p property and inequality z is due to Claim A.2. Now we are ready s RIP-p property for p > 1, then one can perform the stable sparse recovery with the p / 1 guaran s a ((4D) p/(p−1) k, D)-RIP-p matrix for some p > 1, then
h p ≤ O(1) k 1−1/p • x S 1 . Proof.
Setting α = (2D) p/(p−1) > 2, we have (4D) p/(p−1) k ≥ 2 p/(p−1) • α • k > (α + 1)k and therefore the assumptions in Lemma A.4 hold.We proceed as follows.
h p ≤ h S∪T 0 p + h S∪T 0 p x ≤ h S∪T 0 p + 1 α 1−1/p • h S p + 2 x S 1 k 1−1/p ≤ 1 + 1 α 1−1/p • h D • x S 1 k 1−1/p + 2 x S 1 (αk) 1−1/p z ≤ O(1) k 1−1 e following lemma shows that for every measurement matrix A: if it provides the stable sparse recovery with k −1+1/q ) for every 1 ≤ q ≤ p.Thus, the p / 1 guarantee is strong guarantee as long as q ≤ p. Lemma A.6.For every 1 ≤ q ≤ p we have
h q ≤ 4k 1/q−1/p q ≤ h S∪T 0 q + h S∪T 0 q ≤ 2 • h S∪T 0 q + 2 x S 1 k 1−1/q ≤ 4k 1/q−1/p h S∪T 0 p + 2 x S 1 k 1 third inequality follows from the relation between p and q norms.


B Missing Proofs in Section 2

Claim 2.3.If d ≥ Cε −1 k log n a i < j ≤ n |S i ∩ S j | ≤ εd k ≥ 0.99 .
Proof.Let us first upper bound the probability }, and let the random variable X k be 1 if S j contains k, and 0 if not.Under this definition,

stribution is a Pólya frequenc
we want to show that the variables {X 1 , . . ., X d } are also negatively associated.By defi

The Case b > 0. Let us now handle th
case b > 0. It is sufficient to check that for every b ≥ 0 and 1 ≤ p ≤ 2 we have (1 + b) p ≤ 1 + b p + pb p−1 .This inequal ty is trivially true when b = p−1 ) = p(1 + b) p−1 − pb p−1 − p(p − 1)b p 2 ≤ 0 or equivalently 1 + 1 b p−1 ≤ 1 + p − 1 b .
But the latter follows from the r, if c 1 = • • • = c N = 1, we ha e N i=1 a p−1 i ≤ N 2−p • N i cardinality upper bound on the sets of secondary and tertiary entries.3) for t = 1 is obvious, because in the column of (b − 1) + 1, there are only primary entries but not secondary or tertiary ones (see Figure 1).

For any integer t between 2 and , we observe that the left hand side of (C.3) consists only of secondary entries in D b , and moreover,
{(i, j) ∈ D b | j ≤ (b − 1) + t} = dt − j=(b−1) +t j=(b−1) +1 S j ≤ dt − (1 − δ)dt = δdt ≤ 3δdt .
For any t > , we rgue as follows.Since the expander propert that (b − 1) + t ∈ B b for b = b + t−1 > b, then we have
(i, j) ∈ D b ∪ D b | j ≤ (b − 1) + t ≤ |D b | + (b − b) • 2δd ≤ δd + t − 1 • 2δd ≤ 3δdt .
This finishes all the cases of Lemma C.3.

The expansion property implies the following useful inequality that will be used xtensively in the proof.

Lemma C.4.For every integer 1 ≤ b ≤ k/ , we have
(i,j)∈D b ∪D b A ij |x j | ≤ 3δ(dk o d −1/p , the left hand side of the desired inequality is
d −1/p • (i,j)∈D b ∪D b

j | = d −1/p • j≥(b−1) +
|x j | • i | (i, j) ∈ D b ∪ D b .
Let us denote by a j = {i | (i, j) ∈ D ∪ D b } , the number of distinct nonzero elements in the j-th column of A that share rows with the primary entries L b of the block b.Then, the above sum equals to
d −1/p • j≥(b−1) +1 |x j | • a j = d −1/p • t≥1 x (b−1) +t • a (b−1) +t .
We now observe that, a (b−1) +1 + • • • + a (b−1) +t ≤ 3δdt for every t ≥ 1 according to Lemma C.3, while at the same time, x (b−1) +t is assumed to be non-increasing as t increases.Therefore, it one can see that the right hand side of the above sum is maximized when
a (b−1) +1 = • • • = a (b−1) +t = • • = 3δd ,
and therefore, we conclude that
(i,j)∈D b ∪D b A ij |x j | ≤ d −1/p • 3δd • x 1 ≤ 3δ(dk) 1−1/p ,
where the last inequality foll 1−1/p .

C.3 Bounding Equation (3.1) for b > 1

The following estimate upper bounds the right hand side of (3.1) for any block b ≥ 1, but we will use it eventually only for b > 1.For b = 1, we will ne d a separate estimate.

Lemma C.5.For every int