Abstract 1 Introduction 2 Related Work 3 Main Results 4 Main Ideas 5 Proof Sketch for the Oblivious Subspace Embedding 6 Conclusions References

Optimal Oblivious Subspace Embeddings with Near-Optimal Sparsity

Shabarish Chenakkod ORCID University of Michigan, Ann Arbor, MI, USA Michał Dereziński ORCID University of Michigan, Ann Arbor, MI, USA Xiaoyu Dong ORCID National University of Singapore, Singapore
Abstract

An oblivious subspace embedding is a random m×n matrix Π such that, for any d-dimensional subspace, with high probability Π preserves the norms of all vectors in that subspace within a 1±ϵ factor. In this work, we give an oblivious subspace embedding with the optimal dimension m=Θ(d/ϵ2) that has a near-optimal sparsity of O~(1/ϵ) non-zero entries per column of Π. This is the first result to nearly match the conjecture of Nelson and Nguyen [FOCS 2013] in terms of the best sparsity attainable by an optimal oblivious subspace embedding, improving on a prior bound of O~(1/ϵ6) non-zeros per column [Chenakkod et al., STOC 2024]. We further extend our approach to the non-oblivious setting, proposing a new family of Leverage Score Sparsified embeddings with Independent Columns, which yield faster runtimes for matrix approximation and regression tasks.

In our analysis, we develop a new method which uses a decoupling argument together with the cumulant method for bounding the edge universality error of isotropic random matrices. To achieve near-optimal sparsity, we combine this general-purpose approach with new trace inequalities that leverage the specific structure of our subspace embedding construction.

Keywords and phrases:
Randomized linear algebra, matrix sketching, subspace embeddings
Category:
Track A: Algorithms, Complexity and Games
Funding:
Shabarish Chenakkod: NSF grant DMS 20544.
Michał Dereziński: NSF grant CCF 2338655.
Copyright and License:
[Uncaptioned image] © Shabarish Chenakkod, Michał Dereziński, and Xiaoyu Dong; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Sketching and sampling
; Theory of computation Random projections and metric embeddings
Related Version:
Full Version: https://arxiv.org/abs/2411.08773 [7]
Acknowledgements:
The authors are grateful for the generous help and support of Mark Rudelson throughout the duration of this work.
Editors:
Keren Censor-Hillel, Fabrizio Grandoni, Joël Ouaknine, and Gabriele Puppis

1 Introduction

Subspace embeddings are one of the most fundamental techniques in dimensionality reduction, with applications in linear regression [28], low-rank approximation [10], clustering [12], and many more (see [32] for an overview). The key idea is to construct a random linear transformation Πm×n which maps from a large dimension n to a small dimension m, while approximately preserving the geometry of all vectors in a low-dimensional subspace. In many applications, such embeddings must be constructed without the knowledge of the subspace they are supposed to preserve, in which case they are called oblivious subspace embeddings.

Definition 1.

Random matrix Πm×n is an (ε,δ,d)-oblivious subspace embedding (OSE) if for any d-dimensional subspace Tn, it holds that

(xT,(1ε)xΠx(1+ε)x)1δ.

The two central concerns in constructing OSEs are: 1) how small can we make the embedding dimension m, and 2) how quickly can we apply Π to a vector or a matrix. A popular way to address the latter is to use a sparse embedding matrix: If Π has at most sm non-zero entries per column, then the cost of computing Πx equals O(snnz(x)), where nnz(x) denotes the number of non-zero coordinates in x. Designing oblivious subspace embeddings that simultaneously optimize the embedding dimension m and the sparsity s has been the subject of a long line of works [10, 25, 26, 3, 11, 6], aimed towards resolving the following conjecture of Nelson and Nguyen [26], which is supported by nearly-matching lower bounds [27, 23].

Conjecture 2 (Nelson and Nguyen, FOCS 2013 [26]).

For any nd and ε,δ(0,1), there is an (ε,δ,d)-oblivious subspace embedding Πm×n with dimension m=O((d+log1/δ)/ε2) having s=O(log(d/δ)/ε) non-zeros per column.

Nelson and Nguyen gave a simple construction that they conjectured would achieve these guarantees: For each column of Π, place scaled random signs ±1/s in s random locations. They showed that this construction achieves dimension m=O(dpolylog(d)/ε2) and sparsity s=O(polylog(d)/ε). A number of follow-up works [3, 11] improved on this; most notably, Cohen [11] showed that a sparse OSE can achieve m=O(dlog(d)/ε2) with s=O(log(d)/ε). However, none of these guarantees recover the optimal embedding dimension m=Θ(d/ε2), with the extraneous log(d) factor arising due to a long-standing limitation in existing matrix concentration techniques [30].

This sub-optimality in dimension m was finally addressed in a recent work of Chenakkod, Dereziński, Dong and Rudelson [6], relying on a breakthrough in random matrix universality theory by Brailovskaya and van Handel [4]. They achieved m=Θ(d/ε2), but only with a significantly sub-optimal sparsity s=O~(1/ε6), which is a consequence of how the universality error is measured and analyzed in [4] (here, O~ hides polylogarithmic factors in d/εδ). This raises the following natural question:

Can the optimal dimension m=Θ(d/ε2) be achieved with the conjectured O~(1/ε) sparsity?

We give a positive answer to this question, thus matching Conjecture 2 in dimension m and nearly-matching it in sparsity s. To achieve this, we must substantially depart from the approach of Brailovskaya and van Handel, and as a by-product, develop a new set of tools for matrix universality which are likely of independent interest (see Section 4 for an overview). Remarkably, our result is attained by one of the simple constructions that were originally suggested by Nelson and Nguyen in their conjecture.

Theorem 3 (Oblivious Subspace Embedding).

For any nd and ε,δ(0,1) such that 1/ϵδpoly(d), there is an (ε,δ,d)-oblivious subspace embedding Πm×n with m=O(d/ε2) having s=O~(1/ε) non-zeros per column.

Many applications of subspace embeddings arise in matrix approximation [32] where, given a large tall matrix An×d, we seek a smaller A~m×d such that A~x=(1±ε)Ax for all xd. Naturally, this can be accomplished with an (ε,δ,d)-OSE matrix Πm×n, by computing A~=ΠA in time O~(nnz(A)/ε) and considering the column subspace of A. However, given direct access to A, one may hope to get true input sparsity time O(nnz(A)) by leveraging the fact that the embedding need not be oblivious.

To that end, we adapt our subspace embedding construction, so that it can be made even sparser given additional information about the leverage scores of matrix A. The ith leverage score of A is defined as the squared norm of the ith row of the matrix obtained by orthonormalizing the columns of A [21]. We show that if the ith leverage score of A is bounded by li[0,1], then the ith column of Π needs only max{1,O~(li/ε)} non-zero entries. Since the leverage scores of A can be approximated quickly [19], this leads to our new algorithm, Leverage Score Sparsified embedding with Independent Columns (LESS-IC), which is inspired by related constructions that use LESS with independent rows  [17, 16, 15].

Just like recent prior works [8, 9, 6], our algorithm for constructing a subspace embedding from a matrix A incurs a preprocessing cost of O(nnz(A)+dω) required for approximating the leverage scores (here, ω is the matrix multiplication exponent). However, our approach significantly improves on these prior works in the poly(d/ε) embedding cost, leading to matching speedups in downstream applications such as constrained/regularized least squares [6].

Theorem 4 (Fast Subspace Embedding).

Given An×d, ε,γ(0,1) and 1/εpoly(d), in

O(γ1nnz(A)+dω+ε1d2+γpolylog(d))time

we can compute A~m×d such that m=O(d/ε2) and with probability 0.99

(1ε)AxA~x(1+ε)Axxd.

This is a direct improvement over the previous best known runtime for constructing an optimal subspace embedding [6], which suffers an additional O~(d2+γ/ε6) cost due to their sub-optimal sparsity. Remarkably, our result is also the first to achieve O~(d2+γ/ε) dependence even if we allow a sub-optimal dimension, i.e., m=O(dlog(d)/ε2). Here, the previous best time [8, 9] has an additional O~(d2+γ/ε2) cost, due to using a two-stage leverage score sampling scheme in place of a sparse embedding matrix. Our new LESS-IC embedding is crucial in achieving the right dependence on ε, as neither of the previous constructions appear capable of overcoming the Ω(d2+γ/ε2) barrier.

As an example application of our results, we show how our fast subspace embedding construction can be used to speed up reductions for a wide class of optimization problems based on constrained or regularized least squares regression, including Lasso regression [3]. The following corollary follows immediately from Theorem 4, and is a direct improvement over Theorem 1.8 of [6] in terms of the runtime dependence on ϵ from O~(d2+γ/ϵ6) to O~(d2+γ/ϵ), while achieving a matching O(d/ϵ2)×d reduction.

Corollary 5 (Fast reduction for constrained least squares).

Given An×d, bn, ϵ>0, function g:d0 and set 𝒞d consider an n×d problem LS𝒞,g(A,b,ϵ):

Find x~ such thatf(x~)(1+ϵ)minx𝒞f(x),wheref(x)=Axb22+g(x).

There is an algorithm that reduces this problem to an O(d/ϵ2)×d instance LS𝒞,g(A~,b~,0.1ϵ) in O(γ1nnz(A)+dω+ϵ1d2+γpolylog(d)) time.

2 Related Work

Subspace embeddings have played a central role in the area of randomized linear algebra ever since the work of Sarlos [28] (for an overview, see the following surveys and monographs [32, 20, 24, 18]). Initially, these approaches focused on leveraging fast Hadamard transforms [2, 29] to achieve improved time complexity for linear algebraic tasks such as linear regression and low-rank approximation. Clarkson and Woodruff [10] were the first to propose a sparse subspace embedding matrix, the CountSketch, which has exactly one non-zero entry per column but does not recover the optimal embedding dimension guarantee. Before this, the idea of using a sparse random matrix for dimensionality reduction was successfully employed in the context of Johnson-Lindenstrauss embeddings [14, 22], which seek to preserve the geometry of a finite set, as opposed to an entire subspace.

In addition to the aforementioned efforts in improving sparse subspace embeddings [10, 25, 26, 3, 11, 6], some works have aimed to develop fast subspace embeddings that achieve optimal embedding dimension either without sparsity [8, 9], under additional assumptions [5], or with one-sided embedding bounds [31]. Our time complexity result, Theorem 4, improves on all of these in terms of the dependence on ε, thanks to a combination of our new analysis techniques and the new LESS-IC construction.

3 Main Results

In this section, we define the subspace embedding constructions used in our results, and provide detailed statements of our theorems.

As is customary in the literature, we shall work with an equivalent form of the subspace embedding guarantee from Definition 1, which frames this problem as a characterization of the extreme singular values of a class of random matrices. Namely, consider a deterministic n×d matrix U with orthonormal columns that form the basis of a d-dimensional subspace T. Then, a random matrix Πm×n is an (ε,δ,d)-subspace embedding for T if and only if all of the singular values of the matrix ΠU lie in [1ε,1+ε] with probability 1δ, i.e.,

Pr(1εsmin(ΠU)smax(ΠU)1+ε)1δ, (1)

where smin and smax denote the smallest and largest singular values. To ensure that Π is an oblivious subspace embedding, we must therefore ensure (1) for the family of all random matrices of the form ΠU, where U is any n×d matrix with orthonormal columns.

3.1 Oblivious Subspace Embeddings

Our subspace embedding guarantees are achieved by a family of OSEs which have a fixed number of non-zero entries in each column, a key property that was also required of sparse OSE distributions called OSNAP described by Nelson and Nguyen [26]. As we explain later, our analysis techniques apply to other natural families of sparse embedding distributions, including those with i.i.d. entries [1], however the OSNAP-style construction is crucial for achieving the near-optimal sparsity s=O~(1/ε).

In our construction of the m×n OSE matrix Π, we start by defining an unscaled version of the matrix, called S, which has entries in {1,0,1}. We then scale S to appropriately normalize the entry-wise variances, obtaining Π. Concretely, we wish to obtain an m×n sparse random matrix S which has exactly s non-zero ±1 entries in each column. Assume s exactly divides m. Then we can divide each column of S into s subcolumns and randomly populate one entry in each subcolumn by a Rademacher random variable (see Figure 1). We call this family of distributions (unscaled) OSNAP, carrying over Nelson and Nguyen’s terminology (technically, their definition is somewhat broader than ours).

Figure 1: An example of a column divided into s=3 subcolumns with each subcolumn having exactly one non-zero entry in a random position.

Each non-zero entry in the matrix S can be identified by a tuple (l,γ)[n]×[s] where l identifies the column of the non-zero entry and γ is the index of the entry in that column. Thus the (l,γ)th non-zero entry in S is located in column l and row μ(l,γ), where μ(l,γ) is a uniformly chosen integer from the interval [(m/s)(γ1)+1:(m/s)γ]. For example, the (1,1)th non-zero entry in S is located in column 1 and some row in the interval [1:m/s]. An m×n matrix with a non-zero entry in column l and row μ(l,γ) is given by eμ(l,γ)elT, where eμ(l,γ) and el represent standard basis vectors in m and n respectively, and for S we wish to place a random sign ξ(l,γ) at this position. This motivates our formal definition for OSNAP,

Definition 6 (OSNAP).

An m×n random matrix S is called an unscaled oblivious sparse norm-approximating projection with K-wise independent subcolumns (K-wise independent unscaled OSNAP) with parameters p,ε,δ(0,1] such that s=pm divides m if,

S=l=1nγ=1sξ(l,γ)eμ(l,γ)el

where,

  • {ξ(l,γ)}l[n],γ[s] is a collection of K-wise independent Rademacher random variables.

  • {μ(l,γ)}l[n],γ[s] is a collection of K-wise independent random variables such that each μ(l,γ) is uniformly distributed in [(m/s)(γ1)+1:(m/s)γ].

  • The collection {ξ(l,γ)}l[n],γ[s] is independent from the collection {μ(l,γ)}l[n],γ[s].

In this case, Π=(1/pm)S is called a K-wise independent OSNAP with parameters p,ε,δ. In addition, if all the random variables in the collections {ξ(l,γ)}l[n],γ[s] and {μ(l,γ)}l[n],γ[s] are fully independent, then S is called a fully independent unscaled OSNAP and Π is called a fully independent OSNAP.

Thus, each column of the OSNAP matrix Π has s=pm many non-zero entries, and the sparsity level can be varied by setting the parameter p[0,1] appropriately. With the distribution formally defined, we now provide the full statement of our subspace embedding guarantee for OSNAP,

Theorem 7 (Subspace Embedding Guarantee for OSNAP).

Let Π=(1/pm)S be an m×n matrix distributed according to the 8log(dεδ)-wise independent OSNAP distribution with parameter p. Let U be an arbitrary n×d deterministic matrix such that UU=I. Then, there exist positive constants c7.1 and c7.2 such that for any 0<δ,ε<1 and d>10, we have

(1εsmin(ΠU)smax(ΠU)1+ε)1δ

if the embedding dimension satisfies mc7.1(d+log(1/δε))/ε2 and the sparsity s=pm satisfies smin{c7.2(log2(dεδ)/ε+log3(dεδ)),m} non-zeros per column.

 Remark 8.

We note that if 1/ε is polynomial in d/δ, i.e., ε1(d/δ)K for some absolute constant K1, then the log(1/ε) term in log(d/εδ)=log(d/δ)+log(1/ε) is dominated by log(d/δ). In this case, our requirement will become

pmmin{C(K)((log(d/δ))2ε+(log(d/δ))3),m}

for some constant C(K) depending only on K. A weaker lower bound on ε, ε>1/ed is sufficient to reduce the requirement on m to:

m2c7.1d+log(1/δ)ε2.

This is a direct improvement over Theorem 1.2 of [6], which requires sparsity sclog4(d/δ)/ε6 where c is an absolute constant, with the same condition on m. The primary gain lies in the polynomial dependence on 1/ε, but we note that our result also achieves a better logarithmic dependence on d, which means that an improvement is obtained even for ε=Θ(1).

Our techniques can be used to obtain a similar result for a simple OSE model with i.i.d. sparse Rademacher entries [1], which was also considered by [6]. However, in this case, we need an additional requirement of s=pmclog(dεδ)/ε2 for the sparsity (see Section 9 of the technical report for details; this is again a direct improvement over a result of [6]).

 Remark 9.

The 1/ε2 factor in the column–sparsity of an OSE model with i.i.d. entries is unavoidable. To see why, let

U=[Id0],Π=1pmS,

and note that σmin(ΠU),σmax(ΠU)[1ε, 1+ε] forces, for every jd,

|Πej221|=|Njpm1|ε,Nj:=nnz(Sej)Binomial(m,p). (2)

Set

Z:=Njpmmp(1p),a:=εmp1p2εmp(for p12).

Condition 2 is equivalent to |Z|a. With FZ(x)=Pr[Zx] and Φ the standard normal cumulative distribution function, the Berry Esseen theorem gives

supx|FZ(x)Φ(x)|6mp.

Hence

Pr[|Z|a]=FZ(a)FZ(a)(Φ(a)Φ(a))+12mp.

Using Φ(a)Φ(a)=20aϕ(t)𝑑ta/π and the bound on a, we have

Pr(|Πej221|ε)aπ+12mp2πεmp+12mp

By general lower bounds for OSE, we know that, when ε0, we need pm and therefore so 12mp0.

Therefore, for small enough ε, if pm<c/ε2 with c:=181, the right–hand side is <13. Thus any OSE-IE that succeeds with constant probability must satisfy pm=Ω(ε2).

3.2 Characterization via a Moment Property

Our proof techniques for Theorem 7 are based on the moment method, and thus, they naturally imply the following slightly stronger moment-based characterization of an oblivious subspace embedding, which was proposed by [13] as an extension of the corresponding moment-based characterization of a Johnson-Lindenstrauss embedding [22].

Definition 10.

A distribution 𝒟 over m×n has (ε,δ,d,)-OSE moments if, for all matrices Un×d with orthonormal columns,

𝔼Π𝒟(ΠU)T(ΠU)I<εδ.

Note that a simple application of Markov’s inequality recovers the guarantee in Definition 1 from the (ε,δ,d,)-OSE moments property with any 1. Moreover, [13] showed that this moment-based OSE characterization implies several other desirable guarantees of embedding matrices in the context of approximate matrix multiplication, generalized regression and low-rank approximation.

As an immediate consequence of our analysis, we obtain the following OSE moment guarantee for the OSNAP distribution.

Corollary 11.

Let Π be an m×n matrix with an OSNAP distribution having sparsity s. Let 0<δ,ε<1 and d>10. Then Π has (ε,δ,d,)-OSE moments with =16log(dεδ) when mc11.1(d+log(1/δε))/ε2 and smin{c11.2(log2(dεδ)/ε+log3(dεδ)),m}.

 Remark 12.

Π can be applied to a matrix A in time O(nnz(A)(log2(dεδ)/ε+log3(dεδ))). As noted by [13, Remark 3], such runtimes can be further refined by chaining together several embeddings with an OSE moment property. For example, [11] showed that OSNAP with m=O(dlog(d/δ)/ε2) and s=O(log(d/δ)/ε) has (ε,δ,d,log(d/δ))-OSE moments. Thus, letting ε=Θ(1) for simplicity, we can combine a O(dlog(d/δ))×n OSNAP matrix Π1 having sparsity s=O(log(d/δ)) together with a O(d+log(1/δ))×O(dlog(d/δ)) OSNAP matrix having sparsity s=O(log3(d/δ)) to obtain Π=Π2Π1 with (Θ(1),δ,d,log(d/δ))-OSE moments which can be applied to a matrix An×d in time O(nnz(A)log(d/δ)+d2log4(d/δ)).

3.3 Leverage Score Sparsified Embedding with Independent Columns

In a related problem, we seek to embed a subspace given by a fixed Un×d, with information about the squared row norms of U being used to define the distribution of non-zero entries in Π. Such distributions for Π are called non-oblivious (a.k.a. data-aware) subspace embeddings. Previous work [6] has dealt with one such family of distributions termed LESS embeddings [17, 16, 15], showing that they require O~(1/ε4) non-zero entries per row of Π to obtain an ε-embedding guarantee. Since the embedding matrix is very wide, this leads to a much sparser embedding (sparser than any OSE) that can be applied in time sublinear in the input size, leading to fast subspace embedding algorithms.

In this work, we show that our new techniques also extend to LESS embeddings and enable us to prove sharper sparsity estimates than [6]. To fully leverage our approach, we define a new type of sparse embedding (LESS-IC), which can be viewed as a cross between CountSketch and LESS. Here, IC stands for independent columns. At a high level, the CountSketch part ensures that we can use our decoupling method to achieve optimal dependence on 1/ε, while the LESS part enables adaptivity to a fixed subspace.

Specifically, a LESS-IC embedding matrix Π has a fixed number of non-zero entries in each column, chosen so that it is proportional to the leverage score (i.e. the squared row norm) of the corresponding row of U. This is achieved by modifying the OSNAP distribution such that the number of subcolumns is no longer the same in each column. For columns corresponding to very small leverage scores, we only have one “subcolumn”. Thus, each column has at least one non-zero entry. This means that the cost of applying LESS-IC to an n×d matrix A can no longer be sublinear (like it can in the existing LESS embedding constructions), but rather has a fixed linear term of O(nnz(A)), plus an additional sublinear term. Given that the preprocessing step of approximating the leverage scores has to take at least nnz(A) time, the linear term in the cost of applying LESS-IC is negligible.

To generate an embedding matrix with the LESS-IC distribution, it suffices to have a good enough approximation for the leverage scores of the matrix U, in the following sense.

Definition 13 (Approximate Leverage Scores).

Given a matrix Un×d with orthonormal columns and β11,β21, a tuple (l1,,ln)[0,1]n of numbers are (β1,β2)-approximate leverage scores for U if, for 1in,

eiU2β1liandi=1nliβ2i=1neiU2=β2d.

We say that the numbers (l1,,ln)[0,1]n are β-approximations of the leverage scores (i.e. squared row norms) of U with β=β1β2.

To see how approximate leverage scores determine the distribution of entries in the LESS-IC distribution, let us first consider a simpler distribution, LESS-IE from [6], based on a similar construction first proposed by [17]. Here, we once again start by defining an unscaled matrix S, which is then normalized to obtain the subspace embedding matrix Π.

Definition 14 (LESS-IE).

An m×n random matrix S is called an unscaled leverage score sparsified embedding with independent entries (unscaled LESS-IE), and also Π=(1/pm)S is called a LESS-IE, corresponding to (β1,β2)-approximate leverage scores (z1,,zn) with parameter p, if S has entries si,j=1β1zjδi,jξi,j where δi,j are independent Bernoulli random variables taking value 1 with probability pij=β1zjp, whereas ξi,j are i.i.d. Rademacher random variables.

In the LESS-IE model, we have β1pmzj many non-zero entries in column j in expectation. However, to achieve 1/ε dependency of the sparsity, we need to have exactly β1pmzj many non-zero entries in the column in the LESS-IC model to fully take advantage of the error cancellation that occurs in our decoupling argument (See Section 7.2 and Section 9.1 of the technical report). Though these sections deal with oblivious subspace embeddings, the same arguments still apply in the LESS case). This is done by modifying the OSNAP construction so that the size (and consequently, the number) of subcolumns is different across columns.

Notice that to have β1pmzj many non-zero entries in column j, we would need β1pmzj many subcolumns in column j each with one non-zero entry in a random position. This means that the size of each subcolumn needs to be m/(β1pmzj)=1/(β1pzj). However, since 1/(β1pzj) may not be an integer, we consider subcolumns of size bj:=max{1/(β1pzj),1}.

In column j, we stack subcolumns of size bj until we fill up all the rows up to m. Let sj be the smallest number of subcolumns to do this. Then, it may happen that the row indices of the bottom-most subcolumn exceed m. For example, consider the distribution on the first column of Π when m=70, and b1=15. In this case s1=5, so we can stack four subcolumns of size 15 and the 5th subcolumn only spans row indices [61:70]. In each subcolumn, we randomly choose a row to place a non-zero entry, which would be a Rademacher random variable. (See Figure 2). The non-zero entries are appropriately scaled so that all entries of the matrix have the same variance (See Section 8 of the technical report for the full definition).

Figure 2: In the LESS-IC distribution, column j is filled with sj many subcolumns, with the bottom-most subcolumn truncated to fit the size of Π. Each subcolumn has one non-zero entry. Notice that as the leverage scores decrease, the number of subcolumns decreases and the matrix becomes sparser. However, each column always has at least one non-zero entry.

For the LESS-IC distribution, we show the following subspace embedding guarantee. The structure of the proof is similar to the case of OSNAP, and only the specific expressions change due to the different distribution.

Theorem 15 (Subspace Embedding Guarantee for LESS-IC).

Let Π=(1/pm)S be an m×n matrix distributed according to the 8log(dεδ)-wise independent LESS-IC distribution with parameter p for some fixed n×d matrix U satisfying UU=I with given (β1,β2)-approximate leverage scores. Then, there exist positive constants c15.1 and c15.2 such that for any 0<ε,δ<1, and d>10, we have

(1εsmin(ΠU)smax(ΠU)1+ε)1δ

when mc15.1(d+log2(d/δ)+log(1/ε)ε2+log3(d/δ)/ε) and

c15.2max{(log(d/εδ))2.5ε,(log(d/εδ))3}pmm.

The matrix Π has O(n+βpmd) many non-zero entries and can be applied to an n×d matrix A in O(nnz(A)+βpmd2) time, where β=β1β2 is the leverage score approximation factor.

 Remark 16.

When δ=dO(1), we recover the optimal dimension m=Θ(d/ε2) while showing that one can apply the LESS-IC embedding in time O(nnz(A))+O~(βd2/ε). In comparison, [6] showed that a corresponding LESS-IE embedding can be applied in O~(βd2/ε6) time. Using our techniques, one could improve the runtime of LESS-IE to O~(βd2/ε2), but our new LESS-IC construction appears necessary to recover the best dependence on 1/ε.

3.4 Fast Subspace Embedding (Proof of Theorem 4)

Here, we briefly outline how our LESS-IC embedding yields a fast subspace embedding construction to recover the time complexity claimed in Theorem 4. This follows analogously to the construction from Theorem 1.6 of [6], and our improvement in the dependence on 1/ε compared to their result (from 1/ε6 to 1/ε) stems from the improved sparsity of our LESS-IC embedding.

The key preprocessing step for applying the LESS-IC embedding is approximating the leverage scores of the matrix A. Using Lemma 5.1 in [6] (adapted from Lemma 7.2 in [8]), we can construct coarse approximations of all leverage scores so that β1=O(nγ) and β2=O(1) in time O(γ1(nnz(A)+d2)+dω). Applying LESS-IC (Theorem 15) with these leverage scores and parameters β1,β2, computing ΠA takes O(nnz(A)+nγd2log3(d/εδ)/ε), where nnz(A) comes from the fact that every column of Π has at least one non-zero, while the second term accounts for the additional O(βdlog3(d/εδ)/ε) non-zeros.

Thus, if dnc for, say, c=0.1, then we conclude the claim by appropriately scaling γ by a constant factor. Now, suppose otherwise. First, note that without loss of generality we can assume that γ<0.1 (through scaling the time complexity by a constant factor), nnz(A)n (by removing empty rows) and εd/n (because otherwise mn and we could use A~=A). Thus, under our assumption that d<nc, we have nγd2/εn0.5+γ+2cn0.8nnz(A), and the time complexity is dominated by the O(γ1nnz(A)) term.

Finally, we note that Corollary 5 follows simply by constructing a subspace embedding Π via Theorem 4 with respect to matrix [Ab], and computing A~=ΠA, b~=Πb. The proof of the claim is identical to the proof of Theorem 1.8 in [6]. Our improvement comes directly from the faster runtime of our subspace embedding construction.

3.5 Outline of the Paper

Section 4 provides a high level overview of the ideas used in the proofs of our main results, Theorem 7 and Theorem 15. Section 5 provides a sketch of the proof of Theorem 7, listing the main technical steps, leaving the full proof with all technical details to Section 7 of the technical report. The proof of Theorem 15 follows similarly and is covered in Section 8 of the technical report. The subspace embedding guarantee for a sparse matrix with independent entries is proved in Section 9 of the technical report.

3.6 Notation

The following notation and terminology will be used in the paper. The notation [n] is used for the set {1,2,,n} and the notation P([n]) denotes the set of all partitions of [n]. Also, for two integers a and b with ab, we use the notation [a:b] for the set {k:akb}. For x, we use the notation x to denote the greatest integer less than or equal to x and x to denote the least integer greater than or equal to x. In n (or m or d), the lth coordinate vector is denoted by el. All matrices considered in this paper are real valued and the space of m×n matrices with real valued entries is denoted by Mm×n(). Also, for a matrix XMd×d(), the notation Tr(X) denotes the trace of the matrix X, and tr(X)=1dTr(X) denotes the normalized trace. We write the operator norm of a matrix X as X, and it is also denoted by Xop in some places where other norms appear for clarity. The spectrum of a matrix X is denoted by spec(X). The standard probability measure is denoted by , and the symbol 𝔼 means taking the expectation with respect to this standard probability measure. To simplify the notation, we follow the convention from [4] and use the notation 𝔼[X]α for (𝔼(X))α, i.e., when a functional is followed by square brackets, it is applied before any other operations. The covariance of two random variables X and Y is denoted by Cov(X,Y). The standard Lq norm of a random variable ξ is denoted by ξq, for 1q. Throughout the paper, the symbols c1,c2,, and Const,Const, denote absolute constants.

4 Main Ideas

We next outline our new techniques which are needed to establish the main results, Theorems 7 and 15. Here, for notational convenience, we will refer to the unscaled random matrix S, as opposed to the subspace embedding matrix Π=(1/pm)S (see Definition 6).

Note that due to the equivalent characterization of the OSE property in (1), all we need to show is that singular values of SU are clustered around pm at distance O(pmε). In other words, we need to show that the difference between the spectrum of SU and the spectrum of pmId is small, of the order O(pmε).

In all our models, the entries of S are uncorrelated with mean 0 and variance p, and therefore the entries of SU are uncorrelated with uniform variance. If we consider a random matrix G with Gaussian entries which keeps the covariance profile of the entries of SU, then this Gaussian random matrix G has independent Gaussian entries with variance p. Using classical results about singular values of Gaussian random matrices, it can be shown that the singular values of G are sufficiently clustered around pm with high probability for m=Ω(d/ε2). Thus, it suffices to find conditions under which the singular values of SU are sufficiently close to the singular values of G. This is the phenomenon of universality whereby random systems show predictable (in this case Gaussian) behavior under certain limits.

Failure of black-box universality.

Recent work by Brailovskaya-van Handel [4] on universality for certain random matrix models developed tools to bound the distance between the spectrum of a random matrix model obtained as a sum of independent random matrices and the spectrum of a Gaussian random matrix with the same covariance profile. Using these tools, [6] achieved optimal embedding dimension m=O(d/ε2) for OSEs by using the bound in [4, Theorem 2.6] to estimate the Hausdorff distance (a concept of distance between two subsets of ; A,B are said to be ε-close in Hausdorff distance if A is in the ε-neighborhood of B and B is in the ε neighborhood of A) between the spectra of

sym(SU)=[(SU)TSU] and sym(G)=[GTG].

This distance is shown to be (O(pm))2/3, which is of order pmε only when pm has 1/ε6 dependence. Thus, [6] did not obtain the conjectured dependency of the sparsity on ε, which requires pm to only have 1/ε dependency. To get better ε dependency, we would either need a sharper bound on the Hausdorff distance, or have the distance decrease with ε. For example, if the (O(pm))2/3 bound was improved to (O(pm))1/2, we would only need (pm)1/2pmε which can be achieved when pm has 1/ε4 dependence. On the other hand, if the (O(pm))2/3 bound was improved to (O(pm))2/3ε1/2, we would only need pm to have 1/ε3 dependence.

Key idea: Universality of centered moments.

One can instead look at a different approach to characterize the clustering of singular values. To show that the singular values of ΠU are between 1±ε, it is enough to show that (ΠU)TΠUIdε or (SU)TSUpmIdpmε (Note that S=pmΠ). One way to achieve this bound with high probability is to use the moment method, i.e., to show that (see proof of Theorem 7 in Section 5):

𝔼[tr((SU)T(SU)pmId)2q]12q=O(pmε).

In this case, standard calculations on Gaussian random matrices (see Section 6 of the technical report) show that (𝔼[tr(GTGpmId×d)2q])12qcpmdm=O(pmε) when m=Ω(d/ε2) and G has the covariance profile of SU. So it is enough to show that

𝔼[tr((SU)T(SU)pmId)2q]12q𝔼[tr(GTGpmId)2q]12q=O(pmε).

where we recall the notation 𝔼[tr(X)2q]12q=(𝔼tr(X)2q)1/(2q).

Now, [4, Proposition 9.12] does take a similar approach of comparing (SU)T(SU)pmId and GTGpmId, by relying on an interpolation argument, where one defines a mixture S(t)=tS+1tG and controls the change in the moments along the trajectory specified by t[0,1]. Unfortunately, using that result gives a larger power of pm in the bound than desired, resulting again in a worse ε dependence.

One can also, by viewing (SU)T(SU)pmId=i=1m(UTsisiTUpId), get a random matrix model which is a sum of independent random matrices (this is not true for OSNAP, but some other models of OSEs), and then compare 𝔼[tr((SU)T(SU)pmId)2q]12q with 𝔼[tr(H)2q]12q where H is the Gaussian model for (SU)T(SU)pmId. This is the approach of [4, Proposition 9.15], but it fails in obtaining the optimal embedding dimension m=d/ε2.

Key technique: Decoupling.

To overcome these obstacles, we develop a fresh analysis while still using the ideas of [4]. Our first step is to observe that due to the property of S having a fixed number of non-zero entries in a column for the OSNAP distribution, all quadratic terms in (SU)T(SU)pmId are square-free, and this allows us to use the decoupling technique to reduce the problem of controlling the moments of (SU)T(SU)pmId to controlling the moments of (S1U)T(S2U)+(S2U)T(S1U) where S1 and S2 are independent copies of S (See proof of Lemma 17 in Section 5).

We still have to separate bounding 𝔼[tr((S1U)T(S2U)+(S2U)T(S1U))2q]12q into two parts, bounding 𝔼[tr(G1TG2+G2TG1)2q]12q for the Gaussian model, and the difference

𝔼[tr((S1U)T(S2U)+(S2U)T(S1U))2q]12q𝔼[tr(G1TG2+G2TG1)2q]12q,

which is called the universality error.

By standard calculations, we have 𝔼[tr(G1TG2+G2TG1)2q]12qcpmpd=O(pmε) and the main task is still to bound the universality error. The advantage of the decoupling idea is that, informally speaking, since S1 and S2 are independent, we can condition on one of them, e.g., S1. For fixed S1, the random matrix (S1U)T(S2U) (where all randomness comes from S2) can be viewed as a sum of independent random matrices, with the individual summands having moments of smaller order than the previous approach. We can then use an interpolation argument to bound the trace universality error for q=log(dεδ) as follows:

|𝔼[tr((S1U)T(S2U)+(S2U)T(S1U))2q]12q𝔼[tr(G1TG2+G2TG1)2q]12q| (3)
polylog(dεδ).

Notice that there is no pm dependence on the right hand side. So our requirement that this quantity be bounded by pmε is satisfied when pmpolylog(dεδ)/ε, achieving the conjectured 1/ε dependence.

Nevertheless, the conditioning argument cannot be done directly because

𝔼[tr((S1U)T(S2U)+(S2U)T(S1U))2q]12q
𝔼S1[𝔼S2[tr((S1U)T(S2U)+(S2U)T(S1U))2q]12q].

Key technique: 2D interpolation via chain rule.

So, instead we develop a new approach which incorporates the conditioning step directly into a two-dimensional interpolation argument, through the use of the chain rule (see Figure 3). Define

S1(t1)=t1S1+1t1G1,S2(t2)=t2S2+1t2G2.

We start from (G1,G2) at (t1,t2)=(0,0) and move to (S1,S2) at (1,1), interpolating between the easier-to-analyze Gaussian matrices (G1,G2) and the true random matrices (S1,S2) of interest and controlling the changes in their moments (or the error terms) step by step.

Figure 3: Two-dimensional interpolation in (t1,t2)[0,1]2, decomposed using the chain rule.

Defining f(M1,M2)=tr((M1U)T(M2U)+(M2U)T(M1U))2q, and applying the chain rule on the diagonal t1=t2=t, we obtain:

ddt𝔼[f(S1(t),S2(t))]
= t1𝔼[f(S1(t1),S2(t2))]|t1=t,t2=t+t2𝔼[f(S1(t1),S2(t2))]|t1=t,t2=t.

By independence of S1 and S2, we can condition on S2 when we bound the partial derivative t1𝔼[f(S1(t1),S2(t2))]|t1,t2=t, and do similar calculations for the other term. The benefit of doing this is that we can now fine tune the techniques of [4] to get a differential inequality (Lemma 20) that leads to inequality (3). In doing so, we are able to find the optimal bounds and exponents in the differential inequality.

5 Proof Sketch for the Oblivious Subspace Embedding

We now sketch the proof of our main subspace embedding guarantee, Theorem 7 for OSNAP. The full proof can be found in Section 7 of the technical report. The proof of the subspace embedding guarantee for LESS-IC, Theorem 15 is similar and can be found in Section 8 of the technical report.

Proof sketch of Theorem 7.

Let X:=1pmSU. We first assume that the collection of all the random variables {ξ(l,γ),μ(l,γ)}l[n],γ[s] in the unscaled OSNAP construction are fully independent, and later we will check what is the minimum independence needed.

We observe that to prove the theorem, it is enough to show that

(XTXIdε)1δ.

We call the quantity XTXId the embedding error. By Markov’s inequality, we have

(XTXIdδ12q𝔼[dtr(XTXId)2q]12q) δ

which, after simplification, becomes

(XTXId(d/δ)12q𝔼[tr(XTXId)2q]12q) δ.

For q>log(d/δ), we have

(d/δ)12q=exp(log(d/δ)12q)exp(log(d/δ)12log(dδ))e.

Therefore, we have

(XTXIde𝔼[tr(XTXId)2q]12q) δ.

Thus we need to control moments of order 2q of the embedding error for q>log(d/δ), and this is done in the following lemma.

Lemma 17 (Trace Moments of Embedding Error for OSNAP, see Section 7 of the technical report).

For X as above, there exist constants c17.1,c17.2,c17.3>0 such that for q satisfying 2qm, we have

𝔼[tr(XTXId)2q]12q ε,
when m c17.1d+qε2 (4)
and pm (max{c17.2q2ε,c17.3q3})1+2q2. (5)

Applying Lemma 17 (with appropriately adjusted ε) implies

(XTXIdε) δ

when combined with the previous calculations.

It remains to check that conditions (4) and (5) are satisfied for q=2log(dεδ)+2 by requiring mc7.1(d+log(1/δε))/ε2 and s=pmc7.2(log2(dεδ)/ε+log3(dεδ)), and this is done in the full version of the proof in Section 7 of the technical report.

Note that the expression for 𝔼[tr(XTXId)2q] depends only on 2q fold products of the entries of X. So, the quantity 𝔼[tr(XTXId)2q] remains unchanged if we only assume that subsets of the entries of X of size 2q are independent instead of arbitrary subsets of the entries of X being independent. Since it suffices to choose q=2log(dεδ)+2, we only need S to be an O(log(d/εδ))-wise independent unscaled OSNAP. Finally, we show how the above arguments also imply the OSE moment property (Definition 10).

Proof of Corollary 11.

By the proof of Theorem 7, we see that when mc7.1(d+log(1/δε))/ε2 and smin{c7.2(log2(dεδ)/ε+log3(dεδ)),m}, then 𝔼[tr(XTXId)2q]ε2q, for q=8log(dεδ) (The proof originally has q=2log(dεδ)+2, but upon going through the proof we see that q=8log(dεδ) also works). To get 𝔼[tr(XTXId)2q]ε2qδ/d, it suffices for m and s to satisfy the same lower bounds, but with ε replaced by ε(δ/d)12qcε for some c>0 since qlog(d/δ). These new lower bounds can be achieved by lower bounds of the same form as Theorem 7, but with different constants. The claim follows, since XTXId2qdtr(XTXId)2q.

5.1 Controlling Trace Moments of the Embedding Error

We now sketch the proof of Lemma 17, which obtains the moment bound for XTXI used in the previous proof. The full proof can be found in Section 7 of the technical report.

Proof sketch of Lemma 17.

Our first step is to observe that due to the property of S having a fixed number of non-zero entries in a column, all quadratic terms in (SU)T(SU)pmId are square-free, and this allows us to use the decoupling technique to reduce the problem of controlling the moments of (SU)T(SU)pmId to controlling the moments of (S1U)T(S2U)+(S2U)T(S1U) where S1 and S2 are independent copies. This is shown in the following claim, with the proof deferred to Section 7 of the technical report.

Lemma 18 (Decoupling).

When S has the fully independent unscaled OSNAP distribution, we have

𝔼[tr(UTSTSUpmId)2q] =𝔼[tr(i=1mj,j=1,jjnsijsijujujT)2q]

where {ujT}j[n] denote the rows of U. Consequently, we have

𝔼[tr(UTSTSUpmId)2q] 𝔼S1,S2[tr(2((S1U)TS2U+(S2U)TS1U))2q]

where S2 is an independent copy of S1.

To estimate the moments of (S1U)TS2U, we compare them to moments from the Gaussian case, i.e. the moments of (G1U)TG2U where the entries of G1 and G2 are independent normal random variables with variance p (since the entries of S1 and S2 are also uncorrelated with mean 0 and variance p, see Section 6 of the technical report). In this case, due to orthogonal invariance of the Gaussian distribution, the matrices G1U and G2U are distributed as pH1 and pH2 where H1 and H2 are m×d matrices with independent standard normal entries. Thus, we can rely on the following bound, which uses standard results about the norms of Gaussian random matrices with independent entries.

Lemma 19 (Trace Moment of Embedding Error for Decoupled Gaussian Model, see Section 6 of the technical report ).

Let H1 and H2 be independent m×d random matrices with i.i.d. Gaussian entries. Then for any positive integer q, there exists c19>0 such that

𝔼[tr(H1TH2+H2TH1)2q]12qc19max{d,q}max{m,q}.

To formally compare the moments of (S1U)TS2U and (G1U)TG2U, we define the interpolating matrices S1(t),S2(t) for t[0,1] as described in Section 4:

S1(t)=tS1+1tG1,S2(t)=tS2+1tG2. (6)

Let Γ(M1,M2)=(M1U)T(M2U)+(M2U)T(M1U) and Γ(t)=Γ(S1(t),S2(t)). Then, due to the decoupling lemma (Lemma 18), to prove Lemma 17 it is enough to show that 𝔼[tr(Γ(1))2q]12qpmε/2. Now, by Lemma 19, we know that:

𝔼[tr(Γ(0))2q]12q=𝔼[tr(Γ(G1,G2))2q]12qc19pmax{d,q}max{m,q}.

Since we want to find the conditions for which 𝔼[tr(Γ(0))2q]12qpmε/4, it is enough to ensure that c19pmax{d,q}max{m,q}pmε/4. Clearly, this can only happen when qm, and in this case the inequality holds when mc(d+q)ε2. Thus, it suffices to show

𝔼[trΓ(1)2q]12q𝔼[trΓ(0)2q]12q14pmε. (7)

For this, we look to estimate the derivative ddt𝔼[trΓ(t)2q], and we obtain the following estimate in Lemma 20 using the 2D interpolation idea mentioned in Section 4.

Lemma 20 (Differential Inequality).

For Γ(t) as defined above, there exists a constant c20 such that, for any q2, we have

ddt𝔼[trΓ(t)2q] max4k2q(c20q)k((pm)1qmax{pd,q})qk2qq1𝔼[trΓ(t)2q]1k22q2.

This differential inequality can be separated into two distinct cases: pdq and pd>q. When pdq, we can simplify the expression on the right using convexity arguments, and use Lemma 6.6 from [4] to solve the differential inequality and obtain the following bound:

𝔼[trΓ(1)2q]12q𝔼[trΓ(0)2q]12qc7(pm)1qq2.

for some c7>0 (this is done in the full proof of Lemma 17 in the technical report). Thus, inequality (7) is satisfied when c7(pm)1qq2<pmε/4, or

pm4c7(pm)1qq2ε.

When pd>q, the expression on the right of the above differential inequality has some pd factors. We replace these pd factors by terms involving only pm and 𝔼[trΓ(t)2q] and similarly obtain:

𝔼[trΓ(1)2q]12q𝔼[trΓ(0)2q]12qc13q3(pm)2q(dm)12

for some c13>0. In this case, inequality (7) is satisfied when

c13q3(pm)2q(dm)1214pmε.

Since we have mc14dε2 for some constant c14, we have εc14dm, so it suffices to require

c13q3(pm)2q(dm)12 14pmc14dm
or, pm c15(pm)2qq3

for some c15>0.

Combining the analysis for the two cases, it suffices to require

pm(pm)2qmax{c16q2ε,c17q3}

for some constants c16>0 and c17>0. This requirement is equivalent to

pm(max{c16q2ε,c17q3})112/q,

which concludes the proof of Lemma 17 (see remaining details in Section 7 of the technical report).

5.2 Obtaining the differential inequality in Lemma 20

We now discuss the proof of the technical part of our argument in the previous proof, which is to control the derivative of the interpolant. The full proof can be found in Section 7 of the technical report.

Sketch of proof of Lemma 20.

There are two main ideas for obtaining this differential inequality. First, we use the cumulant method as in [4] to transform the derivative in t to matrix directional derivatives. Then, we bound the resulting terms in the expression by delicately using the matrix Hölder’s inequality.

Fix M2 and define f1,M2(M1):=tr(Γ(M1,M2)2q) as a function of M1. We shall first obtain an expression for ddt𝔼[f1,M2(S1(t))]. To see why this is sufficient, note that the derivative we are interested in is the directional derivative along the path t(t,t) for the multivariate function (t1,t2)𝔼[tr(Γ(S1(t1),S2(t2))2q)] and by the chain rule (as mentioned in Section 4),

ddt𝔼[trΓ(t)2q]=ddt1𝔼[tr(Γ(S1(t1),S2(t2))2q)]|t1,t2=t+ddt2𝔼[tr(Γ(S1(t1),S2(t2))2q)]|t1,t2=t

Now, recall that S1 can be written in the form (l,γ)ΞZ(l,γ) where Ξ=[n]×[pm] and Z(l,γ)=ξ(l,γ)eμ(l,γ)el (see Definition 6). We then have the following lemma.

Lemma (Based on Corollary 6.1, [4]).

For any polynomial ϕ:Mm×d(), we have

ddt𝔼[ϕ(S1(t))]=12k=4tk21(k1)!πP([k])(1)|π|1(|π|1)!𝔼[(l,γ)ΞZ(l,γ),1|πZ(l,γ),k|πϕ(S1(t))],

where Zϕ denotes the directional derivative of ϕ in the direction ZMm×d().

Here, P([k]) denotes the set of all partitions of [k], and Z(l,γ),1|π,,Z(l,γ),k|π are random matrices distributed as Z(l,γ). Crucially, those are independent of S1,G1,S2 and G2 (but not necessarily from each other). Further details are given in the full proof in Section 7 of the technical report.

Applying this lemma to ddt1𝔼[tr(Γ(S1(t1),S2(t2))2q)]=ddt1𝔼[f1,S2(t2)(S1(t1))], we need to deal with the directional derivatives of f1,S2(t2) along Z(l,γ),1|π,,Z(l,γ),k|π. Using a general expression for derivatives of multinomials via the product rule, we have, for any deterministic m×d matrices B1,,Bk,M1 and M2,

B1Bkf1,M2(M1)
= σsym(k)r1,,rk+10r1++rk+1=2qktr(Γ(M1,M2)r1((Bσ(1)U)TM2U+(M2U)TBσ(1)U)Γ(M1,M2)r2
((Bσ(2)U)TM2U+(M2U)TBσ(2)U)Γ(M1,M2)rk
((Bσ(k)U)TM2U+(M2U)TBσ(k)U)Γ(M1,M2)rk+1).

In our case, for each fixed (l,γ), we have to analyze Z(l,γ),1|πZ(l,γ),k|πf1,S2(t2)(S1(t1)), which means that we have Bλ=Z(l,γ),λ|π for λ[k], and M2=S2(t). So terms of the form (BλU)TM2U become (Z(l,γ),λ|πU)TS2(t)U. Crucially, (Z(l,γ),λ|πU)TS2(t)U is a rank one matrix, so it can be written as an outer product of the form Θ(l,γ),λ,1TΘ(l,γ),λ,2.

Then, estimating

𝔼[(l,γ)ΞZ(l,γ),1|πZ(l,γ),k|πf1,S2(t2)(S1(t1))]

for t1=t2=t boils down to estimating terms of the form

𝔼[trΓ(t)r1Θ(l,γ),σ(1),τ1(1)TΘ(l,γ),σ(1),τ1(2)Γ(t)r2Γ(t)rkΘ(l,γ),σ(k),τk(1)TΘ(l,γ),σ(k),τk(2)Γ(t)rk+1]

where τisym({1,2}) are permutations of the set {1,2}.

For the remainder of the proof, we delicately analyze terms of this form using the matrix Hölder’s inequality, and appropriately estimate the terms that arise.

Lemma (Matrix Hölder’s inequality, Lemma 5.3 in [4]).

Let 1β1,,βk satisfy i=1k1βi=1. Then

|𝔼[trY1Yk]|Y1β1Ykβk

for any d×d random matrices Y1,,Yk.

This analysis based on matrix Hölder’s inequality is done in Section 7 of the technical report. One important observation we use in this lemma (among many others) is that Θ(l,γ),λ,1TΘ(l,γ),λ,2 are rank one matrices, which allows us to bound Θ(l,γ),λ,1TΘ(l,γ),λ,2q with pd instead of pm. For further details, please refer to the full proof in Section 7 of the technical report.

6 Conclusions

We give an oblivious subspace embedding with optimal embedding dimension that achieves near-optimal sparsity, thus nearly matching a conjecture of Nelson and Nguyen in terms of the best sparsity attainable by an optimal oblivious subspace embedding. We also propose a fast algorithm for constructing low-distortion subspace embeddings, based on a new family of Leverage Score Sparsified embeddings with Independent Columns (LESS-IC). This new algorithm leads to speedups in downstream applications such as optimization problems based on constrained or regularized least squares. As a by-product of our analysis, we develop a new set of tools for matrix universality, combining a decoupling argument with a two-dimensional interpolation method, which are likely of independent interest.

References

  • [1] Dimitris Achlioptas. Database-friendly random projections: Johnson-lindenstrauss with binary coins. Journal of computer and System Sciences, 66(4):671–687, 2003. doi:10.1016/S0022-0000(03)00025-4.
  • [2] Nir Ailon and Bernard Chazelle. The fast johnson–lindenstrauss transform and approximate nearest neighbors. SIAM Journal on computing, 39(1):302–322, 2009. doi:10.1137/060673096.
  • [3] Jean Bourgain, Sjoerd Dirksen, and Jelani Nelson. Toward a unified theory of sparse dimensionality reduction in euclidean space. In Proceedings of the forty-seventh annual ACM symposium on Theory of Computing, pages 499–508, 2015. doi:10.1145/2746539.2746541.
  • [4] Tatiana Brailovskaya and Ramon van Handel. Universality and sharp matrix concentration inequalities. Geometric and Functional Analysis, pages 1–105, 2024.
  • [5] Coralia Cartis, Jan Fiala, and Zhen Shao. Hashing embeddings of optimal dimension, with applications to linear least squares. arXiv preprint arXiv:2105.11815, 2021. arXiv:2105.11815.
  • [6] Shabarish Chenakkod, Michał Dereziński, Xiaoyu Dong, and Mark Rudelson. Optimal embedding dimension for sparse subspace embeddings. In Proceedings of the 56th Annual ACM Symposium on Theory of Computing, pages 1106–1117, 2024.
  • [7] Shabarish Chenakkod, Michał Dereziński, and Xiaoyu Dong. Optimal oblivious subspace embeddings with near-optimal sparsity, 2024. doi:10.48550/arXiv.2411.08773.
  • [8] Nadiia Chepurko, Kenneth L Clarkson, Praneeth Kacham, and David P Woodruff. Near-optimal algorithms for linear algebra in the current matrix multiplication time. In Proceedings of the 2022 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 3043–3068. SIAM, 2022. doi:10.1137/1.9781611977073.118.
  • [9] Yeshwanth Cherapanamjeri, Sandeep Silwal, David P Woodruff, and Samson Zhou. Optimal algorithms for linear algebra in the current matrix multiplication time. In Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 4026–4049. SIAM, 2023. doi:10.1137/1.9781611977554.CH154.
  • [10] Kenneth L Clarkson and David P Woodruff. Low rank approximation and regression in input sparsity time. In Proceedings of the forty-fifth annual ACM symposium on Theory of Computing, pages 81–90, 2013. doi:10.1145/2488608.2488620.
  • [11] Michael B Cohen. Nearly tight oblivious subspace embeddings by trace inequalities. In Proc. of the 27th annual ACM-SIAM Symposium on Discrete Algorithms, pages 278–287. SIAM, 2016. doi:10.1137/1.9781611974331.CH21.
  • [12] Michael B Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina Persu. Dimensionality reduction for k-means clustering and low rank approximation. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 163–172, 2015. doi:10.1145/2746539.2746569.
  • [13] Michael B Cohen, Jelani Nelson, and David P Woodruff. Optimal approximate matrix product in terms of stable rank. In International Colloquium on Automata, Languages, and Programming. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, 2016.
  • [14] Anirban Dasgupta, Ravi Kumar, and Tamás Sarlós. A sparse johnson: Lindenstrauss transform. In Proceedings of the forty-second ACM symposium on Theory of computing, pages 341–350, 2010. doi:10.1145/1806689.1806737.
  • [15] Michał Dereziński. Algorithmic gaussianization through sketching: Converting data into sub-gaussian random designs. In The Thirty Sixth Annual Conference on Learning Theory, pages 3137–3172. PMLR, 2023.
  • [16] Michał Dereziński, Jonathan Lacotte, Mert Pilanci, and Michael W Mahoney. Newton-less: Sparsification without trade-offs for the sketched newton update. Advances in Neural Information Processing Systems, 34:2835–2847, 2021.
  • [17] Michał Dereziński, Zhenyu Liao, Edgar Dobriban, and Michael Mahoney. Sparse sketches with small inversion bias. In Conference on Learning Theory, pages 1467–1510. PMLR, 2021.
  • [18] Michał Dereziński and Michael W Mahoney. Recent and upcoming developments in randomized numerical linear algebra for machine learning. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 6470–6479, 2024.
  • [19] Petros Drineas, Malik Magdon-Ismail, Michael W Mahoney, and David P Woodruff. Fast approximation of matrix coherence and statistical leverage. The Journal of Machine Learning Research, 13(1):3475–3506, 2012. doi:10.5555/2503308.2503352.
  • [20] Petros Drineas and Michael W Mahoney. Randnla: randomized numerical linear algebra. Communications of the ACM, 59(6):80–90, 2016. doi:10.1145/2842602.
  • [21] Petros Drineas, Michael W Mahoney, and S Muthukrishnan. Sampling algorithms for 2 regression and applications. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 1127–1136, 2006.
  • [22] Daniel M Kane and Jelani Nelson. Sparser johnson-lindenstrauss transforms. Journal of the ACM (JACM), 61(1):1–23, 2014. doi:10.1145/2559902.
  • [23] Yi Li and Mingmou Liu. Lower bounds for sparse oblivious subspace embeddings. In Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 251–260, 2022. doi:10.1145/3517804.3526224.
  • [24] Per-Gunnar Martinsson and Joel A Tropp. Randomized numerical linear algebra: Foundations and algorithms. Acta Numerica, 29:403–572, 2020. doi:10.1017/S0962492920000021.
  • [25] Xiangrui Meng and Michael W Mahoney. Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pages 91–100, 2013. doi:10.1145/2488608.2488621.
  • [26] Jelani Nelson and Huy L Nguyên. Osnap: Faster numerical linear algebra algorithms via sparser subspace embeddings. In 2013 ieee 54th annual symposium on foundations of computer science, pages 117–126. IEEE, 2013.
  • [27] Jelani Nelson and Huy L Nguyên. Lower bounds for oblivious subspace embeddings. In International Colloquium on Automata, Languages, and Programming, pages 883–894. Springer, 2014. doi:10.1007/978-3-662-43948-7_73.
  • [28] Tamas Sarlos. Improved approximation algorithms for large matrices via random projections. In 2006 47th annual IEEE symposium on foundations of computer science (FOCS’06), pages 143–152. IEEE, 2006.
  • [29] Joel A Tropp. Improved analysis of the subsampled randomized hadamard transform. Advances in Adaptive Data Analysis, 3(01n02):115–126, 2011. doi:10.1142/S1793536911000787.
  • [30] Joel A. Tropp. An introduction to matrix concentration inequalities. Found. Trends Mach. Learn., 8(1–2):1–230, May 2015. doi:10.1561/2200000048.
  • [31] Joel A Tropp. Comparison theorems for the minimum eigenvalue of a random positive-semidefinite matrix. arXiv preprint arXiv:2501.16578, 2025. doi:10.48550/arXiv.2501.16578.
  • [32] David P Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends® in Theoretical Computer Science, 10(1–2):1–157, 2014. doi:10.1561/0400000060.