Optimal Oblivious Subspace Embeddings with Near-Optimal Sparsity
Abstract
An oblivious subspace embedding is a random matrix such that, for any -dimensional subspace, with high probability preserves the norms of all vectors in that subspace within a factor. In this work, we give an oblivious subspace embedding with the optimal dimension that has a near-optimal sparsity of non-zero entries per column of . This is the first result to nearly match the conjecture of Nelson and Nguyen [FOCS 2013] in terms of the best sparsity attainable by an optimal oblivious subspace embedding, improving on a prior bound of non-zeros per column [Chenakkod et al., STOC 2024]. We further extend our approach to the non-oblivious setting, proposing a new family of Leverage Score Sparsified embeddings with Independent Columns, which yield faster runtimes for matrix approximation and regression tasks.
In our analysis, we develop a new method which uses a decoupling argument together with the cumulant method for bounding the edge universality error of isotropic random matrices. To achieve near-optimal sparsity, we combine this general-purpose approach with new trace inequalities that leverage the specific structure of our subspace embedding construction.
Keywords and phrases:
Randomized linear algebra, matrix sketching, subspace embeddingsCategory:
Track A: Algorithms, Complexity and GamesFunding:
Shabarish Chenakkod: NSF grant DMS 20544.Copyright and License:
![[Uncaptioned image]](x1.png)
2012 ACM Subject Classification:
Theory of computation Sketching and sampling ; Theory of computation Random projections and metric embeddingsAcknowledgements:
The authors are grateful for the generous help and support of Mark Rudelson throughout the duration of this work.Editors:
Keren Censor-Hillel, Fabrizio Grandoni, Joël Ouaknine, and Gabriele PuppisSeries and Publisher:

1 Introduction
Subspace embeddings are one of the most fundamental techniques in dimensionality reduction, with applications in linear regression [28], low-rank approximation [10], clustering [12], and many more (see [32] for an overview). The key idea is to construct a random linear transformation which maps from a large dimension to a small dimension , while approximately preserving the geometry of all vectors in a low-dimensional subspace. In many applications, such embeddings must be constructed without the knowledge of the subspace they are supposed to preserve, in which case they are called oblivious subspace embeddings.
Definition 1.
Random matrix is an -oblivious subspace embedding (OSE) if for any -dimensional subspace , it holds that
The two central concerns in constructing OSEs are: 1) how small can we make the embedding dimension , and 2) how quickly can we apply to a vector or a matrix. A popular way to address the latter is to use a sparse embedding matrix: If has at most non-zero entries per column, then the cost of computing equals , where denotes the number of non-zero coordinates in . Designing oblivious subspace embeddings that simultaneously optimize the embedding dimension and the sparsity has been the subject of a long line of works [10, 25, 26, 3, 11, 6], aimed towards resolving the following conjecture of Nelson and Nguyen [26], which is supported by nearly-matching lower bounds [27, 23].
Conjecture 2 (Nelson and Nguyen, FOCS 2013 [26]).
For any and , there is an -oblivious subspace embedding with dimension having non-zeros per column.
Nelson and Nguyen gave a simple construction that they conjectured would achieve these guarantees: For each column of , place scaled random signs in random locations. They showed that this construction achieves dimension and sparsity . A number of follow-up works [3, 11] improved on this; most notably, Cohen [11] showed that a sparse OSE can achieve with . However, none of these guarantees recover the optimal embedding dimension , with the extraneous factor arising due to a long-standing limitation in existing matrix concentration techniques [30].
This sub-optimality in dimension was finally addressed in a recent work of Chenakkod, Dereziński, Dong and Rudelson [6], relying on a breakthrough in random matrix universality theory by Brailovskaya and van Handel [4]. They achieved , but only with a significantly sub-optimal sparsity , which is a consequence of how the universality error is measured and analyzed in [4] (here, hides polylogarithmic factors in ). This raises the following natural question:
Can the optimal dimension be achieved with the conjectured sparsity?
We give a positive answer to this question, thus matching Conjecture 2 in dimension and nearly-matching it in sparsity . To achieve this, we must substantially depart from the approach of Brailovskaya and van Handel, and as a by-product, develop a new set of tools for matrix universality which are likely of independent interest (see Section 4 for an overview). Remarkably, our result is attained by one of the simple constructions that were originally suggested by Nelson and Nguyen in their conjecture.
Theorem 3 (Oblivious Subspace Embedding).
For any and such that , there is an -oblivious subspace embedding with having non-zeros per column.
Many applications of subspace embeddings arise in matrix approximation [32] where, given a large tall matrix , we seek a smaller such that for all . Naturally, this can be accomplished with an -OSE matrix , by computing in time and considering the column subspace of . However, given direct access to , one may hope to get true input sparsity time by leveraging the fact that the embedding need not be oblivious.
To that end, we adapt our subspace embedding construction, so that it can be made even sparser given additional information about the leverage scores of matrix . The th leverage score of is defined as the squared norm of the th row of the matrix obtained by orthonormalizing the columns of [21]. We show that if the th leverage score of is bounded by , then the th column of needs only non-zero entries. Since the leverage scores of can be approximated quickly [19], this leads to our new algorithm, Leverage Score Sparsified embedding with Independent Columns (LESS-IC), which is inspired by related constructions that use LESS with independent rows [17, 16, 15].
Just like recent prior works [8, 9, 6], our algorithm for constructing a subspace embedding from a matrix incurs a preprocessing cost of required for approximating the leverage scores (here, is the matrix multiplication exponent). However, our approach significantly improves on these prior works in the embedding cost, leading to matching speedups in downstream applications such as constrained/regularized least squares [6].
Theorem 4 (Fast Subspace Embedding).
Given , and , in
we can compute such that and with probability
This is a direct improvement over the previous best known runtime for constructing an optimal subspace embedding [6], which suffers an additional cost due to their sub-optimal sparsity. Remarkably, our result is also the first to achieve dependence even if we allow a sub-optimal dimension, i.e., . Here, the previous best time [8, 9] has an additional cost, due to using a two-stage leverage score sampling scheme in place of a sparse embedding matrix. Our new LESS-IC embedding is crucial in achieving the right dependence on , as neither of the previous constructions appear capable of overcoming the barrier.
As an example application of our results, we show how our fast subspace embedding construction can be used to speed up reductions for a wide class of optimization problems based on constrained or regularized least squares regression, including Lasso regression [3]. The following corollary follows immediately from Theorem 4, and is a direct improvement over Theorem 1.8 of [6] in terms of the runtime dependence on from to , while achieving a matching reduction.
Corollary 5 (Fast reduction for constrained least squares).
Given , , , function and set consider an problem :
There is an algorithm that reduces this problem to an instance in time.
2 Related Work
Subspace embeddings have played a central role in the area of randomized linear algebra ever since the work of Sarlos [28] (for an overview, see the following surveys and monographs [32, 20, 24, 18]). Initially, these approaches focused on leveraging fast Hadamard transforms [2, 29] to achieve improved time complexity for linear algebraic tasks such as linear regression and low-rank approximation. Clarkson and Woodruff [10] were the first to propose a sparse subspace embedding matrix, the CountSketch, which has exactly one non-zero entry per column but does not recover the optimal embedding dimension guarantee. Before this, the idea of using a sparse random matrix for dimensionality reduction was successfully employed in the context of Johnson-Lindenstrauss embeddings [14, 22], which seek to preserve the geometry of a finite set, as opposed to an entire subspace.
In addition to the aforementioned efforts in improving sparse subspace embeddings [10, 25, 26, 3, 11, 6], some works have aimed to develop fast subspace embeddings that achieve optimal embedding dimension either without sparsity [8, 9], under additional assumptions [5], or with one-sided embedding bounds [31]. Our time complexity result, Theorem 4, improves on all of these in terms of the dependence on , thanks to a combination of our new analysis techniques and the new LESS-IC construction.
3 Main Results
In this section, we define the subspace embedding constructions used in our results, and provide detailed statements of our theorems.
As is customary in the literature, we shall work with an equivalent form of the subspace embedding guarantee from Definition 1, which frames this problem as a characterization of the extreme singular values of a class of random matrices. Namely, consider a deterministic matrix with orthonormal columns that form the basis of a -dimensional subspace . Then, a random matrix is an -subspace embedding for if and only if all of the singular values of the matrix lie in with probability , i.e.,
(1) |
where and denote the smallest and largest singular values. To ensure that is an oblivious subspace embedding, we must therefore ensure (1) for the family of all random matrices of the form , where is any matrix with orthonormal columns.
3.1 Oblivious Subspace Embeddings
Our subspace embedding guarantees are achieved by a family of OSEs which have a fixed number of non-zero entries in each column, a key property that was also required of sparse OSE distributions called OSNAP described by Nelson and Nguyen [26]. As we explain later, our analysis techniques apply to other natural families of sparse embedding distributions, including those with i.i.d. entries [1], however the OSNAP-style construction is crucial for achieving the near-optimal sparsity .
In our construction of the OSE matrix , we start by defining an unscaled version of the matrix, called , which has entries in . We then scale to appropriately normalize the entry-wise variances, obtaining . Concretely, we wish to obtain an sparse random matrix which has exactly non-zero entries in each column. Assume exactly divides . Then we can divide each column of into subcolumns and randomly populate one entry in each subcolumn by a Rademacher random variable (see Figure 1). We call this family of distributions (unscaled) OSNAP, carrying over Nelson and Nguyen’s terminology (technically, their definition is somewhat broader than ours).
Each non-zero entry in the matrix can be identified by a tuple where identifies the column of the non-zero entry and is the index of the entry in that column. Thus the non-zero entry in is located in column and row , where is a uniformly chosen integer from the interval . For example, the non-zero entry in is located in column and some row in the interval . An matrix with a non-zero entry in column and row is given by , where and represent standard basis vectors in and respectively, and for we wish to place a random sign at this position. This motivates our formal definition for OSNAP,
Definition 6 (OSNAP).
An random matrix is called an unscaled oblivious sparse norm-approximating projection with -wise independent subcolumns (-wise independent unscaled OSNAP) with parameters such that divides if,
where,
-
is a collection of -wise independent Rademacher random variables.
-
is a collection of -wise independent random variables such that each is uniformly distributed in .
-
The collection is independent from the collection .
In this case, is called a -wise independent OSNAP with parameters . In addition, if all the random variables in the collections and are fully independent, then is called a fully independent unscaled OSNAP and is called a fully independent OSNAP.
Thus, each column of the OSNAP matrix has many non-zero entries, and the sparsity level can be varied by setting the parameter appropriately. With the distribution formally defined, we now provide the full statement of our subspace embedding guarantee for OSNAP,
Theorem 7 (Subspace Embedding Guarantee for OSNAP).
Let be an matrix distributed according to the -wise independent OSNAP distribution with parameter . Let be an arbitrary deterministic matrix such that . Then, there exist positive constants and such that for any and , we have
if the embedding dimension satisfies and the sparsity satisfies non-zeros per column.
Remark 8.
We note that if is polynomial in , i.e., for some absolute constant , then the term in is dominated by . In this case, our requirement will become
for some constant depending only on . A weaker lower bound on , is sufficient to reduce the requirement on to:
This is a direct improvement over Theorem 1.2 of [6], which requires sparsity where is an absolute constant, with the same condition on . The primary gain lies in the polynomial dependence on , but we note that our result also achieves a better logarithmic dependence on , which means that an improvement is obtained even for .
Our techniques can be used to obtain a similar result for a simple OSE model with i.i.d. sparse Rademacher entries [1], which was also considered by [6]. However, in this case, we need an additional requirement of for the sparsity (see Section 9 of the technical report for details; this is again a direct improvement over a result of [6]).
Remark 9.
The factor in the column–sparsity of an OSE model with i.i.d. entries is unavoidable. To see why, let
and note that forces, for every ,
(2) |
Set
Condition 2 is equivalent to . With and the standard normal cumulative distribution function, the Berry Esseen theorem gives
Hence
Using and the bound on , we have
By general lower bounds for OSE, we know that, when , we need and therefore so .
Therefore, for small enough , if with , the right–hand side is . Thus any OSE-IE that succeeds with constant probability must satisfy .
3.2 Characterization via a Moment Property
Our proof techniques for Theorem 7 are based on the moment method, and thus, they naturally imply the following slightly stronger moment-based characterization of an oblivious subspace embedding, which was proposed by [13] as an extension of the corresponding moment-based characterization of a Johnson-Lindenstrauss embedding [22].
Definition 10.
A distribution over has -OSE moments if, for all matrices with orthonormal columns,
Note that a simple application of Markov’s inequality recovers the guarantee in Definition 1 from the -OSE moments property with any . Moreover, [13] showed that this moment-based OSE characterization implies several other desirable guarantees of embedding matrices in the context of approximate matrix multiplication, generalized regression and low-rank approximation.
As an immediate consequence of our analysis, we obtain the following OSE moment guarantee for the OSNAP distribution.
Corollary 11.
Let be an matrix with an OSNAP distribution having sparsity . Let and . Then has -OSE moments with when and .
Remark 12.
can be applied to a matrix in time . As noted by [13, Remark 3], such runtimes can be further refined by chaining together several embeddings with an OSE moment property. For example, [11] showed that OSNAP with and has -OSE moments. Thus, letting for simplicity, we can combine a OSNAP matrix having sparsity together with a OSNAP matrix having sparsity to obtain with -OSE moments which can be applied to a matrix in time .
3.3 Leverage Score Sparsified Embedding with Independent Columns
In a related problem, we seek to embed a subspace given by a fixed , with information about the squared row norms of being used to define the distribution of non-zero entries in . Such distributions for are called non-oblivious (a.k.a. data-aware) subspace embeddings. Previous work [6] has dealt with one such family of distributions termed LESS embeddings [17, 16, 15], showing that they require non-zero entries per row of to obtain an -embedding guarantee. Since the embedding matrix is very wide, this leads to a much sparser embedding (sparser than any OSE) that can be applied in time sublinear in the input size, leading to fast subspace embedding algorithms.
In this work, we show that our new techniques also extend to LESS embeddings and enable us to prove sharper sparsity estimates than [6]. To fully leverage our approach, we define a new type of sparse embedding (LESS-IC), which can be viewed as a cross between CountSketch and LESS. Here, IC stands for independent columns. At a high level, the CountSketch part ensures that we can use our decoupling method to achieve optimal dependence on , while the LESS part enables adaptivity to a fixed subspace.
Specifically, a LESS-IC embedding matrix has a fixed number of non-zero entries in each column, chosen so that it is proportional to the leverage score (i.e. the squared row norm) of the corresponding row of . This is achieved by modifying the OSNAP distribution such that the number of subcolumns is no longer the same in each column. For columns corresponding to very small leverage scores, we only have one “subcolumn”. Thus, each column has at least one non-zero entry. This means that the cost of applying LESS-IC to an matrix can no longer be sublinear (like it can in the existing LESS embedding constructions), but rather has a fixed linear term of , plus an additional sublinear term. Given that the preprocessing step of approximating the leverage scores has to take at least time, the linear term in the cost of applying LESS-IC is negligible.
To generate an embedding matrix with the LESS-IC distribution, it suffices to have a good enough approximation for the leverage scores of the matrix , in the following sense.
Definition 13 (Approximate Leverage Scores).
Given a matrix with orthonormal columns and , a tuple of numbers are -approximate leverage scores for if, for ,
We say that the numbers are -approximations of the leverage scores (i.e. squared row norms) of with .
To see how approximate leverage scores determine the distribution of entries in the LESS-IC distribution, let us first consider a simpler distribution, LESS-IE from [6], based on a similar construction first proposed by [17]. Here, we once again start by defining an unscaled matrix , which is then normalized to obtain the subspace embedding matrix .
Definition 14 (LESS-IE).
An random matrix is called an unscaled leverage score sparsified embedding with independent entries (unscaled LESS-IE), and also is called a LESS-IE, corresponding to -approximate leverage scores with parameter , if has entries where are independent Bernoulli random variables taking value 1 with probability , whereas are i.i.d. Rademacher random variables.
In the LESS-IE model, we have many non-zero entries in column in expectation. However, to achieve dependency of the sparsity, we need to have exactly many non-zero entries in the column in the LESS-IC model to fully take advantage of the error cancellation that occurs in our decoupling argument (See Section 7.2 and Section 9.1 of the technical report). Though these sections deal with oblivious subspace embeddings, the same arguments still apply in the LESS case). This is done by modifying the OSNAP construction so that the size (and consequently, the number) of subcolumns is different across columns.
Notice that to have many non-zero entries in column , we would need many subcolumns in column each with one non-zero entry in a random position. This means that the size of each subcolumn needs to be . However, since may not be an integer, we consider subcolumns of size .
In column , we stack subcolumns of size until we fill up all the rows up to . Let be the smallest number of subcolumns to do this. Then, it may happen that the row indices of the bottom-most subcolumn exceed . For example, consider the distribution on the first column of when , and . In this case , so we can stack four subcolumns of size 15 and the subcolumn only spans row indices . In each subcolumn, we randomly choose a row to place a non-zero entry, which would be a Rademacher random variable. (See Figure 2). The non-zero entries are appropriately scaled so that all entries of the matrix have the same variance (See Section 8 of the technical report for the full definition).
For the LESS-IC distribution, we show the following subspace embedding guarantee. The structure of the proof is similar to the case of OSNAP, and only the specific expressions change due to the different distribution.
Theorem 15 (Subspace Embedding Guarantee for LESS-IC).
Let be an matrix distributed according to the -wise independent LESS-IC distribution with parameter for some fixed matrix satisfying with given -approximate leverage scores. Then, there exist positive constants and such that for any , and , we have
when and
The matrix has many non-zero entries and can be applied to an matrix in time, where is the leverage score approximation factor.
Remark 16.
When , we recover the optimal dimension while showing that one can apply the LESS-IC embedding in time . In comparison, [6] showed that a corresponding LESS-IE embedding can be applied in time. Using our techniques, one could improve the runtime of LESS-IE to , but our new LESS-IC construction appears necessary to recover the best dependence on .
3.4 Fast Subspace Embedding (Proof of Theorem 4)
Here, we briefly outline how our LESS-IC embedding yields a fast subspace embedding construction to recover the time complexity claimed in Theorem 4. This follows analogously to the construction from Theorem 1.6 of [6], and our improvement in the dependence on compared to their result (from to ) stems from the improved sparsity of our LESS-IC embedding.
The key preprocessing step for applying the LESS-IC embedding is approximating the leverage scores of the matrix . Using Lemma 5.1 in [6] (adapted from Lemma 7.2 in [8]), we can construct coarse approximations of all leverage scores so that and in time . Applying LESS-IC (Theorem 15) with these leverage scores and parameters , computing takes , where comes from the fact that every column of has at least one non-zero, while the second term accounts for the additional non-zeros.
Thus, if for, say, , then we conclude the claim by appropriately scaling by a constant factor. Now, suppose otherwise. First, note that without loss of generality we can assume that (through scaling the time complexity by a constant factor), (by removing empty rows) and (because otherwise and we could use ). Thus, under our assumption that , we have , and the time complexity is dominated by the term.
Finally, we note that Corollary 5 follows simply by constructing a subspace embedding via Theorem 4 with respect to matrix , and computing , . The proof of the claim is identical to the proof of Theorem 1.8 in [6]. Our improvement comes directly from the faster runtime of our subspace embedding construction.
3.5 Outline of the Paper
Section 4 provides a high level overview of the ideas used in the proofs of our main results, Theorem 7 and Theorem 15. Section 5 provides a sketch of the proof of Theorem 7, listing the main technical steps, leaving the full proof with all technical details to Section 7 of the technical report. The proof of Theorem 15 follows similarly and is covered in Section 8 of the technical report. The subspace embedding guarantee for a sparse matrix with independent entries is proved in Section 9 of the technical report.
3.6 Notation
The following notation and terminology will be used in the paper. The notation is used for the set and the notation denotes the set of all partitions of . Also, for two integers and with , we use the notation for the set . For , we use the notation to denote the greatest integer less than or equal to and to denote the least integer greater than or equal to . In (or or ), the th coordinate vector is denoted by . All matrices considered in this paper are real valued and the space of matrices with real valued entries is denoted by . Also, for a matrix , the notation denotes the trace of the matrix , and denotes the normalized trace. We write the operator norm of a matrix as , and it is also denoted by in some places where other norms appear for clarity. The spectrum of a matrix is denoted by . The standard probability measure is denoted by , and the symbol means taking the expectation with respect to this standard probability measure. To simplify the notation, we follow the convention from [4] and use the notation for , i.e., when a functional is followed by square brackets, it is applied before any other operations. The covariance of two random variables and is denoted by . The standard norm of a random variable is denoted by , for . Throughout the paper, the symbols , and denote absolute constants.
4 Main Ideas
We next outline our new techniques which are needed to establish the main results, Theorems 7 and 15. Here, for notational convenience, we will refer to the unscaled random matrix , as opposed to the subspace embedding matrix (see Definition 6).
Note that due to the equivalent characterization of the OSE property in (1), all we need to show is that singular values of are clustered around at distance . In other words, we need to show that the difference between the spectrum of and the spectrum of is small, of the order .
In all our models, the entries of are uncorrelated with mean and variance , and therefore the entries of are uncorrelated with uniform variance. If we consider a random matrix with Gaussian entries which keeps the covariance profile of the entries of , then this Gaussian random matrix has independent Gaussian entries with variance . Using classical results about singular values of Gaussian random matrices, it can be shown that the singular values of are sufficiently clustered around with high probability for . Thus, it suffices to find conditions under which the singular values of are sufficiently close to the singular values of . This is the phenomenon of universality whereby random systems show predictable (in this case Gaussian) behavior under certain limits.
Failure of black-box universality.
Recent work by Brailovskaya-van Handel [4] on universality for certain random matrix models developed tools to bound the distance between the spectrum of a random matrix model obtained as a sum of independent random matrices and the spectrum of a Gaussian random matrix with the same covariance profile. Using these tools, [6] achieved optimal embedding dimension for OSEs by using the bound in [4, Theorem 2.6] to estimate the Hausdorff distance (a concept of distance between two subsets of ; are said to be -close in Hausdorff distance if is in the -neighborhood of and is in the neighborhood of ) between the spectra of
This distance is shown to be , which is of order only when has dependence. Thus, [6] did not obtain the conjectured dependency of the sparsity on , which requires to only have dependency. To get better dependency, we would either need a sharper bound on the Hausdorff distance, or have the distance decrease with . For example, if the bound was improved to , we would only need which can be achieved when has dependence. On the other hand, if the bound was improved to , we would only need to have dependence.
Key idea: Universality of centered moments.
One can instead look at a different approach to characterize the clustering of singular values. To show that the singular values of are between , it is enough to show that or (Note that ). One way to achieve this bound with high probability is to use the moment method, i.e., to show that (see proof of Theorem 7 in Section 5):
In this case, standard calculations on Gaussian random matrices (see Section 6 of the technical report) show that when and has the covariance profile of . So it is enough to show that
where we recall the notation .
Now, [4, Proposition 9.12] does take a similar approach of comparing and , by relying on an interpolation argument, where one defines a mixture and controls the change in the moments along the trajectory specified by . Unfortunately, using that result gives a larger power of in the bound than desired, resulting again in a worse dependence.
One can also, by viewing , get a random matrix model which is a sum of independent random matrices (this is not true for OSNAP, but some other models of OSEs), and then compare with where is the Gaussian model for . This is the approach of [4, Proposition 9.15], but it fails in obtaining the optimal embedding dimension .
Key technique: Decoupling.
To overcome these obstacles, we develop a fresh analysis while still using the ideas of [4]. Our first step is to observe that due to the property of having a fixed number of non-zero entries in a column for the OSNAP distribution, all quadratic terms in are square-free, and this allows us to use the decoupling technique to reduce the problem of controlling the moments of to controlling the moments of where and are independent copies of (See proof of Lemma 17 in Section 5).
We still have to separate bounding into two parts, bounding for the Gaussian model, and the difference
which is called the universality error.
By standard calculations, we have and the main task is still to bound the universality error. The advantage of the decoupling idea is that, informally speaking, since and are independent, we can condition on one of them, e.g., . For fixed , the random matrix (where all randomness comes from ) can be viewed as a sum of independent random matrices, with the individual summands having moments of smaller order than the previous approach. We can then use an interpolation argument to bound the trace universality error for as follows:
(3) | ||||
Notice that there is no dependence on the right hand side. So our requirement that this quantity be bounded by is satisfied when , achieving the conjectured dependence.
Nevertheless, the conditioning argument cannot be done directly because
Key technique: 2D interpolation via chain rule.
So, instead we develop a new approach which incorporates the conditioning step directly into a two-dimensional interpolation argument, through the use of the chain rule (see Figure 3). Define
We start from at and move to at , interpolating between the easier-to-analyze Gaussian matrices and the true random matrices of interest and controlling the changes in their moments (or the error terms) step by step.
Defining , and applying the chain rule on the diagonal , we obtain:
By independence of and , we can condition on when we bound the partial derivative , and do similar calculations for the other term. The benefit of doing this is that we can now fine tune the techniques of [4] to get a differential inequality (Lemma 20) that leads to inequality (3). In doing so, we are able to find the optimal bounds and exponents in the differential inequality.
5 Proof Sketch for the Oblivious Subspace Embedding
We now sketch the proof of our main subspace embedding guarantee, Theorem 7 for OSNAP. The full proof can be found in Section 7 of the technical report. The proof of the subspace embedding guarantee for LESS-IC, Theorem 15 is similar and can be found in Section 8 of the technical report.
Proof sketch of Theorem 7.
Let . We first assume that the collection of all the random variables in the unscaled OSNAP construction are fully independent, and later we will check what is the minimum independence needed.
We observe that to prove the theorem, it is enough to show that
We call the quantity the embedding error. By Markov’s inequality, we have
which, after simplification, becomes
For , we have
Therefore, we have
Thus we need to control moments of order of the embedding error for , and this is done in the following lemma.
Lemma 17 (Trace Moments of Embedding Error for OSNAP, see Section 7 of the technical report).
For as above, there exist constants such that for satisfying , we have
(4) | ||||
(5) |
Applying Lemma 17 (with appropriately adjusted ) implies
when combined with the previous calculations.
It remains to check that conditions (4) and (5) are satisfied for by requiring and , and this is done in the full version of the proof in Section 7 of the technical report.
Note that the expression for depends only on fold products of the entries of . So, the quantity remains unchanged if we only assume that subsets of the entries of of size are independent instead of arbitrary subsets of the entries of being independent. Since it suffices to choose , we only need to be an -wise independent unscaled OSNAP. Finally, we show how the above arguments also imply the OSE moment property (Definition 10).
Proof of Corollary 11.
By the proof of Theorem 7, we see that when and , then , for (The proof originally has , but upon going through the proof we see that also works). To get , it suffices for and to satisfy the same lower bounds, but with replaced by for some since . These new lower bounds can be achieved by lower bounds of the same form as Theorem 7, but with different constants. The claim follows, since .
5.1 Controlling Trace Moments of the Embedding Error
We now sketch the proof of Lemma 17, which obtains the moment bound for used in the previous proof. The full proof can be found in Section 7 of the technical report.
Proof sketch of Lemma 17.
Our first step is to observe that due to the property of having a fixed number of non-zero entries in a column, all quadratic terms in are square-free, and this allows us to use the decoupling technique to reduce the problem of controlling the moments of to controlling the moments of where and are independent copies. This is shown in the following claim, with the proof deferred to Section 7 of the technical report.
Lemma 18 (Decoupling).
When has the fully independent unscaled OSNAP distribution, we have
where denote the rows of . Consequently, we have
where is an independent copy of .
To estimate the moments of , we compare them to moments from the Gaussian case, i.e. the moments of where the entries of and are independent normal random variables with variance (since the entries of and are also uncorrelated with mean 0 and variance , see Section 6 of the technical report). In this case, due to orthogonal invariance of the Gaussian distribution, the matrices and are distributed as and where and are matrices with independent standard normal entries. Thus, we can rely on the following bound, which uses standard results about the norms of Gaussian random matrices with independent entries.
Lemma 19 (Trace Moment of Embedding Error for Decoupled Gaussian Model, see Section 6 of the technical report ).
Let and be independent random matrices with i.i.d. Gaussian entries. Then for any positive integer , there exists such that
To formally compare the moments of and , we define the interpolating matrices for as described in Section 4:
(6) |
Let and . Then, due to the decoupling lemma (Lemma 18), to prove Lemma 17 it is enough to show that . Now, by Lemma 19, we know that:
Since we want to find the conditions for which , it is enough to ensure that . Clearly, this can only happen when , and in this case the inequality holds when . Thus, it suffices to show
(7) |
For this, we look to estimate the derivative , and we obtain the following estimate in Lemma 20 using the 2D interpolation idea mentioned in Section 4.
Lemma 20 (Differential Inequality).
For as defined above, there exists a constant such that, for any , we have
This differential inequality can be separated into two distinct cases: and . When , we can simplify the expression on the right using convexity arguments, and use Lemma 6.6 from [4] to solve the differential inequality and obtain the following bound:
for some (this is done in the full proof of Lemma 17 in the technical report). Thus, inequality (7) is satisfied when , or
When , the expression on the right of the above differential inequality has some factors. We replace these factors by terms involving only and and similarly obtain:
for some . In this case, inequality (7) is satisfied when
Since we have for some constant , we have , so it suffices to require
for some .
Combining the analysis for the two cases, it suffices to require
for some constants and . This requirement is equivalent to
which concludes the proof of Lemma 17 (see remaining details in Section 7 of the technical report).
5.2 Obtaining the differential inequality in Lemma 20
We now discuss the proof of the technical part of our argument in the previous proof, which is to control the derivative of the interpolant. The full proof can be found in Section 7 of the technical report.
Sketch of proof of Lemma 20.
There are two main ideas for obtaining this differential inequality. First, we use the cumulant method as in [4] to transform the derivative in to matrix directional derivatives. Then, we bound the resulting terms in the expression by delicately using the matrix Hölder’s inequality.
Fix and define as a function of . We shall first obtain an expression for . To see why this is sufficient, note that the derivative we are interested in is the directional derivative along the path for the multivariate function and by the chain rule (as mentioned in Section 4),
Now, recall that can be written in the form where and (see Definition 6). We then have the following lemma.
Lemma (Based on Corollary 6.1, [4]).
For any polynomial , we have
where denotes the directional derivative of in the direction .
Here, denotes the set of all partitions of , and are random matrices distributed as . Crucially, those are independent of and (but not necessarily from each other). Further details are given in the full proof in Section 7 of the technical report.
Applying this lemma to , we need to deal with the directional derivatives of along . Using a general expression for derivatives of multinomials via the product rule, we have, for any deterministic matrices and ,
In our case, for each fixed , we have to analyze , which means that we have for , and . So terms of the form become . Crucially, is a rank one matrix, so it can be written as an outer product of the form .
Then, estimating
for boils down to estimating terms of the form
where are permutations of the set .
For the remainder of the proof, we delicately analyze terms of this form using the matrix Hölder’s inequality, and appropriately estimate the terms that arise.
Lemma (Matrix Hölder’s inequality, Lemma 5.3 in [4]).
Let satisfy . Then
for any random matrices .
This analysis based on matrix Hölder’s inequality is done in Section 7 of the technical report. One important observation we use in this lemma (among many others) is that are rank one matrices, which allows us to bound with instead of . For further details, please refer to the full proof in Section 7 of the technical report.
6 Conclusions
We give an oblivious subspace embedding with optimal embedding dimension that achieves near-optimal sparsity, thus nearly matching a conjecture of Nelson and Nguyen in terms of the best sparsity attainable by an optimal oblivious subspace embedding. We also propose a fast algorithm for constructing low-distortion subspace embeddings, based on a new family of Leverage Score Sparsified embeddings with Independent Columns (LESS-IC). This new algorithm leads to speedups in downstream applications such as optimization problems based on constrained or regularized least squares. As a by-product of our analysis, we develop a new set of tools for matrix universality, combining a decoupling argument with a two-dimensional interpolation method, which are likely of independent interest.
References
- [1] Dimitris Achlioptas. Database-friendly random projections: Johnson-lindenstrauss with binary coins. Journal of computer and System Sciences, 66(4):671–687, 2003. doi:10.1016/S0022-0000(03)00025-4.
- [2] Nir Ailon and Bernard Chazelle. The fast johnson–lindenstrauss transform and approximate nearest neighbors. SIAM Journal on computing, 39(1):302–322, 2009. doi:10.1137/060673096.
- [3] Jean Bourgain, Sjoerd Dirksen, and Jelani Nelson. Toward a unified theory of sparse dimensionality reduction in euclidean space. In Proceedings of the forty-seventh annual ACM symposium on Theory of Computing, pages 499–508, 2015. doi:10.1145/2746539.2746541.
- [4] Tatiana Brailovskaya and Ramon van Handel. Universality and sharp matrix concentration inequalities. Geometric and Functional Analysis, pages 1–105, 2024.
- [5] Coralia Cartis, Jan Fiala, and Zhen Shao. Hashing embeddings of optimal dimension, with applications to linear least squares. arXiv preprint arXiv:2105.11815, 2021. arXiv:2105.11815.
- [6] Shabarish Chenakkod, Michał Dereziński, Xiaoyu Dong, and Mark Rudelson. Optimal embedding dimension for sparse subspace embeddings. In Proceedings of the 56th Annual ACM Symposium on Theory of Computing, pages 1106–1117, 2024.
- [7] Shabarish Chenakkod, Michał Dereziński, and Xiaoyu Dong. Optimal oblivious subspace embeddings with near-optimal sparsity, 2024. doi:10.48550/arXiv.2411.08773.
- [8] Nadiia Chepurko, Kenneth L Clarkson, Praneeth Kacham, and David P Woodruff. Near-optimal algorithms for linear algebra in the current matrix multiplication time. In Proceedings of the 2022 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 3043–3068. SIAM, 2022. doi:10.1137/1.9781611977073.118.
- [9] Yeshwanth Cherapanamjeri, Sandeep Silwal, David P Woodruff, and Samson Zhou. Optimal algorithms for linear algebra in the current matrix multiplication time. In Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 4026–4049. SIAM, 2023. doi:10.1137/1.9781611977554.CH154.
- [10] Kenneth L Clarkson and David P Woodruff. Low rank approximation and regression in input sparsity time. In Proceedings of the forty-fifth annual ACM symposium on Theory of Computing, pages 81–90, 2013. doi:10.1145/2488608.2488620.
- [11] Michael B Cohen. Nearly tight oblivious subspace embeddings by trace inequalities. In Proc. of the 27th annual ACM-SIAM Symposium on Discrete Algorithms, pages 278–287. SIAM, 2016. doi:10.1137/1.9781611974331.CH21.
- [12] Michael B Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina Persu. Dimensionality reduction for k-means clustering and low rank approximation. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 163–172, 2015. doi:10.1145/2746539.2746569.
- [13] Michael B Cohen, Jelani Nelson, and David P Woodruff. Optimal approximate matrix product in terms of stable rank. In International Colloquium on Automata, Languages, and Programming. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, 2016.
- [14] Anirban Dasgupta, Ravi Kumar, and Tamás Sarlós. A sparse johnson: Lindenstrauss transform. In Proceedings of the forty-second ACM symposium on Theory of computing, pages 341–350, 2010. doi:10.1145/1806689.1806737.
- [15] Michał Dereziński. Algorithmic gaussianization through sketching: Converting data into sub-gaussian random designs. In The Thirty Sixth Annual Conference on Learning Theory, pages 3137–3172. PMLR, 2023.
- [16] Michał Dereziński, Jonathan Lacotte, Mert Pilanci, and Michael W Mahoney. Newton-less: Sparsification without trade-offs for the sketched newton update. Advances in Neural Information Processing Systems, 34:2835–2847, 2021.
- [17] Michał Dereziński, Zhenyu Liao, Edgar Dobriban, and Michael Mahoney. Sparse sketches with small inversion bias. In Conference on Learning Theory, pages 1467–1510. PMLR, 2021.
- [18] Michał Dereziński and Michael W Mahoney. Recent and upcoming developments in randomized numerical linear algebra for machine learning. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 6470–6479, 2024.
- [19] Petros Drineas, Malik Magdon-Ismail, Michael W Mahoney, and David P Woodruff. Fast approximation of matrix coherence and statistical leverage. The Journal of Machine Learning Research, 13(1):3475–3506, 2012. doi:10.5555/2503308.2503352.
- [20] Petros Drineas and Michael W Mahoney. Randnla: randomized numerical linear algebra. Communications of the ACM, 59(6):80–90, 2016. doi:10.1145/2842602.
- [21] Petros Drineas, Michael W Mahoney, and S Muthukrishnan. Sampling algorithms for regression and applications. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 1127–1136, 2006.
- [22] Daniel M Kane and Jelani Nelson. Sparser johnson-lindenstrauss transforms. Journal of the ACM (JACM), 61(1):1–23, 2014. doi:10.1145/2559902.
- [23] Yi Li and Mingmou Liu. Lower bounds for sparse oblivious subspace embeddings. In Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 251–260, 2022. doi:10.1145/3517804.3526224.
- [24] Per-Gunnar Martinsson and Joel A Tropp. Randomized numerical linear algebra: Foundations and algorithms. Acta Numerica, 29:403–572, 2020. doi:10.1017/S0962492920000021.
- [25] Xiangrui Meng and Michael W Mahoney. Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pages 91–100, 2013. doi:10.1145/2488608.2488621.
- [26] Jelani Nelson and Huy L Nguyên. Osnap: Faster numerical linear algebra algorithms via sparser subspace embeddings. In 2013 ieee 54th annual symposium on foundations of computer science, pages 117–126. IEEE, 2013.
- [27] Jelani Nelson and Huy L Nguyên. Lower bounds for oblivious subspace embeddings. In International Colloquium on Automata, Languages, and Programming, pages 883–894. Springer, 2014. doi:10.1007/978-3-662-43948-7_73.
- [28] Tamas Sarlos. Improved approximation algorithms for large matrices via random projections. In 2006 47th annual IEEE symposium on foundations of computer science (FOCS’06), pages 143–152. IEEE, 2006.
- [29] Joel A Tropp. Improved analysis of the subsampled randomized hadamard transform. Advances in Adaptive Data Analysis, 3(01n02):115–126, 2011. doi:10.1142/S1793536911000787.
- [30] Joel A. Tropp. An introduction to matrix concentration inequalities. Found. Trends Mach. Learn., 8(1–2):1–230, May 2015. doi:10.1561/2200000048.
- [31] Joel A Tropp. Comparison theorems for the minimum eigenvalue of a random positive-semidefinite matrix. arXiv preprint arXiv:2501.16578, 2025. doi:10.48550/arXiv.2501.16578.
- [32] David P Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends® in Theoretical Computer Science, 10(1–2):1–157, 2014. doi:10.1561/0400000060.