Abstract 1 Introduction 2 Preliminaries 3 Framework for Constrained Subspace Approximation 4 Applications References Appendix A Missing Proofs

Guessing Efficiently for Constrained Subspace Approximation

Aditya Bhaskara ORCID University of Utah, Salt Lake City, UT, USA Sepideh Mahabadi ORCID Microsoft Research, Redmond, WA, US Madhusudhan Reddy Pittu ORCID Carnegie Mellon University, Pittsburgh, PA, USA Ali Vakilian ORCID Toyota Technological Institute at Chicago, IL, USA David P. Woodruff ORCID Carnegie Mellon University, Pittsburgh, PA, USA
Abstract

In this paper we study constrained subspace approximation problem. Given a set of n points {a1,,an} in d, the goal of the subspace approximation problem is to find a k dimensional subspace that best approximates the input points. More precisely, for a given p1, we aim to minimize the pth power of the p norm of the error vector (a1𝑷a1,,an𝑷an), where 𝑷 denotes the projection matrix onto the subspace and the norms are Euclidean. In constrained subspace approximation (CSA), we additionally have constraints on the projection matrix 𝑷. In its most general form, we require 𝑷 to belong to a given subset 𝒮 that is described explicitly or implicitly.

We introduce a general framework for constrained subspace approximation. Our approach, that we term coreset-guess-solve, yields either (1+ε)-multiplicative or ε-additive approximations for a variety of constraints. We show that it provides new algorithms for partition-constrained subspace approximation with applications to fair subspace approximation, k-means clustering, and projected non-negative matrix factorization, among others. Specifically, while we reconstruct the best known bounds for k-means clustering in Euclidean spaces, we improve the known results for the remainder of the problems.

Keywords and phrases:
parameterized complexity, low rank approximation, fairness, non-negative matrix factorization, clustering
Category:
Track A: Algorithms, Complexity and Games
Funding:
Aditya Bhaskara: Supported by NSF CCF-2047288.
David P. Woodruff: Supported by a Simons Investigator Award and Office of Naval Research award number N000142112647.
Copyright and License:
[Uncaptioned image] © Aditya Bhaskara, Sepideh Mahabadi, Madhusudhan Reddy Pittu, Ali Vakilian, and
David P. Woodruff; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Continuous optimization
Related Version:
Full Version: https://arxiv.org/abs/2504.20883
Editors:
Keren Censor-Hillel, Fabrizio Grandoni, Joël Ouaknine, and Gabriele Puppis

1 Introduction

Large data sets, often represented as collections of high-dimensional points, naturally arise in fields such as machine learning, data mining, and computational geometry. Despite their high-dimensional nature, these points typically exhibit low intrinsic dimensionality. Identifying (or summarizing) this underlying low-dimensional structure is a fundamental algorithmic question with numerous applications to data analysis. We study a general formulation, that we call the subspace approximation problem.

In subspace approximation, given a set of n points {a1,,an}d and a rank parameter k, we consider the problem of “best approximating” the input points with a k-dimensional subspace in a high-dimensional space. Here the goal is to find a rank k projection 𝑷 that minimizes the projection costs ai𝑷ai, aggregated over i[n]. The choice of aggregation leads to different well-studied formulations. In the p subspace approximation problem, the objective is (iai𝑷ai2p). Formally, denoting by A the d×n matrix whose ith column is ai, the p-subspace approximation problem asks to find a rank k projection matrix 𝑷d×d that minimizes 𝑨𝑷𝑨2,pp:=i=1nai𝑷ai2p. For different choices of p, p-subspace approximation captures some well-studied problems, notably the median hyperplane problem (when p=1), the principal component analysis (PCA) problem (when p=2), and the center hyperplane problem (when p=).

Subspace approximation for general p turns out to be NP-hard for all p2. For p>2, semidefinite programming helps achieve a constant factor approximation (for fixed p) for the problem [19]. Matching hardness results were also shown for the case p>2, first assuming the Unique Games Conjecture [19], and then based only on PNP [23]. For p<2, hardness results were first shown in the work of [13].

Due to the ubiquitous applications of subspace approximation in various domains, several “constrained” versions of the problem have been extensively studied as well [20, 44, 33, 2, 8, 14]. In the most general setting of the constrained p-subspace approximation problem, we are additionally given a collection 𝒮 of rank-k projection matrices (specified either explicitly or implicitly) and the goal is to find a projection matrix 𝑷𝒮 minimizing the objective. I.e.,

min𝑷𝒮𝑨𝑷𝑨2,pp. (1)

Some examples of problems in constrained subspace approximation include the well-studied column subset selection [7, 39, 18, 11, 24, 6, 1] where the projection matrices are constrained to project on to the span of k of the original vectors, (k,z)-means clustering in which the set of projection matrices can be specified by the partitioning of the points into k clusters (see [14] for a reference), and many more which we will describe in this paper.

1.1 Our Contributions and Applications

In this paper, we provide a general algorithmic framework for constrained p-subspace approximation that yields either (1+ε)-multiplicative or ε-additive error approximations to the objective (depending on the setting), with running time exponential in k. We apply the framework to several classes of constrained subspace approximation, leading to new results or results matching the state-of-the-art for these problems. Note that since the problems we consider are typically APX-hard (including k-means, and even the unconstrained version of p-subspace approximation for p>2), a running time exponential in k is necessary for our results, assuming the Exponential Time Hypothesis; a discussion in Section 2. Before presenting our results, we start with an informal description of the framework.

Overview of Approach.

Our approach is based on coresets [21] (also [16, 15, 27] and references therein), but turns out to be different from the standard approach in a subtle yet important way. Recall that a (strong) coreset for an optimization problem 𝒪 on set of points 𝑨 is a subset 𝑩 such that for any solution for 𝒪, the cost on 𝑩 is approximately the same as the cost on 𝑨, up to an appropriate scaling. In the formulation of p-subspace approximation above, a coreset for a dataset 𝑨 would be a subset 𝑩 of its columns with kn columns, such that for all k-dimensional subspaces, each defined by some 𝑷, 𝑩𝑷𝑩2,pp𝑨𝑷𝑨2,pp, up to scaling. Thus the goal becomes to minimize the former quantity.

In the standard coreset approach, first a coreset is obtained, and then a problem-specific enumeration procedure is used to find a near optimal solution 𝑷. For example, for the k-means clustering objective, one can consider all the k-partitions of the points in the coreset 𝑩; each partition leads to a set of candidate centers, and the best of these candidate solutions will be an approximate solution to the full instance. Similarly for (unconstrained) p-subspace approximation, one observes that for an optimal solution, the columns of 𝑷 must lie in the span of the vectors of 𝑩, and thus one can enumerate over the combinations of the vectors of 𝑩. Each combination gives a candidate 𝑷, and the best of these candidate solutions is an approximate solution to the full instance.

However, this approach does not work in general for constrained subspace approximation. To see this, consider the very simple constraint of having the columns of 𝑷 coming from some given subspace S. Here, the coreset for p-subspace approximation on 𝑨 will be some set 𝑩 that is “oblivious” of the subspace S. Thus, enumerating over combinations of 𝑩 may not yield any vectors in S!

Our main idea is to avoid enumeration over candidate solutions, but instead, we view the solution (the matrix 𝑷d×k) as simply a set of variables. We then note that since the goal is to use 𝑷 to approximate 𝑩, there must be some combination of the vectors of 𝑷 (equivalently, a set of k coefficients) that approximates each vector ai in 𝑩. If the coreset size is k, there are only kk coefficients in total, and we can thus hope to enumerate these coefficients in time exp(kk). For every given choice of coefficients, we can then solve an optimization problem to find the optimal 𝑷. For the constraints we consider (including the simple example above), this problem turns out to be convex, and can thus be solved efficiently!

This simple idea yields ε-additive approximation guarantees for a range of problems. We then observe that in specific settings of interest, we can obtain (1+ε)-multiplicative approximations by avoiding guessing of the coefficients. In these settings, once the coefficients have been guessed, there is a closed form for the optimal basis vectors, in the form of low degree polynomials of the coefficients. We can then use the literature on solving polynomial systems of equations (viewing the coefficients as variables) to obtain algorithms that are more efficient than guessing. The framework is described more formally in Section 3.

We believe our general technique of using coresets to reduce the number of coefficients needed in order to turn a constrained non-convex optimization problem into a convex one, may be of broader applicability. We note it is fundamentally different than the “guess a sketch” technique for variable reduction in [34, 3, 4, 30] and the techniques for reducing variables in non-negative matrix factorization [32]. To support this statement, the guess a sketch technique requires the existence of a small sketch, and consequently has only been applied to approximation with entrywise p-norms for p2 and weighted variants [34, 3, 30], whereas our technique applies to a much wider family of norms.

Relation to Prior Work.

We briefly discuss the connection to prior work on binary matrix factorization using coresets. The work of [40] considers binary matrix factorization by constructing a strong coreset that reduces the number of distinct rows via importance sampling, leveraging the discrete structure of binary inputs. Our framework generalizes these ideas to continuous settings: we use strong coresets not merely to reduce distinct rows, but to reduce the number of variables in a polynomial system for solving continuous constrained optimization problems. This enables us to extend the approach to real-valued matrices and to more general loss functions. Our framework can be seen as a generalization and unification of prior coreset-based “guessing” strategies, adapting them to significantly broader settings.

Applications.

We apply our framework to the following applications. Each of these settings can be viewed as subspace approximation with a constraint on the subspace (i.e., on the projection matrix), or on properties of the associated basis vectors. Below we describe these applications, mention how they can be formulated as Constrained Subspace Approximation, and state our results for them. See Table 1 for a summary.

Table 1: Summary of the upper bound results we get using our framework. In the approximation column, we use super scripts ,+, to represent if its a multiplicative, additive, or multiplicative-additive approximation respectively. In the notes on prior work column, we use tilde () to indicate that no prior theoretical guarantees are known (only heuristics) and hyphen () to specify that the problem is new.
Problem Running Time Approx. Prior Work
PC-p-Subspace Approx. (κε)poly(kε)poly(n) (21) (O(εp)𝑨p,2p)+ -
nO(k2ε)poly(H) (22) (1+ε) -
Constrained Subspace Est. poly(n)(1δ)O(k2ε) (18) (1+ε,O(δ𝑨F2))
O(ndγε)O(k3ε) (19) (1+ε)
PNMF O(dk2ε)(1δ)O(k2ε)(1) (1+ε,O(δ𝑨F2))
(ndγε)O(k3ε) (2) (1+ε)
k-Means Clustering O(nnz(𝑨)+2O~(kε)+no(1)) (3) (1+ε) [21]
Sparse PCA dO(k3ε2)k3ε (4) (ε𝑨𝑨kF2)+ [17]

1.1.1 Subspace Approximation with Partition Constraints

First, we study a generalization of p-subspace approximation, where we have partition constraints on the subspace. More specifically, we consider PC-p-subspace approximation, where besides the point set {a1,,an}d, we are given subspaces S1,,S along with capacities k1,,k such that i=1ki=k. Now the set of valid projections 𝒮 is implicitly defined to be the set of projections onto the subspaces that are obtained by selecting ki vectors from Si for each i[], taking their span.

PC-p-subspace approximation can be viewed as a variant of data summarization with “fair representation”. Specifically, when Si is the span of the vectors (or points) in group i, then by setting ki values properly (depending on the application or the choice of policy makers), PC--subspace approximation captures the problem of finding a summary of the input data in which groups are fairly represented. This corresponds to the equitable representation criterion, a popular notion studied extensively in the fairness of algorithms, e.g., clustering [29, 28, 10, 26].111We note that the fair representation definitions differ from those in the line of work on fair PCA and column subset selection [35, 38, 31, 37], where the objective contributions (i.e., projection costs) of different groups must either be equal (if possible) or ensure that the maximum incurred cost is minimized. We focus on the question of groups having equal, or appropriately bounded, representation among the chosen low-dimensional subspace (i.e., directions). This distinction is also found in algorithmic fairness studies of other problems, such as clustering. We show the following results for PC-subspace approximation:

  • First, in Theorem 21, we show for any p1, an algorithm for PC-p-subspace approximation with runtime (κε)poly(k/ε)poly(n) that returns a solution with additive error at most O(εp)𝑨p,2p, where κ is the condition number of the optimal choice of vectors from the given subspaces.

  • For p=2, which is one of the most common loss functions for PC-p-subspace approximation, we also present a multiplicative approximation guarantee. There exists a (1+ε)-approximation algorithm running in time sO(k2/ε)poly(H) where H is the bit complexity of each element in the input and s is the sum of the dimensions of the input subspaces S1,,S, i.e., s=j=1dim(Sj). The formal statement is in Theorem 22.

1.1.2 Constrained Subspace Estimation

The Constrained Subspace Estimation problem originates from the signal processing community [36], and aims to find a subspace V of dimension k, that best approximates a collection of experimentally measured subspaces T1,,Tm, with the constraint that it intersects a model-based subspace W in at least a predetermined number of dimensions , i.e., dim(VW). This problem arises in applications such as beamforming, where the model-based subspace is used to encode the available prior information about the problem. The paper of [36] formulates and motivates that problem, and further present an algorithm based on a semidefinite relaxation of this non-convex problem, where its performance is only demonstrated via numerical simulation.

We show in Section 4.1, that this problem can be reduced to at most k instances of PC-2-subspace approximation, in which the number of parts is 2. This will give us the following result for the constrained subspace estimation problem.

  • In Corollary 18, we show a (1+ε,δAF2)-multiplicative-additive approximation in time poly(n)(1/δ)O(k2/ε).

  • In Theorem 19, we show a (1+ε) multiplicative approximation in time O(ndγ/ε)O(k3/ε) where we assume A has integer entries of absolute value at most γ. We assume that γ=poly(n).

1.1.3 Projective Non-Negative Matrix Factorization

Projective Non-Negative Matrix Factorization (PNMF) [45] (see also [46, 43]) is a variant of Non-Negative Matrix Factorization (NMF), used for dimensionality reduction and data analysis, particularly for datasets with non-negative values such as images and texts. In NMF, a non-negative matrix 𝑿 is factorized into the product of two non-negative matrices 𝑾 and 𝑯 such that 𝑿𝑾𝑯 where 𝑾 contains basis vectors, and 𝑯 represents coefficients. In PNMF, the aim is to approximate the data matrix by projecting it onto a subspace spanned by non-negative vectors, similar to NMF. However, in PNMF, the factorization is constrained to be projective.

Formally, PNMF can be formulated as a constrained 2-subspace approximation as follows: the set of feasible projection matrices 𝒮, consists of all matrices that can be written as 𝑷=UUT, where U is a d×k orthonormal matrix with all non-negative entries.

We show the following results:

  • In Theorem 1, we show a (1+ε,δAF2)-multiplicative-additive approximation in time O(dk2/ε)(1/δ)O(k2/ε).

  • In Theorem 2, we show a (1+ε) multiplicative approximation in time (ndγ)O(k3/ε), where we assume A has integer entries of absolute value at most γ.

Theorem 1 (Additive approximation for NMF).

Given an instance 𝐀d×n of Non-negative matrix factorization, there is an algorithm that computes a 𝐔0d×k,𝐔T𝐔=𝐈k such that

𝑨𝑼𝑼T𝑨F2(1+ε)OPT+O(δ𝑨F2)

in time O(dk2/ε)(1/δ)O(k2/ε). For any 0<δ<1.

Theorem 2 (Multiplicative approximation for NMF).

Given an instance 𝐀d×n of Non-negative matrix factorization with integer entries of absolute value at most γ in 𝐀, there is an algorithm that computes a 𝐔0d×k,𝐔T𝐔=𝐈k such that

𝑨𝑼𝑼T𝑨F2(1+ε)OPT

in time (ndγ/ε)O(k3/ε).

1.1.4 𝒌-Means Clustering

k-means is a popular clustering algorithm widely used in data analysis and machine learning. Given a set of n vectors a1,,an and a parameter k, the goal of k-means clustering is to partition these vectors into k clusters {C1,,Ck} such that the sum of the squared distances of all points to their corresponding cluster center i=1naiμC(ai)22 is minimized, where C(ai) denotes the cluster that ai belongs to and μC(ai) denotes its center. It is an easy observation that once the clustering is determined, the cluster centers need to be the centroid of the points in each cluster. It is shown in [14] that this problem is an instance of constrained subspace approximation. More precisely, the set of valid projection matrices are all those that can be written as 𝑷=XCXCT, where XC is a n×k matrix where XC(i,j) is 1/|Cj| if C(ai)=j and 0 otherwise. Note that this is an orthonormal matrix and thus XCXCT is an orthogonal projection matrix. Further, note that using our language we need to apply the constrained subspace approximation on the matrix AT, i.e., min𝑷𝒮AT𝑷ATF2. In Theorem 3, we show a (1+ε) approximation algorithm for k-means that runs in O(nnz(𝑨)+2O~(k/ε)+no(1)) time, whose dependency on k and ε matches that of [21].

Theorem 3.

Given an instance 𝐀n×d of k-means, there is an algorithm that computes a (1+ε)-approximate solution to k-means in O(nnz(𝐀)+2O~(k/ε)+no(1)) time.

1.1.5 Sparse PCA

The goal of Principal Component Analysis (PCA) is to find k linear combinations of the d features (dimensions), which are called principal components, that captures most of the mass of the data. As mentioned earlier, PCA is the subspace approximation problem with p=2. However, typically the obtained principal components are linear combinations of all vectors which makes interpretability of the components more difficult. As such, Sparse PCA which is the optimization problem obtained from PCA by adding a sparsity constraint on the principal components have been defined which provides higher data interpretability [17, 47, 9, 25, 5].

Sparse PCA can be formulated as a constrained subspace approximation problem in which the set of projection matrices are constrained to those that can be written as P=UUT where U is a d×k orthonormal matrix such that the total number of non-zero entries in the U is at most s, for a given parameter s.

max :𝑨𝑨T,𝑼𝑼T (sparse-PCA-max)
𝑼T𝑼 =𝑰k,j[k]𝑼.,j0s. (2)

Program sparse-PCA-max can also be formulated as a minimization version

min :𝑨𝑼𝑼T𝑨F2 (sparse-PCA-min)
𝑼T𝑼 =𝑰k,j[k]𝑼.,j0s. (3)

We give an algorithm that runs in time dO(k3/ε2)(dk3/ε+dlogd) that computes a ε𝑨𝑨kF2 additive approximate solution, which translates to a (1+ε)-multiplicative approximate solution to one formulation of the problem.

Theorem 4.

Given an instance (𝐀d×n,k,s) of sparse-PCA, there is an algorithm that runs in time

O(dkr2+kr(dkr2+dlogd)) (4)

with r=k+k/ε that computes a ε𝐀𝐀kF2 additive approximate solution to both sparse-PCA-max and sparse-PCA-min. This is guarantees as a (1+ε)-approximate solution to sparse-PCA-min because 𝐀𝐀kF2 is a lower bound to sparse-PCA-min.

1.1.6 Column Subset Selection with a Partition Constraint

Column subset selection (CSS) is a popular data summarization technique [8, 14, 1], where given a matrix 𝑨, the goal is to find k columns in 𝑨 that best approximates all columns of 𝑨. Since in CSS, a subset of columns in the matrix are picked as the summary of the matrix 𝑨, enforcing partition constraints naturally captures the problem of column subset selection with fair representation. More formally, in column subset selection with partition constraints (PC-column subset selection), given a partitioning of the columns of 𝑨 into groups, 𝑨(1),,𝑨(), along with capacities k1,,k, where iki=k, the set of valid subspaces are obtained by picking ki vectors from 𝑨(i), and projecting onto the span of these k columns of 𝑨.

In Theorem 5, we show that PC-column subset selection is hard to approximate to any factor f in polynomial time, even if there are only two groups, or even when we allow for violating the capacity constraint by a factor of O(logn) This is in sharp contrast with the standard column subset selection problem for which efficient algorithms with tight guarantees are known.

Theorem 5.

Assuming SATDTIME(nO(loglogn)), the PC-column subset selection problem is hard to approximate to any multiplicative factor f, even in the following special cases:

(i) The case of =2 groups, where the capacities on all the groups are the same parameter s.

(ii) The case where the capacities on all the groups are the same parameter s, and we allow a solution to violate the capacity by a factor g(n)=o(logn), where n is the total number of columns in the instance.

2 Preliminaries

We will heavily use standard notations for vector and matrix quantities. For a matrix 𝑴, we denote by 𝑴.,i the ith column of 𝑴 and by 𝑴i,. the ith row. We denote by 𝑴F the Frobenius norm, which is simply i,jmij2, where mij is the entry in the ith row and jth column of 𝑴. We also use mixed norms, where 𝑴2,p=(i𝑴.,i2p)1/p. I.e., it is the p norm of the vector whose entries are the 2 norm of the columns of 𝑴.

We also use σmin(𝑴) to denote the least singular value of a matrix, and σmax(𝑴) to denote the largest singular value. The value κ(𝑴) is used to denote the condition number, which is the ratio of the largest to the smallest singular value.

In analyzing the running times of our algorithms, we will use the following basic primitives, the running times of which we denote as T0 and T1 respectively. These are standard results from numerical linear algebra; while there are several improvements using randomization, these bounds will not be the dominant ones in our running time, so we do not optimize them.

Lemma 6 (SVD Computation; see [22]).

Given 𝐀d×n, computing the reduced matrix 𝐁 as in Lemma 12 takes time T0:=Hmin{O(nd2),O(ndkε)}, where H is the maximum bit complexity of any element of 𝐀.

Lemma 7 (Least Squares Regression; see [22]).

Given 𝐀d×n and given a target matrix 𝐁 with r columns, the optimization problem min𝐂𝐁𝐀𝐂F2 can be solved in time T1:=O(nrd2H), where H is the maximum bit length of any entry in A,B.

Remark on the Exponential in 𝒌 Running Times.

In all of our results, it is natural to ask if the exponential dependence on k is necessary. We note that many of the problems we study are APX hard, and thus obtaining multiplicative (1+ε) factors will necessarily require exponential time in the worst case. For problems that generalize p-subspace approximation (e.g., the PC-p-subspace approximation problem, Section 4.2), the work of [23] showed APX hardness for p>2 while [13] showed NP-hardness. In our reductions, we in fact have the stronger property that the YES and NO instances differ in objective value by 1poly(k)𝑨2,pp, where 𝑨 is the matrix used in the reduction. Thus, assuming the Exponential Time Hypothesis, even the additive error guarantee in general requires an exponential dependence on either k or 1/ε, at least for p>2.

3 Framework for Constrained Subspace Approximation

Given a d×n matrix 𝑨 and a special collection 𝒮 of rank k projection matrices, we are interested in selecting the projection matrix 𝑷𝒮 that minimizes the sum of projection costs (raised to the pth power) of the columns of 𝑨 onto 𝑷. More compactly, the optimization problem is

min𝑷𝒮 :𝑨𝑷𝑨2,pp. (CSA)
A more geometric and equivalent interpretation is that we have a collection of n data-points {a1,a2,,an}d and we would like to approximate these data points by a subspace while satisfying certain constraints on the subspace:
min :i=1naia^i2p (CSA-geo)
a^iColumnSpan(𝑷)
𝑷𝒮.

See Lemma 9 for a proof of the equivalence. We provide a unified framework to obtain approximately optimal solutions for various special collections of 𝒮. In our framework, there are three steps to obtaining an approximate solution to any instance of CSA.

  1. 1.

    Build a coreset: Reduce the size of the problem by replacing 𝑨 with a different matrix 𝑩d×r with a fewer number of columns typically poly(k,1/ε). The property we need to guarantee is that the projection cost is approximately preserved possibly with an additive error c0 independent of 𝑷:

    𝑩𝑷𝑩2,pp(1,1+ε)𝑨𝑷𝑨2,ppc𝑷 with rank at most k. (5)

    Such a 𝑷 (for p=2) has been referred to as a Projection-Cost-Preserving Sketch with one sided error in [14] . See Definition 10, Theorem 11, and Lemma 12 for results obtaining such a 𝑩 for various 1p<. Lemma 14 shows that approximate solutions to reduced instances (𝑩,𝒮) satisfying Equation 5 are also approximate solutions to the original instance (𝑨,𝒮).

  2. 2.

    Guess Coefficients: Since the projection matrix 𝑷 is of rank k, it can be represented as 𝑼𝑼T such that 𝑼T𝑼=𝑰k. Using this, observe that the residual matrix

    𝑩𝑷𝑩=𝑩𝑼(𝑼T𝑩)

    can be represented as 𝑩𝑼𝑪 where 𝑪=𝑼T𝑩 is a k×r matrix. The norm of the ith column of 𝑪 can be bounded by bi2 the norm of the ith column of 𝑩. This allows us to guess every column of 𝑪 inside a k dimensional ball of radius at most the norm of the corresponding column in 𝑩. Using a net with appropriate granularity, we guess the optimal 𝑪 up to an additive error.

  3. 3.

    Solve: For every fixed 𝑪 in the search space above, we solve the constrained regression problem

    min𝑼d×k:𝑼𝑼T𝒮𝑩𝑼𝑪2,pp

    exactly. If 𝑪^ is the 𝑪 matrix that induces the minimum cost, and 𝑼^ is the minimizer to the constrained regression problem, we return the projection matrix 𝑼^𝑼^T.

The following lemma formalizes the framework above and can be used as a black box application for several specific instances of CSA.

Lemma 8.

Given an instance (𝐀,𝒮) of CSA, for 1p<,

  1. 1.

    Let Ts be the time taken to obtain a smaller instance (𝑩,𝒮) such that the approximate cost property in Equation 5 is satisfied and the number of columns in 𝑩 is r.

  2. 2.

    Let Tr be the time taken to solve the constrained regression problem for any fixed 𝑩d×r and 𝑪k×r

    min𝑼d×k:𝑼𝑼T𝒮𝑼𝑪𝑩2,pp. (6)

    Then for any granularity parameter 0<δ<1, we obtain a solution 𝑷𝒮 such that

    𝑨𝑷𝑨2,pp(1+ε)OPT+Δ (7)

    in time Ts+TrO((1/δ)kr).

Here, Δ=(1+ε)𝐀2,pp((1+δ)p1) and OPT=min𝐏𝒮𝐀𝐏𝐀2,pp.

Proof.

Let the optimal solution to the instance (𝑨,𝒮) be 𝑷=𝑼𝑼T and let 𝑪=𝑼T𝑩. Since the columns of 𝑼 are unit vectors, the norm of the ith column of 𝑪 is at most bi2 the norm of the ith column of 𝑩. We will try to approximately guess the columns of 𝑪 using epsilon nets. For each i, we search for the ith column of 𝑪 using a (bi2δ)-net inside a k dimensional ball of radius bi2 centered at origin. The size of the net for each column of 𝑪 is O((1/δ)k) and hence the total search space over matrices 𝑪 has O((1/δ)kr) possibilities.

For each 𝑪, we solve the constrained regression problem in Equation 6. Let 𝑪^ be the matrix for which the cost is minimized and 𝑼^ be the corresponding minimizer to the constrained regression problem respectively. Consider the solution 𝑷^=𝑼^𝑼^T. The cost of this solution on reduced instance (𝑩,𝒮) is

𝑩𝑼^𝑼^T𝑩2,pp 𝑩𝑼^𝑪^2,pp. (8)
Let 𝑪¯ be the matrix in the search space such that 𝑪¯.,i𝑪.,i2bi2δ for every i[r]. Using the cost minimality of 𝑪^, we can imply that the above cost is
min𝑼d×k:𝑼𝑼T𝒮𝑩𝑼𝑪¯2,pp (9)
𝑩𝑼𝑪¯2,pp. (10)

It remains to upper bound the difference Δ=𝑩𝑼𝑪¯2,pp𝑩𝑼𝑪2,pp. If we let bi:=(𝑼𝑪).,i and b¯i:=(𝑼𝑪¯).,i for i[r], then

Δ=i=1r(bib¯i2pbibi2p). (11)
Using the fact that 𝑪¯.,i𝑪.,i2bi2δ, we know that
b¯ibi2 =𝑼(𝑪¯.,i𝑪.,i)2𝑪¯.,i𝑪.,i2bi2δ. (12)

This implies that each error term

Δi :=bib¯i2pbibi2p (13)
(bibi2+bib¯i2)pbibi2p (Triangle inequality)
(bibi2+biδ)pbibi2p (b¯ibi2bi2δ)
bi2p((1+δ)p1). ((x+δ)pxp is increasing in [0,1],bibi2bi2)

Summing up, the total error Δ is at most 𝑩2,pp((1+δ)p1)=O(δp)𝑩2,pp for δ1/p. This implies that

𝑩𝑷^𝑩2,pp 𝑩𝑷𝑩2,pp+𝑩2,pp((1+δ)p1) (14)
Using the property of 𝑩 from Equation 5, we can imply that
𝑨𝑷^𝑨2,pp (1+ε)𝑨𝑷𝑨+𝑩2,pp((1+δ)p1). (15)
setting 𝑷=0 in Equation 5 and using the fact that c0 gives 𝑩2,pp(1+ε)𝑨2,pp. Plugging this in the equation above gives
𝑨𝑷^𝑨2,pp (1+ε)𝑨𝑷𝑨+(1+ε)𝑨2,pp((1+δ)p1) (16)

The total time taken by the algorithm is Ts+TrO((1/δ)kr).

Lemma 9.

The mathematical programs CSA and CSA-geo equivalent to the following “constrained factorization” problem:

min𝑼𝑼T𝒮,𝑯d×n 𝑨𝑼𝑯2,pp. (CSA-fac)
Proof.

First, we will prove the equivalence between CSA and CSA-fac.

  1. 1.

    The easier direction to see is min𝑼𝑼T𝒮,𝑯d×n𝑨𝑼𝑯2,ppmin𝑼𝑼T𝒮𝑨𝑼𝑼T𝑨2,pp because setting 𝑯=𝑼T𝑨 in CSA-fac gives CSA.

  2. 2.

    For the other direction, it suffices to show that for any fixed choice of 𝑼 such that 𝑼𝑼T𝒮, an optimal choice of 𝑯 is 𝑼T𝑨. In order to see this, observe that the problem

    min𝑯𝑨𝑼𝑯2,pp =min𝑯i=1nai𝑼hi2p (17)
    where ai and hi are the ith columns of 𝑨 and 𝑯 respectively. Since the cost function decomposes into separate problems for each column, we can push the minimization inside.
    =i=1n(minhiai𝑼hi2)p. (18)

    Using the normal equations, the optimal choice for hi satisfies 𝑼T𝑼hi=𝑼Tai. Since the columns of 𝑼 are orthonormal, this implies that hi=𝑼Tai for each i[n] and hence 𝑯=𝑼T𝑨.

Now we show the equivalence between CSA-fac and CSA-geo. Observe that CSA-geo can be re-written as

min i=1naia^i2p
a^iColumnSpan(𝑼)
𝑼𝑼T𝒮.

Because the column span of 𝑷=𝑼𝑼T is identical to the column span of 𝑼. Replacing a^iColumnSpan(𝑼) by a^i=𝑼hi gives CSA-fac.

Definition 10 (Strong coresets; as defined in [41]).

Let 1p< and 0<ε<1. Let 𝐀d×n. Then, a diagonal matrix 𝐒n×n is a (1±ε) strong coreset for p subspace approximation if for all rank k projection matrices 𝐏F, we have

(𝑰𝑷F)𝑨𝑺2,pp(1±ε)(𝑰𝑷F)𝑨2,pp. (19)

The number of non-zero entries nnz(𝐒) of 𝐒 will be referred to as the size of the coreset.

Theorem 11 (Theorems 1.3 and 1.4 of [42]).

Let p[1,2)(2,) and ε>0 be given, and let 𝐀d×n. There is an algorithm running in O~(nnz(𝐀)+dω) time which, with probability at least 1δ, constructs a strong coreset 𝐒 that satisfies Definition 10 and has size:

nnz(𝑺)={kε4/p(log(k/εδ))O(1) if p[1,2),kp/2εp(log(k/εδ))O(p2) if p(2,). (20)
Remark.

Note that for any 𝑺 that satisfies the property in Definition 10, we can scale it up to satisfy (𝑰𝑷F)𝑨𝑺p,2p(1,1+ε)(𝑰𝑷F)𝑨p,2p matching the condition in Equation 5.

For many of the applications, we have p=2. For this case, the choice of the reduced matrix 𝑩 that replaces 𝑨 is simply the matrix of scaled left singular vectors of 𝑨. More formally,

Lemma 12.

When p=2, if 𝐀=i=1nσipiqiT be the singular value decomposition of 𝐀 (where σi is the ith largest singular value and pid,qin are the left singular vector and right singular vector corresponding to σi), then 𝐁=i=1rσipiqiT satisfies Equation 5 for r=k+k/ε.

The proof is deferred to Appendix A.1.

 Remark 13.

Notice that when p=2, Lemma 12 proves the condition in Equation 49:

𝑩𝑷𝑩F2(0,ε)𝑨𝑨kF2+𝑨𝑷𝑨F2c

which is stronger than the condition in Equation 5.

Lemma 14.

If (𝐀,𝒮) is an instance of CSA and 𝐁d×r is a matrix that satisfies Equation 5, and

𝑷^:=argmin𝑷𝒮𝑩𝑷𝑩2,pp,𝑷:=argmin𝑷𝒮𝑨𝑷𝑨2,pp, (21)

then 𝐏^ is an (1+ε)-approximate solution to the instance (𝐀,𝒮) i.e.,

𝑨𝑷^𝑨2,pp(1+ε)𝑨𝑷𝑨2,pp. (22)
  1. 1.

    More generally, if 𝑷^ is an approximate solution to (𝑩,𝒮) such that

    𝑩𝑷^𝑩2,ppα𝑩𝑷𝑩2,pp+β𝑷𝒮,

    for some α1,β0, then we have

    𝑨𝑷^𝑨2,ppα(1+ε)𝑨𝑷𝑨2,pp+β.
  2. 2.

    For the specific case when p=2, if 𝑷^ is an exact solution to (𝑩,𝒮), then we have

    𝑨𝑷^𝑨F2𝑨𝑷𝑨F2+ε𝑨𝑨kF2.
Proof.
  1. 1.

    Using the approximate optimality of 𝑷^ for the instance (𝑩,𝒮), we have

    𝑩𝑷^𝑩2,pp α𝑩𝑷𝑩2,pp+β. (23)
    Using the lower-bound and upper-bound from Equation 5 for the LHS and RHS, we get
    𝑨𝑷^𝑨2,ppc α(1+ε)𝑨𝑷𝑨2,ppαc+β. (24)
    Since α1 and c0, we get
    𝑨𝑷^𝑨2,pp α(1+ε)𝑨𝑷𝑨2,pp+β. (25)
  2. 2.

    Using the optimality of 𝑷^ for the instance (𝑩,𝒮) for with p=2, we have

    𝑩𝑷^𝑩F2𝑩𝑷𝑩F2. (26)

    Using Remark 13, we know that 𝑩𝑷𝑩F2(0,ε)𝑨𝑨kF2+𝑨𝑷𝑨F2c for any rank k projection matrix 𝑷 for some c0 independent of 𝑷 (see Equation 49). Using this, we get

    𝑨𝑷^𝑨F2c𝑩𝑷^𝑩F2𝑩𝑷𝑩F2𝑨𝑷𝑨F2+ε𝑨𝑨kF2c.

    Canceling out the c gives the inequality we claimed.

Lemma 15 (Lemma 4.1 in [12]).

If n×d matrix 𝐀 has integer entries bounded in magnitude by γ, and has rank ρk, then the kth singular value σk of 𝐀 has |logσk|=O(log(ndγ)) as nd. This implies that 𝐀F/Δk(ndγ)O(k/(ρk)) as nd. Here Δk:=𝐀𝐀kF

4 Applications

We present two applications to illustrate our framework. The remaining applications, as well as our hardness result for fair column-based approximation, are deferred to the full version.

4.1 Constrained Subspace Estimation [36]

In constrained subspace estimation, we are given a collection of target subspaces T1,T2,,Tm and a model subspace W. The goal is to find a subspace V of dimension k such that dim(VW) that maximizes the average overlap between the subspace V and T1,,Tm. More formally, the problem can be formulated as mathematical program:

max :𝑷¯T,𝑷V (CSE-max)
dim(V)=k,dim(VW), (27)
𝑷¯T=1mi=1m𝑷Ti, (28)
𝑷Ti and 𝑷V are the projection matrices onto the subspaces Ti and V respectively. (29)

Let us assume that the constraint dim(VW) is actually an exact constraint dim(VW)= because we can solve for k+1 different cases dim(VW)=i for each ik. Since 𝑷¯T is a PSD matrix, let it be 𝑨𝑨T for some 𝑨d×d. Changing the optimization problem from a maximization problem to a minimization problem, we get

min :𝑨𝑨T,𝑰𝑷V=𝑨𝑷V𝑨F2 (CSE-min)
𝑷V is the projection matrix onto V (30)
dim(V)=k,dim(VW)=. (31)
Lemma 16.

The CSE-min problem is a special case of CSA.

Proof.

Setting p=2 and 𝒮 as the set of k dimensional projection matrices PV such that dim(VW)= in CSA gives CSE-min.

Let 𝑩d×r,r=k+k/ε be the reduced matrix obtained as in Lemma 12. Using Lemma 14, it is sufficient to focus on the reduced instance with 𝑨 replaced instead of 𝑩.

Any subspace V such that dim(V)=k,dim(VW)= can be represented equivalently as

V =Span(u1,u2,,u,v1,v2,,vk)
uiW,vjWi[],j[k].

Using these observations and Lemma 9, we can focus on the following subspace estimation program

min :𝑩𝑼𝑪F2 (32)
𝑼 is a orthogonal basis for Span(u1,,u,v1,,vk) (33)
uiW,vjWi[],j[k]. (34)
Since 𝑪 is unconstrained, we can replace the condition in Equation 33 with the much simpler condition 𝑼=[u1,,u,v1,,v]. This gives
min :𝑩𝑼𝑪F2 (CSE-min-reduced)
𝑼=[u1,,u,v1,,v] (35)
uiW,vjWi[],j[k]. (36)
Lemma 17.

For any fixed 𝐁d×r and 𝐂k×r, the Equation CSE-min-reduced can be solved exactly in poly(n) time.

Proof.

For fixed 𝑩 and 𝑪, the objective is convex quadratic in 𝑼 and the constraints are linear on 𝑼. Linear constrained convex quadratic program can be efficiently solved.

Corollary 18 (Additive approximation for CSE).

Using Lemma 8, we can get a subspace V such that dim(V)=k,dim(VW)= and

𝑨𝑷V𝑨F2(1+ε)OPT+O(δ𝑨F2)

for any choice of 0<δ<1 in time poly(n)(1/δ)O(k2/ε).

Lemma 15 gives a lower bound for OPT when the entries of the input matrix 𝑨 are integers bounded in magnitude by γ.

Theorem 19 (Multiplicative approximation for CSE).

Given an instance (𝐀d×n,k,W) of constrained subspace estimation with integer entries of absolute value at most γ in 𝐀, there is an algorithm that obtains a subspace V such that dim(V)=k,dim(VW)= and

𝑨𝑷V𝑨F2(1+ε)OPT

in O(ndγ/ε)O(k3/ε) time.

Proof.

Using Lemma 15, we know that 𝑨F2/𝑨𝑨kF2(ndγ)O(k). Setting δ=ε𝑨𝑨kF2/𝑨F2ε(ndγ)O(k) in Corollary 18 gives the desired time complexity.

4.2 Partition Constrained 𝒑-Subspace Approximation

We now consider the PC-p-subspace approximation problem, which generalizes the subspace approximation and subspace estimation problems.

Definition 20 (Partition Constrained p-Subspace Approximation).

In the PC-p-subspace approximation problem, we are given a set of target vectors {a1,a2,,an}d as columns of a matrix 𝐀d×n, a set of subspaces S1,,Sd, and a sequence of capacity constraints k1,,k where k1++k=k. The goal is to select k vectors in total, ki from subspace Si, such that their span captures as much of 𝐀 as possible. Formally, the goal is to select vectors {vi,ti}i,tiki, such that for every i, vi,1,,vi,kiSi, so as to minimize i[n]projspan({vi,ti}i,tiki)(ai)2p.

Our results will give algorithms with running times exponential in poly(k) for PC--subspace approximation. Given this goal, we can focus on the setting where ki=1, since we can replace each Si in the original formulation with ki copies of Si, with a budget of 1 for each copy.

PC--subspace approximation with Unit Capacity.

Given a set of vectors {a1,a2,,an}d as columns of a matrix 𝑨d×n and subspaces S1,,Skd, select a vector viSi for i[k] in order to minimize i[n]projspan(v1,,vk)(ai)2p, where p1 is a given parameter. A more compact formulation is

min :i=1naia^i2p (PC-p-SA-geo)
a^iSpan(v1,,vk)i[n] (37)
vjSjj[k]. (38)
Using Lemma 9, the two other equivalent formulations are
min :𝑨𝑼𝑼T𝑨2,pp (PC-p-SA)
𝑼 is an orthogonal basis for Span(v1,v2,,vk) (39)
viSii[k]. (40)
min :𝑨𝑽𝑪2,pp (PC-p-SA-fac)
𝑽=[v1,,vk] (41)
viSii[k]. (42)

In what follows, we thus focus on the unit capacity version. We can use our general framework to derive an additive error approximation, for any p.

Theorem 21.

There exists an algorithm for PC-p-subspace approximation with runtime (κ/ε)poly(k/ε)poly(n) which returns a solution with additive error at most O(εp)Ap,2p, where κ is the condition number of an optimal solution 𝐕=[v1,v2,,vk] for the PC-subspace approximation problem PC-p-SA-fac.

For the special case of p=2, it turns out that we can obtain a (1+ε)-multiplicative approximation, using a novel idea. We now outline this approach. As described in our framework, we start by constructing the reduced instance 𝑩,𝒮, where 𝑩={b1,b2,,br}d is a set of target vectors and 𝒮={S1,S2,,Sk} is the given collection of subspaces of d. We define 𝑷j to be some fixed orthonormal basis for the space Sj. Recall that any solution to PC-2-subspace approximation is defined by (a) the vector xj that expresses the chosen vj as vj=𝑷jxj (we have one xj for each j[k]), and (b) a set of combination coefficients cij used to represent the vectors bi using the vectors {vj}j=1k. We collect the vectors xj into one long vector 𝒙 and the coefficients cij into a matrix 𝑪.

Theorem 22.

Let 𝐁,𝒮 be an instance of PC-2-subspace approximation, where 𝐁={b1,b2,br}, and suppose that the bit complexity of each element in the input is bounded by H. Suppose there exists an (approximately) optimal solution is defined by the pair (𝐱,𝐂) with bit complexity poly(n,H). There exists an algorithm that runs in time nO(k2/ε)poly(H) and outputs a solution whose objective value is within a (1+ε) factor of the optimum objective value. We denote s=j=1ksj and sj=dim(Sj); n for this result can be set to max(s,d,k/ε).

Algorithm Overview.

Recall that 𝑷j specifies an orthonormal basis for Sj. Let 𝑷ij:=cij𝑷j, where cij are variables. Define 𝑷 to be the rd×s matrix consisting of r×k blocks; the (i,j)th block is 𝑷ij and we let 𝒙,𝒃 be the vectors representing all the xj,bi stacked vertically respectively as shown below:

𝑷=[𝑷1,1𝑷1,2𝑷1,k𝑷2,1𝑷2,2𝑷2,k𝑷r,1𝑷r,2𝑷r,k],𝒙=[x1x2xk],𝒃=[b1b2br].

The problem PC-2-subspace approximation can now be expressed as the regression problem:

min𝑪,𝒙:𝑷𝒙𝒃22. (43)

Written this way, it is clear that for any 𝑪, the optimization problem with respect to 𝒙 is simply a regression problem. For the sake of exposition, suppose that for the optimal solution (𝑪,𝒙), the matrix 𝑷 turns out to have a full column rank (i.e., 𝑷T𝑷 is invertible). In this case, the we can write down the normal equation 𝑷T𝑷𝒙=𝑷T𝒃 and solve it using Cramer’s rule! More specifically, let 𝑫=𝑷T𝑷 and 𝑫j(i) be the matrix obtained by replacing the ith column in the jth column block of 𝑫 with the column 𝑷T𝒃 for j[k],i[sj]. Using Cramer’s rule, we have xj(i)=det(𝑫j(i))/det(𝑫).

The key observation now is that substituting this back into the objective yields an optimization problem over (the variables) 𝑪. First, observe that using the normal equations, the objective can be simplified as

𝑷𝒙𝒃22=𝒙T𝑷T𝑷𝒙𝒙T𝑷T𝒃𝒃T𝑷𝒙+𝒃2=𝒃2𝒃T𝑷𝒙.

Suppose t is a real valued parameter that is a guess for the objective value. We then consider the following feasibility problem:

𝑷𝒙𝒃22=𝒃22𝒃T𝑷𝒙t (44)
𝒃22tj[k],i[sj](𝒃T𝑷)j(i)det(𝑫j(i)),det(𝑫)=1. (45)

The idea is to solve this feasibility problem using the literature on solving polynomial systems. This leaves two main gaps: guessing t, and handling the case of 𝑷 not having a full column rank in the optimal solution. We handle the former issue using known quantitative bounds on the solution value to polynomial systems, and the latter using a pre-multiplication with random matrices of different sizes.

References

  • [1] Jason Altschuler, Aditya Bhaskara, Gang Fu, Vahab Mirrokni, Afshin Rostamizadeh, and Morteza Zadimoghaddam. Greedy column subset selection: New bounds and distributed algorithms. In International Conference on Machine Learning, pages 2539–2548, 2016. URL: http://proceedings.mlr.press/v48/altschuler16.html.
  • [2] Megasthenis Asteris, Dimitris Papailiopoulos, and Alexandros Dimakis. Nonnegative sparse pca with provable guarantees. In International Conference on Machine Learning, pages 1728–1736. PMLR, 2014. URL: http://proceedings.mlr.press/v32/asteris14.html.
  • [3] Frank Ban, Vijay Bhattiprolu, Karl Bringmann, Pavel Kolev, Euiwoong Lee, and David P. Woodruff. A PTAS for p-low rank approximation. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 747–766, 2019.
  • [4] Frank Ban, David P. Woodruff, and Qiuyi (Richard) Zhang. Regularized weighted low rank approximation. CoRR, abs/1911.06958, 2019. arXiv:1911.06958.
  • [5] Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. Sparse features for pca-like linear regression. Advances in Neural Information Processing Systems, 24, 2011.
  • [6] Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. Near-optimal column-based matrix reconstruction. SIAM Journal on Computing, 43(2):687–717, 2014. doi:10.1137/12086755X.
  • [7] Christos Boutsidis, Michael W Mahoney, and Petros Drineas. An improved approximation algorithm for the column subset selection problem. In Proceedings of the twentieth annual ACM-SIAM symposium on Discrete algorithms, pages 968–977, 2009. doi:10.1137/1.9781611973068.105.
  • [8] Christos Boutsidis, Anastasios Zouzias, Michael W Mahoney, and Petros Drineas. Randomized dimensionality reduction for k-means clustering. IEEE Transactions on Information Theory, 61(2):1045–1062, 2014. doi:10.1109/TIT.2014.2375327.
  • [9] Jorge Cadima and Ian T Jolliffe. Loading and correlations in the interpretation of principle compenents. Journal of applied Statistics, 22(2):203–214, 1995.
  • [10] Ashish Chiplunkar, Sagar Kale, and Sivaramakrishnan Natarajan Ramamoorthy. How to solve fair k-center in massive data models. In International Conference on Machine Learning, pages 1877–1886, 2020.
  • [11] Ali Civril and Malik Magdon-Ismail. Column subset selection via sparse approximation of SVD. Theoretical Computer Science, 421:1–14, 2012. doi:10.1016/J.TCS.2011.11.019.
  • [12] Kenneth L Clarkson and David P Woodruff. Numerical linear algebra in the streaming model. In Proceedings of the forty-first annual ACM symposium on Theory of computing, pages 205–214, 2009. doi:10.1145/1536414.1536445.
  • [13] Kenneth L. Clarkson and David P. Woodruff. Input sparsity and hardness for robust subspace approximation. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 310–329, 2015. doi:10.1109/FOCS.2015.27.
  • [14] Michael B Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina Persu. Dimensionality reduction for k-means clustering and low rank approximation. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 163–172, 2015. doi:10.1145/2746539.2746569.
  • [15] Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, Chris Schwiegelshohn, and Omar Ali Sheikh-Omar. Improved coresets for euclidean k-means. In Proceedings of the 36th International Conference on Neural Information Processing Systems, 2024.
  • [16] Vincent Cohen-Addad, David Saulpic, and Chris Schwiegelshohn. A new coreset framework for clustering. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 169–182, 2021. doi:10.1145/3406325.3451022.
  • [17] Alberto Del Pia. Sparse pca on fixed-rank matrices. Mathematical Programming, 198(1):139–157, 2023. doi:10.1007/S10107-022-01769-9.
  • [18] Amit Deshpande and Luis Rademacher. Efficient volume sampling for row/column subset selection. In 2010 ieee 51st annual symposium on foundations of computer science, pages 329–338, 2010. doi:10.1109/FOCS.2010.38.
  • [19] Amit Deshpande, Madhur Tulsiani, and Nisheeth K. Vishnoi. Algorithms and hardness for subspace approximation. In Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on Discrete Algorithms, pages 482–496, USA, 2011. doi:10.1137/1.9781611973082.39.
  • [20] Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, and Vishwanathan Vinay. Clustering large graphs via the singular value decomposition. Machine learning, 56:9–33, 2004. doi:10.1023/B:MACH.0000033113.59016.96.
  • [21] Dan Feldman, Morteza Monemizadeh, and Christian Sohler. A ptas for k-means clustering based on weak coresets. In Proceedings of the twenty-third annual symposium on Computational geometry, pages 11–18, 2007. doi:10.1145/1247069.1247072.
  • [22] Gene H. Golub and Charles F. Van Loan. Matrix Computations. Johns Hopkins University Press, Baltimore, 4th edition, 2013.
  • [23] Venkatesan Guruswami, Prasad Raghavendra, Rishi Saket, and Yi Wu. Bypassing ugc from some optimal geometric inapproximability results. ACM Trans. Algorithms, 12(1), February 2016. doi:10.1145/2737729.
  • [24] Venkatesan Guruswami and Ali Kemal Sinop. Optimal column-based low-rank matrix reconstruction. In Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms, pages 1207–1214, 2012. doi:10.1137/1.9781611973099.95.
  • [25] Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity. Monographs on statistics and applied probability, 143(143):8, 2015.
  • [26] Sedjro Salomon Hotegni, Sepideh Mahabadi, and Ali Vakilian. Approximation algorithms for fair range clustering. In International Conference on Machine Learning, pages 13270–13284. PMLR, 2023.
  • [27] Lingxiao Huang, Jian Li, and Xuan Wu. On optimal coreset construction for euclidean (k,z)-clustering. In Proceedings of the 56th Annual ACM Symposium on Theory of Computing, pages 1594–1604, 2024. doi:10.1145/3618260.3649707.
  • [28] Matthew Jones, Huy Nguyen, and Thy Nguyen. Fair k-centers via maximum matching. In International Conference on Machine Learning, pages 4940–4949, 2020.
  • [29] Matthäus Kleindessner, Pranjal Awasthi, and Jamie Morgenstern. Fair k-center clustering for data summarization. In International Conference on Machine Learning, pages 3448–3457, 2019. URL: http://proceedings.mlr.press/v97/kleindessner19a.html.
  • [30] Arvind V. Mahankali and David P. Woodruff. Optimal 1 column subset selection and a fast PTAS for low rank approximation. CoRR, abs/2007.10307, 2020.
  • [31] Antonis Matakos, Bruno Ordozgoiti, and Suhas Thejaswi. Fair column subset selection. arXiv preprint arXiv:2306.04489, 2023. doi:10.48550/arXiv.2306.04489.
  • [32] Ankur Moitra. An almost optimal algorithm for computing nonnegative rank. SIAM J. Comput., 45(1):156–173, 2016. doi:10.1137/140990139.
  • [33] Dimitris Papailiopoulos, Alexandros Dimakis, and Stavros Korokythakis. Sparse pca through low-rank approximations. In International Conference on Machine Learning, pages 747–755. PMLR, 2013. URL: http://proceedings.mlr.press/v28/papailiopoulos13.html.
  • [34] Ilya Razenshteyn, Zhao Song, and David P. Woodruff. Weighted low rank approximations with provable guarantees. In Proceedings of the Forty-Eighth Annual ACM Symposium on Theory of Computing, pages 250–263, 2016. doi:10.1145/2897518.2897639.
  • [35] Samira Samadi, Uthaipon Tantipongpipat, Jamie H Morgenstern, Mohit Singh, and Santosh Vempala. The price of fair PCA: One extra dimension. In Advances in neural information processing systems, pages 10976–10987, 2018.
  • [36] Ignacio Santamaria, Javier Vía, Michael Kirby, Tim Marrinan, Chris Peterson, and Louis Scharf. Constrained subspace estimation via convex optimization. In 2017 25th European Signal Processing Conference (EUSIPCO), pages 1200–1204. IEEE, 2017. doi:10.23919/EUSIPCO.2017.8081398.
  • [37] Zhao Song, Ali Vakilian, David Woodruff, and Samson Zhou. On socially fair regression and low-rank approximation. In Advances in Neural Information Processing Systems, 2024.
  • [38] Uthaipon Tantipongpipat, Samira Samadi, Mohit Singh, Jamie H Morgenstern, and Santosh Vempala. Multi-criteria dimensionality reduction with applications to fairness. In Advances in Neural Information Processing Systems, pages 15135–15145, 2019. URL: https://proceedings.neurips.cc/paper/2019/hash/2201611d7a08ffda97e3e8c6b667a1bc-Abstract.html.
  • [39] Joel A Tropp. Column subset selection, matrix factorization, and eigenvalue optimization. In Proceedings of the twentieth annual ACM-SIAM symposium on Discrete algorithms, pages 978–986. SIAM, 2009. doi:10.1137/1.9781611973068.106.
  • [40] Ameya Velingker, Maximilian Vötsch, David P. Woodruff, and Samson Zhou. Fast (1+ε)-approximation algorithms for binary matrix factorization. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
  • [41] David P. Woodruff and Taisuke Yasuda. Nearly linear sparsification of p subspace approximation, 2024. doi:10.48550/arXiv.2407.03262.
  • [42] David P. Woodruff and Taisuke Yasuda. Ridge leverage score sampling for p subspace approximation, 2025. arXiv:2407.03262.
  • [43] Zhirong Yang and Erkki Oja. Linear and nonlinear projective nonnegative matrix factorization. IEEE Transactions on Neural Networks, 21:734–749, 2010. doi:10.1109/TNN.2010.2041361.
  • [44] Xiao-Tong Yuan and Tong Zhang. Truncated power method for sparse eigenvalue problems. Journal of Machine Learning Research, 14(4), 2013. doi:10.5555/2567709.2502610.
  • [45] Zhijian Yuan and Erkki Oja. Projective nonnegative matrix factorization for image compression and feature extraction. In Image Analysis: 14th Scandinavian Conference, SCIA 2005, Joensuu, Finland, June 19-22, 2005. Proceedings 14, pages 333–342. Springer, 2005. doi:10.1007/11499145_35.
  • [46] Zhijian Yuan, Zhirong Yang, and Erkki Oja. Projective nonnegative matrix factorization: Sparseness, orthogonality, and clustering, 2009.
  • [47] Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse principal component analysis. Journal of computational and graphical statistics, 15(2):265–286, 2006.

Appendix A Missing Proofs

A.1 Proof of Lemma 12

Proof.

For any two arbitrary projection matrices 𝑷 and 𝑷 of rank k, consider the difference

(𝑨𝑷𝑨F2𝑩𝑷𝑩F2)(𝑨𝑷𝑨F2𝑩𝑷𝑩F2) (46)
=𝑨𝑨T,𝑰𝑷𝑩𝑩T,𝑰𝑷𝑨𝑨T,𝑰𝑷+𝑩𝑩T,𝑰𝑷 (47)
=𝑨𝑨T𝑩𝑩T,𝑷𝑨𝑨T𝑩𝑩T,𝑷 (48)
𝑨𝑨T𝑩𝑩T,𝑷 (𝑨𝑨T𝑩𝑩T0,𝑷0)
i=r+1r+kσi (rank of 𝑷k )
kσr (σrσr,rr)
krk(i=k+1rσi) (σrσr,rr)
krk𝑨𝑨kF2=ε𝑨𝑨kF2. (𝑨𝑨kF2=i=k+1dσi)

If we let c:=maxrank(𝑷)k(𝑨𝑷𝑨F2𝑩𝑷𝑩F2), then we have

cε𝑨𝑨kF2𝑨𝑷𝑨F2𝑩𝑷𝑩F2c

for any projection matrix 𝑷 of rank at most k. This can we re written as

𝑩𝑷𝑩F2(0,ε)𝑨𝑨kF2+𝑨𝑷𝑨F2c. (49)

Using the fact that 𝑨𝑨kF2𝑨𝑷𝑨F2, we get

𝑩𝑷𝑩F2(1,1+ε)𝑨𝑷𝑨F2c.

The fact that c0 follows from the fact that

𝑨𝑷𝑨F2𝑩𝑷𝑩F2 =𝑨𝑨T𝑩𝑩T,𝑰𝑷 (50)
0. (𝑨𝑨T𝑩𝑩T0,𝑰𝑷0)