Guessing Efficiently for Constrained Subspace Approximation

Bhaskara, Aditya; Mahabadi, Sepideh; Pittu, Madhusudhan Reddy; Vakilian, Ali; Woodruff, David P.

doi:10.4230/LIPIcs.ICALP.2025.29

Guessing Efficiently for Constrained Subspace Approximation

Aditya Bhaskara

University of Utah, Salt Lake City, UT, USA Sepideh Mahabadi

Microsoft Research, Redmond, WA, US Madhusudhan Reddy Pittu

Carnegie Mellon University, Pittsburgh, PA, USA Ali Vakilian

Toyota Technological Institute at Chicago, IL, USA David P. Woodruff

Carnegie Mellon University, Pittsburgh, PA, USA

Abstract

In this paper we study constrained subspace approximation problem. Given a set of $n$ points $\{a_{1},\ldots,a_{n}\}$ in $\mathbb{R}^{d}$ , the goal of the subspace approximation problem is to find a $k$ dimensional subspace that best approximates the input points. More precisely, for a given $p\geq 1$ , we aim to minimize the $p$ th power of the $\ell_{p}$ norm of the error vector $(\|a_{1}-\bm{P}a_{1}\|,\ldots,\|a_{n}-\bm{P}a_{n}\|)$ , where $\bm{P}$ denotes the projection matrix onto the subspace and the norms are Euclidean. In constrained subspace approximation (CSA), we additionally have constraints on the projection matrix $\bm{P}$ . In its most general form, we require $\bm{P}$ to belong to a given subset $\mathcal{S}$ that is described explicitly or implicitly.

We introduce a general framework for constrained subspace approximation. Our approach, that we term coreset-guess-solve, yields either $(1+\varepsilon)$ -multiplicative or $\varepsilon$ -additive approximations for a variety of constraints. We show that it provides new algorithms for partition-constrained subspace approximation with applications to fair subspace approximation, $k$ -means clustering, and projected non-negative matrix factorization, among others. Specifically, while we reconstruct the best known bounds for $k$ -means clustering in Euclidean spaces, we improve the known results for the remainder of the problems.

Keywords and phrases:

parameterized complexity, low rank approximation, fairness, non-negative matrix factorization, clustering

Category:

Track A: Algorithms, Complexity and Games

Funding:

Aditya Bhaskara: Supported by NSF CCF-2047288.

David P. Woodruff: Supported by a Simons Investigator Award and Office of Naval Research award number N000142112647.

Copyright and License:

© Aditya Bhaskara, Sepideh Mahabadi, Madhusudhan Reddy Pittu, Ali Vakilian, and
David P. Woodruff; licensed under Creative Commons License CC-BY 4.0

2012 ACM Subject Classification:

Theory of computation

\rightarrow

Continuous optimization

Editors:

Keren Censor-Hillel, Fabrizio Grandoni, Joël Ouaknine, and Gabriele Puppis

Series and Publisher:

Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik

1 Introduction

Large data sets, often represented as collections of high-dimensional points, naturally arise in fields such as machine learning, data mining, and computational geometry. Despite their high-dimensional nature, these points typically exhibit low intrinsic dimensionality. Identifying (or summarizing) this underlying low-dimensional structure is a fundamental algorithmic question with numerous applications to data analysis. We study a general formulation, that we call the subspace approximation problem.

In subspace approximation, given a set of $n$ points $\{a_{1},\ldots,a_{n}\}\in\mathbb{R}^{d}$ and a rank parameter $k$ , we consider the problem of “best approximating” the input points with a $k$ -dimensional subspace in a high-dimensional space. Here the goal is to find a rank $k$ projection $\bm{P}$ that minimizes the projection costs $\|a_{i}-\bm{P}a_{i}\|$ , aggregated over $i\in[n]$ . The choice of aggregation leads to different well-studied formulations. In the $\ell_{p}$ subspace approximation problem, the objective is $\left(\sum_{i}\|a_{i}-\bm{P}a_{i}\|_{2}^{p}\right)$ . Formally, denoting by $A$ the $d\times n$ matrix whose $i$ th column is $a_{i}$ , the $\ell_{p}$ -subspace approximation problem asks to find a rank $k$ projection matrix $\bm{P}\in\mathbb{R}^{d\times d}$ that minimizes $\|\bm{A}-\bm{P}\bm{A}\|_{2,p}^{p}:=\sum_{i=1}^{n}\|a_{i}-\bm{P}a_{i}\|_{2}^{p}$ . For different choices of $p$ , $\ell_{p}$ -subspace approximation captures some well-studied problems, notably the median hyperplane problem (when $p=1$ ), the principal component analysis (PCA) problem (when $p=2$ ), and the center hyperplane problem (when $p=\infty$ ).

Subspace approximation for general $p$ turns out to be NP-hard for all $p\neq 2$ . For $p>2$ , semidefinite programming helps achieve a constant factor approximation (for fixed $p$ ) for the problem [19]. Matching hardness results were also shown for the case $p>2$ , first assuming the Unique Games Conjecture [19], and then based only on $\text{P}\neq\text{NP}$ [23]. For $p<2$ , hardness results were first shown in the work of [13].

Due to the ubiquitous applications of subspace approximation in various domains, several “constrained” versions of the problem have been extensively studied as well [20, 44, 33, 2, 8, 14]. In the most general setting of the constrained $\ell_{p}$ -subspace approximation problem, we are additionally given a collection $\mathcal{S}$ of rank- $k$ projection matrices (specified either explicitly or implicitly) and the goal is to find a projection matrix $\bm{P}\in\mathcal{S}$ minimizing the objective. I.e.,

\displaystyle\min_{\bm{P}\in\mathcal{S}}\|\bm{A}-\bm{P}\bm{A}\|_{2,p}^{p}.

(1)

Some examples of problems in constrained subspace approximation include the well-studied column subset selection [7, 39, 18, 11, 24, 6, 1] where the projection matrices are constrained to project on to the span of $k$ of the original vectors, $(k,z)$ -means clustering in which the set of projection matrices can be specified by the partitioning of the points into $k$ clusters (see [14] for a reference), and many more which we will describe in this paper.

1.1 Our Contributions and Applications

In this paper, we provide a general algorithmic framework for constrained $\ell_{p}$ -subspace approximation that yields either $(1+\varepsilon)$ -multiplicative or $\varepsilon$ -additive error approximations to the objective (depending on the setting), with running time exponential in $k$ . We apply the framework to several classes of constrained subspace approximation, leading to new results or results matching the state-of-the-art for these problems. Note that since the problems we consider are typically APX-hard (including $k$ -means, and even the unconstrained version of $\ell_{p}$ -subspace approximation for $p>2$ ), a running time exponential in $k$ is necessary for our results, assuming the Exponential Time Hypothesis; a discussion in Section 2. Before presenting our results, we start with an informal description of the framework.

Overview of Approach.

Our approach is based on coresets [21] (also [16, 15, 27] and references therein), but turns out to be different from the standard approach in a subtle yet important way. Recall that a (strong) coreset for an optimization problem $\mathcal{O}$ on set of points $\bm{A}$ is a subset $\bm{B}$ such that for any solution for $\mathcal{O}$ , the cost on $\bm{B}$ is approximately the same as the cost on $\bm{A}$ , up to an appropriate scaling. In the formulation of $\ell_{p}$ -subspace approximation above, a coreset for a dataset $\bm{A}$ would be a subset $\bm{B}$ of its columns with $k^{\prime}\ll n$ columns, such that for all $k$ -dimensional subspaces, each defined by some $\bm{P}$ , $\|\bm{B}-\bm{P}\bm{B}\|_{2,p}^{p}\approx\|\bm{A}-\bm{P}\bm{A}\|_{2,p}^{p}$ , up to scaling. Thus the goal becomes to minimize the former quantity.

In the standard coreset approach, first a coreset is obtained, and then a problem-specific enumeration procedure is used to find a near optimal solution $\bm{P}$ . For example, for the $k$ -means clustering objective, one can consider all the $k$ -partitions of the points in the coreset $\bm{B}$ ; each partition leads to a set of candidate centers, and the best of these candidate solutions will be an approximate solution to the full instance. Similarly for (unconstrained) $\ell_{p}$ -subspace approximation, one observes that for an optimal solution, the columns of $\bm{P}$ must lie in the span of the vectors of $\bm{B}$ , and thus one can enumerate over the combinations of the vectors of $\bm{B}$ . Each combination gives a candidate $\bm{P}$ , and the best of these candidate solutions is an approximate solution to the full instance.

However, this approach does not work in general for constrained subspace approximation. To see this, consider the very simple constraint of having the columns of $\bm{P}$ coming from some given subspace $S$ . Here, the coreset for $\ell_{p}$ -subspace approximation on $\bm{A}$ will be some set $\bm{B}$ that is “oblivious” of the subspace $S$ . Thus, enumerating over combinations of $\bm{B}$ may not yield any vectors in $S$ !

Our main idea is to avoid enumeration over candidate solutions, but instead, we view the solution (the matrix $\bm{P}\in\mathbb{R}^{d\times k}$ ) as simply a set of variables. We then note that since the goal is to use $\bm{P}$ to approximate $\bm{B}$ , there must be some combination of the vectors of $\bm{P}$ (equivalently, a set of $k$ coefficients) that approximates each vector $a_{i}$ in $\bm{B}$ . If the coreset size is $k^{\prime}$ , there are only $k\cdot k^{\prime}$ coefficients in total, and we can thus hope to enumerate these coefficients in time $\exp(k\cdot k^{\prime})$ . For every given choice of coefficients, we can then solve an optimization problem to find the optimal $\bm{P}$ . For the constraints we consider (including the simple example above), this problem turns out to be convex, and can thus be solved efficiently!

This simple idea yields $\varepsilon$ -additive approximation guarantees for a range of problems. We then observe that in specific settings of interest, we can obtain $(1+\varepsilon)$ -multiplicative approximations by avoiding guessing of the coefficients. In these settings, once the coefficients have been guessed, there is a closed form for the optimal basis vectors, in the form of low degree polynomials of the coefficients. We can then use the literature on solving polynomial systems of equations (viewing the coefficients as variables) to obtain algorithms that are more efficient than guessing. The framework is described more formally in Section 3.

We believe our general technique of using coresets to reduce the number of coefficients needed in order to turn a constrained non-convex optimization problem into a convex one, may be of broader applicability. We note it is fundamentally different than the “guess a sketch” technique for variable reduction in [34, 3, 4, 30] and the techniques for reducing variables in non-negative matrix factorization [32]. To support this statement, the guess a sketch technique requires the existence of a small sketch, and consequently has only been applied to approximation with entrywise $p$ -norms for $p\leq 2$ and weighted variants [34, 3, 30], whereas our technique applies to a much wider family of norms.

Relation to Prior Work.

We briefly discuss the connection to prior work on binary matrix factorization using coresets. The work of [40] considers binary matrix factorization by constructing a strong coreset that reduces the number of distinct rows via importance sampling, leveraging the discrete structure of binary inputs. Our framework generalizes these ideas to continuous settings: we use strong coresets not merely to reduce distinct rows, but to reduce the number of variables in a polynomial system for solving continuous constrained optimization problems. This enables us to extend the approach to real-valued matrices and to more general loss functions. Our framework can be seen as a generalization and unification of prior coreset-based “guessing” strategies, adapting them to significantly broader settings.

Applications.

We apply our framework to the following applications. Each of these settings can be viewed as subspace approximation with a constraint on the subspace (i.e., on the projection matrix), or on properties of the associated basis vectors. Below we describe these applications, mention how they can be formulated as Constrained Subspace Approximation, and state our results for them. See Table 1 for a summary.

Table 1: Summary of the upper bound results we get using our framework. In the approximation column, we use super scripts

*,+,\dagger

to represent if its a multiplicative, additive, or multiplicative-additive approximation respectively. In the notes on prior work column, we use tilde (

\sim

) to indicate that no prior theoretical guarantees are known (only heuristics) and hyphen (

-

) to specify that the problem is new.

Problem	Running Time	Approx.	Prior Work
PC- $\ell_{p}$ -Subspace Approx.	$(\frac{\kappa}{\varepsilon})^{\operatorname{poly}(\frac{k}{\varepsilon})}\cdot% \operatorname{poly}(n)$ (21)	$\left(O(\varepsilon p)\cdot\\|\bm{A}\\|_{p,2}^{p}\right)^{+}$	-
PC- $\ell_{p}$ -Subspace Approx.	$n^{O(\frac{k^{2}}{\varepsilon})}\cdot\operatorname{poly}(H)$ (22)	$(1+\varepsilon)^{*}$	-
Constrained Subspace Est.	$\operatorname{poly}(n)\cdot(\frac{1}{\delta})^{O(\frac{k^{2}}{\varepsilon})}$ (18)	$(1+\varepsilon,O(\delta\cdot\\|\bm{A}\\|_{F}^{2}))^{\dagger}$	$\sim$
Constrained Subspace Est.	$O(\frac{nd\gamma}{\varepsilon})^{O(\frac{k^{3}}{\varepsilon})}$ (19)	$(1+\varepsilon)^{*}$	$\sim$
PNMF	$O(\frac{dk^{2}}{\varepsilon})\cdot(\frac{1}{\delta})^{O(\frac{k^{2}}{% \varepsilon})}$ (1)	$(1+\varepsilon,O(\delta\cdot\\|\bm{A}\\|_{F}^{2}))^{\dagger}$	$\sim$
PNMF	$(\frac{nd\gamma}{\varepsilon})^{O(\frac{k^{3}}{\varepsilon})}$ (2)	$(1+\varepsilon)^{*}$	$\sim$
$k$ -Means Clustering	$O(nnz(\bm{A})+2^{\widetilde{O}(\frac{k}{\varepsilon})}+n^{o(1)})$ (3)	$(1+\varepsilon)^{*}$	[21]
Sparse PCA	$d^{O(\frac{k^{3}}{\varepsilon^{2}})}\cdot\frac{k^{3}}{\varepsilon}$ (4)	$(\varepsilon\\|\bm{A}-\bm{A}_{k}\\|_{F}^{2})^{+}$	[17]

1.1.1 Subspace Approximation with Partition Constraints

First, we study a generalization of $\ell_{p}$ -subspace approximation, where we have partition constraints on the subspace. More specifically, we consider PC- $\ell_{p}$ -subspace approximation, where besides the point set $\{a_{1},\cdots,a_{n}\}\in\mathbb{R}^{d}$ , we are given $\ell$ subspaces $S_{1},\cdots,S_{\ell}$ along with capacities $k_{1},\cdots,k_{\ell}$ such that $\sum_{i=1}^{\ell}k_{i}=k$ . Now the set of valid projections $\mathcal{S}$ is implicitly defined to be the set of projections onto the subspaces that are obtained by selecting $k_{i}$ vectors from $S_{i}$ for each $i\in[\ell]$ , taking their span.

PC- $\ell_{p}$ -subspace approximation can be viewed as a variant of data summarization with “fair representation”. Specifically, when $S_{i}$ is the span of the vectors (or points) in group $i$ , then by setting $k_{i}$ values properly (depending on the application or the choice of policy makers), PC- $\ell$ -subspace approximation captures the problem of finding a summary of the input data in which groups are fairly represented. This corresponds to the equitable representation criterion, a popular notion studied extensively in the fairness of algorithms, e.g., clustering [29, 28, 10, 26].¹¹1We note that the fair representation definitions differ from those in the line of work on fair PCA and column subset selection [35, 38, 31, 37], where the objective contributions (i.e., projection costs) of different groups must either be equal (if possible) or ensure that the maximum incurred cost is minimized. We focus on the question of groups having equal, or appropriately bounded, representation among the chosen low-dimensional subspace (i.e., directions). This distinction is also found in algorithmic fairness studies of other problems, such as clustering. We show the following results for PC-subspace approximation:

$\blacksquare$

First, in Theorem 21, we show for any $p\geq 1$ , an algorithm for PC- $\ell_{p}$ -subspace approximation with runtime $(\frac{\kappa}{\varepsilon})^{\operatorname{poly}({k}/{\varepsilon})}\cdot% \operatorname{poly}(n)$ that returns a solution with additive error at most $O(\varepsilon p)\cdot\|\bm{A}\|^{p}_{p,2}$ , where $\kappa$ is the condition number of the optimal choice of vectors from the given subspaces.
$\blacksquare$

For $p=2$ , which is one of the most common loss functions for PC- $\ell_{p}$ -subspace approximation, we also present a multiplicative approximation guarantee. There exists a $(1+\varepsilon)$ -approximation algorithm running in time $s^{O(k^{2}/\varepsilon)}\cdot\operatorname{poly}(H)$ where $H$ is the bit complexity of each element in the input and $s$ is the sum of the dimensions of the input subspaces $S_{1},\cdots,S_{\ell}$ , i.e., $s=\sum_{j=1}^{\ell}\dim(S_{j})$ . The formal statement is in Theorem 22.

1.1.2 Constrained Subspace Estimation

The Constrained Subspace Estimation problem originates from the signal processing community [36], and aims to find a subspace $V$ of dimension $k$ , that best approximates a collection of experimentally measured subspaces $T_{1},\cdots,T_{m}$ , with the constraint that it intersects a model-based subspace $W$ in at least a predetermined number of dimensions $\ell$ , i.e., $\textnormal{dim}(V\cap W)\geq\ell$ . This problem arises in applications such as beamforming, where the model-based subspace is used to encode the available prior information about the problem. The paper of [36] formulates and motivates that problem, and further present an algorithm based on a semidefinite relaxation of this non-convex problem, where its performance is only demonstrated via numerical simulation.

We show in Section 4.1, that this problem can be reduced to at most $k$ instances of PC- $\ell_{2}$ -subspace approximation, in which the number of parts is $2$ . This will give us the following result for the constrained subspace estimation problem.

$\blacksquare$

In Corollary 18, we show a $(1+\varepsilon,\delta\|A\|_{F}^{2})$ -multiplicative-additive approximation in time $\operatorname{poly}(n)\cdot(1/\delta)^{O(k^{2}/\varepsilon)}$ .
$\blacksquare$

In Theorem 19, we show a $(1+\varepsilon)$ multiplicative approximation in time $O(nd\gamma/\varepsilon)^{O(k^{3}/\varepsilon)}$ where we assume $A$ has integer entries of absolute value at most $\gamma$ . We assume that $\gamma=\operatorname{poly}(n)$ .

1.1.3 Projective Non-Negative Matrix Factorization

Projective Non-Negative Matrix Factorization (PNMF) [45] (see also [46, 43]) is a variant of Non-Negative Matrix Factorization (NMF), used for dimensionality reduction and data analysis, particularly for datasets with non-negative values such as images and texts. In NMF, a non-negative matrix $\bm{X}$ is factorized into the product of two non-negative matrices $\bm{W}$ and $\bm{H}$ such that $\bm{X}\approx\bm{WH}$ where $\bm{W}$ contains basis vectors, and $\bm{H}$ represents coefficients. In PNMF, the aim is to approximate the data matrix by projecting it onto a subspace spanned by non-negative vectors, similar to NMF. However, in PNMF, the factorization is constrained to be projective.

Formally, PNMF can be formulated as a constrained $\ell_{2}$ -subspace approximation as follows: the set of feasible projection matrices $\mathcal{S}$ , consists of all matrices that can be written as $\bm{P}=UU^{T}$ , where $U$ is a $d\times k$ orthonormal matrix with all non-negative entries.

We show the following results:

$\blacksquare$

In Theorem 1, we show a $(1+\varepsilon,\delta\|A\|_{F}^{2})$ -multiplicative-additive approximation in time $O(dk^{2}/\varepsilon)\cdot(1/\delta)^{O(k^{2}/\varepsilon)}$ .
$\blacksquare$

In Theorem 2, we show a $(1+\varepsilon)$ multiplicative approximation in time $(nd\gamma)^{O(k^{3}/\varepsilon)}$ , where we assume $A$ has integer entries of absolute value at most $\gamma$ .

Theorem 1 (Additive approximation for NMF).

Given an instance $\bm{A}\in\mathbb{R}^{d\times n}$ of Non-negative matrix factorization, there is an algorithm that computes a $\bm{U}\in\mathbb{R}^{d\times k}_{\geq 0},\;\bm{U}^{T}\bm{U}=\bm{I}_{k}$ such that

\displaystyle\|\bm{A}-\bm{U}\bm{U}^{T}\bm{A}\|_{F}^{2}\leq(1+\varepsilon)\cdot% \text{OPT}+O(\delta\cdot\|\bm{A}\|_{F}^{2})

in time $O(dk^{2}/\varepsilon)\cdot(1/\delta)^{O(k^{2}/\varepsilon)}$ . For any $0<\delta<1$ .

Theorem 2 (Multiplicative approximation for NMF).

Given an instance $\bm{A}\in\mathbb{R}^{d\times n}$ of Non-negative matrix factorization with integer entries of absolute value at most $\gamma$ in $\bm{A}$ , there is an algorithm that computes a $\bm{U}\in\mathbb{R}^{d\times k}_{\geq 0},\;\bm{U}^{T}\bm{U}=\bm{I}_{k}$ such that

\displaystyle\|\bm{A}-\bm{U}\bm{U}^{T}\bm{A}\|_{F}^{2}\leq(1+\varepsilon)\cdot% \text{OPT}

in time $(nd\gamma/\varepsilon)^{O(k^{3}/\varepsilon)}$ .

1.1.4 $𝒌$ -Means Clustering

$k$ -means is a popular clustering algorithm widely used in data analysis and machine learning. Given a set of $n$ vectors $a_{1},\cdots,a_{n}$ and a parameter $k$ , the goal of $k$ -means clustering is to partition these vectors into $k$ clusters $\{C_{1},\cdots,C_{k}\}$ such that the sum of the squared distances of all points to their corresponding cluster center $\sum_{i=1}^{n}\|a_{i}-\mu_{C(a_{i})}\|_{2}^{2}$ is minimized, where $C(a_{i})$ denotes the cluster that $a_{i}$ belongs to and $\mu_{C(a_{i})}$ denotes its center. It is an easy observation that once the clustering is determined, the cluster centers need to be the centroid of the points in each cluster. It is shown in [14] that this problem is an instance of constrained subspace approximation. More precisely, the set of valid projection matrices are all those that can be written as ${\bm{P}}=X_{C}X_{C}^{T}$ , where $X_{C}$ is a $n\times k$ matrix where $X_{C}(i,j)$ is $1/\sqrt{|C_{j}|}$ if $C(a_{i})=j$ and $0$ otherwise. Note that this is an orthonormal matrix and thus $X_{C}X_{C}^{T}$ is an orthogonal projection matrix. Further, note that using our language we need to apply the constrained subspace approximation on the matrix $A^{T}$ , i.e., $\min_{\bm{P}\in\mathcal{S}}\|A^{T}-\bm{P}A^{T}\|_{F}^{2}$ . In Theorem 3, we show a $(1+\varepsilon)$ approximation algorithm for $k$ -means that runs in $O(nnz(\bm{A})+2^{\widetilde{O}(k/\varepsilon)}+n^{o(1)})$ time, whose dependency on $k$ and $\varepsilon$ matches that of [21].

Theorem 3.

Given an instance $\bm{A}\in\mathbb{R}^{n\times d}$ of $k$ -means, there is an algorithm that computes a $(1+\varepsilon)$ -approximate solution to $k$ -means in $O(nnz(\bm{A})+2^{\widetilde{O}(k/\varepsilon)}+n^{o(1)})$ time.

1.1.5 Sparse PCA

The goal of Principal Component Analysis (PCA) is to find $k$ linear combinations of the $d$ features (dimensions), which are called principal components, that captures most of the mass of the data. As mentioned earlier, PCA is the subspace approximation problem with $p=2$ . However, typically the obtained principal components are linear combinations of all vectors which makes interpretability of the components more difficult. As such, Sparse PCA which is the optimization problem obtained from PCA by adding a sparsity constraint on the principal components have been defined which provides higher data interpretability [17, 47, 9, 25, 5].

Sparse PCA can be formulated as a constrained subspace approximation problem in which the set of projection matrices are constrained to those that can be written as $P=UU^{T}$ where $U$ is a $d\times k$ orthonormal matrix such that the total number of non-zero entries in the $U$ is at most $s$ , for a given parameter $s$ .

	$\displaystyle\max$	$\displaystyle:\langle\bm{A}\bm{A}^{T},\bm{U}\bm{U}^{T}\rangle$		(sparse-PCA-max)
	$\displaystyle\bm{U}^{T}\bm{U}$	$\displaystyle=\bm{I}_{k},\;\sum_{j\in[k]}\\|\bm{U}_{.,j}\\|_{0}\leq s.$		(2)

Program sparse-PCA-max can also be formulated as a minimization version

	$\displaystyle\min$	$\displaystyle:\\|\bm{A}-\bm{U}\bm{U}^{T}\bm{A}\\|_{F}^{2}$		(sparse-PCA-min)
	$\displaystyle\bm{U}^{T}\bm{U}$	$\displaystyle=\bm{I}_{k},\;\sum_{j\in[k]}\\|\bm{U}_{.,j}\\|_{0}\leq s.$		(3)

We give an algorithm that runs in time $d^{O(k^{3}/\varepsilon^{2})}\left(dk^{3}/\varepsilon+d\log d\right)$ that computes a $\varepsilon\|\bm{A}-\bm{A}_{k}\|_{F}^{2}$ additive approximate solution, which translates to a $(1+\varepsilon)$ -multiplicative approximate solution to one formulation of the problem.

Theorem 4.

Given an instance $(\bm{A}\in\mathbb{R}^{d\times n},k,s)$ of sparse-PCA, there is an algorithm that runs in time

\displaystyle O\left(d^{kr^{2}+kr}\left(dkr^{2}+d\log d\right)\right)

(4)

with $r=k+k/\varepsilon$ that computes a $\varepsilon\|\bm{A}-\bm{A}_{k}\|_{F}^{2}$ additive approximate solution to both sparse-PCA-max and sparse-PCA-min. This is guarantees as a $(1+\varepsilon)$ -approximate solution to sparse-PCA-min because $\|\bm{A}-\bm{A}_{k}\|_{F}^{2}$ is a lower bound to sparse-PCA-min.

1.1.6 Column Subset Selection with a Partition Constraint

Column subset selection (CSS) is a popular data summarization technique [8, 14, 1], where given a matrix $\bm{A}$ , the goal is to find $k$ columns in $\bm{A}$ that best approximates all columns of $\bm{A}$ . Since in CSS, a subset of columns in the matrix are picked as the summary of the matrix $\bm{A}$ , enforcing partition constraints naturally captures the problem of column subset selection with fair representation. More formally, in column subset selection with partition constraints (PC-column subset selection), given a partitioning of the columns of $\bm{A}$ into $\ell$ groups, $\bm{A}^{(1)},\cdots,\bm{A}^{(\ell)}$ , along with capacities $k_{1},\cdots,k_{\ell}$ , where $\sum_{i}k_{i}=k$ , the set of valid subspaces are obtained by picking $k_{i}$ vectors from $\bm{A}^{(i)}$ , and projecting onto the span of these $k$ columns of $\bm{A}$ .

In Theorem 5, we show that PC-column subset selection is hard to approximate to any factor $f$ in polynomial time, even if there are only two groups, or even when we allow for violating the capacity constraint by a factor of $O(\log n)$ This is in sharp contrast with the standard column subset selection problem for which efficient algorithms with tight guarantees are known.

Theorem 5.

Assuming $\mathrm{SAT}\notin\mathrm{DTIME}(n^{O(\log\log n)})$ , the PC-column subset selection problem is hard to approximate to any multiplicative factor $f$ , even in the following special cases:

(i) The case of $\ell=2$ groups, where the capacities on all the groups are the same parameter $s$ .

(ii) The case where the capacities on all the groups are the same parameter $s$ , and we allow a solution to violate the capacity by a factor $g(n)=o(\log n)$ , where $n$ is the total number of columns in the instance.

2 Preliminaries

We will heavily use standard notations for vector and matrix quantities. For a matrix $\bm{M}$ , we denote by $\bm{M}_{.,i}$ the $i$ th column of $\bm{M}$ and by $\bm{M}_{i,.}$ the $i$ th row. We denote by $\left\lVert\bm{M}\right\rVert_{F}$ the Frobenius norm, which is simply $\sqrt{\sum_{i,j}m_{ij}^{2}}$ , where $m_{ij}$ is the entry in the $i$ th row and $j$ th column of $\bm{M}$ . We also use mixed norms, where $\left\lVert\bm{M}\right\rVert_{2,p}=\left(\sum_{i}\left\lVert\bm{M}_{.,i}% \right\rVert_{2}^{p}\right)^{1/p}$ . I.e., it is the $\ell_{p}$ norm of the vector whose entries are the $\ell_{2}$ norm of the columns of $\bm{M}$ .

We also use $\sigma_{\min}(\bm{M})$ to denote the least singular value of a matrix, and $\sigma_{\max}(\bm{M})$ to denote the largest singular value. The value $\kappa(\bm{M})$ is used to denote the condition number, which is the ratio of the largest to the smallest singular value.

In analyzing the running times of our algorithms, we will use the following basic primitives, the running times of which we denote as $T_{0}$ and $T_{1}$ respectively. These are standard results from numerical linear algebra; while there are several improvements using randomization, these bounds will not be the dominant ones in our running time, so we do not optimize them.

Lemma 6 (SVD Computation; see [22]).

Given $\bm{A}\in\mathbb{R}^{d\times n}$ , computing the reduced matrix $\bm{B}$ as in Lemma 12 takes time $T_{0}:=H\cdot\min\{O(nd^{2}),O(nd\cdot\frac{k}{\varepsilon})\}$ , where $H$ is the maximum bit complexity of any element of $\bm{A}$ .

Lemma 7 (Least Squares Regression; see [22]).

Given $\bm{A}\in\mathbb{R}^{d\times n}$ and given a target matrix $\bm{B}$ with $r$ columns, the optimization problem $\min_{\bm{C}}\left\lVert\bm{B}-\bm{A}\bm{C}\right\rVert_{F}^{2}$ can be solved in time $T_{1}:=O(nrd^{2}\cdot$ H $)$ , where $H$ is the maximum bit length of any entry in $A, B$ .

Remark on the Exponential in $𝒌$ Running Times.

In all of our results, it is natural to ask if the exponential dependence on $k$ is necessary. We note that many of the problems we study are APX hard, and thus obtaining multiplicative $(1+\varepsilon)$ factors will necessarily require exponential time in the worst case. For problems that generalize $\ell_{p}$ -subspace approximation (e.g., the PC- $\ell_{p}$ -subspace approximation problem, Section 4.2), the work of [23] showed APX hardness for $p>2$ while [13] showed NP-hardness. In our reductions, we in fact have the stronger property that the YES and NO instances differ in objective value by $\frac{1}{\text{poly}(k)}\cdot\left\lVert\bm{A}\right\rVert_{2,p}^{p}$ , where $\bm{A}$ is the matrix used in the reduction. Thus, assuming the Exponential Time Hypothesis, even the additive error guarantee in general requires an exponential dependence on either $k$ or $1/\varepsilon$ , at least for $p>2$ .

3 Framework for Constrained Subspace Approximation

Given a $d\times n$ matrix $\bm{A}$ and a special collection $\mathcal{S}$ of rank $k$ projection matrices, we are interested in selecting the projection matrix $\bm{P}\in\mathcal{S}$ that minimizes the sum of projection costs (raised to the $p^{\textnormal{th}}$ power) of the columns of $\bm{A}$ onto $\bm{P}$ . More compactly, the optimization problem is

	$\displaystyle\min\limits_{\bm{P}\in\mathcal{S}}$	$\displaystyle:\,\\|\bm{A}-\bm{P}\bm{A}\\|_{2,p}^{p}.$	(CSA)
A more geometric and equivalent interpretation is that we have a collection of $n$ data-points $\{a_{1},a_{2},\dots,a_{n}\}\subseteq\mathbb{R}^{d}$ and we would like to approximate these data points by a subspace while satisfying certain constraints on the subspace:
	$\displaystyle\min$	$\displaystyle:\,\sum\limits_{i=1}^{n}\\|a_{i}-\widehat{a}_{i}\\|_{2}^{p}$	(CSA-geo)
		$\displaystyle\widehat{a}_{i}\in\textnormal{ColumnSpan}(\bm{P})$
		$\displaystyle\bm{P}\in\mathcal{S}.$

See Lemma 9 for a proof of the equivalence. We provide a unified framework to obtain approximately optimal solutions for various special collections of $\mathcal{S}$ . In our framework, there are three steps to obtaining an approximate solution to any instance of CSA.

1.

Build a coreset: Reduce the size of the problem by replacing $\bm{A}$ with a different matrix $\bm{B}\in\mathbb{R}^{d\times r}$ with a fewer number of columns typically $\operatorname{poly}(k,1/\varepsilon)$ . The property we need to guarantee is that the projection cost is approximately preserved possibly with an additive error $c\geq 0$ independent of $\bm{P}$ :

$\displaystyle\|\bm{B}-\bm{P}\bm{B}\|_{2,p}^{p}\in(1,1+\varepsilon)\cdot\|\bm{A% }-\bm{P}\bm{A}\|_{2,p}^{p}-c\quad\forall\bm{P}\textnormal{ with rank at most }k.$ (5)

Such a $\bm{P}$ (for $p=2$ ) has been referred to as a Projection-Cost-Preserving Sketch with one sided error in [14] . See Definition 10, Theorem 11, and Lemma 12 for results obtaining such a $\bm{B}$ for various $1\leq p<\infty$ . Lemma 14 shows that approximate solutions to reduced instances $(\bm{B},\mathcal{S})$ satisfying Equation 5 are also approximate solutions to the original instance $(\bm{A},\mathcal{S})$ .
2.

Guess Coefficients: Since the projection matrix $\bm{P}$ is of rank $k$ , it can be represented as $\bm{U}\bm{U}^{T}$ such that $\bm{U}^{T}\bm{U}=\bm{I}_{k}$ . Using this, observe that the residual matrix

$\displaystyle\bm{B}-\bm{P}\bm{B}=\bm{B}-\bm{U}(\bm{U}^{T}\bm{B})$

can be represented as $\bm{B}-\bm{U}\bm{C}$ where $\bm{C}=\bm{U}^{T}\bm{B}$ is a $\mathbb{R}^{k\times r}$ matrix. The norm of the $i^{\textnormal{th}}$ column of $\bm{C}$ can be bounded by $\|b_{i}\|_{2}$ the norm of the $i^{\textnormal{th}}$ column of $\bm{B}$ . This allows us to guess every column of $\bm{C}$ inside a $k$ dimensional ball of radius at most the norm of the corresponding column in $\bm{B}$ . Using a net with appropriate granularity, we guess the optimal $\bm{C}$ up to an additive error.
3.

Solve: For every fixed $\bm{C}$ in the search space above, we solve the constrained regression problem

$\displaystyle\min\limits_{\bm{U}\in\mathbb{R}^{d\times k}:\bm{U}\bm{U}^{T}\in% \mathcal{S}}\|\bm{B}-\bm{U}\bm{C}\|_{2,p}^{p}$

exactly. If $\widehat{\bm{C}}$ is the $\bm{C}$ matrix that induces the minimum cost, and $\widehat{\bm{U}}$ is the minimizer to the constrained regression problem, we return the projection matrix $\widehat{\bm{U}}\widehat{\bm{U}}^{T}$ .

The following lemma formalizes the framework above and can be used as a black box application for several specific instances of CSA.

Lemma 8.

Given an instance $(\bm{A},\mathcal{S})$ of CSA, for $1\leq p<\infty$ ,

1.

Let $T_{s}$ be the time taken to obtain a smaller instance $(\bm{B},\mathcal{S})$ such that the approximate cost property in Equation 5 is satisfied and the number of columns in $\bm{B}$ is $r$ .
2.

Let $T_{r}$ be the time taken to solve the constrained regression problem for any fixed $\bm{B}\in\mathbb{R}^{d\times r}$ and $\bm{C}\in\mathbb{R}^{k\times r}$

$\displaystyle\min\limits_{\bm{U}\in\mathbb{R}^{d\times k}:\bm{U}\bm{U}^{T}\in% \mathcal{S}}\|\bm{U}\bm{C}-\bm{B}\|_{2,p}^{p}.$ (6)

Then for any granularity parameter $0<\delta<1$ , we obtain a solution $\bm{P}\in\mathcal{S}$ such that

$\displaystyle\|\bm{A}-\bm{P}\bm{A}\|_{2,p}^{p}\leq(1+\varepsilon)\text{OPT}+\Delta$ (7)

in time $T_{s}+T_{r}\cdot O((1/\delta)^{kr})$ .

Here, $\Delta=(1+\varepsilon)\|\bm{A}\|_{2,p}^{p}\cdot\left((1+\delta)^{p}-1\right)$ and $\text{OPT}=\min\limits_{\bm{P}^{\prime}\in\mathcal{S}}\,\|\bm{A}-\bm{P}^{% \prime}\bm{A}\|_{2,p}^{p}$ .

Proof.

Let the optimal solution to the instance $(\bm{A},\mathcal{S})$ be $\bm{P}^{*}=\bm{U}^{*}{\bm{U}^{*}}^{T}$ and let $\bm{C}^{*}={\bm{U}^{*}}^{T}\bm{B}$ . Since the columns of $\bm{U}^{*}$ are unit vectors, the norm of the $i^{\textnormal{th}}$ column of $\bm{C}^{*}$ is at most $\|b_{i}\|_{2}$ the norm of the $i^{\textnormal{th}}$ column of $\bm{B}$ . We will try to approximately guess the columns of $\bm{C}^{*}$ using epsilon nets. For each $i$ , we search for the $i^{th}$ column of $\bm{C}$ using a $(\|b_{i}\|_{2}\cdot\delta)$ -net inside a $k$ dimensional ball of radius $\|b_{i}\|_{2}$ centered at origin. The size of the net for each column of $\bm{C}$ is $O((1/\delta)^{k})$ and hence the total search space over matrices $\bm{C}$ has $O((1/\delta)^{kr})$ possibilities.

For each $\bm{C}$ , we solve the constrained regression problem in Equation 6. Let $\widehat{\bm{C}}$ be the matrix for which the cost is minimized and $\widehat{\bm{U}}$ be the corresponding minimizer to the constrained regression problem respectively. Consider the solution $\widehat{\bm{P}}=\widehat{\bm{U}}\widehat{\bm{U}}^{T}$ . The cost of this solution on reduced instance $(\bm{B},\mathcal{S})$ is

	$\displaystyle\\|\bm{B}-\widehat{\bm{U}}\widehat{\bm{U}}^{T}\bm{B}\\|_{2,p}^{p}$	$\displaystyle\leq\\|\bm{B}-\widehat{\bm{U}}\widehat{\bm{C}}\\|_{2,p}^{p}.$	(8)
Let $\overline{\bm{C}}$ be the matrix in the search space such that $\\|\overline{\bm{C}}_{.,i}-\bm{C}^{*}_{.,i}\\|_{2}\leq\\|b_{i}\\|_{2}\cdot\delta$ for every $i\in[r]$ . Using the cost minimality of $\widehat{\bm{C}}$ , we can imply that the above cost is
		$\displaystyle\leq\min\limits_{\bm{U}\in\mathbb{R}^{d\times k}:\bm{U}\bm{U}^{T}% \in\mathcal{S}}\\|\bm{B}-\bm{U}\overline{\bm{C}}\\|_{2,p}^{p}$	(9)
		$\displaystyle\leq\\|\bm{B}-\bm{U}^{*}\overline{\bm{C}}\\|_{2,p}^{p}.$	(10)

It remains to upper bound the difference $\Delta=\|\bm{B}-\bm{U}^{*}\overline{\bm{C}}\|_{2,p}^{p}-\|\bm{B}-\bm{U}^{*}\bm% {C}^{*}\|_{2,p}^{p}$ . If we let $b_{i}^{*}:=(\bm{U}^{*}\bm{C}^{*})_{.,i}$ and $\overline{b}_{i}:=(\bm{U}^{*}\overline{\bm{C}})_{.,i}$ for $i\in[r]$ , then

		$\displaystyle\Delta=\sum\limits_{i=1}^{r}\left(\\|b_{i}-\overline{b}_{i}\\|_{2}^% {p}-\\|b_{i}-b_{i}^{*}\\|_{2}^{p}\right).$	(11)
Using the fact that $\\|\overline{\bm{C}}_{.,i}-\bm{C}^{*}_{.,i}\\|_{2}\leq\\|b_{i}\\|_{2}\cdot\delta$ , we know that
	$\displaystyle\\|\overline{b}_{i}-b_{i}^{*}\\|_{2}$	$\displaystyle=\\|\bm{U}^{}(\overline{\bm{C}}_{.,i}-\bm{C}^{}_{.,i})\\|_{2}\leq% \\|\overline{\bm{C}}_{.,i}-\bm{C}^{*}_{.,i}\\|_{2}\leq\\|b_{i}\\|_{2}\cdot\delta.$	(12)

This implies that each error term

$\displaystyle\Delta_{i}$	$\displaystyle:=\\|b_{i}-\overline{b}_{i}\\|_{2}^{p}-\\|b_{i}-b_{i}^{*}\\|_{2}^{p}$	(13)
	$\displaystyle\leq(\\|b_{i}-b_{i}^{}\\|_{2}+\\|b_{i}^{}-\overline{b}_{i}\\|_{2})^% {p}-\\|b_{i}-b_{i}^{*}\\|_{2}^{p}$	(Triangle inequality)
	$\displaystyle\leq(\\|b_{i}-b_{i}^{}\\|_{2}+\\|b_{i}\\|\cdot\delta)^{p}-\\|b_{i}-b_% {i}^{}\\|_{2}^{p}$	( $\\|\overline{b}_{i}-b_{i}^{*}\\|_{2}\leq\\|b_{i}\\|_{2}\cdot\delta$ )
	$\displaystyle\leq\\|b_{i}\\|_{2}^{p}\cdot\left((1+\delta)^{p}-1\right).$	( $(x+\delta)^{p}-x^{p}$ is increasing in $[0,1],\,\\|b_{i}-b_{i}^{*}\\|_{2}\leq\\|b_{i}\\|_{2}$ )

Summing up, the total error $\Delta$ is at most $\|\bm{B}\|_{2,p}^{p}\cdot\left((1+\delta)^{p}-1\right)=O(\delta p)\cdot\|\bm{B% }\|_{2,p}^{p}$ for $\delta\leq 1/p$ . This implies that

	$\displaystyle\\|\bm{B}-\widehat{\bm{P}}\bm{B}\\|_{2,p}^{p}$	$\displaystyle\leq\\|\bm{B}-\bm{P}^{*}\bm{B}\\|_{2,p}^{p}+\\|\bm{B}\\|_{2,p}^{p}% \cdot\left((1+\delta)^{p}-1\right)$	(14)
Using the property of $\bm{B}$ from Equation 5, we can imply that
	$\displaystyle\\|\bm{A}-\widehat{\bm{P}}\bm{A}\\|_{2,p}^{p}$	$\displaystyle\leq(1+\varepsilon)\\|\bm{A}-\bm{P}^{*}\bm{A}\\|+\\|\bm{B}\\|_{2,p}^{% p}\cdot\left((1+\delta)^{p}-1\right).$	(15)
setting $\bm{P}=0$ in Equation 5 and using the fact that $c\geq 0$ gives $\\|\bm{B}\\|_{2,p}^{p}\leq(1+\varepsilon)\\|\bm{A}\\|_{2,p}^{p}$ . Plugging this in the equation above gives
	$\displaystyle\\|\bm{A}-\widehat{\bm{P}}\bm{A}\\|_{2,p}^{p}$	$\displaystyle\leq(1+\varepsilon)\\|\bm{A}-\bm{P}^{*}\bm{A}\\|+(1+\varepsilon)\\|% \bm{A}\\|_{2,p}^{p}\cdot\left((1+\delta)^{p}-1\right)$	(16)

The total time taken by the algorithm is $T_{s}+T_{r}\cdot O((1/\delta)^{kr})$ . $\hfill\blacktriangleleft$

Lemma 9.

The mathematical programs CSA and CSA-geo equivalent to the following “constrained factorization” problem:

\displaystyle\min\limits_{\bm{U}\bm{U}^{T}\in\mathcal{S},\,\bm{H}\in\mathbb{R}% ^{d\times n}}

\displaystyle\,\|\bm{A}-\bm{U}\bm{H}\|_{2,p}^{p}.

(CSA-fac)

Proof.

First, we will prove the equivalence between CSA and CSA-fac.

1.

The easier direction to see is $\min_{\bm{U}\bm{U}^{T}\in\mathcal{S},\,\bm{H}\in\mathbb{R}^{d\times n}}\|\bm{A% }-\bm{U}\bm{H}\|_{2,p}^{p}\leq\min_{\bm{U}\bm{U}^{T}\in\mathcal{S}}\|\bm{A}-% \bm{U}\bm{U}^{T}\bm{A}\|_{2,p}^{p}$ because setting $\bm{H}=\bm{U}^{T}\bm{A}$ in CSA-fac gives CSA.

2.

For the other direction, it suffices to show that for any fixed choice of $\bm{U}$ such that $\bm{U}\bm{U}^{T}\in\mathcal{S}$ , an optimal choice of $\bm{H}$ is $\bm{U}^{T}\bm{A}$ . In order to see this, observe that the problem

	$\displaystyle\min\limits_{\bm{H}}\\|\bm{A}-\bm{U}\bm{H}\\|_{2,p}^{p}$	$\displaystyle=\min\limits_{\bm{H}}\sum\limits_{i=1}^{n}\\|a_{i}-\bm{U}h_{i}\\|_{% 2}^{p}$	(17)
where $a_{i}$ and $h_{i}$ are the $i^{th}$ columns of $\bm{A}$ and $\bm{H}$ respectively. Since the cost function decomposes into separate problems for each column, we can push the minimization inside.
		$\displaystyle=\sum\limits_{i=1}^{n}\left(\min\limits_{h_{i}}\\|a_{i}-\bm{U}h_{i% }\\|_{2}\right)^{p}.$	(18)

Using the normal equations, the optimal choice for $h_{i}$ satisfies $\bm{U}^{T}\bm{U}h_{i}=\bm{U}^{T}a_{i}$ . Since the columns of $\bm{U}$ are orthonormal, this implies that $h_{i}=\bm{U}^{T}a_{i}$ for each $i\in[n]$ and hence $\bm{H}=\bm{U}^{T}\bm{A}$ .

Now we show the equivalence between CSA-fac and CSA-geo. Observe that CSA-geo can be re-written as

	$\displaystyle\min$	$\displaystyle\,\sum\limits_{i=1}^{n}\\|a_{i}-\widehat{a}_{i}\\|_{2}^{p}$
		$\displaystyle\widehat{a}_{i}\in\textnormal{ColumnSpan}(\bm{U})$
		$\displaystyle\bm{U}\bm{U}^{T}\in\mathcal{S}.$

Because the column span of $\bm{P}=\bm{U}\bm{U}^{T}$ is identical to the column span of $\bm{U}$ . Replacing $\widehat{a}_{i}\in\textnormal{ColumnSpan}(\bm{U})$ by $\widehat{a}_{i}=\bm{U}h_{i}$ gives CSA-fac. $\hfill\blacktriangleleft$

Definition 10 (Strong coresets; as defined in [41]).

Let $1\leq p<\infty$ and $0<\varepsilon<1$ . Let $\bm{A}\in\mathbb{R}^{d\times n}$ . Then, a diagonal matrix $\bm{S}\in\mathbb{R}^{n\times n}$ is a $(1\pm\varepsilon)$ strong coreset for $\ell_{p}$ subspace approximation if for all rank $k$ projection matrices $\bm{P}_{F}$ , we have

\displaystyle\|(\bm{I}-\bm{P}_{F})\bm{A}\bm{S}\|_{2,p}^{p}\in(1\pm\varepsilon)% \|(\bm{I}-\bm{P}_{F})\bm{A}\|_{2,p}^{p}.

(19)

The number of non-zero entries $\textnormal{nnz}(\bm{S})$ of $\bm{S}$ will be referred to as the size of the coreset.

Theorem 11 (Theorems 1.3 and 1.4 of [42]).

Let $p\in[1,2)\cup(2,\infty)$ and $\varepsilon>0$ be given, and let $\bm{A}\in\mathbb{R}^{d\times n}$ . There is an algorithm running in $\widetilde{O}(\textnormal{nnz}(\bm{A})+d^{\omega})$ time which, with probability at least $1-\delta$ , constructs a strong coreset $\bm{S}$ that satisfies Definition 10 and has size:

\textnormal{nnz}(\bm{S})=\begin{cases}\frac{k}{\varepsilon^{4/p}}(\log(k/% \varepsilon\delta))^{O(1)}\qquad\text{ if $p\in[1,2)$},\\ \frac{k^{p/2}}{\varepsilon^{p}}(\log(k/\varepsilon\delta))^{O(p^{2})}\qquad% \text{ if $p\in(2,\infty)$}.\end{cases}

(20)

Remark.

Note that for any $\bm{S}$ that satisfies the property in Definition 10, we can scale it up to satisfy $\|(\bm{I}-\bm{P}_{F})\bm{A}\bm{S}\|_{p,2}^{p}\in(1,1+\varepsilon)\|(\bm{I}-\bm% {P}_{F})\bm{A}\|_{p,2}^{p}$ matching the condition in Equation 5.

For many of the applications, we have $p=2$ . For this case, the choice of the reduced matrix $\bm{B}$ that replaces $\bm{A}$ is simply the matrix of scaled left singular vectors of $\bm{A}$ . More formally,

Lemma 12.

When $p=2$ , if $\bm{A}=\sum\limits_{i=1}^{n}\sigma_{i}p_{i}q_{i}^{T}$ be the singular value decomposition of $\bm{A}$ (where $\sigma_{i}$ is the $i^{\text{th}}$ largest singular value and $p_{i}\in\mathbb{R}^{d},q_{i}\in\mathbb{R}^{n}$ are the left singular vector and right singular vector corresponding to $\sigma_{i}$ ), then $\bm{B}=\sum\limits_{i=1}^{r}\sigma_{i}p_{i}q_{i}^{T}$ satisfies Equation 5 for $r=k+k/\varepsilon$ .

The proof is deferred to Appendix A.1.

$\blacktriangleright$ Remark 13.

Notice that when $p=2$ , Lemma 12 proves the condition in Equation 49:

\displaystyle\|\bm{B}-\bm{P}\bm{B}\|_{F}^{2}\in(0,\varepsilon)\cdot\|\bm{A}-% \bm{A}_{k}\|_{F}^{2}+\|\bm{A}-\bm{P}\bm{A}\|_{F}^{2}-c

which is stronger than the condition in Equation 5.

Lemma 14.

If $(\bm{A},\mathcal{S})$ is an instance of CSA and $\bm{B}\in\mathbb{R}^{d\times r}$ is a matrix that satisfies Equation 5, and

\displaystyle\widehat{\bm{P}}:=\operatorname*{arg\,min}_{\bm{P}\in\mathcal{S}}% \|\bm{B}-\bm{P}\bm{B}\|_{2,p}^{p},\quad\bm{P}^{*}:=\operatorname*{arg\,min}_{% \bm{P}\in\mathcal{S}}\|\bm{A}-\bm{P}\bm{A}\|_{2,p}^{p},

(21)

then $\widehat{\bm{P}}$ is an $(1+\varepsilon)$ -approximate solution to the instance $(\bm{A},\mathcal{S})$ i.e.,

\displaystyle\|\bm{A}-\widehat{\bm{P}}\bm{A}\|_{2,p}^{p}\leq(1+\varepsilon)\|% \bm{A}-\bm{P}^{*}\bm{A}\|_{2,p}^{p}.

(22)

1.

More generally, if $\widehat{\bm{P}}$ is an approximate solution to $(\bm{B},\mathcal{S})$ such that

$\displaystyle\|\bm{B}-\widehat{\bm{P}}\bm{B}\|_{2,p}^{p}\leq\alpha\|\bm{B}-\bm% {P}\bm{B}\|_{2,p}^{p}+\beta\quad\forall\bm{P}\in\mathcal{S},$

for some $\alpha\geq 1,\,\beta\geq 0$ , then we have

$\displaystyle\|\bm{A}-\widehat{\bm{P}}\bm{A}\|_{2,p}^{p}\leq\alpha(1+% \varepsilon)\|\bm{A}-\bm{P}^{*}\bm{A}\|_{2,p}^{p}+\beta.$
2.

For the specific case when $p=2$ , if $\widehat{\bm{P}}$ is an exact solution to $(\bm{B},\mathcal{S})$ , then we have

$\displaystyle\|\bm{A}-\widehat{\bm{P}}\bm{A}\|_{F}^{2}\leq\|\bm{A}-\bm{P}^{*}% \bm{A}\|_{F}^{2}+\varepsilon\|\bm{A}-\bm{A}_{k}\|_{F}^{2}.$

Proof.

1.

Using the approximate optimality of $\widehat{\bm{P}}$ for the instance $(\bm{B},\mathcal{S})$ , we have

	$\displaystyle\\|\bm{B}-\widehat{\bm{P}}\bm{B}\\|_{2,p}^{p}$	$\displaystyle\leq\alpha\\|\bm{B}-\bm{P}^{*}\bm{B}\\|_{2,p}^{p}+\beta.$	(23)
Using the lower-bound and upper-bound from Equation 5 for the LHS and RHS, we get
	$\displaystyle\\|\bm{A}-\widehat{\bm{P}}\bm{A}\\|_{2,p}^{p}-c$	$\displaystyle\leq\alpha(1+\varepsilon)\\|\bm{A}-\bm{P}^{*}\bm{A}\\|_{2,p}^{p}-% \alpha c+\beta.$	(24)
Since $\alpha\geq 1$ and $c\geq 0$ , we get
	$\displaystyle\\|\bm{A}-\widehat{\bm{P}}\bm{A}\\|_{2,p}^{p}$	$\displaystyle\leq\alpha(1+\varepsilon)\\|\bm{A}-\bm{P}^{*}\bm{A}\\|_{2,p}^{p}+\beta.$	(25)

2.

Using the optimality of $\widehat{\bm{P}}$ for the instance $(\bm{B},\mathcal{S})$ for with $p=2$ , we have

$\displaystyle\|\bm{B}-\widehat{\bm{P}}\bm{B}\|_{F}^{2}\leq\|\bm{B}-\bm{P}^{*}% \bm{B}\|_{F}^{2}.$ (26)

Using Remark 13, we know that $\|\bm{B}-\bm{P}\bm{B}\|_{F}^{2}\in(0,\varepsilon)\cdot\|\bm{A}-\bm{A}_{k}\|_{F% }^{2}+\|\bm{A}-\bm{P}\bm{A}\|_{F}^{2}-c$ for any rank $k$ projection matrix $\bm{P}$ for some $c\geq 0$ independent of $\bm{P}$ (see Equation 49). Using this, we get

$\displaystyle\|\bm{A}-\widehat{\bm{P}}\bm{A}\|_{F}^{2}-c\leq\|\bm{B}-\widehat{% \bm{P}}\bm{B}\|_{F}^{2}\leq\|\bm{B}-\bm{P}^{*}\bm{B}\|_{F}^{2}\leq\|\bm{A}-\bm% {P}^{*}\bm{A}\|_{F}^{2}+\varepsilon\|\bm{A}-\bm{A}_{k}\|_{F}^{2}-c.$

Canceling out the $-c$ gives the inequality we claimed.

$\hfill\blacktriangleleft$

Lemma 15 (Lemma 4.1 in [12]).

If $n\times d$ matrix $\bm{A}$ has integer entries bounded in magnitude by $\gamma$ , and has rank $\rho\geq k$ , then the $k^{\textnormal{th}}$ singular value $\sigma_{k}$ of $\bm{A}$ has $|\log\sigma_{k}|=O(\log(nd\gamma))$ as $nd\rightarrow\infty$ . This implies that $\|\bm{A}\|_{F}/\Delta_{k}\leq(nd\gamma)^{O(k/(\rho-k))}$ as $nd\rightarrow\infty$ . Here $\Delta_{k}:=\|\bm{A}-\bm{A}_{k}\|_{F}$

4 Applications

We present two applications to illustrate our framework. The remaining applications, as well as our hardness result for fair column-based approximation, are deferred to the full version.

4.1 Constrained Subspace Estimation [36]

In constrained subspace estimation, we are given a collection of target subspaces $T_{1},T_{2},\dots,T_{m}$ and a model subspace $W$ . The goal is to find a subspace $V$ of dimension $k$ such that $\textnormal{dim}(V\cap W)\geq\ell$ that maximizes the average overlap between the subspace $V$ and $T_{1},\dots,T_{m}$ . More formally, the problem can be formulated as mathematical program:

$\displaystyle\max$	$\displaystyle:\langle\overline{\bm{P}}_{T},\bm{P}_{V}\rangle$	(CSE-max)
	$\displaystyle\textnormal{dim}(V)=k,\;\textnormal{dim}(V\cap W)\geq\ell,$	(27)
	$\displaystyle\overline{\bm{P}}_{T}=\frac{1}{m}\sum\limits_{i=1}^{m}\bm{P}_{T_{% i}},$	(28)
	$\displaystyle\bm{P}_{T_{i}}\textnormal{ and }\bm{P}_{V}\textnormal{ are the % projection matrices onto the subspaces }T_{i}\textnormal{ and }V\textnormal{ % respectively.}$	(29)

Let us assume that the constraint $\textnormal{dim}(V\cap W)\geq\ell$ is actually an exact constraint $\textnormal{dim}(V\cap W)=\ell$ because we can solve for $k-\ell+1$ different cases $\textnormal{dim}(V\cap W)=i$ for each $\ell\leq i\leq k$ . Since $\overline{\bm{P}}_{T}$ is a PSD matrix, let it be $\bm{A}\bm{A}^{T}$ for some $\bm{A}\in\mathbb{R}^{d\times d}$ . Changing the optimization problem from a maximization problem to a minimization problem, we get

$\displaystyle\min$	$\displaystyle:\langle\bm{A}\bm{A}^{T},\bm{I}-\bm{P}_{V}\rangle=\\|\bm{A}-\bm{P}% _{V}\bm{A}\\|_{F}^{2}$	(CSE-min)
	$\displaystyle\bm{P}_{V}\textnormal{ is the projection matrix onto }V$	(30)
	$\displaystyle\textnormal{dim}(V)=k,\;\textnormal{dim}(V\cap W)=\ell.$	(31)

Lemma 16.

The CSE-min problem is a special case of CSA.

Proof.

Setting $p=2$ and $\mathcal{S}$ as the set of $k$ dimensional projection matrices $P_{V}$ such that $\textnormal{dim}(V\cap W)=\ell$ in CSA gives CSE-min. $\hfill\blacktriangleleft$

Let $\bm{B}\in\mathbb{R}^{d\times r},\;r=k+k/\varepsilon$ be the reduced matrix obtained as in Lemma 12. Using Lemma 14, it is sufficient to focus on the reduced instance with $\bm{A}$ replaced instead of $\bm{B}$ .

Any subspace $V$ such that $\textnormal{dim}(V)=k,\;\textnormal{dim}(V\cap W)=\ell$ can be represented equivalently as

	$\displaystyle V$	$\displaystyle=\textnormal{Span}(u_{1},u_{2},\dots,u_{\ell},v_{1},v_{2},\dots,v% _{k-\ell})$
		$\displaystyle u_{i}\in W,\;v_{j}\in W^{\perp}\quad\forall i\in[\ell],\;j\in[k-% \ell].$

Using these observations and Lemma 9, we can focus on the following subspace estimation program

	$\displaystyle\min$	$\displaystyle:\\|\bm{B}-\bm{U}\bm{C}\\|_{F}^{2}$	(32)
		$\displaystyle\bm{U}\textnormal{ is a orthogonal basis for }\textnormal{Span}(u% _{1},\dots,u_{\ell},v_{1},\dots,v_{k-\ell})$	(33)
		$\displaystyle u_{i}\in W,\;v_{j}\in W^{\perp}\quad\forall i\in[\ell],\;j\in[k-% \ell].$	(34)
Since $\bm{C}$ is unconstrained, we can replace the condition in Equation 33 with the much simpler condition $\bm{U}=[u_{1},\dots,u_{\ell},v_{1},\dots,v_{\ell}]$ . This gives
	$\displaystyle\min$	$\displaystyle:\\|\bm{B}-\bm{U}\bm{C}\\|_{F}^{2}$	(CSE-min-reduced)
		$\displaystyle\bm{U}=[u_{1},\dots,u_{\ell},v_{1},\dots,v_{\ell}]$	(35)
		$\displaystyle u_{i}\in W,\;v_{j}\in W^{\perp}\quad\forall i\in[\ell],\;j\in[k-% \ell].$	(36)

Lemma 17.

For any fixed $\bm{B}\in\mathbb{R}^{d\times r}$ and $\bm{C}\in\mathbb{R}^{k\times r}$ , the Equation CSE-min-reduced can be solved exactly in $\operatorname{poly}(n)$ time.

Proof.

For fixed $\bm{B}$ and $\bm{C}$ , the objective is convex quadratic in $\bm{U}$ and the constraints are linear on $\bm{U}$ . Linear constrained convex quadratic program can be efficiently solved. $\hfill\blacktriangleleft$

Corollary 18 (Additive approximation for CSE).

Using Lemma 8, we can get a subspace $V$ such that $\textnormal{dim}(V)=k,\;\textnormal{dim}(V\cap W)=\ell$ and

\displaystyle\|\bm{A}-\bm{P}_{V}\bm{A}\|_{F}^{2}\leq(1+\varepsilon)\text{OPT}+% O(\delta\|\bm{A}\|_{F}^{2})

for any choice of $0<\delta<1$ in time $\operatorname{poly}(n)\cdot(1/\delta)^{O(k^{2}/\varepsilon)}$ .

Lemma 15 gives a lower bound for OPT when the entries of the input matrix $\bm{A}$ are integers bounded in magnitude by $\gamma$ .

Theorem 19 (Multiplicative approximation for CSE).

Given an instance $(\bm{A}\in\mathbb{R}^{d\times n},k,W)$ of constrained subspace estimation with integer entries of absolute value at most $\gamma$ in $\bm{A}$ , there is an algorithm that obtains a subspace $V$ such that $\textnormal{dim}(V)=k,\;\textnormal{dim}(V\cap W)=\ell$ and

\displaystyle\|\bm{A}-\bm{P}_{V}\bm{A}\|_{F}^{2}\leq(1+\varepsilon)\text{OPT}

in $O(nd\gamma/\varepsilon)^{O(k^{3}/\varepsilon)}$ time.

Proof.

Using Lemma 15, we know that $\|\bm{A}\|_{F}^{2}/\|\bm{A}-\bm{A}_{k}\|_{F}^{2}\leq(nd\gamma)^{O(k)}$ . Setting $\delta=\varepsilon\|\bm{A}-\bm{A}_{k}\|_{F}^{2}/\|\bm{A}\|_{F}^{2}\geq% \varepsilon(nd\gamma)^{-O(k)}$ in Corollary 18 gives the desired time complexity. $\hfill\blacktriangleleft$

4.2 Partition Constrained $\ell_{p}$ -Subspace Approximation

We now consider the PC- $\ell_{p}$ -subspace approximation problem, which generalizes the subspace approximation and subspace estimation problems.

Definition 20 (Partition Constrained $\ell_{p}$ -Subspace Approximation).

In the PC- $\ell_{p}$ -subspace approximation problem, we are given a set of target vectors $\{a_{1},a_{2},\dots,a_{n}\}\subseteq\mathbb{R}^{d}$ as columns of a matrix $\bm{A}\in\mathbb{R}^{d\times n}$ , a set of $\ell$ subspaces $S_{1},\dots,S_{\ell}\subseteq\mathbb{R}^{d}$ , and a sequence of capacity constraints $k_{1},\cdots,k_{\ell}$ where $k_{1}+\cdots+k_{\ell}=k$ . The goal is to select $k$ vectors in total, $k_{i}$ from subspace $S_{i}$ , such that their span captures as much of $\bm{A}$ as possible. Formally, the goal is to select vectors $\{v_{i,t_{i}}\}_{i\leq\ell,t_{i}\leq k_{i}}$ , such that for every $i\leq\ell$ , $v_{i,1},\dots,v_{i,k_{i}}\in S_{i}$ , so as to minimize $\sum_{i\in[n]}\|\mbox{proj}^{\bot}_{\textnormal{span}(\{v_{i,t_{i}}\}_{i\leq% \ell,t_{i}\leq k_{i}})}(a_{i})\|_{2}^{p}$ .

Our results will give algorithms with running times exponential in $\operatorname{poly}(k)$ for PC- $\ell$ -subspace approximation. Given this goal, we can focus on the setting where $k_{i}=1$ , since we can replace each $S_{i}$ in the original formulation with $k_{i}$ copies of $S_{i}$ , with a budget of $1$ for each copy.

PC- $\ell$ -subspace approximation with Unit Capacity.

Given a set of vectors $\{a_{1},a_{2},\dots,a_{n}\}\subseteq\mathbb{R}^{d}$ as columns of a matrix $\bm{A}\in\mathbb{R}^{d\times n}$ and subspaces $S_{1},\dots,S_{k}\subseteq\mathbb{R}^{d}$ , select a vector $v_{i}\in S_{i}$ for $i\in[k]$ in order to minimize $\sum_{i\in[n]}\|\mbox{proj}^{\perp}_{\textnormal{span}(v_{1},\dots,v_{k})}(a_{% i})\|_{2}^{p}$ , where $p\geq 1$ is a given parameter. A more compact formulation is

	$\displaystyle\min$	$\displaystyle:\sum\limits_{i=1}^{n}\\|a_{i}-\widehat{a}_{i}\\|_{2}^{p}$	(PC- $\ell_{p}$ -SA-geo)
		$\displaystyle\widehat{a}_{i}\in\textnormal{Span}(v_{1},\dots,v_{k})\quad% \forall i\in[n]$	(37)
		$\displaystyle v_{j}\in S_{j}\quad\forall j\in[k].$	(38)
Using Lemma 9, the two other equivalent formulations are
	$\displaystyle\min$	$\displaystyle:\\|\bm{A}-\bm{U}\bm{U}^{T}\bm{A}\\|_{2,p}^{p}$	(PC- $\ell_{p}$ -SA)
		$\displaystyle\bm{U}\textnormal{ is an orthogonal basis for Span}(v_{1},v_{2},% \dots,v_{k})$	(39)
		$\displaystyle v_{i}\in S_{i}\quad\forall i\in[k].$	(40)
	$\displaystyle\min$	$\displaystyle:\\|\bm{A}-\bm{V}\bm{C}\\|_{2,p}^{p}$	(PC- $\ell_{p}$ -SA-fac)
		$\displaystyle\bm{V}=[v_{1},\dots,v_{k}]$	(41)
		$\displaystyle v_{i}\in S_{i}\quad\forall i\in[k].$	(42)

In what follows, we thus focus on the unit capacity version. We can use our general framework to derive an additive error approximation, for any $p$ .

Theorem 21.

There exists an algorithm for PC- $\ell_{p}$ -subspace approximation with runtime $(\kappa/\varepsilon)^{\operatorname{poly}(k/\varepsilon)}\cdot\operatorname{% poly}(n)$ which returns a solution with additive error at most $O(\varepsilon p)\cdot\|A\|_{p,2}^{p}$ , where $\kappa$ is the condition number of an optimal solution $\bm{V}^{*}=\left[v_{1}^{*},v_{2}^{*},\dots,v_{k}^{*}\right]$ for the PC-subspace approximation problem PC- $\ell_{p}$ -SA-fac.

For the special case of $p=2$ , it turns out that we can obtain a $(1+\varepsilon)$ -multiplicative approximation, using a novel idea. We now outline this approach. As described in our framework, we start by constructing the reduced instance $\bm{B},\mathcal{S}$ , where $\bm{B}=\{b_{1},b_{2},\dots,b_{r}\}\subset\mathbb{R}^{d}$ is a set of target vectors and $\mathcal{S}=\{S_{1},S_{2},\dots,S_{k}\}$ is the given collection of subspaces of $\mathbb{R}^{d}$ . We define $\bm{P}_{j}$ to be some fixed orthonormal basis for the space $S_{j}$ . Recall that any solution to PC- $\ell_{2}$ -subspace approximation is defined by (a) the vector $x_{j}$ that expresses the chosen $v_{j}$ as $v_{j}=\bm{P}_{j}x_{j}$ (we have one $x_{j}$ for each $j\in[k]$ ), and (b) a set of combination coefficients $c_{ij}$ used to represent the vectors $b_{i}$ using the vectors $\{v_{j}\}_{j=1}^{k}$ . We collect the vectors $x_{j}$ into one long vector $\bm{x}$ and the coefficients $c_{ij}$ into a matrix $\bm{C}$ .

Theorem 22.

Let $\bm{B},\mathcal{S}$ be an instance of PC- $\ell_{2}$ -subspace approximation, where $\bm{B}=\{b_{1},b_{2},\dots b_{r}\}$ , and suppose that the bit complexity of each element in the input is bounded by $H$ . Suppose there exists an (approximately) optimal solution is defined by the pair $(\bm{x}^{*},\bm{C}^{*})$ with bit complexity $\text{poly}(n,H)$ . There exists an algorithm that runs in time $n^{O(k^{2}/\varepsilon)}\cdot poly(H)$ and outputs a solution whose objective value is within a $(1+\varepsilon)$ factor of the optimum objective value. We denote $s=\sum_{j=1}^{k}s_{j}$ and $s_{j}=dim(S_{j})$ ; $n$ for this result can be set to $\max(s,d,k/\varepsilon)$ .

Algorithm Overview.

Recall that $\bm{P}_{j}$ specifies an orthonormal basis for $S_{j}$ . Let $\bm{P}_{ij}:=c_{ij}\bm{P}_{j}$ , where $c_{ij}$ are variables. Define $\bm{P}$ to be the $\mathbb{R}^{rd\times s}$ matrix consisting of $r\times k$ blocks; the $(i,j)^{\textnormal{th}}$ block is $\bm{P}_{ij}$ and we let $\bm{x},\bm{b}$ be the vectors representing all the $x_{j},b_{i}$ stacked vertically respectively as shown below:

\displaystyle\bm{P}=\left[\begin{array}[]{c|c|c|c}\bm{P}_{1,1}&\bm{P}_{1,2}&% \cdots&\bm{P}_{1,k}\\ \hline\cr\bm{P}_{2,1}&\bm{P}_{2,2}&\cdots&\bm{P}_{2,k}\\ \hline\cr\vdots&\vdots&\ddots&\vdots\\ \hline\cr\bm{P}_{r,1}&\bm{P}_{r,2}&\cdots&\bm{P}_{r,k}\\ \end{array}\right],\quad\bm{x}=\left[\begin{array}[]{c}x_{1}\\ \hline\cr x_{2}\\ \hline\cr\vdots\\ \hline\cr x_{k}\end{array}\right],\quad\bm{b}=\left[\begin{array}[]{c}b_{1}\\ \hline\cr b_{2}\\ \hline\cr\vdots\\ \hline\cr b_{r}\end{array}\right].

The problem PC- $\ell_{2}$ -subspace approximation can now be expressed as the regression problem:

\displaystyle\min_{\bm{C,x}}:\|\bm{P}\bm{x}-\bm{b}\|_{2}^{2}.

(43)

Written this way, it is clear that for any $\bm{C}$ , the optimization problem with respect to $\bm{x}$ is simply a regression problem. For the sake of exposition, suppose that for the optimal solution $(\bm{C}^{*},\bm{x}^{*})$ , the matrix $\bm{P}$ turns out to have a full column rank (i.e., $\bm{P}^{T}\bm{P}$ is invertible). In this case, the we can write down the normal equation $\bm{P}^{T}\bm{P}\bm{x}=\bm{P}^{T}\bm{b}$ and solve it using Cramer’s rule! More specifically, let $\bm{D}=\bm{P}^{T}\bm{P}$ and $\bm{D}_{j}^{(i)}$ be the matrix obtained by replacing the $i^{\textnormal{th}}$ column in the $j^{\textnormal{th}}$ column block of $\bm{D}$ with the column $\bm{P}^{T}\bm{b}$ for $j\in[k],i\in[s_{j}]$ . Using Cramer’s rule, we have $x_{j}^{(i)}=\det(\bm{D}_{j}^{(i)})/\det(\bm{D})$ .

The key observation now is that substituting this back into the objective yields an optimization problem over (the variables) $\bm{C}$ . First, observe that using the normal equations, the objective can be simplified as

\|\bm{P}\bm{x}-\bm{b}\|_{2}^{2}=\bm{x}^{T}\bm{P}^{T}\bm{P}\bm{x}-\bm{x}^{T}\bm% {P}^{T}\bm{b}-\bm{b}^{T}\bm{P}\bm{x}+\left\lVert\bm{b}\right\rVert^{2}=\left% \lVert\bm{b}\right\rVert^{2}-\bm{b}^{T}\bm{P}\bm{x}.

Suppose $t$ is a real valued parameter that is a guess for the objective value. We then consider the following feasibility problem:

		$\displaystyle\\|\bm{P}\bm{x}-\bm{b}\\|_{2}^{2}=\\|\bm{b}\\|_{2}^{2}-\bm{b}^{T}\bm{% P}\bm{x}\leq t$		(44)
		$\displaystyle\iff\\|\bm{b}\\|_{2}^{2}-t\leq\sum\limits_{j\in[k],i\in[s_{j}]}(\bm% {b}^{T}\bm{P})_{j}^{(i)}\det(\bm{D}_{j}^{(i)}),\quad\det(\bm{D})=1.$		(45)

The idea is to solve this feasibility problem using the literature on solving polynomial systems. This leaves two main gaps: guessing $t$ , and handling the case of $\bm{P}$ not having a full column rank in the optimal solution. We handle the former issue using known quantitative bounds on the solution value to polynomial systems, and the latter using a pre-multiplication with random matrices of different sizes.

References

[1] Jason Altschuler, Aditya Bhaskara, Gang Fu, Vahab Mirrokni, Afshin Rostamizadeh, and Morteza Zadimoghaddam. Greedy column subset selection: New bounds and distributed algorithms. In International Conference on Machine Learning, pages 2539–2548, 2016. URL: http://proceedings.mlr.press/v48/altschuler16.html.
[2] Megasthenis Asteris, Dimitris Papailiopoulos, and Alexandros Dimakis. Nonnegative sparse pca with provable guarantees. In International Conference on Machine Learning, pages 1728–1736. PMLR, 2014. URL: http://proceedings.mlr.press/v32/asteris14.html.
[3] Frank Ban, Vijay Bhattiprolu, Karl Bringmann, Pavel Kolev, Euiwoong Lee, and David P. Woodruff. A PTAS for $\ell_{p}$ -low rank approximation. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 747–766, 2019.
[4] Frank Ban, David P. Woodruff, and Qiuyi (Richard) Zhang. Regularized weighted low rank approximation. CoRR, abs/1911.06958, 2019. arXiv:1911.06958.
[5] Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. Sparse features for pca-like linear regression. Advances in Neural Information Processing Systems, 24, 2011.
[6] Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. Near-optimal column-based matrix reconstruction. SIAM Journal on Computing, 43(2):687–717, 2014. doi:10.1137/12086755X.
[7] Christos Boutsidis, Michael W Mahoney, and Petros Drineas. An improved approximation algorithm for the column subset selection problem. In Proceedings of the twentieth annual ACM-SIAM symposium on Discrete algorithms, pages 968–977, 2009. doi:10.1137/1.9781611973068.105.
[8] Christos Boutsidis, Anastasios Zouzias, Michael W Mahoney, and Petros Drineas. Randomized dimensionality reduction for $k$ -means clustering. IEEE Transactions on Information Theory, 61(2):1045–1062, 2014. doi:10.1109/TIT.2014.2375327.
[9] Jorge Cadima and Ian T Jolliffe. Loading and correlations in the interpretation of principle compenents. Journal of applied Statistics, 22(2):203–214, 1995.
[10] Ashish Chiplunkar, Sagar Kale, and Sivaramakrishnan Natarajan Ramamoorthy. How to solve fair $k$ -center in massive data models. In International Conference on Machine Learning, pages 1877–1886, 2020.
[11] Ali Civril and Malik Magdon-Ismail. Column subset selection via sparse approximation of SVD. Theoretical Computer Science, 421:1–14, 2012. doi:10.1016/J.TCS.2011.11.019.
[12] Kenneth L Clarkson and David P Woodruff. Numerical linear algebra in the streaming model. In Proceedings of the forty-first annual ACM symposium on Theory of computing, pages 205–214, 2009. doi:10.1145/1536414.1536445.
[13] Kenneth L. Clarkson and David P. Woodruff. Input sparsity and hardness for robust subspace approximation. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 310–329, 2015. doi:10.1109/FOCS.2015.27.
[14] Michael B Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina Persu. Dimensionality reduction for $k$ -means clustering and low rank approximation. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 163–172, 2015. doi:10.1145/2746539.2746569.
[15] Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, Chris Schwiegelshohn, and Omar Ali Sheikh-Omar. Improved coresets for euclidean $k$ -means. In Proceedings of the 36th International Conference on Neural Information Processing Systems, 2024.
[16] Vincent Cohen-Addad, David Saulpic, and Chris Schwiegelshohn. A new coreset framework for clustering. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 169–182, 2021. doi:10.1145/3406325.3451022.
[17] Alberto Del Pia. Sparse pca on fixed-rank matrices. Mathematical Programming, 198(1):139–157, 2023. doi:10.1007/S10107-022-01769-9.
[18] Amit Deshpande and Luis Rademacher. Efficient volume sampling for row/column subset selection. In 2010 ieee 51st annual symposium on foundations of computer science, pages 329–338, 2010. doi:10.1109/FOCS.2010.38.
[19] Amit Deshpande, Madhur Tulsiani, and Nisheeth K. Vishnoi. Algorithms and hardness for subspace approximation. In Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on Discrete Algorithms, pages 482–496, USA, 2011. doi:10.1137/1.9781611973082.39.
[20] Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, and Vishwanathan Vinay. Clustering large graphs via the singular value decomposition. Machine learning, 56:9–33, 2004. doi:10.1023/B:MACH.0000033113.59016.96.
[21] Dan Feldman, Morteza Monemizadeh, and Christian Sohler. A ptas for $k$ -means clustering based on weak coresets. In Proceedings of the twenty-third annual symposium on Computational geometry, pages 11–18, 2007. doi:10.1145/1247069.1247072.
[22] Gene H. Golub and Charles F. Van Loan. Matrix Computations. Johns Hopkins University Press, Baltimore, 4th edition, 2013.
[23] Venkatesan Guruswami, Prasad Raghavendra, Rishi Saket, and Yi Wu. Bypassing ugc from some optimal geometric inapproximability results. ACM Trans. Algorithms, 12(1), February 2016. doi:10.1145/2737729.
[24] Venkatesan Guruswami and Ali Kemal Sinop. Optimal column-based low-rank matrix reconstruction. In Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms, pages 1207–1214, 2012. doi:10.1137/1.9781611973099.95.
[25] Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity. Monographs on statistics and applied probability, 143(143):8, 2015.
[26] Sedjro Salomon Hotegni, Sepideh Mahabadi, and Ali Vakilian. Approximation algorithms for fair range clustering. In International Conference on Machine Learning, pages 13270–13284. PMLR, 2023.
[27] Lingxiao Huang, Jian Li, and Xuan Wu. On optimal coreset construction for euclidean $(k,z)$ -clustering. In Proceedings of the 56th Annual ACM Symposium on Theory of Computing, pages 1594–1604, 2024. doi:10.1145/3618260.3649707.
[28] Matthew Jones, Huy Nguyen, and Thy Nguyen. Fair $k$ -centers via maximum matching. In International Conference on Machine Learning, pages 4940–4949, 2020.
[29] Matthäus Kleindessner, Pranjal Awasthi, and Jamie Morgenstern. Fair $k$ -center clustering for data summarization. In International Conference on Machine Learning, pages 3448–3457, 2019. URL: http://proceedings.mlr.press/v97/kleindessner19a.html.
[30] Arvind V. Mahankali and David P. Woodruff. Optimal $\ell_{1}$ column subset selection and a fast PTAS for low rank approximation. CoRR, abs/2007.10307, 2020.
[31] Antonis Matakos, Bruno Ordozgoiti, and Suhas Thejaswi. Fair column subset selection. arXiv preprint arXiv:2306.04489, 2023. doi:10.48550/arXiv.2306.04489.
[32] Ankur Moitra. An almost optimal algorithm for computing nonnegative rank. SIAM J. Comput., 45(1):156–173, 2016. doi:10.1137/140990139.
[33] Dimitris Papailiopoulos, Alexandros Dimakis, and Stavros Korokythakis. Sparse pca through low-rank approximations. In International Conference on Machine Learning, pages 747–755. PMLR, 2013. URL: http://proceedings.mlr.press/v28/papailiopoulos13.html.
[34] Ilya Razenshteyn, Zhao Song, and David P. Woodruff. Weighted low rank approximations with provable guarantees. In Proceedings of the Forty-Eighth Annual ACM Symposium on Theory of Computing, pages 250–263, 2016. doi:10.1145/2897518.2897639.
[35] Samira Samadi, Uthaipon Tantipongpipat, Jamie H Morgenstern, Mohit Singh, and Santosh Vempala. The price of fair PCA: One extra dimension. In Advances in neural information processing systems, pages 10976–10987, 2018.
[36] Ignacio Santamaria, Javier Vía, Michael Kirby, Tim Marrinan, Chris Peterson, and Louis Scharf. Constrained subspace estimation via convex optimization. In 2017 25th European Signal Processing Conference (EUSIPCO), pages 1200–1204. IEEE, 2017. doi:10.23919/EUSIPCO.2017.8081398.
[37] Zhao Song, Ali Vakilian, David Woodruff, and Samson Zhou. On socially fair regression and low-rank approximation. In Advances in Neural Information Processing Systems, 2024.
[38] Uthaipon Tantipongpipat, Samira Samadi, Mohit Singh, Jamie H Morgenstern, and Santosh Vempala. Multi-criteria dimensionality reduction with applications to fairness. In Advances in Neural Information Processing Systems, pages 15135–15145, 2019. URL: https://proceedings.neurips.cc/paper/2019/hash/2201611d7a08ffda97e3e8c6b667a1bc-Abstract.html.
[39] Joel A Tropp. Column subset selection, matrix factorization, and eigenvalue optimization. In Proceedings of the twentieth annual ACM-SIAM symposium on Discrete algorithms, pages 978–986. SIAM, 2009. doi:10.1137/1.9781611973068.106.
[40] Ameya Velingker, Maximilian Vötsch, David P. Woodruff, and Samson Zhou. Fast $(1+\varepsilon)$ -approximation algorithms for binary matrix factorization. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
[41] David P. Woodruff and Taisuke Yasuda. Nearly linear sparsification of $\ell_{p}$ subspace approximation, 2024. doi:10.48550/arXiv.2407.03262.
[42] David P. Woodruff and Taisuke Yasuda. Ridge leverage score sampling for $\ell_{p}$ subspace approximation, 2025. arXiv:2407.03262.
[43] Zhirong Yang and Erkki Oja. Linear and nonlinear projective nonnegative matrix factorization. IEEE Transactions on Neural Networks, 21:734–749, 2010. doi:10.1109/TNN.2010.2041361.
[44] Xiao-Tong Yuan and Tong Zhang. Truncated power method for sparse eigenvalue problems. Journal of Machine Learning Research, 14(4), 2013. doi:10.5555/2567709.2502610.
[45] Zhijian Yuan and Erkki Oja. Projective nonnegative matrix factorization for image compression and feature extraction. In Image Analysis: 14th Scandinavian Conference, SCIA 2005, Joensuu, Finland, June 19-22, 2005. Proceedings 14, pages 333–342. Springer, 2005. doi:10.1007/11499145_35.
[46] Zhijian Yuan, Zhirong Yang, and Erkki Oja. Projective nonnegative matrix factorization: Sparseness, orthogonality, and clustering, 2009.
[47] Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse principal component analysis. Journal of computational and graphical statistics, 15(2):265–286, 2006.

Appendix A Missing Proofs

A.1 Proof of Lemma 12

Proof.

For any two arbitrary projection matrices $\bm{P}$ and $\bm{P}^{\prime}$ of rank $\leq k$ , consider the difference

	$\displaystyle\left(\\|\bm{A}-\bm{P}\bm{A}\\|_{F}^{2}-\\|\bm{B}-\bm{P}\bm{B}\\|_{F}% ^{2}\right)-\left(\\|\bm{A}-\bm{P}^{\prime}\bm{A}\\|_{F}^{2}-\\|\bm{B}-\bm{P}^{% \prime}\bm{B}\\|_{F}^{2}\right)$		(46)
	$\displaystyle=\langle\bm{A}\bm{A}^{T},\bm{I}-\bm{P}\rangle-\langle\bm{B}\bm{B}% ^{T},\bm{I}-\bm{P}\rangle-\langle\bm{A}\bm{A}^{T},\bm{I}-\bm{P}^{\prime}% \rangle+\langle\bm{B}\bm{B}^{T},\bm{I}-\bm{P}^{\prime}\rangle$		(47)
	$\displaystyle=\langle\bm{A}\bm{A}^{T}-\bm{B}\bm{B}^{T},\bm{P}^{\prime}\rangle-% \langle\bm{A}\bm{A}^{T}-\bm{B}\bm{B}^{T},\bm{P}\rangle$		(48)
	$\displaystyle\leq\langle\bm{A}\bm{A}^{T}-\bm{B}\bm{B}^{T},\bm{P}^{\prime}\rangle$		( $\bm{A}\bm{A}^{T}-\bm{B}\bm{B}^{T}\succeq 0,\,\bm{P}\succeq 0$ )
	$\displaystyle\leq\sum\limits_{i=r+1}^{r+k}\sigma_{i}$		(rank of $\bm{P}^{\prime}\leq k$ )
	$\displaystyle\leq k\cdot\sigma_{r}$		( $\sigma_{r}\geq\sigma_{r^{\prime}},\,r^{\prime}\geq r$ )
	$\displaystyle\leq\frac{k}{r-k}\cdot\left(\sum\limits_{i=k+1}^{r}\sigma_{i}\right)$		( $\sigma_{r}\leq\sigma_{r^{\prime}},\,r^{\prime}\leq r$ )
	$\displaystyle\leq\frac{k}{r-k}\\|\bm{A}-\bm{A}_{k}\\|_{F}^{2}=\varepsilon\\|\bm{A% }-\bm{A}_{k}\\|_{F}^{2}.$		( $\\|\bm{A}-\bm{A}_{k}\\|_{F}^{2}=\sum\limits_{i=k+1}^{d}\sigma_{i}$ )

If we let $c:=\max_{\textnormal{rank}(\bm{P})\leq k}\left(\|\bm{A}-\bm{P}\bm{A}\|_{F}^{2}% -\|\bm{B}-\bm{P}\bm{B}\|_{F}^{2}\right)$ , then we have

\displaystyle c-\varepsilon\|\bm{A}-\bm{A}_{k}\|_{F}^{2}\leq\|\bm{A}-\bm{P}\bm% {A}\|_{F}^{2}-\|\bm{B}-\bm{P}\bm{B}\|_{F}^{2}\leq c

for any projection matrix $\bm{P}$ of rank at most $k$ . This can we re written as

\displaystyle\|\bm{B}-\bm{P}\bm{B}\|_{F}^{2}\in(0,\varepsilon)\cdot\|\bm{A}-% \bm{A}_{k}\|_{F}^{2}+\|\bm{A}-\bm{P}\bm{A}\|_{F}^{2}-c.

(49)

Using the fact that $\|\bm{A}-\bm{A}_{k}\|_{F}^{2}\leq\|\bm{A}-\bm{P}\bm{A}\|_{F}^{2}$ , we get

\displaystyle\|\bm{B}-\bm{P}\bm{B}\|_{F}^{2}\in(1,1+\varepsilon)\cdot\|\bm{A}-% \bm{P}\bm{A}\|_{F}^{2}-c.

The fact that $c\geq 0$ follows from the fact that

	$\displaystyle\\|\bm{A}-\bm{P}\bm{A}\\|_{F}^{2}-\\|\bm{B}-\bm{P}\bm{B}\\|_{F}^{2}$	$\displaystyle=\langle\bm{A}\bm{A}^{T}-\bm{B}\bm{B}^{T},\bm{I}-\bm{P}\rangle$		(50)
		$\displaystyle\geq 0.$		( $\bm{A}\bm{A}^{T}-\bm{B}\bm{B}^{T}\succeq 0,\;\bm{I}-\bm{P}\succeq 0$ )

$\hfill\blacktriangleleft$

[bib.bib1] [1] Jason Altschuler, Aditya Bhaskara, Gang Fu, Vahab Mirrokni, Afshin Rostamizadeh, and Morteza Zadimoghaddam. Greedy column subset selection: New bounds and distributed algorithms. In International Conference on Machine Learning, pages 2539–2548, 2016. URL: http://proceedings.mlr.press/v48/altschuler16.html.

[bib.bib2] [2] Megasthenis Asteris, Dimitris Papailiopoulos, and Alexandros Dimakis. Nonnegative sparse pca with provable guarantees. In International Conference on Machine Learning, pages 1728–1736. PMLR, 2014. URL: http://proceedings.mlr.press/v32/asteris14.html.

[bib.bib3] [3] Frank Ban, Vijay Bhattiprolu, Karl Bringmann, Pavel Kolev, Euiwoong Lee, and David P. Woodruff. A PTAS for $\ell_{p}$ -low rank approximation. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 747–766, 2019.

[bib.bib4] [4] Frank Ban, David P. Woodruff, and Qiuyi (Richard) Zhang. Regularized weighted low rank approximation. CoRR, abs/1911.06958, 2019. arXiv:1911.06958.

[bib.bib5] [5] Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. Sparse features for pca-like linear regression. Advances in Neural Information Processing Systems, 24, 2011.

[bib.bib6] [6] Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. Near-optimal column-based matrix reconstruction. SIAM Journal on Computing, 43(2):687–717, 2014. doi:10.1137/12086755X.

[bib.bib7] [7] Christos Boutsidis, Michael W Mahoney, and Petros Drineas. An improved approximation algorithm for the column subset selection problem. In Proceedings of the twentieth annual ACM-SIAM symposium on Discrete algorithms, pages 968–977, 2009. doi:10.1137/1.9781611973068.105.

[bib.bib8] [8] Christos Boutsidis, Anastasios Zouzias, Michael W Mahoney, and Petros Drineas. Randomized dimensionality reduction for $k$ -means clustering. IEEE Transactions on Information Theory, 61(2):1045–1062, 2014. doi:10.1109/TIT.2014.2375327.

[bib.bib9] [9] Jorge Cadima and Ian T Jolliffe. Loading and correlations in the interpretation of principle compenents. Journal of applied Statistics, 22(2):203–214, 1995.

[bib.bib10] [10] Ashish Chiplunkar, Sagar Kale, and Sivaramakrishnan Natarajan Ramamoorthy. How to solve fair $k$ -center in massive data models. In International Conference on Machine Learning, pages 1877–1886, 2020.

[bib.bib11] [11] Ali Civril and Malik Magdon-Ismail. Column subset selection via sparse approximation of SVD. Theoretical Computer Science, 421:1–14, 2012. doi:10.1016/J.TCS.2011.11.019.

[bib.bib12] [12] Kenneth L Clarkson and David P Woodruff. Numerical linear algebra in the streaming model. In Proceedings of the forty-first annual ACM symposium on Theory of computing, pages 205–214, 2009. doi:10.1145/1536414.1536445.

[bib.bib13] [13] Kenneth L. Clarkson and David P. Woodruff. Input sparsity and hardness for robust subspace approximation. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 310–329, 2015. doi:10.1109/FOCS.2015.27.

[bib.bib14] [14] Michael B Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina Persu. Dimensionality reduction for $k$ -means clustering and low rank approximation. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 163–172, 2015. doi:10.1145/2746539.2746569.

[bib.bib15] [15] Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, Chris Schwiegelshohn, and Omar Ali Sheikh-Omar. Improved coresets for euclidean $k$ -means. In Proceedings of the 36th International Conference on Neural Information Processing Systems, 2024.

[bib.bib16] [16] Vincent Cohen-Addad, David Saulpic, and Chris Schwiegelshohn. A new coreset framework for clustering. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 169–182, 2021. doi:10.1145/3406325.3451022.

[bib.bib17] [17] Alberto Del Pia. Sparse pca on fixed-rank matrices. Mathematical Programming, 198(1):139–157, 2023. doi:10.1007/S10107-022-01769-9.

[bib.bib18] [18] Amit Deshpande and Luis Rademacher. Efficient volume sampling for row/column subset selection. In 2010 ieee 51st annual symposium on foundations of computer science, pages 329–338, 2010. doi:10.1109/FOCS.2010.38.

[bib.bib19] [19] Amit Deshpande, Madhur Tulsiani, and Nisheeth K. Vishnoi. Algorithms and hardness for subspace approximation. In Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on Discrete Algorithms, pages 482–496, USA, 2011. doi:10.1137/1.9781611973082.39.

[bib.bib20] [20] Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, and Vishwanathan Vinay. Clustering large graphs via the singular value decomposition. Machine learning, 56:9–33, 2004. doi:10.1023/B:MACH.0000033113.59016.96.

[bib.bib21] [21] Dan Feldman, Morteza Monemizadeh, and Christian Sohler. A ptas for $k$ -means clustering based on weak coresets. In Proceedings of the twenty-third annual symposium on Computational geometry, pages 11–18, 2007. doi:10.1145/1247069.1247072.

[bib.bib22] [22] Gene H. Golub and Charles F. Van Loan. Matrix Computations. Johns Hopkins University Press, Baltimore, 4th edition, 2013.

[bib.bib23] [23] Venkatesan Guruswami, Prasad Raghavendra, Rishi Saket, and Yi Wu. Bypassing ugc from some optimal geometric inapproximability results. ACM Trans. Algorithms, 12(1), February 2016. doi:10.1145/2737729.

[bib.bib24] [24] Venkatesan Guruswami and Ali Kemal Sinop. Optimal column-based low-rank matrix reconstruction. In Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms, pages 1207–1214, 2012. doi:10.1137/1.9781611973099.95.

[bib.bib25] [25] Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity. Monographs on statistics and applied probability, 143(143):8, 2015.

[bib.bib26] [26] Sedjro Salomon Hotegni, Sepideh Mahabadi, and Ali Vakilian. Approximation algorithms for fair range clustering. In International Conference on Machine Learning, pages 13270–13284. PMLR, 2023.

[bib.bib27] [27] Lingxiao Huang, Jian Li, and Xuan Wu. On optimal coreset construction for euclidean $(k,z)$ -clustering. In Proceedings of the 56th Annual ACM Symposium on Theory of Computing, pages 1594–1604, 2024. doi:10.1145/3618260.3649707.

[bib.bib28] [28] Matthew Jones, Huy Nguyen, and Thy Nguyen. Fair $k$ -centers via maximum matching. In International Conference on Machine Learning, pages 4940–4949, 2020.

[bib.bib29] [29] Matthäus Kleindessner, Pranjal Awasthi, and Jamie Morgenstern. Fair $k$ -center clustering for data summarization. In International Conference on Machine Learning, pages 3448–3457, 2019. URL: http://proceedings.mlr.press/v97/kleindessner19a.html.

[bib.bib30] [30] Arvind V. Mahankali and David P. Woodruff. Optimal $\ell_{1}$ column subset selection and a fast PTAS for low rank approximation. CoRR, abs/2007.10307, 2020.

[bib.bib31] [31] Antonis Matakos, Bruno Ordozgoiti, and Suhas Thejaswi. Fair column subset selection. arXiv preprint arXiv:2306.04489, 2023. doi:10.48550/arXiv.2306.04489.

[bib.bib32] [32] Ankur Moitra. An almost optimal algorithm for computing nonnegative rank. SIAM J. Comput., 45(1):156–173, 2016. doi:10.1137/140990139.

[bib.bib33] [33] Dimitris Papailiopoulos, Alexandros Dimakis, and Stavros Korokythakis. Sparse pca through low-rank approximations. In International Conference on Machine Learning, pages 747–755. PMLR, 2013. URL: http://proceedings.mlr.press/v28/papailiopoulos13.html.

[bib.bib34] [34] Ilya Razenshteyn, Zhao Song, and David P. Woodruff. Weighted low rank approximations with provable guarantees. In Proceedings of the Forty-Eighth Annual ACM Symposium on Theory of Computing, pages 250–263, 2016. doi:10.1145/2897518.2897639.

[bib.bib35] [35] Samira Samadi, Uthaipon Tantipongpipat, Jamie H Morgenstern, Mohit Singh, and Santosh Vempala. The price of fair PCA: One extra dimension. In Advances in neural information processing systems, pages 10976–10987, 2018.

[bib.bib36] [36] Ignacio Santamaria, Javier Vía, Michael Kirby, Tim Marrinan, Chris Peterson, and Louis Scharf. Constrained subspace estimation via convex optimization. In 2017 25th European Signal Processing Conference (EUSIPCO), pages 1200–1204. IEEE, 2017. doi:10.23919/EUSIPCO.2017.8081398.

[bib.bib37] [37] Zhao Song, Ali Vakilian, David Woodruff, and Samson Zhou. On socially fair regression and low-rank approximation. In Advances in Neural Information Processing Systems, 2024.

[bib.bib38] [38] Uthaipon Tantipongpipat, Samira Samadi, Mohit Singh, Jamie H Morgenstern, and Santosh Vempala. Multi-criteria dimensionality reduction with applications to fairness. In Advances in Neural Information Processing Systems, pages 15135–15145, 2019. URL: https://proceedings.neurips.cc/paper/2019/hash/2201611d7a08ffda97e3e8c6b667a1bc-Abstract.html.

[bib.bib39] [39] Joel A Tropp. Column subset selection, matrix factorization, and eigenvalue optimization. In Proceedings of the twentieth annual ACM-SIAM symposium on Discrete algorithms, pages 978–986. SIAM, 2009. doi:10.1137/1.9781611973068.106.

[bib.bib40] [40] Ameya Velingker, Maximilian Vötsch, David P. Woodruff, and Samson Zhou. Fast $(1+\varepsilon)$ -approximation algorithms for binary matrix factorization. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.

[bib.bib41] [41] David P. Woodruff and Taisuke Yasuda. Nearly linear sparsification of $\ell_{p}$ subspace approximation, 2024. doi:10.48550/arXiv.2407.03262.

[bib.bib42] [42] David P. Woodruff and Taisuke Yasuda. Ridge leverage score sampling for $\ell_{p}$ subspace approximation, 2025. arXiv:2407.03262.

[bib.bib43] [43] Zhirong Yang and Erkki Oja. Linear and nonlinear projective nonnegative matrix factorization. IEEE Transactions on Neural Networks, 21:734–749, 2010. doi:10.1109/TNN.2010.2041361.

[bib.bib44] [44] Xiao-Tong Yuan and Tong Zhang. Truncated power method for sparse eigenvalue problems. Journal of Machine Learning Research, 14(4), 2013. doi:10.5555/2567709.2502610.

[bib.bib45] [45] Zhijian Yuan and Erkki Oja. Projective nonnegative matrix factorization for image compression and feature extraction. In Image Analysis: 14th Scandinavian Conference, SCIA 2005, Joensuu, Finland, June 19-22, 2005. Proceedings 14, pages 333–342. Springer, 2005. doi:10.1007/11499145_35.

[bib.bib46] [46] Zhijian Yuan, Zhirong Yang, and Erkki Oja. Projective nonnegative matrix factorization: Sparseness, orthogonality, and clustering, 2009.

[bib.bib47] [47] Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse principal component analysis. Journal of computational and graphical statistics, 15(2):265–286, 2006.

	$\displaystyle\\|\bm{B}-\widehat{\bm{U}}\widehat{\bm{U}}^{T}\bm{B}\\|_{2,p}^{p}$	$\displaystyle\leq\\|\bm{B}-\widehat{\bm{U}}\widehat{\bm{C}}\\|_{2,p}^{p}.$	(8)
Let $\overline{\bm{C}}$ be the matrix in the search space such that $\\|\overline{\bm{C}}_{.,i}-\bm{C}^{*}_{.,i}\\|_{2}\leq\\|b_{i}\\|_{2}\cdot\delta$ for every $i\in[r]$ . Using the cost minimality of $\widehat{\bm{C}}$ , we can imply that the above cost is
		$\displaystyle\leq\min\limits_{\bm{U}\in\mathbb{R}^{d\times k}:\bm{U}\bm{U}^{T}% \in\mathcal{S}}\\|\bm{B}-\bm{U}\overline{\bm{C}}\\|_{2,p}^{p}$	(9)
		$\displaystyle\leq\\|\bm{B}-\bm{U}^{*}\overline{\bm{C}}\\|_{2,p}^{p}.$	(10)

		$\displaystyle\Delta=\sum\limits_{i=1}^{r}\left(\\|b_{i}-\overline{b}_{i}\\|_{2}^% {p}-\\|b_{i}-b_{i}^{*}\\|_{2}^{p}\right).$	(11)
Using the fact that $\\|\overline{\bm{C}}_{.,i}-\bm{C}^{*}_{.,i}\\|_{2}\leq\\|b_{i}\\|_{2}\cdot\delta$ , we know that
	$\displaystyle\\|\overline{b}_{i}-b_{i}^{*}\\|_{2}$	$\displaystyle=\\|\bm{U}^{}(\overline{\bm{C}}_{.,i}-\bm{C}^{}_{.,i})\\|_{2}\leq% \\|\overline{\bm{C}}_{.,i}-\bm{C}^{*}_{.,i}\\|_{2}\leq\\|b_{i}\\|_{2}\cdot\delta.$	(12)

$\displaystyle\Delta_{i}$	$\displaystyle:=\\|b_{i}-\overline{b}_{i}\\|_{2}^{p}-\\|b_{i}-b_{i}^{*}\\|_{2}^{p}$	(13)
	$\displaystyle\leq(\\|b_{i}-b_{i}^{}\\|_{2}+\\|b_{i}^{}-\overline{b}_{i}\\|_{2})^% {p}-\\|b_{i}-b_{i}^{*}\\|_{2}^{p}$	(Triangle inequality)
	$\displaystyle\leq(\\|b_{i}-b_{i}^{}\\|_{2}+\\|b_{i}\\|\cdot\delta)^{p}-\\|b_{i}-b_% {i}^{}\\|_{2}^{p}$	( $\\|\overline{b}_{i}-b_{i}^{*}\\|_{2}\leq\\|b_{i}\\|_{2}\cdot\delta$ )
	$\displaystyle\leq\\|b_{i}\\|_{2}^{p}\cdot\left((1+\delta)^{p}-1\right).$	( $(x+\delta)^{p}-x^{p}$ is increasing in $[0,1],\,\\|b_{i}-b_{i}^{*}\\|_{2}\leq\\|b_{i}\\|_{2}$ )

	$\displaystyle\\|\bm{B}-\widehat{\bm{P}}\bm{B}\\|_{2,p}^{p}$	$\displaystyle\leq\\|\bm{B}-\bm{P}^{*}\bm{B}\\|_{2,p}^{p}+\\|\bm{B}\\|_{2,p}^{p}% \cdot\left((1+\delta)^{p}-1\right)$	(14)
Using the property of $\bm{B}$ from Equation 5, we can imply that
	$\displaystyle\\|\bm{A}-\widehat{\bm{P}}\bm{A}\\|_{2,p}^{p}$	$\displaystyle\leq(1+\varepsilon)\\|\bm{A}-\bm{P}^{*}\bm{A}\\|+\\|\bm{B}\\|_{2,p}^{% p}\cdot\left((1+\delta)^{p}-1\right).$	(15)
setting $\bm{P}=0$ in Equation 5 and using the fact that $c\geq 0$ gives $\\|\bm{B}\\|_{2,p}^{p}\leq(1+\varepsilon)\\|\bm{A}\\|_{2,p}^{p}$ . Plugging this in the equation above gives
	$\displaystyle\\|\bm{A}-\widehat{\bm{P}}\bm{A}\\|_{2,p}^{p}$	$\displaystyle\leq(1+\varepsilon)\\|\bm{A}-\bm{P}^{*}\bm{A}\\|+(1+\varepsilon)\\|% \bm{A}\\|_{2,p}^{p}\cdot\left((1+\delta)^{p}-1\right)$	(16)

	$\displaystyle\\|\bm{B}-\widehat{\bm{P}}\bm{B}\\|_{2,p}^{p}$	$\displaystyle\leq\alpha\\|\bm{B}-\bm{P}^{*}\bm{B}\\|_{2,p}^{p}+\beta.$	(23)
Using the lower-bound and upper-bound from Equation 5 for the LHS and RHS, we get
	$\displaystyle\\|\bm{A}-\widehat{\bm{P}}\bm{A}\\|_{2,p}^{p}-c$	$\displaystyle\leq\alpha(1+\varepsilon)\\|\bm{A}-\bm{P}^{*}\bm{A}\\|_{2,p}^{p}-% \alpha c+\beta.$	(24)
Since $\alpha\geq 1$ and $c\geq 0$ , we get
	$\displaystyle\\|\bm{A}-\widehat{\bm{P}}\bm{A}\\|_{2,p}^{p}$	$\displaystyle\leq\alpha(1+\varepsilon)\\|\bm{A}-\bm{P}^{*}\bm{A}\\|_{2,p}^{p}+\beta.$	(25)

Guessing Efficiently for Constrained Subspace Approximation

Abstract

Keywords and phrases:

Category:

Funding:

Copyright and License:

2012 ACM Subject Classification:

Related Version:

DOI:

Event:

Editors:

Series and Publisher:

1 Introduction

1.1 Our Contributions and Applications

Overview of Approach.

Relation to Prior Work.

Applications.

1.1.1 Subspace Approximation with Partition Constraints

1.1.2 Constrained Subspace Estimation

1.1.3 Projective Non-Negative Matrix Factorization

Theorem 1 (Additive approximation for NMF).

Theorem 2 (Multiplicative approximation for NMF).

1.1.4 𝒌-Means Clustering

Theorem 3.

1.1.5 Sparse PCA

Theorem 4.

1.1.6 Column Subset Selection with a Partition Constraint

Theorem 5.

2 Preliminaries

Lemma 6 (SVD Computation; see [22]).

Lemma 7 (Least Squares Regression; see [22]).

Remark on the Exponential in 𝒌 Running Times.

3 Framework for Constrained Subspace Approximation

Lemma 8.

Proof.

Lemma 9.

Proof.

Definition 10 (Strong coresets; as defined in [41]).

Theorem 11 (Theorems 1.3 and 1.4 of [42]).

Remark.

Lemma 12.

▶ Remark 13.

Lemma 14.

Proof.

Lemma 15 (Lemma 4.1 in [12]).

4 Applications

4.1 Constrained Subspace Estimation [36]

Lemma 16.

Proof.

Lemma 17.

Proof.

Corollary 18 (Additive approximation for CSE).

Theorem 19 (Multiplicative approximation for CSE).

Proof.

4.2 Partition Constrained ℓ𝒑-Subspace Approximation

Definition 20 (Partition Constrained ℓp-Subspace Approximation).

PC-ℓ-subspace approximation with Unit Capacity.

Theorem 21.

Theorem 22.

Algorithm Overview.

References

Appendix A Missing Proofs

A.1 Proof of Lemma 12

Proof.

1.1.4 $𝒌$ -Means Clustering

Remark on the Exponential in $𝒌$ Running Times.

$\blacktriangleright$ Remark 13.

4.2 Partition Constrained $\ell_{p}$ -Subspace Approximation

Definition 20 (Partition Constrained $\ell_{p}$ -Subspace Approximation).

PC- $\ell$ -subspace approximation with Unit Capacity.