Relational Approximations for Subspace Primitives

Liu, Xiang; Varadarajan, Kasturi

doi:10.4230/LIPIcs.APPROX/RANDOM.2025.12

Relational Approximations for Subspace Primitives

Xiang Liu

Department of Computer Science, University of Iowa, Iowa City, IA, USA Kasturi Varadarajan

Department of Computer Science, University of Iowa, Iowa City, IA, USA

Abstract

We explore fundamental geometric computations on point sets that are given to the algorithm implicitly. In particular, we are given a database which is a collection of tables with numerical values, and the geometric computation is to be performed on the join of the tables. Explicitly computing this join takes time exponential in the size of the tables. We are therefore interested in geometric problems that can be solved by algorithms whose running time is a polynomial in the size of the tables. Such relational algorithms are typically not able to explicitly compute the join.

To avoid the NP-completeness bottleneck, researchers assume that the tables have a tractable combinatorial structure, like being acyclic. Even with this assumption, simple geometric computations turn out to be non-trivial and sometimes intractable. In this article, we study the problem of computing the maximum distance of a point in the join to a given subspace, and develop approximation algorithms for this NP-hard problem.

Keywords and phrases:

relational algorithm, Euclidean distance, subspace approximation

Category:

APPROX

Copyright and License:

2012 ACM Subject Classification:

Theory of computation

\rightarrow

Computational geometry

Acknowledgements:

The authors thank Kirk Pruhs for discussions that led to the problems studied here, and the organizers of the 2023 Workshop on Fined-Grained Complexity, Logic, and Query Evaluation at the Simons Institute for the Theory of Computing for facilitating these discussions.

DOI:

10.4230/LIPIcs.APPROX/RANDOM.2025.12

Event:

Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2025)

Editors:

Alina Ene and Eshan Chattopadhyay

Series and Publisher:

Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik

1 Introduction

In this paper, we consider certain fundamental geometric computations that are trivially solvable in linear or quadratic time in the size of the input point set. In our setting, the input point set is not given explicitly, but rather in an implicit form as described below. An explicit representation can be exponentially larger than the implicit one, and the question we pursue is whether these geometric computations can be performed in time polynomial in the size of the implicit representation.

We are given a set $\{C_{1},C_{2},\ldots,C_{m}\}$ where each $C_{j}\subseteq\mathbb{Z}^{+}$ is a subset of the positive integers. We interpret each element of $C_{j}$ as a coordinate axis or feature. Furthermore, for each $1\leq j\leq m$ , we are given a table (two-dimensional matrix) $T_{j}$ with at least one row and exactly $|C_{j}|$ columns, where each column corresponds to a feature in $C_{j}$ . The values in the table $T_{j}$ are real numbers. Given a row $r$ in $T_{j}$ , we let $r[i]\in\mathbb{R}$ (or $r_{i}$ ) denote the value corresponding to feature $i\in C_{j}$ . We assume tables don’t have duplicate rows, and thus a table specifies a set of row vectors.

We define the join $T_{1}\bowtie T_{2}\cdots\bowtie T_{m}$ of $m\geq 1$ such tables as a table $J$ whose columns correspond to the union $C=\bigcup_{j=1}^{m}C_{j}$ of the individual feature sets. Thus each row in the join is a vector with $|C|$ real components. Such a vector $q$ belongs to the join $J$ if in each table $T_{j}$ , there is a row $r^{j}$ such that $q$ and $r^{j}$ agree on the features in $C_{j}$ – for each $i\in C_{j}$ , $q[i]=r^{j}[i]$ .

Thus, a row $q$ in the join $J$ is generated by picking rows $r^{1}\in T_{1},r^{2}\in T_{2},\ldots,r^{m}\in T_{m}$ that are pairwise compatible, that is, agreeing on the value on common features; and then “concatenating” these rows. That is, for each feature $i\in C$ , we find an arbitrary $C_{j}$ containing feature $i$ , and let $q[i]:=r^{j}[i]$ . The join $J$ consists of the set of all rows that can be generated in this way. An example of a join with feature set $\{a,b,c\}$ is shown below:

$T_{1}$		$T_{2}$		$J=T_{1}\Join T_{2}$
$a$	$b$	$b$	$c$	$a$	$b$	$c$
$1$	$1$	$1$	$2$	$1$	$1$	$2$
$1$	$2$	$2$	$3$	$2$	$1$	$2$
$2$	$1$	$5$	$6$	$1$	$2$	$3$

A relational database stores its data in the form of multiple tables as above. However, typical data analysis techniques such as clustering only work when the input data is in explicit form, as in the join $J$ . Standard practice is thus to compute the join, and then run the data analysis algorithm on it. Let $d=|C|$ denote the total number of features and $n$ the maximum number of rows in any given table. The space needed to store all the tables is easily upper bounded by $O(mnd)$ , where we recall that $m$ is the number of tables. However, the number of rows in the join $J=T_{1}\bowtie T_{2}\cdots\bowtie T_{m}$ is $\Theta(n^{m})$ in the worst case, which can be exponentially larger. This naturally raises the following question: what properties of the join $J$ can we compute using relational algorithms, defined as algorithms with running time polynomial in $n$ , $m$ , and $d$ ? See [2, 1, 9, 12] and the references therein. Notice that a relational algorithm typically is not able to explicitly construct the join.

Before describing prior work on relational algorithms, we point out the geometric nature of the join. The rows of each table $T_{j}$ can be viewed as a set of points lying in the subspace spanned by the features/coordinate axes in $C_{j}$ . The join $J=T_{1}\bowtie T_{2}\cdots\bowtie T_{m}$ is the set of all points $p$ , in the subspace spanned by features in $C$ , such that the projection of $p$ onto the subspace spanned by $C_{j}$ is an element of $T_{j}$ , for each $j$ .

1.1 Prior Work

Perhaps the most basic algorithmic question to ask about the join $J$ of the tables $T_{1},T_{2},\ldots,T_{m}$ is whether it is non-empty – does it have at least one row/point in it? This turns out to be NP-Complete [7, 10, 9]. Indeed, a simple reduction from 3CNF-SAT has the property that the number of rows in the join is exactly the number of satisfying assignments to the input formula. (This is a good spot to emphasize that in this paper, the number $d$ of features and the number $m$ of tables are not viewed as constants.)

Given this state of affairs, the approach taken by papers in this area is to assume that the join has some additional structure that makes such basic algorithmic tasks tractable. The most common assumption is that the join is acyclic – this is a combinatorial assumption about how the features are distributed across the tables, and we define it in the next section. Assuming acyclic joins lets us use dynamic programming to solve certain algorithmic questions in polynomial time [2, 1]:

1.

Compute the number of rows in the join $J$
2.

Compute the maximum of the squared $l_{2}$ norm of rows in the join, that is, $\max_{q\in J}\sum_{i\in C}q_{i}^{2}$ .

These algorithmic tasks are special cases of a sum-product query. This is a query $Q(J)$ over the join $J$ of the form

Q(J)=\bigoplus_{q\in J}\bigotimes_{i\in C}F_{i}(q_{i}),

where $(R,\bigoplus,\bigotimes)$ is a commutative semiring over a set $R$ . Corresponding to feature $i\in C$ , $F_{i}:\mathbb{R}\rightarrow R$ is an easy to compute function with range $R$ . For example, for the query asking for the number of rows in the join, $R$ is the set of integers, $F_{i}$ is the constant function that takes on the value $1$ , $\bigotimes$ is multiplication over the integers, and $\bigoplus$ is addition over the integers. For computing the maximum of the squared $l_{2}$ norm of the rows, $R$ is the set of real numbers, $F_{i}(x)=x^{2}$ , $\bigotimes$ is addition over the reals, and $\bigoplus$ is the $\max$ function.

The authors in [9] study sum-product queries under constraints on the rows of the join. They allow constraints of the form $\sum_{i\in C}g_{i}(q_{i})\leq\beta$ , where $g_{i}:\mathbb{R}\rightarrow\mathbb{R}$ is an easy-to-compute function, and $\beta$ is a constant. They consider the problem of evaluating a sum-product query over those rows in the join that satisfy the given set of constraints. They show that with two constraints, such problems are not only NP-hard but also hard to approximate to within any multiplicative factor. Concretely, they show it is NP-hard to determine if there is a row $q$ in the join $J$ that satisfies the linear constraint $\sum_{i}q_{i}=0$ . This is via a reduction from the Partition problem, where we want to determine if a given set of positive integers can be partitioned into two sets whose sums are equal. Thus, it is a weak NP-hardness result [6]. The linear constraint $\sum_{i}q_{i}=0$ is equivalent to the two linear inequality constraints $\sum_{i}q_{i}\geq 0$ and $\sum_{i}q_{i}\leq 0$ .

On the other hand, the authors in [9] show that with just one inequality constraint, and some additional technical assumptions, there is a randomized approximation scheme for sum-product queries that guarantees a multiplicative $(1+\epsilon)$ -approximation, for any given parameter $\epsilon>0$ . For instance, they use this general result to get a polynomial-time $(1+\epsilon)$ -approximation for the following counting problems:

1.

Count the number of rows in the join that lie in a given half space $\sum_{i}a_{i}q_{i}\leq\beta$ .
2.

Count the number of rows in the join that lie within a given ball $\sum_{i}(q_{i}-y_{i})^{2}\leq r^{2}$ (centered at point $y$ and having radius $r\geq 0$ ).

The authors in [12] study the $k$ -means problem, and show how the $k$ -means++ algorithm [3] for constructing a coreset can be implemented in the relational setting, thus deriving an $O(1)$ -approximation for the $k$ -means problem. Their core technical result is a polynomial-time sampling algorithm that, given a set $c_{1},c_{2},\ldots,c_{j}$ of centers in $\mathbb{R}^{|C|}$ , outputs a row $q\in J$ with probability proportional to the square of the Euclidean distance of $q$ to the nearest center. This sampling result is surprising given the hardness of closely related problems. In particular, the problem of counting the number of rows in the join whose closest center is $c_{1}$ is NP-hard even for $j=2$ , and hard to approximate to within any multiplicative factor for $j=3$ . The technique of rejection sampling plays a key role in their sampling result.

Very recently, the authors in [5] present fast deterministic and randomized relational approximation algorithms for the $k$ -median and $k$ -means clustering problems assuming the dimension $d$ is a constant. The authors in [4] present a method for constructing a coreset for certain optimization problems in machine learning. The work that we have reviewed builds on a body of theoretical and experimental research on relational algorithms. We refer the reader to [9, 12, 5, 4] for surveys of this prior work.

1.2 Our Contribution

From prior work, it is evident that in the relational setting, the study of the complexity of very fundamental geometric problems yields surprising answers. More of the terrain needs to be explored to form a better picture of the computational complexity. Motivated by this, we consider the following Distance to $k$ -Subspace problem: Given a set $S=\{s_{1},s_{2},\ldots,s_{k}\}$ of orthogonal unit vectors, compute the maximum Euclidean distance of a row in the join to $span(S)$ , the subspace spanned by $S$ . As a reminder, we are assuming an acyclic database. The Distance to Subspace problem is a basic primitive for other tasks including the subspace approximation problem [8]. We obtain the following results.

1.

We show, in Section 4, that the Distance to $k$ -Subspace problem is NP-hard even for $k=1$ .
2.

We present a polynomial-time $\sqrt{d}$ -approximation for the Distance to $k$ -Subspace problem. To do so, we introduce a generalization of the join and compute measures on that generalization. This is likely to be an algorithmic tool of independent interest. As a consequence of our technique, we show that the row rank of the join $J$ can be computed exactly in polynomial time. These results are presented in Section 3.
3.

Next, we ask if for constant $k$ , we can get a multiplicative $(1+\epsilon)$ -approximation for the Distance to $k$ -Subspace problem, for any $\epsilon>0$ . The distance of a row $q$ in the join from the given $k$ -subspace can be written as $A(q)-B(q)$ , where $A(q)$ and $B(q)$ are both non-negative polynomials. $A(q)$ and $B(q)$ can individually be maximized/minimized using prior work on sum-product queries, but obviously this does not, by itself, give us a way of maximizing the difference $A(q)-B(q)$ .

Instead, we keep track of all possible values of the pair $(A(q),B(q))$ in our dynamic program. However, there can be an exponential number of such pairs, as is to be expected for an NP-hard problem. We can keep track of suitably discretized values of $(A(q),B(q))$ , but to get a polynomial-time algorithm using this approach, we need the instance to be well-conditioned in the following sense: the maximum 2-norm of rows in the join (that is, Distance to the $0$ -Subspace) is at most polynomially larger than the Distance to the given $k$ -subspace.

Our main technical contribution here is to show that the Distance to $k$ -Subspace problem can be reduced to polynomially many instances of the same problem that are well-conditioned. To obtain such a reduction, one tool we develop is a proof that there exists a $k$ -subspace spanned by $k$ of the standard basis vectors that is “close” to any given $k$ -subspace. Overall, our reduction gives us a $(1+\epsilon)$ -approximation (Section 4) for the Distance to $k$ -Subspace problem.
4.

We observe, in Section 4, that our $(1+\epsilon)$ -approximation to Distance to $k$ -Subspace can be used as a primitive to obtain a $(2(1+\epsilon))^{k}$ -approximation for finding a $k$ -subspace that minimizes the maximum Euclidean distance to the join. When the point set is given explicitly (as opposed to implicitly via the join), this is well known as the $l_{\infty}$ -subspace approximation problem [14]. The running time of our algorithm is polynomial for constant $k$ . Our approximation factor for subspace approximation is large, and this result serves to illustrate the usefulness of the Distance to Subspace approximation primitive, which is the main focus of this work.

We emphasize that our work addresses the regime where the number $d$ of features and the number $m$ of tables are part of the input, and are not treated as constants. We are primarily focussed on the approximation guarantee that we can get with polynomial running time; we do not explore further optimizations of the running time.

2 Preliminaries

In this section, we present essential preliminaries. The database that is input to each of the problems considered here consists of a collection of tables $T_{1},T_{2},\ldots,T_{m}$ . The columns of table $T_{j}$ correspond to a subset $C_{j}\subseteq\mathbb{Z}^{+}$ of features, and thus each row in $T_{j}$ is a vector with $|C_{j}|$ real-valued components. We use $n$ to denote the maximum number of rows in any table. Let $C=\bigcup_{j=1}^{m}C_{j}$ denote the set of all features. We will assume that $C=\{1,2,\ldots,d\}$ , where $d$ denotes $|C|$ . We assume that the columns in any table are ordered according to the natural ordering $\langle 1,2,\ldots,d\rangle$ of the features in $C$ . Thus, we can represent a row $r$ in Table $T_{j}$ unambiguously as a point in $\mathbb{R}^{|C_{j}|}$ . Given such a row $r$ and feature $i\in C_{j}$ , we will use $r_{i}$ or $r[i]$ to denote the component of $r$ corresponding to feature $i$ .

Given two row vectors $r^{\prime}$ and $r^{\prime\prime}$ over feature sets $C^{\prime}$ and $C^{\prime\prime}$ , we say that $r^{\prime}$ and $r^{\prime\prime}$ are compatible if for each feature $i\in C^{\prime}\cap C^{\prime\prime}$ , $r^{\prime}[i]=r^{\prime\prime}[i]$ . We use the notation $r^{\prime}\sim r^{\prime\prime}$ to denote that $r^{\prime}$ and $r^{\prime\prime}$ are compatible. The concatenation of compatible rows $r^{\prime}$ and $r^{\prime\prime}$ is the row vector over $C^{\prime}\cup C^{\prime\prime}$ formed by combining the components of $r^{\prime}$ and $r^{\prime\prime}$ in the obvious way.

2.1 Acyclic Databases

We say that our database of $m$ tables is acyclic if there is a tree $\tau$ with $O(m)$ nodes, each associated with one of our input tables, that has the following properties:

1.

Each table in the database is associated with at least one node.
2.

For each feature $i\in C$ , the set of nodes whose tables contain $i$ induces a connected subgraph of the tree.

Note that for technical reasons, we allow a table to be associated with more than one node of the tree $\tau$ . Given an acyclic database, such a tree $\tau$ can be computed in polynomial time; see for example [9].

Note that each row of the join $J=T_{1}\bowtie T_{2}\cdots\bowtie T_{m}$ is formed by picking one row from each table associated with each node of $\tau$ , such that the chosen rows are pairwise compatible, and then concatenating the chosen rows. For an acyclic database, it suffices to check compatibility between pairs of rows chosen from nodes that are neighbors in the tree $\tau$ . This follows from the properties of an acyclic database. This is the crucial property of acyclic databases that allows one to evaluate certain queries on the join efficiently via dynamic programming.

For the purpose of dynamic programming, it will be convenient to view the tree $\tau$ as rooted. Furthermore, we can assume that each internal node of $\tau$ has exactly two children. This is without loss of generality: If there is a node $\alpha$ with exactly one child $\alpha^{\prime}$ , we can make $\alpha^{\prime}$ the left child of $\alpha$ and add a right child whose table is a copy of $\alpha$ ’s. If a node $\alpha$ has $j>2$ children, then replace $\alpha$ with a complete binary tree with $j-1$ internal nodes and $j$ leaves, which will now correspond to the original $j$ children of $\alpha$ . The table at each of the $j-1$ internal nodes will be a copy of the table at $\alpha$ .

After applying these two operations, we have that each internal node of $\tau$ has exactly two children. The total number of nodes in $\tau$ is $O(m)$ . The join is unchanged: we have merely added copies of some of the original tables. Finally, $\tau$ still witnesses the fact that the database is acyclic, as is readily checked. We will refer to $\tau$ as the join tree of the database.

For a node $\alpha$ in the tree $\tau$ , we use $T_{\alpha}$ to denote the table at node $\alpha$ , and ${J}_{\alpha}$ to denote the join of all the tables in the subtree rooted at $\alpha$ . We “assign” each feature $i\in C$ to the highest node in tree $\tau$ whose table contains feature $i$ as a column. By highest, we mean the node closest to the root of $\tau$ . Let ${\widehat{C}}_{\alpha}$ denote the set of features assigned to node $\alpha$ , and ${C}_{\alpha}$ denote the set of features assigned to nodes in the subtree rooted at $\alpha$ . For a row $r$ in table $T_{\alpha}$ , we denote the set of rows in the join $J_{\alpha}$ that are compatible with $r$ by $J_{\alpha}\bowtie r$ . The notation is reasonable as this set of rows is indeed the result of a join of $J_{\alpha}$ and a table containing the single row $r$ .

3 Computing the Distance to a given $𝒌$ -Subspace

Assume we are given an acyclic database consisting of tables $T_{1},T_{2},\ldots,T_{m}$ and a set of $k$ orthogonal unit vectors $S=\{s_{i}\ |\ 1\leq i\leq k\}\subseteq\mathbb{R}^{d}$ , where (as we recall) the set of features in the database is $C=\{1,2,\ldots,d\}$ . The Distance to $k$ -Subspace problem is that of computing the maximum distance of a row in the join $J=T_{1}\bowtie T_{2}\bowtie\cdots\bowtie T_{m}$ from the subspace spanned by $S$ . That is, we want to compute

\max_{r\in J}||r-\sum_{i=1}^{k}(r\cdot s_{i})s_{i}||_{2}

The main result of this section is a $\sqrt{d}$ -approximation to this problem. We begin with a scheme for implicitly representing functions ${V}:J\rightarrow\mathbb{R}^{\mu}$ , and posing algorithmic problems concerning this representation.

3.1 An Implicit Representation

Suppose that for each feature $i\in C$ , we are given a function ${V}_{i}:\mathbb{R}\rightarrow\mathbb{R}^{\mu}$ , where $\mu\geq 1$ is identical across all features. We assume that these functions are easily computed. In fact, for the application here, we assume that ${V}_{i}(x)$ is given explicitly for each value $x$ of feature $i$ that occurs in any of the input tables. For any any row $r$ in the join ${J}$ , let ${V}(r)=\sum_{i\in C}{V}_{i}(r_{i})$ . Finally, let ${V}({J})=\{{V}(r)\ |\ r\in{J}\}$ .

One illustrative example is to let ${V}_{i}(x)=(0,0,...,\underbrace{x}_{i-\text{th entry}},...,0)\in\mathbb{R}^{d}$ , where we recall that the feature set $C$ is $\{1,2,\ldots,d\}$ . Then for any row $r$ in the join $J$ of the tables, ${V}(r)=\sum_{i\in C}{V}_{i}(r_{i})=r$ . Thus, ${V}(J)=J$ . We will see more interesting examples shortly.

We are interested in computing measures of the point set $V(J)$ . But this seems generally harder for ${V}(J)$ than for $J$ . For example, there is a polynomial-time algorithm for computing $\max_{r\in J}\ ||r||_{2}^{2}$ using sum-product queries. However, computing $\max_{q\in{V}(J)}||q||_{2}^{2}$ seems harder. Fortunately, we can show a positive result for the $||\cdot||_{\infty}$ norm.

Lemma 1.

There is a polynomial-time algorithm for computing $\max_{q\in{V}(J)}||q||_{\infty}$ .

Proof.

For $1\leq j\leq\mu$ , let

\Psi^{max}_{j}=\max_{q\in{V}(J)}q_{j}=\max_{r\in J}\sum_{i\in C}{V}_{i}(r_{i})% [j],

and

\Psi^{min}_{j}=\min_{q\in{V}(J)}q_{j}=\min_{r\in J}\sum_{i\in C}{V}_{i}(r_{i})% [j].

That is, $\Psi^{max}_{j}$ (resp. $\Psi^{min}_{j}$ ) is the maximum (minimum) value of the $j$ -th coordinate over points in $V(J)$ . As $\Psi^{max}_{j}=\max_{r\in J}\sum_{i\in C}{V}_{i}(r_{i})[j]$ , it is a sum-product query over the join $J$ where the commutative semiring is the set of real numbers, the $\bigoplus$ operator is $\max$ , and the $\bigotimes$ is addition over the reals. It follows that $\Psi^{max}_{j}$ and $\Psi^{min}_{j}$ can be computed in polynomial time. Let $\Psi_{j}=\max\{|\Psi^{max}_{j}|,|\Psi^{min}_{j}|\}$ .

Finally, we note that $\max_{q\in{V}(J)}||q||_{\infty}$ is $\max_{j=1}^{\mu}\Psi_{j}$ . $\hfill\blacktriangleleft$

3.2 A $\sqrt{d}$ -approximation

We begin by showing that using the above implicit representation, we can efficiently maximize the $l_{\infty}$ norm of the projection onto the orthogonal complement of the given $k$ -subspace.

Lemma 2.

Suppose that for $1\leq k\leq d$ we are given a set of unit vectors $S=\{s_{1},s_{2},...,s_{k}\}\subset\mathbb{R}^{d}$ that are pairwise orthogonal. We can compute

\max_{r\in J}||r-\sum_{i=1}^{k}(r\cdot s_{i})s_{i}||_{\infty}

in polynomial time.

Proof.

For $y\in\mathbb{R}^{d}$ , let $y^{\dagger}$ denote the projection onto the orthogonal complement of the subspace spanned by $S$ . Thus, $y^{\dagger}=y-\sum_{i=1}^{k}(y\cdot s_{i})s_{i}$ . Viewing $y^{\dagger}$ as a function of $y$ , we see that $y^{\dagger}=\sum_{i=1}^{d}y_{i}b_{i}$ , where each $b_{i}\in\mathbb{R}^{d}$ is a constant vector – it depends on $S$ but not $y$ . (In other words, $y^{\dagger}=By$ , where the projection matrix $B$ depends only on $S$ .)

For each $1\leq i\leq d$ , define the function ${V}_{i}:\mathbb{R}\rightarrow\mathbb{R}^{d}$ by ${V}_{i}(x)=xb_{i}$ . It follows that for any row $r$ in the join $J$ ,

{V}(r)=\sum_{i=1}^{d}{V}_{i}(r_{i})=\sum_{i=1}^{d}r_{i}b_{i}=r^{\dagger}.

Thus, ${V}(J)=\{r^{\dagger}\ |\ r\in J\}$ . From Lemma 1, there is a polynomial-time algorithm to compute

\max_{q\in V(J)}||q||_{\infty}=\max_{r\in J}||r^{\dagger}||_{\infty}\

$\hfill\blacktriangleleft$ For any $y\in\mathbb{R}^{d}$ , we have $||y||_{\infty}\leq||y||_{2}\leq\sqrt{d}||y||_{\infty}$ . Thus we obtain the main result of this Section:

Theorem 3.

Suppose that for $1\leq k\leq d$ we are given a set of unit vectors $S=\{s_{1},s_{2},...,s_{k}\}\subset\mathbb{R}^{d}$ that are pairwise orthogonal. There is a polynomial-time algorithm to compute a $\sqrt{d}$ -approximation to

\max_{r\in J}||r-\sum_{i=1}^{k}(r\cdot s_{i})s_{i}||_{2}

Next, we consider the algorithmic task of computing the row rank of the join $J=T_{1}\bowtie T_{2}\bowtie\cdots\bowtie T_{m}$ of the input acyclic database consisting of tables $T_{1},T_{2},\ldots,T_{m}$ . That is, we want to compute the size of a maximal linearly independent set [13] of row vectors in $J$ . An algorithm for this follows from Lemma 2.

Theorem 4.

There is a polynomial time algorithms that, given an acyclic database as input, can compute the row rank of the join $J$ .

Proof.

We build an orthonormal basis one vector at a time. Using prior work on sum-product queries, we compute a vector $s^{\prime}_{1}\in J$ that maximizes the square of the Euclidean norm. If $s^{\prime}_{1}$ is the zero vector, the rank of $J$ is $0$ . Otherwise, let $s_{1}$ be the unit vector corresponding to $s^{\prime}_{1}$ , and we initialize $S=\{s_{1}\}$ .

Suppose that we have computed a set $S=\{s_{1},s_{2},...,s_{j}\}$ of unit vectors in the row space of $J$ that are pairwise orthogonal. Using Lemma 2, we compute $s^{\prime}_{j+1}\in J$ that achieves

\max_{r\in J}||r-\sum_{i=1}^{j}(r\cdot s_{i})s_{i}||_{\infty}

If this maximum is $0$ , then the rank of $J$ is $j$ and $S$ is a basis for the row space of $J$ . Otherwise, we add the unit vector $s_{j+1}$ corresponding to $s^{\prime}_{j+1}$ to $S$ and continue. $\hfill\blacktriangleleft$

4 Maximum Distance to a $𝒌$ -Subspace: A PTAS

Assume we are given an acyclic database consisting of tables $T_{1},T_{2},\ldots,T_{m}$ and an orthonormal set of $k$ vectors $S=\{s_{i}\ |\ 1\leq i\leq k\}\subseteq\mathbb{R}^{d}$ , where (as we recall) the set of features in the database is $C=\{1,2,\ldots,d\}$ . We would like to compute the maximum distance of a row in the join $J=T_{1}\bowtie T_{2}\bowtie\cdots\bowtie T_{m}$ from the subspace spanned by $S$ . The distance of a point $q\in\mathbb{R}^{d}$ from the subspace spanned by $S$ , is ${\texttt{dist}}(q,span(S)):=||q-\sum_{i=1}^{k}(q\cdot s_{i})s_{i}||_{2}$ . Thus, our goal is to compute $\max_{q\in J}{\texttt{dist}}(q,span(S))$ . Let ${\hat{d}}(q,span(S)):=||q||_{2}^{2}-\sum_{i=1}^{k}(q\cdot s_{i})^{2}$ . As ${\hat{d}}(q,span(S))={\texttt{dist}}^{2}(q,span(S))$ , we reformulate this as computing ${\texttt{D}}(J,span(S)):=\max_{q\in J}{\hat{d}}(q,span(S))$ . The main result of this section a polynomial-time approximation scheme (PTAS) for the problem for constant $k$ .

We begin by showing that the Distance to $k$ -Subspace problem is NP-hard even for $k=1$ .

Theorem 5.

If there is a polynomial-time algorithm for computing ${\texttt{D}}(J,span(\{p\}))$ given the input acyclic database and vector $p\in\mathbb{R}^{d}$ , then P = NP.

Proof.

The proof is by reduction from the Partition problem. Given a set of positive integers $F=\{f_{1},f_{2},...,f_{d}\}$ , the goal here is determine if $F$ can be partitioned into two parts $F^{\prime}$ and $F^{\prime\prime}$ such that $diff(F^{\prime},F^{\prime\prime}):=\sum_{f\in F^{\prime}}f-\sum_{f\in F^{% \prime\prime}}f=0$ . Given such an input, we compute an acyclic database with tables $T_{1},T_{2},\ldots,T_{2d}$ and a vector $p$ as follows. The sequence of features is $\langle c_{1},v_{1},c_{2},v_{2},\ldots,c_{d},v_{d},c_{d+1}\rangle$ . For $1\leq i\leq d$ , the columns of table $T_{2i-1}$ are $c_{i}$ and $v_{i}$ , and the columns of table $T_{2i}$ are $v_{i}$ and $c_{i+1}$ . Each table has two rows, whose values are determined as follows.

$T_{1}$		$T_{2}$
$c_{1}$	$v_{1}$	$v_{1}$	$c_{2}$
$0$	$f_{1}$	$f_{1}$	$0$
$0$	$-f_{1}$	$-f_{1}$	$0$

$T_{3}$		$T_{4}$
$c_{2}$	$v_{2}$	$v_{2}$	$c_{3}$
$0$	$f_{2}$	$f_{2}$	$0$
$0$	$-f_{2}$	$-f_{2}$	$0$

⋮

$T_{2d-1}$		$T_{2d}$
$c_{d}$	$v_{d}$	$v_{d}$	$c_{d+1}$
$0$	$f_{d}$	$f_{d}$	$0$
$0$	$-f_{d}$	$-f_{d}$	$0$

It is clear that the database is acyclic, and in fact, a tree that witnesses this is a path. We set $p=(1,1,\ldots,1)\in\mathbb{R}^{2d+1}$ . Let $\ell(p)=span(\{p\})$ denote the line through $p$ . Let $J$ denote the join of the tables. If ${\texttt{D}}(J,\ell(p))=\sum_{f\in F}f^{2}$ , we declare that the input Partition instance has a solution; otherwise, we declare that it doesn’t.

The reduction runs in polynomial time, and we now argue that it is correct. For any partition $(F^{\prime},F^{\prime\prime})$ of $F$ , we define $\chi(F^{\prime},F^{\prime\prime})$ to be the vector $(c_{1},v_{1},\ldots,c_{d},v_{d},c_{d+1})$ where $c_{i}=0$ for $1\leq i\leq d+1$ , and for each $1\leq i\leq d$ , $v_{i}=f_{i}$ if $f_{i}\in F^{\prime}$ and $v_{i}=-f_{i}$ if $f_{i}\in F^{\prime\prime}$ . Note that $\chi$ is a bijection from the set of partitions of $F$ to the set of rows of the join $J$ . Furthermore, $diff(F^{\prime},F^{\prime\prime})=\chi(F^{\prime},F^{\prime\prime})\cdot p$ . Note that $||p||_{2}^{2}=2d+1$ , and for any row $q\in J$ , we have $||q||_{2}^{2}=\sum_{f\in F}f^{2}$ ; all rows in $J$ have the same 2-norm. Thus, for any $q\in J$ ,

{\hat{d}}(q,\ell(p))=||q||_{2}^{2}-\frac{(q\cdot p)^{2}}{||p||_{2}^{2}}=\sum_{% f\in F}f^{2}-\frac{(q\cdot p)^{2}}{2d+1}

We conclude that for any partition $(F^{\prime},F^{\prime\prime})$ of $F$ ,

diff(F^{\prime},F^{\prime\prime})=0\iff\chi(F^{\prime},F^{\prime\prime})\cdot p% =0\iff{\hat{d}}(\chi(F^{\prime},F^{\prime\prime}),p)=\sum_{f\in F}f^{2}\

$\hfill\blacktriangleleft$

Our PTAS for Distance to a $k$ -Subspace has three major steps. Let $E=\{e_{i}\ |\ 1\leq i\leq d\}\subseteq\mathbb{R}^{d}$ denote the unit vectors in the standard basis. We want to find a $k$ -subset $E_{k}\subseteq E$ such that $span(E_{k})$ is “close to” $span(S)$ . This first step, along with the properties of $span(E_{k})$ , is described in Section 4.1. In the second step (Section 4.2), we use $E_{k}$ to reduce the Distance to $k$ -Subspace instance to polynomially many well-conditioned instances of the same problem. In the third step (Section 4.3), we use discretization and dynamic programming to solve a well-conditioned instance approximately in polynomial time. We conclude in Section 4.4 with an application of the PTAS to subspace approximation.

For a set $X\subset\mathbb{R}^{d}$ , we denote the orthogonal complement of $X$ by $X^{\perp}$ . That is, $X^{\perp}=\{y\in\mathbb{R}^{d}\ |\ y\cdot x=0\mbox{ for all }x\in X\}$ .

4.1 Finding a Close $𝒌$ -Subspace

Our algorithm for computing $E_{k}$ is as follows. Let $proj(a,span(S))$ be the projection of a vector $a$ on the subspace spanned by $S$ .

Algorithm 1 Compute

E_{k}

.

We state some useful properties of this algorithm:

Lemma 6.

(a) The set $\{p_{1},p_{2},...,p_{k}\}$ is an orthonormal basis for $span(S)$ ; (b) For any $1\leq j<i\leq k$ , we have $e^{\prime}_{j}\cdot p_{i}=0$ ; (c) We have $|E_{k}|=k$ , that is, our algorithm never attempts to add a duplicate element to $E_{k}$ ; (d) For each $1\leq i\leq k$ , we have $e^{\prime}_{i}\cdot p_{i}\geq\frac{1}{\sqrt{d}}$ .

Proof.

Part (a) is evident. As for (b), fix $1\leq j<i\leq k$ . We have $p_{i}\cdot p_{j}=0$ . We have $p_{j}:=proj(e^{\prime}_{j},span(S)\cap[span(p_{1},p_{2},...,p_{j-1})^{\perp}])$ , and $p_{i}\in span(S)\cap[span(p_{1},p_{2},...,p_{j-1})^{\perp}]$ . Given $p_{i}\cdot p_{j}=0$ , we have $p_{i}\cdot e^{\prime}_{j}=0$ .

For (c), We pick $e^{\prime}_{i}\in E$ such that $|e^{\prime}_{i}\cdot p^{\prime}_{i}|$ is maximum. So $|e^{\prime}_{i}\cdot p^{\prime}_{i}|\geq\frac{1}{\sqrt{d}}$ . Given that $p^{\prime}_{i},p_{i}\in span(S)\cap[span(p_{1},p_{2},...,p_{i-1})^{\perp}]$ , and $p_{i}=proj(e^{\prime}_{i},span(S)\cap[span(p_{1},p_{2},...,p_{i-1})^{\perp}])$ , we have $e^{\prime}_{i}\cdot p_{i}\geq|e^{\prime}_{i}\cdot p^{\prime}_{i}|\geq\frac{1}{% \sqrt{d}}$ . From part (b), for all $1\leq j\leq i-1$ , we have $e^{\prime}_{j}\cdot p_{i}=0$ , so $e^{\prime}_{i}\not\in\{e^{\prime}_{j}\ |\ 1\leq j\leq i-1\}$ . Thus $|E_{k}|=k$ . We have already shown part (d) in the process. $\hfill\blacktriangleleft$

The key property of the set $E_{k}$ computed above is that the $k$ -subspaces $span(E_{k})$ and $span(S)$ are close in the sense of the following theorem. For other measures of closeness of subspaces, see [11].

Theorem 7.

For any unit vector $v\in span(S)$ , there exists a unit vector $u\in span(E_{k})$ such that $|v\cdot u|\geq\frac{1}{(\sqrt{d})^{k}2^{k}}$ . (We assume that the dimension $d\geq 36$ to get this simplified bound.)

Proof.

A high-level overview of the proof is that we construct a “best-response” unit vector $u\in span(E_{k})$ for each unit vector $v\in span(S)$ , depending on the relative magnitudes of components of $v$ . Now we proceed to the formal proof.

We represent the unit vectors as $v=x_{1}p_{1}+x_{2}p_{2}+...+x_{k}p_{k}$ and $u=y_{1}e^{\prime}_{1}+y_{2}e^{\prime}_{2}+...+y_{k}e^{\prime}_{k}$ , where $\sum_{i=1}^{k}x_{i}^{2}=\sum_{i=1}^{k}y_{i}^{2}=1$ . Let $x=(x_{1},x_{2},...,x_{k})$ and $y=(y_{1},y_{2},...,y_{k})$ denote the corresponding unit vectors.

Let $P=\begin{bmatrix}p_{1}&p_{2}&...&p_{k}\end{bmatrix}\in\mathbb{R}^{d\times k}$ , $U=\begin{bmatrix}e^{\prime}_{1}&e^{\prime}_{2}&...&e^{\prime}_{k}\end{bmatrix}% \in\mathbb{R}^{d\times k}$ , and $A=P^{\top}U$ . We have $v\cdot u=x^{\top}Ay$ . We have the following claim for $A$ .

Claim 8.

A is an upper triangular matrix. Each diagonal element $a_{ii}\geq\frac{1}{\sqrt{d}}$ . For other non-diagonal elements $a_{ij^{\prime}}$ , where $i<j^{\prime}$ , we have $|a_{ij^{\prime}}|\leq 1$ .

Proof.

For $1\leq j<i\leq k$ , we have $a_{ij}=p_{i}\cdot e^{\prime}_{j}=0$ by part (b) of Lemma 6. For $1\leq i\leq k$ , we have $a_{ii}=p_{i}\cdot e^{\prime}_{i}\geq\frac{1}{\sqrt{d}}$ by part (d). For other non-diagonal elements $a_{ij^{\prime}}$ , where $i<j^{\prime}$ : given $p_{i}$ is a unit vector, $|a_{ij^{\prime}}|=|p_{i}\cdot e^{\prime}_{j}|\leq 1$ . $\hfill\vartriangleleft$

Let $z=x^{\top}A$ . So $v\cdot u=z\cdot y$ and

z=(a_{11}x_{1},a_{12}x_{1}+a_{22}x_{2},...,a_{1k}x_{1}+a_{2k}x_{2}+...+a_{kk}x% _{k})^{\top}

We want to pick a good vector $y$ such that $|z\cdot y|$ is large enough. Let $c$ be a small positive constant. We discuss different cases for $x$ :

$\blacksquare$

Case 1: $x_{1}\geq\frac{c}{(\sqrt{d})^{k-1}}$
$\blacksquare$

Case 2: $x_{1}<\frac{c}{(\sqrt{d})^{k-1}},x_{2}\geq\frac{2c}{(\sqrt{d})^{k-2}}$

$\vdots$

$\blacksquare$

Case $j$ : $x_{1}<\frac{c}{(\sqrt{d})^{k-1}},x_{2}<\frac{2c}{(\sqrt{d})^{k-2}}$ , …, $x_{j-1}<\frac{2^{j-2}c}{(\sqrt{d})^{k-j+1}},x_{j}\geq\frac{2^{j-1}c}{(\sqrt{d}% )^{k-j}}$

$\vdots$

$\blacksquare$

Case $k$ : $x_{1}<\frac{c}{(\sqrt{d})^{k-1}},x_{2}<\frac{2c}{(\sqrt{d})^{k-2}}$ , …, $x_{k-1}<\frac{2^{k-2}c}{(\sqrt{d})},x_{k}\geq 2^{k-1}c$

In case 1, we pick $y_{1}=1$ and leave all other components 0. Then $|z\cdot y|=|a_{11}x_{1}|\geq\frac{1}{\sqrt{d}}\cdot\frac{c}{(\sqrt{d})^{k-1}}=% \frac{c}{(\sqrt{d})^{k}}$ .

In case 2, we pick $y_{2}=1$ and leave all other components 0. Then $|z\cdot y|=|a_{12}x_{1}+a_{22}x_{2}|\geq a_{22}x_{2}-|a_{12}x_{1}|\geq\frac{1}% {\sqrt{d}}\cdot\frac{2c}{(\sqrt{d})^{k-2}}-\frac{c}{(\sqrt{d})^{k-1}}=\frac{c}% {(\sqrt{d})^{k-1}}$ .

Similarly, in the case $j$ , we pick $y_{j}=1$ and leave all other components 0. Then $|z\cdot y|=|a_{1j}x_{1}+a_{2j}x_{2}+...+a_{jj}x_{j}|\geq a_{jj}x_{j}-\sum_{i=1% }^{j-1}|a_{ij}x_{i}|\geq\frac{2^{j-1}c}{(\sqrt{d})^{k-j+1}}-\sum_{i=1}^{j-1}% \frac{2^{i-1}c}{(\sqrt{d})^{k-i}}\geq\frac{2^{j-1}c}{(\sqrt{d})^{k-j+1}}-\frac% {2^{j-2}c}{(\sqrt{d})^{k-j+1}}\cdot\frac{1}{1-\frac{2}{\sqrt{d}}}=\frac{2^{j-2% }c}{(\sqrt{d})^{k-j+1}}\frac{\sqrt{d}-4}{\sqrt{d}-2}$ .

In case $k$ , we pick $y_{k}=1$ and leave all other components 0. Then $|z\cdot y|=|a_{1j}x_{1}+a_{2j}x_{2}+...+a_{kk}x_{k}|\geq a_{kk}x_{k}-\sum_{i=1% }^{k-1}|a_{ik}x_{i}|$ . Similar to the case $j$ , the lower bound is $\frac{2^{k-2}c}{(\sqrt{d})}\frac{\sqrt{d}-4}{\sqrt{d}-2}$ .

Therefore, given that $d\geq 36$ , the smallest lower bound for $|z\cdot y|$ occurs in Case 1, where $|z\cdot y|\geq\frac{c}{(\sqrt{d})^{k}}$ . We now argue that at least one of the cases must hold. It suffices that

\sum_{i=1}^{k}(\frac{2^{i-1}c}{(\sqrt{d})^{k-i}})^{2}\leq\sum_{i=1}^{k}x_{i}^{% 2}=1

The inequality holds if $c=\frac{1}{2^{k}}$ . Thus, $|v\cdot u|\geq\frac{c}{(\sqrt{d})^{k}}=\frac{1}{(\sqrt{d})^{k}2^{k}}$ . $\hfill\blacktriangleleft$

For each $1\leq i\leq k$ , let $\beta_{i}$ denote the coordinate corresponding to $e^{\prime}_{i}$ , i.e., $e^{\prime}_{i}=e_{\beta_{i}}$ . Let $I=\{\beta_{1},\beta_{2},...,\beta_{k}\}\subseteq[d]$ be the set of indices corresponding to the standard basis vectors in $E_{k}$ . The following claim is shown by using observations made in the proof of Theorem 7.

Lemma 9.

For any $k$ -dimensional vector $a=(a_{1},a_{2},...,a_{k})\in\mathbb{R}^{k}$ , we can efficiently compute a point $q\in span(S)$ such that $q_{\beta_{i}}=a_{i}$ , for all $1\leq i\leq k$ .

Proof.

For any $1\leq i\leq k$ , let $w_{i}\in\mathbb{R}^{k}$ be $p_{i}$ restricted to the indices in $I$ . Similarly, let $e^{\prime\prime}_{i}\in\mathbb{R}^{k}$ be $e^{\prime}_{i}$ restricted to indices in $I$ .

Let $M=\begin{bmatrix}w_{1}&w_{2}&...&w_{i}&...&w_{k}\end{bmatrix}\in\mathbb{R}^{k% \times k}$ , We want to show that there exists an vector $x\in\mathbb{R}^{k}$ such that $Mx=a$ . We have the following claim for $M$ .

Claim 10.

$rank(M)=k$ .

Proof.

Let $N=\begin{bmatrix}e^{\prime\prime}_{1}&e^{\prime\prime}_{2}&...&e^{\prime\prime% }_{i}&...&e^{\prime\prime}_{k}\end{bmatrix}\in\mathbb{R}^{k\times k}$ . We can observe that $p_{i}\cdot e^{\prime}_{j}=w_{i}\cdot e^{\prime\prime}_{j}$ holds for $1\leq i,j\leq k$ . Thus, $M^{\top}N=P^{\top}U=A$ . From claim 8, we know $rank(A)=k$ . Thus $rank(M)=k$ . $\hfill\vartriangleleft$

Given $M$ is a full rank square matrix, there exists an vector $x\in\mathbb{R}^{k}$ such that $Mx=a$ . Then we can compute $q=Px$ . $\hfill\blacktriangleleft$

We conclude with the following implication of the closeness in Theorem 7 of $span(S)$ and $span(E_{k})$ .

Lemma 11.

Let $w$ be an arbitrary unit vector in the $(d-k)$ -subspace $span(E\setminus E_{k})$ . Let $w^{\prime}$ be a unit vector along the projection of $w$ on $span(S)$ . Then $|w\cdot w^{\prime}|\leq\sqrt{1-\frac{1}{R}}$ , where $R=d^{k}2^{2k}$ .

Proof.

Let $v$ be the projection of $w^{\prime}$ on the subspace spanned by $E_{k}$ . Let $u=w^{\prime}-v$ . According to Theorem 2, $||v||_{2}\geq\frac{1}{(\sqrt{d})^{k}2^{k}}$ , then $||u||_{2}\leq\sqrt{1-\frac{1}{d^{k}2^{2k}}}$ .

Then

|w\cdot w^{\prime}|=|w\cdot(u+v)|=|w\cdot u|\leq||w||_{2}||u||_{2}\leq\sqrt{1-% \frac{1}{d^{k}2^{2k}}}=\sqrt{1-\frac{1}{R}}\

$\hfill\blacktriangleleft$

4.2 Reduction to Well-Conditioned Instances

We will use the new orthonormal basis $\mathcal{P}=\{p_{1},p_{2},...,p_{k}\}$ for $span(S)$ . Therefore we will use the notation $span(\mathcal{P})$ for the subspace $span(S)$ in the following sections.

Recall that $I\subseteq[d]$ is the set of indices corresponding to the standard basis vectors in $E_{k}$ . For each $i\in I$ , let $\alpha_{i}^{*}$ be the node in the join tree such that $i\in{\widehat{C}}_{\alpha^{*}_{i}}$ . See Section 2 for a reminder of these concepts. Let $\mathcal{A}=\{\alpha_{i}^{*}\ |\ i\in I\}$ ; we use $\mathcal{A}=\{\alpha_{1},\alpha_{2},\ldots,\alpha_{|\mathcal{A}|}\}$ to denote the elements of the set. We can write the join $J$ as

J=\bigcup_{r_{1}\in T_{\alpha_{1}},r_{2}\in T_{\alpha_{2}},...,r_{|\mathcal{A}% |}\in T_{\alpha_{|\mathcal{A}|}}}J\bowtie r_{1}\bowtie r_{2}\bowtie...\bowtie r% _{|\mathcal{A}|}

We therefore compute ${\texttt{D}}(J\bowtie r_{1}\bowtie r_{2}\bowtie...\bowtie r_{|\mathcal{A}|},% span(\mathcal{P}))$ for each row combination $\{r_{1},r_{2},...,r_{|\mathcal{A}|}\}$ , and return the maximum of these quantities. The number of row combinations is $O(n^{k})$ , which is polynomial as $k$ is a constant. In the remainder of Section 4, we fix a single row combination $\{r_{1},r_{2},...,r_{|\mathcal{A}|}\}$ , and explain our approximation scheme for calculating ${\texttt{D}}(J\bowtie r_{1}\bowtie r_{2}\bowtie...\bowtie r_{|\mathcal{A}|},% span(\mathcal{P}))$ . We will continue to use $J$ to denote the new join. For each $1\leq i\leq|\mathcal{A}|$ , we delete all rows in the table $T_{\alpha_{i}}$ except $r_{i}$ . Let $r^{*}=r_{1}\Join r_{2}\Join...\Join r_{|\mathcal{A}|}$ .

For any $q\in J$ , we have ${\hat{d}}(q,span(\mathcal{P}))=||q||_{2}^{2}-\sum_{i=1}^{k}(q\cdot p_{i})^{2}$ . However, the individual terms $||q||_{2}^{2}$ and $\sum_{i=1}^{k}(q\cdot p_{i})^{2}$ can be much larger than ${\hat{d}}(q,span(\mathcal{P}))$ . We therefore apply a translation that preserves ${\hat{d}}(\cdot,span(\mathcal{P}))$ but makes these two terms small.

Let $r^{k}$ be the vector formed from $r^{*}$ by restricting to the indices in $I$ . Using Lemma 9, we compute a vector $h\in span(\mathcal{P})$ such that $h_{I}=r^{k}$ . We apply the translation $q^{\prime}=q-h$ . As we translate by a vector in $span(\mathcal{P})$ , the distance to $span(\mathcal{P})$ is preserved:

Lemma 12.

For any $q\in J$ , ${\hat{d}}(q,span(\mathcal{P}))={\hat{d}}(q^{\prime},span(\mathcal{P}))$

Also, we can observe that for any $q\in J$ , $q^{\prime}_{I}=q_{I}-h_{I}=r^{k}-r^{k}=\mathbf{0}$ .

To effect the translation, we have to modify the tables. Thus, for each node $\alpha$ in the join tree, and for each row $r\in T_{\alpha}$ and each feature $i$ of $T_{\alpha}$ , we set $r^{new}_{i}=r_{i}-h_{i}$ . We henceforth refer to the $r^{new}_{i}$ as just $r_{i}$ and $q^{\prime}$ as $q$ .

Lemma 13.

After the modification of tables, for any $q\in J$ , $q\in span(E\backslash E_{k})$ as $q_{I}=\mathbf{0}$ . Therefore by Lemma 11, $||q||_{2}^{2}\leq R\cdot{\hat{d}}(q,span(\mathcal{P}))$ .

This bound on $||q||_{2}^{2}$ is what we mean by saying that the join $J$ is well-conditioned.

4.3 Algorithm for a Well-Conditioned Instance

We can use dynamic programming over the join tree to explicitly compute the set
$\{{\hat{d}}(q,span(\mathcal{P}))\ |\ q\in J\}$ . However, this set can have exponential size. Our approximation algorithm will therefore work with a rounded version of ${\hat{d}}(q,span(\mathcal{P}))$ . We observe that

{\hat{d}}(q,span(\mathcal{P}))=||q||_{2}^{2}-\sum_{i=1}^{k}(q\cdot p_{i})^{2}=% \sum_{j=1}^{d}q_{j}^{2}-\sum_{i=1}^{k}\left(\sum_{j=1}^{d}p_{ij}q_{j}\right)^{2}

(1)

Where $p_{ij}$ is the $j$ -th element of $p_{i}$ . By Lemma 13, we know that $||q||_{2}^{2}\leq R\cdot{\hat{d}}(q,span(\mathcal{P}))$ for any $q\in J$ ; the join is well-conditioned.

Let $\beta=\max_{q\in J}||q||_{2}^{2}$ . Consider a node $\alpha$ of the join tree such that its table $T_{\alpha}$ contains a row $r$ with an entry whose absolute value exceeds $\sqrt{\beta}$ . Clearly, such a row $r$ does not contribute to the eventual join $J$ , so we delete such rows. Let $\eta=\frac{4R(k+1)d^{2}}{\epsilon}$ , where $\epsilon>0$ is the error parameter for the approximation scheme, and $\gamma=\frac{\sqrt{\beta}}{\eta}$ . Notice from Equation 1 that ${\hat{d}}(q,span(\mathcal{P}))$ is built up from terms of the form $p_{ij}q_{j}$ and $q_{j}$ . For $x\in\mathbb{R}$ , let $[x]=sign(x)\lfloor|x|/\gamma\rfloor$ . We will use $\gamma*[p_{ij}q_{j}]$ and $\gamma*[q_{j}]$ as our rounded version of $p_{ij}q_{j}$ and $q_{j}$ accordingly – essentially we round to the nearest integer multiple of $\gamma$ that makes the absolute value smaller. The rounded version we use for ${\hat{d}}(q,span(\mathcal{P}))$ will then be

\texttt{apxdist}(q,span(\mathcal{P})):=\gamma^{2}\sum_{j=1}^{d}[q_{j}]^{2}-% \gamma^{2}\sum_{i=1}^{k}\left(\sum_{j=1}^{d}[p_{ij}q_{j}]\right)^{2}

(2)

We now describe our dynamic programming algorithm to use the $[p_{ij}q_{j}]$ ’s to compute an approximation to ${\texttt{D}}(J,span(\mathcal{P}))$ .

4.3.1 Data Structures for the DP

For node $\alpha$ in the join tree, and row $q\in{J}_{\alpha}$ , let $A_{\alpha}(q)=\sum_{j\in{C}_{\alpha}}[q_{j}]^{2}$ , and
$B_{\alpha}(q)=(B_{\alpha}(q)[1],B_{\alpha}(q)[2],...,B_{\alpha}(q)[k])$ , where $B_{\alpha}(q)[i]=\sum_{j\in{C}_{\alpha}}[p_{ij}q_{j}]$ , for $1\leq i\leq k$ . Note that if $\alpha$ is the root of the join tree, then $\gamma^{2}A_{\alpha}(q)-\gamma^{2}\sum_{i=1}^{k}(B_{\alpha}(q)[i])^{2}$ is $\texttt{apxdist}(q,span(\mathcal{P}))$ (Equation 2).

For each node $\alpha$ in the join tree $\tau$ , our algorithm stores an array $\alpha.M[]$ indexed by the rows in table $T_{\alpha}$ . For a row $r\in T_{\alpha}$ , the entry $\alpha.M[r]$ will eventually store the set

\{(A_{\alpha}(q),B_{\alpha}(q))\ |\ q\in J_{\alpha}\bowtie r\}

That is, a signature $(A_{\alpha}(q),B_{\alpha}(q))$ is stored in $\alpha.M[r]$ for each row $q$ in the join $J_{\alpha}\bowtie r$ . Using the fact that the join is well-conditioned, we show later that the size of this set of signatures is bounded by a polynomial, even though the join itself can have an exponential number of rows.

4.3.2 The DP

To fill in the $\alpha.M[]$ arrays, the algorithm does a post-order traversal of the join tree $\tau$ , and solves for each node $\alpha$ of $\tau$ in that order.

$\blacksquare$

If $\alpha$ is a leaf node, then for each row $r$ in $T_{\alpha}$ , we set $\alpha.M[r]$ to $\{(A_{\alpha}(r),B_{\alpha}(r))\}$ . Note that $A_{\alpha}(r)=\sum_{j\in{\widehat{C}}_{\alpha}}[r_{j}]^{2}$ , and $B_{\alpha}(r)=(B_{\alpha}(r)[1],B_{\alpha}(r)[2],...,B_{\alpha}(r)[k])$ , where $B_{\alpha}(r)[i]=\sum_{j\in{\widehat{C}}_{\alpha}}[p_{ij}r_{j}]$ . If ${\widehat{C}}_{\alpha}=\emptyset$ , then the above summations evaluate to $0$ , and $\alpha.M[r]=\{(0,[0,0,...,0])\}$ .
$\blacksquare$

If $\alpha$ is an internal node, let $\lambda$ and $\rho$ denote its left and right child. For each row $r\in T_{\alpha}$ , we initialize $\alpha.M[r]=\emptyset$ and compute $\alpha.M[r]$ as follows:

Algorithm 2 Compute $\alpha.M[r]$ .

After the DP completes, we use the tables at the root $\alpha$ of the join tree to output this estimate for ${\texttt{D}}(J,span(\mathcal{P}))$ :

\max_{r\in T_{\alpha}}\max_{(a,b)\in\alpha.M[r]}\gamma^{2}a-\gamma^{2}\sum_{i=% 1}^{k}b[i]^{2}

4.3.3 Correctness of the DP

Let $\alpha$ be any node of the join tree, and $r$ a row in the table $T_{\alpha}$ . We argue that for any row $q$ in the join ${J}_{\alpha}\bowtie r$ , $(A_{\alpha}(q),B_{\alpha}(q))\in\alpha.M[r]$ after the DP completes processing node $\alpha$ . If $\alpha$ is a leaf, this is obvious. Assume that $\alpha$ is an internal node. Let $\lambda$ and $\rho$ be the left and right child of $\alpha$ , respectively. There is exactly one row $r_{\lambda}\in T_{\lambda}$ (resp. $r_{\rho}\in T_{\rho}$ ) such that $q\sim r_{\lambda}$ (resp. $q\sim r_{\rho}$ ). Clearly, $r\sim r_{\lambda}$ and $r\sim r_{\rho}$ . Thus, $q$ is a concatenation of $3$ rows: (a) a row in $q_{\lambda}\in{J}_{\lambda}\bowtie r_{\lambda}$ (b) a row in $q_{\rho}\in{J}_{\rho}\bowtie r_{\rho}$ ; and (c) $r$ itself.

Furthermore,

A_{\alpha}(q)=A_{\lambda}(q_{\lambda})+A_{\rho}(q_{\rho})+\Delta_{A},

where $\Delta_{A}=\sum_{j\in{\widehat{C}}_{\alpha}}[r_{j}]^{2}$ . And

B_{\alpha}(q)[i]=B_{\lambda}(q_{\lambda})[i]+B_{\rho}(q_{\rho})[i]+\Delta_{B}[% i],

where $\Delta_{B}[i]=\sum_{j\in{\widehat{C}}_{\alpha}}[p_{ij}r_{j}]$ . Thus the pair $(A_{\alpha}(q),B_{\alpha}(q))$ is added to $\alpha.M[r]$ by the DP when looking at rows $r_{\lambda}\in T_{\lambda}$ and $r_{\rho}\in T_{\rho}$ .

Conversely, we can also argue that if the pair $(a,b)$ is added by the DP to $\alpha.M[r]$ , then $a=A_{\alpha}(q)$ and $b=B_{\alpha}(q)$ for some $q\in{J}_{\alpha}\bowtie r$ .

4.3.4 Running Time

We now show that our algorithm runs in polynomial time. Consider the term $p_{ij}r_{j}$ for some $1\leq i\leq k,1\leq j\leq d$ where $r$ is some row of some table that contains $j$ as a feature. Given that $p_{i}$ is a unit vector, $|p_{ij}r_{j}|\leq|r_{j}|\leq\max_{q\in J}||q||_{2}=\sqrt{\beta}$ . As $|[p_{ij}r_{j}]|=\lfloor|p_{ij}r_{j}|/\gamma\rfloor$ , and $\gamma=\sqrt{\beta}/\eta$ , we have

|[p_{ij}r_{j}]|\leq|p_{ij}r_{j}|/\gamma\leq\sqrt{\beta}/\gamma\leq\eta

Note that $[p_{ij}r_{j}]$ is an integer. Now fix a node $\alpha$ in the join tree and consider any row $q\in{J}_{\alpha}$ . It follows that $\forall i\in[k]$ , $B_{\alpha}(q)[i]=\sum_{j\in{C}_{\alpha}}[p_{ij}q_{j}]$ is an integer whose absolute value is at most $d\eta$ . Similarly, $A_{\alpha}(q)=\sum_{j\in{C}_{\alpha}}[q_{j}]^{2}$ is an integer whose absolute value is at most $d^{2}\eta^{2}$ . Similarly, $\forall i\in[k],B_{\alpha}(q)[i]$ is an integer whose absolute value is at most $d\eta$ . The cardinality of the set $\{(A_{\alpha}(q),B_{\alpha}(q))\ |\ q\in{J}_{\alpha}\}$ is therefore $O(d^{k+2}\eta^{k+2})$ . Thus, the space used by the $\alpha.M[\cdot]$ array is $O(nd^{k+2}\eta^{k+2})$ . Given that $\eta=4R(k+1)d^{2}/\epsilon$ , it is now easy to see that the overall running time is a polynomial.

4.3.5 Approximation Guarantee

Fix $q\in{J}_{\alpha}$ . We note that for all $1\leq i,j\leq d$ , $|p_{ij}q_{j}-\gamma[p_{ij}q_{j}]|\leq\gamma$ by design. Now for $1\leq i\leq k,1\leq j,j^{\prime}\leq d$ , we have

$\displaystyle\|(p_{ij}q_{j})(p_{ij^{\prime}}q_{j^{\prime}})-\gamma^{2}[p_{ij}q_% {j}][p_{ij^{\prime}}q_{j^{\prime}}]\|$	$\displaystyle\leq$	$\displaystyle\|(p_{ij}q_{j})(p_{ij^{\prime}}q_{j^{\prime}})-\gamma[p_{ij}q_{j}]% (p_{ij^{\prime}}q_{j^{\prime}})\|+$
		$\displaystyle\|\gamma[p_{ij}q_{j}](p_{ij^{\prime}}q_{j^{\prime}})-\gamma^{2}[p_% {ij}q_{j}][p_{ij^{\prime}}q_{j^{\prime}}]\|$
	$\displaystyle\leq$	$\displaystyle\gamma\|p_{ij^{\prime}}q_{j^{\prime}}\|+\gamma\|\gamma[p_{ij}q_{j}]\|$
	$\displaystyle\leq$	$\displaystyle 2\gamma\sqrt{\beta}=2\beta/\eta$

Here, the last inequality is because we only round down the $p_{ij}q_{j}$ . Now, ${\hat{d}}(q,span(\mathcal{P}))$ (Equation 1) and $\texttt{apxdist}(q,span(\mathcal{P}))$ (Equation 2) each have $(k+1)d^{2}$ corresponding terms, and the difference in absolute value between the corresponding terms is bounded by $2\beta/\eta$ above. Thus,

|{\hat{d}}(q,span(\mathcal{P}))-\texttt{apxdist}(q,span(\mathcal{P})|\leq\frac% {2(k+1)d^{2}}{\eta}\beta\leq\frac{\epsilon}{2R}\beta\leq\frac{\epsilon}{2}{% \texttt{D}}(J,span(\mathcal{P}))

(3)

Here the last step is due to Lemma 13. Let $q^{*}\in J$ satisfy ${\hat{d}}(q^{*},span(\mathcal{P}))={\texttt{D}}(J,span(\mathcal{P}))$ . The algorithm returns $\texttt{apxdist}(q,span(\mathcal{P}))$ for a row $q\in J$ such that

$\displaystyle{\hat{d}}(q,span(\mathcal{P}))$	$\displaystyle\geq$	$\displaystyle\texttt{apxdist}(q,span(\mathcal{P}))-\frac{\epsilon}{2}{\texttt{% D}}(J,span(\mathcal{P}))$
	$\displaystyle\geq$	$\displaystyle\texttt{apxdist}(q^{*},span(\mathcal{P}))-\frac{\epsilon}{2}{% \texttt{D}}(J,span(\mathcal{P}))$
	$\displaystyle\geq$	$\displaystyle{\hat{d}}(q^{*},span(\mathcal{P}))-\epsilon{\texttt{D}}(J,span(% \mathcal{P}))=(1-\epsilon){\texttt{D}}(J,span(\mathcal{P}))$

Theorem 14.

There is a polynomial-time algorithm that, given an acyclic database with tables $T_{1},T_{2},\ldots,T_{m}$ over a total of $d$ features, a set of orthogonal unit vectors $\mathcal{P}=\{p_{1},p_{2},...,p_{k}\}$ (where $k$ is a constant), and a parameter $0<\epsilon<1$ , computes a $(1-\epsilon)$ -approximation to the farthest distance of a row in the join from the subspace spanned by $\mathcal{P}$ .

4.4 Subspace Approximation

Theorem 14 can be used as a primitive to compute a $k$ -subspace that approximately minimizes the maximum Euclidean distance of a row in the join to the $k$ -subspace. This problem is well known in the literature as $l_{\infty}$ -subspace approximation; see [14] for references. Consider the following algorithm:

It follows from (the proof of) Lemma 5.2 of [8] that for the set $S$ computed by the algorithm and sufficiently small $0<\epsilon<1$ , we have ${\texttt{D}}(J,span(S))\leq(2(1+\epsilon))^{2k}{\texttt{D}}(J,{\cal F})$ , for any $k$ -subspace ${\cal F}$ . That is, for constant $k$ , we get an algorithm that runs in time polynomial in the size of the acyclic database and returns a $(2(1+\epsilon))^{k}$ -approximation for the $l_{\infty}$ -subspace approximation problem.

References

[1] Mahmoud Abo Khamis, Ryan R. Curtin, Benjamin Moseley, Hung Q. Ngo, XuanLong Nguyen, Dan Olteanu, and Maximilian Schleich. On functional aggregate queries with additive inequalities. In Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS ’19, pages 414–431, New York, NY, USA, 2019. Association for Computing Machinery. doi:10.1145/3294052.3319694.
[2] Mahmoud Abo Khamis, Hung Q. Ngo, and Atri Rudra. Faq: Questions asked frequently. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS ’16, pages 13–28, New York, NY, USA, 2016. Association for Computing Machinery. doi:10.1145/2902251.2902280.
[3] David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’07, pages 1027–1035, USA, 2007. Society for Industrial and Applied Mathematics. URL: http://dl.acm.org/citation.cfm?id=1283383.1283494.
[4] Jiaxiang Chen, Qingyuan Yang, Ruomin Huang, and Hu Ding. Coresets for relational data and the applications. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2024. Curran Associates Inc.
[5] Aryan Esmailpour and Stavros Sintos. Improved approximation algorithms for relational clustering. Proceedings of the ACM on Management of Data, 2(5):1–27, 2024. doi:10.1145/3695831.
[6] Michael R. Garey and David S. Johnson. Computers and Intractability; A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., USA, 1990.
[7] Martin Grohe. The structure of tractable constraint satisfaction problems. In Proceedings of the 31st International Conference on Mathematical Foundations of Computer Science, MFCS’06, pages 58–72, Berlin, Heidelberg, 2006. Springer-Verlag. doi:10.1007/11821069_5.
[8] Sariel Har-Peled and Kasturi R. Varadarajan. High-dimensional shape fitting in linear time. Discret. Comput. Geom., 32(2):269–288, 2004. doi:10.1007/S00454-004-1118-2.
[9] Mahmoud Abo Khamis, Sungjin Im, Benjamin Moseley, Kirk Pruhs, and Alireza Samadian. Approximate aggregate queries under additive inequalities. In Michael Schapira, editor, 2nd Symposium on Algorithmic Principles of Computer Systems, APOCS 2020, Virtual Conference, January 13, 2021, pages 85–99. SIAM, 2021. doi:10.1137/1.9781611976489.7.
[10] Dániel Marx. Tractable hypergraph properties for constraint satisfaction and conjunctive queries. J. ACM, 60(6), November 2013. doi:10.1145/2535926.
[11] Jianming Miao and Adi Ben-Israel. On principal angles between subspaces in rn. Linear Algebra and its Applications, 171:81–98, 1992. doi:10.1016/0024-3795(92)90251-5.
[12] Benjamin Moseley, Kirk Pruhs, Alireza Samadian, and Yuyan Wang. Relational algorithms for k-means clustering. In Nikhil Bansal, Emanuela Merelli, and James Worrell, editors, 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021, July 12-16, 2021, Glasgow, Scotland (Virtual Conference), volume 198 of LIPIcs, pages 97:1–97:21. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.ICALP.2021.97.
[13] Gilbert Strang. Linear Algebra and Its Applications, 3rd Edition. Harcourt, Inc., 1988.
[14] David P Woodruff and Taisuke Yasuda. New subset selection algorithms for low rank approximation: Offline and online. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 1802–1813, 2023. doi:10.1145/3564246.3585100.

[bib.bib1] [1] Mahmoud Abo Khamis, Ryan R. Curtin, Benjamin Moseley, Hung Q. Ngo, XuanLong Nguyen, Dan Olteanu, and Maximilian Schleich. On functional aggregate queries with additive inequalities. In Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS ’19, pages 414–431, New York, NY, USA, 2019. Association for Computing Machinery. doi:10.1145/3294052.3319694.

[bib.bib2] [2] Mahmoud Abo Khamis, Hung Q. Ngo, and Atri Rudra. Faq: Questions asked frequently. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS ’16, pages 13–28, New York, NY, USA, 2016. Association for Computing Machinery. doi:10.1145/2902251.2902280.

[bib.bib3] [3] David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’07, pages 1027–1035, USA, 2007. Society for Industrial and Applied Mathematics. URL: http://dl.acm.org/citation.cfm?id=1283383.1283494.

[bib.bib4] [4] Jiaxiang Chen, Qingyuan Yang, Ruomin Huang, and Hu Ding. Coresets for relational data and the applications. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2024. Curran Associates Inc.

[bib.bib5] [5] Aryan Esmailpour and Stavros Sintos. Improved approximation algorithms for relational clustering. Proceedings of the ACM on Management of Data, 2(5):1–27, 2024. doi:10.1145/3695831.

[bib.bib6] [6] Michael R. Garey and David S. Johnson. Computers and Intractability; A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., USA, 1990.

[bib.bib7] [7] Martin Grohe. The structure of tractable constraint satisfaction problems. In Proceedings of the 31st International Conference on Mathematical Foundations of Computer Science, MFCS’06, pages 58–72, Berlin, Heidelberg, 2006. Springer-Verlag. doi:10.1007/11821069_5.

[bib.bib8] [8] Sariel Har-Peled and Kasturi R. Varadarajan. High-dimensional shape fitting in linear time. Discret. Comput. Geom., 32(2):269–288, 2004. doi:10.1007/S00454-004-1118-2.

[bib.bib9] [9] Mahmoud Abo Khamis, Sungjin Im, Benjamin Moseley, Kirk Pruhs, and Alireza Samadian. Approximate aggregate queries under additive inequalities. In Michael Schapira, editor, 2nd Symposium on Algorithmic Principles of Computer Systems, APOCS 2020, Virtual Conference, January 13, 2021, pages 85–99. SIAM, 2021. doi:10.1137/1.9781611976489.7.

[bib.bib10] [10] Dániel Marx. Tractable hypergraph properties for constraint satisfaction and conjunctive queries. J. ACM, 60(6), November 2013. doi:10.1145/2535926.

[bib.bib11] [11] Jianming Miao and Adi Ben-Israel. On principal angles between subspaces in rn. Linear Algebra and its Applications, 171:81–98, 1992. doi:10.1016/0024-3795(92)90251-5.

[bib.bib12] [12] Benjamin Moseley, Kirk Pruhs, Alireza Samadian, and Yuyan Wang. Relational algorithms for k-means clustering. In Nikhil Bansal, Emanuela Merelli, and James Worrell, editors, 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021, July 12-16, 2021, Glasgow, Scotland (Virtual Conference), volume 198 of LIPIcs, pages 97:1–97:21. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.ICALP.2021.97.

[bib.bib13] [13] Gilbert Strang. Linear Algebra and Its Applications, 3rd Edition. Harcourt, Inc., 1988.

[bib.bib14] [14] David P Woodruff and Taisuke Yasuda. New subset selection algorithms for low rank approximation: Offline and online. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 1802–1813, 2023. doi:10.1145/3564246.3585100.

Relational Approximations for Subspace Primitives

Abstract

Keywords and phrases:

Category:

Copyright and License:

2012 ACM Subject Classification:

Acknowledgements:

DOI:

Event:

Editors:

Series and Publisher:

1 Introduction

1.1 Prior Work

1.2 Our Contribution

2 Preliminaries

2.1 Acyclic Databases

3 Computing the Distance to a given 𝒌-Subspace

3.1 An Implicit Representation

Lemma 1.

Proof.

3.2 A 𝒅-approximation

Lemma 2.

Proof.

Theorem 3.

Theorem 4.

Proof.

4 Maximum Distance to a 𝒌-Subspace: A PTAS

Theorem 5.

Proof.

4.1 Finding a Close 𝒌-Subspace

Lemma 6.

Proof.

Theorem 7.

Proof.

Claim 8.

Proof.

Lemma 9.

Proof.

Claim 10.

Proof.

Lemma 11.

Proof.

4.2 Reduction to Well-Conditioned Instances

Lemma 12.

Lemma 13.

4.3 Algorithm for a Well-Conditioned Instance

4.3.1 Data Structures for the DP

4.3.2 The DP

4.3.3 Correctness of the DP

4.3.4 Running Time

4.3.5 Approximation Guarantee

Theorem 14.

4.4 Subspace Approximation

References

3 Computing the Distance to a given $𝒌$ -Subspace

3.2 A $\sqrt{d}$ -approximation

4 Maximum Distance to a $𝒌$ -Subspace: A PTAS

4.1 Finding a Close $𝒌$ -Subspace