Abstract 1 Introduction 2 Preliminaries 3 Computing the Distance to a given 𝒌-Subspace 4 Maximum Distance to a 𝒌-Subspace: A PTAS References

Relational Approximations for Subspace Primitives

Xiang Liu ORCID Department of Computer Science, University of Iowa, Iowa City, IA, USA Kasturi Varadarajan ORCID Department of Computer Science, University of Iowa, Iowa City, IA, USA
Abstract

We explore fundamental geometric computations on point sets that are given to the algorithm implicitly. In particular, we are given a database which is a collection of tables with numerical values, and the geometric computation is to be performed on the join of the tables. Explicitly computing this join takes time exponential in the size of the tables. We are therefore interested in geometric problems that can be solved by algorithms whose running time is a polynomial in the size of the tables. Such relational algorithms are typically not able to explicitly compute the join.

To avoid the NP-completeness bottleneck, researchers assume that the tables have a tractable combinatorial structure, like being acyclic. Even with this assumption, simple geometric computations turn out to be non-trivial and sometimes intractable. In this article, we study the problem of computing the maximum distance of a point in the join to a given subspace, and develop approximation algorithms for this NP-hard problem.

Keywords and phrases:
relational algorithm, Euclidean distance, subspace approximation
Category:
APPROX
Copyright and License:
[Uncaptioned image] © Xiang Liu and Kasturi Varadarajan; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Computational geometry
Acknowledgements:
The authors thank Kirk Pruhs for discussions that led to the problems studied here, and the organizers of the 2023 Workshop on Fined-Grained Complexity, Logic, and Query Evaluation at the Simons Institute for the Theory of Computing for facilitating these discussions.
Editors:
Alina Ene and Eshan Chattopadhyay

1 Introduction

In this paper, we consider certain fundamental geometric computations that are trivially solvable in linear or quadratic time in the size of the input point set. In our setting, the input point set is not given explicitly, but rather in an implicit form as described below. An explicit representation can be exponentially larger than the implicit one, and the question we pursue is whether these geometric computations can be performed in time polynomial in the size of the implicit representation.

We are given a set {C1,C2,,Cm} where each Cj+ is a subset of the positive integers. We interpret each element of Cj as a coordinate axis or feature. Furthermore, for each 1jm, we are given a table (two-dimensional matrix) Tj with at least one row and exactly |Cj| columns, where each column corresponds to a feature in Cj. The values in the table Tj are real numbers. Given a row r in Tj, we let r[i] (or ri) denote the value corresponding to feature iCj. We assume tables don’t have duplicate rows, and thus a table specifies a set of row vectors.

We define the join T1T2Tm of m1 such tables as a table J whose columns correspond to the union C=j=1mCj of the individual feature sets. Thus each row in the join is a vector with |C| real components. Such a vector q belongs to the join J if in each table Tj, there is a row rj such that q and rj agree on the features in Cj – for each iCj, q[i]=rj[i].

Thus, a row q in the join J is generated by picking rows r1T1,r2T2,,rmTm that are pairwise compatible, that is, agreeing on the value on common features; and then “concatenating” these rows. That is, for each feature iC, we find an arbitrary Cj containing feature i, and let q[i]:=rj[i]. The join J consists of the set of all rows that can be generated in this way. An example of a join with feature set {a,b,c} is shown below:

T1 T2 J=T1T2
a b b c a b c
1 1 1 2 1 1 2
1 2 2 3 2 1 2
2 1 5 6 1 2 3

A relational database stores its data in the form of multiple tables as above. However, typical data analysis techniques such as clustering only work when the input data is in explicit form, as in the join J. Standard practice is thus to compute the join, and then run the data analysis algorithm on it. Let d=|C| denote the total number of features and n the maximum number of rows in any given table. The space needed to store all the tables is easily upper bounded by O(mnd), where we recall that m is the number of tables. However, the number of rows in the join J=T1T2Tm is Θ(nm) in the worst case, which can be exponentially larger. This naturally raises the following question: what properties of the join J can we compute using relational algorithms, defined as algorithms with running time polynomial in n, m, and d? See [2, 1, 9, 12] and the references therein. Notice that a relational algorithm typically is not able to explicitly construct the join.

Before describing prior work on relational algorithms, we point out the geometric nature of the join. The rows of each table Tj can be viewed as a set of points lying in the subspace spanned by the features/coordinate axes in Cj. The join J=T1T2Tm is the set of all points p, in the subspace spanned by features in C, such that the projection of p onto the subspace spanned by Cj is an element of Tj, for each j.

1.1 Prior Work

Perhaps the most basic algorithmic question to ask about the join J of the tables T1,T2,,Tm is whether it is non-empty – does it have at least one row/point in it? This turns out to be NP-Complete [7, 10, 9]. Indeed, a simple reduction from 3CNF-SAT has the property that the number of rows in the join is exactly the number of satisfying assignments to the input formula. (This is a good spot to emphasize that in this paper, the number d of features and the number m of tables are not viewed as constants.)

Given this state of affairs, the approach taken by papers in this area is to assume that the join has some additional structure that makes such basic algorithmic tasks tractable. The most common assumption is that the join is acyclic – this is a combinatorial assumption about how the features are distributed across the tables, and we define it in the next section. Assuming acyclic joins lets us use dynamic programming to solve certain algorithmic questions in polynomial time [2, 1]:

  1. 1.

    Compute the number of rows in the join J

  2. 2.

    Compute the maximum of the squared l2 norm of rows in the join, that is, maxqJiCqi2.

These algorithmic tasks are special cases of a sum-product query. This is a query Q(J) over the join J of the form

Q(J)=qJiCFi(qi),

where (R,,) is a commutative semiring over a set R. Corresponding to feature iC, Fi:R is an easy to compute function with range R. For example, for the query asking for the number of rows in the join, R is the set of integers, Fi is the constant function that takes on the value 1, is multiplication over the integers, and is addition over the integers. For computing the maximum of the squared l2 norm of the rows, R is the set of real numbers, Fi(x)=x2, is addition over the reals, and is the max function.

The authors in [9] study sum-product queries under constraints on the rows of the join. They allow constraints of the form iCgi(qi)β, where gi: is an easy-to-compute function, and β is a constant. They consider the problem of evaluating a sum-product query over those rows in the join that satisfy the given set of constraints. They show that with two constraints, such problems are not only NP-hard but also hard to approximate to within any multiplicative factor. Concretely, they show it is NP-hard to determine if there is a row q in the join J that satisfies the linear constraint iqi=0. This is via a reduction from the Partition problem, where we want to determine if a given set of positive integers can be partitioned into two sets whose sums are equal. Thus, it is a weak NP-hardness result [6]. The linear constraint iqi=0 is equivalent to the two linear inequality constraints iqi0 and iqi0.

On the other hand, the authors in [9] show that with just one inequality constraint, and some additional technical assumptions, there is a randomized approximation scheme for sum-product queries that guarantees a multiplicative (1+ϵ)-approximation, for any given parameter ϵ>0. For instance, they use this general result to get a polynomial-time (1+ϵ)-approximation for the following counting problems:

  1. 1.

    Count the number of rows in the join that lie in a given half space iaiqiβ.

  2. 2.

    Count the number of rows in the join that lie within a given ball i(qiyi)2r2 (centered at point y and having radius r0).

The authors in [12] study the k-means problem, and show how the k-means++ algorithm [3] for constructing a coreset can be implemented in the relational setting, thus deriving an O(1)-approximation for the k-means problem. Their core technical result is a polynomial-time sampling algorithm that, given a set c1,c2,,cj of centers in |C|, outputs a row qJ with probability proportional to the square of the Euclidean distance of q to the nearest center. This sampling result is surprising given the hardness of closely related problems. In particular, the problem of counting the number of rows in the join whose closest center is c1 is NP-hard even for j=2, and hard to approximate to within any multiplicative factor for j=3. The technique of rejection sampling plays a key role in their sampling result.

Very recently, the authors in [5] present fast deterministic and randomized relational approximation algorithms for the k-median and k-means clustering problems assuming the dimension d is a constant. The authors in [4] present a method for constructing a coreset for certain optimization problems in machine learning. The work that we have reviewed builds on a body of theoretical and experimental research on relational algorithms. We refer the reader to [9, 12, 5, 4] for surveys of this prior work.

1.2 Our Contribution

From prior work, it is evident that in the relational setting, the study of the complexity of very fundamental geometric problems yields surprising answers. More of the terrain needs to be explored to form a better picture of the computational complexity. Motivated by this, we consider the following Distance to k-Subspace problem: Given a set S={s1,s2,,sk} of orthogonal unit vectors, compute the maximum Euclidean distance of a row in the join to span(S), the subspace spanned by S. As a reminder, we are assuming an acyclic database. The Distance to Subspace problem is a basic primitive for other tasks including the subspace approximation problem [8]. We obtain the following results.

  1. 1.

    We show, in Section 4, that the Distance to k-Subspace problem is NP-hard even for k=1.

  2. 2.

    We present a polynomial-time d-approximation for the Distance to k-Subspace problem. To do so, we introduce a generalization of the join and compute measures on that generalization. This is likely to be an algorithmic tool of independent interest. As a consequence of our technique, we show that the row rank of the join J can be computed exactly in polynomial time. These results are presented in Section 3.

  3. 3.

    Next, we ask if for constant k, we can get a multiplicative (1+ϵ)-approximation for the Distance to k-Subspace problem, for any ϵ>0. The distance of a row q in the join from the given k-subspace can be written as A(q)B(q), where A(q) and B(q) are both non-negative polynomials. A(q) and B(q) can individually be maximized/minimized using prior work on sum-product queries, but obviously this does not, by itself, give us a way of maximizing the difference A(q)B(q).

    Instead, we keep track of all possible values of the pair (A(q),B(q)) in our dynamic program. However, there can be an exponential number of such pairs, as is to be expected for an NP-hard problem. We can keep track of suitably discretized values of (A(q),B(q)), but to get a polynomial-time algorithm using this approach, we need the instance to be well-conditioned in the following sense: the maximum 2-norm of rows in the join (that is, Distance to the 0-Subspace) is at most polynomially larger than the Distance to the given k-subspace.

    Our main technical contribution here is to show that the Distance to k-Subspace problem can be reduced to polynomially many instances of the same problem that are well-conditioned. To obtain such a reduction, one tool we develop is a proof that there exists a k-subspace spanned by k of the standard basis vectors that is “close” to any given k-subspace. Overall, our reduction gives us a (1+ϵ)-approximation (Section 4) for the Distance to k-Subspace problem.

  4. 4.

    We observe, in Section 4, that our (1+ϵ)-approximation to Distance to k-Subspace can be used as a primitive to obtain a (2(1+ϵ))k-approximation for finding a k-subspace that minimizes the maximum Euclidean distance to the join. When the point set is given explicitly (as opposed to implicitly via the join), this is well known as the l-subspace approximation problem [14]. The running time of our algorithm is polynomial for constant k. Our approximation factor for subspace approximation is large, and this result serves to illustrate the usefulness of the Distance to Subspace approximation primitive, which is the main focus of this work.

We emphasize that our work addresses the regime where the number d of features and the number m of tables are part of the input, and are not treated as constants. We are primarily focussed on the approximation guarantee that we can get with polynomial running time; we do not explore further optimizations of the running time.

2 Preliminaries

In this section, we present essential preliminaries. The database that is input to each of the problems considered here consists of a collection of tables T1,T2,,Tm. The columns of table Tj correspond to a subset Cj+ of features, and thus each row in Tj is a vector with |Cj| real-valued components. We use n to denote the maximum number of rows in any table. Let C=j=1mCj denote the set of all features. We will assume that C={1,2,,d}, where d denotes |C|. We assume that the columns in any table are ordered according to the natural ordering 1,2,,d of the features in C. Thus, we can represent a row r in Table Tj unambiguously as a point in |Cj|. Given such a row r and feature iCj, we will use ri or r[i] to denote the component of r corresponding to feature i.

Given two row vectors r and r′′ over feature sets C and C′′, we say that r and r′′ are compatible if for each feature iCC′′, r[i]=r′′[i]. We use the notation rr′′ to denote that r and r′′ are compatible. The concatenation of compatible rows r and r′′ is the row vector over CC′′ formed by combining the components of r and r′′ in the obvious way.

2.1 Acyclic Databases

We say that our database of m tables is acyclic if there is a tree τ with O(m) nodes, each associated with one of our input tables, that has the following properties:

  1. 1.

    Each table in the database is associated with at least one node.

  2. 2.

    For each feature iC, the set of nodes whose tables contain i induces a connected subgraph of the tree.

Note that for technical reasons, we allow a table to be associated with more than one node of the tree τ. Given an acyclic database, such a tree τ can be computed in polynomial time; see for example [9].

Note that each row of the join J=T1T2Tm is formed by picking one row from each table associated with each node of τ, such that the chosen rows are pairwise compatible, and then concatenating the chosen rows. For an acyclic database, it suffices to check compatibility between pairs of rows chosen from nodes that are neighbors in the tree τ. This follows from the properties of an acyclic database. This is the crucial property of acyclic databases that allows one to evaluate certain queries on the join efficiently via dynamic programming.

For the purpose of dynamic programming, it will be convenient to view the tree τ as rooted. Furthermore, we can assume that each internal node of τ has exactly two children. This is without loss of generality: If there is a node α with exactly one child α, we can make α the left child of α and add a right child whose table is a copy of α’s. If a node α has j>2 children, then replace α with a complete binary tree with j1 internal nodes and j leaves, which will now correspond to the original j children of α. The table at each of the j1 internal nodes will be a copy of the table at α.

After applying these two operations, we have that each internal node of τ has exactly two children. The total number of nodes in τ is O(m). The join is unchanged: we have merely added copies of some of the original tables. Finally, τ still witnesses the fact that the database is acyclic, as is readily checked. We will refer to τ as the join tree of the database.

For a node α in the tree τ, we use Tα to denote the table at node α, and Jα to denote the join of all the tables in the subtree rooted at α. We “assign” each feature iC to the highest node in tree τ whose table contains feature i as a column. By highest, we mean the node closest to the root of τ. Let C^α denote the set of features assigned to node α, and Cα denote the set of features assigned to nodes in the subtree rooted at α. For a row r in table Tα, we denote the set of rows in the join Jα that are compatible with r by Jαr. The notation is reasonable as this set of rows is indeed the result of a join of Jα and a table containing the single row r.

3 Computing the Distance to a given 𝒌-Subspace

Assume we are given an acyclic database consisting of tables T1,T2,,Tm and a set of k orthogonal unit vectors S={si| 1ik}d, where (as we recall) the set of features in the database is C={1,2,,d}. The Distance to k-Subspace problem is that of computing the maximum distance of a row in the join J=T1T2Tm from the subspace spanned by S. That is, we want to compute

maxrJri=1k(rsi)si2

The main result of this section is a d-approximation to this problem. We begin with a scheme for implicitly representing functions V:Jμ, and posing algorithmic problems concerning this representation.

3.1 An Implicit Representation

Suppose that for each feature iC, we are given a function Vi:μ, where μ1 is identical across all features. We assume that these functions are easily computed. In fact, for the application here, we assume that Vi(x) is given explicitly for each value x of feature i that occurs in any of the input tables. For any any row r in the join J, let V(r)=iCVi(ri). Finally, let V(J)={V(r)|rJ}.

One illustrative example is to let Vi(x)=(0,0,,xith entry,,0)d, where we recall that the feature set C is {1,2,,d}. Then for any row r in the join J of the tables, V(r)=iCVi(ri)=r. Thus, V(J)=J. We will see more interesting examples shortly.

We are interested in computing measures of the point set V(J). But this seems generally harder for V(J) than for J. For example, there is a polynomial-time algorithm for computing maxrJr22 using sum-product queries. However, computing maxqV(J)q22 seems harder. Fortunately, we can show a positive result for the |||| norm.

Lemma 1.

There is a polynomial-time algorithm for computing maxqV(J)q.

Proof.

For 1jμ, let

Ψjmax=maxqV(J)qj=maxrJiCVi(ri)[j],

and

Ψjmin=minqV(J)qj=minrJiCVi(ri)[j].

That is, Ψjmax (resp. Ψjmin) is the maximum (minimum) value of the j-th coordinate over points in V(J). As Ψjmax=maxrJiCVi(ri)[j], it is a sum-product query over the join J where the commutative semiring is the set of real numbers, the operator is max, and the is addition over the reals. It follows that Ψjmax and Ψjmin can be computed in polynomial time. Let Ψj=max{|Ψjmax|,|Ψjmin|}.

Finally, we note that maxqV(J)q is maxj=1μΨj.

3.2 A 𝒅-approximation

We begin by showing that using the above implicit representation, we can efficiently maximize the l norm of the projection onto the orthogonal complement of the given k-subspace.

Lemma 2.

Suppose that for 1kd we are given a set of unit vectors S={s1,s2,,sk}d that are pairwise orthogonal. We can compute

maxrJri=1k(rsi)si

in polynomial time.

Proof.

For yd, let y denote the projection onto the orthogonal complement of the subspace spanned by S. Thus, y=yi=1k(ysi)si. Viewing y as a function of y, we see that y=i=1dyibi, where each bid is a constant vector – it depends on S but not y. (In other words, y=By, where the projection matrix B depends only on S.)

For each 1id, define the function Vi:d by Vi(x)=xbi. It follows that for any row r in the join J,

V(r)=i=1dVi(ri)=i=1dribi=r.

Thus, V(J)={r|rJ}. From Lemma 1, there is a polynomial-time algorithm to compute

maxqV(J)q=maxrJr

For any yd, we have yy2dy. Thus we obtain the main result of this Section:

Theorem 3.

Suppose that for 1kd we are given a set of unit vectors S={s1,s2,,sk}d that are pairwise orthogonal. There is a polynomial-time algorithm to compute a d-approximation to

maxrJri=1k(rsi)si2

Next, we consider the algorithmic task of computing the row rank of the join J=T1T2Tm of the input acyclic database consisting of tables T1,T2,,Tm. That is, we want to compute the size of a maximal linearly independent set [13] of row vectors in J. An algorithm for this follows from Lemma 2.

Theorem 4.

There is a polynomial time algorithms that, given an acyclic database as input, can compute the row rank of the join J.

Proof.

We build an orthonormal basis one vector at a time. Using prior work on sum-product queries, we compute a vector s1J that maximizes the square of the Euclidean norm. If s1 is the zero vector, the rank of J is 0. Otherwise, let s1 be the unit vector corresponding to s1, and we initialize S={s1}.

Suppose that we have computed a set S={s1,s2,,sj} of unit vectors in the row space of J that are pairwise orthogonal. Using Lemma 2, we compute sj+1J that achieves

maxrJri=1j(rsi)si

If this maximum is 0, then the rank of J is j and S is a basis for the row space of J. Otherwise, we add the unit vector sj+1 corresponding to sj+1 to S and continue.

4 Maximum Distance to a 𝒌-Subspace: A PTAS

Assume we are given an acyclic database consisting of tables T1,T2,,Tm and an orthonormal set of k vectors S={si| 1ik}d, where (as we recall) the set of features in the database is C={1,2,,d}. We would like to compute the maximum distance of a row in the join J=T1T2Tm from the subspace spanned by S. The distance of a point qd from the subspace spanned by S, is dist(q,span(S)):=qi=1k(qsi)si2. Thus, our goal is to compute maxqJdist(q,span(S)). Let d^(q,span(S)):=q22i=1k(qsi)2. As d^(q,span(S))=dist2(q,span(S)), we reformulate this as computing D(J,span(S)):=maxqJd^(q,span(S)). The main result of this section a polynomial-time approximation scheme (PTAS) for the problem for constant k.

We begin by showing that the Distance to k-Subspace problem is NP-hard even for k=1.

Theorem 5.

If there is a polynomial-time algorithm for computing D(J,span({p})) given the input acyclic database and vector pd, then P = NP.

Proof.

The proof is by reduction from the Partition problem. Given a set of positive integers F={f1,f2,,fd}, the goal here is determine if F can be partitioned into two parts F and F′′ such that diff(F,F′′):=fFffF′′f=0. Given such an input, we compute an acyclic database with tables T1,T2,,T2d and a vector p as follows. The sequence of features is c1,v1,c2,v2,,cd,vd,cd+1. For 1id, the columns of table T2i1 are ci and vi, and the columns of table T2i are vi and ci+1. Each table has two rows, whose values are determined as follows.

T1 T2
c1 v1 v1 c2
0 f1 f1 0
0 f1 f1 0
T3 T4
c2 v2 v2 c3
0 f2 f2 0
0 f2 f2 0

T2d1 T2d
cd vd vd cd+1
0 fd fd 0
0 fd fd 0

It is clear that the database is acyclic, and in fact, a tree that witnesses this is a path. We set p=(1,1,,1)2d+1. Let (p)=span({p}) denote the line through p. Let J denote the join of the tables. If D(J,(p))=fFf2, we declare that the input Partition instance has a solution; otherwise, we declare that it doesn’t.

The reduction runs in polynomial time, and we now argue that it is correct. For any partition (F,F′′) of F, we define χ(F,F′′) to be the vector (c1,v1,,cd,vd,cd+1) where ci=0 for 1id+1, and for each 1id, vi=fi if fiF and vi=fi if fiF′′. Note that χ is a bijection from the set of partitions of F to the set of rows of the join J. Furthermore, diff(F,F′′)=χ(F,F′′)p. Note that p22=2d+1, and for any row qJ, we have q22=fFf2; all rows in J have the same 2-norm. Thus, for any qJ,

d^(q,(p))=q22(qp)2p22=fFf2(qp)22d+1

We conclude that for any partition (F,F′′) of F,

diff(F,F′′)=0χ(F,F′′)p=0d^(χ(F,F′′),p)=fFf2

Our PTAS for Distance to a k-Subspace has three major steps. Let E={ei| 1id}d denote the unit vectors in the standard basis. We want to find a k-subset EkE such that span(Ek) is “close to” span(S). This first step, along with the properties of span(Ek), is described in Section 4.1. In the second step (Section 4.2), we use Ek to reduce the Distance to k-Subspace instance to polynomially many well-conditioned instances of the same problem. In the third step (Section 4.3), we use discretization and dynamic programming to solve a well-conditioned instance approximately in polynomial time. We conclude in Section 4.4 with an application of the PTAS to subspace approximation.

For a set Xd, we denote the orthogonal complement of X by X. That is, X={yd|yx=0 for all xX}.

4.1 Finding a Close 𝒌-Subspace

Our algorithm for computing Ek is as follows. Let proj(a,span(S)) be the projection of a vector a on the subspace spanned by S.

Algorithm 1 Compute Ek.

We state some useful properties of this algorithm:

Lemma 6.

(a) The set {p1,p2,,pk} is an orthonormal basis for span(S); (b) For any 1j<ik, we have ejpi=0; (c) We have |Ek|=k, that is, our algorithm never attempts to add a duplicate element to Ek; (d) For each 1ik, we have eipi1d.

Proof.

Part (a) is evident. As for (b), fix 1j<ik. We have pipj=0. We have pj:=proj(ej,span(S)[span(p1,p2,,pj1)]), and pispan(S)[span(p1,p2,,pj1)]. Given pipj=0, we have piej=0.

For (c), We pick eiE such that |eipi| is maximum. So |eipi|1d. Given that pi,pispan(S)[span(p1,p2,,pi1)], and pi=proj(ei,span(S)[span(p1,p2,,pi1)]), we have eipi|eipi|1d. From part (b), for all 1ji1, we have ejpi=0, so ei{ej| 1ji1}. Thus |Ek|=k. We have already shown part (d) in the process.

The key property of the set Ek computed above is that the k-subspaces span(Ek) and span(S) are close in the sense of the following theorem. For other measures of closeness of subspaces, see [11].

Theorem 7.

For any unit vector vspan(S), there exists a unit vector uspan(Ek) such that |vu|1(d)k2k. (We assume that the dimension d36 to get this simplified bound.)

Proof.

A high-level overview of the proof is that we construct a “best-response” unit vector uspan(Ek) for each unit vector vspan(S), depending on the relative magnitudes of components of v. Now we proceed to the formal proof.

We represent the unit vectors as v=x1p1+x2p2++xkpk and u=y1e1+y2e2++ykek, where i=1kxi2=i=1kyi2=1. Let x=(x1,x2,,xk) and y=(y1,y2,,yk) denote the corresponding unit vectors.

Let P=[p1p2pk]d×k, U=[e1e2ek]d×k, and A=PU. We have vu=xAy. We have the following claim for A.

Claim 8.

A is an upper triangular matrix. Each diagonal element aii1d. For other non-diagonal elements aij, where i<j, we have |aij|1.

Proof.

For 1j<ik, we have aij=piej=0 by part (b) of Lemma 6. For 1ik, we have aii=piei1d by part (d). For other non-diagonal elements aij, where i<j: given pi is a unit vector, |aij|=|piej|1.

Let z=xA. So vu=zy and

z=(a11x1,a12x1+a22x2,,a1kx1+a2kx2++akkxk)

We want to pick a good vector y such that |zy| is large enough. Let c be a small positive constant. We discuss different cases for x:

  • Case 1: x1c(d)k1

  • Case 2: x1<c(d)k1,x22c(d)k2

  • Case j: x1<c(d)k1,x2<2c(d)k2, …, xj1<2j2c(d)kj+1,xj2j1c(d)kj

  • Case k: x1<c(d)k1,x2<2c(d)k2, …, xk1<2k2c(d),xk2k1c

In case 1, we pick y1=1 and leave all other components 0. Then |zy|=|a11x1|1dc(d)k1=c(d)k.

In case 2, we pick y2=1 and leave all other components 0. Then |zy|=|a12x1+a22x2|a22x2|a12x1|1d2c(d)k2c(d)k1=c(d)k1.

Similarly, in the case j, we pick yj=1 and leave all other components 0. Then |zy|=|a1jx1+a2jx2++ajjxj|ajjxji=1j1|aijxi|2j1c(d)kj+1i=1j12i1c(d)ki2j1c(d)kj+12j2c(d)kj+1112d=2j2c(d)kj+1d4d2.

In case k, we pick yk=1 and leave all other components 0. Then |zy|=|a1jx1+a2jx2++akkxk|akkxki=1k1|aikxi|. Similar to the case j, the lower bound is 2k2c(d)d4d2.

Therefore, given that d36, the smallest lower bound for |zy| occurs in Case 1, where |zy|c(d)k. We now argue that at least one of the cases must hold. It suffices that

i=1k(2i1c(d)ki)2i=1kxi2=1

The inequality holds if c=12k. Thus, |vu|c(d)k=1(d)k2k.

For each 1ik, let βi denote the coordinate corresponding to ei, i.e., ei=eβi. Let I={β1,β2,,βk}[d] be the set of indices corresponding to the standard basis vectors in Ek. The following claim is shown by using observations made in the proof of Theorem 7.

Lemma 9.

For any k-dimensional vector a=(a1,a2,,ak)k, we can efficiently compute a point qspan(S) such that qβi=ai, for all 1ik.

Proof.

For any 1ik, let wik be pi restricted to the indices in I. Similarly, let ei′′k be ei restricted to indices in I.

Let M=[w1w2wiwk]k×k, We want to show that there exists an vector xk such that Mx=a. We have the following claim for M.

Claim 10.

rank(M)=k.

Proof.

Let N=[e1′′e2′′ei′′ek′′]k×k. We can observe that piej=wiej′′ holds for 1i,jk. Thus, MN=PU=A. From claim 8, we know rank(A)=k. Thus rank(M)=k.

Given M is a full rank square matrix, there exists an vector xk such that Mx=a. Then we can compute q=Px.

We conclude with the following implication of the closeness in Theorem 7 of span(S) and span(Ek).

Lemma 11.

Let w be an arbitrary unit vector in the (dk)-subspace span(EEk). Let w be a unit vector along the projection of w on span(S). Then |ww|11R, where R=dk22k.

Proof.

Let v be the projection of w on the subspace spanned by Ek. Let u=wv. According to Theorem 2, v21(d)k2k, then u211dk22k.

Then

|ww|=|w(u+v)|=|wu|w2u211dk22k=11R

4.2 Reduction to Well-Conditioned Instances

We will use the new orthonormal basis 𝒫={p1,p2,,pk} for span(S). Therefore we will use the notation span(𝒫) for the subspace span(S) in the following sections.

Recall that I[d] is the set of indices corresponding to the standard basis vectors in Ek. For each iI, let αi be the node in the join tree such that iC^αi. See Section 2 for a reminder of these concepts. Let 𝒜={αi|iI}; we use 𝒜={α1,α2,,α|𝒜|} to denote the elements of the set. We can write the join J as

J=r1Tα1,r2Tα2,,r|𝒜|Tα|𝒜|Jr1r2r|𝒜|

We therefore compute D(Jr1r2r|𝒜|,span(𝒫)) for each row combination {r1,r2,,r|𝒜|}, and return the maximum of these quantities. The number of row combinations is O(nk), which is polynomial as k is a constant. In the remainder of Section 4, we fix a single row combination {r1,r2,,r|𝒜|}, and explain our approximation scheme for calculating D(Jr1r2r|𝒜|,span(𝒫)). We will continue to use J to denote the new join. For each 1i|𝒜|, we delete all rows in the table Tαi except ri. Let r=r1r2r|𝒜|.

For any qJ, we have d^(q,span(𝒫))=q22i=1k(qpi)2. However, the individual terms q22 and i=1k(qpi)2 can be much larger than d^(q,span(𝒫)). We therefore apply a translation that preserves d^(,span(𝒫)) but makes these two terms small.

Let rk be the vector formed from r by restricting to the indices in I. Using Lemma 9, we compute a vector hspan(𝒫) such that hI=rk. We apply the translation q=qh. As we translate by a vector in span(𝒫), the distance to span(𝒫) is preserved:

Lemma 12.

For any qJ, d^(q,span(𝒫))=d^(q,span(𝒫))

Also, we can observe that for any qJ, qI=qIhI=rkrk=𝟎.

To effect the translation, we have to modify the tables. Thus, for each node α in the join tree, and for each row rTα and each feature i of Tα, we set rinew=rihi. We henceforth refer to the rinew as just ri and q as q.

Lemma 13.

After the modification of tables, for any qJ, qspan(E\Ek) as qI=𝟎. Therefore by Lemma 11, q22Rd^(q,span(𝒫)).

This bound on q22 is what we mean by saying that the join J is well-conditioned.

4.3 Algorithm for a Well-Conditioned Instance

We can use dynamic programming over the join tree to explicitly compute the set
{d^(q,span(𝒫))|qJ}. However, this set can have exponential size. Our approximation algorithm will therefore work with a rounded version of d^(q,span(𝒫)). We observe that

d^(q,span(𝒫))=q22i=1k(qpi)2=j=1dqj2i=1k(j=1dpijqj)2 (1)

Where pij is the j-th element of pi. By Lemma 13, we know that q22Rd^(q,span(𝒫)) for any qJ; the join is well-conditioned.

Let β=maxqJq22. Consider a node α of the join tree such that its table Tα contains a row r with an entry whose absolute value exceeds β. Clearly, such a row r does not contribute to the eventual join J, so we delete such rows. Let η=4R(k+1)d2ϵ, where ϵ>0 is the error parameter for the approximation scheme, and γ=βη. Notice from Equation 1 that d^(q,span(𝒫)) is built up from terms of the form pijqj and qj. For x, let [x]=sign(x)|x|/γ. We will use γ[pijqj] and γ[qj] as our rounded version of pijqj and qj accordingly – essentially we round to the nearest integer multiple of γ that makes the absolute value smaller. The rounded version we use for d^(q,span(𝒫)) will then be

apxdist(q,span(𝒫)):=γ2j=1d[qj]2γ2i=1k(j=1d[pijqj])2 (2)

We now describe our dynamic programming algorithm to use the [pijqj]’s to compute an approximation to D(J,span(𝒫)).

4.3.1 Data Structures for the DP

For node α in the join tree, and row qJα, let Aα(q)=jCα[qj]2, and
Bα(q)=(Bα(q)[1],Bα(q)[2],,Bα(q)[k]), where Bα(q)[i]=jCα[pijqj], for 1ik. Note that if α is the root of the join tree, then γ2Aα(q)γ2i=1k(Bα(q)[i])2 is apxdist(q,span(𝒫)) (Equation 2).

For each node α in the join tree τ, our algorithm stores an array α.M[] indexed by the rows in table Tα. For a row rTα, the entry α.M[r] will eventually store the set

{(Aα(q),Bα(q))|qJαr}

That is, a signature (Aα(q),Bα(q)) is stored in α.M[r] for each row q in the join Jαr. Using the fact that the join is well-conditioned, we show later that the size of this set of signatures is bounded by a polynomial, even though the join itself can have an exponential number of rows.

4.3.2 The DP

To fill in the α.M[] arrays, the algorithm does a post-order traversal of the join tree τ, and solves for each node α of τ in that order.

  • If α is a leaf node, then for each row r in Tα, we set α.M[r] to {(Aα(r),Bα(r))}. Note that Aα(r)=jC^α[rj]2, and Bα(r)=(Bα(r)[1],Bα(r)[2],,Bα(r)[k]), where Bα(r)[i]=jC^α[pijrj]. If C^α=, then the above summations evaluate to 0, and α.M[r]={(0,[0,0,,0])}.

  • If α is an internal node, let λ and ρ denote its left and right child. For each row rTα, we initialize α.M[r]= and compute α.M[r] as follows:

    Algorithm 2 Compute α.M[r].

After the DP completes, we use the tables at the root α of the join tree to output this estimate for D(J,span(𝒫)):

maxrTαmax(a,b)α.M[r]γ2aγ2i=1kb[i]2

4.3.3 Correctness of the DP

Let α be any node of the join tree, and r a row in the table Tα. We argue that for any row q in the join Jαr, (Aα(q),Bα(q))α.M[r] after the DP completes processing node α. If α is a leaf, this is obvious. Assume that α is an internal node. Let λ and ρ be the left and right child of α, respectively. There is exactly one row rλTλ (resp. rρTρ) such that qrλ (resp. qrρ). Clearly, rrλ and rrρ. Thus, q is a concatenation of 3 rows: (a) a row in qλJλrλ (b) a row in qρJρrρ; and (c) r itself.

Furthermore,

Aα(q)=Aλ(qλ)+Aρ(qρ)+ΔA,

where ΔA=jC^α[rj]2. And

Bα(q)[i]=Bλ(qλ)[i]+Bρ(qρ)[i]+ΔB[i],

where ΔB[i]=jC^α[pijrj]. Thus the pair (Aα(q),Bα(q)) is added to α.M[r] by the DP when looking at rows rλTλ and rρTρ.

Conversely, we can also argue that if the pair (a,b) is added by the DP to α.M[r], then a=Aα(q) and b=Bα(q) for some qJαr.

4.3.4 Running Time

We now show that our algorithm runs in polynomial time. Consider the term pijrj for some 1ik,1jd where r is some row of some table that contains j as a feature. Given that pi is a unit vector, |pijrj||rj|maxqJq2=β. As |[pijrj]|=|pijrj|/γ, and γ=β/η, we have

|[pijrj]||pijrj|/γβ/γη

Note that [pijrj] is an integer. Now fix a node α in the join tree and consider any row qJα. It follows that i[k], Bα(q)[i]=jCα[pijqj] is an integer whose absolute value is at most dη. Similarly, Aα(q)=jCα[qj]2 is an integer whose absolute value is at most d2η2. Similarly, i[k],Bα(q)[i] is an integer whose absolute value is at most dη. The cardinality of the set {(Aα(q),Bα(q))|qJα} is therefore O(dk+2ηk+2). Thus, the space used by the α.M[] array is O(ndk+2ηk+2). Given that η=4R(k+1)d2/ϵ, it is now easy to see that the overall running time is a polynomial.

4.3.5 Approximation Guarantee

Fix qJα. We note that for all 1i,jd, |pijqjγ[pijqj]|γ by design. Now for 1ik,1j,jd, we have

|(pijqj)(pijqj)γ2[pijqj][pijqj]| |(pijqj)(pijqj)γ[pijqj](pijqj)|+
|γ[pijqj](pijqj)γ2[pijqj][pijqj]|
γ|pijqj|+γ|γ[pijqj]|
2γβ=2β/η

Here, the last inequality is because we only round down the pijqj. Now, d^(q,span(𝒫)) (Equation 1) and apxdist(q,span(𝒫)) (Equation 2) each have (k+1)d2 corresponding terms, and the difference in absolute value between the corresponding terms is bounded by 2β/η above. Thus,

|d^(q,span(𝒫))apxdist(q,span(𝒫)|2(k+1)d2ηβϵ2Rβϵ2D(J,span(𝒫)) (3)

Here the last step is due to Lemma 13. Let qJ satisfy d^(q,span(𝒫))=D(J,span(𝒫)). The algorithm returns apxdist(q,span(𝒫)) for a row qJ such that

d^(q,span(𝒫)) apxdist(q,span(𝒫))ϵ2D(J,span(𝒫))
apxdist(q,span(𝒫))ϵ2D(J,span(𝒫))
d^(q,span(𝒫))ϵD(J,span(𝒫))=(1ϵ)D(J,span(𝒫))
Theorem 14.

There is a polynomial-time algorithm that, given an acyclic database with tables T1,T2,,Tm over a total of d features, a set of orthogonal unit vectors 𝒫={p1,p2,,pk} (where k is a constant), and a parameter 0<ϵ<1, computes a (1ϵ)-approximation to the farthest distance of a row in the join from the subspace spanned by 𝒫.

4.4 Subspace Approximation

Theorem 14 can be used as a primitive to compute a k-subspace that approximately minimizes the maximum Euclidean distance of a row in the join to the k-subspace. This problem is well known in the literature as l-subspace approximation; see [14] for references. Consider the following algorithm:

It follows from (the proof of) Lemma 5.2 of [8] that for the set S computed by the algorithm and sufficiently small 0<ϵ<1, we have D(J,span(S))(2(1+ϵ))2kD(J,), for any k-subspace . That is, for constant k, we get an algorithm that runs in time polynomial in the size of the acyclic database and returns a (2(1+ϵ))k-approximation for the l-subspace approximation problem.

References

  • [1] Mahmoud Abo Khamis, Ryan R. Curtin, Benjamin Moseley, Hung Q. Ngo, XuanLong Nguyen, Dan Olteanu, and Maximilian Schleich. On functional aggregate queries with additive inequalities. In Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS ’19, pages 414–431, New York, NY, USA, 2019. Association for Computing Machinery. doi:10.1145/3294052.3319694.
  • [2] Mahmoud Abo Khamis, Hung Q. Ngo, and Atri Rudra. Faq: Questions asked frequently. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS ’16, pages 13–28, New York, NY, USA, 2016. Association for Computing Machinery. doi:10.1145/2902251.2902280.
  • [3] David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’07, pages 1027–1035, USA, 2007. Society for Industrial and Applied Mathematics. URL: http://dl.acm.org/citation.cfm?id=1283383.1283494.
  • [4] Jiaxiang Chen, Qingyuan Yang, Ruomin Huang, and Hu Ding. Coresets for relational data and the applications. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2024. Curran Associates Inc.
  • [5] Aryan Esmailpour and Stavros Sintos. Improved approximation algorithms for relational clustering. Proceedings of the ACM on Management of Data, 2(5):1–27, 2024. doi:10.1145/3695831.
  • [6] Michael R. Garey and David S. Johnson. Computers and Intractability; A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., USA, 1990.
  • [7] Martin Grohe. The structure of tractable constraint satisfaction problems. In Proceedings of the 31st International Conference on Mathematical Foundations of Computer Science, MFCS’06, pages 58–72, Berlin, Heidelberg, 2006. Springer-Verlag. doi:10.1007/11821069_5.
  • [8] Sariel Har-Peled and Kasturi R. Varadarajan. High-dimensional shape fitting in linear time. Discret. Comput. Geom., 32(2):269–288, 2004. doi:10.1007/S00454-004-1118-2.
  • [9] Mahmoud Abo Khamis, Sungjin Im, Benjamin Moseley, Kirk Pruhs, and Alireza Samadian. Approximate aggregate queries under additive inequalities. In Michael Schapira, editor, 2nd Symposium on Algorithmic Principles of Computer Systems, APOCS 2020, Virtual Conference, January 13, 2021, pages 85–99. SIAM, 2021. doi:10.1137/1.9781611976489.7.
  • [10] Dániel Marx. Tractable hypergraph properties for constraint satisfaction and conjunctive queries. J. ACM, 60(6), November 2013. doi:10.1145/2535926.
  • [11] Jianming Miao and Adi Ben-Israel. On principal angles between subspaces in rn. Linear Algebra and its Applications, 171:81–98, 1992. doi:10.1016/0024-3795(92)90251-5.
  • [12] Benjamin Moseley, Kirk Pruhs, Alireza Samadian, and Yuyan Wang. Relational algorithms for k-means clustering. In Nikhil Bansal, Emanuela Merelli, and James Worrell, editors, 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021, July 12-16, 2021, Glasgow, Scotland (Virtual Conference), volume 198 of LIPIcs, pages 97:1–97:21. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.ICALP.2021.97.
  • [13] Gilbert Strang. Linear Algebra and Its Applications, 3rd Edition. Harcourt, Inc., 1988.
  • [14] David P Woodruff and Taisuke Yasuda. New subset selection algorithms for low rank approximation: Offline and online. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 1802–1813, 2023. doi:10.1145/3564246.3585100.