Relational Approximations for Subspace Primitives
Abstract
We explore fundamental geometric computations on point sets that are given to the algorithm implicitly. In particular, we are given a database which is a collection of tables with numerical values, and the geometric computation is to be performed on the join of the tables. Explicitly computing this join takes time exponential in the size of the tables. We are therefore interested in geometric problems that can be solved by algorithms whose running time is a polynomial in the size of the tables. Such relational algorithms are typically not able to explicitly compute the join.
To avoid the NP-completeness bottleneck, researchers assume that the tables have a tractable combinatorial structure, like being acyclic. Even with this assumption, simple geometric computations turn out to be non-trivial and sometimes intractable. In this article, we study the problem of computing the maximum distance of a point in the join to a given subspace, and develop approximation algorithms for this NP-hard problem.
Keywords and phrases:
relational algorithm, Euclidean distance, subspace approximationCategory:
APPROXCopyright and License:
2012 ACM Subject Classification:
Theory of computation Computational geometryAcknowledgements:
The authors thank Kirk Pruhs for discussions that led to the problems studied here, and the organizers of the 2023 Workshop on Fined-Grained Complexity, Logic, and Query Evaluation at the Simons Institute for the Theory of Computing for facilitating these discussions.Editors:
Alina Ene and Eshan ChattopadhyaySeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
In this paper, we consider certain fundamental geometric computations that are trivially solvable in linear or quadratic time in the size of the input point set. In our setting, the input point set is not given explicitly, but rather in an implicit form as described below. An explicit representation can be exponentially larger than the implicit one, and the question we pursue is whether these geometric computations can be performed in time polynomial in the size of the implicit representation.
We are given a set where each is a subset of the positive integers. We interpret each element of as a coordinate axis or feature. Furthermore, for each , we are given a table (two-dimensional matrix) with at least one row and exactly columns, where each column corresponds to a feature in . The values in the table are real numbers. Given a row in , we let (or ) denote the value corresponding to feature . We assume tables don’t have duplicate rows, and thus a table specifies a set of row vectors.
We define the join of such tables as a table whose columns correspond to the union of the individual feature sets. Thus each row in the join is a vector with real components. Such a vector belongs to the join if in each table , there is a row such that and agree on the features in – for each , .
Thus, a row in the join is generated by picking rows that are pairwise compatible, that is, agreeing on the value on common features; and then “concatenating” these rows. That is, for each feature , we find an arbitrary containing feature , and let . The join consists of the set of all rows that can be generated in this way. An example of a join with feature set is shown below:
A relational database stores its data in the form of multiple tables as above. However, typical data analysis techniques such as clustering only work when the input data is in explicit form, as in the join . Standard practice is thus to compute the join, and then run the data analysis algorithm on it. Let denote the total number of features and the maximum number of rows in any given table. The space needed to store all the tables is easily upper bounded by , where we recall that is the number of tables. However, the number of rows in the join is in the worst case, which can be exponentially larger. This naturally raises the following question: what properties of the join can we compute using relational algorithms, defined as algorithms with running time polynomial in , , and ? See [2, 1, 9, 12] and the references therein. Notice that a relational algorithm typically is not able to explicitly construct the join.
Before describing prior work on relational algorithms, we point out the geometric nature of the join. The rows of each table can be viewed as a set of points lying in the subspace spanned by the features/coordinate axes in . The join is the set of all points , in the subspace spanned by features in , such that the projection of onto the subspace spanned by is an element of , for each .
1.1 Prior Work
Perhaps the most basic algorithmic question to ask about the join of the tables is whether it is non-empty – does it have at least one row/point in it? This turns out to be NP-Complete [7, 10, 9]. Indeed, a simple reduction from 3CNF-SAT has the property that the number of rows in the join is exactly the number of satisfying assignments to the input formula. (This is a good spot to emphasize that in this paper, the number of features and the number of tables are not viewed as constants.)
Given this state of affairs, the approach taken by papers in this area is to assume that the join has some additional structure that makes such basic algorithmic tasks tractable. The most common assumption is that the join is acyclic – this is a combinatorial assumption about how the features are distributed across the tables, and we define it in the next section. Assuming acyclic joins lets us use dynamic programming to solve certain algorithmic questions in polynomial time [2, 1]:
-
1.
Compute the number of rows in the join
-
2.
Compute the maximum of the squared norm of rows in the join, that is, .
These algorithmic tasks are special cases of a sum-product query. This is a query over the join of the form
where is a commutative semiring over a set . Corresponding to feature , is an easy to compute function with range . For example, for the query asking for the number of rows in the join, is the set of integers, is the constant function that takes on the value , is multiplication over the integers, and is addition over the integers. For computing the maximum of the squared norm of the rows, is the set of real numbers, , is addition over the reals, and is the function.
The authors in [9] study sum-product queries under constraints on the rows of the join. They allow constraints of the form , where is an easy-to-compute function, and is a constant. They consider the problem of evaluating a sum-product query over those rows in the join that satisfy the given set of constraints. They show that with two constraints, such problems are not only NP-hard but also hard to approximate to within any multiplicative factor. Concretely, they show it is NP-hard to determine if there is a row in the join that satisfies the linear constraint . This is via a reduction from the Partition problem, where we want to determine if a given set of positive integers can be partitioned into two sets whose sums are equal. Thus, it is a weak NP-hardness result [6]. The linear constraint is equivalent to the two linear inequality constraints and .
On the other hand, the authors in [9] show that with just one inequality constraint, and some additional technical assumptions, there is a randomized approximation scheme for sum-product queries that guarantees a multiplicative -approximation, for any given parameter . For instance, they use this general result to get a polynomial-time -approximation for the following counting problems:
-
1.
Count the number of rows in the join that lie in a given half space .
-
2.
Count the number of rows in the join that lie within a given ball (centered at point and having radius ).
The authors in [12] study the -means problem, and show how the -means++ algorithm [3] for constructing a coreset can be implemented in the relational setting, thus deriving an -approximation for the -means problem. Their core technical result is a polynomial-time sampling algorithm that, given a set of centers in , outputs a row with probability proportional to the square of the Euclidean distance of to the nearest center. This sampling result is surprising given the hardness of closely related problems. In particular, the problem of counting the number of rows in the join whose closest center is is NP-hard even for , and hard to approximate to within any multiplicative factor for . The technique of rejection sampling plays a key role in their sampling result.
Very recently, the authors in [5] present fast deterministic and randomized relational approximation algorithms for the -median and -means clustering problems assuming the dimension is a constant. The authors in [4] present a method for constructing a coreset for certain optimization problems in machine learning. The work that we have reviewed builds on a body of theoretical and experimental research on relational algorithms. We refer the reader to [9, 12, 5, 4] for surveys of this prior work.
1.2 Our Contribution
From prior work, it is evident that in the relational setting, the study of the complexity of very fundamental geometric problems yields surprising answers. More of the terrain needs to be explored to form a better picture of the computational complexity. Motivated by this, we consider the following Distance to -Subspace problem: Given a set of orthogonal unit vectors, compute the maximum Euclidean distance of a row in the join to , the subspace spanned by . As a reminder, we are assuming an acyclic database. The Distance to Subspace problem is a basic primitive for other tasks including the subspace approximation problem [8]. We obtain the following results.
-
1.
We show, in Section 4, that the Distance to -Subspace problem is NP-hard even for .
-
2.
We present a polynomial-time -approximation for the Distance to -Subspace problem. To do so, we introduce a generalization of the join and compute measures on that generalization. This is likely to be an algorithmic tool of independent interest. As a consequence of our technique, we show that the row rank of the join can be computed exactly in polynomial time. These results are presented in Section 3.
-
3.
Next, we ask if for constant , we can get a multiplicative -approximation for the Distance to -Subspace problem, for any . The distance of a row in the join from the given -subspace can be written as , where and are both non-negative polynomials. and can individually be maximized/minimized using prior work on sum-product queries, but obviously this does not, by itself, give us a way of maximizing the difference .
Instead, we keep track of all possible values of the pair in our dynamic program. However, there can be an exponential number of such pairs, as is to be expected for an NP-hard problem. We can keep track of suitably discretized values of , but to get a polynomial-time algorithm using this approach, we need the instance to be well-conditioned in the following sense: the maximum 2-norm of rows in the join (that is, Distance to the -Subspace) is at most polynomially larger than the Distance to the given -subspace.
Our main technical contribution here is to show that the Distance to -Subspace problem can be reduced to polynomially many instances of the same problem that are well-conditioned. To obtain such a reduction, one tool we develop is a proof that there exists a -subspace spanned by of the standard basis vectors that is “close” to any given -subspace. Overall, our reduction gives us a -approximation (Section 4) for the Distance to -Subspace problem.
-
4.
We observe, in Section 4, that our -approximation to Distance to -Subspace can be used as a primitive to obtain a -approximation for finding a -subspace that minimizes the maximum Euclidean distance to the join. When the point set is given explicitly (as opposed to implicitly via the join), this is well known as the -subspace approximation problem [14]. The running time of our algorithm is polynomial for constant . Our approximation factor for subspace approximation is large, and this result serves to illustrate the usefulness of the Distance to Subspace approximation primitive, which is the main focus of this work.
We emphasize that our work addresses the regime where the number of features and the number of tables are part of the input, and are not treated as constants. We are primarily focussed on the approximation guarantee that we can get with polynomial running time; we do not explore further optimizations of the running time.
2 Preliminaries
In this section, we present essential preliminaries. The database that is input to each of the problems considered here consists of a collection of tables . The columns of table correspond to a subset of features, and thus each row in is a vector with real-valued components. We use to denote the maximum number of rows in any table. Let denote the set of all features. We will assume that , where denotes . We assume that the columns in any table are ordered according to the natural ordering of the features in . Thus, we can represent a row in Table unambiguously as a point in . Given such a row and feature , we will use or to denote the component of corresponding to feature .
Given two row vectors and over feature sets and , we say that and are compatible if for each feature , . We use the notation to denote that and are compatible. The concatenation of compatible rows and is the row vector over formed by combining the components of and in the obvious way.
2.1 Acyclic Databases
We say that our database of tables is acyclic if there is a tree with nodes, each associated with one of our input tables, that has the following properties:
-
1.
Each table in the database is associated with at least one node.
-
2.
For each feature , the set of nodes whose tables contain induces a connected subgraph of the tree.
Note that for technical reasons, we allow a table to be associated with more than one node of the tree . Given an acyclic database, such a tree can be computed in polynomial time; see for example [9].
Note that each row of the join is formed by picking one row from each table associated with each node of , such that the chosen rows are pairwise compatible, and then concatenating the chosen rows. For an acyclic database, it suffices to check compatibility between pairs of rows chosen from nodes that are neighbors in the tree . This follows from the properties of an acyclic database. This is the crucial property of acyclic databases that allows one to evaluate certain queries on the join efficiently via dynamic programming.
For the purpose of dynamic programming, it will be convenient to view the tree as rooted. Furthermore, we can assume that each internal node of has exactly two children. This is without loss of generality: If there is a node with exactly one child , we can make the left child of and add a right child whose table is a copy of ’s. If a node has children, then replace with a complete binary tree with internal nodes and leaves, which will now correspond to the original children of . The table at each of the internal nodes will be a copy of the table at .
After applying these two operations, we have that each internal node of has exactly two children. The total number of nodes in is . The join is unchanged: we have merely added copies of some of the original tables. Finally, still witnesses the fact that the database is acyclic, as is readily checked. We will refer to as the join tree of the database.
For a node in the tree , we use to denote the table at node , and to denote the join of all the tables in the subtree rooted at . We “assign” each feature to the highest node in tree whose table contains feature as a column. By highest, we mean the node closest to the root of . Let denote the set of features assigned to node , and denote the set of features assigned to nodes in the subtree rooted at . For a row in table , we denote the set of rows in the join that are compatible with by . The notation is reasonable as this set of rows is indeed the result of a join of and a table containing the single row .
3 Computing the Distance to a given -Subspace
Assume we are given an acyclic database consisting of tables and a set of orthogonal unit vectors , where (as we recall) the set of features in the database is . The Distance to -Subspace problem is that of computing the maximum distance of a row in the join from the subspace spanned by . That is, we want to compute
The main result of this section is a -approximation to this problem. We begin with a scheme for implicitly representing functions , and posing algorithmic problems concerning this representation.
3.1 An Implicit Representation
Suppose that for each feature , we are given a function , where is identical across all features. We assume that these functions are easily computed. In fact, for the application here, we assume that is given explicitly for each value of feature that occurs in any of the input tables. For any any row in the join , let . Finally, let .
One illustrative example is to let , where we recall that the feature set is . Then for any row in the join of the tables, . Thus, . We will see more interesting examples shortly.
We are interested in computing measures of the point set . But this seems generally harder for than for . For example, there is a polynomial-time algorithm for computing using sum-product queries. However, computing seems harder. Fortunately, we can show a positive result for the norm.
Lemma 1.
There is a polynomial-time algorithm for computing .
Proof.
For , let
and
That is, (resp. ) is the maximum (minimum) value of the -th coordinate over points in . As , it is a sum-product query over the join where the commutative semiring is the set of real numbers, the operator is , and the is addition over the reals. It follows that and can be computed in polynomial time. Let .
Finally, we note that is .
3.2 A -approximation
We begin by showing that using the above implicit representation, we can efficiently maximize the norm of the projection onto the orthogonal complement of the given -subspace.
Lemma 2.
Suppose that for we are given a set of unit vectors that are pairwise orthogonal. We can compute
in polynomial time.
Proof.
For , let denote the projection onto the orthogonal complement of the subspace spanned by . Thus, . Viewing as a function of , we see that , where each is a constant vector – it depends on but not . (In other words, , where the projection matrix depends only on .)
For each , define the function by . It follows that for any row in the join ,
Thus, . From Lemma 1, there is a polynomial-time algorithm to compute
For any , we have . Thus we obtain the main result of this Section:
Theorem 3.
Suppose that for we are given a set of unit vectors that are pairwise orthogonal. There is a polynomial-time algorithm to compute a -approximation to
Next, we consider the algorithmic task of computing the row rank of the join of the input acyclic database consisting of tables . That is, we want to compute the size of a maximal linearly independent set [13] of row vectors in . An algorithm for this follows from Lemma 2.
Theorem 4.
There is a polynomial time algorithms that, given an acyclic database as input, can compute the row rank of the join .
Proof.
We build an orthonormal basis one vector at a time. Using prior work on sum-product queries, we compute a vector that maximizes the square of the Euclidean norm. If is the zero vector, the rank of is . Otherwise, let be the unit vector corresponding to , and we initialize .
Suppose that we have computed a set of unit vectors in the row space of that are pairwise orthogonal. Using Lemma 2, we compute that achieves
If this maximum is , then the rank of is and is a basis for the row space of . Otherwise, we add the unit vector corresponding to to and continue.
4 Maximum Distance to a -Subspace: A PTAS
Assume we are given an acyclic database consisting of tables and an orthonormal set of vectors , where (as we recall) the set of features in the database is . We would like to compute the maximum distance of a row in the join from the subspace spanned by . The distance of a point from the subspace spanned by , is . Thus, our goal is to compute . Let . As , we reformulate this as computing . The main result of this section a polynomial-time approximation scheme (PTAS) for the problem for constant .
We begin by showing that the Distance to -Subspace problem is NP-hard even for .
Theorem 5.
If there is a polynomial-time algorithm for computing given the input acyclic database and vector , then P = NP.
Proof.
The proof is by reduction from the Partition problem. Given a set of positive integers , the goal here is determine if can be partitioned into two parts and such that . Given such an input, we compute an acyclic database with tables and a vector as follows. The sequence of features is . For , the columns of table are and , and the columns of table are and . Each table has two rows, whose values are determined as follows.
⋮
It is clear that the database is acyclic, and in fact, a tree that witnesses this is a path. We set . Let denote the line through . Let denote the join of the tables. If , we declare that the input Partition instance has a solution; otherwise, we declare that it doesn’t.
The reduction runs in polynomial time, and we now argue that it is correct. For any partition of , we define to be the vector where for , and for each , if and if . Note that is a bijection from the set of partitions of to the set of rows of the join . Furthermore, . Note that , and for any row , we have ; all rows in have the same 2-norm. Thus, for any ,
We conclude that for any partition of ,
Our PTAS for Distance to a -Subspace has three major steps. Let denote the unit vectors in the standard basis. We want to find a -subset such that is “close to” . This first step, along with the properties of , is described in Section 4.1. In the second step (Section 4.2), we use to reduce the Distance to -Subspace instance to polynomially many well-conditioned instances of the same problem. In the third step (Section 4.3), we use discretization and dynamic programming to solve a well-conditioned instance approximately in polynomial time. We conclude in Section 4.4 with an application of the PTAS to subspace approximation.
For a set , we denote the orthogonal complement of by . That is, .
4.1 Finding a Close -Subspace
Our algorithm for computing is as follows. Let be the projection of a vector on the subspace spanned by .
We state some useful properties of this algorithm:
Lemma 6.
(a) The set is an orthonormal basis for ; (b) For any , we have ; (c) We have , that is, our algorithm never attempts to add a duplicate element to ; (d) For each , we have .
Proof.
Part (a) is evident. As for (b), fix . We have . We have , and . Given , we have .
For (c), We pick such that is maximum. So . Given that , and , we have . From part (b), for all , we have , so . Thus . We have already shown part (d) in the process.
The key property of the set computed above is that the -subspaces and are close in the sense of the following theorem. For other measures of closeness of subspaces, see [11].
Theorem 7.
For any unit vector , there exists a unit vector such that . (We assume that the dimension to get this simplified bound.)
Proof.
A high-level overview of the proof is that we construct a “best-response” unit vector for each unit vector , depending on the relative magnitudes of components of . Now we proceed to the formal proof.
We represent the unit vectors as and , where . Let and denote the corresponding unit vectors.
Let , , and . We have . We have the following claim for .
Claim 8.
A is an upper triangular matrix. Each diagonal element . For other non-diagonal elements , where , we have .
Proof.
For , we have by part (b) of Lemma 6. For , we have by part (d). For other non-diagonal elements , where : given is a unit vector, .
Let . So and
We want to pick a good vector such that is large enough. Let be a small positive constant. We discuss different cases for :
-
Case 1:
-
Case 2:
-
Case : , …,
-
Case : , …,
In case 1, we pick and leave all other components 0. Then .
In case 2, we pick and leave all other components 0. Then .
Similarly, in the case , we pick and leave all other components 0. Then .
In case , we pick and leave all other components 0. Then . Similar to the case , the lower bound is .
Therefore, given that , the smallest lower bound for occurs in Case 1, where . We now argue that at least one of the cases must hold. It suffices that
The inequality holds if . Thus, .
For each , let denote the coordinate corresponding to , i.e., . Let be the set of indices corresponding to the standard basis vectors in . The following claim is shown by using observations made in the proof of Theorem 7.
Lemma 9.
For any -dimensional vector , we can efficiently compute a point such that , for all .
Proof.
For any , let be restricted to the indices in . Similarly, let be restricted to indices in .
Let , We want to show that there exists an vector such that . We have the following claim for .
Claim 10.
.
Proof.
Let . We can observe that holds for . Thus, . From claim 8, we know . Thus .
Given is a full rank square matrix, there exists an vector such that . Then we can compute .
We conclude with the following implication of the closeness in Theorem 7 of and .
Lemma 11.
Let be an arbitrary unit vector in the -subspace . Let be a unit vector along the projection of on . Then , where .
Proof.
Let be the projection of on the subspace spanned by . Let . According to Theorem 2, , then .
Then
4.2 Reduction to Well-Conditioned Instances
We will use the new orthonormal basis for . Therefore we will use the notation for the subspace in the following sections.
Recall that is the set of indices corresponding to the standard basis vectors in . For each , let be the node in the join tree such that . See Section 2 for a reminder of these concepts. Let ; we use to denote the elements of the set. We can write the join as
We therefore compute for each row combination , and return the maximum of these quantities. The number of row combinations is , which is polynomial as is a constant. In the remainder of Section 4, we fix a single row combination , and explain our approximation scheme for calculating . We will continue to use to denote the new join. For each , we delete all rows in the table except . Let .
For any , we have . However, the individual terms and can be much larger than . We therefore apply a translation that preserves but makes these two terms small.
Let be the vector formed from by restricting to the indices in . Using Lemma 9, we compute a vector such that . We apply the translation . As we translate by a vector in , the distance to is preserved:
Lemma 12.
For any ,
Also, we can observe that for any , .
To effect the translation, we have to modify the tables. Thus, for each node in the join tree, and for each row and each feature of , we set . We henceforth refer to the as just and as .
Lemma 13.
After the modification of tables, for any , as . Therefore by Lemma 11, .
This bound on is what we mean by saying that the join is well-conditioned.
4.3 Algorithm for a Well-Conditioned Instance
We can use dynamic programming over the join tree to explicitly compute the set
. However, this set can have exponential size. Our approximation algorithm will therefore work with a rounded version of . We observe that
| (1) |
Where is the -th element of . By Lemma 13, we know that for any ; the join is well-conditioned.
Let . Consider a node of the join tree such that its table contains a row with an entry whose absolute value exceeds . Clearly, such a row does not contribute to the eventual join , so we delete such rows. Let , where is the error parameter for the approximation scheme, and . Notice from Equation 1 that is built up from terms of the form and . For , let . We will use and as our rounded version of and accordingly – essentially we round to the nearest integer multiple of that makes the absolute value smaller. The rounded version we use for will then be
| (2) |
We now describe our dynamic programming algorithm to use the ’s to compute an approximation to .
4.3.1 Data Structures for the DP
For node in the join tree, and row , let , and
, where , for . Note that if is the root of the join tree, then
is (Equation 2).
For each node in the join tree , our algorithm stores an array indexed by the rows in table . For a row , the entry will eventually store the set
That is, a signature is stored in for each row in the join . Using the fact that the join is well-conditioned, we show later that the size of this set of signatures is bounded by a polynomial, even though the join itself can have an exponential number of rows.
4.3.2 The DP
To fill in the arrays, the algorithm does a post-order traversal of the join tree , and solves for each node of in that order.
-
If is a leaf node, then for each row in , we set to . Note that , and , where . If , then the above summations evaluate to , and .
-
If is an internal node, let and denote its left and right child. For each row , we initialize and compute as follows:
Algorithm 2 Compute .
After the DP completes, we use the tables at the root of the join tree to output this estimate for :
4.3.3 Correctness of the DP
Let be any node of the join tree, and a row in the table . We argue that for any row in the join , after the DP completes processing node . If is a leaf, this is obvious. Assume that is an internal node. Let and be the left and right child of , respectively. There is exactly one row (resp. ) such that (resp. ). Clearly, and . Thus, is a concatenation of rows: (a) a row in (b) a row in ; and (c) itself.
Furthermore,
where . And
where . Thus the pair is added to by the DP when looking at rows and .
Conversely, we can also argue that if the pair is added by the DP to , then and for some .
4.3.4 Running Time
We now show that our algorithm runs in polynomial time. Consider the term for some where is some row of some table that contains as a feature. Given that is a unit vector, . As , and , we have
Note that is an integer. Now fix a node in the join tree and consider any row . It follows that , is an integer whose absolute value is at most . Similarly, is an integer whose absolute value is at most . Similarly, is an integer whose absolute value is at most . The cardinality of the set is therefore . Thus, the space used by the array is . Given that , it is now easy to see that the overall running time is a polynomial.
4.3.5 Approximation Guarantee
Fix . We note that for all , by design. Now for , we have
Here, the last inequality is because we only round down the . Now, (Equation 1) and (Equation 2) each have corresponding terms, and the difference in absolute value between the corresponding terms is bounded by above. Thus,
| (3) |
Here the last step is due to Lemma 13. Let satisfy . The algorithm returns for a row such that
Theorem 14.
There is a polynomial-time algorithm that, given an acyclic database with tables over a total of features, a set of orthogonal unit vectors (where is a constant), and a parameter , computes a -approximation to the farthest distance of a row in the join from the subspace spanned by .
4.4 Subspace Approximation
Theorem 14 can be used as a primitive to compute a -subspace that approximately minimizes the maximum Euclidean distance of a row in the join to the -subspace. This problem is well known in the literature as -subspace approximation; see [14] for references. Consider the following algorithm:
It follows from (the proof of) Lemma 5.2 of [8] that for the set computed by the algorithm and sufficiently small , we have , for any -subspace . That is, for constant , we get an algorithm that runs in time polynomial in the size of the acyclic database and returns a -approximation for the -subspace approximation problem.
References
- [1] Mahmoud Abo Khamis, Ryan R. Curtin, Benjamin Moseley, Hung Q. Ngo, XuanLong Nguyen, Dan Olteanu, and Maximilian Schleich. On functional aggregate queries with additive inequalities. In Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS ’19, pages 414–431, New York, NY, USA, 2019. Association for Computing Machinery. doi:10.1145/3294052.3319694.
- [2] Mahmoud Abo Khamis, Hung Q. Ngo, and Atri Rudra. Faq: Questions asked frequently. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS ’16, pages 13–28, New York, NY, USA, 2016. Association for Computing Machinery. doi:10.1145/2902251.2902280.
- [3] David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’07, pages 1027–1035, USA, 2007. Society for Industrial and Applied Mathematics. URL: http://dl.acm.org/citation.cfm?id=1283383.1283494.
- [4] Jiaxiang Chen, Qingyuan Yang, Ruomin Huang, and Hu Ding. Coresets for relational data and the applications. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2024. Curran Associates Inc.
- [5] Aryan Esmailpour and Stavros Sintos. Improved approximation algorithms for relational clustering. Proceedings of the ACM on Management of Data, 2(5):1–27, 2024. doi:10.1145/3695831.
- [6] Michael R. Garey and David S. Johnson. Computers and Intractability; A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., USA, 1990.
- [7] Martin Grohe. The structure of tractable constraint satisfaction problems. In Proceedings of the 31st International Conference on Mathematical Foundations of Computer Science, MFCS’06, pages 58–72, Berlin, Heidelberg, 2006. Springer-Verlag. doi:10.1007/11821069_5.
- [8] Sariel Har-Peled and Kasturi R. Varadarajan. High-dimensional shape fitting in linear time. Discret. Comput. Geom., 32(2):269–288, 2004. doi:10.1007/S00454-004-1118-2.
- [9] Mahmoud Abo Khamis, Sungjin Im, Benjamin Moseley, Kirk Pruhs, and Alireza Samadian. Approximate aggregate queries under additive inequalities. In Michael Schapira, editor, 2nd Symposium on Algorithmic Principles of Computer Systems, APOCS 2020, Virtual Conference, January 13, 2021, pages 85–99. SIAM, 2021. doi:10.1137/1.9781611976489.7.
- [10] Dániel Marx. Tractable hypergraph properties for constraint satisfaction and conjunctive queries. J. ACM, 60(6), November 2013. doi:10.1145/2535926.
- [11] Jianming Miao and Adi Ben-Israel. On principal angles between subspaces in rn. Linear Algebra and its Applications, 171:81–98, 1992. doi:10.1016/0024-3795(92)90251-5.
- [12] Benjamin Moseley, Kirk Pruhs, Alireza Samadian, and Yuyan Wang. Relational algorithms for k-means clustering. In Nikhil Bansal, Emanuela Merelli, and James Worrell, editors, 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021, July 12-16, 2021, Glasgow, Scotland (Virtual Conference), volume 198 of LIPIcs, pages 97:1–97:21. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.ICALP.2021.97.
- [13] Gilbert Strang. Linear Algebra and Its Applications, 3rd Edition. Harcourt, Inc., 1988.
- [14] David P Woodruff and Taisuke Yasuda. New subset selection algorithms for low rank approximation: Offline and online. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 1802–1813, 2023. doi:10.1145/3564246.3585100.
