Abstract 1 Introduction 2 Our Results 3 Proof Idea 4 Preliminaries 5 Partial Set-Multilinear Cluster Representation 6 Almost circuit, Learning most gates 7 Finding a good Projection 8 Reconstruction using a good Projection References

Faster & Deterministic FPT Algorithm for Worst-Case Tensor Decomposition

Vishwas Bhargava ORCID California Institute of Technology, Pasadena, CA, USA Devansh Shringi ORCID University of Toronto, Canada
Abstract

We present a deterministic 2k𝒪(1)poly(n,d) time algorithm for decomposing d-dimensional, width-n tensors of rank at most k over and . This improves upon the previous randomized algorithm of Peleg, Shpilka, and Volk (ITCS ’24) that takes 2kk𝒪(k)poly(n,d) time and the deterministic nkk time algorithms of Bhargava, Saraf, and Volkovich (STOC ’21).

Our work resolves an open question asked by Peleg, Shpilka, and Volk (ITCS ’24) on whether a deterministic Fixed Parameter Tractable (FPT) algorithm exists for worst-case tensor decomposition. We also make substantial progress on the fundamental problem of how the tractability of tensor decomposition varies as the tensor rank increases. Our result implies that we can achieve deterministic polynomial-time decomposition as long as the rank of the tensor is at most (logn)1/C, where C is some fixed constant independent of n and d. Further, we note that there cannot exist a polynomial-time algorithm for k=ω(logn) unless ETH fails. Our algorithm works for all fields; however, the time complexity worsens to 2kk𝒪(1) and requires randomization for finite fields of large characteristics. Both conditions are provably necessary unless there are improvements in the state of the art for system solving over the corresponding fields.

Our approach achieves this by designing a proper learning (reconstruction) algorithm for set-multilinear depth-3 arithmetic circuits. On a technical note, we design a “partial” clustering algorithm for set-multilinear depth-3 arithmetic circuits that lets us isolate a cluster from any set-multilinear depth-3 circuit while preserving the structure of the circuit.

Keywords and phrases:
Algebraic circuits, Deterministic algorithms, FPT algorithm, Learning circuits, Reconstruction, Tensor Decomposition, Tensor Rank
Category:
Track A: Algorithms, Complexity and Games
Copyright and License:
[Uncaptioned image] © Vishwas Bhargava and Devansh Shringi; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Algebraic complexity theory
; Theory of computation Pseudorandomness and derandomization
Related Version:
Full Version: https://eccc.weizmann.ac.il/report/2024/123/
Editors:
Keren Censor-Hillel, Fabrizio Grandoni, Joël Ouaknine, and Gabriele Puppis

1 Introduction

Tensors, higher-dimensional analogs of matrices, are multi-dimensional arrays with entries from a field 𝔽. For instance, a 3-dimensional tensor can be written as 𝒯=(αi,j,k)𝔽n1×n2×n3. The notion of tensor rank and tensor decomposition are one of the most important tools in modern science. Tensor rank and decomposition have become fundamental tools in various branches of science, with applications in machine learning, statistics, signal processing, and computational complexity. Even within machine learning theory, tensor decomposition algorithms are leveraged to give efficient algorithms for phylogenetic reconstruction [40], topic modeling [2], community detection [3], independent component analysis [39], and learning various mixture models [27, 30]. For more details on the applications of tensor decomposition and the rich theory of tensors in general, we refer the reader to the detailed monograph by Landsberg [37] and the references therein.

We will work with general d-dimensional tensors 𝒯=(αj1,j2,,jd)𝔽n1××nd. The rank of a tensor 𝒯 is defined as the smallest r for which 𝒯 can be written as a sum of r tensors of rank 1, where a rank-1 tensor is a tensor of the form v1vd with vi𝔽ni. Here, denotes the Kronecker (outer) product, also known as the tensor product. The expression of 𝒯 as a sum of such rank-1 tensors over the field 𝔽 is called 𝔽-tensor decomposition, or simply tensor decomposition.

Tensor decomposition is a notoriously111The seminal work of Lim and Hiller [26] aptly justifies the use of this adjective. difficult problem. Håstad [25] showed that determining the tensor rank is an NP-hard problem over and NP-complete over finite fields. The follow-up work of Schafer and Stefankovic [42] further improved our understanding by giving a very tight reduction from algebraic system solving. Formally, they show that for any field 𝔽, given a system S of algebraic equations over 𝔽, we can in polynomial time construct a 3-dimensional tensor τS and an integer k such that S has a solution in 𝔽 iff τS has rank at most k over 𝔽.

Despite the known hardness results, tensor decomposition remains widely studied due to its broad applicability. There is a rich history of tensor decomposition algorithms in the literature. Most of these algorithms are average-case (i.e., they work when the entries are sampled randomly from a distribution), or they work when the input is non-degenerate (i.e., when the tensor components are chosen from a Zariski open set), or they are heuristic. See, for instance, [14, 4, 36]. This area also includes interesting active research directions, such as making these algorithms noise-resilient and studying their smoothed analysis [6, 43, 16].
On the other hand, tensor decomposition in the worst-case scenario has only recently started to receive attention. Firstly, the work of Bhargava, Saraf, and Volkovich [11] provided an nkk time algorithm to decompose rank-k tensors, showing that an efficient algorithm exists to decompose fixed-rank tensors. The next natural step in furthering the understanding of this NP-hard problem is studying its Fixed Parameter Tractability (FPT). In the FPT setting, the input instance comes with a parameter k with specific quantities (e.g., the optimum treewidth of a graph). In our case of tensor decomposition, this parameter is the tensor rank. This setting treats k as a parameter much smaller than the instance size n, thus relaxing the required runtime of the algorithm from n𝒪(1) to f(k)n𝒪(1), for any computable function f. The class FPT comprises parameterized problems that admit an algorithm with this running time.

Peleg, Shpilka, and Volk [41] designed a randomized FPT algorithm for decomposing rank-k tensors. Specifically, they provided a randomized kkk𝒪(k)poly(n,d) time algorithm for decomposing rank-k tensors. As a direct consequence, they demonstrated that tensors can be decomposed in polynomial time even beyond the constant tensor rank up to 𝒪(logloglogn/loglogloglogn).

Both these works use a standard connection between tensor decomposition and reconstruction of a special case of depth-3 circuits. We start by describing this connection below.

Tensors as depth-3 arithmetic circuits

For a tensor 𝒯=(αj1,j2,,jd)𝔽n1××nd consider the following polynomial

f𝒯(X)=Δ(j1,,jd)[n1]××[nd]αj1,j2,,jdx1,j1x2,j2xd,jd.

Let C(X)=i=1kj=1di,j(Xj) be a set-multilinear depth-3 circuit over 𝔽 respecting the partition j[d]Xj (denoted by ΣΠΣ{jXj}(k)), and computing f𝒯(X). Then observe that

𝒯=i=1k𝐯(i,1)𝐯(i,d)

where 𝐯(i,j) corresponds to the linear form i,j as an nj-dimensional vector over 𝔽. Indeed, it is easy to see that a tensor 𝒯=(αj1,j2,,jd)𝔽n1××nd has rank at most r if and only if f𝒯(X) can be computed by a ΣΠΣ{jXj}(r) circuit. Therefore, the rank of 𝒯 is the smallest k for which f𝒯(X) can be computed by a ΣΠΣ{jXj}(k) circuit. We say that a ΣΠΣ{jXj} circuit has width w, if |Xj|w for all j.

Now that we understand the direct correspondence between tensors and set-multilinear depth-3 arithmetic circuits, we will, for simplicity of analysis, stick to this polynomial representation of tensors. We emphasize that all operations on the polynomial associated with a tensor can be interpreted as standard efficient operations on tensors themselves. We will describe these basic operations used in our algorithm in terms of tensor notation, which follows directly from the above equivalence.

For instance, any partial evaluation of fτ on a subset of the variable partition (say, X1=𝒂𝟏,X3=𝒂𝟑,X8=𝒂𝟖) is simply the weighted contraction of the tensor along those directions (1,3,8) with weights 𝒂𝟏,𝒂𝟑, and 𝒂𝟖, respectively. A complete evaluation is simply the dot product of τ and the outer-product of the evaluation points, that is fτ(𝒂𝟏,,𝒂𝒅)=τ,i[d]𝒂𝒊. Note that the linear forms in the circuit representation of fτ correspond to vectors in the tensor decomposition.

Keeping this connection in mind, we get that if we can learn the optimal representation of fτ as an ΣΠΣ{jXj} circuit, that will directly give an optimal tensor decomposition of τ as well. The problem of learning a circuit representation of a polynomial from black-box (a.k.a. oracle/membership query) access to the polynomial is referred to as arithmetic circuit reconstruction. This problem is the algebraic analog of exact learning in Boolean circuit complexity [5]. In recent years, much attention has been focused on reconstruction algorithms for various interesting subclasses of arithmetic circuits, both in the worst case setting [7, 35, 34, 17, 10] and in the average case setting [23, 24, 33, 19, 9]. Given the natural expressibility of polynomials (and algebraic circuits), arithmetic circuit reconstruction links and connects to many other learning problems. It is not just restricted to tensor decomposition but also extends to other fundamental learning problems like learning mixtures of Gaussian and subspace clustering [31, 20, 12]. Finding other such connections is a very interesting research direction.

Returning to the aforementioned connection, we now focus on learning these special depth-3 circuits. A set-multilinear depth-3 circuit with respect to a partition X and top fan-in k, denoted by ΣΠΣ{jXj}(k), computes a set-multilinear polynomial of the form:

C(X)=i=1kj=1di,j(Xj),

where i,j(Xj) is a linear form in 𝔽[Xj]. A polynomial P𝔽[X] is called set-multilinear with respect to the partition X=jXj if every monomial that appears in P is of the form xi1xi2xid, where xijXj. The relation between tensors and depth-3 set-multilinear circuits is discussed in detail in the full version of paper. We assume black-box (or oracle/membership query) access to the polynomial f rather than access to the full tensor τ. Simply reading the entries of the tensor would require nd time. However, in the low-rank case, we can learn the decomposition with far fewer measurements. This approach, known as studying or decomposing tensors using evaluations of f or “measurements” of the tensor τ, has been extensively studied in the literature [18, 13].

Non-uniqueness of tensor decomposition.

One reason for the hardness of tensor decomposition is its non-uniqueness. This is evident even in the average-case setting due to the limitations of any tensor decomposition algorithm beyond Kruskal’s uniqueness bounds [38, 47]. Naturally, in the worst-case setting, things become even more convoluted.

(10)(10)(100)5(11)(11)(111)+10(12)(12)(124)=10(13)(13)(139)5(14)(14)(1416)+(15)(15)(1525)

Figure 1: An example of non-uniqueness of tensor decomposition by Derksen [15].

Fortunately, thanks to strong structural results on polynomial identities of depth-3 circuits, we can obtain strong almost-uniqueness results for such circuit representations. These are structural results for identically zero ΣΠΣ(k) circuits and essentially show that under some mild conditions, any ΣΠΣ(k) circuit that computes the identically zero polynomial must have its linear forms contained in a “low-dimensional” space. And, in the case of set-multilinear depth-3 circuits, it implies that the circuit is of “low-degree” after stripping the linear factors. Observe that two different ways of writing a polynomial constitute an elaborate polynomial identity. This directly yields that two ΣΠΣ{jXj} decompositions of fτ will differ on only poly(k)-linear forms. For more details into these structural results for ΣΠΣ{jXj} decompositions, we refer the reader to check out [11, Section 3.4.2].

2 Our Results

As described in the previous section, the problem of worst-case tensor decomposition, or equivalently, proper learning of ΣΠΣ{jXj}(k) circuits, was studied in [11], where the authors presented a deterministic algorithm with a running time of poly(dk3,kkk10,n). This algorithm runs in polynomial time when k is a constant. This was extended beyond constant rank in the recent work of [41], where they proposed a randomized algorithm with a running time of kkk𝒪(k)poly(n,d). This algorithm runs in randomized polynomial time for tensor rank k=𝒪(logloglognloglogloglogn).

In [41, Section 1.4], two open problems were posed: Firstly, to understand how the tractability of tensor decomposition changes as tensor rank increases; and secondly, whether a deterministic fixed-parameter tractable (FPT) algorithm exists for worst-case tensor decomposition.

We make significant progress in answering the first question and completely resolve the second. We now state our result for learning set-multilinear ΣΠΣ(k) arithmetic circuits.

Theorem 1 (Learning ΣΠΣ{jXj}(k) circuits).

Given blackbox access to degree d, n variate polynomial f computable by a set-multilinear ΣΠΣ{jXj}(k) circuit C with top fan-in k over 𝔽= or , then there exists a deterministic algorithm that outputs a set-multilinear ΣΠΣ{jXj}(k) circuit over 𝔽 with top fan-in k computing f in time F(k,n,d)=2k𝒪(1)poly(n,d).

As a direct consequence of the above theorem, we get the following corollary.

Corollary 2 (Decomposing rank-k tensors).

Let 𝒯𝔽n1××nd be a d-dimensional tensor of rank at most k with 𝔽= or . Let n=i=1dni. Given black-box access to measurements of 𝒯 (equivalently to evaluations of f𝒯), there exists a deterministic poly(2k𝒪(1),d,n) time algorithm for computing a decomposition of 𝒯 as a sum of at most k rank 1 tensors.

Note that just reading the entries of the tensor will require nd time. However, the above corollary states that in the low-rank case, we can learn the decomposition with much fewer measurements of the tensor.

As a direct consequence of the above result, we can perform the tensor decomposition for any tensor with tensor rank k=𝒪((logn)1/C) in poly(n,d) time for a fixed constant C. This substantially improves the previously known tensor rank bound of k=𝒪(logloglognloglogloglogn).

 Remark 3.

The only point at which the field plays a crucial role in our work is in the polynomial system-solving process (which is provably necessary; see Discussion 2). Aside from this, our techniques are field-independent and deterministic. The computational model in Theorem 1 assumes input entries as integers (or rationals), and without loss of generality, we can maintain this assumption by approximating or truncating any given real/ complex numbers to a specific number of bits.

The coefficients in our output are derived from solutions to polynomial systems of equations. These solutions lie in an algebraic extension over the base field, and their δ-rational approximations can be computed with an additional factor of poly(log(1/δ)) in the time complexity; see, for instance, [21, Remark on page 2].

On the intractability/hardness side, using Håstad’s [25] tight reduction between tensor rank and SAT, we see that even testing tensor rank will require 2Ω(k)poly(n,d) time assuming the Exponential Time Hypothesis (ETH, [29]). This demonstrates that our time complexity is somewhat tight (with respect to the parameter k). We formalize this in the observation below.

Observation 4.

Given a d-dimensional tensor τ𝔽n×n××n (with d dimensions) and k, if there exists a 2o(k)poly(n,d) time algorithm for testing if rank(τ)k, then ETH is false.

Proof.

Note that Håstad ([25, Lemma 2]) converts a 3-SAT instance with n variables and m clauses to a tensor τ𝔽(2+n+2m)×3n×(3n+m) of rank k=4n+2m if and only if the 3-SAT instance is satisfiable222We can assume n=m by an easy reduction from general 3-SAT. Thus, if there exists a 2o(k)poly(n,d) time algorithm for testing tensor rank, this would yield a 2o(n)-time algorithm for 3-SAT, since the reduction incurs only a linear blow-up. Thus refuting ETH.

Comparison with previous works

Two main works [11, 41] have previously addressed this problem using similar approaches, and below we describe why exactly the approaches in these two works fail to achieve the required deterministic running time of 2k𝒪(1).

  • In [11], the authors develop a way of computing almost all of the linear forms (i.e. the last layer) in the circuit C. After this, they have to brute-force search over all subsets of the variable partition {jXj} of size 𝒪(k) to obtain linear forms from each of the gates. This fails the required running time for an FPT algorithm as it adds a d𝒪(k) factor to the running time.

  • In [41], the authors focus on the reconstruction of multilinear circuits and consequently provide a reconstruction algorithm for set-multilinear circuits. Their approach hinges on obtaining black-box access to a cluster (of “multiplication” gates) of multilinear gates such that the cluster representation is unique. The benefit of this approach is that thanks to its uniqueness, we can aim to obtain black-box access to a cluster of gates. However, even after removing the linear factors from these clusters, the degree can be as high as kk𝒪(k). To learn this linear factor-free part of the cluster (referred to as the simple part), they have to solve a system of polynomial equations with kk𝒪(k) variables, which takes time kkk𝒪(k) over fields 𝔽= or . Although fixed-parameter tractable (FPT), this leads to a very inefficient exponential tower dependence on k. Another important distinction is that the algorithm in [41] is randomized, placing tensor rank in randomized FPT time. As observed in [41, Remark 7.3], in the process of getting black-box access to a cluster, the authors have to do polynomial identity testing of a degree nkk𝒪(k) polynomial, which is well beyond current PIT techniques to do this in T(k)poly(n) time deterministically. Therefore, there is no easy derandomization of their algorithm.

We summarize the comparison in the table below.

Results Algorithm Type Running Time Key Bottleneck
[11] Deterministic poly(dk3,kkk10,n) Brute-force search in (d𝒪(k)) sized list
[41] Randomized kkk𝒪(k)poly(n,d) System solving with kk𝒪(k) unknowns
This Work Deterministic 2k𝒪(1)poly(n,d) System solving with k𝒪(1) unknowns
Can we improve the time complexity further to 𝟐𝓞(𝐤)𝐩𝐨𝐥𝐲(𝐧,𝐝)?

Our approach is based on using “uniqueness” results (or rank bounds in the algebraic complexity literature) of tensor decomposition, which requires d=Ω(k). If we use polynomial system solving to learn this low-degree set-multilinear circuit, it will create a system of equations in Ω(k2) variables. With these parameters, the currently known best techniques for solving a system of polynomial equations will require a 2k𝒪(1) running time, making it difficult to directly improve the running time to 2𝒪(k)poly(n,d). An improvement to 2𝒪(k)poly(n,d) would either require a new approach to learning low-degree circuits or a detailed study of these algebraic systems of equations to achieve better time complexity in finding their solutions. However, we cannot rule out the possibility of the decision version of this problem being solved in 2𝒪(k)poly(n) time. We leave this as an open problem.

Over other fields

Our algorithm works for all fields; however, the time complexity worsens to 2kk𝒪(1) and requires randomization for finite fields of large characteristics. Both conditions are provably necessary, as we will discuss now. Let’s start by stating our result in generality for all fields.

Theorem 5 (Learning ΣΠΣ{jXj}(k) over general field 𝔽).

Given blackbox access to a degree d, n-variate polynomial f computable by a set-multilinear ΣΠΣ{jXj}(k) circuit C over 𝔽, there exists an algorithm that outputs a set-multilinear ΣΠΣ{jXj}(k) circuit computing f over a poly(kk𝒪(1)) degree extension of 𝔽 in time F(k,n,d)=2kk𝒪(1)poly(n,d)c𝔽(2k𝒪(1)).

Here, c𝔽(D) denotes the time complexity of factorizing a univariate polynomial of degree D over 𝔽.

The reason for the different time complexity over different fields lies in the difficulty of system solving over these fields. Our algorithm uses as a subroutine an algorithm for solving algebraic systems with kO(1) unknowns. The above result follows directly from using the best algorithm for algebraic system solving over any field 𝔽. This sensitivity to the underlying field is natural in tensor decomposition, as different fields can significantly affect the computational complexity. For example, over the rationals (), determining the exact tensor rank – even for constant tensor rank – is believed to be undecidable [44, 42].

In fact, a tight reduction in [42] solidifies this connection, demonstrating a tight reduction between the feasibility of algebraic systems over 𝔽 and computing tensor rank over 𝔽:

Theorem 6 ([42]).

For any field 𝔽, given a system of m algebraic equations S over 𝔽, we can construct in polynomial time a 3-dimensional tensor 𝒯S of shape [3m]×[3m]×[n+1] and an integer k=2m+n such that S has a solution in 𝔽 if and only if 𝒯S has rank at most 2m+n over 𝔽.

Even restricting ourselves to finite fields, the time complexity of Theorem 5 depends on the efficiency of univariate polynomial factorization over the field. Efficient randomized algorithms exist for this task, but derandomizing them remains a notoriously difficult problem, as noted in [1, Problem 15].

Interestingly, the hardness of derandomizing univariate polynomial factorization over 𝔽 directly impacts tensor decomposition, as observed by Volkovich [46]. Specifically, there is no known way to derandomize Theorem 5 over finite fields unless we can derandomize the factorization of univariate (even quadratic) polynomials over finite fields.

Theorem 7 ([46, Theorem 5]).

If the class of (set)-multilinear ΣΠΣ(2) circuits over the field 𝔽 is learnable as multilinear ΣΠΣ(k) circuits of polynomial size with k=O(1), then, in deterministic poly(log|𝔽|) time, the learning algorithm can be converted to an algorithm that computes the square root of any element in 𝔽, if it exists.

3 Proof Idea

As the introduction mentions, we will use the connection between tensor rank and set multilinear depth-3 circuits. For readers who are not very familiar with the algebraic circuit representation of tensors, we emphasize that we are simply performing standard efficient operations on tensors and use this notation solely for ease of analysis. We refer the reader to subsection 1 to familiarize themselves with this notation. Overall, we are simply performing weighted contractions (partial evaluations) on our tensor after a change of basis (variable reduction) to first learn most of the vectors in the decomposition. We then use the uniqueness of low-rank tensor decomposition in high-dimensional settings to learn the remaining vectors in the decomposition. We now give a detailed sketch of our algorithm using set-multilinear depth-3 circuits terminology.

Recall that we only need black-box access to fτ, i.e., we can query evaluations of fτ at fixed points in 𝔽n in constant time, rather than requiring a dense representation of the tensor/polynomial.

Through some preprocessing, we can assume the following:

  • We know k, the rank of the tensor or equivalently the minimum k such that fτ is computable by a ΣΠΣ{jXj}(k). Indeed, we can assume the value of k and output the first k for which the learning algorithm works, since we can always test if our output is correct deterministically using an existing black-box PIT algorithm. This affects the time complexity by a multiple of k.

  • We can strip off fτ with any linear factors (Lin(fτ):={:|fτ}), since any optimal decomposition(with tensor rank k) of NonLin(fτ)=fτLin(fτ) gives an optimal decomposition for fτ as well. This comes from a result of the factoring of ΣΠΣ{jXj}(k) circuits from [45] described in Lemma 10 that shows that if fτ is computable by a ΣΠΣ{jXj}(k) circuit then any irreducible factor NonLin(fτ) can also be computed by a product of ΣΠΣ{jXj}(k) circuits. It also describes how to obtain black-box access to these irreducible factors in 2𝒪(log2k)poly(n,d) time.

  • We can assume that |Xj|k. This follows from the width reduction step, see [11, Section 5.1] and Lemma 11. In the low-degree reconstruction in Lemma 12, the system of polynomial equations has kwd variables, which go into the exponent in the running time required to find a solution. Therefore we must be able to reduce the width to k, so we can obtain an FPT algorithm.

  • Note that there can be multiple optimal ΣΠΣ{jXj} decompositions for fτ, but for the sake of argument, we will fix one representation C and argue the proof using this fixed representation C.

Low Degree Reconstruction using system solving

If d<k5, then we can simply reduce it to finding a solution of a system of a k𝒪(1)-variate algebraic system. Indeed if the degree is small, the number of monomials appearing in f is small, and the total number of variables appearing in f is small. One can invoke black-box reconstruction algorithms for sparse polynomials [35, 8] to learn f as a sum of monomials. Then, keeping the coefficients of ΣΠΣ{jXj} representation as unknowns we set up a system of polynomial equations in poly(k) variables such that every solution to the system corresponds to a ΣΠΣ{jXj}(k) representation of f.

Our Computational budget

Note that for low-degree learning (of degree k𝒪(1)), we require at least kk𝒪(1) time to learn the low-degree gates, as it requires solving a system of equations with at least k𝒪(1) variables. Our goal is to ensure that all algebraic manipulations necessary to learn the full circuit fit within this time budget. It is unclear whether the [32] and [41]-style clustering as the learning approach, which requires solving a system of polynomial equations in kk𝒪(k) variables and therefore kkkk𝒪(k) time, can be performed within this time constraint. Therefore, we combine ideas from both [11] and [41] to adopt partial clustering – just enough to isolate a gate (or its corresponding cluster). Subsequently, we can subtract this cluster and reduce it to a top ΣΠΣ{jXj}(k1) circuit, which we can then learn by induction.

Learning almost all the gates of the circuit

We adopt the same approach as in [11, Sections 5.3, Section 5.4] to learn almost all the gates of the circuit. The exact result we need is described in Lemma 17. The idea is to use almost-uniqueness for ΣΠΣ{jXj} circuits twice. Firstly, we will project the circuit to a low degree (approximately k2) and then learn this projection using low-degree reconstruction. Note that the representation we will learn will have some linear forms that are the same as the original representation C, by almost-uniqueness. In fact, we can ensure to get two distinct linear forms supported on the same variable set.

In the next step, we will use these linear forms to learn most of the linear forms of C. Once we learn 1 and 2 appearing in C, we try to learn more linear forms as follows. The algorithm applies a suitable setting of the variables of 1 in the polynomial f that makes 1 evaluates to 0, resulting in a circuit with fewer than k multiplication gates. Call the restricted polynomial fR1 and let CfR1 be the restricted version of Cf. By the inductive hypothesis, we can learn an ΣΠΣ{jXj}(k1) representation of fR1, which will be close to the original representation by almost-uniqueness. Repeating the same with 2, once we have this, by iterating over all ways of matching up the multiplication gates333This matching step involves a kO(k) brute-force matching, which might seem wasteful, but it fits within our computational budget, so we don’t need to optimize this. and choices of overlap, we can generate a list of k gates T1,,Tk such that Δ(Ti,Ti)<2k. Here, Δ(Ti,Ti):=deg(Tigcd(Ti,Ti)) is a measure of how many linear forms are different in Ti and Ti (distance measure between two gates). We refer to i[k]Ti as the almost circuit.

3.1 Using linear forms learned in Almost Circuit

Now, we have reconstructed the following almost circuit Ci[k]Ti such that Δ(Ti,Ti)<2k. We know that at most k linear forms from each Ti are incorrect in the representation we learned in C. Therefore, there are at most k2 incorrect linear forms, and so there are at least dk2 variable parts for which we know all linear forms in C are correct. We call this set of partitions Consistent-VarPart(C,C).

Consistent-VarPart(C,C):={j[d]i[k]i,jTiTi}

As discussed, |Consistent-VarPart(C,C)|dk2. In [11], the authors guessed this set, which introduced dk-type dependence, but we will do something different. Note that, for any subset P[d] such that |P|k2+1, there would be at least one variable part in Consistent-VarPart(C,C).

For any variable part jConsistent-VarPart(C,C), we observe that if there exists some i[k] such that i,jspan(1,j,,i1,j,i+1,j,,k,j), then we can reconstruct Ti exactly. To do this, we substitute Xj=𝜶 such that 1,j(𝜶)==i1,j(𝜶)=i+1,j(𝜶)==k,j(𝜶)=0 and i,j(𝜶)0. Now using black-box factoring on C after the substitution, we can learn Ti|Xj=𝜶 as all other terms vanish. Due to unique factorization, we can say that we learn the projection correctly and we can find Ti simply using Ti=Ti|Xj=𝜶i,ji,j(𝜶). Once we have learned Ti exactly, we can just subtract it from the rest of C and learn CTi as an set-multilinear ΣΠΣ(k1) circuit.

What if there is no variable part, for which the above condition holds? That is, j[d],i[k]i,jspan(1,j,,i1,j,i+1,j,,k,j). One could try to use the above technique iteratively decreasing the top fan-in using variable parts in Consistent-VarPart, i.e., you pick a variable part jConsistent-VarPart(C,C) and fix the variables to a value 𝜶j𝔽|Xj| such that one linear form depending on Xj (except 1,j) in C (and also in C as jConsistent-VarPart(C,C)) is set to zero while keeping T1 non-zero (by ensuring 1,j(𝜶j)0). This will decrease the fan-in by at least one until we are left with our target gate T1, which we can learn using the above-mentioned technique. This approach also fails, as there may be other gates that differ from T1 on a few variable parts, none of which are in Consistent-VarPart(C,C), and hence cannot be differentiated using just C. To avoid this issue, we will focus on learning a cluster of gates, that is, multiplication gates (the Ti’s in the circuit) that differ on only a few (poly(k)) linear forms instead of learning just one multiplication gate. See Lemma 15 for the formal definition of clusters.

3.2 Set-Multilinear Clustering

Karnin and Shpilka [32] introduced the notion of clustering multiplication gates in any depth-3 circuit. Just like clustering points in space, where close points form a cluster and distant points form different clusters, clustering multiplication gates with Δ(Ti,Tj) as a distance metric ensures that gates in one cluster differ on a few linear forms, while gates in different clusters differ on substantially more linear forms. One significant benefit of studying clustered representation is that it is unique! Furthermore, if we can get black-box access to a single low simple rank cluster, then we can learn a circuit representation of it. Indeed, we can strip off the gcd among different multiplication gates in the cluster, and what is left is just a low-degree circuit. In fact, this exact approach has been used in [32], [11], and [41] for learning (multilinear) ΣΠΣ circuits.

However, one major drawback of the multilinear clustering used in [41] and [32] is that the rank of the simple part444Resulting polynomial after stripping off the linear factors. of any cluster has an upper bound of k𝒪(k). When we try to learn this simple part as a low-degree ΣΠΣ{jXj} circuit, it requires solving a system of equations in k𝒪(k) variables. This was one of the main culprits behind the exponential tower dependence in k in [41].

We develop a partial cluster representation (Lemma 15) specifically designed for isolating a single cluster from an ΣΠΣ{jXj}(k) circuit. For instance, if we want to isolate a cluster containing the gate T1, then our clustering algorithm will output a set A[k], the partial cluster containing 1, such that the degree of the simple part of the cluster CA:=iATi is at most k4+k3 while ensuring that the ’distance’ between the isolated cluster and other gates is high enough. Formally, Δ(CA,Ti)k2+k for iA.

Note that, if we can isolate the cluster CA, that is, get black-box access to a cluster CA, we can reconstruct it using low-degree reconstruction as the degree of the simple part of CA is less than 2k4. Our clustering mechanism ensures that Δ(CA,Ti)k2+k for iA. This further implies that Δ(CA,Ti)k.

A natural approach would be to use these k-variable parts and Ti to set all the gates not in A to zero while ensuring that T1 doesn’t vanish. However, we don’t have any idea what these k variable parts are. Obviously, any brute force search for them will have a dk-type dependence. Furthermore, our approach should also ensure that any projection doesn’t kill T1 or sim(CA).

3.3 Good 𝑻𝟏-isolating Projections

We will now describe how to handle the above issues. Firstly, we will ensure that the variable parts we are using are not only in Consistent-VarPart(C,C), but also that the linear forms depending on these variable parts are linear factors of CA, i.e., we choose variable parts from Consistent-VarPart(C,C)Support(Lin(CA)). Fixing variables that are in the variable parts from the set Consistent-VarPart(C,C)Support(Lin(CA)) will give us CA up to a few linear forms. This fixing of variables is what we call a good T1-isolating projection as defined in Definition 19.

And secondly, in Section 7.1, we describe an algorithm to search for those k-variable parts such that the linear forms depending on them in other gates differ from Lin(CA), in time k𝒪(k)poly(n,d). We design a recursive approach that, using the structural guarantees of C, outputs a list of (k𝒪(k)) candidates for these variable parts, with a guarantee that one of them will help us project to the cluster CA. The reduction of the search time for these k-variable parts from dk to (k𝒪(k)) is the main technical contribution of our work.

We now elaborate on the structural guarantees of C that let us do this. Since the rank of sim(CA) is less than 2k4, the size of Support(Lin(CA)) is at least d2k4 and therefore |Consistent-VarPart(C,C)Support(Lin(CA))|d2k4k2. Therefore, our algorithm will pick k5+1 variable parts, one of which is guaranteed to be in Consistent-VarPart(C,C)Support(Lin(CA)), such that the linear forms in the gates not yet set to 0 depending on those parts in C have a dimension of at least 2. Then, for each of these, we set a linear form not in the span of 1,j to 0 while keeping 1,j non-zero, decreasing the top fan-in, and then recursively call the function. The distance condition in Lemma 15 ensures that if there is a gate not in A that has not yet been set to zero, we will be able to find such a variable part. Therefore, we have k𝒪(k) recursive calls, one of which will be such that the gates remaining are only in A and the variable parts fixed all belong to Consistent-VarPart(C,C)Support(Lin(CA)), and thus we can learn CA.

For the correct choice of these variable parts, we can ensure that the linear forms in T1 depending on the part don’t vanish. This ensures that any other gate in A also doesn’t vanish, as it is part of their gcd (the initial representation is such that gcd(CA)=Lin(CA)). Also, due to unique factorization and since all the linear forms that got fixed to constant non-zero values are linear factors of CA, we can simply multiply back the linear forms we learned in T1 (which will be the same as T1 as jConsistent-VarPart(C,C) and other gates in A as the linear forms are part of Lin(CA)) for those variable parts.

Once we have learned CA, we subtract it from C and learn the smaller top fan-in circuit. To ensure that the output circuit computes the correct polynomial, we use an efficient 2𝒪(log2k)poly(n) FPT polynomial-time PIT for ΣΠΣ{jXj}(2k) circuits in Theorem 9 at the end, which ensures that the only circuit output is the correct one.

4 Preliminaries

In this section, we will define notations and develop the basic preliminaries in Algebraic complexity and reconstruction required to understand this work. An experienced reader can presumably skip to Section 5. A more detailed set of preliminaries with Notations can be found in the full version.

Complexity of Solving a System of Polynomial Equations

Theorem 8.

Let f1,f2,fm𝔽[x1,,xn] be n-variate polynomials of degree at most d. Then, the complexity of finding a single solution to the system f1(x)=0,,fm(x)=0 (if one exists) over various fields is as follows:

  1. 1.

    [21] For 𝔽=, we have Sys𝔽(n,m,d)=poly((md)n2) deterministic time. Here the authors assumed that the constants appearing in the system are integers (or rationals). Note that for all computational applications we can WLOG assume this by simply approximating/truncating a given real number at some number of bits.

  2. 2.

    [28] For 𝔽= (or any algebraically closed field) Sys𝔽(n,m,d)=(mn)O(n)dO(n2) deterministic time.

  3. 3.

    For all fields 𝔽, the Sys𝔽(n,m,d)=poly((nmd)3n)c𝔽(d2n). Here, c𝔽(N) denotes the time complexity of factorizing a univariate polynomial of degree n over 𝔽, randomized or deterministic. This follows from standard techniques in elimination theory. For a detailed sketch of the argument and a bound on the size of the extension, see [11, Appendix A].

Polynomial Identity Testing

Corollary 9 ([22, Theorem 3]).

Over any field 𝔽, there’s a deterministic polynomial time 2𝒪(log2k)poly(n) for the class of set-multilinear polynomials computed by depth-3 set-multilinear circuits of degree d and top fan-in k.

The result in [22] is quantitatively much stronger, but we restate only what is most suited for our application.

Factoring: structural and algorithmic results

In their work [45], showed that for any class 𝒞 of multilinear polynomials, one can derandomize polynomial factoring using PIT algorithms for 𝒞. Using that fact directly with the PIT algorithm in previous subsection we get that.

Lemma 10 ([45, Corollary 1.2]).

There is a deterministic algorithm that given a black-box access to a ΣΠΣ{jXj}(k) circuit C, outputs black-boxes for the irreducible factors of C, in time 2𝒪(log2k)poly(n). In addition, each such irreducible factor is computable by a ΣΠΣ{jXj} circuit. There is a deterministic algorithm that outputs linear functions L1,,Lr and black-box access to a simple set-multilinear ΣΠΣ{jXj}(k) circuit C^ such that C=i=1rLiC^, in time 2𝒪(log2k)poly(n).

Width Reduction for set-multilinear 𝚺𝚷𝚺(𝒌)

In [11], the authors presented an algorithm for learning set-multilinear ΣΠΣ(k) circuits of arbitrary width w in roughly the same amount of time it takes to learn set-multilinear ΣΠΣ(k) circuits of width k. The following lemma follows directly from the result of [11, Corollary 5.3] and the fast PIT algorithm of [22].

Lemma 11 ([11, Corollary 5.3]).

Suppose A is an algorithm that has the following behavior. On input black-box access to degree d, n-variate polynomial f𝔽[X] such that f is computable by a width k ΣΠΣ{jXj}(k) circuit Cf over the field 𝔽, runs in deterministic time 𝒜(n,d,k) and outputs a ΣΠΣ{jXj}(k) circuit computing f. Then there is another algorithm A that has the following behavior. On input black-box access to degree d, n-variate polynomial f𝔽[X] such that f is computable by an arbitrary width ΣΠΣ{jXj}(k) circuit Cf over the field 𝔽, runs in deterministic 2𝒪(log2k)poly(n,d)𝒜(n,d,k) time and outputs a ΣΠΣ{jXj}(k) circuit computing f.

We refer the reader to Section 5.1 and Section 5.6 of [11] for the details.

Low Degree Reconstruction

In [11], the authors gave an algorithm that reduced the reconstruction of a low-degree set-multilinear ΣΠΣ(k) into solving a system of polynomial equations with few variables. We will be using it mainly in the case where d2k4. The result is as follows

Lemma 12 ([11], Lemma 5.5).

Given black-box access to a degree d polynomial f𝔽[X] such that f is computable by a width w ΣΠΣ{jXj}(k) circuit Cf over the field 𝔽, there is a deterministic Sys𝔽(kwd,wd,d)+poly(w,k,d) time algorithm that outputs a ΣΠΣ{jXj}(k) circuit computing f. Further using Lemma 11, we can reduce the width to k and get the running time of Sys𝔽(k2d,kd,d)poly(n,d).

 Remark 13.

When 𝔽= or , the output circuit is over the same underlying field 𝔽. In general the output circuit might be over a 2k𝒪(1) degree algebraic extension of 𝔽.

5 Partial Set-Multilinear Cluster Representation

We will be clustering all gates close to the gate T1 to create a representation with a cluster of gates containing T1, such that the remaining gates are far from the cluster. The gates clubbed together will be represented by a set A[k], with CA=iATi, fulfilling the properties described in Lemma 15.

For any set of gates A[k] of circuit C, we can only consider the initial representation of C such that Lin(CA) and the gcd of the gates in A are the same, i.e., sim(CA) has no linear factors. Such a representation will always exist for the polynomial computed by CA with a depth 3 set-multilinear circuit and the same top fan-in, which can be inferred from Lemma 10.

Next, we introduce the following definition of distance that we will be using to define the cluster representation.

Definition 14 (Cluster Distance).

For circuit C=T1++Tk with cluster A[k] with the C representation such that sim(CA) has no linear factors, we define ΔA(CA,Ti) for iA,i.e. the cluster distance between cluster CA and gate Ti, as the number of variable parts on which the linear form in Lin(CA) and Ti differ. It can be represented as follows in the mathematical notation

ΔA(CA,Ti)=|{j[d]:1,jLin(CA) and i,jTi}|

We will use the cluster representation with the properties described in the following lemma.

Lemma 15.

For a polynomial f computed by set-multilinear ΣΠΣ(k) C=T1++Tk, there exist a set A[k] such that for CA=iATi and C being a representation such that sim(CA) doesn’t have any linear factors, we have

  • 1A

  • Δ(CA)=rank(sim(CA))k4+k3

  • For all remaining gates Ti, Lin(CA) and Ti differ on at least k2+k variable parts, i.e.

    i[k]A|{j[d]:1,jLin(CA) and 1,jTi}|k2+k
    i.e. i[k]AΔA(CA,Ti)k2+k

Due to space constraints, we defer the formal proof of the above lemma to the full version of the paper.

 Remark 16.

Note that as described in the proof of Lemma 15, the output cluster A of our clustering algorithm is unique for a given representation C. We will from now on assume that we will be talking about this A cluster containing T1 with all properties described in Lemma 15.

Check out the full version to compare it to previously known clusterings.

6 Almost circuit, Learning most gates

Given a ΣΠΣjXj(k) circuit C=T1+T2++Tk=i[k]j[d]i,j computing polynomial f, we can use the techniques in [11] to learn a circuit C=T1++Tk such that i[k]Δ(Ti,Ti)<2k.

We briefly explain how such a C is obtained in time k𝒪(k)Sys𝔽(k𝒪(1))poly(n,d). We will use Lemma 12 to do deterministic reconstruction when the degree is dk3 in time Sys𝔽(k2d,kd,d). Now, similar to Lemma 5.6 in [11], we set all but k2 variable parts to some random values and reconstruct the polynomial, which will be close to C (on the variable parts that were not fixed), to obtain 2 independent linear forms 1,2C such that they are supported on the same variable part Xi. By setting 1,2 to 0 one at a time, and recursively calling reconstruction for set-multilinear ΣΠΣ(k1) circuits, we obtain representations close to C. These close representations have each multiplication gate that is close to a multiplication gate in C due to rank bounds as the gates that contain 1 are obtained when we go mod2, and vice versa. As described in Lemma 5.7 of [11], we obtain S={M1,,M|S|} of at most 2k2 ΠΣ circuits such that i[k],j[2k2] such that Δ(Ti,Mj)<2k. Working with k2k2 possibilities, we get at least one circuit C=T1++Tk such that i[k]Δ(Ti,Ti)<2k.

Lemma 17.

Given black-box access to an set-multilinear ΣΠΣ(k) circuit C=T1+T2++Tk computing f, there exists an algorithm that runs in time 2k3(2F(n,d,k1)+k2k2)+Sys𝔽(2k4,k2k2,2k2)+poly(k,d)2log2k and outputs a list of size k𝒪(k) set-multilinear ΣΠΣ(k) circuits such that one of the circuits C=T1+T2++Tk has the property that i[k]Δ(Ti,Ti)<2k.

7 Finding a good Projection

The focus of this section will be on learning the cluster CA (which contains T1). Due to our clustering algorithm 555We actually never run the clustering algorithm; it is just used for existential arguments., we know there exists a representation of C=T1++Tk=CA+iATi such that Δ(CA)k4+k3 and ΔA(CA,Ti)k2+k.

We will now define our useful variable parts by excluding from Consistent-VarPart(C,C) the partitions for which the linear forms in CA are not part of Lin(CA). Therefore, we define

S(C,C):=Consistent-VarPart(C,C)Support(Lin(CA)).

As rk(sim(CA))k4+k3, we have |S(C,C)|dk4k3k2. Our target remains to find a set of variable parts in S, such that we can fix them to some values that vanish other gates, but CA remains non-zero. We define any such fixing of variables Xi to 𝜶i, contained as tuples in a set denoted by , that keeps gates G (containing T1) alive and sets the rest to 0 as a T1-isolating projection.

Definition 18 (𝑻𝟏-isolating Projection).

A T1-isolating projection is a tuple (G,) such that G[k],1G, and is a set of tuples (j,𝛂j) such that j[d] is a variable part and 𝛂j𝔽|Xj| is an assignment of variables in the j-th variable part. All gates except those in G vanish after the substitution of variables according to , i.e.,

C|(j,𝜶j),Xj=𝜶j=(iGTi)|(j,𝜶j),Xj=𝜶j0

The goal of our computation will be getting black-box access to CA, thus we define a good T1-isolating projection which let’s us compute the black-box access to CA.

Definition 19 (Good T1 isolating Projection).

A T1-isolating projection (G,) is good for circuit C,C if G=A in Lemma 15 and for all (j,𝛂j), jS(C,C). We refer to the unique A for circuit C as described in Remark 16.

7.1 Computing a good 𝑻𝟏-isolating projection

This subsection will describe how we can compute a good T1-isolating projection using white-box access to C. Our idea is straightforward. We know that for each i[k]A, Ti differs from Lin(CA) on k2+k variable parts. By the “closeness” of C and C, we know that even in Ti we will have at least k variable parts that differ from Lin(CA). If we know exactly what these variable parts are for each i[k]A, and the fact that they are not the same (up to scalar multiple), this means there is an assignment that kills at least one linear form of Ti (and thus Ti itself) for i[k]A while keeping Lin(CA) alive. However, we don’t have any idea what these k variable parts are. Obviously, any brute force search for them will have a dk-type dependence. We design a recursive approach that, using the structural guarantees of C, outputs a small (k𝒪(k)) list of candidates for these variable parts, with a guarantee that one of them will help us project to the cluster CA.

We now elaborate on the structural guarantees of C that let us do this. If we pick a set of variable parts of size at most k5 (for simplicity, in Algorithm 1, we use k4+k3+k2+1) such that for each variable part, the span of the set of linear forms in the gates of C in G has a dimension of at least 2, this means for each of the variable parts j in this set, we can pick at least one linear form i,j from Ti such that i,j is not a scalar multiple of 1,j. Since i,j and 1,j are linearly independent, we can find an assignment of variables Xj=𝜶j such that i,j(𝜶j)=0 (and therefore Ti|Xj=𝜶𝒋=0) while keeping 1,j(𝜶𝒋) and T1|Xj=𝜶𝒋 non-zero. Thus, we have found a projection for C with the top fan-in decreased by at least 1. This projection (j,𝜶j) is added to and all gates in C that are set to zero are removed from G for the execution that continues after fixing this variable part. We do this recursively until we reach an execution level where we cannot find any variable parts for our set, which can happen if |G|=1 or the remaining gates in C have the same linear forms (up to scalar multiples) as T1 on all the variable parts that have not been fixed to some value. After, doing this for all variable parts in the at most k5 sized set, the algorithm adds the gates and projection that gets to a global list GoodProjList.

Algorithm 1 Computing List of Candidate good Projections.

Input: Black-box access to ΣΠΣ{jXj} circuit C=T1++Tk and white-box access to C=T1++Tk such that Δ(Ti,Tj)<2k

Now, to see that this set has a good T1-isolating projection, we observe that since we picked k5 variable parts or Algorithm 1 couldn’t find k5 variable parts with the required property. In the former case, using |S(C,C)|dk4k3k2, there would be at least one variable part in our set that came from S(C,C). We use the distance property of the clustering to show in the proof of Lemma 20 that if there is a gate iGA, then there is a variable part j in S(C,C) on which the gate Ti (and Ti as jS(C,C)) has an independent linear form from Lin(CA) and therefore T1 (and T1). Therefore, in the latter case as well, either we pick a variable part in S or GA.

We focus our attention on the path of execution where each variable part picked was from S(C,C). Since all variable parts in S(C,C) are also in Consistent-VarPart(C,C), setting Ti|Xj=𝜶j=0 using jConsistent-VarPart(C,C) also sets Ti|Xj=𝜶j=0 as the linear forms depending on the variable part are same for C and C as described in the definition of Consistent-VarPart(C,C), while keeping T1|Xj=𝜶𝒋 nonzero keeps T1|Xj=𝜶𝒋 nonzero. Also, as all variable parts in S(C,C) are also in Support(Lin(CA)) and the circuit is set-multilinear, keeping T1|Xj=𝜶𝒋 non-zero also keeps Ti|Xj=𝜶𝒋 for all iA. Therefore, along this execution path, at each recursive level, at least one gate not in A gets set to zero, while keeping all the gates in A nonzero by fixing variable parts Xj with jS(C,C), until G=A. In the execution call on the path with G=A, the (G,) that is added to GoodProjList is a good T1-isolating Projection and occurs at a depth of at most k|A|k1.

This computation can be seen as a k5-arity tree with depth at most k1, with each node a recursive call of the function. We start with G=[k] and make at most k5 choices of variable parts such that at least one of them is in S(C,C). In the execution of the algorithm along this path, at each step, at least one gate not in A is set to zero by fixing the chosen variable part in S(C,C) while keeping all gates in A non-zero, until we reach G=A, which adds a good T1-isolating Projection to the list.

Lemma 20.

Algorithm 1 when given black-box access to a set-multilinear ΣΠΣ(k) circuit C and white-box access to an almost circuit C from Lemma 17 computes a list of T1-isolating projections for C of size at most k𝒪(k) such that it has at least 1 good T1-isolating projection for C,C in deterministic k𝒪(k)poly(n,d) time.

Due to space constraints, we defer the formal proof of the above lemma to the full version of the paper.

8 Reconstruction using a good Projection

Now, we will describe how we can obtain a k𝒪(k) sized list of candidate circuits for CA using the list computed in Lemma 20, one of which will be computing CA.

Algorithm 2 Candidate Circuits using Good Partitions.

Input: Black-box access to set-multilinear ΣΠΣ{jXj} Circuit C=T1++Tk and white-box access to C=T1++Tk such that Δ(Ti,Mj)<2k and a list containing a good Projection

Lemma 21.

Given a list of T1-isolating Projections from Lemma 20, black-box access to a set-multilinear circuit C computing f with top fan-in k, and white-box access to circuit C as in Lemma 17, then Algorithm 2 computes a k𝒪(k) sized list of depth 3 set-multilinear circuits Φ(C) such that at least one of them computes CA in time k𝒪(k)Sys𝔽(2k6,k2k4,2k4)poly(n,d).

Due to space constraints, we defer the formal proof of the above lemma to the full version of the paper. We conclude by presenting the algorithm mentioned in our main theorem (Theorem 1). The proof of its correctness can be found in the full version of the paper.

Algorithm 3 Reconstruction of set-multilinear ΣΠΣ(k) circuits.

Input: Black-box access to set-multilinear ΣΠΣ{jXj} Circuit C=T1++Tk

References

  • [1] L. M. Adleman and K. S. McCurley. Open problems in number theoretic complexity, II. In Algorithmic Number Theory, First International Symposium, ANTS-I, Ithaca, NY, USA, May 6-9, 1994, Proceedings, pages 291–322, 1994. doi:10.1007/3-540-58691-1.
  • [2] Anima Anandkumar, Dean P Foster, Daniel J Hsu, Sham M Kakade, and Yi-Kai Liu. A spectral algorithm for latent dirichlet allocation. In Advances in Neural Information Processing Systems, pages 917–925, 2012.
  • [3] Animashree Anandkumar, Rong Ge, Daniel Hsu, and Sham M Kakade. A tensor approach to learning mixed membership community models. The Journal of Machine Learning Research, 15(1):2239–2312, 2014. doi:10.5555/2627435.2670323.
  • [4] Animashree Anandkumar, Rong Ge, and Majid Janzamin. Learning overcomplete latent variable models through tensor methods. In Conference on Learning Theory, pages 36–112. PMLR, 2015. URL: http://proceedings.mlr.press/v40/Anandkumar15.html.
  • [5] D. Angluin. Queries and concept learning. Machine Learning, 2:319–342, 1988.
  • [6] Boaz Barak, Jonathan A Kelner, and David Steurer. Dictionary learning and tensor decomposition via the sum-of-squares method. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 143–151, 2015. doi:10.1145/2746539.2746605.
  • [7] A. Beimel, F. Bergadano, N. H. Bshouty, E. Kushilevitz, and S. Varricchio. Learning functions represented as multiplicity automata. J. ACM, 47(3):506–530, 2000. doi:10.1145/337244.337257.
  • [8] M. Ben-Or and P. Tiwari. A deterministic algorithm for sparse multivariate polynominal interpolation. In Proceedings of the 20th Annual ACM Symposium on Theory of Computing (STOC), pages 301–309, 1988.
  • [9] Vishwas Bhargava, Ankit Garg, Neeraj Kayal, and Chandan Saha. Learning generalized depth three arithmetic circuits in the non-degenerate case. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2022). Schloss-Dagstuhl-Leibniz Zentrum für Informatik, 2022.
  • [10] Vishwas Bhargava, Shubhangi Saraf, and Ilya Volkovich. Deterministic factorization of sparse polynomials with bounded individual degree. Journal of the ACM (JACM), 67(2):1–28, 2020. doi:10.1145/3365667.
  • [11] Vishwas Bhargava, Shubhangi Saraf, and Ilya Volkovich. Reconstruction algorithms for low-rank tensors and depth-3 multilinear circuits. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 809–822, 2021. doi:10.1145/3406325.3451096.
  • [12] Pritam Chandra, Ankit Garg, Neeraj Kayal, Kunal Mittal, and Tanmay Sinha. Learning Arithmetic Formulas in the Presence of Noise: A General Framework and Applications to Unsupervised Learning. In 15th Innovations in Theoretical Computer Science Conference (ITCS 2024), volume 287, pages 25:1–25:19, 2024. doi:10.4230/LIPIcs.ITCS.2024.25.
  • [13] S. Chen and R. Meka. Learning polynomials in few relevant dimensions. In Jacob D. Abernethy and Shivani Agarwal, editors, Conference on Learning Theory, COLT 2020, 9-12 July 2020, Virtual Event [Graz, Austria], volume 125 of Proceedings of Machine Learning Research, pages 1161–1227. PMLR, 2020. URL: http://proceedings.mlr.press/v125/chen20a.html.
  • [14] Lieven DeLathauwer, Josphine Castaing, and Jean-Franois Cardoso. Fourth-order cumulant-based blind identification of underdetermined mixtures. IEEE Transactions on Signal Processing, 55(6):2965–2973, 2007. doi:10.1109/TSP.2007.893943.
  • [15] Harm Derksen. Kruskal’s uniqueness inequality is sharp. Linear Algebra and its Applications, 438(2):708–712, 2013.
  • [16] Jingqiu Ding, Tommaso d’Orsi, Chih-Hung Liu, David Steurer, and Stefan Tiegel. Fast algorithm for overcomplete order-3 tensor decomposition. In Conference on Learning Theory, pages 3741–3799. PMLR, 2022. URL: https://proceedings.mlr.press/v178/ding22a.html.
  • [17] M. A. Forbes and A. Shpilka. Quasipolynomial-time identity testing of non-commutative and read-once oblivious algebraic branching programs. Electronic Colloquium on Computational Complexity (ECCC), 19:115, 2012.
  • [18] Michael A Forbes and Amir Shpilka. On identity testing of tensors, low-rank recovery and compressed sensing. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pages 163–172, 2012. doi:10.1145/2213977.2213995.
  • [19] A. Garg, N. Kayal, and C. Saha. Learning sums of powers of low-degree polynomials in the non-degenerate case. arXiv preprint arXiv:2004.06898, 2020.
  • [20] Ankit Garg, Neeraj Kayal, and Chandan Saha. Learning sums of powers of low-degree polynomials in the non-degenerate case. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 889–899. IEEE, 2020. doi:10.1109/FOCS46700.2020.00087.
  • [21] D. Yu. Grigor’ev and N.N. Vorobjov. Solving systems of polynomial inequalities in subexponential time. Journal of Symbolic Computation, 5(1):37–64, 1988. doi:10.1016/S0747-7171(88)80005-1.
  • [22] Zeyu Guo and Rohit Gurjar. Improved explicit hitting-sets for roabps. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2020). Schloss-Dagstuhl-Leibniz Zentrum für Informatik, 2020.
  • [23] A. Gupta, N. Kayal, and S. V. Lokam. Efficient reconstruction of random multilinear formulas. In IEEE 52nd Annual Symposium on Foundations of Computer Science, FOCS, pages 778–787, 2011. doi:10.1109/FOCS.2011.70.
  • [24] A. Gupta, N. Kayal, and Y. Qiao. Random arithmetic formulas can be reconstructed efficiently. Computational Complexity, 23(2):207–303, 2014. doi:10.1007/s00037-014-0085-0.
  • [25] Johan Håstad. Tensor rank is np-complete. J. Algorithms, 11(4):644–654, 1990. doi:10.1016/0196-6774(90)90014-6.
  • [26] Christopher J. Hillar and Lek-Heng Lim. Most tensor problems are np-hard. J. ACM, 60(6), November 2013. doi:10.1145/2512329.
  • [27] Daniel Hsu and Sham M Kakade. Learning mixtures of spherical gaussians: moment methods and spectral decompositions. arXiv preprint arXiv:1306.0021, 2013. Presented at the 4th Conference on Innovations in Theoretical Computer Science (ITCS).
  • [28] D. Ierardi. Quantifier elimination in the theory of an algebraically-closed field. In Proceedings of the Twenty-First Annual ACM Symposium on Theory of Computing, STOC ’89, pages 138–147, New York, NY, USA, 1989. Association for Computing Machinery. doi:10.1145/73007.73020.
  • [29] Russell Impagliazzo and Ramamohan Paturi. On the complexity of k-sat. Journal of Computer and System Sciences, 62(2):367–375, 2001. doi:10.1006/JCSS.2000.1727.
  • [30] Prateek Jain and Sewoong Oh. Learning mixtures of discrete product distributions using spectral decompositions. arXiv preprint arXiv:1404.4604, 2014. Presented at the Conference on Learning Theory (COLT).
  • [31] Nathaniel Johnston, Benjamin Lovitz, and Aravindan Vijayaraghavan. Computing linear sections of varieties: quantum entanglement, tensor decompositions and beyond. In 2023 IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS), pages 1316–1336. IEEE, 2023. doi:10.1109/FOCS57990.2023.00079.
  • [32] Z. S. Karnin and A. Shpilka. Reconstruction of generalized depth-3 arithmetic circuits with bounded top fan-in. In Proceedings of the 24th Annual IEEE Conference on Computational Complexity (CCC), pages 274–285, 2009.
  • [33] N. Kayal and C. Saha. Reconstruction of non-degenerate homogeneous depth three circuits. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019., pages 413–424, 2019. doi:10.1145/3313276.3316360.
  • [34] A. Klivans and A. Shpilka. Learning restricted models of arithmetic circuits. Theory of computing, 2(10):185–206, 2006. doi:10.4086/TOC.2006.V002A010.
  • [35] A. Klivans and D. Spielman. Randomness efficient identity testing of multivariate polynomials. In Proceedings of the 33rd Annual ACM Symposium on Theory of Computing (STOC), pages 216–223, 2001.
  • [36] Pravesh K. Kothari, Ankur Moitra, and Alexander S. Wein. Overcomplete tensor decomposition via koszul-young flattenings, 2024. doi:10.48550/arXiv.2411.14344.
  • [37] J. Landsberg. Tensors: geometry and applications. Representation theory, 381(402):3, 2012.
  • [38] Benjamin Lovitz and Fedor Petrov. A generalization of kruskal’s theorem on tensor decomposition. In Forum of Mathematics, Sigma, volume 11, page e27. Cambridge University Press, 2023.
  • [39] Ankur Moitra and Alexander S Wein. Spectral methods from tensor networks. arXiv preprint arXiv:1811.00944, 2018. arXiv:1811.00944.
  • [40] Elchanan Mossel and Sébastien Roch. Learning nonsingular phylogenies and hidden markov models. arXiv preprint arXiv:1506.08512, 2005. Presented at the Thirty-Seventh Annual ACM Symposium on Theory of Computing (STOC).
  • [41] Shir Peleg, Amir Shpilka, and Ben Lee Volk. Tensor Reconstruction Beyond Constant Rank. In 15th Innovations in Theoretical Computer Science Conference (ITCS 2024), volume 287, pages 87:1–87:20, 2024. doi:10.4230/LIPIcs.ITCS.2024.87.
  • [42] M. Schaefer and D. Stefankovic. The complexity of tensor rank. CoRR, abs/1612.04338, 2016. arXiv:1612.04338.
  • [43] Tselil Schramm and David Steurer. Fast and robust tensor decomposition with applications to dictionary learning. In Conference on Learning Theory, pages 1760–1793. PMLR, 2017. URL: http://proceedings.mlr.press/v65/schramm17a.html.
  • [44] Y. Shitov. How hard is the tensor rank? arXiv preprint arXiv:1611.01559, 2016.
  • [45] A. Shpilka and I. Volkovich. On the relation between polynomial identity testing and finding variable disjoint factors. In Automata, Languages and Programming, 37th International Colloquium (ICALP), pages 408–419, 2010. Full version at https://eccc.weizmann.ac.il/report/2010/036.
  • [46] Ilya Volkovich. A guide to learning arithmetic circuits. In Conference on Learning Theory, pages 1540–1561. PMLR, 2016. URL: http://proceedings.mlr.press/v49/volkovich16.html.
  • [47] Alexander S Wein. Average-case complexity of tensor decomposition for low-degree polynomials. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 1685–1698, 2023. doi:10.1145/3564246.3585232.