A Universal Uniform Approximation Theorem for Neural Networks
Abstract
We show the existence of a fixed recurrent network capable of approximating any computable function with arbitrary precision, provided that an encoding of the function is given in the initial input. While uniform approximation over a compact domain is a well-known property of neural networks, we go further by proving that our network ensures effective uniform approximation – simultaneously ensuring:
-
Uniform approximation in the sup-norm sense, guaranteeing precision across the compact domain ;
-
Uniformity in the sense of computability theory (also referred to as effectivity or universality), meaning the same network works for all computable functions.
Our result is obtained constructively, using original arguments. Moreover, our construction bridges computation theory with neural network approximation, providing new insights into the fundamental connections between circuit complexity and function representation.
Furthermore, this connection extends beyond computability to complexity theory. The obtained network is efficient: if a function is computable or approximable in polynomial time in the Turing machine model, then the network requires only a polynomial number of recurrences or iterations to achieve the same level of approximation, and conversely. Moreover, the recurrent network can be assumed to be very narrow, strengthening the link our results and existing models of very deep learning, where uniform approximation properties have already been established.
Keywords and phrases:
Models of computation, Complexity theory, Formal neural networksCopyright and License:
2012 ACM Subject Classification:
Theory of computation Models of computation ; Theory of computation Computability ; Theory of computation Complexity classes ; Computing methodologies Neural networksFunding:
O. Bournez and J. Cohen received support from the Programme Gaspard Monge (PGMO). O. Bournez received support from ANR Project OTC ANR-24-CE48-0335. A. Wurm received partial support from the BTU Graduate Research School (Conference Travel Grant).Editors:
Paweł Gawrychowski, Filip Mazowiecki, and Michał SkrzypczakSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
Deep learning models have revolutionized machine learning. One of the best-known fundamental results in neural network theory is that they are universal approximators: for any continuous function, any compact domain, and any precision , there exists a neural network within distance at most from the function over this compact domain. Mathematically, uniform approximation states that the class of functions realized by neural networks is dense in the set of continuous functions over compact domains. This holds under minimal assumptions (a single hidden layer and a non-polynomial activation function suffice [15, 29]).
We relate the questions of uniform approximation to questions from computability and complexity theory. Due to space constraints, we take the editorial choice of expecting our readers to be MFCS attendees and to have knowledge in these fields. For other potential readers, we do not claim that our results provide practical tools for training very deep networks or large language models, etc. Pursuing such directions would require a different focus and a different work.
However, we believe that our results clarify the intrinsic difficulty of the associated problems. In particular, we think that the gap between practical challenges studied in deep learning and their theoretical counterparts in computability and complexity theory remains largely unexplored. We agree that some -but not all, including those mentioned above- of the proofs of uniform approximation can be considered constructive. However, we also believe our results help clarify the limits of constructivity in this context. Furthermore, our results demonstrate that a fixed, universal, neural network is sufficient to get uniform approximation, a result that has not been established yet.
The natural measures of neural network complexity include their depth (number of layers), size (number of neurons), and width (number of neurons per hidden layer). Neural networks can be viewed as a special class of circuits and described using tools from circuit complexity. Unlike Boolean circuits, they operate over real numbers, and each neuron acts as a gate applying an activation function to an affine weighted combination of its inputs. Many classical results in circuit complexity can thus be revisited in this context – an idea that dates back at least to [43, 52]. Associated questions, such as training, are consequently at least as hard as their counterparts in the Boolean circuit setting. The intractability of various problems for neural network problems – especially training – remains a highly active area of research; see, for example, the recent works [5, 8, 10, 17, 18, 19, 20, 22, 42].
Uniform approximation by various classes of networks is also a highly active research area in neural networks literature. As expected, it connects to how continuous functions can be approximated over a compact domain by different function classes. The width of a neural network naturally reflects the “complexity” of the functions being approximated, which can be measured in various ways, such as through the coefficients of their Taylor or Fourier series, their polynomial approximation via Weierstrass’s theorem, or their decomposition via the Kolmogorov–Arnold theorem for multivariate functions. Rather than surveying these developments, we refer the reader to [4] for a recent overview.
In contrast, our focus takes an orthogonal perspective: uniform approximation by a fixed network – that is, universal uniform approximation.
We can also add that, in the neural networks literature, uniform approximation typically falls into two fundamental scenarios. The most classical approach considers networks with arbitrary width but a limited number of hidden layers – referred to here as shallow and wide networks. In many statements, including the above-mentioned historical [15, 29] arbitrary width is essential to achieve function density. The dual perspective involves networks with arbitrary depth but limited width, known as deep narrow networks, or sometimes very deep learning in opposition to deep learning.
We focus on this second scenario – networks with limited width and arbitrary depth. We prove that uniform approximation holds with a fixed unique network, of small width.
On the deep narrow scenario.
As mentioned, we do not attempt a full survey and we briefly highlight why deep narrow networks have only recently gained attention. Increasing depth to boost performance is intuitive, but only recently became practical after [27]. The main obstacle was the use of gradient-based training: as depth grows – especially beyond hundreds of layers – gradients vanish [21], leaving only the final layers effectively updated during backpropagation.
A key breakthrough was the recent introduction of skip connections, which allow layers to connect beyond their immediate predecessor. Skip connections were popularized about nine years ago by Residual Networks (ResNets) introduced in [27], and by Highway Nets introduced in [51]. It has been shown that this structure allows ResNets, or some considered architectures, to be interpreted as discrete approximations of continuous-time dynamical systems, particularly through the forward Euler method. As a result, such networks inherit desirable properties from numerical methods, including stability, convergence, and robustness. It was later mathematically proven that many efficient models are, in fact, reformulations of discretisation schemes for ODEs. For example, following [39], PolyNet [58] can be viewed as an approximation to the backward Euler scheme for solving . Fractalnet [36] follows a well-known Runge-Kutta scheme. ResNet [23] corresponds to a forward Euler discretization of a simple dynamical system.
This perspective has also inspired models such as Neural Ordinary Differential Equations [14], in which the dynamics evolve continuously and are governed by the network parameters: . The parameters can be trained using methods such as the adjoint state method, which serves as a continuous-time analogue of backpropagation [14]: see e.g. [30] for a survey. All these models fall within the deep narrow framework considered here. Such approaches have contributed significantly to the widespread use of very deep networks today.
What is known about uniform approximation in the deep narrow scenario.
It is known that a deep neural network [26] requires a minimum width of at least the input dimension plus one to ensure uniform approximation. The uniform norm requirement can also be relaxed: it is known that width of is sufficient for approximation using the -norm [40]. Deep ReLU networks approximate smooth functions more efficiently than shallow networks, as demonstrated by adaptive depth-6 architectures that outperform standard shallow designs [54]. Furthermore, recent work establishes connections between network depth and width [28], while also revealing that neural networks possess even greater approximation power than traditional nonlinear approximation methods [16]. ResNet, even with just one neuron per layer, has been proved to approximate any Lebesgue-integrable function arbitrarily well as the number of tunable weights goes to infinity [37]. Furthermore, [38] establishes that ResNets can be reformulated as feedforward neural networks, enabling the derivation of lower bounds on the number of tunable parameters required for ResNets to approximate different function classes. Universal approximation with general activation functions has been studied in [13, 16, 31, 32, 35, 55, 56, 57], focusing on which types of functions allow for approximation and, in some cases, achieving optimal bounds.
What is known about the computational capabilities of neural networks.
In the mid-1990s, recurrent neural networks (RNNs) were widely used. It is known that if an ideal sigmoid function – often modeled as the saturated linear function, defined by for , for , and for is allowed, then such networks can simulate Turing machines using rational weights [47]. The computational power of these networks has also been extensively investigated in the case where real or even non-computable weights are allowed, leading to connections with non-uniform complexity classes such as [48]. Refer to [46] for an overview. Since the 1990s, research has shifted toward alternative network architectures. Today, architectures using activation, defined as are widely adopted. It is easy to see that deep neural networks can simulate ideal sigmoid networks and conversely, as illustrated by the identity . Moreover, any continuous piecewise linear function can be represented by a neural network with activation [3]. Hence, clearly the previous statements about the possibility of simulating Turing machines hold for such “ideal” activation functions.
While it is known that neural networks can simulate Turing machines – implying universality in the computational sense (at least for ideal sigmoids) – this does not provide statements about their capabilities of universal approximations. The point is that uniform approximation is about “every inputs”, while these statements about their computational capabilities are about “some inputs”. Existing results only show that certain inputs can encode arbitrary Turing machine computations.
Our results.
In this paper, we investigate neural networks with matching input and output dimensions, focusing on their behaviour when repeatedly applied to the same input , as seen in all these contexts of the deep narrow scenario. We prove that uniform approximation is also achievable in the sense of computation theory. Note that we are playing with two distinct notions of uniformity: one refers to the uniform norm, as in the classical uniform approximation theorem for neural networks; the other stems from computability theory, where uniformity means that the approximation is effective – that is, computable. We establish that both hold: we have universality, in the sense that a single, fixed neural network can serve as a universal approximator.
Neural networks are intrinsically computing over real numbers. This leads to various possibilities of considering computability, when stating results: either by restricting to networks with rational coefficients, and seeing them as functions over the rational numbers, or even more generally as functions from the reals to the reals, with possibly real coefficients. In the first case, we are in the framework of classical computability over finite words. In the second case, we are in the framework of computable analysis [11, 12, 53, 33]. Our results work for both, including aspects of time complexity.
2 Our contributions
We give precise definitions in Section 3, but at this point, it may help to realise that a Neural Network with activation function , denoted by , with inputs and outputs, defines a particular continuous piecewise affine map from to .
Classical approximation theorem.
First, we recall more formally the uniform approximation theorem in the context of neural networks:
Theorem 1 (Classical Uniform Approximation Theorem [15, 29, 44]).
Let be a continuous, non-polynomial function. For any continuous function , for any , there exists a neural network with one hidden layer and activation function , computing some function such that for all ,
In this statement, the width of the hidden layer can be arbitrary large and depends on both the function being approximated and the precision parameter . This result holds in particular for -networks: a -Neural-Net or -Net is a neural network whose activation function is defined as . In this article, we mainly focus on such networks, and for conciseness, we write then for . Note that Theorem 1 does not actually require the output and input dimensions to be equal. However, we have chosen this version to be more comparable to our setting (and both problems are mutually reducible). A dyadic is a number that has a finite binary expansion, meaning it can be expressed as for some integers and . In this case, we denote the precision of by . The set of all dyadic rational numbers (respecitvely of precision ) is denoted by (resp. ). The binary length of a dyadic rational number is defined as , in consistency with the standard convention for rational numbers.
Our contributions.
Let (respectively: ) denote the set of continuous piecewise affine functions with rational coefficients (resp. dyadic coefficients, preserving dyadics). We fix a representation for functions in (resp. ): this is a surjective function from words over a finite alphabet to (resp. ), so that, as expected, given a rational and , we can compute easily (say in polynomial time). We need only this: so, for the coming discussions when talking about circuit complexity, as an alternative to see as only a description of , this possibly helps to consider that can also be seen as a description of a circuit to compute .
Theorem 2 (Our Main theorem 1: Uniform Computability for Classical Complexity).
For any fixed integer , there exists a fixed -net , of width at most , such that for any function over , for any name of function , for any , there exists , such that for all
| (1) |
Here, denotes the neural network obtained by composing with itself times, and is the projection onto the first coordinates. Furthermore, the involved has a simple complexity interpretation. It is polynomially related to the time required to evaluate on : when , we can guarantee equality over dyadics: for , with polynomial in , and the size of .
We emphasize here that while it is known that any function in can be approximated by a neural network, we state here that a stronger statement holds: a single neural network architecture is sufficient to approximate all such functions. We have universal uniform approximation theorem.
Notice, that there is “no encoding involved in variable ”: we state that there is uniform approximation of by the function : Assuming only this (with no a priori encoding involved) is precisely one of the main difficulties we solve.
Moreover, this result can be further strengthened within the framework of computable analysis: the class is dense in the space of continuous functions with respect to the uniform norm. That is, any function admits a sequence in such that for all . As a result, any function in can be encoded as the infinite word .This infinite word, when interpreted as the binary expansion of a real number in , is computable as an infinite word if and only if the function is computable in the sense of computable analysis [11, 12, 53, 33]. Hence, can be viewed as a (possibly irrational) real number in whose digits encode , and this number is computable if and only if is computable.
Theorem 3 (Main theorem 2: Uniform computability for computable analysis).
For any fixed integer , there exists a fixed -net such that, for any name of a continuous function , for any precision parameter , there exists , such that for all Equation (1) holds.
In other words, this result establishes uniform approximation with uniformity in the sense of computability for computable analysis: given any description of a function – even if the function itself is not computable – the neural network will approximate it to any prescribed precision, as determined by an input parameter. This goes beyond classical approximation: it guarantees that a single, fixed network architecture can approximate any described function with explicit, parameter-driven control over the precision parameter.
From a complexity-theoretic perspective, beyond mere computability, we prove that the parameter admits a natural interpretation: for functions in , corresponds – up to a polynomial factor – to the size of the family of Boolean circuits that compute over the dyadic rationals. In this sense, captures the inherent computational complexity of the function. In the next statement, we say that is -computable, if we can produce in time some approximation of by some function in :
Theorem 4 (Main theorem 3: Computational Complexity Approximation Theorem).
For any fixed , there exists a fixed : , such that for any name of a function -computable, for any precision parameter , there exists such that for all ,
| (2) |
where is polynomial in and .
Unlike previous deep learning results, this is established by computability-theoretic arguments, specifically by simulating Circuit Machines. Furthermore, for more general functions, we provably relate to the complexity of the function to be approximated. Using recent results, this can be connected to the Kolmogorov complexity of [1, 2].
Main technical contribution with a high potential of new applications.
Our results are obtained by programming with -nets. While the final conclusions may seem intuitive – essentially that -nets can be viewed as specific circuit models, and that we have universality – one might wonder why such statements were not well established. We now outline the exact issues we addressed and why they were not immediately apparent from existing work.
In short, we provide a method to map from reals to the world of Turing machines, and conversely, using only functions. The problem lies in the uniform approximation, which requires approximation over all inputs (i.e. real, rational and irrational). Restricting to a suitable subset would have been much easier, as existing results from the 1990s already address this case. However, achieving true uniform approximation demands mapping all inputs into a domain where Turing machines can operate effectively.
More precisely, one has to realise first that functions constructed by a -net must be specific functions. In particular, even when iterated a bounded number of times, they always remain Lipschitz continuous. This forbids to obtain some simple functions. For example, the Heaviside function, being discontinuous, is impossible to compute with a -net in a finite number of iterations. This limitation extends beyond Heaviside, excluding many other functions as well:
Example 5.
Consider any with values when and when . This function cannot be computed within a bounded number of iterations. Indeed, it can be extended to a Lipschitz-continuous function on . However, this extension fails if we drop the restriction and instead allow , as it would no longer be Lipschitz-continuous in . This occurs because at some point, making it unbounded. Consequently, while a network could compute the function, the number of iterations required cannot be bounded. As a result, it cannot be directly related to the number of steps required by the algorithm.
That stated, we indeed rely on the well-established result from the 1990s that Turing machines can be simulated by -nets. However, this leads us to discuss some coding issues, and the need to encode/decode inputs, if we want to come to a settings where we can use these results for uniform approximation.
This issues arises from the nature of the input data type: the input of a network is a real number, while the Turing machine expects a discrete input (e.g. a word encoding ). A central first idea to bridge this gap is to use the binary expansion of . The Turing machine would then proceed the input bit by bit. However, extracting the first bit of a real number is generally impossible due to discontinuity. For example, consider , which has first bit , while yields . For , both values converge to , but their first bits remain distinct. Thus, a network satisfying cannot exist as it would necessarily compute a discontinuous function – an operation fundamentally incompatible with the continuity and piecewise affine structure enforced by -nets. A trick, adapted from Turing machine simulations of the 1990s [47, 48], uses . The discontinuity issue disappears if we encode as a quadric . In this encoding, the first bit can be extracted by a function that equals 0 on and 1 on . Since is not constrained in , it can be extended arbitrarily on this domain as a continuous, semilinear function – thereby making it compatible with computation by a neural network.
Such encoding using -quadradics is employed throughout the Turing machine simulations presented in [47, 48, 46], where all problems mentioned above were already present.
In other words, while it is well understood that, within the Turing machine framework (computability theory and computable analysis), we can achieve uniform approximation by mapping the encoding of to an arbitrarily precise encoding of , this does not, however, immediately imply that we can compute from itself for all from a neural network. The key difficulty lies in converting a real or dyadic number into a suitable -quadric encoding that a Turing machine can process: this is exactly what we solve.
Remark 6.
We are not the first to have this issue of needing conversion, using continuous function. A point is that in many approaches, such as [45, 24, 25] work over unbounded domains, which makes somehow life really simpler, as they have functions such as than can basically round to integers. Here, everything must be done on a bounded domain, and futhermore, with only. The closest to our approach is [6], where discrete ODEs and analytic functions are used. However, their framework differs fundamentally from ours: they do not work with neural networks, and their constructions rely on tools outside the scope of -nets. For example, they employ the function , which mimics a sigmoid as grows. This construction requires a multiplication by (even if obtained recursively there). This seems impossible to obtain using finitely many iterations of a -net. In -nets, multiplications are not permitted, and all intermediate variables must remain bounded, whereas their approach relies on multiplications with potentially unbounded weights.
This transformation use only -nets. Our main technical result is stated below:
Proposition 7 (From the reals to the world of TMs).
There exists a -net such that, given as a numerical value and a dyadic of precision , the network returns: , where is the corresponding -quadric encoding of .
The proof of the following proposition is similar and simpler.
Proposition 8 (From the world of TMs to the reals).
There exists -net such that, given as a numerical value, given a 1,3-quadradic of precision , then , where is the corresponding dyadic.
These constructions yield both decoders and encoders between dyadics and -quadradics, with a major advantage: the latter encoding can be tailored to be robust against local errors. One important remaining issue is that they will work for most of the dyadic , but not all.
Remark 9.
This limitation is unavoidable. Due to continuity constraints, it is impossible to transform into for all , as no continuous function can map onto a discrete set. The functions constructed in the proof of Proposition 7, exhibit “jumps”, where the output is necessarily “uncorrect”.
We overcome this limitation by introducing an error indicator , allowing two decoders to run in parallel and always ensure a correct result. Formally:
Proposition 10.
The networks and can be designed to be robust against local errors. Specifically, for , if the input is within distance of a given , then still correctly outputs , and still correctly returns .
Therefore, for any arbitrary real number , the network will output the base-4 encoding of a dyadic that is -close to with guarantee at least . Furthermore, computing both and guarantees that at least one of them returns the base-4 encoding of a dyadic that is -close to . Additionally, we can include an error indicator in the second dimension to signal potential errors, such that , where only if the dyadic computation is correct. All of the above operations can be performed on dyadic (or base-4) vectors componentwise in parallel.
Discussions.
Notably, as a side effect, we also prove that universality can be achieved with width (Theorem 23). To the best of our knowledge, this result is original. Furthermore, the ability to map neural computations onto the Turing machine framework gives rise to numerous variants of the previous statements. In essence, many of these variations become valid as soon as the corresponding operations are feasible on a Turing machine.
Interpretation of as a measure of the complexity of the functions.
Coming back to classical complexity, our results give a way to represent continuous functions using circuits, where the time is polynomially related to the circuit size . In this sense, measures the circuit complexity of the uniform approximation of a function; see [52] for a comprehensive reference on circuit complexity. More generally, several natural measures can be used to quantify the complexity of a function – among them, Kolmogorov complexity. It is commonly defined via the function , which denotes the length of the shortest binary description such that a fixed universal Turing machine satisfies . If we adopt the slightly different definition of and proposed in [2], which departs from the classical formulation, then, Kolmogorov complexity can be directly related to circuit complexity. In this sense, we establish that the time parameter is polynomially related to both the size of the circuits computing the function (possibly relativized) and to its Kolmogorov complexity: It is known that and , where is the halting problem, and means “polynomially related” [2, 1].
3 Basic definitions and statements
The networks that we consider are either feedforward or built by iterating feedforward layers. In their most general form, they process real-valued inputs. We use only rational weights – making them both expressive and analytically tractable.
Definition 11 (Feedforward neural network).
A feedforward neural network (NN) is a layered graph that represents a function of the form , for some . The first layer with label is called the input layer and consists of nodes called input nodes. Each node receives an input , which is also taken as its output: . Layers indexed by are hidden layers, each containing computation nodes. The output of the -th node in layer is given by . Here, is (typically nonlinear) activation function and the sum runs over all nodes in the previous layer . The are rational constants which are called weights, and is a rational constant called bias. The outputs of layer are collectively denoted . The final layer is the output layer and consists of output nodes. The -th node computes an output in the same way as a node in a hidden layer. The vector is the output of the network when given input , denoted .
Let denote the class of functions described by feedforward neural networks with the following structure: input neurons, output neurons, and hidden layers, each containing at most neurons. All hidden neurons use the activation function . We use the symbol to indicate that no constraint is imposed on some parameter. represents for example the feedforward neural networks with activation function , with neurons in the input layer, one neuron in the output layer, and one hidden layer containing an arbitrary number of neurons.
Definition 12 (Composition of Neural Networks).
Let and be two neural networks. The notation denotes both the resulting network formed by composing the two networks, i.e., by connecting the output nodes of to the corresponding input nodes of – and the function it computes. Moreover, in the case where , we inductively define , and for all .
As defined above, all considered networks exclusively use coefficients in and will employ activation function from now on. To avoid redundancy, we omit explicitly writing “--Net” in every instance. We assume a fixed notation for functions by defining a surjective function from to . Thus, every function has a name , meaning there exists a finite word such that . Note that these objects represent functions from to , even though their coefficients are restricted to rational numbers.
The following theorem is already well established:
Theorem 13 ([3]).
Every , can be represented by a -net .
For vectors and of the same dimension, we write (or ), if the inequality holds component-wise. The binary size of an object is denoted by . For example, the binary length of an integer is denoted by . A dyadic number of precision can be represented in binary as for some , with each , corresponding to . To establish our results, we introduce the concept of -quadradics, defined as the subset of real numbers whose base-4 expansion consists exclusively of the digits and .
Definition 14 (-quadradics).
We denote by the set of dyadic numbers whose base- expansion consist of finitely many either or ’s, followed by an infinite sequence of zeros. We write for the dyadic numbers whose base- expansion is infinite with only and ’s. That is, and for some .
Remark 15 (Simple key observation).
While the dyadics are dense in , the -quadradics are not – they exhibit gaps, a crucial property we will use throughout this work. For example, no element of lies within interval . Moreover, the set is homeomorphic to the Cantor set.
4 Programming with Neural Networks
To construct neural networks tailored to specific tasks – effectively programming with neural networks – it is useful to describe them using a simple programming language. We outline the structure of such programs, showing that these networks can simulate algorithms involving loops and conditional statements. We consider algorithms composed of commands, while loops, and conditions. A command, written as where is some function, corresponds, as expected to the function that maps to , where for all , and . A command is said to be affine if the function is affine.
Theorem 13, as stated above, asserts, that any function in can be programmed/computed by a feedforward network. However, careful attention is needed regarding the involved functions, and potential introduction of new variables or gap-related issues in the construction. This is why we need to come back to very concrete programs and functions. To address this, we present a simple yet useful observation on transforming affine-programmable functions into -net.
Lemma 16 (From a program to a Neural Net).
Let be an affine-programmable function, such that every if and while condition of the form satisfies a gap condition: there exists such that is guaranteed, with . Additionally, assume that all intermediate values computed by the algorithm remain within the interval . Then there exists a network computing such that for the smallest satisfying . Moreover, the number of iterations required by the network is linear in the runtime complexity of the algorithm.
Proof.
Let all basic commands be labeled with indices according to the order in which they appear in the algorithm. We first define a command-finding function which, given the index of the current command and the current state, returns the index of the next command to execute.
- Case 1:
-
The return-command is followed by terminating the network, meaning for all .
- Case 2:
-
Let be the index of the last command before entering a while-loop with condition , so is the index of the first command in that loop. Let be the index of the first command after leaving the loop, making the last command in the loop. The command-finding function in this case is defined as:
- Case 3:
-
If-conditions follow the same structure as in Case 2.
- Case 4:
-
All other commands are mapped to their immediate successor, that is, .
The function defined above can be extended to a continuous piecewise affine map (CPAM) and is therefore computable by a -net by Theorem 13.
Second, we construct the network . Given input , first computes as well as the outputs of all commands . It then subtracts from the result of each command where denotes the Kronecker delta. In the first layer, this yields the vector . Since for and for , the first layer simplifies to . In the output layer, a component-wise sum over all ensures that the final output is . The above statement and its proof may at first appear to be a trivial observation. However, the restriction to the domain instead of and the requirement for gaps in conditional queries, are essential for the construction to be valid. In this sense, the difficulty lies in ensuring that all computations are carried out using functions that can be explicitly implemented by -nets.
The above proof does not aim to optimize the width of the constructed -nets. We present it in this form as it facilitates the understanding of subsequent arguments. Later, we will apply several techniques to reduce the network width (e.g., in the proof of Theorem 23). Essentially, if we had applied the same optimizations in the above proof, we would obtain feasibility within a width that is at most the maximum width of the involved commands, as established by Lemma 21, by executing the commands sequentially.
To achieve width reduction, we will actually also leverage techniques similar to those used in the proof of the following statement, but sometimes even more optimized.
Proposition 17 (Width efficient computation of max-min string, [26, Proposition 2]).
A function is a max-min string of length on input variables and output variables if there exist affine functions : such that
where each is either a coordinate-wise max or a min.
For every max-min string of length on input variables and ouput variables, and for every compact , there exists a -net with input dimension , hidden layer width , output dimension , and depth that computes for all .
5 From the world of real numbers to Turing machines and back
5.1 Turing machines
We assume our readers are familiar with Turing machines and their classical properties, we refer to [50] for a complete introduction. Following [50], we define a (one-tape) Turing machine as a 7-tuple where are finite non-empty sets, is a blank symbol such that is the set of states; is the set of input symbols, with . is the set of tape symbols, with ; is the transition function, with the left shift and the right shift. If is not defined on some then the machine halts; is the initial state; is the accept state; is the reject state, with . The machine halts on input if it eventually reaches either state or . A configuration (or instantaneous description (ID)) represents an current state of the machine: the state , the tape content, and the head location.
The space of IDs of a Turing machine can be defined as , where denotes infinite words over alphabet . The initial configuration associated with input , denoted by , is given by . A configuration represents a machine state where the tape content is , with each , the head positioned over symbol , and the internal state equal to . We write if configuration is the immediate successor of configuration under the transition function .
Remark 18.
Notice that we define instead of in order to possibly include infinite words, i.e the fact that their might be an infinite word in the tape. Notice also that, under this convention, the content to the right of the head (i.e.,) is written in the standard order, from left to right, whereas the content to the left of the head (i.e., is written in reverse order, from right to left.
5.2 Simulating TMs with
It is well established that Turing machines can be simulated by PAMs [34, 41, 49]. This can be formalized as follows (see, for instance, [7]):
Lemma 19 ([7, Lemma 4.1]).
Let be a Turing machine with , and , and let be its configuration space. There exists a piecewise affine function and an encoding function such that the following diagram commutes:
(i.e. for all configurations , with ).
For an explicit proof, see [7]. The proof there essentially follows [34], itself building on the ideas from [41, 49].
Idea of the proof.
It is helpful to briefly explain the idea behind the proof of Lemma 19. First, we can assume that the Turing machine’s alphabet is , where the blank symbol is . In this setting, any finite word over alphabet can be considered as -quadradics. Namely, any word can be viewed as the base-4 number , that is, the element of given by with . This functions maps into the subset of corresponding to .
So, a configuration can be seen as element
A simple observation is that constructing a piecewise affine function satisfying the desired commutative diagram becomes straightforward if we replace the role of in this diagram with . Assuming , then lives in , and we get the above diagram where is replace to . To indeed reduce it to , one can apply the encoding trick described in Lemma 21.
Remark 20.
The main advantage of using -quadradics instead of binary representation is that we only need to specify the function on the image of . As this image is a subset of , it contains gaps (see Remark 15), meaning we can entirely ignore how behaves inside on it. Now, outside of gaps, operations such as moving the tape left or right, or replacing a symbol, is easily seen to correspond to some particular affine functions. Moreover, since the function is only required on the structured subset , and we do not need to specify its values outside of gaps: we can even suppose the function to be continuous and, in some cases, even smooth. This aspect is discussed in references such as [7, 41, 34, 9, 46].
In general, we can encode a real, and some finite information into a unique real:
Lemma 21 (Encoding finite information into a real).
Let . Then can be represented as a -quadradic of finite precision : we can view as , an element of , whose base-4 expansion consists of exactly digits, each in . Using this, we can encode any pair via the affine function with gaps: . Then, recovering the pair from can be achieved using an affine function with gaps. We denote this inverse operation by for this reverse function, and and its restriction to the interval by .
These are the classical techniques for simulating a Turing machine using piecewise affine functions over . Then, using Theorem 13, we can implement this simulation within -net. Actually, we improve this idea in order to consider low-width -nets.
Remark 22.
We could have used Proposition 17, but it would result in a width-4 construction and would still require proving that max-min strings can be implemented.
Theorem 23.
The piecewise affine function of Lemma 19 can be implemented by a -net of width .
Proof.
Indeed, the program of the Turing machine consists of instructions of the form with . This corresponds to the high-level instruction of the form:
“line : If one reads Then write ; shift the tape left or right depending on ,
Goto line ”
For example, if , , this translates into a line of the form:
“line : If Then ; ;
(here, and represent and of Lemma 19, encoding in -quadradics the left and right part of the tape), where is111The last line is here to deal with the blank symbol .: (we do not write , because it is symmetric)
-
If Then ;
-
If Then ;
-
If Then ; ;
-
Goto line
This involves lines all of the form, say, If Then . Because of the use of -quadradics and the presence of gaps, this assignment still works if written as , for any function that equals when , takes the value on the interval , and returns again when . We could take .
Then, this computation can be implemented using four layers of width in a -network. The first two neurons are used to store the values of and , while the third neuron serves as a working neuron, incrementally computing each term in the expression defining the function . Concretely, the first layer maps to ; the second layer maps to ; the third layer computes ; and finally, the fourth layer maps to , forwarding the result.
6 Main Technical Statement 1: Proofs of Propositions 7 and 10
Proof.
Observe that the function , defined by , can be computed by a -network. This function satisfies for all , and for all . Define the function . The function satisfies the property that if and only if . Both and are computable by neural networks of width 2.
We describe, at a precise level, how to transform a dyadic number in binary form into a -quadradic of the same precision.
The step of ensures that the new value of dyadic is in the middle of the interval of all real numbers, whose binary expansion starts with .
The above algorithm returns , but with the digits in reversed order. This issue is easily corrected, as the POP-operation can be performed in base in constant time. By Lemma 16, the code suffices to show that a recurrent network can implement this transformation.
Note that the above algorithm computes the first digits, so it would have returned the same result as long as digits to do not all have the same value. This holds for any input in the interval . Therefore, if the algorithm fails for a non-dyadic , it will still correctly a nearby value , because .
7 Proof of Theorems 2, 3 and 4
Proof.
By lack of space, we assume that has dimension . Furthermore, we assume that is Lipschitz-continuous with constant . If , we replace by a largest power of 2 smaller than . We also assume that ends with a unique symbol # whose binary encoding is metrically isolated in the sense that for every other string that does not contain #. We now argue that the following pseudocode can be implemented within the constraints of Lemma 16 (we assume the decoding back into the dyadic numbers to be done implicitly in the return command; the same holds for the next pseudocode).
Note that at least one of and is correctly encoded; without loss of generality, assume it is for . This ensures that at all times. Since the computation of eventually halts, becomes static and is -close to . For , two cases arise:
- Case 1:
-
is also correctly encoded in base . In this case, and by continuity of , is also -close to . Consequently is -close to as well.
- Case 2:
-
is incorrectly transformed into base 4, which means that . After iterations of the last loop, , leading to . This ensures that , which is again -close to .
However, if an error remains nonzero, there is no bound on the runtime: for example, if is very small, it may require an arbitrarily long time before successive multiplications by make it exceed .
To establish a polynomial bound on the number of network iterations when is a dyadic number, consider the following simplified version of the above algorithm:
To transform into base 4, the outer loop runs times, so the inner loop executes a total of iterations. The transformation of into base requires polynomially many steps in the precision of that we need. The final loop depends polynomially on the complexity of computing . Thus, Theorem 3 follows directly: instead of extracting the function immediately, we delete the first bits of until reaching the -th , and then extract , which, due to of fast convergence is -close to the desired function. We then proceed as in Theorem 2. This program as written above has variables (including the index of the current command). It can hence be implemented by a width -net by techniques similar to the proof of Theorem 23.
8 Conclusion
We established a framework for approximating functions using recurrent networks and showed that universal approximation is achievable with the activation function. Beyond proving the existence of a universal network, we also proved that the number of network recurrences corresponds to the complexity of computing the approximated function. Our approach is guaranteed to be precise for functions in and dyadic numbers, while efficiently approximating any continuous function on a bounded domain to arbitrary precision. Additionally, we examined the link between Turing machines and functions. We extended it to a connection between Turing machines and neural networks and also provided a method for translating pseudocode into equivalent network computations.
Natural further open questions are:
-
1.)
Do the above statements hold for other activation functions? First, note that we have restricted our analysis to bounded domains. In particular, when evaluating , we ensure that for the largest weight added to the largest bias in the network, for every intermediate result was assumed to be in . Any function that coincides with in a neighborhood around zero would therefore work as well, because such a function can be linearly stretched into a function so that it coincides with on . For example, this implies to the linear sigmoid mentioned earlier, as well as to piecewise linear rescalings such as if , if , if . Clearly, since any non-affine function can be linearly transformed into a function that coincides with the function on an interval around zero, the same conclusion holds for any rational continuous piecewise affine activation function.
-
2.)
Can the size of the overall network be bounded? What about its width and depth? Universality is known to hold with width 3 for Turing machine simulation, but whether it holds with width 2 remains open. After several attempts, we think this question is related to the well-known open problem of whether universality can be achieved using one-dimensional systems, such as piecewise affine functions. In our construction, the encoding and decoding mechanisms, as described in this paper, require a network of width . It was not conceived to be width-minimal. As a result, the total width required by our universal construction is currently the maximum between (for encoding/decoding) and (for simulation), making the encoding/decoding part the main bottleneck. A natural question thus arises:what is the minimal width required for encoding and decoding?
References
- [1] Eric Allender. Circuit complexity, Kolmogorov complexity, and prospects for lower bounds. In DCFS, pages 7–13. Citeseer, 2008.
- [2] Eric Allender, Harry Buhrman, Michal Kouckỳ, Dieter Van Melkebeek, and Detlef Ronneburger. Power from random strings. SIAM Journal on Computing, 35(6):1467–1493, 2006. doi:10.1137/050628994.
- [3] Raman Arora, Amitabh Basu, Poorya Mianjy, and Anirbit Mukherjee. Understanding deep neural networks with rectified linear units. arXiv preprint arXiv:1611.01491, 2016. arXiv:1611.01491.
- [4] Midhun T Augustine. A survey on universal approximation theorems. arXiv preprint arXiv:2407.12895, 2024. doi:10.48550/arXiv.2407.12895.
- [5] Daniel Bertschinger, Christoph Hertrich, Paul Jungeblut, Tillmann Miltzow, and Simon Weber. Training Fully Connected Neural Networks is -Complete. Advances in Neural Information Processing Systems (NeurIPS 2023), 36:36222–36237, 2023.
- [6] Manon Blanc and Olivier Bournez. A characterisation of functions computable in polynomial time and space over the reals with discrete ordinary differential equations: Simulation of Turing machines with analytic discrete ODEs. In Jérôme Leroux, Sylvain Lombardy, and David Peleg, editors, 48th International Symposium on Mathematical Foundations of Computer Science, MFCS 2023, August 28 to September 1, 2023, Bordeaux, France, volume 272 of LIPIcs, pages 21:1–21:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPIcs.MFCS.2023.21.
- [7] Vincent D. Blondel, Olivier Bournez, Pascal Koiran, and John Tsitsiklis. The stability of saturated linear dynamical systems is undecidable. Journal of Computer and System Science, 62(3):442–462, May 2001. doi:10.1006/jcss.2000.1737.
- [8] Digvijay Boob, Santanu S Dey, and Guanghui Lan. Complexity of training ReLU neural network. Discrete Optimization, 44:100620, 2022. doi:10.1016/J.DISOPT.2020.100620.
- [9] Olivier Bournez and Michel Cosnard. On the computational power of dynamical systems and hybrid systems. Theoretical Computer Science, 168(2):417–459, November 1996. doi:10.1016/S0304-3975(96)00086-2.
- [10] Cornelius Brand, Robert Ganian, and Mathis Rocton. New complexity-theoretic frontiers of tractability for neural network training. Advances in Neural Information Processing Systems, 36:56456–56468, 2023.
- [11] Vasco Brattka. Computability & complexity in analysis. Tutorial donné à Computability in Europe (CIE’2005), 2005. doi:10.1016/j.jco.2006.10.001.
- [12] Vasco Brattka, Peter Hertling, and Klaus Weihrauch. A tutorial on computable analysis. In New computational paradigms, pages 425–491. Springer, 2008. doi:10.1007/978-0-387-68546-5_18.
- [13] Yongqiang Cai. Achieve the minimum width of neural networks for universal approximation. In The Eleventh International Conference on Learning Representations, 2023. URL: https://openreview.net/forum?id=hfUJ4ShyDEU.
- [14] Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Advances in Neural Information Processing Systems, pages 6571–6583, 2018. URL: https://proceedings.neurips.cc/paper/2018/hash/69386f6bb1dfed68692a24c8686939b9-Abstract.html.
- [15] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989. doi:10.1007/BF02551274.
- [16] Ingrid Daubechies, Ronald DeVore, Simon Foucart, Boris Hanin, and Guergana Petrova. Nonlinear approximation and (deep) ReLU networks. Constructive Approximation, 55(1):127–172, 2022.
- [17] Vincent Froese and Christoph Hertrich. Training neural networks is np-hard in fixed dimension. Advances in Neural Information Processing Systems, 36:44039–44049, 2023.
- [18] Vincent Froese, Christoph Hertrich, and Rolf Niedermeier. The computational complexity of ReLU network training parameterized by data dimensionality. Journal of Artificial Intelligence Research, 74:1775–1790, 2022. doi:10.1613/JAIR.1.13547.
- [19] Vincent Froese, Christoph Hertrich, and Rolf Niedermeier. The computational complexity of ReLU network training parameterized by data dimensionality. Journal of Artificial Intelligence Research, 74:1775–1790, 2022. doi:10.1613/JAIR.1.13547.
- [20] Robert Ganian, Mathis Rocton, and Simon Wietheger. Training one-dimensional graph neural networks is NP-hard. In The Thirteenth International Conference on Learning Representations (ICLR 2025), 2025.
- [21] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010. URL: http://proceedings.mlr.press/v9/glorot10a.html.
- [22] Surbhi Goel, Adam Klivans, Pasin Manurangsi, and Daniel Reichman. Tight hardness results for training depth-2 ReLU networks. In 12th Innovations in Theoretical Computer Science Conference (ITCS 2021), volume 185 of Leibniz International Proceedings in Informatics, pages 22:1–22:14, 2021. doi:10.4230/LIPICS.ITCS.2021.22.
- [23] Aidan N. Gomez, Mengye Ren, Raquel Urtasun, and Roger B. Grosse. The reversible residual network: Backpropagation without storing activations. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, volume 30, pages 2214–2224, 2017. URL: https://proceedings.neurips.cc/paper/2017/hash/f9be311e65d81a9ad8150a60844bb94c-Abstract.html.
- [24] Riccardo Gozzi. Analog Characterization of Complexity Classes. PhD thesis, Instituto Superior Técnico, Lisbon, Portugal and University of Algarve, Faro, Portugal, 2022.
- [25] Daniel S. Graça, Manuel L. Campagnolo, and J. Buescu. Robust simulations of Turing machines with analytic maps and flows. In S. B. Cooper, B. Löwe, and L. Torenvliet, editors, CiE 2005: New Computational Paradigms, LNCS 3526, pages 169–179. Springer, 2005. doi:10.1007/11494645_21.
- [26] Boris Hanin and Mark Sellke. Approximating continuous functions by ReLU nets of minimal width. arXiv preprint arXiv:1710.11278, 2017. arXiv:1710.11278.
- [27] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. doi:10.1109/cvpr.2016.90.
- [28] Ruiyang Hong and Anastasis Kratsios. Bridging the gap between approximation and learning via optimal approximation by ReLU MLPS of maximal regularity. arXiv preprint arXiv:2409.12335, 2024. doi:10.48550/arXiv.2409.12335.
- [29] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251–257, 1991. doi:10.1016/0893-6080(91)90009-T.
- [30] Patrick Kidger. On Neural Differential Equations. PhD thesis, Mathematical Institute, University of Oxford, 2022. doi:10.48550/arXiv.2202.02435.
- [31] Patrick Kidger and Terry Lyons. Universal approximation with deep narrow networks. In Jacob Abernethy and Shivani Agarwal, editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 2306–2327. PMLR, 2020. URL: http://proceedings.mlr.press/v125/kidger20a.html.
- [32] Namjun Kim, Chanho Min, and Sejun Park. Minimum width for universal approximation using ReLU networks on compact domain. In The Twelfth International Conference on Learning Representations, 2024. URL: https://openreview.net/forum?id=dpDw5U04SU.
- [33] Ker-I Ko. Complexity Theory of Real Functions. Progress in Theoretical Computer Science. Birkhaüser, Boston, 1991. doi:10.1007/978-1-4684-6802-1.
- [34] Pascal Koiran, Michel Cosnard, and Max Garzon. Computability with low-dimensional dynamical systems. Theoretical Computer Science, 132(1-2):113–128, September 1994. doi:10.1016/0304-3975(94)90229-1.
- [35] Anastasis Kratsios and Léonie Papon. Universal approximation theorems for differentiable geometric deep learning. Journal of Machine Learning Research, 23(196):1–73, 2022. URL: https://jmlr.org/papers/v23/21-0716.html.
- [36] Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Fractalnet: Ultra-deep neural networks without residuals. ICLR, 2016.
- [37] Hongzhou Lin and Stefanie Jegelka. Resnet with one-neuron hidden layers is a universal approximator. Advances in neural information processing systems, 31, 2018.
- [38] Chenghao Liu, Enming Liang, and Minghua Chen. Characterizing resnet’s universal approximation capability. In Forty-first International Conference on Machine Learning, 2024.
- [39] Yiping Lu, Aoxiao Zhong, Quanzheng Li, and Bin Dong. Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Workshop Track Proceedings, pages 3276–3285. PMLR, OpenReview.net, 2018. URL: https://openreview.net/forum?id=B1LatDVUM.
- [40] Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. The expressive power of neural networks: A view from the width. Advances in neural information processing systems, 30, 2017.
- [41] Cristopher Moore. Generalized shifts: unpredictability and undecidability in dynamical systems. Nonlinearity, 4(3):199–230, 1991. doi:10.1088/0951-7715/4/2/002.
- [42] Anirbit Mukherjee, Amitabh Basu, Raman Arora, and Poorya Mianjy. Understanding deep neural networks with rectified linear units. In International Conference on Learning Representations (ICLR 2018), 2018.
- [43] Ian Parberry. Circuit complexity and neural networks. MIT press, 1994. doi:10.7551/mitpress/1836.001.0001.
- [44] Allan Pinkus. Approximation theory of the mlp model in neural networks. Acta numerica, 8:143–195, 1999.
- [45] Amaury Pouly. Continuous models of computation: from computability to complexity. PhD thesis, Ecole Polytechnique and Unidersidade Do Algarve, defended on july 6, 2015 2015. https://pastel.archives-ouvertes.fr/tel-01223284, Prix de Thèse de l’Ecole Polytechnique 2016, Ackermann Award 2017.
- [46] Hava T. Siegelmann. Neural networks and analog computation - beyond the Turing limit. Progress in theoretical computer science. Birkhäuser, 1999.
- [47] Hava T. Siegelmann and Eduardo D. Sontag. Turing computability with neural nets. Applied Mathematics Letters, 4(6):77–80, 1991. doi:10.1016/0893-9659(91)90080-f.
- [48] Hava T. Siegelmann and Eduardo D. Sontag. Analog computation via neural networks. Theoretical Computer Science, 131(2):331–360, September 1994. doi:10.1016/0304-3975(94)90178-3.
- [49] Hava T. Siegelmann and Eduardo D. Sontag. On the computational power of neural nets. Journal of Computer and System Sciences, 50(1):132–150, February 1995. doi:10.1006/jcss.1995.1013.
- [50] Michael Sipser. Introduction to the Theory of Computation. PWS Publishing Company, 1997.
- [51] Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks. arXiv preprint arXiv:1505.00387, 2015. arXiv:1505.00387.
- [52] Heribert Vollmer. Introduction to circuit complexity: a uniform approach. Springer Science & Business Media, 1999.
- [53] Klaus Weihrauch. Computable Analysis - An Introduction. Texts in Theoretical Computer Science. An EATCS Series. Springer, 2000. doi:10.1007/978-3-642-56999-9.
- [54] Dmitry Yarotsky. Error bounds for approximations with deep ReLU networks. Neural networks, 94:103–114, 2017. doi:10.1016/J.NEUNET.2017.07.002.
- [55] Dmitry Yarotsky. Elementary superexpressive activations. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 11932–11940. PMLR, 18–24 July 2021. URL: https://proceedings.mlr.press/v139/yarotsky21a.html.
- [56] Dmitry Yarotsky. Structure of universal formulas. Advances in Neural Information Processing Systems, 36:54876–54902, 2023.
- [57] Shijun Zhang, Jianfeng Lu, and Hongkai Zhao. Deep network approximation: Beyond ReLU to diverse activation functions. Journal of Machine Learning Research, 25(35):1–39, 2024. URL: https://jmlr.org/papers/v25/23-0912.html.
- [58] Xingcheng Zhang, Zhizhong Li, Chen Change Loy, and Dahua Lin. Polynet: A pursuit of structural diversity in very deep networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 3900–3908. IEEE Computer Society, 2017. doi:10.1109/CVPR.2017.415.
