King’s College London
Department of Mathematics Academic year 2019–2020, Spring term Dr Blanka Horvath
7CCMFM18 Machine Learning Problem sheet 2
In this problem sheet, we will study the universal approximation property of feedforward neural networks with one input, one output and one hidden layer with ReLU activation. More specifically, we will prove in four steps a construc- tive version of the result, which determines a sufficient number of hidden-layer units ensuring given precision of approximation. For simplicity, we work with functions on the unit interval [0, 1].
We need two new concepts. Firstly, the number of units in the network will depend on the “wiggliness” of the function we are trying to approximate. To as- sess wiggliness we introduce:
Definition 1. A function g : [0,1] → R is Hölder continuous with exponent α ∈ (0,1]andconstantK >0if
|g(x1)−g(x2)|≤K|x1−x2|α forallx1,x2 ∈[0,1].
Remark 1. (i) Hölder continuity implies uniform continuity and, ultimately,
continuity.
(ii) The exponent α determines how much wiggliness is allowed. The larger α is, the less there is room for wiggle. In the most restrictive case α = 1 we recover Lipschitz continuity, familiar, e.g., from the theory of differential equations.
(iii) For instance, for γ ∈ (0,1), the function x → xγ, x ∈ [0,1], is Hölder contin- uous with exponent α ∈ (0, γ] but not with exponent α ∈ (γ, 1]. Any contin- uously differentiable function on [0, 1] is Hölder continuous with exponent α = 1. The sample path of a standard Brownian motion can be shown to
be, with probability one, Hölder continuous with exponent α ∈ 0, 1 but 2
notwithα∈1,1. 2
1
Secondly, to construct the approximating neural network, the following con- cept will be particularly helpful.
Definition 2. A function f : [0, 1] → R is piecewise affine with knots 0 = x0 < x1 < ··· < xn = 1 if there exist α1,...,αn ∈ R and β1,...,βn ∈ R such that
and
f (x) =
. . .
(⋆)
α1 +β1x,
x ∈ [0,x1),
α +β x, x∈[x ,x ),
n−1 n−1 αn +βnx,
n−2n−1 x ∈[xn−1,1],
αi+βixi =αi+1+βi+1xi
agree at knots, so that the resulting function is continuous.
In what follows, u : [0, 1] → R stands for the function we are trying to approxi- mate by a neural network, which is assumed to be Hölder continuous with expo- nent α ∈ (0,1] and constant K > 0.
Problem 1
Let n ∈ N. Determine the (unique) piecewise affine function un with knots xi = i , n
i = 0,1,…,n, such that
un(xi)=u(xi) foranyi =0,1,…,n. (⋆⋆⋆)
Hint: Work out the parameters αi and βi for any i = 1,…,n using (⋆), (⋆⋆) and (⋆⋆⋆).
Problem 2
foranyi=1,…,n−1. (⋆⋆) Remark 2. The property (⋆⋆) simply ensures that the piecewise definitions in (⋆)
Show that the function un in Problem 1 satisfies ∥un − u∥sup,[0,1] ≤ 2K .
nα
Hint: Study separately for each i = 1,…,n the behaviour of un(x)−u(x) for x ∈
i −1 , i , using Hölder continuity and the property (⋆ ⋆ ⋆). nn
2
Problem 3
Show that there is fn ∈ N2(1,n,1;ReLU,Id) such that fn(x) = un(x) for any x ∈ [0,1], where un is as in Problems 1 and 2. Hint: Write down an expression for general fn ∈ N2(1,n,1;ReLU,Id) and determine its weights and biases so that it equals un .
Problem 4
Using Problems 2 and 3, deduce that for any ε > 0 there is f ∈N2(1,n,1;ReLU,Id)
such that
2K 1 providedn≥ ε α.
∥u − f ∥sup,[0,1] < ε,
3