COMP5400: Biological and Bio-Inspired Computation Steepest Gradient Descent
Marc de Kamps March 2, 2020
1 Small Digression on Curve Fitting
Often we want to reduce a set of data points to its essentials and produce a simple function which captures the most important aspects. Consider the data of Figure 1. It appears to be approximately
Copyright By PowCoder代写 加微信 powcoder
Figure 1: A data set which appears to be approximately linear.
linear. We often encounter two situations:
1. The data express a relationship that we know from first principles should be linear and the deviations that we see are due to noise in the observations. We want to uncover the linear relationship as well as possible.
2. We decide by eye that the data is approximately linear and want the best line that describes the data.
Either way we somehow want to find the ’best line’ that ’fits’ the data. How do we go about it? For simplicity we assume that there is no error in the x values. If we assume any linear relationship:
f (x) = ax + b
and we pick some value for a and b, then we can draw the line in the same figure as the data set. Obviously, we have not done too bad a job, and it is possible to guess values for a and b by eye. But how can we systematically determine a and b? For any data point in our set we can determine the so-called residu (see Figure 2). It is the square of the distance by which the straight line misses the
1 1 (x, y)
Figure 2: Defintion of residues.
data point. More precisely, if data point i (one out of a set of N) is given by (xi, yi) then the residu for this point is given by:
For our straight line this means
ri = (yi − f(xi))2. (1) ri = (yi − (axi + b))2.
Note for future reference that Equation 1 is a useful definition for other, non linear, functions as well.
Summing over all data points we find the total sum of residues one gets: R=(yi −(axi +b)2)
R is now a function of a and b only, and it seems quite reasonable to see if there is a minimum (amin,bmin) for the function R ≡ R(a,b), and if there is one to take (amin,bmin) as estimates for the linear relationship. It is in fact more than reasonable, it is optimal in a sense to be discussed in section 3. R is also called the error function, for obvious reasons.
For a minimum the following conditions must hold:
∂R = 0, (2)
For R such a minimum exists and can be calculated. The precise expression for a and b is not very hard to find, but not relevant for the rest of the story. This works out nicely for linear relationships and numerical software usually contains subroutines to find a and b in terms of the data points. This method for determining the optimal line that fits a data set is called the least squares method.
The method can also be used for non linear relationships. Just as Equation 1 gives the residu for a straight line, the following equation would give the residu for a parabola:
R=(yi −[a+bxi +cxi2])2
In this particular case, it would be harder to find the minimum of R explicitly and a simpler method is required.
The function f is called a model, in this case a and b are the parameters.
2 Steepest Gradient Descent
Let us re-examine Equation 3. It is not so hard what the terms are:
∂ R = −2(yi −(axi +b))xi (3)
∂ R = −2(yi −(axi +b)) ∂b
This is the so-called gradient of R in terms of a, b. In general the gradient of an N-dimensional function f (x0, · · · , xN−1) is defined as an N-dimensional vector:
∇f ≡(∂f ,···, ∂f ) ∂x0 ∂xN−1
In general the gradient is not hard to calulate, as the example for R shows. An important property of the gradient is that is the direction in which function f increases most rapidly. As a consequence the function decreases most rapidly in the direction of minus the gradient. An often used pragmatic approach for finding minima would work as follows for the residu.
1. Pick a point(a0 , b0 ) in (a, b) space.
2. Evaluate the residu in this point, call this R0 = R(a0 , b0 )
3. Calculate the gradient in point (a0, b0) using Equation 4.
4. Pick a new point (a1, b1) that is lies a small step away from (a0, b0) in the opposite direction of the gradient. In other words:
(a1,b1)=(a0,b0)−h∇R=(a0,b0)−h( ∂ R|(a0,b0), ∂ R|(a0,b0))
Since we move in the direction in which the function decreases most rapidly, for a continuous function this means if we chose h small enough, then R is guaranteed to be smaller in (a1 , b1 ) than it was in (a0 , b0 ). By repeating this procedure iteratively, we will arrive at a minimum for R. In this particular example we would indeed arrive at the minimum; there is only one.
This technique is called steepest gradient descent. It is widely used to determine the minima of functions. A drawback is that it may converge on a local minimum of a function and not necessarily on the global one. We will apply the technique in neural networks, first to find an improved version of the perceptron called ADALINE, and then on so-called multi-layer perceptrons.
3 Motivating the Error term
It is possible to define an error sum for the perceptron by defining a so-called squashing function: f(x) = 1 (4)
1+e−βx and by modifying the perceptron definition as follows:
Note that although this looks very different from the original perceptron, it gives approximately the same output, only when the input is around zero, the output becomes graded. If a close resemblance to the original perceptron definition is required, then a large value for β can be chosen.
4. Explain this.
We can now again define an error function, conform Equation 1 where f (x) is now given by Equation
5. The parameters to be adapted are now the weights, wi. For a particular input pattern ⃗xi, we can
wi xi − θ) (5)
Figure 3: A so-called sigmoid squashing function.
evaluate Equation 5 and obtain the observed output pattern oi. If we denote the desired output by di
the error function is:
E = 21 (oi − di)2 (6) i
E is a function of the weights only, just as the residu function in the example above was a function of a and b. We are now again calculating the gradient of the error function with respect to w⃗ . As a small step, let us first establish:
d f(x)=f(1−f) (7) dx
This is a remarkable result: usually it is not easy to find the derivative of a function, even if you know the function value. This is an important reason for picking this particular function. The gradient is
now calculated as follows:
∂ E=(oi−di)∂oi, (8) ∂wj i ∂wj
which follows from the chain rule. Now we have to remember that in the data point (xi,yi), xo is shorthand for an input vector:
xi =(x0,i,x1,i,···,xN−1,i)
Differentiating further we find that
∂oi = ∂ f(wkxk,i)= f′(wkxk,i)xj,i =oi(1−oi)xj,i
∂wj ∂wj k k
Putting the result together we can define the so-called δ-rule:
∂Ei =∆xj (9) ∂wj
(oi − di)oi(1 − oi)
where δ refers to the difference between desired and actual output. To find the total gradient, we need to add all the terms of Equation 9.
To calculate the weight change we perfrom steepest gradient descent now that we have calculated the gradient:
wi →wi −h∂E (10) ∂w
Here is a small parameter which must be chosen, it is called the learning rate. Despite the complex manipulations, the rule for weight change is not radically different from the perceptron learning rule. Since it minimises the error function, it finds a reasonable compromise, even for non linearly separable data.
Technically, one should calculate the sum of the gradients for all data points and then minimise. This is called batch learning. However, in practice one often calculates the gradient for each data point, cf Equation 9 and uses it immediately to adapt the weights. For infinitesimally small h this amounts to the same thing. For finite values uased in practice, it gives different results. This is called online learning.
4 Derivation of Backpropagation
I will use the following notation for a three layer network: o(0) will denote the i-th neuron of the i
input layer, o(2) will denote the k-th neuron of the output layer and o(1) is the j-th neuron of the kj
hidden layer. The weight between the q-th neuron of the input layer and the p-th neuron of the hidden layer is denoted by w(0) so that:
o(1) = f(w(0)o(0)) p pkk
and the weight between the r-th neuron of the output layer and the s-th neuron of the hidden layer is denoted by w(1) so that:
o(2) = f(w(1)o(1)) (11) p pkk
Let us consider a single input pattern ⃗x, and let us assume that the desired output pattern associ-
ated with this pattern is d⃗. If we enter ⃗x into the input layer of the network and evolve the network using the equations above we get an output pattern ⃗o in the output layer of the network. As before, when d⃗ ⃗0 we have a non-zero error function:
E ≡ (ok − dk)2 ≡ (o(2) − dk)2 (12) kk
where the sum runs over all nodes in the output layer. As before, the error function for given patterns
⃗o and d⃗ is a function of the weights only. As before, calculating the gradient helps to find (local)
minima of the error function. The gradient of the error function with respect to the weights w(1) can pq
be found in a similar way to the steepest gradient perceptron rule:
∂E =1 ∂ (o(2)−dk)2 ∂w(1) 2 ∂w(1) k
As before:
pq pq l pqm l
(o(2) −dk) k
∂w(1) pqk pq
∂o(2) ∂f(lw(1)o(1)) (1) (1) ∂ (1)
k = kl l =f′( w o ) w o(1)=f′( w o )δo(1)
(1) (1) p ∂w(1) ∂w(1) kl l ∂w(1) km m kl l k q
withδkl theKroneckerδdefinedby:
k=1:1 δkl= kl:0
Substituting the last result into Equation 13 gives
This is written as:
∂E =(o(2) −d )f′(w(1)o(1))o(1) (16) ∂w(1)pp pllq
∂E = ∆(1)o(1) (17) ∂w(1) p q
∆(1) ≡(o(2) −d )f′(w(1)o(1)) (18) p p p pll
This is completely equivalent to the steepest gradient perceptron rule. The gradient with respect to the weights w(0) is more complicated:
(o(2) −dk) k
o(2) = f ( w(1)o(1)) = f ( w(1) f ( w(0)o(0))) k kll kl lmm
so that the chain rule will have to be applied twice:
∂w(0) pqk pq
k = f ′( w(1)o(1)) w(1) f ′( w(0)o(0))δp o(0) (20)
∂w(0) kll km mtt mq pq l m t
= f′(w(1)o(1))w(1) f(w(0)o(0))
which leads to:
kl l km∂w(0)
pq l m pq t
Substituting Equation 20 into 19 gives:
∂E = (o(2) − d ) f ′( w(1)o(1)) w(1) f ′( w(0)o(0))δp o(0) (21) ∂w(0) kk kll km mttmq
∂E = (o(2) − d ) f ′( w(1)o(1))w(1) f ′( w(0)o(0))o(0) (22)
∂w(0) k k kl l kp pqk l t
Now applying Equation 18, it follows that
∂E = ∆(1)w(1) f ′( w(0)o(0))o(0) ∂w(0) k kp pt t q
∆(0) ≡∆(1)w(1)f′(w(0)o(0)) p kkp ptt
This completes the proof.
5 A Comment on Loss Functions
The error function is in some sense optimal, not just a nice idea. Assume that we are studying data which can be well described by a linear equation but where the y values have been distorted by Gaussian noise. By this we mean that there is a true value yˆi for which indeed:
but the value we observe is:
yˆi = axi + b (24) y i = yˆ i + ε i
where εi is drawn from a Gaussian distribution. Assume that this distribution has mean 0 and vari- ance σ. This means that we can not predict individual εi, but give the probability that it lies in a
certain interval:
y 1 −ε2/σ2 probεi in[x,y]= √ e
The problem is that we want to fit a line to the true underlying relationship, whilst having distorted data points. Nevertheless, we can estimate the line using probabilistic arguments. We know that the probability for finding an observed value yi is given by:
−(yi −yˆi )2 p(yi|a, b) ∝ e σ2
and since all data points are assumed to be idd, the total probability is −(yi −yˆi )2
Πi p(yi|a, b) ∝ Πe σ2 (25)
For the true y values, yˆ, Equation 24 holds for some unknown a and b. We will now estimate a and b from the data by finding the values aML, bML that maximize the total probability of observing the data that we consider, i.e. aML, bML maximize;
−(yi−(axi+b))2
Πie σ2 (26)
This is somewhat unwieldy. If we take the logarithm of this product and maximize with respect to that, we would still find aML, bML.
In other words we want to find the a, b that maximize: which is the same as maximising: −(yi −(axi +b))2
Under the assumption of Gaussian noise, minimising the error function amounts to maximising the log likelihood by changing the weights. In this sense, neural network training is not so different from standard regression.
In tensorflow, or whatever else you prefer to use, the error function described here is usually called a loss function, and the one described here is often called, or very close to a loss function called MSE loss.
For classification, there are other, more suitable loss functions that we will describe later.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com