CS代写 Backpropagation of Error

Backpropagation of Error
Marc de Kamps
Institute for Artificial and Biological Intelligence University of Leeds

Copyright By PowCoder代写 加微信 powcoder

Multilayer-Perceptrons
Single Perceptron: important, powerful, especially as part of more complex tools:
Multi-layered networks Support Vector Machines
Fails on relatively simple problems: XOR
Already established that a three-layer network can solve XOR: hand-picked weights
Very laborious: need an automatic procedure
Backpropagation of Error
Backpropagation
Distinguish between the network: multi-layer perceptron: a network architecture
and algorithm: backpropagation is a way to set the weights Modern deep learners use clever ways to pre-train
But backpropagation is still the most widely used method to finish training, even in modern architectures
Backpropagation
Marc de Kamps

A Graded Perceptron Continuous Dependency on Weights
The classic perceptron can be described using the Heaviside function:
o = H(􏰅wixi) i
Here we used the Heaviside function as squashing function 􏰎x<0: 0 H(x)= x≥0: 1 Instead, now we will use the sigmoid: f(x) = 1 1+e−βx Typically, β is fixed. It can be set to ensure a hard or a soft decision Graded response Rather than a hard decision, a perceptron using a sigmoid produces a graded response This can be used to: Express uncertainty: e.g. probability Approximate functions Backpropagation Marc de Kamps A Perceptron Network Perceptrons can be grouped together, forming a two-layer network This network can be trained by a vector version of the perceptron algorithm No lateral connections in this network, so you can train each node individually Backpropagation Marc de Kamps A Perceptron Network Matrix-Vector Representation Represent output by vector: ⃗o Now need a weight matrix ⃗o = f ( W · ⃗x ) Components wij i runs over output, j input Backpropagation Marc de Kamps Different Ways of Looking at Network 1 - As a Matrix-Vector Operation ⃗ ⃗o = w⃗ ⃗x Different Ways of Looking at Network 2 - In terms of components oi =􏰅wijxj j Different Ways of Looking at Network 3 - In terms a numerical computation  x1  􏰕o1 􏰖 􏰕w11 w12 w13 w14 􏰖x2  o=wwwwx= 2 21 22 23 24 3 Direction is important! Features in Backpropagation  x1  􏰕o1 􏰖 􏰕w11 w12 w13 w14 􏰖x2  o=wwwwx= 2 21 22 23 24 3 oi =􏰅wijxk j x1 w11 w21 x2  w12 w22 􏰕o1 􏰖 x=wwo=  3   13 23  2 x4 w14 w24 xi =􏰅wjioj j The Error Function AKA loss Function We are still considering supervised learning ⃗ Data points are presented as vectors x(i) They come with a desired classifcation (function value) d(i) A network is a machine that transforms an input vector x(i) into an output o(i) (observed value) E = 1􏰅􏰅(d(i) −o(i))2 2kk E is a measure for how well the machine approximates the function that produced the data E is a function of the weights only, given a fixed data set The sum over k runs over all output nodes The sum over i runs over all data points Backpropagation Marc de Kamps The Error Function A Single Perceptron The Equation of a single perceptron is: o = f(􏰅wkxk) k therefore the error function is: E = 1􏰅(d(i) −f(􏰅wkx(i)))2 or in full: E = 1 􏰅(d(i) − 1 )2 1+e−􏰄 w x(i) i kkk The latter form is a bit impractical The Data Matrix We have treated a single data point as a vector, e.g.: 1 x⃗ i =  2  3 The vector is sometimes written as: X(i) Aligning all vectors produces a matrix: X = [X(1)X(2) ···X(m)] xik : components of X n: dimension of data points m: number of data points i labels the component of data point k (row) k labels the data point Backpropagation Marc de Kamps Neural Networks Perform Regression Regression requires a model f(x). The residu of a single data point is: ri =(y(i)−f(xi))2 Linear regression uses a line as a regression model: ri = (y(i) − (ax(i) + b))2 (x, y) 1 1 Neural Networks Perform Regression For a minimum the following conditions must hold: ∂R = 0, ∂a The gradient is: ∂ R = −2􏰅(y(i) −(ax(i) +b))x(i) Two equations with two unknowns Least-squares method ∂R = −2􏰅(y(i)−(ax(i)+b)) Backpropagation Marc de Kamps Steepest Gradient Descent Recap Calculate gradient: ∂E ∂wj wj →wj −λ∂E ∂wj λ learning rate Repeat Backpropagation Marc de Kamps Steepest Gradient Descent Single Perceptron The gradient: E = 1 􏰅(ok − dk )2 ∂E =􏰅(ok −dk)f′(􏰅wlxl)􏰅∂wkxm ∂wpi lm∂wp Usef′ =f(1−f) =􏰅(ok −dk)ok(1−ok)xp k Gives the so-called delta rule This is for a single data point! What is the only change if there are more? Multi-layer Perceptron Steepest Gradient Descent Input vector: ⃗x Hidden layer ⃗h Output vector ⃗o Two weight matrices: V ⃗ ⃗o = f (W · f (V · ⃗x )) ⃗o = f (W · h), with: h = f ( V · ⃗x ) Backpropagation Marc de Kamps Multi-layer Perceptron Steepest Gradient Descent Determine ∂E ∂ wij Determine ∂E ∂ vkl Derivation in hand out Backpropagation Marc de by Error A Single Datapoint E = 12 ( ⃗d − ⃗o ) 2 ⃗⃗ ⃗o=f(w⃗ ·f(⃗v⃗x)) Backpropagation by Error Output layer ∆(2) =(o −d )o (1−o ) ppppp =(⃗o − ⃗d) · ∂⃗o ∂wpq ⃗ ′⃗⃗ ∂⃗w·⃗h =(⃗o−d)·f (w⃗ ·h)· ∂wpq =􏰅(ok −dk)ok(1−ok)∂􏰄l wklhl k ∂wpq =(op − dp)op(1 − op)hq =∆(2) h Backpropagation by Error Hidden layer The hidden layer is more work: ∂ok ′⃗⃗⃗′⃗∂⃗v·⃗x ∂v =f(w⃗·h)w⃗·f(⃗v·x)∂v pq pq =ok(1−ok)wkphp(1−hp)xq, (1) ∂E =􏰅(ok −dk)ok(1−ok)wkphp(1−hp)xq ∂wpq k = 􏰅 ∆(2)wkphp(1 − hp)xq k This is again of the form ∆(1)x , with pq ∆(1)=􏰅∆(2)w h(1−h) p kkppp Backpropagation by Error Interpretation We can find the gradient of the weights between layer node p in layer i + 1, and node q in layer i as follows: ∂E = ∆(i+1)x ∂w(i)pq p q For the output layer we have: ∆(out) =(o −d )o (1−o ) Below we have: ∆(i) =􏰅∆(i+1)w h (1−h ) p k kpp p Observe, the summation order: backpropagation Backpropagation by Error First Step Calculate the error: ∆(w) =(o −d )o (1−o ) The gradient by: ∂E =∆(w)o ∂wpq p q This is the percptron learning rule! Backpropagation by Error Second Step Backpropagate the error: ∆(v)=􏰅∆(w)w h(1−h) Apart from the weird factor hp(1 − hp) this amounts to backpropagation of error Backpropagation by Error Third Step Calculate the gradient of the V matrix: ∂E =∆(v)hl ∂vkl k Backpropagation by Error Fourth Step Update weights: w⃗ → w⃗ − λ ∂ E ∂w ⃗v → ⃗v − λ ∂ E ∂v Backpropagation Marc de Kamps General Remarks General Remarks General Remarks General Remarks General Remarks General Remarks General Remarks General Remarks General Remarks General Remarks General Remarks General Remarks 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com