https://xkcd.com/720/
Announcements
Quiz 1 recap – today after the main lecture
Copyright By PowCoder代写 加微信 powcoder
Assignment 1
due next Mon
(Extra support sessions this week
will be announced soon)
Neural networks
Neural network as adaptive
– Weight-space symmetries
Network training
– parameter optimisation
– gradient descent
Error backpropagation and automatic differentiation
Regularisation
Feed-forward
Can have skip connections;
network functions
Connections can be sparse;
Require feed-forward structure, no directed cycles Convention: number of layers refers to the number of
weights (rather than nodes)
Two key ideas:
1) Generalised linear model with activation
composition of these
Universal approximation
(regression) a two-layer network with linear outputs can uniformly approximate any continuous function on a compact
input domain to arbitrary accuracy
provided the network has a sufficiently large number of hidden units.
(classification) a two-layer network can
uniformly approximate any discriminative function on a compact input domain to arbitrary accuracy provided the network
has a sufficiently large number of hidden
All activation functions are linear?
Number of hidden units smaller than
the number of input
dimensions?
Neural network
as variable basis functions
(the source of these figures is lost — let me know if you encounter them, with apologies to the author)
Weight-space symmetries
Neural net with M
hidden units, weights W
set of weights
W’ that always map the same input to the same
output as W .
we covered
Neural network as adaptive
– Weight-space symmetries
Network training
– parameter optimisation
– gradient descent
Error backpropagation and automatic differentiation
Regularisation
Neural network
Regression (linear activation at
Binary classification (logistic output activation)
training — objectives
also (4.90)
Different versions
Binary classification (logistic
Multiple/independent binary classification
Multi-class classification
of classification
output activation)
Recap: Convex functions
Examples: x2, xlnx (x>0), … xT S x (S positive semi definite)
Jesen’s inequality
Non-linear optimisation
Task: find weight vector w, that minimizes the function E(w)
Non-linear optimisation (a
local view)
Goal: figure out a good direction ∆w (and how far)
Taylor expansion
H is p.s.d → local minima. Many
other geometric structures possible, maxima, saddle points, ring structure …
Gradient descent: the
computational argument
The Cheap Gradient Principle (Griewank 2008) — the computational cost of computing the gradient of a
scalar-valued function is nearly the same (often within a factor of
) as that of simply computing the function itself —
is of central importance in optimization; it allows us to quickly obtain (high dimensional) gradients of scalar loss
functions which are subsequently used in black box gradient-based optimization procedures.
Limitations of
Long valleys
Saddle points
gradient-based methods
https://jermwatt.github.io/machine_learning_refined/notes/3_First_order_methods/3_7_Problems.html
“ ”, Pascanu et al, https://arxiv.org/abs/1405.4604
Follow-up topics:
* Newton methods
Quasi-Newton methods, e.g.
conjugate gradient descent
On the saddle point problem for non-convex optimization
we covered
Neural network as adaptive
– Weight-space symmetries
Network training
– parameter optimisation
– gradient descent
Error backpropagation and automatic differentiation
Regularisation
Gradient descent
Online version, stochastic gradient
optimisation
backpropagation: linear version
Forward propagation in general
Backpropagation for 1
“Yes you should understand backprop”
It’s a leaky abstraction – i
t is easy to fall into the trap of abstracting away the learning process —
believing that you can simply stack arbitrary layers together and backprop will “magically make them
work” on your data
https://xkcd.com/1838/
https://cs231n.github.io/optimization-2/, also see MML Sec 5.6, esp 5.6.2
Implementing Backprop
Automatic Differentiation
https://cs231n.github.io/optimization-2/, also see MML Sec 5.6, esp 5.6.2
Implementing Backprop
Automatic Differentiation
Have differentials
of elementary function
Use chain rule
to compose the
stage-wise gradient
Caching – store the differentials and function values as computer
we covered
Neural network as adaptive
– Weight-space symmetries
Network training
– parameter optimisation
– gradient descent
Error backpropagation and automatic differentiation
neural networks
The number of input and
output nodes determined
the application.
The number of hidden nodes is a free parameter.
Recipe: use the L2 regularisation that we
Regularisation by
early-stopping
Invariances
+ Augment training set with perturbed/transformed training patterns (Fig 5.14)
+ Preprocess input by normalising against transformations (e.g. rectifying faces)
+ Build important invariance into network structure — e.g. Convolutional Nets
– Explicitly allow invariances using improper priors (as regulariser) – 5.5.1 ,
tangent propagation (5.5.4)
Neural networks
Neural network as adaptive
– Weight-space symmetries
Network training
– parameter optimisation
– gradient descent
Error backpropagation and automatic differentiation
Regularisation
https://xkcd.com/2173/
https://xkcd.com/2265/
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com