CS计算机代考程序代写 Bayesian scheme data mining algorithm deep learning Neural Learning

Neural Learning
COMP9417 Machine Learning and Data Mining
Term 2, 2020
COMP9417 ML & DM Neural Learning Term 2, 2020 1 / 66

Acknowledgements
Material derived from slides for the book
“Elements of Statistical Learning (2nd Ed.)” by T. Hastie, R. Tibshirani & J. Friedman. Springer (2009) http://statweb.stanford.edu/~tibs/ElemStatLearn/
Material derived from slides for the book
“Machine Learning: A Probabilistic Perspective” by P. Murphy MIT Press (2012)
http://www.cs.ubc.ca/~murphyk/MLbook
Material derived from slides for the book “Machine Learning” by P. Flach Cambridge University Press (2012) http://cs.bris.ac.uk/~flach/mlbook
Material derived from slides for the book
“Bayesian Reasoning and Machine Learning” by D. Barber Cambridge University Press (2012) http://www.cs.ucl.ac.uk/staff/d.barber/brml
Material derived from slides for the book “Machine Learning” by T. Mitchell McGraw-Hill (1997)
http://www- 2.cs.cmu.edu/~tom/mlbook.html
Material derived from slides for the course “Machine Learning” by A. Srinivasan BITS Pilani, Goa, India (2016)
COMP9417 ML & DM
Neural Learning
Term 2, 2020
2 / 66

Aims
Aims
This lecture will enable you to describe and reproduce machine learning approaches to the problem of neural (network) learning. Following it you should be able to:
• describe Perceptrons and how to train them
• relate neural learning to optimization in machine learning
• outline the problem of neural learning
• derive the method of gradient descent for linear models
• describe the problem of non-linear models with neural networks
• outline the method of back-propagation training of a multi-layer perceptron neural network
• describe the application of neural learning for classification
• outline back-propagation in terms of computational graphs
• describe some issues arising when training deep neural networks
COMP9417 ML & DM Neural Learning Term 2, 2020 3 / 66

Perceptrons
What are perceptrons ?
Perceptron
A linear classifier that can achieve perfect separation on linearly separable data is the perceptron, originally proposed as a simple neural network by F. Rosenblatt in the late 1950s.
COMP9417 ML & DM Neural Learning Term 2, 2020 4 / 66

Perceptrons
What are perceptrons ?
Perceptron
Originally implemented in software (based on the McCulloch-Pitts neuron from the 1940s), then in hardware as a 20×20 visual sensor array with potentiometers for adaptive weights.
Source http://en.wikipedia.org/w/index.php?curid=47541432
COMP9417 ML & DM Neural Learning Term 2, 2020 5 / 66

Perceptrons
What are perceptrons ?
Perceptron
Output o is thresholded sum of products of inputs and their weights: 􏰆+1 ifw0+w1x1+···+wnxn>0
o(x1, . . . , xn) = −1 otherwise.
COMP9417 ML & DM Neural Learning Term 2, 2020 6 / 66

Perceptrons
What are perceptrons ?
Perceptron
Or in vector notation:
􏰆 1 ifw·x>0 o(x) = −1 otherwise.
COMP9417 ML & DM
Neural Learning Term 2, 2020 7 / 66

Perceptrons
Perceptrons are linear classifiers
Decision Surface of a Perceptron
+ +
+

x2
x2
+-
-+ (b)
– –
(a)
x1
x1
Represents some useful functions
• What weights represent o(x1,x2) = AND(x1,x2)? • What weights represent o(x1, x2) = XOR(x1, x2)?
COMP9417 ML & DM Neural Learning
Term 2, 2020
8 / 66

Perceptrons
How to train
Perceptron learning
Key idea:
Learning is “finding a good set of weights”
Perceptron learning is simply an iterative weight-update scheme:
wi ← wi + ∆wi
where the weight update ∆wi depends only on misclassified examples and is modulated by a “smoothing” parameter η typically referred to as the “learning rate”.
COMP9417 ML & DM Neural Learning Term 2, 2020 9 / 66

Perceptrons
How to train
Perceptron learning
The perceptron iterates over the training set, updating the weight vector every time it encounters an incorrectly classified example.
• For example, let xi be a misclassified positive example, then we have yi =+1andw·xi w · xi, which moves the decision boundary towards and hopefully past xi.
• This can be achieved by calculating the new weight vector as
w′ = w+ηxi, where 0 < η ≤ 1 is the learning rate (again, assume set to1). Wethenhavew′·xi =w·xi+ηxi·xi >w·xi asrequired.
• Similarly, if xj is a misclassified negative example, then we have
yj =−1andw·xj >t. Inthiscasewecalculatethenewweight vector as w′ = w−ηxj, and thus w′ ·xj = w·xj −ηxj ·xj < w·xj. COMP9417 ML & DM Neural Learning Term 2, 2020 10 / 66 Perceptrons How to train Perceptron learning • The two cases can be combined in a single update rule: w′ =w+ηyixi • Here yi acts to change the sign of the update, corresponding to whether a positive or negative example was misclassified • This is the basis of the perceptron training algorithm for linear classification • The algorithm just iterates over the training examples applying the weight update rule until all the examples are correctly classified • If there is a linear model that separates the positive from the negative examples, i.e., the data is linearly separable, it can be shown that the perceptron training algorithm will converge in a finite number of steps. COMP9417 ML & DM Neural Learning Term 2, 2020 11 / 66 Perceptrons How to train Perceptron training algorithm 1 2 3 4 5 6 7 8 9 10 11 Algorithm Perceptron(D, η) // perceptron training for linear classification Input: labelled training data D in homogeneous coordinates; learning rate η. Output: weight vector w defining classifier yˆ = sign(w · x). w ←0 // Other initialisations of the weight vector are possible converged ←false while converged = false do converged ←true for i = 1 to |D| do if yiw · xi ≤ 0 then w←w + ηyixi converged←false // We changed w so haven’t converged yet end end end // i.e., yˆi ̸= yi COMP9417 ML & DM Neural Learning Term 2, 2020 12 / 66 Perceptrons How to train Perceptron training – varying learning rate 333 222 111 000 −1 −1 −1 −2 −2 −2 −3 −3 −3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 (left) A perceptron trained with a small learning rate (η = 0.2). The circled examples are the ones that trigger the weight update. (middle) Increasing the learning rate to η = 0.5 leads in this case to a rapid convergence. (right) Increasing the learning rate further to η = 1 may lead to too aggressive weight updating, which harms convergence. The starting point in all three cases was the basic linear classifier. COMP9417 ML & DM Neural Learning Term 2, 2020 13 / 66 Perceptrons How to train Perceptron Convergence Perceptron training will converge (under some mild assumptions) for linearly separable classification problems A labelled data set is linearly separable if there is a linear decision boundary that separates the classes COMP9417 ML & DM Neural Learning Term 2, 2020 14 / 66 Perceptrons How to train Perceptron Convergence Assume: Dataset D = {(x1,y1),...,(xn,yn)} At least one example in D is labelled +1, and one is labelled -1. R = maxi ||xi||2 A weight vector w∗ exists s.t. ||w∗||2 = 1 and ∀i yiw∗ · xi ≥ γ Perceptron Convergence Theorem (Novikoff, 1962) The number of mistakes made by the perceptron is at most (R)2. γ γ is typically referred to as the “margin”. COMP9417 ML & DM Neural Learning Term 2, 2020 15 / 66 Perceptrons How to train Decision Surface of a Perceptron Unfortunately, as a linear classifier perceptrons are limited in expressive power So some functions not representable • e.g., not linearly separable For non-linearly separable data we’ll need something else However, with a fairly minor modification many perceptrons can be combined together to form one model • multilayer perceptrons, the classic “neural network” COMP9417 ML & DM Neural Learning Term 2, 2020 16 / 66 Optimization Optimization Studied in many fields such as engineering, science, economics, . . . A general optimization algorithm: 1 1 start with initial point x = x0 2 select a search direction p, usually to decrease f(x) 3 select a step length η 4 sets=ηp 5 setx=x+s 6 go to step 2, unless convergence criteria are met For example, could minimize a real-valued function f : Rn → R Note: convergence criteria will be problem-specific. 1B. Ripley (1996) “Pattern Recognition and Neural Networks”, CUP. COMP9417 ML & DM Neural Learning Term 2, 2020 17 / 66 Optimization Optimization Usually, we would like the optimization algorithm to quickly reach an answer that is close to being the right one. • typically, need to minimize a function • e.g., error or loss • optimization is known as gradient descent or steepest descent • sometimes, need to maximize a function • e.g., probability or likelihood • optimization is known as gradient ascent or steepest ascent COMP9417 ML & DM Neural Learning Term 2, 2020 18 / 66 Neural Learning Connectionist Models Consider humans: • Neuron switching time ≈ .001 second • Number of neurons ≈ 1010 • Connections per neuron ≈ 104−5 • Scene recognition time ≈ .1 second • 100 inference steps doesn’t seem like enough → much parallel computation COMP9417 ML & DM Neural Learning Term 2, 2020 19 / 66 Neural Learning Connectionist Models Properties of artificial neural nets (ANN’s): • Many neuron-like threshold switching units • Many weighted interconnections among units • Highly parallel, distributed process • Emphasis on tuning weights automatically COMP9417 ML & DM Neural Learning Term 2, 2020 20 / 66 Neural Learning When to Consider Neural Networks • Input is high-dimensional discrete or real-valued (e.g., raw sensor input) • Output can be discrete or real-valued • Output can be a vector of values • Possibly noisy data • Form of target function is unknown • Human readability of result is unimportant Examples: • Speech recognition (now the standard method) • Image classification (also now the standard method) • manyothers... COMP9417 ML & DM Neural Learning Term 2, 2020 21 / 66 Training a Linear Unit by Gradient Descent Gradient Descent To understand, consider simpler linear unit, where o = w0 + w1x1 + · · · + wnxn Let’s learn wi’s that minimize the squared error E[w] ≡ 1 􏰂(td − od)2 2 d∈D Where D is set of training examples COMP9417 ML & DM Neural Learning Term 2, 2020 22 / 66 Training a Linear Unit by Gradient Descent Gradient Descent 25 20 15 10 5 0 2 E[w] COMP9417 ML & DM Neural Learning Term 2, 2020 23 / 66 1 -2 w0 -13 2 w1 010 -1 Training a Linear Unit by Gradient Descent Gradient Descent Gradient i.e., ∆w = −η∇E[w] ∆wi = −η ∂E 􏰄∂E ∂E ∂E􏰅 ∇E[w] ≡ ∂w , ∂w ,··· ∂w 01n Gradient vector gives direction of steepest increase in error E Negative of the gradient, i.e., steepest decrease, is what we want Training rule: COMP9417 ML & DM ∂wi Neural Learning Term 2, 2020 24 / 66 Training a Linear Unit by Gradient Descent Gradient Descent ∂E ∂wi = ∂ 1􏰂(td−od)2 ∂wi2d = 1􏰂 ∂ (td−od)2 2 d ∂wi = 1􏰂2(td−od) ∂ (td−od) 2d ∂wi = 􏰂(td −od) ∂ (td −w·xd) d ∂wi ∂E ∂wi = 􏰂(td − od)(−xi,d) d COMP9417 ML & DM Neural Learning Term 2, 2020 25 / 66 Training a Linear Unit by Gradient Descent Gradient Descent Gradient-Descent(training examples,η) Each training example is a pair ⟨x, t⟩, where x is the vector of input values, and t is the target output value. η is the learning rate (e.g., .05). Initialize each wi to some small random value Until the termination condition is met, Do Initialize each ∆wi to zero For each ⟨x, t⟩ in training examples, Do Input the instance x to the unit and compute the output o For each linear unit weight wi ∆wi ←∆wi +η(t−o)xi For each linear unit weight wi wi ← wi + ∆wi COMP9417 ML & DM Neural Learning Term 2, 2020 26 / 66 Training a Linear Unit by Gradient Descent Training Perceptron vs. Linear unit Perceptron training rule guaranteed to succeed if • Training examples are linearly separable • Sufficiently small learning rate η Linear unit training rule uses gradient descent • Guaranteed to converge to hypothesis with minimum squared error • Given sufficiently small learning rate η • Even when training data contains noise • Even when training data not separable by H COMP9417 ML & DM Neural Learning Term 2, 2020 27 / 66 Training a Linear Unit by Gradient Descent Incremental (Stochastic) Gradient Descent Batch mode Gradient Descent: Do until satisfied • Compute the gradient ∇ED[w] • w←w−η∇ED[w] Incremental mode Gradient Descent: Do until satisfied • For each training example d in D • Compute the gradient ∇Ed[w] • w←w−η∇Ed[w] COMP9417 ML & DM Neural Learning Term 2, 2020 28 / 66 Training a Linear Unit by Gradient Descent Incremental (Stochastic) Gradient Descent Batch: Incremental: ED[w] ≡ 1 􏰂(td − od)2 2 d∈D Ed[w] ≡ 1(td − od)2 2 Incremental or Stochastic Gradient Descent (SGD) can approximate Batch Gradient Descent arbitrarily closely, if η made small enough Very useful for training large networks, or online learning from data streams Stochastic implies examples should be selected at random COMP9417 ML & DM Neural Learning Term 2, 2020 29 / 66 The Multi-layer Perceptron Multilayer Networks of Sigmoid Units COMP9417 ML & DM Neural Learning Term 2, 2020 30 / 66 The Multi-layer Perceptron ALVINN drives 70 mph on highways COMP9417 ML & DM Neural Learning Term 2, 2020 31 / 66 The Multi-layer Perceptron ALVINN Sharp Left Straight Ahead Sharp Right 30 Output Units 4 Hidden Units 30x32 Sensor Input Retina COMP9417 ML & DM Neural Learning Term 2, 2020 32 / 66 The Multi-layer Perceptron A Multilayer Perceptron for Speech Recognition: Model head hid who’d hood ... ... F1 F2 COMP9417 ML & DM Neural Learning Term 2, 2020 33 / 66 The Multi-layer Perceptron A Multilayer Perceptron for Speech Recognition: Decision Boundaries COMP9417 ML & DM Neural Learning Term 2, 2020 34 / 66 The Multi-layer Perceptron Sigmoid Unit COMP9417 ML & DM Neural Learning Term 2, 2020 35 / 66 The Multi-layer Perceptron Sigmoid Unit Same as a perceptron except that the step function has been replaced by a smoothed version, a sigmoid function. Note: in practice, particularly for deep networks, sigmoid functions are less common than other non-linear activation functions that are easier to train, but sigmoids are mathematically convenient. COMP9417 ML & DM Neural Learning Term 2, 2020 36 / 66 The Multi-layer Perceptron Sigmoid Unit Why use the sigmoid function σ(x) ? 1 1+e−x Nice property: dσ(x) = σ(x)(1 − σ(x)) dx We can derive gradient descent rules to train • One sigmoid unit • Multilayer networks of sigmoid units → Backpropagation over a set of training examples D. COMP9417 ML & DM Neural Learning Start by assuming we want to minimize squared error 1 􏰇 2 d∈D (td − od)2 Term 2, 2020 37 / 66 The Multi-layer Perceptron Error Gradient for a Sigmoid Unit ∂E ∂wi = ∂ 1􏰂(td−od)2 ∂wi 2 d∈D = 1􏰂 ∂ (td−od)2 2 d ∂wi = 1􏰂2(td−od) ∂ (td−od) 2d ∂wi 􏰂 􏰀 ∂od􏰁 = (td−od) −∂wi d = − 􏰂(td − od) d ∂od ∂netd ∂netd ∂wi COMP9417 ML & DM Neural Learning Term 2, 2020 38 / 66 The Multi-layer Perceptron Error Gradient for a Sigmoid Unit But we know: ∂od = ∂σ(netd) = od(1 − od) ∂ netd ∂ netd ∂netd =∂(w·xd)=xi,d ∂wi ∂wi So: ∂E ∂wi = − 􏰂(td − od)od(1 − od)xi,d d∈D COMP9417 ML & DM Neural Learning Term 2, 2020 39 / 66 The Multi-layer Perceptron Backpropagation Algorithm Initialize all weights to small random numbers. Until satisfied, Do For each training example, Do Input the training example to the network and compute the network outputs For each output unit k δk ← ok(1 − ok)(tk − ok) For each hidden unit h δh ← oh(1 − oh) 􏰇k∈outputs wkhδk Update each network weight wji wji ← wji + ∆wji where ∆wji = ηδjxji COMP9417 ML & DM Neural Learning Term 2, 2020 40 / 66 The Multi-layer Perceptron More on Backpropagation A solution for learning highly complex models . . . • Gradient descent over entire network weight vector • Easily generalized to arbitrary directed graphs • Can learn probabilistic models by maximising likelihood Minimizes error over all training examples • Training can take thousands of iterations → slow! • Using network after training is very fast COMP9417 ML & DM Neural Learning Term 2, 2020 41 / 66 The Multi-layer Perceptron More on Backpropagation Will converge to a local, not necessarily global, error minimum • May be many such local minima • In practice, often works well (can run multiple times) • Often include weight momentum α ∆wji(n) = ηδjxji + α∆wji(n − 1) • Stochastic gradient descent using “mini-batches” Nature of convergence • Initialize weights near zero • Therefore, initial networks near-linear • Increasingly non-linear functions possible as training progresses COMP9417 ML & DM Neural Learning Term 2, 2020 42 / 66 The Multi-layer Perceptron More on Backpropagation Models can be very complex • Will network generalize well to subsequent examples? • may underfit by stopping too soon • may overfit ... Many ways to regularize network, making it less likely to overfit • Add term to error that increases with magnitude of weight vector E(w)≡1􏰂 􏰂 (t −o )2+γ􏰂w2 2 kd kd ji d∈D k∈outputs i,j • Other ways to penalize large weights, e.g., weight decay • Using ”tied” or shared set of weights, e.g., by setting all weights to their mean after computing the weight updates • Many other ways . . . COMP9417 ML & DM Neural Learning Term 2, 2020 43 / 66 The Multi-layer Perceptron Expressive Capabilities of ANNs Boolean functions: • Every Boolean function can be represented by network with single hidden layer • but might require exponential (in number of inputs) hidden units Continuous functions: • Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989; Hornik et al. 1989] • Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988]. Being able to approximate any function is one thing, being able to learn it is another ... COMP9417 ML & DM Neural Learning Term 2, 2020 44 / 66 The Multi-layer Perceptron How complex should the model be ? With four parameters I can fit an elephant, and with five I can make him wiggle his trunk. John von Neumann COMP9417 ML & DM Neural Learning Term 2, 2020 45 / 66 The Multi-layer Perceptron “Goodness of fit” in ANNs Can neural networks overfit/underfit ? Next two slides: plots of “learning curves” for error as the network learns (shown by number of weight updates) on two different robot perception tasks. Note difference between training set and off-training set (validation set) error on both tasks ! Note also that on second task validation set error continues to decrease after an initial increase — any regularisation (network simplification, or weight reduction) strategies need to avoid early stopping (underfitting). COMP9417 ML & DM Neural Learning Term 2, 2020 46 / 66 The Multi-layer Perceptron Overfitting in ANNs 0.01 0.009 0.008 0.007 0.006 0.005 0.004 0.003 0.002 Error versus weight updates (example 1) Training set error Validation set error 0 5000 10000 15000 Number of weight updates 20000 COMP9417 ML & DM Neural Learning Term 2, 2020 47 / 66 Error The Multi-layer Perceptron Underfitting in ANNs 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 Error versus weight updates (example 2) Training set error Validation set error 0 1000 2000 3000 4000 5000 6000 Number of weight updates COMP9417 ML & DM Neural Learning Term 2, 2020 48 / 66 Error The Multi-layer Perceptron Neural networks for classification Sigmoid unit computes output o(x) = σ(w · x) Output ranges from 0 to 1 Example: binary classification 􏰆 predict class 1 if o(x) ≥ 0.5 o(x) = predict class 0 otherwise. Questions: • what error (loss) function should be used ? • how can we train such a classifier ? COMP9417 ML & DM Neural Learning Term 2, 2020 49 / 66 The Multi-layer Perceptron Neural networks for classification Minimizing square error (as before) does not work so well for classification If we take the output o(x) as the probability of the class of x being 1, the preferred loss function is the cross-entropy −􏰂td logod +(1−td)log(1−od) d∈D where: td ∈ {0, 1} is the class label for training example d, and od is the output of the sigmoid unit, interpreted as the probability of the class of training example d being 1. To train sigmoid units for classification using this setup, can use gradient ascent with a similar weight update rule as that used to train neural networks by gradient descent – this will yield the maximum likelihood solution. COMP9417 ML & DM Neural Learning Term 2, 2020 50 / 66 The Multi-layer Perceptron A practical application: Face Recognition Dataset: 624 images of faces of 20 different people. • image size 120x128 pixels • grey-scale, 0-255 pixel value range • different poses • different expressions • wearing sunglasses or not Raw images compressed to 30x32 pixels, each is mean of 4x4 pixels. MLP structure: 960 inputs × 3 hidden nodes × 4 output nodes. COMP9417 ML & DM Neural Learning Term 2, 2020 51 / 66 The Multi-layer Perceptron Neural Nets for Face Recognition - Task left straight right up Four pose classes: looking left, straight ahead, right or upwards. Use a 1-of-n encoding: more parameters; can give confidence of prediction. Selected single hidden layer with 3 nodes by experimentation. COMP9417 ML & DM Neural Learning Term 2, 2020 52 / 66 The Multi-layer Perceptron Neural Nets for Face Recognition - after 1 epoch left straight right up COMP9417 ML & DM Neural Learning Term 2, 2020 53 / 66 The Multi-layer Perceptron Neural Nets for Face Recognition - after 100 epochs left straight right up COMP9417 ML & DM Neural Learning Term 2, 2020 54 / 66 The Multi-layer Perceptron Neural Nets for Face Recognition - Results Each output unit (left, straight, right, up) has four weights, shown by dark (negative) and light (positive) blocks. Leftmost block corresponds to the bias (threshold) weight Weights from each of 30x32 image pixels into each hidden unit are plotted in position of corresponding image pixel. Classification accuracy: 90% on test set (default: 25%) Question: what has the network learned ? For code, data, etc. see http://www.cs.cmu.edu/~tom/faces.html COMP9417 ML & DM Neural Learning Term 2, 2020 55 / 66 The Multi-layer Perceptron Deep Learning Y. Lecun et al. (2015) Nature (521) 436–444. COMP9417 ML & DM Neural Learning Term 2, 2020 56 / 66 The Multi-layer Perceptron Deep Learning Deep learning is a vast area that has exploded in the last 15 years. Beyond scope of this course to cover in detail. See: “Deep Learning” I. Goodfellow et al. (2017) – there is an online copy freely available. Course COMP9444 Neural Networks (next semester). COMP9417 ML & DM Neural Learning Term 2, 2020 57 / 66 The Multi-layer Perceptron Deep Learning Question: How much of what we have seen carries over to deep networks ? Answer: Most of the basic concepts. We mention some important issues that differ in deep networks. COMP9417 ML & DM Neural Learning Term 2, 2020 58 / 66 The Multi-layer Perceptron Deep Learning: Architectures Most successful deep networks do not use the fully connected network architecture we outlined above. Instead, they use more specialised architectures for the application of interest (inductive bias). Example: Convolutional neural nets (CNNs) have an alternating layer-wise architecture inspired by the brain’s visual cortex. Works well for image processing tasks, but also for applications like text processing. Example: Long short-term memory (LSTM) networks have recurrent network structure designed to capture long-range dependencies in sequential data, as found, e.g., in natural language (although now often superseded by transformer architectures). Example: Autoencoders are are kind of unsupervised learning method. They learn a mapping from input examples to the same examples as output via a compressed (lower dimension) hidden layer, or layers. COMP9417 ML & DM Neural Learning Term 2, 2020 59 / 66 The Multi-layer Perceptron Deep convolutional networks learn features H. Lee et al. (2009) ICML. COMP9417 ML & DM Neural Learning Term 2, 2020 60 / 66 The Multi-layer Perceptron Deep convolutional networks learn features H. Lee et al. (2009) ICML. COMP9417 ML & DM Neural Learning Term 2, 2020 61 / 66 The Multi-layer Perceptron Deep convolutional networks learn features M. Zeiler & R. Fergus (2014) ECCV. COMP9417 ML & DM Neural Learning Term 2, 2020 62 / 66 The Multi-layer Perceptron Deep Learning: Activation Functions Problem: in very large networks, sigmoid activation functions can saturate, i.e., can be driven close to 0 or 1 and then the gradient becomes almost 0 – effectively halts updates and hence learning for those units. Solution: use activation functions that are non-saturating., e.g., “Rectified Linear Unit” or ReLu, defined as f (x) = max(0, x). Problem: sigmoid activation functions are not zero-centred, which can cause gradients and hence weight updates become “non-smooth”. Solution: use zero-centred activation function, e.g., tanh, with range [−1, +1]. Note that tanh is essentially a re-scaled sigmoid. Derivative of a ReLu is simply ∂f 􏰆0ifx≤0 COMP9417 ML & DM ∂x = 1 otherwise. Neural Learning Term 2, 2020 63 / 66 The Multi-layer Perceptron Deep Learning: Regularization Deep networks can have millions or billions or parameters. Hard to train, prone to overfit. What techniques can help ? Example: dropout • for each unit u in the network, with probability p, “drop” it, i.e., ignore it and its adjacent edges during training • this will simplify the network and prevent overfitting • can take longer to converge • but will be quicker to update on each epoch • also forces exploration of different sub-networks formed by removing p of the units on any training run COMP9417 ML & DM Neural Learning Term 2, 2020 64 / 66 The Multi-layer Perceptron Back-propagation and computational graphs See the accompanying hand-written notes, based on Strang (2019). COMP9417 ML & DM Neural Learning Term 2, 2020 65 / 66 Summary Summary • ANNs since 1940s; popular in 1980s, 1990s; recently a revival Complex function fitting. Generalise core techniques from machine learning and statistics based on linear models for regression and classification. Learning is typically stochastic gradient descent. Networks are too complex to fit otherwise. Many open problems remain. How are these networks actually learning ? How can they be improved ? What are the limits to neural learning ? COMP9417 ML & DM Neural Learning Term 2, 2020 66 / 66 References Strang, G. (2019). Linear Algebra and Learning from Data. Wellesley - Cambridge Press. COMP9417 ML & DM Neural Learning Term 2, 2020 66 / 66