代写代考 COMP9417 Machine Learning & Data Mining

Neural Learning
COMP9417 Machine Learning & Data Mining
Term 1, 2022
Adapted from slides by Dr Michael

This lecture will develop your understanding of Neural Network Learning & will extend that to Deep Learning
– describe Perceptrons and how to train them
– relate neural learning to optimization in machine learning
– outline the problem of neural learning
– derive the Gradient Descent for linear models
– describe the problem of non-linear models with neural networks – outline the method of back-propagation training of a multi-layer – understand Convolutional Neural Networks (CNNs)
– understand the main difference between CNN and regular NN – know the basics of training CNN
COMP9417 T1, 2022 1

Artificial Neural Networks
Artificial Neural Networks are inspired by human nervous system
NNs are composed of a large number of interconnected processing elements known as neurons
They use supervised error correcting rules with back-propagation to learn a specific task
http://statsmaths.github.io/stat665/lectures/lec12/lecture12.pdf
COMP9417 T1, 2022 2

Perceptron
A linear classifier that can achieve perfect separation on linearly separable data is the perceptron – a simplified neuron, originally proposed as a simple neural network by F. Rosenblatt in the late 1950s.
COMP9417 T1, 2022 3

Perceptron
Originally implemented in software (based on the McCulloch-Pitts neuron from the 1940s), then in hardware as a 20×20 visual sensor array with potentiometers for adaptive weights.
Source http://en.wikipedia.org/w/index.php?curid=47541432
COMP9417 T1, 2022 4

Perceptron
Each neuron has multiple dendrites and a single axon. The neuron receives its inputs from its dendrites and transmits its output through its axon. Both inputs and outputs take the form of electrical impulses. The neuron sums up its inputs, and if the total electrical impulse strength exceeds the neuron’s firing threshold, the neuron fires off a new impulse along its single axon. The axon, in turn, distributes the signal along its branching synapses which collectively reach thousands of neighboring neurons.
https://towardsdatascience.com/from-fiction-to-reality-a-beginners-guide-to-artificial-neural-networks-d0411777571b
COMP9417 T1, 2022 5

Perceptron
Output o is thresholded sum of products of inputs and their weights:
COMP9417 T1, 2022 6

Perceptron
Or in vector notation:
COMP9417 T1, 2022 7

Decision Surface of a Perceptron
• Perceptron is able to represent some useful functions which are linearly separable (a)
• But functions that are not linearly separable are not representable (e.g. (b) XOR)
COMP9417 T1, 2022 8

Perceptron Learning
Learning is “finding a good set of weights”
Perceptron learning is simply an iterative weight-update scheme:
wi ←wi+∆wi
where the weight update ∆wi depends only on misclassified examples and is modulated by a “smoothing” parameter η typically referred to as the “learning rate”.
COMP9417 T1, 2022 9

Perceptron Learning
The perceptron iterates over the training set, updating the weight vector every time it encounters an incorrectly classified example.
§ Let xi be a misclassified positive example, then we have yi=+1andw·xi<0.Wethereforewanttofindw'suchthat w'·xi> w·xi, which moves the decision boundary towards and hopefully past xi.
§ This can be achieved by calculating the new weight vector as
w’ = w + ηxi, where 0< η ≤ 1 is the learning rate (again, assume set to 1). We thenhavew'·xi = w·xi + ηxi·xi > w·xi asrequired.
§ Similarly, if xj is a misclassified negative example, then we have yj=−1andw·xj>0.Inthiscasewecalculatethenewweight vectoras
w’ = w − ηxj , and thus w’· xj = w· xj − ηxj· xj < w· xj . COMP9417 T1, 2022 10 Perceptron Learning § The two cases can be combined in a single update rule: w' = w + η yi xi § Here yi acts to change the sign of the update, corresponding to whether a positive or negative example was misclassified § This is the basis of the perceptron training algorithm for linear classification § The algorithm just iterates over the training examples applying the weight update rule until all the examples are correctly classified § If there is a linear model that separates the positive from the negative examples, i.e., the data is linearly separable, it can be shown that the perceptron training algorithm will converge in a finite number of steps. COMP9417 T1, 2022 11 Training Perceptron COMP9417 T1, 2022 12 Perceptron Learning Rate (left) A perceptron trained with a small learning rate (η = 0.2). The circled examples are the ones that trigger the weight update. (middle) Increasing the learning rate to η = 0.5 leads in this case to a rapid convergence. (right) Increasing the learning rate further to η = 1 may lead to too aggressive weight updating, which harms convergence. COMP9417 T1, 2022 13 Perceptron Convergence Perceptron training will converge (under some mild assumptions) for linearly separable classification problems A labelled data set is linearly separable if there is a linear decision boundary that separates the classes COMP9417 T1, 2022 14 Perceptron Convergence Dataset D = {(x1, y1), . . . , (xn, yn)} At least one example in D is labelled +1, and one is labelled -1. A weight vector w∗ exists s.t. ||w∗||2 = 1 and ∀i yiw∗ ·xi ≥ γ γ is typically referred to as the “margin” COMP9417 T1, 2022 15 Perceptron Convergence Perceptron Convergence Theorem (Novikoff, 1962) R=maxi ║xi║# The number of mistakes made by the perceptron is at most (Rγ )2 COMP9417 T1, 2022 16 Decision Surface of a Perceptron § Unfortunately, as a linear classifier perceptrons are limited in expressive power § So some functions not representable, e.g., not linearly separable § For non-linearly separable data we’ll need something else § However, with a relatively minor modification many perceptrons can be combined together to form one model § multilayer perceptrons, the classic “neural network” COMP9417 T1, 2022 17 Optimisation Studied in many fields such as engineering, science, economics, . . . A general optimisation algorithm: 1 1) start with initial point x = x0 2) select a search direction p, usually to decrease f (x) 3) select a step length η 4) sets=ηp 5) setx=x+s 6) go to step 2, unless convergence criteria are met For example, could minimise a real-valued function f Note: convergence criteria will be problem-specific. 1B. Ripley (1996) “Pattern Recognition and Neural Networks”, CUP. COMP9417 T1, 2022 18 Optimisation Usually, we would like the optimisation algorithm to quickly reach an answer that is close to being the right one. § typically, we need to minimise a function § e.g., error or loss § optimisation is known as gradient descent or steepest descent § sometimes, we need to maximise a function § e.g., probability or likelihood § optimisation is known as gradient ascent or steepest ascent COMP9417 T1, 2022 19 Gradient Descent To understand, consider the simple linear unit, where Let’s learn 𝑤𝑖 that minimise the squared error Where D is the set of training samples COMP9417 T1, 2022 20 Gradient Descent COMP9417 T1, 2022 21 Gradient Descent Gradient vector gives direction of steepest increase in error E Negative of the gradient, i.e., steepest decrease, is what we want Training rule: i.e., COMP9417 T1, 2022 22 Gradient Descent COMP9417 T1, 2022 23 Gradient Descent COMP9417 T1, 2022 24 Perceptron vs. Linear Unit Perceptron training rule guaranteed to succeed if § Training examples are linearly separable § Sufficiently small learning rate η Linear unit training rule uses gradient descent § Guaranteed to converge to hypothesis with minimum squared error § Given sufficiently small learning rate η § Even when training data contains noise § Even when training data not separable by H COMP9417 T1, 2022 25 Stochastic (Incremental) Gradient Descent Batch mode Gradient Descent: Stochastic (incremental) mode Gradient Descent: COMP9417 T1, 2022 26 Stochastic Gradient Descent § Stochastic Gradient Descent (SGD) can approximate Batch Gradient Descent arbitrarily closely, if 𝜂 made small enough § Very useful for training large networks, or online learning from data streams § Stochastic implies examples should be selected at random COMP9417 T1, 2022 27 Multilayer Networks COMP9417 T1, 2022 28 Multilayer Networks § Multi-layer networks can represent arbitrary functions, but an effective learning algorithm for such networks was thought to be difficult § A typical multi-layer network consists of an input, hidden and output layer, each fully connected to the next (with activation feeding forward) § The weights determine the function computed. § Given an arbitrary number of hidden units, any boolean function can be computed with a single hidden layer COMP9417 T1, 2022 29 General Structure of ANN COMP9417 T1, 2022 30 General Structure of ANN Properties of Artificial Neural Networks (ANN’s): § Many neuron-like threshold switching units § Many weighted interconnections among units § Highly parallel, distributed process § Emphasis on tuning weights automatically Artificial Neural Network (Source: VIASAT) COMP9417 T1, 2022 31 When to Consider Neural Networks § Input is high-dimensional discrete or real-valued (e.g., raw sensor input) § Output can be discrete or real-valued § Output can be a vector of values § Possibly noisy data § Form of target function is unknown § Interpretability of result is not important § Speech recognition § Image classification § many others . . . COMP9417 T1, 2022 32 ALVINN drives 70 mph on highways COMP9417 T1, 2022 33 COMP9417 T1, 2022 34 MLP Speech Recognition Decision Boundaries COMP9417 T1, 2022 35 Sigmoid Unit COMP9417 T1, 2022 36 Sigmoid Unit § Same as a perceptron except that the step function has been replaced by a nonlinear sigmoid function. § Nonlinearity makes it easy for the model to generalise or adapt with variety of data and to differentiate between the output. COMP9417 T1, 2022 37 Sigmoid Unit COMP9417 T1, 2022 38 Sigmoid Unit Why use the sigmoid function σ (x) ? Nice property: We can derive gradient descent rules to train •One sigmoid unit •Multilayer networks of sigmoid units → Backpropagation Note: in practice, particularly for deep networks, sigmoid functions are less common than other non-linear activation functions that are easiertotrain, butSigmoidsaremathematicallyconvenient. COMP9417 T1, 2022 39 Error Gradient of Sigmoid Unit Start by assuming we want to minimise squared error ( ) over a set of training examples D. COMP9417 T1, 2022 40 Error Gradient of Sigmoid Unit COMP9417 T1, 2022 41 Backpropagation Algorithm COMP9417 T1, 2022 42 More on Backpropagation A solution for learning highly complex models . . . § Gradient descent over entire network weight vector § Easily generalised to arbitrary directed graphs § Can learn probabilistic models by maximising likelihood Minimises error over all training examples § Training can take thousands of iterations → slow! § Using network after training is very fast COMP9417 T1, 2022 43 More on Backpropagation Will converge to a local, not necessarily global, error minimum § Might exist many such local minima § In practice, often works well (can run multiple times) § Often include weight momentum α ∆𝑤𝑗𝑖(𝑛) = 𝜂𝛿𝑗𝑥𝑗𝑖 + 𝛼 ∆𝑤𝑗𝑖(𝑛 − 1) § Stochastic gradient descent using “mini-batches” Nature of convergence § Initialise weights near zero § Therefore, initial networks near-linear § Increasingly non-linear functions become possible as training progresses COMP9417 T1, 2022 44 More on Backpropagation Models can be very complex § Will network generalise well to subsequent examples? • may underfit by stopping too soon • may overfit . . . Many ways to regularise network, making it less likely to overfit § Add term to error that increases with magnitude of weight vector § Other ways to penalise large weights, e.g., weight decay § Using "tied" or shared set of weights, e.g., by setting all weights to their mean after computing the weight updates § Many other ways . . . COMP9417 T1, 2022 45 Expressive Capabilities of ANNs Boolean functions: § Every Boolean function can be represented by a network with single hidden layer § but might require exponential (in number of inputs) hidden units Continuous functions: § Every bounded continuous function can be approximated with arbitrarily small error, by a network with one hidden layer [Cybenko 1989; Hornik et al. 1989] § Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988] Being able to approximate any function is one thing, being able to learn it is another . . . COMP9417 T1, 2022 46 How complex should the model be? “With four parameters I can fit an elephant, and with five I can make him wiggle his trunk." John von OMP9417 T1, 2022 47 “Goodness of fit” in ANNs Can neural networks overfit/underfit ? Next two slides: plots of “learning curves” for error as the network learns (shown by number of weight updates) on two different robot perception tasks. Note the difference between training set and off-training set (validation set) error on both tasks ! Note also that on second task validation set error continues to decrease after an initial increase — any regularisation (network simplification, or weight reduction) strategies need to avoid early stopping (underfitting). COMP9417 T1, 2022 48 Overfitting in ANNs COMP9417 T1, 2022 49 Underfitting in ANNs COMP9417 T1, 2022 50 Neural Networks for Classification Sigmoid unit computes output o(x) = σ (w·x) Output ranges from 0 to 1 Example: binary classification Questions: § what error (loss) function should be used ? § how can we train such a classifier ? COMP9417 T1, 2022 51 Neural Networks for Classification Minimizing square error (as before) does not work so well for classification If we take the output o(x) as the probability of the class of x being 1, the preferred loss function is the cross-entropy td∈{0,1}istheclasslabelfortrainingexampled,andod istheoutputofthe sigmoid unit, interpreted as the probability of the class of training example d being 1. To train sigmoid units for classification using this setup, one can use gradient descent and backpropagation algorithm – this will yield the maximum likelihood solution. COMP9417 T1, 2022 52 Application: Face Pose Recognition Dataset: 624 images of faces of 20 different people § image size 120x128 pixels § grey-scale, 0-255 intensity value range § different poses § different expressions § wearing sunglasses or not Raw images compressed to 30x32 pixels MLP structure: 960 inputs × 3 hidden nodes × 4 output nodes COMP9417 T1, 2022 53 Application: Face Pose Recognition left straight right up Four pose classes: looking left, straight ahead, right or upwards Use a 1-of-n encoding: more parameters; can give confidence of prediction Selected single hidden layer with 3 nodes by experimentation COMP9417 T1, 2022 54 Application: Face Pose Recognition After 1 epoch left straight right up COMP9417 T1, 2022 55 Application: Face Pose Recognition After 100 epochs left straight right up COMP9417 T1, 2022 56 Application: Face Recognition Each output unit (left, straight, right, up) has four weights, shown by dark (negative) and light (positive) blocks. Leftmost block corresponds to the bias (threshold) weight Weightsfromeachof30x32imagepixelsintoeachhiddenunitareplotted in position of corresponding image pixel. Classification accuracy: 90% on test set (default: 25%) Question: what has the network learned ? For code, data, etc. see http://www.cs.cmu.edu/~tom/faces.html COMP9417 T1, 2022 57 Deep Learning: Convolutional Neural Networks COMP9417 T1, 2022 58 A Bit of History § Earliest studies about visual mechanics of animals emphasised the importance of edge detection for solving Computer Vision problems § Early processing in cat visual cortex looks like it is performing convolutions that are looking for oriented edges and blobs § Certain cells are looking for edges with a particular orientation at a particular spatial location in the visual field § This inspired convolutional neural networks but was limited by the lack of computational power § Hinton et al. reinvigorated research into deep learning and proposed a greedy layer wise training technique § In 2012, et al. won ImageNet challenge and proposed the well recognised AlexNet Convolutional Neural Network COMP9417 T1, 2022 59 A Bit of History • The very first CNN • Handwritten digit recognition COMP9417 T1, 2022 60 A Bit of History • The very first CNN • Handwritten digit recognition COMP9417 T1, 2022 61 A Bit of History § AlexNet (2012) § ImageNet classification § Images showing 1000 object categories COMP9417 T1, 2022 62 A Bit of History § AlexNet (2012) § ImageNet classification § Images showing 1000 object categories COMP9417 T1, 2022 63 Deep Learning § Deep learning is a collection of artificial neural network techniques that are widely used at present § Predominantly, deep learning techniques rely on large amounts of data and deeper learning architectures § Some well-known paradigms: § Convolutional Neural Networks (CNNs) § Recurrent Neural Networks § Auto-encoders § Restricted Boltzmann Machines COMP9417 T1, 2022 64 § CNNs are very similar to regular Neural Networks § Made up of neurons with learnable weights § CNN architecture assumes that inputs are images § So that we have local features § Which allows us to § encode certain properties in the architecture that makes the forward pass more efficient and § significantly reduces the number of parameters needed for the network COMP9417 T1, 2022 65 The problem with regular NNs is that they do not scale well with dimensions (i.e. larger images) § Eg: 32x32 image with 3 channels (RGB) – a neuron in first hidden layer would have 32x32x3 = 3,072 weights : manageable. § Eg: 200x200 image with 3 channels – a neuron in first hidden layer would have 200x200x3 = 120,000 weights and we need at least several of these neurons which makes the weights explode. COMP9417 T1, 2022 66 What is different? § In contrast, CNNs consider 3-D volumes of neurons and propose a parameter sharing scheme that minimises the number of parameters required by the network. § CNN neurons are arranged in 3 dimensions: Width, Height and Depth. § Neurons in a layer are only connected to a small region of the layer before it (hence not fully connected) COMP9417 T1, 2022 67 What is different? COMP9417 T1, 2022 68 CNN architecture Main layers: § Convolutional § Fully-connected § Drop-out § Output layers COMP9417 T1, 2022 69 Convolutional Layer Suppose we want to classify an image as a bird, sunset, dog, cat, etc. If we can identify features such as feather, eye, or beak which provide useful information in one part of the image, then those features are likely to also be relevant in another part of the image. We can exploit this regularity by using a convolution layer which applies the same weights to different parts of the image. COMP9417 T1, 2022 70 Convolutional Layer COMP9417 T1, 2022 71 Convolutional Layer Convolution COMP9417 T1, 2022 72 Convolutional Layer COMP9417 T1, 2022 73 Convolutional Layer Original link: https://i.stack.imgur.com/nOLCe.gif COMP9417 T1, 2022 74 Convolutional Layer § The output of the Conv layer can be interpreted as holding neurons arranged in a 3D volume. § The Conv layer's parameters consist of a set of learnable filters. Every filter is small spatially (along width and height), but extends through the full depth of the input volume. § During the forward pass, each filter is slid (convolved) across the width and height of the input volume, producing a 2-dimensional activation map of t 程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts