WiSe 2021/22
Machine Learning 1/1-X
Lecture 6 Neural Networks 1
Copyright By PowCoder代写 加微信 powcoder
� Bayes Optimal Classifier
� Maximum Mean Separation � Fisher Discriminant
� Artificial Neural Network
� The Perceptron
� Forward propagation
� Optimizing neural networks � Error backpropagation
Recap: Bayes Optimal Classifier
� Assume our data is generated for each class ωj according to the multivariate Gaussian distribution p(x|ωj ) = N (μj , Σ) and with class priors P (ωj ). The Bayes optimal classifier is derived as
arg max{P(ωj |x)} j
= arg max{log p(x|ωj ) + log P(ωj )} j
=argmax�x�Σ−1μ�j − 1μjΣ−1μj +logP(ωj)� j2
� Given our generative assumptions, there is no more accurate classifier than the one above.
Recap: Bayes Optimal Classifier
Limitations:
� In practice, we don’t know the data generating distributions and only have the data.
� Estimating the data-generating distribution from a limited number of observations is difficult (e.g. it is hard to estimate the covariance of a Gaussian distribution in a way that the covariance remains invertible).
Recap: Mean Separation Criterion
� We want to learn a projection of the datazk =w�xk with�w�=1such that the means of classes in projected space are as distant as possible.
� First, we compute the means in projected space for the two classes
μ1 = 1 �zk μ2 = 1 �zk N1 k∈C1 N2 k∈C2
� Then we find w that maximizes the difference of means, i.e. we express the means as a function of w and pose the optimization problem:
arg max |μ2 (w)−μ1 (w)| with �w� = 1 w
Recap: Mean Separation Criterion
Limitations:
� There is a significant class overlap in projected space.
� A better classifier seems achievable if we rotate the projection by a few degrees clockwise.
� Making means distant may not be sufficient to induce class separability in projected space.
Recap: Fisher Discriminant
� In addition to maximizing the separation between class means in projected space, also consider to reduce the within-class variance.
μ1= 1 �zk |C1| k∈C1
s1 = �(zk −μ1)2 k ∈C1
μ2= 1 �zk |C2| k∈C2
s2 = �(zk −μ2)2 k ∈C2
� Maximizing distance between means while minimizing within-class variance can be formulated as:
arg max (μ2(w) − μ1(w))2 w s1(w) + s2(w)
Recap: Means vs. Fisher
Maximum Mean Separation Fisher Discriminant
� Fisher Discriminant leads (in general) to better class separability, and therefore, better classification accuracy.
� Fisher Discriminant requires inversion of a covariance matrix (only tractable for low-dimensional data).
Recap: Fisher Discriminant
Limitations:
� The resulting decision boundary can become suboptimal when the data is not Gaussian.
� In particular, like principal component analysis, Fisher Discriminant is not robust to outliers.
� When the distribution is non-Gaussian, the model does not focus on optimizing the classification error directly.
ML1 Roadmap
The Perceptron
F. Rosenblatt (1928–1971)
� Proposed by F. Rosenblatt in 1958.
� Classifier that prefectly separates training
data (if the data is linearly separable).
� Trained using an simple and cheap iterative procedure.
� The perceptron gave rise to artificial neural networks.
The Perceptron Algorithm
� Consider our linear model
zk =w�xk +b yk =sign(zk)
andlettk be1and−1whenthetrueclassofxk isω1 and ω2 respectively.
� Iterate over k = 1 . . . , N (multiple times).
� If xk is correctly classified(yk = tk ), continue.
� If xk is wrongly classified (yk �= tk ), apply: w ← w + η · xk tk
b ← b + η · tk
where η is a learning rate.
� Stop once all examples are correctly classified.
The Perceptron: Optimization View
The perceptron can be seen as a gradient descent of the
error function 1 �N
E(w, b) = max(0, −zk tk )
N k=1 � �� � Ek (w, b)
Perceptron at Work
����� ����� ����� ���
�����������������������������������
����� ����� ����� ���
���������������������������������
Nonlinear Classification
Observation:
� Mean separation, Fisher discriminant, and the perceptron, build a decision function which is linear in input space. In practice, the data may not be linearly separable.
� Transform the data nonlinearly through some function Φ before applying the linear decision function.
f(x) = w�Φ(x)+b
� Example: Φ(x)=[x1,x2,x12,x2,x1x2] and w ∈ R5.
Artificial Neural Networks
� Models that are inspired by the way the brain represents sensory input and learn from repeated stimuli.
� Neuron activations can be seen as a nonlinear transformation of the sensory input (similar to Φ(x)).
� The neural representation adapts itself after repeated exposure to certain stimuli (plasticity).
The Biological Neuron
� Highly sophisticated physical system with complex spatio-temporal dynamics that transfers signal received by dendrites to the axon.
� Human brain is composed of a very large number of neurons (approx. 86 billions) that are interconnected (150 trillions synapses).
The Artificial Neuron
� Simple multivariate, nonlinear and differentiable function.
� Ultra-simplification of the biological neuron that retains two key properties: (1) ability to produce complex nonlinear representations when many neurons are interconnected (2) ability to learn from the data.
Examples of Nonlinear Functions
Example of a Neural Network
� Artificial neural networks are typically connected in some regular manner, e.g. sequences of layers.
� Number of neurons in an neural networks varies from a few neurons for simple tasks up to millions of neurons for image classifiers.
The Forward Pass
Universal Approximation Theorem (1)
Theorem (simplified): With sufficiently many neurons, neural networks can
approximate any nonlinear functions.
“Visual Proof”:
Universal Approximation Theorem (2)
Theorem (simplified): With sufficiently many neurons, neural networks can approximate any nonlinear functions.
Sketch proof taken from the book Bishop’95 Neural Network for Pattern Recognition, p. 130–131, (after Jones’90 and Blum&Li’91):
� Consider the special class of functions y : R2 → R where input variables are called x1,x2.
� We will show that any two-layer network with threshold functions as nonlinearity can approximate y(x1,x2) up to arbitrary accuracy.
� We first observe that any function of x2 (with x1 fixed) can be approximated as an infinite Fourier series.
y(x1,x2) � �As(x1)cos(sx2) s
Universal Approximation Theorem (3)
� We first observe that any function of x2 (with x1 fixed) can be approximated as an infinite Fourier series.
y(x1,x2) � �As(x1)cos(sx2) s
� Similarly, the coefficients themselves can be expressed as an infinite
Fourier series:
y(x1, x2) � Asl cos(lx1) cos(sx2) sl
� We now make use of a trigonometric identity to write the function above as a sum of cosines:
cos(α) cos(β) = 21 cos(α + β) + 21 cos(α − β)
� Thus, the function to approximate can be written as a sum of cosines,
where each of them receives a linear combination of the input variables:
y(x1,x2)��∞ vj cos(x1w1j +x2v2j) j=1
Universal Approximation Theorem (4)
� Thus, the function to approximate can be written as a sum of cosines, where each of them receives a linear combination of the input variables:
y(x1,x2)��∞ vj cos(x1w1j +x2v2j) j=1
� This is a two-layer neural network, except for the cosine nonlinearity. The latter can however be approximated by a superposition of a large number of step functions.
cos(z) = lim � [cos(τ · (i + 1)) − cos(τ · i)] · 1z>τ·(i+1) +const. τ→0 i � �� � � �� �
constant step function
Training a Neural Network
� Use the same error function as the perceptron, but replace the perceptron output z by the neural network output zout:
N k=1 � �� out �
and compute the gradient of the error function
max(0, −z(k)t(k)) E(k)(θ)
w.r.t. the parameters θ of the neural network. Question:
� How to compute the gradient of the error function?
Error Backpropagation
� The gradient can be expressed in terms of gradient in the higher layers using the multivariate chain rule.
∂E =�∂zj ∂E ∂zi j ∂zi ∂zj
� Similarly, one can then extract the gradient w.r.t. the parameters of the model as:
∂E = ∂zj ∂E
∂wij ∂wij ∂zj
Error Backpropagation
Neural Network at Work
��������������� ���
�����������������������������������
��������������� ���
���������������������������������
Neural Networks
Remaining questions:
� How to ensure the perceptron and the neural network learn solutions that are simple and generalize well to new data?
� How hard it is to optimize a neural network. Are we guaranteed to converge to a local minima?
� How to learn multiclass classifiers?
� How to implement neural networks?
(These questions will be addressed in the next lectures.)
������������
� The perceptron and the neural network enable training classifiers on more complex distributions by focusing on what is critical for classification, i.e. the boundary between classes.
� The neural network enables learning nonlinear decision boundaries. This is useful when the problem is complex (most practical problems are nonlinear).
� The gradient of a neural network required for learning can be easily and quickly computed using the method of error backpropagation.
� The perceptron and the neural network do not have closed form solutions but can be trained iteratively using gradient descent.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com