CS计算机代考程序代写 database chain deep learning algorithm 1b: Multi Layer Networks and Backpropagation

1b: Multi Layer Networks and Backpropagation

Multi Layer Perceptrons

Limitations of Perceptrons

The main problem with Perceptrons is that many useful functions are not linearly separable.

The simplest example of a logical function that is not linearly separable is the Exclusive OR or XOR
function. Some languages have distinct words for inclusive and exclusive OR for example, Latin
has “aut” and “vel”. In other languages, like English, this distinction needs to be inferred from the
context. Consider these two sentences:

1. “All of his movies are either too long or too silly.” [They could be both too long and too silly]

2. “Either you give me the money, or I will punch you in the face.” [We understand from the
context that one of these things will happen, but not both]

Multi Layer Perceptrons

One solution to this problem is to rewrite XOR in terms of linearly separable functions like AND, OR,
NOR and then arrange several Perceptrons into a network which combines them in the same way.
For example,

XOR can be written as: ( AND ) NOR ( NOR ). We can therefore arrange three
perceptrons in the following way to compute XOR:
x 1 x 2 x 1 x 2 x 1 x 2

In the following exercise, you will be showing how this method can be used to construct a two-layer
neural network which computes any given logical function.

Exercise: MLPs

Question 1

No response

Question 2

No response

Question 3

No response

Question 4

Conjunctive Normal Form

Before beginning this exercise, we note that any logical function can be converted into Conjunctive
Normal Form (CNF), meaning it is a conjunction of terms where each term is a disjunction of (possibly
negated) literals. This is an example of an expression in CNF:

(A ∨ B) ∧ (¬B ∨ C ∨ ¬D) ∧ (D ∨ ¬E)

Computing any Logical Function with a two-layer network

Assuming False and True , explain how each of the following could be constructed.=0 =1

You should include the bias for each node, and the value of all the weights (input-to-output or input-
to-hidden, hidden-to-output, as appropriate).

Perceptron to compute the OR function of inputsm

Perceptron to compute the AND function of inputs.n

Two-layer Neural Network to compute the function (A ∨ B) ∧ (¬B ∨ C ∨ ¬D) ∧ (D ∨ ¬E)

With reference to this example, explain how a two-layer neural network could be constructed to
compute any (given) logical expression, assuming it is written in Conjunctive Normal Form.

Hint: �rst consider how to construct a Perceptron to compute the OR function of inputs, with of
the inputs negated.

m k

m

Bonus challenge: Can you construct a two-layer Neural Network to compute XOR which has only

No response

one hidden unit, but also includes shortcut connections from the two inputs directly to the (one)
output?
Hint: start with a network that computes inclusive OR, and then try to think how it could be modi�ed.

Gradient Descent

We saw previously how a Multi Layer Perceptron can be built to implement any logical function.
However, normally we need to deal with raw data rather than an explicit logical expression. What we
really want is a method, analogous to the perceptron learning algorithm, which can learn the weights
of a neural network, based on a set of training items.

As early as the 1960’s, engineers understood how to use Gradient Descent to optimize over a family
of functions that are continuous and di�erentiable. The basic idea is this:

We de�ne an error function or loss function to be (half) the sum over all input items of the square
of the di�erence between the actual output and target output:

E

E = z − t
2
1

i

∑ ( i i)2

If we think of as height, this de�nes an error landscape on the weight space. The aim is to �nd a set
of weights for which is very low.

E
E

If the functions involved are smooth, we can use multi-variable calculus to adjust the weights in such
a way as to take us in the steepest downhill direction.

w ← w − η
∂w
∂E

Parameter is called the learning rate.η

Gradient Descent for Neural Networks

Although Gradient Descent was already a well-established technique in the 1960s, somehow no-one
thought to apply it to Neural Networks until many years later. One reason for this was politics.
Marvin Minsky and Seymour Papert criticised neural networks in their 1969 book “Perceptrons” and
lobbied the US Government to redirect research funding away from neural networks and into
symbolic methods such as expert systems.

The other reason was a technical obstacle. If we use the step function as our transfer function, the
landscape will not be smooth but will instead consist almost entirely of �at local regions and
“shoulders”, with occasional discontinuous jumps.

For the perceptron this didn’t matter, because it only had one layer, but for networks with two or
more layers, it becomes a big problem.

In order for gradient descent to be applied successfully, neural networks would need to be
redesigned so that the function from input to output becomes smooth and di�erentiable. This was
achieved by Paul Werbos in 1975 and, more famously, in (Rumelhart, 1986).

Continuous Activation Functions

The key idea is to replace the (discontinuous) step function with a di�erentiable function, such as the
sigmoid:

g(s) =
1 + e−s

1

or hyperbolic tangent:

g(s) = tanh(s) = =
e + es −s
e − es −s

2( ) −
1 + e−2s

1
1

Backpropagation

We now describe how the partial derivatives of the loss function with respect to each weight can be

computed. We consider the case of a -layer neural network with sigmoid activation at the hidden
layers, as shown in this diagram.

2

For simplicity, we present the case with inputs, hidden units and output.2 2 1

u 1

y 1

s

z

= b + w x + w x 1 11 1 12 2
= g u ( 1)

= c + v y + v y 1 1 2 2
= g(s)

We sometimes use as a shorthand for any of the trainable weights
.

w

c, v , v , b , b , w , w , w , w 1 2 1 2 11 21 12 22

Chain Rule (6.5.2)

If where and theny = y(u, v) u = u(x) v = v(x)

=
∂x
∂y

+
∂u
∂y

∂x
∂u

∂v
∂y

∂x
∂v

This principle can be used to compute the partial derivatives in an e�cient and localized manner.
Note that the transfer function must be di�erentiable (usually , or ).sigmoid tanh

Note: if

if

z(s) = ,
1 + e−s

1

z(s) = tanh(s),

z (s) = z(1 − z)′

z (s) = 1 − z′ 2

Forward Pass

u 1

y 1

s

z

E

= b + w x + w x 1 11 1 12 2
= g u ( 1)

= c + v y + v y 1 1 2 2
= g(s)

= (z − t)
2
1 ∑ 2

Backward Pass

Partial Derivatives

∂z
∂E

ds

dz

∂y 1

∂s

du 1

dy 1

= z − t

= g (s) = z(1 − z)′

= v 1

= y 1 − y 1 ( 1)

Useful notation

Then

δ =out δ =∂s
∂E

1 δ =∂u 1

∂E
2 ∂u 2

∂E

δ out

∂v 1

∂E

δ 1

∂w 11

∂E

= (z − t)z(1 − z)

= δ y out 1

= δ v y 1 − y out 1 1 ( 1)

= δ x 1 1

Partial derivatives can be calculated e�ciently by packpropagating deltas through the network.

References

Rumelhart, D.E., Hinton, G.E., & Williams, R.J., 1986. Learning representations by back-propagating
errors. Nature, 323(6088), 533-536.

Further Reading

Textbook Deep Learning (Goodfellow, Bengio, Courville, 2016):

Continuous Activation Functions (3.10)

Gradient Descent (4.3)

Backpropagation (6.5.2)

Neural Network Training

Example: Pima Indians Diabetes Dataset

One example of the kind of task for which two-layer neural networks might be applied is the Pima
Indians Diabetes Dataset. For each patient, the 8 input attributes shown in this table are provided.
The network is trained to output a number between and indicating the probability that the patient
is positive for diabetes.

0 1

1.
2.
3.
4.
5.
6.
7.
8.

Attribute
Number of times pregnant
Plasma glucose concentration
Diastolic blood pressure (mm Hg)
Triceps skin fold thickness (mm)
2-Hour serum insulin (mu U/ml)
Body mass index (weight in kg/(height in m) ) 2

Diabetes pedigree function
Age (years)

mean
3.8

120.9
69.1
20.5
79.8
32.0
0.5

33.2

stdv
3.4

32.0
19.4
16.0

115.2
7.9
0.3

11.8

Online, Batch and Minibatch Learning

You will notice that the loss function includes a summation over training items:

E = z − t
2
1

i

∑ ( i i)2

In some cases, the gradients for all training items are computed and added together, and this overall
gradient is then used to update the weights in one step. This is known as Batch Learning. At the
other extreme, we could compute the gradient for one item at a time as they are presented, and
update the weights as we go, based on the gradient for each individual item. This is called Online
Learning. To take advantage of parallel hardware, an intermediate approach known as Minibatch
Learning is often employed, where the training items are divided randomly into minibatches of
roughly equal size. The gradients are computed in parallel for all items in the minibatch, and the
weights are updated accordingly. Because each minibatch represents only a portion of the full loss
function, its gradient can be thought of as an approximation to the full gradient, with some implicit
noise added. For this reason, online and minibatch learning are sometimes collectively referred to as
Stochastic Gradient Descent (SGD).

In order for learning to be successful, we should try to avoid temporal correlations in the training
process. In other words, each minibatch (or a set of consecutively presented items in the case of
online learning) should contain a variety of di�erent inputs and corresponding outputs, rather than
containing very similar inputs with the same target output.

Autonomous Driving and Hinton Diagrams

Neural networks with one or two hidden layers found a diverse range of applications throughout the
1990’s including medical diagnosis, game playing, credit card fraud detection, handwriting
recognition, �nancial prediction, and many others.

We will describe ALVINN (Autonomous Land Vehicle in a Neural Network Pomeleau, 1990) because
it is a good early example of Behavioural Cloning, hidden unit visualization, Data Augmentation and
Experience Replay. ALVINN was an early autonomous driving system which used a 2-layer neural
network to determine the steering angle. The network takes as input a 30 32 sensor input retina and
computes 30 outputs corresponding to di�erent steering directions ranging from Sharp Left to Sharp
Right. When the trained network is driving the vehicle, the steering direction is chosen as a weighted
average over these 30 outputs.

×

For image processing or spatial tasks like this, the weights coming into and out of a particular hidden
node can be visualized using a Hinton Diagram.

The dots in the top row correspond to hidden-to-output weights; those in the lower square
correspond to input-to-hidden weights. Black and white dots represent positive and negative weight
values, respectively.

Behavioral Cloning, Data Augmentation and Experience Replay

ALVINN was initially trained by a process known as Behavioral Cloning. The vehicle is driven for some
time by a human, and a database is collected of sensor inputs and the corresponding actions chosen
by the human driver. The network is trained on this database, and is then invited to take over the
controls.

It was found that the network could not achieve pro�ciency in the task when trained on the human
data alone. This is because, when the network was driving, the vehicle would start to veer o� the
road, and would then encounter situations that never occurred when the human was driving and
therefore did not appear in the database.

(Pomerleau, 1990) solved this problem by using Data Augmentation, which means using domain
knowledge to create additional training data. Every original image (collected while the human is
driving) gets shifted and rotated in 14 di�erent ways to create 14 additional training items. The
steering angle is adjusted to compensate for the shift and rotation.

The other problem they encountered is that many similar images occur in rapid succession, with
similar steering angles, leading to the temporal correlations discussed above. To avoid this, an
Experience Replay bu�er of 200 items is maintained. At each timestep, 15 old items are removed
from the replay bu�er and replaced with 15 newly generated items. The network is then trained on
all 200 items currently in the replay bu�er.

With these enhancements, ALVINN was able to steer autonomously across the United States in 1995
(although, for safety reasons, the accelerator and brake were still human controlled).

In Week 5, we will see how Reinforcement Learning can be used as an alternative to Behavioral
Cloning for autonomous control tasks, or for learning to play board games or video games.

References

Pomerleau, D.A., 1990. Rapidly adapting arti�cial neural networks for autonomous navigation. In
Proceedings of the 3rd International Conference on Neural Information Processing Systems (pp. 429-435).

Quiz 1: Perceptrons and Backprop

Question 1

No response

Question 2

No response

Question 3

No response

Question 4

No response

Question 5

No response

This is a Quiz to test your understanding of the material from Week 1.

You must attempt to answer each question yourself, before looking at the sample answer.

What class of functions can be learned by a Perceptron?

Explain the di�erence between Perceptron Learning and Backpropagation.

When training a Neural Network by Backpropagation, what happens if the Learning Rate is too low?
What happens if it is too high?

Explain why rescaling of inputs is sometimes necessary for Neural Networks.

What is the di�erence between Online Learning, Batch Learning, Mini-Batch Learning and Experience
Replay? Which of these methods are referred to as “Stochastic Gradient Descent”?

Week 1 Thursday Video