CS计算机代考程序代写 algorithm 4b: Image Processing

4b: Image Processing

Weight Initialization

The aim of weight initialization is to choose the size of the initial weights in each layer of a deep
neural network in such a way that the gradients will remain in a healthy range and not vanish or
explode.

Consider a neural network with layers and let the activations at each layer ( ) be

where is the input and is the output. If is the weight

connecting node in layer ( ) to node in layer ( ), then

D i x =(i)

{x } k
(i)

1≤k≤n (i) x = x
(1) z = x(D+1) w jk

(i)

k i j i+1

x =j
(i+1)

g( w x )
k=1


n (i)

jk
(i)

k
(i)

where is the transfer function. If we assume that the activations (and the weights) are statistically
independent and identically distributed (i.i.d.) we can estimate the mean and variance of based

on those of and . The bias weights are initialised to 0, and we also assume (for now) that the

activation function is which is radially symmetric about the origin.

g()
x(i+1)

x k
(i)

w jk
(i)

tanh()

Recall that the mean and variance of a set of samples are given by

If where are i.i.d. then and .

n x , … , x 1 n

Mean[x]

Var[x]

= x
n

1

k=1


n

k

= x − Mean[x] = x − Mean[x]
n

1

k=1


n

( k )
2 (

n

1

k=1


n

k
2) 2

s = x ∑k=1
n

k x k Mean[s] = n Mean[x] Var[s] = n Var[x]

More generally, if where are i.i.d. with theny = w x ∑k=1
n

k k w , x k k Mean[w] = Mean[x] = 0

Var[y] = n Var[w] Var[x]

Statistics Example: Coin Tossing

Suppose we toss a coin once, and count the number of heads. The mean and variance of this value
are

μ

σ
2

= (0 + 1) = 0.5
2
1

= ((0 − 0.5) + (1 − 0.5) )) = 0.25
2
1 2 2

If we toss the coin 100 times, the mean and variance will be and
. If we toss it 10,000 times, they will be and

.

μ = 100 × 0.5 = 50 σ =2 100 ×
0.25 = 25 μ = 10, 000 × 0.5 = 5000 σ =2 10, 000 ×
0.25 = 2500

Note that instead of the variance we often think in terms of the standard deviation, which in this
case would be and , respectively. This means that the number of heads as a fraction of
the total number of coin tosses will get steadily closer to as the number of tosses increases.

σ = 0.5, 5 50
0.5

Weight Initialisation

Returning to our neural network example, we have

x j
(i+1)

Var w x [
k=1


n (i)

jk
(i)

k
(i)]

So Var x[ (i+1)]

= g w x (
k=1


n (i)

jk
(i)

k
(i))

= n Var w Var x(i) [ (i)] [ (i)]

≃ G n Var w Var x0 (i) [ (i)] [ (i)]

where is a constant whose value is estimated to take account of the transfer function. By
multiplying across all layers from input to output , we get

G 0
D x = x(1) z = x(D+1)

Var[z] ≃ G n Var w Var[x](
i=1


D

0 (i) [ (i)])
When we apply gradient descent through backpropagation, the di�erentials will follow a similar
pattern:

Var ≃[
∂x
∂ ] G n Var w Var (

i=1


D

1 (i+1) [ (i)]) [ ∂z∂ ]
Where is a constant whose value is estimated to take account of the derivative of the transfer
function. If some layers are not fully connected, we can replace in the above equations with the
average number of incoming connections to each node at layer ( +1), and replace with the
average number of outgoing connections from each node at layer .

G 1
n (i)
i n (i+1)

(i)

In order to have healthy forward and backward propagation, we need to choose the initial weights

in each layer such that all terms in the product are approximately equal to . Any

deviation from this could cause the di�erentials to either decay or explode exponentially. If the
transfer function is , this is normally achieved using xavier initialisation, where the weights
are chosen from a uniform distribution bounded between these values:

{w }jk
(i) (i) 1

tanh()

±
n + n (i) (i+1)

6

For Recti�ed Linear Units (ReLU) the above analysis is essentially still valid, with ,
although we need to be mindful of the fact that the mean of the activations is not zero. In this
case we normally use kaiming initialisation, where the weights are chosen from a Gaussian
distribution with mean 0 and standard deviation

G =0 G =1 2
1

x(i)

n (i)

2

These �gures from (He et al., 2015) illustrate the bene�t of kaiming initialisation, on the ImageNet
classi�cation task.

22-layer ReLU network (left), converges faster than
30-layer ReLU network (right) is successful while fails to learn at all.

Var[w] =
n
2 Var[w] =

n
1

Var[w] =
n
2 Var[w] =

n
1

References

He, K., Zhang, X., Ren, S., & Sun, J., 2015. Delving deep into recti�ers: Surpassing human-level
performance on imagenet classi�cation, Proceedings of the IEEE International Conference on Computer
Vision (pp. 1026-1034).

Batch Normalization

Batch Normalisation (Io�e & Szegedy, 2015) serves a similar purpose to weight Initialisation, but is
applied throughout the training process rather than just at the beginning.

The activations of node in layer can be normalised relative to their mean and variance
across a mini-batch of training items:

x k
(i)

k (i)

=x̂k
(i)

Var[x ]k
(i)

x − Mean[x ]k
(i)

k
(i)

These activations can then be shifted and re-scaled to have mean and standard deviation β k
(i)

γ k
(i)

y =k
(i)

β +k
(i)

γ k
(i)


k
(i)

Inspired by Weight Initialisation, we might at �rst think that and should be �xed in advance

(for example, if and , then would be equal to ). However, it turns out that

better results can be obtained if and are treated as additional parameters for each node,
which can be trained by backpropagation along with the other parameters (weights) in the network.
In this way, the network retains the same �exibility it would have had without Batch Normalization,
but the dynamics of he backpropagation are changed in a bene�cial way.

β k
(i)

γ k
(i)

β =k
(i) 0 γ =k

(i) 1 y k
(i)

k
(i)

β k
(i)

γ k
(i)

After training is complete, and can either be pre-computed on the entire
training set, or updated using running averages.

Mean[x ]k
(i) Var[x ]k

(i)

References

Io�e, S., & Szegedy, C., 2015. Batch normalization: Accelerating deep network training by reducing
internal covariate shift. In International Conference on Machine Learning (pp. 448-456).

ResNets and DenseNets

Residual Networks

Very deep networks (greater than 30 layers) can be trained successfully by introducing skip
connections to form a residual network. The idea is to take any two consecutive stacked layers in a
deep network and add a “skip” connection which bypasses these layers and is added to their output.

In this way, the preceding layers attempt to do the “whole” job, making as close as possible to the
target output of the entire network. is a residual component which corrects the errors from
previous layers, or provides additional details which the previous layers were not powerful enough to
compute.

x

F (x)

These graphs from (He, 2016) demonstrate the e�ectiveness of residual networks. When the skip
connections are absent (left) the test error for a 34-layer network is higher than for an 18-layer
network. When the skip connections are included (right) both the training error (thin) and test error
(thick) are lower for the 34-layer network than for the 18-layer network.

In order to train a network with more than 100 layers, the transfer function (ReLU) needs to be
applied before adding the residual instead of afterwards. This is called an identity skip connection.

Dense Networks

Good results on ImageNet have also been achieved using networks with densely connected blocks.
Within each block, every layer is connected by shortcut connections to all the preceding layers
(Huang, 2017).

References

He, K., Zhang, X., Ren, S., & Sun, J., 2016. Deep Residual Learning for Image Recognition, In
Proceedings of the IEEE Conference on computer Vision and Pattern Recognition (pp770-778).

Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K.Q., 2017. Densely connected convolutional
networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4700-

4708).

Neural Style Transfer

Convolutional Neural Networks can be used for many tasks other than object classi�cation. One such
task is Neural Style Transfer (Gatys, 2016), which aims to combine the content of one image with the
style of another, as shown here.

The process relies on a �xed CNN such as VGG-19 which has been pre-trained on ImageNet. If
denotes the activation of the convolutional �lter at spatial location in layer , then it is natural to
minimise the L distance between and , where is the content image and is the
synthetic image being generated. Moreover, classical work on texture synthesis would suggest that
the visual “style” of an image is somehow captured in the Gram matrices

F
ik
l

ith k l

2 F (x)ik
l F (x )

ik
l

c x c x

G =ij
l

F F

k

∑ ikl jkl

Neural Style Transfer therefore aims to minimise this loss function:

E total = E + E content style

= ∣∣F (x) − F (x )∣∣ + G − A
2
α

i,k

∑ ikl ikl c 2 4
β

l=0


L

N M
l
2

l
2

w l

i,j

∑ ( ijl ijl )2

where is the generated image, is the content image, is the th �lter at position in layer ,
are the number of �lters and size (area) of the hidden layer, is a weighting factor for layer
x x c F ik

l i k l

N , M l l w l l

, and are the Gram matrices for the style image and the generated image.G , A ij
l

ij
l

Note that in this case, gradient descent is applied not to the weights of the network (which remain
�xed) but rather to the R,G,B values of the pixels in the image itself. This �gure from Gatys (2016)
shows a single content image in combination with �ve di�erent style images.

References

Gatys, L.A., Ecker, A.S., & Bethge, M. 2016. Image style transfer using convolutional neural networks,
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2414-2423).

Quiz 5: Image Processing

Question 1

No response

Question 2

No response

Question 3

No response

Explain the problem of vanishing and exploding gradients, and how Weight Initialization can help to
prevent it.

Describe the Batch Normalization algorithm.

Explain the di�erence between a Residual Network and a Dense Network.

Coding Exercise: Image Classi�cation

In this exercise, you will practice how to use PyTorch to create a neural network model named LeNet-
5, one of the most famous convolutional neural network models proposed by Yann LeCun et al. in
1989. Please implement LeNet-5 according to the following description:

LeNet-5 consists of seven layers:

layer 1: Convolution, input channel = 1, output channel = 6, kernel size = 5, activation = ReLu.

layer 2: Max Pooling, kernel size = 2.

layer 3: Convolution, input channel = 6, output channel = 16, kernel size = 5, activation = ReLu.

layer 4: Max Pooling, kernel size = 2.

layer 5: Linear, output size 120, activation = ReLu. (Calculate the input size by yourself).

layer 6: Linear, input size 120, output size 84, activation = ReLu. (Calculate the input size by yourself).

layer 7: Linear, input size 84, output size 10, no activation.

Please download the notebook and implement/run it locally.

Week 4 Thursday video