CS计算机代考程序代写 COMP5329 week4 notes

COMP5329 week4 notes

Week4 lecture notes
Written by Gary Jiajun Huang, who tried his best to complete this notes.

Why regularisation
The regularisation could help us to prevent overfitting.

So what is overfitting?

Overfitting is a situation that the network is over fit to the training data, such that all knowledge it
learned is for the training data, but cannot generized to other unseen data. 

The figure above is an example of overfitting. The blue dots are the training data, the green line
the real distribution of the whole dataset, which is:

While the network learned another distribution from the training data shown in the red line, which
is:

(Notes: what we want to say is that we use a long term formula to learn a short term formula)

As we can see, the network learns perfectly to predict the data from the training dataset, but it
cannot work for the whole dataset. In other words, it cannot present the whole distribution.

How can we detect overfitting?

It is very easy to detect overfitting!

During the training process, if we can obtain that the prediction error (or the loss/objective
function) for the training data approaching a very low level (it does mean the error will be close to
0! Just relatively very low.). In the meanwhile, the testing error becomes very high. If we notice
such a situation, that means the network is suffering overfitting!

How to prevent overfitting?

1. Use a larger dataset. Problem: sometimes, we cannot access a larger one. And we just have
limited data.

2. reduce parameters:
Like the example shown above, we use too many terms to learn a simple two terms’
distribution, such that it is very easy to be overfitted.

Therefore, if we can reduce some terms, we can overcome the problem easily.

Problem: In the real situation, we don’t how complicate the task is (how many parameters
the real distribution has). So it is impossible for us to adjust the number of parameters
one by one to achieve optimal performance.


3. classical regularisation method: adding constraints for the parameters

L2 regularisation (weight decay): It is a very classical, simple way for regularization. 

It is such easy that we need to add the second term for our final objective function. The
theta the network weights (the parameters of the network). The alpha is called the weight
decay parameter to control the strength of constraint. 

Why L2 regularisation work?

As we can see, for now, the objective function will penalise the large network weights. As
a result, the weights will tend to very small. Ultimately, a large number of weights will be
zero, while the remaining few of them will be large, or we can say the weights are being
sparse. Since many weights are zero, that equals to we remove those weights
(parameters).

 Why L2 regularisation, why not L1?

L1 regularisation is the version that without square.

Computation difficulty:

L2 has a closed-form expression since it has a square. (Notes: closed-form expression is
an explainable expression expressed by a finite number of operations.) However, L1
does not have a closed-form solution because it is a non-differentiable piecewise
function, as it involves an absolute value. Therefore, it is computation difficult for L1
regularisation, such that we normally use some approximation (which is a closed-form
expression), to explain L1.

4. Other regularisation methods:

4.1 adding noises

Assuming that we are doing a binary classification task. In this case, the gap between positive
class and negative class distribution is quite large, such that the decision boundary could take
one of w1, w2, w3, or other lines between two distribution. W2 is the better one within three
candidates (notes: it doesn’t mean it is the best one globally!). Why?

Suppose the drawn blue and red dots (sorry for the bad drawing 🙂 ), if we pick w1, then the blue
dots will be misclassified as the negative ones. If we pick w3, then the red dots will be
misclassified as positive. However, if we pick w2, all drawn dots can still be classified correctly.

A well choice boundary could improve generalization for the unseen dataset.

If we add some noise, e.g., Gaussian noise based on the class distribution, it will expand the
training set distribution, such that the network will learn a better boundary!

However, if the added noise is under a bad distribution, e.g., the gaussian distribution with large
variance, the noise might appear in other class distribution. In this situation, the noise will become
the real “noise” that annoying the network to learn appropriate boundaries.

Adding noise is equal to adding L2 regularisation:

Notes: The meaning of

is noise donate as epsilon, is following the normal (Gaussian) distribution N, with 0 means,

Variance.

Lambda is a scalar to control the strength, similar to the weight decay parameter. I is the variance
of the class distribution.

If we add such noise to each sample x ==> (x+noise). It will be equal to adding L2 regularisation:

4.2 Data augmentation:

Recall that using a larger dataset for training could prevent overfitting. Data augmentation is a
useful method to create (expand) the current dataset!

For example, we can perform horizontal flip, crop or rotate (with small angles) to slip change the
image and introduce more examples, such that we increase the size of dataset.

4.3 Early stopping:

Recall what happens if the model is suffering overfitting? The training loss decreased to a very
low level, but the testing loss is being larger and larger.

We can stop the training early before we approach overfitting!

One key problem of this method is that we require validation data to determine the early stopping
point. It is bad because we have less data for training, especially since the dataset’s size is small.

But we can reuse these validation data!

We need to train the model twice. We use validation data to find out the early stopping point for
the first time, and we record this point. Then we train the model again. In this time, we merge the
training and validation data to train our final model (reuse the validation data for training). We will
stop the training time once we reach the pre-recorded early stopping point.

4.5 Dropout:

Dropout is an efficient way to prevent overfitting.

We will temporarily remove (drop out) some of the units for each iteration during the training
process. Temporary removal means we won’t process forward and backward through these units
in that iteration.

During the testing: we will use all units for testing or predicting.

The figure above shows an example about dropping out 50% of units.

how to implement?
The idea is that we block the output of some nodes to the next layer.

Consider r_i is the mask for layer i. Bernoulli(p) is that given a p (the percentage of the unit will be
blocked), it will return a mask, randomly assigned with 0 or 1 follow the p.

Then we multiply the mask with outputs such that the outputs of selected units will be assigned
with 0 and blocked at that iteration.

Why dropout?

When we train a large network, sometimes the network will tend to use few weights to learn the
task, such that few units have high-value outputs but others remaining very small values. It equals
to we are training with a smaller network because most of the outputs are close to zero!

Dropout could prevent the network from relying on few units for prediction, making each weight
independently. As a result, the network learns to utilize more weights for prediction. On the other
hand, we can also consider we are training multiple subnetworks at the same time and we use the
power of all subnetworks to make decisions.

In practise, we set p = 0.5 (half of units will be disabled) could achieve best results (but it is also
important to explore other setting in your assignment as well.)

4.6 Dropconnect:

The dropconnect is the general version of dropout. Instead remove the units, it disable the
weights by setting them into 0.

Normalization:
1. Classical normalisation

We covert the input features in range of [0,1].

Why do we need to do that?

Consider a two feature data sample. The first feature has a range [0,1], and the second one has a
range [0,1000]. The second feature could have more impact than the first. But in fact, they could
have an equal contribution for making a prediction. So it could be better for us to scale all of them
into the same range, such as [0,1].

Or convert to zero means and one variance (the normal distribution):

Internal covariate shift – a problem could occur during the training.

When we pass some inputs to a layer, it will output outputs following a specific distribution.
Covariate shift is that the output distribution of one layer output keeps changing (during to the
inputs of data or previous layers). The next layer has to learn to adapt to the new distribution all
the time. It slows down the training process, preventing the network from achieving better
performance.

==> it will be better for the layers’ output to have the same distribution, e.g., normal distribution.

==> use batch normalization!

2. Batch normalization

For each layer, except the output layer, we add one batch normalization layer right after the action
function output.

Given an activation output x, we normalize it by using the mean and variance calculated based on
all outputs of the current mini-batch in this layer. It forces all layer outputs to follow the normal
distribution!

Gamma and beta are the parameters that will be learned during the training process.

Batch normalization is scale-invariant:

About the scale: A large learning rate could increase weight values (the scale of layer parameters),
such that the gradient explosion could happen!

When we use BN, each layer’s output will be normalized, regardless of the scale, preventing the
gradient explosion.

The impact of batch size in BN: larger batch size -> better performance!

s

Group normalization (GN):

Rather than group the batch to calculate mean and variance for normalization, which could be
affected by the batch size, GN divide and group channels to calculate mean and variance for
normalization.

The channels are the terms introduces by 3D networks such as convolution neural networks
(CNN). The number of each layer’s channels is a constant value, which means GN doesn’t need to
care about the batch size.

Why regularisation
Normalization: