Machine Learning I
Machine Learning II
Lecture 8 – Fine tune parameters and hyper parameters
1
1
Introduction
2
How to initialize parameters and hyper parameters?
Weight initialization
Weight decay and momentum
Cross-entropy
Adam optimization
Example
3
Fashion MNIST dataset contains 70,000 grayscale images in 10 categories. The images show individual articles of clothing at low resolution (28 by 28 pixels), as seen here:
Loading data
4
The classes are shown in following table
The dataset can be loaded as below:
# Load dataset
fashion_mnist = keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
Loading data
5
Each image is mapped to a single label. Since the class names are not included with the dataset, store them here to use later when plotting the images
class_names = [‘T-shirt/top’, ‘Trouser’, ‘Pullover’, ‘Dress’, ‘Coat’, ‘Sandal’, ‘Shirt’, ‘Sneaker’, ‘Bag’, ‘Ankle boot’]
We can draw the images in the dataset:
plt.figure()
plt.imshow(train_images[0])
plt.colorbar()
plt.grid(False)
plt.show()
Pre-processing
6
Normalize the data as we did before.
train_images = train_images / 255.0 (除以maximum number就是255去normalize)
test_images = test_images / 255.0
Draw dataset:
plt.figure(figsize=(10,10))
for i in range(25):
plt.subplot(5,5,i+1)
plt.xticks([])
plt.yticks([])
plt.grid(False)
plt.imshow(train_images[i], cmap=plt.cm.binary)
plt.xlabel(class_names[train_labels[i]])
plt.show()
Model design
7
num_pixels = train_images.shape[1] * train_images.shape[2] #28*28 = 784
X_train = train_images.reshape(train_images.shape[0], num_pixels)
X_test = test_images.reshape(test_images.shape[0], num_pixels)
# normalize inputs from 0-255 to 0-1
X_train = X_train / 255
X_test = X_test / 255
Y_test = test_labels
from keras.utils import np_utils
# one hot encode outputs
y_train = np_utils.to_categorical(train_labels )
y_test = np_utils.to_categorical(test_labels )
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras import optimizers
hidden_nodes = 128
num_classes = y_test.shape[1]
def baseline_model():
# create model
model = Sequential()
model.add(Dense(num_pixels, input_dim= 784, activation=’relu’))
model.add(Dense(hidden_nodes, activation=’relu’))
model.add(Dense(num_classes, activation=‘softmax‘)) (softmax是对multiple classes好的)
sgd = optimizers.SGD(lr=0.01)
# Compile model
model.compile(loss=’mean_squared_error’, optimizer=sgd, metrics=[‘accuracy’])
return model
Train and test
8
Accuracy: 24.30%
model = baseline_model()
# Fit the model
nn_simple = model.fit(X_train, y_train, validation_split=0.2, epochs=80, batch_size=200)
(Validation是accurate your model。Change了parameter。Batchsize是speed up training。但是不影响或者shouldn’t影响accuracy。)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test)
print(“Accuracy: %.2f%%” % (scores[1]*100))
Possible Solution- We didn’t pre-defined weights. Let’s do that but how??
Parameters- Weight initialization可以handle以下两种错
9
Issues of choosing wrong weight:
Vanishing gradients: In case of deep networks, for any activation function, abs(dW) will get smaller and smaller as we go backwards with every layer during back propagation. The earlier layers are the slowest to train in such a case.
Exploding gradients : This is the exact opposite of vanishing gradients. Consider you have non-negative and large weights and small activations (as can be the case for sigmoid(z)). When these weights are multiplied along the layers, they cause a large change in the cost. This may result in oscillating around the minima or even overshooting the optimum again and again and the model will never learn.
Best Weights
10
The weights are chosen from a normal distribution with mean of 0 and standard deviation of:
For ReLU:
For tanh:
Another approach:
Gradient clipping: We set a threshold value, and if a chosen function of a gradient is larger than this threshold, we set it to another value.
= #neuron in the layer
If we initialize weights, the weights will be assigned as mean=0 and std= …
New weight initialization
11
def baseline_model():
# create model
model = Sequential()
model.add(Dense(hidden_nodes, input_dim= num_pixels, kernel_initializer= ‘normal’ ,activation=’relu’))
model.add(Dense(128, kernel_initializer= ‘normal’, activation=’relu’))
model.add(Dense(num_classes, kernel_initializer=’normal’, activation=’softmax’))
sgd = optimizers.SGD(lr=0.01)
# Compile model
model.compile(loss=’mean_squared_error’, optimizer=sgd, metrics=[‘accuracy’])
return model
Accuracy: 32.97%
Hyper parameters- Weight Decay
12
As we discussed last week, we can use L2 regularization to prevent from overfitting where after each update, the weights are multiplied by a factor slightly less than 1.
sgd = optimizers.SGD(lr=0.01, decay=0.01)
Momentum (跟learning rate目的差不多)
13
In optimization problem we want to find the global minimum
However, in real world this is what we end up with.
Momentum can jump these local minimums which is a value between 0 and 1 that increases the size of the steps taken towards the minimum by trying to jump from a local minima.
Momentum
14
If the momentum term is large then the learning rate should be kept smaller.
A large value of momentum also means that the convergence will happen fast.
If both the momentum and learning rate are kept at large values, then you might skip the minimum with a huge step.
A small value of momentum cannot reliably avoid local minima, and can also slow down the training of the system.
Mathematically, we can call it some type of weighted average.
We are getting weighting average of all the steps.
Momentum
15
(Alpha是learning rate)
Accuracy: 35.20%
sgd = optimizers.SGD(lr=0.01, momentum=0.9)
Cross entropy
16
We always used to see the error in the form of the quadratic
or
This MSE is more better for regression problem than classification (MSE不适合catagorical) which is the problem of DNN.
For DNN we generally use cross-entropy which was introduced in lecture 2.
Accuracy: 48.87%
model.compile(loss=’categorical_crossentropy’, optimizer=sgd, metrics=[‘accuracy’])
Adam optimizer
17
The Adam (ADAptive Moment) optimization algorithm is an extension to stochastic gradient descent that has recently seen broader adoption for deep learning applications in computer vision and natural language processing.- 2015
The main difference between adam and GD is that the learning rate in adam is not a fixed number and is separately updated for each parameter.
It was a modification of two other algorithms that we don’t cover here:
Adaptive Gradient Algorithm (AdaGrad): that maintains a per-parameter learning rate that improves performance on problems with sparse gradients
Root Mean Square Propagation (RMSProp): that also maintains per-parameter learning rates that are adapted based on the average of recent magnitudes of the gradients for the weight
Adam optimizer
18
= =
= =
=
and are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively
(learning rate or step size): The proportion that weights are updated (e.g. 0.001). Larger values (e.g. 0.3) results in faster initial learning before the rate is updated. Smaller values (e.g. 1.0E-5) slow learning right down during training
: The exponential decay rate for the first moment estimates
:The exponential decay rate for the second-moment estimates
:Is a very small number to prevent any division by zero
Adam optimizer
19
Pros:
Straightforward to implement.
Computationally efficient.
Little memory requirements.
Well suited for problems that are large in terms of data and/or parameters.
Appropriate for non-stationary objectives.
Appropriate for problems with very noisy/or sparse gradients.
Hyper-parameters have intuitive interpretation and typically require little tuning.
Accuracy: 88.20%
model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])
Output analyzing
20
When we classify using keras, we can have the 10 outputs for each set of testing instances. The highest number makes the judgement about the actual class.
(这张矩阵表示10个种类每个的probability,就是说最后一个99.49的几率是sandle。)
There is a code uploaded on moodle that can show you the output:
红条是说94%是sneaker,剩下的bar也是几率
predictions = model.predict(X_test)
predictions[0]
Assignment 4
Learning curves has an important role in developing any machine learning algorithm. To understand the effect of changing the hyper parameters and their effect on deep network that we designed, draw two learning curves for Accuracy of test vs. epochs. Repeat this analysis two times:
With Adam algorithm
With SGD
21
/docProps/thumbnail.jpeg