程序代写代做代考 algorithm Keras deep learning go Deep Learning

Deep Learning
By Majid Babaei

Architecture of Keras
Keras API can be divided into three main categories
• Model
• Sequential Model − a linear composition of Layers. • Functional API
• Layers
• Core Layers
• Convolution Layers
• Pooling Layers
• Recurrent Layers
• Core Modules
• Activations module
• Loss module
• Optimizer module
• Regularizes

High vs. Low learning rate
4

Learning Rate and Gradient Descent
• The amount that the weights are updated during training is referred to as the step size or the “learning rate.”
• The learning rate is a configurable hyperparameter used in the training of neural networks
• The learning rate controls how quickly the model is adapted to the problem.
• Smaller learning rates require more training epochs given the smaller changes
made to the weights each update
• Larger learning rates result in rapid changes and require fewer training epochs.

If you have time to tune only one hyperparameter, tune the learning rate.

An Optimizer : Stochastic Gradient Descent
◎It isagradientdescentmethodoptimizedbytherateofconvergence.
◎ In both gradient descent (GD) and stochastic gradient descent (SGD), you update a set of parameters in an iterative manner to minimize an error function.
◎ In GD, you have to run through ALL the samples in your training set to do a single update.
◎ In SGD, on the other hand, you use ONLY ONE or SUBSET of training sample from your training set to do the update
9

An Optimizer : Stochastic Gradient Descent ◎ Saves a lot of time compared to summing over all data
◎ SGD often converges much faster compared to GD but the error function is not as well minimized as in the case of GD.
◎ Often in most cases, the close approximation that you get in SGD for the parameter values are enough because they reach the optimal values and keep oscillating there.
10

Stochastic Gradient Descent in Keras
• Keras provides the SGD class that implements the stochastic gradient descent optimizer with a learning rate and momentum.
• In Keras we can configure the “optimizer” argument before calling the compile() function.
from keras.optimizers import SGD …
opt = SGD()
model.compile(…, optimizer=opt)

Stochastic Gradient Descent in Keras
• The learning rate can be specified via the “lr” argument and the momentum can be specified via the “momentum” argument.
from keras.optimizers import SGD …
opt = SGD(lr=0.01, momentum=0.9) model.compile(…, optimizer=opt)

What is the momentum in SGD?

Momentum in very simple terms!
• Think of the loss being a hilly roller coaster terrain, thus it has the potential energy of U.
• U(potential energy) = mgh

Momentum in very simple terms!
• What do we want?!
• We want it to go to the
bottom faster
• On the bottom of the curve, we want it to slow down in order to not miss the minima!

Momentum
Momentum is a method which helps accelerate gradients vectors in the right directions.

Momentum
• Suppose, we have some sequence S which is noisy.
• For this example I plotted cosine function and added some Gaussian noise

What do we want of this example?
• We want some kind of ‘moving’ function which would ‘denoise’ the data and bring it closer to the original function.
• Exponentially weighed averages can help us!

Exponentially Weighted Average
• This algorithm has been mostly used to reduce the noisy time- series data.
• It’s also called “smoothing” the data.
• The way we achieve this is by essentially weighing the number of observations and using their average.

Exponentially Weighted Average
• Exponentially weighed averages define a new sequence V with the following equation:

Exponentially Weighted Average
or
Momentum
• That sequence V is the one plotted yellow.
• Beta is another hyper- parameter which takes values from 0 to one.

Momentum
we got much smoother line, which is closer to the original function than data we had.

Let’s see how the choice of beta affects our new sequence V.

• With smaller numbers of beta, the new sequence turns out to be fluctuating a lot (why?!)
• With bigger values of beta,
like beta=0.98, we get much smother curve, but it’s a little bit shifted to the right (why?!)
• Beta = 0.9 provides a good balance between these two extremes.
Exponentially weighed averages for different values of beta

Momentum helps accelerate gradients in the right direction
Left — SGD without momentum, right— SGD with momentum.

The effect of learning rate on model performance

A multi-class classification problem
• We will use a small multi-class classification problem as the basis to demonstrate the effect of learning rate on model performance.
• The scikit-learn class provides the make_blobs() function that can be used to create a multi-class classification
• Number of samples
• Input variables
• Classes
• Variance of samples within a class.

A multi-class classification problem
• The problem has two input variables (to represent the x and y coordinates of the points)
• A standard deviation of 2.0 for points within each group.
• We will use the same random state (seed for the pseudorandom number generator) to ensure that we always get the same data points.
# generate 2d classification dataset
X, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)

A multi-class classification problem
• In order to get a feeling for the complexity of the problem, we can plot each point on a two-dimensional scatter plot and color each point by class value.
• If you run multiclass_classification_problem.py, you will see …

Multiclass classification problem

Learning Rate Dynamics

Investigate the
effect of different
learning rates
We will develop a Multilayer Perceptron (MLP) model to address the classification problem and investigate the effect of different learning rates.

Initial steps
First, we split the dataset into train and test datasets.
Additionally, we must also hot encode the target variable. We can do that using to_categorical() function.

We will define a simple MLP model that expects two input variables from the problem
Define MLP model
It has a single hidden layer with 50 nodes
An output layer with three nodes to predict the probability for each of the three classes.

Define MLP model: Code
# define model
model = Sequential()
model.add(Dense(50, input_dim=2, activation=’relu’)) model.add(Dense(3, activation=’softmax’))

Compile model
• We will use the stochastic gradient descent (SGD) optimizer
• SGD requires that the learning rate be specified so that we can evaluate
different rates.
• The model will be trained to minimize value of the loss function.
# compile model
opt = SGD(lr=lrate)
model.compile(loss=’categorical_crossentropy’, optimizer=opt, metrics=[‘accuracy’])

Fit model
• The model will be fit for 200 training epochs, found with a little trial and error • The test set will be used as the validation dataset.
• So we can get an idea of the generalization error of the model during training.
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=200, verbose=0)

Plot the accuracy of the model
• Once fit, we will plot the accuracy of the model on the train and test sets over the training epochs.
# plot learning curves pyplot.plot(history.history[‘accuracy’], label=’train’) pyplot.plot(history.history[‘val_accuracy’], label=’test’) pyplot.title(‘lrate=’+str(lrate), pad=-50)

We can now investigate the dynamics of different learning rates on the train and test accuracy of the model.

• We will evaluate learning rates on a logarithmic scale.
• Create line plots for each learning
rate by calling the fit_model() function.
Evaluate learning
rates on a logarithmic scale

• In LearningRate_dynamics.py
• We call the fit_model() function for different learning rates.
Code

Results
• Running the example creates a single figure that contains eight line plots for the eight different evaluated learning rates.
• Classification accuracy on the training dataset is marked in blue, whereas accuracy on the test dataset is marked in orange.
• Your specific results may vary given the stochastic nature of the learning algorithm. Consider running the example a few times.

Discussion
• The plots show oscillations in behavior for the too-large learning rate of 1.0
• The inability of the model to learn anything with the too-small learning rates of
1E-6 and 1E-7.
• We can see that the model was able to learn the problem well with the learning rates 1E-1, 1E-2 and 1E-3.

Momentum Dynamics

Momentum
• Momentum can smooth the progression of the learning algorithm that, in turn, can accelerate the training process.
• We can adapt the example from the previous section to evaluate the effect of momentum with a fixed learning rate.
• In this case, we will choose the learning rate of 0.01 that in the previous section converged to a reasonable solution.

The fit_model() function can be Update updated to take
Fit_model()
a “momentum” argument instead of a learning rate argument

The updated version of fit_model() function
# fit a model and plot learning curve
def fit_model(trainX, trainy, testX, testy, momentum):
# define model
model = Sequential()
model.add(Dense(50, input_dim=2, activation=’relu’))
model.add(Dense(3, activation=’softmax’))
# compile model
opt = SGD(lr=0.01, momentum=momentum) model.compile(loss=’categorical_crossentropy’, optimizer=opt, metrics=[‘accuracy’])
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=200, verbose=0) # plot learning curves
pyplot.plot(history.history[‘accuracy’], label=’train’) pyplot.plot(history.history[‘val_accuracy’], label=’test’) pyplot.title(‘momentum=’+str(momentum), pad=-80)

It is common to use momentum values close to 1.0, such as 0.9 and 0.99.

Momentum Dynamics
• In this example, we will demonstrate the dynamics of the model without momentum compared to the model with momentum values of 0.5 and the higher momentum values.
# create learning curves for different momentums
momentums = [0.0, 0.5, 0.9, 0.99]
for i in range(len(momentums)): # determine the plot number plot_no = 220 + (i+1) pyplot.subplot(plot_no)
# fit model and plot learning curves for a momentum
fit_model(trainX, trainy, testX, testy, momentums[i]) # show learning curves
pyplot.show()

Run the code
• If you run the Momentum_dynamics.py, you will see …

Momentum Dynamics
• Momentum_dynamics.py

Results
• Running the example creates a single figure that contains four line plots for the different evaluated momentum values.
• Classification accuracy on the training dataset is marked in blue, whereas accuracy on the test dataset is marked in orange.
• Your specific results may vary given the stochastic nature of the learning algorithm.

Discussion
• We can see that the addition of momentum does accelerate the training of the model.
• Momentum values of 0.9 and 0.99 achieve reasonable train and test accuracy within about 50 training epochs as opposed to 200 training epochs when momentum is not used.
• In all cases where momentum is used, the accuracy of the model on the holdout test dataset appears to be more stable.

Step2: Compilation
Loss
55

A Loss Function
• As part of the optimization algorithm, the error for the current state of the model must be estimated repeatedly
• This requires the choice of an error function, conventionally called a loss function
• Loss function can be used to estimate the loss of the model so that the weights can be updated to reduce the loss on the next evaluation.

How we can choose a suitable loss function for our problem?

A Loss Function regression.
• The choice of loss function must match with our modeling problem, such as classification or

Regression Loss Functions
• There is an important difference between classification and regression problems. (Classification, predicting a label and regression, predicting a quantity)
• Regression predictive modeling is the task of approximating a mapping function (f) from input variables (X) to a continuous output variable (y).
59

What are the characteristics of appropriate loss
functions for regression problems?
First, Let’s create a regression problem

Our regression problem
We will use a standard regression problem generator provided by the scikit-learn library
make_regression() function will generate examples from a simple regression problem with a given number of input variables, statistical noise, and other properties.
61

A regression problem
• We will use this function to define a problem that has 20 input features
• A total of 1,000 examples will be randomly generated.
• The pseudorandom number generator will be fixed to ensure that we get the same 1,000 examples each time the code is run.
# generate regression dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=1)
62

A regression problem
• For this problem, each of the input variables and the target variable have a Gaussian distribution; therefore, standardizing the data in this case is desirable.
• We can achieve this using the StandardScaler transformer class also from the scikit-learn library.
# standardize dataset
X = StandardScaler().fit_transform(X)
y = StandardScaler().fit_transform(y.reshape(len(y),1))[:,0]
63

“Rescaling” vs. “Normalizing” vs. “Standardizing”

Rescaling
• “Rescaling” a vector means to add or subtract a constant and then multiply or divide by a constant.
• As you would do to change the units of measurement of the data.
• For example, to convert a temperature from Celsius to Fahrenheit.
65

Normalizing
• “Normalizing” a vector often means dividing by a norm of the vector.
• In order to make all the elements lie between 0 and 1.
• Bringing all the values of numeric columns in the dataset to a common scale.
66

Standardizing
• “Standardizing” a vector often means subtracting a measure of location and dividing by a measure of scale.
• For example, if the vector contains values with a Gaussian distribution, you might subtract the mean and divide by the standard deviation.
• Obtaining a “standard normal” variable
with mean 0 and standard deviation 1. 67

When should we be
using Normalization and Standardization?

When do we normalizing?
• Normalization is a good technique to use when we do not know the distribution of our data.
• or when our data has varying scales and the algorithm we are using does not make assumptions about the distribution of our data.
• Such as k-nearest neighbors and artificial neural networks 69

When do we standardizing?
• Standardization assumes that our data has a Gaussian distribution.
• This does not strictly have to be true, but the technique is more effective if your attribute distribution is Gaussian.
• Such as linear regression, logistic regression.
70

Let’s back to our regression problem!

• Once scaled, the data will be split evenly into train and test sets.
Initial step
# split into train and test
n_train = 500
trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:]
72

Define MLP model
• A small Multilayer Perceptron (MLP) model will be defined to address this problem and provide the basis for exploring different loss functions.
# define model
model = Sequential()
model.add(Dense(25, input_dim=20, activation=’relu’)) model.add(Dense(1, activation=’linear’))
73

Compile the model
• The model will be fit with stochastic gradient descent (SGD) with a learning rate of 0.01 and a momentum of 0.9.
• Training will be performed for 100 epochs.
• The test set will be evaluated at the end of each epoch so that we can plot learning curves at the end of the run.
opt = SGD(lr=0.01, momentum=0.9)
model.compile(loss=’…’, optimizer=opt)
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=100, verbose=0)
74

Now that we have the basis of a problem and model, we can take a look at some loss
functions .

Loss Functions for
regression problems

Mean Squared Error
The Mean Squared Error, or MSE, loss is the default loss to use for regression problems. Mean squared error is calculated as the average of the squared differences between the
predicted and actual values.
The result is always positive regardless of the sign of the predicted and actual values.
A perfect value is 0.0.
77

Mean Squared Error
The squaring means that larger mistakes result in more error than smaller mistakes
In Keras by specifying ‘mse‘ or ‘mean_squared_error‘ as the loss function when compiling the model.
78

Mean Squared Logarithmic Error loss
There may be regression problems in which the target value has a spread of values
When predicting a large value, you may not want to punish a model as heavily as MSE
Instead, you can first calculate the natural logarithm of each of the predicted values, then calculate the mean squared error
79

To compare MSLE with MSE please run : lossFunction_MSLEL.py
A line plot is also created showing the MSE and MSLE loss functions for both the train (blue) and test (orange) datasets.

Mean Squared Logarithmic Error loss vs Mean Squared Error
• As you can see that the MSLE converged well over the 100 epochs algorithm.
• It appears that the MSE may be showing signs of overfitting, dropping fast and starting to rise from epoch 20 onwards.
81

Mean Absolute Error
On some regression problems, the distribution of the target variable may be mostly Gaussian, but may have outliers.
MAE loss function is more robust to outliers
It is calculated as the average of the absolute difference between the actual and predicted values.
82

To compare MAE with MSE please run : lossFunction_MAE.py

Mean Absolute Error loss vs Mean Squared Error
• MAE does converge but shows a bumpy course, although the dynamics of MSE don’t appear greatly affected.
• We know that the target variable is a standard Gaussian with no large outliers, so MAE would not be a good fit in this case.
84

Loss Functions for binary
classification problems

Binary Classification Loss Functions
Binary classification are those predictive modeling problems where examples are assigned one of two labels.
86

A binary classification problem
• We will generate examples from the circles test problem in scikit-learn as the basis for this investigation.
• The circles problem involves samples drawn from two concentric circles on a two-dimensional plane.
• points on the outer circle belong to class 0 and points for the inner circle belong to class 1.
• Statisticalnoiseisaddedtothesamplestoaddambiguityandmaketheproblemmore challenging to learn.
# generate circles
X, y = make_circles(n_samples=1000, noise=0.1, random_state=1)
87

A binary classification problem
• We can create a scatter plot of the dataset to get an idea of the problem we are modeling. The complete example is listed below.
• If you want to see a scatter plot of the dataset, run binary_classification_problem.py
88

A binary classification problem
89

Binary Cross-Entropy
Cross-entropy is the default loss function to use for binary classification problems.
Cross-entropy will calculate a score that summarizes the average difference between the actual and predicted probability distributions
The score is minimized in the learning process
A perfect cross-entropy value is 0
90

Binary Cross-Entropy in Keras
It can be specified as the loss function
in Keras by specifying ‘binary_crossentropy‘ when compiling the model.
91

Please run : lossFunction_BCE.py

Binary Cross-Entropy Loss and Accuracy
• The training process converged well.
• The plot for loss is smooth, given the
continuous nature of the error.
• Whereas the line plot for accuracy shows bumps.
• In the given binary example, a case can only be predicted as correct or incorrect.
93

Hinge Loss function
Developed for use with Support Vector Machine (SVM) models.
Target values must be in the set {-1, 1}
Assigning more error when there is a difference in the sign between the actual and predicted class values
The output layer of the network must be configured to have activation function like tanh 94

Please run : lossFunction_HL.py

Hinge Loss and Accuracy
• The plot of hinge loss shows that the model has converged and has reasonable loss on both datasets.
• The plot of classification accuracy also shows signs of convergence, albeit at a lower level of skill than may be desirable on this problem.
96

Loss Functions for multiclass classification problems

Multiclass Classification Loss Functions
Multi-Class classification are those predictive modeling problems where examples are assigned one of more than two classes.
98

A multiclass classification problem
• We will use the blobs problem as the basis for the investigation.
• The make_blobs() function provided by the scikit-learn provides a way to generate examples
given a specified number of classes and input features.
• We will use this function to generate 1,000 examples for a 3-class classification problem with 2 input variables.
# generate dataset
X, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)
99

A multiclass classification problem
• The example in multiclass_classification_problem.py creates a scatter plot of the entire dataset coloring points by their class membership.
100

Multiclass Cross Entropy
It is the default loss function to use for multi-class classification problems. Target values must be in the set {0, 1, 3, …, n}
Cross-entropy will calculate a score that summarizes the average difference between the actual and predicted probability distributions for all classes in the problem
The score is minimized in the learning process A perfect cross-entropy value is 0
101

Multiclass Cross Entropy
It requires that the output layer is configured with an n nodes (one for each class)
Multiclass Cross-entropy can be specified as the loss function
in Keras by specifying ‘categorical_crossentropy’ when compiling the model.
102

Please run : lossFunction_MCCE.py

Multiclass Cross Entropy Loss and Accuracy
• The line plots for both cross-entropy and accuracy show good convergence, although somewhat bumpy.
• The model may be well configured given no sign of over or under fitting.
• The learning rate or batch size may be tuned to even out the smoothness of the convergence in this case.
104

Sparse Multiclass Cross Entropy
Consider classification problems with a large number of labels.
Target element of each training example may require a vector with several zero values, requiring significant memory.
Sparse cross-entropy addresses this by performing the same cross-entropy calculation of error, without requiring that the target variable be one hot
encoded prior to training.
105

Sparse Multiclass Cross Entropy in Keras
It can be specified as the loss function
in Keras by specifying ‘sparse_categorical_crossentropy‘ when compiling the model.
106

Please run : lossFunction_SMCE.py

Sparse Multiclass Cross Entropy Loss and Accuracy
In this case, the plot shows good convergence of the model over training with regard to loss and classification accuracy.
108

Step2: Compilation
Loss
109

110