COSC2779LabExercises_W4_solutions
¶
COSC 2779 | Deep Learning
¶
Week 4 Lab Exercises: **Feed-forward Neural Networks**
¶
Introduction¶
This lab is aimed at understanding different elements and, debugging simple feed-forward neural networks. During this lab you will:
Try different activations
Try different models with varying capacities
Try different optimisation techniques
Experiment with regularisation
This notebook is designed to run on Google Colab. If you like to run this on your local machine, make sure that you have installed TensorFlow version 2.0.
HIGGS Data Set Description¶
The dataset used for this lab is from the paper cited below. This is a classification problem to distinguish between a signal process which produces Higgs bosons and a background process which does not.
The data has been produced using Monte Carlo simulations. The first 21 features (columns 2-22) are kinematic properties measured by the particle detectors in the accelerator. The last seven features are functions of the first 21 features; these are high-level features derived by physicists to help discriminate between the two classes. There is an interest in using deep learning methods to obviate the need for physicists to manually develop such features. Benchmark results using Bayesian Decision Trees from a standard physics package and 5-layer neural networks are presented in the original paper. The last 500,000 examples are used as a test set.
Baldi, P., P. Sadowski, and D. Whiteson. “Searching for Exotic Particles in
High-energy Physics with Deep Learning.” Nature Communications 5 (July 2,
2014).
Setting up the Notebook¶
Lets first load the packages we need.
In [ ]:
import tensorflow as tf
import numpy as np
import pandas as pd
In [ ]:
import pathlib
import shutil
import tempfile
from IPython import display
from matplotlib import pyplot as plt
from sklearn.metrics import roc_auc_score
We can use the tensor board to view the learning curves. Lets first set it up.
In [ ]:
logdir = pathlib.Path(tempfile.mkdtemp())/”tensorboard_logs”
shutil.rmtree(logdir, ignore_errors=True)
# Load the TensorBoard notebook extension
%load_ext tensorboard
# Open an embedded TensorBoard viewer
%tensorboard –logdir {logdir}/models
We can also write our own function to plot the models training history ones training has completed.
In [ ]:
from itertools import cycle
def plotter(history_hold, metric = ‘binary_crossentropy’, ylim=[0.0, 1.0]):
cycol = cycle(‘bgrcmk’)
for name, item in history_hold.items():
y_train = item.history[metric]
y_val = item.history[‘val_’ + metric]
x_train = np.arange(0,len(y_val))
c=next(cycol)
plt.plot(x_train, y_train, c+’-‘, label=name+’_train’)
plt.plot(x_train, y_val, c+’–‘, label=name+’_val’)
plt.legend()
plt.xlim([1, max(plt.xlim())])
plt.ylim(ylim)
plt.xlabel(‘Epoch’)
plt.ylabel(metric)
plt.grid(True)
Load the dataset¶
Lets load the dataset from the internet and set it up to be used with deep learning models
In [ ]:
data=pd.read_csv(“http://mlphysics.ics.uci.edu/data/higgs/HIGGS.csv.gz”, header=None)
When developing machine learning models one would usually do some data exploration at this point. I encourage you to use your skills on machine learning to explore HIGGS dataset at this point.
We can create a tensorflow dataset from numpy arrays tf.data.Dataset.from_tensor_slices. Print some data to see if everything is working. More information on creating datasets and operations at TensorFlow Documentation
In [ ]:
targets = data.loc[:,0]
dataX = data.loc[:,1:]
HIGGS_dataset = tf.data.Dataset.from_tensor_slices((dataX.values, targets.values))
for feat, targ in HIGGS_dataset.take(1):
print (‘Features: {}, Target: {}’.format(feat, targ))
Lets now split the data into validation and training sets. It is good practice to hold out a test set for final evaluation. but we will ignore it for now because this lab is for just experimenting.
We are going to use 1050000 data point for training the algorithms and 50000 data points for validation. Tensorflow dataset provides an interface to batch data and we have used it below.
Flip the variable small_dataset to either use the complete dataset (takes time) or only a small portion of it (for testing code).
In [ ]:
small_dataset = True # If true only a small propotion of data is used to do quick check
if small_dataset:
N_TRAIN=10000
N_VAL = 5000
MAX_EPOCH=100
else:
N_TRAIN=1050000
N_VAL = 500000
MAX_EPOCH=10000
BATCH_SIZE = 500
BUFFER_SIZE = int(10000)
train_ds = HIGGS_dataset.take(N_TRAIN).cache()
validate_ds = HIGGS_dataset.skip(N_TRAIN).take(N_VAL).cache()
# Create batches
validate_ds = validate_ds.batch(BATCH_SIZE)
train_ds = train_ds.shuffle(BUFFER_SIZE).repeat().batch(BATCH_SIZE)
Many models train better if you gradually reduce the learning rate during training. Use optimizers.schedules to reduce the learning rate over time. Here we are going to decay the learning rate inversely promotional to the iteration. You can explore other available learning rate schedules in tensorflow at TensorFlow Documentation
In [ ]:
STEPS_PER_EPOCH = N_TRAIN//BATCH_SIZE
lr_schedule = tf.keras.optimizers.schedules.InverseTimeDecay(
0.001,
decay_steps=STEPS_PER_EPOCH*1000,
decay_rate=1,
staircase=False)
Next setup some callback function to be used with training. The tensorboard callback is something we have already used and it will enable us to monitor the loss curves while training. EarlyStopping callback is included to avoid long and unnecessary training times. Note that this callback is set to monitor the val_binary_crossentropy, not the val_loss. The parameter patience is the number of epochs with no improvement after which training will be stopped.
In [ ]:
def get_callbacks(name):
return [
tf.keras.callbacks.EarlyStopping(monitor=’val_binary_crossentropy’, patience=20),
tf.keras.callbacks.TensorBoard(logdir/name),
]
Next we can create a function to train a given model. This function will be called whenever we want a model to be trained with different configurations. will improve the code reuse in the notebook.
In [ ]:
def compile_and_fit(model, name, optimizer=None, max_epochs=MAX_EPOCH):
if optimizer is None:
optimizer = tf.keras.optimizers.Adam(lr_schedule)
model.compile(optimizer=optimizer,
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=[
tf.keras.losses.BinaryCrossentropy(
from_logits=True, name=’binary_crossentropy’),
‘accuracy’])
model.summary()
history = model.fit(
train_ds,
steps_per_epoch = STEPS_PER_EPOCH,
epochs=max_epochs,
validation_data=validate_ds,
callbacks=get_callbacks(name),
verbose=0)
return history
Simple Model¶
Lets start with a very simple model first. This model will have a single hidden layer with 16 units each with sigmoid activation.
In [ ]:
FEATURES = 28
tiny_model_sigmoid = tf.keras.Sequential([
tf.keras.layers.Dense(16, activation=’sigmoid’, input_shape=(FEATURES,)),
tf.keras.layers.Dense(1)
])
In [ ]:
m_histories = {}
In [ ]:
m_histories[‘Tiny_sigmoid’] = compile_and_fit(tiny_model_sigmoid, ‘models/Tiny_sigmoid’)
In [ ]:
plotter(m_histories, ylim=[0.5, 0.8])
**TODO:** Change the activation type to `relu`.
In [ ]:
tiny_model = tf.keras.Sequential([
tf.keras.layers.Dense(16, activation=’relu’, input_shape=(FEATURES,)),
tf.keras.layers.Dense(1)
])
m_histories[‘Tiny_relu’] = compile_and_fit(tiny_model, ‘models/Tiny_relu’)
In [ ]:
plotter(m_histories, ylim=[0.5, 0.8])
**TODO:** Change the activation type to `elu`. Also try leaky relu to see if there is any difference.
In [ ]:
tiny_model_elu = tf.keras.Sequential([
tf.keras.layers.Dense(16, activation=’elu’, input_shape=(FEATURES,)),
tf.keras.layers.Dense(1)
])
m_histories[‘Tiny_elu’] = compile_and_fit(tiny_model_elu, ‘models/Tiny_elu’)
In [ ]:
plotter(m_histories, ylim=[0.5, 0.8])
Did you observe any overfitting or underfitting? What can be done to improve the model
This is apparent if you plot and compare the validation metrics to the training metrics.
It’s normal for there to be a small difference.
If both metrics are moving in the same direction and the gap is small, everything is fine.
If the validation metric begins to stagnate while the training metric continues to improve, you are probably close to overfitting.
If the validation metric is going in the wrong direction, the model is clearly overfitting.
Lets use elu model as the baseline model
In [ ]:
r_histories = {}
r_histories[‘Tiny_elu’] = m_histories[‘Tiny_elu’]
Complex model¶
Next lets try a very large model that will have much more capacity. Hopefully this will have more capacity than required for the task.
In [ ]:
large_model = tf.keras.Sequential([
tf.keras.layers.Dense(512, activation=’elu’, input_shape=(FEATURES,)),
tf.keras.layers.Dense(512, activation=’elu’),
tf.keras.layers.Dense(512, activation=’elu’),
tf.keras.layers.Dense(512, activation=’elu’),
tf.keras.layers.Dense(1)
])
r_histories[‘Large_elu’] = compile_and_fit(large_model, ‘models/Large_elu’)
In [ ]:
plotter(r_histories, ylim=[0.5, 0.8])
Regularization¶
As we see overfitting in the complex model lets try to use some regularization to reduce the capacity to oprimal position.
Constrain weights¶
a common way to mitigate overfitting is to put constraints on the complexity of a network by forcing its weights only to take small values, which makes the distribution of weight values more “regular”. This is called “weight regularization”, and it is done by adding to the loss function of the network a cost associated with having large weights. This cost comes in two flavors:
L1 regularization, where the cost added is proportional to the absolute value of the weights coefficients (i.e. to what is called the “L1 norm” of the weights).
L2 regularization, where the cost added is proportional to the square of the value of the weights coefficients (i.e. to what is called the squared “L2 norm” of the weights). L2 regularization is also called weight decay in the context of neural networks. Don’t let the different name confuse you: weight decay is mathematically the exact same as L2 regularization.
L1 regularization pushes weights towards exactly zero encouraging a sparse model. L2 regularization will penalize the weights parameters without making them sparse since the penalty goes to zero for small weights. one reason why L2 is more common.
In tf.keras, weight regularization is added by passing weight regularizer instances to layers as keyword arguments. For example, l2 regularization can be added to layer one in the above model using:
tf.keras.layers.Dense(512, activation=’elu’,
kernel_regularizer=tf.keras.regularizers.l2(0.001),
input_shape=(FEATURES,))
**TODO:** Add L2 regulatization (lambda = 0.001) to the above large model.
Try L1 regularisation yourself later.
In [ ]:
l2_model = tf.keras.Sequential([
tf.keras.layers.Dense(512, activation=’elu’,
kernel_regularizer=tf.keras.regularizers.l2(0.001),
input_shape=(FEATURES,)),
tf.keras.layers.Dense(512, activation=’elu’,
kernel_regularizer=tf.keras.regularizers.l2(0.001)),
tf.keras.layers.Dense(512, activation=’elu’,
kernel_regularizer=tf.keras.regularizers.l2(0.001)),
tf.keras.layers.Dense(512, activation=’elu’,
kernel_regularizer=tf.keras.regularizers.l2(0.001)),
tf.keras.layers.Dense(1)
])
r_histories[‘l2′] = compile_and_fit(l2_model, “models/regularizers_l2”)
In [ ]:
plotter(r_histories, ylim=[0.5, 0.8])
Add dropout¶
Dropout is one of the most effective and most commonly used regularization techniques for neural networks, developed by Hinton and his students at the University of Toronto.
The intuitive explanation for dropout is that because individual nodes in the network cannot rely on the output of the others, each node must output features that are useful on their own.
Dropout, applied to a layer, consists of randomly “dropping out” (i.e. set to zero) a number of output features of the layer during training. Let’s say a given layer would normally have returned a vector [0.2, 0.5, 1.3, 0.8, 1.1] for a given input sample during training; after applying dropout, this vector will have a few zero entries distributed at random, e.g. [0, 0.5, 1.3, 0, 1.1].
The “dropout rate” is the fraction of the features that are being zeroed-out; it is usually set between 0.2 and 0.5. At test time, no units are dropped out, and instead the layer’s output values are scaled down by a factor equal to the dropout rate, so as to balance for the fact that more units are active than at training time.
In tf.keras you can introduce dropout in a network via the Dropout layer, which gets applied to the output of layer right before.
Let’s add Dropout layers in our network to see how well they do at reducing overfitting:
dropout can be added as a layer:
tf.keras.layers.Dropout(0.5)
**TODO:** Add Dropout regulatization to the above large model (without l2 regularization).
In [ ]:
dropout_model = tf.keras.Sequential([
tf.keras.layers.Dense(512, activation=’elu’, input_shape=(FEATURES,)),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(512, activation=’elu’),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(512, activation=’elu’),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(512, activation=’elu’),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(1)
])
r_histories[‘dropout’] = compile_and_fit(dropout_model, “models/regularizers_dropout”)
In [ ]:
plotter(r_histories, ylim=[0.5, 0.8])
Combined L2 + dropout¶
In [ ]:
combined_model = tf.keras.Sequential([
tf.keras.layers.Dense(512, kernel_regularizer=tf.keras.regularizers.l2(0.0001),
activation=’elu’, input_shape=(FEATURES,)),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(512, kernel_regularizer=tf.keras.regularizers.l2(0.0001),
activation=’elu’),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(512, kernel_regularizer=tf.keras.regularizers.l2(0.0001),
activation=’elu’),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(512, kernel_regularizer=tf.keras.regularizers.l2(0.0001),
activation=’elu’),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(1)
])
r_histories[‘combined’] = compile_and_fit(combined_model, “models/regularizers_combined”)
In [ ]:
plotter(r_histories, ylim=[0.5, 0.8])
Optimisation Algorithm¶
So far we have been using Adam optimiser with an inverse decay learning rate update policy. Try the following optimisers: SGD, RMSprop
In [ ]:
o_histories = {}
o_histories[‘Adam’] = r_histories[‘combined’]
Since we are going to use the same model define above for this step, we need to run clear_session(). Otherwise, the training will start from the weights lerned in the last step.
In [ ]:
tf.keras.backend.clear_session()
combined_model_SGD = tf.keras.Sequential([
tf.keras.layers.Dense(512, kernel_regularizer=tf.keras.regularizers.l2(0.0001),
activation=’elu’, input_shape=(FEATURES,)),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(512, kernel_regularizer=tf.keras.regularizers.l2(0.0001),
activation=’elu’),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(512, kernel_regularizer=tf.keras.regularizers.l2(0.0001),
activation=’elu’),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(512, kernel_regularizer=tf.keras.regularizers.l2(0.0001),
activation=’elu’),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(1)
])
opt = tf.keras.optimizers.SGD(lr_schedule)
o_histories[‘SGD’] = compile_and_fit(combined_model_SGD, “models/Optimisation_SGD”, optimizer=opt)
In [ ]:
plotter(o_histories, ylim=[0.5, 0.8])
**TODO:** Change the optimization to RMSProp.
In [ ]:
tf.keras.backend.clear_session()
combined_model_RMSprop = tf.keras.Sequential([
tf.keras.layers.Dense(512, kernel_regularizer=tf.keras.regularizers.l2(0.0001),
activation=’elu’, input_shape=(FEATURES,)),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(512, kernel_regularizer=tf.keras.regularizers.l2(0.0001),
activation=’elu’),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(512, kernel_regularizer=tf.keras.regularizers.l2(0.0001),
activation=’elu’),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(512, kernel_regularizer=tf.keras.regularizers.l2(0.0001),
activation=’elu’),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(1)
])
opt = tf.keras.optimizers.RMSprop(lr_schedule)
o_histories[‘RMSprop’] = compile_and_fit(combined_model_RMSprop, “models/Optimisation_RMSprop”, optimizer=opt)
In [ ]:
plotter(o_histories, ylim=[0.5, 0.8])
Change the learning rate schedule and the initial learning rate and observe the variation
Exercises¶
Can you get the model to perform well with only The first 21 features (columns 2-22) without the last seven features (high-level features derived by physicists to help discriminate between the two classes)?
Is accuracy a good measure for this dataset?
Check the network proposed by the original paper and see if you can improve it? The network configurations and hyper parameters are provided in Supplementary Information