Activity.5.1
Activity 5.1 Neural Network¶
In this activity we build a simple three layer Neural Network for the classification problem. We study the effect of initialization on model selection.
Linearly Separable Classes¶
Synthetic Data Generation¶
Here we follow an approach similar to what we took to produce the datasets for classification problems in Module 3.
In [8]:
# Data Generation
## Libraries:
library(mvtnorm) # generates multivariate Gaussian sampels and calculate the densities
library(ggplot2)
library(reshape2)
## Setting parameters
set.seed(12345) # set random seed
N <- 100 # total number of samples
D <- 2 # number of dimensions
## Initialization
c0 <- '0'; c1 <- '1' # class labels
mu0 <- c(1.0, 4.0); p0 <- 0.60
mu1 <- c(4.5, 0.5); p1 <- 1 - p0
sigma <- matrix(c(1, 0, 0, 1), nrow=2, ncol=2, byrow = TRUE) # shared covariance matrix
sigma0 <- sigma; sigma1 <- sigma
data <- data.frame(x1=double(), x2=double(), label=double()) # empty data.frame
## Generate class labels
data[1:N,'label'] <- sample(c(c0,c1), N, replace = TRUE, prob = c(p0, p1))
## calculate the size of each class
N0 <- sum(data[1:N,'label']==c0); N1 <- N - N0
## Sample from the Gaussian distribution accroding to the class labels and statitics.
data[data$label==c0, c('x1', 'x2')] <- rmvnorm(n = N0, mu0, sigma0)
data[data$label==c1, c('x1', 'x2')] <- rmvnorm(n = N1, mu1, sigma1)
## Split data to train and test datasets
train.len <- round(N/2)
train.index <- sample(N, train.len, replace = FALSE)
train.data <- data[train.index, c('x1', 'x2')]; test.data <- data[-train.index, c('x1', 'x2')]
train.label <- data[train.index, 'label']; test.label <- data[-train.index, 'label']
## Take a look at the data set
ggplot(data=data[1:N,], aes(x=x1, y=x2, color=label, label=ifelse(label==c0, '0', '1'))) +
geom_point(x=mu0[1], y=mu0[2], size=4, color = 'black') +
geom_point(x=mu1[1], y=mu1[2], size=4, color = 'black') +
geom_text(size = 5, alpha=0.5) + ggtitle ('Data set') + theme_minimal()
Neural Network¶
In this example we implement a simple 3-layer (i.e., the input layer, one hidden layer and the output layer) network with sigmoid as the activation function.
Auxiliary Functions¶
We need to define some auxiliary functions. In particular, we implement activation function h and its derivitive h.d. Then, we will have output probability function that for each point it calculates the probability of being a memeber of class c1 (in the output layer). Finally, we implement prediction and accuracy funcitons that calculate the class labels and the accuracy percentage, respectively.
Note 1: Since the output of our neural network is the probability of class c1, we classify the points with any probability larger thatn 0.5 as a member of c1. All the other points will be labeled as c0.
Note 2: The structure list that is defined at the end of the following piece of code helps us to define functions that can return multiple variables at the same time. We will use this structure in the feedforward and backpropagation functions.
In [9]:
# auxiliary functions
## the activation function (sigmoid here)
h <- function(z) {
return (1/(1+exp(-3*z)))
}
## the derivitive of the activation function (sigmoid here)
h.d <- function(z) {
return (h(z)*(1-h(z)))
}
## Class Probabilities
probability <- function(X, W1, W2, b1, b2){
a2 <- h(sweep(W1 %*% X, 1, b1,'+' ))
a3 <- h(sweep(W2 %*% a2, 1, b2,'+' ))
return (a3)
}
## prediction
prediction <- function(X, W1, W2, b1, b2, threshold=0.5){
return (ifelse(probability(X, W1, W2, b1, b2)>=threshold, 1, 0))
}
## Accuracy
accuracy <- function(Y, T){
return (sum(Y==T)/length(T)*100)
}
## The following structure helps us to have functions with multiple outputs
### credit: https://stat.ethz.ch/pipermail/r-help/2004-June/053343.html
list <- structure(NA,class="result")
"[<-.result" <- function(x,...,value) {
args <- as.list(match.call())
args <- args[-c(1:2,length(args))]
length(value) <- length(args)
for(i in seq(along=args)) {
a <- args[[i]]
if(!missing(a)) eval.parent(substitute(a <- v,list(a=a,v=value[[i]])))
}
x
}
Name Conventions¶
To make a more readable code, let's do some name convention. In particular, we choose variable names to be close to the math terms in Chapter 2. Form now on, N referes to the number of training data. X1 and T1 indicate to train data and labels, while X2 and T2 refer to test data and labels, respectively.
Wn and bn are refereing to weight matrix and bias term for layer n (here n is either 1 or 2), and .d stands for delta. Finally, an and zn are denoting the total weighted sum activation and its activation value (here n is in [1:3]).
The following code does nothing but some data convertions.
In [10]:
# Some conversions:
## rename just for convenience
N <- train.len
## convert data and labels to matrices
X1 <- t(unname(data.matrix(train.data)))
T1 <- t(data.matrix(as.numeric(train.label)))
X2 <- t(unname(data.matrix(test.data)))
T2 <- t(data.matrix(as.numeric(test.label)))
Feedforward Function¶
The following function perform one feedforward pass for a 3-layer feedforward network. We will call this function in a loop to train our network.
In [11]:
feedforward <- function(Xi, Ti, W1, b1, W2, b2){
### 1st (input) layer
a1 <- Xi
y <- Ti
### 2nd (hidden) layer
z2 <- W1 %*% a1 + b1
a2 <- h(z2)
### 3rd (output) layer
z3 <- W2 %*% a2 + b2
a3 <- h(z3)
return(list(a1, a2, a3, y, z2, z3))
}
Backpropagation Function¶
The following function perform one backpropagation pass for a 3-layer feedforward network. We will call this function after each feedforward step..
In [12]:
backpropagation <- function(Ti, W2, z2, z3, a3){
### 3rd (output) layer
d3 <- -(Ti-a3) * h.d(z3)
### 2nd (hidden) layer
d2 <- t(W2)%*%d3 * h.d (z2)
return(list(d2,d3))
}
Initialization¶
Like any other algorithm, NN starts with some parameter settings and initializations. Here, K indicates the number of units in the hidden layer.
In [13]:
# Setting parameters
K <- 3 # number of units in the hidden layer
epoch.max <- 1000 # maximum number of iterations
eta <- 0.1 # learning rate
lambda <- 0.0001 # regularization term
# initialization
epoch <- 1 # epoch (iteration) counter
terminate <- FALSE # termination criteria
## weight vectors/matrices initialization
### w stands for weight and b for bias
### the numbers after the letters indicates the layer number
W1 <- matrix(0.01*rnorm(D*K, sd=0.5), nrow=K, ncol=D)
b1 <- matrix(0*rnorm(1*K), nrow=K, ncol=1)
W2 <- matrix(0.01*rnorm(K*1, sd=0.5), nrow=1, ncol=K)
b2 <- matrix(0*rnorm(1*1), nrow=1, ncol=1)
## tracing accuracy of the model
train.accuracy <- matrix(0,nrow=epoch.max, ncol=1)
Main Loop¶
Now it's time to iteratively run the feed-forward and back-propagation passes.
In [14]:
# main loop
while (!terminate){
## delta vectors/matrices initialization (for batch backpropagation)
### .d stands for delta
W1.d <- W1 *0
b1.d <- b1 *0
W2.d <- W2 *0
b2.d <- b2 *0
## inner loop for each train sample
for (i in 1:N){
## Feedforward:
list[a1, a2, a3, y, z2, z3] <- feedforward(X1[,i], T1[i], W1, b1, W2, b2)
## Backpropagation:
list[d2, d3] <- backpropagation(T1[i], W2, z2, z3, a3)
## calculate the delta values
### 1st layer
W1.d <- W1.d + d2 %*% t(a1)
b1.d <- b1.d + d2
### 2nd layer
W2.d <- W2.d + d3 %*% t(a2)
b2.d <- b2.d + d3
}
## update weight vectors and matrices
### 1st (input) layer
W1 <- W1 - eta * (W1.d/N + lambda*W1)
b1 <- b1 - eta * (b1.d/N)
### 2nd (hidden) layer
W2 <- W2 - eta * (W2.d/N + lambda*W2)
b2 <- b2 - eta * (b2.d/N)
## trace train accuracy
train.accuracy[epoch]<- accuracy(prediction(X1, W1, W2, b1, b2), T1)
## increase the iteration counter
epoch <- epoch + 1
## check the termination criteria
if (epoch > epoch.max) {terminate <- TRUE}
}
print('Done!')
[1] "Done!"
Evaluation¶
Let's visualize the classification accuracy trend as the training goes on:
In [15]:
## Train Accuracy
plot(train.accuracy, main='Train Accuracy', xlab ='epoch', ylab='Accuracy(%)')
Linearly Non-separable Classes¶
Synthetic Data Generation¶
The dataset we generated for the previous experiemnts is rather easily separable using a linear model. One can modify the data generator code and repeat the experiments on more difficult samples (i.g., closer $\mu$, non-shared and identical $\Sigma$, or very imbalanced class sizes). Another approach to challenge our NN is to apply it on linearly non-separable classes where at least one of the classes is multi-modal (like two or more separate islands). The code below does the job.
In [16]:
## Initialization
set.seed(12345)
N <- 500
c0 <- '0'; c1 <- '1'; c2 <- '2' # class labels
mu0 <- c(1.0, 4.0); p0 <- 0.30
mu1 <- c(4.5, 0.5); p1 <- 0.50
mu2 <- c(3.0, -3.0); p2 <- 1 - p0 - p1
sigma <- matrix(c(1, 0, 0, 1), nrow=2, ncol=2, byrow = TRUE) # shared covariance matrix
sigma0 <- sigma; sigma1 <- sigma; sigma2 <- sigma
data <- data.frame(x1=double(), x2=double(), label=double()) # empty data.frame
## Generate class labels
data[1:N,'label'] <- sample(c(c0,c1,c2), N, replace = TRUE, prob = c(p0, p1, p2))
## calculate the size of each class
N0 <- sum(data[1:N,'label']==c0);
N1 <- sum(data[1:N,'label']==c1); N2 <- N - N0 - N1
## Sample from the Gaussian distribution accroding to the class labels and statitics.
data[data[1:N,'label']==c0, c('x1', 'x2')] <- rmvnorm(n = N0, mu0, sigma0)
data[data[1:N,'label']==c1, c('x1', 'x2')] <- rmvnorm(n = N1, mu1, sigma1)
data[data[1:N,'label']==c2, c('x1', 'x2')] <- rmvnorm(n = N2, mu2, sigma2)
data[data[1:N,'label']==c2, 'label'] <- c0
## Take a look at the data set
ggplot(data=data, aes(x=x1, y=x2, color=label, label=ifelse(label==c0, '0', '1'))) +
geom_text(size = 5, alpha=0.5) +
ggtitle ('Data set') +
theme_minimal()
N <- nrow(data)
train.len <- round(N/2)
train.index <- sample(N, train.len, replace = FALSE)
train.data <- data[train.index, c('x1', 'x2')]
test.data <- data[-train.index, c('x1', 'x2')]
train.label <- data[train.index, 'label']
test.label <- data[-train.index, 'label']
# Some conversions:
## rename just for convenience
N <- train.len
## convert data and labels to matrices
X1 <- t(unname(data.matrix(train.data)))
T1 <- t(data.matrix(as.numeric(train.label)))
X2 <- t(unname(data.matrix(test.data)))
T2 <- t(data.matrix(as.numeric(test.label)))
Now, we can repeat the above experiments on this linearly non-separable dataset. However, let's first put the pieces of the above code togather to define a function which trains a 3-layer NN with K units in the hiden layer, and returns the training and testing accuracies.
In [17]:
NN <- function(K, N, X1, T1, X2, T2, tracing=FALSE, epoch.max=500){
# Setting parameters
D <- 2
eta <- 0.1 # learning rate
lambda <- 0.0001 # regularization term
# initialization
if (tracing==TRUE) {train.accuracy <- matrix(0,nrow=epoch.max, ncol=1); test.accuracy <- train.accuracy}
epoch <- 1 # epoch (iteration) counter
terminate <- FALSE # termination criteria
W1 <- matrix(0.01*rnorm(D*K, sd=0.5), nrow=K, ncol=D)
b1 <- matrix(0*rnorm(1*K), nrow=K, ncol=1)
W2 <- matrix(0.01*rnorm(K*1, sd=0.5), nrow=1, ncol=K)
b2 <- matrix(0*rnorm(1*1), nrow=1, ncol=1)
# main loop
while (!terminate){
## delta vectors/matrices initialization
W1.d <- W1 *0
b1.d <- b1 *0
W2.d <- W2 *0
b2.d <- b2 *0
## inner loop for each train sample
for (i in 1:N){
## Feedforward:
list[a1, a2, a3, y, z2, z3] <- feedforward(X1[,i], T1[i], W1, b1, W2, b2)
## Backpropagation:
list[d2, d3] <- backpropagation(T1[i], W2, z2, z3, a3)
## calculate the delta values
### 1st layer
W1.d <- W1.d + d2 %*% t(a1)
b1.d <- b1.d + d2
### 2nd layer
W2.d <- W2.d + d3 %*% t(a2)
b2.d <- b2.d + d3
}
## update weight vectors and matrices
W1 <- W1 - eta * (W1.d/N + lambda*W1)
b1 <- b1 - eta * (b1.d/N)
W2 <- W2 - eta * (W2.d/N + lambda*W2)
b2 <- b2 - eta * (b2.d/N)
## record the errors
if (tracing==TRUE){
train.accuracy[epoch]<- accuracy(prediction(X1, W1, W2, b1, b2), T1)
test.accuracy[epoch]<- accuracy(prediction(X2, W1, W2, b1, b2), T2)
}
## increase the iteration counter
epoch <- epoch + 1
## check the termination criteria
if (epoch > epoch.max) {terminate <- TRUE}
}
if (tracing==FALSE){
train.accuracy <- accuracy(prediction(X1, W1, W2, b1, b2), T1)
test.accuracy <- accuracy(prediction(X2, W1, W2, b1, b2), T2)
}
return(cbind(train.accuracy,test.accuracy))
}
Now we train our network and plot the training accuracy.
In [18]:
resutls <- NN(3, N, X1, T1, X2, T2, tracing = TRUE, epoch.max = 1000)
plot(resutls[,1], main='Train Accuracy', xlab ='epoch', ylab='Accuracy(%)')
Model Selection¶
In Module 2 and 3 we leanred that models may have diifferent levels of generalization according to their complexities. We also know that the number of units in the hidden layer has a direct effect on the model complexity. Therefore, we intend to study the effect of the hidden layer size on the out neural network performance. To perform such study, we repeat the above experiment several times (every time with a different subset of training samples) for different values of $K$. Then, we draw a boxplot to show the results.
In [19]:
K <- c(1, 5, 10, 50, 100)
T <- 1:5
results <- data.frame(k=rep(K, times = length(T), each = 1), t=rep(T, times = 1, each = length(K)),
train=rep(0, length(K) * length(T)), test=rep(0, length(K) * length(T)))
for (k in K){
for (t in T){
indx <- sample(1:N, round(N/2), replace = TRUE)
results[results$k==k & results$t==t, c('train', 'test')] <- NN(k, round(N/2), X1[,indx], T1[indx], X2, T2)
}
}
ggplot(data=results, aes(factor(k), train)) + geom_boxplot() + ggtitle('Train Error') + theme_minimal()
ggplot(data=results, aes(factor(k), test)) + geom_boxplot() + ggtitle('Test Error') + theme_minimal()
Discussions¶