MATH 208 Assignment 3
MATH 208 Assignment 3
The assignment contains one question with 5 parts (a)-(e), each worth 10 points, for a total of 50 points. Your
answers must be submitted in the form of a PDF and include both the answers to the question, along with
your R code and output used to generate your answers.
The background
Neural networks have become a popular tool for analyzing datasets where the goal is to develop a complex
prediction model which takes a set of input features and tries to predict the result of an outcome (or target)
variable. These models work well in situations where the relationship between the input features and the
outcome variable is highly nonlinear.
The basic structure of a neural network is as follows:
We assume a set of input features.
We choose a number of hidden (unobserved) layers.
For each layer, we choose a number of nodes.
The inputs, layers, and outcome are connected by edges which have weights which need to be
estimated.
Every node also has its own bias node that is used to help adjust the linear combinations to improve
prediction (similar to an intercept term in linear regression).
Each node in each hidden layer contains a linear combination of values of previous nodes that then gets
passed throughe the network to result in a final prediction.
Here is an example of such a network for the palmerpenguins data from the quizzes, where we try to
predict the sex of the penguin from bill length and body mass. Setting linear.output=FALSE and using the
logistic function (argument of act.funct ) converts the output of the neural network into a value between 0
and 1 which can be interpreted as a probability. The length of the vector of the hidden argument vector is
the number of hidden layers. The value of each element of the hidden argument vector is the number of
hidden nodes in that respective layer so hidden=c(a,b) means 2 hidden layers with a nodes in the first
layer and b nodes in the second layer.
library(palmerpenguins) ## needs to be installed
library(neuralnet) ## needs to be installed
penguins_example <- penguins %>%
drop_na %>% #NA dropped
# Make Females=1, Males=0
mutate(sex=ifelse(sex==”female”,1,0))
We will use this dataset for the rest of the assignment.
Additionally, the two input features need to be scaled to have average value 0 and standard deviation 1 in
order to use the neuralnet function (from the neuralnet package) to fit the models without lots of extra
work.
K
penguins_with_bin_sex_no_na <-
## Select columns we need
penguins_example[,c("sex",c("body_mass_g","bill_length_mm"))] %>%
# columns scaled — mutate at each in vars and save under the same name
mutate_at(~scale(.),.vars=vars(c(“body_mass_g”,”bill_length_mm”)))
Now we can fit the neural network:
nn_penguins <- neuralnet(sex~body_mass_g+bill_length_mm,linear.output = FALSE,
act.fct="logistic",data=penguins_with_bin_sex_no_na, hidden=
c(2,2))
plot(nn_penguins)
Above, we can see the structure of the network when there are two hidden layers and two hidden nodes per
layer. The weights of the edges are in black, the blue edges and nodes are the bias terms. The weights and
bias terms are chosen to minimize the sum of squared errors (shown above), i.e.
where is the binary gender value for the penguin in row of the data and is the predicted probability from
the neural network that this penguin in row has . Other errors are possible in neuralnet , but the
default is fine for this assignment. Note that a better measure is the average error per observation, which we
will compute later.
( ! .!
i=1
\# of units
yi y "i)2
yi i y "i
i = 1yi
In general, our neural network error tends to be under-estimated, because of the optimization of the weights.
So we often split our data into a training sample and a test sample to evaluate the error independently from
our estimated weights. I can do this randomly by doing:
# Note, I do this so you and I get the same results
# You do not have to set this for your assignment
set.seed(1101)
## The line above creates a vector with Training and test splits
## for each row of our data, approx. 80% train, 20% test
split_labels <- sample(c("Training","Test"),prob=c(0.8,0.2), replace=T,
size=nrow(penguins_with_bin_sex_no_na))
## How many of each kind?
table(split_labels)
split_labels
Test Training
62 271
# Create Training Data
Training_sample <- penguins_with_bin_sex_no_na %>%
filter(split_labels==”Training”)
nrow(Training_sample)
[1] 271
# Create Test Data
Test_sample <- penguins_with_bin_sex_no_na %>%
filter(split_labels==”Test”)
nrow(Test_sample)
[1] 62
Now I can fit the neural network to the training data and compute predictions and average squared error for
the training data.
## Run neural network on training data
train_penguins <- neuralnet(sex~.,linear.output = FALSE,
act.fct="logistic",data=Training_sample, hidden=c(2,2))
## Compute predictions and error for training using predict function on neuralnet obj
ect
train_penguins_predict <- predict(train_penguins,newdata=Training_sample)
Training_sample %>% mutate(train_error_sq=(sex-train_penguins_predict)^2) %>%
summarize(Avg_Error_train=mean(train_error_sq))
Avg_Error_train
0.171879
1 row
Now I compute the predictions and error for the test data.
## Compute predictions and error for test data using predict function on neuralnet ob
ject
test_penguins_predict <- predict(train_penguins,newdata=Test_sample)
## Compute test sums of squared error divided by number of test samples
Test_sample %>% mutate(test_error_sq=(sex-test_penguins_predict)^2) %>%
summarize(Avg_error_test =mean(test_error_sq))
Avg_error_test
0.1853412
1 row
CONTINUED ON NEXT PAGE
The actual assignment
The goal of this assignment is to create a set of functions that will do ALL of the tasks in the code of the
previous section for ANY dataset that contains a single outcome column and a set of possible input features.
This is similar to how one might construct an R package to do all of the steps from before. TAKE A DEEP
BREATH. I will break it all down for you.
a. First, write a function that requires three arguments:
i. A data frame or tibble
ii. A length 1 character vector indicating the name of the outcome column in the dataset.
iii. A character vector of unspecified length containing the names of the input features to be selected
and scaled.
and returns a new data set which contains a tibble containing only the outcome vector which should be
renamed outcome and the scaled feature vectors, each of which has been scaled using the scale function.
You CAN assume that the outcome column is already binary (contains 0 and 1). You should NOT assume that
all of the columns in the original data frame or tibble will be used in the network, so you will need to choose
the right ones using the appropriate argument. Demonstrate that your function works by running it on the
penguins_example tibble from the Background section for the two features used in the Background,
body_mass_g and bill_length_mm .
b. Write a function to randomly split a data frame or tibble into Training and Test that requires two
arguments:
i. A data frame or tibble
ii. The percentage of the total number of rows that should be from training
and returns a list which has two elements, one that is the Training data and the other is the Test data.
Demonstrate that your function works by running it on the tibble that you generated in part (i) with
training fraction equal to 0.7.
c. Write a function that takes in the following arguments:
i. A data frame or tibble with a column named outcome and other columns that are all scaled
feature vectors
ii. A vector of integers that can be used as the hidden argument to the neuralnet function, i.e. a
list of numbers of nodes of the hidden layers of a neural network
to return a neuralnet object that is the result of running the neuralnet function on the data frame/tibble
with the hidden nodes specified from the second argument and the following other arguments:
linear.output = FALSE,act.fct=”logistic” and using the outcome variable as the outcome in the
formula argument. Demonstrate that your function works by running it on the Training Data that you
generated in part (b).
d. Write a function that takes the following arguments:
i. A neuralnet object
ii. A data frame/tibble containing Training Data
iii. A data frame/tibble containing Test Data
and returns a vector containing the average training squared error and the average test squared error using
the neuralnet object, where average squared error is as defined in the background section. In other words,
your function should compute both the average squared error for the Training data and the average squared
error for the Test data, both using the neuralnet object to find the predictions. The vector returned by your
function should be named with the first element named “Training_Error” and the second element named
“Test_Error”. Demonstrate that your function works by running it on the Training and Test Data that you
generated in part (b) and the neuralnet object from part (c).
e. Write a function that takes the following arguments:
i. A data frame or tibble
ii. A length 1 character vector indicating the name of the outcome column in the dataset.
iii. A character vector of unspecified length containing the names of the input features to be selected
and scaled.
iv. The percentage of the total number of rows in the data/frame or tibble that should be used in the
training data.
and returns a tibble where each row contains the Average Training and Average Test squared error for fitting a
two-layer neural network at all possible combinations of numbers of hidden nodes at each layer (1 through 3).
Your returned tibble should look like this:
as_tibble(expand.grid(`First layer`=c(1,2,3),`Second layer`=c(1,2,3), `Training Error
`=NA, `Test error`=NA))
First layer
Second layer
Training Error
Test error
1 1
2 1
3 1
1 2
2 2
3 2
1 3
2 3
3 3
9 rows
where the NA’s are replaced with the values for your runs. Hint: You can use the expand.grid function
above to create a data.frame/tibble that you can iterate over the functions from parts (a) through (d) over the
strategies that we have learned in class. Anything that works is acceptable (no need to optimize the speed).
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
Demonstrate that your function works by running it on the penguins_example tibble from the
Background section for the two features used in the Background, body_mass_g and bill_length_mm .