Instructions:
Page 1 of 3 Turn the page over
Module Title: Learning from Data
School of Mathematic Semester One 202122
• There are 3 pages to this examination.
• Answer all questions.
• You must type your report using either Word or LaTex (pdf submission). Your report
must be in academic style and include all information needed for someone who is com-
petent but not familiar with the questions to be able to understand.
Module Code: MATH5301
Assignment 1
• You must show all your calculations and you must use R to solve the computational
parts.
• There is no page limit. However, unnecessarily long reports (i.e. including irrelevant
content) will bear a penalty.
• You should include your code as an appendix.
1. In this task you will simulate data from a data generating process with one response and
two features. Then you will try to explain the response Y given the two features X1
and X2.
(a) Simulate 1,000,000 values from each of the following standard Uniform variables
X1 ∼ U(0, 1) and X2 ∼ U(0, 1). Use these as covariates to simulate 1,000,000
values from the following response Y (this is the data generating process):
yi = (b10 ∗ x1 + 1c+ b10 ∗ x2 + 1c) mod 2,
where b·c is the floor function: bxc = max{m ∈ Z|m ≤ x}.
(b) Convert the variable y into a factor type and create a data frame with y, x1, x2.
(c) Plot the data on a scatter plot, where X1 is on the horizontal axis, X2 is on the
vertical axis, and the colour of the data is based on the class Y .
(d) Can you see a pattern based on the plot of Q1.c? Discuss how the data generating
process produces this pattern.
(e) Grow a classification tree using the CART algorithm to describe the relationship
between Y (the response) and X1, X2 (the features). You can decide on the
parameters. Can the tree learn the DGP (i.e. can it partition the feature space
correctly? Explain the reasoning behind your findings.
(f) Grow a Random Forest to answer the same problem as above. Does the forest do
better than the single tree in correctly partitioning the feature space? Explain the
reasoning behind your findings.
(g) Now try to solve the same problem using Boosting (gradient boosting machines).
Compare its performance with the single tree and the random forest.
(h) Try changing the parameters (number of trees and maximum depth of tree) for
the random forest and report the effect on the accuracy of the model. Use plots
to support your arguments.
Page 2 o 3 Turn the page over
Module Code: MATH5301
(a) Introduce the dataset and perform an exploratory data analysis. Try to link your
analysis to the question in hand.
(b) Create a training sample of size n = 2/3N , where N is the number of instances.
The remaining data will form you testing set. Explain how you selected these sets.
(c) Fit a logistic regression model to predict the probability of signing up to a long-term
deposit (Y = Yes). Evaluate the fit.
(d) Fit a Random Forest model to predict the probability of signing up to a long-term
deposit. Evaluate the fit.
(e) Fit a generalised additive model using cubic splines as smoothers to predict the
probability of of signing up to a long-term deposit. Evaluate the fit.
(f) Compare all the above models based on their predictive performance on the testing
set. Be clear on what measures you are using for comparison.
(g) Fit a logistic regression using only the 3 most important variables as indicated by
the random forest model. Explain how can you get this information. Compare this
logistic model to the one you fitted using all the variables. When would such an
approach be beneficial?
(h) (Bonus question!) Plot a partial dependence plot of the predictions of the random
forest with respect to age.
Page 3 o 3 End.
Module Code: MATH5301
2. In this question you will work with a real dataset from Kaggle Kaggle Banking Dataset.
A copy of the csv file can be found . The problem is to model the probability
that a customer will sign up to a long term deposit, given the different characteristics
described in 15 features. The response variable is ”y”. In all the models that you are
asked to fit below, you can decide the parameters after some trial runs. You need to
explain how you chose the parameters.