2018S2 QBUS6850 Page 1 of 4
Notes to Students
1. The assignment MUST be submitted electronically to Turnitin through QBUS6850
Canvas site. Please do NOT submit a zipped file.
2. The assignment is due at 17:00pm on Monday, 3 September 2018. The late penalty
for the assignment is 10% of the assigned mark per day, starting after 17:00pm on the
due date. The closing date Monday, 10 September 2018, 17:00pm is the last date on
which an assessment will be accepted for marking.
3. Your answers shall be provided as a word-processed report giving full explanation
and interpretation of any results you obtain. Output without explanation will receive
zero marks.
4. Be warned that plagiarism between individuals is always obvious to the markers of
the assignment and can be easily detected by Turnitin.
5. The data sets for this assignment can be downloaded from Canvas.
6. Presentation of the assignment is part of the assignment. Markers will reduce to 10%
of the mark for poor writing in clarity and presentation. It is recommended that you
should include your Python code as appendix to your report, however you may insert
small section of your code into the report for better interpretation when necessary.
Think about the best and most structured way to present your work, summarise the
procedures implemented, support your results/findings and prove the originality of
your work.
7. Numbers with decimals should be reported to the third decimal point.
8. The report should be NOT more than 10 pages including everything like text, figure,
tables, small sections of inserted codes etc but excluding the appendix containing
Python code.
Tasks
Question 1 (50 Marks)
You will work on the UCI ML housing dataset https://archive.ics.uci.edu/ml/machine-learning-
databases/housing/. A template Python program has been prepared for you. The program can
help you get the dataset from sklearn dataset repository. Please test and play with the
template program to fully understand the dataset.
For further information, please visit
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.names.
(a) Suppose you are interested in using the house age AGE (proportion of owner-
occupied units built prior to 1940) as the first feature 𝑥𝑥1 and the full-value
property-tax rate TAX as the second feature 𝑥𝑥2 to predict the MEDV (median
value of owner-occupied homes in $1000’s) as the target t. Write code to extract
2018S2 QBUS6850 Page 2 of 4
these two features and the target from the dataset.
Use the dataset (two chosen features and one target) to plot the loss function
𝐿𝐿(𝜷𝜷) =
1
2𝑁𝑁
�(𝑓𝑓(𝐱𝐱𝑛𝑛,𝜷𝜷) − 𝑡𝑡𝑛𝑛)2
𝑁𝑁
𝑛𝑛=1
with 𝑓𝑓(𝐱𝐱𝑛𝑛,𝜷𝜷) = 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥2
That is, we are using a linear regression model without the intercept term 𝛽𝛽0.
Hint: This is a 3D plot and you will need to iterate over a range of 𝛽𝛽1 and 𝛽𝛽2
values.
(b) Use the linear regression model LinearRegression in the scikit-learn package
to do two linear regression models to predict the target, with and without the
intercept term. You may use 90% of the data as your training data, and the
remaining 10% as your testing data. Compare the performance of two models and
explain the importance of the intercept term.
Hint: The argument fit_intercept of the LinearRegression controls
whether an intercept term is included in the model by fit_intercept = True
or fit_intercept = False.
(c) Take 90% of data as training data. Construct the centred training dataset by
conducting the following steps in your Python code:
(i) Take the mean of all the training target values, then deduct this mean from
each training target value MEDV. Take the resulting target values as the new
training target values 𝐭𝐭𝑛𝑛𝑛𝑛𝑛𝑛;
(ii) In the training data, take the mean of all the first feature values AGE, then
deduct this mean from each of first feature values. Take the result as the new
first feature values 𝐱𝐱𝑛𝑛𝑛𝑛𝑛𝑛𝟏𝟏 ;
(iii)In the training data, do the same for the second feature TAX. The result is
𝐱𝐱𝑛𝑛𝑛𝑛𝑛𝑛𝟐𝟐 ;
Now build linear regressions with and without the intercept to fit to the new
training data. Report and compare the coefficients and the intercept. Compare the
performance of two models over the testing data. Note that, when you take your
testing data into the model to calculate performance scores, you shall take the
relevant training means from the testing features and targets.
(d) Consider the closed-form solution of the linear regression below, see slide 25 (the
number may change) of Lecture 2,
𝜷𝜷 = (𝐗𝐗𝑇𝑇𝐗𝐗)−1𝑿𝑿𝑇𝑇𝐭𝐭
where X is the design (data) matrix whose first column is all 1s, and the first
component in 𝜷𝜷 is the intercept. Suppose that the data are centred (refer to (c)).
Now prove that, in the case of centred data, the intercept 𝛽𝛽0 in the solution above
is zero.
Hint: You may need that following fact that
Andy Sun
2018S2 QBUS6850 Page 3 of 4
�𝐀𝐀 0
0 𝐁𝐁
�
−1
= �𝐀𝐀
−1 0
0 𝐁𝐁−1
�
where both matrices A and B are invertible.
Question 2 (50 Marks)
Use Logistic Regression to predict diagnosis of breast cancer patients on the Breast Cancer
Wisconsin (Diagnostic) Dataset (wdbc.data). See Section About Datasets. This question
aims to test your ability in programming in matrix operation for Logistic Regression.
(a) Write Python code to load the data into your program. For the target feature
Diagnosis, change its literal M (malignant) to 0 and B (benign) to 1. Split the data
into training and validation sets (80%, 20% split). Then define and train a logistic
regression model by using scikit-learn’s LogisticRegression model.
(b) Using the logistic regression model function below and the estimated parameters
from your model, calculate the probability of sample ID 8510426 (20th sample)
having a benign diagnosis.
𝑓𝑓(𝐱𝐱𝑛𝑛,𝜷𝜷) =
1
1 + 𝑒𝑒−𝐱𝐱𝑛𝑛𝑇𝑇𝜷𝜷
(c) The objective of logistic regression is defined as, on slide 17 (the number may
change) of Lecture 3,
𝐿𝐿(𝜷𝜷) = −
1
𝑁𝑁
���𝑡𝑡𝑛𝑛 log �𝑓𝑓�𝐱𝐱𝑛𝑛, 𝜷𝜷�� + (1 − 𝑡𝑡𝑛𝑛) log �1 − 𝑓𝑓�𝐱𝐱𝑛𝑛, 𝜷𝜷���
𝑁𝑁
𝑛𝑛=1
�
where both the parameter 𝜷𝜷 = (𝛽𝛽0, 𝛽𝛽1, … ,𝛽𝛽𝑑𝑑)𝑻𝑻 and sample 𝐱𝐱𝑛𝑛 =
(𝑥𝑥𝑛𝑛0, 𝑥𝑥𝑛𝑛1, … , 𝑥𝑥𝑛𝑛𝑑𝑑)𝑇𝑇 are d+1 dimensional vectors, where the intercept feature
𝑥𝑥𝑛𝑛0 = 1. For Wisconsin Dataset d = 30. It is easy to prove that (you don’t need
to prove this)
𝜕𝜕𝐿𝐿(𝜷𝜷)
𝜕𝜕𝜷𝜷
=
1
𝑁𝑁
𝐗𝐗𝑇𝑇(𝐟𝐟(𝐗𝐗,𝜷𝜷) − 𝐭𝐭)
where 𝐟𝐟(𝐗𝐗,𝜷𝜷) = �𝑓𝑓(𝐱𝐱1,𝜷𝜷),𝑓𝑓(𝐱𝐱2,𝜷𝜷), … ,𝑓𝑓(𝐱𝐱𝑁𝑁 ,𝜷𝜷)�
𝑇𝑇
and 𝑡𝑡 = (𝑡𝑡1, 𝑡𝑡2, … , 𝑡𝑡𝑁𝑁)𝑇𝑇.
Write your own python code to use this derivative formula to implement the
gradient descent algorithm for the logistic regression. You may write a python
function named such as myLogisticGD, which accepts an data matrix X, an
initial parameter beta_0, and a number of GD iterations T and other arguments
you see appropriate. Your function should return the learned parameter 𝜷𝜷.
Hint: In python, you can use the following way to get the vector 𝐅𝐅 = 𝐟𝐟(𝐗𝐗,𝜷𝜷).
First define the sigmoid function by
QBUS6850 Assignment 1:
Notes to Students
Tasks
About Datasets
Breast Cancer Wisconsin (Diagnostic): wdbc.data