1
QBUS2820 Predictive Analytics
Semester 2, 2021
Individual Assignment 1
Key information
1. Required submissions (through Canvas/Assignments/Individual Assignment 1)
a. ONE written report (word or pdf format) for both tasks.
b. ONE Jupyter Notebook .ipynb file for Task A.
2. Due date/time and closing date/time: See Canvas. The late penalty for the
assignment is 5% of the assigned mark per day, starting after 4pm on the due date.
The closing date/time is the last date/time on which an assessment will be accepted
for marking.
3. Weight: 30% of the total mark.
4. Length: The main text of your report (including Task A and Task B) should have a
maximum of 15 pages with the usual font size 11-12. For Task A, you should write a
complete report including sections such as business context, problem formulation,
data processing, EDA, methodology, analysis, conclusions and limitations, etc.
5. If you wish to include additional material, you can do so by creating an appendix.
There is no page limit for the appendix. Keep in mind that making good use of your
audience’s time is an essential business skill. Every sentence, table and figure have
to count. Extraneous and/or wrong material will reduce your mark no matter the
quality of the assignment.
6. Anonymous marking: As the anonymous marking policy of the University, please
only include your student ID in the submitted report, and do NOT include your
name. The file name of your report and code file should follow the following format.
Replace “SID” with your Student ID. Example: SID_Qbus2820_Assignment1.
7. Presentation of the assignment is part of the assignment. Markers will allocate 5
marks for clarity of writing and presentation. Numbers with decimals should be
reported to the four-decimal point.
Key rules:
Carefully read the requirements for each part of the assignment.
Please follow any further instructions announced on Canvas.
You must use Python for the assignment. Use “random_state= 1” when needed, e.g.
when using “train_test_split” function of Python. For all other parameters that are not
specified in the questions, use the default values of corresponding Python functions.
Reproducibility is fundamental in data analysis, so that you will be required to submit a
Jupyter Notebook that generates your results. Not submitting your code will lead to a
loss of 50% of the assignment marks.
2
Failure to read information and follow instructions may lead to a loss of marks.
Furthermore, note that it is your responsibility to be informed of the University of
Sydney and Business School rules and guidelines, and follow them.
Referencing: Harvard Referencing System. (You may find the details at:
http://libguides.library.usyd.edu.au/c.php?g=508212&p=3476130)
Task A (Lab): 70 Marks
You will work on the NBA salary dataset.
Note: This task does not require prior knowledge of basketball. You should not add any
personal subjective assumptions about these data based on your existing knowledge. This
can lead to inaccurate results. You should use the techniques that we learnt and you
discovered to complete the prediction task.
1. Problem description
As a consultant working for a sports analytics company, the NBA league approached you to
develop predictive models to predict NBA salaries based on data analysis techniques. To
enable this task, you were provided with a dataset containing highly detailed performance
of the NBA players.
Select two models (or three) to predict NBA player salary from performance statistics.
These models are:
a linear regression model,
a kNN regression model,
The third model is optional, and you might be given maximum 10 bonus marks for
this. This model can be any model of your choice (might be a Kernel regression
model, or even a model not covered in the QBUS2820 unit). This is to encourage
you to self-explore and self-study, since the ability of self-study is critical in the field
of machine learning which is evolving rapidly.
As part of the contract, you need to write a report according to the details below.
1. Understanding the data
You can download the “NBA_Train.csv” and “NBA_test.csv” data for the Canvas. The
response is the SALARY($Millions) column in the dataset.
NBA glossary link below and the glossary Table at the end of the Task A can help you
understand the meaning of the variables better:
https://stats.nba.com/help/glossary
You should use the given test set to evaluate the performance of your work. The
performance/scoring metric is Root Mean Squared Error (RMSE), for the test set.
https://stats.nba.com/help/glossary
3
Your target of the test set RMSE should be less than 4.1 ($Millions).
2. Written report
The purpose of the report is to describe, explain, and justify your solution to the client with
polished presentation. Be concise and objective. Find ways to say more with less. When it
doubts, put it in the appendix. You can refer to the file “TaskA_instructions” on more
detailed instructions on how to work on Task A including writing the report.
The NBA glossary Table.
Sports Reference LLC, 2016a.
Task B (essay): 30 marks
Question 1
a) Sometimes an observed data set is partitioned into a training set and a validation
set. What is the purpose of the use of these two sets?
b) Often, the partition is done randomly, i.e., observations are selected randomly to
add into the training set and the validation set. What is the purpose of this random
partition?
Question 2 The 2001 demographic data of the United Nations contains PPgdp, the 2001
4
gross national product per person (in dollars), and Fertility, the birth rate per 1000 females
in the population in the year 2000. The data are for 193 localities among UN member
countries. The data were collected from http://unstats.un.org/unsd/demographic.
(a) Figure below gives the scatter plot of Fertility versus PPgdp:
The left panel in the figure below gives the scatter plot of log(Fertility) versus log(PPgdp):
5
We want to explain Fertility based on PPgdp using a simple linear regression model.
Would you prefer working on the original scale or log-scale? Explain.
(b) The UN’s 2001 demographic data also contain Purban, the percentage of the population
who lives in an urban area. The scatter plot of log(Fertility) versus Purban is given the figure
above. The correlation coefficient between Purban and log(PPgdp) is 0.78.
Below is the output of the linear regression model of log(Fertility) using the two predictors
log(PPgdp) and Purban
Comment on this situation: what would you do next? remove Purban from the model or
6
keep it? or what else would you do?