Instructions
BUSS6002 Assignment 2
October 10, 2022
• Due: at 23:59 on Friday, October 28, 2022 (end of week 12).
Copyright By PowCoder代写 加微信 powcoder
• You must submit a written report (in PDF) with the following filename format, replacing
STUDENTID with your own student ID: BUSS6002 A2 STUDENTID.pdf.
• You must also submit a Jupyter Notebook (.ipynb) file with the following filename format,
replacing STUDENTID with your own student ID: BUSS6002 A2 STUDENTID.ipynb.
• There is a limit of 2000 words for your report (excluding equations, tables, and captions).
• All plots, computational tasks, and results must be completed using Python.
• Each section of your report must be clearly labelled with a heading.
• Do not include any Python code as part of your report.
• All figures must be appropriately sized and have readable axis labels and legends (where applicable).
• The submitted .ipynb file must contain all the code used in the development of your report.
• The submitted .ipynb file must be free of any errors, and the results must be reproducible.
• You may submit multiple times but only your last submission will be marked.
• A late penalty applies if you submit your assignment late without a successful special con- sideration. See the Unit Outline for more details.
This assignment is worth 20% of the unit’s marks. The assessment is designed to test your technical ability and statistical knowledge in modelling a real-world dataset, as well as your communication skills in writing a concise and coherent report presenting your approach and results.
Assessment Item
Overall Presentation
Goal Marks
Introduction 3 Candidate models 10 Model estimation and selection 12 Model evaluation 8 Conclusion 3 Clear, concise, coherent, and professional 4
Table 1: Assessment Items and
Being able to accurately predict the sale prices of residential properties is crucial to many aspects of the economy. Some companies base their entire business models on providing their clients with predictions of property sale prices. As a data scientist, you are asked to build a model to predict sale prices using data on residential home sales in Ames, a city in the state of Iowa of the United States. The dataset contains sale prices between 2006 and 2010 of all residential properties in Ames, as well as many numerical and categorical features (i.e., variables) associated with each dwelling. The following downloadable files are available on Canvas.
AmesHousing.txt
DataDocumentation.txt
AmesResidential.pdf
Description
Data file containing 2,930 observations and 82 variables Data dictionary containing description of each variable A map of 2: Files Provided
Place the data file AmesHousing.txt in the same location (i.e., directory) as your Jupyter Notebook file (.ipynb), and then read the data into a pandas DataFrame object using exactly the following code.
import pandas as pd data = pd.read_csv(
‘AmesHousing.txt’, sep=’\t’, keep_default_na=False, na_values=[”])
1 Introduction
In this section, you should
provide a brief project background so that the reader of your report can understand the general problem that you are solving;
state the aim of your project; briefly describe the dataset;
briefly summarise your key results.
Candidate models
Propose at least three candidate models for predicting the response variable ‘SalePrice’. For i ∈ {1, 2, 3}, each candidate model should take the form
y = fi(xi; βi) + εi,
where y is the sale price of a property, and xi, βi, and εi are the predictor vector, parameter vector, and the error term of the i-th model, respectively. The set of variables chosen for the feature vector xi should be a subset (or constructed from a subset) of the 81 predictors in the provided dataset. You may label your models M1, M2, and M3. The proposed models should be different in terms of model complexity (i.e., number of parameters). For each proposed model, you should:
• clearly define the function fi, which can be either linear or nonlinear with respect to xi; • clearly define the feature vector xi;
• justify your choices of fi and xi;
• state any assumptions on the error term εi;
• discuss how the model parameters βi can be estimated.
Hint: one effective way to motivate/justify your choices of fi and xi is to present the relevant
evidence in the data.
3 Model estimation and selection
Select the best model from the set of candidate models proposed in Section 2 using the “validation set” approach. In this section, you should:
• include a description of the model selection procedure that you adopted;
• report and discuss the estimation results (based on the training set) of each candidate model;
• discuss whether each candidate model is correctly specified based on residuals (obtained from fitting each model to the training set);
• report the validation performance (MSE) of each candidate model;
• identify the best model;
• discuss the complexity of the selected model in terms of bias-variance tradeoff.
The description of the model selection procedure (first point above) should provide enough details
so that the reader is able to implement exactly what you have done by following your description. 3
4 Model evaluation
Evaluate the generalisation performance of the selected model in Section 3 against two benchmark models. The generalisation performance should be measured by the observed MSE calculated using the test set. The two benchmark models are specified as follows.
• Let C be the set constructed by combining (or concatenating) the observed sale prices in the training and validation sets. The first benchmark model (BM1) is the “constant mean” model given by
yˆBM1 := m1 y, y∈C
where m > 0 is the size of the set C. That is, BM1 will always give the sample mean of C as its prediction, regardless the values of any predictors.
• Let N(x) be the subset of C that contains only the sale prices from the neighbourhood x. E.g., N(‘OldTown’) contains the sale prices in C that are associated with the neighbourhood ‘OldTown’. The second benchmark model (BM2) is the “neighbourhood mean” model given by
In this section, you should
• discuss your findings;
• discuss any limitations of your project;
• suggest any potential extensions for future work.
yˆBM2:= 1 y,
m(x) y∈N(x)
where m(x) < m is the size of the set N(x). That is, BM2 predicts the sale price by the average price of the corresponding neighbourhood.
In this section, you should
combine the training and validation sets and re-estimate the selected model on the combined set;
describe the model evaluation procedure; describe the two benchmark models;
report and discuss the generalisation (i.e., test set) performance of the selected model against the two benchmark models.
Conclusion
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com