RESEARCH SCHOOL OF FINANCE, ACTUARIAL STUDIES AND STATISTICS
REGRESSION MODELLING
(STAT2008/STAT2014/STAT4038/STAT6014/STAT6038)
Assignment 2 for Semester 2, 2021
INSTRUCTIONS:
• This assignment is a total of 100 marks worth 15% of your overall grade for this course.
• Please submit your assignment in the Assignment section on Wattle using the Turnitin
submission link. When uploading to Wattle you must submit the following, combined
into a single ’PDF’ document:
1. Your assignment/report in a pdf document.
2. All your R codes you have used for the assignment added as an Appendix to the
end of the report. Failure to upload the R code will result in a penalty.
• Assignment solutions should be typed. Your assignment may include some carefully
edited R output (e.g. graphs, tables) showing the results of your data analysis and a
discussion of these results, as well as some carefully selected code. Please be selective
about what you present and only include as much R output as necessary to justify
your solution. It is important to be be concise in your discussion of the results. Clearly
label each part of your report with the part of the question that it refers to.
• Unless otherwise advised, use a significance level of 5%.
• Marks may be deducted if these instructions are not strictly adhered to, and marks
will certainly be deducted if the total report is of an unreasonable length, i.e. more
than 15 pages including graphs and tables. You must include an appendix that is in
addition to the above page limits which include all the R code. Although, the appendix
will not be marked but if the R codes are not provided then marks will be deducted.
The R codes are required should there be any question the markers have about the
work you have submitted.
• You may ask me (Abhinav Mehta) questions about this assignment up to 24 hours
before the submission time. This will allow me enough time to respond to your ques-
tions. The tutors will not entertain any questions about the assignment other than
troubleshooting R codes.
• Late submissions will attract a penalty of 5% of your mark for each day of delay. No
assignments will be accepted 10 days beyond the due date.
• Extensions will usually be granted on medical or compassionate grounds on production
of appropriate evidence, but must have my permission by no later than 24 hours before
the submission date. If you are granted an extension and submit your assignment after
the extended deadline then the late submission penalty will still apply.
Assignment 2 – Sem 2, 2021 Page 1 of 4
Question 1 [50 Marks]
A group of researchers in the US attempted to look at the pollution related factors affect-
ing mortality. Sixty US cities were sampled. Total age-adjusted mortality, (mortality),
from all causes, in deaths per 100,000 population, was measured, along with the follow-
ing covariates: mean annual precipitation (in inches) (precipitation); median number
of school years completed for persons aged 25 years or older (education); percentage of
population that is non-white (nonwhite); relative pollution potential of oxides of nitro-
gen (nox); and relative pollution potential of sulphur dioxide (so2). “Relative pollution
potential” is the product of tons emitted per day per square kilometre and a factor
correcting for the city dimension and exposure. The data is available in a .csv file,
pollution.
(a) [8 marks] Fit a multiple linear regression (MLR) model with Mortality as the re-
sponse variable and all other covariates as predictors. Is the regression model
significant?
(b) [12 marks] What are the estimated coefficients of the (MLR) model in part (a)
and the confidence intervals for each of these coefficients at a joint confidence level
of 95%? Interpret the values of these estimated coefficients with regards to model
specification.
(c) [6 marks] There is a t-test associated with each of these coefficients. Briefly explain,
what these tests can or cannot be used for? In your answer, be sure to mention
the appropriate hypotheses that can be assessed using these t-tests.
(d) [8 marks] Construct an appropriate test of the hypothesis that education and nox
are not significant contributors to the model. That is, test βeducation = βnox = 0.
(e) [10 marks] A researcher from this group suggested that they have been using a
model with coefficients: βprecipitation = 2, βeducation = −10, βnonwhite = 3, βnox = 0,
and βso2 = 1. Can you test whether this existing model is consistent with the new
model you have fit? Write down appropriate full and reduced models for carrying
out such a test. Perform the test and comment on the results.
(f) [6 marks] One of the researcher is from the city of San Antonio, and has recorded
a new set of measurements on each of the predictors. The precipitation is 33,
education is 11.5, nonwhite is 17.2 and nox and so2 are each 1. What do you
predict the mortality rate to be? Find a 99% interval for this prediction.
Assignment 2 – Sem 2, 2021 Page 2 of 4
Question 2 [50 Marks]
Minnesota Department of Revenue have collected data on nearly every agricultural land
sale in the six major agricultural regions of Minnesota for the period 2002-2011. The
data is provided to you by your boss and is available in the alr4 package in the dataset
MinnLand. The variables in the dataset are described in the table below:
Variable Name Description
acrePrice Sale price in dollars per acre, adjusted to a common date
within each year
region A factor with levels giving the geographic names of six eco-
nomic regions of Minnesota
year Year of sale
acres Size of property, acres
tillable Percentage of farm that is rated arable
improvements Percentage of property value due to buildings and other im-
provements
financing Type of financing: either a title transfer or seller finance
crp Any part of the acreage is enrolled in the U.S. Conservation
Reserve Program (CRP), and none otherwise
crpPct Percentage of land in CRP
productivity A numeric score between 1 and 100 with larger values indi-
cating more productive land, calculated by the University of
Minnesota
He has asked you to investigate the effects of various predictors on the sale price of land.
You have been asked to build a regression model which can estimate the sale price of
any piece of agricultural land in Minnesota once the characteristics of the land for sale
are provided.
(a) [8 marks] Based on the data description, which variables are qualitative variables?
Read the whole data set into R. Are these qualitative variables shown as factor
objects in R? If not, manually convert them to factor objects. For each qualitative
variable, how many observation does each group have? For each of these qualita-
tive variables, provide boxplots for the price of land sales, acrePrice, across the
different groups of each qualitative variable. Compare the group difference and
summarize your findings.
(b) [6 marks] Provide any comments on the shape of the distribution of the sale price
Assignment 2 – Sem 2, 2021 Page 3 of 4
variable. Does the response variable need to be transformed to meet the assump-
tions of a linear regression model? Suggest a transformation and test it out with
supporting plots.
(c) [8 marks] You have been told that the price of land is only influenced by the year of
sale, year, size of the land, acres, and whether there are any improvements made
on the land, improvements. Fit a regression model with these predictors and the
sale price with the transformation you may have chosen in part(b) as the response
variable. Show the summary table of the fitted results. Is the model significant?
Interpret all the estimated coefficients except for the intercept.
(d) [10 marks] You believe that region should be included in the regression model as
well as the percentage of the land is in CRP, crpPct. Fit a regression model by
adding these variables and using the ’Central’ region as the reference category for
the region variable. Is this model significant? How can you assess whether adding
these variables has improved the regression model? Produce the plots or summary
statistics which can answer this question.
(e) [8 marks] Produce the Bonferroni 95% joint confidence intervals for the coefficients
of only the quantitative variables in the model fitted in part(d).
(f) [10 marks] There is a piece of land in the South Central region with these charac-
teristics:
• Year of sale is 2010 and the size of property is 150 acres
• 5% of the property value is due to improvements and CRP percentage is 4%
• the land sale is financed by seller
What is the expected sale price for this piece of land and also construct a 95%
confidence interval for the sale price? Also, provide the 90% prediction interval for
this same piece of land.
Assignment 2 – Sem 2, 2021 Page 4 of 4