CS计算机代考程序代写 Hive Coursework 3 – Regression and Goodness of fit

Coursework 3 – Regression and Goodness of fit

MATH20811 Practical Statistics: Coursework 3

(December 2021)

The marks awarded for this coursework constitute 40% of the total assessment for the module.

Your solution to the coursework should be fairly concise (maximum of about 12 pages) and it
should take, on average, about 20 hours to complete.

Please read all the instructions and advice given below carefully.

The submission deadline is 10:00 am on Tuesday 4 January 2022.

Late Submission of Work: Any student’s work that is submitted after the given deadline will
be classed as late, unless an extension has already been agreed via mitigating circumstances or a
DASS extension.

The following rules for the application of penalties for late submission are quoted from the
University guidance on late submission document, version 1.3 (dated July 2019):

”Any work submitted at any time within the first 24 hours following the published submission
deadline will receive a penalty of 10% of the maximum amount of marks available. Any work
submitted at any time between 24 hours and up to 48 hours late will receive a deduction of 20%
of the marks available, and so on, at the rate of an additional 10% of available marks deducted
per 24 hours, until the assignment is submitted or no marks remain.”

Your submitted solutions should all be in one document which must be prepared
using LaTeX. A 10 mark penalty will be imposed if this is not adhered to. For each
part of the project you should provide explanations as to how you completed what is required,
show your workings and also comment on computational results, where applicable.

When you include a plot, be sure to give it a title and label the axes correctly.

When you have written or used R code to answer any of the parts, then you should list this R code
after the particular written answer to which it applies. This may be the R code for a function you
have written and/or code you have used to produce numerical results, plots and tables. R code
should also be clearly annotated.

Do not use screenshots of R code/output in your report. Instead, to include R code
use the verbatim environment and summarise R output in tables using the table environment, as
demonstrated in the solution of Example Sheet 2.

Your file should be submitted through the Turnitin assessment called ”PS CW3
2021”in the folder ”MATH20811 CW3” under Assessment & Feedback on Blackboard
and by the above time and date. Work will be marked anonymously on Blackboard so please
ensure that your filename is clear but that it does not contain your name and student id number.
Similarly, do not include your name and id number in the document itself.

Coursework 3 – Regression and Goodness of fit

There is a basic LaTeX template file on Blackboard which you may choose to use for typing-up
your solutions. The file is called CW3_submitted_work.tex.

Turnitin will generate a similarity report for your submitted document and indicate matches to
other sources, including billions of internet documents (both live and archived), a subscription
repository of periodicals, journals and publications, as well as submissions from other students.
Please ensure that the document you upload represents your own work and is written in your own
words. The Turnitin report will be available for you to see shortly after the due date.

This coursework should hopefully help to reinforce some of the methodology you have been study-
ing, as well as the skills in R you have been developing in the module. Correct interpretation and
meaningful discussion of the results (i.e. attempt to put the results into context) are important
in order to achieve a high mark for the coursework.

Coursework 3 – Regression and Goodness of fit

The data for this work is related to auto insurance claims in Sweden over a particular period of
time. The data was collected by the Swedish Committee on Analysis of Risk Premium in Motor
Insurance. They are in the file sweden_ins_data.txt on Blackboard which contains two columns:

claims = number of claims – this will be the predictor / independent variable (or covariate)
in the regression model. We can denote it by x.

payment = total payment for all the claims in thousands of Swedish Kronor for geographical
zones in Sweden – this will be the dependent / response variable in the regression model. We
denote this variable by y.

There are n = 63 observations in the dataset. The simple linear regression model for the data is
given by:

yi = α + βxi + �i i = 1, . . . , n

where α and β are parameters with unknown values, and the random errors �1, . . . , �n are assumed
to be independent N(0, σ2) random variables where the value of σ2 is also unknown.

1. Produce a scatterplot of the data and comment on any evident features, and also the ap-
parent suitability (or not) of the above regression model for the data. [3]

2. Write your own function in R to fit, using least squares, a simple linear regression model
to a set of data comprising a response variable, y and a single covariate, x. Your function
should have only two arguments – a vector of data for y and a vector of data for x.

The output from your function should be in the form of a list object which comprises:

• a vector containing the parameter estimates α̂ and β̂;
• a vector containing the fitted values;
• a vector of the estimated errors (or residuals);
• a scalar giving the value of the residual degrees of freedom.

The four items in the output should all be calculated manually (DIY) within the function
itself using the input data.

To complete this part, run you function using the Swedish insurance data. [8]

Please note that,

• full marks are obtained in part 2 for writing a function which uses DIY calculations
and can successfully output the four items listed in the question;

• if you are unable to complete a function to do all the calculations then you may write
code to calculate them items separately outside of a function, but there will be a mark
penalty for this;

Coursework 3 – Regression and Goodness of fit

• all the items that your DIY regression function should output are available in an object
that can be created by running the lm function. If this is your only recourse to obtaining
the results specified in part 2, then you will only receive 1/8 marks for it. However,
you will then have the necessary results available to do the subsequent parts of this
coursework.

3. Using the output generated in part 2:

(i) Report the estimated values of α and β that have been calculated. Superimpose the
fitted regression line on to a scatterplot of the data and comment on the results. [3]

(ii) Using an unbiased estimator, estimate the value of the error variance, σ2. [2]

(iii) Use an F -test to test H0 : E[Y |x] = α vs H1 : E[Y |x] = α + βx at the 5% significance
level. Report your conclusions. [2]

(iv) Calculate a 95% confidence interval for E[Y |x = 80]. [2]

In the next parts we will use the residuals to examine the plausibility of the assumptions
made about our regression model. It can be shown theoretically that the estimated errors
from a fitted model (the residuals) have differing variances which depend on the values of
the covariates. Consequently, we will work with the standardised residuals so that they are
all on the same scale. To obtain the standardised residuals from the fitted model object
created in R we can use the rstandard function which performs the standardisation in a
special way that recognises the unequal variances.

4. Using diagnostic plots of the standardised residuals against the fitted values and against the
predictor, make comments about the validity of the assumptions that were made regarding
the model that has been fitted to the data. [4]

5. Manually construct (rather than using an existing R function for this purpose) a Normal
quantile-quantile plot of the standardised residuals and superimpose a suitable reference
line to help gauge Normality. Comment on the form of your plot and say whether you think
that Normality was a tenable assumption or not. [3]

6. We now wish to carry out a Kolmogorov-Smirnov (KS) test to assess whether the distribution
of the standardised residuals is N(0, 1). Find the value of the KS test statistic and also that
standardised residual value where the absolute difference between the empirical and N(0, 1)
cdfs is a maximum. [3]

7. Produce a plot containing the empirical cdf of the standardised residuals and the N(0, 1) cdf
and indicate on it the point at which the maximum difference between the curves occurs.

[3]

8. Write a function in R to simulate the sampling distribution of the Kolmogorov-Smirnov test
statistic when the N(0, 1) null distribution is true using the same sample size as that of the
Swedish insurance data. [3]

Coursework 3 – Regression and Goodness of fit

Run your function and use the results to plot a histogram of the estimated sampling distri-
bution with a superimposed kernel density estimate of this distribution. [2]

Use your simulated test statistic values to obtain an estimated 5% critical value for your
test. Compare your observed value with this and report your conclusions. [2]