代写 R C math graph statistic Kings College London

Kings College London
This paper is part of an examination of the College counting towards the award of a degree. Examinations are governed by the College Regulations under the authority of the Academic Board.
Degree Programmes
Module Code Module Title Examination Period
MSc
7CCMMS61T
Statistics for Data Analysis January 2018 Period 1
Time Allowed Rubric
Calculators
Notes
Two hours
ANSWER ALL QUESTIONS. ANSWER EACH QUESTION ON A NEW PAGE OF YOUR ANSWER BOOK AND WRITE ITS NUMBER IN THE SPACE PROVIDED. A FOR MULA SHEET IS PROVIDED.
Calculators may be used. The following models are permit ted: Casio fx83 Casio fx85.
Books, notes or other written material may not be brought into this examination
PLEASE DO EXAMINATION ROOM
2018 Kings College London
SOLUTIONS
NOT REMOVE THIS PAPER FROM THE

January 2018 7CCMMS61T
1. Answer
PARTIAL MARKS ARE AWARDED FOR WORKING, THROUGHOUT. Syl labus topic: descriptive statistics. Teaching outcome: similar problems have been seen in tutorials.
A researcher measured the concentration of potassium in the blood of 50 patients after receiving a new drug. The following table summarises the data:
Interval
Absolute frequency
3,4 4,4.5 4.5,5 5,5.5 5.5,6
7 12 18 8 5
a. Calculate the relative frequency table and the relative cumulative fre quency table for these data.
Answer
2 marks for each.
SOLUTIONS
4 marks
Interval
Relative frequency
Relative cumulative frequency
3,4 4,4.5 4.5,5 5,5.5 5.5,6
0.14 0.24 0.36 0.16 0.1
0.14 0.38 0.74 0.9 1
Page 2
SEE NEXT PAGE

SOLUTIONS
January 2018 7CCMMS61T
Empirical distribution function for Potassium data
3.0 3.5 4.0 4.5 5.0 5.5 6.0
b. Draw appropriate graphical representations for the relative frequency table and the cumulative relative frequencies.
4 marks
Answer
2 marks for each.
c. Determine the modal class and the intervals containing the rst quartile, median and third quartile.
3 marks
Answer
1 mark for the modal class, 2 marks for the quantiles classes. Modal class: 4.5, 5. Q1 4, 4.5,Q2 4.5, 5, Q3 5, 5.5.
Page 3
SEE NEXT PAGE
Relative cumulative frequency
0.0 0.2 0.4 0.6 0.8 1.0

SOLUTIONS
January 2018 7CCMMS61T
d. Calculate approximated values for the median, the mean and the variance of the data.
4 marks
Answer
1 mark for the median, 1 mark for the mean, 2 marks for the variance.
Q2 4.50.50.50.384. 6 0.36
x 3.50.144.250.244.750.365.250.165.750.1 4.635 2 1 73.54.6352 124.254.6352 184.754.6352
50
8 5.25 4.6352 5 5.75 4.6352 0.405525
Page 4
SEE NEXT PAGE

SOLUTIONS
January 2018 7CCMMS61T
2. The distance X, measured in thousands of kilometers, that a model of electric car can travel with a newly charged battery is a random variable with density function
1x 1×2, ifx0.
Syllabus topic: distributions and random variables. Teaching outcome: the
application is new but similar exercises have been done.
a. Find the values of parameter c for which fXx is a valid probability density function.
4 marks
Answer
2 marks for stating the condition, 2 marks for nding the value.
Answer
fXx c I0,x c 2
0, if x 0,
c c
2dx1 c1. 1 x 1 x 0
b. Calculate the distribution function FXx.
0
Answer
2 marks for the expression of the integral, 2 marks for solving it.
0
FX x P X x x 1
0 1y2
if x 0 dy 1 1 if x 0
1x
4 marks
Page 5
SEE NEXT PAGE

SOLUTIONS
January 2018 7CCMMS61T
c. Calculate the probability that the electric car can travel at least 1 thou sand kilometers.
Answer
PX 11PX 11FX1 1 0.5 11
4 marks
d. Calculate the rst quartile Q1 and the third quartile Q3 of X.
6 marks
Answer
3markseach. For01thequantileq isgivenbyFXq, i.e. 1 1 ,thereforeq . Then,Q1 q0.25 13and
e. Calculate the probability that the distance travelled with a newly charged battery is between Q1 and Q3.
1q Q2 q0.75 3.
1
Answer
PQ1 X Q3 0.5 by denition.
2 marks
Page 6
SEE NEXT PAGE

SOLUTIONS
January 2018 7CCMMS61T
3. Consider the probability density function
x1 for0x1 fXx 0 else ,
with 0 unknown parameter and let X be a random variable with probability density function fXx.
Answer
Syllabus topic: distributions, point estimation. Teaching outcome: The expression of the distribution is new but similar examples have been seen.
a. Calculate the expected value of X.
Answer
4 marks
2 marks for the expression of the integral, 2 marks for the value. EX
1xfXxdx1xdx . 0 0 1
b. Calculate the variance of X.
Answer
4 marks
2 marks for the expression of the integral, 2 marks for the value.
EX21x2fXxdx1x1dx andVarXEX2 0 0 2
22 EX 2 1 212
Page 7
SEE NEXT PAGE

SOLUTIONS
January 2018 7CCMMS61T
Let X1,…,Xn be a random sample from a population with probability density function fXx.
c. Dene the sample mean X and compute its expected value as a function of .
3 marks
Answer
1 mark for the denition, 2 marks for the expected value. X 1 n Xi
andEX1n EXi n i1 1
d. Find the expression of the estimator for based on the method of mo ments.
n i1
Answer
X X . 1 1X
3 marks
e. Find the expression of the maximum likelihood estimator for .
6 marks
Answer
2 marks for the likelihood, 3 marks for the maximization, 1 mark for
checking that it is a maximum. L;X1,…,Xn n i1
llnLnln1ni1lnxi. Then,
x1 and i
dl n n
ln xi 0,
and this implies n n . Moreover, d2l n2 0, therefore
d
i1 lnxi d2
the stationary point is a maximum and is the maximum likelihood estimator.
i1
Page 8
SEE NEXT PAGE

SOLUTIONS
January 2018 7CCMMS61T
4. A company is investigating the production of one of their factories that produces metallic laminates. They measure the tensile strength of n samples of metallic laminates. They can assume that the measurements in pounds per square inch, psi X1, . . . , Xn are independent and normally distributed random variables with unknown mean and variance 16psi2 known from previous experiments.
Answer
Syllabus topic: condence intervals, hypothesis testing. Teaching outcome: the application is seen for the rst time and it requires basic knowledge of condence intervals and testing procedure.
a. What is the minimum value of n such that the width of the 95 con dence interval for is not larger than 4psi?
6 marks
Answer
3 marks for the expression of the width of the interval, 3 marks for the correct n. X1,…,Xn N,2, with 2 16. The 95 condence interval for is x z10.975n, x z10.975n, where x is the sample mean. Its width is therefore 2z10.975n 2 1.96 4n and 21.964n 4 implies n 15.37. The smaller n that satises this condition is n 16.
The employee tasked with the investigation decided to collect n 20 mea surements and they obtained a sample mean of x 91psi.
b. Calculate the realization of the 95 condence interval for .
3 marks
Answer
91 1.75, i.e. 89.25, 92.75.
Page 9
SEE NEXT PAGE

SOLUTIONS
January 2018 7CCMMS61T
c. The company guarantees to their client that the mean textile strength of their product is 95psi. Describe the appropriate hypothesis test to check this hypothesis. Could a client complain on the basis of the observed measurements?
5 marks
Answer
3 marks for the appropriate test, 2 marks for carrying out the test.
H0 :95vsH1 :95;thenullhypothesiscanberejectedat5 level because 0 95 is outside of the 95 condence interval.
d. Describe how the expression of the 95 condence interval would change in the case where the variance of X1,…,X20 is unknown. Compute the realization of the 95 condence interval in this case, knowing that the employee reported a sample variance of 17psi2.
6 marks
Answer
3 marks for the expression of the interval, 3 marks for the realization. If the measurements are independent random variables with unknown mean
X 2120 2
and variance, t19, where s Xi X . The expression
s 19 19i1
of the 95 condence interval is X t s , where t is the
0.025,19 19 0.025,19
0.975 quantile of the student t distribution with 19 degrees of freedom.
The realization of the condence interval is 91psi2.0931719psi 91psi 1.98psi.
Page 10
SEE NEXT PAGE

SOLUTIONS
January 2018 7CCMMS61T
5. A researcher is studying the speed of growth for some plant species. Data are collected by measuring the size of plants size, in cm after they have been allowed to grow in a laboratory for a certain number of days days. Data are collected in three dierent laboratories and this is denoted in R with a factor lab with levels A,B and C.
Answer
Syllabus topic: linear regression. Teaching outcome: the application is new but similar examples have been seen.
a. The researcher decides to t a linear model to the data using R:
plant1lmsizedaysdays:lab
summaryplant1

Call:
lmformula size days days:lab

Coefficients:
Estimate Std. Error t value Prt
Intercept 0.6464
days 10.1404
days:labB 5.0770
days:labC 3.1296

5.3745 0.120 0.905
0.3677 27.576 2e16
0.3739 13.580 2.58e13
0.3739 8.371 7.48e09
Residual standard error: 15.09 on 26 degrees of freedom
Multiple Rsquared: 0.9858, Adjusted Rsquared: 0.9842
Fstatistic: 601.5 on 3 and 26 DF, pvalue: 2.2e16
Write down the mathematical expression of the linear regression model. What is the estimated daily increase in size for laboratory A, B and C? What is the estimate of the error variance?
Page 11
SEE NEXT PAGE

SOLUTIONS
January 2018 7CCMMS61T
6 marks
Answer
4 marks for the model, 2 marks for the estimated parameters. Let Yij
be the size and Xij be the number of days for the ith observation of the laboratory j, i 1,…,nj, j A,B,C. The model is then
Yij 0 1Xij jXij ij
under the corner point constraint A 0, where ij are i.i.d N0,2.
The estimated daily increase in size for the laboratory A is then 1
10.14. The ones for laboratory B and C are A1 1B

0
10.14 5.08 15.22 and 10.14 3.13 7.01 respectively. 1C
The estimate of the error variance is s2 15.092 227.71. Alterna tively, the model can be dened using two dummy variables. Let Yk be the size of the ith observation, Zk1 be the number of days for the ith observation, Zk2 be equal to 1 if the ith observation comes from the laboratory B and 0 else and Zk3 be equal to 1 if the ith observation comes from the laboratory C and 0 else, for k 1,…,30. Then, the model is
Yk 0 1Zk1 BZk1Zk2 CZk1Zk3 k, where k are i.i.d N0,2.
Page 12
SEE NEXT PAGE

SOLUTIONS
January 2018 7CCMMS61T
b. Provide a 95 condence interval for the daily increase in size in labo ratory A.
4 marks
Answer
3 marks for the expression of the interval, 1 mark for the computa
tion. Using model plant1, the daily increase is in laboratory A is
the parameter 1 and a 95 condence interval for 1 is given by

1 t0.025;26 10.140.4 2.056 0.3677 10.140.4 0.756.
1
c. Provide an estimate for the size of a plant that has been allowed to grow for 10 days in laboratory B.
2 marks
Answer
Using model plant1, Y 0.6464 10.1404 5.077 10 152.82.
d. Check the diagnostic plots in Figure 1 and dene the quantities that appear on the x and y axes of these plots. Do you spot any problems with the model assumptions?
6 marks
Answer
2 marks each for the plots description, 1 mark each for the comments.
ThettedvaluesaredenedasY Z Z Z Z Z
k 0 1 k1 B k1 k2 C k1 k3
and the residuals are Y Y , for k 1,…,30. The plot of the kkk
residuals vs tted values may suggest the presence of a quadratic trend in the residual.
The qqplot compare the sorted standardized residuals y axis with the corresponding theoretical quantiles of a standard normal distribution y axis. The qqplot does not highlight any problem with the normality assumption.
Page 13
SEE NEXT PAGE

January 2018
7CCMMS61T
1 2
Res.Df RSS Df Sum of Sq
26 5921.7
F PrF
SOLUTIONS
Figure 1: Diagnostics plots.
e. The researcher tries then to t a second model which allows for dierent intercepts for the dierent laboratories:
plant2lmsizedayslab
anovaplant1,plant2
Analysis of Variance Table

Model 1: size days days:lab
Model 2: size days lab
24 5103.0 2 818.72 1.9253 0.1677
Explain the test that is carried out by the anova command, specify ing the null and the alternative hypothesis, the expression of the test statistics and how the pvalue is computed. Which model is preferable?
7 marks
Answer
2 marks for the null and alternative hypothesis, 2 marks for the test statistics, 2 marks for the pvalue, 1 mark for the conclusion. Let 0 be the model tted in R as plant1 and 1 the model tted as
Page 14
SEE NEXT PAGE

SOLUTIONS
January 2018 7CCMMS61T
plant2. The anova command carries out a hypothesis test where H0 :
data are generated from model 0 vs H1 : data are generated from model 1. The test statistics is
F0RSS0 RSS12, RSS1 26
where RSS denotes the residual sum of squares for the model and under the null hypothesis F0 F2,26. The pvalue is computed as PF2,26 F0 PF2,26 1.9253 0.1677. For all the usual levels, we do not have evidence to reject the null hypothesis and therefore we prefer model 0 plant1.
Page 15
FINAL PAGE