代写代考 ST221 Lecture Notes

ST221 Lecture Notes

1 Introduction to Linear Models 7 1.1 Simplelinearregression ……………………….. 8 1.2 Terminologyandnotation ………………………. 13 1.3 Fromsimpletomultipleregression ………………….. 15 1.4 Theclassoflinearmodels ………………………. 20 1.5 SummaryofChapter1………………………… 23
2 Categorical predictor variables 25 2.1 Asinglecategoricalpredictor …………………….. 25 2.2 Addingaquantitativepredictor……………………. 32 2.3 SummaryofChapter2………………………… 36

3 Least squares estimation 37 3.1 Theresidualsumofsquares……………………… 37 3.2 Leastsquaresestimation……………………….. 39 3.3 Example:simplelinearregression…………………… 42 3.4 Example:lineartransformations …………………… 45 3.5 Astrategyforstatisticalmodelling………………….. 48 3.6 SummaryofChapter3………………………… 49
4 Linearity 51 4.1 Anscombe’sQuartet …………………………. 51 4.2 Polynomialregression ………………………… 54 4.3 Alinearmodelwithalog-transformedpredictor . . . . . . . . . . . . . . . 57 4.4 Amodelwithaninteraction……………………… 59 4.5 SummaryofChapter4………………………… 63
5 Residual analysis 65 5.1 Modelassumptions………………………….. 65 5.2 Residualplots…………………………….. 66

CONTENTS 3
5.3 Thelog-transformation………………………… 73 5.4 Themammalsdataset ………………………… 75 5.5 Thetreesdataset…………………………… 77 5.6 SummaryofChapter5………………………… 80
6 The least squares estimator and the hat matrix 82 6.1 Propertiesoftheleastsquaresestimator……………….. 83 6.2 Thehatmatrix……………………………. 85 6.3 Propertiesoftheresidualsandthefittedvalues . . . . . . . . . . . . . . . . 86 6.4 SummaryofChapter6………………………… 88
7 Measures of influence 90 7.1 Leverages ………………………………. 90 7.2 Outliers ……………………………….. 92 7.3 Influence……………………………….. 93 7.4 Whattodowithinfluentialdatapoints ……………….. 96 7.5 SummaryofChapter7………………………… 97
8 The Gauss-Markov Theorem 99 8.1 TheGauss-MarkovTheorem …………………….. 99 8.2 TheproofoftheGauss-MarkovTheorem ………………. 100 8.3 SummaryofChapter8………………………… 103
9 The Normal Linear Model 104 9.1 TheMaximumLikelihoodEstimator …………………. 104 9.2 Unbiasedestimatorfortheerrorvariance ………………. 106 9.3 SummaryofChapter9………………………… 107
10 The T-statistic for normal linear models 108 10.1 Useful results for the multivariate normal distribution . . . . . . . . . . . . 108 10.2Distributionalpropertiesofβ􏰖ands2 …………………. 109 10.3DerivingtheT-statistic ……………………….. 112 10.4Confidenceintervals …………………………. 113 10.5SummaryofChapter10 ……………………….. 114
11 The t-test for normal linear models 116 11.1Hypothesistesting ………………………….. 116 11.2AsimpleexampleinR………………………… 117 11.3Comparisonwithothert-tests…………………….. 120 11.4Estimationandprediction………………………. 122 11.5SummaryofChapter11 ……………………….. 124
12 Review on the normal linear model 125 12.1Thenormallinearmodel……………………….. 125

4 CONTENTS
12.2Assessingnormality …………………………. 127 12.3Limitationsofthet-test ……………………….. 130 12.4SummaryofChapter12 ……………………….. 130
13 The F-test and ANOVA 131 13.1 Thedecompositionofthetotalsumofsquares . . . . . . . . . . . . . . . . 131 13.2 ANOVAtablefortheexistenceofregression. . . . . . . . . . . . . . . . . . 134 13.3IllustrationinR …………………………… 136 13.4ThedistributionoftheF-statistic ………………….. 140 13.5SummaryofChapter13 ……………………….. 145
14 More on F-tests and ANOVA 147 14.1SequentialANOVA………………………….. 147 14.2Factorvariables……………………………. 150 14.3Testfornon-linearity…………………………. 156 14.4SummaryofChapter14 ……………………….. 158
15 Model selection criteria and variable selection 160 15.1Whatmakesagoodmodel?……………………… 160 15.2Parsimonyandplausibility………………………. 161 15.3Themodelhierarchy…………………………. 163 15.4Modelselectionstatistics……………………….. 165 15.5Variableselection…………………………… 171 15.6Multicollinearity …………………………… 175 15.7SummaryofChapter15 ……………………….. 177
16 The General Linear Model 178 16.1Thegeneralisedleastsquaresestimator………………… 178 16.2Weightedregression …………………………. 180 16.3AR(1)Errors …………………………….. 181 16.4 IntroductiontoGeneralisedLinearModels. . . . . . . . . . . . . . . . . . . 183 16.5SummaryofChapter16 ……………………….. 186 16.6RecommendedreadingforChapter16 ………………… 187
17 Generalised Linear Models 188 17.1GLMsforbinarydata ………………………… 188 17.2Hypothesistesting ………………………….. 197 17.3PoissonGLM…………………………….. 199 17.4Influenceandmulti-collinearityinGLMs……………….. 203 17.5SummaryofChapter17 ……………………….. 203 17.6RecommendedreadingforChapter17 ………………… 204
Appendix 205

CONTENTS 5
AppendixA-Linearalgebraandmultivariablecalculus . . . . . . . . . . . . . . 205 AppendixB-Generalisedexpectation …………………… 206 Appendix C – M-estimators and robust regression (optional material) . . . . . . . 209 AppendixD-SolutionstoExercises ……………………. 213 Chapter1…………………………………. 213 Chapter2…………………………………. 220 Chapter3…………………………………. 224 Chapter4…………………………………. 228 Chapter5…………………………………. 229 Chapter6…………………………………. 230 Chapter7…………………………………. 232 Chapter8…………………………………. 233 Chapter9…………………………………. 233 Chapter10 ………………………………… 234 Chapter11 ………………………………… 235 Chapter12 ………………………………… 237 Chapter13 ………………………………… 238 Chapter14 ………………………………… 241 Chapter15 ………………………………… 247 Chapter16 ………………………………… 257 Chapter17 ………………………………… 258

All rights are reserved.
To obtain a pdf version of these notes, click on the pdf icon in the top bar.
These materials are solely for your own use and you must not distribute these in any format. Do not upload these materials to the internet or any filesharing sites nor provide them to any third party or forum.
If you find any typos in these notes, please inform the module leader by submitting the details to the ST221 anonymous submission form.
−2 −1 0 1 2
−10 −5 0 5 10

Introduction to Linear Models
Linear models are statistical models that describe the relationship between a single, quanti- tative outcome variable and one or more explanatory variables. Statistical modelling often uses linear models because, as we will see, they comprise a surprisingly large class of useful models. The mathematical theory underpinning linear models is well developed as it is analytically tractable and we will spend a substantial amount of this module deriving this theory. To do this we will use tools from linear algebra as is common for machine learning algorithms in general, (a fact that is alluded to in this comic by xkcd.com).
Linear models have many nice properties some of which we will learn about in later chapters. And finally, numerically, it is relatively easy to fit linear models and we will use well- established functions in R to do so.
The module ST104 Statistical Laboratory introduced the simplest form of linear models, namely simple linear regression models. In the current chapter we will start with simple linear regression and then explain how to extend to regression models with several explanatory variables. The focus of the initial sections in this chapter is to give an intuitive understanding and overview of the subject. This means that we will defer some of the more technical details until later.
The following videos discuss material from Chapter 1.
Sections 1.1 and 1.2:
– Introduction to simple linear regression [15 min] – Example: Solitaire diamond rings [15 min]
Sections 1.3 and 1.4:
– From simple to multiple linear regression [13 min]
– Multiple regression with quantitative predictors [18 min]

8 CHAPTER 1. INTRODUCTION TO LINEAR MODELS 1.1 Simple linear regression
Let’s start with an example relating to data on 48 Solitaire gold rings. A Solitaire ring is a ring mounted with a single diamond. The data were originally collected by Dr Singfat Chu1 and are available as the dataset diamond in the R library UsingR. (For a more thorough analysis of this dataset including a review of simple linear regression, please see this revision tutorial.)
Figure 1.1 shows a plot of the price of each ring in Singapore Dollars against the weight of its diamond. The weight of a diamond is measured in carat where 1 carat corresponds to 20 milligrams.
0.15 0.20 0.25 0.30 0.35
weight of diamond (in carat)
Figure 1.1: Scatterplot of the diamond data.
Scatterplots like these are helpful in illustrating the relationship between two quantitative variables. We observe that the price of a Solitaire ring depends on the weight of its diamond: the heavier the diamond, the more expensive tends to be the ring. No surprise there! While the figure visually illustrates how the price of a ring depends on the weight of its diamond, it would certainly be useful to have a more concise mathematical representation of that relationship.
Considering the scatterplot again, we observe that the relationship between price and weight is linear, that is, the points in the scatterplot lie close to a straight line. In Figure 1.2 I have added the line of best fit to the scatterplot, thus illustrating the linearity of the relationship between the price of a ring and the weight of its diamond.
However, we also note that the relationship is not deterministic, that is, the points in the plot do not lie exactly on the straight line, but instead are scattered around it. Or, in other words, given the weight of the diamond we can use the line to produce a reasonable estimate
1Chu, S. (1996) Diamond ring pricing using linear regression. Journal of Statistics Education, 4(3).
price of ring (in S$)
200 400 600 800 1000

1.1. SIMPLE LINEAR REGRESSION 9
0.15 0.20 0.25 0.30 0.35
weight of diamond in carat
Figure 1.2: Scatterplot of the diamond data with fitted regression line.
of the price of the corresponding Solitaire ring but we may not predict the price exactly.
In the above, we have implicitly decomposed the relationship between the price of a ring and the weight of its diamond into two parts:
• a deterministic or systematic part, in this example a straight line; and
• a random error that describes how the observations scatter around the systematic
h(weight) = β0 + β1 weight
define the straight line that forms the systematic component of the relationship between price and weight. Furthermore, let ε denote the random error, then the simple linear regression model for the price of a Solitaire ring is defined by the equation
price = β0 + β1 weight + ε. (1.1)
We assume that the error term is zero on average, which means that the average price of a ring is determined by the systematic component and thus the weight of its diamond ring. We will make this definition more precise later on, but for now it serves as an illustration of how a linear model comprises a systematic component and a random component.
The intercept β0 and the slope β1 in (1.1) are referred to as the parameters of the model. To be able to make practical use of the model, we need to determine reasonable values for β0 and β1. We can find these by determining the line of best fit. (We will discuss later how this line is determined mathematically. For now, rest assured that R will do all the hard work for you!)
To determine the line of best fit in R we use the command
price of ring in S$
200 400 600 800 1000

10 CHAPTER 1. INTRODUCTION TO LINEAR MODELS lm(response ~ predictor, data = dataset)
Then the summary function summary(model) gives us the details of the fitted model, in particular the estimated values for the parameters β0 and β1. The R code given below starts by loading the package UsingR that provides the dataset diamond. (If you try this code and get an error message, then this is likely due to not having installed the package UsingR. Use the command install.packages(“UsingR”) to install it, then try again.) Note that in the dataset the weight variable is called carat.
library(UsingR)
SLR.diamond <- lm(price ~ carat, data = diamond) summary(SLR.diamond) lm(formula = price ~ carat, data = diamond) Residuals: Min 1Q Median 3Q Max -85.159 -21.448 -0.869 18.972 79.370 Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -259.63 17.32 -14.99 <2e-16 *** carat 3721.02 81.79 45.50 <2e-16 *** Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 31.84 on 46 degrees of freedom Multiple R-squared: 0.9783, Adjusted R-squared: 0.9778 F-statistic: 2070 on 1 and 46 DF, p-value: < 2.2e-16 Substantial output is produced and we will learn in this module what it all means. For now just consider the column Estimate in the middle table which lists information about the coefficients. These are the intercept and slope of the line of best fit, with the intercept given as -259.63 and the slope as 3721.02. So what does this tell us about the relationship between the price of a Solitaire ring and the weight of its diamond? A naive interpretation of the type that you may have learned in school is as follows. • Intercept: the average price for a Solitaire ring whose diamond has zero weight is -260 Singapore Dollars. • Slope: On average, the price of a Solitaire ring increases by 3721 Singapore Dollars for every additional carat of weight of its diamond. (Note that I have rounded the coefficients to something that seems sensible in this context. 1.1. SIMPLE LINEAR REGRESSION 11 And, before you ask, I will not give you hard and fixed rules about rounding, but instead expect you to do something that is sensible in a given context.) I see two issues with this interpretation above. Firstly, the heaviest diamond in the dataset is 0.35 carat and so considering a 1 carat increase seems rather extreme. Secondly, it does not seem entirely sensible to consider the price of a Solitaire ring whose diamond has zero weight, but even so, what does a negative average price mean? To address the second of these two concerns we might consider fitting a line that is forced to go through the origin and thus avoids producing negative prices. We can do this in R using the command lm(response ~ 0 + predictor, data = dataset) But when we do this we observe that the fitted line (depicted in Figure 1.3 below as a solid red line) provides a worse fit to the data than the original fitted line (depicted as a dashed blue line). regression line forced to go through origin regression line not forced to go through origin 0.0 0.1 0.2 0.3 0.4 weight of diamond (in carat) Figure 1.3: Comparison between a regression through the orgin and a regression with arbitrary intercept, both fitted to the diamond data. So let’s consider an alternative approach. Suppose avg_weight denotes the average weight of the diamonds in the dataset and we specify the model as price = β0 + β1 (weight − avg_weight) + ε. (1.2) So instead of weight we are now using the difference between weight and the average weight as our predictor variable. We can implement this in R using the command lm(response ~ I(predictor - mean(predictor)), data = dataset) The I() tells R to evaluate the expression in brackets before using it as an explanatory variable. We then apply the command price of ring (in S$) 0 200 400 600 800 12 CHAPTER 1. INTRODUCTION TO LINEAR MODELS coef(model) to obtain the estimated coefficients. slr.diamond2 <- lm(price ~ I(carat - mean(carat)), data = diamond) coef(slr.diamond2) (Intercept) I(carat - mean(carat)) 500.0833 3721.0249 Note how the estimate of the intercept has changed but the estimate of the slope has remained the same.2 So how do we interpret this model? The predictor variable is now the difference between the weight of the diamond under consideration and the average weight of diamonds. So this predictor is zero when we consider a diamond of average weight, which, in this dataset, is 0.204 carat. This leads to the following interpretation of the intercept in this new model: • The expected price of a Solitaire ring with a diamond of average weight, that is 0.204 carat, is approximately 500 Singapore Dollars. Of course, if we do consider a Solitaire ring with a diamond of zero weight, then the model still gives us a negative price. However, aside from it not making any practical sense, when considering a zero weight, we are extrapolating substantially beyond the range of the data. However, we simply cannot assume that the relationship described by the fitted model extends beyond the range of the data. Another interesting property to note is that if we compute the sample mean of price then we find that this is equal to the above intercept of 500.08 Singapore Dollars.3 Next consider the interpretation of the slope coefficient. Earlier we criticised that using a 1 carat increase in weight leads to extrapolation, so let’s choose the more sensible increase of 0.1 carat. • On average, the price of a Solitaire ring increases by 372,- Singapore Dollars for every additional 0.1 carat of weight. So far we have covered material that most of you will have been familiar with. However, the above discussion exemplifies a skill that you need to have or develop to do well in this module. And that is, you will need to be able to move confidently between a mathematical description of a model, its implementation in R and its interpretation within the application context. 2We will learn later in the module why this is the case! 3This is related to the fact that in simple linear regression, the fitted line goes through the point whose x-coordinate is the sample mean of the explanatory variable and whose y-coordinate is the sample mean of the response variable. We will see a formal proof of this later in the module. 1.2. TERMINOLOGY AND NOTATION 13 Before we move on, below are some exercises to help you assimilate and practice the material discussed so far. Note As part of this module we will be looking at various real-life data sets. Many of them will be historical data sets collected at a time when a binary (female/male) classification of gender was standard. In recent years, data collection on gender identity often provides options beyond a binary classification, see for example the changes made for the 2021 UK Census as discussed by the Office for National Statistics. But it may take a little while until such datasets become more widely available for teaching purposes. 1. Show numerically that the fitted model in (1.2) leads to the same model equation as the original model in (1.1). We say that the model in (1.2) is a reparameterisation of the model in (1.1). 2. Fit a simple linear regression model to the father.son d 程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts