Motivation
Motivation
Statistical Modelling
1.0 – Statistical modelling
Motivation
Statistical Modelling
Resources for this section
Cf. ”R basics” appendix for a brief recap of R basics
So many resources available online… you could start with:
CRAN documents (manuals, contributed docs)
Hadley Wickham’s online website
Garrett Grolemund (RStudio)’s Twitter account
Go through the ST4060/ST6015/ST604 warm-up workout
https://cran.r-project.org/manuals.html
https://cran.r-project.org/other-docs.html
http://hadley.nz/
https://twitter.com/statgarrett?lang=en
https://eric-wolsztynski.shinyapps.io/learnr0_basicR
Motivation
Statistical Modelling
Motivation for modelling
There are overall two kinds of objectives in modelling:
Describe and characterize a random behaviour within the data
Describe and characterize patterns of interest within the data
Motivation
Statistical Modelling
Distribution models
1.1 – Distribution models
Motivation
Statistical Modelling
Distribution models
Modelling a random variable…
Baby data: fathers’ age (N=1236)
Father’s age
D
en
si
ty
20 40 60 80 100
0.
00
0.
01
0.
02
0.
03
0.
04
0.
05
0.
06
Baby data: birth weights (N=1236)
birth weight in ounces
D
en
si
ty
60 80 100 120 140 160 180
0.
00
0
0.
00
5
0.
01
0
0.
01
5
0.
02
0
Dataset: http://www.stat.berkeley.edu/users/statlabs /labs .html.
Stat Labs: Mathematical Statistics through Applications Springer-Verlag (2001) by Deborah Nolan and Terry Speed.
Motivation
Statistical Modelling
Distribution models
Modelling a random variable…
Campbell, N.A. and Mahon, R.J. (1974) A multivariate study of variation in two species of rock crab of genus Leptograpsus. Australian Journal of Zoology 22, 417–425
carapace length of blue crabs (N=100)
carapace length (mm)
D
en
si
ty
10 20 30 40 50
0.
00
0.
01
0.
02
0.
03
0.
04
0.
05
0.
06
carapace length of orange crabs (N=100)
carapace length (mm)
D
en
si
ty
15 20 25 30 35 40 45 50
0.
00
0.
01
0.
02
0.
03
0.
04
0.
05
0.
06
Motivation
Statistical Modelling
Distribution models
Modelling a random variable…
Sepal width of Iris setosa (N=50)
Sepal width (cm)
D
en
si
ty
2.0 2.5 3.0 3.5 4.0 4.5
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
Sepal length of Iris setosa (N=50)
Sepal length (cm)
D
en
si
ty
4.5 5.0 5.5
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
1.
2
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
Motivation
Statistical Modelling
Distribution models
Distributions in R
Synopsis (where distr abbreviates a chosen distribution):
rdistr(n, …) – random generation of n realizations
ddistr(x, …) – probability density function f(x)
pdistr(q, …) – probability distribution function F (x)
qdistr(p, …) – quantile function
Example with the Normal distribution:
rnorm(10) randomly generates 10 realizations of N (0, 1)
dnorm(0) == 1/ sqrt(2*pi) should return TRUE
pnorm(1.645) returns 0.9500151
qnorm(0.95) returns 1.644854
?Distributions for a list of available distributions: dbeta,
dbinom, dexp, dpois, dlnorm, dunif, dt, dcauchy, etc.
See also:
http://cran.r-project.org/web/views/Distributions.html
http://cran.r-project.org/web/views/Distributions.html
Motivation
Statistical Modelling
Distribution models
Random generation
rnorm(2,1,3) generates 2 realizations of N (µ = 1, σ = 3)
The following loop:
par(mfrow=c(1,3))
for(i in 1:3){v=rnorm(50); hist(v)}
plots the histograms of 3 samples of 50 realizations of N (0, 1)
This is how new samples can be generated in a loop
set.seed(s) may be used to fix pseudo-random generation:
set.seed(1); v=rnorm(10);
set.seed(1); w=rnorm(10);
(using seed s=1) will ensure that v=w
This is particularly useful to reproduce simulated data exactly
Motivation
Statistical Modelling
Distribution models
Examining the distribution of a dataset
summary(object) returns statistics for the input dataset:
R> summary( rnorm(10) )
may return something like
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.56300 -0.30350 -0.11910 -0.01992 0.44800 1.10500
summary() also returns a summary of the output of many
fitting functions
Plots: hist(), stem(), density(), boxplot()…
QQ-plots: qqnorm(), qqplot()
shapiro.test() (Shapiro-Wilk normality test)
Motivation
Statistical Modelling
Distribution models
A simple example: empirical CDF
x <- rnorm(30) [from addictedtor.free.fr] F10 <- ecdf(x) plot(F10, verticals= TRUE, do.p = FALSE, lwd=2) curve(pnorm, from=-5, to=5, add=TRUE, col="gray70") rug(x) # plots the locations of x’s below the curve −2 −1 0 1 2 0 .0 0 .2 0 .4 0 .6 0 .8 1 .0 ecdf(x) x F n (x ) Motivation Statistical Modelling Linear regression 1.2 - Linear regression Motivation Statistical Modelling Linear regression Linear regression is a central tool of statistical analysis. It is used extensively and often is a key component within a more complex analytical procedure. The approach was notably extended to generalized linear modelling (GLM) and nonlinear regression, to address a large range of problems. Regression analysis is critical to a large number of methodological aspects, including: trend estimation and prediction diagnostic tests for model selection benchmarking and (methodological) explorative analysis Motivation Statistical Modelling Linear regression Linear regression In linear regression models, observations depend linearly on the parameters of interest: Y = θ0 + θ1X + ε where ε are typically i.i.d. realizations of a Gaussian r.v. 5 10 15 20 25 0 20 40 60 80 10 0 12 0 speed di st an ce Motivation Statistical Modelling Linear regression Linear regression In linear regression models, observations depend linearly on the parameters of interest: Y = θ0 + θ1X + ε where ε are typically i.i.d. realizations of a Gaussian r.v. if εi ∼ N (0, σ2), then Yi|Xi ∼ N (θ0 + θ1Xi, σ2) Least squares are usually used to estimate θ = (θ0, θ1) When the error term ε is not assumed Gaussian, maximum likelihood estimation is the more generic framework MLE = OLS when ε ∼ N(0, σ2) Motivation Statistical Modelling Linear regression Regression analysis with R Linear regression via least squares is performed in R using lm: lm(y∼x) implements fitting of y = β0 + β1x+ ε lm(y∼x+0) implements fitting of y = β1x+ ε Example: compare summary(lm(dist~speed, data=cars)) summary(lm(dist~speed+0, data=cars)) Example: the following Least Squares estimation x = c(1:10); y = 3*x + rnorm(10); lm(y∼x) returns estimates β̂0 = −0.8818 and β̂1 = 3.0975 Motivation Statistical Modelling Linear regression One can capture the output of lm, inspect it, and use it: out = lm(y∼x) # capture output summary(out) # inspect output plot(out$residuals) # plot residuals from model fit acf(out$residuals) # check these residuals! etc. Motivation Statistical Modelling Extensions of the linear model 1.3 - Extensions of the linear model Motivation Statistical Modelling Extensions of the linear model Generalised Linear Models (GLM) GLM is an extension of the simple linear model allowing for: discrete response variable Y (e.g. Poisson observations) non-Gaussian variable Y (by adapting the distribution used in the linear model via a link function) categorical covariates X (e.g. binary variables) Example: Y = θ1X1 + θ2X2 + ε, where Y ∈ R patient survival time X1 = white blood cell counts X2 = presence of in leukemia patients In R: library(survival) # contains the leuk dataset summary(glm(time~log(wbc)+ag, family="Gamma", data=leuk)) Motivation Statistical Modelling Extensions of the linear model Logistic regression Logistic regression models binary response variables Y Particular case of GLM: it allows for categorical covariates Logistic regression is therefore a classification model Example: Y = θ1X1 + θ2X2 + ε, where Y ∈ {0; 1}, Y = 0 if subject is a blue crab, Y = 1 if orange X1 = carapace length (CL, in mm) X2 = carapace width (CW, in mm) In R (sp corresponds to crab species Y ): library(MASS) # contains the crabs dataset summary(glm(sp~CL+CW, family="binomial", data=crabs)) Motivation Statistical Modelling Nonparametric density estimation 1.4 - Nonparametric density estimation Motivation Statistical Modelling Nonparametric density estimation Nonparametric Statistics? Nonparametric may mean that: the method does not assume an underlying distribution; Ex: rank statistics the model is allowed to change with the data. Ex: nonparametric regression Some nonparametric methods are also sometimes termed adaptive, because they adapt to the underlying distribution of the data. Motivation Statistical Modelling Nonparametric density estimation Nonparametric density estimation Estimate a density f(x) with a smooth object f̂(x) Use e.g. a Kernel Density Estimator (KDE) Consider sample points as “kernels” Centre a given “template” density of size h at each kernel: Kh (u− xi) = 1 h K ( u− xi h ) Sum contributions of all these kernel densities: f̂(u) = 1 N N∑ i=1 Kh (u− xi) Motivation Statistical Modelling Nonparametric density estimation Kernel Density Estimation −2 −1 0 1 2 0 2 4 6 8 1 0 Sample points K D E * * ** * ** * xi(∗), K ( x−xi h ) , 1 h ∑N i=1K ( x−xi h ) , 1 N ∑N i=1Kh(x− xi) (example with N = 8 points) Motivation Statistical Modelling Nonparametric density estimation Kernel Density Estimation In R, estimate the pdf of a sample x by density(x) Example: xs = rnorm(10) fx = density(xs) names(fx) plot(fx) points( density(xs, bw=2*fx$bw) ) lines(fx$x, fx$y, col=’blue’, lwd=2) Motivation Statistical Modelling Nonparametric density estimation Kernel Density Estimation Visualize influence of choice of bandwidth h: kernel densities are in blue, final density estimate is in red... * * * **** * −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 0 1 2 3 4 5 6 h = 0.065 Sample points K D E * * * **** * −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 0 1 2 3 4 5 6 h = 0.13 Sample points K D E (example with N = 8 points) Motivation Statistical Modelling Nonparametric density estimation Other common nonparametric density estimators Naive estimators using FN (u) = 1 N ∑N i=1 1Xi≤u: fN (u) = 1 Nh N∑ i=1 W ( u−Xi h ) where W (u) = δ(|u|<1) 2 is a rectangular kernel k-th nearest neighbour estimators (k ≥ 2) f̂N (u) = k 2Ndk(u) = 1 Ndk(u) N∑ i=1 K ( u−Xi dk(u) ) where dk(u) = |u−X[k]| are in increasing order, and∫ K(u)du = 1 Statistical Modelling Distribution models Linear regression Extensions of the linear model Nonparametric density estimation