Smoothing
Jennifer Wilcock STAT221 2020 S2
Reminder: Density estimation
We focussed on single variables, and looked at two distinct approaches to exploring and understanding the structure of the distribution of the variable being considered:
1. we considered histograms and some methods for ‘smoothing’ histograms, in particular frequency polygons and ASH plots,
2. we then considered kernel density estimation, which takes a distinctly different approach of fitting a kernel around each individual datapoint, and then summing the contribution to the density of each of these kernels.
set.seed(35)
x = rnorm(1000, 0, 1)
k = density(x) # default is Gaussian/Normal kernel plot(k)
rug(x)
density.default(x = x)
Density
0.0 0.1 0.2 0.3 0.4
−4 −2 0 2 4
N = 1000 Bandwidth = 0.2275
Both a kernel density estimator and a smoothed histogram give a smoothed picture of the density of the variable, using the individual data points in the observed or simulated sample.
The goal of such a smoother is to smooth out the fine details of the sample, which we can assume are the result of sample to sample variability, while allowing the detail of the general structure of the sample to remain. We then assume that this general structure is essentailly that of the population from which the sample was drawn.
Smoothing in the context of regression
Another important context in which smoothing is important is in the context of regression type problems. In a simple regression type problem, there are two variables:
• an explanatory (or independent) variable, usually called x
1
• a response (or dependent) variable, usually called y.
In regression we use the explanatory variable x to predict the response variable y, and we usually do this
by assuming some form of relationship between x and y.
The simplest relationship is linear, where we assume a known relationship between x and y, for example
a straight line:
yi = α + βxi + ei
where α and β are unknown parameters that we estimate using the data, and the errors ei are assumed
to come from a Normal distribution. You will have looked at this in STAT101.
In general, however, a regression line can be assumed to be any function, which we can write as
yi =g(xi)+ei
and g is an unknown function that we estimate using the data. We can still assume the errors come from a Normal distribution, if we wish – what we will focus on here is that it is the function itself (rather than some parameters of a previously assumed function) that we estimate using the data.
This is called nonparametric regression.
The type of regression considered in STAT101, STAT201, STAT202, STAT318, at school, etc, is called parametric regression and falls under the general side of being based in analytical or mathematical approaches. Nonparametric regression is an essentially computational approach – it wouldn’t exist without the existance of computers.
With a smoother, we assume that this unknown regression function is smooth, so we assume it is differentiable. However in practice only some smoothing methods use the derivatives, and other methods don’t.
Scatter plots in regression
Before doing a regression, we usually plot the two variables against each other in a scatter plot. For example, we will think about the following data:
data(mtcars)
plot(mtcars$wt, mtcars$mpg, xlab = “Weight”, ylab = “Miles per gallon”)
2345
Weight
2
Miles per gallon
10 15 20 25 30
The running mean smoother
The running mean/running average/moving average is the simplest smoother.
For each given point x, as we move across tha range of the x variable, we average the values of yi for
each xi that lies within a distance h of x.
As before, h is called the bandwidth or smoothing parameter. We call the interval (x − h, x + h) the
smoothing window.
So we gradually move the smoothing window across the plot, and we take the average yi value for all of the points that lie within the window. These average values, when plotted, form a ‘smoothed’ line that we can plot.
data(mtcars)
plot(mtcars$wt, mtcars$mpg, xlab = “Weight”, ylab = “Miles per gallon”) k = ksmooth(mtcars$wt, mtcars$mpg, bandwidth = 0.5)
lines(k, col = 1)
2345
Weight
Here the bandwidth was 0.5, which is quite wide relative to the range of the data for Weight. Even so, it doesn’t look very smooth.
We could try making the bandwidth wider:
data(mtcars)
plot(mtcars$wt, mtcars$mpg, xlab = “Weight”, ylab = “Miles per gallon”) k = ksmooth(mtcars$wt, mtcars$mpg, bandwidth = 1)
lines(k, col = 1)
3
Miles per gallon
10 15 20 25 30
2345
Weight
It’s ‘smoother’, but still not smooth.
There is a key problem with this smoother, which is why it doesn’t really get used in practice.
The reason it isn’t smooth is that we are using an all-or-nothing / in-or-out decision about which yi values to use in the average. A given yi value is either included or not included, and immediately switches from being in to being out of the average calculation (or vice versa). As a result the average value suddenly jumps, as data points either suddenly enter or leave the smoothing window.
This is because the kernel being used here is the ‘box’ kernel, which is like the rectangular kernel in kernel density estimation – it’s never going to be truely smooth.
General kernel smoothing
Instead, we use kernel smoothers that are similar to those used in kernel density estimation. We use these kernels to calculate a ‘weighted’ average.
For example, we can use a Gaussian kernel with the original bandwidth of 0.5:
data(mtcars)
plot(mtcars$wt, mtcars$mpg, xlab = “Weight”, ylab = “Miles per gallon”) k = ksmooth(mtcars$wt, mtcars$mpg, “normal”, bandwidth = 0.5)
lines(k, col = 1)
4
Miles per gallon
10 15 20 25 30
which is now ‘smooth’.
We estimate g(x) by the kernel regression estimate
2345
Weight
yiwx−xi h
w x − x i h
g ( x ) = n i=1
n i=1
where w is a known, fixed non-negative function that is symmetric about zero, called the kernel. (Recall that to find a kernel density estimate we called the kernels k, but they had the same properties.)
How does it work?
This is an example of a linear smoother.
Each g(x) is a weighted average of the yi, which we can write as
p1y1 +···+pnyn where the pi sum to one. The pi are the “weights”
wx−xi h
pi = n x−xj j=1 w h
.
When the kernel is the ‘box’ function, so that the kernel smoother is the running average we looked at
earlier, then we define the kernel w as
w(x)= 1, −1≤x≤1
0 , x < 1 0, 1