1
•
•
•
•
Kernel Regression
Advanced Methods for Data Analysis (36-402/36-608) Spring 2014
Linear smoothers and kernels
Recall our basic setup: we are given i.i.d. samples (xi, yi), i = 1, . . . n from the model yi =r(xi)+εi, i=1,…n,
and our goal is to estimate r with some function rˆ. Assume for now that each xi ∈ R (i.e., the predictors are 1-dimensional)
We talked about consider rˆ in the class of linear smoothers, so that n
rˆ(x) = w(x, xi) · yi (1) i=1
for some choice of weights w(x, xi). Indeed, both linear regression and k-nearest-neighbors are special cases of this
Here we will examine another important linear smoother, called kernel smoothing or kernel regression. We start by defining a kernel function K : R → R, satisfying
•
K(x) dx = 1, Three common examples are the box kernel:
K(x) = K(−x)
if |x| ≤ 1 otherwise
exp(−x2/2), 3/4(1 − x2) if |x| ≤ 1
K(x) =
Given a choice of kernel K, and a bandwidth h, kernel regression is defined by taking
in the linear smoother form (1). In other words, the kernel regression estimator is
the Gaussian kernel:
and the Epanechnikov kernel:
K(x) = √1 2π
K(x) =
1/2 0
,
0 else
w(x,xi) = n xj−x j=1 K h
K xi −x h
n K xi −x · yi i=1 h
rˆ(x)= n Kxi−x i=1 h
1
•
• •
•
2
•
•
•
What is this doing? This is a weighted average of yi values. Think about laying doing a Gaussian kernel around a specific query point x, and evaluating its height at each xi in order to determine the weight associate with yi
Because these weights are smoothly varying with x, the kernel regression estimator rˆ(x) itself is also smoothly varying with x; compare this to k-nearest-neighbors regression
What’s in the choice of kernel? Different kernels can give different results. But many of the common kernels tend to produce similar estimators; e.g., Gaussian vs. Epanechnikov, there’s not a huge difference
A much bigger difference comes from choosing different bandwidth values h. What’s the tradeoff present when we vary h? Hint: as we’ve mentioned before, you should always keep these two quantities in mind …
Bias and variance of kernels
At a fixed query point x, recall our fundamental decomposition E[TestErr(rˆ(x))] = EY − rˆ(x)2X = x
= σ2 + Bias(rˆ(x))2 + Var(rˆ(x)). So what is the bias and variance of the kernel regression estimator?
Fortunately, these can actually roughly be worked out theoretically, under some smoothness assumptions on r (and other assumptions). We can show that
Bias(rˆ(x))2 = E[rˆ(x)] − r(x)2 ≤ C1h2 Var(rˆ(x)) ≤ C2 ,
nh
and
for some constants C1 and C2. Does this make sense? What happens to the bias and variance
as h shrinks? As h grows? This means that
E[TestErr(rˆ(x))] = σ2 + C1h2 + C2 . nh
3
• •
We can find the best bandwidth h, i.e., the one minimizing test error, by differentiating and setting equal to 0: this yields
h= C2 . 2C1 n1/3
Is this is a realistic choice for the bandwidth? Problem is that we don’t know C1 and C2! (And even if we did, it may not be a good idea to use this … why?)
Practical considerations, multiple dimensions
In practice, we tend to select h by, you guessed it, cross-validation
Kernels can actually suffer bad bias at the boundaries … why? Think of the asymmetry of the weights
2
• In multiple dimensions, say, each xi ∈ Rp, we can easily use kernels, we just replace xi − x in the kernel argument by ∥xi − x∥2, so that the multivariate kernel regression estimator is
n i=1
rˆ ( x ) = n i=1 K
∥ x i − x ∥ 2 h
K ∥xi−x∥2 · yi h
• The same calculations as those that went into producing the bias and variance bounds above can be done in this multivariate case, showing that
and
Bias(rˆ(x))2 ≤ C ̃1h2 Var(rˆ(x)) ≤ C ̃2 .
nhp
Why is the variance so strongly affected now by the dimension p? What is the optimal h, now?
• A little later we’ll see an alternative extension to higher dimensions that doesn’t nearly suffer the same variance; this is called an additive model
3