计算机代考程序代写 flex data mining algorithm §5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models

§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models
Kernel and Local Regression
MAST90083 Computational Statistics and Data Mining

School of Mathematics & Statistics The University of Melbourne
Kernel and Local Regression 1/42

§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models
Outline
§5.1 Introduction
§5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models
Kernel and Local Regression 2/42

§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models
Introduction
Fitting a good linear model often involves considerable time to adequately model:
􏰔 Nonlinear dependencies
􏰔 Significant and insignificant variables 􏰔 Interactions between variables
Various methods have been proposed to overcome these limitations, among them slpine repression. Here we look at an alternative to linear and spline regression that overcomes the issue of nonlinearities.
Kernel and Local Regression 3/42

§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models
Introduction
􏰔 We discuss an alternative regression technique for estimating a regression function f (x) over a domain in Rp
􏰔 The approximation is realized by fitting a simple model at each point xi, i = 1,…,n
􏰔 At each point xi , the model makes use of the those training samples close to xi producing a smooth estimation fˆ (x) in Rp
􏰔 The selection of the training samples is realized using a weighting function known as kernel Kh (xi , xj )
Kernel and Local Regression 4/42

§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models
Introduction
􏰔 Kh (xi , xj ) assigns a weight to xj based on its scaled distance to xi where the scale is controlled by a parameter h
􏰔 The scale h controls the size of the effective neighborhood to use for estimation
􏰔 These methods differ by the shape of the kernel function and do not require training
􏰔 The only parameter that needs to be tuned using training samples is the width of the kernel h
Kernel and Local Regression 5/42

§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models
Introduction
Kernel regression has been around since the 1960s, and is one of the most popular methods for“nonparametrically”fitting a model to data. We work here in regression context, but there exist extensions to classification models via logistic regression.
We will focus on the most popular kernel regression and local polynomial regression.
Kernel and Local Regression 6/42

§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models
One Dimensional Kernel
􏰔 Consider the regression model
yi =f (xi)+εi, E(εi)=0
􏰔 and we are interested in estimating the regression function f (x) = E (y|x)
􏰔 using a training set (xi,yi), i = 1,…,n.
􏰔 The relationship between x and y is more likely to be nonlinear
Kernel and Local Regression 7/42

§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models
One Dimensional Kernel
􏰔 A direct method: k-nearest-neighbor average. Use the average of those observations in the defined neighborhood of x, Nk(x) to build the estimator of f (x )
fˆ(x)=k1 􏰏 yi=Ave(yi|xi∈Nk(x)) xi∈Nk(x)
􏰔 Nk(x) defines the k closest points xi to x in the training sample to use or select for the estimation
Kernel and Local Regression 8/42

§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models
One Dimensional Kernel
􏰔 The average changes in a discrete way, leading to a discontinuous fˆ(x)
Kernel and Local Regression 9/42
y =sin(4x)+ε, x ∼U[0,1], ε∼N(0,1/3)

§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models
One Dimensional Kernel
􏰔 Problem: The k-nearest-neighbor estimator gives the same weight to all the points in the neighbor used for the estimation of fˆ(x)
􏰔 Alternative: Make the weights attributed to the points used in the estimation inversely proportional (smoothly) to the distance from the point of estimation interest
Kernel and Local Regression 10/42

§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models
Nadaraya-Watson Kernel
The Nadaraya-Watson kernel leads to a weighted average estimation
􏰎Ni=1 Kh (x0, xi ) yi f(x0)= 􏰎Ni=1Kh(x0,xi)
N
if 􏰏Kh (x0,xi) ̸= 0
i=1
N
and fˆ(x0) = 0 if 􏰏Kh (x0,xi) = 0
i=1
ˆ
Kernel and Local Regression 11/42

§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models
Kernel Function
􏰔 The Kernel function plays a central role in the fitting and it is defined by
􏰃 x − x0 􏰄 Kh (x0,x) = K h
􏰔 K(x) needs to be smooth, maximal at 0, symmetrical around 0 and decreasing with respect to |x|
􏰔 Having
􏰔 is also common
􏰐􏰐
K(u)du = 1 uK(u)du = 0
Kernel and Local Regression 12/42

§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models
Kernel Functions
Common kernel functions are Name
Epanechnikov
Gaussian
Biweight Triweight Uniform
Support [−1, 1]
[−∞, ∞] [−1, 1]
[−1, 1]
[−1, 1]
K (x )
3 (1 − x 2 )I{|x |<1} 4􏰞2􏰟 (2π)−1/2 exp −x 2 15 (1 − x 2 )2 I{|x |<1} 16 35 (1 − x 2 )3 I{|x |<1} 32 12 I{|x |<1} Tricube 70(1−|x|3)3I{|x|<1} [−1,1] 81 Kernel and Local Regression 13/42 §5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models Triweight kernel Kh(x) for various choices of h h=0.5 K_h(x) 0.0 0.5 1.0 1.5 2.0 Kernel and Local Regression 14/42 h=2 −2 −1 0 1 2 x h=1 §5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models Kernel Functions 􏰔 The Gaussian function is non-compact kernel where σ2 plays the role of the window size Kernel and Local Regression 15/42 §5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models Nadaraya-Watson Kernel Epanechnikov quadratic kernel application example The contribution of the points (their weights in the estimation) slowly increases as the approximation evolves. The contribution is initially with weight zero. Kernel and Local Regression 16/42 §5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models Example 􏰔 The nearest-neighbor corresponds to K(x) = 12I{|x| < 1} 􏰔 Inthiscasefˆ(x)=averageofyi′ssuchthatxi ∈[x−h,x+h] or|xi −x|≤h Kernel and Local Regression 17/42 §5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models Example There are two extreme cases 􏰔 h → ∞, fˆ is independent of x (high bias case) f(x)→N fˆ(xi)=yi and fˆ(x)=0 for x̸=xi 􏰔 The estimator reproduces the data yi at xi and zero in other points. 􏰔 The optimal h is between these two extremes and provides the appropriate compromise between the bias and variance ˆ 1􏰏N yi =const. 􏰔 h → 0, h < mini,j |xi − xj|, (high variance case) i=1 Kernel and Local Regression 18/42 §5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models Linear Estimator 􏰔 The Nadaraya-Watson can be written as a weighted sum N fˆ(x) = 􏰏yiWi(x) i=1 􏰔 where the weights Wi(x) = 􏰎N Kh (x,xi)I 􏰉 N 􏰊 Kh (x,xi) 􏰏Kh (x,xi) ̸= 0 i=1 i=1 􏰔 are independent of the responses yi Kernel and Local Regression 19/42 §5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models Justification or Interpretation Let (x,y) be a pair of random variables in R2 with density p(x,y) and marginal density p(x) = 􏰙 p(x,y)dy > 0, then
f (x)=E[y/x]=
􏰐
􏰐 p(y,x) yp(y/x)dy = y p(x) dy
1 􏰐 = p(x)
􏰙 yp(x,y)dy yp(y,x)dy = 􏰙 p(x,y)dy
If we replace p(x,y) by pˆ(x,y) (its estimator) and p(x) by pˆ(x) we recover fˆ(x)
Note 1
Kernel and Local Regression 20/42

§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models
Justification or Interpretation
If the density p is assumed uniform then
yi K h
ˆ 􏰏N 􏰃xi−x􏰄
f (x ) =
i=1
Kernel and Local Regression 21/42

§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models
Properties
􏰔 The width of the used local neighborhood h plays the role of the smoothing parameter
􏰔 Large values of h implies lower variance (use more samples for estimation) but higher bias (assume the function is constant within the window)
􏰔 For k-nearest neighborhoods, the neighborhood size k plays theroleofthewindowsizehandhk(xi)=|xi −xk|wherexk is the kth closest xj to xi
􏰔 Adaptive width h(x) can also be considered instead of constant width h(x) = h and the kernel is
􏰃 |x − xi | 􏰄 Kh (xi,x) = K h(xi)
Kernel and Local Regression 22/42

§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models
Local Polynomial Regression
􏰔 Kernel fit can still have problems due to the asymmetry at the boundaries
􏰔 or in the interior if the x values are not equally spaced
􏰔 Locally weighted linear regression provide an alternative local approximation
Kernel and Local Regression 23/42

§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models
Local Polynomial Regression
Kernel and Local Regression 24/42

§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models
Local Linear Regression
􏰔 It is obtained by solving a weighted least squares criterion at each target points x0
N
min 􏰏Kh (x0,xi)[yi −α(x0)−β(x0)xi]2
α(x0),β(x0) i=1
􏰔 and the estimate at x0 is given by
fˆ ( x 0 ) = αˆ ( x 0 ) + βˆ ( x 0 ) x 0
Kernel and Local Regression 25/42

§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models
Local Polynomial Regression
Kernel and Local Regression 26/42

§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models
Local Linear Regression
Let B be the N × 2 regression matrix with row b(xi)⊤ = (1,xi) and W (x0) the N × N diagonal matrix with ith diagonal element Kh (x0, xi ) then
ˆ ⊤􏰁⊤ 􏰂−1⊤ 􏰏N f (x0) = b(x0) B W(x0)B B W(x0)y =
i=1
where li (x0)′s do not involve y
Local linear regression tends to be biased in curved regions of the true function
li (x0)yi
Kernel and Local Regression 27/42

§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models
An example of local linear regression
●● ●●
● ●●
●● ● ●●

● ●

● ●●



● ●

● ●






●● ●
● ●

● ●
●●

●● ● ●

● ●●
●● ● ●● ●
●● ● ●●
● ●
●● ●●
●●

● ●
●● ●


● ●
● ●●
● ●●
● ●
●● ●

−3 −2 −1 0 1 2 3
x
Kernel and Local Regression 28/42
y
0.0 0.1 0.2 0.3 0.4

§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models
An example of local linear regression
●● ●●
● ●●
●● ● ●●


● ●

● ●●
● ●
● ●

● ●


● ●
● ●
● ●


● ●
●● ●



●● ●
● ●
●● ●
● ●

● ●
●●

●● ● ●

● ●●
●● ● ●● ●
●● ● ●●
● ●


● ●●
● ●
●● ●

−3 −2 −1 0 1 2 3
x
Kernel and Local Regression 29/42
y
0.0 0.1 0.2 0.3 0.4

§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models
An example of local linear regression
●● ●●
● ●●
●● ● ●●

● ●

● ●●



● ●

● ●






●● ●
● ●

● ●
●●

●● ● ●

● ●●
●● ● ●● ●
●● ● ●●
● ●
●● ●●
●●

● ●
●● ●


● ●
● ●●
● ●●
● ●
●● ●

−3 −2 −1 0 1 2 3
x
Kernel and Local Regression 30/42
y
0.0 0.1 0.2 0.3 0.4

§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models
An example of local linear regression

● ●●●
● ●● ●
●● ● ●● ●
● ●

● ●● ●
● ●
●● ●●
●●

● ●
●● ●


● ●
● ●●●
● ●



● ●●
●● ●●
● ●●●

●●●●
●● ●●
● ●● ●●
●●● ● h=1 ●●●
−3 −2 −1 0 1 2 3
x


●● ●
● ●


●● ●
Kernel and Local Regression 31/42
y
0.0 0.1 0.2 0.3 0.4

§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models
Overfitting and underfitting
The choice of bandwidth h very directly controls the bias-variance tradeoff. Choosing h too small will tend to give overfitted models (high variance, low bias), while h too large will give underfit models (high bias, low variance).
In practice we can employ methods like cross-validation, or even plug-in estimates to decide on an appropriate value of h.
Kernel and Local Regression 32/42

§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models
Underfitted local linear regression

● ●●●
● ●● ●
●● ● ●● ●
● ●

● ●● ●
● ●
●● ●●
●●
● ●




● ●
●● ●


● ●
● ●●●
● ●●
●● ●●
● ●●●

●●●●
●● ●●

● ●● ●●
●● ● h=5, underfit ●●●
−3 −2 −1 0 1 2 3
x


●● ●
● ●


●● ●
Kernel and Local Regression 33/42
y
0.0 0.1 0.2 0.3 0.4

§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models
Overfitted local linear regression

● ●●●
● ●● ●
●● ● ●● ●
● ●

● ●● ●
● ●
●● ●●
●●
● ●




● ●
●● ●


● ●
● ●●●
● ●●
●● ●●
● ●●●

●●●●
●● ●●

● ●● ●●
●● ● h=0.4, overfit ●●●
−3 −2 −1 0 1 2 3
x


●● ●
● ●


●● ●
Kernel and Local Regression 34/42
y
0.0 0.1 0.2 0.3 0.4

§5.1 Introduction §5.2 One Dimensional Kernel §5.3 Local Polynomial Regression §5.4 Generalized Additive Models
Adaptive choices of h
A common alternative to using a fixed h is to vary it with respect to x. The most common example of this is the nearest neighbour bandwidth, where hx is chosen so that the window always contains a fixed proportion of the data t,
􏰎i I{|Xi −x|