程序代写代做代考 C go graph data mining decision tree algorithm flex Getting nonlinear

Getting nonlinear
Data Mining
Prof. Dr. Matei Demetrescu
Statistics and Econometrics (CAU Kiel) Summer 2020 1 / 40

Get more out of the data?
We used linearity as a starting point rather than truth carved in stone.
When a linear approximation is not good enough,1
some alternative approaches may offer a lot of flexibility, without losing (much of) the interpretability of linear models.
Splines, local regression, and classification & regression trees fit here. (Today)
Other alternatives are a bit more of a black box: ensemble methods or artificial neural networks.
(Next weeks)
1The decision to go nonlinear can be taken using model selection techniques. Statistics and Econometrics (CAU Kiel) Summer 2020
2 / 40

Today’s outline
Getting nonlinear
1 Polynomials, step functions and splines
2 Local regression and GAMs
3 Classification and regression trees
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 3 / 40

Polynomials, step functions and splines
Outline
1 Polynomials, step functions and splines
2 Local regression and GAMs
3 Classification and regression trees
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 4 / 40

Polynomials, step functions and splines
Should be familiar Polynomial Regression
May use Y = β + β X + β X2 + . . . + β Xd + ε for linear and logit
regression:
i01i2i2 3diid
yi =0 +1xi +2xi +3xi +…+dxi +✏i
Degree−4 Polynomial
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | || | | | | | |
20 30 40 50 60 70 80 20 30 40 50 60 70 80
Age Age
0.00
Pr Wage>250 | Age
0.05 0.10 0.15 0.20
Wage
50 100 150 200 250 300
(Wage data set from ISLR package; see textbook James et al. (2013)) Statistics and Econometrics (CAU Kiel) Summer 2020 5 / 40
| ||||||||||||||||||||||| ||||||||||||| |
2 23

Polynomials, step functions and splines
Details
Not interested in βˆl, but rather in the fitted function values at any x0: ˆˆˆˆ2ˆ3ˆ4
f(x0)=β0 +β1×0 +β2×0 +β3×0 +β4×0.
ˆˆ
Since f(x0) is a linear function of the βl, you can get a simple expression
􏰌ˆ􏰍
for pointwise-variances Var f(x0) at any value x0.
Model degrees of freedom are d + 1 in the simple regression case. We either fix the degree d at some reasonably low value,
… or else use say cross-validation to choose d.
(For multiple regression, you may even add interactions.) Logistic regression follows naturally; here,
P(Yi >250|Xi)= exp(β0 +β1Xi +β2Xi2 +···+βdXid) . 1+exp(β0 +β1Xi +β2Xi2 +···+βdXid)
To get confidence intervals, compute upper and lower bounds on the logit
scale, and then transform them to get on probability scale.
Statistics and Econometrics (CAU Kiel) Summer 2020 6 / 40

Statistics and Econometrics
(CAU Kiel)
Summer 2020
7 / 40
Polynomials, step functions and splines
An alternative: many steps…
Step Functions
Polynomials have notorious tail behavior — very bad for extrapolation.
Another way of creating transformations of a variable — cut
To avoid this, may partition the range of the regressor, e.g. for X = Age the variable into distinct regions.
C (X)=I(X <35), C (X)=I(35X <50),...,C (X)=I(X 65) C1(X)1= I(X < 35),C2(X2) = I(35 ≤ X < 65),C3(X3 ) = I(65 ≥ X) 0.00 Pr Wage>250 | Age
0.05 0.10 0.15 0.20
Wage
50 100 150 200 250 300
20 30 40 50 60 70 80
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
20 30 40 50 60 70 80
Age
Age
5 23
Piecewise Constant
| |||||||||||||||||||||| |||||||||||| |

Polynomials, step functions and splines
Step functions
Easy to work with:
create a series of dummy variables representing each region, and use instead of original predictor.
Useful way of creating interactions that are easy to interpret. For example, interaction effect of Year and Age:
I(Year < 2005) · Age, I(Year ≥ 2005) · Age would allow for different linear functions in each age category. (Number of) Cutpoints (or knots) usually very important: Choice of cutpoints can be problematic (but sometimes there are plausible choices, say children vs. adults). The jumps at the cutpoints may be a problem (smoother alternatives my work better.) Statistics and Econometrics (CAU Kiel) Summer 2020 8 / 40 Polynomials, step functions and splines Piecewise Polynomials May combine step functions with polynomials. E.g. for one knot and d = 3, 􏰆β01 +β11Xi +β21Xi2 +β31Xi3 +εi if Xi ξk
Statistics and Econometrics (CAU Kiel) Summer 2020
10 / 40
0.0 0.2 0.4
0.6 0.8 1.0
x
0.0 0.2 0.4
0.6 0.8 1.0
x
b(x) f(x)
0.0 0.2 0.4 0.3 0.5 0.7 0.9


Polynomials, step functions and splines
Cubic Splines
Now we have truncated power basis functions,
Yi = β0 + β1b1(Xi) + β2b2(Xi) + · · · + βK+3bK+3(Xi) + εi,
where we have
b1(x) =x b2(x) =x2 b3(x) =x3
and, for k = 1,…,K, bk+3(x) = (x − ξk)3+.
2 23
0.0 0.2 0.4
0.6 0.8 1.0
x
0.0 0.2 0.4
0.6 0.8 1.0
x
Statistics and Econometrics
(CAU Kiel) Summer 2020 11 / 40
b(x) f(x)
0.0 0.2 0.4 1.0 1.2
1.4 1.6

Polynomials, step functions and splines
Different degrees of smoothness
Piecewise Cubic
Continuous Piecewise Cubic
20 30 40 50 60 70
Age
Cubic Spline
20 30 40 50 60 70
Age
Linear Spline
20 30 40 50 60 70
20 30 40 50 60 70
Age
Age
Statistics and Econometrics
(CAU Kiel)
Summer 2020
12 / 40
Wage
Wage
50 100 150 200 250
50 100
150 200 250
Wage
Wage
50 100 150 200 250
50 100
150 200 250


allows us to put more internal
Polynomials, step functions and splines
knots for the same degrees of
Natural Cubic Splines
freedom as a regular cubic spline.
A natural cubic spline extrapolates linearly beyond the boundary knots.
Natural Cubic Spline Cubic Spline
20 30 40 50 60 70
Age
… allows us to put more internal knots for the same flexibility (model
degrees of freedom) as a regular cubic spline.
This adds 4 = 2 × 2 extra constraints (zero 2nd order derivatives), and
3 2
Statistics and Econometrics (CAU Kiel) Summer 2020 13 / 40
Wage
50 100 150 200 250

Polynomials, step functions and splines
Fitting splines in R is easy: bs(x, …) for any degree splines, Now for all data
and ns(x, …) for natural cubic splines, in package splines. Natural Cubic Spline
0.00
Pr Wage>250 | Age
0.05 0.10 0.15 0.20
Wage
50 100 150 200 250 300
20 30 40 50 60 70 80
Age
||||||||||||||||||||||||||||||||||||||||||||||||||||||||| | | | |
20 30 40 50 60 70 80
Age
Statistics and Econometrics (CAU Kiel)
Summer 2020 14 / 40
| ||||||||||||||||| |||| ||||||||||||| |

Knot placement
• One strategy is to decide K, the number place them at appropriate quantiles of th
Knot placement
Polynomials, step functions and splines
One strategy is to decide K, the number of knots, and then place them at
• A cubic spline with K knots has K + 4 p
appropriate quantiles of the observed X.
degrees of freedom.
A cubic spline with K knots has K + 4 parameters/model d.o.f.
• A natural spline with K knots has K deg
A natural spline with K knots has K degrees of freedom.
Direct comparison: degree-14 polynomial natural cubic spline
Both have 15 d.o.f. – but note the extrapolating behavior of the polynomial and its wiggly behavior.
Natural Cubic Spline Polynomial
Co deg mia cub wit
Wage
50 100 150 200 250 300
Statistics and Econometrics (CAU Kiel)
Summer 2020
15 / 40
20 30 40 50 60 70 80 Age

Polynomials, step functions and splines
Smoothing Splines
Consider this criterion for fitting a smooth function g(x) to some data:
􏰆n􏰙􏰗 min 􏰉(Yi − g(Xi))2 + λ g′′(t)2dt
g∈S i=1
The first term is RSS, and tries to match g(x) to the data.
The second is a roughness penalty and controls how wiggly g(x) is.
It is modulated by the tuning parameter λ ≥ 0.
The smaller λ, the more wiggly the function, eventually interpolating yi whenλ=0.
As λ → ∞ the function g(x) becomes linear.
Might see a parallel to the LASSO or ridge regression.
Statistics and Econometrics (CAU Kiel) Summer 2020 16 / 40

Polynomials, step functions and splines
Smoothing Splines cont’d
The solution is a particular natural cubic spline, with a knot at every unique value of Xi. The roughness penalty still controls roughness via λ.
Smoothing splines avoid the knot-selection issue, leaving a single λ to be chosen.
The vector of n fitted values can be written as gˆ = Sλy, where Sλ is a n × n matrix (determined by the sample Xi and λ).
The effective degrees of freedom are given by
n
dfλ = tr(Sλ) = 􏰉[Sλ]ii.
i=1
This is analogous to linear regression, where yˆ = X (X′X)−1 Xy with
tr(X (X′X)−1 X) = p + 1.
Statistics and Econometrics (CAU Kiel) Summer 2020 17 / 40

Polynomials, step functions and splines
Choosing λ
The leave-one-out (LOO) cross-validated error is given by
􏰉n (−i)
2
􏰉n 􏰔yi−gˆλ(xi)􏰕2 1−{S} .
i=1
i=1 λ ii
(yi−gˆλ
Smoothing Spline
RSScv(λ)=
Still, K-fold cross-validation works a tick more reliably.
(xi)) =
Wage
0 50 100 200 300
20 30 40 50 60 70 80
Statistics and Econometrics (CAU Kiel) Age Summer 2020 18 / 40
16 Degrees of Freedom
6.8 Degrees of Freedom (LOOCV)

Local regression and GAMs
Outline
1 Polynomials, step functions and splines
2 Local regression and GAMs
3 Classification and regression trees
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 19 / 40

Local regression and GAMs
Local linear Regression
With a sliding weight function, we fit separate linear fits over the range of X by weighted least squares.Local Regression
Local Regression
O
O OO O O OO
OO OOOO
O O
O OOO
OO OO
O O
O O O
OO OO OO O O O O
O
O OOO
O O O O
OO O OOO OO O
O OOOOOOOO OO O O O
O
OO OO O OO
OO O OO O
O O O O O O
O O O
O OO
O
O
OO O O
O
O
O
O
O O
OO O O O
O O
OO O
OO O O
O O
OOO OO O
OO O
OO OOO OO O
O
OOO OO
O
OO OO O OO
OO O OO O
O O O O O O
O O O OOO
O
O
OO O O
O
O
O
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
With a sliding weight function, we fit separate linear fits over
This is nonpthaeraramngeetroifcXinbnyawteuigreht,ewd eleatsrtysqtouaeresst.imate E(Y |X) with no
See text for more details, and loess() function in R.
model assumptions (except smoothness of regression function). See
Advanced Statistics III.
2à 23
Statistics and Econometrics (CAU Kiel) Summer 2020
20 / 40
−1.0 −0.5 0.0 0.5 1.0 1.5
−1.0 −0.5 0.0 0.5 1.0 1.5

Local regression and GAMs
Example: Real estate data
The shape of the weight function influences bias-variance trade-off.
And so does the so-called bandwidth!
May have local polynomial regression, but beware of increasing number of model degrees of freedom.
St
Summer 2020
21 / 40
Local poylnomial of order 1
0 10 20 30 40
House age
Bandwidth Small
Moderate Large
Local poylnomial of order 2
Bandwidth Small
Moderate Large
0 10 20 30 40
atistics and Econometrics (CAU Kiel)
House age
Price Price
20 40 60 80 100 120 20 40 60 80 100 120

Local regression and GAMs
Some details
Analogously to splines, the local linear regression is a linear smoother.
It is however customary to specify the bandwidth rather than the (effective) d.o.f.
The optimal bandwidth is typically found by minimizing cross-validated errors.
Like any flexible estimator, the local regression estimator runs into problems for large, or even moderate, p.
ˆ −2/(4+d) It can be analyzed analytically, and f(x) − f(x) = Op(n
GAMs try to side-step this issue.
).
Statistics and Econometrics (CAU Kiel) Summer 2020
22 / 40

Local regression and GAMs
Generalized additive models
Allows for flexible nonlinearities in several variables, but retains the additive structure of linear models.
Yi =β0 +f1(Xi1)+f2(Xi2)+···+fp(Xip)+εi.
Partial response
−20 −15 −10 −5 0 5 10
Partial response
−5 0 5
Partial response
−5 0 5 10
0 1 2 3 4 5 6
Distance to public transportation
0 10 20 30 40
House age
0 2 4 6 8 10
Number of stores
Statistics and Econometrics (CAU Kiel)
Summer 2020
23 / 40

Local regression and GAMs
Details
The functions fl may be fitted using so-called backtracking: Start with initial guesses of fˆ;
Re-fit one fˆ at the time while keeping the others fixed, and iterate; l
Stop when a convergence criterion has been met.
But splines are more popular, since their use leads to linear regression.
Either way, coefficients not that interesting; fitted functions are. Can mix terms — some linear, some nonlinear
Can include low-order synergies, e.g. with bivariate smoothers or regression interactions
Can even classify:
􏰄 P(Y|X) 􏰅
log 1−P(Y|X) =β0 +f1(X1)+f2(X2)+···+fp(Xp).
l
Statistics and Econometrics (CAU Kiel) Summer 2020 24 / 40

Classification and regression trees
Outline
1 Polynomials, step functions and splines
2 Local regression and GAMs
3 Classification and regression trees
4 Up next
Statistics and Econometrics (CAU Kiel) Summer 2020 25 / 40

Classification and regression trees
Multidimensional step functions?
Let us consider a particular way of employing and describing step functions stratifying or segmenting the predictor space into “simpler” regions.
Some eyeballing:
Predict one price for
Location > 1
Predict other price for Location < 1 Refine prediction if Age < 10 Refine prediction if Age > 10
This looks like the beginning of a nice decision tree.
Statistics and Econometrics (CAU Kiel) Summer 2020 26 / 40
01234567
Distance to public transportation
House age
0 10 20 30 40

Classification and regression trees
Housing data cont’d
You may go on like this …
17.1
house.loc <> 0.826827
house.age <> 11.7 house.loc <> 4.00727
56.6
38.8 44.2
23.6
1
52.25 90 obs
house.loc <> 0.331727 house.age <> 17.75
7
17.06 33 obs
house.age <> 35.15
23 44.16 56.55
55 obs 10 obs
4
38.77 110 obs
5
29.61 77 obs
6
23.58 39 obs
29.6
52.2
01234567
Distance to public transportation
Statistics and Econometrics
(CAU Kiel)
Summer 2020 27 / 40
It’s just a matter of finding splitting rules!
House age
0 10 20 30 40

Classification and regression trees
Classification and regression trees
Trees are an over-simplification, but compared to a regression model, a CART is easy to display and explain.
In keeping with the tree analogy, the regions R1, R2, and R3 are known as terminal nodes.
Decision trees are typically drawn upside down, in the sense that the leaves are at the bottom of the tree.
The points along the tree where the predictor space is split are referred to as internal nodes.
The final regions are terminal nodes, or leaves.
Alone, trees are not very powerful prediction/classification tools. We’ll use
ensemble methods to improve (often a lot!) on this.
Statistics and Econometrics (CAU Kiel) Summer 2020 28 / 40

Classification and regression trees
Details of the tree-building process
In general, we partition the predictor space into J (distinct and non-overlapping) regions, R1, R2, . . . , RJ .
In theory, the regions could have any shape.
For simplicity (and for ease of interpretation), we choose to divide the
predictor space into high-dimensional rectangles, or boxes.
The goal is to find boxes R1, . . . , RJ that minimize the RSS, given by
J
􏰉 􏰉 ( Y i − Yˆ R j ) 2 , j=1 i∈Rj
where YˆRj is the mean response for (training) observations in jth box. Unfortunately, it is computationally infeasible to consider every possible
partition of the feature space into J boxes.
Statistics and Econometrics (CAU Kiel) Summer 2020 29 / 40

Classification and regression trees
Greedy tree growing
Starting with one region covering the whole predictor space, the algorithm splits in iteration t one of the t existing regions in two subregions:
For each of the current regions, say Rm, m = 1,…,t, Take each predictor Xj, j = 1,…,p;
Find cutpoint smj such that splitting region Rm into the regions
{X ∈ Rm|Xj < smj} and {X ∈ Rm|Xj ≥ smj} leads to the greatest possible reduction in RSS; Among the p predictors, pick that one “whose split” smj leads to the greatest possible reduction in RSS; (This predictor will likely differ from region to region) Finally, see for which of the current regions the splits found above lead to the greatest possible reduction in RSS. Iterate until a stopping criterion is reached (say a maximal number of regions J, or a minimal number or data points in each region). Statistics and Econometrics (CAU Kiel) Summer 2020 30 / 40 Example Recursive binary splitting is Top-down: it begins at the top of the tree and then successively splits the predictor space. It is greedy because it looks for largest RSS decrease in the current iteration rather than optimizing over all partitions. The tree only depicts the final partition, not the order in which splits occurred (except the first). R2 R1 R5 R4 t4 Statistics and Econometrics (CAU Kiel) FIGURE 8à3à Top Left: A partition of two-dimensional feature space that could not result from recursive binary splittingà Top Right: The output of recursive binary splitting on a two-dimensional exampleà Bottom Left: A tree corresponding Summer 2020 31 / 40 to the partition in the top right panelà Bottom Right: A perspective plot of the Classification and regression trees 3à8 8à Tree-Based Methods X2 X2 X2 ≤ t2 X1 ≤ t3 X2 ≤ t4 X 1 ≤ t 1| R1 R2 R3 X1 X1 R4 R5 t2 X2 X1 t1 t3 R3 Classification and regression trees Preventing overfitting The process described above may produce good predictions on the training set, but is likely to overfit the data, leading to poor test set performance. Why? A smaller tree with fewer splits (that is, fewer regions R1, . . . , RJ ) might lead to lower variance and better interpretation at the cost of a little bias. One possible alternative to the process described above is to grow the tree only so long as the decrease in the RSS due to each split exceeds some (high) threshold. This strategy will result in smaller trees, but is too short-sighted: a seemingly worthless split early on in the tree might be followed by a very good split — that is, a split that leads to a large reduction in RSS later on. Statistics and Econometrics (CAU Kiel) Summer 2020 32 / 40 Classification and regression trees Pruning a tree A better strategy is to grow a very large tree T0, and then prune it back in order to obtain a subtree. Cost complexity pruning — also known as weakest link pruning — is often used to do this. We consider a sequence of trees indexed by a nonnegative tuning parameter α. For each value of α there corresponds a subtree T ⊂ T0 |T| 􏰉 􏰉 (Yi−YˆRm)2+α|T| m=1 i:Xi∈Rm is as small as possible. Here |T| indicates the number of terminal nodes of the tree T, Rm is the rectangle (i.e. the subset of predictor space) corresponding to the mth terminal node, and YˆRm is the mean of the training observations in Rm. Statistics and Econometrics (CAU Kiel) Summer 2020 33 / 40 Classification and regression trees Choosing the best subtree The tuning parameter α... controls a trade-off between the subtree’s complexity and its fit to the training data. We select an optimal value αˆ using cross-validation. We then return to the full data set and obtain the subtree corresponding to αˆ. (Lots of variations exist in practice.) Statistics and Econometrics (CAU Kiel) Summer 2020 34 / 40 Classification and regression trees Summary: tree algorithm 1 Use recursive binary splitting to grow a large tree on the training data, stopping only when each terminal node has fewer than some minimum number of observations. 2 Apply cost complexity pruning to the large tree in order to obtain a sequence of best subtrees, as a function of α. 3 Use K-fold cross-validation to choose α. For each k = 1, . . . , K: 1 Repeat Steps 1 and 2 on the K−1th fraction of the training data, K excluding the kth fold. 2 Evaluate the mean squared prediction error on the data in the left-out kth fold, as a function of α. Average the results, and pick α to minimize the average error. 4 Return the subtree from Step 2 that corresponds to the chosen value of α. Statistics and Econometrics (CAU Kiel) Summer 2020 35 / 40 Classification and regression trees Classification trees Predict that each observation belongs to the most commonly occurring class of training observations in the region to which it belongs. However, RSS cannot be used as a criterion for binary splitting. A natural alternative to RSS is the classification error rate Em =1−max(pˆmk). k Here pˆmk represents the proportion of training observations in the mth region that are from the kth class. This is not sufficiently sensitive for tree-growing, so use either of Gini index, G = 􏰇Kk=1 pˆmk(1 − pˆmk), and Cross-entropy, D = − 􏰇Kk=1 pˆmk log pˆmk . They tend to deliver similar results in practice (and both are small if nodes contain most observations from the same class). Statistics and Econometrics (CAU Kiel) Summer 2020 36 / 40 Classification and regression trees To sum up: Trees Versus Linear Models Trees Versus Linear Models −2 −1 0 1 2 X1 −2 −1 0 1 2 X1 −2 −1 0 1 2 X1 −2 −1 0 1 2 X1 ue linear boundary; Bottom row: true non-linear Top Row: True linear boundary; Bottom row: True non-linear boundary; Left column: Linear model; Right column: Tree-based model Statistics and Econometrics (CAU Kiel) Summer 2020 37 / 40 X2 X2 −2 −1 0 1 2 −2 −1 0 1 2 X2 X2 −2 −1 0 1 2 −2 −1 0 1 2 r Classification and regression trees Advantages and disadvantages of trees + Trees are very easy to explain to people. In fact, they are even easier to explain than linear regression! + Some people believe that decision trees more closely mirror human decision-making than linear or logit regression. + Trees can be displayed graphically, and are easily interpreted even by a non-expert (especially if they are small). + Trees can easily handle qualitative predictors without the need to create dummy variables. − Unfortunately, trees generally do not have the same level of predictive accuracy as other regression/classification approaches we discussed. + May aggregate many smaller trees cleverly grown on the same data. Such ideas apply more generally, in fact, and can improve a lot. Statistics and Econometrics (CAU Kiel) Summer 2020 38 / 40 Up next Outline 1 Polynomials, step functions and splines 2 Local regression and GAMs 3 Classification and regression trees 4 Up next Statistics and Econometrics (CAU Kiel) Summer 2020 39 / 40 Up next Coming up Ensemble methods Statistics and Econometrics (CAU Kiel) Summer 2020 40 / 40