Springer Series in Statistics
Trevor Tibshirani Jerome Elements of
Statistical Learning
Data Mining, Inference, and Prediction
Copyright By PowCoder代写 加微信 powcoder
Second Edition
To our parents:
Valerie and Vera and Florence and
and to our families:
Samantha, Timothy, and , Ryan, Julie, and Cheryl Melanie, Dora, Monika, and Ildiko
This is page v Printer: Opaque this
Preface to the Second Edition
In God we trust, all others bring data.
– Deming (1900-1993)1
We have been gratified by the popularity of the first edition of The Elements of Statistical Learning. This, along with the fast pace of research in the statistical learning field, motivated us to update our book with a second edition.
We have added four new chapters and updated some of the existing chapters. Because many readers are familiar with the layout of the first edition, we have tried to change it as little as possible. Here is a summary of the main changes:
1On the Web, this quote has been widely attributed to both Deming and . Hayden; however Professor Hayden told us that he can claim no credit for this quote, and ironically we could find no “data” confirming that Deming actually said this.
This is page vii Printer: Opaque this
viii Preface to the Second Edition
1. Introduction
2. Overview of Supervised Learning 3. Linear Methods for Regression
4. Linear Methods for Classification 5. Basis Expansions and Regulariza- tion
6. Kernel Smoothing Methods
7. Model Assessment and Selection
8. Model Inference and Averaging 9. Additive Models, Trees, and Related Methods
10. Boosting and Additive Trees
11. Neural Networks
12. Support Vector Machines and Flexible Discriminants
What’s new
LAR algorithm and generalizations of the lasso
Lasso path for logistic regression Additional illustrations of RKHS
Strengths and pitfalls of cross- validation
New example from ecology; some material split off to Chapter 16. Bayesian neural nets and the NIPS 2003 challenge
Path algorithm for SVM classifier
Spectral clustering, kernel PCA, sparse PCA, non-negative matrix factorization archetypal analysis, nonlinear dimension reduction, Google page rank algorithm, a direct approach to ICA
13. Prototype Methods Nearest-Neighbors
14. Unsupervised Learning
15. Random Forests
16. Ensemble Learning
17. Undirected Graphical Models 18. High-Dimensional Problems
Some further notes:
• Our first edition was unfriendly to colorblind readers; in particular, we tended to favor red/green contrasts which are particularly trou- blesome. We have changed the color palette in this edition to a large extent, replacing the above with an orange/blue contrast.
• We have changed the name of Chapter 6 from “Kernel Methods” to “Kernel Smoothing Methods”, to avoid confusion with the machine- learning kernel method that is discussed in the context of support vec- tor machines (Chapter 11) and more generally in Chapters 5 and 14.
• In the first edition, the discussion of error-rate estimation in Chap- ter 7 was sloppy, as we did not clearly differentiate the notions of conditional error rates (conditional on the training set) and uncondi- tional rates. We have fixed this in the new edition.
Preface to the Second Edition ix
• Chapters 15 and 16 follow naturally from Chapter 10, and the chap- ters are probably best read in that order.
• In Chapter 17, we have not attempted a comprehensive treatment of graphical models, and discuss only undirected models and some new methods for their estimation. Due to a lack of space, we have specifically omitted coverage of directed graphical models.
• Chapter 18 explores the “p ≫ N” problem, which is learning in high- dimensional feature spaces. These problems arise in many areas, in- cluding genomic and proteomic studies, and document classification.
We thank the many readers who have found the (too numerous) errors in the first edition. We apologize for those and have done our best to avoid er- rors in this new edition. We thank , , and for comments on some of the new chapters, and many Stanford graduate and post-doctoral students who offered comments, in particular Quraishi, , , , Mahon, , , , and . We thank for his patience in guiding us through this new edition. RT dedicates this edition to the memory of Phee.
Trevor Tibshirani
Stanford, California August 2008
x Preface to the Second Edition
Preface to the First Edition
We are drowning in information and starving for knowledge.
– . field of Statistics is constantly challenged by the problems that science and industry brings to its door. In the early days, these problems often came from agricultural and industrial experiments and were relatively small in scope. With the advent of computers and the information age, statistical problems have exploded both in size and complexity. Challenges in the areas of data storage, organization and searching have led to the new field of “data mining”; statistical and computational problems in biology and medicine have created “bioinformatics.” Vast amounts of data are being generated in many fields, and the statistician’s job is to make sense of it all: to extract important patterns and trends, and understand “what the data says.” We call this learning from data.
The challenges in learning from data have led to a revolution in the sta- tistical sciences. Since computation plays such a key role, it is not surprising that much of this new development has been done by researchers in other fields such as computer science and engineering.
The learning problems that we consider can be roughly categorized as either supervised or unsupervised. In supervised learning, the goal is to pre- dict the value of an outcome measure based on a number of input measures; in unsupervised learning, there is no outcome measure, and the goal is to describe the associations and patterns among a set of input measures.
This is page xi Printer: Opaque this
xii Preface to the First Edition
This book is our attempt to bring together many of the important new ideas in learning, and explain them in a statistical framework. While some mathematical details are needed, we emphasize the methods and their con- ceptual underpinnings rather than their theoretical properties. As a result, we hope that this book will appeal not just to statisticians but also to researchers and practitioners in a wide variety of fields.
Just as we have learned a great deal from researchers outside of the field of statistics, our statistical viewpoint may help others to better understand different aspects of learning:
There is no true interpretation of anything; interpretation is a vehicle in the service of human comprehension. The value of interpretation is in enabling others to fruitfully think about an idea.
We would like to acknowledge the contribution of many people to the conception and completion of this book. , , , , , , , and have greatly influenced our careers. Balasub- ramanian Narasimhan gave us advice and help on many computational problems, and maintained an excellent computing environment. Shin-Ho Bang helped in the production of a number of the figures. gave valuable tips on color production. , , , , , , , Bog- dan Popescu, , , , , Mu Zhu, two reviewers and many students read parts of the manuscript and offered helpful suggestions. was supportive, patient and help- ful at every phase; Mary and headed a superb production team at Springer. would like to thank the statis- tics department at the University of Cape Town for their hospitality during the final stages of this book. We gratefully acknowledge NSF and NIH for their support of this work. Finally, we would like to thank our families and our parents for their love and support.
Trevor Tibshirani
Stanford, California May 2001
The quiet statisticians have changed our world; not by discov- ering new facts or technical developments, but by changing the ways that we reason, experiment and form our opinions ….
Preface to the Second Edition vii Preface to the First Edition xi 1 Introduction 1
2 Overview of Supervised Learning 9
2.1 Introduction ……………………. 9
2.2 VariableTypesandTerminology………….. 9
2.3 Two Simple Approaches to Prediction:
LeastSquaresandNearestNeighbors . . . . . . . . . . . 11
2.3.1 LinearModelsandLeastSquares . . . . . . . . 11
2.3.2 Nearest-NeighborMethods ………… 14
2.3.3 From Least Squares to Nearest Neighbors . . . . 16
2.4 StatisticalDecisionTheory …………….. 18
2.5 LocalMethodsinHighDimensions. . . . . . . . . . . . . 22
2.6 Statistical Models, Supervised Learning
andFunctionApproximation ……………. 28
2.6.1 A Statistical Model
for the Joint Distribution Pr(X, Y ) ……. 28
2.6.2 SupervisedLearning……… ……. 29
2.6.3 Function Approximation . . . . . . ……. 29
2.7 StructuredRegressionModels …………… 32 2.7.1 DifficultyoftheProblem …………. 32
This is page xiii Printer: Opaque this
2.8 ClassesofRestrictedEstimators . . . . . . . . . . .
2.8.1 Roughness Penalty and Bayesian Methods
2.8.2 Kernel Methods and Local Regression . . .
2.8.3 Basis Functions and Dictionary Methods .
2.9 Model Selection and the Bias–Variance Tradeoff . . BibliographicNotes……………………. 39 Exercises…………………………. 39
Linear Methods for Regression 43
3.1 Introduction ……………………. 43
3.2 Linear Regression Models and Least Squares . . . . . . . 44
3.2.1 Example:ProstateCancer ………… 49
3.2.2 TheGauss–MarkovTheorem……….. 51
3.2.3 Multiple Regression
from Simple Univariate Regression . . . . . . . . 52
3.2.4 MultipleOutputs …………….. 56
3.3 SubsetSelection ………………….. 57
3.3.1 Best-SubsetSelection …………… 57
3.3.2 Forward- and Backward-Stepwise Selection .
3.3.3 Forward-Stagewise Regression . . . . . . . .
3.3.4 Prostate Cancer Data Example (Continued)
3.4 ShrinkageMethods………………..
3.4.1 RidgeRegression …………….. 61
3.4.2 TheLasso ………………… 68
3.4.3 Discussion: Subset Selection, Ridge Regression
andtheLasso ………………. 69
3.4.4 LeastAngleRegression ………….. 73
3.5 Methods Using Derived Input Directions . . . . . . . . . 79
3.5.1 Principal Components Regression . . . . . . . . 79
3.5.2 PartialLeastSquares …………… 80
3.6 Discussion: A Comparison of the Selection andShrinkageMethods ………………. 82
3.7 Multiple Outcome Shrinkage and Selection . . . . . . . . 84
3.8 More on 3.8.1
3.8.2 3.8.3 3.8.4 3.8.5 3.8.6
the Lasso and Related Path Algorithms . . . . . 86 Incremental Forward Stagewise Regression . . . 86 Piecewise-Linear Path Algorithms . . . . . . . . 89 TheDantzigSelector …………… 89 TheGroupedLasso ……………. 90 FurtherPropertiesoftheLasso. . . . . . . . . . 91 Pathwise Coordinate Optimization . . . . . . . . 92
3.9 ComputationalConsiderations …………… 93 BibliographicNotes……………………. 94 Exercises…………………………. 94
… 33 … 34 … 34 … 35 … 37
.. 58 .. 60 .. 61 .. 61
4 Linear Methods for Classification
Contents xv
4.1 Introduction ……………………. 101
4.2 Linear Regression of an Indicator Matrix . . . . . . . . . 103
4.3 LinearDiscriminantAnalysis……………. 106
4.3.1 Regularized Discriminant Analysis . . . . . . . . 112
4.3.2 ComputationsforLDA ………….. 113
4.3.3 Reduced-Rank Linear Discriminant Analysis . . 113
4.4 LogisticRegression…………………. 119
4.4.1 Fitting Logistic Regression Models . . . . . . . . 120
4.4.2 Example: South African Heart Disease . . . . . 122
4.4.3 Quadratic Approximations and Inference . . . . 124
4.4.4 L1 Regularized Logistic Regression . . . . . . . . 125
4.4.5 LogisticRegressionorLDA? . . . . . . . . . . . 127
4.5 SeparatingHyperplanes………………. 129
4.5.1 Rosenblatt’s Perceptron Learning Algorithm . . 130
4.5.2 Optimal Separating Hyperplanes . . . . . . . . . 132
BibliographicNotes……………………. 135 Exercises…………………………. 135
5 Basis Expansions and Regularization 139
5.1 Introduction ……………………. 139
5.2 PiecewisePolynomialsandSplines . . . . . . . . . . . . . 141
5.2.1 NaturalCubicSplines…………… 144
5.2.2 Example: South African Heart Disease (Continued)146
5.2.3 Example: Phoneme Recognition . . . . . . . . . 148
5.3 FilteringandFeatureExtraction . . . . . . . . . . . . . . 150
5.4 SmoothingSplines…………………. 151
5.4.1 Degrees of Freedom and Smoother Matrices . . . 153
5.5 Automatic Selection of the Smoothing Parameters . . . . 156
5.5.1 FixingtheDegreesofFreedom . . . . . . . . . . 158
5.5.2 TheBias–VarianceTradeoff. . . . . . . . . . . . 158
5.6 NonparametricLogisticRegression . . . . . . . . . . . . . 161
5.7 MultidimensionalSplines ……………… 162
5.8 Regularization and Reproducing Kernel . 167
5.8.1 Spaces of Functions Generated by Kernels . . . 168
5.8.2 ExamplesofRKHS ……………. 170
5.9 WaveletSmoothing ………………… 174
5.9.1 Wavelet Bases and the Wavelet Transform . . . 176
5.9.2 AdaptiveWaveletFiltering . . . . . . . . . . . . 179
BibliographicNotes……………………. 181 Exercises…………………………. 181 Appendix: Computational Considerations for Splines . . . . . . 186 Appendix:B-splines………………… 186 Appendix: Computations for Smoothing Splines . . . . . 189
xvi Contents
6 Kernel Smoothing Methods 191
6.1 One-DimensionalKernelSmoothers . . . . . . . . . . . . 192
6.1.1 LocalLinearRegression………….. 194
6.1.2 LocalPolynomialRegression . . . . . . . . . . . 197
6.2 SelectingtheWidthoftheKernel . . . . . . . . . . . . . 198
6.3 LocalRegressioninIRp ………………. 200
6.4 Structured Local Regression Models in IRp . . . . . . . . 201
6.4.1 StructuredKernels…………….. 203
6.4.2 Structured Regression Functions . . . . . . . . . 203
6.5 LocalLikelihoodandOtherModels . . . . . . . . . . . . 205
6.6 Kernel Density Estimation and Classification . . . . . . . 208
6.6.1 KernelDensityEstimation ………… 208
6.6.2 KernelDensityClassification . . . . . . . . . . . 210
6.6.3 TheNaiveBayesClassifier ………… 210
6.7 RadialBasisFunctionsandKernels . . . . . . . . . . . . 212
6.8 Mixture Models for Density Estimation and Classification 214
6.9 ComputationalConsiderations …………… 216
BibliographicNotes……………………. 216 Exercises…………………………. 216
7 Model Assessment and Selection 219
7.1 Introduction ……………………. 219
7.2 Bias,VarianceandModelComplexity . . . . . . . . . . . 219
7.3 TheBias–VarianceDecomposition . . . . . . . . . . . . . 223
7.3.1 Example: Bias–Variance Tradeoff . . . . . . . . 226
7.4 OptimismoftheTrainingErrorRate . . . . . . . . . . . 228
7.5 Estimates of In-Sample Prediction Error . . . . . . . . . . 230
7.6 TheEffectiveNumberofParameters. . . . . . . . . . . . 232
7.7 TheBayesianApproachandBIC………….. 233
7.8 MinimumDescriptionLength……………. 235
7.9 Vapnik–ChervonenkisDimension………….. 237
7.9.1 Example(Continued) …………… 239
7.10 Cross-Validation ………………….. 241 7.10.1 K-FoldCross-Validation …………. 241
7.10.2 The Wrong and Right Way toDoCross-validation…………… 245
7.10.3 Does Cross-Validation Really Work? . . . . . . . 247
7.11 BootstrapMethods ………………… 249 7.11.1 Example(Continued) …………… 252
7.12 ConditionalorExpectedTestError?. . . . . . . . . . . . 254
BibliographicNotes……………………. 257 Exercises…………………………. 257
8 Model Inference and Averaging 261
8.1 Introduction ……………………. 261
Contents xvii
8.2 The Bootstrap and Maximum Likelihood Methods . . . . 261
8.2.1 ASmoothingExample ………….. 261
8.2.2 MaximumLikelihoodInference………. 265
8.2.3 Bootstrap versus Maximum Likelihood . . . . . 267
8.3 BayesianMethods …………………. 267
8.4 Relationship Between the Bootstrap
andBayesianInference ………………. 271
8.5 TheEMAlgorithm ………………… 272
8.5.1 Two-Component Mixture Model . . . . . . . . . 272
8.5.2 TheEMAlgorithminGeneral . . . . . . . . . . 276
8.5.3 EM as a Maximization–Maximization Procedure 277
8.6 MCMCforSamplingfromthePosterior . . . . . . . . . . 279
8.7 Bagging………………………. 282
8.7.1 Example: Trees with Simulated Data . . . . . . 283
8.8 ModelAveragingandStacking …………… 288
8.9 StochasticSearch:Bumping…………….. 290
BibliographicNotes……………………. 292 Exercises…………………………. 293
9 Additive Models, Trees, and Related Methods 295
9.1 GeneralizedAdditiveModels ……………. 295
9.1.1 FittingAdditiveModels………….. 297
9.1.2 Example: Additive Logistic Regression . . . . . 299
9.1.3 Summary…………………. 304
9.2 Tree-BasedMethods………………… 305
9.2.1 Background ……………….. 305
9.2.2 RegressionTrees……………… 307
9.2.3 ClassificationTrees ……………. 308
9.2.4 OtherIssues ……………….. 310
9.2.5 SpamExample(Continued)……….. 313
9.3 PRIM:BumpHunting……………….. 317
9.3.1 SpamExample(Continued) . . . . . . . . . . . 320
9.4 MARS: Multivariate Adaptive Regression Splines . . . . . 321
9.4.1 SpamExample(Continued)……….. 326
9.4.2 Example(SimulatedData) ………… 327
9.4.3 OtherIssues ……………….. 328
9.5 HierarchicalMixturesofExperts . . . . . . . . . . . . . . 329
9.6 MissingData……………………. 332
9.7 ComputationalConsiderations …………… 334
BibliographicNotes……………………. 334 Exercises…………………………. 335
10 Boosting and Additive Trees 337
10.1 BoostingMethods …………………. 337 10.1.1 OutlineofThisChapter………….. 340
10.2 BoostingFitsanAdditiveModel………….. 341
10.3 Forward Stagewise Additive Modeling . . . . . . . . . . . 342
10.4 ExponentialLossandAdaBoost ………….. 343
10.5 WhyExponentialLoss? ………………. 345
10.6 LossFunctionsandRobustness…………… 346
10.7 “Off-the-Shelf” Procedures for Data Mining . . . . . . . . 350
10.8 Example:SpamData ……………….. 352
10.9 BoostingTrees …………………… 353
10.10 Numerical Optimization via Gradient Boosting . . . . . . 358
10.10.1 SteepestDescent……………… 358 10.10.2 GradientBoosting…………….. 359 10.10.3 Implementations of Gradient Boosting . . . . . . 360
10.11Right-SizedTreesforBoosting …………… 361 10.12Regularization …………………… 364 10.12.1 Shrinkage…………………. 364 10.12.2 Subsampling ……………….. 365 10.13Interpretation …………………… 367 10.13.1 Relative Importance of Predictor Variables . . . 367 10.13.2 PartialDependencePlots…………. 369 10.14Illustrations…………………….. 371 10.14.1 CaliforniaHousing…………….. 371 10.14.2 NewZealandFish …………….. 375 10.14.3 DemographicsData ……………. 379 BibliographicNotes……………………. 380 Exercises…………………………. 384
11 Neural Networks 389
11.1 Introduction ……………………. 389
11.2 ProjectionPursuitRegression …………… 389
11.3 NeuralNetworks………………….. 392
11.4 FittingNeuralNetworks………………. 395
11.5 Some Issues in Training Neural Networks . . . . . . . . . 397
11.5.1 Starting
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com