CS计算机代考程序代写 database flex Introduction to Machine Learning

Introduction to Machine Learning
Sarat C. Dass
Department of Mathematical and Computer Sciences Heriot-Watt University Malaysia Campus
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 1/102

Introduction to Machine Learning
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 2/102

What is Learning?
Let’s begin with a simple example. Suppose that you are a consultant hired to provide advice on how to improve sales of a particular product.
The Advertising data set consists of the sales of a product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper.
The data are displayed as below:
0 50 100 150 200 250 300 0 10 20 30 40 50 0 20 40 60 80 100
TV Radio Newspaper
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus
3/102
Sales
5 10 15 20 25
Sales
5 10 15 20 25
Sales
5 10 15 20 25

What is Statistical Learning? (cont.)
What do you see from the plots? What do you conclude?
It is not possible for our client to directly increase sales of the product. On the other hand, they can control the advertising expenditure in each of the three media.
We determine if there is an association between advertising and sales, then we instruct our client to adjust advertising budgets, thereby indirectly increasing sales.
In other words, our goal is to develop an accurate model that can be used to predict sales on the basis of the three media budgets.
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 4/102

Terms and Symbols Used for Statistical Learning
In this setting, the advertising budgets are input variables while sales is an output variable.
The inputs are typically denoted using the variable symbol X with subscripts to distinguish them, e.g., X1, X2, X3, etc.
So X1 might be the TV budget, X2 the radio budget, and X3 the newspaper budget.
The output variable is usually denoted by the symbol Y . Here, Y = sales.
The inputs go by different names, such as predictors, independent variables, features, or sometimes just variables.
The output variable is variable often called the response or dependent variable.
Throughout the lectures and the textbook, we will use all of these terms interchangeably.
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus
5/102

The Statistical Model
More generally, suppose that we observe a quantitative response Y and p different predictors, X1, X2, · · · , Xp.
We assume that there is some relationship between Y and
X = (X1,X2,··· ,Xp), which can be written in the very general form
Y =f(X)+ε
Here f (X ) is some fixed but unknown function of X and ε is a
random error term, which is independent of X and has mean zero. In this formulation, f (X ) represents the systematic information that
X provides about Y .
Let’s look at another example …
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 6/102

Another example: Example 2
As another example, consider Y = income and
X = years of education for 30 individuals. The plot is given as below:
The plot suggests that one might be able to predict income using years of education.
However, the function Y = f (X ) that connects X to Y is gnerally unknown. In this situation, one must estimate f based on the observed points, call this fˆ(X).
To explain how this estimation will be performed, we use simulated datawherethetruef(X)≡E(Y|X)isknown. Thetruef(X)is shown by the blue curve in the right-hand panel of the figure on the next slide.
The vertical lines represent the error terms ε. We note that some of the 30 observations lie above the blue curve and some lie below it; overall, the errors have approximately mean zero.
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 7/102

Figures for Example 2
Income
20 30 40 50 60 70 80
Income
20 30 40 50 60 70 80
10 12 14 16 18 20 22 10 12 14 16 18 20 22
Education Education
The vertical lines represent the error terms ε. We note that some of the 30 observations lie above the blue curve and some lie below it; overall, the errors have approximately mean zero.
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 8/102

More general f s
In general, the function f (X ) may involve more than one input variable.
In Figure 2.3 of the textbook, we plot income as a function of years of education and seniority: f (X ) is a 2D surface that must be estimated based on the observed data. But errors have same characteristics.
Figure: This figure is Figure 2.3 taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 9/102

The best regressor: formulation and derivation
Suppose we wish to predict Y by f (X ).
Let (X,Y) ∼ π(x,y) for which the marginal of X and the conditional
of Y given X are denoted by π(x) and π(y|x), respectively. The mean square error (MSE) is given by
MSE ≡Eπ(x,y)[Y −f(X)]2
=Eπ(x,y)[Y −E(Y|X)+E(Y|X)−f(X)]2
=Eπ(x,y)[Y −E(Y|X)]2 +Eπ(x,y)[E(Y|X)−f(X)]2 =Eπ(x)Eπ(y|x)[Y −E(Y|X)]2 +Eπ(x)[E(Y|X)−f(X)]2 = Eπ(x ) [Var(Y |X )] + Eπ(x ) [E (Y |X ) − f (X )]2
Thus,MSE isminimizedwhenf(X)=E(Y|X)whichisthe conditional expectation of Y given X calculated with respect to π(y|x).
E(Y|X) is the best regressor or best predictor for Y based on X.
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus
10/102

The learning target
Recall the decomposition of the MSE as
MSE =Eπ(x)[Var(Y|X)]+Eπ(x)[E(Y|X)−f(X)]2 (1)
It follows that
MSE ≥Eπ(x)[Var(Y|X)] withequalityiffE(Y|X)=f(X),thebestpredictorofY givenX
under MSE criteria.
E (Y |X ) is the target of our learning procedure. We want an f (X ) that is a good approximation of E (Y |X ) or even f (X ) = E (Y |X ) exactly if possible.
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus
13/102

The challenges involved
The joint pdf π(x,y) will usually be unknown and therefore E(Y|X) is also unknown.
The first error term on the RHS of (1) is the variance of ε which is inherent noise and hence irreducible.
The second error term on the RHS of (1) is a measure of how well f (X ) approximates E (Y |X ).
This error can be reduced by estimating f from a suitably selected class of functions which mimics the unknown form of E(Y|X).
This component is called the reducible error component – this is the focus for us.
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 14/102

Let’s go back to the problem of estimation
Recall the difficulties:
􏲤 The joint pdf π(x , y ) is unknown and hence E (Y |X ) is unknown.
􏲤 Need to choose a class of functions C which can mimic the form of the
unknown E(Y|X).
􏲤 Need a error criteria to perform the estimation of f within C.
Here are the solutions:
􏲤 Obtain a training set (xi,yi), i = 1,2,··· ,n iid from π(x,y).
􏲤 Choose a class of functions C which can reasonably model E (Y |X ),
i.e., the relationship between x and y in the training data set.
􏲤 In the current context, choose the empirical MSE as the criteria for
estimating f ∈ C.
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 15/102

Measuring the quality of fit: The empirical MSE criteria
In the regression setting, the most commonly-used measure is the (empirical) mean squared error (MSE), given by
1 􏱜n
MSE = n
based on a training dataset (xi,yi), i = 1,2,··· ,n of size n which are assumed to be iid from π(x,y).
Note that the above (empirical) MSE is an estimate of the population MSE given in (1) since
pop. MSE ≡ Eπ(x,y)(Y − f (X))2 =∧ Eπˆ(x,y)(Y − f (X))2
1 􏱜n
= n
where πˆ(x,y) is the empirical distribution which puts mass 1/n on
each point (xi,yi), i = 1,2,··· ,n.
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 16/102
i=1
(yi −f(xi))2 (2)
(yi −f(xi))2 ≡emp. MSE
i=1

Next, choose a class C
In order to estimate f , we have to choose a class of functions C that can reasonably model the relationship between X and Y that we observe in the training dataset.
In Example 2, we can choose C = C1 to be the class of linear functions: f (X) = β0 + β1X, and β0 and β1 are unknown parameters that have to be estimated in order to obtain fˆ.
The MSE criteria to obtain fˆ becomes
where
1 􏱜n
fˆ(X ) = arg min n (yi − f (xi ))2 = βˆ0 + βˆ1 X
f∈C1 i=1
ˆˆ 1􏱜n (β0,β1)=argminn (yi −β0−β1xi)2
β0 ,β1 i =1
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus
17/102

Least squares regression
Recall that
ˆ 1􏱜n (β0,β1)=argminn (yi −β0−β1xi)2
β0 ,β1 i =1
is precisely the least squares critria you learnt for simple linear regression previously.
The estimates βˆ0 and βˆ1 are the least squares estimates
βˆ0=y ̄−βˆ1x ̄ and
ˆ
Thus, the predictor of Y at given X is
􏰢ni=1(yi − y ̄)(xi − x ̄) β1 = 􏰢ni=1(xi −x ̄)2
fˆ(x)=βˆ +βˆx 01
In other words, the unknown E(Y|X) is approximated by a linear
function of X.
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus
18/102

Least squares simple linear regression: Example 2
Let’s fit a least squares regression line to the plot of Y versus X in Example 2.
10 12 14 16 18 20 22 10 12 14 16 18 20 22
Education Education
Ask yourself: Is the fit good? Your answer will determine whether C1 is reasonable class for the unknown E(Y|X).
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus
19/102
Income
20 30 40 50 60 70 80
Income
20 30 40 50 60 70 80

R codes
Here are the R codes used to generate the previous figures:
#Example 2: Income versus years of education
#Fit least squares regression line to scatter plot library(splines)
#Need this library for residual diagnostics
#Read data
income1 <- read.csv("Income1.csv") #Obtain scatter plot #Check out the trend with(income1, plot(Education, Income, type = "p", col = "brown", xlab = "Education", ylab="Income", cex = 2, lwd=3) ) Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 20/102 R codes (cont.) #Fit least squares regression line lm_fit <- lm(Income ~ Education, data = income1) #out is a lm object which will be used subsequently #Summary of fit analysis summary(lm_fit) #Draw the regression line abline(lm_fit, col = "blue", lwd=3) Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 21/102 Output of lm fit ## ## Call: ## lm(formula = Income ~ Education, data = income1) ## ## Residuals: ## Min 1Q Median 3Q Max ## -13.046 -2.293 0.472 3.288 10.110 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -39.4463 4.7248 -8.349 4.4e-09 ***
## Education 5.5995 0.2882 19.431 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ## ## Residual standard error: 5.653 on 28 degrees of freedom ## Multiple R-squared: 0.931,Adjusted R-squared: 0.9285 Sara#tC#.DFas-sDsetpaartmteintsotfMiacth:ema3ti7ca7la.nd6ComIonptnruotdeur1cStcioiaennntcoedsMHaec2rhioi8nt-eWLDaetaFtrnU,ingiverspity-MvaalalysiuaeCa:mpu% sample_frac(0.7) #Validation dataset data.frame
valid <- income1 %>% setdiff(train)
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus
39/102

R codes (cont.)
#Determine class of learners (polynomial regression with degre
poly3_train_fit <- lm(Income ~ poly(Education,3), data = train) poly3_train_predcit <- predict(poly3_train_fit, train) poly3_valid_predict <- predict(poly3_train_fit, valid) MSE_train <- with(train, mean((Income - poly3_train_predcit)^2)) MSE_train ## [1] 16.47067 MSE_valid <- with(valid, mean((Income - poly3_valid_predict)^2)) MSE_valid ## [1] 14.2355 Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 40/102 Now the full CV with K = 50 and P = 6 #Now let's do full CV K = 50; P = 6; MSE_train_mat <- vector("list", P) MSE_valid_mat <- vector("list", P) for (k in 1:K){ #Training dataset data.frame train <- income1 %>% sample_frac(0.7) #Validation dataset data.frame
valid <- income1 %>% setdiff(train)
Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus
41/102

Full CV with K = 50 and P = 6 (cont.)
#Determine class of learners which are polynomials
#from degree 1 to 6
for (p in 1:P){
poly_train_fit <- lm(Income ~ poly(Education,p), data = train) poly_train_predict <- predict(poly_train_fit, train) poly_valid_predict <- predict(poly_train_fit, valid) MSE_train_mat[[p]][k] <- with(train, mean((Income - poly_train_predict)^2)) MSE_valid_mat[[p]][k] <- with(valid, mean((Income - poly_valid_predict)^2)) } } MSE_train_p <- sapply(MSE_train_mat, mean) MSE_valid_p <- sapply(MSE_valid_mat, mean) plot(seq(1,P,1), MSE_valid_p, type="l",col="blue", ylim = c(0, lines(seq(1,P,1), MSE_train_p, type="l",col="red", lwd=3) Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 42/102 Plot of MSETrain(Cp) (red) and MSEValid(Cp) (blue) versus p 123456 seq(1, P, 1) Notethatbestfitisatp∗ =4 Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 43/102 MSE_valid_p 0 10 20 30 40 50 Best fit plot with p∗ = 4: R codes #Best fit is p*=4 poly_best_fit <- lm(Income ~ poly(Education,4), data = income1 with(income1, plot(Education, Income, type = "p", col = "brown", xlab = "Education", ylab="Income", cex = 2, lwd=3) ) #This part is to fit fhat to the scatter plot xpoints = with(income1, seq(min(Education), max(Education),0.5)) #prediction using fhat at xpoints ypoints <- predict(poly_best_fit, data.frame(Education=xpoints)) #Plot the points on scatter plot lines(xpoints, ypoints, col = "blue", lwd=3) Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 44/102 Best fit plot with p∗ = 4 10 12 14 16 18 20 22 Education Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 45/102 Income 20 30 40 50 60 70 80 Choice of Class of Functions: Prediction Accuracy versus Model Interpretability Some classes are less flexible meaning that they produce a small range of shapes for f , e.g., linear regression. Other methods are considerably more flexible because they can generate a much wider range of possible shapes to estimate f . Why would we ever choose to use a more restrictive method instead of a very flexible approach? One answer that you have seen is to avoid overfitting. Another answer is that if you want to interpret the model parameters in a certain way which is related to the real problem, it is better to choose a less flexible method. In linear regression, we have an interpretation for the slope and the intercept but as we move to higher order polynomial functions, the interpretation of coefficients associated with higher powers of x is less clear. For example, if x is years of education, what does x5 mean for the real problem? Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 46/102 Choice of Class of Functions: Prediction vs. Inference Why estimate f ? Two reasons: Prediction and Inference. Prediction: We obtain Yˆ = fˆ(X) and fˆ is treated as a black box. We are only interested in how well fˆ predicts future Y s. Here, we are not concerned with the form of f that results, and can select a more flexible class. On the other hand, inference means that we seek to understand how X affects Y . For example, we may want to know which independent variables are most associated with the response, are these relationships positive or negative relationships, or what effect will an increase in X have on Y ? In these scenarios, fˆ cannot be treated as a black box, and simpler model choices will help to answer such questions. Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 47/102 Bias versus Variance Trade Off: Introduction The CV procedure explained previously calculates MSEValid(C) based on a validation dataset V after f has been trained on a training dataset T . The sets V and T change at each cycle of the CV procedure. Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 48/102 Bias versus Variance Trade Off (cont.) The CV procedure thus tries to estimate the population version of MSEValid (C) which is given by 􏰄 ˆ 􏰅2 Eπ(x0,y0)Eπn(x,y) y0−f(x0;x,y) where 􏲤 π(x,y) is the joint pdf of (x,y) and πn(x,y) is the pdf of the iid training samples (x,y) ≡ {(xi,yi),i = 1,2,··· ,n} under π(x,y) 􏲤 Eπ(x0,y0) is the expectation with respect to a new unseen sample arising from π(x,y) 􏲤 fˆ(x0 ; x,y) is the estimated f based on the training set (x,y). We emphasize the dependence of fˆ(x0 ; x,y) on (x,y) which can change when the training set changes. Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 49/102 Bias versus Variance Trade Off (cont.) Define E(Y|X = x0) ≡ f0(x0), En to be the expectation calculated withrespecttoEn ,andEn(fˆ(x ;x,y))≡f ̄(x). π(x,y) 0 00 Using same arguments as before, we can show that 􏰄 ˆ 􏰅2 Eπ(x0,y0)Eπn(x,y) y0−f(x0;x,y) =E =E En􏰄y −f(x)+f(x)−f ̄(x)+f ̄(x)−fˆ(x ;x,y)􏰅2 π(x0,y0) 0 0 0 0 0 0 0 0 0 0 (y −f(x))2+E 􏰓f(x)−f ̄(x)􏰔2+ π(x0,y0) 0 0 0 π(x0,y0) 0 0 0 0 En􏰄f ̄(x)−fˆ(x ;x,y)􏰅2 E π(x0,y0) 00 0 Recall that the first term in the last equality, Eπ(x0,y0) (y0 − f0(x0))2 = Eπ(x0)Var(Y |X = x0) is the irreducible error. Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 50/102 Bias versus Variance Trade Off (cont.) The second term E 􏰓f ̄ (x ) − f (x )􏰔2 is the expected square of π(x0,y0)00 00 thebiastermB≡f ̄(x )−f (x ). 00 00 This term measures how well, on the average, the trainings of f using class C is able to approximate the true f0(x0). This term will become large if C is a more restrictive class which is not able to approximate f0(x0) accurately. n􏰄 ̄ ˆ 􏰅2 ThethirdtermEπ(x0,y0)E f0(x0)−f(x0;x,y) measuresthe variability of each training of f from its average over all trainings. This term will become large if C is a more flexible class which results in overfitting. Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 51/102 Bias versus Variance Trade Off (cont.) Thus, we have MSEValid (C) = Irr. Error + (Training Bias)2 + Training Variance For a general family of classes Cp indexed by p where large p indicates a more flexible class, we expect (Training Bias(Cp))2 ↓ and Training Variance(Cp) ↑ as p increases. Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 52/102 Bias versus Variance Trade Off: Example 2 This explains why we observed the U-shaped behaviour of MSEValid(Cp) in Example 2. When p is small, the bias term is large but the variance term is small. As p increases, the bias term becomes smaller but the variance term becomes larger. Thus, the sum of the two terms first decreases, achieves a minimum and then increases. Calculating the bias term requires the knowledge of E(Y|X) which is unknown. So, the bias term cannot be calculated for real datasets. But this theoretical study is useful to understand the behaviour of MSEValid(Cp) in all situations. Sarat C. Dass Department of Mathematical and ComInptruotdeurcStcioienntcoesMHaecrhioint-eWLaetatrnUingiversity Malaysia Campus 53/102

Related Posts