程序代写代做代考 algorithm decision tree 10-601 Introduction to Machine Learning

10-601 Introduction to Machine Learning
Machine Learning Department School of Computer Science Carnegie Mellon University
Stochastic Gradient Descent
+
Probabilistic Learning (Binary Logistic Regression)
Matt Gormley Lecture 9 Mar. 01, 2021
1

Reminders
• Homework 3: KNN, Perceptron, Lin.Reg.
– Out: Mon, Feb. 22
– Due: Mon, Mar. 01 at 11:59pm
– IMPORTANT: you may only use 2 grace days on Homework 3 (last possible moment to submit HW3: Wed, Mar. 03 at 11:59pm)
• Practice for Exam – Mock Exam 1
• Wed, Mar. 03 at 7:00pm – 9:00pm
• See @261 for participation point details
– Practice Problems 1A (Gradescope) – Practice Problems 1B (PDF)
• Midterm Exam 1
– Tue, Feb. 18, 7:00pm – 9:00pm
2

MIDTERM EXAM LOGISTICS
3

Midterm Exam – Time: Saturday Exam
• Time / Location
Saturday, March 6, at 10:30am – 12:30pm EST
– Location: We will contact you with additional details about how to join the appropriate Zoom meeting.
– Seats: There will be assigned Zoom rooms. Please arrive online early.
– Please watch Piazza carefully for announcements. • Logistics
– Covered material: Lecture 1 – Lecture 8
– Format of questions:
• Multiple choice
• True / False (with justification)
• Derivations
• Short answers
• Interpreting figures
• Implementing algorithms on paper
– No electronic devices
4

Midterm Exam
• How to Prepare
– Attend the midterm review lecture
(right now!)
– Participate in the Mock Exam
– Review exam practice problems (we’ll post them)
– Review this year’s homework problems
– Consider whether you have achieved the “learning objectives” for each lecture / section
5

Midterm Exam
• Advice(forduringtheexam)
– Solve the easy problems first
(e.g. multiple choice before derivations)
• if a problem seems extremely complicated you’re likely missing something
– Don’t leave any answer blank!
– If you make an assumption, write it down
– If you look at a question and don’t know the answer:
• we probably haven’t told you the answer
• but we’ve told you enough to work it out
• imagine arguing for some answer and see if you like it
6

• Foundations
– Probability, Linear Algebra, Geometry, Calculus
– Optimization
• ImportantConcepts – Overfitting
– Experimental Design
• Classification – Decision Tree – KNN
– Perceptron
• Regression
– Linear Regression
Topics for Midterm 1
7

SAMPLE QUESTIONS
8

• •

• •

subtree subtree 1
subtree subtree 2
Sample Questions
0.5log 0.50.5log 0.5 0.5log20.50.5log20.5
0.5 log2 0.5 0.5 log2 0.5 22
10
subtree1 12
subtree2 log 0.75=0.4 log 0.25=2
log20.75 = 0.4 log20.25 = 2
log2 0.75 = 0.4 log2 0.25 = 2 22

Sample Questions
10-701 Machine Learning Midterm Exam – Page 7 of 17 11/02/2016
10-701 Machine Learning Midterm Exam – Page 8 of 17 11/02/2016
4 K-NN [12 pts]
Now we will apply K-Nearest Neighbors using Euclidean distance to a binary classifi-
In this problem, you will be tested on your knowledge of K-Nearest Neighbors (K-NN), where cation task. We assign the class of the test point to be the class of the majority of the
k indicates the number of nearest neighbors.
k nearest neighbors. A point can be its own neighbor.
1. [3 pts] For K-NN in general, are there any cons of using very large k values? Select one. Briefly justify your answer.
(a) Yes (b) No
2. [3 pts] For K-NN in general, are there any cons of using very small k values? Select one. Briefly justify your answer.
(a) Yes (b) No
Figure 5
3. [2 pts] What value of k minimizes leave-one-out cross-validation error for the dataset shown in Figure 5? What is the resulting error?
11

10-601: Machine Learning Page 10 of 16 2/29/2016 Sample Questions
4 SVM, Perceptron and Kernels [20 pts. + 4 Extra Credit]
4.1 True or False
Answer each of the following questions with T or F and provide a one line justification.
(a) [2 pts.] Consider two datasets D(1) and D(2) where D(1) = {(x(1),y(1)),…,(x(1),y(1))} 11 nn
and D(2) = {(x(2), y(2)), …, (x(2), y(2))} such that x(1) 2 Rd1 , x(2) 2 Rd2 . Suppose d > d 11mmii12
and n > m. Then the maximum number of mistakes a perceptron algorithm will make is higher on dataset D(1) than on dataset D(2).
(b) [2 pts.] Suppose (x) is an arbitrary feature mapping from input x 2 X to (x) 2 RN and let K(x, z) = (x) · (z). Then K(x, z) will always be a valid kernel function.
(c) [2 pts.] Given the same training data, in which the points are linearly separable, the margin of the decision boundary produced by SVM will always be greater than or equal
to the margin of the decision boundary produced by Perceptron. 12

reaGl-ivvaelnuetdhaptarwaemhetaevres awne iensptiumtaxteaannddw✏erewparnetsetnotsestthiemnaoteiseaninotuhtepudtatya,.inWlhineenarthreegnroeisseion
is Gwaeuasssiuanm,emtahxeimreilzaitniogntsheiplibkeltiwheoeondtohfema diastoafsettheSfo=rm{(yx=,yw)x,.+..b,(+x✏,yw)h}erteowesatinmdabteare 11 nn
n
therepaal-rvaamlueetedrspawraamndetebrissweqeueisvtaimlenatetoanmdin✏imreipzrinesgenthtsetshqeuanroeidseeirnrotrh:e data. When the noise
Page 7 of 16 2/29/2016 is Gaussian, maximizing the likelihooXd of a dataset S = {(x1, y1), . . . , (xn, yn)} to estimate
10-601: Machine Learning
the parameters w and b is equivalent to minimizing the 2squared error: argmin (yi (wxi +b)) .
Sample Questions
3 Linear and LogisticwRegression [20 pts. + 2 Extra Credit] i=1 n
2
argmin (yi (wxi +b)) .
Consider the dataset S plotted in Fig. 1 along with its associated regression line. For
3.1 Linear regression w i=1
each of the altered data sets Snew plotted in Fig. 3, indicate which regression line (relative
to tGheivCoeorningstiihndaetlrowtnhee)hdainvaetFaasigen.ti2SnpcpourltoretxtsepadonndidnswFteiogw.tah1netarletoognrgeswtsimiotnhatlieitnsaenafsosoroutcthipaeutetndeyw,reidgnaretlisansiesoaentr.lriWenger.ietseFsiorn new
youewraceahnasoswsfuetmrhseinathltehreredtladbtialoetnabshesleioptwsb.Setweepnlothtemd iins Fofigt.h3e,fionrdmicyat=e whxic+h bre+gr✏e,swsihonerelinwe a(rnedlabtiavree troeathl-evaolruiegdinpalaroanme)etienrsFwige. 2esctiomrraetsepoands✏toretphresreengtrsestshieonolinse ifnorththeednaetwa. dWathaenset.heWnroitisee
Dataset (a) (b) (c) (d) (e) yiosuGraunsswiaenrs,imnatxhiemtizaibnlgebtheelolwik.elihoodofadatasetS={(x1,y1),…,(xn,yn)}toestimate
Regression line 10 the parameters w and b is equivalent to minimizing the squared error:
Dataset
(a)
(b)
(c)
(d)
(e)
Regression line
n
2
argminX(yi (wxi +b)) . w i=1
Consider the dataset S plotted in Fig. 1 along with its associated regression line. For each of the altered data sets Snew plotted in Fig. 3, indicate which regression line (relative to the original one) in Fig. 2 corresponds to the regression line for the new data set. Write your answers in the table below.
Dataset (a) (b) (c) (d) (e) Regression line
Figure 1: An observed data set and its associated regression line. Figure 1: An observed data set and its associated regression line.
Figure 1: An observed data set and its associated regression line. Figure 2: New regression lines for altered data sets Snew.
Dataset
-601: Machine Learning
Page 8
(a) Adding one outlier to the original data set.
( se
13
X
new
of
b) t.
Figure 2: New regression lines for altered data sets S .
A

reaGl-ivvaelnuetdhaptarwaemhetaevres awne iensptiumtaxteaannddw✏erewparnetsetnotsestthiemnaoteiseaninotuhtepudtatya,.inWlhineenarthreegnroeisseion
is Gwaeuasssiuanm,emtahxeimreilzaitniogntsheiplibkeltiwheoeondtohfema diastoafsettheSfo=rm{(yx=,yw)x,.+..b,(+x✏,yw)h}erteowesatinmdabteare 11 nn
n
therepaal-rvaamlueetedrspawraamndetebrissweqeueisvtaimlenatetoanmdin✏imreipzrinesgenthtsetshqeuanroeidseeirnrotrh:e data. When the noise
Page 7 of 16 2/29/2016 is Gaussian, maximizing the likelihooXd of a dataset S = {(x1, y1), . . . , (xn, yn)} to estimate
10-601: Machine Learning
the parameters w and b is equivalent to minimizing the 2squared error: argmin (yi (wxi +b)) .
Sample Questions
3 Linear and LogisticwRegression [20 pts. + 2 Extra Credit] i=1 n
2
argmin (yi (wxi +b)) .
Consider the dataset S plotted in Fig. 1 along with its associated regression line. For
3.1 Linear regression w i=1
each of the altered data sets Snew plotted in Fig. 3, indicate which regression line (relative
to tGheivCoeorningstiihndaetlrowtnhee)hdainvaetFaasigen.ti2SnpcpourltoretxtsepadonndidnswFteiogw.tah1netarletoognrgeswtsimiotnhatlieitnsaenafsosoroutcthipaeutetndeyw,reidgnaretlisansiesoaentr.lriWenger.ietseFsiorn new
youewraceahnasoswsfuetmrhseinathltehreredtladbtialoetnabshesleioptwsb.Setweepnlothtemd iins Fofigt.h3e,fionrdmicyat=e whxic+h bre+gr✏e,swsihonerelinwe a(rnedlabtiavree troeathl-evaolruiegdinpalaroanme)etienrsFwige. 2esctiomrraetsepoands✏toretphresreengtrsestshieonolinse ifnorththeednaetwa. dWathaenset.heWnroitisee
Dataset (a) (b) (c) (d) (e) yiosuGraunsswiaenrs,imnatxhiemtizaibnlgebtheelolwik.elihoodofadatasetS={(x1,y1),…,(xn,yn)}toestimate
Regression line
the parameters w and b is equivalent to minimizing the squared error:
argminX(yi (wxi +b)) . w i=1
Consider the dataset S plotted in Fig. 1 along with its associated regression line. For each of the altered data sets Snew plotted in Fig. 3, indicate which regression line (relative to the original one) in Fig. 2 corresponds to the regression line for the new data set. Write your answers in the table below.
Dataset (a) (b) (c) (d) (e) Regression line
Figure 1: An observed data set and its associated regression line. Figure 1: An observed data set and its associated regression line.
Figure 1: An observed data set and its associated regression line. Figure 2: New regression lines for altered data sets Snew.
Dataset
(a)
(b)
(c)
(d)
(e)
Regression line
n
2
Dataset
(a) Adding one outlier to the ( original data set. s
(c) Adding three outliers to the original data set. Two on one side and one on the other side.
14
X
new
b) et.
Figure 2: New regression lines for altered data sets S .

reaGl-ivvaelnuetdhaptarwaemhetaevres awne iensptiumtaxteaannddw✏erewparnetsetnotsestthiemnaoteiseaninotuhtepudtatya,.inWlhineenarthreegnroeisseion
is Gwaeuasssiuanm,emtahxeimreilzaitniogntsheiplibkeltiwheoeondtohfema diastoafsettheSfo=rm{(yx=,yw)x,.+..b,(+x✏,yw)h}erteowesatinmdabteare 11 nn
n
therepaal-rvaamlueetedrspawraamndetebrissweqeueisvtaimlenatetoanmdin✏imreipzrinesgenthtsetshqeuanroeidseeirnrotrh:e data. When the noise
Page 7 of 16 2/29/2016 is Gaussian, maximizing the likelihooXd of a dataset S = {(x1, y1), . . . , (xn, yn)} to estimate
10-601: Machine Learning
the parameters w and b is equivalent to minimizing the 2squared error: argmin (yi (wxi +b)) .
Sample Questions
3 Linear and LogisticwRegression [20 pts. + 2 Extra Credit] i=1 n
2
argmin (yi (wxi +b)) .
Consider the dataset S plotted in Fig. 1 along with its associated regression line. For
3.1 Linear regression w i=1
each of the altered data sets Snew plotted in Fig. 3, indicate which regression line (relative
to tGheivCoeorningstiihndaetlrowtnhee)hdainvaetFaasigen.ti2SnpcpourltoretxtsepadonndidnswFteiogw.tah1netarletoognrgeswtsimiotnhatlieitnsaenafsosoroutcthipaeutetndeyw,reidgnaretlisansiesoaentr.lriWenger.ietseFsiorn new
youewraceahnasoswsfuetmrhseinathltehreredtladbtialoetnabshesleioptwsb.Setweepnlothtemd iins Fofigt.h3e,fionrdmicyat=e whxic+h bre+gr✏e,swsihonerelinwe a(rnedlabtiavree troeathl-evaolruiegdinpalaroanme)etienrsFwige. 2esctiomrraetsepoands✏toretphresreengtrsestshieonolinse ifnorththeednaetwa. dWathaenset.heWnroitisee
Dataset (a) (b) (c) (d) (e) yiosuGraunsswiaenrs,imnatxhiemtizaibnlgebtheelolwik.elihoodofadataset(Sa)=A{d(dxing,yon)e,.o.u.t,l(iexrt,oyt)h}etoestimate
Regression line
the parameters w and b is equivalent to minimizing the squared error:
11 nn original data set.
Dataset
(a)
(b)
(c)
(d)
(e)
Regression line
n
2
argminX(yi (wxi +b)) . w i=1
Consider the dataset S plotted in Fig. 1 along with its associated regression line. For each of the altered data sets Snew plotted in Fig. 3, indicate which regression line (relative to the original one) in Fig. 2 corresponds to the regression line for the new data set. Write your answers in the table below.
Dataset (a) (b) (c) (d) (e) Regression line
Figure 1: An observed data set and its associated regression line.
Figure 1: An observed data set and its associated regression line.
(c) Adding three outliers to the original dat set. Two on one side and one on the othe side.
Figure 1: An observed data set and its associated regression line. Figure 2: New regression lines for altered data sets Snew.
Dataset
(b) Adding two outliers to the original data set.
a r
(d) Duplicating the original data set.
15
X
new
Figure 2: New regression lines for altered data sets S .

reaGl-ivvaelnuetdhaptarwaemhetaevres awne iensptiumtaxteaannddw✏erewparnetsetnotsestthiemnaoteiseaninotuhtepudtatya,.inWlhineenarthreegnroeisseion
is Gwaeuasssiuanm,emtahxeimreilzaitniogntsheiplibkeltiwheoeondtohfema diastoafsettheSfo=rm{(yx=,yw)x,.+..b,(+x✏,yw)h}erteowesatinmdabteare 11 nn
n
therepaal-rvaamlueetedrspawraamndetebrissweqeueisvtaimlenatetoanmdin✏imreipzrinesgenthtsetshqeuanroeidseeirnrotrh:e data. When the noise
Page 7 of 16 2/29/2016 is Gaussian, maximizing the likelihooXd of a dataset S = {(x1, y1), . . . , (xn, yn)} to estimate
10-601: Machine Learning
the parameters w and b is equivalent to minimizing the 2squared error: argmin (yi (wxi +b)) .
Sample Questions
3 Linear and LogisticwRegression [20 pts. + 2 Extra Credit] i=1 n
2
argmin (yi (wxi +b)) .
Consider the dataset S plotted in Fig. 1 along with its associated regression line. For
3.1 Linear regression w i=1
each of the altered data sets Snew plotted in Fig. 3, indicate which regression line (relative
to tGheivCoeorningstiihndaetlrowtnhee)hdainvaetFaasigen.ti2SnpcpourltoretxtsepadonndidnswFteiogw.tah1netarletoognrgeswtsimiotnhatlieitnsaenafsosoroutcthipaeutetndeyw,reidgnaretlisansiesoaentr.lriWenger.ietseFsiorn new
youewraceahnasoswsfuetmrhseinathltehreredtladbtialoetnabshesleioptwsb.Setweepnlothtemd iins Fofigt.h3e,fionrdmicyat=e whxic+h bre+gr✏e,swsihonerelinwe a(rnedlabtiavree (c) Adding three ou troeathl-evaolruiegdinpalaroanme)etienrsFwige. 2esctiomrraetsepoands✏toretphresreengtrsestshieonolinse ifnorththeednaetwa. dWathaenset.heWnroitisee set. Two on one si
Dataset (a) (b) (c) (d) (e) yiosuGraunsswiaenrs,imnatxhiemtizaibnlgebtheelolwik.elihoodofadatasetS={(x1,y1),…,(xn,yn)}toestimate
Regression line
the parameters w and b is equivalent to minimizing the squared error:
argminX(yi (wxi +b)) . w i=1
Consider the dataset S plotted in Fig. 1 along with its associated regression line. For each of the altered data sets Snew plotted in Fig. 3, indicate which regression line (relative to the original one) in Fig. 2 corresponds to the regression line for the new data set. Write your answers in the table below.
Dataset (a) (b) (c) (d) (e) Regression line
Figure 1: An observed data set and its associated regression line. Figure 1: An observed data set and its associated regression line.
Figure 1: An observed data set and its associated regression line. Figure 2: New regression lines for altered data sets Snew.
side.
Dataset
(a)
(b)
(c)
(d)
(e)
Regression line
n
2
Dataset
liers to the original data (d) Duplicating the e and one on the other
(e) Duplicating the original data set and adding four points that lie on the trajectory of the original regression line.
Figure 3: New data set Snew.
16
X
new
to d
Figure 2: New regression lines for altered data sets S .
r

Q&A
23

Q&A
Q: Is there one recitation timeslot or two for this class?
A:
Back to just one, i.e. Friday, same time as lecture.
We tried hosting a Thursday evening recitation, but attendance remained around half a dozen students. So we are not hosting it anymore.
24

Q&A
Q: Why did we focus mostly on the Perceptron mistake bound for linearly separable data; isn’t that an unrealistic setting?
A:
Not at all! Even if your data isn’t linearly separable to begin with, we can often add features to make it so.
x1
x2
y
+1
+1
+
+1
-1

-1
+1

-1
-1
+
Exercise: Add another feature to transform this nonlinearly separable data into linearly separable data.
25

CLOSED FORM SOLUTION FOR LINEAR REGRESSION
27

Computational Complexity of OLS
To solve the Ordinary Least Squares problem we compute:
The resulting shape of the matrices:
Computational Complexity of OLS:
Linear in # of examples, N Polynomial in # of features, M
31

Gradient Descent Cases to consider gradient descent:
1. What if we can not find a closed-form solution?
2. What if we can, but it’s inefficient to compute?
3. Whatifwecan,butit’snumerically unstable to compute?
32

Empirical Convergence
Log-log scale plot
Gradient Descent
• Def: an epoch is a single pass through the training data
1. For GD, only one update per epoch
2. For SGD, N updates per epoch
N = (# train examples)
SGD
Closed-form (normal eq.s)
Epochs
• SGD reduces MSE much more rapidly than GD
• For GD / SGD, training MSE is initially large due to uninformed initialization
33
Mean Squared Error (Train)

LINEAR REGRESSION: SOLUTION UNIQUENESS
34

Linear Regression: Uniqueness
Question:
Consider a 1D linear regression model trained to minimize MSE.
How many solutions (i.e. sets of parameters w,b) are there for the given dataset?
y
Two Points (Case 1)
x
35

Linear Regression: Uniqueness
Question:
Consider a 1D linear regression model trained to minimize MSE.
How many solutions (i.e. sets of parameters w,b) are there for the given dataset?
y
One Point
x
36

Linear Regression: Uniqueness
Question:
Consider a 1D linear regression model trained to minimize MSE.
How many solutions (i.e. sets of parameters w,b) are there for the given dataset?
y
Two Points (Case 2)
Answer:
A: 0 B: 1 C: 2 D: +∞
x
37

Linear Regression: Uniqueness
Question:
• Considera2D linear regression model trained to minimize MSE
• Howmany solutions (i.e. sets of parameters w1, w2, b) are there for the given dataset?
y
Points on a Line
x2
x1
39

Linear Regression: Uniqueness
Question:
• Considera2D linear regression model trained to minimize MSE
• Howmany solutions (i.e. sets of parameters w1, w2, b) are there for the given dataset?
y
Points on a Line
x2
x1
40

Linear Regression: Uniqueness
Question:
• Considera2D linear regression model trained to minimize MSE
• Howmany solutions (i.e. sets of parameters w1, w2, b) are there for the given dataset?
y
Points on a Line
x2
x1
41

Linear Regression: Uniqueness
To solve the Ordinary Least Squares problem we compute:
These geometric intuitions align with the linear algebraic intuitions we can derive from the normal equations.
1. If is invertible, then there is exactly one solution.
2. If is not invertible, then there are either no solutions or infinitely many solutions.
42

Linear Regression: Uniqueness
To solve the Ordinary Least Squares problem we compute:
These geometric intuitions align with the linear algebraic intuitions we can derive from the normal equations.
1. If is invertible, then there is exactly one
no solutions or infinTithealtyism, tahneyresios lnuotifoenatsu.re that is a linear combination of the
other features.
solution.
Invertability of is
2. If is not inveretqiubilvea,letnhtetno tXhbeeriengafruelleriatnhke.r
43

OPTIMIZATION METHOD #3: STOCHASTIC GRADIENT DESCENT
44

Gradient Descent
M
45
Algorithm 1 Gradient Descent
1:
2:
3: 4:
5:
procedure GD(D, (0)) (0)
while not converged do
+ J()
return

Stochastic Gradient Descent (SGD)
W e n e e d a p e r – e x a m p l e o b j e c t i v e : Let J() = Ni=1 J(i)()
where J(i)() = 1 (T x(i) y(i))2. 2
46

Stochastic Gradient Descent (SGD)
In practice, it is common
to implement SGD using
sampling without
replacement (i.e.
shuffle({1,2,…N}), even
though most of the
theory is for sampling
with replacement (i.e. y(i))2.
Uniform({1,2,…N}).
47
W e n e e d a p e r – e x a m p l e o b j e c t i v e : Let J() = Ni=1 J(i)()
where J(i)() = 1 (T x(i) 2

49

Expectations of Gradients
50

LINEAR REGRESSION: PRACTICALITIES
51

Empirical Convergence
Log-log scale plot
Gradient Descent
• Def: an epoch is a single pass through the training data
1. For GD, only one update per epoch
2. For SGD, N updates per epoch
N = (# train examples)
SGD
Closed-form (normal eq.s)
Epochs
• SGD reduces MSE much more rapidly than GD
• For GD / SGD, training MSE is initially large due to uninformed initialization
52
Mean Squared Error (Train)

Convergence of Optimizers
53

SGD FOR
LINEAR REGRESSION
54

Linear Regression as Function Approximation
55

Gradient Calculation for Linear Regression
[used by Gradient Descent]
[used by SGD]
57

SGD for Linear Regression
SGD applied to Linear Regression is called the “Least Mean Squares” algorithm
58

GD for Linear Regression
Gradient Descent for Linear Regression repeatedly takes steps opposite the gradient of the objective function
􏰔􏰥􏰗􏰃􏰝􏰖􏰋􏰇􏰦 􏰧 􏰨􏰩 􏰪􏰃􏰝 􏰍􏰖􏰘􏰅􏰈􏰝 􏰫􏰅􏰗􏰝􏰅􏰄􏰄􏰖􏰃􏰘
􏰧􏰬
􏰱􏰬 􏰴􏰬
􏰵􏰬 􏰷􏰬
􏰹􏰬
􏰂􏰝􏰃􏰭􏰅􏰊􏰁􏰝􏰅 􏰨􏰩􏰍􏰫􏰮D􏰯 ✓(0)􏰰 ✓ ✓(0)
􏰆􏰇􏰖􏰥􏰅 􏰘􏰃P􏰋 􏰭􏰃􏰘􏰉􏰅􏰝􏰗􏰅􏰊 􏰊􏰃
􏰑 Ni=1(✓T 􏰓(i) y(i))􏰓(i)
✓ ✓ 􏰑 􏰝􏰅􏰋􏰁􏰝􏰘 ✓
. 􏰲􏰘􏰖􏰋􏰖􏰈􏰥􏰖􏰳􏰅 􏰂􏰈􏰝􏰈􏰦􏰅􏰋􏰅􏰝􏰄 . 􏰶􏰃􏰦􏰂􏰁􏰋􏰅 􏰗􏰝􏰈􏰊􏰖􏰅􏰘􏰋
. 􏰸􏰂􏰊􏰈􏰋􏰅 􏰂􏰈􏰝􏰈􏰦􏰅􏰋􏰅􏰝􏰄
59

Optimization Objectives
You should be able to…
• Applygradientdescenttooptimizeafunction
• Applystochasticgradientdescent(SGD)to
optimize a function
• Applyknowledgeofzeroderivativestoidentify a closed-form solution (if one exists) to an optimization problem
• Distinguishbetweenconvex,concave,and nonconvex functions
• Obtainthegradient(andHessian)ofa(twice) differentiable function
62

Linear Regression Objectives
You should be able to…
• Design k-NN Regression and Decision Tree Regression
• Implement learning for Linear Regression using three optimization techniques: (1) closed form, (2) gradient descent, (3) stochastic gradient descent
• Choose a Linear Regression optimization technique that is appropriate for a particular dataset by analyzing the tradeoff of computational complexity vs. convergence speed
• Distinguish the three sources of error identified by the bias-variance decomposition: bias, variance, and irreducible error.
63

PROBABILISTIC LEARNING
64

Probabilistic Learning
Function Approximation
Previously, we assumed that our output was generated using a deterministic target function:
Our goal was to learn a hypothesis h(x) that best approximates c*(x)
Probabilistic Learning
Today, we assume that our output is sampled from a conditional probability distribution:
Our goal is to learn a probability distribution p(y|x) that best approximates p*(y|x)
65

Robotic Farming
Deterministic
Probabilistic
Classification (binary output)
Is this a picture of a wheat kernel?
Is this plant drought resistant?
Regression (continuous output)
How many wheat kernels are in this picture?
What will the yield of this plant be?
66

Bayes Optimal Classifier
Whiteboard
– Bayes Optimal Classifier
– Reducible / irreducible error
– Ex: Bayes Optimal Classifier for 0/1 Loss
67

Bayes Optimal Classifier
68

MLE
􏰀􏰁􏰂􏰂􏰃􏰄􏰅 􏰆􏰅 􏰇􏰈􏰉􏰅 􏰊􏰈􏰋􏰈 D = {x(i)}Ni=1
Principle of Maximum Likelihood Estimation: N
= 􏰏􏰐􏰑􏰒􏰏􏰓 p(􏰓 |)
􏰌􏰍􏰎 (i)
Choose the parameters that maximize the likelihood
of the data.
N
􏰌􏰍􏰎 = 􏰏􏰐􏰑􏰒􏰏􏰓 p(􏰓(i)|)

N
i=1
􏰌􏰔􏰕 = 􏰏􏰐􏰑􏰒􏰏􏰓 p(􏰓(i)|)p()
Maximum Likelihood Estimate (MLE)

i=1
i=1
θ2 θMLE
L(θ) θMLE
L(θ1, θ2) θ1
69

MLE
What does maximizing likelihood accomplish?
• There is only a finite amount of probability mass (i.e. sum-to-one constraint)
• MLE tries to allocate as much probability mass as possible to the things we have observed…
…at the expense of the things we have not observed
70

Maximum Likelihood Estimation
71