程序代写代做代考 algorithm GPU decision tree 10-601 Introduction to Machine Learning

10-601 Introduction to Machine Learning
Machine Learning Department School of Computer Science Carnegie Mellon University
Logistic Regression + Feature Engineering + Regularization
Matt Gormley Lecture 11 Mar. 3, 2021
1

Reminders
• Homework 3: KNN, Perceptron, Lin.Reg.
– Out: Mon, Feb. 22
– Due: Mon, Mar. 01 at 11:59pm
– IMPORTANT: you may only use 2 grace days on Homework 3 (last possible moment to submit HW3: Wed, Mar. 03 at 11:59pm)
• Practice for Exam – Mock Exam 1
• Wed, Mar. 03 at 7:00pm – 9:00pm
• See @261 for participation point details
– Practice Problems 1A (Gradescope) – Practice Problems 1B (PDF)
• Midterm Exam 1
– Saturday, March 6, at 10:30am – 12:30pm EST
3

Reminders
4

PROBABILISTIC LEARNING
5

MLE
􏰀􏰁􏰂􏰂􏰃􏰄􏰅 􏰆􏰅 􏰇􏰈􏰉􏰅 􏰊􏰈􏰋􏰈 D = {x(i)}Ni=1
Principle of Maximum Likelihood Estimation: N
= 􏰏􏰐􏰑􏰒􏰏􏰓 p(􏰓 |)
􏰌􏰍􏰎 (i)
Choose the parameters that maximize the likelihood
of the data.
N
􏰌􏰍􏰎 = 􏰏􏰐􏰑􏰒􏰏􏰓 p(􏰓(i)|)

N
i=1
􏰌􏰔􏰕 = 􏰏􏰐􏰑􏰒􏰏􏰓 p(􏰓(i)|)p()
Maximum Likelihood Estimate (MLE)

i=1
i=1
θ2 θMLE
L(θ) θMLE
L(θ1, θ2) θ1
6

MLE
What does maximizing likelihood accomplish?
• There is only a finite amount of probability mass (i.e. sum-to-one constraint)
• MLE tries to allocate as much probability mass as possible to the things we have observed…
…at the expense of the things we have not observed
7

Maximum Likelihood Estimation
8

Learning from Data (Frequentist)
Whiteboard
– Principle of Maximum Likelihood Estimation (MLE)
– Strawmen:
• Example: Bernoulli
• Example: Gaussian
• Example: Conditional #1
(Bernoulli conditioned on Gaussian)
• Example: Conditional #2
(Gaussians conditioned on Bernoulli)
9

MOTIVATION: LOGISTIC REGRESSION
11

Example: Image Classification
• ImageNet LSVRC-2010 contest:
– Dataset: 1.2 million labeled images, 1000 classes
– Task: Given a new image, label it with the correct class – Multiclass classification problem
• Examples from http://image-net.org/
14

15

16

17

Example: Image Classification
CNN for Image Classification
(Krizhevsky, Sutskever & Hinton, 2011)
17.5% error on ImageNet LSVRC-2010 contest
• Five convolutional layers (w/max-pooling)
• Three fully connected layers
Input image (pixels)
1000-way softmax
Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities
18
between the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts

Example: Image Classification
CNN for Image Classification
(Krizhevsky, Sutskever & Hinton, 2011)
17.5% error on ImageNet LSVRC-2010 contest
• Five convolutional layers (w/max-pooling)
• Three fully connected layers
Input image (pixels)
1000-way softmax
The rest is just some fancy feature extraction (discussed later in the course)
This “softmax” layer is Logistic Regression!
Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities
19
between the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts

LOGISTIC REGRESSION
20

Logistic Regression
Data: Inputs are continuous vectors of length M. Outputs are discrete.
We are back to classification.
Despite the name logistic regression.
21

Linear Models for Classification
Key idea: Try to learn this hyperplane directly
Directly modeling the hyperplane would use a decision function:
h(􏰓) = 􏰄􏰖􏰗􏰘(T 􏰓) for:
y {1, +1}
Looking ahead:
• We’ll see a number of commonly used Linear Classifiers
• These include:
– Perceptron
– LogisticRegression
– NaïveBayes(under
certain conditions)
– SupportVector Machines
Recall…

Background: Hyperplanes Hyperplane (Definition 1):
Notation Trick: fold the
bias b and the weights w
into a single vector θ by
prepending a constant to
x and increasing w
dimensionality by one to get x’!
Half-spaces:
H = {x : wT x = b} Hyperplane (Definition 2):
’’ 1’
1 1
Recall…

Using gradient ascent for linear
classifiers Key idea behind today’s lecture:
1. Define a linear classifier (logistic regression)
2. Define an objective function (likelihood)
3. Optimize it with gradient descent to learn parameters
4. Predict the class with highest probability under the model
24

Using gradient ascent for linear classifiers
This decision function isn’t differentiable:
h(􏰓) = 􏰄􏰖􏰗􏰘(T 􏰓)
sign(x)
Use a differentiable function instead: 1
p(y = 1|􏰓) = 1 + 􏰙􏰓􏰚(T 􏰓)
logistic(u) ≡ 1 1+e−u
25

Using gradient ascent for linear classifiers
This decision function isn’t differentiable:
h(􏰓) = 􏰄􏰖􏰗􏰘(T 􏰓)
sign(x)
Use a differentiable function instead: 1
p(y = 1|􏰓) = 1 + 􏰙􏰓􏰚(T 􏰓)
logistic(u) ≡ 1 1+e−u
26

Logistic Regression
Data: Inputs are continuous vectors of length M. Outputs are discrete.
Model: Logistic function applied to dot product of
1
parameters with input vector.
p(y = 1|􏰓) =
1 + 􏰙􏰓􏰚(T 􏰓)
Learning: finds the parameters that minimize some
objective function.
=argminJ()
Prediction: Output is the most probable class.
yˆ = 􏰏 􏰐 􏰑 􏰒 􏰏 􏰓 p ( y | 􏰓 ) y{0,1}
27

Logistic Regression
Whiteboard
– Logistic Regression Model – Decision boundary
29

Learning for Logistic Regression
Whiteboard
– Partial derivative for Logistic Regression – Gradient for Logistic Regression
30

LOGISTIC REGRESSION ON GAUSSIAN DATA
31

Logistic Regression
32

Logistic Regression
33

Logistic Regression
34

LEARNING LOGISTIC REGRESSION
35

Maximum Conditional Likelihood Estimation
Learning: finds the parameters that minimize some objective function.
=argminJ()
We minimize the negative log conditional likelihood:
Why?
1. We can’t maximize likelihood (as in Naïve Bayes)
because we don’t have a joint model p(x,y)
2. It worked well for Linear Regression (least squares is
MCLE)
J() = 􏰛􏰜􏰑
N i=1
p(y(i)|􏰓(i))
36

Maximum Conditional Likelihood Estimation
Learning: Four approaches to solving Approach 1: Gradient Descent
=argminJ()
(take larger – more certain – steps opposite the gradient)
Approach 2: Stochastic Gradient Descent (SGD) (take many small steps opposite the gradient)
Approach 3: Newton’s Method
(use second derivatives to better follow curvature)
Approach 4: Closed Form???
(set derivatives equal to zero and solve for parameters)
37

Maximum Conditional Likelihood Estimation
Learning: Four approaches to solving Approach 1: Gradient Descent
=argminJ()
(take larger – more certain – steps opposite the gradient)
Approach 2: Stochastic Gradient Descent (SGD) (take many small steps opposite the gradient)
Approach 3: Newton’s Method
(use second derivatives to better follow curvature)
Approach 4: Closed Form???
(set derivatives equal to zero and solve for parameters)
Logistic Regression does not have a closed form solution for MLE parameters.
38

SGD for Logistic Regression
Question:
Which of the following is a correct description of SGD for Logistic Regression?
Answer:
At each step (i.e. iteration) of SGD for Logistic Regression we…
A. (1) compute the gradient of the log-likelihood for all examples (2) update all the parameters using the gradient
B. (1) ask Matt for a description of SGD for Logistic Regression, (2) write it down, (3) report that answer
C. (1) compute the gradient of the log-likelihood for all examples (2) randomly pick an example (3) update only the parameters for that example
D. (1) randomly pick a parameter, (2) compute the partial derivative of the log- likelihood with respect to that parameter, (3) update that parameter for all examples
E. (1) randomly pick an example, (2) compute the gradient of the log-likelihood for that example, (3) update all the parameters using that gradient
F. (1) randomly pick a parameter and an example, (2) compute the gradient of the log-likelihood for that example with respect to that parameter, (3) update that parameter using that gradient
39

Gradient Descent
In order to apply GD to Logistic Regression all we need is the gradient of the objective function (i.e. vector of partial derivatives).
d d1
J()
.
d J() = d2
J()

d dN
. J ()
40
Algorithm 1 Gradient Descent
1:
2:
3: 4:
5:
procedure GD(D, (0)) (0)
while not converged do
+ J()
return


Recall…

Stochastic Gradient Descent (SGD)

We can also apply SGD to solve the MCLE problem for Logistic Regression.
We need a per-example objective:
􏰍􏰅􏰋 J() = Ni=1 J(i)() 􏰆􏰇􏰅􏰝􏰅 J(i)() = 􏰛􏰜􏰑 p(yi|􏰓i)􏰞
41
Recall…

Logistic Regression vs. Perceptron
Question:
True or False: Just like Perceptron, one step (i.e. iteration) of SGD for Logistic Regression will result in a change to the parameters only if the current example is incorrectly classified.
Answer:
42

Matching Game
Goal: Match the Algorithm to its Update Rule
1. SGD for Logistic Regression
h(x) = p(y|x)
2. Least Mean Squares
h(x) = T x
3. Perceptron
h (x) = sign(T x)
4.
k k + (h(x(i)) y(i))
5. k k + 1 1+exp(h(x(i))y(i))
6.
k k + (h(x(i)) y(i))x(i) k
A. 1=5, 2=4, 3=6 B. 1=5, 2=6, 3=4 C. 1=6, 2=4, 3=4 D. 1=5, 2=6, 3=6
E. 1=6, 2=6, 3=6 F. 1=6, 2=5, 3=5 G. 1=5, 2=5, 3=5 H. 1=4, 2=5, 3=6
43

OPTIMIZATION METHOD #4: MINI-BATCH SGD
44

Mini-Batch SGD
• Gradient Descent:
Compute true gradient exactly from all N examples
• Stochastic Gradient Descent (SGD): Approximate true gradient by the gradient of one randomly chosen example
• Mini-Batch SGD:
Approximate true gradient by the average gradient of K randomly chosen examples
45

Mini-Batch SGD
Three variants of first-order optimization:
46

Summary
1. Discriminative classifiers directly model the
conditional, p(y|x)
2. Logistic regression is a simple linear classifier, that retains a probabilistic semantics
3. ParametersinLRarelearnedbyiterative optimization (e.g. SGD)
55

Logistic Regression Objectives
You should be able to…
• Apply the principle of maximum likelihood estimation (MLE) to learn the parameters of a probabilistic model
• Given a discriminative probabilistic model, derive the conditional log-likelihood, its gradient, and the corresponding Bayes Classifier
• Explain the practical reasons why we work with the log of the likelihood
• Implement logistic regression for binary or multiclass classification
• Prove that the decision boundary of binary logistic regression is linear
• For linear regression, show that the parameters which minimize squared error are equivalent to those that maximize conditional likelihood
56

FEATURE ENGINEERING
57

Handcrafted Features
born-in
LOC PER
p(y|x) ∝ ( )
exp(ΘyŸf
)
S
NP
VP
ADJP NP VP
NNP : VBN NNP VBD egypt – born proyas direct
Egypt – born Proyas directed
58

Where do features come from?
First word before M1 Second word before M1 Bag-of-words in M1
Head word of M1
Other word in between First word after M2 Second word after M2 Bag-of-words in M2
Head word of M2
Bigrams in between
Words on dependency path Country name list Personal relative triggers Personal title list
WordNet Tags
Heads of chunks in between Path of phrase labels Combination of entity types
hand-crafted features
Sun et al., 2011
Zhou et al., 2005
Feature Learning
59
Feature Engineering

Where do features come from?
hand-crafted features
Sun et al., 2011
Zhou et al., 2005
Look-up table
Classifier
input (context words)
embeddin g
missing word
unsupervised learning
0.11
.23

-.45
similar words, similar embeddings
cat:
0.13
.26

-.52
dog:
CBOW model in Mikolov et al. (2013)
word
embeddings
Mikolov et al., 2013
Feature Learning
60
Feature Engineering

Where do features come from?
The [movie] showed [wars]
Recursive Auto Encoder
(Socher 2011)
RAE
pooling
hand-crafted features
T
Convolutional Neural Networks
(Collobert and Weston 2008)
Sun et al.
20
11
,
e
h
[movie] showed [wars]
Zhou et al., 2005
CNN
word
embeddings
Mikolov et al., 2013
string embeddings
Socher, 2011 Collobert & Weston,
2008
Feature Learning
61
Feature Engineering

Where do features come from?
hand-crafted features
S
WNP,VP
NP VP
Sun et al., 2011
WDT,NN
WV,NN
tree embeddings
Socher et al.,
2013 Hermann & Blunsom,
2013
string embeddings
Socher, 2011 Collobert & Weston,
2008
The [movie] showed [wars]
Zhou et al., 2005
word
embeddings
Mikolov et al., 2013
Feature Learning
62
Feature Engineering

Where do features come from?
hand-crafted features
Sun et al., 2011
word embedding features
Turian et al. 2010
Koo et al. 2008
Hermann et al. 2014
Zhou et al., 2005
word
embeddings
Mikolov et al., 2013
tree embeddings
Socher et al.,
2013 Hermann & Blunsom,
2013
string embeddings
Socher, 2011 Collobert & Weston,
2008
Feature Learning
63
Refine embedding features with semantic/syntactic info
Feature Engineering

Where do features come from?
hand-crafted features
word embedding features
best of both worlds?
tree embeddings
Socher et al.,
2013 Hermann & Blunsom,
2013
string embeddings
Socher, 2011 Collobert & Weston,
2008
Sun et al., 2011
Turian et al. 2010
Koo et al. 2008
Hermann et al. 2014
Zhou et al., 2005
word
embeddings
Mikolov et al., 2013
Feature Learning
64
Feature Engineering

Feature Engineering for NLP
Suppose you build a logistic regression model to predict a part-of-speech (POS) tag for each word in a sentence.
What features should you use?
deter. noun noun verb verb noun
The movie I watched depicted hope
65

Feature Engineering for NLP
Per-word Features:
x(4) x(5) x(6)
x(1) x(2) x(3)
1
1
0
0
0
0

0
1
0
0
0
1

1
0
0
0
0
0

0
0
1
1
0
0

0
1
0
0
0
0

0
0
1
1
0
0

is-capital(wi) endswith(wi,“e”) endswith(wi,“d”) endswith(wi,“ed”) wi == “aardvark” wi == “hope”

deter. noun noun verb
verb noun
The movie I watched depicted hope
66

Feature Engineering for NLP
Context Features:
x(4) x(5) x(6)
x(1) x(2) x(3)

0
0
0
0
0


0
0
0
0
1


0
1
0
0
0


0
0
1
0
0


0
0
0
1
0


1
0
0
0
0


wi == “watched” wi+1 == “watched” wi-1 == “watched” wi+2 == “watched” wi-2 == “watched”

deter. noun noun verb
verb noun
The movie I watched depicted hope
67

Feature Engineering for NLP
Context Features:
x(4) x(5) x(6)
x(1) x(2) x(3)

0
0
0
1
0


0
0
0
0
0


1
0
0
0
0


0
0
0
0
1


0
1
0
0
0


0
0
1
0
0


wi == “I” wi+1 == “I” wi-1 == “I” wi+2 == “I” wi-2 == “I”

deter. noun noun verb
verb noun
The movie I watched depicted hope
68

Development 19-21 5,527 131,768 4,467
Test 22-24 5,462 129,654 3,649
Table from Manning (2011)
Feature Engineering for NLP
Table 3. Tagging accuracies with different feature templates and other changes on the WSJ 19-21 development set.
Model
3gramMemm naacl 2003 Replication Replication′ 5w 5wShapes 5wShapesDS
Feature Templates
See text
See text and [1]
See text and [1] +rareFeatureThresh = 5 +⟨t0, w−2⟩, ⟨t0, w2⟩
+⟨t0, s−1⟩, ⟨t0, s0⟩, ⟨t0, s+1⟩ + distributional similarity
# Sent. Feats Acc. 248,798 52.07% 460,552 55.31% 460,551 55.62% 482,364 55.67% 730,178 56.23% 731,661 56.52% 737,955 56.79%
Token Unk. Acc. Acc. 96.92% 88.99% 97.15% 88.61% 97.18% 88.92% 97.19% 88.96% 97.20% 89.03% 97.25% 89.81% 97.28% 90.46%
3gramMemm shows the performance of a straightforward, fast, discrimina-
tive sequence model tagger. It uses the templates ⟨t0, w−1⟩, ⟨t0, w0⟩, ⟨t0, w+1⟩,
⟨t0, t−1⟩, ⟨t0, t−2, t−1⟩ and the unknown word features from [1]. The higher
performance naacl 2003 tagger numbers come from use of a bidirectional deter. noun noun verb verb noun
cyclic dependency network tagger, which adds the feature templates ⟨t0,t+1⟩, ⟨t0, t+1, t+2⟩, ⟨t0, t−1, t+1⟩, ⟨t0, t−1, w0⟩, ⟨t0, t+1, w0⟩, ⟨t0, w−1, w0⟩, ⟨t0, w0, w+1⟩
The movie I watched depicted hope
The next line shows results from an attempt to replicate those numbers in 2010. The results are similar but a fraction better.4 The line after that shows that
69
the numbers are pushed up a little by lowering the support threshold for in-

Feature Engineering for CV Edge detection (Canny)
Corner Detection (Harris)
Figures from http://opencv.org
74

Feature Engineering for CV Scale Invariant Feature Transform (SIFT)
Figure from Lowe (1999) and Lowe (2004)
75

NON-LINEAR FEATURES
77

Nonlinear Features
• aka. “nonlinear basis functions”
• So far, input was always
• Key Idea: let input be some function of x
– original input:
– new input: – define
• Examples: (M = 1)
For a linear model:
still a linear function of b(x) even though a nonlinear function of x
Examples:
– Perceptron
– Linear regression
– Logistic regression
78

Example: Linear Regression
Goal: Learn y = wT f(x) + b where f(.) is a pol
basis function
y
x
2.0
1.2
1.3
1.7
0.1
2.7
1.1
1.9
ynomial
y
x
true “unknown” target function is linear with negative slope and gaussian noise
79

Example: Linear Regression
Goal: Learn y = wT f(x) + b where f(.) is a pol
basis function
y
x
2.0
1.2
1.3
1.7
0.1
2.7
1.1
1.9
ynomial
y
x
true “unknown” target function is linear with negative slope and gaussian noise
80

Example: Linear Regression
Goal: Learn y = wT f(x) + b where f(.) is a pol
basis function
y
x
x2
2.0
1.2
(1.2)2
1.3
1.7
(1.7)2
0.1
2.7
(2.7)2
1.1
1.9
(1.9)2
ynomial
y
x
true “unknown” target function is linear with negative slope and gaussian noise
81

Example: Linear Regression
Goal: Learn y = wT f(x) + b where f(.) is a pol
basis function
y
x
x2
x3
2.0
1.2
(1.2)2
(1.2)3
1.3
1.7
(1.7)2
(1.7)3
0.1
2.7
(2.7)2
(2.7)3
1.1
1.9
(1.9)2
(1.9)3
ynomial
y
x
true “unknown” target function is linear with negative slope and gaussian noise
82

Example: Linear Regression
Goal: Learn y = wT f(x) + b where f(.) is a pol
basis function
ynomial
y
2.0
1.3
0.1
y
1.1
x
1.2
1.7
2.7
1.9
x2
(1.2)2
(1.7)2
(2.7)2
(1.9)2





x5
(1.2)5
(1.7)5
(2.7)5
(1.9)5
true “unknown” target function is linear with negative slope and gaussian noise
x
83

Example: Linear Regression
Goal: Learn y = wT f(x) + b where f(.) is a pol
basis function
ynomial
y
2.0
1.3
0.1
y
1.1
x
1.2
1.7
2.7
1.9
x2
(1.2)2
(1.7)2
(2.7)2
(1.9)2





x8
(1.2)8
(1.7)8
(2.7)8
(1.9)8
true “unknown” target function is linear with negative slope and gaussian noise
x
84

Example: Linear Regression
Goal: Learn y = wT f(x) + b where f(.) is a pol
basis function
ynomial
y
2.0
1.3
0.1
1.1
x
1.2
1.7
2.7
1.9
x2
(1.2)2
(1.7)2
(2.7)2
(1.9)2





x9
(1.2)9
(1.7)9
(2.7)9
y
(1.9)9
true “unknown” target function is linear with negative slope and gaussian noise
x
85

Over-fitting
Root-Mean-Square (RMS) Error:
Slide courtesy of William Cohen

Polynomial Coefficients
Slide courtesy of William Cohen

Example: Linear Regression
Goal: Learn y = wT f(x) + b where f(.) is a pol
basis function
i
1
2

ynomial
• WithjustN=10 points we overfit!
• ButwithN=100 points, the
overfitting (mostly) disappears
• Takeaway: more data helps
prevent overfitting
y
10
y
2.0
1.3

1.1
x
1.2
1.7

1.9





x9
(1.2)9
(1.7)9

(1.9)9
x
88

Example: Linear Regression
Goal: Learn y = wT f(x) + b where f(.) is a pol
basis function
ynomial
• WithjustN=10 points we overfit!
• ButwithN=100 points, the
overfitting (mostly) disappears
• Takeaway: more data helps
prevent overfitting
i
1
2
3
4

y
2.0
1.3
0.1
1.1

x
1.2
1.7
2.7
1.9







x9
(1.2)9
(1.7)9
(2.7)9
(1.9)9

y
…………… true “unknown”
target function is ……………
linear with
98 … … … negative slope
a
nd g
aus
… sian
99


noise 100 0.9
1.5



(1.5)9
x
89

REGULARIZATION
90

Overfitting
Definition: The problem of overfitting is when the model captures the noise in the training data instead of the underlying structure
Overfitting can occur in all the models we’ve seen so far:
– Decision Trees (e.g. when tree is too deep)
– KNN (e.g. when k is small)
– Perceptron (e.g. when sample isn’t representative) – Linear Regression (e.g. with nonlinear features)
– Logistic Regression (e.g. with many rare features)
91

Motivation: Regularization
Example: Stock Prices
• Suppose we wish to predict Google’s stock price at time t+1
• What features should we use? (putting all computational concerns aside)
– Stock prices of all other stocks at times t, t-1, t-2, …, t – k
– Mentions of Google with positive / negative sentiment words in all newspapers and social media outlets
• Do we believe that all of these features are going to be useful?
92

Motivation: Regularization
• Occam’s Razor: prefer the simplest hypothesis
• What does it mean for a hypothesis (or model) to be simple?
1. small number of features (model selection)
2. small number of “important” features (shrinkage)
93

Regularization
• Given objective function: J(θ)
• Goal is to find:
• Key idea: Define regularizer r(θ) s.t. we tradeoff between fitting the data and keeping the model simple
• Choose form of r(θ):
– Example: q-norm (usually p-norm)
94

Regularization
Question:
Suppose we are minimizing J’(θ) where
As λ increases, the minimum of J’(θ) will…
A. …move towards the midpoint between J’(θ) and r(θ)
B. …move towards the minimum of J(θ)
C. …move towards the minimum of r(θ)
D. …move towards a theta vector of positive infinities
E. …move towards a theta vector of negative infinities
F. …stay the same
96

Regularization Exercise
In-class Exercise
1. Plot train error vs. regularization weight (cartoon)
2. Plot test error vs . regularization weight (cartoon)
regularization weight
98
error

Regularization
Question:
Suppose we are minimizing J’(θ) where
As we increase λ from 0, the the validation error will…
A. …increase
B. …decrease
C. …first increase, then decrease
D. …first decrease, then increase
E. …stay the same
99

Regularization
Don’t Regularize the Bias (Intercept) Parameter!
• In our models so far, the bias / intercept parameter is usually denoted by !” — that is, the parameter for which we fixed #” = 1
• Regularizers always avoid penalizing this bias / intercept parameter
• Why? Because otherwise the learning algorithms wouldn’t be invariant to a shift in the y-values
Whitening Data
• It’s common to whiten each feature by subtracting its mean and dividing by its variance
• For regularization, this helps all the features be penalized in the same units
(e.g. convert both centimeters and kilometers to z-scores)
100

Training Data
Example: Logistic Regression
• For this example, we construct nonlinear features (i.e. feature engineering)
• Specifically, we add polynomials up to order 9 of the two original features x1 and x2
• Thus our classifier is linear in the high-dimensional feature space, but the decision boundary is nonlinear when visualized in low-dimensions (i.e. the original two dimensions)
Test Data
105

Example: Logistic Regression
lambda
106
error

Example: Logistic Regression
107

Example: Logistic Regression
108

Example: Logistic Regression
109

Example: Logistic Regression
110

Example: Logistic Regression
111

Example: Logistic Regression
112

Example: Logistic Regression
113

Example: Logistic Regression
114

Example: Logistic Regression
115

Example: Logistic Regression
116

Example: Logistic Regression
117

Example: Logistic Regression
118

Example: Logistic Regression
119

Example: Logistic Regression
lambda
120
error

Regularization as MAP
• L1 and L2 regularization can be interpreted as maximum a-posteriori (MAP) estimation of the parameters
• To be discussed later in the course…
121

Takeaways
1. Nonlinear basis functions allow linear models (e.g. Linear Regression, Logistic Regression) to capture nonlinear aspects of the original input
2. Nonlinear features are require no changes to the model (i.e. just preprocessing)
3. Regularization helps to avoid overfitting
4. Regularization and MAP estimation are equivalent for appropriately chosen priors
123

Feature Engineering / Regularization
Objectives
You should be able to…
• Engineer appropriate features for a new task
• Use feature selection techniques to identify and
remove irrelevant features
• Identify when a model is overfitting
• Add a regularizer to an existing objective in order to combat overfitting
• Explain why we should not regularize the bias term
• Convert linearly inseparable dataset to a linearly
separable dataset in higher dimensions
• Describe feature engineering in common application areas
124