Ve492: Introduction to Artificial Intelligence
Discriminative Learning
UM-SJTU Joint Institute
Copyright By PowCoder代写 加微信 powcoder
Some Slides adapted from http://ai.berkeley.edu, AIMA, UM
Learning Objectives
❖ What is a discriminative learning?
❖ How to perform classification with this approach?
❖ What is the perceptron model?
❖ What is logistic regression? How is it related to the perceptron model?
❖ Discriminative model ❖ Perceptron
❖ Multi-class classification
❖ Logistic regression
Discriminative Model
❖ Generative model P(X,Y) encodes how data is generated
❖ May be useful if a priori domain knowledge is available
❖ More flexible:
❖ Can account for missing observation
❖ Generative model could be used for other tasks
❖ Model may be hard to learn/specify
❖ Discriminative model directly predicts labels
❖ Decision rule
❖ Directly learns a boundary between classes
❖ May be more efficient for classification
Feature Vectors
Do you want free
printr cartriges?
Why pay more when
you can get them
ABSOLUTELY FREE!
# free :2 YOUR_NAME :0 MISSPELLED : 2 FROM_FRIEND : 0 …
PIXEL-7,12 : 1 PIXEL-7,13 : 0 … NUM_LOOPS :1 …
Some (Simplified) Biology
❖ Very loose inspiration: human neurons
Perceptron
❖ Inputs are feature values
❖ Each feature has a weight
❖ Sum is the activation
activation𝐰(𝐱) = ∑# 𝑤#𝜑#(𝐱) = 𝐰 ⋅ 𝜑(𝐱)
❖ If the activation is:
❖ Positive, output +1
❖ Negative, output -1
❖ Linear classifier
❖ Binary case: compare features to a weight vector
❖ Learning: figure out the weight vector from examples
# free : 4
YOUR_NAME :-1
MISSPELLED : 1
FROM_FRIEND :-3
MISSPELLED : 2
FROM_FRIEND : 0
# free : 0
YOUR_NAME : 1
MISSPELLED : 1
FROM_FRIEND : 1
Dot product 𝐰 ⋅ 𝜑(𝐱) positive means the positive class
Binary Decision Rule
❖ In the space of feature vectors
❖ Examples are points
❖ Any weight vector is a hyperplane
❖ One side corresponds to Y=+1
❖ Other corresponds to Y=-1
+1 = SPAM 1
BIAS : -3
free : 4
money : 2
𝐰 ⋅ 𝜑(𝐱) = 0
Learning: Binary Perceptron
❖ Start with weights = 0
❖ For each training instance (x, y*):
❖ Classify with current weights
❖ If correct (i.e., 𝑦$ = 𝑦∗), no change!
❖ If wrong: adjust the weight vector
Learning: Binary Perceptron
❖ Start with weights = 0
❖ For each training instance (x, y*):
❖ Classify with current weights 𝑦!={+1 if𝐰⋅𝜑(𝐱)≥0
−1 if𝐰⋅𝜑(𝐱)<0
❖ If correct (i.e., 𝑦! = 𝑦∗), no change!
❖ If wrong: adjust the weight vector by adding or subtracting the feature vector. Subtract if 𝑦* is -1.
𝐰 = 𝐰 + 𝑦∗ ⋅ 𝜑(𝐱)
Example: Perceptron Algorithm
❖ Separable Case
Multiclass Decision Rule
❖ If we have multiple classes:
❖ A weight vector for each class:
❖ Score (activation) of a class y: 𝐰" ⋅ 𝜑(𝐱)
❖ Prediction highest score wins 𝑦' = a r g m a x " 𝐰 " ⋅ 𝜑 ( 𝐱 )
𝐰" ⋅ 𝜑 biggest 𝐰#
𝐰# 𝐰# ⋅ 𝜑 biggest
Binary Classification As Multiclass Decision Rule
❖ Multiclass decision rule
𝑦' = a r g m a x " 𝐰 " ⋅ 𝜑 ( 𝐱 )
❖ Denote 𝐰 the weight vector of the positive class.
❖ What could be the weight vector of the negative class?
Start with all weights = 0
Pick up training examples one by one Predict with current weights
y! = argmaxywy " !(x) If correct, no change!
If wrong: lower score of wrong answer, raise score of right answer
wy! = wy! # !(x) wy* = wy* + !(x)
“win the vote” “win the election” “win the game”
BIAS :1 win :0 game :0 vote :0 the :0 ...
BIAS : 0 win :0 game : 0 vote : 0 the :0 ...
BIAS : 0 win :0 game : 0 vote : 0 the :0 ...
Properties of Perceptrons
❖ Separability: true if some parameters get the training set perfectly correct
❖ Convergence: if the training set is separable, perceptron will eventually converge (binary case)
❖ Mistake Bound: the maximum number of mistakes (binary case) related to the margin or degree of separability
Non-Separable
Problems with the Perceptron
❖ Noise: if the data isn’t separable, weights might thrash
❖ Averaging weight vectors over time can help (averaged perceptron)
❖ Mediocre generalization: finds a “barely” separating solution
❖ Overtraining: test/validation accuracy usually rises, then falls
❖ Overtraining is a kind of overfitting
Probabilistic Classification
❖ Naïve Bayes provides probabilistic classification
1: 0.001 2: 0.001 ...
❖ Perceptron just gives us a class prediction
❖ Can we get it to give us probabilities?
❖ Turns out it also makes it easier to train!
1: 0.001 2: 0.703 ...
6: 0.264 ...
Note: To simplify notations, “𝑥” denotes “𝜑(𝐱)” from now on
A Probabilistic Perceptron ❖ Called Logistic Regression
A 1D Example
definitely blue not sure definitely red
probability increases exponentially as we move away from boundary
normalizer
The Soft to Learn?
❖ Maximum likelihood estimation
❖ Maximum conditional likelihood estimation
❖ What assumptions are we making?
Local Search
❖ Simple, general idea:
❖ Start wherever
❖ Repeat: move to the best neighboring state
❖ If no neighbors better than current, quit
❖ Neighbors = small perturbations of w
Our Problem
❖ Our objective
❖ Challenge: how to find a good w ? ❖ Equivalently:
1D Optimization
❖ Could evaluate and
❖ Then step in best direction
❖ Or, evaluate derivative:
❖ Which tells which direction to step into
❖ Example in 2D
2D Optimization
❖ Extends to higher dimensions
Source: ’s Blog
Steepest Descent
❖ Start somewhere
❖ Repeat: Take a step in the steepest descent direction
❖ Steepest Direction = direction of the gradient
Figure source: Mathworks
How to Learn in Logistic Regression?
Optimization Procedure: Gradient Descent
❖ 𝛼: learning rate — hyperparameter that needs to be chosen carefully
❖ What’s the issue if it is too small? Too big?
❖ How to set? Try multiple choices
❖ Crude rule of thumb: update should change 𝑤 by about 0.1-1%
Stochastic Gradient Descent
compare this to the multiclass perceptron: probabilistic weighting!
probability of incorrect answer probability of incorrect answer
Logistic Regression Demo! https://playground.tensorflow.org/
Concluding Remarks
❖ Generative vs Discriminative Approaches
❖ Perceptron and Logistic Regression
❖ Error-driven classification:
❖ Minimize loss (some measure of error on data)
❖ Learns by (stochastic) gradient descent
❖ Generalizes to any supervised learning tasks
❖ For more information:
❖ AIMA, Chapter 18 for Learning from Examples
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com