程序代写 Ve492: Introduction to Artificial Intelligence

Ve492: Introduction to Artificial Intelligence
Discriminative Learning

UM-SJTU Joint Institute

Copyright By PowCoder代写 加微信 powcoder

Some Slides adapted from http://ai.berkeley.edu, AIMA, UM

Learning Objectives
❖ What is a discriminative learning?
❖ How to perform classification with this approach?
❖ What is the perceptron model?
❖ What is logistic regression? How is it related to the perceptron model?

❖ Discriminative model ❖ Perceptron
❖ Multi-class classification
❖ Logistic regression

Discriminative Model
❖ Generative model P(X,Y) encodes how data is generated
❖ May be useful if a priori domain knowledge is available
❖ More flexible:
❖ Can account for missing observation
❖ Generative model could be used for other tasks
❖ Model may be hard to learn/specify
❖ Discriminative model directly predicts labels
❖ Decision rule
❖ Directly learns a boundary between classes
❖ May be more efficient for classification

Feature Vectors
Do you want free
printr cartriges?
Why pay more when
you can get them
ABSOLUTELY FREE!
# free :2 YOUR_NAME :0 MISSPELLED : 2 FROM_FRIEND : 0 …
PIXEL-7,12 : 1 PIXEL-7,13 : 0 … NUM_LOOPS :1 …

Some (Simplified) Biology
❖ Very loose inspiration: human neurons

Perceptron
❖ Inputs are feature values
❖ Each feature has a weight
❖ Sum is the activation
activation𝐰(𝐱) = ∑# 𝑤#𝜑#(𝐱) = 𝐰 ⋅ 𝜑(𝐱)
❖ If the activation is:
❖ Positive, output +1
❖ Negative, output -1
❖ Linear classifier

❖ Binary case: compare features to a weight vector
❖ Learning: figure out the weight vector from examples
# free : 4
YOUR_NAME :-1
MISSPELLED : 1
FROM_FRIEND :-3
MISSPELLED : 2
FROM_FRIEND : 0
# free : 0
YOUR_NAME : 1
MISSPELLED : 1
FROM_FRIEND : 1
Dot product 𝐰 ⋅ 𝜑(𝐱) positive means the positive class

Binary Decision Rule
❖ In the space of feature vectors
❖ Examples are points
❖ Any weight vector is a hyperplane
❖ One side corresponds to Y=+1
❖ Other corresponds to Y=-1
+1 = SPAM 1
BIAS : -3
free : 4
money : 2
𝐰 ⋅ 𝜑(𝐱) = 0

Learning: Binary Perceptron
❖ Start with weights = 0
❖ For each training instance (x, y*):
❖ Classify with current weights
❖ If correct (i.e., 𝑦$ = 𝑦∗), no change!
❖ If wrong: adjust the weight vector

Learning: Binary Perceptron
❖ Start with weights = 0
❖ For each training instance (x, y*):
❖ Classify with current weights 𝑦!={+1 if𝐰⋅𝜑(𝐱)≥0
−1 if𝐰⋅𝜑(𝐱)<0 ❖ If correct (i.e., 𝑦! = 𝑦∗), no change! ❖ If wrong: adjust the weight vector by adding or subtracting the feature vector. Subtract if 𝑦* is -1. 𝐰 = 𝐰 + 𝑦∗ ⋅ 𝜑(𝐱) Example: Perceptron Algorithm ❖ Separable Case Multiclass Decision Rule ❖ If we have multiple classes: ❖ A weight vector for each class: ❖ Score (activation) of a class y: 𝐰" ⋅ 𝜑(𝐱) ❖ Prediction highest score wins 𝑦' = a r g m a x " 𝐰 " ⋅ 𝜑 ( 𝐱 ) 𝐰" ⋅ 𝜑 biggest 𝐰# 𝐰# 𝐰# ⋅ 𝜑 biggest Binary Classification As Multiclass Decision Rule ❖ Multiclass decision rule 𝑦' = a r g m a x " 𝐰 " ⋅ 𝜑 ( 𝐱 ) ❖ Denote 𝐰 the weight vector of the positive class. ❖ What could be the weight vector of the negative class? Start with all weights = 0 Pick up training examples one by one Predict with current weights y! = argmaxywy " !(x) If correct, no change! If wrong: lower score of wrong answer, raise score of right answer wy! = wy! # !(x) wy* = wy* + !(x) “win the vote” “win the election” “win the game” BIAS :1 win :0 game :0 vote :0 the :0 ... BIAS : 0 win :0 game : 0 vote : 0 the :0 ... BIAS : 0 win :0 game : 0 vote : 0 the :0 ... Properties of Perceptrons ❖ Separability: true if some parameters get the training set perfectly correct ❖ Convergence: if the training set is separable, perceptron will eventually converge (binary case) ❖ Mistake Bound: the maximum number of mistakes (binary case) related to the margin or degree of separability Non-Separable Problems with the Perceptron ❖ Noise: if the data isn’t separable, weights might thrash ❖ Averaging weight vectors over time can help (averaged perceptron) ❖ Mediocre generalization: finds a “barely” separating solution ❖ Overtraining: test/validation accuracy usually rises, then falls ❖ Overtraining is a kind of overfitting Probabilistic Classification ❖ Naïve Bayes provides probabilistic classification 1: 0.001 2: 0.001 ... ❖ Perceptron just gives us a class prediction ❖ Can we get it to give us probabilities? ❖ Turns out it also makes it easier to train! 1: 0.001 2: 0.703 ... 6: 0.264 ... Note: To simplify notations, “𝑥” denotes “𝜑(𝐱)” from now on A Probabilistic Perceptron ❖ Called Logistic Regression A 1D Example definitely blue not sure definitely red probability increases exponentially as we move away from boundary normalizer The Soft to Learn? ❖ Maximum likelihood estimation ❖ Maximum conditional likelihood estimation ❖ What assumptions are we making? Local Search ❖ Simple, general idea: ❖ Start wherever ❖ Repeat: move to the best neighboring state ❖ If no neighbors better than current, quit ❖ Neighbors = small perturbations of w Our Problem ❖ Our objective ❖ Challenge: how to find a good w ? ❖ Equivalently: 1D Optimization ❖ Could evaluate and ❖ Then step in best direction ❖ Or, evaluate derivative: ❖ Which tells which direction to step into ❖ Example in 2D 2D Optimization ❖ Extends to higher dimensions Source: ’s Blog Steepest Descent ❖ Start somewhere ❖ Repeat: Take a step in the steepest descent direction ❖ Steepest Direction = direction of the gradient Figure source: Mathworks How to Learn in Logistic Regression? Optimization Procedure: Gradient Descent ❖ 𝛼: learning rate — hyperparameter that needs to be chosen carefully ❖ What’s the issue if it is too small? Too big? ❖ How to set? Try multiple choices ❖ Crude rule of thumb: update should change 𝑤 by about 0.1-1% Stochastic Gradient Descent compare this to the multiclass perceptron: probabilistic weighting! probability of incorrect answer probability of incorrect answer Logistic Regression Demo! https://playground.tensorflow.org/ Concluding Remarks ❖ Generative vs Discriminative Approaches ❖ Perceptron and Logistic Regression ❖ Error-driven classification: ❖ Minimize loss (some measure of error on data) ❖ Learns by (stochastic) gradient descent ❖ Generalizes to any supervised learning tasks ❖ For more information: ❖ AIMA, Chapter 18 for Learning from Examples 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com