Perceptron learning algorithm
The multiclass preceptron learning algorithm
procedure Perceptron(x1:N,y1:N) t 0
✓(0) 0
repeat
tt+1
Select an instance i
yˆ argmaxy ✓(t 1) ·f(x(i),y)
ifyˆ6=y(i)then
1: 2: 3: 4: 5: 6: 7:
8:
9:
10: else
11: ✓(t) 12: end if 13: until tired 14: end procedure
. Online training
✓(t)
✓(t 1) + f (x(i), y(i)) f (x(i), yˆ) ✓(t 1)
The averaged preceptron algorithm
1: procedure Ave Perceptron(x1:N,y1:N)
2: t 0
3: ✓(0) 0
4: m 0
5: repeat
6: tt+1
7: Select an instance i
8: yˆ argmaxy ✓(t 1) ·f(x(i),y)
. Online training 10: ✓(t) ✓(t 1) + f (x(i), y(i)) f (x(i), yˆ)
9: ifyˆ6=y(i) then 11: else
12: ✓(t) ✓(t 1)
13: end if
14: m m + ✓(t)
15: until tired
16: ✓ ̄=1m t ̄
17: return ✓
18: end procedure
Making predictions
Assuming that you have trained weights ✓, finding the best scoring label is straightforward given a data instance x(i):
yˆ = argmax ✓ · f (x (i ) , y ) y
Note that the score cannot be interpreted probabilistically.
What is the Loss function of Perceptron?
I Hinge Loss:
`perceptron(✓; x(i), y(i)) = max ✓ · f (x(i), y) ✓ · f (x(i), y(i))
y2Y
I When yˆ = y(i), the loss is zero; otherwise, it increases linearly with the gap between the score for the predicted label yˆ and the score for the true label y(i).
I Plot the loss against the input gives a Hinge shape, giving its name Hinge loss
What is the Loss function of Perceptron?
I Hinge Loss:
`perceptron(✓; x(i), y(i)) = max ✓ · f (x(i), y) ✓ · f (x(i), y(i))
y2Y
I The derivative of the loss function:
@ `perceptron(✓;x(i),y(i))=f(x(i),yˆ) f(x(i),y(i)) @✓
I Updating: At each instance, the perceptron algorithm takes a step of magnitude one in the opposite direction of this gradient
✓=✓ r✓`perceptron =✓ @ `perceptron(✓;x(i),y(i)) @✓
= ✓ f ( x ( i ) , yˆ ) + f ( x ( i ) , y ( i ) )
When do you stop the training?
How many iterations do you have to train, and when do you stop?
I Ideally you want to stop where the model have the best performance on previously unseen data
I One way is to check the di↵erence between the average weight vectors after each pass of the data. If the norm of the di↵erence falls below a certain predefined threshold, then stop training
I Early stopping. Hold out a proportion of the training data, and when the accuracy on this held out data set starts to decrease, the model has begun to overfit the training set. It’s time to stop training.
Multiclass Perceptron: Parameter estimation
✓
not funny painful ok overall story good jokes bias POS00 0 00 0 0 0 1 NEG00 0 00 0 0 0 0 NEU00 0 00 0 0 0 0
Training instance: y = NEG, x = “not funny at all”
score (y , x ) = ✓ · f
score(y =POS,x)=1⇥0+1⇥0+0⇥0+0⇥0+0⇥0+ 0⇥0+0⇥0+0⇥0+1⇥1=1
score(y =NEG,x)=1⇥0+1⇥0+0⇥0+0⇥0+0⇥0+ 0⇥0+0⇥0+0⇥0+1⇥0=0
score(y =NEU,x)=1⇥0+1⇥0+0⇥0+0⇥0+0⇥0+ 0⇥0+0⇥0+0⇥0+1⇥0=0
Multiclass Perceptron: Parameter estimation
✓
not funny painful ok overall story good jokes bias POS00 0 00 0 0 0 1 NEG00 0 00 0 0 0 0 NEU00 0 00 0 0 0 0
Training instance: y = NEG, x = “not funny at all” yˆ = argmax score(y, x) = POS
y
y 6= y0, so update
Multi-class Perceptron: parameter estimation
Updated ✓
not funny painful ok overall story good jokes bias
POS-1-1 0 00 NEG11 0 00 NEU00 0 00
Updating:
I Add the feature vector of the
NEG class and subtract from I The weights for NEU are not
0 0 0 0 0 0 0 1 0 0 0 0
instance x(i) to weights for the the weights for the POS class.
updated. Why not?
Multi-class Perceptron: parameter estimation
Updated ✓
not funny painful ok overall story good jokes bias POS-1-1 0 00 0 0 0 0 NEG11 0 00 0 0 0 1 NEU00 0 00 0 0 0 0
New Training instance: y=NEG x=“painful, not funny”
score (y , x ) = ✓ · f
score(y =POS,x)=1⇥ 1+1⇥ 1+1⇥0+0⇥0+0⇥0+ 0 ⇥ 0 + 0 ⇥ 0 + 0 ⇥ 0 + 1 ⇥ 0 = 2
score(y =NEG,x)=1⇥1+1⇥1+1⇥0+0⇥0+0⇥0+ 0⇥0+0⇥0+0⇥0+1⇥1=3
score(y =NEU,x)=1⇥0+1⇥0+1⇥0+0⇥0+0⇥0+ 0⇥0+0⇥0+0⇥0+1⇥0=0
yˆ = NEG, correct, so no update
Perceptron versus Na ̈ıve Bayes
I Both `NB and `PERCEPTRON are convex, making them relatively easy to optimize. `NB can be optimized in closed form, while `PERCEPTRON requires iterating over the training set multiple times.
I `NB can su↵er infinite loss on a single example, since the logarithm of zero probability is negative infinity. Na ̈ıve Bayes will therefore overemphasize some examples and underemphasize others.
I The Na ̈ıve Bayes classifier assumes that the observed features are conditionally independent, given the label, and the performance of the classifier depends on the extent to which this assumption holds. The perceptron requires no such assumption.
I `PERCEPTRON treats all correct answers equally. Even if ✓ only gives the correct answer by a tiny margin, the loss is still zero.
Perceptron vs Logistic regression
I Inference: Both Perceptron and Logistic Regression are discriminative models, but the score for Logistic Regression can be interpreted probabilistically, while the score for Perceptron can not.
I Parameter Estimation: The parameters for both Logistic Regression and Perceptron are estimated iteratively by gradient descent. The di↵erence in parameter estimation between Logistic Regression and Perceptron follows from their loss functions.
I The features for which the weights need to be updated for Perceptron are features for the correct label and the label that receives the highest score if they are di↵erent. For Logistic
I Regression, the features “fire” for all labels are updated.
In perceptron, incorrect features penalized by 1, while for Logistic Regression incorrect features penalized proportionally by how much o↵ is its prediction is.
Online large margin classification
I For the perceptron, if the classifier gets the correct answer on the training example by a small margin, it may give a di↵erent answer on a nearby test instance.
I The address this, we need the answer to be correct by a large margin. Margin can be formalized as:
(✓; x(i), y(i)) = ✓ · f (x(i), y(i)) maxy6=y(i) ✓ · f (x(i), y)
I The margin represents the di↵erence between the score of the correct label and the score for the highest-scoring incorrect label.
I The intuition is that it’s not enough to label the training data correctly — the correct label should be separated from other labels by a comfortable margin.
Margin Loss
`MARGIN(✓; x(i), y(i)) = (0, (✓; x(i), y(i)) 1 1 (✓;x(i),y(i)), otherwise
= ⇣1 (✓; x(i), y(i))⌘+
where (x)+ = max(0,x). The margin loss is zero when the margin between the score for the true label and the best alternative yˆ is at least 1.
Minimizing the margin loss
The margin can be achieved by adding a cost to the score yˆ = argmax ✓ · f (x (i ) , y ) + c (y (i ) , y )
y2Y
where the cost function that can(be defined as
1, y(i)6=yˆ 0, otherwise
Note:
I The cost function is only used to compute yˆ during training as the true label is unknown for unseen data. During test time the best label is just the label with the best score.
I For purposes of training, don’t simply use the label that has the highest score. Instead, choose the one that has the highest score and cost.
I When the label that has the highest score, it may still be chosen when it has a score that is higher than an (incorrect) alternative label by at least 1, in which case the loss will be 0
c(y(i),y) =
The update rule for SVM
Finding the gradient of the loss function involves constrained optimization with Lagrangian multipliers. The final update rule is:
✓(t) (1 )✓(t 1) +f(x(i),y(i)) f(x(i),yˆ)
I The previous weights ✓(t 1) are scaled by 1 , where
2 {0, 1}.
I It pulls the weights back towards zero, and in this sense it
serves as a form of regularization that prevents overfitting. I How is this di↵erent from L2 regularization?
Linear algorithms: a summary
I Na ̈ıve Bayes: Pros: easy to implement, estimation is fast, required on a single pass on the training data; assigns probabilities to predicted labels; controls overfitting with parameter smoothing; Cons: often has poor accuracy, especially with correlated features
I Perceptron: Pros: easy to implement; online; error-driven learning typically produces good accuracy, especially after averaging. Cons: not probabilistic; hard to know when to stop learning; lack of margin can lead to overfitting.
I Support vector machine Pros: optimizes an error-based metric, usually resulting in good accuracy; overfitting is controlled by a regulariation parameter. Cons: not probabilistic.
I Logistic regression Pros: error-driven and probabilistic; overfitting is controlled by a regularization parameter. Cons: batch learning requires black-box optimization; Logistic loss can “overtrain” on correctly labeled samples