Announcements Reminder: ps5 out, due Thursday 11/5
• pset 4 grades up on blackboard by Monday 11/9
Midterm grades out!
Unweighted midterm grades (Median = 78)
25 30 35 40 45 50 55 60 65 70 75 80 85 90 95
Graduate students did better overall
Graduate Students (Median 82)
35 40 45 50 55 60 65 70 75 80 85 90 95 Undergrads (median 72)
25 30 35 40 45 50 55 60 65 70 75 80 85 90 95
Class grading
• 20% midterm
• 20% final
• 15% class challenge • 45% homeworks
Assume student gets 72% on midterm and final, 85% on homework/challenge= ~80% (B-)
72% on midterm, 85% final, 95% homework/challenge=~88% (B+)
Two questions < 50% points awarded, retroactively make them bonus points Unweighted midterm grades (Median = 78) 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 Weighted Midterm (Median 85) 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 New median for graduates: 89% New median for undergraduates: 78% Class grading • 20% midterm • 20% final • 15% class challenge • 45% homeworks Assume student gets 78% on midterm and final, 85% on homework/challenge= ~82% (B-/B) 78% on midterm, 88% final, 95% homework/challenge=~90% (A-) Soft-margin SVM Max Margin Classifier “Expand” the decision boundary to include a margin (until we hit first point on either side) Use margin of 1 Inputs in the margins are of unknown class Dual vs Primal SVM d s.t What if data is not linearly separable? • Introduce slack variables ξi subject to constraints (for all i): • Example lies on wrong side of hyperplane: is upper bound on number of training errors • trades off training error versus model complexity • This is known as the soft-margin extension Multi-class SVMs Multi-class classification Kernel SVM Non-linear decision boundaries • Note that both the learning objective and the decision function depend only on dot products between patterns • How to form non-linear decision boundaries in input space? • Basic idea: 1. Map data into feature space 2. Replace dot products between inputs with feature points 3. Find linear decision boundary in feature space • Problem: what is a good feature function φ(x)? Kernel Trick • Kernel trick: dot-products in feature space can be computed as a kernel function • Idea: work directly on x, avoid having to compute φ(x) • Example: Input transformation Mapping to a feature space can produce problems: • High computational burden due to high dimensionality • Many more parameters SVM solves these two issues simultaneously • Kernel trick produces efficient classification • Dual formulation only assigns parameters to samples, not features Kernels Kernels as Similarity Functions Given , compute new feature depending on proximity to “landmarks” x2 Example: Gaussian (RBF) kernel 𝑓 = (0.6, 2.3, 0.8) x1 similarity is high Predict label “1” when similarity is low Example: Landmarks for Gaussian kernel Given : x2 𝜎 x1 Predict if So the new features 𝑓measure how close the example is to each “landmark” point Where do the landmarks come from? Landmarks for Gaussian kernel Where do come from? They are the training points! blue is class 1 and red is class 2 Training points become landmarks So the "landmarks" are points we can use to compute a new feature representation for a point x, by representing it as the similarity to each landmark point (measured using a Gaussian centered at the landmark) In SVMs with RBF (Gaussian) kernels, we place a Gaussian centered at each training point to compute the nonlinear features. This is equivalent to using all of the training data as landmarks. SVM with Kernels Given choose Given example : 𝜎 Kernels Examples of kernels (kernels measure similarity): 1. Polynomial 2. Gaussian 3. Sigmoid Each kernel computation corresponds to dot product calculation for particular mapping φ(x): implicitly maps to high-dimensional space Why is this useful? 1. Rewrite training examples using more complex features 2. Dataset not linearly separable in original space may be linearly separable in higher dimensional space Classification with non-linear SVMs Non-linear SVM using kernel function K(): Maximize LK w.r.t. {α}, under constraints α≥0 Unlike linear SVM, cannot express w as linear combination of support vectors – now must retain the support vectors to classify new examples Final decision function: Decision boundary in Gaussian kernel SVM Gammma = 1/𝜎2 towardsdatascience.com Kernel SVM Summary Advantages: • Kernels allow very flexible hypotheses • Poly-time exact optimization methods rather than approximate methods • Soft-margin extension permits mis-classified examples • Variable-sized hypothesis space • Excellent results (1.1% error rate on handwritten digits vs. LeNet’s 0.9%) Disadvantages: • Must choose kernel hyperparameters • Very large problems computationally intractable • Batch algorithm Kernel Functions Mercer’s Theorem (1909): any reasonable kernel corresponds to some feature space Reasonable means that the Gram matrix is positive definite Feature space can be very large, e.g., polynomial kernel (1+ xi + xj)d corresponds to feature space exponential in d Linear separators in these super high-dim spaces correspond to highly nonlinear decision boundaries in input space Kernelizing A popular way to make an algorithm more powerful is to develop a kernelized version of it • We can rewrite a lot of algorithms to be defined only in terms of inner product • For example: k-nearest neighbors Techniques for constructing valid kernels Summary of SVMs Summary Software: • A list of SVM implementations can be found at http://www.kernel-machines.org/software.html • Some implementations (such as LIBSVM) can handle multi-class classification • SVMLight is among the earliest implementations • Several Matlab toolboxes for SVM are also available Key points: • Difference between logistic regression and SVMs • Maximum margin principle • Target function for SVMs • Slack variables for mis-classified points • Kernel trick allows non-linear generalizations History of SVMs • The original SVM algorithm was invented by Vladimir Vapnik and Alexey Chervonenkis in 1963. • In 1992, Bernhard E. Boser, Isabelle M. Guyon and Vladimir Vapnik suggested a way to create nonlinear classifiers by applying the kernel trick to maximum- margin hyperplanes.[13] • The soft margin was proposed by Corinna Cortes and Vapnik in 1993 and published in 1995.[1] • SVMs were very popular in the 90’s-00’s until neural networks took over circa 2012 https://en.wikipedia.org/wiki/Support-vector_machine Next Class Reinforcement Learning I reinforcement learning; Markov Decision Process (MDP); policies, value functions, Q- learning