Linear Classification
Introduction to Machine Learning: Lecture II Nebu John Mathai
Recap
Linear Regression
Regression: Why?
• What problem is regression solving?
• Modelling a phenomenon
• Predicting the behavior of a phenomenon, based on this model
• Consider a phenomenon, P, with some I/O behavior that we can investigate and tabulate: inputs (x), outputs (t)
• Regression: postulate a tunable model
• That is, a model with various adjustable parameters
• Adjust these parameters based on the data (fit)
• Validate the adjustments on novel data (generalization)
Regression: Setup
• Consider a N-element dataset containing input (data, x) and output (target, t) characteristics for some phenomenon, P, of interest
• We want to develop a model, M, of P so that we can predict its behavior
• That is, given existing or novel values of x, accurately measure what P would output, given that x
• Recall: fit versus generalization
• Fit: Given values of x from our dataset, our model’s output (y) is an accurate
prediction of the associated t
• Generalization: Given new values of x that we have not seen yet, the model is an accurate prediction of the associated t
Regression: Plan
• Here is the supervised learning plan of attack of this problem
• 0. We separate the dataset into: • TrainingSet
• ValidationSet
• 1. We propose a tunable (i.e., parametrized) computational model, M, that has the same
I/O characteristics of P (i.e., the dataset)
• InputtypeanddimensionalityofPandMareidentical(ref:x\inR^D) • OutputtypeanddimensionalityofPandMareidentical(ref:t\inR)
• 2. We define the Loss function, L
• FromthiswecanderivetheperformancesurfaceE
• 3. Learning: We tune the parameters of M using (a) the Training Set, and (b) Gradient Descent (i.e., we learn using the Training Set to achieve fit)
• 4. Validation: Once (3) has converged, we validate the model using the Validation set • Thisisnoveldata:thusthisstepistomeasuregeneralization
Linear Regression: Details
• Linear Regression is a special case of the regression problem
• M is chosen to be a linear model: the prediction here is a weighted linear
combination of the input vector x • 𝑀: 𝑦 = 𝒘! ‘ 𝒙 + 𝑏
• The N-element dataset, (𝒙(“), 𝑡(“)) & , is partitioned into TrS (~70% N) and VaS (~30% N) “$%
Linear Regression: Details
• Now we define the error function (loss function) that expresses our measure of how “off” a prediction is:
• 𝐿 = ” (𝑦 − 𝑡)
#
#
• With this loss function we can define a performance surface: •𝐸𝒘,𝑏=”∑$ 𝐿(𝑦%,𝑡%)
• This performance surface is a scalar valued function over the hypothesis space (i.e., the space of all hypotheses (w, b))
• It defines the average loss of M(x; w’,b’)
• M(x; w’,b’) is the particular instance of M with parameters set to (w, b)
$
%&”
Linear Regression: Details
• With E defined, we now just need to find the lowest point (the global minima) on this surface E, and use the value of (w, b) at this minima point
• For this we employ gradient descent
• Start with some setting of the hypothesis
• Calculate the gradient of E with respect to the hypothesis space, at this value of the hypothesis
• Move in the opposite direction of the gradient to find the minima
• For the linear regression problem, the performance surface E is guaranteed to be convex, and thus the single extrema of this function (i.e., the point on E where the gradient is zero) is a minima
Linear Regression: Details
• After identifying the minima of E with respect to TrS, we have a setting for the hypothesis that enables M to fit TrS
• We now need to validate M for its ability to generalize to new data (i.e., data that it was not trained on)
• For this reason, we take VaS and calculate E
• If E(VaS) <= E(TrS) then we have an empirical indication that the tuned model generalizes
Linear Regression: Details
• This is all empirical!
• It is of course possible that our dataset was biased, and also that our
partitioning of DS into TrS and VaS was done in a biased manner.
• We thus have NO theoretical guarantee that our LR tuned model will generalize to all possible data
• We have no guaranty as to M’s validity
• Hence: in employing a ML model, we must be always vigilant in validating the model
continuously to see if there is any indication that our model is invalid • ML is thus very different from automata theory
• In the latter we have theoretical guarantees that we have solved a problem per the theory
• Why is ML used? Sometimes we have to cope with the absence of a theory for our problem.
Extensions
• LR allows us to model functions (or “learn functions”) of form 𝑓: 𝑹' → 𝑅 using a linear model that is tuned
• Can extend this to learning functions of form 𝑓: 𝑅 → 𝑅 using a polynomial (nonlinear) model that is tuned
• No new development was needed
• We used feature functions to transform the original scalar data set into a multidimensional one (powers of the original data were mapped to new dimensions), and then used standard LR
• Exercise: Show how the above techniques can be applied to the modelling of general multidimensional systems via linear or polynomial regression
Hyperparameters
• We also covered the notion of hyperparameters: parameters in the process that are outside of the optimization loop
• C.f. w,b which are parameters, and are being actively optimized
• The learning rate is one such hyperparameter
• The maximum degree for polynomial regression is another
• The degree is special, because we can put an additional loop around our optimization loop so that when validation fails, we can adjust the hyperparameter to increase model complexity
• We discussed the conflict between fit and generality
• The next slide is illustrative
t*
Which model is more appropriate?
How would they both fare with new data (in red)?
*
x
Expectations
• At this point, you must know:
• The setup of the regression problem
• The full development of the LR solution
• How to extend LR to polynomial regression • In multiple dimensions
• The role of the data set and its partitions { Trs, VaS, TeS }
• What hyper-parameters are and how they differ from parameters (i.e., the
hypothesis space)
• How to code an LR solver and how to use it
• Training vs Validation vs Testing
• Ask if any of this is unclear
Classification
Problem Setup; Linear Binary Classification
Classification
• What is the classification problem?
• Suppose we have medical case data that shows how various symptoms and measurements of an organism relates to the presence or absence of a medical condition
• A classification system tuned on this data would allow us to take in a new vector (symptoms, measurements) and indicate whether this indicated the presence of absence of disease
• Suppose we have a set of pictures with labels that indicate the object being pictured
• A classification system tuned on this data would allow us to take in a new input vector (image pixels) and indicate what the imaged object is
Classification
• Like with regression, our input is a vector (possibly multidimensional) • E.g., a vector of measurements, pixels, etc.
• Unlike with regression, our output (target) is not a real-valued variable but rather a class: some discrete variable with a fixed number of options
Classification vs Regression
• To understand classification (and indeed, any ML technique) we can start by looking at the datasets involved
(") (") "$% ' •Regression: (𝒙 ,𝑡 )& 𝑤h𝑒𝑟𝑒𝒙𝜖𝑹 𝑎𝑛𝑑𝑡𝜖𝑹
• Our goal was to design a model that can predict real-valued targets, t, given the input x "$%
• Classification: (𝒙("), 𝑡(")) & 𝑤h𝑒𝑟𝑒 𝒙 𝜖 𝑹' 𝑎𝑛𝑑 𝑡 𝜖 𝒁
• Our goal here is to also design a model that will predict targets
• Now, however, we call the target a class and the class is not real-valued, but a discrete space (i.e., an integer space)
Classification
• Classification: (𝒙("), 𝑡(")) & 𝑤h𝑒𝑟𝑒 𝒙 𝜖 𝑹' 𝑎𝑛𝑑 𝑡 𝜖 𝒁
• This change of the nature of the target set from real-valued to integral makes a major difference
• You can recall Lecture 1’s discussion of the implications of our using an integer computer to approximate real-valued computations to refresh in your mind that real numbers and integers are very different spaces
• We will see in the next lecture why we can not simply rehash our LR development to develop a classification ML system
"$%
Classification
• Classification: (𝒙("), 𝑡(")) & 𝑤h𝑒𝑟𝑒 𝒙 𝜖 𝑹' 𝑎𝑛𝑑 𝑡 𝜖 𝒁
• This change of the nature of the target set from real-valued to integral makes a major difference
• You can recall Lecture 1’s discussion of the implications of our using an integer computer to approximate real-valued computations to refresh in your mind that real numbers and integers are very different spaces
• We will not be able to look at the t-vs-x plot in the same manner as for LR • t is a discrete space, and so our real-variables calculus tools can not be used directly
• We will see in the next lecture why we can not simply rehash our LR development to develop a classification ML system
"$%
Setup: Binary Classification
• We will start our development by choosing the most basic, non-trivial classification space: the discrete set { -1, 1 }
(%) (%) %&" )
• Classification: (𝒙 ,𝑡 ) $ 𝑤h𝑒𝑟𝑒𝒙𝜖𝑹 𝑎𝑛𝑑𝑡𝜖{−1,1}
• As stated previously, we will not visualize this data set by plotting t- versus-x, as we would normally be able to do if t were real-valued
• Instead, we will plot the data in x’s space and color the data points based on the class of the point (i.e., either -1 or 1)
• We need to determine a method to classify input vectors x into either the class -1 or the class 1: this is called Binary Classification
Setup: Binary Classification
• If we are not viewing the data as t-vs-x (as we did in LR), then we are not able to use our LR development as is to learn the classification function
• Instead, we will try to learn the coefficients for a hypersurface in
𝑹! (𝑖. 𝑒. , 𝑡h𝑒 𝑠𝑝𝑎𝑐𝑒 𝑓𝑟𝑜𝑚 𝑤h𝑖𝑐h 𝑡h𝑒 𝑑𝑎𝑡𝑎 𝑣𝑒𝑐𝑡𝑜𝑟𝑠, 𝒙, 𝑎𝑟𝑒 𝑑𝑒𝑟𝑖𝑣𝑒𝑑) such that the elements of the two classes are on different sides of the hypersurface (if this is possible, we are said to have a linearly separable dataset)
• Recall from linear algebra that the equation for such a hypersurface has form: 𝒘" 8 𝒙 + 𝑏 = 0
• This hypersurface will split the space into two half-spaces:
• Points u on one side of the hypersurface would obey: 𝒘! " 𝒖 + 𝑏 < 0
• Points v on the other side of the hypersurface would obey: 𝒘! " 𝒗 + 𝑏 > 0
Setup: Binary Classification
• Summarizing: #%&
• Dataset: (𝒙(#), 𝑡(#)) ‘ 𝑤h𝑒𝑟𝑒 𝒙 𝜖 𝑹( 𝑎𝑛𝑑 𝑡 𝜖 {−1,1}
• Model: 1𝑖𝑓𝑧≥0 • 𝑧 = 𝒘! & 𝒙 + 𝑏
• 𝑦 = ,−1 𝑖𝑓 𝑧 < 0
• As with LR, y is the output of the model, the predictor of the class
• This prediction is computed by looking at z to determine which side of the hypersurface we’re on
• Note: you may notice some deficiencies due to the fact that the predictor can not, on the surface, distinguish between being on the surface, and on the positive side of the surface; we’ll address this soon.
Binary Linear Classifier
• Dataset: (𝒙("), 𝑡(")) & 𝑤h𝑒𝑟𝑒 𝒙 𝜖 𝑹' 𝑎𝑛𝑑 𝑡 𝜖 {−1,1}
"$% •Model:𝑧=𝒘( <𝒙+𝑏
• 𝑦 = , 1 𝑖𝑓 𝑧 ≥ 0 −1 𝑖𝑓 𝑧 < 0
• The above model is parametrized by w and b
• We now discuss how to determine the values of the parameters
• As with LR, and indeed with all supervised learning techniques, we will be adjusting these parameters based on the data and some feedback as to our model’s instantaneous correctness
Perceptron
• To make the development simpler to present we are going to absorb b into the w vector, by modifying the dataset as follows
• Original Dataset: (𝒖("), 𝑡(")) & 𝑤h𝑒𝑟𝑒 𝒖 𝜖 𝑹5 𝑎𝑛𝑑 𝑡 𝜖 −1,1 "$% 1
• Modified Dataset: (𝒙("), 𝑡(")) & 𝑤h𝑒𝑟𝑒 𝒙 = 𝒖 𝜖 𝑹*+" 𝑎𝑛𝑑 𝑡 𝜖 −1,1 "$%
• That is, the modified data set is the original data set to which an extra dummy dimension has been added with a value of 1 for all data vectors
• Since this transform can always be done, we will assume it has already been done in the next slides
Perceptron
•𝑧=𝒘7𝒙
• 𝑦 = , 1 𝑖𝑓 𝑧 ≥ 0
• With this modified dataset, we have the model !
−1 𝑖𝑓 𝑧 < 0
1
• Note:𝑧=𝒘! 7𝒙= 𝑤" ... 𝑤#$% 𝑥" =𝑤 +𝑤 𝑥 +⋯+𝑤 𝑥 =0stilldefinesa
absorption of b into the weight vector)
• The above model is parametrized by D+1-dimensional w
𝑥⋮ " % % #$% #$%
#$%
hypersurface in 𝑹 (i.e., we have not altered the dimensionality nor the solution by our
#
Perceptron
• If tuned properly, the model should classify data points from the data set correctly:
•∀1≤𝑗≤𝑁,𝑡, =+1è𝑦𝒙, =+1 •∀1≤𝑗≤𝑁,𝑡, =−1è𝑦𝒙, =−1
• How do we achieve this proper tuning?
• It turns out there is a very simple algorithm that is guaranteed to converge in finite time if the dataset is linearly separable
Perceptron Learning Rule (PLR)
• With thanks to Rosenblatt, 1958
• Let’s consider the j-th element of out dataset
• For𝑡(,) =+1wewant𝑦 𝒙(,) =+1
• If the tuned model is producing a correct prediction then:
• We will strengthen this criterion and demand: 𝑡(,) = +1è𝑧 𝒙(,) > 0 • This criterion prevents us from allowing data points on the hypersurface
• Similarly, we demand: 𝑡(,) = −1è𝑧 𝒙(,) < 0
• If this were not the case, then we would have to adjust w • How?
Perceptron Learning Rule
• Based on the prior development we only have two cases: • 𝑡(,) 𝑧 𝒙(,) > 0èNO ACTION NEEDED
•𝑡(,)𝑧𝒙(,) ≤0è𝒘𝑀𝑈𝑆𝑇𝐵𝐸𝑇𝑈𝑁𝐸𝐷
• If the product is strictly positive then we know: • Both target and prediction have the correct sense
• Prediction is not on the hypersurface (i.e., the =0 case is excluded)
Perceptron Learning Rule
•If𝑡(@)𝑧𝒙(@) ≤0:
• For positive targets, this means that 𝑧 = 𝒘! ‘ 𝒙(,) is negative and must be
increased
• Let’s propose we add 𝒙(,) to w (i.e., 𝒘- ← 𝒘 + 𝒙(,))
•Then:𝑧-=𝒘-!’𝒙,=𝒘+𝒙, !’𝒙𝒋=𝒘!’𝒙,+𝒙, #>z
• Recall:𝒂!7𝒂= 𝒂 >0
&
• Similarly, for negative targets: 𝒘 ← 𝒘 − 𝒙
• Combining the two cases: • 𝒘’ ← 𝒘 + 𝑡𝒙())
– (,)
Perceptron Learning Rule
• Estimate w
• While (True):
• 𝑢𝑝𝑑𝑎𝑡𝑒𝑑 ← 𝐹𝑎𝑙𝑠𝑒
• ∀ 𝒙 ) ,𝑡 ) 𝜖 𝑇𝑟𝑆 (i.e., for all elements in our training set)
• Calculate 𝑧 𝒙(“) •If𝑡(“)𝑧𝒙(“) ≤0:
• 𝒘!←𝒘+𝑡𝒙(#)
• 𝑢𝑝𝑑𝑎𝑡𝑒𝑑 ← 𝑇𝑟𝑢𝑒 • Break if updated==True
• Note: if the training set (Trs) is linearly separable then this is guaranteed to converge
• Since we may not know TrS is linearly separable, you should augment this loop to have a maximum iteration count (to prevent infinite loops). Do this as an exercise.
Exercises
• Consider the following Boolean functions: • OR, AND, NOT and XOR
• Recall:forxandyin{-1,1}
• OR(x,y)returnsa1forallcasesexceptwhenx=y=-1
• AND(x,y)returnsa-1forallcasesexceptwhenx=y=1 • NOT(x)returnsa-1whenx=1,anda+1otherwise
• XOR(x,y)returnsa-1whenx=y,anda1otherwise
• What is the structure of the Perceptron that you would require to learn each of those functions? (i.e., write the w and x vectors)
• Derive the weights by hand for each function.
• Now code up the PLR in Python and see what it computes for w
• Test what you got: were (a) your hand calculations and (b) the output of the PLR correct?
Limits of the PLR
• The Perceptron Learning Rule is only guaranteed to converge if the dataset is linearly separable
• Given a dataset with a binary target (in { -1, 1 }), lets call the data vectors with class=1 the positive examples
• Likewise, the data vectors with class=-1 are the negative examples
• A linearly separable dataset is one where a hyperplane separates the
positive and negative examples
• In such a case, the PLR is guaranteed to converge to the hyperplane coefficients
Limits of the PLR
• Exercise: graph the aforementioned Boolean functions and attempt to draw a hyperplane (a line in 2D) between the positive and negative examples. What do you observe? Compare this with the results of the computation of w for each case that you did earlier.
• As you can see, a very simple function (the XOR) produces results that are not linearly separable, and thus can not be learned via a perceptron. This is a strong indictment of the Perceptron, given that XOR is trivially constructed from two ANDs and one OR!
• What else can one not do with a Perceptron?
Limits of the PLR
• Let’s first introduce the concept of a convex set
• A convex set, S is one wherein given any two points within the space, the line joining these two points (i.e., the set of points on the line) are also within the set, i.e.:
• 𝒔,…,𝒔 ⊆𝑆ó∑8 𝜆𝒔𝜖𝑆 %8 9:%99
• For∑89:%𝜆9=1
• For a linearly separable problem, the region of positive examples must be convex and the region of negative examples must be convex
Limits of the PLR
• Let’s consider the XOR function.
• Let’s suppose it was linearly separable. Then we ought to be able to draw convex sets for the regions of positive examples (and likewise for negative examples).
• If we draw a line from the two -1s we ought to have a convex set (if XOR was linearly separable).
• Likewise for the two 1s.
• Note, however, that the two spaces (the two lines) intersect, meaning that
the point of intersection must belong to both sets, which is nonsense.
• Hence the XOR is not linearly separable as the regions of positive examples and that of negative examples are not convex
Limits of the Perceptron
• In looking at the limits of LC we see that the lack of linear separability (LS) in the dataset is the problem
• Two cases:
• Dataset is LS but noisy • Dataset is NOT LS
Addressing the Limits
• The PLR is great for linearly separable (LS) datasets (DS)
• What if our dataset is contaminated with noise?
• Then a fundamentally LS DS will not look LS!
• In this case we will need to optimize: find the best linear separator for most
of the data
• This will involve learning
• I.e., loss functions and gradient descent
• The next lecture will deal with this
• Recall our comments that classification, being a mapping from a real data space to a discrete target space is crucially different from the regression problem
• We’ll see why soon
Addressing the Limits
• Though useful at learning linearly separable datasets, the Perceptron is limited and can’t learn even simple nonlinear functions
• Exercise: show that XOR is nonlinear
• Recall how LR was extended to polynomial regression via the feature
functions
• We’ll see that we can employ a similar tactic to make the Perceptron learn the XOR
• But more than that, we’ll uncover a fundamental principle that will guide us into more interesting ML territory: neural networks
Map so far
• Linear Regression (LR) • Dataset characteristics
• Tunable model
• Optimization framework for the solution of LR: learning
• Linear Classification (LC)
• Dataset characteristics
• Tunable model (Perceptron)
• Perceptron Learning Rule: non-optimization framework for the solution of LC
• Next: Learning and Classification
• + Optimization framework for the solution of linear classification with noise
• Next next: Learning non-linear classification functions • + Generalization of the model
• + How to optimize with this more general model
Linear Classification: Terminology
• Classification
• Linear Classification • Binary Classification • Model
• Perceptron Model
• Negative Examples • Positive Examples
• Hypersurface
• Convex Subset
• Linearly Separable Problems
Required Reading
• N/A
Useful Reference Material
• An excellent reference for AI, ML, linear regression, and other related topics is found in Artificial Intelligence: A Modern Approach (Third Edition; by Russel and Norvig)
• Professor Grosse (Computer Science) has further excellent material:
• http://www.cs.toronto.edu/~rgrosse/courses/csc321_2018/readings/L03%20
Linear%20Classifiers.pdf
• Neural Networks – A Systematic Introduction, Raul Rojas, Springer- Verlag, Berlin, New-York, 1996 (502 p.,350 illustrations).
• Chapter 3 and 4