Machine Learning
Harish Tayyar Madabushi
www.harishmadabushi.com
Introduction 2 Supervised Learning 2
The elements of Supervised Learning 3
Regression 5
Univariate Linear Regression 6 Hypothesis Function 7 Cost Function 7
Gradient Descent for Univariate Linear Regression 8 Regression with Multiple Variables and Polynomial Terms 9 Hypothesis Functions 9 Cost Function 10 Gradient Descent 10
These notes provide an overview of the topics covered in class and are not meant to replace the lectures. This is intended as supplementary material which will hopefully make it easier for you to revise the content of what was taught.
I will be referring to sections of “Artificial Intelligence: A Modern Approach, Global Edition by Stuart Russell, and Peter Norvig (2016). Chapter 18, Learning from Examples” You can access an online version of this through the resource list.
Machine Learning 1 Harish Tayyar Madabushi
Introduction
Learning algorithms can broadly be classified into three different kinds of algorithms:
1. Unsupervised 2. Supervised
3. Reinforced
Unsupervised algorithms are those that classify input data based on certain inherent properties of that data. An example of such an algorithm is the k-means algorithm. This section of your course does not deal with these algorithms.
Supervised algorithms are those that require annotated data to learn the trend or classes of the input data before then being able to make prediction about previously unseen input. In this section of the course we study Linear and non-linear Regression, Logistic Regression and Neural Networks, all of which are supervised learning algorithms (because they need training data).
Reinforced Learning algorithms are required to achieve a specific task in a predefined environment and learn to do so based on “rewards” that inform the algorithm of how well it is doing in terms of getting closer to achieving that task. We briefly study these algorithms in this section.
Further Reading: Artificial Intelligence: A Modern Approach, Global Edition by Stuart Russell, and Peter Norvig (2016). [Section 18.1]
Supervised Learning
The majority of what we will discuss is related to supervised learning. Supervised learning is the process of learning some aspects of data that is annotated. This annotation will inform us as to what the values of the dependant variable “y” are for some (set) of input variables which are independant.
Machine Learning 2 Harish Tayyar Madabushi
Independant variables take on different values based on the environment. Dependant variables take on values that are dependant on other variables, often independent variables.
These dependant variables can take on continuous values, such as in the case of the relation between temperature in degrees fahrenheit given temperature in degrees celsius, or might take on discrete values as in the case of whether or not someone has the flu. The prediction of continuous values is called “Regression” and the prediction of which class an element belongs to is called “Classification”.
The data that is used to train a model is called the training data. In the context of supervised learning, training data always has the associated value of the dependant variable, usually referred to as “y”.
We use the following notation to represent the training data:
(X1,y1),(X2,y2),(X3,y3),(X4,y4) …(XN,yN)
What this means is that for some input X1, the corresponding output is y1 and for X2 it is y2 and so on for the N training data elements that we have available. Notice how the X in the above is capitalised, whereas the y is not. This is to highlight the fact that “X” can consist of more than one element (Multivariate regression or classification discussed later)
The test data is an independent set of data that we test our model on. While it shares the characteristics of the training data, its elements are different from those in the training data.
The elements of Supervised Learning
Regardless of classification or regression, supervised learning models consist of the following elements :
1. Hypothesis Function
2. Cost Function
3. The (partial) differential of the cost function
Machine Learning 3 Harish Tayyar Madabushi
The hypothesis function represents the hypothesis we have about how the input data (“X” or the independent variable) affects the value (or class) of the output data (“y” or the dependant variable). It is therefore an equation that relates these two and, for regression, is of the general form:
hw( x) = w0 + w1x 1 + w2x 2 + … +wkx k Equation 1
You might see the function hw( x) written as f(x) or h(x). In the above equation we assume that there are going to be an arbitrary number of terms k. Remember that this does not necessarily mean that there are k independent variables, it just means that there are k terms, some of which might be polynomial terms of the form x2 , x1x 2 and so on.
In the above equation, w0, w1 … are referred to as weights and take on some numeric values. The purpose of all supervised learning algorithms is to find those values of w for which the hypothesis function best estimates the training data. This is also true of classification, although the hypothesis function is of a slightly different form in that case.
We measure how good or bad a hypothesis function is using the cost function. When the cost function is at its lowest the cost of the hypothesis function is at its minimum and therefore the hypothesis function is set to “fit” the training data well.
Recall that the differential of a function gives the equation of the slope of that function. When this slope is zero, we can determine the “point of inflection” of the original function. This was discussed in class using the following slide. You might want to listen in on the associated bits of the lecture:
Machine Learning 4 Harish Tayyar Madabushi
As finding the minimum of the cost function is not always trivial (as in the case of Non-Linear Multivariate Regression/Classification) we make use of an algorithm to do this. This algorithm which utilises the (partial) derivative of the costfunctiontofindthosevaluesofw0, w1, w2, …atwhichthecostisminimum is called Gradient Descent.
Further Reading: Artificial Intelligence: A Modern Approach, Global Edition by Stuart Russell, and Peter Norvig (2016). [Section 18.2]
Regression
Regression involves the use of training data to create a model so we can establish a trend in the training data which can be used to predict values of “y” given previously unseen values of “X”.
Machine Learning 5 Harish Tayyar Madabushi
An example from the lectures:
Univariate Linear Regression
Let us first consider the case where we have one independent variable “x” related to the dependent variable linearly. This problem is called Univariate Linear Regression.
Notice how, in this case, what we are trying to fit to the training data is a straight line. The equation of a straight line is y = mx + c, where m is the slope and c the y-intercept. This equation now represents our hypothesis because we know that some straight line “fits” our training data. The aim of Univariate Linear Regression is to find the specific line that fits the training data and this requires us to find the values of m and c.
Machine Learning 6 Harish Tayyar Madabushi
We translate the equation of a straight line into the form of the general hypothesis function given by Equation 1 as follows by replacing c by w0 and m by w1.
Hypothesis Function
For Univariate Linear Regression, the hypothesis function is of the form: hw( x) = w0 + w1x Equation 2
Cost Function
As discussed in the previous section, we must now establish how “bad” a particular line (defined by specific values of w0 and w1) is. This is done using the Cost Function.
There are two cost functions associated with Regression. The first is the average of the sum of the absolute value of the distance between the prediction and the observed value of “y” over all training examples. The second is the average of the sum of the squares of the same distance.
Mathematically, this is given by:
(1/m) ∑m abs( yi – hw( xi) )
i=0
(1/m) ∑m ( yi – hw( xi) )2
i=0
Equation 3 Equation 4
In the above equations m is the number of training examples (the size of the training set) yi and xi the it h output and input respectively and hw the hypothesis function parametrized by some values of w. Notice that as we change the values
Machine Learning 7 Harish Tayyar Madabushi
of w, in an attempt to find the line that better fits the training data, the corresponding cost will change.
Equation 4 represents the more commonly used Cost and is called the L2-Loss. While there is a subtle difference between “Loss” and “Cost”, we will use them interchangeably for the purpose of this course. Equation 4 is more common than Equation 3 as it more heavily penalises the hypothesis function for points that are “very far” from it.
Gradient Descent for Univariate Linear Regression
Gradient descent is an algorithm that allows one to progressively update the values of w so the Cost of the hypothesis function, parametrized by w, progressively reduces.
The general form of Gradient Descent is:
while not converged :
for example j in observations/training data:
update all w using loss on j (Simultaneously for all w)
What this says about the algorithm is that the values of w are updated based on the loss on each training element, j. This iteration over all training elements is called an epoch.
We run through this process multiple times until we reach a state where the algorithm is said to have converged. This is the state where the values of w represent a particular parameterization of the hypothesis function at which the Cost is minimal. One simply way to check for convergence is to wait until the cost of Consecutive epochs is very small.
Machine Learning 8 Harish Tayyar Madabushi
In the case of Univariate Linear Regression, Gradient Descent is as follows:
while not converged :
for example j in observations:
w1 =w1 +α.(yj -hw( xj )).xj w0 =w0 +α.(yj -hw( xj ))
Regression with Multiple Variables and Polynomial Terms
There is no reason to limit our regression to one variable or to a linear hypothesis function. You are encouraged to pay specific attention to what these functions look like in the corresponding lectures (Specifically Week 8 Lecture 1).
Hypothesis Functions
For Multivariate Linear Regression, the hypothesis function is of the form: hw( x) = w0 + w1x 1 + w2x 2 Equation 5
For Univariate Non-Linear Regression, the hypothesis function is of the form: hw( x)=w0 +w1x + w2x 2 +w3x 3 +…
Equation 6
Notice that the above equation can be quadratic, cubic or contain higher polynomials. We will talk about how to choose a hypothesis function later in this document.
Machine Learning 9 Harish Tayyar Madabushi
Finally, for Multivariate Non-Linear Regression, the hypothesis function is of the form:
hw( x) = w0 + w1x 1+ w2x 2 + w3x 12 + …
Equation 7
Notice that the above equation can similarly contain multiple polynomial terms. Here are some different polynomial terms: x12 x23 x1x 2
Cost Function
The Cost function for Regression is always the same. We continue to use the L2-Loss function which was given by Equation 4:
(1/m) ∑m ( yi – hw( xi) )2 Equation 4 i=0
The difference, however, is that the function used to calculate the predicted values (hw) changes based on the specific hypothesis function used.
Gradient Descent
Gradient Descent is also the same with two modifications: a) Like with the cost function the hypothesis function needs to be the one being optimised and b) multiple w values might need to be updated.
Machine Learning 10 Harish Tayyar Madabushi