CSC 311: Introduction to Machine Learning – Lecture 1 – Introduction
CSC 311: Introduction to Machine Learning
Lecture 1 – Introduction
Intro ML (UofT) CSC311-Lec1 1 / 53
This course
Broad introduction to machine learning
I First half: algorithms and principles for supervised learning
I nearest neighbors, decision trees, ensembles, linear regression,
logistic regression
I Unsupervised learning: PCA, K-means, mixture models
I Basics of reinforcement learning
Coursework is aimed at advanced undergrads. We will use
multivariate calculus, probability, and linear algebra.
Intro ML (UofT) CSC311-Lec1 2 / 53
Do I have the appropriate background?
This is where we use all the math that you learned, and then some!
Linear algebra (MAT223/240): vector/matrix manipulations,
properties.
Calculus (MAT232): partial derivatives/gradient
Probability, Statistics (STA256): common distributions, Bayes
Rule; expectation, variance, covariance, median; maximum
likelihood.
Geometry and geometric intuition.
Recommend preparation: Numerical Methods (CSC338) or
Computational Statistics
Mathematical maturity will be assumed: you should be able to write
and understand rigorous proofs.
Intro ML (UofT) CSC311-Lec1 3 / 53
Course Information
Recommended readings will be given for each lecture. But the
following will be useful throughout the course:
Trevor Hastie, Robert Tibshirani, and Jerome Friedman, The Elements of
Statistical Learning, Second Edition, 2009.
Christopher M. Bishop, Pattern Recognition and Machine Learning, 2006
Richard S. Sutton and Andrew G. Barto, Reinforcement Learning: An
Introduction, Second Edition, 2018.
Ian Goodfellow, Yoshua Bengio and Aaron Courville, Deep Learning, 2016
Kevin Murphy, Machine Learning: A Probabilistic Perspective, 2012.
Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, An
Introduction to Statistical Learning, 2017.
Shai Shalev-Shwartz and Shai Ben-David, Understanding Machine Learning:
From Theory to Algorithms, 2014.
David MacKay, Information Theory, Inference, and Learning Algorithms, 2003.
There are lots of freely available, high-quality ML resources.
Intro ML (UofT) CSC311-Lec1 4 / 53
https://web.stanford.edu/~hastie/ElemStatLearn/
https://web.stanford.edu/~hastie/ElemStatLearn/
https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf
http://incompleteideas.net/book/the-book-2nd.html
http://incompleteideas.net/book/the-book-2nd.html
http://www.deeplearningbook.org
http://faculty.marshall.usc.edu/gareth-james/ISL/
http://faculty.marshall.usc.edu/gareth-james/ISL/
https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/index.html
https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/index.html
http://www.inference.org.uk/mackay/itila/book.html
Requirements and Marking
Three (3) assignments
I Combination of pen & paper derivations and programming exercises
I Due Oct 8, Nov 8, Dec 7 at 11:59pm
Midterm
I October 25, 8–9pm.
Final Exam
I Two hours
I Date and time TBA
Intro ML (UofT) CSC311-Lec1 5 / 53
More on Assignments
Collaboration on the assignments is not allowed. Each student is responsible for their own
work. Discussion of assignments should be limited to clarification of the handout itself, and
should not involve any sharing of pseudocode or code or simulation results. Violation of this
policy is grounds for a semester grade of F, in accordance with university regulations.
The schedule of assignments will be posted on the course webpage.
Extensions will be granted only in special situations, and you will need a Student Medical
Certificate or a written request approved by the course coordinator at least one week before
the due date.
Intro ML (UofT) CSC311-Lec1 6 / 53
Related Courses and a Note on the Content
More advanced ML courses such as CSC413 (Neural Networks
and Deep Learning) and CSC412 (Probabilistic Learning and
Reasoning) both build upon the material in this course.
If you’ve already taken an applied statistics course, there will be
some overlap.
Warning: This is not a very difficult course, but it has a lot of
content. To be successful, you need to work hard. For example,
don’t start working on your homework assignments just two days
before the deadline.
Intro ML (UofT) CSC311-Lec1 7 / 53
What is learning?
“The activity or process of gaining knowledge or skill by studying,
practicing, being taught, or experiencing something.”
Merriam Webster dictionary
“A computer program is said to learn from experience E with respect
to some class of tasks T and performance measure P, if its performance
at tasks in T, as measured by P, improves with experience E.”
Tom Mitchell
Intro ML (UofT) CSC311-Lec1 8 / 53
What is machine learning?
For many problems, it is difficult to program the correct behaviour
by hand
I recognizing people and objects
I understanding human speech
Machine learning approach: program an algorithm to
automatically learn from data, or from experience
Why might you want to use a learning algorithm?
I hard to code up a solution by hand (e.g. vision, speech)
I system needs to adapt to a changing environment (e.g. spam
detection)
I want the system to perform better than the human programmers
I privacy/fairness (e.g. ranking search results)
Intro ML (UofT) CSC311-Lec1 9 / 53
What is machine learning?
It is similar to statistics…
I Both fields try to uncover patterns in data
I Both fields draw heavily on calculus, probability, and linear algebra,
and share many of the same core algorithms
But it is not statistics!
I Stats is more concerned with helping scientists and policymakers
draw good conclusions; ML is more concerned with building
autonomous agents
I Stats puts more emphasis on interpretability and mathematical
rigor; ML puts more emphasis on predictive performance,
scalability, and autonomy
Intro ML (UofT) CSC311-Lec1 10 / 53
Relations to AI
Nowadays, “machine learning” is often brought up with “artificial
intelligence” (AI)
AI does not often imply a learning based system
I Symbolic reasoning
I Rule based system
I Tree search
I etc.
Learning based system → learned based on the data → more
flexibility, good at solving pattern recognition problems.
Intro ML (UofT) CSC311-Lec1 11 / 53
Relations to human learning
Human learning is:
I Very data efficient
I An entire multitasking system (vision, language, motor control, etc.)
I Takes at least a few years 🙂
For serving specific purposes, machine learning doesn’t have to
look like human learning in the end.
It may borrow ideas from biological systems, e.g., neural networks.
It may perform better or worse than humans.
Intro ML (UofT) CSC311-Lec1 12 / 53
What is machine learning?
Types of machine learning
I Supervised learning: access to labeled examples of the correct
behaviour
I Reinforcement learning: learning system (agent) interacts with
the world and learn to maximize a reward signal
I Unsupervised learning: no labeled examples – instead, looking
for “interesting” patterns in the data
Intro ML (UofT) CSC311-Lec1 13 / 53
History of machine learning
ML conferences selling out like Beyonce tickets.
Source: medium.com/syncedreview/
Intro ML (UofT) CSC311-Lec1 14 / 53
History of machine learning
ML conferences selling out like Beyonce tickets.
Source: medium.com/syncedreview/
Intro ML (UofT) CSC311-Lec1 14 / 53
Computer vision: Object detection, semantic segmentation, pose
estimation, and almost every other task is done with ML.
Instance segmentation – Link
Intro ML (UofT) CSC311-Lec1 15 / 53
https://drive.google.com/file/d/0Byy_mRDnLTHIYzVHN3lITzFhUkU/view
Speech: Speech to text, personal assistants, speaker identification…
Intro ML (UofT) CSC311-Lec1 16 / 53
NLP: Machine translation, sentiment analysis, topic modeling, spam
filtering.
Intro ML (UofT) CSC311-Lec1 17 / 53
Playing Games
DOTA2 – Link
Intro ML (UofT) CSC311-Lec1 18 / 53
E-commerce & Recommender Systems : Amazon,
Netflix, …
Intro ML (UofT) CSC311-Lec1 19 / 53
Why this class?
Why not jump straight to CSC412/413, and learn neural nets first?
The principles you learn in this course will be essential to really
understand neural nets.
The techniques in this course are still the first things to try for a
new ML problem.
I For example, you should try applying logistic regression before
building a deep neural net!
There is a whole world of probabilistic graphical models.
Intro ML (UofT) CSC311-Lec1 20 / 53
Why this class?
2017 Kaggle survey of data science and ML practitioners: what data
science methods do you use at work?
Intro ML (UofT) CSC311-Lec1 21 / 53
Implementing machine learning systems
You will often need to derive an algorithm (with pencil and
paper), and then translate the math into code.
Array processing (NumPy)
I vectorize computations (express them in terms of matrix/vector
operations) to exploit hardware efficiency
I This also makes your code cleaner and more readable!
Intro ML (UofT) CSC311-Lec1 22 / 53
Preliminaries and Nearest Neighbourhood Methods
Intro ML (UofT) CSC311-Lec1 23 / 53
Introduction
Today (and for the next 5-6 lectures) we focus on supervised learning.
This means we are given a training set consisting of inputs and
corresponding labels, e.g.
Task Inputs Labels
object recognition image object category
image captioning image caption
document classification text document category
speech-to-text audio waveform text
…
…
…
Intro ML (UofT) CSC311-Lec1 24 / 53
Supervised Learning Example
What an image looks like to the computer:
[Image credit: Andrej Karpathy]
Intro ML (UofT) CSC311-Lec1 25 / 53
Representing the Input
Machine learning algorithms need to handle lots of types of data:
images, text, audio waveforms, credit card transactions, etc.
Common strategy: represent the input as an input vector in Rd
I Representation = mapping to another space that is easy to
manipulate
I Vectors are a great representation since we can do linear algebra
Intro ML (UofT) CSC311-Lec1 26 / 53
Supervised Learning Setup
Mathematically, our training set consists of a collection of pairs of an
input vector x ∈ Rd and its corresponding target, or label, t
I Regression: t is a real number (e.g. stock price)
I Classification: t is an element of a discrete set {1, . . . , C}
I These days, t is often a highly structured object (e.g. image)
Denote the training set {(x(1), t(1)), . . . , (x(N), t(N))}
I Note: these superscripts have nothing to do with exponentiation!
Intro ML (UofT) CSC311-Lec1 27 / 53
Input Vectors
Can use raw pixels:
Can do much better if you compute a vector of meaningful features.
Intro ML (UofT) CSC311-Lec1 28 / 53
Nearest Neighbour
Our first supervised learning model!
Suppose we’re given a novel input vector x we’d like to classify.
The idea: find the nearest input vector to x in the training set and
copy its label.
Intro ML (UofT) CSC311-Lec1 29 / 53
Nearest Neighbor: Example
Example: We would like to classify which Beetle is in this image x:
We find the closest image x(i) in the training set, and look at its label.
But how do we determine what is the “closest” image?
Intro ML (UofT) CSC311-Lec1 30 / 53
Nearest Neighbor: Example
Example: We would like to classify which Beetle is in this image x:
We find the closest image x(i) in the training set, and look at its label.
But how do we determine what is the “closest” image?
Intro ML (UofT) CSC311-Lec1 30 / 53
Nearest Neighbor: Example
Example: We would like to classify which Beetle is in this image x:
We find the closest image x(i) in the training set, and look at its label.
But how do we determine what is the “closest” image?
Intro ML (UofT) CSC311-Lec1 30 / 53
Nearest Neighbor: Example
Example: We would like to classify which Beetle is in this image x:
We find the closest image x(i) in the training set, and look at its label.
But how do we determine what is the “closest” image?
Intro ML (UofT) CSC311-Lec1 30 / 53
The 1-Nearest Neighbor Algorithm
Can formalize “nearest” in terms of Euclidean distance
||x(a) − x(b)||2 =
√√√√ d∑
j=1
(x
(a)
j − x
(b)
j )
2
Algorithm:
1. Given
I input x (that we would like to classify),
I training set (x(1), t(1)), (x(2), t(2)), . . . .
Find an example (x(∗), t(∗)) in the training set closest to x,
i.e. that minimizes distance(x(∗),x).
2. Output y = t(∗)
Note: we do not need to compute the square root. Why?
Intro ML (UofT) CSC311-Lec1 31 / 53
Nearest Neighbors: Decision Boundaries
We can visualize the behaviour in the classification setting using a Voronoi
diagram.
Here, our data set x ∈ R2 (e.g. a 2-pixel image), and there are two possible
labels t either red or blue.
Intro ML (UofT) CSC311-Lec1 32 / 53
Nearest Neighbors: Decision Boundaries
Decision boundary: the boundary between regions of input space assigned to
different categories.
Intro ML (UofT) CSC311-Lec1 33 / 53
Nearest Neighbors: Decision Boundaries
Example: Decision boundary in a 3D data set.
Intro ML (UofT) CSC311-Lec1 34 / 53
Estimating Model Accuracy
Typically, we want to have some sense of how well our model performs.
For example, we want to use this model in a real-world application, and
want to know how reliable the predictions will be.
There are several ways to measure model performance. The simplest is
the accuracy mesaure:
Accuracy =
Number of Correct Predictions
Total Number of Predictions
Intro ML (UofT) CSC311-Lec1 35 / 53
The Test Set
What might be an issue with computing the accuracy using the training set?
We will always get perfect performance in our 1-nearest neighbor model
(a data point is always closest to itself!)
Not a realistic representation of how our model will perform on unseen
data.
Solution: set aside a test set (from our set of labeled data) so we can estimate
how well our model will generalize.
Intro ML (UofT) CSC311-Lec1 36 / 53
The Test Set
What might be an issue with computing the accuracy using the training set?
We will always get perfect performance in our 1-nearest neighbor model
(a data point is always closest to itself!)
Not a realistic representation of how our model will perform on unseen
data.
Solution: set aside a test set (from our set of labeled data) so we can estimate
how well our model will generalize.
Intro ML (UofT) CSC311-Lec1 36 / 53
The Test Set
What might be an issue with computing the accuracy using the training set?
We will always get perfect performance in our 1-nearest neighbor model
(a data point is always closest to itself!)
Not a realistic representation of how our model will perform on unseen
data.
Solution: set aside a test set (from our set of labeled data) so we can estimate
how well our model will generalize.
Intro ML (UofT) CSC311-Lec1 36 / 53
Nearest Neighbors
[Pic by Olga Veksler]
Nearest neighbors sensitive to noise or mis-labeled data (“class noise”).
Solution?
Smooth by having k nearest neighbors vote
Intro ML (UofT) CSC311-Lec1 37 / 53
k-Nearest Neighbors
[Pic by Olga Veksler]
Nearest neighbors sensitive to noise or mis-labeled data (“class noise”).
Solution?
Smooth by having k nearest neighbors vote
Intro ML (UofT) CSC311-Lec1 37 / 53
the K-Nearest neighbors Algorithm
Algorithm (kNN):
1. Find k examples {x(i), t(i)} closest to the test instance x
2. Classification output is majority class
y = argmax
t(z)
k∑
i=1
I{t(z) = t(i)}
I{statement} is the identity function and is equal to one whenever the
statement is true. We could also write this as δ(t(z), t(i)) with δ(a, b) = 1 if
a = b, 0 otherwise. I{1}.
Intro ML (UofT) CSC311-Lec1 38 / 53
K-Nearest neighbors (k=1)
[Image credit: ”The Elements of Statistical Learning”]
Intro ML (UofT) CSC311-Lec1 39 / 53
K-Nearest neighbors (k=15)
[Image credit: ”The Elements of Statistical Learning”]
Intro ML (UofT) CSC311-Lec1 40 / 53
How to choose k?
Small k
I Good at capturing fine-grained patterns
I May overfit, i.e. be sensitive to random idiosyncrasies in the
training data
Large k
I Makes stable predictions by averaging over lots of examples
I May underfit, i.e. fail to capture important regularities
Balancing k:
I The optimal choice of k depends on the number of data points n.
I Nice theoretical properties if k →∞ and k
n
→ 0.
I Rule of thumb: Choose k = n
2
2+d .
We will explain another way to choose k using our data.
Intro ML (UofT) CSC311-Lec1 41 / 53
Underfitting and Overfitting
[Image credit: https://cambridgecoding.wordpress.com/2016/03/24/
misleading-modelling-overfitting-cross-validation-and-the-bias-variance-trade-off/]
Intro ML (UofT) CSC311-Lec1 42 / 53
Misleading modelling: overfitting, cross-validation, and the bias-variance trade-off
Misleading modelling: overfitting, cross-validation, and the bias-variance trade-off
Choice of k: Generalization
We would like our algorithm to generalize to data it hasn’t seen before.
We can measure the generalization error (error rate on new examples)
using the test set.
[Image credit: ”The Elements of Statistical Learning”]
Intro ML (UofT) CSC311-Lec1 43 / 53
The problem with using the test set
Here, k is an example of a hyperparameter, something we can’t fit as part of
the learning algorithm itself.
What are potential issues with using the test set to choose k?
The test set is meant to simulate unseen data, in order to estimate how
well the model generalizes.
We therefore shouldn’t use the test set to make any decisions about any
aspects of the model!
Keep the test set unseen, so that we can estimate model accuracy.
What can we do instead?
Intro ML (UofT) CSC311-Lec1 44 / 53
The problem with using the test set
Here, k is an example of a hyperparameter, something we can’t fit as part of
the learning algorithm itself.
What are potential issues with using the test set to choose k?
The test set is meant to simulate unseen data, in order to estimate how
well the model generalizes.
We therefore shouldn’t use the test set to make any decisions about any
aspects of the model!
Keep the test set unseen, so that we can estimate model accuracy.
What can we do instead?
Intro ML (UofT) CSC311-Lec1 44 / 53
The problem with using the test set
Here, k is an example of a hyperparameter, something we can’t fit as part of
the learning algorithm itself.
What are potential issues with using the test set to choose k?
The test set is meant to simulate unseen data, in order to estimate how
well the model generalizes.
We therefore shouldn’t use the test set to make any decisions about any
aspects of the model!
Keep the test set unseen, so that we can estimate model accuracy.
What can we do instead?
Intro ML (UofT) CSC311-Lec1 44 / 53
Validation and Test Sets
Partition the data into three subsets
Training set: used to train the model
Validation set: used to tune hyperparameters
Test set: used to evaluate final performance once, after hyperparameters
are chosen
Intro ML (UofT) CSC311-Lec1 45 / 53
Pitfalls: Normalization
Nearest neighbors can be sensitive to the ranges of different features.
Often, the units are arbitrary:
Simple fix: normalize each dimension to be zero mean and unit variance.
I.e., compute the mean µj and standard deviation σj , and take
x̃j =
xj − µj
σj
Caution: depending on the problem, the scale might be important!
Intro ML (UofT) CSC311-Lec1 46 / 53
Pitfalls: Computational Cost
Number of computations at training time: 0
Number of computations at test time, per query (näıve algorithm)
I Calculuate D-dimensional Euclidean distances with N data points:
O(ND)
I Sort the distances: O(N logN)
This must be done for each query, which is very expensive by the
standards of a learning algorithm!
Need to store the entire dataset in memory!
Tons of work has gone into algorithms and data structures for efficient
nearest neighbors with high dimensions and/or large datasets.
Intro ML (UofT) CSC311-Lec1 47 / 53
Example: Digit Classification
Decent performance when lots of data
[Slide credit: D. Claus]Intro ML (UofT) CSC311-Lec1 48 / 53
Example: Digit Classification
KNN can perform a lot better with a good similarity measure.
Example: shape contexts for object recognition. In order to achieve
invariance to image transformations, they tried to warp one image to
match the other image.
I Distance measure: average distance between corresponding points
on warped images
Achieved 0.63% error on MNIST, compared with 3% for Euclidean KNN.
Competitive with conv nets at the time, but required careful engineering.
[Belongie, Malik, and Puzicha, 2002. Shape matching and object recognition using shape
contexts.]
Intro ML (UofT) CSC311-Lec1 49 / 53
Example: 80 Million Tiny Images
80 Million Tiny Images was
the first extremely large
image dataset. It consisted of
color images scaled down to
32× 32.
With a large dataset, you can
find much better semantic
matches, and KNN can do
some surprising things.
Note: this required a carefully
chosen similarity metric.
[Torralba, Fergus, and Freeman, 2007. 80 Million Tiny Images.]
Intro ML (UofT) CSC311-Lec1 50 / 53
Example: Colorizing 80 Million Tiny Images
[Torralba, Fergus, and Freeman, 2007. 80 Million Tiny Images.]
Intro ML (UofT) CSC311-Lec1 51 / 53
Conclusions
Simple algorithm that does all its work at test time — in a sense,
no learning!
Can be used for regression too, which we encounter later.
Can control the complexity by varying k
Next time: decision trees, another approach to regression and
classification
Intro ML (UofT) CSC311-Lec1 52 / 53
Questions?
?
Intro ML (UofT) CSC311-Lec1 53 / 53
Administration
Intro
Examples
Formulation
Introduction