CS代考 CSC 311: Introduction to Machine Learning

CSC 311: Introduction to Machine Learning
Lecture 1 – Introduction
Anthony Bonner &
Based on slides by Amir-massoud Farahmand & Emad A.M. Andrews
Intro ML (UofT) CSC311-Lec1 1 / 53

This course
Broad introduction to machine learning
􏰀 First half: algorithms and principles for supervised learning
􏰀 nearest neighbors, decision trees, ensembles, linear regression, logistic regression
􏰀 Unsupervised learning: PCA, K-means, mixture models
􏰀 Basics of reinforcement learning
Coursework is aimed at advanced undergrads. We will use multivariate calculus, probability, and linear algebra.
Intro ML (UofT) CSC311-Lec1

Do I have the appropriate background?
This is where we use all the math that you learned, and then some!
Linear algebra (MAT223/240): vector/matrix manipulations, properties.
Calculus (MAT232): partial derivatives/gradient
Probability, Statistics (STA256): common distributions, Bayes Rule; expectation, variance, covariance, median; maximum likelihood.
Geometry and geometric intuition.
Recommend preparation: Numerical Methods (CSC338) or Computational Statistics
Mathematical maturity will be assumed: you should be able to write and understand rigorous proofs.
Intro ML (UofT) CSC311-Lec1

Course Information
Recommended readings will be given for each lecture. But the following will be useful throughout the course:
, , and , The Elements of Statistical Learning, Second Edition, 2009.
. Bishop, Pattern Recognition and Machine Learning, 2006
. Sutton and . Barto, Reinforcement Learning: An Introduction, Second Edition, 2018.
, and , Deep Learning, 2016 , Machine Learning: A Probabilistic Perspective, 2012.
, , , and , An Introduction to Statistical Learning, 2017.
-Shwartz and -David, Understanding Machine Learning: From Theory to Algorithms, 2014.
Kay, Information Theory, Inference, and Learning Algorithms, 2003.
There are lots of freely available, high-quality ML resources.
Intro ML (UofT) CSC311-Lec1 4 / 53

Requirements and Marking
Three (3) assignments
􏰀 Combination of pen & paper derivations and programming exercises
􏰀 Due Oct 8, Nov 8, Dec 7 at 11:59pm
􏰀 October 25, 8–9pm.
Final Exam
􏰀 Two hours
􏰀 Date and time TBA
Intro ML (UofT) CSC311-Lec1 5 / 53

More on Assignments
Collaboration on the assignments is not allowed. Each student is responsible for their own work. Discussion of assignments should be limited to clarification of the handout itself, and should not involve any sharing of pseudocode or code or simulation results. Violation of this policy is grounds for a semester grade of F, in accordance with university regulations.
The schedule of assignments will be posted on the course webpage.
Extensions will be granted only in special situations, and you will need a Student Medical Certificate or a written request approved by the course coordinator at least one week before the due date.
Intro ML (UofT) CSC311-Lec1 6 / 53

Related Courses and a Note on the Content
More advanced ML courses such as CSC413 (Neural Networks and Deep Learning) and CSC412 (Probabilistic Learning and Reasoning) both build upon the material in this course.
If you’ve already taken an applied statistics course, there will be some overlap.
Warning: This is not a very difficult course, but it has a lot of content. To be successful, you need to work hard. For example, don’t start working on your homework assignments just two days before the deadline.
Intro ML (UofT) CSC311-Lec1 7 / 53

What is learning?
“The activity or process of gaining knowledge or skill by studying, practicing, being taught, or experiencing something.”
dictionary
“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”

Intro ML (UofT) CSC311-Lec1 8 / 53

What is machine learning?
For many problems, it is difficult to program the correct behaviour by hand
􏰀 recognizing people and objects
􏰀 understanding human speech
Machine learning approach: program an algorithm to automatically learn from data, or from experience Why might you want to use a learning algorithm?
􏰀 hard to code up a solution by hand (e.g. vision, speech)
􏰀 system needs to adapt to a changing environment (e.g. spam
detection)
􏰀 want the system to perform better than the human programmers
􏰀 privacy/fairness (e.g. ranking search results)
Intro ML (UofT) CSC311-Lec1

What is machine learning?
It is similar to statistics…
􏰀 Both fields try to uncover patterns in data
􏰀 Both fields draw heavily on calculus, probability, and linear algebra,
and share many of the same core algorithms But it is not statistics!
􏰀 Stats is more concerned with helping scientists and policymakers draw good conclusions; ML is more concerned with building autonomous agents
􏰀 Stats puts more emphasis on interpretability and mathematical rigor; ML puts more emphasis on predictive performance, scalability, and autonomy
Intro ML (UofT) CSC311-Lec1

Relations to AI
Nowadays, “machine learning” is often brought up with “artificial intelligence” (AI)
AI does not often imply a learning based system
􏰀 Symbolic reasoning
􏰀 Rule based system
􏰀 Tree search 􏰀 etc.
Learning based system → learned based on the data → more flexibility, good at solving pattern recognition problems.
Intro ML (UofT) CSC311-Lec1

Relations to human learning
Human learning is:
􏰀 Very data efficient
􏰀 An entire multitasking system (vision, language, motor control, etc.)
􏰀 Takes at least a few years 🙂
For serving specific purposes, machine learning doesn’t have to look like human learning in the end.
It may borrow ideas from biological systems, e.g., neural networks. It may perform better or worse than humans.
Intro ML (UofT) CSC311-Lec1 12 / 53

What is machine learning?
Types of machine learning
􏰀 Supervised learning: access to labeled examples of the correct
􏰀 Reinforcement learning: learning system (agent) interacts with
the world and learn to maximize a reward signal
􏰀 Unsupervised learning: no labeled examples – instead, looking
for “interesting” patterns in the data
Intro ML (UofT) CSC311-Lec1 13 / 53

History of machine learning
ML conferences selling out like Beyonce tickets.
Source: medium.com/syncedreview/
Intro ML (UofT) CSC311-Lec1 14 / 53

History of machine learning
ML conferences selling out like Beyonce tickets.
Intro ML (UofT) CSC311-Lec1 14 / 53

Computer vision: Object detection, semantic segmentation, pose estimation, and almost every other task is done with ML.
Instance segmentation –
Intro ML (UofT)
CSC311-Lec1 15 / 53

Speech: Speech to text, personal assistants, speaker identification…
Intro ML (UofT) CSC311-Lec1 16 / 53

NLP: Machine translation, sentiment analysis, topic modeling, spam filtering.
Intro ML (UofT) CSC311-Lec1 17 / 53

Playing Games
Intro ML (UofT) CSC311-Lec1 18 / 53

E-commerce & Recommender Systems : Amazon, Netflix, …
Intro ML (UofT) CSC311-Lec1 19 / 53

Why this class?
Why not jump straight to CSC412/413, and learn neural nets first?
The principles you learn in this course will be essential to really understand neural nets.
The techniques in this course are still the first things to try for a new ML problem.
􏰀 For example, you should try applying logistic regression before building a deep neural net!
There is a whole world of probabilistic graphical models.
Intro ML (UofT) CSC311-Lec1 20 / 53

Why this class?
2017 Kaggle survey of data science and ML practitioners: what data science methods do you use at work?
Intro ML (UofT) CSC311-Lec1 21 / 53

Implementing machine learning systems
You will often need to derive an algorithm (with pencil and paper), and then translate the math into code.
Array processing (NumPy)
􏰀 vectorize computations (express them in terms of matrix/vector operations) to exploit hardware efficiency
􏰀 This also makes your code cleaner and more readable!
Intro ML (UofT) CSC311-Lec1 22 / 53

Preliminaries and Nearest Neighbourhood Methods
Intro ML (UofT) CSC311-Lec1 23 / 53

Introduction
Today (and for the next 5-6 lectures) we focus on supervised learning. This means we are given a training set consisting of inputs and
corresponding labels, e.g.
object recognition image captioning document classification speech-to-text
Inputs image image
text audio waveform
Labels object category caption document category text
Intro ML (UofT)
CSC311-Lec1

Supervised Learning Example
What an image looks like to the computer:
[Image credit: ]
Intro ML (UofT) CSC311-Lec1 25 / 53

Representing the Input
Machine learning algorithms need to handle lots of types of data: images, text, audio waveforms, credit card transactions, etc.
Common strategy: represent the input as an input vector in Rd
􏰀 Representation = mapping to another space that is easy to manipulate
􏰀 Vectors are a great representation since we can do linear algebra
Intro ML (UofT) CSC311-Lec1

Supervised Learning Setup
Mathematically, our training set consists of a collection of pairs of an input vector x ∈ Rd and its corresponding target, or label, t
􏰀 Regression: t is a real number (e.g. stock price)
􏰀 Classification: t is an element of a discrete set {1, . . . , C }
􏰀 These days, t is often a highly structured object (e.g. image)
Denote the training set {(x(1), t(1)), . . . , (x(N), t(N))}
􏰀 Note: these superscripts have nothing to do with exponentiation!
Intro ML (UofT) CSC311-Lec1 27 / 53

Input Vectors
Can use raw pixels:
Can do much better if you compute a vector of meaningful features.
Intro ML (UofT) CSC311-Lec1 28 / 53

Nearest Neighbour
Our first supervised learning model!
Suppose we’re given a novel input vector x we’d like to classify.
The idea: find the nearest input vector to x in the training set and copy its label.
Intro ML (UofT) CSC311-Lec1 29 / 53

Nearest Neighbor: Example
Example: We would like to classify which Beetle is in this image x:
We find the closest image x(i) in the training set, and look at its label.
Intro ML (UofT) CSC311-Lec1 30 / 53

Nearest Neighbor: Example
Example: We would like to classify which Beetle is in this image x:
We find the closest image x(i) in the training set, and look at its label.
Intro ML (UofT) CSC311-Lec1 30 / 53

Nearest Neighbor: Example
Example: We would like to classify which Beetle is in this image x:
We find the closest image x(i) in the training set, and look at its label.
Intro ML (UofT) CSC311-Lec1 30 / 53

Nearest Neighbor: Example
Example: We would like to classify which Beetle is in this image x:
We find the closest image x(i) in the training set, and look at its label.
But how do we determine what is the “closest” image?
Intro ML (UofT) CSC311-Lec1 30 / 53

The 1-Nearest Neighbor Algorithm
Can formalize “nearest” in terms of Euclidean distance
||x(a) − x(b)||2 = 􏰤􏰣􏰋(x(a) − x(b))2
Algorithm:
􏰀 input x (that we would like to classify),
􏰀 training set (x(1), t(1)), (x(2), t(2)), . . . .
Find an example (x(∗),t(∗)) in the training set closest to x,
i.e. that minimizes distance(x(∗), x). 2. Output y = t(∗)
Note: we do not need to compute the square root. Why?
Intro ML (UofT) CSC311-Lec1

Nearest Neighbors: Decision Boundaries
We can visualize the behaviour in the classification setting using a Voronoi diagram.
Here, our data set x ∈ R2 (e.g. a 2-pixel image), and there are two possible labels t either red or blue.
Intro ML (UofT) CSC311-Lec1 32 / 53

Nearest Neighbors: Decision Boundaries
Decision boundary: the boundary between regions of input space assigned to different categories.
Intro ML (UofT) CSC311-Lec1 33 / 53

Nearest Neighbors: Decision Boundaries
Example: Decision boundary in a 3D data set.
Intro ML (UofT) CSC311-Lec1 34 / 53

Estimating Model Accuracy
Typically, we want to have some sense of how well our model performs.
For example, we want to use this model in a real-world application, and want to know how reliable the predictions will be.
There are several ways to measure model performance. The simplest is the accuracy mesaure:
Accuracy = Number of Correct Predictions Total Number of Predictions
Intro ML (UofT) CSC311-Lec1 35 / 53

The Test Set
What might be an issue with computing the accuracy using the training set?
Intro ML (UofT) CSC311-Lec1 36 / 53

The Test Set
What might be an issue with computing the accuracy using the training set? We will always get perfect performance in our 1-nearest neighbor model
(a data point is always closest to itself!)
Not a realistic representation of how our model will perform on unseen data.
Intro ML (UofT) CSC311-Lec1 36 / 53

The Test Set
What might be an issue with computing the accuracy using the training set? We will always get perfect performance in our 1-nearest neighbor model
(a data point is always closest to itself!)
Not a realistic representation of how our model will perform on unseen data.
Solution: set aside a test set (from our set of labeled data) so we can estimate how well our model will generalize.
Intro ML (UofT) CSC311-Lec1 36 / 53

Nearest Neighbors
[Pic by ]
Nearest neighbors sensitive to noise or mis-labeled data (“class noise”). Solution?
Intro ML (UofT) CSC311-Lec1 37 / 53

k-Nearest Neighbors
[Pic by ]
Nearest neighbors sensitive to noise or mis-labeled data (“class noise”). Solution?
Smooth by having k nearest neighbors vote
Intro ML (UofT) CSC311-Lec1 37 / 53

the K-Nearest neighbors Algorithm
Algorithm (kNN):
1. Find k examples {x(i),t(i)} closest to the test instance x
2. Classification output is majority class k
y = argmax 􏰋 I{t(z) = t(i)}
I{statement} is the identity function and is equal to one whenever the statement is true. We could also write this as δ(t(z), t(i)) with δ(a, b) = 1 if a = b, 0 otherwise. I{1}.
Intro ML (UofT) CSC311-Lec1 38 / 53

K-Nearest neighbors (k=1)
[Image credit: ”The Elements of Statistical Learning”]
Intro ML (UofT) CSC311-Lec1 39 / 53

K-Nearest neighbors (k=15)
[Image credit: ”The Elements of Statistical Learning”]
Intro ML (UofT) CSC311-Lec1 40 / 53

How to choose k? Small k
􏰀 Good at capturing fine-grained patterns
􏰀 May overfit, i.e. be sensitive to random idiosyncrasies in the
training data Large k
􏰀 Makes stable predictions by averaging over lots of examples
􏰀 May underfit, i.e. fail to capture important regularities
Balancing k:
􏰀 The optimal choice of k depends on the number of data points n.
􏰀 Nice theoretical properties if k → ∞ and nk → 0. 2
􏰀 Rule of thumb: Choose k = n2+d .
We will explain another way to choose k using our data.
Intro ML (UofT) CSC311-Lec1 41 / 53

Underfitting and Overfitting
[Image credit: https://cambridgecoding.wordpress.com/2016/03/24/
misleading- modelling- overfitting- cross- validation- and- the- bias- variance- trade- off/
Intro ML (UofT) CSC311-Lec1 42 / 53

Choice of k: Generalization
We would like our algorithm to generalize to data it hasn’t seen before.
We can measure the generalization error (error rate on new examples) using the test set.
[Image credit: ”The Elements of Statistical Learning”]
Intro ML (UofT) CSC311-Lec1 43 / 53

The problem with using the test set
Here, k is an example of a hyperparameter, something we can’t fit as part of the learning algorithm itself.
What are potential issues with using the test set to choose k?
Intro ML (UofT) CSC311-Lec1 44 / 53

The problem with using the test set
Here, k is an example of a hyperparameter, something we can’t fit as part of the learning algorithm itself.
What are potential issues with using the test set to choose k?
The test set is meant to simulate unseen data, in order to estimate how
well the model generalizes.
We therefore shouldn’t use the test set to make any decisions about any aspects of the model!
Keep the test set unseen, so that we can estimate model accuracy.
Intro ML (UofT) CSC311-Lec1 44 / 53

The problem with using the test set
Here, k is an example of a hyperparameter, something we can’t fit as part of the learning algorithm itself.
What are potential issues with using the test set to choose k?
The test set is meant to simulate unseen data, in order to estimate how
well the model generalizes.
We therefore shouldn’t use the test set to make any decisions about any aspects of the model!
Keep the test set unseen, so that we can estimate model accuracy. What can we do instead?
Intro ML (UofT) CSC311-Lec1 44 / 53

Validation and Test Sets
Partition the data into three subsets
Training set: used to train the model
Validation set: used to tune hyperparameters
Test set: used to evaluate final performance once, after hyperparameters are chosen
Intro ML (UofT) CSC311-Lec1 45 / 53

Pitfalls: Normalization
Nearest neighbors can be sensitive to the ranges of different features. Often, the units are arbitrary:
Simple fix: normalize each dimension to be zero mean and unit variance. I.e., compute the mean μj and standard deviation σj, and take
x ̃ j = x j − μ j σj
Caution: depending on the problem, the scale might be important!
Intro ML (UofT) CSC311-Lec1 46 / 53

Pitfalls: Computational Cost
Number of computations at training time: 0
Number of computations at test time, per query (na ̈ıve algorithm)
􏰀 Calculuate D-dimensional Euclidean distances with N data points: O(ND)
􏰀 Sort the distances: O(N log N )
This must be done for each query, which is very expensive by the
standards of a learning algorithm!
Need to store the entire dataset in memory!
Tons of work has gone into algorithms and data structures for efficient nearest neighbors with high dimensions and/or large datasets.
Intro ML (UofT) CSC311-Lec1 47 / 53

Example: Digit Classification
Decent performance when lots of data
[SlideIcnrterodiMt:LD(.UCoflTau)s] CSC311-Lec1 48/53

Example: Digit Classification
KNN can perform a lot better with a good similarity measure.
Example: shape contexts for object recognition. In order to achieve invariance to image transformations, they tried to warp one image to match the other image.
􏰀 Distance measure: average distance between corresponding points on warped images
Achieved 0.63% error on MNIST, compared with 3% for Euclidean KNN. Competitive with conv nets at the time, but required careful engineering.
[Belongie, Malik, and Puzicha, 2002. Shape matching and object recognition using shape contexts.]
Intro ML (UofT) CSC311-Lec1 49 / 53

Example: 80 Million Tiny Images
80 Million Tiny Images was the first extremely large image dataset. It consisted of color images scaled down to 32×32.
With a large dataset, you can find much better semantic matches, and KNN can do some surprising things.
Note: this required a carefully chosen similarity metric.
[Torralba, Fergus, and Freeman, 2007. 80 Million Tiny Images.]
Intro ML (UofT) CSC311-Lec1 50 / 53

Example: Colorizing 80 Million Tiny Images
[Torralba, Fergus, and Freeman, 2007. 80 Million Tiny Images.]
Intro ML (UofT) CSC311-Lec1 51 / 53

Conclusions
Simple algorithm that does all its work at test time — in a sense, no learning!
Can be used for regression too, which we encounter later. Can control the complexity by varying k
Next time: decision trees, another approach to regression and classification
Intro ML (UofT) CSC311-Lec1 52 / 53

Questions?
Intro ML (UofT) CSC311-Lec1 53 / 53