CS计算机代考程序代写 data structure data science deep learning flex decision tree information theory AI algorithm CSC 311: Introduction to Machine Learning – Lecture 1 – Introduction

CSC 311: Introduction to Machine Learning – Lecture 1 – Introduction

CSC 311: Introduction to Machine Learning
Lecture 1 – Introduction

Intro ML (UofT) CSC311-Lec1 1 / 53

This course

Broad introduction to machine learning
I First half: algorithms and principles for supervised learning

I nearest neighbors, decision trees, ensembles, linear regression,
logistic regression

I Unsupervised learning: PCA, K-means, mixture models
I Basics of reinforcement learning

Coursework is aimed at advanced undergrads. We will use
multivariate calculus, probability, and linear algebra.

Intro ML (UofT) CSC311-Lec1 2 / 53

Do I have the appropriate background?

This is where we use all the math that you learned, and then some!

Linear algebra (MAT223/240): vector/matrix manipulations,
properties.

Calculus (MAT232): partial derivatives/gradient

Probability, Statistics (STA256): common distributions, Bayes
Rule; expectation, variance, covariance, median; maximum
likelihood.

Geometry and geometric intuition.

Recommend preparation: Numerical Methods (CSC338) or
Computational Statistics

Mathematical maturity will be assumed: you should be able to write
and understand rigorous proofs.

Intro ML (UofT) CSC311-Lec1 3 / 53

Course Information

Recommended readings will be given for each lecture. But the
following will be useful throughout the course:

Trevor Hastie, Robert Tibshirani, and Jerome Friedman, The Elements of
Statistical Learning, Second Edition, 2009.

Christopher M. Bishop, Pattern Recognition and Machine Learning, 2006

Richard S. Sutton and Andrew G. Barto, Reinforcement Learning: An
Introduction, Second Edition, 2018.

Ian Goodfellow, Yoshua Bengio and Aaron Courville, Deep Learning, 2016

Kevin Murphy, Machine Learning: A Probabilistic Perspective, 2012.

Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, An
Introduction to Statistical Learning, 2017.

Shai Shalev-Shwartz and Shai Ben-David, Understanding Machine Learning:
From Theory to Algorithms, 2014.

David MacKay, Information Theory, Inference, and Learning Algorithms, 2003.

There are lots of freely available, high-quality ML resources.

Intro ML (UofT) CSC311-Lec1 4 / 53

https://web.stanford.edu/~hastie/ElemStatLearn/
https://web.stanford.edu/~hastie/ElemStatLearn/
https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf
http://incompleteideas.net/book/the-book-2nd.html
http://incompleteideas.net/book/the-book-2nd.html
http://www.deeplearningbook.org
http://faculty.marshall.usc.edu/gareth-james/ISL/
http://faculty.marshall.usc.edu/gareth-james/ISL/
https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/index.html
https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/index.html
http://www.inference.org.uk/mackay/itila/book.html

Requirements and Marking

Three (3) assignments
I Combination of pen & paper derivations and programming exercises
I Due Oct 8, Nov 8, Dec 7 at 11:59pm

Midterm
I October 25, 8–9pm.

Final Exam
I Two hours
I Date and time TBA

Intro ML (UofT) CSC311-Lec1 5 / 53

More on Assignments

Collaboration on the assignments is not allowed. Each student is responsible for their own
work. Discussion of assignments should be limited to clarification of the handout itself, and
should not involve any sharing of pseudocode or code or simulation results. Violation of this
policy is grounds for a semester grade of F, in accordance with university regulations.

The schedule of assignments will be posted on the course webpage.

Extensions will be granted only in special situations, and you will need a Student Medical
Certificate or a written request approved by the course coordinator at least one week before
the due date.

Intro ML (UofT) CSC311-Lec1 6 / 53

Related Courses and a Note on the Content

More advanced ML courses such as CSC413 (Neural Networks
and Deep Learning) and CSC412 (Probabilistic Learning and
Reasoning) both build upon the material in this course.

If you’ve already taken an applied statistics course, there will be
some overlap.

Warning: This is not a very difficult course, but it has a lot of
content. To be successful, you need to work hard. For example,
don’t start working on your homework assignments just two days
before the deadline.

Intro ML (UofT) CSC311-Lec1 7 / 53

What is learning?

“The activity or process of gaining knowledge or skill by studying,
practicing, being taught, or experiencing something.”

Merriam Webster dictionary

“A computer program is said to learn from experience E with respect
to some class of tasks T and performance measure P, if its performance
at tasks in T, as measured by P, improves with experience E.”

Tom Mitchell

Intro ML (UofT) CSC311-Lec1 8 / 53

What is machine learning?

For many problems, it is difficult to program the correct behaviour
by hand

I recognizing people and objects
I understanding human speech

Machine learning approach: program an algorithm to
automatically learn from data, or from experience

Why might you want to use a learning algorithm?
I hard to code up a solution by hand (e.g. vision, speech)
I system needs to adapt to a changing environment (e.g. spam

detection)
I want the system to perform better than the human programmers
I privacy/fairness (e.g. ranking search results)

Intro ML (UofT) CSC311-Lec1 9 / 53

What is machine learning?

It is similar to statistics…
I Both fields try to uncover patterns in data
I Both fields draw heavily on calculus, probability, and linear algebra,

and share many of the same core algorithms

But it is not statistics!
I Stats is more concerned with helping scientists and policymakers

draw good conclusions; ML is more concerned with building
autonomous agents

I Stats puts more emphasis on interpretability and mathematical
rigor; ML puts more emphasis on predictive performance,
scalability, and autonomy

Intro ML (UofT) CSC311-Lec1 10 / 53

Relations to AI

Nowadays, “machine learning” is often brought up with “artificial
intelligence” (AI)

AI does not often imply a learning based system
I Symbolic reasoning
I Rule based system
I Tree search
I etc.

Learning based system → learned based on the data → more
flexibility, good at solving pattern recognition problems.

Intro ML (UofT) CSC311-Lec1 11 / 53

Relations to human learning

Human learning is:
I Very data efficient
I An entire multitasking system (vision, language, motor control, etc.)
I Takes at least a few years 🙂

For serving specific purposes, machine learning doesn’t have to
look like human learning in the end.

It may borrow ideas from biological systems, e.g., neural networks.

It may perform better or worse than humans.

Intro ML (UofT) CSC311-Lec1 12 / 53

What is machine learning?

Types of machine learning
I Supervised learning: access to labeled examples of the correct

behaviour
I Reinforcement learning: learning system (agent) interacts with

the world and learn to maximize a reward signal
I Unsupervised learning: no labeled examples – instead, looking

for “interesting” patterns in the data

Intro ML (UofT) CSC311-Lec1 13 / 53

History of machine learning

ML conferences selling out like Beyonce tickets.

Source: medium.com/syncedreview/
Intro ML (UofT) CSC311-Lec1 14 / 53

History of machine learning

ML conferences selling out like Beyonce tickets.

Source: medium.com/syncedreview/

Intro ML (UofT) CSC311-Lec1 14 / 53

Computer vision: Object detection, semantic segmentation, pose
estimation, and almost every other task is done with ML.

Instance segmentation – Link

Intro ML (UofT) CSC311-Lec1 15 / 53

https://drive.google.com/file/d/0Byy_mRDnLTHIYzVHN3lITzFhUkU/view

Speech: Speech to text, personal assistants, speaker identification…

Intro ML (UofT) CSC311-Lec1 16 / 53

NLP: Machine translation, sentiment analysis, topic modeling, spam
filtering.

Intro ML (UofT) CSC311-Lec1 17 / 53

Playing Games

DOTA2 – Link

Intro ML (UofT) CSC311-Lec1 18 / 53

E-commerce & Recommender Systems : Amazon,
Netflix, …

Intro ML (UofT) CSC311-Lec1 19 / 53

Why this class?

Why not jump straight to CSC412/413, and learn neural nets first?

The principles you learn in this course will be essential to really
understand neural nets.

The techniques in this course are still the first things to try for a
new ML problem.

I For example, you should try applying logistic regression before
building a deep neural net!

There is a whole world of probabilistic graphical models.

Intro ML (UofT) CSC311-Lec1 20 / 53

Why this class?

2017 Kaggle survey of data science and ML practitioners: what data
science methods do you use at work?

Intro ML (UofT) CSC311-Lec1 21 / 53

Implementing machine learning systems

You will often need to derive an algorithm (with pencil and
paper), and then translate the math into code.

Array processing (NumPy)
I vectorize computations (express them in terms of matrix/vector

operations) to exploit hardware efficiency
I This also makes your code cleaner and more readable!

Intro ML (UofT) CSC311-Lec1 22 / 53

Preliminaries and Nearest Neighbourhood Methods

Intro ML (UofT) CSC311-Lec1 23 / 53

Introduction

Today (and for the next 5-6 lectures) we focus on supervised learning.

This means we are given a training set consisting of inputs and
corresponding labels, e.g.

Task Inputs Labels
object recognition image object category
image captioning image caption

document classification text document category
speech-to-text audio waveform text


Intro ML (UofT) CSC311-Lec1 24 / 53

Supervised Learning Example

What an image looks like to the computer:

[Image credit: Andrej Karpathy]

Intro ML (UofT) CSC311-Lec1 25 / 53

Representing the Input

Machine learning algorithms need to handle lots of types of data:
images, text, audio waveforms, credit card transactions, etc.

Common strategy: represent the input as an input vector in Rd

I Representation = mapping to another space that is easy to
manipulate

I Vectors are a great representation since we can do linear algebra

Intro ML (UofT) CSC311-Lec1 26 / 53

Supervised Learning Setup

Mathematically, our training set consists of a collection of pairs of an
input vector x ∈ Rd and its corresponding target, or label, t

I Regression: t is a real number (e.g. stock price)
I Classification: t is an element of a discrete set {1, . . . , C}
I These days, t is often a highly structured object (e.g. image)

Denote the training set {(x(1), t(1)), . . . , (x(N), t(N))}
I Note: these superscripts have nothing to do with exponentiation!

Intro ML (UofT) CSC311-Lec1 27 / 53

Input Vectors

Can use raw pixels:

Can do much better if you compute a vector of meaningful features.

Intro ML (UofT) CSC311-Lec1 28 / 53

Nearest Neighbour

Our first supervised learning model!

Suppose we’re given a novel input vector x we’d like to classify.

The idea: find the nearest input vector to x in the training set and
copy its label.

Intro ML (UofT) CSC311-Lec1 29 / 53

Nearest Neighbor: Example

Example: We would like to classify which Beetle is in this image x:

We find the closest image x(i) in the training set, and look at its label.

But how do we determine what is the “closest” image?

Intro ML (UofT) CSC311-Lec1 30 / 53

Nearest Neighbor: Example

Example: We would like to classify which Beetle is in this image x:

We find the closest image x(i) in the training set, and look at its label.

But how do we determine what is the “closest” image?

Intro ML (UofT) CSC311-Lec1 30 / 53

Nearest Neighbor: Example

Example: We would like to classify which Beetle is in this image x:

We find the closest image x(i) in the training set, and look at its label.

But how do we determine what is the “closest” image?

Intro ML (UofT) CSC311-Lec1 30 / 53

Nearest Neighbor: Example

Example: We would like to classify which Beetle is in this image x:

We find the closest image x(i) in the training set, and look at its label.

But how do we determine what is the “closest” image?

Intro ML (UofT) CSC311-Lec1 30 / 53

The 1-Nearest Neighbor Algorithm

Can formalize “nearest” in terms of Euclidean distance

||x(a) − x(b)||2 =

√√√√ d∑
j=1

(x
(a)
j − x

(b)
j )

2

Algorithm:

1. Given
I input x (that we would like to classify),
I training set (x(1), t(1)), (x(2), t(2)), . . . .

Find an example (x(∗), t(∗)) in the training set closest to x,
i.e. that minimizes distance(x(∗),x).

2. Output y = t(∗)

Note: we do not need to compute the square root. Why?

Intro ML (UofT) CSC311-Lec1 31 / 53

Nearest Neighbors: Decision Boundaries

We can visualize the behaviour in the classification setting using a Voronoi
diagram.

Here, our data set x ∈ R2 (e.g. a 2-pixel image), and there are two possible
labels t either red or blue.

Intro ML (UofT) CSC311-Lec1 32 / 53

Nearest Neighbors: Decision Boundaries

Decision boundary: the boundary between regions of input space assigned to
different categories.

Intro ML (UofT) CSC311-Lec1 33 / 53

Nearest Neighbors: Decision Boundaries

Example: Decision boundary in a 3D data set.

Intro ML (UofT) CSC311-Lec1 34 / 53

Estimating Model Accuracy

Typically, we want to have some sense of how well our model performs.

For example, we want to use this model in a real-world application, and
want to know how reliable the predictions will be.

There are several ways to measure model performance. The simplest is
the accuracy mesaure:

Accuracy =
Number of Correct Predictions

Total Number of Predictions

Intro ML (UofT) CSC311-Lec1 35 / 53

The Test Set

What might be an issue with computing the accuracy using the training set?

We will always get perfect performance in our 1-nearest neighbor model
(a data point is always closest to itself!)

Not a realistic representation of how our model will perform on unseen
data.

Solution: set aside a test set (from our set of labeled data) so we can estimate
how well our model will generalize.

Intro ML (UofT) CSC311-Lec1 36 / 53

The Test Set

What might be an issue with computing the accuracy using the training set?

We will always get perfect performance in our 1-nearest neighbor model
(a data point is always closest to itself!)

Not a realistic representation of how our model will perform on unseen
data.

Solution: set aside a test set (from our set of labeled data) so we can estimate
how well our model will generalize.

Intro ML (UofT) CSC311-Lec1 36 / 53

The Test Set

What might be an issue with computing the accuracy using the training set?

We will always get perfect performance in our 1-nearest neighbor model
(a data point is always closest to itself!)

Not a realistic representation of how our model will perform on unseen
data.

Solution: set aside a test set (from our set of labeled data) so we can estimate
how well our model will generalize.

Intro ML (UofT) CSC311-Lec1 36 / 53

Nearest Neighbors

[Pic by Olga Veksler]

Nearest neighbors sensitive to noise or mis-labeled data (“class noise”).
Solution?

Smooth by having k nearest neighbors vote

Intro ML (UofT) CSC311-Lec1 37 / 53

k-Nearest Neighbors

[Pic by Olga Veksler]

Nearest neighbors sensitive to noise or mis-labeled data (“class noise”).
Solution?

Smooth by having k nearest neighbors vote

Intro ML (UofT) CSC311-Lec1 37 / 53

the K-Nearest neighbors Algorithm

Algorithm (kNN):

1. Find k examples {x(i), t(i)} closest to the test instance x

2. Classification output is majority class

y = argmax
t(z)

k∑
i=1

I{t(z) = t(i)}

I{statement} is the identity function and is equal to one whenever the
statement is true. We could also write this as δ(t(z), t(i)) with δ(a, b) = 1 if
a = b, 0 otherwise. I{1}.

Intro ML (UofT) CSC311-Lec1 38 / 53

K-Nearest neighbors (k=1)

[Image credit: ”The Elements of Statistical Learning”]

Intro ML (UofT) CSC311-Lec1 39 / 53

K-Nearest neighbors (k=15)

[Image credit: ”The Elements of Statistical Learning”]

Intro ML (UofT) CSC311-Lec1 40 / 53

How to choose k?

Small k

I Good at capturing fine-grained patterns
I May overfit, i.e. be sensitive to random idiosyncrasies in the

training data

Large k

I Makes stable predictions by averaging over lots of examples
I May underfit, i.e. fail to capture important regularities

Balancing k:

I The optimal choice of k depends on the number of data points n.
I Nice theoretical properties if k →∞ and k

n
→ 0.

I Rule of thumb: Choose k = n
2

2+d .

We will explain another way to choose k using our data.

Intro ML (UofT) CSC311-Lec1 41 / 53

Underfitting and Overfitting

[Image credit: https://cambridgecoding.wordpress.com/2016/03/24/
misleading-modelling-overfitting-cross-validation-and-the-bias-variance-trade-off/]

Intro ML (UofT) CSC311-Lec1 42 / 53

Misleading modelling: overfitting, cross-validation, and the bias-variance trade-off
Misleading modelling: overfitting, cross-validation, and the bias-variance trade-off

Choice of k: Generalization

We would like our algorithm to generalize to data it hasn’t seen before.

We can measure the generalization error (error rate on new examples)
using the test set.

[Image credit: ”The Elements of Statistical Learning”]

Intro ML (UofT) CSC311-Lec1 43 / 53

The problem with using the test set

Here, k is an example of a hyperparameter, something we can’t fit as part of
the learning algorithm itself.

What are potential issues with using the test set to choose k?

The test set is meant to simulate unseen data, in order to estimate how
well the model generalizes.

We therefore shouldn’t use the test set to make any decisions about any
aspects of the model!

Keep the test set unseen, so that we can estimate model accuracy.

What can we do instead?

Intro ML (UofT) CSC311-Lec1 44 / 53

The problem with using the test set

Here, k is an example of a hyperparameter, something we can’t fit as part of
the learning algorithm itself.

What are potential issues with using the test set to choose k?

The test set is meant to simulate unseen data, in order to estimate how
well the model generalizes.

We therefore shouldn’t use the test set to make any decisions about any
aspects of the model!

Keep the test set unseen, so that we can estimate model accuracy.

What can we do instead?

Intro ML (UofT) CSC311-Lec1 44 / 53

The problem with using the test set

Here, k is an example of a hyperparameter, something we can’t fit as part of
the learning algorithm itself.

What are potential issues with using the test set to choose k?

The test set is meant to simulate unseen data, in order to estimate how
well the model generalizes.

We therefore shouldn’t use the test set to make any decisions about any
aspects of the model!

Keep the test set unseen, so that we can estimate model accuracy.

What can we do instead?

Intro ML (UofT) CSC311-Lec1 44 / 53

Validation and Test Sets

Partition the data into three subsets

Training set: used to train the model

Validation set: used to tune hyperparameters

Test set: used to evaluate final performance once, after hyperparameters
are chosen

Intro ML (UofT) CSC311-Lec1 45 / 53

Pitfalls: Normalization

Nearest neighbors can be sensitive to the ranges of different features.

Often, the units are arbitrary:

Simple fix: normalize each dimension to be zero mean and unit variance.
I.e., compute the mean µj and standard deviation σj , and take

x̃j =
xj − µj
σj

Caution: depending on the problem, the scale might be important!

Intro ML (UofT) CSC311-Lec1 46 / 53

Pitfalls: Computational Cost

Number of computations at training time: 0

Number of computations at test time, per query (näıve algorithm)

I Calculuate D-dimensional Euclidean distances with N data points:
O(ND)

I Sort the distances: O(N logN)

This must be done for each query, which is very expensive by the
standards of a learning algorithm!

Need to store the entire dataset in memory!

Tons of work has gone into algorithms and data structures for efficient
nearest neighbors with high dimensions and/or large datasets.

Intro ML (UofT) CSC311-Lec1 47 / 53

Example: Digit Classification

Decent performance when lots of data

[Slide credit: D. Claus]Intro ML (UofT) CSC311-Lec1 48 / 53

Example: Digit Classification

KNN can perform a lot better with a good similarity measure.

Example: shape contexts for object recognition. In order to achieve
invariance to image transformations, they tried to warp one image to
match the other image.

I Distance measure: average distance between corresponding points
on warped images

Achieved 0.63% error on MNIST, compared with 3% for Euclidean KNN.

Competitive with conv nets at the time, but required careful engineering.

[Belongie, Malik, and Puzicha, 2002. Shape matching and object recognition using shape
contexts.]

Intro ML (UofT) CSC311-Lec1 49 / 53

Example: 80 Million Tiny Images

80 Million Tiny Images was
the first extremely large
image dataset. It consisted of
color images scaled down to
32× 32.

With a large dataset, you can
find much better semantic
matches, and KNN can do
some surprising things.

Note: this required a carefully
chosen similarity metric.

[Torralba, Fergus, and Freeman, 2007. 80 Million Tiny Images.]

Intro ML (UofT) CSC311-Lec1 50 / 53

Example: Colorizing 80 Million Tiny Images

[Torralba, Fergus, and Freeman, 2007. 80 Million Tiny Images.]

Intro ML (UofT) CSC311-Lec1 51 / 53

Conclusions

Simple algorithm that does all its work at test time — in a sense,
no learning!

Can be used for regression too, which we encounter later.

Can control the complexity by varying k

Next time: decision trees, another approach to regression and
classification

Intro ML (UofT) CSC311-Lec1 52 / 53

Questions?

?

Intro ML (UofT) CSC311-Lec1 53 / 53

Administration
Intro
Examples
Formulation

Introduction