CSC 311: Introduction to Machine Learning
Lecture 5 – Linear Models III, Neural Nets I
Roger G. of Toronto, Fall 2021
Intro ML (UofT) CSC311-Lec5 1 / 49
Copyright By PowCoder代写 加微信 powcoder
Multiclass Classification and Softmax Regression
Intro ML (UofT) CSC311-Lec5 2 / 49
Classification: predicting a discrete-valued target
I Binary classification: predicting a binary-valued target
I Multiclass classification: predicting a discrete(> 2)-valued target
Examples of multi-class classification
I predict the value of a handwritten digit
I classify e-mails as spam, travel, work, personal
Intro ML (UofT) CSC311-Lec5
Multiclass Classification
Classification tasks with more than two categories:
It is very hard to say what makes a 2 Some examples from an earlier version of the net
Intro ML (UofT) CSC311-Lec5 4 / 49
Multiclass Classification
Targets form a discrete set {1,…,K}.
It’s often more convenient to represent them as one-hot vectors, or
a one-of-K encoding:
t = (0,…,0,1,0,…,0) ∈ RK
entry k is 1
Intro ML (UofT) CSC311-Lec5 5 / 49
Multiclass Linear Classification
We can start with a linear function of the inputs.
Now there are D input dimensions and K output dimensions, so
we need K × D weights, which we arrange as a weight matrix W. Also, we have a K-dimensional vector b of biases.
A linear function of the inputs:
zk=Xwkjxj+bk for k=1,2,…,K
We can eliminate the bias b by taking W ∈ RK×(D+1) and adding
a dummy variable x0 = 1. So, vectorized:
z=Wx+b orwithdummyx0 =1 z=Wx
Intro ML (UofT) CSC311-Lec5
Multiclass Linear Classification
How can we turn this linear prediction into a one-hot prediction? We can interpret the magnitude of zk as an measure of how much
the model prefers k as its prediction. If we do this, we should set
(1 i = argmaxk zk yi = 0 otherwise
Exercise: how does the case of K = 2 relate to the prediction rule in binary linear classifiers?
Intro ML (UofT) CSC311-Lec5
Softmax Regression
We need to soften our predictions for the sake of optimization.
We want soft predictions that are like probabilities, i.e., 0 ≤ yk ≤ 1
andPkyk =1.
A natural activation function to use is the softmax function, a
multivariable generalization of the logistic function:
yk = softmax(z1,…,zK)k = Pk′ ezk′
I Outputs can be interpreted as probabilities (positive and sum to 1) I If zk is much larger than the others, then softmax(z)k ≈ 1 and it
behaves like argmax.
I Exercise: how does the case of K = 2 relate to the logistic
The inputs zk are called the logits.
Intro ML (UofT) CSC311-Lec5
Softmax Regression
If a model outputs a vector of class probabilities, we can use cross-entropy as the loss function:
K LCE(y,t) = −Xtk logyk
= −t⊤(log y),
where the log is applied elementwise.
Just like with logistic regression, we typically combine the softmax
and cross-entropy into a softmax-cross-entropy function.
Intro ML (UofT) CSC311-Lec5 9 / 49
Softmax Regression
Softmax regression (with dummy x0 = 1):
y = softmax(z)
LCE = −t⊤(log y)
Gradient descent updates can be derived for each row of W:
∂LCE = ∂LCE · ∂zk =(yk −tk)·x ∂wk ∂zk ∂wk
wk ← wk − α X(y(i) − t(i))x(i)
Similar to linear/logistic reg (no coincidence) (verify the update)
Intro ML (UofT) CSC311-Lec5 10 / 49
Intro ML (UofT) CSC311-Lec5 11 / 49
When are critical points optimal?
critical point
critical point
Gradient descent finds a critical point, but it may be a local optima.
Convexity is a property that guarantees that all critical points are global minima.
critical point
Intro ML (UofT) CSC311-Lec5 12 / 49
Convex Sets
A set S is convex if any line segment connecting points in S lies entirely within S. Mathematically,
x1,x2∈S =⇒ λx1+(1−λ)x2∈S for0≤λ≤1. A simple inductive argument shows that for x1, . . . , xN ∈ S,
weighted averages, or convex combinations, lie within the set: λ1×1+···+λNxN∈S forλi>0,λ1+···λN=1.
Intro ML (UofT) CSC311-Lec5
Convex Functions
A function f is convex if for any x0,x1 in the domain of f, f((1 − λ)x0 + λx1) ≤ (1 − λ)f(x0) + λf(x1)
Equivalently, the set of points lying above the graph of f is convex.
Intuitively: the function is bowl-shaped.
Intro ML (UofT) CSC311-Lec5 14 / 49
Convex Functions
We just saw that the least-squares loss
12(y−t)2 isconvexas a function of y
For a linear model,
z = w⊤x + b is a linear function of w and b. If the loss function is convex as a function of z, then it is convex as a function of w and b.
Intro ML (UofT) CSC311-Lec5 15 / 49
Tracking model performance
Intro ML (UofT) CSC311-Lec5 16 / 49
Progress during learning
Recall we introduced training curves as a way to track progress during learning.
The training criterion (e.g. squared error, cross-entropy) is chosen partly to be easy to optimize.
We may which to track other metrics which better match what we’re interested in, even if we can’t directly optimize them.
Intro ML (UofT) CSC311-Lec5 17 / 49
Metrics for Binary classification
Recall that the average of 0–1 loss is the error rate, or fraction incorrectly classified.
I We noted we couldn’t optimize it, but it’s still a useful metric to track.
I Equivalently, we can track the accuracy, or fraction correct.
I Typically, the error rate behaves similarly to the cross-entropy loss,
but this isn’t always the case.
Another way to break down the accuracy:
I P=num positive; N=num negative; TP=true positives; TN=true negatives
I FP=false positive or a type I error I FN=false negative or a type II error
Acc = T P + T N = T P + T N
P +N TP +TN +FP +FN
Discuss: When might accuracy present a misleading picture of performance?
Intro ML (UofT) CSC311-Lec5 18 / 49
The limitations of accuracy
Accuracy is highly sensitive to class imbalance.
I Suppose you’re trying to screen patients for a particular disease,
and under the data generating distribution, 1% of patients have
that disease.
I How can you achieve 99% accuracy?
I You are able to observe a feature which is 10x more likely in a
patient who has cancer. Does this improve your accuracy?
Sensitivity and specificity are useful metrics even under class imbalance.
I Sensitivity = T P [True positive rate] TP+FN
I Specificity = T N [True negative rate] TN+FP
I What happens if our classification problem is not truly (log-)linearly seperable?
I How do we pick a threshold for y = σ(x)?
Intro ML (UofT) CSC311-Lec5
Designing diagnostic tests
You’ve developed a binary prediction model to indicate whether someone has a specific disease
What happens to sensitivity and specificity as you slide the threshold from left to right?
Intro ML (UofT) CSC311-Lec5 20 / 49
Sensitivity and specificity
Tradeoff between sensitivity and specificity
Intro ML (UofT) CSC311-Lec5 21 / 49
Receiver Operating Characteristic (ROC) curve
Receiver Operating Characteristic (ROC) curve
y axis: sensitivity
x axis: 100-specificity
Area under the ROC curve (AUC) is a useful metric to track if a binary classifier achieves a good tradeoff between sensitivity and specificity.
Intro ML (UofT) CSC311-Lec5 22 / 49
Metrics for Multi-Class classification
You might also be interested in how frequently certain classes are confused.
Confusion matrix: K × K matrix; rows are true labels, columns are predicted labels, entries are frequencies
Question: what does the confusion matrix look like if the classifier is perfect?
Intro ML (UofT) CSC311-Lec5 23 / 49
Limits of Linear Classification
Intro ML (UofT) CSC311-Lec5 24 / 49
Limits of Linear Classification
Some datasets are not linearly separable, e.g. XOR
Visually obvious, but how to show this?
Intro ML (UofT) CSC311-Lec5 25 / 49
Limits of Linear Classification
Showing that XOR is not linearly separable (proof by contradiction)
If two points lie in a half-space, line segment connecting them also lie in the same halfspace.
Suppose there were some feasible weights (hypothesis). If the positive examples are in the positive half-space, then the green line segment must be as well.
Similarly, the red line segment must line within the negative half-space.
But the intersection can’t lie in both half-spaces. Contradiction!
Intro ML (UofT) CSC311-Lec5 26 / 49
Limits of Linear Classification
Sometimes we can overcome this limitation using feature maps, just like for linear regression. E.g., for XOR:
x1 ψ(x)=x2
x1 x2 ψ1(x) ψ2(x) ψ3(x) t 000000 010101 101001 111110
This is linearly separable. (Try it!)
Designing feature maps can be hard. Can we learn them?
Intro ML (UofT) CSC311-Lec5
Neural Networks
Intro ML (UofT) CSC311-Lec5 28 / 49
Inspiration: The Brain
Neurons receive input signals and accumulate voltage. After some threshold they will fire spiking responses.
[Pic credit: www.moleculardevices.com]
Intro ML (UofT) CSC311-Lec5 29 / 49
Inspiration: The Brain
For neural nets, we use a much simpler model neuron, or unit:
Compare with logistic regression: y = σ(w⊤x + b)
By throwing together lots of these incredibly simplistic neuron-like processing units, we can do some powerful computations!
Intro ML (UofT) CSC311-Lec5 30 / 49
Intro ML (UofT) CSC311-Lec5 31 / 49
We can connect lots of units together into a directed acyclic graph.
Typically, units are grouped into layers.
This gives a feed-forward neural network.
Intro ML (UofT) CSC311-Lec5 32 / 49
Each hidden layer i connects Ni−1 input units to Ni output units.
In a fully connected layer, all input units are connected to all output units.
Note: the inputs and outputs for a layer are distinct from the inputs and outputs to the network.
If we need to compute M outputs from N inputs, we can do so using matrix multiplication. This means we’ll be using a M × N matrix
The outputs are a function of the input units: y = f (x) = φ (Wx + b)
φ is typically applied component-wise.
A multilayer network consisting of fully connected layers is called a multilayer perceptron.
Intro ML (UofT) CSC311-Lec5 33 / 49
Some activation functions:
Rectified Linear Identity Unit
y = log 1 + ez
y = max(0, z)
Intro ML (UofT) CSC311-Lec5
Some activation functions:
Hard Threshold
1ifz>0 y= 0ifz≤0
Hyperbolic Tangent (tanh)
z −z y=e−e
Intro ML (UofT)
CSC311-Lec5
Each layer computes a function, so the network computes a composition of functions:
h(1) = f(1)(x) = φ(W(1)x + b(1))
h(2) = f(2)(h(1)) = φ(W(2)h(1) + b(2))
y = f(L)(h(L−1))
Or more simply:
y=f(L) ◦···◦f(1)(x).
Neural nets provide modularity: we can implement each layer’s computations as a black box.
Intro ML (UofT) CSC311-Lec5 36 / 49
Feature Learning
Last layer:
If task is regression: choose
y = f(L)(h(L−1)) = (w(L))⊤h(L−1) + b(L) If task is binary classification: choose
y = f(L)(h(L−1)) = σ((w(L))⊤h(L−1) + b(L))
So neural nets can be viewed as a way of learning features:
Intro ML (UofT) CSC311-Lec5 37 / 49
Feature Learning
Suppose we’re trying to classify images of handwritten digits. Each image is represented as a vector of 28 × 28 = 784 pixel values.
Each first-layer hidden unit computes φ(wi⊤x). It acts as a feature detector.
We can visualize w by reshaping it into an image. Here’s an example that responds to a diagonal stroke.
Intro ML (UofT) CSC311-Lec5 38 / 49
Feature Learning
Here are some of the features learned by the first hidden layer of a handwritten digit classifier:
Unlike hard-coded feature maps (e.g., in polynomial regression), features learned by neural networks adapt to patterns in the data.
Intro ML (UofT) CSC311-Lec5 39 / 49
Expressivity
In Lecture 4, we introduced the idea of a hypothesis space H, which is the set of input-output mappings that can be represented by some model. Suppose we are deciding between two models A, B with hypothesis spaces HA,HB.
If HB ⊆ HA, then A is more expressive than B.
A can represent any function f in HB.
Some functions (XOR) can’t be represented by linear classifiers. Are deep networks more expressive?
Intro ML (UofT) CSC311-Lec5 40 / 49
Expressivity—Linear Networks
Suppose a layer’s activation function was the identity, so the layer just computes a affine transformation of the input
I We call this a linear layer
Any sequence of linear layers can be equivalently represented with
a single linear layer.
y = W(3)W(2)W(1) x | {z }
I Deep linear networks can only represent linear functions.
I Deep linear networks are no more expressive than linear regression.
Intro ML (UofT) CSC311-Lec5 41 / 49
Expressive Power—Non-linear Networks
Multilayer feed-forward neural nets with nonlinear activation functions are universal function approximators: they can approximate any function arbitrarily well, i.e., for any f : X → T thereisasequencefi ∈Hwithfi →f.
This has been shown for various activation functions (thresholds, logistic, ReLU, etc.)
I Even though ReLU is “almost” linear, it’s nonlinear enough.
Intro ML (UofT) CSC311-Lec5 42 / 49
Designing a network to classify XOR:
Assume hard threshold activation function
Intro ML (UofT) CSC311-Lec5 43 / 49
h1 computes I[x1 + x2 − 0.5 > 0] I i.e.x1ORx2
h2 computes I[x1 + x2 − 1.5 > 0] I i.e.x1ANDx2
y computes I[h1 − h2 − 0.5 > 0] ≡ I[h1 + (1 − h2) − 1.5 > 0] I i.e. h1 AND(NOTh2)=x1 XORx2
Intro ML (UofT) CSC311-Lec5
Expressivity
Universality for binary inputs and targets:
Hard threshold hidden units, linear output
Strategy: 2D hidden units, each of which responds to one particular input configuration
Only requires one hidden layer, though it needs to be extremely wide.
Intro ML (UofT) CSC311-Lec5 45 / 49
Expressivity
What about the logistic activation function?
You can approximate a hard threshold by scaling up the weights and biases:
y = σ(x) y = σ(5x)
This is good: logistic units are differentiable, so we can train them
with gradient descent.
Intro ML (UofT) CSC311-Lec5 46 / 49
Expressivity—What is it good for?
Universality is not necessarily a golden ticket.
I You may need a very large network to represent a given function. I How can you find the weights that represent a given function?
Expressivity can be bad: if you can learn any function, overfitting is potentially a serious concern!
I Recall the polynomial feature mappings from Lecture 2. Expressivity increases with the degree M, eventually allowing multiple perfect fits to the training data.
This motivated L2 regularization.
Do neural networks overfit and how can we regularize them?
Intro ML (UofT) CSC311-Lec5
Regularization and Overfitting for Neural Networks
The topic of overfitting (when & how it happens, how to regularize, etc.) for neural networks is not well-understood, even by researchers!
I In principle, you can always apply L2 regularization. I You will learn more in CSC413.
A common approach is early stopping, or stopping training early, because overfitting typically increases as training progresses.
Unlike L2 regularization, we don’t add an explicit R(θ) term to our cost. Intro ML (UofT) CSC311-Lec5 48 / 49
Conclusion
Multi-class classification
Convexity of loss functions
Selecting good metrics to track performance in models From linear to non-linear models
Intro ML (UofT) CSC311-Lec5
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com