CSCI 567: Machine Learning
Copyright By PowCoder代写 加微信 powcoder
Lecture 6, Sep 29
Administrivia
• Quiz 1 next week during lecture hours (5pm-7:20pm)
• In-person. Please go to your respective room as follows:
Last name from A-L: THH 201
Last name from M-Z: SGM 123
• Quiz will be closed-book.
• Based on material covered till last time (so up to and including SVMs).
• HW1 grades will be released on Monday.
• HW3 will be released a few days after Quiz 1.
Multiclass
classification
Recall the setup:
• input (feature vector): x ! Rd
• output (label): y ! [C] = {1, 2, · · · ,C}
• goal: learn a mapping f : Rd ” [C]
• recognizing digits (C = 10) or letters (C = 26 or 52)
• predicting weather: sunny, cloudy, rainy, etc
• predicting image category: ImageNet dataset (C # 20K)
1.2 Linear models: Binary to multiclass
Step 1: What should a linear model look like for multiclass tasks?
Note: a linear model for binary tasks (switching from {!1,+1} to {1, 2})
1 if wTx ” 0
2 if wTx < 0 can be written as for any w1,w2 s.t. w = w1 !w2 Think of wTkx as a score for class k. Linear models: Binary to multiclass Linear models: Binary to multiclass Linear models: Binary to multiclass 1.3 Function class: Linear models for multiclass classification f(x) = argmax x | w1, . . . ,wC ! R f(x) = argmax (Wx)k | W ! R Next, lets try to generalize the loss functions. Focus on the logistic loss today. 1.4 Multinomial logistic regression: a probabilistic view Observe: for binary logistic regression, with w = w1 !w2: Pr(y = 1 | x;w) = !(wTx) = Naturally, for multiclass: Pr(y = k | x;W ) = This is called the softmax function. 1.5 Let’s find the MLE Maximize probability of seeing labels y1, . . . , yn given x1, . . . ,xn Pr(yi | xi;W ) = By taking negative log, this is equivalent to minimizing e(wk#wyi ) This is the multiclass logistic loss. It is an upper-bound on the 0-1 misclassification loss: I[f(x) != y] " log2 When C = 2, multiclass logistic loss is the same as binary logistic loss (let’s verify). Relating binary and multiclass logistic loss 1.6 Next, optimization Apply SGD: what is the gradient of F (W ) = ln e(wk"wyi ) It’s a C! d matrix. Let’s focus on the k-th row: If k "= yi: e(wk"wyi ) e(wk"wyi ) i = Pr(y = k | xi;W )x e(wk"wyi ) e(wk"wyi ) i = (Pr(y = yi | xi;W )$ 1)x SGD for multinomial logistic regression Initialize W = 0 (or randomly). Repeat: 1. pick i ! [n] uniformly at random 2. update the parameters Pr(y = 1 | xi;W ) Pr(y = yi | xi;W )# 1 Pr(y = C | xi;W ) Think about why the algorithm makes sense intuitively. 1.7 Probabilities -> Prediction
Having learned W , we can either
• make a deterministic prediction argmax
• make a randomized prediction according to Pr(y = k | x;W ) ! ew
1.8 Beyond linear models
Suppose we have any model f (not necessary linear) which gives some score fk(x) for
the datapoint x having the k-th label.
How can we convert this score to probabilities? Use the softmax function!
f̃k(x) = Pr(y = k | x; f) =
Once we have probability estimates, what is suitable loss function to train the model?
Use the log loss. Also known as the cross-entropy loss.
Log Loss/Cross-entropy loss: Binary case
Let’s start with binary classification again. Consider a model which predicts f̃(x) as
the probability of label being 1 for labelled datapoint (x, y). The log loss is defined as,
LogLoss = 1(y = 1) ln
+ 1(y = !1) ln
= !1(y = 1) ln(f̃(x))! 1(y = !1) ln((1! f̃(x))).
When the model is linear, this reduces to the logistic regression loss we defined before!
This generalizes easily to the multiclass case. For datapoint (x, y), if f̃k(x) is the
predicted probability of label k,
1(y = k) ln
1(y = k) ln
When the model is linear, this also reduces to the multiclass logistic regression loss we
defined earlier today.
Log Loss/Cross-entropy loss: Multiclass case
By combining the softmax and the log-loss, we have a general loss !(f(x), y) which
we can use to train a multi-class classification model which assigns scores fk(x) to the
k-th class. (These scores fk(x) are sometimes referred to as logits).
!(f(x), y) = !
1(y = k) ln
efk(x)#fy(x)
Log Loss/Cross-entropy loss: Multiclass case
Multiclass logistic loss: Another view
Cross-entropy is the most popular, but there are other black-box techniques to convert
multiclass classification to binary classification.
• one-versus-all (one-versus-rest, one-against-all, etc.)
• one-versus-one (all-versus-all, etc.)
• Error-Correcting Output Codes (ECOC)
• tree-based reduction
1.8 Other techniques for multiclass classification
Idea: train C binary classifiers to learn “is class k or not?” for each k.
Training: for each class k ! [C],
• relabel examples with class k as +1, and all others as “1
• train a binary classifier hk using this new dataset
1.9 One-versus-all
Picture credits link
http://rob.schapire.net/talks/ecoc-icml10.pdf
Idea: train C binary classifiers to learn “is class k or not?” for each k.
Prediction: for a new example x
• ask each hk: does this belong to class k? (i.e. hk(x))
• randomly pick among all k’s s.t. hk(x) = +1.
Issue: will (probably) make a mistake as long as one of hk errs.
1.9 One-versus-all
Idea: train
binary classifiers to learn “is class k or k!?”.
Training: for each pair (k, k!),
• relabel examples with class k as +1 and examples with class k! as !1
• discard all other examples
• train a binary classifier h(k,k!) using this new dataset
1.10 One-versus-one
Picture credits link
http://rob.schapire.net/talks/ecoc-icml10.pdf
Idea: train
binary classifiers to learn “is class k or k!?”.
Prediction: for a new example x
• ask each classifier h(k,k!) to vote for either class k or k
• predict the class with the most votes (break tie in some way)
More robust than one-versus-all, but slower in prediction.
Other techniques such as tree-based methods and error-correcting codes can achieve
intermediate tradeoffs.
1.10 One-versus-one
Neural Networks
Linear models aren’t always enough. As we discussed, we can use a nonlinear mapping
and learn a linear model in the feature space:
!(x) : x ! Rd ” z ! RM
But what kind of nonlinear mapping ! should be used?
Can we just learn the nonlinear mapping itself?
Linear -> Fixed non-linear -> Learned non-linear map
Supervised learning in one slide
Loss function: What is the right loss function for the task?
Representation: What class of functions should we use?
Optimization: How can we efficiently solve the empirical risk
minimization problem?
Generalization: Will the predictions of our model transfer
gracefully to unseen examples?
All related! And the fuel which powers everything is data.
2.1 Loss function
For model which makes predictions f(x) on labelled datapoint (x, y), we can use the
following losses.
Regression:
!(f(x), y) = (f(x)! y)
Classification:
!(f(x), y) = ln
efk(x)#fy(x)
There maybe other, more suitable options for the problem at hand, but these are the
most popular for supervised problems.
To create non-linearity, can use some nonlinear (differentiable) function:
• Rectified Linear Unit (ReLU): h(a) = max{0, a}
• Sigmoid function: h(a) = 1
• Tanh: h(a) = e
• many more
2.2 Representation: Defining neural networks
For a linear model, h(a) = a.
Linear model as a one-layer neural network:
Figure 13.2 from PML
W ! R4!3, h : R4 ” R4 so h(a) = (h1(a1), h2(a2), h3(a3), h4(a4))
Can think of this as a nonlinear mapping: !(x) = h(Wx)
Adding a layer
We now have a network:
• each node is called a neuron
• h is called the activation function
• can use h(a) = 1 for one neuron in each layer to
incorporate bias term
• output neuron can use h(a) = a
• #layers refers to #hidden layers (plus 1 or 2 for input/output
• deep neural nets can have many layers and millions of
parameters
• this is a feedforward, fully connected neural net, there
are many variants (convolutional nets, residual nets, re-
current nets, etc.)
Putting things together: a neural network
An L-layer neural net can be written as
f(x) = hL (WLhL!1 (WL!1 · · ·h1 (W 1x))) .
d!”d!!1 is the weights between layer !” 1 and !
• d0 = d, d1, . . . , dL are numbers of neurons at each layer
d! is input to layer !
d! is output of layer !
d! is activation functions at layer !
Now, for a given input x, we have recursive relations:
o0 = x,a! = W !o!!1,o! = h!(a!), (! = 1, . . . , L).
Neural network: Definition
2.3 Optimization
Our optimization problem is to minimize,
F (W 1, . . . ,W L) =
Fi(W 1, . . . ,W L)
Fi(W 1, . . . ,W L) =
!f(xi)” yi!
2 for regression
efk(xi)”fyi (xi)
for classification
How to solve this? Apply SGD!
To compute the gradient efficiently, we use backpropogation. More on this soon.
2.4 Generalization
Overfitting is a concern for such a complex model, but there are ways to handle it.
For example, we can add !2 regularization.
!2 regularization: minimize
G(W 1, . . . ,W L) = F (W 1, . . . ,W L) + ”
all weights w
in network
http://playground.tensorflow.org/
http://playground.tensorflow.org/
Neural Networks:
Diving deeper
Universal approximation theorem (Cybenko, 89; Hornik, 91):
A feedforward neural net with a single hidden layer can approximate any continuous
It might need a huge number of neurons though, and depth helps!
Choosing the network architecture is important.
• for feedforward network, need to decide number of hidden layers, number of
neurons at each layer, activation functions, etc.
Designing the architecture can be complicated, though various standard choices exist.
3.1 Representation: Very powerful function class!
3.2 Optimization: Computing gradients efficiently using Backprop
To run SGD, need gradients of Fi(W 1, . . . ,W L) with respect to all the weights in all
the layers. How do we get the gradient?
Here’s a naive way to compute gradients. For some function F (w) of a univariate
parameter w,
F (w + !)! F (w ! !)
Backpropogation: A very efficient way to compute gradients of neural networks using
an application of the chain rule (similar to dynamic programming).
Chain rule:
• for a composite function f(g(w))
• for a composite function f(g1(w), . . . , gd(w))
the simplest example f(g1(w), g2(w)) = g1(w)g2(w)
: Intuition
Drop the subscript ! for layer for simplicity. For this derivation, refer to the loss
function as Fm (instead of Fi) for convenience.
Find the derivative of Fm w.r.t. to wij
Backprop: Derivation
Adding the subscript for layer:
h”!,i(a!,i)
For the last layer, for square loss
!(hL,i(aL,i)! ym)
= 2(hL,i(aL,i)! ym)h
Exercise: try to do it for logistic loss yourself.
Backprop: Derivation
Using matrix notation greatly simplifies presentation and implementation:
!(a!) if ” < L 2(hL(aL)# ym) " h where v1 " v2 = (v11v21, · · · , v1dv2d) is the element-wise product (a.k.a. Hadamard Verify yourself! Backprop: Derivation Initialize W 1, . . . ,W L randomly. Repeat: 1. randomly pick one data point i ! [n] 2. forward propagation: for each layer ! = 1, . . . , L • compute a! = W !o!!1 and o! = h!(a!) (o0 = xi) 3. backward propagation: for each ! = L, . . . , 1 !(a!) if ! < L 2(hL(aL)# yi) " h • update weights W ! $W ! # # (Important: should W ! be overwritten immediately in the last step?) The backpropagation algorithm (Backprop) Non-saturating activation functions Figure 13.2 from PML Modern networks are huge, and training can take time From https://openai.com/blog/ai-and-compute/ ..since 2012, the amount of compute used in the largest AI training runs has been increasing exponentially with a 3.4-month doubling time (by comparison, Moore’s Law had a 2-year doubling period). Since 2012, this metric has grown by more than 300,000x (a 2-year doubling period would yield only a 7x increase). Y-axis: Petaflop/s-day (pfs-day) consists of performing 1015 neural net operations per second for one day, or a total of about 1020 operations. https://openai.com/blog/ai-and-compute/ Modern networks are huge, and training can take time https://huggingface.co/blog/large-language-models Scaling Laws for Neural Language Models [Kaplan et al.’20] https://huggingface.co/blog/large-language-models Optimization: Variants on SGD • mini-batch: randomly sample a batch of examples to form a stochastic gradient (common batch size: 32, 64, 128, etc.) Mini-batch Consider F (w) = i=1 Fi(w), where Fi(w) is the loss function for the i-th datapoint. Recall that any !F̃ (w) is a stochastic gradient of F (w) if E[!F̃ (w)] = !F (w). Mini-batch SGD (also known as mini-batch GD): sample S " {1, . . . , n} at random, and estimate the average gradient over these batch of |S| samples: Common batch size: 32, 64, 128, etc. • mini-batch: randomly sample a batch of examples to form a stochastic gradient (common batch size: 32, 64, 128, etc.) • adaptive learning rate tuning: choose a different learning rate for each parameter (and vary this across itera- tions), based on the magnitude of previous gradients for that parameter (used in Adagrad, RMSProp) Optimization: Variants on SGD Adaptive learning rate tuning ``The learning rate is perhaps the most important hyperparameter. If you have time to tune only one hyperparameter, tune the learning rate.” -Deep learning (Book by Goodfellow, Bengio, Courville) We often use a learning rate schedule. Some common learning rate schedules (figure from PML) Adaptive learning rate methods (Adagrad, RMSProp) scale the learning rate of each parameter based on some moving average of the magnitude of the gradients. Optimization: Variants on SGD • mini-batch: randomly sample a batch of examples to form a stochastic gradient (common batch size: 32, 64, 128, etc.) • adaptive learning rate tuning: choose a different learning rate for each parameter (and vary this across itera- tions), based on the magnitude of previous gradients for that parameter (used in Adagrad, RMSProp) • momentum: add a “momentum” term to encourage model to continue along previous gradient direction “move faster along directions that were previously good, and to slow down along directions where the gradient has suddenly changed, just like a ball rolling downhill.” [PML] Initialize w0 and (velocity) v = 0 For t = 1, 2, . . . • estimate a stochastic gradient g • update v ! !v + g for some discount factor ! " (0, 1) • update weight wt ! wt!1 # "v Updates for first few rounds: • w1 = w0 # "g1 • w2 = w1 # !"g1 # "g2 • w3 = w2 # ! https://distill.pub/2017/momentum/ https://distill.pub/2017/momentum/ Optimization: Variants on SGD • mini-batch: randomly sample a batch of examples to form a stochastic gradient (common batch size: 32, 64, 128, etc.) • adaptive learning rate tuning: choose a different learning rate for each parameter (and vary this across itera- tions), based on the magnitude of previous gradients for that parameter (used in Adagrad, RMSProp) • momentum: add a “momentum” term to encourage model to continue along previous gradient direction • Many other variants and tricks such as batch normalization: normalize the inputs of each layer over the mini- batch (to zero-mean and unit-variance; like we did in HW1) 3.3 Generalization: Preventing Overfitting Overfitting can be a major concern since neural nets are very powerful. Methods to overcome overfitting: • data augmentation • regularization • early stopping Preventing overfitting: Data augmentation The best way to prevent overfitting? Get more samples. What if you cannot get access to more samples? Exploit prior knowledge to add more training data: Preventing overfitting: Regularization & Dropout We can use regularization techniques such as !2 regularization. !2 regularization: minimize G(W 1, . . . ,W L) = F (W 1, . . . ,W L) + " all weights w in network A very popular technique is Dropout. Here, we independently delete each neuron with a fixed probability (say 0.1), during each iteration of Backprop (only for training, not for testing) Very effective and popular in practice! Preventing overfitting: Early stopping Stop training when the performance on validation set stops improvingCHAPTER 7. REGULARIZATION FOR DEEP LEARNING 0 50 100 150 200 250 Time (epochs) Training set loss Validation set loss Figure 7.3: Learning curves showing how the negative log-likelihood loss changes over time (indicated as number of training iterations over the dataset, or epochs). In this example, we train a maxout network on MNIST. Observe that the training objective decreases consistently over time, but the validation set average loss eventually begins to increase again, forming an asymmetric U-shaped curve. greatly improved (in proportion with the increased number of examples for the shared parameters, compared to the scenario of single-task models). Of course this will happen only if some assumptions about the statistical relationship between the different tasks are valid, meaning that there is something shared across some of the tasks. From the point of view of deep learning, the underlying prior belief is the following: among the factors that explain the variations observed in the data associated with the different tasks, some are shared across two or more tasks. 7.8 Early Stopping When training large models with sufficient representational capacity to overfit the task, we often observe that training error decreases steadily over time, but validation set error begins to rise again. See figure 7.3 for an example of this behavior. This behavior occurs very reliably. This means we can obtain a model with better validation set error (and thus, hopefully better test set error) by returning to the parameter setting at the point in time with the lowest validation set error. Every time the error on the validation set improves, we store a copy 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com