CS代写 IN2323 in the Summer semester!

Machine Learning
Lecture 8: Deep Learning II
Prof. Dr. ̈nnemann
Data Analytics and Machine Learning Group Technische Universit ̈at Mu ̈nchen

Winter term 2020/2021

Structured data
Training deep neural networks
Deep learning frameworks
Modern architectures & tricks
Deep Learning 2
Data Analytics and Machine Learning

Structured data

Different layers
So far we’ve seen only fully-connected feed-forward layers and our (deep) NNs were obtained by stacking them.
There are many more types of layers for specific tasks/data:
• Convolution layer (typically used for images)
• Recurrent layer (typically used for sequences, discussed in-depth in our MLGS lecture)
• Graph convolutional layers (covered in our MLGS lecture)
These layers leverage the known structure of data and provide an
inductive bias.
Think of them as building blocks (i.e. lego pieces) that you can compose.
Deep Learning 4
Data Analytics and Machine Learning

Neural networks for images
from https://cs.nyu.edu/~fergus/ tutorials/deep_learning_cvpr12/
Suppose we have an image with 100 × 100 pixels and want to process it with a neural network with a single hidden layer with 1,000 units.
In a feed forward neural network this results in 10 million parameters (weights)!
We can solve this problem by using the convolution operation to build neural networks. This exploits the high local correlation of pixel values in natural images.
For example, 1,000 (learnable) 5×5 convolutional filters only require 25,000 parameters.
Deep Learning 5
Data Analytics and Machine Learning

CNN: Convolution
Continuous convolution is defined as
(x∗k)(t) =
This can be intuitively described as a weighted average of the input
signal x using the weights (or kernel, filter) k at each point in time t. CNNs use the discrete variant
(x ∗ k)(t) = 􏰉 x(τ)k(t − τ). τ =−∞
x(τ)k(t−τ)dτ.
Deep Learning 6
Data Analytics and Machine Learning

CNN: Convolution
CNNs for images are based on a 2D convolution
(x ∗ k)(i, j) = 􏰉 􏰉 x(l, m)k(i − l, j − m) lm
However, when talking about convolution in CNNs we usually mean cross-correlation.
This is what is usually implemented in libraries. It is similar to convolution, but the summation indices are swapped:
xˆ(i, j) = 􏰉 􏰉 x(i + l, j + m)k(l, m) l=1 m=1
Convolutions usually act on multiple channels. Filters therefore have L × M × Cin × Cout parameters.
Deep Learning 7
Data Analytics and Machine Learning

CNN: Convolution
Weights of the sliding window (kernel) are shared for every patch.
from http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_ using_convolution
Deep Learning 8
Data Analytics and Machine Learning

CNN: Convolution
http://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/
Deep Learning 9
Data Analytics and Machine Learning

CNN: Padding
Inputs are finite. What happens at the boundary?
We can either reduce the output size (by not applying the filter at the boundary) or pad the input (e.g. with zeros, constant values, or by reflecting or repeating the image).
Padding schemes:
• VALID: Do not use padding, reduce size in output to
Dl+1 = (Dl − K) + 1, where Dl is the input size along a dimension and K is the kernel width (width of its“receptive field”).
• SAME (half) padding: Add padding so that the input size is preserved, i.e. add P = ⌊K/2⌋ values on each side.
• FULL: Add K − 1 values on each side, increasing the output size.
Deep Learning 10
Data Analytics and Machine Learning

CNN: Strides
A stride S is the distance between positions the kernel is applied. This changes the output size to
Dl+1 = ⌊ Dl + 2P − K ⌋ + 1 S
Strides S > 1 are a way of downsampling the signal. They can be used to handle inputs that differ in size.
Deep Learning 11
Data Analytics and Machine Learning

CNN: Pooling
Calculate summary statistics in sliding window. Introduces invariance to small movements.
Variants: Max pooling, mean pooling, Lp norm pooling
Deep Learning 12
Data Analytics and Machine Learning

CNN: Convolutional Neural Network
Deep Learning 13
Data Analytics and Machine Learning

Architectures for other types of structured data
Sequential data (e.g., text, time series) • Recurrent neural networks (RNN) • Transformers
Graph data (e.g., social networks, molecules) • Graph neural networks (GNN)
More about these topics in IN2323 in the Summer semester!
Deep Learning 14
Data Analytics and Machine Learning

Training deep neural networks

Training deep neural networks
Training a neural network means optimizing the loss with SGD, Adam, AMSgrad, or some other optimizer.
From which point do we start optimizing? Weight initialization can be crucial for successful training.
There are 2 essential issues with na ̈ıve weight initialization: 1. Weight symmetry
2. Weight scale
Deep Learning 16
Data Analytics and Machine Learning

Weight symmetry
If two hidden units have exactly the same bias and exactly the same incoming and outgoing weights, they will always get exactly the same gradient.
• So they can never learn to extract different features.
• We break symmetry by initializing the weights to have small random values.
Deep Learning 17
Data Analytics and Machine Learning

Weight scale
If a hidden unit has a big fan-in, small changes on many of its incoming weights can cause learning to overshoot (take a huge update step).
If it has a large fan-out, the same thing can happen during the backward pass.
Wrong weight scales (mean and variance) can therefore lead to vanishing or exploding gradients.
Deep Learning 18
Data Analytics and Machine Learning

initialization
Q: How can we preserve the mean and variance of an incoming i.i.d. signal with zero mean in both forward and backward pass?
A: By controlling the statistical moments of the weight distribution.
Mean and standard deviation of activation. Note saturation of layer 4.
Activation histogram. Top: Unnormalized. Bottom: Glorot init.
from Glorot, Bengio. Understanding the difficulty of training deep feedforward neural networks. 2010
Deep Learning 19
Data Analytics and Machine Learning

initialization
We can preserve the signal’s properties by using weight matrices with zero mean and variance of
Var(W ) = 2 . fan-in + fan-out
This is Xavier (or Glorot) initialization. An example weight distribution fulfilling this is
􏰆􏰩6􏰩6􏰇 W ∼ Uniform − fan-in + fan-out , fan-in + fan-out
Note that the output layer for regression tasks is usually best initialized taking into account the target scale.
Deep Learning 20
Data Analytics and Machine Learning

Vanishing and exploding gradients
Deep networks can suffer from the vanishing / exploding gradient problem Toy example: We multiply the input t times by a matrix W
W =V diag(D)V−1 ⇒ Wt =V􏰎diag(D)􏰏tV−1 The gradient w.r.t. the elements of D will likely vanish or explode:
• if D < 1.0, then Dt will be near zero, and gradients will become ii ii small — the network will take a long time to converge Gradients also vanish when using saturating activations (e.g.,sigmoid) Solutions: • Change the architecture (e.g., use batch norm, change activations) • Gradient clipping • if D unstable > 1.0, then Dt will explode and the computations become ii
Deep Learning 21
Data Analytics and Machine Learning

Regularization
Recall that models with high capacity (like NNs) are prone to overfitting. We need regularization to prevent this.
Typically, we use the familiar L2 parameter norm penalty (or weight decay, which is mostly equivalent). Sometimes we also use L1 norm to e.g. promote sparsity.
We can combine it with other regularization methods:
• Dataset augmentation: e.g. rotate/translate/skew/change lighting of images
• Injecting noise
• Parameter tying and sharing
Deep Learning 22
Data Analytics and Machine Learning

Let’s look at a neural network with two hidden layers.
Each time a training sample is learned, we randomly put to 0 each hidden unit with probability p (usually 0.5).
We are therefore randomly sampling from 2H different architectures, but these share the same weights.
Deep Learning 23
Data Analytics and Machine Learning

Hyperparameter optimization
To squeeze every bit of performance out of your NN, you need to tune: • number of hidden layers (1, 2, 3, …)
• number of hidden units (50, 100, 200, . . . )
• type of activation function (sigmoid, ReLU, Swish, . . . )
• optimizer (SGD, Adam, ADADELTA, Rprop, . . . ) • learning rate schedule (warmup, decay, cyclic)
• data preprocessing/augmentation
We often start by finding these by“playing around”with some reasonable estimates. A better way of finding a good set is by using hyperparameter optimization. Random search or Bayesian Optimisation are both viable candidates.
Deep Learning 24
Data Analytics and Machine Learning

Side note: gradient-based hyperparameter optimization
Instead of“searching”for good hyperparameters, can’t we…
…learn them, e.g., with gradient descent? It turns out we actually can (for
continuous hyperparameters)!
We can backpropagate through the training procedure to compute the gradient of the final loss w.r.t. hyperparameters, e.g. the learning rate. We can then perform a ‘meta update’ on the hyperparameter and repeat.
Some researchers even use this technique to learn the initial weights to enable a model to adapt to different tasks with few training examples (“few-shot learning”).
For large neural networks this is however extremely expensive since we have to store all parameters at each training iteration.
from arxiv.org/pdf/1502.03492.pdf
Deep Learning 25
Data Analytics and Machine Learning

Deep learning frameworks

Deep learning frameworks
Programming your own NN in Python is quite simple, but not efficient. For efficient code, use one of many open source libraries, e.g.
• TensorFlow • PyTorch
You will then find a number of implementations based on such libraries.
Deep Learning 27
Data Analytics and Machine Learning

Static vs. dynamic computation graphs
Deep learning frameworks build a computation graph that defines in which order the operations are performed.
from https://medium.com/intuitionmachine/
pytorch- dynamic- computational- graphs- and- modular- deep- learning- 7e7f89f18d1
Deep Learning 28
Data Analytics and Machine Learning

Static vs. dynamic computation graphs
Deep learning frameworks build a computation graph, primarily for calculating gradients.
In the static variant we first define the computational graph to later execute it with actual data (“Define-and-Run”).
In the dynamic variant the graph is defined by executing the desired operations (“Define-by-Run”).
Static computation graphs can be optimized by the framework, similar to a compiler optimizing source code. However, we cannot change a static graph at runtime.
One example are RNNs. With a static computation graph an RNN gets explicitly unrolled for a specified number of time steps. This means that it cannot process sequences of varying time/length.
The major frameworks have now moved to dynamic computation graphs, since they are more natural to work with. Static graphs can then be generated via JIT compilation of annotated functions.
Deep Learning 29
Data Analytics and Machine Learning

Modern architectures & tricks

Batch normalization
Widely used method for improving the training of deep neural networks.
Goal: Stabilize the distribution of each layer’s activations.
Ideal: Whiten each activation independently. Too expensive for each training step, use minibatch statistics instead (or a moving average):
xˆ = 􏰁VarB[x] + ε
To maintain the model’s representational power we also introduce a scale parameter γ and an offset β. These are learned via backpropagation.
y = γ xˆ + β
This is done after each layer (before the non-linearity). x is the batch norm layer’s input, y its output
Deep Learning 31
Data Analytics and Machine Learning

Batch normalization: Why does it work?
Loss landscape and gradient change per optimization step. From: Santurkar et al. How Does Batch Normalization Help Optimization? 2018
Batch normalization smoothens the loss landscape.
This causes more predictive and stable gradients, which helps training.
(Note: The original explanation based on internal covariate shift turned out to be wrong.)
Deep Learning 32
Data Analytics and Machine Learning

Skip connections
Improve information flow in deep neural networks.
Highway networks: Add layer input to its output, weighted by gate T :
y = f (x, W )T (x, WT ) + x(1 − T (x, WT ))
Residual connections: Add previous layers’ output to a subsequent layer, skipping multiple (usually 2) layers. Usually no gating or other transformation.
Residual connection. From He et al. Deep Residual Learning for Image Recognition. 2016
Deep Learning 33
Data Analytics and Machine Learning

Tips and tricks
• Use only differentiable operations. Non differentiable operations are e.g. arg max, sampling from a distribution (see our MLGS lecture)
• Always try to overfit your model to a single training batch or sample to make sure it is correctly ‘wired’.
• Start with small models and gradually add complexity while monitoring how the performance improves.
• Be aware of the properties of activation functions, e.g. no sigmoid output when doing regression.
• Monitor the training procedure and use early stopping.
• See also: A Recipe for Training Neural Networks https://karpathy.github.io/2019/04/25/recipe/
Deep Learning 34
Data Analytics and Machine Learning

• CNNs: Convolution, padding, strides, pooling
• initialization, regularization (dropout), hyperparameter
optimization
• deep learning frameworks; static & dynamic computation graphs
• batch normalization, skip connections, tips & tricks
Deep Learning 35
Data Analytics and Machine Learning

Reading material
• Goodfellow, Deep Learning: chapters 7, 8, 9, 11
• Karpathy, A Recipe for Training Neural Networks https://karpathy.github.io/2019/04/25/recipe/
Deep Learning 36
Data Analytics and Machine Learning

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts