Accelerate in relevant direction
Penalize the global
learning rate with G_t,ii
exponentially decaying average over past squared gradients
Consider both gradient and learning rate
1. exponentially decaying average over past squared gradients;2.second-order
update to keep unit consistency.
Go through the entire dataset
Go through mini batches
Limiting the growth of the weights in the network.
Early stopping rules provide guidance as to how many iterations can be run before the
learner begins to over-fit, and stop the algorithm then.
Is a regularization technique for reducing overfitting in neural networks by preventing
complex co-adaptations on training data. It is a very efficient way of performing model
averaging with neural networks. The term “dropout” refers to dropping out units (both
hidden and visible) in a neural network
Artificially enlarge the dataset using label-preserving transformations
A normalization method/layer for neural networks. BN reducesCovariate Shift
You cannot do this, because of gradient vanishing
In this network, the information moves in only one direction, forward, from the input nodes,
through the hidden nodes (if any) and to the output nodes. There are no cycles or loops in
the network.
The inputs are fed directly to the outputs via a series of weights.
This class of networks consists of multiple layers of computational units, usually
interconnected in a feed-forward way. Each neuron in one layer has directed connections to
the neurons of the subsequent layer. In many applications the units of these networks apply
a sigmoid function as an activation function.
the region of the input space that
affects a particular unit of the network
Relu, Data Augmentation, Dropout, Local Response
Normalization, Overlapping Pooling. 11×11 Filter; 9
layers
7×7 Filter
3×3 Filter; Plain network structure; Prefer a
stack of small filters; 16 or 19 layers
22 layers; Inception
module; 1×1 filter
all 3×3 conv (almost); use network layers to fit a
residual mapping instead of directly trying to fit
a desired underlying mapping
An extreme version of inception module; one
spatial convolution per output channel of the 1×1
convolution
decrease depth and increase width of residual
networks
Repeating a building block that aggregates a set of
transformations with the same topology.
Each layer has direct access to the gradients from
the loss function and the original input signal
The output of hidden layer are stored
in the memory; Memory can be
considered as another input.
If we are only interested in training a language model for the
input for some other tasks, then we do not need the Decoder
of the transformer, that gives us BERT.
If we do not have input, we just want to model the “next word”, we can get rid of the Encoder
side of a transformer and output “next word” one by one. This gives us GPT.
Chebyshev expansion
No eigen decomposition
Gradient vanishing
A VAE is an autoencoder whose encodings distribution is regularised during the training in
order to ensure that its latent space has good properties allowing us to generate some new
data.
Autoencoders can reconstruct data, and can learn
features to initialize a supervised model. Features
capture factors of variation in training data.
Object detection to image classification
ROI Pooling
Region Proposal Network
ROI Align
A transformer uses Encoder stack to model input, and uses Decoder stack to model output
(using input information from encoder side).
Deep Learning
Feed Forward
CNN
RNN
LSTM
Transformer
GCN
GAN
VAE
CV Applications
Regularisation
Optimization
Gradient Descent
Stochastic Gradient Descent
Gradient
Learning RateAdagrad
MomentumNAG
RMSProp
AdaDelta
Adam
Weight Decay
Early Stopping
Dropout
Add Noise
Noise to the input
Noise to the weights
Data Augmentation
Batch Normalization
Initialization
Optimizers
All Zero Initializaiton
Calibrating the Variances
Xavier
Kaiming
1. The mean of the activations should be zero. 2.The variance of the activations
should stay the same across every layer.
Single-Layer Perceptron
Multi-Layer Perceptron
Convolution
Pooling
Stride
Padding
Receptive filed
Architectures
AlexNet
ZFnet
GoogleNet
ResNet
VggNet
GoogleNet variants
Xception
Wide ResNet
ResNeXt
Densenet
Efficient CNNs
SqueezeNet
MobileNet
ShuffleNet
Vanishing Gradients
Exploding Gradients
Prevent vanishing gradient
Gate Recurrent Unit
Reset Gate
Update Gate (forget gate + input gate)
Self-Attention
Multi-head
Bert
GPT
query
key
value
Spatial
Spectral
JS Divergence
Wasserstein GANs
Auto-encoder
Detection
RCNN
Fast-RCNN
Faster-RCNN
Mask-RCNN
Deep Neural Networks for different types of data (e.g., image, sequential data and graph data) and
different tasks (image classification, image generation, word embedding, translation and generation, etc.)
How to (better) train deep neural networks.