CS计算机代考程序代写 deep learning js algorithm Accelerate in relevant direction

Accelerate in relevant direction

Penalize the global
learning rate with G_t,ii

exponentially decaying average over past squared gradients

Consider both gradient and learning rate

1. exponentially decaying average over past squared gradients;2.second-order
update to keep unit consistency.

Go through the entire dataset

Go through mini batches

Limiting the growth of the weights in the network.

Early stopping rules provide guidance as to how many iterations can be run before the
learner begins to over-fit, and stop the algorithm then.

Is a regularization technique for reducing overfitting in neural networks by preventing
complex co-adaptations on training data. It is a very efficient way of performing model
averaging with neural networks. The term “dropout” refers to dropping out units (both
hidden and visible) in a neural network

Artificially enlarge the dataset using label-preserving transformations

A normalization method/layer for neural networks. BN reducesCovariate Shift

You cannot do this, because of gradient vanishing

In this network, the information moves in only one direction, forward, from the input nodes,
through the hidden nodes (if any) and to the output nodes. There are no cycles or loops in
the network.

The inputs are fed directly to the outputs via a series of weights.

This class of networks consists of multiple layers of computational units, usually
interconnected in a feed-forward way. Each neuron in one layer has directed connections to
the neurons of the subsequent layer. In many applications the units of these networks apply
a sigmoid function as an activation function.

the region of the input space that
affects a particular unit of the network

Relu, Data Augmentation, Dropout, Local Response
Normalization, Overlapping Pooling. 11×11 Filter; 9
layers

7×7 Filter

3×3 Filter; Plain network structure; Prefer a
stack of small filters; 16 or 19 layers

22 layers; Inception
module; 1×1 filter

all 3×3 conv (almost); use network layers to fit a
residual mapping instead of directly trying to fit
a desired underlying mapping

An extreme version of inception module; one
spatial convolution per output channel of the 1×1
convolution

decrease depth and increase width of residual
networks

Repeating a building block that aggregates a set of
transformations with the same topology.

Each layer has direct access to the gradients from
the loss function and the original input signal

The output of hidden layer are stored
in the memory; Memory can be
considered as another input.

If we are only interested in training a language model for the
input for some other tasks, then we do not need the Decoder
of the transformer, that gives us BERT.

If we do not have input, we just want to model the “next word”, we can get rid of the Encoder
side of a transformer and output “next word” one by one. This gives us GPT.

Chebyshev expansion

No eigen decomposition

Gradient vanishing

A VAE is an autoencoder whose encodings distribution is regularised during the training in
order to ensure that its latent space has good properties allowing us to generate some new
data.

Autoencoders can reconstruct data, and can learn
features to initialize a supervised model. Features
capture factors of variation in training data.

Object detection to image classification

ROI Pooling

Region Proposal Network

ROI Align

A transformer uses Encoder stack to model input, and uses Decoder stack to model output
(using input information from encoder side).

Deep Learning

Feed Forward

CNN

RNN

LSTM

Transformer

GCN

GAN

VAE

CV Applications

Regularisation

Optimization

Gradient Descent

Stochastic Gradient Descent

Gradient

Learning RateAdagrad

MomentumNAG

RMSProp

AdaDelta

Adam

Weight Decay

Early Stopping

Dropout

Add Noise
Noise to the input

Noise to the weights

Data Augmentation

Batch Normalization

Initialization

Optimizers

All Zero Initializaiton

Calibrating the Variances
Xavier

Kaiming

1. The mean of the activations should be zero. 2.The variance of the activations
should stay the same across every layer.

Single-Layer Perceptron

Multi-Layer Perceptron

Convolution

Pooling

Stride

Padding

Receptive filed

Architectures

AlexNet

ZFnet

GoogleNet

ResNet

VggNet

GoogleNet variants

Xception

Wide ResNet

ResNeXt

Densenet

Efficient CNNs

SqueezeNet

MobileNet

ShuffleNet

Vanishing Gradients

Exploding Gradients

Prevent vanishing gradient

Gate Recurrent Unit
Reset Gate

Update Gate (forget gate + input gate)

Self-Attention

Multi-head

Bert

GPT

query

key

value

Spatial

Spectral

JS Divergence

Wasserstein GANs

Auto-encoder

Detection

RCNN

Fast-RCNN

Faster-RCNN

Mask-RCNN

Deep Neural Networks for different types of data (e.g., image, sequential data and graph data) and
different tasks (image classification, image generation, word embedding, translation and generation, etc.)

How to (better) train deep neural networks.