CS计算机代考程序代写 chain deep learning decision tree algorithm CMPSC442-Wk12-Mtg35

CMPSC442-Wk12-Mtg35

Introduction to Deep Learning and Neural
Networks
AIMA 21.1 – 21.6

CMPSC 442
Week 12, Meeting 35, Three Segments

Outline

● Intro to Deep Learning and Neural Networks
● Computation Graphs
● Convolutional Networks versus Recurrent Networks

2Outline, Wk 12, Mtg 35

CMPSC 442
Week 12, Meeting 35, Segment 1 of 3: Intro to Deep
Learning and Neural Networks

Introduction to Deep Learning and Neural
Networks
AIMA 21.1 – 21.6

Deep Learning
● Most widely used techniques of ML

○ Deep: Computational infrastructure based on complex algebraic circuits with many layers
○ Layer: computation path from previous layer (or input) that transforms weighted input layer

using an activation function (usually non-linear)

4Intro to Deep Learning and Neural Networks

ImageNet dataset for Vision recognition. Russakovsky, Olga,
et al. “Imagenet large scale visual recognition challenge.”
International journal of computer vision 115.3 (2015): 211-252.

Language model used by Google search engine. Src:
http://web.stanford.edu/class/cs224n/slides/cs224n-2020-lectu
re06-rnnlm.pdf

Neural Networks

● Can handle large input representations through parallelization of computation across
matrices (GPU architectures)

● Computation path involves multiplication of weights into each layer and aggregation
of output from each layer

● Network architecture affects the expressive power of the network (e.g., feed forward,
convolutional network, recurrent network)

5Intro to Deep Learning and Neural Networks

Logistic Regression vs. Decision Tree vs. Deep Neural Network

● Parameters : : define the weights between unit i and j
● Unit: node of the network
● Activation function defines the output of unit given the input
● Non-linear activation functions mean a network can learn complex non-linear relationships

between inputs and outputs

Simple Feed Forward Network

6Intro to Deep Learning and Neural Networks

Universal Approximation Theorem

● An MLP1 (multilayer perceptron with 1 hidden layer) is a network with two layers
of computational units

○ First layer with nonlinear activation
○ Second layer with linear activation

● Universal approximator
○ Can approximate continuous functions on a closed, bounded subset of
○ Given the non-linearity, a sufficiently large network can learn an arbitrary function

● In practice, it could be difficult to learn the parameters for a given MLP1

Intro to Deep Learning and Neural Networks 7

Activation Functions

● Activation Function: introduces non-linearity into network

8

Logistic or Sigmoid Relu Tanh

Intro to Deep Learning and Neural Networks

Non-linearity

● Neural networks can learn non-linear decision boundaries

Intro to Deep Learning and Neural Networks 9

Loss Functions and Gradient Descent

● Loss function: prediction errors from neural network compared to training data
● Calculate the gradients of loss function w.r.t weights, and update the weights along

with gradient direction to reduce loss during training
● Given input training sample,

○ Network prediction
○ True value
○ Compute loss function, e.g., L2

10Intro to Deep Learning and Neural Networks

From Input to Output

● Defining a network with one hidden layer of two units and one output layer
(MLP1)

11Intro to Deep Learning and Neural Networks

Schematic Representation
With bias inputs and

computation operations

From Input to Output

● Output of unit 5 is a function of its
inputs (sum of its bias, weighted input
from unit 3, and weighted input from
unit 4) . . .

12Intro to Deep Learning and Neural Networks

● Loss at output must be back-propagated along each computation path

13

Back-Propagation

Given network definition and loss function, compute gradients following chain-rule
(taking derivatives of loss):

Gradients are computed by automatic differentiation through deep learning
frameworks (e.g. tensorflow, pytorch)

Intro to Deep Learning and Neural Networks

14

Back-Propagation

● Gradients define the direction of weight updates
○ is the gradient
○ is positive means is too big

● Vanishing gradients: derivatives are (very close to) zero
○ Neurons die out (insensitive to input data)
○ Some activation functions work better than others to prevent vanishing

gradients

Intro to Deep Learning and Neural Networks

CMPSC 442
Week 12, Meeting 33, Segment 2 of 3: Computation Graphs

Introduction to Deep Learning and Neural
Networks
AIMA 21.1 – 21.6

Computation Graphs

● Input layer
○ Input encoding: converting input to numerical representation
○ Continuous (e.g. word vectors) or one-hot encoding (n bits for n category labels)

● Hidden layers of units
○ Deeper networks have more layers, hence more parameters to learn
○ Number of layers (parameters) is limited by size of training data and computation

resources

● Output layer and loss functions at output
○ Layer for prediction (e.g. classification, regression)
○ Select loss function that fits the problem

16Computation Graphs

Training and Computation Graphs

● A computation graph is a DAG where nodes are mathematical operations or
variables and edges correspond to source of input values

● EG: (a * b + 1) * (a * b + 2)

Goldberg, Chapter 5

Computation Graphs 17

Computation Graph: MLP with Softmax

● Oval nodes are mathematical operations or
functions (e.g., ADD, tanh)

● Shaded rectangles represent parameters
(bound variables)

● Output of each node is a matrix, with
dimensionality shown above the node

● Softmax turns output into a probability
distribution

Goldberg, Chapter 5, Figure 5.1 (a)
Computation Graphs 18

Loss Functions

19

● Negative log likelihood (recall from AIMA section 20)
○ Find the parameters that maximize the probability of the data
○ Same as minimizing a negative log loss

● Cross-entropy loss measures dissimilarity between the predicted and true
distributions

Computation Graphs

Select Loss Function Based on Output Distribution

20

src: https://www.deeplearningbook.org/slides/06_mlp.pdf

Computation Graphs

Learning Algorithms

21

● Gradient learning is an optimization problem
○ MInimize the loss
○ Tune learning rate to guide the speed of weight update to find local optimal

■ Advanced algorithm for optimization problem: Stochastic Gradient Descent (SGD),
Newton’s method, etc.

■ Optimization algorithms are often provided in DL frameworks
● Other techniques that help better convergence

○ Batch normalization
○ Tuning optimizer (learning rate, regularization, dropout, etc.)

● Hyper-parameter tuning is very important for neural networks!

Computation Graphs

Introduction to Deep Learning and Neural
Networks
AIMA 21.1 – 21.6

CMPSC 442
Week 12, Meeting 35, Segment 3 of 3: CNNs & RNNS

Convolutional Neural Network

23CNNs & RNNs

Most powerful and common network for vision tasks, e.g. image recognition

src: http://cs231n.stanford.edu/

24

Convolutional Neural Network

● Convolution: matrix multiplication for learning the weights,
● Kernel : convolution on a region of the matrix

● For each output position i, we take dot product between the kernel and a
snippet of x with width l.

(AIMA 21.8)

CNNs & RNNs

Convolutional Neural Network

25

(AIMA Figure 21.5)

CNNs & RNNs

Sequence modeling: Recurrent Neural Network

● Sequential data:
○ Time series data (e.g. stock market, speech signal)
○ Language model
○ Part-of-Speech tagging

● Instance: Language model

26

“Every morning I take my dog for a __walk__ ”

predictionmemory

CNNs & RNNs

Recall: Markov Assumptions

● The state of current time step is dependent on a bounded number of states
from previous time steps
○ Keeping track of time step
○ Remembering previous states — memory

● Recurrent neural network
○ Based on Markov assumptions
○ Hidden states are computed with features at each time step
○ Back-propagation through time
○ Weights sharing across all neurons in the network

27CNNs & RNNs

Recurrent Neural Network

28

● RNN formulation
○ Input sequence vector:
○ Observed output:

Hidden state at time step t

Predicted output at time step t

CNNs & RNNs

Long-short Term Memory

29

● Modification of RNNs
● Motivation:

○ Vanishing gradients in RNNs: avoid direct back propagation on cells
○ Increasing ability to capture longer sequences

● Controlling information flow by gating units
○ Element-wise multiplication

■ Forget gate
■ Input gate
■ Output gate

CNNs & RNNs

30

Long-short Term Memory

src:https://colah.github.io/posts/2015-08-Understanding-LSTMs/ (Very nice
walkthru of LSTM, recommend! )

CNNs & RNNs

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Summary, One

● Deep learning is the most widely used family of ML
○ Deep means many layers
○ Neural network is the computational infrastructure achieving DL with neurons

(computational circuits)
● Basic NN has three layers:

○ Input encoding, activation function, and output layer
○ Loss function is used to compute gradients for updating the weights
○ Gradients are back propagated to update the network

● Architectures are quite different in NN
○ Feedforward
○ Convolutional NN
○ RNN and LSTM

31Summary, Wk 12, Mtg 35

Summary, Two
● CNN is often used in vision tasks
● RNN is for sequence tasks, based on Markov Assumption

○ Hidden states are computed over time
○ LSTM is a variant of RNN, but with benefits of

■ Avoidance of vanishing gradients
■ Greater ability in longer sequence

● Selection of architecture is dependent on the problem
● Hyper-parameter tuning is critical in training neural networks

32Summary, Wk 12, Mtg 35