CMPSC442-Wk12-Mtg35
Introduction to Deep Learning and Neural
Networks
AIMA 21.1 – 21.6
CMPSC 442
Week 12, Meeting 35, Three Segments
Outline
● Intro to Deep Learning and Neural Networks
● Computation Graphs
● Convolutional Networks versus Recurrent Networks
2Outline, Wk 12, Mtg 35
CMPSC 442
Week 12, Meeting 35, Segment 1 of 3: Intro to Deep
Learning and Neural Networks
Introduction to Deep Learning and Neural
Networks
AIMA 21.1 – 21.6
Deep Learning
● Most widely used techniques of ML
○ Deep: Computational infrastructure based on complex algebraic circuits with many layers
○ Layer: computation path from previous layer (or input) that transforms weighted input layer
using an activation function (usually non-linear)
4Intro to Deep Learning and Neural Networks
ImageNet dataset for Vision recognition. Russakovsky, Olga,
et al. “Imagenet large scale visual recognition challenge.”
International journal of computer vision 115.3 (2015): 211-252.
Language model used by Google search engine. Src:
http://web.stanford.edu/class/cs224n/slides/cs224n-2020-lectu
re06-rnnlm.pdf
Neural Networks
● Can handle large input representations through parallelization of computation across
matrices (GPU architectures)
● Computation path involves multiplication of weights into each layer and aggregation
of output from each layer
● Network architecture affects the expressive power of the network (e.g., feed forward,
convolutional network, recurrent network)
5Intro to Deep Learning and Neural Networks
Logistic Regression vs. Decision Tree vs. Deep Neural Network
● Parameters : : define the weights between unit i and j
● Unit: node of the network
● Activation function defines the output of unit given the input
● Non-linear activation functions mean a network can learn complex non-linear relationships
between inputs and outputs
Simple Feed Forward Network
6Intro to Deep Learning and Neural Networks
Universal Approximation Theorem
● An MLP1 (multilayer perceptron with 1 hidden layer) is a network with two layers
of computational units
○ First layer with nonlinear activation
○ Second layer with linear activation
● Universal approximator
○ Can approximate continuous functions on a closed, bounded subset of
○ Given the non-linearity, a sufficiently large network can learn an arbitrary function
● In practice, it could be difficult to learn the parameters for a given MLP1
Intro to Deep Learning and Neural Networks 7
Activation Functions
● Activation Function: introduces non-linearity into network
8
Logistic or Sigmoid Relu Tanh
Intro to Deep Learning and Neural Networks
Non-linearity
● Neural networks can learn non-linear decision boundaries
Intro to Deep Learning and Neural Networks 9
Loss Functions and Gradient Descent
● Loss function: prediction errors from neural network compared to training data
● Calculate the gradients of loss function w.r.t weights, and update the weights along
with gradient direction to reduce loss during training
● Given input training sample,
○ Network prediction
○ True value
○ Compute loss function, e.g., L2
10Intro to Deep Learning and Neural Networks
From Input to Output
● Defining a network with one hidden layer of two units and one output layer
(MLP1)
11Intro to Deep Learning and Neural Networks
Schematic Representation
With bias inputs and
computation operations
From Input to Output
● Output of unit 5 is a function of its
inputs (sum of its bias, weighted input
from unit 3, and weighted input from
unit 4) . . .
12Intro to Deep Learning and Neural Networks
● Loss at output must be back-propagated along each computation path
13
Back-Propagation
Given network definition and loss function, compute gradients following chain-rule
(taking derivatives of loss):
Gradients are computed by automatic differentiation through deep learning
frameworks (e.g. tensorflow, pytorch)
Intro to Deep Learning and Neural Networks
14
Back-Propagation
● Gradients define the direction of weight updates
○ is the gradient
○ is positive means is too big
● Vanishing gradients: derivatives are (very close to) zero
○ Neurons die out (insensitive to input data)
○ Some activation functions work better than others to prevent vanishing
gradients
Intro to Deep Learning and Neural Networks
CMPSC 442
Week 12, Meeting 33, Segment 2 of 3: Computation Graphs
Introduction to Deep Learning and Neural
Networks
AIMA 21.1 – 21.6
Computation Graphs
● Input layer
○ Input encoding: converting input to numerical representation
○ Continuous (e.g. word vectors) or one-hot encoding (n bits for n category labels)
● Hidden layers of units
○ Deeper networks have more layers, hence more parameters to learn
○ Number of layers (parameters) is limited by size of training data and computation
resources
● Output layer and loss functions at output
○ Layer for prediction (e.g. classification, regression)
○ Select loss function that fits the problem
16Computation Graphs
Training and Computation Graphs
● A computation graph is a DAG where nodes are mathematical operations or
variables and edges correspond to source of input values
● EG: (a * b + 1) * (a * b + 2)
Goldberg, Chapter 5
Computation Graphs 17
Computation Graph: MLP with Softmax
● Oval nodes are mathematical operations or
functions (e.g., ADD, tanh)
● Shaded rectangles represent parameters
(bound variables)
● Output of each node is a matrix, with
dimensionality shown above the node
● Softmax turns output into a probability
distribution
Goldberg, Chapter 5, Figure 5.1 (a)
Computation Graphs 18
Loss Functions
19
● Negative log likelihood (recall from AIMA section 20)
○ Find the parameters that maximize the probability of the data
○ Same as minimizing a negative log loss
● Cross-entropy loss measures dissimilarity between the predicted and true
distributions
Computation Graphs
Select Loss Function Based on Output Distribution
20
src: https://www.deeplearningbook.org/slides/06_mlp.pdf
Computation Graphs
Learning Algorithms
21
● Gradient learning is an optimization problem
○ MInimize the loss
○ Tune learning rate to guide the speed of weight update to find local optimal
■ Advanced algorithm for optimization problem: Stochastic Gradient Descent (SGD),
Newton’s method, etc.
■ Optimization algorithms are often provided in DL frameworks
● Other techniques that help better convergence
○ Batch normalization
○ Tuning optimizer (learning rate, regularization, dropout, etc.)
● Hyper-parameter tuning is very important for neural networks!
Computation Graphs
Introduction to Deep Learning and Neural
Networks
AIMA 21.1 – 21.6
CMPSC 442
Week 12, Meeting 35, Segment 3 of 3: CNNs & RNNS
Convolutional Neural Network
23CNNs & RNNs
Most powerful and common network for vision tasks, e.g. image recognition
src: http://cs231n.stanford.edu/
24
Convolutional Neural Network
● Convolution: matrix multiplication for learning the weights,
● Kernel : convolution on a region of the matrix
● For each output position i, we take dot product between the kernel and a
snippet of x with width l.
(AIMA 21.8)
CNNs & RNNs
Convolutional Neural Network
25
(AIMA Figure 21.5)
CNNs & RNNs
Sequence modeling: Recurrent Neural Network
● Sequential data:
○ Time series data (e.g. stock market, speech signal)
○ Language model
○ Part-of-Speech tagging
● Instance: Language model
26
“Every morning I take my dog for a __walk__ ”
predictionmemory
CNNs & RNNs
Recall: Markov Assumptions
● The state of current time step is dependent on a bounded number of states
from previous time steps
○ Keeping track of time step
○ Remembering previous states — memory
● Recurrent neural network
○ Based on Markov assumptions
○ Hidden states are computed with features at each time step
○ Back-propagation through time
○ Weights sharing across all neurons in the network
27CNNs & RNNs
Recurrent Neural Network
28
● RNN formulation
○ Input sequence vector:
○ Observed output:
Hidden state at time step t
Predicted output at time step t
CNNs & RNNs
Long-short Term Memory
29
● Modification of RNNs
● Motivation:
○ Vanishing gradients in RNNs: avoid direct back propagation on cells
○ Increasing ability to capture longer sequences
● Controlling information flow by gating units
○ Element-wise multiplication
■ Forget gate
■ Input gate
■ Output gate
CNNs & RNNs
30
Long-short Term Memory
src:https://colah.github.io/posts/2015-08-Understanding-LSTMs/ (Very nice
walkthru of LSTM, recommend! )
CNNs & RNNs
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
Summary, One
● Deep learning is the most widely used family of ML
○ Deep means many layers
○ Neural network is the computational infrastructure achieving DL with neurons
(computational circuits)
● Basic NN has three layers:
○ Input encoding, activation function, and output layer
○ Loss function is used to compute gradients for updating the weights
○ Gradients are back propagated to update the network
● Architectures are quite different in NN
○ Feedforward
○ Convolutional NN
○ RNN and LSTM
31Summary, Wk 12, Mtg 35
Summary, Two
● CNN is often used in vision tasks
● RNN is for sequence tasks, based on Markov Assumption
○ Hidden states are computed over time
○ LSTM is a variant of RNN, but with benefits of
■ Avoidance of vanishing gradients
■ Greater ability in longer sequence
● Selection of architecture is dependent on the problem
● Hyper-parameter tuning is critical in training neural networks
32Summary, Wk 12, Mtg 35