Introduction to Deep Learning and Neural
AIMA 21.1 – 21.6
● Intro to Deep Learning and Neural Networks
● Computation Graphs
● Convolutional Networks versus Recurrent Networks
Deep Learning
● Most widely used techniques of ML
○ Deep: Computational infrastructure based on complex algebraic circuits with many layers
○ Layer: computation path from previous layer (or input) that transforms weighted input layer
using an activation function (usually non-linear)
ImageNet dataset for Vision recognition. Russakovsky, Olga,
et al. “Imagenet large scale visual recognition challenge.”
International journal of computer vision 115.3 (2015): 211-252.
Language model used by Google search engine. Src:
Neural Networks
● Can handle large input representations through parallelization of computation across
matrices (GPU architectures)
● Computation path involves multiplication of weights into each layer and aggregation
of output from each layer
● Network architecture affects the expressive power of the network (e.g., feed forward,
convolutional network, recurrent network)
Logistic Regression vs. Decision Tree vs. Deep Neural Network
● Parameters : : define the weights between unit i and j
● Unit: node of the network
● Activation function defines the output of unit given the input
● Non-linear activation functions mean a network can learn complex non-linear relationships
between inputs and outputs
Simple Feed Forward Network
Universal Approximation Theorem
● An MLP1 (multilayer perceptron with 1 hidden layer) is a network with two layers
of computational units
○ First layer with nonlinear activation
○ Second layer with linear activation
● Universal approximator
○ Can approximate continuous functions on a closed, bounded subset of
○ Given the non-linearity, a sufficiently large network can learn an arbitrary function
● In practice, it could be difficult to learn the parameters for a given MLP1
Activation Functions
● Activation Function: introduces non-linearity into network
Logistic or Sigmoid Relu Tanh
● Neural networks can learn non-linear decision boundaries
Loss Functions and Gradient Descent
● Loss function: prediction errors from neural network compared to training data
● Calculate the gradients of loss function w.r.t weights, and update the weights along
with gradient direction to reduce loss during training
● Given input training sample,
○ Network prediction
○ True value
○ Compute loss function, e.g., L2
From Input to Output
● Defining a network with one hidden layer of two units and one output layer
Schematic Representation
With bias inputs and
computation operations
From Input to Output
● Output of unit 5 is a function of its
inputs (sum of its bias, weighted input
from unit 3, and weighted input from
unit 4) . . .
● Loss at output must be back-propagated along each computation path
Given network definition and loss function, compute gradients following chain-rule
(taking derivatives of loss):
Gradients are computed by automatic differentiation through deep learning
frameworks (e.g. tensorflow, pytorch)
● Gradients define the direction of weight updates
○ is the gradient
○ is positive means is too big
● Vanishing gradients: derivatives are (very close to) zero
○ Neurons die out (insensitive to input data)
○ Some activation functions work better than others to prevent vanishing
Computation Graphs
● Input layer
○ Input encoding: converting input to numerical representation
○ Continuous (e.g. word vectors) or one-hot encoding (n bits for n category labels)
● Hidden layers of units
○ Deeper networks have more layers, hence more parameters to learn
○ Number of layers (parameters) is limited by size of training data and computation
● Output layer and loss functions at output
○ Layer for prediction (e.g. classification, regression)
○ Select loss function that fits the problem
Training and Computation Graphs
● A computation graph is a DAG where nodes are mathematical operations or
variables and edges correspond to source of input values
● EG: (a * b + 1) * (a * b + 2)
Goldberg, Chapter 5
Computation Graph: MLP with Softmax
● Oval nodes are mathematical operations or
functions (e.g., ADD, tanh)
● Shaded rectangles represent parameters
(bound variables)
● Output of each node is a matrix, with
dimensionality shown above the node
● Softmax turns output into a probability
Goldberg, Chapter 5, Figure 5.1 (a)
Loss Functions
● Negative log likelihood (recall from AIMA section 20)
○ Find the parameters that maximize the probability of the data
○ Same as minimizing a negative log loss
● Cross-entropy loss measures dissimilarity between the predicted and true
Select Loss Function Based on Output Distribution
Learning Algorithms
● Gradient learning is an optimization problem
○ MInimize the loss
○ Tune learning rate to guide the speed of weight update to find local optimal
■ Advanced algorithm for optimization problem: Stochastic Gradient Descent (SGD),
Newton’s method, etc.
■ Optimization algorithms are often provided in DL frameworks
● Other techniques that help better convergence
○ Batch normalization
○ Tuning optimizer (learning rate, regularization, dropout, etc.)
● Hyper-parameter tuning is very important for neural networks!
Convolutional Neural Network
Most powerful and common network for vision tasks, e.g. image recognition
Convolutional Neural Network
● Convolution: matrix multiplication for learning the weights,
● Kernel : convolution on a region of the matrix
● For each output position i, we take dot product between the kernel and a
snippet of x with width l.
(AIMA 21.8)
Convolutional Neural Network
(AIMA Figure 21.5)
Sequence modeling: Recurrent Neural Network
● Sequential data:
○ Time series data (e.g. stock market, speech signal)
○ Language model
○ Part-of-Speech tagging
● Instance: Language model
“Every morning I take my dog for a __walk__ ”
Recall: Markov Assumptions
● The state of current time step is dependent on a bounded number of states
from previous time steps
○ Keeping track of time step
○ Remembering previous states — memory
● Recurrent neural network
○ Based on Markov assumptions
○ Hidden states are computed with features at each time step
○ Back-propagation through time
○ Weights sharing across all neurons in the network
Recurrent Neural Network
● RNN formulation
○ Input sequence vector:
○ Observed output:
Hidden state at time step t
Predicted output at time step t
Long-short Term Memory
● Modification of RNNs
● Motivation:
○ Vanishing gradients in RNNs: avoid direct back propagation on cells
○ Increasing ability to capture longer sequences
● Controlling information flow by gating units
○ Element-wise multiplication
■ Forget gate
■ Input gate
■ Output gate
Long-short Term Memory
src: (Very nice
walkthru of LSTM, recommend! )
Summary, One
● Deep learning is the most widely used family of ML
○ Deep means many layers
○ Neural network is the computational infrastructure achieving DL with neurons
(computational circuits)
● Basic NN has three layers:
○ Input encoding, activation function, and output layer
○ Loss function is used to compute gradients for updating the weights
○ Gradients are back propagated to update the network
● Architectures are quite different in NN
○ Feedforward
○ Convolutional NN
○ RNN and LSTM
Summary, Two
● CNN is often used in vision tasks
● RNN is for sequence tasks, based on Markov Assumption
○ Hidden states are computed over time
○ LSTM is a variant of RNN, but with benefits of
■ Avoidance of vanishing gradients
■ Greater ability in longer sequence
● Selection of architecture is dependent on the problem
● Hyper-parameter tuning is critical in training neural networks
