程序代写代做代考 python Keras chain Machine Learning 10-601/301

Machine Learning 10-601/301
Tom M. Mitchell
Machine Learning Department Carnegie Mellon University
March 115, 2021
This section:
• Convolutional neural nets
• Recurrent neural nets
• LSTMs
• Sequence to sequence models
Reading:
• optional: Mitchell: Chapter 4
• Note Mitchell book now downloadable

Convolutional Neural Nets

A Convolutional Neural Net for Handwritten
Digit recognition: LeNet5*
[LeCun, et al., 1998]
* In the 1998 LeNet5 paper output layer was a Gaussian RBF layer, though today we would use Softmax to obtain probabilities as outputs

Result S
Input I
Learned parameters
Convolution Layer
Kernel K
[from Goodfellow et al.]

Convolution : yields invariance to input translation
Input I
0
1
2
3
4
5
6
7
8
Kernel K
Result S
5
9
17
1
2
1
0
*
=
Trained parameters
Output activations

Convolution as parameter sharing
Input I
0
1
2
3
4
5
6
7
8
Kernel K
Result S
5
9
17
21
1
2
1
0
*
=
Trained parameters
Output activations
Result S:

How do we calculate gradient components
?
Input I
0
1
2
3
4
5
6
7
8
Kernel K
Result S
5
9
17
21
1
2
1
0
*
=
Trained parameters
Output activations
Result S:

How do we calculate gradient components
for training example d?
Input I
0
1
2
3
4
5
6
7
8
Kernel K
Result S
5
9
17
21
1
2
1
0
Result S:
*
=
Trained parameters
Output activations

Maxpool Layer
What is derivative of out with respect to inputs?
e.g., if a=2,b=3,e=2,f=4
out = max(a,b,e,f)
[from Goodfellow et al.]

Subsampling Layer
In LeNet
What is derivative of out with respect to inputs?
out = sigmoid(w0 + w1(a+b+e+f))
[from Goodfellow et al.]

A Convolutional Neural Net for Handwritten Digit recognition: LeNet5*

LeNet5 details
• [LeCun et al., 1998]
• C1 is a convolution layer using 6 distinct 5×5 kernels, stride 1, creating 6 distinct channels of 28×28 feature maps, each based on one kernel. Total trainable parameters:
156
• S2 is a subsampling layer, creating 6 channels, one each from the corresponding channel of C1. Values are based on a 2×2 input kernel, stride 2 (so no overlap) and the value output to the S2 map is out = sigmoid(w0+w1(x1+x2+x3+x4)), where xi’s are the four inputs to the 2×2 kernel. Total trainable parameters:
• C3 is a convolutional layer, using 16 kernels to produce 16 feature maps. Each kernel is connected to several 5×5 neighborhoods at identical locations in a subset of the 6 channels of S2 as shown below. Total trainable parameters: 1,516
• S4 subsamples C3, just like S2 samples C1

LeNet5 details
• [LeCun et al., 1998]
• C1 is a convolution layer using 6 distinct 5×5 kernels, stride 1, creating 6 distinct channels of 28×28 feature maps, each based on one kernel. Total trainable parameters:
156
• S2 is a subsampling layer, creating 6 channels, one each from the corresponding channel of C1. Values are based on a 2×2 input kernel, stride 2 (so no overlap) and the value output to the S2 map is out = sigmoid(w0+w1(x1+x2+x3+x4)), where xi’s are the four inputs to the 2×2 kernel. Total trainable parameters:
• C3 is a convolutional layer, using 16 kernels to produce 16 feature maps. Each kernel How many total trainable
channels of S2 as shown below. Total trainable parameters: 1,516
S2?
• S4 subsamples C3, just like S2 samples C1 Answer:
Poll Question 2:
is connected to several 5×5 neighborhoods at identical locations in a subset of the 6
parameters are in layer

LeNet5 (1998):
More typical 2021 Convolutional Net:
Max-pool
Max-pool
Sigmoid, Linear or ReLUs
fully connected
Softmax

Softmax Layer: Predict Probability Distribution over discrete-valued labels
• Logistic Regression: when Y has two possible values
• Softmax: when Y has R values {y1 … yR}, then learn R sets of weights to predict R output probabilities
Note neural network now has R outputs instead of just 1

A Convolutional Neural Net for Handwritten Digit recognition: LeNet
• Shrinking size of feature maps
• Multiple channels
• LeNet-5 Demos:
http://yann.lecun.com/exdb/lenet/index.html
• Vary scale
• Vary stroke width • Squeeze
• Noisy-2, Noisy-4

[from Goodfellow et al.]

Convolutional networks for time series àinvariance across time
[from Margarita Granat]

[Abdel-Hamid, et al., Convolutional Neural Networks for Speech Recognition, IEEE, 2014]

Convolutional Neural Nets
• Convolution across space, time
• Parameter sharing
• Translation invariance
• Scaling
• Multiple channels of “feature maps”
• Architecture with multiple types of layers
• Popular for perception problems

Recurrent Neural Nets for Sequential Data

Sequences
● Words, Letters
● Speech
● Images, Videos
● Programs
● Sequential Decision Making (RL)

Recurrent Networks
• Key idea: recurrent network uses (part of) its state at t as input for t+1
Nonlinearity
Hidden State at previous time step
[Goodfellow et al., 2016]

Recurrent Networks
• Key idea: recurrent network uses (part of) its state at t as input for t+1
Another example of parameter sharing, like CNNs
[Goodfellow et al., 2016]

Training Recurrent Networks
Key principle for training:
1. Treat as if unfolded in time, resulting in directed acyclic graph
2. Note shared parameters in unfolded netàsum the gradients
[Goodfellow et al., 2016]

Example: RNN to predict next character in string
• Train on entire works of Shakespeare
• 5,448,482 characters, 84 unique
• Python code online with today’s slides

Example: RNN to predict next character in string
• xt : input character, encode 1-hot, 84 dimensions
• ht : hidden layer, 100 dimension
• ot : predicted next character, softmax, 84 dimensions
84 unique characters in this dataset

Example: RNN to predict next character in string
e
i
c
Ni
c

Training loss

Generated strings at different stages of training
0 iterations:
2000 iterations:
200000 iterations:

Example: Language Models to Predict next word
Slide Credit: Piotr Mirowski

Chain Rule
Slide Credit: Piotr Mirowski

• Forward Pass
Slide Credit: Piotr Mirowski

• Backward Pass
* problem: vanishing and/or exploding gradients
Slide Credit: Piotr Mirowski

• Learned hidden representations of context useful for: • part of speech labeling
• sentiment analysis
• information extraction
• Predict label for each word, instead of predicting next word
Slide Credit: Piotr Mirowski

Example: Opinion Mining
Label opinion segments by labeling each word. o = outside
b = beginning of segment
i = inside segment
Label:o b i i i i o
Trump [has come a long way] from [Irsoy & Cardie, 2014]
h summarizes earlier words

Deep Bidirectional Recurrent Network
[Irsoy & Cardie, 2014]
Two additional ideas:
• Multiple layers to compute y from x
• A left-to-right RNN, plus right-to-left RNN
Example:
• Y label values {begin, inside, outside} for each word, to label
contiguous text segments indicating opinions. [Irsoy & Cardie, 2014]

Deep Bidirectional Recurrent Network: Opinion Mining
Mr. Stoiber [has come a long way] from his refusal to …
Y: oobiiiioooo o = outside
b = begin
i = inside
Correct:
[Irsoy & Cardie, 2014]

Deep Bidirectional Recurrent Network
Two additional ideas:
• Multiple layers to compute y from x
• A left-to-right RNN, plus right-to-left RNN
Example:
• Y label values {begin, inside, outside} for each word, to label
contiguous text segments indicating opinions. [Irsoy & Cardie, 2014]

Long Short Term Memory
LSTMs
h1
h2
h3
x1 x2 x3

LSTMs
h1
h2
h3
x1 x2 x3

LSTMs
h1
h2
h3
x1 x2 x3
Element-wise multiply

Bi-directional Recurrent Neural Networks
• Key idea: processing of word at position t can depend on following words too, not just preceding words
[Goodfellow et al., 2016]

Deep Bidirectional LSTM Network
[“Hybrid Speech Recognition with Deep Bidirectional LSTM,” Graves et al., 2013]

Gated Recurrent Units (GRUs)
Element-wise multiply
GRU
fewer parameters than LSTM found equally effective in some experiments involving
• speech recognition
• music analysis
see [Chung et al., 2014]
Optional material – won’t be on exam

Sequence to Sequence Learning
Learned Representation
Encoder
Output Sequence
Decoder
Input Sequence
• RNN Encoder-Decoders for Machine Translation (Sutskever et al. 2014; Cho et al. 2014; Kalchbrenner et al. 2013, Srivastava et.al., 2015)

Seq2Seq
XYZQ
Target sequence
v
A B C D __ X Y Z Input sequence

63
Sequence to Sequence Models
• Natural language processing is concerned with tasks involving language data
Andrej Karpathy. The Unreasonable Effectiveness of Recurrent Neural Networks

Programming Frameworks for Deep Nets
• Pytorch (Facebook)
• TensorFlow (Google)
• TFLearn (runs on top of TensorFlow, but simpler to use)
• Theano (University of Montreal)
• CNTK (Microsoft)
• Keras (can run on top of Theano, CNTK, TensorFlow)
Many support use of Graphics Processing Units (GPU’s) Major factor in dissemination of Deep Network technology

TensorFlow
example

Modern Deep Networks: 2021 vs 1987
• vastly more online data
• GPU’s, TPU’s
• Heterogenous units
– Relu, sigmoid, tanh, linear
• including memory units – LSTM, GRU, …
• wild new architectures
– 100 layers deep, bidirectional LSTMs, Convolutional nets widespread …
• new ideas for gradient descent
– dropout, batch normalization, weight initialization, …
• unification with probabilistic models – train to output probabilities
• frameworks like TensorFlow

What you should know:
• Representation learning
– Hidden layers re-represent inputs in form to predict outputs
– Autoencoders
– Sometimes reused widely (e.g., word2vec word embeddings)
• Convolutional neural networks
– Convolution provides translation invariance
– Network stages with reducing spatial resolution, Mult. channels,…
• Recurrent neural networks
– Learn to represent history in time series
– Backpropagation as unfolding in time
– LSTM memory units
• Neural architectures
– Shared parameters across multiple computations
– Layers with different structures/functions
– Probabilistic classification à output Softmax layer

Related Posts