Lecture 4: FeedForward Neural Network
Instructor:
Outline of this lecture
Copyright By PowCoder代写 加微信 powcoder
} From Neuron to Feedforward Neural Network } Prediction Function: Network Architecture
} Cost Function
} Optimization
} Case Study
Inspired by Human Brain
• Our brain has lots of neurons connected together and the strength of the connection between neurons represents long-term knowledge
• Visual Cortex in Brains:
• The first hierarchy of neurons that receives information in the visual
cortex are sensitive to specific edges
• Brain regions further down the visual pipeline are sensitive to more
complex structures such as faces, textures, etc.
What’s Neural Network (by a newspaper reporter)
} Part of the machine learning field; exceptional effective at learning knowledge/patterns from a large amount of annotated data
} Provided a sufficient amount of training samples, the system begins to understand things like human beings.
} Work as the computer brains beyond many eye-catching systems: self-driving vehicles, intelligent diagnosis, data science,, ALPHA GO
} GPUs+ TONS of human annotations } Money!
What’s Neural Network (In this course)
} A machine learning model applicable to regression, classification, unsupervised learning, etc.
} Making predictions by a function of function
} Multiple layers
} Inputs would be of high-dimensional features
} Outputs would be low-dimensional labels (e.g., binary labels)
} The name reflects the design principles of the prediction function
Nonlinear Transformation
} Linear Regression
h!(𝑥) = 𝜃”𝑥 = 𝜃% + 𝜃&𝑥& + 𝜃’𝑥’ + ⋯
} Logistic Regression
h!(𝑥) = 𝑔(𝜃”𝑥)
𝑔𝑥=1 1+𝑒#$
Neuron in Brain
Neuron Unit
g ∗ : activation functions
𝜃$ 𝜃! 𝜃 𝜃# ”
𝑥$ 𝑥! 𝑥” 𝑥#
𝜃”𝑥 = 𝜃%𝑥% + 𝜃&𝑥& + 𝜃’𝑥’ + 𝜃(𝑥(
h! 𝑥 =𝑔𝜃”𝑥 = 1
Activation functions
} ReLU is popular
Neural Network: Deeper
𝑎$ 𝑎! 𝑎” 𝑎#
𝑥$ 𝑥! 𝑥” 𝑥#
Layer 3: Output Layer 2: Hidden Layer
Layer 1: Input
Neural Network: Architecture
} Multiple Layer
How neural networks work?
Example 1: neural network for XOR
} XOR: exclusive OR gate,
XOR= 𝑥”and (𝑛𝑜𝑡 𝑥#) 𝑜𝑟 [ (𝑛𝑜𝑡 𝑥”) and 𝑥#]
Nonlinear Boundary
} XOR= 𝑥”and (𝑛𝑜𝑡 𝑥#)
} Three basic operators } or
} 𝑥”and (𝑛𝑜𝑡 𝑥#) } (𝑛𝑜𝑡 𝑥”) and 𝑥#
𝑜𝑟 [ (𝑛𝑜𝑡 𝑥”) and 𝑥#]
} Idea: constructing basic operator first and then compose XOR
Operator: Or? And ? Not? else?
h% 𝑥 = g(−30×1 + 20×𝑥! + 20×𝑥”)
Prediction
Answer: and
Operator: Or? And ? Not? else?
h% 𝑥 = g(30×1 − 40×𝑥!)
Prediction
Answer: not gate
Quiz: design neuron unit for OR
h% 𝑥 =g(?)
Prediction
Operator: OR
= g(−30×1 + 𝑥!×40 + 𝑥”×40)
Prediction
Assume parameters have been learnt (introduced later)
𝑥!and (𝑛𝑜𝑡 𝑥”)
h% 𝑥 = g(−30×1 + 𝑥!×50 − 𝑥”×30)
Prediction
Operator:[ (𝑛𝑜𝑡 𝑥!) and 𝑥”]
= g(−30×1 − 𝑥!×30 + 𝑥”×50)
Prediction
Assume parameters have been learnt (introduced later)
Composite operator: XOR
1 𝑥! 𝑥” 1 𝑥! 𝑥” 1 𝑥! 𝑥”
OR 𝑥!and (𝑛𝑜𝑡 𝑥”)
XOR= 𝑥”and (𝑛𝑜𝑡 𝑥#) 𝑜𝑟 [ (𝑛𝑜𝑡 𝑥”) and 𝑥#]
[ (𝑛𝑜𝑡 𝑥!) and 𝑥”]
how to design neural network for XNOR
XNOR= 𝑥”and𝑥# 𝑜𝑟 [ (𝑛𝑜𝑡 𝑥”) and (𝑛𝑜𝑡 𝑥#)]
Multi-class Neural Network
Training samples: x#, y# ,
1,0,0 , if i-th sample is see bass 𝑦$ ∈ 𝑅% 0,1,0 , if i-th sample is salmon
0,0,1 , if i-th sample is cyprinoid
𝑎#(“) 𝑥$ 𝑥! 𝑥” 𝑥#
𝑎(“) 𝑎!(“) $
Architecture
Outline of this lecture
} From Neuron to Feedforward Neural Network } Prediction Function: Network Architecture
} Cost Function
} Optimization
} Case Study
Network-1: two-layer network, Binary class
With network parameters, how to get predictions?
𝑧”# = [𝜃”” , 𝜃”#” , 𝜃”F” ]G 𝑥”,𝑥#,𝑥F hH 𝑥 =𝑎”# =𝑔 𝑧”#
Network-2: two layer, multi-class
Network parameters are Provided, how to get Predictions?
h$(𝑥) h$(𝑥) h$(𝑥)
Make Prediction
} 𝑧”# = [𝜃”” , 𝜃”#” , 𝜃”$” ]% 𝑥”,𝑥#,𝑥$
} h& 𝑥 ” = 𝑎”# = 𝑔 𝑧”#
}𝑎## =? }𝑎$# =?
Network-3:more layers
For one training sample 𝑥
𝑎”(“) 𝑎#(“)
𝑎(!) = 𝑥!,𝑥”,𝑥#
𝑧!” =2𝜃!(!𝑎(!) ()$
𝑧 ” ” = 2 𝜃 ” !( 𝑎 ( ( ! ) ()$
𝑧*” =2𝜃*(!𝑎(!) ()$
𝑧 ” = (𝜃(!))′ 𝑎(!)
Make Predictions: more layers
𝑎’ = 𝑔(𝑧’)
h!(𝑥) = 𝑎(
= (𝜃(‘))′𝑎(‘) 𝑎( = 𝑔(𝑧()
𝑎!(“) 𝑎”(“) 𝑎#(“)
Forward propagation:
𝑎!(“) 𝑎”(“)
𝑧(‘) = 𝜃(&)𝑎&
𝑎(‘) = 𝑔(𝑧 ‘ )
𝑧(() = 𝜃(‘)𝑎(‘)
h! x = 𝑎(() = 𝑔(𝑧 ( )
Outline of this lecture
} From Neuron to Feedforward Neural Network } Prediction Function: Network Architecture
} Cost Function
} Optimization
} Case Study
Cost Function
} How well the prediction is? } Review
} Residual Square of Sum (RSS) } Cross-entropy Function
TrainingSet: (𝑥+,𝑦+),𝑖=1,2,3,…m
Binary class: yJ = 0 or 1 , one output unit
Multi- class:yJ = 1, 0,0 , 0,1, 0 , or 0,0,1 , K(=3) output units
Cost Function
} Network-1: two-layer network, Binary class h%(𝑥)
𝑧!” =[𝜃!&! ,𝜃!! , 𝜃!”! , 𝜃!%! ]’ 𝑥&,𝑥!,𝑥”,𝑥% h( 𝑥 =𝑎!” =𝑔 𝑧!”
𝑥$ 𝑥! 𝑥” 𝑥#
𝐹 𝜃 = 2𝑚 =(𝑦’ − h& 𝑥’ )#
Cost Function
} Network 2: two-layer network, multi-class
h$(𝑥) h$(𝑥) h$(𝑥) h& 𝑥’ ,𝑦’ ∈𝑅*($
𝐹 𝜃 =2𝑚 == 𝑦’ 𝑘 −h&(𝑥’)[𝑘] # ‘(” +(”
𝑥% 𝑥! 𝑥” 𝑥#
} Network 3: multiple layer network, multiple class
𝑎&(“) 𝑎!(“) 𝑎”(“) 𝑎#(“)
𝑥& 𝑥! 𝑥” 𝑥#
Same cost function as two-layer network
Cost Function
} Example 4: Multiple-layer network, binary class
𝑥& 𝑥! 𝑥” 𝑥#
Outline of this lecture
} From Neuron to Feedforward Neural Network } Prediction Function: Network Architecture
} Cost Function
} Optimization
} Case Study
Optimization: gradient descent
For each parameter,
𝜃, = 𝜃, − 𝛼 𝜕𝐹 𝜃 𝜕𝜃,
How to calculate gradients ?
Network-1: two-layer, binary class
𝐹 𝜃 = 2𝑚 =(𝑦’ − h& 𝑥’ )#
𝑥$ 𝑥! 𝑥” 𝑥#
Derivatives w.r.t parameters
𝜕𝐹(𝜃) = 𝜕𝐹(𝜃) = 𝜕𝐹(𝜃) = 𝜕𝐹(𝜃) = 𝜃”, (“) 𝜃”” (“) 𝜃”# (“) 𝜃”$ (“)
𝑥$ 𝑥! 𝑥” 𝑥#
Derivative w.r.t 𝜃(& )
⇒hH𝑥P −𝑦P𝑔Q 𝑧”# 𝜕𝑧! 𝜕𝜃(“)
Error received from Gradient of current neuron unit upper layer
} Gradient Descent Method Repeat
𝜃!&(!) = 𝜃!&(!) − 𝛼 𝜕𝐹 𝜃 𝜕𝜃!& !
𝜃!!(!) = 𝜃!!(!) − 𝛼 𝜕𝐹 𝜃 𝜕𝜃!!(!)
𝜃!”(!) = 𝜃!”(!) − 𝛼 𝜕𝐹 𝜃 𝜕𝜃!”(!)
𝜃!%(!) = 𝜃!%(!) − 𝛼 𝜕𝐹 𝜃 𝜕𝜃!%(!)
How About Multi-layer Networks?
} How to calculate gradients for every parameter?
𝑔($𝑤P𝑥P +𝑏) P
} g(): activation function
Network Architectures
Network: Prediction
} Forward pass For one training sample 𝑥
Layer 1: Layer 2:
𝑎(“) = 𝑥$,𝑥”,𝑥%,𝑥&
𝑧”% =<𝜃"'"𝑎'(") '($
𝑧%% =<𝜃%"'𝑎'(") '($
𝑧% =<𝜃"𝑎(")
Layer 2: Layer3:
𝑎% = 𝑔(𝑧%)
𝑧& =(𝜃% )𝑎(%)
𝑎& = 𝑔(𝑧&) h+(𝑥) = 𝑎&
)' ' 𝑧% =𝜃"*𝑎(")
Network: Cost Function
𝐹 𝜃 = 1 = 𝑦P − 𝑓H 𝑥P
Gradients: last layer
𝐹 𝑊 = 1 2 h% 𝑥* − 𝑦* " 2𝑚 *
h%𝑥* =𝑔𝑊𝑥* =𝑔(𝑤!𝑥*1+𝑤"𝑥*"+⋯)
Gradients:other layers
𝑎=𝑔($𝑤P𝑥P +𝑏) P
Local gradients
𝜕𝐹 𝜕𝑎 𝜕𝑎 𝜕𝑤!
𝜕𝐹 𝜕𝑎 𝜕𝑎 𝜕𝑤%
Example: Gradients for a simple network
Gradient Descent
𝐹(𝑥, 𝑦, 𝑧)
𝑥 = 𝑥 − 𝛼 𝜕𝐹 𝜕𝑥
𝑦 = 𝑦 − 𝛼 𝜕𝐹 𝜕𝑦
𝑧 = 𝑧 − 𝛼 𝜕𝐹 𝜕𝑧
Example: Gradients for a simple network
𝐹(𝑥, 𝑦, 𝑧)
𝜕𝐹 = 𝜕𝐹 𝜕𝑞 𝜕𝑥 𝜕𝑞 𝜕𝑥
Example: A Neuron
1 + 𝑒#(-/$/.-0$0.-1 )
Set x=(-1.0, -2.0), y=1, w=(2.0, -3.0, - 3.0), How to calculate gradients for w?
+ 1.0 ∗−1 -1.0 𝑒𝑥𝑝 0.37 1+ 1.37 1/
Set +, = 1.0 " +-
1 + 𝑒#(-/$/.-0$0.-1 )
x=(-1.0, -2.0), y=1 w=(2.0, -3.0, -3.0)
-3. 0 -2.0
+ 1.0 ∗−1 -1.0 𝑒𝑥𝑝 0.37 1+ 1.37 1/ −0.53
.-=−! ./ /,
𝑑𝐹𝑑𝑎=1.0∗ − 1 =−0.53 𝑑𝑎 𝑑𝑥 1.37"
1 + 𝑒#(-/$/.-0$0.-1 )
x=(-1.0, -2.0), y=1 w=(2.0, -3.0, -3.0)
-3. 0 -2.0
+ 1.0 ∗−1 -1.0 𝑒𝑥𝑝 0.37 1+ 1.37 1/ −0.53 −0.53
𝑎𝑥 =1+𝑥, .-=1 ./
𝑑𝐹 𝑑𝑎=1∗ −0.53 =−0.53 𝑑𝑎 𝑑𝑥
1 + 𝑒#(-/$/.-0$0.-1 )
x=(-1.0, -2.0), y=1 w=(2.0, -3.0, -3.0)
-3. 0 -2.0
+ 1.0 ∗−1 1+ 1.37 1/ −0.53
𝑎 𝑥 =exp(𝑥), .- =exp 𝑥 ./
𝑑𝐹 𝑑𝑎 = exp −1.0 ∗ −0.53 𝑑𝑎 𝑑𝑥
-1.0 𝑒𝑥𝑝 0.37 −0.2 −0.53
1 + 𝑒#(-/$/.-0$0.-1 )
x=(-1.0, -2.0), y=1 w=(2.0, -3.0, -3.0)
-3. 0 -2.0
+ 𝑒𝑥𝑝 0.37 1+ 1.37 1/ −0.53 −0.53
1.0 ∗−1 -1.0 0.2 −0.2
𝑎𝑥 =−1∗𝑥, .-=−1 ./
𝑑𝐹 𝑑𝑎=(−1)∗ −0.2 =0.2 𝑑𝑎 𝑑𝑥
1 + 𝑒#(-/$/.-0$0.-1 )
x=(-1.0, -2.0), y=1 w=(2.0, -3.0, -3.0)
𝑤 -3.0 " 0.2
∗ −1 -1.0 −0.2
𝑒𝑥𝑝 0.37 −0.53
1 + 1.37 −0.53
[local gradients] *[upstream gradients] 1 ∗[0.2]=0.2
1 ∗[0.2]=0.2
1 + 𝑒#(-/$/.-0$0.-1 )
x=(-1.0, -2.0), y=1 w=(2.0, -3.0, -3.0)
𝑤& 2.0 −0.2
∗ -2.0 0.2
+ 1.0 ∗−1 -1.0 𝑒𝑥𝑝 0.37 1+ 1.37 1/ 0.73 0.2 −0.2 −0.53 −0.53 1.0
[local gradients] *[upstream gradients] 𝑤0: −1.0 ∗ 0.2 = −0.2
𝑥&: 2.0 ∗ [0.2] = 0.4
𝑤& 2.0 −0.2
∗ -2.0 0.2
"ih*+ , "ih*+
= (1 − g x )(g x )
1 + 𝑒@ A%B%CA&B&CA'
= 𝑔 𝑤$𝑥$ + 𝑤!𝑥! + 𝑤"
ef g = h*+ =("ih*+j") "
Sigmoid gate
∗ −1 -1.0 𝑒𝑥𝑝 0.37 1 + 1.37
−0.2 −0.53
0.73*(1-0.73)=0.2
Backward Propagation
} Add gate: gradient distributor
} Mul Gate: gradient switcher
} Quiz: How is the gradient calculated for a Max gate? } Max gate: gradient router
max(𝑤&, 𝑥&)
𝑤& 2.0 −0.2
∗ -2.0 0.2
+ 1.0 ∗−1 -1.0 𝑒𝑥𝑝 0.37 1+ 1.37 1/ 0.73 0.2 −0.2 −0.53 −0.53 1.0
Learning: forward & backward
} Forward Propagation
} Start from the first layer
} Propagate predictions of hidden layers
} Backward Propagation
} Start from the last layer
} Propagate errors made for an training sample
} Complete training procedure } Forward + backward
𝑎&(") 𝑎!(") 𝑎"(")
𝑥& 𝑥! 𝑥" 𝑥#
qInput: training set
qOutput: Δ!"# , gradients for , l=1,2,...L-1,
qSetΔ!"# =0
qFor each sample of label y,
üSet𝑎!=𝑎"! =[𝑥"]
ü Forward propagation to get 𝑎# = [𝑎$#] for all layers 𝑙; ü For the last layer, compute 𝛿(&) = 𝑎(&) − 𝑦
ü Back propagation to get 𝛿(&(!), 𝛿(&()),...;
üΔ$"#=Δ$"#+𝑎"#𝑔*𝑧(D 𝛿$#+!
} Forward propagation
} Backward propagation
} Graphical representation for gradient calculation
} Computational Graph! Important concept used in most deep learning toolboxes (e.g., Tensorflow)
Outline of this lecture
} From Neuron to Feedforward Neural Network } Prediction Function: Network Architecture
} Cost Function
} Optimization
} Case Study
Environment
} Jupyter Notebook } Google Co-Lab
} Free to use (12 hours)
} Access to GPUs
} Loading files form google drive
} Full support of Python3, Tensorflow, etc.
} Welcome To Colaboratory - Colaboratory (google.com) } Machine Learning Examples
Example: Text Classification
} IMDB Database: 50,000 movie reviews } 50% training samples
} 50% testing samples
} Labels: Positive/Negative
Review: “This was an absolutely terrible movie. Don't be lured in by or . Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like 's good name. I could barely sit through it.”
Label: Negative (0)
Example: Text Classification
} IMDB Database: 50,000 movie reviews } 50% training samples
} 50% testing samples
} Labels: Positive/Negative
Review: “This is the kind of film for a snowy Sunday afternoon when the rest of the world can go ahead with its own business as you descend into a big arm-chair and mellow for a couple of hours. Wonderful performances from Cher and (as always) gently row the plot along. There are no rapids to cross, no dangerous waters, just a warm and witty paddle through life at its best. A family film in every sense and one that deserves the praise it received. .”
Label: Positive (1)
Google Co-Lab
} https://colab.research.google.com/drive/1VvpND8QXdrL Ls AzGP02IKqU1JPm187fH?usp=sharing
• TensorFlow:isadeeplearningframeworkthatsupportsa wide variety of neural networks
• Computational graph for gradient calculation • Support GPUs
What is Tensorflow
} Open APIs: Graph based computation
} Consider a=(b+c)*(c+2), we might break it into the following components:
} d=b+c } e=c+2 } a=d*e
Why graph computation
} Parallel computing
} Download the database from internet!
} Dump into google co-lab space (limited to each account)
} One might upload data file to the Google Drive and Locate the file’s in CoLab
} Exploring the dataset by printing out samples
Model Summary
• Feature Extraction: to represent a chuck of sentences, one need to use a text embedding
• Transfer a sentence into a feature vector
• E.g., bag of words
• Pre-trained embedding models
Model Summary
• One should customize the number of layers and the number hidden units for each layer.
• Specify the following settings
• Loss function
• Optimizer
• Execute the training
• Training data
• Training labels
• Batch_size
• Validation data
Training History:Convergence
Training History: Accuracy
Outline of this lecture
} From Neuron to Feedforward Neural Network } Prediction Function: Network Architecture
} Cost Function
} Optimization
} Case Study
Others Not Covered
} Loss Functions
} Variants of Gradient Descent Methods
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com