Neural Networks
Data Mining
Mariia Okuneva, M.Sc.
Statistics and Econometrics CAU Kiel Summer 2020 1 / 30
Today’s outline
Neural Networks
1 Neural Networks
2 Training Neural Networks
3 Special Architechtures
Statistics and Econometrics CAU Kiel Summer 2020 2 / 30
Neural Networks
Introduction
A neural network is a universal approximator, a model that with enough data could learn any smooth predictive relationship. The central idea of neural networks is to:
extract linear combinations of the inputs as derived features model the target as a nonlinear function of these features
Neural networks could be scaled up and generalized in a variety of ways: many hidden units in a layer, multiple hidden layers, and many forms of regularization.
As a result, NN work well in practice because they compactly express smooth functions that fit well with the statistical properties of data we encounter in practice, and are also easy to learn using optimization algorithms (e.g. gradient descent).
Statistics and Econometrics CAU Kiel Summer 2020 3 / 30
Neural Networks
Feed-forward neural network
Statistics and Econometrics CAU Kiel
h =F(x,W1) i iij
y =F(h,W2) i iij
y = F(x,W)
Summer 2020 4 / 30
Single hidden layer, feed-forward neural network
Neural Networks
Artificial neuron
Each artificial neuron performs a dot product with the input and its weights, adds the bias and applies the non-linearity (or activation function).
The intercept terms W01, W02 are called bias. Functions φ and g are activation functions.
N
h =φ(xW1 +W1) m iim0
Statistics and Econometrics CAU Kiel
i=1
M
y =g(hW2 +W2) k jjk0
j=1
Summer 2020
5 / 30
Training Neural Networks
Training Neural Networks: feedforward part
Step 1: Finding h. Step 2: Finding y.
x = [x1, x2], in general: x = [x1, …, xN ].
W1 W1 W1
W1= 11 12 13 W1 W1 W1
In general, NxM matrix:
W1 W1 … W1 11 12 1M
W1 W1 … W1 W1= 21 22 2M
21 22 23
……… W1 W1 … W1
N1 N2 NM
Statistics and Econometrics CAU Kiel Summer 2020 6 / 30
Training Neural Networks
Training Neural Networks: feedforward part
Activation functions:
2
h =φ(xW1)
1 ii1 i=1
N
h =φ(xW1) m iim
i=1
h = φ(xW1)
1 Hyperbolic tangent function: f(x) = tanh(x) = exp(x)−exp(−x) exp(x)+exp(−x)
2 Sigmoid function: σ(x) = 1 1+e−x
3 ReLU function:
0 forx<0
f(x) =
x forx≥0
Statistics and Econometrics CAU Kiel
Summer 2020
7 / 30
Training Neural Networks
Training Neural Networks: feedforward part
Activation functions. ReLU is a rectified linear unit.
Statistics and Econometrics CAU Kiel Summer 2020 8 / 30
Training Neural Networks
Training Neural Networks: feedforward part
h = [h1, h2, h3], in general: h = [h1, ..., hM ]. W 12
In general, MxK matrix:
W2 =W2 W32
W2 W2 ... W2 11 12 1K
W2 W2 ... W2 W2= 21 22 2K
......... W2 W2 ... W2
M1 M2 MK In case of identity activation function and no bias:
M
y = h i W i2
i=1
y = hW2
Statistics and Econometrics CAU Kiel Summer 2020 9 / 30
Training Neural Networks
Training Neural Networks: feedforward part
The choice of the activation function for the output layer depends on the problem:
in a regression problem: identity or linear function
in a classification problem: logistic (if binary response variable), softmax (if multiclass response variable)
Softmax activation function:
exp(xj )
f(x)j = Kk=1 exp(xj),for j ∈ {1,...,K}
Statistics and Econometrics CAU Kiel Summer 2020 10 / 30
Training Neural Networks
Neural Network vs Polynomial (one covariate)
Let’s compare polynomial regression and neural network with one regressor x:
For M=3:
M
y = β0 + βj xj
j=1
M
y = W02 + Wj2φ(xWj1 + W01)
j=1
Statistics and Econometrics CAU Kiel
Summer 2020 11 / 30
Training Neural Networks
Neural Network vs Polynomial (one covariate)
Polynomial: M=3 Neural Network: M=3
Neural networks can capture nonlinearity.
Statistics and Econometrics CAU Kiel Summer 2020 12 / 30
Training Neural Networks
Training Neural Networks: the main idea
Neural Networks are trained by minimizing the empirical risk (= expected loss).
Start with random weights. For each training instance:
1 Feedforward part: the algorithm makes a prediction.
2 Backpropagation part:
1 Measure the error
2 Go through each layer in reverse to measure the error contribution from
each connection
3 Slightly change the connection weights to reduce the error (Gradient
Descent step)
Statistics and Econometrics CAU Kiel Summer 2020 13 / 30
Training Neural Networks
Training Neural Networks: backpropagation part
Let us consider the loss function of a single training example. The latter can be defined as MSE for all neurons in the output layer (regression case). Classification case: negative log-likelihood also known as cross-entropy loss.
In our example:
L = ( y − yˆ ) 2 2
Then, to measure the loss of all training examples, we can simply compute the average of the loss calculated for each training example.
The goal is to find parameters that minimize this loss.
To minimize the loss function, we can rely on a gradient descent algorithm.
Statistics and Econometrics CAU Kiel Summer 2020 14 / 30
Training Neural Networks
Training Neural Networks: backpropagation part
wnew = wprevious + α(− dL ), dw
where α is the learning rate W′l =Wl +α(− ∂L )
ij ij ∂Wl ij
Statistics and Econometrics CAU Kiel
W ′ = W + α∇W (−L)
Summer 2020 15 / 30
Training Neural Networks
Training Neural Networks: backpropagation part
Backpropagation:
1.∆Wl =−α( ∂L ) ij ∂Wl
ij
2.W′l =Wl +∆Wl ij ij ij
∂ ( y − yˆ ) 2
∆Wl =−α( ∂L )=−α 2 =α(y−yˆ)∗ ∂yˆ
ij ∂Wl ∂Wl ∂Wl ij ij ij
δl = ∂yˆ ij ∂Wl ij
Statistics and Econometrics CAU Kiel
Summer 2020
16 / 30
Training Neural Networks
Training Neural Networks: backpropagation part
Step 1. Update the weights of layer 2.
δi2= ∂yˆ =∂(3i=1hiWi2)=hi ∂ W i2 ∂ W i2
δ1 = h1 δ 21 = h 2 δ 31 = h 3
∆Wi2 = α(y − yˆ) ∗ δi2 = α(y − yˆ) ∗ hi
∆W12 =α(y−yˆ)∗h1 ∆W2 =α(y−yˆ)∗h2 ∆W32 =α(y−yˆ)∗h3
Statistics and Econometrics CAU Kiel Summer 2020 17 / 30
Training Neural Networks
Training Neural Networks: backpropagation part
Step 2. Update the weights of layer 1.
M=3 ∂yˆ ∂hp δ1 =
ij ∂hp ∂W1 p=1 ij
∂yˆ = ∂(3i=1 hiWi2) = Wj2 ∂hj ∂hj
∂h ∂φ(2 xiW1)∂2 xiW1
j i=1 ij i=1 ij ′
∂W1 = 2 xW1 ∂W1 =φjxi
ij i=1 i ij
δ1 =W2φ′x
∆W1 =α(y−yˆ)∗δ1 =α(y−yˆ)∗W2φ′x ij ij jji
ij jji
ij
Statistics and Econometrics CAU Kiel Summer 2020 18 / 30
Training Neural Networks
Training Neural Networks: backpropagation part
In this example, we compute the gradient of the loss function at a single generic pair (x,y) and we update the weights for each new input (stochastic/online learning).
W′l =Wl +∆Wl ij ij ij
We could go through all training examples, and average the values of all (N =sample size) weight changes (batch training). But this is computationally very expensive!
l 1N∂Lp ∆Wij =−αN (∂Wl )
p=1 ij
Mini-batch training: update the weights once every N steps (e.g., N=128). Helps to reduce the complexity of the training process since fewer computations are required. When we average multiple, possibly noisy changes to the weights, we end up with a less noisy correction.
Statistics and Econometrics CAU Kiel Summer 2020 19 / 30
Training Neural Networks
Early Stopping
One epoch is when an entire dataset is passed both forward and backward though the neural network only once.
When training large models with sufficient representational capacity to overfit the task, we often observe that training error decreases steadily over time, but validation set error begins to rise.
Every time the error on the validation set improves, we store a copy of the model parameters.
This strategy is called early stopping.
Statistics and Econometrics CAU Kiel Summer 2020 20 / 30
Training Neural Networks
Hyperparameters: Learning Rate
Good starting point = 0.01
With low learning rates the improvements in loss will be linear (slow learning).
With high learning rates they will start to look more exponential. Higher learning rates will decay the loss faster, but they get stuck at
worse values of loss (green line).
Statistics and Econometrics CAU Kiel Summer 2020 21 / 30
Training Neural Networks
Hyperparameters: Minibatch Size
Source: pdf
Commonly used mini-batch sizes: 32, 64, 128, 256.
The mini-batch size is always a trade-off between computational efficiency and accuracy.
Large mini-batch sizes leads to quite significant decrease in performance, but faster training.
Statistics and Econometrics CAU Kiel Summer 2020 22 / 30
Training Neural Networks
Hyperparameters: Number of hiden units/layers
Neural Networks with more neurons can express more complicated functions. However, this is both a blessing (since we can learn to classify more complicated data) and a curse (since it is easier to overfit the training data).
In practice, it is better to use larger networks and use regularization techniques to control overfitting. The reason behind this is that smaller networks are harder to train with Gradient Descent.
It is often the case that 3-layer neural networks will outperform 2-layer nets, but going even deeper (4,5,6-layer) rarely helps much more.
Statistics and Econometrics CAU Kiel Summer 2020 23 / 30
Training Neural Networks
Regularization (Shrinkage): Dropout
Dropout is an extremely effective, simple and recently introduced regularization technique that complements the other methods (L1, L2). While training, dropout is implemented by only keeping a neuron active with some probability p (a hyperparameter), or setting it to zero otherwise.
Statistics and Econometrics CAU Kiel Summer 2020 24 / 30
Special Architechtures
Handwritten Digit Problem: MNIST data set
Statistics and Econometrics CAU Kiel Summer 2020 25 / 30
Special Architechtures
Handwritten Digit Problem: MNIST data set
Statistics and Econometrics CAU Kiel Summer 2020 26 / 30
Special Architechtures
Convolutional Neural Networks
CNNs use sparsely connected layers and therefore contain less parameters.
CNNs make the explicit assumption that the inputs are images and accept matrices as input.
CNNs take into account spatial relationship between the pixels.
The neuron weights are in this example [1, 0, −1] , and bias is zero. These weights are shared across all yellow neurons.
Statistics and Econometrics CAU Kiel Summer 2020 27 / 30
Special Architechtures
Convolutional Neural Networks
Break the image into overlapping image tiles and, feed each image tile into a small neural network with the same weights.
Using the same small NN reduces the number of weights.
It is common to refer to the sets of weights as a filter (or a kernel), that is convolved with the input.
website
Statistics and Econometrics CAU Kiel Summer 2020 28 / 30
Special Architechtures
Recurrent Neural Networks
Useful for sequential data (e.g., a sentence).
Instead of training a network using a single input and a single output at each time step, we train with sequences since previous inputs matter.
RNNs contain memory elements.
Applications: sentiment analysis, speech recognition.
pdf
Statistics and Econometrics CAU Kiel Summer 2020 29 / 30
Special Architechtures
References
CS231 course on Convolutional Neural Networks for Visual Recognition. Stanford. website
Efron, B., and T. Hastie (2016) Computer Age Statistical Inference, Cambridge University Press, ch.18. pdf
Flachaire, E. (2019) Econometrics & Machine Learning. pdf
Gallic, E. (2018) Machine learning and statistical learning. Chapter 5.
Deep Learning. pdf
Goodfellow, I., Y. Bengio, and A. Courville (2016). Deep Learning.
Cambridge, MA: MIT Press. website
Statistics and Econometrics CAU Kiel Summer 2020 30 / 30