程序代写代做代考 kernel data mining algorithm go deep learning Neural Networks

Neural Networks
Data Mining
Mariia Okuneva, M.Sc.
Statistics and Econometrics CAU Kiel Summer 2020 1 / 30

Today’s outline
Neural Networks
1 Neural Networks
2 Training Neural Networks
3 Special Architechtures
Statistics and Econometrics CAU Kiel Summer 2020 2 / 30

Neural Networks
Introduction
A neural network is a universal approximator, a model that with enough data could learn any smooth predictive relationship. The central idea of neural networks is to:
extract linear combinations of the inputs as derived features model the target as a nonlinear function of these features
Neural networks could be scaled up and generalized in a variety of ways: many hidden units in a layer, multiple hidden layers, and many forms of regularization.
As a result, NN work well in practice because they compactly express smooth functions that fit well with the statistical properties of data we encounter in practice, and are also easy to learn using optimization algorithms (e.g. gradient descent).
Statistics and Econometrics CAU Kiel Summer 2020 3 / 30

Neural Networks
Feed-forward neural network
Statistics and Econometrics CAU Kiel
h =F(x,W1) i iij
y =F(h,W2) i iij
y = F(x,W)
Summer 2020 4 / 30
Single hidden layer, feed-forward neural network

Neural Networks
Artificial neuron
Each artificial neuron performs a dot product with the input and its weights, adds the bias and applies the non-linearity (or activation function).
The intercept terms W01, W02 are called bias. Functions φ and g are activation functions.
N
h =φ(􏰉xW1 +W1) m iim0
Statistics and Econometrics CAU Kiel
i=1
M
y =g(􏰉hW2 +W2) k jjk0
j=1
Summer 2020
5 / 30

Training Neural Networks
Training Neural Networks: feedforward part
Step 1: Finding h. Step 2: Finding y.
x = [x1, x2], in general: x = [x1, …, xN ].
􏰔W1 W1 W1 􏰕
W1= 11 12 13 W1 W1 W1
In general, NxM matrix:
W1 W1 … W1  11 12 1M
W1 W1 … W1 W1= 21 22 2M
21 22 23
………  W1 W1 … W1
N1 N2 NM
Statistics and Econometrics CAU Kiel Summer 2020 6 / 30

Training Neural Networks
Training Neural Networks: feedforward part
Activation functions:
2
h =φ(􏰉xW1)
1 ii1 i=1
N
h =φ(􏰉xW1) m iim
i=1
h = φ(xW1)
1 Hyperbolic tangent function: f(x) = tanh(x) = exp(x)−exp(−x) exp(x)+exp(−x)
2 Sigmoid function: σ(x) = 1 1+e−x
3 ReLU function:
􏰆0 forx<0 f(x) = x forx≥0 Statistics and Econometrics CAU Kiel Summer 2020 7 / 30 Training Neural Networks Training Neural Networks: feedforward part Activation functions. ReLU is a rectified linear unit. Statistics and Econometrics CAU Kiel Summer 2020 8 / 30 Training Neural Networks Training Neural Networks: feedforward part h = [h1, h2, h3], in general: h = [h1, ..., hM ].  W 12  In general, MxK matrix: W2 =W2 W32 W2 W2 ... W2  11 12 1K W2 W2 ... W2 W2= 21 22 2K .........  W2 W2 ... W2 M1 M2 MK In case of identity activation function and no bias: M y = 􏰉 h i W i2 i=1 y = hW2 Statistics and Econometrics CAU Kiel Summer 2020 9 / 30 Training Neural Networks Training Neural Networks: feedforward part The choice of the activation function for the output layer depends on the problem: in a regression problem: identity or linear function in a classification problem: logistic (if binary response variable), softmax (if multiclass response variable) Softmax activation function: exp(xj ) f(x)j = 􏰇Kk=1 exp(xj),for j ∈ {1,...,K} Statistics and Econometrics CAU Kiel Summer 2020 10 / 30 Training Neural Networks Neural Network vs Polynomial (one covariate) Let’s compare polynomial regression and neural network with one regressor x: For M=3: M y = β0 + 􏰉 βj xj j=1 M y = W02 + 􏰉 Wj2φ(xWj1 + W01) j=1 Statistics and Econometrics CAU Kiel Summer 2020 11 / 30 Training Neural Networks Neural Network vs Polynomial (one covariate) Polynomial: M=3 Neural Network: M=3 Neural networks can capture nonlinearity. Statistics and Econometrics CAU Kiel Summer 2020 12 / 30 Training Neural Networks Training Neural Networks: the main idea Neural Networks are trained by minimizing the empirical risk (= expected loss). Start with random weights. For each training instance: 1 Feedforward part: the algorithm makes a prediction. 2 Backpropagation part: 1 Measure the error 2 Go through each layer in reverse to measure the error contribution from each connection 3 Slightly change the connection weights to reduce the error (Gradient Descent step) Statistics and Econometrics CAU Kiel Summer 2020 13 / 30 Training Neural Networks Training Neural Networks: backpropagation part Let us consider the loss function of a single training example. The latter can be defined as MSE for all neurons in the output layer (regression case). Classification case: negative log-likelihood also known as cross-entropy loss. In our example: L = ( y − yˆ ) 2 2 Then, to measure the loss of all training examples, we can simply compute the average of the loss calculated for each training example. The goal is to find parameters that minimize this loss. To minimize the loss function, we can rely on a gradient descent algorithm. Statistics and Econometrics CAU Kiel Summer 2020 14 / 30 Training Neural Networks Training Neural Networks: backpropagation part wnew = wprevious + α(− dL ), dw where α is the learning rate W′l =Wl +α(− ∂L ) ij ij ∂Wl ij Statistics and Econometrics CAU Kiel W ′ = W + α∇W (−L) Summer 2020 15 / 30 Training Neural Networks Training Neural Networks: backpropagation part Backpropagation: 1.∆Wl =−α( ∂L ) ij ∂Wl ij 2.W′l =Wl +∆Wl ij ij ij ∂ ( y − yˆ ) 2 ∆Wl =−α( ∂L )=−α 2 =α(y−yˆ)∗ ∂yˆ ij ∂Wl ∂Wl ∂Wl ij ij ij δl = ∂yˆ ij ∂Wl ij Statistics and Econometrics CAU Kiel Summer 2020 16 / 30 Training Neural Networks Training Neural Networks: backpropagation part Step 1. Update the weights of layer 2. δi2= ∂yˆ =∂(􏰇3i=1hiWi2)=hi ∂ W i2 ∂ W i2 δ1 = h1 δ 21 = h 2 δ 31 = h 3 ∆Wi2 = α(y − yˆ) ∗ δi2 = α(y − yˆ) ∗ hi ∆W12 =α(y−yˆ)∗h1 ∆W2 =α(y−yˆ)∗h2 ∆W32 =α(y−yˆ)∗h3 Statistics and Econometrics CAU Kiel Summer 2020 17 / 30 Training Neural Networks Training Neural Networks: backpropagation part Step 2. Update the weights of layer 1. M=3 ∂yˆ ∂hp δ1 = 􏰉 ij ∂hp ∂W1 p=1 ij ∂yˆ = ∂(􏰇3i=1 hiWi2) = Wj2 ∂hj ∂hj ∂h ∂φ(􏰇2 xiW1)∂􏰇2 xiW1 j i=1 ij i=1 ij ′ ∂W1 = 􏰇2 xW1 ∂W1 =φjxi ij i=1 i ij δ1 =W2φ′x ∆W1 =α(y−yˆ)∗δ1 =α(y−yˆ)∗W2φ′x ij ij jji ij jji ij Statistics and Econometrics CAU Kiel Summer 2020 18 / 30 Training Neural Networks Training Neural Networks: backpropagation part In this example, we compute the gradient of the loss function at a single generic pair (x,y) and we update the weights for each new input (stochastic/online learning). W′l =Wl +∆Wl ij ij ij We could go through all training examples, and average the values of all (N =sample size) weight changes (batch training). But this is computationally very expensive! l 1􏰉N∂Lp ∆Wij =−αN (∂Wl ) p=1 ij Mini-batch training: update the weights once every N steps (e.g., N=128). Helps to reduce the complexity of the training process since fewer computations are required. When we average multiple, possibly noisy changes to the weights, we end up with a less noisy correction. Statistics and Econometrics CAU Kiel Summer 2020 19 / 30 Training Neural Networks Early Stopping One epoch is when an entire dataset is passed both forward and backward though the neural network only once. When training large models with sufficient representational capacity to overfit the task, we often observe that training error decreases steadily over time, but validation set error begins to rise. Every time the error on the validation set improves, we store a copy of the model parameters. This strategy is called early stopping. Statistics and Econometrics CAU Kiel Summer 2020 20 / 30 Training Neural Networks Hyperparameters: Learning Rate Good starting point = 0.01 With low learning rates the improvements in loss will be linear (slow learning). With high learning rates they will start to look more exponential. Higher learning rates will decay the loss faster, but they get stuck at worse values of loss (green line). Statistics and Econometrics CAU Kiel Summer 2020 21 / 30 Training Neural Networks Hyperparameters: Minibatch Size Source: pdf Commonly used mini-batch sizes: 32, 64, 128, 256. The mini-batch size is always a trade-off between computational efficiency and accuracy. Large mini-batch sizes leads to quite significant decrease in performance, but faster training. Statistics and Econometrics CAU Kiel Summer 2020 22 / 30 Training Neural Networks Hyperparameters: Number of hiden units/layers Neural Networks with more neurons can express more complicated functions. However, this is both a blessing (since we can learn to classify more complicated data) and a curse (since it is easier to overfit the training data). In practice, it is better to use larger networks and use regularization techniques to control overfitting. The reason behind this is that smaller networks are harder to train with Gradient Descent. It is often the case that 3-layer neural networks will outperform 2-layer nets, but going even deeper (4,5,6-layer) rarely helps much more. Statistics and Econometrics CAU Kiel Summer 2020 23 / 30 Training Neural Networks Regularization (Shrinkage): Dropout Dropout is an extremely effective, simple and recently introduced regularization technique that complements the other methods (L1, L2). While training, dropout is implemented by only keeping a neuron active with some probability p (a hyperparameter), or setting it to zero otherwise. Statistics and Econometrics CAU Kiel Summer 2020 24 / 30 Special Architechtures Handwritten Digit Problem: MNIST data set Statistics and Econometrics CAU Kiel Summer 2020 25 / 30 Special Architechtures Handwritten Digit Problem: MNIST data set Statistics and Econometrics CAU Kiel Summer 2020 26 / 30 Special Architechtures Convolutional Neural Networks CNNs use sparsely connected layers and therefore contain less parameters. CNNs make the explicit assumption that the inputs are images and accept matrices as input. CNNs take into account spatial relationship between the pixels. The neuron weights are in this example [1, 0, −1] , and bias is zero. These weights are shared across all yellow neurons. Statistics and Econometrics CAU Kiel Summer 2020 27 / 30 Special Architechtures Convolutional Neural Networks Break the image into overlapping image tiles and, feed each image tile into a small neural network with the same weights. Using the same small NN reduces the number of weights. It is common to refer to the sets of weights as a filter (or a kernel), that is convolved with the input. website Statistics and Econometrics CAU Kiel Summer 2020 28 / 30 Special Architechtures Recurrent Neural Networks Useful for sequential data (e.g., a sentence). Instead of training a network using a single input and a single output at each time step, we train with sequences since previous inputs matter. RNNs contain memory elements. Applications: sentiment analysis, speech recognition. pdf Statistics and Econometrics CAU Kiel Summer 2020 29 / 30 Special Architechtures References CS231 course on Convolutional Neural Networks for Visual Recognition. Stanford. website Efron, B., and T. Hastie (2016) Computer Age Statistical Inference, Cambridge University Press, ch.18. pdf Flachaire, E. (2019) Econometrics & Machine Learning. pdf Gallic, E. (2018) Machine learning and statistical learning. Chapter 5. Deep Learning. pdf Goodfellow, I., Y. Bengio, and A. Courville (2016). Deep Learning. Cambridge, MA: MIT Press. website Statistics and Econometrics CAU Kiel Summer 2020 30 / 30