CS计算机代考程序代写 chain deep learning algorithm COMP5329 Deep Learning Week 2 – Multilayer Neural Networks 1. Perceptron

COMP5329 Deep Learning Week 2 – Multilayer Neural Networks 1. Perceptron
Despite the fantastic name – Deep Learning, the field of neural networks is not new at all. Perceptron can be the foundations for neural networks in 1980s. It was developed to solve a binary classification problem.
Figure 1. A Perceptron module.
We consider the input signal to be represented by a d-dimensional feature vector [x1, x2, · · · , xd]. In a Perceptron (see Fig. 1), we plan to learn a linear function parameterized by w to accomplish the classification task,
d
(1) yˆ=􏰇wixi +b,
i=1
where wi is the weight on the i-th dimension, b is the bias and yˆ is the prediction score of the
example x. The binary prediction can then be easily with the help of the sign function,
(2) o = sign(yˆ) =
􏰚+1, yˆ≥0 −1, yˆ<0 Perceptron simply uses target values y = 1+1 for the positive class and y = −1 for the negative class. According to the above analysis, we find that if an example can be correctly classified by the Perceptron, we have (3) y(wT x + b) > 0,
otherwise
(4) y(wT x + b) < 0. Therefore, to maximize the prediction accuracy (i.e., to minimize the cost function for Percep- tron), the objective function of Perceptron can be written as, (5) min L(w,b)=− 􏰇 yi(wTxi +b), w,b xi ∈M where M stands for the set of mis-classified examples. Gradient descent can be applied to optimize the Problem (5). By taking the partial derivative, we can calculate the gradient of the objective function, (6) ∇wL(w, b) = − 􏰇 yixi, xi ∈M 1 (7) ∇bL(w, b) = − 􏰇 yi. xi ∈M Gradient descent methods can then be taken to update the parameters w and b until the convergence. 2. Multilayer Neural Networks The Perceptron is only capable of separating data points with a linear classifier, and cannot even handle the simple XOr problem. The solution to this problem is to include an additional layer - known as a hidden layer. Then this kind of feed-forward network is a multilayer perceptron, as shown in Fig. 2. Figure 2. A three-layer neural network and the notation used. Figure 2 shows a simple three-layer neural network, which consists of an input layer, a hidden layer, and an output layer, interconnected by modifiable weights, represented by links between layers. Each hidden unit computes the weighted sum of its inputs to form its scalar net acti- vation, which is denoted simply as net. That is, the net activation is the inner product of the inputs with the weights at the hidden unit. Thus, it can be written d (8) netj =􏰇xiwji =wjTx, i=1 where the subscript i indexes units in the input layer; wji denotes the input-to-hidden layer weights at the hidden unit j. Each hidden unit emits an output that is a nonlinear function of its activation, f(net), that is, (9) yj = f(netj). This f(·) is called the activation function or nonlinearity of a unit. Each output unit computes its net activation based on the hidden unit signals as nH (10) netk =􏰇yjwkj =wkTy, j=0 2 where the subscript k indexes units in the output layer and nH denotes the number of hidden units. An output unit computes the nonlinear function of its net, emitting (11) zk = f(netk) 3. Activation Functions Activation functions are functions used in neural networks to decide if a neuron can be fired or not. 3.1. Sigmoid Function. The Sigmoid is a non-linear activation function used mostly in feedfor- ward neural networks. It is a bounded differentiable real function, defined for real input values, with positive derivatives everywhere and some degree of smoothness. The Sigmoid function is given by (12) f(x) = 1 . 1+e−x However, the Sigmoid activation function suffers major drawbacks: • Sigmoids saturate and kill gradients. A very undesirable property of the sigmoid neuron is that when the neuron’s activation saturates at either tail of 0 or 1, the gradient at these regions is almost zero. Recall that during backpropagation, this (local) gradient will be multiplied to the gradient of this gate’s output for the whole objective. Therefore, if the local gradient is very small, it will effectively “kill” the gradient and almost no signal will flow through the neuron to its weights and recursively to its data. Additionally, one must pay extra caution when initializing the weights of sigmoid neurons to prevent saturation. For example, if the initial weights are too large then most neurons would become saturated and the network will barely learn. • Sigmoid outputs are not zero-centered. This is undesirable since neurons in later layers of processing in a Neural Network (more on this soon) would be receiving data that is not zero-centered. This has implications on the dynamics during gradient descent, because if the data coming into a neuron is always positive (e.g. x > 0 elementwise in f = wT x + b), then the gradient on the weights w will during backpropagation become either all be positive, or all negative (depending on the gradient of the whole expression f). This could introduce undesirable zig-zagging dynamics in the gradient updates for the weights. However, notice that once these gradients are added up across a batch of data the final update for the weights can have variable signs, somewhat mitigating this issue. Therefore, this is an inconvenience but it has less severe consequences compared to the saturated activation problem above.
3.2. Hyperbolic Tangent Function (Tanh). The hyperbolic tangent function known as tanh function, is a smoother zero-centred function whose range lies between -1 to 1, thus the output of the tanh function is given by,
e−x − e−x
(13) f(x) = e−x + e−x .
The tanh function became the preferred function compared to the sigmoid function in that it gives better training performance for multi-layer neural networks. However, the tanh function could not solve the vanishing gradient problem suffered by the sigmoid functions as well. The main advantage provided by the function is that it produces zero centred output thereby aiding the back-propagation process. The tanh functions have been used mostly in recurrent neural networks for natural language processing.
3

3.3. Rectified Linear Unit (ReLU) Function. The rectified linear unit (ReLU) activation function was proposed by Nair and Hinton 2010, and ever since, has been the most widely used activation function for deep learning applications with state-of-the-art results to date. It offers the better performance and generalization in deep learning compared to the Sigmoid and tanh activation functions. The ReLU activation function performs a threshold operation to each input element where values less than zero are set to zero thus the ReLU is given by
(14) f (x) = max(0, x).
The main advantage of using the rectified linear units in computation is that, they guarantee faster computation since it does not compute exponentials and divisions, with overall speed of computation enhanced. Another property of the ReLU is that it introduces sparsity in the hidden units as it squishes the values between zero to maximum.
The ReLU has a significant limitation that it is sometimes fragile during training thereby causing some of the gradients to die. This leads to some neurons being dead as well, thereby causing the weight updates not to activate in future data points, thereby hindering learning as dead neurons gives zero activation. To resolve the dead neuron issues, the leaky ReLU was proposed.
The leaky ReLU introduces some small negative slope to the ReLU to sustain and keep the weight updates alive during the entire propagation process. The alpha parameter was introduced as a solution to the ReLUs dead neuron problems such that the gradients will not be zero at any time during training. The LReLU computes the gradient with a very small constant value for the negative gradient α in the range of 0.01 thus the LReLU is computed as
(15) f(x) =
􏰚x, x>0 αx, x≤0
The LReLU has an identical result when compared to the standard ReLU with an exception that it has non-zero gradients over the entire duration thereby suggesting that there no significant result improvement except in sparsity and dispersion when compared to the standard ReLU and tanh function.
4. Backpropagation
The backpropagation is one of the simplest and most general methods for supervised training of multilayer neural networks. Networks have two primary modes of operation: feedforward and learning. Feed-forward operation consists of presenting a pattern to the input units and passing the signals through the network in order to yield outputs from the output units. Supervised learning consists of presenting an input pattern and changing the network parameters to bring the actual outputs closer to the desired teaching or target values.
We consider the training error on a pattern to be the sum over output units of the squared difference between the desired output tk and the actual output zk:
(16) J(w) = 12∥t − z∥2
where t and z are the target and the network output vectors and w represents all the weights in the network.
The backpropagation learning rule is based on gradient descent. The weights are initialized with random values, and then they are changed in a direction that will reduce the error:
(17) ∆w = −η ∂J , ∂w
where η is the learning rate, and indicates the relative size of the change in weights. Eq. (17) demands that we take a step in weight space that lowers the criterion function. It is clear from Eq. (16) that the criterion function can never be negative; the learning rule guarantees that 4

learning will stop. This iterative algorithm requires taking a weight vector at the iteration m and updating it as
(18) w(m + 1) = w(m) + ∆w.
We now turn to the problem of evaluating Eq. (17) for a three-layer net. Consider first the hidden-to-output weights, wkj. Because the error is not explicitly dependent upon wkj, we must use the chain rule for differentiation:
(19) ∂J = ∂J ∂netk =−δ ∂netk,
∂w ∂net ∂w k ∂w kj kkj kj
where the sensitivity of unit k is defined to be
(20) δk =− ∂J
∂netk
and describes how the overall error changes with the unit’s net activation and determines the direction of search in weight space for the weights . Assuming that the activation function f(·) is differentiable, we differentiate Eq. (16) and find that for such an output unit, δk is simply
(21) δk =− ∂J =−∂J ∂zk =(tk −zk)f′(netk). ∂netk ∂zk ∂netk
Taken together, these results give the weight update or learning rule for the hidden-to-output weights:
(22) ∆wkj = ηδkyj = η(tk − zk)f′(netk)yj.
The learning rule for the input-to-hidden units is subtle. From Eq. (17), and again using the
chain rule, we calculate
(23) ∂J = ∂J ∂yj ∂netj
The first term can be calculated as
(24)
∂wji ∂yj ∂netj ∂wji
∂J 􏰇c ∂J ∂zk 􏰇c ∂zk ∂netk ∂y= ∂z∂y=− (tk−zk)∂net∂y
j k=1 k j k=1 k j cc
=−􏰇(tk −zk)f′(netk)wkj =−􏰇δkwkj. k=1 k=1
For the step above, we had to use the chain rule again. The final sum over output units in Eq. (24) expresses how the hidden unit output yj, affects the error at each output unit. This will allow us to compute an effective target activation for each hidden unit. We use Eq. (24) to define the sensitivity for a hidden unit as
c
(25) δj = f′(netj)􏰇wkjδk
k=1
The sensitivity at a hidden unit is simply the sum of the individual sensitivities at the output
units weighted by the hidden-to-output weights wkj, all multiplied by f′(netj). Thus the learning rule for the input-to-hidden weights is
c
(26) ∆w=ηxδ =ηf′(net)x􏰇w δ ij jikjk
k=1
Hence, we conclude the backpropagation algorithm, or more specifically the “backpropagation
of errors” algorithm. Backpropagation is just a gradient descent in layered models where appli- cation of the chain rule through continuous functions allows the computation of derivatives of the criterion function with respect to all model weights.
5