CS计算机代考程序代写 python deep learning 3b: Hidden Unit Dynamics

3b: Hidden Unit Dynamics

Hidden Unit Dynamics

Encoder Networks

The Encoder task is a simple supervised learning task which has been designed to help us
understand the hidden unit dynamics of neural networks, and also serves as a simpli�ed version of
the Autoencoders we will meet in Week 6.

N −K−N

For this task, the th item follows a one-hot encoding, with its th input equal to and all other
inputs equal to . The target output is exactly the same as the input. However, the challenge lies in
the fact that the input must be “compressed” to �t through the bottleneck of the hidden nodes
where , and then reconstructed from this lower dimensional space.

j j 1
0

K

K < N Pleas run the following PyTorch code and then click on "encoder.png" to visualize the hidden unit space for a Encoder network.9−2−9 Run PYTHON 1 2 3 4 5 6 7 8 9 10 11 12 13 14 # encoder_main.py # COMP9444, CSE, UNSW from __future__ import print_function import torch import torch.utils.data import torch.nn.functional as F import matplotlib.pyplot as plt #import numpy as np class EncModel(torch.nn.Module): # fully connected two-layer network def __init__(self, num_input, num_hid, num_out): (EncModel self) __init__() Each dot in this image corresponds to a particular input pattern, and shows us the activations of the two hidden nodes (horizontal and vertical axis) when this input pattern is fed to the network. Each line in the image corresponds to a particular output node, and shows the dividing line between those points in hidden unit space for which this output is less than , and those for which it is greater than . 0.5 0.5 Weight Space Symmetry swap any pair of hidden nodes, overall function will be the same on any hidden node, reverse the sign of all incoming and outgoing weights (assuming symmetric transfer function) hidden nodes with identical input-to-hidden weights in theory would never separate; so, they all have to begin with di�erent (small) random weights in practice, all hidden nodes try to do similar job at �rst, then gradually specialize Controlled Nonlinearity for small weights, each layer implements an approximately linear function, so multiple layers also implement an approximately linear function for large weights, transfer function approximates a step function, so computation becomes digital and learning becomes very slow with typical weight values, two-layer neural network implements a function which is close to linear, but takes advantage of a limited degree of nonlinearity Further Reading Lister, R., 1993. Visualizing weight dynamics in the N-2-N encoder. In IEEE International Conference on Neural Networks (pp. 684-689). Textbook Deep Learning (Goodfellow, Bengio, Courville, 2016): Geometry of hidden unit activations (8.2) Deeper Networks Vanishing / Exploding Gradients In general, neural networks with more hidden layers are able to implement a wider class of functions. For example, this Twin Spirals problem cannot be learned with a 2-layer sigmoidal network, but it can be learned with a 3-layer network (Lang & Witbrock, 1988). The �rst hidden layer learns features that are linearly separable, the second hidden layer learns features that we could informally describe as "convex", and the third (output) layer learns the target function, which we could say has a more "concave" appearance. It is tempting to think that any function could be learned, simply by adding more hidden layers to the network. However, it turns out that training networks with many hidden layers by backpropagation is not so easy, due to the problem of Vanishing or Exploding Gradients. When the weights are small, the di�erentials become smaller and smaller as we backpropagate through the layers and end up having no e�ect. When the weights are large, the activations in the higher layers may saturate to extreme values. As a result, the gradients at those layers would become very small, and would not be propagated to the earlier layers. When the weights have intermediate values, the di�erentials can sometimes get ampli�ed in places where the transfer function is steep, causing them to blow up to large values. Dealing with Deep Networks We will explore a number of enhancements which have been introduced in recent years to allow successful training of deeper networks; the main ones are summarised in this table: 4 - 9 layers: 10-30 layers: 30-100 layers: more than 100 layers: New Activation Functions (ReLU, SeLU) Weight Initialization, Batch Normalisation Skip Connections (Residual Networks) Identity Skip Connections, Dense Networks Activation Functions Sigmoid Recti�ed Linear Unit (ReLU) Hyperbolic Tangent Scaled Exponential Linear Unit (SELU) The sigmoid and hyperbolic tangent were traditionally used for 2-layer networks, but su�er from the vanishing gradient problem in deeper networks. Recti�ed Linear Units (ReLUs) have become popular since 2012 for deep networks, including convolutional networks. The gradients are multiplied by either 0 or 1 at each node (depending on the activation) and are therefore less likely to vanish or explode. Scaled Exponential Linear Units (SELUs) are a more recent innovation, which seem to also work well for deep networks. References Lang, K.J., & Witbrock, M.J., 1988. Learning to tell two spirals apart. In Proceedings of the 1988 connectionist models summer school No. 1989 (pp. 52-59). Further Reading Textbook Deep Learning (Goodfellow, Bengio, Courville, 2016): Activation Functions (6.3) Exercise: Hidden Unit Dynamics Question 1 No response Question 2 No response Consider a fully connected feedforward neural network with 6 inputs, 2 hidden units and 3 outputs, using activation at the hidden units and at the outputs. Suppose this network is trained on the following data, and that the training is successful. tanh sigmoid Item 1. 2. 3. 4. 5. 6. Inputs 123456 100000 010000 001000 000100 000010 000001 Outputs 123 000 001 010 100 101 110 Draw a diagram showing: for each input, a point in hidden unit space corresponding to that input, and for each output, a line dividing the hidden unit space into regions for which the value of that output is greater/less than one half. Consider an encoder with its three-dimensional hidden unit space. What shape would be formed by the 8 points representing the input-to-hidden weights for the 8 input units? What shape would be formed by the planes representing the hidden-to-output weights for each output unit? 8−3−8 Hint: think of two platonic solids, which are “dual” to each other. Week 3 Thursday video