Chapter 1 Introduction Comment by B:
Thank you for the opportunity to assist you with this project.
Overall, I found this extremely well written (i.e., in the PDF). However, I worked on improving the writing by eliminating any errors in grammar, spelling, and punctuation and by refining word choice and sentence structure. As ever, please see the comments throughout for details.
Please feel free to request me by name (James) should you require an editor in the future.
Thank you for using wordvice.com.tw
In recent years, the machine learning technique, especially deep learning which uses multiple layers of artificial neural networks, has achieved remarkable breakthroughs in many fields. From image classification to the Go game AI player, AlphaGo [1], deep learning all hasexhibits the best performance. Comment by B: Run on sentence avoided here
At the same time, more and more people use smart phonephones. With no doubtUndoubtedly, AI techniques such as deep learning will make smart phones even smarter. Functions such as face recognizationrecognition, audio recognizationrecognition and image classification will be added to many mobile apps.
Deep The deep learning model training part can be done offline in the server clusters. For the inference part, although we can send the data through a network to the server, and the server does the prediction and reply replies with the result. In some cases, if the data is sensitive, the client may wish not to send out to servers. One example is the bank card number recognizationrecognition application. Even without security concernconcerns, network traffic can be slow and expensive, and building reliable servers increase increases the operation cost. Comment by B: Conjunction added as appropriate Comment by B: Subject–verb agreement
So Thus, if we can do prediction on the smart phone, then there is no data security con- cern, no network traffic delay and cost, and no need to maintain a potentially expensive server cluster. But However, this approach also has its drawbacks. It needs to storerequires storage of the model in the smart phone’s limited storage and inference computing in the mobile can be slow and cost battery power.
Deep A deep neural network typically has many layers with a large number of parameters. It needs requires large storage and a large number of math operations. For example, one traditional image classification model VGG [2] has about 100 million parameters, need more than 1GB to store the model and takes more than 10000 million MultMulti-Add operations. Thus it is not fit in the mobile phone. Comment by B: Awkward word choice
To use deep learning models in the mobile phone, we must find a way to signifi- cantly decrease the model size and the number of computing operations to make the model file resonablereasonable small and computing fast with less power. In the mean timemeantime, we don’t want the performance too badperformance should be maintained as high as possible. We need to find a suitable trade-off between them. Comment by B: Phrasing refined
1.2 Objective
MobileNet [3] is a new deep neural network model proposed by Google that are is specially designed for mobile and embedded devices using approximate computing tech- niques. Although the experiments in its paper show that it has strong performance com- pared to other popular models on ImageNet [4] classification, a useful model should also have good performance on new dataset datasets using the transfer learning technique.
In this project, IThis project will compare MobileNet with other popular models in accuracy, model size and inference time in mobile device devices to investigate whether approximate computing used in MobileNet can achieve a better trade offtrade-off between accuracy and effi- ciency to be suitable for mobile devicedevices. I It will also investigate how the two parameters width multiplier and resolution multiplier of MobileNet affect the accuracy, model size and inference time. Comment by B: Avoid personal pronouns in academic writing.
1.3 Achieved results
I successfully train MobileNets with different width multipliers and resolution multi- pliers are successfully trained on the CIFAR-100 using transfer learning with pre-trained model on ImageNet. GoogLeNet Inception V3[5]andResNetand ResNet[6] modelsarealsotrainedontheCIFARmodels are also trained on the CIFAR- 100 using transfer learning. Top-1 and top-5 accuracy on test set are computed for each model. The size of model files to be deployed in mobile app are is recorded. The infer- ence time of each model in Android device is computed. The results comparison show that MobileNet with width multiplier 1 and resolution multiplier 1 have speedup more than 17× and shrink the model file more than 6× both compared with GoogLeNet Inception V3 and ResNet models. It has 18.3% loss in top-1 accuracy and 8.5% loss in top-5 accuracy compared with GoogLeNet Inception V3 and with almost no loss in both top-1 and top-5 accuracy compared with ResNet. The results also show that as we decrease width multiplizermultiplier, model size becomes smaller, and that inference time quicker increasing doesn’t does not affect model size. Comment by B: I cannot find reference to this term—revised throughout. Comment by B: Avoid common contractions in formal writing.
1.4 Dissertation outline
Chapter 2 will introduce various approximate computing techniques for deep learning which can be divided into 3 general categories such as low rank approximation to which techniques used in this project belong, network pruning and quantization. The introduction of Tensorflow [7], which is the deep learning framework used in this project, is also included in Chapter 1.
Chapter 3 will elaborate both the theory and implementation of the deep learning models in detail. They include loss function, optimization algorithm, regularization method, various kinds of layers used, transfer learning and the particular approxi- mate computing technique used in this project: approximating traditional convolu- tional layer with depth-wise separable convolution layer.
Chapter 4 describes experiment results and analysis. Chapter 5 gives the project conclusion and discussion, then future work. Comment by B: As appropriate to content in that chapter…
Chapter 2 Background
2.1 Relevant work 2.1.1 Deep Learning
Deep learning techniques have achieved state-of-art results in many areas of
machine learning. The achievements are remarkable, especially for the success of deep convolutional neural network(network (CNNs) in image classification. CNNs have the best re- sults in all the standard image datasets such as MNIST [8], CIFAR-10 [9], CIFAR-100 [9] and ImageNet [4]. Many different CNNs models are have been developed such as ResNet, VGG and Inception. Because convolutional layer layers can make better use of image spatial information, these models typically have a sequence of many convolutional layers. Comment by B: Past perfect used here, as appropriate.
2.1.2 Approximate Computing
Until recently, deep learning researchers are have been primarily focused on improving model’s accuracy. However, the use of multiple convolutional layers also results in large numm- ber of parameters requiring large memory for model storage and increases the compu- tational cost. Comment by B: Again, verb tense error corrected
With the widespread use of mobile devices and the application of deep learning in mobile apps, more and more researchers are now aware that to have a good mobile user experience, accuracy is not enough, – the model must also be efficient: less memory, quicker inference and less energy consumption. Because mobile consumers don’t do not want a single app to take too much space of limited memory and want the app to respond instantly. Comment by B: Redundant
They resort to approximate computing techniques to make a better trade-off be- tween accuracy and efficiency. The goal is to make model size smaller and inference time quicker to be suitable for mobile device while at the same keep as much accuracy as possible.
[10] shows that significant redundancy often exists in deep learning models. Through approximate computing, we can remove the redundancy, to savesaving both memory and computation cost. The approximate computing for deep learning can be divided into roughly 3 three general approaches: pruning, quantization and low rank approximation. Comment by B: Refined to improve flow. Comment by B: Spell out 1–9 unlesss with a unit of measurement
2.1.2.1 Low Rank Approximation Ofof Filters
This approach decomposes the filters in convolutional layers into a series of separable smaller filters which are a low-rank approximation of original filters and reduce time complexity. The optimal decomposition can be found by minimizing the reconstrucc- tion error of the filters or the layer output. Since convolutional layers are the most time time-consuming parts in CNNs, this low-rank decomposition will generate significant speed up.
[11] uses SVD decomposition to make convolutional layers 1.6 times faster, while sacrificing 1% accuracy. [12] exploits cross-channel or filter redundancy to construct a low low-rank basis of filters that are rank-1 in the spatial domain, which achieves speedup by factor 2.5 without sacrifice of accuracy, and by factor 4.5, with less than 1% accuracy decrease for a text character recognition network. [11] and [12] can only decompose linear filters in a single layer. [13] further develops this method to take into account the nonlinearity, such as Rectified Linear Units (ReLU), which makes the approximation more accurate. It also invents new algorithms to optimize the whole network to reduce the accumulated errors when approximating multiple convolutional layers. It achieves speed up of factor 4 on a large pre-trained model on ImageNet with only 0.9% e top-5 error rate increase. Comment by B: Compound adjective hyphenated Comment by B: Run on sentence
Instead of finding low-rank approximation of convolutional layers of pre-trained networks, some papers replace traditional convolutional layers with layers that has sim- ilar function but with smaller computation cost. Flattened networks [14] replaces 3D filters in conventional convolutional networks with consecutive sequence of 1-D filters in all 3 dimensions, which reduce reduces the parameters significantly and make the feedfor- ward computation about 2 times faster. Factorized networks [15] factorsFactorized networks [15] factor the convolu- tion operation by unravelling the convolutional layer with a sequence of combination of single channel convolution and linear channel projection. It achieves similar accuracy but with much fewer less computatincomputation compared with traditional deep convolutional neural networks models. MobileNets [3] uses a similar approach with to those of flattened net- works [14] and factorized networks [15]. Its model is based on depthwise separable convolutions which separate traditional convolutions into depthwise convolutions that apply a single filter for each input channel and pointwise convolutions that combinatecombine the results linearly. The MobileNet model has smaller size and comparable accuracy with models such as GoogleNet [5] and VGG 16 [2]. It provides two hyperparameters width multiplier and resolution multiplier to adjust the trade offtrade-off between latency and accuracy. Comment by B: Subject–verb agreement enforced Comment by B: Word choice corrected Comment by B: Hyphenated as per convention
2.1.2.2 Network Pruning
This approach tries to remove parts of the models that are not important to reduce number of parameters and computation.
[16] first learns the importance of network connections and remove them, then retrain the network to lean the weights of the remaining connections. Its experiments show that this method can reduce the number of parameters of VGG-16 model by 13×, AlexNet [17] model by 9× with no loss of accuracy.
[18] and [19] aim to prune whole filters together instead of weights, which can induce more speedup in the convolutional layers. [18] reports inference time decreases by 34% for VGG-16 and 38% for ResNet-110 on CIFAR-10 almost without loss of accuracy. [19] reports 3.31× FLOPs reduction and 16.63× compression on VGG-16 with 0.52% top-5 accuracy drop. Comment by B: Run on sentence Comment by B: As per convention
[20]’s pruning algorithm aims specially specifically at reducing energy consumption of CNNs instead of computation and memory cost. It reports energy consumption for AlexNet decreases by 3.7× and GoogLeNet decreases 1.6×, both with less than 1% drop in top-5 accuracy.
2.1.2.3 Network Quantization
Network Quantization quantitizes the parameters of neural network models and en- codes them with fewer bits to reduce the memory storage required by the models. For example, using 8 bits instead of 32 bits will require only about 25% of previously needed storage previ- ously needed. Another benefit of quantization is to make the inference computation faster and use less power. Because using lessfewer bits save memory bandwidth, save RAM access time and allows more operations done in one cycle for SIMD instructions. Comment by B: Word order corrected
During the training phase, in each step, the parameters of neural networks adjusts a little using back propagation and gradient descent algorithm which requires high- precision number format such as 32 bits floating number. So Thus, instead of training a quantized model from scratch, we usually quantize a pre-trained model.
Quantization for deep networks typically doesn’t decrease the accuracy of infer- ence. Because deep networks are often very robust and good at ignoring the noise, including the precision error noise introduced by quantization.
One simple way to quantize is to store the minimum and maximum values of the floating numbers set, then using an integer to represent the floating number. For ex- ample, if we use 8 bits to represent floating numbers in the range [-20.0, 50]. ], Then then 0 represents -20.0, 255 represents 50.0, 128 represents 35.0 and so on. Comment by B: Fragment sentences merged
[21] uses k-means clustering algorithm and product quantization method to quan- tize the network parameters layer by layer. It achieves 16-–24 times compression of the state-of-the-art CNN on ImageNet with 1% loss of accuracy. Comment by B: En dash used for range
[22] uses Hessian-weighted k-means clustering and fixed-length binary encoding to do the quantization. Hessian-weighting also takes into account the across layers impact of quantization errors aside from within impact and, thus, can quantize the whole network at once. This paper also employs Huffman coding to further compress the network. It reports that the quantize models are 1.95%, 4.51% and 2.46% respectively of the original model sizes for LeNet, ResNet and AlexNet at no or marginal performance loss. Comment by B: As meant?
Network quantization can also be combined with other approximate computing techniques. Deep compression [23] combines network pruning and quantization. It first prunes the model connections and only keeps most important connection to reduce parameters by 9-13 times. Then it quantizes the weights so that we can use only 5 bits to represent a weight instead of 32 bits. Finally it uses Huffman coding to reduce the model further. This method compresses AlextNetAlexNet model by 35 times, from 240MB to 6.9MB, increases the speed by 3-–4 times and costs 3-–7 times fewer less power.
2.2 Tensorflow
2.2.1 Introduction
Tensorflow is the second generation machine learning system published by googleGoogle. It is a successor for google’sGoogle’s previous DistBelief system. Its computioncomputation is based on data flow graph with takes math operation as node and multidimensional data arrays (tensors) flows through edges. Comment by B: Proper capitalization maintaned
It is open-sourced and can be used in either single machine or multi-server clus- ters. It can be run in CPU or GPU and even speciailizedspecialized computation device such as TPU(TPU (Tensor Processing Units) which are used in googleGoogle. It enables the researchers to easily implement various deep learning algorithms and has attractattracted much attention from research communities. Comment by B: Past tense used here, as appropriate
The main components of tensorflowTensorflow consists consist of client, master and working pro- cesses. Client sends request to the master and the master schedules the working pro- cesses to do computation in available devices. Tensorflow can be used both in single- machine and distributed clusters where client, master and working processes run in different machines . machines.
2.2.2 Advantage
One of the many useful features is that tensorflowTensorflow can differentiate symbolic expres- sion and derive the backpropagation automatically for neural network training which greatly reduce the work on programmer and the chance to make mistakes.
The tensorflowTensorflow is designed based on dataflow-graph model. It provides python and cC++ interface for programmers to easily construct the graph which makes architecture, algorithm and parameters experimentation very easy. Comment by B: Capitalized as per convention
After the user constructs the dataflow-graph, the tensorflow system will optimizedTensorflow system will optimize the graph and actually execute the operations in machines. Through this first con- structing graph then actually executing approach, it enables the tensorflowTensorflow to know the whole information before executing and, thus, can do optimization as much as possible.
All computations are encoded as nodes in a data graph, ; the dependency of the data between different operations are explicitly encoded in the graph, so the tensorflowTensorflow can partition the graph according to the dependencies and run the subgraph computations parallel in parallel in different devices. Comment by B: Semi colon used to split independent clauses.
The tensorflowTensorflow allows the user to specify the subgraph that needsubgraphs that need to be computed. The user can feed tensors to none or some of the input place holders. The tensorflowTensorflow system only runs the computation that is necessary and prune the irrelevant graph away. Comment by B: Pluralized—ok?
The tensor flow’s data graph model not only makemodel not only makes it easy to run concurrently and also easy to distribute computation to multiple devices. Comment by B: Subject–verb agreement
In tensorflowTensorflow, the data flowing through graph are called tensors. A tensor is a multi multi-
dimensional array of primitive types such as int32. It represents the input and the output of the operations, which is are represented in the vertex. Every operation has a type and none or more attributes. An operation that contain contains mutable state is called a stageful operation, ; Variable variable is one of such kind of operation. Another special operation is queue operation. Comment by B: Article use enforced as appropriate
User Users can use tTensor flow’s checkpointing to periodically save training models to file and reload the model later. This facility not only improveThis facility not only improves the fault tolerance, it also can be used for transfer learning.
2.2.3 Architecture
The TensorFlow adopts a layered architecture. On the top level are training and infer- ence libraries. The next level is python and cC++ API which are built on the C API. Below the C API level are distributed master and dataflow executor. Comment by B: As per convention
The distributed master accepts a data flow graph as input, ; it will prune the unnec- essary part of the graph and divide the graph into subgraphs to distribute computation to different devices. Many optimizationoptimizations, such as constant folding and subexpression elimination, are done by it. Comment by B: Always use the plural after “many”
The dataflow executor’s take is to execute the computation of the subgraph dis- tributed by the distributed master.
The next level is kernel implementations, which has more than 200 operations im- plemented, including often used operation such as Const, Var, MatMul, Conv2D and ReLU.
Apart from above core components, the tensorflowTensorflow system also includes several useful tools such as a dashboard to visualize the data flow graph and training progress and a profiler that shows the running time of different tasks in different devices.
2.2.4 Performance
In Chintala’s benchmark of convolutional models testing, the results show that Tensorf- Flow has shorter training step time than Caffe and similar withwith a similar one to Torch. Experiments have shown that tensorflowTensorflow can scale well in problems such as image classification and language modelingmodelling. Comment by B: UK English enforced
Chapter 3 Methods
3.1 Network AchitectureArchitecture
The neural network consists of layers of neurons. The first layer is called the input layer and the last layer is called the output layer. The layers between the input layer and the output layer are called hidden layers. Figure 3.1 shows a simple neural network with one hidden layer.
3.1.1 Activation Function
Each neuron is a computing unit that applysapplies linear transformation to the inputs to it it, followed by activation function (Figure 3.2). Comment by B: Again, run on sentence resolved here
Figure 3.2: Computation in a single neuron.
The computation can be written as f (wT x + b), where w is the weights, b is the bias and f is the activation function.
We typically use a non-linear function as the activation function. , Because because if the activation function is linear, it can be incorporated into the previous linear transformation. There are many different activation functions. The most commonly used are sigmoid, tanh and rectified linear unit (ReLU). In a deep neural network, ReLU is found to have better results than sigmoid and tanh. Comment by B: Fragment sentences merged
3.1.2 Fully Connected Layer
In a fully connected layer, every neuron in this layer is connected to each neuron in the previous layer. If the two layers have M neurons and N neurons respectively, then there are M × N connections between them, each with different weight parameters. This is the traditional layer type often used in regular neural network. An example is given in figure Figure 3.6. Comment by B: Wordy/redundant Comment by B: Capitalized for consistency
3.1.3 Convolutional Layer
For image and other high high-dimentionaldimensional data, convolutional layer is often prefereablepreferable to fully connected layer. Because fully connected layer will create too many connections, and thus has much more parameters which can be slow to train and easy to overfit. For example, if the input image is 30x30x3, each neuron in the first fully hidden layer will connect to 30x30x3=2700 neurons in the input layer. For such small image, it may not be problem. But for larger image such as 300x300x3, there will be 270000 connections for a single neuron which is difficult to handle. Another problem is that high high-dimentionaldimensional data such as image often has inherent spatial structure, but for the fully connected layer, the input is just a vector of pixel values, the relative position of the piexelspixels has no effect and so the spatial structure information is lost.
To address these problems, convolutional layer is invented. To be suitable for im- age data, the layout of neurons in convolutional layer is 3 3-dimentional instead 1 dimen- tionaldimensional in the fully connected layer. The 3 dimentionsdimensions called are width, height and depth respectively. Each neuron in the convolutional layer now only connects to a small re- gions of neurons of previous layer. The small region is small in width and hightheight but includes all depth. The width and height of the region is called receptive field or filter size. So the receptive field controls how large the connection region will be. In this Comment by B: Wordiness reduced
way, we reduce the connection dramasticallydramatically. For example, If the receptive field is 3×3, the input volume is 300x300x3, then one neuron will connect to 3x3x3=27 neurons of the previous layer instead of 270000 in a fully connected layer. Apart from the benefit of reducing the number of connections, it is also helpful to learnfor learning the local feature of the image.
To reduce the number of parameters further, the convolutional layer let neurons in the same depth dimentiondimension share the same weights, which is called filter. So Thus, for different positions in the image, the filter uses the same weights to extract the features , which makes the feature extracting translation invariant. Comment by B: Informal conjunction revised Comment by B: Run on sentence avoided
During forward propagation phase, we slide a window of size defined by receptive field over all the input volume and compute the dot product of filter weights and the pixel values in the window to get a single number in the output volume. The dot products of all positions constitute the activation map. And the activation maps for all filters stacked in the depth dimentiondimension to constitute the total output volume.
In summary, by arranging layer of neurons in 3D space, constraining the connecc- tions to local area and sharing the weights, convolutional layer can make better use of spatial information with much less parameters. The local connection and weight sharing are illustrated in figure Figure 3.7. An exmapleexample of a 3D convolutional layer is given in figure Figure 3.8.
Figure 3.7: An example of 1D convolutional layer. For illustration purpose, the graph shows connections of 1 dimensional convolutional layer instead of the usual 3 3-dimen- sionalD convolutional layer used for image data. The filter size is 1. The connections with the same colorcolour share the same weight parameters.
Figure 3.8: An example of 3D convolutional layer. The input size is 32 × 32 × 3. There are 5 five filters. The connection is local in width and hightheight dimension dimensions but across all the entire depth dimension. Comment by B: As meant?
3.1.3.1 Convert Fully connected layer to Convolutional Layer
Fully connected layer can be converted to convolutional layer. For example, if the fully connected layer accepts 5 × 5 × 128 input volume and outputs volume 1 × 1 × 10, then a convolution layer with 10 filters of size 5 × 5 will give the same effect. Replacing the fully connected layer with a convolution layer has the advantage that when the input image has a large size than the trained image, we can inference infer multiple areas of the input image in a single forward pass instead of multiple forward passes to get multiple class score vectors and the final prediction can be done using their average which can improve the prediction accuracy. Comment by B: Verb form used, as appropriate to context.
3.1.4 Pooling layer
The pooling layer can be used to reduce the spatial size of the representation and the number of parameters. It works by sliding a small window over input volume, using a non-linear function to computing a number with the values in the small window as input. The computation is conducted for each input depth independently.Theindependently. The most of- tencommonly used non-linear function is max function. Other functions, such as average (figure Figure 3.9) and L2-norm, are also used. By reduce reducing multiple values in a local region to only 1 one number, the pooling layer has the effect of extract more abstract features and help helps the model to generalize and reduce overfitting. The pooling layer introduces no additional parameters and it will reduce the width and height by factor 2 or more with depth un- changed. So Thus, the number numbers of parameters of the later layers are reduced. The most often used filter size is 2×2, this will resultwhich results in output volume of 1/4 input volume size. Larger filter size is rarely used, because it will discard too much information and often result in bad performance. Comment by B: Pluralized as appropriate to context
Figure 3.9: An example of average pooling operation for a single depth slice with a 2×2 filter and stride of 2.
3.2 Loss function
Suppose we have n classes, for sample x, we have computed a score vector f of n elements. fj is the class score of sample xi for class j. Larger score indicates it is more likely for xi to belong that class. The loss function is to taketakes the score vector as input and output a single number to indicate how well the score outcome matches with the true class label. Intuitively, if the score for the true class is relatively higher than othersothers’, then the loss function value should be smaller. Comment by B: Verb form corrected Comment by B: Plural possessive, as appropriate
3.2.1 Cross Entropy loss
We can use the softmax function to convert class score vector to class probability vector with each value in range [0, 1] and the total sum as 1.
The probability of data sample xi belong belonging to class k given the class score vector f is:
That is, for each score, take its exponentiation and then divided by sum of exponen- Comment by B: Technical run on sentence avoided
tiations to normalize the value to 0-1. We want the loss to be small when the predicted probability for the correct class is larger. We can take negative log of P(yi|xi) where yi is the correct class for xi to get the loss. The loss for sample xi is as follows:
3.2.2 Hinge Loss
Another commonly used loss function is hinge loss. The loss for sample (xi,yi) of a given class score vector f is: Comment by B: As meant?
Li = ∑max(0,fj−fi+1) (3.6) j̸=yi
Intuitively, this loss function wants the score for the true class to be larger than others at least by 1. Otherwise, the loss will increase for each violation.
3.2.3 Loss Functions Comparison
The cross-entropy unlike hinge loss provides probability for each class which is more easyeasier for human to interpret than raw class score. Another difference is that, once the marginesmargins between true class score and other class scores are large enough, the hinge loss becomes 0 and can’t cannot decrease further, whereas the cross-entropy loss can always decrease. The hinge loss and cross-entropy loss often have similar performance. Comment by B: Again, common contraction avoided
3.3 Optimization
3.3.1 Mini-batch gradient descent
The training process is to use optimization algorithm to update the parametesparameters so that the loss is minimized. Most common used optimization algorithm for neural network.
θ is the parameter vector, L(θ) is the loss, ∇L(θ) is its gradient and η is the learning rate. The gradient descent is an iterative algorithm that updates the parameters though the negative direction of gradient at each iteration and the step size is controledcontrolled by the learning rate. Comment by B: UK English enforced
When the training data is huge, for example ImageNet has over 10 millionsmillion of imagees, computing the gradient using the entire data set is costly. In this situation, we need to use mini-batch gradient descent. In this method, we take a small subset of samples (a mini-batch) from the data set at each step and then use this these mini-batch samples instead of the whole data set in normal gradient descent algorithm to compute the gradient and do the parameter updating. Due to the correlation between samples in the training data set, the gradient of the loss function over the mini-batch is often very approximate to the gradient of the loss function over the whole training data set. Since the computation cost is much cheaper in the mini-batch gradient descent algorithm than the normal gradient descent algorithm at each parameter updating step, much more updates can be performed and, thus, the loss function can converge much more quickly in mini-batch gradient descent algorithm Comment by B: Errant phrasing corrected Comment by B: Pronoun choice corrected Comment by B: Again, proper article use enforced
The learning rate in the mini-batch gradient descent algorithm is very important. When the learning rate is very small, although the loss is guaranteed to decrease, the converging speed may be too slow. We can increase the learning rate to speed up the learning, but this may lead to overstep that makes the loss increase. It is very difficult to set a suitable learning rate. Different dataset datasets or different network architecture architectures may require different learning rate. We may need to set different learning rate for different parameters and in different training phases. Learning rate decay and extensions of mini-batch gradient descent algorithms can be used to solve this problem. Comment by B: Redundant/implicit Comment by B: Pluralized as appropriate
3.3.2 Learning Rate Decay
At the start of training, we may want a relatively larger learning rate so that the loss function value can decrease quicker. In the later stage, with the improvement getting smaller in each step, we may want to decay the learning rate so that it can avoid over- stepping and fine-tune the parameters. We can set the learning rate decay according to some rule, – for example, multiply 0.9 every 1 epoch. Or set the decay manually, for example, when we see the training loss doesn’t decrease any more, we can try to half the learning rate. Comment by B: Again, run on sentence avoided here
Let η0 is the initial learning rate, k is decay rate and t is the number of training steps. 3 Three commonly used rule rules can be expressed as follows. Comment by B: Spelled out, as per convention
3.3.3 Mini-batch gradient descent extensions
Many extensions are proposed to improve over the basic mini-batch gradient descent algorithm. Algorithms such as Adagrad and RMSProp try to setting the learning rate adaptively during training. Algorithms such as Momentum and Nesterov Momentum try to adjust the parameter updating direction to reduce oscillations. Comment by B: Consider clarifying if you mean the number or intensity of oscillations
3.3.3.1 Adagrad
Adagrad algorithm can adapt the learning rate for each parameter automatically.
ε is used to avoid dividing 0 and it is set to a very small value such as 1e−6.
The above formulae operations are element-wise for each parameter. So each parater parameter has its own effective learning rate. AdaGrad keeps track of the sum of gra- dients and use uses it to adjust the learning rate. Comment by B: Subject–vebrbagreement
3.3.3.2 RMSProp
One problem of Adagrad is that the effective learning rate √ η is always decreasing, C+ε
when it is approximate to 0, then the algorithm stops learning. Another algorithm called RMSProp trystries to solve this problem.
γ is the decay rate. RMSProp makes a simple change which makes C as the moving
average of gradient square instead of acculatedcalculated sum in the Adagrad. Now the effective learning rate is no longer always decreasing.
3.3.3.3 Momentum
γ is another hyperparameter called momentum. v is the velocity. We integrate previous velocity with gradient to get the current velocity and then using the velocity to update the θ. which This is different from basic gradient descent where we directly update the parameters using gradient. This algorithm is helpful to reduce oscillating and speed up convergence. Comment by B: Lengthy sentence split to improve flow.
3.3.3.4 Nesterov Momentum
The Nesterov momentum uses the gradient of the next position instead of current po- sition and achieves better result results over momentum.
3.3.4 Forward Propagation and Backpropagation
Let ai represents the activation values of layer i. For the input layer, the values are directly from input x, so we have a1 = x . We can compute all neurons’ value layer layer-by by-layer from input layer until output layer. Comment by B: Compound adjective hyphenated
From the output layer’s values, we can compute the loss that measures the error between the model predicted value and the actual target value.
In the training process, we need to use gradient descent algorithm to update the parameters to reduce the loss. Backpropagation makes use of chain rule to compute gradients of all parameters with respect to the output efficiently. The backpropagation is applied on the computation graph from the last output node backward to all other nodes. During backpropagation, in a node, for each input, multiply the input gradient with respect to the local output and the node output gradient with respect to the final output which that is received from the later node, and then the process continues for each input node. Comment by B: Not “the next”
3.3.4.1 Chain Rule
The chain rule is used to compute derivative of composition functions. For example, if variable x is a function of y, which in turn is a function of z, then according to the chain rule:
3.3.4.2 Example
The following illustrates the forward propogationpropagation and Backpropagation process of feeding one sample data to a neural network that has one hidden layer with ReLU activation and uses cross-entropy loss.
W,b,W′,b′ are the weights and biases for hidden layer and output layer respec- tively. X,y are the sample data and class label.
From above, we can see that during backpropagation, we used many intermediate results computed in forward propogationpropagation. Thus we often save the needed interme- diate values in forward propogationpropagation to save computation time by avoiding duplicate computation in backpropagation.
Although above example is just for a simple neural network, it can be easily ex- tended to a more complex network. During the forward propogationpropagation and backpropaga- tion process, the computation is local to each layer. Each layer only needs to know the value propagated to it, compute the values and propagate the values to other layers. It doesn’t need to care about how other layers do the computation. Thus Thus, different layers and operations can be used as components to construct deep and very complex neural networks in many different ways of combination. Comment by B: Or “more complex networks”, if preferred
3.4 Regularization
We often use the regularization method to reduce overfitting. One way of regularization is to add weight penalty to the loss. The new loss is the addition sum of original data loss and the added regularization loss. The regularization parameter lambda controls the regularization strength. Large A large lambda will put more weight to regularization loss and thus stronger regularization. Small lambda will put more weight to data loss and thus weaker regularization. Different dataset or network architectures may require very different value of lambda. There is no simple way to decide suitable lambda. It is usually set through cross cross-validation. By adding regularization loss which penalizes large weights, it helps to result in networks with smaller weights. Comment by B: Word choice refined
Small weights means a few change of the inputs won’t change the output of the network too much. Few outliers won’t matter too much for the regularized networks which make the network less sensitive to the noise in the data. On the other hand, a little change on some of the inputs may cause the output of network with large weights change a lot. So large weights will make the model easily adapt to all the training data including noise.
In summary, regularized networks with small weights tend to be simpler, robust to noise, less likely to overfit and better to generalize. Unregularized networks with large weights tend to be more complex, easy to learnmore easily learn the noise and more likely to overfit.
The L2 regularization and L1 regularization are similar. Both penalize large weights.
But they have different form of weight updating in gradient descent algorithm. For L2 regularization, the additional update of w because of added regularization loss is
From above, we can see that the updating amount is constant for L1 regularization and proportional to w for L2 regularization. Thus Thus, the penalty is much larger for L2 regularization when |w| is large and much larger for L1 regularization when |w| is small. The effect is that weights in L1 are sparse, with a small number of relatively large weights and others driven to 0. Whereas On the other hand, L2 regularization weights are more diffuse. The sparsity featuefeature of L1 regularization makes L1 a better choice for feature seletionselection purpose. In other situations, L2 regularization is found usaullyusually better than L1 regularization. Comment by B: Awkward conjunction choice avoided
We can also combine these two regularizations, which is called Elastic net regular- ization.
Apart from adding regularization loss, another way to avoid weights with too too-large
magnitude is called Max norm regularization. This method does the weights updating as normal using gradient descent algorithm and then clipping clips the weights if needed to ensure each weight vector norm is below a preset maximum value. Comment by B: Or “remains”?
3.4.3 Dropout Layer
Dropout is a method to reduce overfitting. In the training stage, we randomly drop out the neurons and the associated connections according to probability 1 − p (Fig- ure 3.10). This has the effect of sampling from a large number of sub-networks. In the testing stage, we don’t do not drop out neurons. Instead, we use the full networks but with the neuron’s output weighted with p. In this way, we compute the average output of all the sub-networks approximately. Comment by B: Article use corrected
By randomly dropingdropping out neurons, the dropout techniques trains over exponentially large number of sub-networks, and using the average prediction of them, which is like a kind of ensemble learning, it reduces the overfitting and also increase increases the speed of training. Comment by B: Subject–verb agreement
Figure 3.10: An example of dropout operation. The first and third neurons and their associated connections are dropeddropped out.
3.4.4 Batch Normalization
During neural network training, the parameters change of one layer will change the distribution of inputs of the layers after it. This phenomenon, called internal covariate shift,is is especially true for deep neural network, the impact will be amplified by mul- tiple layers. To adapt to the input distribution change, it usually requires small a low learning rate, and thus makingwhich makes the training slow. Comment by B: Redundant/repeated Comment by B: Errant grammar corrected
To solve this problem, we can transform inputs to the layer to have mean 0 and variance 1. This transformation is called whitening. To make the computation fast and also differentiable, as required by the back propagation, we can whiten each dimension of the input indepentlyindependently. Comment by B: Run on sentence
The x is one dimension of the input which is scalar.
To avoid changing the layer’s representation, we add a linear transformation after the whitening transformation.
The two transformations together are called batch normalization.
During training, the mean and variance of x are estimated from mini-batch samples. The population means and variances are also estimated by taking moving average averages of mini-batch statistics during training. During inference, the fixed population means and variances are used so that the output is only determined by the input. Comment by B: Pluralized as appropriate
For a layer in the original network.
We can apply batch normalization in this way.
The reason to remove b is that it can be canceledcancelled by β parameter in the batch normalizatonnormalization. Comment by B: UK English
In the convolutinconvolution layer, the activation map is got by using the same filter applied on different locations of previous layer. When we use batch normalizatonnormalization for the con- volution layer, we will normalize all the activations in the activation map together in the mini-batch. So Thus, if the activation map has size p × q and the batch size is m, then the normalization is applied over the p × q × m values. Just like the activation map shares the same weights, we use the same parameter γ and β for a activation map. Comment by B: Again, informal sentence opener
The batch normalization can reduce layer input distribution change and make the gradients less sensitive to parameter scales, thus higher learning rate can be used to speed up the training.
During training, the batch normalization depends on the whole mini-batch samples, the output of one training sample is not deterministic any more. In this way, batch normalization has the effect of regulization and can remove other regulization methods, such as dropout.
3.5 Depthwise Separable Convolution
The depthwise separable convolutions factorize the conventional convolution (Figure 3.11) with a depthwise convolution (Figure 3.12) followed by a pointwise convolution (Figure 3.13).
Figure 3.13: Pointwise convolution example
The depthwise convolution is done independently for each channel of the input where a single filter is applied. The pointwise convolution is the same with con- ventional convolution operatinoperating but with kernel size 1×1, which is why it is called pointwise. It combines the features from depthwise convolution linearly to create new features.
Thus Thus, the depth separable convolution has the effect of filtering input channel through depthwise convolution and then combining features to create new ones through point- -wise convolution. The effects are exactly the same with contentionalconventional convolution. The difference is that contentionalconventional convolution achieves this using a single step, whereas depth separable convolution uses two separate steps. Comment by B: As meant?
Through the separation of feature filtering and feature combining, depthwise sepa- rable convolution reduces the amount of computation tremendously.
Assuming Assume the input I has size W ×H ×M, where W is the input width, H is the height and M is the number of input channels. The filer F has size w × h and the number of filters is N. With stride as 1 and zero padding, the output of conventional convolution O will has have size W × H × N . N. The elements of O are computed as follows: Comment by B: Verb tense corrected
It takes O(W · H · M · N · w · h)
For depthwise convolution, we use one filter for each input channel. The filter has size w × h × M. The output of the depthwise convolution has size W × H × M.
It is computed as follows:
It takes O(W · H · M · w · h)
Then for the 1 × 1 pointwise convolution, it uses N filters, takes the output of depthwise convolution and generates output ofsizeof size W×H×N. It takes O(W·H·M·N) In total, depthwise separable convolution takes O(W · H · M · w · h + W · H · M · N ) = Comment by B: Spacing error corrected
O(W · H · M · (w · h + N))
The time ratio between depthwise separable convolution and conventional convo-
lution is 1/N + 1/wh. For a typical convolution, where w = 3, h = 3, N > 100, we can get about 9 times speed upachieve about a 9-fold increase in speed. Comment by B: Awkward phrasing refined
3.6 Transfer Learning
Training a good deep convolutional neural network model usually requires large com- putation resource and long time. For example, training a deep convolutional neural network model on ImageNet may takes weeks even with GPU clusters. If we can notcannot afford the computation resource or time, we can use transfer learning method. We can use a pre-trained model(model (there are already many state of the art trained models availabeavailable free from internet), replace the last fully-connected layer and retrain it. The previ- ous layers of the neural network model can be seen as a feature extractor. The last fully connected layer is used to compute class scores using extracted features extracted. We can use the same features as the pre-trained model, but the classes are often different from pre-trained model, so we need to replace and retrain the last layer. If retraining only the last layer doesn’t have a satisfactory performance, we may also need to fine-tune previous layers: initializing weights with pre-trained model and updating them during training with smaller learning rate. The reason to use smaller learning rate is that we expect the weights of pre-trained model are not far from the final optimized weights and we want to update them little little-by by-little and not to overstep. Whether find-fine-tuning is needed often depends on the similarity between the new dataset and the dataset used by the pre-trained model in terms of both image data and class labels. If they are very similar, the kind of features extracted by the layers before last layer in the pre-trained model are likely to also suit the new model, and retraining only the last layer may be enough. Comment by B: Spelling error caught Comment by B: Word order refined Comment by B: Hyphenated, as per convention Comment by B: Again, run on sentence avoided
Apart from saving much training time and computation resources using transfer learning, it often has better results.
Chapter 4 Results and Evaluation
4.1 Resource and tools
The model training and evaluation is implemented using python with tensorflowTensorflow frame- work 1.0 on ubuntuUbuntu linuxLinux system. I use Amazon Elastic Compute Cloud (EC2) G2 instance which uses NVIDIA GRID K520 GPUs for my model training. Comment by B: Capitalized as appropriate
The image classification app on the mobile is implemented using Android java with tensorflowTensorflow mobile library. Currently the tensorflowTensorflow mobile library support 3 platforms: Android, IOS and Raspberry Pi. The library provides APIs that let mobile app easily load pre-trained model and do inference with it.
The android image classification app is developed with Android Studio, which is the official IDE for Android.
4.1.1 Checkpoint File
During training, we can use tensorflowTensorflow API to save the learned model parameters pe- riodically to binary checkpoint files. In this wayThereby, the model parameters are backed up. Next time, the model parameters can be restored by loading data from checkpoint file.
4.1.2 Model File
The model file is in Protocol Buffers format which can be saved and loaded using many different languages. So +Thus, we can save the model file using python and load the model using java in android app. Comment by B: Formal conjunction used
The Graph object contains all the information about the model graph. The graph consists of nodes. Each node stores various information including node name, operation such as “Add” and “Conv2D”, input nodes and other attributes such as filter size for “Conv2D”.
To make it suitable for deployment, we can use tool from tensorflowTensorflow freeze graph.py to combine the graph definition file and checkpoint file, that which contains learned parame- ters into a single model file. The tool achieves this by replacing Variable node with Const node, that which contains the parameters and it also removes nodes unnecessary for inference to simplify graph and decreases file size.
The resulting model file can then be shipped with Android app. In the android Android app, upon starting, we will first load the model file using Tensorflow Mobile java API. Then we can do inference using the loaded model.
4.2 Dataset 4.2.1 CIFAR 100
The CIFAR-100 dataset contains 60000 small images of size 32 × 32. They belong to 100 different classes, with each class containing 600 images. A sample of 100 images of this dataset areA sample of 100 images of this dataset is shown in figure 4.1. Comment by B: Technical run on sentence Comment by B: Subject–verb agreement
4.3 Experimental Setup 4.3.1 Training set and test set
This CIFAR-100 dataset is divided into training set which contains 50000 images and test set which contains 10000 images.
4.3.2 Preprocessing
During the training, an image is randomly transformed before feeding to the neural networks. In this way, the neural networks will train on multiple versions of the same image and the actual training data set size is much larger than original data set size. This will make the model better generalize and reduce overfitting.
4.3.2.1 Randomly Shift the Image
First pad the image, and then randomly crop the image. In this way, the image will randomly shift in the 4 four directions.
4.3.2.2 Randomly Flip the Image
The image is flipedflipped left to right with 0.5 probability.
4.3.2.3 Randomly adjust the image brightness
This randomly add a value between -63 and 63 to all RGB components of every pixel.
4.3.2.4 Randomly change the image contrast
Randomly choose a contrast factor 0.2 ≤ f ≤ 1.8. For each RBG channel, compute the mean m and update the corresponding component of each pixel with:
(x−m)× f +m
After above randomly changing steps of the image, lastly we normalize the image data to make it have zero mean and unit norm.
4.3.3 MobilenetMobileNet
Hyperparameters
• Batch Size: 128
• Momentum: 0.9
• Initial learning rate: 0.01
• Learning rate decay: decay with factor 0.94 every 2 epochs
• WeigthWeight decay parameter: 0.00004
• Optimizer: RMSProp optimization algorithm with decay rate of 0.9
The initial weights are loaded from mobilenetMobileNet pre-trained model on imagenetImageNet. In the first stage, train only on the last fully connected layer and keeping the parameters of previous layers unchanged. It trains 25000 steps in this phase.Thenphase. Then train all layers to fine-tune the model. It trains 55000 steps in this phase. During training, random minor changes are applied on the images to augment the data set.
After training finishes, we use the test set to evaluate the performance. Note that the prediction on each image is just done once. If using average prediction of multiple changes on aan image is used, the performance is likely to improve. Comment by B: Beware a/an confusion
The models are exported to tensorflowTensorflow model file. In the android mobile image classification app, the model file is loaded and the inference time is computed by di- viding the time it takes to classify 100 images one by one with 100. The inference time on mobile is done on NexsusNexus 6 Android phone.
The experiments are done for width multiplier multipliers 1.0, 0.75, 0.5 and 0.25, image size sizes 32, 24 and 16. So Thus, the above steps are done for a total of 12 models. Comment by B: Pluralized as appropriate
The change of losses with training steps for model with width multiplier 1.0 and image size 32 are as follows. Others are similar. The red line is for first stage and the green line for the second stage.
Figure 4.2, 4.3 and 4.4 shows the change of total loss, cross entropy loss and regu- larization loss with the training steps in both stages.
4.3.4 Inception V3
Figure 4.4: Regularization Loss
Google Inception V3 model is proposed in [5]. It adds an auxiliary logits layer in addition to usual logits layer to speedup convergence during training. For this model in the experiment, scale the image from 32×32 to 128×128. The first stage trains on the auxiliary logits layer and logits layer 15000 steps with a fixed learning rate of 0.01. The second stage trains 30000 steps on all layers with smaller fixed learning rate 0.0001. Both stages usesuse weight decay of 0.00004. Comment by B: Article and preposition added to improve flow
Figure Figures 4.5, 4.6 and 4.7 shows the change of total loss, cross entropy loss and regu- larization loss with the training steps in both training stages for the Inception V3 model.
4.3.5 ResNet
The ReNet model is proposed in [6]. For this model in the experiment, it undergoes the same process with the Inception V3 model during training.
Figure Figures 4.8, 4.9 and 4.10 shows the change of total loss, cross entropy loss and regularization loss with the training steps in both training stages for the ResNet model.
4.4 Metrics
4.4.1 Top-1 Accuracy
The ratio between the number of images that are predicted correctly and the total num- ber of images in the test set.
4.4.2 Top-5 Accuracy
Same with top-1 Accuracyaccuracy, it is the ratio between the number of correct predictions and the total number of images. The difference is the meaning of correct prediction. For top-5 accuracy, the classifier gives 5 five candidate guesses instead of 1 one guess. If the correct label is one of the 5 five guesses, then the prediction is considered correct. Comment by B: Inconsistent capitalization Comment by B: Integers without unit of measurement spelled out
4.4.3 Inference Time
The average time model takes to classify a single image.
4.4.4 Model File Size
The size of the model file in tensorflowTensorflow for deployment. The model file size is mainly determined by the number of parameters and the number of bits used to encode each parameter.
4.5 Results
Table 4.1 shows the performance for MobileNets with various width Multiplizersmultipliers and resolution multiplizersmultipliers. Table 4.2 shows performance for full MobileNet, Inception V3 and ResNet.
Table 4.1: Performance For Different Width MultiplizersMultipliers and Resolution MultiplizersMultipliers
Table 4.2: Performance of Different Models
4.6 Analysis
Table 4.3: Relative Performance
For comparison purposepurposes, the accuracy loss, inference time speedup and model size compression raioratio of MobilenetMobileNet model over Inception 3 and ResNet are computed in table Table 4.3. Comment by B: Pluralized, as per convention Comment by B: Capitlaized for consistency
We can see that the MobilenetMobileNet have significant inference speed up and model size compression over Inception and ResNet. Its accuracy is similar with ResNet and have a relatively big loss compared with Inception.
We can also see that smaller width multiplizermultiplier will decrease inference time, model size and accuracy. Smaller resolution multiplier will not affect model size and will decrease inference time and accuracy. Because smaller width multiplizermultiplier will decrease the number of channels used in the filters which will decrease the number of parame- ters, so the model file decreases. Smaller resolution multiplier will decrease the input image size, ; so thus, the amount of computation decrease, but the number of parameters arenumber of parameters is the same. Thus it will speed up inference but not shrink model file size. Comment by B: Technical ambiguity avoided
The results also show that it is better to decrease width multiplizermultiplier than resolution multiplizermultiplier to speed up inference and shrink model file. For example, using width multiplier 0.75 and resolution multiplier 1.0 have higher accuracy, quicker inference and smaller model size than using width multiplier 1.0 and resolution multiplier 0.75.
Chapter 5 Conclusion and Discussion
5.1 Remarks and observations
This project implements the MobileNet model that using the Tensorflow framework. The approximate computing techniques: , approximating traditional convolutional layer with depth-wise separable convolution layer, are used. An Android mobile image classi- fication app is built to test the real inference time of each model. In the experiment, MobileNets with various width multipliers and resolution multipliers are successfully trained on the CIFAR-100 dataset to compare these two hyperparameters hyperparameters’ effect on the per- formance, which show that by adjusting them we can get different trade-offs between accuracy and efficiency. The decrease of width multiplizermultiplier and resolution multiplier lead to smaller model size and quicker image classification on mobiles wither with greater accuracy loss. So Thus, mobile developers can adjust them to find the best trade-off for their applications. Comparison with other models such as Inception and ResNet are also done in the experiment, which shows that MobileNet has much speedup in inference time and smaller mobile size with reasonable accuracy sacrifice. The resulting model is more suitable for mobile deployment, which takes much less memory space and in- ference time. Comment by B: Article use corrected Comment by B: Again, run on sentence resolved Comment by B: Plural possessive sued, as appropriate
5.2 Limitation and Further work
5.2.1 More approximate computing techniques
Currently, the approximate computing technique used is depth depth-wise separable convolu- tion, which is approximation to traditional convolution. We would like to apply network pruning and quantization techniques on the resulting models to further decrease model
size and inference time in future work.
5.2.2 More extensive Experiment
In this project, due to computing resouceresource and time constraint, we use one dataset CIFAR-100 and two traditional popular models Inception and ResNet in compar- ison. In future work, we will use more dataset datasets and more models to do more extensive evaluation. Comment by B: Or “larger datasets”?
5.2.3 Application into Practice
In future work, we would like to put the approximate computing techniques used in this project into real practice. Many mobile applications would benefit from approxi- mate computing techniques used in this project. Two examples are bank card number recognizationrecognition and handwritten chineseChinese character recognition. The first one can be used in a payment app that let users avoid the hassle of entering a card number manually. The second one can be used in Chinese input app. The computing techniques used in this project would make the recognizationrecognition in the two applications mush much faster and the apps less memory memory-consuming. Comment by B: Proper noun capitalized
5.2.4 Model Architecture Improvement
Although the mobilenet MobileNet achieves significant inference speedup and model size shrink- ing, it has a relatively big accuracy loss compared with the Inception model. We Thus, we would like to adjust the model architecture to improve its accuracy in future work. Comment by B: Format consistency Comment by B: Appropriate conjunction added