Approximate Computing for Deep Learning in
TensorFlow
Chiang Chi-An
T
H
E
U
N I V E R
S
I
T
Y
O
F
E
D I N B
U
R
G
H
Master of Science
School of Informatics
University of Edinburgh
2017
Abstract
Deep learning techniques have achieved remarkable breakthroughs in many areas
such as image classificatin, audio recognization and object detection in recent years.
At the same time, with the widespread of smartphones, there is a growing demand
for mobile applications to deploy deep learning models that can make smartphones
even smarter. However, deep neural networks are typically computationally expensive
and memory-consuming which makes them unsuitable for smartphones. Depthwise
separable convolution is a powerful approximate computing technique for deep learn-
ing to address this problem. It is an approximation of conventional convolution but
with much fewer parameters and less inference time. This project investigates this
technique and the MobileNet mobel which makes heavily use of depthwise separable
convolutions. Experiments are conducted using TensorFlow library in this project to
show whether MobileNet mobel can decrease model size and increase inference speed
in mobile devices, while at the same time keep decent accuracy compared with tradi-
tional deep learning models such as Inception V3 and ResNet. The experiments of this
project also demonstrate how the two hyperparameters width multiplier and resolution
multiplier in MobileNet impact the trade off between model size, inference speed and
classification accuracy.
i
Acknowledgements
First of all, I would like to thank my dissertation supervisor, Dr. Pramod Bhatotia, for
teaching me how to conduct rigorous research, organize my thoughts, and produce a
well-structured thesis. From beginning the proposal to finishing the dissertation, He
provided me with precious and invaluable intellectual advice. Thanks to him, I did not
have to learn through trial and error, and had the freedom to explore the areas that I
found most interesting and satisfying while I was doing my dissertation. I would also
like to thank Dr. Bob Fisher, my personal tutor. In both our one-on-one meetings and
our group meetings, he demonstrated his investment in me. In addition, the courses
that he advised me to take were very practical and provided me with a comprehensive
course schedule, enabling me to learn all that I could in this MSc program. Finally,
I would like to thank my mother, whose financial and mental support allowed me to
focus on my studies without worry.
ii
Declaration
I declare that this thesis was composed by myself, that the work contained herein is
my own except where explicitly stated otherwise in the text, and that this work has not
been submitted for any other degree or professional qualification except as specified.
(Chiang Chi-An)
iii
Table of Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Achieved results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Dissertation outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background 4
2.1 Relevant work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Approximate Computing . . . . . . . . . . . . . . . . . . . . 4
2.2 TensorFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Advantage . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Methods 11
3.1 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.1 Activation Function . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.2 Fully Connected Layer . . . . . . . . . . . . . . . . . . . . . 14
3.1.3 Convolutional Layer . . . . . . . . . . . . . . . . . . . . . . 15
3.1.4 Pooling layer . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1 Cross Entropy loss . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.2 Hinge Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.3 Loss Functions Comparison . . . . . . . . . . . . . . . . . . 19
3.3 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
iv
3.3.1 L1 and L2 Regularization . . . . . . . . . . . . . . . . . . . 19
3.3.2 Dropout Layer . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.3 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4.1 Mini-batch gradient descent . . . . . . . . . . . . . . . . . . 23
3.4.2 Learning Rate Decay . . . . . . . . . . . . . . . . . . . . . . 24
3.4.3 Mini-batch gradient descent extensions . . . . . . . . . . . . 25
3.4.4 Forward Propagation and Backpropagation . . . . . . . . . . 27
3.5 Depthwise Separable Convolution . . . . . . . . . . . . . . . . . . . 29
3.6 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4 Results and Evaluation 33
4.1 Resource and tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.1 Software and Hardware Environments . . . . . . . . . . . . . 33
4.1.2 Checkpoint File . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.3 Model File . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3.1 Training set and test set . . . . . . . . . . . . . . . . . . . . . 35
4.3.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3.3 Neural Networks Training . . . . . . . . . . . . . . . . . . . 36
4.4 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.6 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5 Conclusion and Discussion 46
5.1 Remarks and observations . . . . . . . . . . . . . . . . . . . . . . . 46
5.2 Limitation and Further work . . . . . . . . . . . . . . . . . . . . . . 46
5.2.1 More approximate computing techniques . . . . . . . . . . . 46
5.2.2 More extensive Experiment . . . . . . . . . . . . . . . . . . 47
5.2.3 Application into Practice . . . . . . . . . . . . . . . . . . . . 47
5.2.4 Model Architecture Improvement . . . . . . . . . . . . . . . 47
Bibliography 48
v
List of Figures
3.1 A simple neural network. . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Computation in a single neuron. wi is the weight for input xi, b is the
bias and f is the activation function. . . . . . . . . . . . . . . . . . . 12
3.3 sigmoid plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Tanh plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.5 ReLU plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.6 An example of fully connected layer . . . . . . . . . . . . . . . . . . 15
3.7 An example of 1D convolutional layer that illustrates local connection
and weights sharing. The connections with the same color share the
same weight. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.8 An example of average pooling operation for a single depth slice with
a 2×2 filter and stride of 2. . . . . . . . . . . . . . . . . . . . . . . . 18
3.9 An example of dropout operation. The two neurons colored grey and
their associated connections are dropped out. . . . . . . . . . . . . . 22
3.10 Conventional convolution [34] . . . . . . . . . . . . . . . . . . . . . 29
3.11 Depthwise convolution [34] . . . . . . . . . . . . . . . . . . . . . . . 29
3.12 Pointwise convolution [34] . . . . . . . . . . . . . . . . . . . . . . . 30
4.1 A sample of 100 images from CIFAR-100 . . . . . . . . . . . . . . . 35
4.2 The architectures of MobileNet, Inception V3 and ResNet. . . . . . . 37
4.3 Total Loss for MobileNet . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4 Cross Entropy Loss for MobileNet . . . . . . . . . . . . . . . . . . . 39
4.5 Regularization Loss for MobileNet . . . . . . . . . . . . . . . . . . . 39
4.6 Total Loss for Inception V3 . . . . . . . . . . . . . . . . . . . . . . . 40
4.7 Cross Entropy Loss for Inception V3 . . . . . . . . . . . . . . . . . . 40
4.8 Regularization Loss for Inception V3 . . . . . . . . . . . . . . . . . . 40
4.9 Total Loss for ResNet . . . . . . . . . . . . . . . . . . . . . . . . . . 41
vi
4.10 Cross Entropy Loss for ResNet . . . . . . . . . . . . . . . . . . . . . 41
4.11 Regularization Loss for ResNet . . . . . . . . . . . . . . . . . . . . . 41
vii
List of Tables
4.1 Performance for different width multipliers and resolution multipliers. 43
4.2 Performance of Different Models . . . . . . . . . . . . . . . . . . . . 43
4.3 Relative Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 44
viii
Chapter 1
Introduction
1.1 Motivation
In recent years, machine learning techniques, especially deep learning which uses mul-
tiple layers of artificial neural networks, have achieved remarkable breakthroughs in
many fields. From image classification to Go game AI player AlphaGo [1], deep learn-
ing exhibits the best performance.
At the same time, more and more people use smartphone. Undoubtedly, AI tech-
niques such as deep learning will make smartphones even smarter. Functions such
as face recognition, audio recognition and image classification will be added to many
mobile apps.
The deep learning model training part can be done offline in the server clusters.
For the inference part, although we can send the data to the server, and let the server
do the prediction job and reply with the result. In some cases, if the data is sensitive,
the client may wish not to send out to servers. One example is the bank card number
recognition application. Even without privacy or security concerns, network traffic can
be slow and expensive and building reliable servers increases the operation cost.
Thus, if the prediction computation is done locally on the smart phone, then there
is no data security concern, no network traffic delay and cost and no need to maintain
a potentially expensive server cluster. However, this approach also has its drawbacks.
It requires the model to be stored in the smart phone’s limited storage and inference
computing in the mobile can be slow and cost battery power.
A deep neural network typically has many layers with a large number of parame-
ters. It requires large storage and a large number of math operations. For example, one
traditional image classification model VGG [2] has about 100 million parameters, need
1
Chapter 1. Introduction 2
more than 1GB to store the model and takes more than 10000 million math operations.
Thus it is not fit in the mobile phone.
To use deep learning models in the mobile phone, we must find a way to signifi-
cantly decrease the model size and the number of computing operations to make the
model file reasonably small and computing fast with less power. In the meantime, per-
formance should be maintained as high as possible. We need to find a suitable trade-off
between them.
1.2 Objective
MobileNet [3] is a new deep neural network model proposed by Google that are spe-
cially designed for mobile and embedded devices using approximate computing tech-
niques. Although the experiments in its paper show that it has strong performance com-
pared to other popular models on ImageNet [4] classification, a useful model should
also have good performance on new datasets using the transfer learning technique.
In this project, we will compare MobileNet with other popular models in accuracy,
model size and inference time in mobile devices to investigate whether approximate
computing used in MobileNet can achieve a better trade off between accuracy and
efficiency to be suitable for mobile device. We will also investigate how the two pa-
rameters width multiplier and resolution multiplier of MobileNet affect the accuracy,
model size and inference time.
1.3 Achieved results
MobileNets with different width multipliers and resolution multipliers are success-
fully trained on the CIFAR-100 [5] using transfer learning with pre-trained model on
ImageNet. GoogLeNet Inception V3 [6] and ResNet [7] models are also trained on
the CIFAR-100 using transfer learning. The following metrics are computed for each
model: top-1 and top-5 accuracy on test set, number of parameters, size of model
file and inference time in mobile device. Comparison of the results shows that Mo-
bileNet with width multiplier 1 and resolution multiplier 1 has speedup more than 17×
and shrinks the model file more than 6× both compared with GoogLeNet Inception
V3 and ResNet models. It has 18.3% loss in top-1 accuracy and 8.5% loss in top-5
accuracy compared with GoogLeNet Inception V3 and with almost no loss in both
top-1 and top-5 accuracy compared with ResNet. The results also show that as we
Chapter 1. Introduction 3
decrease width multiplier, model size becomes smaller and inference time quicker but
with more accuracy loss. The resolution multiplier has the similar effect except that it
doesn’t affect model size.
1.4 Dissertation outline
Chapter 2 will introduce various approximate computing techniques for deep learn-
ing which can be divided into 3 general categories such as low rank approximation
to which techniques used in this project belong, network pruning and quantization.
The introduction of Tensorflow [8] which is the deep learning framework used in this
project is also included in Chapter 1.
Chapter 3 will elaborate both the theory and implementation of the deep learning
models in detail. They include loss function, optimization algorithm, regularization
method, various kinds of layers used, transfer learning and the particular approxi-
mate computing technique used in this project: approximating traditional convolu-
tional layer with depth-wise separable convolution layer.
Chapter 4 describes experiment setup, results and analysis.
Chapter 5 gives the project conclusion and discussion, then future work.
Chapter 2
Background
2.1 Relevant work
2.1.1 Deep Learning
Deep learning techniques have achieved state-of-art results in many areas of machine
learning. The achievements are remarkable, especially for the success of deep convo-
lutional neural network (CNNs) in image classification. CNNs have the best results
in all the standard image datasets such as MNIST [9], CIFAR-10 [5], CIFAR-100 [5]
and ImageNet [4]. Many different CNNs models such as VGG [2], ResNet [7] and
Inception V3 [6] have been developed. Because convolutional layers can make bet-
ter use of image spatial information, these models typically have a sequence of many
convolutional layers.
2.1.2 Approximate Computing
Until recently, deep learning researchers have been primarily focused on improving
model’s accuracy. However, the use of multiple convolutional layers also results in
large number of parameters requiring large memory for model storage and increases
the computational cost.
With the widespread use of mobile devices and the application of deep learning
in mobile apps, more and more researchers are aware that to have a good mobile user
experience, accuracy is not enough – the model must also be efficient: less memory,
quicker inference and less energy consumption. Because mobile consumers do not
want a single app to take too much space of limited memory and want the app to
respond instantly.
4
Chapter 2. Background 5
They resort to approximate computing techniques to make a better trade-off be-
tween accuracy and efficiency. The goal is to make model size smaller and inference
time quicker to be suitable for mobile device while at the same keep as much accuracy
as possible.
[10] shows that significant redundancy often exists in deep learning models. Through
approximate computing, we can remove the redundancy, saving both memory and
computation cost. The approximate computing for deep learning can be divided into
roughly three general approaches: low rank approximation, network pruning and quan-
tization.
2.1.2.1 Low Rank Approximation of Filters
This approach decomposes the filters in convolutional layers into a series of separable
smaller filters which are a low-rank approximation of original filters and reduce time
complexity. The optimal decomposition can be found by minimizing the reconstruc-
tion error of the filters or the layer output. Since convolutional layers are the most
time-consuming parts in CNNs, this low-rank decomposition will generate significant
speed up.
[11] uses SVD decomposition to make convolutional layers 1.6 times faster while
sacrificing 1% accuracy. [12] uses rank 1 filter basis approximation that achieves
speedup by factor 2.5 without sacrifice of accuracy and by factor 4.5 with less than
1% accuracy decrease for a text character recognition network. [11] and [12] can only
decompose linear filters in a single layer. [13] further develops this method to take
into account the nonlinearity, such as Rectified Linear Units (ReLU), which makes the
approximation more accurate. It also invents new algorithms to optimize the whole
network to reduce the accumulated errors when approximating multiple convolutional
layers. It achieves speed up of factor 4 on a large pre-trained model on ImageNet with
only 0.9% top-5 error rate increase.
Instead of modifying pre-trained networks by finding low-rank approximation of
the convolutional layers, some papers invent new architectures with traditional con-
volutional layers replaced by layers that has similar function but with smaller com-
putation cost and then train from scratch. Flattened networks [14] replaces 3D filters
in conventional convolutional networks with consecutive sequence of 1-D filters in all
3 dimensions which reduces the parameters significantly and make the feedforward
computation about 2 times faster. Factorized networks [15] factor the convolution
operation by unravelling the convolutional layer with a sequence of combination of
Chapter 2. Background 6
single channel convolution and linear channel projection. It achieves similar accuracy
but with much less computation compared with traditional deep convolutional neural
networks models. MobileNet [3] uses a similar approach to those of flattened networks
[14] and factorized networks [15]. Its model is based on depthwise separable convo-
lutions which separate traditional convolutions into depthwise convolutions that apply
a single filter for each input channel and pointwise convolutions that combine the re-
sults linearly. The MobileNet model has smaller size and comparable accuracy with
models such as GoogleNet [16] and VGG-16 [2]. It provides two hyperparameters
width multiplier and resolution multiplier to adjust the trade-off between latency and
accuracy.
2.1.2.2 Network Pruning
This approach tries to remove parts of the models that are not important to reduce
number of parameters and computation.
[17] first learns the importance of network connections and removes those that are
not important, then retrains the connections left. Its experiments show that the number
of parameters in VGG-16 model can be reduced by 13× and AlexNet [18] model by
9× with no loss of accuracy using this method.
[19] and [20] aim to prune whole filters together instead of weights, which can
induce more speedup in the convolutional layers. [19] reports inference time
decreases by 34% for VGG-16 and 38% for ResNet-110 on CIFAR-10 almost without
loss of accuracy. [20] reports 3.31× FLOPs reduction and 16.63× compression on
VGG-16 with 0.52% top-5 accuracy drop.
[21]’s pruning algorithm aims specifically at reducing energy consumption of CNNs
instead of computation and memory cost. It reports energy consumption for AlexNet
decreases by 3.7× and GoogLeNet decreases 1.6× both with top-5 accuracy loss less
than 1%.
2.1.2.3 Network Quantization
Network Quantization quantizes the parameters of neural network models and encodes
them with fewer bits to reduce their memory storage. For example, using 8 bits instead
of 32 bits will require only about 25% of previously needed storage. Another benefit of
quantization is to make the inference computation faster and use less power. Because
using fewer bits saves memory bandwidth and RAM access time and allows more
Chapter 2. Background 7
operations done in one cycle for SIMD instructions.
During the training phase, in each step, the parameters of neural networks adjust
a little using back propagation and gradient descent algorithm, which requires high-
precision number format such as 32 bits floating number. Thus, instead of training a
quantized model from scratch, we usually quantize a pre-trained model.
Quantization for deep networks typically doesn’t decrease the accuracy of infer-
ence too much. Because deep networks are often very robust and good at ignoring the
noise including the precision error noise introduced by quantization.
One simple way to quantize is to store the minimum and maximum values of the
floating numbers set, then using an integer to represent the floating number. For ex-
ample, if we use 8 bits to represent floating numbers in the range [-20.0, 50], then 0
represents -20.0, 255 represents 50.0, 128 represents 35.0 and so on.
[22] uses k-means clustering algorithm and product quantization method to quan-
tize the network parameters layer by layer. It achieves 16-24 times compression of
CNN on ImageNet with 1% loss of accuracy.
[23] uses Hessian-weighted k-means clustering and fixed-length binary encoding
to do the quantization. Hessian-weighting also takes into account the across layers im-
pact of quantization errors aside from within impact and thus can quantize the whole
network at once. This paper also employs Huffman coding to further compress the net-
work. It reports that the quantize models are 1.95%, 4.51% and 2.46% respectively of
the original model sizes for LeNet, ResNet and AlexNet at no or marginal performance
loss.
Network quantization can also be combined with other approximate computing
techniques. Deep compression [24] combines network pruning and quantization. It
first prunes the model connections and only keeps most important connections to re-
duce parameters by 9-13 times. Then it quantizes the weights so that we can use only
5 bits to represent a weight instead of 32 bits. Finally it uses Huffman coding to reduce
the model size further. This method compresses AlexNet model by 35 times, from
240MB to 6.9MB, increases the speed by 3-4 times and costs 3-7 times less power.
Chapter 2. Background 8
2.2 TensorFlow
2.2.1 Introduction
TensorFlow [8] is the second generation machine learning system published by Google.
It is a successor for Google’s previous DistBelief system [25]. Its computation is based
on data flow graph which takes math operations as nodes and multidimensional data
arrays (tensors) flows through edges.
It is open-sourced and can be run in CPU, GPU and TPU (Tensor Processing Unit)
which is ,specialized computation device developed by Google. It enables the re-
searchers to easily implement various deep learning algorithms and has attracted much
attention from research communities.
The main components of TensorFlow consist of client, master and working pro-
cesses. Client sends request to the master and the master schedules the working pro-
cesses to do computation in available devices. TensorFlow can be used either in a
single machine or distributed clusters where client, master and working processes run
in different machines .
2.2.2 Advantage
One of the many useful features is that TensorFlow can differentiate symbolic expres-
sions and derive the backpropagation automatically for neural network training which
greatly reduce the workload on programmers and the chance to make mistakes.
The TensorFlow is designed based on dataflow graph model. It provides python
and C++ interface for programmers to easily construct the graph, which greatly facili-
tates the experimentations of network architecture, optimization algorithm and network
parameters.
After the user constructs the dataflow graph, the TensorFlow system will optimize
the graph and actually execute the operations in machines. Through this approach of
first constructing the graph and then actually executing the operations, it enables the
TensorFlow to know the whole information before execution and thus can do optimiza-
tion as much as possible.
All computations are encoded as nodes in a data graph. The dependency of the
data between different operations are explicitly encoded in the graph. Thus, the Ten-
sorFlow can partition the graph according to the dependencies and run the subgraph
computations in parallel in different devices. The TensorFlow allows the user to spec-
Chapter 2. Background 9
ify the subgraphs that need to be computed. The user can feed tensors to none or some
of the input place holders. The TensorFlow system only runs the computation that is
necessary and prune the irrelevant graph away.
The tensor flow’s data graph model not only makes it easy to run concurrently and
also easy to distribute computation to multiple devices. In TensorFlow, the data flowing
through graph are called tensors. A tensor is a multi-dimensional array of primitive
types such as int32. It represents the input and the output of the operations, which are
represented in the vertices. Every operation has a type and none or more attributes. An
operation that contains mutable state is called a stateful operation. Variable is one of
such kind of operation. Another special operation is queue operation
User can use TensorFlow’s checkpoint files to periodically save training models
and reload the model later. This facility not only improves the fault tolerance, it also
can be used for transfer learning.
2.2.3 Architecture
The TensorFlow adopts a layered architecture. On the top level are training and infer-
ence libraries. The next level is python and C++ API which are built on the C API.
Below C API level are distributed master and dataflow executor.
The distributed master accepts a data flow graph as input, it will prune the unnec-
essary part of the graph and divide the graph into subgraphs to distribute computation
to different devices. Many optimizations, such as constant folding and subexpression
elimination, are done by it. The dataflow executor’s task is to execute the computation
of the subgraph distributed by the distributed master.
The next level is kernel implementations which have more than 200 operations
implemented including most commonly used operations such as Const, Var, MatMul,
Conv2D and ReLU.
Apart from above core components, the TensorFlow system also includes several
useful tools such as a dashboard to visualize the data flow graph and training progress
and a profiler that shows the running time of different tasks in different devices.
2.2.4 Performance
In Chintala’s benchmark of convolutional models testing, the results show that Ten-
sorFlow has shorter training step time than Caffe [26] and with a similar one to Torch
Chapter 2. Background 10
[27]. Experiments have shown that TensorFlow can scale well in problems such as
image classification and language modelling.
Chapter 3
Methods
3.1 Network Architecture
The neural networks are organized as layers of neurons with input layer as the first
layer, output layer as the last and hidden layers between them. A simple neural network
is shown in Figure 3.1.
Figure 3.1: A simple neural network.
11
Chapter 3. Methods 12
3.1.1 Activation Function
Each neuron is a computing unit that applies linear transformation to its inputs, fol-
lowed by an activation function to generate the output (Figure 3.2).
Figure 3.2: Computation in a single neuron. wi is the weight for input xi, b is the bias
and f is the activation function.
We typically use a non-linear function as the activation function, because if the acti-
vation function is linear, it can be incorporated into the previous linear transformation.
There are many different activation functions. The most commonly used are sigmoid,
tanh and rectified linear unit (ReLU). Sigmoid (Equation 3.1) and tanh (Equation 3.2)
functions involve exponential operation and division, whereas ReLU (Equation 3.3)
only involves max operation. Thus, it is much easier and faster to compute ReLU
activation function value and its derivative during backpropagation. ReLU can also
reduce the vanishing gradient problem (when x is too large or too small, the gradient
of sigmoid or tanh is approximate to 0). [18] found that it is much faster to train deep
neural networks using ReLU as activation function than sigmoid or tanh. Thus, in this
project activation functions of all neurons are ReLU.
• Sigmoid
f (x) =
1
1+ e−x
(3.1)
Chapter 3. Methods 13
Figure 3.3: sigmoid plot
• Tanh
f (x) = tanh(x) =
ex− e−x
ex + e−x
(3.2)
Figure 3.4: Tanh plot
• ReLU
Chapter 3. Methods 14
f (x) = max(0,x) (3.3)
Figure 3.5: ReLU plot
3.1.2 Fully Connected Layer
The neurons in the neighbouring layers are fully connected, which means every pair
of neurons has a connection (Figure 3.6). If the two layers have M neurons and N
neurons respectively, then there are M×N connections between them each with dif-
ferent weight parameters. This is the traditional layer type often used in regular neural
network. In this project, we use the fully connected layer as the last layer to compute
class scores in the deep convolutional neural networks.
Chapter 3. Methods 15
Figure 3.6: An example of fully connected layer
3.1.3 Convolutional Layer
This project stacks multiple convolutional layers in the deep neural networks to learn
features automatically from image data. Because for image and other high-dimensional
data, convolutional layer is often preferable to fully connected layer which can be
slow to train and easy to overfit due to the large number of connections and parame-
ters. For example, if the input image is 30x30x3 and fully connected layer is used as
the first hidden layer, then every neuron in the fully connected layer will connect to
30x30x3=2700 neurons in the input layer. For such small image, it may not be a prob-
lem. But for larger image such as 300x300x3, there will be 270000 connections for a
single neuron which is difficult to handle. In addition, high-dimensional data such as
image often has inherent spatial structure. But for fully connected layer, the input is
just a vector of pixel values. The relative position of the pixels has no effect and the
spatial structure information is lost.
To address these problems, convolutional layer is invented. To be suitable for image
data, the layout of neurons in convolutional layer is three-dimensional instead of one-
dimensional in the fully connected layer. The three dimensions are width, height and
depth respectively. Each neuron in the convolutional layer now only connects to a
small regions of neurons of previous layer. The small region is small in width and
Chapter 3. Methods 16
height but includes all depth. The width and height of the region is called receptive
field or filter size. So the receptive field controls how large the connection region will
be. In this way, we reduce the connections dramatically. For example, If the receptive
field is 3×3, the input volume is 300x300x3, then one neuron will connect to 3x3x3=27
neurons of the previous layer instead of 270000 in a fully connected layer. Apart from
the benefit of reducing the number of connections, it is also helpful for learning the
local features of the image.
To reduce the number of parameters further, the convolutional layer let neurons
in the same depth slice share the same weights. Thus, for different positions in the
image, the filter uses the same weights to extract the features, which makes the feature
extracting translation invariant.
During forward propagation phase, we slide a window of size defined by receptive
field over all the input volume and compute the dot product of filter weights and the
pixel values in the window to get a single number in the output volume. The dot
products of all positions constitute the activation map. And the activation maps for all
filters are stacked in the depth dimension to constitute the total output volume.
In summary, by arranging layer of neurons in 3D space, constraining the connec-
tions to local area and sharing the weights (Figure 3.7), convolutional layer can make
better use of spatial information with much less parameters.
Chapter 3. Methods 17
Figure 3.7: An example of 1D convolutional layer that illustrates local connection and
weights sharing. The connections with the same color share the same weight.
3.1.3.1 Convert Fully connected layer to Convolutional Layer
Fully connected layer can be converted to convolutional layer. For example, if the fully
connected layer accepts 5×5×128 input volume and outputs 1×1×10 volume, then
a convolutional layer with 10 filters of size 5×5 will give the same effect. Replacing
the fully connected layer with a convolutional layer has the advantage that when the
input image has a larger size than the trained image, we can process multiple areas
of the input image in a single forward pass instead of multiple forward passes to get
multiple class score vectors and the final prediction can be done using their average,
which can improve the prediction accuracy.
3.1.4 Pooling layer
Pooling layer is often used in the convolutional neural networks and can decrease the
size of features significantly. In this project, the average pooling layers are used before
the fully connected layers. It works by sliding a small window over input volume, us-
ing a non-linear function to computing a number with the values in the small window
Chapter 3. Methods 18
as input. The computation is conducted for each input depth independently. The most
commonly used non-linear function is max function. Other functions, such as average
(Figure 3.8) and L2-norm, are also used. By reducing multiple values in a local re-
gion to only one number, the pooling layer has the effect of extracting more abstract
features which helps the model to generalize and reduce overfitting. The pooling layer
introduces no additional parameters and it will reduce the width and height by factor
2 or more with depth unchanged. Thus, the number of parameters of the later layers is
reduced. The most commonly used filter size is 2×2, which results in output volume
of 1/4 input volume size. Larger filter size is rarely used, because it will discard too
much information and often result in bad performance.
Figure 3.8: An example of average pooling operation for a single depth slice with a 2×2
filter and stride of 2.
3.2 Loss function
Suppose we have n classes, for sample x, we have computed a score vector f of n
elements. f j is the class score of sample xi for class j. Larger score indicates it is more
likely for xi to belong that class. The loss function takes the score vector as input and
outputs a single number to indicate how well the score outcome matches with the true
class label. Intuitively, if the score for the true class is relatively higher than others’,
then the loss function value should be smaller.
3.2.1 Cross Entropy loss
We can use the softmax function (Equation 3.4) to convert class score vector to class
probability vector with each value in range [0,1] and the total sum as 1. The probability
of data sample xi belonging to class k given the class score vector f is:
Chapter 3. Methods 19
P(y = k|xi) =
e fk
∑ j e
f j
(3.4)
That is, for each score, take its exponentiation and then divided by sum of exponen-
tiations to normalize the value to 0-1. We want the loss to be small when the predicted
probability for the correct class is larger. Thus, we take negative log of P(yi|xi) where
yi is the correct class for xi to get the loss. The loss for sample xi is:
Li =− log(P(yi|xi)) =− log(
e fyi
∑ j e
f j
) =− fyi + log∑
j
e f j (3.5)
3.2.2 Hinge Loss
Another commonly used loss function is hinge loss. The loss for sample (xi,yi) given
class score vector f is:
Li = ∑
j 6=yi
max(0, f j− fi +1) (3.6)
Intuitively, this loss function wants the score for the true class to be larger than
others at least by 1. Otherwise, the loss will increase for each violation.
3.2.3 Loss Functions Comparison
The cross-entropy unlike hinge loss provides probability for each class which is easier
for human to interpret than raw class score. Another difference is that, once the margins
between score of true class and scores of other classes are large enough, the hinge loss
becomes zero and cannot decrease further, whereas the cross-entropy loss can always
decrease. The hinge loss and cross-entropy loss often have similar performance. The
cross-entropy loss is used for all the models in this project.
3.3 Regularization
3.3.1 L1 and L2 Regularization
We often use regularization methods to reduce overfitting. One way of regularization
is to add weight penalty to the loss. The new loss is the sum of original data loss and
the added regularization loss. The most commonly used are L1 (Equation 3.7) and L2
(Equation 3.8) regularization.
Chapter 3. Methods 20
L =
1
N ∑i
Li︸ ︷︷ ︸
data loss
+ λ∑
k
∑
l
|Wk,l|︸ ︷︷ ︸
regularization loss
(3.7)
L =
1
N ∑i
Li︸ ︷︷ ︸
data loss
+
1
2
λ∑
k
∑
l
W 2k,l︸ ︷︷ ︸
regularization loss
(3.8)
The regularization parameter λ controls the regularization strength. A large λ will
put more weight to regularization loss and thus stronger regularization. Small λ will
put more weight to data loss and thus weaker regularization. Different datasets or
network architectures may require very different value of λ. There is no simple way to
decide suitable λ. It is usually set through cross-validation. By adding regularization
loss which penalizes large weights, it helps to result in networks with smaller weights.
Smaller weights mean a few change of the inputs won’t change the output of the
network too much. A few outliers won’t matter too much for the regularized networks
which make the network less sensitive to the noise in the data. On the other hand, a
little change on some of the inputs may cause the output of network with large weights
change a lot. So large weights will make the model easily adapt to all the training data
including noise.
In summary, unregularized networks with large weights tend to be more complex,
easy to learn the noise and more likely to overfit. In contrast, regularized networks
with small weights tend to be simpler, robust to noise, less likely to overfit and better
to generalize.
The L2 regularization and L1 regularization are similar. Both penalize large weights.
But they have different form of weight updating in gradient descent algorithm. For L2
regularization, the additional update of w because of added regularization loss is
w = w−ηλw (3.9)
For L1 regularization, it is
w = w−ηλ sign(w) (3.10)
From above we can see that the updating amount is constant for L1 regularization
and proportional to w for L2 regularization. Thus, the penalty is much larger for L2
regularization when |w| is large and much larger for L1 regularization when |w| is
small. The effect is that weights in L1 are sparse with a small number of relatively
large weights and others driven to 0. On the other hand, L2 regularization weights
Chapter 3. Methods 21
are more diffuse. The sparsity feature of L1 regularization makes L1 a better choice
for feature selection purpose. In other situations, L2 regularization is found usually
better than L1 regularization. Thus, we use L2 regularization in this project to reduce
overfitting.
We can also combine these two regularizations (Equation 3.11) which is called
elastic net regularization [28].
L =
1
N ∑i
Li︸ ︷︷ ︸
data loss
+∑
k
∑
l
λ1|Wk,l|+
1
2
λ2W
2
k,l︸ ︷︷ ︸
regularization loss
(3.11)
Apart from adding regularization loss, another way to avoid weights with too large
magnitude is to enforce max norm constraints [29]. This method does the weights
updating as usual using gradient descent algorithm and then clips the weights if needed
to ensure each weight vector norm is below a preset maximum value.
3.3.2 Dropout Layer
Dropout [30] is another regularization method to reduce overfitting. In the training
stage, we randomly drop out the neurons and the associated connections according to
probability 1− p (Figure 3.9). This has the effect of sampling from a large number of
sub-networks. In the testing stage, we do not drop out neurons. Instead, we use the full
networks but with the neuron’s output weighted with p. In this way, we compute the
average output of all the sub-networks approximately.
By randomly dropping out neurons, the dropout techniques trains over exponen-
tially large number of sub-networks, and using the average prediction of them which is
like a kind of ensemble learning, it reduces the overfitting and also increases the speed
of training.
Chapter 3. Methods 22
Figure 3.9: An example of dropout operation. The two neurons colored grey and their
associated connections are dropped out.
3.3.3 Batch Normalization
During neural network training, the parameters change of one layer will change the
distribution of inputs of the layers after it. This phenomenon called internal covariate
shift is especially true for deep neural network, where the impact will be amplified by
multiple layers. To adapt to the input distribution change, it usually requires a lower
learning rate, which makes the training slow.
To solve this problem, we can transform the layer’s input to make it have mean 0
and variance 1. This transformation is called whitening. To make the computation fast
and also differentiable required by the backpropagation, we can whiten each dimension
of the input independently as in Equation 3.12 where the scalar x is one dimension of
the input.
x =
x−E[x]√
Var[x]
(3.12)
To avoid changing the layer’s representation, we add a linear transformation (Equa-
tion 3.13) after the whitening transformation. The two transformations together are
called batch normalization [31].
Chapter 3. Methods 23
y = γx+β (3.13)
During training, the mean and variance of x are estimated from mini-batch samples.
The population means and variances are also estimated by taking moving averages of
mini-batch statistics during training. During inference, the fixed population means and
variances are used so that the output is only determined by the input.
For a layer in the original network:
z = g(Wu+b) (3.14)
We can apply batch normalization in this way:
z = g(BN(Wu)) (3.15)
The reason to remove b is that it can be cancelled by parameter β in the batch
normalization.
In the convolutional layer, the activation map is got by using the same filter applied
on different locations of previous layer. When we use batch normalization for the
convolutional layer, we will normalize all the activations in the activation map together
in the mini-batch. Thus, if the activation map has size p× q and the batch size is m,
then the normalization is applied over p× q×m values. Just like the activation map
shares the same weights, we use the same parameter γ and β for a activation map.
The batch normalization can reduce layer input distribution change and make the
gradients less sensitive to parameter scales, thus higher learning rate can be used to
speed up the training.
During training, the batch normalization depends on the whole mini-batch sam-
ples, the output of one training sample is not deterministic any more. In this way,
batch normalization has the effect of regularization and can remove other regulariza-
tion methods such as dropout. In the experiment, batch normalization is used in all
models. In particular, for MobileNet model, it is used after each hidden layer.
3.4 Optimization
3.4.1 Mini-batch gradient descent
The training process is to use optimization algorithm to update the parameters so that
the loss is minimized. The most commonly used optimization algorithm for neural
network is gradient descent.
Chapter 3. Methods 24
θn+1 = θn−η∇L(θn) (3.16)
θ is the parameter vector, L(θ) is the loss, ∇L(θ) is its gradient and η is the learning
rate. The gradient descent is an iterative algorithm that updates the parameters though
the negative direction of gradient at each iteration and the step size is controlled by the
learning rate.
When the training data is huge, for example ImageNet has over 10 million images,
computing the gradient using the entire data set is costly. In this situation, we need to
use mini-batch gradient descent. In this method, we take a small subset of samples (a
mini-batch) from the data set at each step and then use this mini-batch instead of the
whole data set in normal gradient descent algorithm to compute the gradient and do the
parameter updating. Due to the correlation between samples in the training data set,
the gradient of the loss function over the mini-batch is often very approximate to the
gradient of the loss function over the whole training data set. Since the computation
cost is much cheaper in the mini-batch gradient descent algorithm than the normal
gradient descent algorithm at each parameter updating step, much more updates can
be performed and thus, the loss function can converge much more quickly in mini-
batch gradient descent algorithm
The learning rate in the mini-batch gradient descent algorithm is very important.
When the learning rate is very small, although the loss is guaranteed to decrease, the
converging speed may be too slow. We can increase the learning rate to speed up the
learning, but this may lead to overstep that makes the loss increase. It is very difficult
to set a suitable learning rate. Different datasets or different network architectures may
require different learning rate. We may need to set different learning rate for different
parameters and in different training phases. Learning rate decay and extensions of
mini-batch gradient descent algorithms can be used to solve this problem.
3.4.2 Learning Rate Decay
At the start of training, we may want a relatively larger learning rate so that the loss
function value can decrease quicker. In the later stage, with the improvement getting
smaller in each step, we may want to decay the learning rate so that it can avoid over-
stepping and fine-tune the parameters. We can set the learning rate decay according to
some rule – for example, multiply 0.9 every 1 epoch. Or set the decay manually, for
example, when we see the training loss doesn’t decrease any more, we can try to half
Chapter 3. Methods 25
the learning rate.
Let η0 is the initial learning rate, k is decay rate and t is the number of training
steps. Three commonly used rules can be expressed as follows.
• Natural Exponential decay
η = η0e
−kt (3.17)
• Exponential decay
η = η0k
t (3.18)
• Inverse Time Decay
η =
η0
1+ kt
(3.19)
3.4.3 Mini-batch gradient descent extensions
Many extensions are proposed to improve over the basic mini-batch gradient descent
algorithm. Algorithms such as Adagrad and RMSProp try to setting the learning rate
adaptively during training. Algorithms such as Momentum and Nesterov Momentum
try to adjust the parameter updating direction to reduce oscillations.
3.4.3.1 Adagrad
Adagrad [32] algorithm can adapt the learning rate for each parameter automatically.
C =C+d2
θ
(3.20)
θ = θ−
η
√
C+ ε
dθ (3.21)
ε is used to avoid dividing 0 and it is set to a very small value such as 1e−6.
The above formulae operations are element-wise for each parameter. So each pa-
rameter has its own effective learning rate. AdaGrad keeps track of the sum of gradients
and uses it to adjust the learning rate.
Chapter 3. Methods 26
3.4.3.2 RMSProp
One problem of Adagrad is that the effective learning rate η√
C+ε
is always decreasing,
when it is approximate to 0, then the algorithm stops learning.
Another algorithm called RMSProp [33] tries to solve this problem.
C = γC+(1− γ)d2
θ
(3.22)
θ = θ−
η
√
C+ ε
dθ (3.23)
γ is the decay rate. RMSProp makes a simple change which makes C as the moving
average of gradient square instead of accumulated sum in the Adagrad. Now the effec-
tive learning rate is no longer always decreasing. We train the models using RMSProp
algorithm in this project.
3.4.3.3 Momentum
v = γv−ηdθ (3.24)
θ = θ− v (3.25)
γ is another hyperparameter called momentum. v is the velocity. We integrate
previous velocity with gradient to get the current velocity and then using the velocity
to update the θ. This is different from basic gradient descent where we directly update
the parameters using gradient. This algorithm is helpful to reduce oscillating and speed
up convergence.
3.4.3.4 Nesterov Momentum
The Nesterov momentum uses the gradient of the next position instead of current po-
sition and achieves better results over momentum.
θ
′ = θ+ γv (3.26)
v = γv−ηdθ′ (3.27)
θ = θ− v (3.28)
Chapter 3. Methods 27
3.4.4 Forward Propagation and Backpropagation
Let ai represents the activation values of layer i. For the input layer, the values are
directly from input x, so we have a1 = x . We can compute all neurons’ value layer by
layer from input layer until output layer.
ai+1 = fi(Wiai +bi) (3.29)
From the output layer’s values, we can compute the loss that measures the error
between the model predicted value and the actual target value.
In the training process, we need to use gradient descent algorithm to update the
parameters to reduce the loss. Backpropagation makes use of chain rule to compute
gradients of all parameters with respect to the output efficiently.
3.4.4.1 Chain Rule
The derivative of composite functions can be computed using chain rule method. For
example, if variable x is a function of y which in turn is a function of z, then according
to the chain rule:
dx
dz
=
dx
dy
.
dy
dz
(3.30)
3.4.4.2 Example
The following illustrates the forward propagation and backpropagation process of feed-
ing one sample data to a neural network that has one hidden layer with ReLU activation
and uses cross-entropy loss. W,b,W ′,b′ are the weights and biases for hidden layer and
output layer respectively. X ,y are the sample data and class label.
• Forward propagation
Compute the affine transform for hidden layer.
Z =W T X +b (3.31)
Compute the ReLU activation for hidden layer.
H = max(Z,0) (3.32)
Compute the affine transform for output layer which is the class score.
S =W ′T H +b′ (3.33)
Chapter 3. Methods 28
Convert class score to probability using softmax function.
Pk =
eSy
∑ j e
S j
(3.34)
Compute the loss
L =− logPy (3.35)
• Backpropagation
Compute gradient of class score.
∂L
∂Sk
= pk−1(y = k) (3.36)
Compute gradient of weight w′.
∂L
∂W ′
= H
∂L
∂S
T
(3.37)
Compute gradient of bias b′.
∂L
∂b′k
=
∂L
∂Sk
(3.38)
Backpropagate to hidden layer.
∂L
∂H
=W
∂L
∂S
(3.39)
Set non-positive elements to 0 in ∂L
∂H . Because
∂max(x,0)
∂x = 1 if x > 0 and 0 if
x≤ 0.
∂L
∂Z
=
∂L
∂H
�1(H > 0) (3.40)
Compute gradient of weight w.
∂L
∂W
= X
∂L
∂Z
T
(3.41)
Compute gradient of bias b.
∂L
∂bk
=
∂L
∂Zk
(3.42)
From above, we can see that during backpropagation, we use many intermediate
results computed in forward propagation. Thus we often save the needed intermediate
values in forward propagation to save computation time by avoiding duplicate compu-
tation in backpropagation.
Chapter 3. Methods 29
Although above example is just for a simple neural network, it can be easily ex-
tended to a more complex network. During the forward propagation and backpropaga-
tion process, the computation is local to each layer. Each layer only needs to know the
value propagated to it, compute the values and propagate the values to other layers. It
doesn’t need to care about how other layers do the computation. Thus, different layers
and operations can be used as components to construct deep and very complex neural
networks in many different ways of combination.
3.5 Depthwise Separable Convolution
The depthwise separable convolutions [3] factorize the conventional convolution (Fig-
ure 3.10) with a depthwise convolution (Figure 3.11) followed by a pointwise convo-
lution (Figure 3.12).
Figure 3.10: Conventional convolution [34]
Figure 3.11: Depthwise convolution [34]
Chapter 3. Methods 30
Figure 3.12: Pointwise convolution [34]
The depthwise convolution is done independently for each channel of the input
where a single filter is applied. The pointwise convolution is the same with con-
ventional convolution operation but with kernel size 1×1, which is why it is called
pointwise. It combines the features from depthwise convolution linearly to create new
features. Thus the depth separable convolution has the same effect with conventional
convolution. The difference is that conventional convolution achieves this using a sin-
gle step, whereas depth separable convolution uses two separate steps. Through the
separation of feature filtering and feature combining, depthwise separable convolution
reduces the amount of computation tremendously.
Assume the input I has size W ×H×M where W is the input width, H is the height
and M is the number of input channels. The filer F has size w× h and the number of
filters is N. With stride as 1 and zero padding, the output of conventional convolution
O will have size W ×H ×N. The elements of O are computed as in Equation 3.43
which takes O(W ·H ·M ·N ·w ·h).
Oi, j,n = ∑
u,v,m
Ii+u, j+v,m ·Fu,v,m,n (3.43)
For depthwise convolution, we use one filter for each input channel. The filter has
size w× h×M. The output of the depthwise convolution has size W ×H ×M. It is
computed as in 3.44 which takes O(W ·H ·M ·w ·h):
Oi, j,m = ∑
u,v
Ii+u, j+v,m ·Fu,v,m (3.44)
Then for the 1× 1 pointwise convolution, it uses N filters, takes the output of
depthwise convolution and generates output of size W ×H×N which takes O(W ·H ·
M ·N). In total, depthwise separable convolution takes O(W ·H ·M ·w ·h+W ·H ·M ·
N) = O(W ·H ·M · (w ·h+N)).
Chapter 3. Methods 31
The time ratio between depthwise separable convolution and conventional convo-
lution is W ·H·M·(w·h+N)W ·H·M·N·w·h =
1
N +
1
w·h . For a typical convolution, where w = 3,h = 3,N >
100, we achieve about a 9-fold increase in speed.
Each filter in conventional convolution has w · h ·M weights. Thus, conventional
convolution has w ·h ·M ·N total number of weights for N filters. The total number of
weights of depthwise separable convolution is the sum of w ·h ·M for depthwise convo-
lution and M ·N for 1×1 pointwise convolution. Thus the ratio of number of weights
between depthwise separable convolution and conventional convolution is w·h·M+M·Nw·h·M·N
which can be reduced to 1N +
1
w·h . Thus, the depthwise separable convolution has the
same ratio in reducing number of parameters as in reducing computation cost.
In summary, depthwise separable convolution has almost the same effect of fea-
ture extraction with conventional convolution, but has much less computation cost and
fewer parameters. It is heavily used in MobileNet model and its effectiveness is evalu-
ated in the experiment.
3.6 Transfer Learning
Training a deep convolutional neural network model usually requires large computa-
tion resource and takes long time. For example, training a deep convolutional neural
network model on ImageNet may takes weeks even with GPU clusters. If we cannot
afford the computation resource or time, we can use transfer learning method. In this
method, we take a pre-trained model (there are already many state of the art trained
models available free from internet), replace the last fully-connected layer and retrain
it. The previous layers of neural network model can be seen as a feature extractor. The
last fully connected layer is used to compute class scores using extracted features. We
can use the same features as the pre-trained model, but the classes are often different
from pre-trained model, so we need to replace and retrain the last layer. If retrain-
ing only the last layer doesn’t have a satisfactory performance, we may also need to
fine-tune previous layers: initializing weights with pre-trained model and updating
them during training with smaller learning rate. The reason to use smaller learning
rate is that we expect the weights of pre-trained model are not far from the final opti-
mized weights and we want to update them little by little and not to overstep. Whether
fine-tuning is needed often depends on the similarity between the new dataset and the
dataset used by the pre-trained model in terms of both image data and class labels. If
they are very similar, the kind of features extracted by the layers before last layer in
Chapter 3. Methods 32
the pre-trained model are likely to also suit the new model, and retraining only the last
layer may be enough.
Apart from saving much training time and computation resources using transfer
learning, it often has better results as reported in [35]. Thus, the transfer learning
technique is used to train the models in this project.
Chapter 4
Results and Evaluation
4.1 Resource and tools
4.1.1 Software and Hardware Environments
The model training and accuracy evaluation is implemented using python with Ten-
sorFlow framework 1.0 on Ubuntu Linux system. We use Amazon Elastic Compute
Cloud (EC2) G2 instance equipped with NVIDIA GRID K520 GPUs for training the
deep neural networks.
The image classification Android mobile app is implemented in java with Tensor-
Flow mobile library using Android Studio which is the official IDE. The TensorFlow
mobile library provides APIs that let mobile app easily load pre-trained model and do
inference with it. We use the image classification mobile app to measure the inference
time in Nexus 6 Android phone.
4.1.2 Checkpoint File
During training, we use TensorFlow API to save the learned model parameters period-
ically to binary checkpoint files. Thereby, the model parameters are backed up. Next
time, the model parameters can be restored by loading data from checkpoint file.
4.1.3 Model File
The model file is in Protocol Buffers format which can be saved and loaded using many
different languages. Thus, we can save the model file using python and load the model
using java in Android app.
33
Chapter 4. Results and Evaluation 34
The mobile file contains all the information about the model graph. Each node of
the model graph stores various information including operation name such as “Add”
and “Conv2D”, input nodes and other attributes such as filter size for “Conv2D”.
To make it suitable for deployment, we use the tool freeze graph.py provided by
TensorFlow to combine the graph definition file and checkpoint file containing learned
parameters into a single model file. The tool achieves this by replacing Variable node
with Const node that contains the parameters and it also removes nodes unnecessary
for inference to simplify graph and decreases file size.
The resulting model file can then be shipped with Android app. In the Android
app, upon starting, we first load the model file using TensorFlow Mobile java API and
then do inference using the loaded model.
4.2 Dataset
We use CIFAR-100 dataset [5] in the experiment. The CIFAR-100 dataset contains
60000 small images of size 32× 32. They belong to 100 different classes, with each
class containing 600 images. A sample of 100 images of this dataset is shown in Figure
4.1.
Chapter 4. Results and Evaluation 35
Figure 4.1: A sample of 100 images from CIFAR-100
4.3 Experimental Setup
4.3.1 Training set and test set
The CIFAR-100 dataset is divided into training set which contains 50000 images and
test set which contains 10000 images.
4.3.2 Preprocessing
During the training, an image is randomly transformed before feeding to the neural
networks. In this way, the neural networks will train on multiple versions of the same
image and the actual training data set size is much larger than original data set size.
This will make the model better generalize and reduce overfitting.
• Randomly Shift the Image
Chapter 4. Results and Evaluation 36
First pad the image, and then randomly crop the image. In this way, the image
will randomly shift in the four directions.
• Randomly Flip the Image
The image is flipped left to right with 0.5 probability.
• Randomly adjust the image brightness
Randomly add a value between -63 and 63 to all RGB components of every
pixel.
• Randomly change the image contrast
Randomly choose a contrast factor 0.2≤ f ≤ 1.8. For each RBG channel, com-
pute the mean m and update the corresponding component of each pixel with:
(x−m)× f +m
After above randomly changing steps of the image, lastly we normalize the image
data to make it have zero mean and unit norm.
4.3.3 Neural Networks Training
The deep neural network models compared in the experiment are MobileNet [3], In-
ception V3 [6] and ResNet [7]. MobileNet model heavily makes use of depth separable
convolution. As described in section 3.5, the depth separable convolution approximates
conventional convolution but with fewer parameters and less computation cost. In this
experiment, we will investigate whether this approximate computing technique used
in MobileNet can achieve a better trade off between accuracy and efficiency than the
Inception V3 and ResNet model.
To further reduce the computation cost, MobileNet uses two hyper-parameters
width multiplier α and resolution multiplier β to control the model size. The Mo-
bileNet model with α = 1 and β = 1 is called base model. Models with smaller width
multiplier will use fewer input and output channels in each layer. And models with
smaller resolution multiplier will use input image of smaller size. For a layer with in-
put size W×H×M and output size R×S×N in the base model, the input size becomes
βW ×βH×αM and output size becomes βR×βS×αN in the shrunk model. We will
also investigate how the two parameters width multiplier and resolution multiplier of
MobileNet affect the accuracy, model size and inference time.
Chapter 4. Results and Evaluation 37
Figure 4.2 shows the architectures of the three models used in this experiment. The
training process of each model is described in following sections 4.3.3.1, 4.3.3.2 and
4.3.3.3.
Figure 4.2: The architectures of MobileNet, Inception V3 and ResNet.
4.3.3.1 MobileNet
Hyperparameters
Chapter 4. Results and Evaluation 38
• Batch Size: 128
• Momentum: 0.9
• Initial learning rate: 0.01
• Learning rate decay: decay with factor 0.94 every 2 epochs
• Weight decay parameter: 0.00004
• Optimizer: RMSProp optimization algorithm with decay rate of 0.9
The initial weights are loaded from MobileNet pre-trained model on ImageNet. In
the first stage, train only on the last fully connected layer and keeping the parameters
of previous layers unchanged. It trains 25000 steps in this phase. Then train all layers
to fine-tune the model. It trains 55000 steps in this phase. During training, random
minor changes are applied on the images to augment the data set.
After training finishes, we use the test set to evaluate the performance. Note that the
prediction on each image is just done once. If average prediction of multiple randomly
changes on an image is used, the performance is likely to improve.
The models are exported to TensorFlow model file. In the Android mobile image
classification app, the model file is loaded and the inference time is computed by di-
viding the time it takes to classify 100 images one by one with 100. The inference time
is measured in Nexus 6 Android phone.
The experiments are done for width multipliers 1.0, 0.75, 0.5 and 0.25, image sizes
32, 24 and 16. Thus, the above steps are done for a total of 12 models.
The change of losses with training steps for model with width multiplier 1.0 and
image size 32 are in figures 4.3, 4.4 and 4.5. Loss changes for other models are similar.
The red line is for the first stage and the green line for the second stage.
Figures 4.3, 4.4 and 4.5 show the change of total loss, cross entropy loss and regu-
larization loss with the training steps in both stages.
Chapter 4. Results and Evaluation 39
Figure 4.3: Total Loss for MobileNet
Figure 4.4: Cross Entropy Loss for MobileNet
Figure 4.5: Regularization Loss for MobileNet
4.3.3.2 Inception V3
Google Inception V3 model is proposed in [6]. It adds an auxiliary logits layer in
addition to usual logits layer to speedup convergence during training. The first stage
Chapter 4. Results and Evaluation 40
trains on the auxiliary logits layer and logits layer 15000 steps with a fixed learning
rate of 0.01. The second stage trains 30000 steps on all layers with a smaller fixed
learning rate of 0.0001. Both stages use weight decay of 0.00004.
Figures 4.6, 4.7 and 4.8 show the change of total loss, cross entropy loss and reg-
ularization loss with the training steps in both training stages for the Inception V3
model.
Figure 4.6: Total Loss for Inception V3
Figure 4.7: Cross Entropy Loss for Inception V3
Figure 4.8: Regularization Loss for Inception V3
Chapter 4. Results and Evaluation 41
4.3.3.3 ResNet
The ReNet model is proposed in [7]. For this model in the experiment, it undergoes
the same process with the Inception V3 model during training.
Figures 4.9, 4.10 and 4.11 shows the change of total loss, cross entropy loss and
regularization loss with the training steps in both training stages for the ResNet model.
Figure 4.9: Total Loss for ResNet
Figure 4.10: Cross Entropy Loss for ResNet
Figure 4.11: Regularization Loss for ResNet
Chapter 4. Results and Evaluation 42
4.4 Metrics
• Top-1 Accuracy
The ratio between the number of images that are predicted correctly and the total
number of images in the test set.
• Top-5 Accuracy
Same with top-1 accuracy, it is the ratio between the number of correct predic-
tions and the total number of images. The difference is the meaning of correct
prediction. For top-5 accuracy, classifier gives five candidate guesses instead of
one guess. If the correct label is one of the five guesses, then the prediction is
considered correct.
• Inference Time
The average time model takes to classify a single image in mobile devices.
• Model File Size
The size of the model file in TensorFlow for deployment. The model file size is
mainly determined by the number of parameters and the number of bits used to
encode each parameter.
4.5 Results
Table 4.1 shows the performance for MobileNets with various width multipliers and
resolution multipliers. Table 4.2 shows performance for full MobileNet, Inception V3
and ResNet.
Chapter 4. Results and Evaluation 43
Table 4.1: Performance for different width multipliers and resolution multipliers.
Table 4.2: Performance of Different Models
Chapter 4. Results and Evaluation 44
4.6 Analysis
Table 4.3: Relative Performance
For comparison purposes, the accuracy loss, inference time speedup and model size
compression ratio of MobileNet model over Inception 3 and ResNet are computed in
Table 4.3.
Table 4.3 shows that the MobileNet have significant inference speed up and model
size compression over Inception V3 and ResNet. The reason for this is that depthwise
separable convolutions used in MobileNet reduce computation cost and parameters
dramatically compared with conventional convolution as analyzed in section 3.5. Its
accuracy is similar with ResNet and have a relatively larger loss compared with Incep-
tion V3.
Table 4.1 shows that we can achieve fast inference at the cost of greater accuracy
loss by decreasing width multiplier and resolution multiplier. We can also have fewer
number of parameters and smaller model file size by decreasing resolution multiplier.
Models with smaller width multiplier will have fewer filters used in the convolution,
which results in decreased feature extracting power and increased accuracy loss. Mod-
els with smaller resolution multiplier take smaller input image size. Thus, with less
information used, their accuracies also decrease. The following quantitative analysis
shows why width multiplier and resolution multiplier can reduce computation cost and
number of parameters.
Since most computation and parameters are in depth separable convolutions for
MobileNet, we can focus on the change in the depth separable convolutions. For the
base model, as stated in section 3.5 the computation cost is W ·H ·M · (w · h + N)
and the number of weights is w · h ·M +M ·N. For model with width multiplier α
and resolution multiplier β, the number of input and output channels become αM and
αN, and the input width and height become βW and βH. Thus, the computation cost
becomes
Chapter 4. Results and Evaluation 45
βW ·βH ·αM · (w ·h+αN) (4.1)
and the number of weights becomes
w ·h ·αM+αM ·αN (4.2)
Equation 4.1 shows that decreasing width multiplier α and resolution multiplier β
leads to less computation cost and Equation 4.2 shows that decreasing width multiplier
α also reduces the number of parameters and resolution multiplier β doesn’t affect the
number of parameters.
Table 4.1 also shows that it is better to decrease width multiplier than resolution
multiplier to speed up inference and shrink model file. For example, using width mul-
tiplier 0.75 and resolution multiplier 1.0 have higher accuracy, quicker inference and
smaller model size than using width multiplier 1.0 and resolution multiplier 0.75.
Chapter 5
Conclusion and Discussion
5.1 Remarks and observations
This project implements the MobileNet model using the TensorFlow framework. The
approximate computing techniques, approximating traditional convolutional layer with
depth-wise separable convolution layer, are used. An Android mobile image classifi-
cation app is built to test the real inference time of each model. In the experiment,
MobileNets with various width multipliers and resolution multipliers are successfully
trained on the CIFAR-100 dataset to compare these two hyperparameters’ effect on
the performance, which show that by adjusting them we can get different trade-offs
between accuracy and efficiency. The decrease of width multiplier and resolution
multiplier lead to smaller model size and quicker image classification on mobiles
with greater accuracy loss. Thus, mobile developers can adjust them to find the best
trade-off for their applications. Comparison with other models such as Inception V3
and ResNet are also done in the experiment, which shows that MobileNet has much
speedup in inference time and smaller mobile size with reasonable accuracy sacrifice.
The resulting model is more suitable for mobile deployment which takes much less
memory space and inference time.
5.2 Limitation and Further work
5.2.1 More approximate computing techniques
Currently, the approximate computing technique used is depth-wise separable convolu-
tion which is approximation to traditional convolution. We would like to apply network
46
Chapter 5. Conclusion and Discussion 47
pruning and quantization techniques on the resulting models to further decrease model
size and inference time in future work.
5.2.2 More extensive Experiment
In this project, due to computing resource and time constraint, we use one dataset
CIFAR-100 and two traditional popular models Inception V3 and ResNet in compari-
son. In future work, we will use more datasets and more models to do more extensive
evaluation.
5.2.3 Application into Practice
In future work, we would like to put the approximate computing techniques used in
this project into real practice. Many mobile applications would benefit from approxi-
mate computing techniques used in this project. Two examples are bank card number
recognition and handwritten Chinese character recognition. The first one can be used
in a payment app that let users avoid the hassle of entering card number manually. The
second one can be used in Chinese input app. The computing techniques used in this
project would make the recognition in the two applications much faster and the apps
less memory-consuming.
5.2.4 Model Architecture Improvement
Although the MobileNet achieves significant inference speedup and model size shrink-
ing, it has a relatively large accuracy loss compared with the Inception V3 model.
Thus, we would like to adjust the model architecture to improve its accuracy in future
work.
Bibliography
[1] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George
Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershel-
vam, Marc Lanctot, et al. Mastering the game of go with deep neural networks
and tree search. Nature, 529(7587):484–489, 2016.
[2] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for
large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[3] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun
Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Effi-
cient convolutional neural networks for mobile vision applications. arXiv preprint
arXiv:1704.04861, 2017.
[4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Ima-
genet: A large-scale hierarchical image database. In Computer Vision and Pat-
tern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE,
2009.
[5] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from
tiny images. 2009.
[6] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew
Wojna. Rethinking the inception architecture for computer vision. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages
2818–2826, 2016.
[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learn-
ing for image recognition. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 770–778, 2016.
48
Bibliography 49
[8] Martı́n Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen,
Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al.
Tensorflow: Large-scale machine learning on heterogeneous distributed systems.
arXiv preprint arXiv:1603.04467, 2016.
[9] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-
based learning applied to document recognition. Proceedings of the IEEE,
86(11):2278–2324, 1998.
[10] Misha Denil, Babak Shakibi, Laurent Dinh, Nando de Freitas, et al. Predict-
ing parameters in deep learning. In Advances in Neural Information Processing
Systems, pages 2148–2156, 2013.
[11] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus.
Exploiting linear structure within convolutional networks for efficient evaluation.
In Advances in Neural Information Processing Systems, pages 1269–1277, 2014.
[12] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up
convolutional neural networks with low rank expansions. arXiv preprint
arXiv:1405.3866, 2014.
[13] Xiangyu Zhang, Jianhua Zou, Xiang Ming, Kaiming He, and Jian Sun. Efficient
and accurate approximations of nonlinear convolutional networks. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages
1984–1992, 2015.
[14] Jonghoon Jin, Aysegul Dundar, and Eugenio Culurciello. Flattened convolutional
neural networks for feedforward acceleration. arXiv preprint arXiv:1412.5474,
2014.
[15] Min Wang, Baoyuan Liu, and Hassan Foroosh. Factorized convolutional neural
networks. arXiv preprint arXiv:1608.04337, 2016.
[16] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,
Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabi-
novich. Going deeper with convolutions. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 1–9, 2015.
Bibliography 50
[17] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and
connections for efficient neural network. In Advances in Neural Information
Processing Systems, pages 1135–1143, 2015.
[18] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification
with deep convolutional neural networks. In Advances in neural information
processing systems, pages 1097–1105, 2012.
[19] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Prun-
ing filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.
[20] Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter level pruning method
for deep neural network compression. arXiv preprint arXiv:1707.06342, 2017.
[21] Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. Designing energy-efficient
convolutional neural networks using energy-aware pruning. arXiv preprint
arXiv:1611.05128, 2016.
[22] Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. Compress-
ing deep convolutional networks using vector quantization. arXiv preprint
arXiv:1412.6115, 2014.
[23] Yoojin Choi, Mostafa El-Khamy, and Jungwon Lee. Towards the limit of network
quantization. arXiv preprint arXiv:1612.01543, 2016.
[24] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing
deep neural networks with pruning, trained quantization and huffman coding.
arXiv preprint arXiv:1510.00149, 2015.
[25] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark
Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale
distributed deep networks. In Advances in neural information processing systems,
pages 1223–1231, 2012.
[26] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long,
Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional
architecture for fast feature embedding. In Proceedings of the 22nd ACM inter-
national conference on Multimedia, pages 675–678. ACM, 2014.
Bibliography 51
[27] Ronan Collobert, Samy Bengio, and Johnny Mariéthoz. Torch: a modular ma-
chine learning software library. Technical report, Idiap, 2002.
[28] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic
net. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
67(2):301–320, 2005.
[29] Andrej Karpathy. Cs231n: Convolutional neural networks for visual recognition.
Neural networks, 1, 2016.
[30] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Rus-
lan Salakhutdinov. Dropout: a simple way to prevent neural networks from over-
fitting. Journal of machine learning research, 15(1):1929–1958, 2014.
[31] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep
network training by reducing internal covariate shift. In International Conference
on Machine Learning, pages 448–456, 2015.
[32] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for
online learning and stochastic optimization. Journal of Machine Learning Re-
search, 12(Jul):2121–2159, 2011.
[33] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient
by a running average of its recent magnitude. COURSERA: Neural networks for
machine learning, 4(2):26–31, 2012.
[34] Matthijs Hollemans. MobileNets on the iPhone. http://machinethink.net/
blog/googles-mobile-net-architecture-on-iphone/, 2017. Accessed:
2017-08-01.
[35] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson.
Cnn features off-the-shelf: an astounding baseline for recognition. In Proceed-
ings of the IEEE conference on computer vision and pattern recognition work-
shops, pages 806–813, 2014.