程序代写代做代考 android Java chain distributed system GPU AI python IOS database algorithm deep learning c++ Approximate Computing for Deep Learning in

Approximate Computing for Deep Learning in

TensorFlow

Chiang Chi-An

T
H
E

U N
I V E R S

I T
Y

O
F

E
D I N B U

R
G
H

Master of Science

School of Informatics

University of Edinburgh

2017

Abstract

Nowadays, many machine learning techniques are applied on the smart phone to do

things like image classificatin, audio recognization and object detection to make smart

phone even smarter. Since deep learning has achieved the best result in many fields.

More and more people want to use deep neural netowrk model in the smart phone.

However, deep neural netowrk model can be large, need large amount of computa-

tion that takes too much time and power. There are a few methods of approximate

computing proposed to address this problem in recent years. The method I use in this

paper is mobilenet model using tensorflow which is just published by Google in this

year. I will conduct experiments to show whether mobilenet can decrease model size,

increase speed while at the same time keep decent accuracy. I will compare metrics

of the mobilenet with other traditional models such as VGG model. I will also show

how the parameters of width multiplier and resolution multiplier impact the trade off

between model size, speed and accuracy.

i

Acknowledgements

Many thanks to my mummy for the numerous packed lunches; and of course to Igor,

my faithful lab assistant.

ii

Declaration

I declare that this thesis was composed by myself, that the work contained herein is

my own except where explicitly stated otherwise in the text, and that this work has not

been submitted for any other degree or professional qualification except as specified.

(Chiang Chi-An)

iii

Table of Contents

1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Achieved results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.4 Dissertation outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 4
2.1 Relevant work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.2 Approximate Computing . . . . . . . . . . . . . . . . . . . . 4

2.2 Tensorflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.2 Advantage . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Methods 10
3.1 Network Achitecture . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1.1 Activation Function . . . . . . . . . . . . . . . . . . . . . . . 11

3.1.2 Fully Connected Layer . . . . . . . . . . . . . . . . . . . . . 13

3.1.3 Convolutional Layer . . . . . . . . . . . . . . . . . . . . . . 14

3.1.4 Pooling layer . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.1 Cross Entropy loss . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.2 Hinge Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2.3 Loss Functions Comparison . . . . . . . . . . . . . . . . . . 19

3.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

iv

3.3.1 Mini-batch gradient descent . . . . . . . . . . . . . . . . . . 19

3.3.2 Learning Rate Decay . . . . . . . . . . . . . . . . . . . . . . 20

3.3.3 Mini-batch gradient descent extensions . . . . . . . . . . . . 21

3.3.4 Forward Propagation and Backpropagation . . . . . . . . . . 22

3.4 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.4.1 L2 regularization . . . . . . . . . . . . . . . . . . . . . . . . 25

3.4.2 L1 regularization . . . . . . . . . . . . . . . . . . . . . . . . 25

3.4.3 Dropout Layer . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.4.4 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . 27

3.5 Depthwise Separable Convolution . . . . . . . . . . . . . . . . . . . 28

3.6 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4 Results and Evaluation 33
4.1 Resource and tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1.1 Checkpoint File . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1.2 Model File . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2.1 CIFAR 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3.1 Training set and test set . . . . . . . . . . . . . . . . . . . . . 35

4.3.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3.3 Mobilenet . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3.4 Inception V3 . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.3.5 ResNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.4 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.4.1 Top-1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.4.2 Top-5 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.4.3 Inference Time . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.4.4 Model File Size . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.6 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5 Conclusion and Discussion 44
5.1 Remarks and observations . . . . . . . . . . . . . . . . . . . . . . . 44

5.2 Limitation and Further work . . . . . . . . . . . . . . . . . . . . . . 44

5.2.1 More approximate computing techniques . . . . . . . . . . . 44

v

5.2.2 More extensive Experiment . . . . . . . . . . . . . . . . . . 45

5.2.3 Application into Practice . . . . . . . . . . . . . . . . . . . . 45

5.2.4 Model Architecture Improvement . . . . . . . . . . . . . . . 45

Bibliography 46

vi

Chapter 1

Introduction

1.1 Motivation

In recent years, machine learning technique, especially deep learning which uses multi-

ple layers of artificial neural networks has achieved remarkable breakthroughs in many

fields. From image classification to Go game AI player AlphaGo [1], deep learning all

has the best performance.

At the same time, more and more people use smart phone. With no doubt, AI

techniques such as deep learning will make smart phone even smarter. Functions such

as face recognization, audio recognization and image classification will be added to

many mobile apps.

Deep learning model training part can be done offline in the server clusters. For

the inference part, although we can send the data through network to the server, and

the server does the prediction and reply with the result. In some cases, if the data is

sensitive, the client may wish not to send out to servers. One example is the bank card

number recognization application. Even without security concern, network traffic can

be slow and expensive, building reliable servers increase the operation cost.

So if we can do prediction on the smart phone, then there is no data security con-

cern, no network traffic delay and cost, no need to maintain a potentially expensive

server cluster. But this approach also has its drawbacks. It needs to store the model in

the smart phone’s limited storage and inference computing in the mobile can be slow

and cost battery power.

Deep neural network typically has many layers with large number of parameters. It

needs large storage and large number of math operations. For example, one traditional

image classification model VGG [2] has about 100 million parameters, need more than

1

Chapter 1. Introduction 2

1GB to store the model and takes more than 10000 million Mult-Add operations. Thus

it is not fit in the mobile phone.

To use deep learning models in the mobile phone, we must find a way to signifi-

cantly decrease the model size and the number of computing operations to make the

model file resonable small and computing fast with less power. In the mean time, we

don’t want the performance too bad. We need to find a suitable trade-off between them.

1.2 Objective

MobileNet [3] is a new deep neural network model proposed by Google that are spe-

cially designed for mobile and embedded devices using approximate computing tech-

niques. Although the experiments in its paper show that it has strong performance com-

pared to other popular models on ImageNet [4] classification, a useful model should

also have good performance on new dataset using transfer learning technique.

In this project, I will compare MobileNet with other popular models in accuracy,

model size and inference time in mobile device to investigate whether approximate

computing used in MobileNet can achieve a better trade off between accuracy and effi-

ciency to be suitable for mobile device. I will also investigate how the two parameters

width multiplier and resolution multiplier of MobileNet affect the accuracy, model size

and inference time.

1.3 Achieved results

I successfully train MobileNets with different width multipliers and resolution multi-

pliers on the CIFAR-100 using transfer learning with pre-trained model on ImageNet.

GoogLeNet Inception V3 [5] and ResNet [6] models are also trained on the CIFAR-

100 using transfer learning. Top-1 and top-5 accuracy on test set are computed for each

model. The size of model files to be deployed in mobile app are recorded. The infer-

ence time of each model in Android device is computed. The results comparison show

that MobileNet with width multiplier 1 and resolution multiplier 1 have speedup more

than 17⇥ and shrink the model file more than 6⇥ both compared with GoogLeNet
Inception V3 and ResNet models. It has 18.3% loss in top-1 accuracy and 8.5% loss

in top-5 accuracy compared with GoogLeNet Inception V3 and with almost no loss

in both top-1 and top-5 accuracy compared with ResNet. The results also show that as

we decrease width multiplizer, model size becomes smaller and inference time quicker

Chapter 1. Introduction 3

with more accuracy loss. The resolution multiplier has the similar effect except that it

doesn’t affect model size.

1.4 Dissertation outline

Chapter 2 will introduce various approximate computing techniques for deep learning

which can be divided into 3 general categories such as low rank approximation to

which techniques used in this project belong, network pruning and quantization. The

introduction of Tensorflow [7] which is the deep learning framework used in this

project is also included in Chapter 1.

Chapter 3 will elaborate both the theory and implementation of the deep learning

models in detail. They include loss function, optimization algorithm, regularization

method, various kinds of layers used, transfer learning and the particular approxi-

mate computing technique used in this project: approximating traditional convolu-

tional layer with depth-wise separable convolution layer.

Chapter 4 describes experiment results and analysis.

Chapter 5 gives the project conclusion and discussion.

Chapter 2

Background

2.1 Relevant work

2.1.1 Deep Learning

Deep learning techniques have achieved state-of-art results in many areas of

machine learning. The achievements are remarkable especially for the success of deep

convolutional neural network(CNNs) in image classification. CNNs have the best re-

sults in all the standard image datasets such as MNIST [8], CIFAR-10 [9], CIFAR-100

[9] and ImageNet [4]. Many different CNNs models are developed such as ResNet,

VGG and Inception. Because convolutional layer can make better use of image spatial

information, these models typically have a sequence of many convolutional layers.

2.1.2 Approximate Computing

Until recently, deep learning researchers are primarily focused on improving model’s

accuracy. However, the use of multiple convolutional layers also results in large num-

ber of parameters requiring large memory for model storage and increases the compu-

tational cost.

With the widespread use of mobile devices and the application of deep learning

in mobile apps, more and more researchers are now aware that to have a good mobile

user experience, accuracy is not enough, the model must also be efficient: less memory,

quicker inference and less energy consumption. Because mobile consumers don’t want

a single app to take too much space of limited memory and want the app to respond

instantly.

4

Chapter 2. Background 5

They resort to approximate computing techniques to make a better trade-off be-

tween accuracy and efficiency. The goal is to make model size smaller and inference

time quicker to be suitable for mobile device while at the same keep as much accuracy

as possible.

[10] shows that significant redundancy often exists in deep learning models. Through

approximate computing, we can remove the redundancy to save both memory and

computation cost. The approximate computing for deep learning can be divided into

roughly 3 general approaches: pruning, quantization and low rank approximation.

2.1.2.1 Low Rank Approximation Of Filters

This approach decomposes the filters in convolutional layers into a series of separable

smaller filters which are a low-rank approximation of original filters and reduce time

complexity. The optimal decomposition can be found by minimizing the reconstruc-

tion error of the filters or the layer output. Since convolutional layers are the most time

consuming parts in CNNs, this low-rank decomposition will generate significant speed

up.

[11] uses SVD decomposition to make convolutional layers 1.6 times faster while

sacrificing 1% accuracy. [12] exploits cross-channel or filter redundancy to construct

a low rank basis of filters that are rank-1 in the spatial domain which achieves speedup

by factor 2.5 without sacrifice of accuracy and by factor 4.5 with less than 1% accuracy

decrease for a text character recognition network. [11] and [12] can only decompose

linear filters in a single layer. [13] further develops this method to take into account

the nonlinearity such as Rectified Linear Units (ReLU) which makes the approximation

more accurate. It also invents new algorithms to optimize the whole network to reduce

the accumulated errors when approximating multiple convolutional layers. It achieves

speed up of factor 4 on a large pre-trained model on ImageNet with only 0.9% e top-5

error rate increase.

Instead of finding low-rank approximation of convolutional layers of pre-trained

networks, some papers replace traditional convolutional layers with layers that has sim-

ilar function but with smaller computation cost. Flattened networks [14] replaces 3D

filters in conventional convolutional networks with consecutive sequence of 1-D filters

in all 3 dimensions which reduce the parameters significantly and make the feedfor-

ward computation about 2 times faster. Factorized networks [15] factors the convolu-

tion operation by unravelling the convolutional layer with a sequence of combination

of single channel convolution and linear channel projection. It achieves similar accu-

Chapter 2. Background 6

racy but with much fewer computatin compared with traditional deep convolutional

neural networks models. MobileNets [3] uses a similar approach with flattened net-

works [14] and factorized networks [15]. Its model is based on depthwise separable

convolutions which separate traditional convolutions into depthwise convolutions that

apply a single filter for each input channel and pointwise convolutions that combinate

the results linearly. The MobileNet model has smaller size and comparable accuracy

with models such as GoogleNet [5] and VGG 16 [2]. It provides two hyperparameters

width multiplier and resolution multiplier to adjust the trade off between latency and

accuracy.

2.1.2.2 Network Pruning

This approach tries to remove parts of the models that are not important to reduce

number of parameters and computation.

[16] first learns the importance of network connections and remove them, then

retrain the network to lean the weights of the remaining connections. Its experiments

show that this method can reduce the number of parameters of VGG-16 model by 13⇥,
AlexNet [17] model by 9⇥ with no loss of accuracy.

[18] and [19] aim to prune whole filters together instead of weights which can

induce more speedup in the convolutional layers. [18] reports inference time

decreases by 34% for VGG-16 and 38% for ResNet-110 on CIFAR10 almost without

loss of accuracy. [19] reports 3.31⇥ FLOPs reduction and 16.63⇥ compression on
VGG-16 with 0.52% top-5 accuracy drop.

[20]’s pruning algorithm aims specially at reducing energy consumption of CNNs

instead of computation and memory cost. It reports energy consumption for AlexNet

decreases by 3.7⇥ and GoogLeNet decreases 1.6⇥ both with less than 1% drop in
top-5 accuracy.

2.1.2.3 Network Quantization

Network Quantization quantitizes the parameters of neural network models and en-

codes them with fewer bits to reduce the memory storage required by the models. For

example, using 8 bits instead of 32 bits will require only about 25% of storage previ-

ously needed. Another benefit of quantization is to make the inference computation

faster and use less power. Because using less bits save memory bandwidth, save RAM

access time and more operations done in one cycle for SIMD instructions.

Chapter 2. Background 7

During the training phase, in each step, the parameters of neural networks adjusts

a little using back propagation and gradient descent algorithm which requires high-

precision number format such as 32 bits floating number. So instead of training a

quantized model from scratch, we usually quantize a pre-trained model.

Quantization for deep networks typically doesn’t decrease the accuracy of infer-

ence. Because deep networks are often very robust and good at ignoring the noise

including the precision error noise introduced by quantization.

One simple way to quantize is to store the minimum and maximum values of the

floating numbers set, then using an integer to represent the floating number. For ex-

ample, if we use 8 bits to represent floating numbers in the range [-20.0, 50]. Then 0

represents -20.0, 255 represents 50.0, 128 represents 35.0 and so on.

[21] uses k-means clustering algorithm and product quantization method to quan-

tize the network parameters layer by layer. It achieves 16-24 times compression of the

state-of-the-art CNN on ImageNet with 1% loss of accuracy.

[22] uses Hessian-weighted k-means clustering and fixed-length binary encoding

to do the quantization. Hessian-weighting also takes into the across layers impact of

quantization errors aside from within impact and thus can quantize the whole network

at once. This paper also employs Huffman coding to further compress the network.

It reports that the quantize models are 1.95%, 4.51% and 2.46% respectively of the

original model sizes for LeNet, ResNet and AlexNet at no or marginal performance

loss.

Network quantization can also be combined with other approximate computing

techniques. Deep compression [23] combines network pruning and quantization. It

first prunes the model connections and only keeps most important connection to reduce

parameters by 9-13 times. Then it quantizes the weights so that we can use only 5 bits

to represent a weight instead of 32 bits. Finally it uses Huffman coding to reduce the

model further. This method compresses AlextNet model by 35 times from 240MB to

6.9MB, increases the speed by 3-4 times and costs 3-7 times fewer power.

2.2 Tensorflow

2.2.1 Introduction

Tensorflow is the second generation machine learning system published by google.

It is a successor for google’s previous DistBelief system. Its compution is based on

Chapter 2. Background 8

data flow graph with takes math operation as node and multidimensional data arrays

(tensors) flows through edges.

It is open-sourced and can be used in either single machine or multi-server clus-

ters. It can be run in CPU or GPU and even speciailized computation device such as

TPU(Tensor Processing Units) which are used in google. It enables the researchers to

easily implement various deep learning algorithms and has attract much attention from

research communities.

The main components of tensorflow consists of client, master and working pro-

cesses. Client sends request to the master and the master schedules the working pro-

cesses to do computation in available devices. Tensorflow can be used both in single-

machine and distributed clusters where client, master and working processes run in

different machines .

2.2.2 Advantage

One of the many useful features is that tensorflow can differentiate symbolic expres-

sion and derive the backpropagation automatically for neural network training which

greatly reduce the work on programmer and the chance to make mistakes.

The tensorflow is designed based on dataflow-graph model. It provides python and

c++ interface for programmers to easily construct the graph which makes architecture,

algorithm and parameters experimentation very easy.

After the user constructs the dataflow-graph, the tensorflow system will optimized

the graph and actually execute the operations in machines. Through this first con-

structing graph then actually executing approach, it enables the tensorflow to know the

whole information before executing and thus can do optimization as much as possible.

All computations are encoded as nodes in data graph, the dependency of the data

between different operations are explicitly encoded in the graph, so the tensorflow can

partition the graph according to the dependencies and run the subgraph computations

parallel in different devices.

The tensorflow allows the user to specify the subgraph that need to be computed.

The user can feed tensors to none or some of the input place holders. The tensorflow

system only runs the computation that is necessary and prune the irrelevant graph away.

The tensor flow’s data graph model not only make it easy to run concurrently and

also easy to distribute computation to multiple devices.

In tensorflow, the data flowing through graph are called tensors. A tensor is a multi

Chapter 2. Background 9

dimensional array of primitive types such as int32. It represents the input and the

output of the operations which is represented in the vertex. Every operation has a type

and none or more attributes. An operation that contain mutable state is called stageful

operation, Variable is one of such kind of operation. Another special operation is queue

operation

User can use tensor flow’s checkpointing to periodically save training models to

file and reload the model later. This facility not only improve the fault tolerance, it

also can be used for transfer learning.

2.2.3 Architecture

The TensorFlow adopts a layered architecture. On the top level are training and infer-

ence libraries. The next level is python and c++ API which are built on the C API.

Below C API level are distributed master and dataflow executor.

The distributed master accepts a data flow graph as input, it will prune the unnec-

essary part of the graph and divide the graph into subgraphs to distribute computation

to different devices. Many optimization such as constant folding and subexpression

elimination are done by it.

The dataflow executor’s take is to execute the computation of the subgraph dis-

tributed by the distributed master.

The next level is kernel implementations which has more than 200 operations im-

plemented including often used operation such as Const, Var, MatMul, Conv2D and

ReLU.

Apart from above core components, the tensorflow system also includes several

useful tools such as a dashboard to visualize the data flow graph and training progress

and a profiler that shows the running time of different tasks in different devices.

2.2.4 Performance

In Chintala’s benchmark of convolutional models testing, the results show that Tensor-

Flow has shorter training step time than Caffe and similar with Torch. Experiments

have shown that tensorflow can scale well in problems such as image classification and

language modeling.

Chapter 3

Methods

3.1 Network Achitecture

The neural network consists of layers of neurons. The first layer is called input layer

and the last layer is called output layer. The layers between input layer and output layer

are called hidden layers. Figure 3.1 shows a simple neural network with one hidden

layer.

Figure 3.1: A simple neural network with one hidden layer.

10

Chapter 3. Methods 11

3.1.1 Activation Function

Each neuron is a computing unit that applys linear transformation to the inputs to it

followed by activation function (Figure 3.2).

Figure 3.2: Computation in a single neuron.

The computation can be written as f (wT x+b), where w is the weights, b is the bias

and f is the activation function.

We typically use a non-linear function as the activation function. Because if the

activation function is linear, it can be incorporated into previous linear transformation.

There are many different activation functions. The most commonly used are sigmoid,

tanh and rectified linear unit (ReLU). In deep neural network, ReLU is found to have

better results than sigmoid and tanh.

• Sigmoid
f (x) =

1
1+ e�x

(3.1)

Chapter 3. Methods 12

Figure 3.3: sigmoid plot

• Tanh
f (x) = tanh(x) =

ex � e�x

ex + e�x
(3.2)

Figure 3.4: Tanh plot

• ReLU

Chapter 3. Methods 13

f (x) = max(0,x) (3.3)

Figure 3.5: ReLU plot

3.1.2 Fully Connected Layer

In a fully connected layer, every neuron in this layer is connected to each neuron in the

previous layer. If the two layers have M neurons and N neurons respectively, then there

are M ⇥N connections between them each with different weight parameters. This is
the traditional layer type often used in regular neural network. An example is given in

figure 3.6.

Chapter 3. Methods 14

Figure 3.6: An example of fully connected layer

3.1.3 Convolutional Layer

For image and other high dimentional data, convolutional layer is often prefereable to

fully connected layer. Because fully connected layer will create too many connections,

and thus has much more parameters which can be slow to train and easy to overfit.

For example, if the input image is 30x30x3, each neuron in the first fully hidden layer

will connect to 30x30x3=2700 neurons in the input layer. For such small image, it

may not be problem. But for larger image such as 300x300x3, there will be 270000

connections for a single neuron which is difficult to handle. Another problem is that

high dimentional data such as image often has inherent spatial structure, but for the

fully connected layer, the input is just a vector of pixel values, the relative position of

the piexels has no effect and so the spatial structure information is lost.

To address these problems, convolutional layer is invented. To be suitable for im-

age data, the layout of neurons in convolutional layer is 3 dimentional instead 1 dimen-

tional in the fully connected layer. The 3 dimentions called width, height and depth

respectively. Each neuron in the convolutional layer now only connects to a small re-

gions of neurons of previous layer. The small region is small in width and hight but

includes all depth. The width and height of the region is called receptive field or filter

size. So the receptive field controls how large the connection region will be. In this

Chapter 3. Methods 15

way, we reduce the connection dramastically. For example, If the receptive field is 3×3,

the input volume is 300x300x3, then one neuron will connect to 3x3x3=27 neurons of

the previous layer instead of 270000 in fully connected layer. Apart from the benefit

of reducing number of connections, it is also helpful to learn the local feature of the

image.

To reduce the number of parameters further, the convolutional layer let neurons in

the same depth dimention share the same weights which is called filter. So for different

positions in the image, the filter uses the same weights to extract the features which

makes the feature extracting translation invariant.

During forward propagation phase, we slide a window of size defined by receptive

field over all the input volume and compute the dot product of filter weights and the

pixel values in the window to get a single number in the output volume. The dot

products of all positions constitute the activation map. And the activation maps for all

filters stacked in the depth dimention to constitute the total output volume.

In summary, by arranging layer of neurons in 3D space, constraining the connec-

tions to local area and sharing the weights, convolutional layer can make better use

of spatial information with much less parameters. The local connection and weight

sharing are illustrated in figure 3.7. An exmaple of 3D convolutional layer is given in

figure 3.8.

Chapter 3. Methods 16

Figure 3.7: An example of 1D convolutional layer. For illustration purpose, the graph

shows connections of 1 dimensional convolutional layer instead of the usual 3 dimen-

sional convolutional layer used for image data. The filter size is 1. The connections with

the same color share the same weight parameters.

Figure 3.8: An example of 3D convolutional layer. The input size is 32⇥32⇥3. There
are 5 filters. The connection is local in width and hight dimension but across all depth

dimension.

Chapter 3. Methods 17

3.1.3.1 Convert Fully connected layer to Convolutional Layer

Fully connected layer can be converted to convolutional layer. For example, if the fully

connected layer accepts 5⇥5⇥128 input volume and outputs volume 1⇥1⇥10, then
a convolution layer with 10 filters of size 5⇥ 5 will give the same effect. Replacing
fully connected layer with convolution layer has the advantage that when the input

image has a large size than the trained image, we can inference multiple areas of the

input image in a single forward pass instead of multiple forward passes to get multiple

class score vectors and the final prediction can be done using their average which can

improve the prediction accuracy.

3.1.4 Pooling layer

The pooling layer can be used to reduce the spatial size of the representation and the

number of parameters. It works by sliding a small window over input volume, using

a non-linear function to computing a number with the values in the small window as

input. The computation is conducted for each input depth independently.The most of-

ten used non-linear function is max function. Other functions such as average (figure

3.9) and L2-norm are also used. By reduce multiple values in a local region to only 1

number, the pooling layer has the effect of extract more abstract features and help the

model to generalize and reduce overfitting. The pooling layer introduces no additional

parameters and it will reduce the width and height by factor 2 or more with depth un-

changed. So the number of parameters of the later layers are reduced. The most often

used filter size is 2×2, this will result in output volume of 1/4 input volume size. Larger

filter size is rarely used, because it will discard too much information and often result

in bad performance.

Chapter 3. Methods 18

Figure 3.9: An example of average pooling operation for a single depth slice with a 2×2

filter and stride of 2.

3.2 Loss function

Suppose we have n classes, for sample x, we have computed a score vector f of n

elements. f j is the class score of sample xi for class j. Larger score indicates it is

more likely for xi to belong that class. The loss function is to take the score vector as

input and output a single number to indicate how well the score outcome matches with

the true class label. Intuitively, if the score for the true class is relatively higher than

others, then the loss function value should be smaller.

3.2.1 Cross Entropy loss

We can use softmax function to convert class score vector to class probability vector
with each value in range [0,1] and the total sum as 1.

The probability of data sample xi belong to class k given the class score vector f

is:

P(y = k|xi) =
e fk

 j e f j
(3.4)

That is for each score, take its exponentiation and then divided by sum of exponen-

tiations to normalize the value to 0-1. We want the loss to be small when the predicted

probability for correct class is larger. We can take negative log of P(yi|xi) where yi is
the correct class for xi to get the loss. The loss for sample xi is as follows:

Li =� log(P(yi|xi)) =� log(
e fyi

 j e f j
) =� fyi + logÂ

j
e f j (3.5)

Chapter 3. Methods 19

3.2.2 Hinge Loss

Another commonly used loss function is hinge loss. The loss for sample (xi,yi) given

class score vector f is:

Li = Â
j 6=yi

max(0, f j � fi +1) (3.6)

Intuitively, this loss function wants the score for the true class to be larger than

others at least by 1. Otherwise, the loss will increase for each violation.

3.2.3 Loss Functions Comparison

The cross-entropy unlike hinge loss provides probability for each class which is more

easy for human to interpret than raw class score. Another difference is that, once the

margines between true class score and other class scores are large enough, the hinge

loss becomes 0 and can’t decrease further, whereas the cross-entropy loss can always

decrease. The hinge loss and cross-entropy loss often have similar performance.

3.3 Optimization

3.3.1 Mini-batch gradient descent

The training process is to use optimization algorithm to update the parametes so that

the loss is minimized. Most common used optimization algorithm for neural network.

qn+1 = qn �h—L(qn) (3.7)

q is the parameter vector, L(q) is the loss, —L(q) is its gradient and h is the learning
rate. The gradient descent is an iterative algorithm that updates the parameters though

the negative direction of gradient at each iteration and the step size is controled by the

learning rate.

When the training data is huge, for example ImageNet has over 10 millions of

image, computing the gradient using the entire data set is costly. In this situation,

we need to use mini-batch gradient descent. In this method, we take a small subset

of samples (a mini-batch) from the data set at each step and then use this mini-batch

samples instead of the whole data set in normal gradient descent algorithm to compute

the gradient and do the parameter updating. Due to the correlation between samples

in the training data set, the gradient of the loss function over the mini-batch is often

Chapter 3. Methods 20

very approximate to the gradient of the loss function over the whole training data set.

Since the computation cost is much cheaper in mini-batch gradient descent algorithm

than normal gradient descent algorithm at each parameter updating step, much more

updates can be performed and thus the loss function can converge much more quickly

in mini-batch gradient descent algorithm

The learning rate in the mini-batch gradient descent algorithm is very important.

When the learning rate is very small, although the loss is guaranteed to decrease, the

converging speed may be too slow. We can increase the learning rate to speed up the

learning, but may lead to overstep that makes the loss increase. It is very difficult

to set suitable learning rate. Different dataset or different network architecture may

require different learning rate. We may need to set different learning rate for different

parameters and in different training phases. Learning rate decay and extensions of

mini-batch gradient descent algorithms can be used to solve this problem.

3.3.2 Learning Rate Decay

At the start of training, we may want a relatively larger learning rate so that the loss

function value can decrease quicker. In the later stage, with the improvement getting

smaller in each step, we may want to decay the learning rate so that it can avoid over-

stepping and fine-tune the parameters. We can set the learning rate decay according

to some rule, for example, multiply 0.9 every 1 epoch. Or set the decay manually, for

example, when we see the training loss doesn’t decrease any more, we can try to half

the learning rate.

Let h0 is the initial learning rate, k is decay rate and t is the number of training
steps. 3 commonly used rule can be expressed as follows.

3.3.2.1 Natural Exponential decay

h = h0e�kt (3.8)

3.3.2.2 Exponential decay

h = h0kt (3.9)

3.3.2.3 Inverse Time Decay

h =
h0

1+ kt
(3.10)

Chapter 3. Methods 21

3.3.3 Mini-batch gradient descent extensions

Many extensions are proposed to improve over the basic mini-batch gradient descent

algorithm. Algorithms such as Adagrad and RMSProp try to setting the learning rate

adaptively during training. Algorithms such as Momentum and Nesterov Momentum

try to adjust the parameter updating direction to reduce oscillations.

3.3.3.1 Adagrad

Adagrad algorithm can adapt the learning rate for each parameter automatically.

C =C+d2q (3.11)

q = q�
h

p
C+ e

dq (3.12)

e is used to avoid dividing 0 and it is set to a very small value such as 1e�6.
The above formulae operations are element-wise for each parameter. So each

parater has its own effective learning rate. AdaGrad keeps track of the sum of gra-

dients and use it to adjust the learning rate.

3.3.3.2 RMSProp

One problem of Adagrad is that the effective learning rate hp
C+e

is always decreasing,

when it is approximate to 0, then the algorithm stops learning.

Another algorithm called RMSProp trys to solve this problem.

C = gC+(1� g)d2q (3.13)

q = q�
h

p
C+ e

dq (3.14)

g is the decay rate. RMSProp makes a simple change which makes C as the moving
average of gradient square instead of acculated sum in the Adagrad. Now the effective

learning rate is no longer always decreasing.

3.3.3.3 Momentum

v = gv�hdq (3.15)

Chapter 3. Methods 22

q = q� v (3.16)

g is another hyperparameter called momentum. v is the velocity. We integrate
previous velocity with gradient to get the current velocity and then using the velocity

to update the q which is different from basic gradient descent where we directly update
the parameters using gradient. This algorithm is helpful to reduce oscillating and speed

up convergence.

3.3.3.4 Nesterov Momentum

The Nesterov momentum uses the gradient of the next position instead of current po-

sition and achieves better result over momentum.

q0 = q+ gv (3.17)

v = gv�hdq0 (3.18)

q = q� v (3.19)

3.3.4 Forward Propagation and Backpropagation

Let ai represents the activation values of layer i. For the input layer, the values are

directly from input x, so we have a1 = x . We can compute all neurons’ value layer by

layer from input layer until output layer.

ai+1 = fi(Wiai +bi) (3.20)

From the output layer’s values, we can compute the loss that measures the error

between model predicted value and the actual target value.

In the training process, we need to use gradient descent algorithm to update the

parameters to reduce the loss. Backpropagation makes use of chain rule to compute

gradients of all parameters with respect to the output efficiently. The backpropagation

is applied on the computation graph from the last output node backward to all other

nodes. During backpropagation, in a node, for each input, multiply the input gradient

with respect to the local output and the node output gradient with respect to the final

output which is received from later node, and then the process continues for each input

node.

Chapter 3. Methods 23

3.3.4.1 Chain Rule

The chain rule is used to compute derivative of composition functions. For example, if

variable x is a function of y which in turn is a function of z, then according to the chain

rule:
dx
dz

=
dx
dy

.
dy
dz

(3.21)

3.3.4.2 Example

The following illustrates the forward propogation and Backpropagation process of

feeding one sample data to a neural network that has one hidden layer with ReLU

activation and uses cross-entropy loss.

W,b,W 0,b0 are the weights and biases for hidden layer and output layer respec-

tively. X ,y are the sample data and class label.

Forward propogation
Compute the affine transform for hidden layer.

Z =W T X +b (3.22)

Compute the ReLU activation for hidden layer.

H = max(Z,0) (3.23)

Compute the affine transform for output layer which is the class score.

S =W 0T H +b0 (3.24)

Convert class score to probability using softmax function.

Pk =
eSy

 j eS j
(3.25)

Compute the loss

L =� logPy (3.26)

Backpropagation
Compute gradient of class score.

∂L
∂Sk

= pk �1(y = k) (3.27)

Compute gradient of weight w0.

∂L
∂W 0

= H
∂L
∂S

T
(3.28)

Chapter 3. Methods 24

Compute gradient of bias b0.
∂L
∂b0k

=
∂L
∂Sk

(3.29)

Backpropagate to hidden layer.

∂L
∂H

=W
∂L
∂S

(3.30)

Set non-positive elements to 0 in ∂L∂H . Because
∂max(x,0)

∂x = 1 if x > 0 and 0 if x  0.

∂L
∂Z

=
∂L
∂H

�1(H > 0) (3.31)

Compute gradient of weight w.

∂L
∂W

= X
∂L
∂Z

T
(3.32)

Compute gradient of bias b.
∂L
∂bk

=
∂L
∂Zk

(3.33)

From above, we can see that during backpropagation, we used many intermediate

results computed in forward propogation. Thus we often save the needed interme-

diate values in forward propogation to save computation time by avoiding duplicate

computation in backpropagation.

Although above example is just for a simple neural network, it can be easily ex-

tended to more complex network. During the forward propogation and backpropaga-

tion process, the computation is local to each layer. Each layer only needs to know the

value propagated to it, compute the values and propagate the values to other layers. It

doesn’t need to care about how other layers do the computation. Thus different layers

and operations can be used as components to construct deep and very complex neural

networks in many different ways of combination.

3.4 Regularization

We often use regularization method to reduce overfitting. One way of regularization

is to add weight penalty to the loss. The new loss is the addition of original data loss

and the added regularization loss. The regularization parameter lambda controls the

regularization strength. Large lambda will put more weight to regularization loss and

thus stronger regularization. Small lambda will put more weight to data loss and thus

weaker regularization. Different dataset or network architectures may require very

Chapter 3. Methods 25

different value of lambda. There is no simple way to decide suitable lambda. It is

usually set through cross validation. By adding regularization loss which penalizes

large weights, it helps to result in networks with smaller weights.

Small weights means a few change of the inputs won’t change the output of the

network too much. Few outliers won’t matter too much for the regularized networks

which make the network less sensitive to the noise in the data. On the other hand, a

little change on some of the inputs may cause the output of network with large weights

change a lot. So large weights will make the model easily adapt to all the training data

including noise.

In summary, regularized networks with small weights tend to be simpler, robust to

noise, less likely to overfit and better to generalize. Unregularized networks with large

weights tend to be more complex, easy to learn the noise and more likely to overfit.

3.4.1 L2 regularization

L =
1
N Âi

Li
| {z }
data loss

+
1
2


k

Â
l

W 2k,l
| {z }
regularization loss

(3.34)

3.4.2 L1 regularization

L =
1
N Âi

Li
| {z }
data loss

+ lÂ
k

Â
l
|Wk,l|

| {z }
regularization loss

(3.35)

The L2 regularization and L1 regularization are similar. Both penalize large weights.

But they have different form of weight updating in gradient descent algorithm. For L2

regularization, the additional update of w because of added regularization loss is

w = w�hlw (3.36)

For L1 regularization, it is

w = w�hl sign(w) (3.37)

From above we can see that the updating amount is constant for L1 regularization

and proportional to w for L2 regularization. Thus the penalty is much larger for L2

regularization when |w| is large and much larger for L1 regularization when |w| is
small. The effect is that weights in L1 are sparse with a small number of relatively

large weights and others driven to 0. Whereas L2 regularization weights are more

diffuse. The sparsity featue of L1 regularization makes L1 a better choice for feature

Chapter 3. Methods 26

seletion purpose. In other situations, L2 regularization is found usaully better than L1

regularization.

We can also combine these two regularizations which is called Elastic net regular-

ization.

L =
1
N Âi

Li
| {z }
data loss


k

Â
l

l1|Wk,l|+
1
2

l2W 2k,l
| {z }

regularization loss

(3.38)

Apart from adding regularization loss, another way to avoid weights with too large

magnitude is called Max norm regularization. This method does the weights updating

as normal using gradient descent algorithm and then clipping the weights if needed to

ensure each weight vector norm below a preset maximum value.

3.4.3 Dropout Layer

Dropout is method to reduce overfitting. In the training stage, we randomly drop

out the neurons and the associated connections according to probability 1� p (Fig-
ure 3.10). This has the effect of sampling from a large number of sub-networks. In the

testing stage, we don’t drop out neurons. Instead, we use the full networks but with the

neuron’s output weighted with p. In this way, we compute the average output of all the

sub-networks approximately.

By randomly droping out neurons, the dropout techniques trains over exponentially

large number of sub-networks, and using the average prediction of them which is like

a kind of ensemble learning, it reduces the overfitting and also increase the speed of

training.

Chapter 3. Methods 27

Figure 3.10: An example of dropout operation. The first and third neurons and their

associated connections are droped out.

3.4.4 Batch Normalization

During neural network training, the parameters change of one layer will change the

distribution of inputs of the layers after it. This phenomenon called internal covariate

shiftis is especially true for deep neural network, the impact will be amplified by mul-

tiple layers. To adapt to the input distribution change, it usually requires small learning

rate and thus making the training slow.

To solve this problem, we can transform inputs to the layer to have mean 0 and

variance 1. This transformation is called whitening. To make the computation fast and

also differentiable required by the back propagation, we can whiten each dimension of

the input indepently.

x =
x�E[x]
p

Var[x]
(3.39)

The x is one dimension of the input which is scalar.

To avoid changing the layer’s representation, we add a linear transformation after

the whitening transformation.

y = gx+b (3.40)

Chapter 3. Methods 28

The two transformations together are called batch normalization.

During training, the mean and variance of x are estimated from mini-batch samples.

The population means and variances are also estimated by taking moving average of

mini-batch statistics during training. During inference, the fixed population means and

variances are used so that the output is only determined by the input.

For a layer in the original network.

z = g(Wu+b) (3.41)

We can apply batch normalization in this way.

z = g(BN(Wu)) (3.42)

The reason to remove b is that it can be canceled by b parameter in the batch
normalizaton.

In the convolutin layer, the activation map is got by using the same filter applied

on different locations of previous layer. When we use batch normalizaton for the con-

volution layer, we will normalize all the activations in the activation map together in

the mini-batch. So if the activation map has size p⇥q and the batch size is m, then the
normalization is applied over the p⇥q⇥m values. Just like the activation map shares
the same weights, we use the same parameter g and b for a activation map.

The batch normalization can reduce layer input distribution change and make the

gradients less sensitive to parameter scales, thus higher learning rate can be used to

speed up the training.

During training, the batch normalization depends on the whole mini-batch samples,

the output of one training sample is not deterministic any more. In this way, batch

normalization has the effect of regulization and can remove other regulization methods

such as dropout.

3.5 Depthwise Separable Convolution

The depthwise separable convolutions factorize the conventional convolution (Figure

3.11) with a depthwise convolution (Figure 3.12) followed by a pointwise convolution

(Figure 3.13).

Chapter 3. Methods 29

Figure 3.11: Conventional convolution example

Figure 3.12: Depthwise convolution example

Chapter 3. Methods 30

Figure 3.13: Pointwise convolution example

The depthwise convolution is done independently for each channel of the input

where a single filter is applied. The pointwise convolution is the same with con-

ventional convolution operatin but with kernel size 1×1 which is why it is called

pointwise. It combines the features from depthwise convolution linearly to create

new features.

Thus the depth separable convolution has the effect of filtering input channel through

depthwise convolution and then combining features to create new ones through point-

wise convolution. The effects are exactly the same with contentional convolution. The

difference is that contentional convolution achieves this using a single step, whereas

depth separable convolution uses two separate steps.

Through the separation of feature filtering and feature combining, depthwise sepa-

rable convolution reduces the amount of computation tremendously.

Assuming the input I has size W ⇥H ⇥M where W is the input width, H is the
height and M is the number of input channels. The filer F has size w⇥ h and the
number of filters is N. With stride as 1 and zero padding, the output of conventional

convolution O will has size W ⇥H ⇥N. The elements of O are computed as follows:

Oi, j,n = Â
u,v,m

Ii+u, j+v,m ·Fu,v,m,n (3.43)

It takes O(W ·H ·M ·N ·w ·h)

Chapter 3. Methods 31

For depthwise convolution, we use one filter for each input channel. The filter has

size w⇥h⇥M. The output of the depthwise convolution has size W ⇥H ⇥M.
It is computed as follows:

Oi, j,m = Â
u,v

Ii+u, j+v,m ·Fu,v,m (3.44)

It takes O(W ·H ·M ·w ·h)
Then for the 1 ⇥ 1 pointwise convolution, it uses N filters, takes the output of

depthwise convolution and generates output of size W ⇥H⇥N. It takes O(W ·H ·M ·N)
In total, depthwise separable convolution takes O(W ·H ·M ·w ·h+W ·H ·M ·N) =

O(W ·H ·M · (w ·h+N))
The time ratio between depthwise separable convolution and conventional convo-

lution is 1/N +1/wh. For a typical convolution, where w = 3,h = 3,N > 100, we can

get about 9 times speed up.

3.6 Transfer Learning

Training a good deep convolutional neural network model usually requires large com-

putation resource and long time. For example, training a deep convolutional neural

network model on ImageNet may takes weeks even with GPU clusters. If we can not

afford the computation resource or time, we can use transfer learning method. We can

use a pre-trained model(there are already many state of the art trained models availabe

free from internet), replace the last fully-connected layer and retrain it. The previ-

ous layers of neural network model can be seen as feature extractor. The last fully

connected layer is used to compute class scores using features extracted. We can use

the same features as the pre-trained model, but the classes are often different from

pre-trained model, so we need to replace and retrain the last layer. If retraining only

the last layer doesn’t have a satisfactory performance, we may also need to fine-tune

previous layers: initializing weights with pre-trained model and updating them during

training with smaller learning rate. The reason to use smaller learning rate is that we

expect the weights of pre-trained model are not far from the final optimized weights

and we want to update them little by little and not to overstep. Whether find-tuning is

needed often depends on the similarity between the new dataset and the dataset used

by the pre-trained model in terms of both image data and class labels. If they are very

similar, the kind of features extracted by the layers before last layer in the pre-trained

Chapter 3. Methods 32

model are likely to also suit the new model and retraining only the last layer may be

enough.

Apart from saving much training time and computation resources using transfer

learning, it often has better results.

Chapter 4

Results and Evaluation

4.1 Resource and tools

The model training and evaluation is implemented using python with tensorflow frame-

work 1.0 on ubuntu linux system. I use Amazon Elastic Compute Cloud (EC2) G2

instance which uses NVIDIA GRID K520 GPUs for my model training.

The image classification app on the mobile is implemented using Android java with

tensorflow mobile library. Currently the tensorflow mobile library support 3 platforms:

Android, IOS and Raspberry Pi. The library provides APIs that let mobile app easily

load pre-trained model and do inference with it.

The android image classification app is developed with Android Studio which is the

official IDE for Android.

4.1.1 Checkpoint File

During training, we can use tensorflow API to save the learned model parameters pe-

riodically to binary checkpoint files. In this way, the model parameters are backed up.

Next time, the model parameters can be restored by loading data from checkpoint file.

4.1.2 Model File

The model file is in Protocol Buffers format which can be saved and loaded using many

different languages. So we can save the model file using python and load the model

using java in android app.

The Graph object contains all the information about the model graph. The graph

consists of nodes. Each node stores various information including node name, opera-

33

Chapter 4. Results and Evaluation 34

tion such as “Add” and “Conv2D”, input nodes and other attributes such as filter size

for “Conv2D”.

To make it suitable for deployment, we can use tool from tensorflow freeze graph.py

to combine the graph definition file and checkpoint file that contains learned parame-

ters into a single model file. The tool achieves this by replacing Variable node with

Const node that contains the parameters and it also removes nodes unnecessary for

inference to simplify graph and decreases file size.

The resulting model file can then be shipped with Android app. In the android app,

upon starting, we will first load the model file using Tensorflow Mobile java API. Then

we can do inference using the loaded model.

4.2 Dataset

4.2.1 CIFAR 100

Figure 4.1: A sample of 100 images from CIFAR-100

Chapter 4. Results and Evaluation 35

The CIFAR-100 dataset contains 60000 small images of size 32⇥32. They belong to
100 different classes with each class containing 600 images. A sample of 100 images

of this dataset are shown in figure 4.1.

4.3 Experimental Setup

4.3.1 Training set and test set

This CIFAR-100 dataset is divided into training set which contains 50000 images and

test set which contains 10000 images.

4.3.2 Preprocessing

During the training, an image is randomly transformed before feeding to the neural

networks. In this way, the neural networks will train on multiple versions of the same

image and the actual training data set size is much larger than original data set size.

This will make the model better generalize and reduce overfitting.

4.3.2.1 Randomly Shift the Image

First pad the image, and then randomly crop the image. In this way, the image will

randomly shift in the 4 directions.

4.3.2.2 Randomly Flip the Image

The image is fliped left to right with 0.5 probability.

4.3.2.3 Randomly adjust the image brightness

This randomly add a value between -63 and 63 to all RGB components of every pixel.

4.3.2.4 Randomly change the image contrast

Randomly choose a contrast factor 0.2  f  1.8. For each RBG channel, compute
the mean m and update the corresponding component of each pixel with:

(x�m)⇥ f +m

After above randomly changing steps of the image, lastly we normalize the image

data to make it have zero mean and unit norm.

Chapter 4. Results and Evaluation 36

4.3.3 Mobilenet

Hyperparameters

• Batch Size: 128

• Momentum: 0.9

• Initial learning rate: 0.01

• Learning rate decay: decay with factor 0.94 every 2 epochs

• Weigth decay parameter: 0.00004

• Optimizer: RMSProp optimization algorithm with decay rate of 0.9

The initial weights are loaded from mobilenet pre-trained model on imagenet. In

the first stage, train only on the last fully connected layer and keeping the parameters

of previous layers unchanged. It trains 25000 steps in this phase.Then train all layers

to fine-tune the model. It trains 55000 steps in this phase. During training, random

minor changes are applied on the images to augment the data set.

After training finishes, we use the test set to evaluate the performance. Note that

the prediction on each image is just done once. If using average prediction of multiple

changes on a image is used, the performance is likely to improve.

The models are exported to tensorflow model file. In the android mobile image

classification app, the model file is loaded and the inference time is computed by di-

viding the time it takes to classify 100 images one by one with 100. The inference time

on mobile is done on Nexsus 6 Android phone.

The experiments are done for width multiplier 1.0, 0.75, 0.5 and 0.25, image size

32, 24 and 16. So above steps are done for a total of 12 models.

The change of losses with training steps for model with width multiplier 1.0 and

image size 32 are as follows. Others are similar. The red line is for first stage and the

green line for the second stage.

Figure 4.2, 4.3 and 4.4 shows the change of total loss, cross entropy loss and regu-

larization loss with the training steps in both stages.

Chapter 4. Results and Evaluation 37

Figure 4.2: Total Loss

Figure 4.3: Cross Entropy Loss

Chapter 4. Results and Evaluation 38

Figure 4.4: Regularization Loss

4.3.4 Inception V3

Google Inception V3 model is proposed in [5]. It adds an auxiliary logits layer in

addition to usual logits layer to speedup convergence during training. For this model

in the experiment, scale the image from 32×32 to 128×128. The first stage trains on

auxiliary logits layer and logits layer 15000 steps with fixed learning rate 0.01. The

second stage trains 30000 steps on all layers with smaller fixed learning rate 0.0001.

Both stages uses weight decay 0.00004.

Figure 4.5, 4.6 and 4.7 shows the change of total loss, cross entropy loss and regu-

larization loss with the training steps in both training stages for Inception V3 model.

Chapter 4. Results and Evaluation 39

Figure 4.5: Total Loss

Figure 4.6: Cross Entropy Loss

Figure 4.7: Regularization Loss

Chapter 4. Results and Evaluation 40

4.3.5 ResNet

ReNet model is proposed in [6]. For this model in the experiment, it undergoes the

same process with Inception V3 model during training.

Figure 4.8, 4.9 and 4.10 shows the change of total loss, cross entropy loss and

regularization loss with the training steps in both training stages for ResNet model.

Figure 4.8: Total Loss

Figure 4.9: Cross Entropy Loss

Chapter 4. Results and Evaluation 41

Figure 4.10: Regularization Loss

4.4 Metrics

4.4.1 Top-1 Accuracy

The ratio between the number of images that are predicted correctly and the total num-

ber of images in the test set.

4.4.2 Top-5 Accuracy

Same with top-1 Accuracy, it is the ratio between the number of correct predictions and

the total number of images. The difference is the meaning of correct prediction. For

top-5 accuracy, classifier gives 5 candidate guesses instead of 1 guess. If the correct

label is one of the 5 guesses, then the prediction is considered correct.

4.4.3 Inference Time

The average time model takes to classify a single image.

4.4.4 Model File Size

The size of the model file in tensorflow for deployment. The model file size is mainly

determined by the number of parameters and the number of bits used to encode each

parameter.

Chapter 4. Results and Evaluation 42

4.5 Results

Table 4.1 shows the performance for MobileNets with various width Multiplizers and

resolution multiplizers. Table 4.2 shows performance for full MobileNet, Inception V3

and ResNet.

Table 4.1: Performance For Different Width Multiplizers and Resolution Multiplizers

Table 4.2: Performance of Different Models

Chapter 4. Results and Evaluation 43

4.6 Analysis

Table 4.3: Relative Performance

For comparison purpose, the accuracy loss, inference time speedup and model size

compression raio of Mobilenet model over Inception 3 and ResNet are computed in

table 4.3.

We can see that the Mobilenet have significant inference speed up and model size

compression over Inception and ResNet. Its accuracy is similar with ResNet and

have a relatively big loss compared with Inception.

We can also see that smaller width multiplizer will decrease inference time, model

size and accuracy. Smaller resolution multiplier will not affect model size and will

decrease inference time and accuracy. Because smaller width multiplizer will decrease

the number of channels used in the filters which will decrease the number of parame-

ters, so the model file decreases. Smaller resolution multiplier will decrease the input

image size, so the amount of computation decrease, but the number of parameters are

the same. Thus it will speed up inference but not shrink model file size.

The results also show that it is better to decrease width multiplizer than resolution

multiplizer to speed up inference and shrink model file. For example, using width

multiplier 0.75 and resolution multiplier 1.0 have higher accuracy, quicker inference

and smaller model size than using width multiplier 1.0 and resolution multiplier 0.75.

Chapter 5

Conclusion and Discussion

5.1 Remarks and observations

This project implements the MobileNet model that using Tensorflow framework.

The approximate computing techniques: approximating traditional convolutional layer

with depth-wise separable convolution layer are used. Android mobile image classi-

fication app is built to test the real inference time of each model. In the experiment,

MobileNets with various width multipliers and resolution multipliers are successfully

trained on CIFAR-100 dataset to compare these two hyperparameters effect on the per-

formance which show that by adjusting them we can get different trade-off between

accuracy and efficiency. The decrease of width multiplizer and resolution multiplier

lead to smaller model size and quicker image classification on mobile wither greater

accuracy loss. So mobile developers can adjust them to find the best trade-off for their

applications. Comparison with other models such as Inception and ResNet are also

done in the experiment which shows that MobileNet has much speedup in inference

time and smaller mobile size with reasonable accuracy sacrifice. The resulting model

is more suitable for mobile deployment which takes much less memory space and in-

ference time.

5.2 Limitation and Further work

5.2.1 More approximate computing techniques

Currently, the approximate computing technique used is depth wise separable convolu-

tion which is approximation to traditional convolution. We would like to apply network

44

Chapter 5. Conclusion and Discussion 45

pruning and quantization techniques on the resulting models to further decrease model

size and inference time in future work.

5.2.2 More extensive Experiment

In this project, due to computing resouce and time constraint, we use one dataset

CIFAR-100 and two traditional popular models Inception and ResNet in compar-

ison. In future work, we will use more dataset and more models to do more extensive

evaluation.

5.2.3 Application into Practice

In future work, we would like to put the approximate computing techniques used in

this project into real practice. Many mobile applications would benefit from approxi-

mate computing techniques used in this project. Two examples are bank card number

recognization and handwritten chinese character recognition. The first one can be used

in payment app that let users avoid the hassle of entering card number manually. The

second one can be used in Chinese input app. The computing techniques used in this

project would make the recognization in the two applications mush faster and the apps

less memory consuming.

5.2.4 Model Architecture Improvement

Although the mobilenet achieves significant inference speedup and model size shrink-

ing, it has a relatively big accuracy loss compared with Inception model. We would

like to adjust the model architecture to improve its accuracy in future work.

Bibliography

[1] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George

Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershel-

vam, Marc Lanctot, et al. Mastering the game of go with deep neural networks

and tree search. Nature, 529(7587):484–489, 2016.

[2] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for

large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

[3] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun

Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Effi-

cient convolutional neural networks for mobile vision applications. arXiv preprint

arXiv:1704.04861, 2017.

[4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Ima-

genet: A large-scale hierarchical image database. In Computer Vision and Pat-

tern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE,

2009.

[5] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew

Wojna. Rethinking the inception architecture for computer vision. In Proceed-

ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages

2818–2826, 2016.

[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learn-

ing for image recognition. In Proceedings of the IEEE conference on computer

vision and pattern recognition, pages 770–778, 2016.

[7] Martı́n Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen,

Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al.

Tensorflow: Large-scale machine learning on heterogeneous distributed systems.

arXiv preprint arXiv:1603.04467, 2016.

46

Bibliography 47

[8] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-

based learning applied to document recognition. Proceedings of the IEEE,

86(11):2278–2324, 1998.

[9] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from

tiny images. 2009.

[10] Misha Denil, Babak Shakibi, Laurent Dinh, Nando de Freitas, et al. Predict-

ing parameters in deep learning. In Advances in Neural Information Processing

Systems, pages 2148–2156, 2013.

[11] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus.

Exploiting linear structure within convolutional networks for efficient evaluation.

In Advances in Neural Information Processing Systems, pages 1269–1277, 2014.

[12] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up

convolutional neural networks with low rank expansions. arXiv preprint

arXiv:1405.3866, 2014.

[13] Xiangyu Zhang, Jianhua Zou, Xiang Ming, Kaiming He, and Jian Sun. Efficient

and accurate approximations of nonlinear convolutional networks. In Proceed-

ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages

1984–1992, 2015.

[14] Jonghoon Jin, Aysegul Dundar, and Eugenio Culurciello. Flattened convolutional

neural networks for feedforward acceleration. arXiv preprint arXiv:1412.5474,

2014.

[15] Min Wang, Baoyuan Liu, and Hassan Foroosh. Factorized convolutional neural

networks. arXiv preprint arXiv:1608.04337, 2016.

[16] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and

connections for efficient neural network. In Advances in Neural Information

Processing Systems, pages 1135–1143, 2015.

[17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification

with deep convolutional neural networks. In Advances in neural information

processing systems, pages 1097–1105, 2012.

Bibliography 48

[18] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Prun-

ing filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.

[19] Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter level pruning method

for deep neural network compression. arXiv preprint arXiv:1707.06342, 2017.

[20] Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. Designing energy-efficient

convolutional neural networks using energy-aware pruning. arXiv preprint

arXiv:1611.05128, 2016.

[21] Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. Compress-

ing deep convolutional networks using vector quantization. arXiv preprint

arXiv:1412.6115, 2014.

[22] Yoojin Choi, Mostafa El-Khamy, and Jungwon Lee. Towards the limit of network

quantization. arXiv preprint arXiv:1612.01543, 2016.

[23] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing

deep neural networks with pruning, trained quantization and huffman coding.

arXiv preprint arXiv:1510.00149, 2015.