Practical Block-wise Neural Network Architecture Generation
Zhao Zhong1,3∗, Junjie Yan2,Wei Wu2,Jing Shao2,Cheng-Lin Liu1,3,4
1National Laboratory of Pattern Recognition,Institute of Automation, Chinese Academy of Sciences 2 SenseTime Research 3 University of Chinese Academy of Sciences
4 CAS Center for Excellence of Brain Science and Intelligence Technology
Email: {zhao.zhong, liucl}@nlpr.ia.ac.cn, {yanjunjie, wuwei, shaojing}@sensetime.com
Abstract
Convolutional neural networks have gained a remark- able success in computer vision. However, most usable net- work architectures are hand-crafted and usually require ex- pertise and elaborate design. In this paper, we provide a block-wise network generation pipeline called BlockQNN which automatically builds high-performance networks us- ing the Q-Learning paradigm with epsilon-greedy explo- ration strategy. The optimal network block is constructed by the learning agent which is trained sequentially to choose component layers. We stack the block to construct the whole auto-generated network. To accelerate the generation pro- cess, we also propose a distributed asynchronous frame- work and an early stop strategy. The block-wise genera- tion brings unique advantages: (1) it performs competitive results in comparison to the hand-crafted state-of-the-art networks on image classification, additionally, the best net- work generated by BlockQNN achieves 3.54% top-1 error rate on CIFAR-10 which beats all existing auto-generate networks. (2) in the meanwhile, it offers tremendous re- duction of the search space in designing networks which only spends 3 days with 32 GPUs, and (3) moreover, it has strong generalizability that the network built on CIFAR also performs well on a larger-scale ImageNet dataset.
1. Introduction
During the last decades, Convolutional Neural Networks (CNNs) have shown remarkable potentials almost in ev- ery field in the computer vision society [17]. For exam- ple, thanks to the network evolution from AlexNet [16], VGG [25], Inception [30] to ResNet [10], the top-5 per- formance on ImageNet challenge steadily increases from 83.6% to 96.43%. However, as the performance gain usually requires an increasing network capacity, a high-
∗This work was done when Zhao Zhong worked as an intern at Sense- Time Research.
performance network architecture generally possesses a tremendous number of possible configurations about the number of layers, hyperparameters in each layer and type of each layer. It is hence infeasible for manually exhaus- tive search, and the design of successful hand-crafted net- works heavily rely on expert knowledge and experience. Therefore, constructing the network in a smart and auto- matic manner remains an open problem.
Although some recent works have attempted computer- aided or automated network design [2, 37], there are sev- eral challenges still unsolved: (1) Modern neural networks always consist of hundreds of convolutional layers, each of which has numerous options in type and hyperparame- ters. It makes a huge search space and heavy computational costs for network generation. (2) One typically designed network is usually limited on a specific dataset or task, and thus is hard to transfer to other tasks or generalize to another dataset with different input data sizes. In this paper, we pro- vide a solution to the aforementioned challenges by a novel fast Q-learning framework, called BlockQNN, to automati- cally design the network architecture, as shown in Fig. 1.
Particularly, to make the network generation efficient and generalizable, we introduce the block-wise network generation, i.e., we construct the network architecture as a flexible stack of personalized blocks rather tedious per-layer network piling. Indeed, most modern CNN architectures such as Inception [30, 14, 31] and ResNet Series [10, 11] are assembled as the stack of basic block structures. For example, the inception and residual blocks shown in Fig. 1 are repeatedly concatenated to construct the entire network. With such kind of block-wise network architecture, the generated network owns a powerful generalization to other task domains or different datasets.
In comparison to previous methods like NAS [37] and MetaQNN [2], as depicted in Fig. 1, we present a more readily and elegant model generation method that specif- ically designed for block-wise generation. Motivated by the unsupervised reinforcement learning paradigm, we em- ploy the well-known Q-learning [33] with experience re-
arXiv:1708.05552v3 [cs.CV] 14 May 2018
Hand-crafted Network
AlexNet VGGNet Inception-block Residue-block
Auto-generated Network
NAS MetaQNN BlockQNN
I nput 11×11 Conv Pool 1/2 5×5 Conv Pool 1/2 3×3 Conv 3×3 Conv 3×3 Conv Pool 1/2 Linear 4096 Linear 4096 Linear
Input
Input
Conv
BN
Conv,1
MaxP,3
Conv,1
Conv Conv,1 BN ReLU
Conv,3
Conv,5
Conv,1
Add
ReLU
Output
Concat
Output
I nput 3×3 Conv 3×3 Conv 3×3 Conv 5×5 Conv 3×7 Conv 7×7 Conv 7×7 Conv 7×3 Conv 7×1 Conv 7×7 Conv 5×7 Conv 7×7 Conv 7×5 Conv 7×5 Conv 7×5 Conv Linear
I nput 5×5 Conv 3×3 Conv Dropout 1/8 2×2 MaxP 1×1 Conv Dropout 1/4 5×5 Conv 3×3 MaxP Dropout 3/8 3×3 Conv Linear
Input
Conv,1 Conv,3 Conv,5 Conv,5 AvgP,3
Add Concat Conv,3
Concat Output
Conv,3
Figure 1. The proposed BlockQNN (right in red box) compared with the hand-crafted networks marked in yellow and the existing auto- generated networks in green. Automatically generating the plain networks [2, 37] marked in blue need large computational costs on searching optimal layer types and hyperparameters for each single layer, while the block-wise network heavily reduces the cost to search structures only for one block. The entire network is then constructed by stacking the generated blocks. Similar block concept has been demonstrated its superiority in hand-crafted networks, such as inception-block and residue-block marked in red.
play [19] and epsilon-greedy strategy [21] to effectively and efficiently search the optimal block structure. The network block is constructed by the learning agent which is trained sequentiality to choose component layers. Afterwards we stack the block to construct the whole auto-generated net- work. Moreover, we propose an early stop strategy to en- able efficient search with fast convergence. A novel reward function is designed to ensure the accuracy of the early stopped network having positive correlation with the con- verged network. We can pick up good blocks in reduced training time using this property. With this acceleration strategy, we can construct a Q-learning agent to learn the optimal block-wise network structures for a given task with limited resources (e.g. few GPUs or short time period). The generated architectures are thus succinct and have powerful generalization ability compared to the networks generated by the other automatic network generation methods.
The proposed block-wise network generation brings a few advantages as follows:
• Effective. The automatically generated networks present comparable performances to those of hand- crafted networks with human expertise. The proposed method is also superior to the existing works and achieves a state-of-the-art performance on CIFAR-10 with 3.54% error rate.
• Efficient. We are the first to consider block-wise setup in automatic network generation. Companied with the proposed early stop strategy, the proposed method results in a fast search process. The network generation for CIFAR task reaches convergence with only 32 GPUs in 3 days, which is much more efficient than that by NAS [37] with 800 GPUs in 28 days.
• Transferable. It offers surprisingly superior transfer- able ability that the network generated for CIFAR can
be transferred to ImageNet with little modification but still achieve outstanding performance.
2. Related Work
Early works, from 1980s, have made efforts on automat- ing neural network design which often searched good archi- tecture by the genetic algorithm or other evolutionary algo- rithms [24, 27, 26, 28, 23, 7, 34]. Nevertheless, these works, to our best knowledge, cannot perform competitively com- pared with hand-crafted networks. Recent works, i.e. Neu- ral Architecture Search (NAS) [37] and MetaQNN [2], adopted reinforcement learning to automatically search a good network architecture. Although they can yield good performance on small datasets such as CIFAR-10, CIFAR- 100, the direct use of MetaQNN or NAS for architecture design on big datasets like ImageNet [6] is computationally expensive via searching in a huge space. Besides, the net- work generated by this kind of methods is task-specific or dataset-specific, that is, it cannot been well transferred to other tasks nor datasets with different input data sizes. For example, the network designed for CIFAR-10 cannot been generalized to ImageNet.
Instead, our approach is aimed to design network block architecture by an efficient search method with a dis- tributed asynchronous Q-learning framework as well as an early-stop strategy. The block design conception follows the modern convolutional neural networks such as Incep- tion [30, 14, 31] and Resnet [10, 11]. The inception-based networksconstructtheinception blocksviaahand- crafted multi-level feature extractor strategy by computing 1 × 1, 3 × 3, and 5 × 5 convolutions, while the Resnet uses residue blocks with shortcut connection to make it easier to represent the identity mapping which allows a very deep network. The blocks automatically generated by our
Index
Type
Kernel Size
Pred1
T
1
1, 3, 5
K
T
2
1, 3
K
T
3
1, 3
K
T
4
0
K
T
5
0
K
T
6
0
K
T
7
0
0
Input
Input
Conv
Pooling
Conv
Pooling
Block
Pooling
Conv
Block ×N Pooling
Block ×N Pooling
×N
Block ×N ×N Linear
×N
×N
Block
Pooling
Block
Pooling
Block
Linear
Name Pred2 Convolution 0 Max Pooling 0 Average Pooling 0 Identity 0
Elemental Add K Concat K Terminal 0
Table 1. Network Structure Code Space. The space contains seven types of commonly used layers. Layer index stands for the posi- tion of the current layer in a block, the range of the parameters is set to be T = {1, 2, 3, …max layer index}. Three kinds of ker- nel sizes are considered for convolution layer and two sizes for pooling layer. Pred1 and Pred2 refer to the predecessor parame- ters which is used to represent the index of layers predecessor, the allowed range is K = {1, 2, …, current layer index − 1}
Input
Input
Identity
Identity
Conv,1 Conv,1 MaxP,3 Conv,3 Conv,3 Conv,5 Conv,1 Conv,3
Concat
Add
Concat
Output
Output
Codes = [(1,4,0,0,0), (2,1,1,1,0), (3,1,3,2,0), (4,1,1,1,0), (5,1,5,4,0), (6,6,0,3,5), (7,2,3,1,0), (8,1,1,7,0), (9,6,0,6,8),
(10,7,0,0,0)]
Codes = [(1,4,0,0,0), (2,1,3,1,0), (3,1,3,2,0), (4,5,0,1,3),
(5,7,0,0,0)]
Figure 3. Auto-generated networks on CIFAR-10 (left) and Im- ageNet (right). Each network starts with a few convolution lay- ers to learn low-level features, and followed by multiple repeated blocks with several pooling layers inserted to downsample.
similar structure but with different weights and filter num- bers to construct the network. With the block-wise design, the network can not only achieves high performance but also has powerful generalization ability to different datasets and tasks. Unlike previous research on automating neural network design which generate the entire network directly, we aim at designing the block structure.
As a CNN contains a feed-forward computation proce- dure, we represent it by a directed acyclic graph (DAG), where each node corresponds to a layer in the CNN while directed edges stand for data flow from one layer to another. To turn such a graph into a uniform representation, we pro- pose a novel layer representation called Network Structure Code (NSC), as shown in Table 2. Each block is then de- picted by a set of 5-D NSC vectors. In NSC, the first three numbers stand for the layer index, operation type and kernel size. The last two are predecessor parameters which refer to the position of a layer’s predecessor layer in structure codes. The second predecessor (Pred2) is set for the layer owns two predecessors, and for the layer with only one pre- decessor, Pred2 will be set to zero. This design is motivated by the current powerful hand-crafted networks like Incep- tion and Resnet which own their special block structures. This kind of block structure shares similar properties such as containing more complex connections, e.g. shortcut con- nections or multi-branch connections, than the simple con- nections of the plain network like AlexNet. Thus, the pro- posed NSC can encode complexity architectures as shown in Fig. 2. In addition, all of the layer without successor in the block are concatenated together to provide the final out- put. Note that each convolution operation, same as the dec-
Figure 2. Representative block exemplars with their Network structure codes (NSC) respectively: the block with multi-branch connections (left) and the block with shortcut connections (right).
approach have similar structures such as some blocks con- tain short cut connections and inception-like multi-branch combination. We will discuss the details in Section 5.1.
Another bunch of related works include hyper-parameter optimization [3], meta-learning [32] and learning to learn methods [12, 1]. However, the goal of these works is to use meta-data to improve the performance of the existing algorithms, such as finding the optimal learning rate of op- timization methods or the optimal number of hidden layers to construct the network. In this paper, we focus on learn- ing the entire topological architecture of network blocks to improve the performance.
3. Methodology
3.1. ConvolutionalNeuralNetworkBlocks
The modern CNNs, e.g. Inception and Resnet, are de- signed by stacking several blocks each of which shares
(T,x,x,x,x)
(T,x,x,x,x) (1,7,0,0,0) (2,7,0,0,0) (3,7,0,0,0) (T,7,0,0,0)
Conv,1×1 Conv,3×3
(1,1,1,0,0)
(2,1,3,1,0)
(3,1,3,1,0)
Input
Input
(1,x,x,x,x)
(2,x,x,x,x)
(3,x,x,x,x)
… …
Agent samples structure codes
State Action
Block
(b)
Update Q-Value
Feedback
validation accuracy as reward
Stack blocks to generate a network
Train the network on a task
(a)
(c)
Figure 4. Q-learning process illustration. (a) The state transition process by different action choices. The block structure in (b) is generated by the red solid line in (a). (c) The flow chart of the Q-learning procedure.
laration in Resnet [11], refers to a Pre-activation Convolu- tional Cell (PCC) with three components, i.e. ReLU, Con- volution and Batch Normalization. This results in a smaller searching space than that with three components separate search, and hence with the PCC, we can get better initial- ization for searching and generating optimal block structure with a quick training process.
Based on the above defined blocks, we construct the complete network by stacking these block structures sequentially which turn a common plain network into its counterpart block version. Two representative auto- generated networks on CIFAR and ImageNet tasks are shown in Fig. 3. There is no down-sampling operation within each block. We perform down-sampling directly by the pooling layer. If the size of feature map is halved by pooling operation, the block’s weights will be doubled. The architecture for ImageNet contains more pooling layers than that for CIFAR because of their different input sizes, i.e. 224 × 224 for ImageNet and 32 × 32 for CIFAR. More importantly, the blocks can be repeated any N times to fulfill different demands, and even place the blocks in other manner, such as inserting the block into the Network-in- Network [20] framework or setting short cut connection be- tween different blocks.
3.2. Designing Network Blocks With Q-Learning
Albeit we squeeze the search space of the entire network design by focusing on constructing network blocks, there is still a large amount of possible structures to seek. There- fore, we employ reinforcement learning rather than random sampling for automatic design. Our method is based on Q- learning, a kind of reinforcement learning, which concerns how an agent ought to take actions so as to maximize the cumulative reward. The Q-learning model consists of an agent, states and a set of actions.
In this paper, the state s ∈ S represents the status of the current layer which is defined as a Network Structure Code (NSC) claimed in Section 3.1, i.e. 5-D vector {layer index, layer type, kernel size, pred1, pred2}. The action a ∈ A is the decision for the next successive layer. Thanks
to the defined NSC set with a limited number of choices, both the state and action space are thus finite and discrete to ensure a relatively small searching space. The state tran- sition process (st , a(st )) → (st+1 ) is shown in Fig. 4(a), where t refers to the current layer. The block example in Fig. 4(b) is generated by the red solid lines in Fig. 4(a). The learning agent is given the task of sequentially picking NSC of a block. The structure of block can be considered as an action selection trajectory τa1:T , i.e. a sequence of NSCs. We model the layer selection process as a Markov Decision Process with the assumption that a well-performing layer in one block should also perform well in another block [2]. To find the optimal architecture, we ask our agent to maximize its expected reward over all possible trajectories, denoted byRτ,
Rτ =EP(τa1:T)[R], (1)
where the R is the cumulative reward. For this maximiza- tion problem, it is usually to use recursive Bellman Equation to optimality. Given a state st ∈ S, and subsequent action a ∈ A(st), we define the maximum total expected reward to be Q∗(st,a) which is known as Q-value of state-action pair. The recursive Bellman Equation then can be written as
Q∗(st, a) = Est+1|st,a[Er|st,a,st+1 [r|st, a, st+1]
+γ max Q∗(st+1, a′)]. (2)
a′ ∈A(st+1 ))
An empirical assumption to solve the above quantity is
to formulate it as an iterative update:
Q(sT , a) =0 (3) Q(sT −1, aT ) =(1 − α)Q(sT −1, aT ) + αrT (4)
Q(st, a) =(1 − α)Q(st, a)
+α[rt + γ maxQ(st+1, a′)], t ∈ {1, 2, …T − 2}, (5)
a′
where α is the learning rate which determines how the newly acquired information overrides the old information, γ is the discount factor which measures the importance of future rewards. rt denotes the intermediate reward observed
…
… … …
100 95 90 85 80 75
Data Analysis of Early Stop Accuracy
Early Stop ACC Final ACC Redefined Reward FLOPs Density 5
4 3 2 1 0
1 11 21 31 41 51 61 71 81 91 Model (block)
70 65 60 55
Q-learning Performance with Different intermediate reward
Ignore 𝑟#
shaped reward 𝑟#
1 6 11 16 21 26 Iteration (batch)
Figure 5. Comparison results of Q-learning with and without the shaped intermediate reward rt. By taking our shaped reward, the learning process convergent faster than that without shaped reward start from the same exploration.
for the current state st and sT refers to final state, i.e. termi- nal layers. rT is the validation accuracy of corresponding network trained convergence on training set for aT , i.e. ac- tion to final state. Since the reward rt cannot be explicitly measured in our task, we use reward shaping [22] to speed up training. The shaped intermediate reward is defined as:
rT
rt=T. (6)
Previous works [2] ignore these rewards in the iterative process, i.e. set them to zero, which may cause a slow con- vergence in the beginning. This is known as the temporal credit assignment problem which makes RL time consum- ing [29]. In this case, the Q-value of sT is much higher than others in early stage of training and thus leads the agent pre- fer to stop searching at the very beginning, i.e. tend to build small block with fewer layers. We show a comparison result in Fig. 5, the learning process of the agent with our shaped reward rt convergent much faster than previous method.
We summarize the learning procedure in Fig. 4(c). The agent first samples a set of structure codes to build the block architecture, based on which the entire network is constructed by stacking these blocks sequentially. We then train the generated network on a certain task, and the vali- dation accuracy is regarded as the reward to update the Q- value. Afterwards, the agent picks another set of structure codes to get a better block structure.
3.3. Early Stop Strategy
Introducing block-wise generation indeed increases the efficiency. However, it is still time consuming to com- plete the search process. To further accelerate the learn- ing process, we introduce an early stop strategy. As we all know, early stopping training process might result in a poor accuracy. Fig. 6 shows an example, where the early-stop ac- curacy in yellow line is much lower than the final accuracy in orange line, which means that some good blocks unfor- tunately perform worse than bad blocks when stop training
Figure 6. The performance of early stop training is poorer than the final accuracy of a complete training. With the help of FLOPs and Density, it squeezes the gap between the redefined reward function and the final accuracy.
early. In the meanwhile, we notice that the FLOPs and den- sity of the corresponding blocks have a negative correlation with the final accuracy. Thus, we redefine the reward func- tion as
reward = ACCEarlyStop − μ log(FLOPs)
−ρ log(Density), (7)
where FLOPs [8] refer to an estimation of computational complexity of the block, and Density is the edge number divided by the dot number in DAG of the block. There are two hyperparameters, μ and ρ, to balance the weights of FLOPs and Density. With the redefined reward function, the reward is more relevant to the final accuracy.
With this early stop strategy and small searching space of network blocks, it just costs 3 days to complete the search- ing process with only 32 GPUs, which is superior to that of [37], spends 28 days with 800 GPUs to achieve the same performance.
4. Framework and Training Details
4.1. Distributed Asynchronous Framework
To speed up the learning of agent, we use a distributed asynchronous framework as illustrated in Fig. 7. It consists of three parts: master node, controller node and compute nodes. The agent first samples a batch of block structures in master node. Afterwards, we store them in a controller node which uses the block structures to build the entire networks and allocates these networks to compute nodes. It can be regarded as a simplified parameter-server [5, 18]. Specif- ically, the network is trained in parallel on each of com- pute nodes and returns the validation accuracy as reward by controller nodes to update agent. With this framework, we
Accuracy (%)
Scalar for FLOPs and Density
Accuracy (%)
Master Node
Controller Node
Compute Nodes
Figure 7. The distributed asynchronous framework. It contains three parts: master node, controller node and compute nodes.
ε 0.1 Iters 12
Table 2. Epsilon Schedules. The number of iteration the agent trains at each epsilon(ε) state.
can generate network efficiently on multiple machines with multiple GPUs.
4.2. Training Details
Epsilon-greedy Strategy. The agent is trained using Q- learning with experience replay [19] and epsilon-greedy strategy [21]. With epsilon-greedy strategy, the random ac- tion is taken with probability ε and the greedy action is cho- sen with probability 1 − ε. We decrease epsilon from 1.0 to 0.1 following the epsilon schedule as shown in Table 2 such that the agent can transform smoothly from exploration to exploitation. We find that the result goes better with a longer exploration, since the searching scope would become larger and the agent can see more block structures in the random exploration period.
Experience Replay. Following [2], we employ a replay memory to store the validation accuracy and block descrip- tion after each iteration. Within a given interval, i.e. each training iteration, the agent samples 64 blocks with their corresponding validation accuracies from the memory and updates Q-value 64 times.
BlockQNN Generation.
In the Q-learning update process, the learning rate α is set to 0.01 and the discount factor γ is 1. We set the hy- perparameters μ and ρ in the redefined reward function as 1 and 8, respectively. The agent samples 64 sets of NSC vec- tors at a time to compose a mini-batch and the maximum layer index for a block is set to 23. We train the agent with 178 iterations, i.e. sampling 11, 392 blocks in total.
During the block searching phase, the compute nodes train each generated network for a fixed 12 epochs on CIFAR-100 using the early top strategy as described in Sec- tion 3.3. CIFAR-100 contains 60, 000 samples with 100 classes which are divided into training and test set with the
ratio of 5 : 1. We train the network without any data aug- mentation procedure. The batch size is set to 256. We use Adam optimizer [15] with β1 = 0.9, β2 = 0.999, ε = 10−8. The initial learning rate is set to 0.001 and is reduced with a factor of 0.2 every 5 epochs. All weights are initialized as in [9]. If the training result after the first epoch is worse than the random guess, we reduce the learning rate by a factor of 0.4 and restart training, with a maximum of 3 times for restart-operations.
After obtaining one optimal block structure, we build the whole network with stacked blocks and train the net- work until converging to get the validation accuracy as the criterion to pick the best network. In this phase, we aug- ment data with randomly cropping the images with size of 32 × 32 and horizontal flipping. All models use the SGD optimizer with momentum rate set to 0.9 and weight decay set to 0.0005. We start with a learning rate of 0.1 and train the models for 300 epochs, reducing the learning rate in the 150-th and 225-th epoch. The batch size is set to 128 and all weights are initialized with MSRA initialization [9].
Transferable BlockQNN. We also evaluate the transfer- ability of the best auto-generated block structure searched on CIFAR-100 to a smaller dataset, CIFAR-10, with only 10 classes and a larger dataset, ImageNet, containing 1.2M images with 1000 classes. All the experimental settings are the same as those on the CIFAR-100 stated above. The training is conducted with a mini-batch size of 256 where each image has data augmentation of randomly cropping and flipping, and is optimized with SGD strategy. The ini- tial learning rate, weight decay and momentum are set as 0.1, 0.0001 and 0.9, respectively. We divide the learning rate by 10 twice, at the 30-th and 60-th epochs. The net- work is trained with a total of 90 epochs. We evaluate the accuracy on the test images with center crop.
Our framework is implemented under the PyTorch sci- entific computing platform. We use the CUDA backend and cuDNN accelerated library in our implementation for high-performance GPU acceleration. Our experiments are carried out on 32 NVIDIA TitanX GPUs and took about 3 days to complete searching.
5. Results
5.1. Block Searching Analysis
Fig. 8(a) provides early stop accuracies over 178 batches on CIFAR-100, each of which is averaged over 64 auto- generated block-wise network candidates within in each mini-batch. After random exploration, the early stop ac- curacy grows steadily till converges. The mean accuracy within the period of random exploration is 56% while fi- nally achieves 65% in the last stage with ε = 0.1. We choose top-100 block candidates and train their respective networks to verify the best block structure. We show top-
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
95
7
7
7
10
10
10
10
10
68 66 64 62 60 58 56 54
Mean Accuracy
Block-QNN-A
Block-QNN-B
Input
Conv,3 Conv,5 Conv,3 MaxP,1 MaxP,1 MaxP,3 Conv,1 AvgP,3
Input
Conv,3 Conv,5 Conv,5 AvgP,3
Add Concat Conv,3
Concat (c)Block-QNN-B
Input
Conv,3
MaxP,3
Conv,5 Add Conv,3
Conv,1
Conv,1
Conv,3
Conv,3
Conv,1 Conv,3 MaxP,3 Conv,5
Add
Conv,1 Add Conv,3
Conv,1 Conv,3
1 21 41 61 81 101 121 141 161 Iteration (batch)
Random Exploration Start Exploitation
(a) Q-learningperformance
Concat (b)Block-QNN-A
Concat (d)Block-QNN-S
Figure 8. (a) Q-learning performance on CIFAR-100. The accuracy goes up with the epsilon decrease and the top models are all found in the final stage, show that our agent can learn to generate better block structures instead of random searching. (b-c) Topology of the Top-2 block structures generated by our approach. We call them Block-QNN-A and Block-QNN-B. (d) Topology of the best block structures generated with limited parameters, named Block-QNN-S.
Q-learning Performance with Different Structure Codes
70 65 60 55 50 45 40
PCC{ReLU,Conv,BN} separate ReLU,BN,Conv
1 21 41 61 81 101 121 141 161 Iteration (batch)
Depth
Para
C-10
–
7.25
110
1.7M
6.61
28
36.5M
4.17
1001
10.2M
4.62
40 100 100 190
1.0M
7.0M 27.2M 25.6M
5.24 4.10 3.74 3.46
– –
– 11.2M
7.32 6.92
15 20 39 39
4.2M 2.5M 7.1M 37.4M
5.50 6.01 4.47 3.65
25 37 19 22
–
– 6.1M 39.8M
3.60 3.80 4.38 3.54
Figure 9. Q-learning result with different NSC on CIFAR-100. The red line refers to searching with PCC, i.e. combination of ReLU, Conv and BN. The blue stands for separate searching with ReLU, BN and Conv. The red line is better than blue from the beginning with a big gap.
2 block structures in Fig. 8(b-c), denoted as Block-QNN- A and Block-QNN-B. As shown in Fig. 8(a), both top-2 blocks are found in the final stage of the Q-learning pro- cess, which proves the effectiveness of the proposed method in searching optimal block structures rather than randomly searching a large amount of models. Furthermore, we ob- serve that the generated blocks share similar properties with those state-of-the-art hand-crafted networks. For example, Block-QNN-A and Block-QNN-B contain short-cut con- nections and multi-branch structures which have been man- ually designed in residual-based and inception-based net- works. Compared to other auto-generated methods, the net- works generated by our approach are more elegant and can automatically and effectively reveal the beneficial proper- ties for optimal network structure.
To squeeze the searching space, as stated in Section 3.1, we define a Pre-activation Convolutional Cell (PCC) con- sists of three components, i.e. ReLU, convolution and
Method VGG [25]
ResNet [10]
Wide ResNet [36] ResNet (pre-activation) [11] DenseNet (k = 12) [13] DenseNet (k = 12) [13] DenseNet (k = 24) [13] DenseNet-BC (k = 40) [13]
MetaQNN (ensemble) [2] MetaQNN (top model) [2] NAS v1 [37]
NAS v2 [37]
NAS v3 [37]
NAS v3 more filters [37] Block-QNN-A, N=4 Block-QNN-B, N=4 Block-QNN-S, N=2 Block-QNN-S more filters
C-100 –
– 20.5 22.71 24.42 20.20 19.25 17.18
– 27.14 –
–
–
– 18.64 18.72 20.65 18.06
Table 3. Block-QNN’s results (error rate) compare with state-of- the-art methods on CIFAR-10 (C-10) and CIFAR-100 (C-100) dataset.
batch normalization (BN). We show the superiority of the PCC, searching a combination of three components, in Fig. 9, compared to the separate search of each component. Searching the three components separately is more likely to generate “bad” blocks and also needs more searching space and time to pursue “good” blocks.
5.2. Results on CIFAR
Due to the small size of images (i.e. 32 × 32) in CIFAR, we set block stack number as N = 4. We compare our generated best architectures with the state-of-the-art hand-
Accuracy (%) Accuracy (%)
Best Model on CIFAR10
GPUs
6.92
10
3.65
800
3.54
32
crafted networks or auto-generated networks in Table 3.
Comparison with hand-crafted networks – It shows that our Block-QNN networks outperform most hand-crafted net- works. The DenseNet-BC [13] uses additional 1 × 1 con- volutions in each composite function and compressive tran- sition layer to reduce parameters and improve performance, which is not adopted in our design. Our performance can be further improved by using this prior knowledge.
Comparison with auto-generated networks – Our approach achieves a significant improvement to the MetaQNN [2], and even better than NAS’s best model (i.e. NASv3 more filters) [37] proposed by Google brain which needs an ex- pensive costs on time and GPU resources. As shown in Ta- ble 4, NAS trains the whole system on 800 GPUs in 28 days while we only need 32 GPUs in 3 days to get state- of-the-art performance.
Transfer block from CIFAR-100 to CIFAR-10 – We trans- fer the top blocks learned from CIFAR-100 to CIFAR-10 dataset, all experiment settings are the same. As shown in Table 3, the blocks can also achieve state-of-the-art re- sults on CIFAR-10 dataset with 3.60% error rate that proved Block-QNN networks have powerful transferable ability.
Analysis on network parameters – The networks generated by our method might be complex with a large amount of pa- rameters since we do not add any constraints during train- ing. We further conduct an experiment on searching net- works with limited parameters and adaptive block num- bers. We set the maximal parameter number as 10M and obtain an optimal block (i.e. Block-QNN-S) which outper- forms NASv3 with less parameters, as shown in Fig. 8(d). In addition, when involving more filters in each convolu- tional layer (e.g. from [32,64,128] to [80,160,320]), we can achieve even better result (3.54%).
5.3. Transfer to ImageNet
To demonstrate the generalizability of our approach, we transfer the block structure learned from CIFAR to Ima- geNet dataset.
For the ImageNet task, we set block repeat number N = 3 and add more down sampling operation before blocks, the filters for convolution layers in different level blocks are [64,128,256,512]. We use the best blocks struc- ture learned from CIFAR-100 directly without any fine- tuning, and the generated network initialized with MSRA initialization as same as above. The experimental results are shown in Table 5. The network generated by our frame- work can get competitive result compared with other human designed models. The recently proposed methods such as Xception [4] and ResNext [35] use special depth-wise con- volution operation to reduce their total number of parame- ters and to improve performance. In our work, we do not use this new convolution operation, so it can’t be compared
Method Time(days) MetaQNN [2] 10 NAS [37] 28
Our approach 3
Table 4. The required computing resource and time of our ap- proach compare with other automatic designing network methods.
Input Size
Depth
Top-1
224×224
16
28.5
224×224 224×224
22 22
27.8 25.2
224×224 224×224
50 152
24.7 23.0
224×224 224×224
50 101
23.6 20.4
224×224 224×224
38 38
24.3 22.6
Method
VGG [25] 9.90
Top-5
Inception V1 [30]
Inception V2 [14] 7.80
ResNet-50 [11] 7.80 ResNet-152 [11] 6.70 Xception(our test) [4] 7.10 ResNext-101(64x4d) [35] 5.30 Block-QNN-B, N=3 7.40 Block-QNN-S, N=3 6.46
Table 5. Block-QNN’s results (single-crop error rate) compare with modern methods on ImageNet-1K Dataset.
fairly, and we will consider this in our future work to further improve the performance.
As far as we known, most previous works of automatic network generation did not report competitive result on large scale image classification datasets. With the con- ception of block learning, we can transfer our architecture learned in small datasets to big dataset like ImageNet task easily. In the future experiments, we will try to apply the generated blocks in other tasks such as object detection and semantic segmentation.
6. Conclusion
In this paper, we show how to efficiently design high per- formance network blocks with Q-learning. We use a dis- tributed asynchronous Q-learning framework and an early stop strategy focusing on fast block structures searching. We applied the framework to automatic block generation for constructing good convolutional network. Our Block- QNN networks outperform modern hand-crafted networks as well as other auto-generated networks in image classi- fication tasks. The best block structure which achieves a state-of-the-art performance on CIFAR can be transfer to the large-scale dataset ImageNet easily, and also yield a competitive performance compared with best hand-crafted networks. We show that searching with the block design strategy can get more elegant and model explicable network architectures. In the future, we will continue to improve the proposed framework from different aspects, such as using more powerful convolution layers and making the searching process faster. We will also try to search blocks with lim-
10.10
68 67 66 65 64 63
Start Exploitation RS Top1 RSTop5
BlockQNN Top1 BlockQNN Top5
1 21 41 61 81 101 121 141 161 Iteration (batch)
ited FLOPs and conduct experiments on other tasks such as detection or segmentation.
Acknowledgments
This work has been supported by the National Natural Science Foundation of China (NSFC) Grants 61721004 and 61633021.
Appendix
A. Efficiency of BlockQNN
We demonstrate the effectiveness of our proposed Block- QNN on network architecture generation on the CIFAR-100 dataset as compared to random search given an equivalent amount of training iterations, i.e. number of sampled net- works. We define the effectiveness of a network architec- ture auto-generation algorithm as the increase in top auto- generated network performance from the initial random ex- ploration to exploitation, since we aim to getting optimal auto-generated network instead of promoting the average performance.
Figure 10 shows the performance of BlockQNN and ran- dom search (RS) for a complete training process, i.e. sam- pling 11, 392 blocks in total. We can find that the best model generated by BlockQNN is markedly better than the best model found by RS by over 1% in the exploitation phase on CIFAR-100 dataset. We observe this in the mean perfor- mance of the top-5 models generated by BlockQNN com- pares to RS. Note that the compared random search method start from the same exploration phase as BlockQNN for fairness.
Figure 11 shows the performance of BlockQNN with limited parameters and adaptive block numbers (BlockQNN-L) and random search with limited parameters and adaptive block numbers (RS-L) for a complete training process. We can see the same phenomenon, BlockQNN- L outperform RS-L by over 1% in the exploitation phase. These results prove that our BlockQNN can learn to gener- ate better network architectures rather than random search.
B. Evolutionary Process of Auto-Generated Blocks
We sample the block structures with median perfor- mance generated by our approach in different stage, i.e. at iteration [1, 30, 60, 90, 110, 130, 150, 170], to show the evo- lutionary process. As illustrated in Figure 12 and Fig- ure 13, i.e. BlockQNN and BlockQNN-L respectively, the block structures generated in the random exploration stage is much simpler than the structures generated in the ex- ploitation stage.
In the exploitation stage, the multi-branch structures ap- pear frequently. Note that the connection numbers is gradu-
Figure 10. Measuring the efficiency of BlockQNN to random search (RS) for learning neural architectures. The x-axis measures the training iterations (batch size is 64), i.e. total number of archi- tectures sampled, and the y-axis is the early stop performance after 12 epochs on CIFAR-100 training. Each pair of curves measures the mean accuracy across top ranking models generated by each algorithm. Best viewed in color.
66 65 64 63 62 61
Start Exploitation
RS-L Top1
RS-L Top5 BlockQNN-L Top1 BlockQNN-L Top5
1 21 41 61 81 101 121 141 161 Iteration (batch)
Figure 11. Measuring the efficiency of BlockQNN with limited parameters and adaptive block numbers (BlockQNN-L) to ran- dom search with limited parameters and adaptive block numbers (RS-L) for learning neural architectures. The x-axis measures the training iterations (batch size is 64), i.e. total number of architec- tures sampled, and the y-axis is the early stop performance after 12 epochs on CIFAR-100 training. Each pair of curves measures the mean accuracy across top ranking models generated by each algorithm. Best viewed in color.
ally increase and the block tend choose ”Concat” as the last layer. And we can find that the short-cut connections and elemental add layers are common in the exploitation stage. Additionally, blocks generated by BlockQNN-L have less ”Conv,5” layers, i.e. convolution layer with kernel size of 5, since the limitation of the parameters.
These prove that our approach can learn the universal de- sign concepts for good network blocks. Compare to other automatic network architecture design methods, our gener- ated networks are more elegant and model explicable.
Accuracy (%) Accuracy (%)
Input Conv,3
Add
Input
Input
Input
Conv,5
Concat
Input
AvgP,3 Add
Concat
Conv,3
Conv,5 Conv,3 Add
Conv,3 Conv,1
Input
Add
AvgP,3 AvgP,3 Conv,1
AvgP,1
Concat
Conv,3
Concat
Conv,5
Input
Conv,5 Conv,3 Conv,1
Concat
Conv,5
AvgP,3
Conv,3 AvgP,3
Conv,3 Conv,5 Conv,5 Conv,3
Conv,1 Conv,1
Conv,3
AvgP,3
Conv,3
Conv,5
Concat
Input Conv,3 MaxP,3
Conv,1 Add
Conv,5
Concat
Random Exploration
Exploitation from epsilon=0.9 to epsilon=0.1
Figure 12. Evolutionary process of blocks generated by BlockQNN. We sample the block structures with median performance at iteration [1, 30, 60, 90, 110, 130, 150, 170] to compare the difference between the blocks in the random exploration stage and the blocks in the exploitation stage.
Input
Conv,5 Conv,1
Add
Input MaxP,3
Input
Conv,5
MaxP,3
MaxP,1 AvgP,3
Concat
Input
Concat Conv,5
Concat
Conv,5 AvgP,3 Add
Conv,1
Input
AvgP,1
Concat
MaxP,3 Conv,3 Conv,1
Concat AvgP,3
Input
Conv,1 MaxP,3
Conv,5
Concat
Conv,3
Conv,3 Conv,1
Input
Concat Conv,3
Concat
Conv,3
Conv,3
Input
Conv,3 Conv,1
AvgP,3 Add
Concat
Conv,3 AvgP,3 MaxP,3
Conv,1
Conv,3
Conv,1
AvgP,3
Conv,3
Conv,5
Conv,1
Concat
Random Exploration
Exploitation from epsilon=0.9 to epsilon=0.1
Figure 13. Evolutionary process of blocks generated by BlockQNN with limited parameters and adaptive block numbers (BlockQNN-L). We sample the block structures with median performance at iteration [1, 30, 60, 90, 110, 130, 150, 170] to compare the difference between the blocks in the random exploration stage and the blocks in the exploitation stage.
C. Additional Experiment
We also use BlockQNN to generate optimal model on person key-points task. The training process is conducted on MPII dataset, and then, we transfer the best model found in MPII to COCO challenge. It costs 5 days to complete the searching process. The auto-generated network for key-points task outperform the state-of-the-art hourglass 2 stacks network, i.e. 70.5 AP compares to 70.1 AP on COCO validation dataset.
References
[1] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, and N. de Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pages 3981–3989, 2016. 3
[2] B. Baker, O. Gupta, N. Naik, and R. Raskar. Designing neu- ral network architectures using reinforcement learning. In 6th International Conference on Learning Representations, 2017. 1,2,4,5,6,7,8
[3] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Ke ́gl. Al- gorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems, pages 2546–2554, 2011. 3
[4] F. Chollet. Xception: Deep learning with depthwise separa- ble convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017. 8
[5] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large scale dis- tributed deep networks. In Advances in neural information processing systems, pages 1223–1231, 2012. 5
[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009. 2
[7] T. Domhan, J. T. Springenberg, and F. Hutter. Speeding up automatic hyperparameter optimization of deep neural net- works by extrapolation of learning curves. In IJCAI, pages 3460–3468, 2015. 2
[8] K. He and J. Sun. Convolutional neural networks at con- strained time cost. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5353– 5360, 2015. 5
[9] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international con- ference on computer vision, pages 1026–1034, 2015. 6
[10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- ing for image recognition. In Proceedings of the IEEE con- ference on computer vision and pattern recognition, pages 770–778, 2016. 1, 2, 7
[11] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European Conference on Com- puter Vision, pages 630–645. Springer, 2016. 1, 2, 3, 7, 8
[12] S. Hochreiter, A. S. Younger, and P. R. Conwell. Learning to learn using gradient descent. In International Conference on
Artificial Neural Networks, pages 87–94. Springer, 2001. 3
[13] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional networks. In Proceed- ings of the IEEE conference on computer vision and pattern
recognition, 2017. 7, 8
[14] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448– 456, 2015. 1, 2, 8
[15] D. Kingma and J. Ba. Adam: A method for stochastic opti- mization. In 3rd International Conference for Learning Rep- resentations, 2015. 6
[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Proc. Advances in Neural Information Processing Systems, pages 1097–1105, 2012. 1
[17] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015. 1
[18] M. Li, L. Zhou, Z. Yang, A. Li, F. Xia, D. G. Andersen, and A. Smola. Parameter server for distributed machine learning. In Big Learning NIPS Workshop, volume 6, page 2, 2013. 5
[19] L.-J. Lin. Reinforcement learning for robots using neural networks. Technical report, Carnegie-Mellon Univ Pitts- burgh PA School of Computer Science, 1993. 1, 6
[20] M. Lin, Q. Chen, and S. Yan. Network in network. In In- ternational Conference on Learning Representations, 2013. 4
[21] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep rein- forcement learning. Nature, 518(7540):529–533, 2015. 1, 6
[22] A.Y.Ng,D.Harada,andS.Russell.Policyinvarianceunder reward transformations: Theory and application to reward shaping. In ICML, volume 99, pages 278–287, 1999. 5
[23] S. Saxena and J. Verbeek. Convolutional neural fabrics. In Advances in Neural Information Processing Systems, pages 4053–4061, 2016. 2
[24] J.D.Schaffer,D.Whitley,andL.J.Eshelman.Combinations of genetic algorithms and neural networks: A survey of the state of the art. In Combinations of Genetic Algorithms and Neural Networks, 1992., COGANN-92. International Work- shop on, pages 1–37. IEEE, 1992. 2
[25] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In 3rd Interna- tional Conference for Learning Representations, 2015. 1, 7, 8
[26] K. O. Stanley, D. B. D’Ambrosio, and J. Gauci. A hypercube-based encoding for evolving large-scale neural networks. Artificial life, 15(2):185–212, 2009. 2
[27] K. O. Stanley and R. Miikkulainen. Evolving neural net- works through augmenting topologies. Evolutionary compu- tation, 10(2):99–127, 2002. 2
[28] M. Suganuma, S. Shirakawa, and T. Nagao. A genetic pro- gramming approach to designing convolutional neural net- work architectures. In Proceedings of the Genetic and Evo- lutionary Computation Conference, pages 497–504, 2017. 2
[29] [30]
[31]
[32]
[33] [34] [35]
[36] [37]
R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998. 5
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015. 1, 2, 8
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016. 1, 2
R. Vilalta and Y. Drissi. A perspective view and survey of meta-learning. Artificial Intelligence Review, 18(2):77–95, 2002. 3
C. J. C. H. Watkins. Learning from delayed rewards. PhD thesis, King’s College, Cambridge, 1989. 1
L. Xie and A. Yuille. Genetic cnn. In Proceedings of the International Conference on Computer Vision, 2017. 2
S. Xie, R. Girshick, P. Dolla ́r, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 5987–5995. IEEE, 2017. 8
S. Zagoruyko and N. Komodakis. Wide residual networks. In British Machine Vision Conference, 2016. 7
B. Zoph and Q. V. Le. Neural architecture search with re- inforcement learning. In 6th International Conference on Learning Representations, 2017. 1, 2, 5, 7, 8