2014 IEEE Conference on Computer Vision and Pattern Recognition
Convolutional Neural Networks for No-Reference Image Quality Assessment
Le Kang1, Peng Ye1, Yi Li2, and David Doermann 1
1University of Maryland, College Park, MD, USA
2NICTA and ANU, Canberra, Australia
1 {lekang,pengye,doermann}@umiacs.umd.edu 2 yi.li@cecs.anu.edu.au
Abstract
In this work we describe a Convolutional Neural Net- work (CNN) to accurately predict image quality without a reference image. Taking image patches as input, the CNN works in the spatial domain without using hand-crafted fea- tures that are employed by most previous methods. The net- work consists of one convolutional layer with max and min pooling, two fully connected layers and an output node. Within the network structure, feature learning and regres- sion are integrated into one optimization process, which leads to a more effective model for estimating image quality. This approach achieves state of the art performance on the LIVE dataset and shows excellent generalization ability in cross dataset experiments. Further experiments on images with local distortions demonstrate the local quality estima- tion ability of our CNN, which is rarely reported in previous literature.
1. Introduction
This paper presents a Convolutional Neural Network (CNN) that can accurately predict the quality of distorted images with respect to human perception. The work focuses on the most challenging category of objective image qual- ity assessment (IQA) tasks: general-purpose No-Reference IQA (NR-IQA), which evaluates the visual quality of digi- tal images without access to reference images and without prior knowledge of the types of distortions present.
Visual quality is a very complex yet inherent character- istic of an image. In principle, it is the measure of the dis- tortion compared with an ideal imaging model or perfect reference image. When reference images are available, Full Reference (FR) IQA methods [14, 22, 16, 17, 19] can be ap-
The partial support of this research by DARPA through BBN/DARPA Award HR0011-08-C-0004 under subcontract 9500009235, the US Gov- ernment through NSF Awards IIS-0812111 and IIS-1262122 is gratefully acknowledged.
1063-6919/14 $31.00 © 2014 IEEE DOI 10.1109/CVPR.2014.224
plied to directly quantify the differences between distorted images and their corresponding ideal versions. State of the art FR measures, such as VIF [14] and FSIM [22], achieve a very high correlation with human perception.
However, in many practical computer vision applications there do not exist perfect versions of the distorted images, so NR-IQA is required. NR-IQA measures can directly quantify image degradations by exploiting features that are discriminant for image degradations. Most successful ap- proaches use Natural Scene Statistics (NSS) based features. Typically, NSS based features characterize the distributions of certain filter responses. Traditional NSS based features are extracted in image transformation domains using, for example the wavelet transform [10] or the DCT transform [13]. These methods are usually very slow due to the use of computationally expensive image transformations. Recent development in NR-IQA methods – CORNIA [20, 21] and BRISQUE [9] promote extracting features from the spatial domain, which leads to a significant reduction in compu- tation time. CORNIA demonstrates that it is possible to learn discriminant image features directly from the raw im- age pixels, instead of using handcrafted features.
Based on these observations, we explore using a Convo- lutional Neural Network (CNN) to learn discriminant fea- tures for the NR-IQA task. Recently, deep neural networks have gained researchers’ attention and achieved great suc- cess on various computer vision tasks. Specifically, CNN has shown superior performance on many standard object recognition benchmarks [6, 7, 4]. One of CNN’s advan- tages is that it can take raw images as input and incorporate feature learning into the training process. With a deep struc- ture, the CNN can effectively learn complicated mappings while requiring minimal domain knowledge.
To the best of our knowledge, CNN has not been ap- plied to general-purpose NR-IQA. The primary reason is that the original CNN is not designed for capturing image quality features. In the object recognition domain good fea- tures generally encode local invariant parts, however, for the NR-IQA task, good features should be able to capture
1723373
NSS properties. The difference between NR-IQA and ob- ject recognition makes the application of CNN nonintuitive. One of our contributions is that we modified the network structure, such that it can learn image quality features more effectively and estimate the image quality more accurately.
Another contribution of our paper is that we propose a novel framework that allows learning and prediction of im- age quality on local regions. Previous approaches typically accumulate features over the entire image to obtain statis- tics for estimating overall quality, and have rarely shown the ability to estimate local quality, except for a simple ex- ample in [18]. By contrast, our method can estimate quality on small patchs (such as 32 × 32). Local quality estima- tion is important for the image denoising or reconstruction problems, applying enhancement only where required.
We show experimentally that the proposed method ad- vances the state of the art. On the LIVE dataset our CNN outperforms CORNIA and BRISQUE, and achieves com- parable results with state of the art FR measures such as FSIM [22]. In addition to the superior overall performance, we also show qualitative results that demonstrate the local quality estimation of our method.
2. Related Work
Previously researchers have attempted to use neural net- works for NR-IQA. Li et al. [8] applied a general regression neural network that takes as input perceptual features in- cluding phase congruency, entropy and the image gradients. Chetouani et al. [3] used a neural network to combine mul- tiple distortion-specific NR-IQA measures. These methods require pre-extracted handcrafted features and only use neu- ral networks for learning the regression function. Thus they do not have the advantage of learning features and regres- sion models in a holistic way, and these approaches are in- ferior to the state of the art approaches. In contrast, our method does not require any handcrafted features and di- rectly learns discriminant features from normalized raw im- age pixels to achieve much better performance.
The use of convolutional neural networks is partly mo- tivated by the feature learning framework introduced in CORNIA [20, 21]. First, the CORNIA features are learned directly from the normalized raw image patches. This im- plies that it is possible to extract discriminative features from spatial domain without complicated image transfor- mations. Second, supervised CORNIA [21] employs a two- layer structure which learns the filters and weights in the regression model simultaneously based on an EM like ap- proach. This structure can be viewed as an empirical imple- mentation of a two layer neural network. However, it has not utilized the full power of neural networks.
Our approach integrates feature learning and regression into the general CNN framework. The advantages are two fold. First, making the network deeper will raise the learn-
ing capacity significantly [1]. In the following sections we will see that with fewer filters/features than CORNIA, we are able to achieve the state of the art results. Second, in the CNN framework, training the network as a whole us- ing a simple method like backpropagation enables the pos- sibility of conveniently incorporating recent techniques de- signed to improve learning such as dropout [5] and rectified linear unit [7]. Furthermore, after we make the bridge be- tween NR-IQA and CNN, the rapid developing deep learn- ing community will be a significant source of novel tech- niques for advancing the NR-IQA performance.
3. CNN for NR-IQA
The proposed framework of using CNN for image qual- ity estimation is as follows. Given a gray scale image, we first perform a contrast normalization, then sample non- overlapping patches from it. We use a CNN to estimate the quality score for each patch and average the patch scores to obtain a quality estimation for the image.
3.1. Network Architecture
The proposed network consists of five layers. Figure 1 shows the architecture of our network, which is a 32 × 32 − 26×26×50−2×50−800−800−1structure. The input is locally normalized 32 × 32 image patches. The first layer is a convolutional layer which filters the input with 50 kernels each of size 7 × 7 with a stride of 1 pixel. The convolutional layer produces 50 feature maps each of size 26 × 26, followed by a pooling operation that reduces each feature map to one max and one min. Two fully connected layers of 800 nodes each come after the pooling. The last layer is a simple linear regression with a one dimensional output that gives the score.
3.2. Local Normalization
Previous NR-IQA methods, such as BRISQUE and
CORNIA, typically apply a contrast normalization. In this
work, we employ a simple local contrast normalization
method similar to [9]. Suppose the intensity value of a pixel
at location (i, j) is I(i, j), we compute its normalized value ˆ
I(i,j)−μ(i,j) σ(i,j)+C
I(i, j) as follows:
ˆ
I(i,j) =
p=P q=Q
μ(i,j) = I(i+p,j+q)
p=−P q=−Q p=P q=Q
σ(i,j) = I(i+p,j+q)−μ(i,j)2
p=−P q=−Q
(1) where C is a positive constant that prevents dividing by zero. P and Q are the normalization window sizes. In [9],
1
1
1
7
7
7
3
3
2
4
8
4
it was shown that a smaller normalization window size im- proves the performance. In practice we pick P = Q = 3 so the window size is much smaller than the input image patch. Note that with this local normalization each pixel may have a different local mean and variance.
Local normalization is important. We observe that us- ing larger normalization windows leads to worse perfor- mance. Specifically, a uniform normalization, which ap- plies the mean and variance of the entire image patch to each pixel, will cause about a 3% drop on the performance.
It is worth noting that when using a CNN for object recognition, a global contrast normalization is usually ap- plied to the entire image. The normalization not only al- leviates the saturation problem common in early work that used sigmoid neurons, but also makes the network robust to illumination and contrast variation. For the NR-IQA prob- lem, contrast normalization should be applied locally. Ad- ditionally, although luminance and contrast change can be considered distortions in some applications, we mainly fo- cus on distortions arising from image degradations, such as blur, compression and additive noise.
3.3. Pooling
In the convolution layer, the locally normalized image
patches are convolved with 50 filters and each filter gen-
erates a feature map. We then apply pooling on each fea-
ture map to reduce the filter responses to a lower dimension.
Specifically, each feature map is pooled into one max value
and one min value, which is similar to CORNIA. Let Rk i,j
denote the response at location (i, j) of the feature map ob- tained by the k-th filter, then the max and min values of uk and vk are given by
connected layer takes an input of size 2 × K. It is worth noting that although max pooling already works well, intro- ducing min pooling boosts the performance by about 2%.
In object recognition scenario, pooling is typically per- formed on every 2 × 2 cell. In that case, selecting a repre- sentative filter response from each small cell may keep some location information while achieving robustness to transla- tion. This operation is particularly helpful for object recog- nition since objects can typically be modeled as multiple parts organized in a certain spatial order. However, for the NR-IQA task we observe that image distortions are often locally (if not globally) homogeneous, i.e. the same level of distortion occurs at all the locations of a 32 × 32 patch, for example. The lack of obvious global spatial structure in im- age distortions enables pooling without keeping locations to reduce the cost of computation.
3.4. ReLU Nonlinearity
u =maxRk k i,j i,j
v =minRk k i,j i,j
(2)
where k = 1, 2, …, K and K is the number of kernels. The pooling procedure reduces each feature map to a 2 dimen- sional feature vector. Therefore, each node of the next fully
Figure 1: The architecture of our CNN
Instead of traditional sigmoid or tanh neurons, we use Rectified Linear Units (ReLUs) [11] in the two fully con- nected layers. [7] demonstrated in a deep CNN that ReLUs enable the network to train several times faster compared to using tanh units. Here we give a brief description of Re- LUs. ReLUs take a simple form of nonlinearity by applying a thresholding function to the input, in place of the sigmoid or tanh transform. Let g, wi and ai denote the output of the ReLU, the weights of the ReLU and the output of the previous layer, respectively, then the ReLU can be mathe- matically described as g = max(0, w a ).
Note that ReLUs only allow nonnegative signals to pass through. Due to this property, we do not use ReLUs but use linear neurons (identity transform) on the convolutional and pooling layer. The reason is that the min pooling typically produce negative values and we do not want to block the information in these negative pooling outputs.
3.5. Learning
We train our network on non-overlapping 32×32 patches taken from large images. For training we assign each patch
iii
1
1
1
7
7
7
3
2
3
5
5
9
a quality score as its source image’s ground truth score. We can do this because the training images in our experiments have homogeneous distortions. During the test stage, we average the predicted patch scores for each image to ob- tain the image level quality score. By taking small patches as input, we have a much larger number of training sam- ples compared to using the whole image on a given dataset, which particularly meets the needs of CNNs.
Let xn and yn denote the input patch and its ground truth score respectively and f(xn; w) be the predicted score of xn with network weights w. Support Vector Regression (SVR) with ε-insensitive loss has been successfully applied to learn the regression function for NR-IQA in previous work [21, 9]. We adopt a similar objective function as follows:
where wt is weight at epoch t, ε0 = 0.1 is learning rate, d = 0.9 is decay for the learning rate, rs = 0.9 and re = 0.5 are starting and ending momentums respectively, T = 10 is a threshold to control how the momentum changes with the number of epochs. Note that unlike [5] where momentum starts off at a value of 0.5 and stays at 0.99, we use a large momentum at the beginning and reduce it as the training progresses. We found through experiments that this setting can achieve better performance.
4. Experiment
4.1. Experimental Protocol
Datasets: The following two datasets are used in our exper- iments.
(1) LIVE [15]: A total of 779 distorted images with five different distortions – JP2k compression (JP2K), JPEG compression (JPEG), White Gaussian (WN), Gaussian blur (BLUR) and Fast Fading (FF) at 7-8 degradation levels de- rived from 29 reference images. Differential Mean Opinion Scores (DMOS) are provided for each image, roughly in the range [0, 100]. Higher DMOS indicates lower quality.
(2) TID2008 [12]: 1700 distorted images with 17 differ- ent distortions derived from 25 reference images at 4 degra- dation levels. In our experiments, we consider only the four common distortions that are shared by the LIVE dataset, i.e. JP2k, JPEG, WN and BLUR. Each image is associated with a Mean Opinion Score (MOS) in the range [0, 9]. Contrary to DMOS, higher MOS indicates higher quality. Evaluation: Two measures are used to evaluate the perfor- mance of IQA algorithms: 1) Linear Correlation Coefficient (LCC) and 2) Spearman Rank Order Correlation Coefficient (SROCC). LCC measures the linear dependence between two quantities and SROCC measures how well one quan- tity can be described as a monotonic function of another quantity. We report results obtained from 100 train-test it- erations where in each iteration we randomly select 60% of reference images and their distorted versions as the training set, 20% as the validation set, and the remaining 20% as the test set.
4.2. Evaluation on LIVE
On the LIVE dataset, for distortion-specific experiment we train and test on each of the five distortions: JP2K, JPEG, WN, BLUR and FF. For non-distortion-specific ex- periments, images of all five distortions are trained and tested together without providing a distortion type.
Table 1 shows the results of the two experiments com- pared with previous state of the art NR-IQA methods as well as FR-IQA methods. Results of the best performing NR-IQA systems are in bold. The FR-IQA measures are evaluated by using 80% of the data for fitting a non-linear logistic function, then testing on 20% of the data. We can see from Table 1 that our approach works well on each of
1 N
L=N ∥f(xn;w)−yn∥l1
n=1 w′ =minL
w
(3)
Note that the above loss function is equivalent to the loss function used in ε-SVR with ε = 0. Stochastic gradient decent (SGD) and backpropagation are used to solve this problem. A validation set is used to select parameters of the trained model and prevent overfitting. In experiments we perform SGD for 40 epochs in training and keep the model parameters that generate the highest Linear Correlation Co- efficient (LCC) on the validation set.
Recently successful neural network methods [7, 5] re- port that dropout and momentum improve learning. In our experiment we also find these two techniques boost the per- formance.
Dropout is a technique that prevents overfitting in train- ing neural networks. Typically the outputs of neurons are set to zero with a probability of 0.5 in the training stage and divided by 2 in the test stage. By randomly masking out the neurons, dropout is an efficient approximation of training many different networks with shared weights. In our experiments, since applying dropout to all layers sig- nificantly increases the time to reach convergence, we only apply dropout at the second fully connected layer.
Updating the network weights with momentum is a widely adopted strategy. We update the weights in the fol- lowing form:
∆wt = rt∆wt−1 − (1 − rt)εt ⟨▽wL⟩
wt = wt−1 + ∆wt
εt = ε0(d)t (4)
tr+(1−t)r, t