HallucinatedIQA: NoReference Image Quality Assessment via Adversarial Learning
KwanYee Lin1 and Guanxiang Wang2
1Department of Information Science, School of Mathematical Sciences, Peking University 2Department of Mathematics, School of Mathematical Sciences, Peking University
1linjunyipku.edu.cn 2gxwangmath.pku.edu.cn Abstract
Noreference image quality assessment NRIQA is a fundamental yet challenging task in lowlevel computer vi sion community. The difficulty is particularly pronounced for the limited information, for which the corresponding ref erence for comparison is typically absent. Although various feature extraction mechanisms have been leveraged from natural scene statistics to deep neural networks in previous methods, the performance bottleneck still exists.
In this work, we propose a hallucinationguided qual ity regression network to address the issue. We firstly generate a hallucinated reference constrained on the dis torted image, to compensate the absence of the true refer ence. Then, we pair the information of hallucinated ref erence with the distorted image, and forward them to the regressor to learn the perceptual discrepancy with the guid ance of an implicit ranking relationship within the gener ator, and therefore produce the precise quality prediction. To demonstrate the effectiveness of our approach, compre hensive experiments are evaluated on four popular image quality assessment benchmarks. Our method significantly outperforms all the previous stateoftheart methods by large margins. The code and model are publicly available on the project page https:kwanyeelin.github. ioprojectsHIQAHIQA.html.
1. Introduction
Image quality assessment IQA refers to the challenging task of automatically predicting the perceptual quality of a distorted image. IQA serves as a key component in the low level computer vision community and has a wide range of applications 13, 26, 49.
IQA algorithms could be classified into three cate gories: fullreference IQA FRIQA 50, 24, 19, reduced reference IQA RRIQA 11, and general purpose no reference IQA NRIQA 46, 17, 42, 51, 44, 20, 25. Al
a
b
c
Ground truth Reference
Distorted Image
Discrepancy Reference Map
Figure 1: An illustration of our motivation. The first column is the Groundtruth Reference image which is undistorted. The sec ond column shows several kinds of distortion that is easily hap pened. The third row demonstrates the hallucinated reference im ages which are generated by our approach. The fourth column is the discrepancy map which captures rich information that can be utilized to guide the learning of quality regression network to get high accuracy results.
though FRIQA and RRIQA metrics have achieved re markable results over the decades, the precondition of them that requiring a corresponding nondistorted reference im age for comparison during quality predicting process makes these metrics infeasible in practical applications, since it is hard, even impossible in most cases, to obtain an ideal ref erence image. In contrast, NRIQA, which takes only the distorted image to be assessed as input without any addi tional information, is more realistic and therefore receives substantial attention in recent years. However, the illposed definition makes it is highly challenging for NRIQA to make a good image quality prediction.
The illposed nature of the underdetermined NRIQA problem is particularly pronounced for the limited informa tion, for which the form of distortion and the correspond
Hallucinated
arXiv:1804.01681v1 cs.CV 5 Apr 2018
ing nondistorted reference image are typically absent. It is counterintuitive, since human visual system HVS needs a reference to quantify the perceptual discrepancy by com paring the distorted image either directly with the original undistorted image or implicitly with a hallucinated scene in mind, as demonstrated in Figure. 1a. The illposed defini tion becomes the most essential issue of NRIQA task that leads to the performance bottleneck over the last decade.
Numerous efforts have been made to ease this problem by designing powerful feature representation models. Tra ditional methods commonly use manually designed statis tic representations, and hence lack of diversity and flexibil ity for modeling the multiple complex distorted types1 and large span of image contents e.g., human, animal, plant, cityscape, transportation, etc. in NRIQA. In recent years, the promising results of Deep Neural Networks DNNs in many computer vision tasks 8, 5, 43 encourage re searchers exploring their formidable feature representation power to the NRIQA task. Nevertheless, the extremely limited annotation samples in public datasets greatly limit the advantage of DNNs in NRIQA task. To better leverag ing the power of DNNs, previous works usually utilize var ious multitask and data augmentation strategies with extra annotated ranking, proxy quality scores, or distortion infor mation sophisticatedly, which are unavailable in practical NRIQA applications, and hence lack of feasibility for mod eling unknown multiple distortion types. Some works at tempt to transfer general image feature representations from a pretrained model on ImageNet 6 to quality prediction. While, less correlation and similarity between NRIQA and image classification task reduce the effectiveness of transfer learning.
In this work, a HallucinationGuided Quality Regression Network is proposed to simulate the behaviour of human visual system HVS, which can make precise prediction by leveraging perceptual discrepancy information between the distorted image and hallucinated reference. As shown in Fig. 2, a highresolution scene hallucination is firstly generated from the distorted image. Then, the discrepancy map which naturally encoding the difference between the distorted images and hallucinated reference can be obtained to guide the learning of regression network. With the strong and clear defined discrepancy information incorporated in, the illposed nature of NRIQA can be dramatically over come. Therefore, even with a common data augmentation, our approach could lead to better performance than all of the conventional sophisticated methods.
A straightforward way to generate the hallucinated ref erence is to leverage the stateoftheart image super
1An image could be distorted by any stages in the whole process of its lifecycle from acquisition to storage, and therefore will undergo diverse distortions, like noise corruption, compression artifact, transmission errors, underoverexposure,etc. For more details, please refer to 35.
resolution 23, 22, 40, blind deblur 33, or inpainting 34 methods to reconstruct images from the distorted ones. However, since an image could be distorted with multiple unknown distortions, which breaks the basic assumptions2 of these related fields, it is impractical to utilize them to ob tain a reconstructed image that qualified for the agent ref erence of NRIQA task. To this end, a QualityAware Gen erative Network is proposed to generate hallucinated refer ence with a novel qualityaware perceptual term which is designed specifically for the NRIQA task at hand.
While the QualityAware Generative Network is robust to most distortion types and levels, it is, however, still very challenging for a method that under the framework of DCNN to reconstruct highfrequency details with realistic texture, when the distorted image lacks structure informa tion, as shown in Fig. 1c. Since the result of hallucinated reference is crucial for final prediction, a bad hallucination will introduce large bias and therefore lead the regression results into suboptimal values. We propose to tackle this problem with the two following mechanisms from lowlevel to high: We introduce adversarial learning idea to hal lucinated reference generation and quality prediction with a novel IQAdiscriminator to, on the one hand, encourage the generated hallucinated scene perceptually hard to dis tinguish from the true reference images, and on the other hand, in a lowlevel semantic, constrain the influence of bad hallucinations to the quality regression network. 2 A novel highlevel semantic fusion mechanism is introduced to further reduce the instability of the quality regression net work caused by the hallucination model. It explores implicit ranking relationship within the hallucination network to as a guidance to help the regression network adjusting the image quality prediction in an adaptive manner.The qualityaware generative network, hallucinationguided quality regression network, and the iqadiscriminator can be jointly optimised in an endtoend manner.
Our main contributions of this work are summarised into three folds:
A novel HallucinationGuided Quality Regression Network is proposed to incorporate the perceptual dis crepancy information into network learning to over come the illposed nature of NRIQA and significantly improves the prediction precision and robustness.
A QualityAware Generative Network together with a qualityaware perceptual loss is proposed, in which both texture feature similarity and quality feature sim ilarity are taken into consideration in a complementary manner to help generating qualified hallucinated refer ences.
2For example, superresolution methods usually assume the blur kernel or form is known.
Since the result of hallucinated reference is crucial for final prediction, an IQADiscriminator and an implicit ranking relationship fusion scheme are introduced to better guide the learning of generator and suppress the negative scene hallucination influence to quality re gression in a lowlevel to highlevel manner.
We evaluate the proposed method on four broadly used image quality assessment benchmarks including LIVE 41, CSIQ 21, TID2008 36, and TID2013 35. Our approach shows the superior performance over all of the stateofthe art NRIQA methods by significant margins. Comprehen sive ablation study further demonstrates the effectiveness of each component.
2. Related Work
Noreference Image Quality Assessment. In the litera ture of NRIQA, besides classic methods 31, 38, 29, 47 and their improved versions 51, 44, 27, recently, signifi cant progresses have been achieved by exploring DNNs for better feature representation 17, 18, 45, 20, 25, 48. For example, Kang et al. 17 introduce a shallow ConvNet to model the quality prediction. This approach is refined to a multitask CNN 18, where the network learns both dis tortion type and quality score simultaneously. Bianco et al. 2 use a pretrained DCNN finetuned on an IQA dataset to extract features, then map them to IQA scores by an SVR model. Hui et al. 48 also propose to extract features by a pretrained ResNet 14. Instead of learning IQA scores di rectly, they fine tune the network to learn a probabilistic rep resentation of distorted images. According to the distortion types and levels in particular datasets, Liu et al. 25 syn thesize masses of ranked images to train a Siamese network to learn the rankings for NRIQA. Liang et al. 24 propose to use nonaligned similar scene as a reference. Kim and Lee 20 apply stateoftheart FRIQA methods to gener ate proxy scores on patches as the ground truth to pretrain the model and then finetune to NRIQA.
In this work, we propose a unique approach to address the illposed problem by compensating the absent refer ence information without any extra data annotation or prior knowledge, which therefore increases the flexibility and feasibility than other methods.
Generative Adversarial Network. GANs 12 and var ious variants 37, 1, 28 flourish in generating natural im ages such as human faces 3 and indoor scenes 7. How ever, generating highresolution images e.g.,256!256 will lead GANs to training instability and sometimes nonsensi cal outputs, which has been proven in 16. Since our ul timate goal is NRIQA, and the performance of quality re gression network is closely related to the output of the gen erator, instead of applying original discriminator, we tai lor the adversarial learning scheme for image quality as
sessment by introducing an effective iqadiscriminative net work.
3. Our Approach
In this section, we introduce our approach for NRIQA. An overview of our framework is illustrated in Fig. 2. The model consists of three parts, i.e., the qualityaware gen erative network G, the iqadiscriminative network D and the hallucinationguided quality regression network R. The generative network produces hallucinated reference as the compensatory information for the distorted images. The discriminative network is trained with G in the adversar ial manner to help G producing more qualified results and constrain negative effects of bad ones to R. We define the objective discrepancy i.e., the pixelwise differences be tween a distorted image and the corresponding scene hal lucination as the discrepancy map3. The quality regression network takes the distorted images and corresponding dis crepancy maps as inputs, with the guidance of implicit rank ing relationships in G, to exploit the perceptual discrepancy and produce the predicted quality scores as outputs.
3.1. QualityAware Generative Network
As we mentioned in the previous sections, the function of hallucinated reference for the distorted image is to com pensate the absence of true reference image, and the less gap between hallucination and true reference, the more pre cise the quality regression network will perform. Therefore, the aim of G is to generate a highresolution hallucinate im age Ish conditioned on the distorted image Id. Toward this end, we adopt a stacked hourglass 32 as the baseline of the generative network.
A straightforward way for learning the generating func tion G Id is to enforce the output of the generator both pixelwise and perceptionwise close to the true refer ence. Therefore, given a set of distorted images I id , i ,2,aaa,N , and corresponding true reference images
, we solve
lp G Idi ,Iri ls G Idi ,Iri , 1
where lp penalizes the pixelwise differences between the output and the ground truth with pixellevel error measure ments, such as MSE, to generate holistic content; and ls pe nalizes the perceptionwise differences to achieve sharper local results. We adopt a feature space loss term 9 as the perception constraint, which is defined as
ls G Idi ,Iri kG Idi Iri k2, 2
3This is different from the concept of error map, which is used in FR IQA to represent pixelwise error between the distorted image and true reference.
Iir,i ,2,aaa,N o XN
argmin
Ni
Distorted Image
Quality Aware Generative Network
Generated Hallucinated Reference
Stack 1
Hourglass Module
Stack n
High Level Feature
Convolution
Residual Unit
Fully Connection
Down Sampling
Up Sampling
Subtraction S
Discrepancy Map
Entrywise Sum
Predicted Quality Score
Hallucination Guided Quality Regression Network
Generated Hallucinated Reference
Ground Truth Reference
RealFake
IQA Discriminative Network
Figure 2: An illustration of our proposed HallucinatedIQA framework. It consists of three strongly related subnets. a QualityAware Generative Network is used to generate hallucinated reference images. In order to get high resolution hallucinated images, a quality aware loss is introduced to the learning process. b HallucinationGuided Quality Regression Network is in a position to incorporate the discrepancy information between the hallucinated image and distorted encoded in the discrepancy map. The incorporated discrepancy information together with highlevel semantic fusion from the generative network can supply the regression network with rich information and greatly guide the network learning. 3 Since the results of the hallucinated image are crucial for the final prediction, IQADiscriminator is proposed to further refine the hallucinated image.
where
Cv Wj Hj
lvX XXkjGIdi x,yjIrix,yk2,
Distorted Image
Figure 3:
loss and IQAGAN. With the qualityaware loss and IQAGAN scheme adding, the hallucinated images are improved to be more and more clear and plausible. The last column shows the discrep ancy map got from our model, which can be seen to well capture the type and location information of the distortion. The map is demonstrated to be very helpful for our IQA task.
where a represents a feature transformation. Intuitively, pretrained network like VGG19 could be utilized to cal culate the perception term. This is reasonable in most cases by the fact that the VGG19 is trained for semantic classifi cation, and the features of its intermediate layers are there fore invariant to the noise of input 4, 10. Consequently, these layers provide structure and texture information to the generator for inferring more accurate results. However, the invariance property will also lead to the perception term ig noring the hard cases where the output of the generator still contains a certain degree distortion information, as demon strated in Fig. 3. To ease this problem, we propose a qualityaware perceptual loss, which incorporates the fea tures of the deep regression network R dynamically. The lossfunctioninequation 2 becomes
ls G Idi ,Iri lv G Idi ,Iri 2lq G Idi ,Iri , 3
cv WjHj xy Cq Wk Hk
5 where j a denotes the feature map at the jth layer of VGG19, !k a denotes the feature map at the kth layer of R; W and H represent the dimensions of the feature map, C represents the number of feature maps at a par ticular layer. Since the vgg19 network and R are trained for different tasks, the representation of kernels within the two networks also toward to preserve different information. The activations from the layers of a pretrained 4 NRIQA regression network capture the distortion information of the input, which ensures the quality similarity measurement be tween the output of G and the ground truth. The activations from the layers of the VGG19 network ensure the seman tic similarity measurement. Base on respective represent ing capabilities of the two networks, incorporating both lv and lq losses to the perception term could complement each other and therefore help the generator producing better re
sults jointly.
4It should be noted that, the pretrained quality regression model refers to the one that trained from scratch with IQA dataset.
Baseline Generator Baseline Quality Baseline Quality Aware Loss Aware Loss IQA GAN
Discrepancy Map
4
x,y !k Iri x,yk2,
An illustration of the effectiveness of qualityaware
and
lq X XXk!k G Idi cq WkHk xy
Fusion
3.2. IQADiscriminative Network
To ensure the generator producing high perceptual out puts with realistic highfrequency details, especially for the samples that seriously lack structure and texture informa tion due to the distortion type e.g., local blockwise dis tortions of different intensity, transmission errors, or the distortion level, the adversarial learning mechanism is in troduced to our work.
The original manner of adversarial learning is to train G to generate images to fool D, and D is in contrast trained to distinguish fake reference images Ish from real reference images Ir. However, since GANs are limited to the resolu tion of the generator, and the distorted images forwarded to a quality network are usually of large size to maintain suffi cient contextual information, directly providing Ish as fake images to the discriminator will introduce instability to opti mization procedure and sometimes leads to nonsensical re sults. More importantly, our ultimate goal is improving the performance of the deep regression network R. Even when G fails to generate highresolution hallucination images, the predicted scores of R should still be a reasonable value. Thus, the influence of bad hallucination images to R should be suppressed. Thus, we propose a IQADiscriminator i.e., D to ease above problems by discriminating the fake sam ples from the real samples according to their positive or neg ative influence to R. If G generates a hallucinated reference could help improving the precision of R, then this halluci nation is defined as real sample to D, otherwise the halluci nation is a fake sample. This could be formulated as
maxElogD! Ir Elog D! G Id dfake ,
where , 2 and 3 represent the parameters that trade off the three loss components.
3.3. HallucinationGuided Quality Regression Net work
Given the hallucinated scene generated by G, we are able to provide agent references to the quality regression network to compensate the absence of true reference infor mation. In order to incorporate the hallucinated reference information effectively, the concept of discrepancy map is introduced to the work. To further stabilizing the optimiza tion procedure of R, a highlevel semantic fusion scheme is proposed.
Discrepancy Map. Given a set of distorted images to be assessed, previous CNNbased NRIQA methods learn a mapping function R Id to predict the quality scores. On the contrary, we consider the distorted images and their dis crepancy maps as pairs Idi , Imi ap Ni to train a deep regres sion network by solving
,si , 10
whereImap IdGo Id ,denotesthediscrepancymap. The formulation shows the discrepancy map could virtually be regarded as a prior information to tell the network what the distortion looks like.
It is interesting that, so far, the holistic mechanism func tions in a reinforced way, during training stage of R, G is used to produce auxiliary hallucinated references, while during the training stage of G, R is in reverse introduced to help generating better hallucinations. In essence, G and R are mutually correlated and thus can reinforce each other.
Highlevel Semantic Fusion. As we mentioned in pre vious sections, the precision of R is greatly depended on the eligibility of the hallucinated scene. To be specific, a qualified hallucination as the agent reference could help R exploring correct perceptual discrepancy of the distorted image, while the unqualified one will conversely introduce large bias to R by improperly narrowing the distortion infor mation. Hence, a constrained scheme is needed to stabilize the quality regression process.
Assume G has been trained, the feature maps after the
mth residual block in encoder part of its nth stack are
6 difakedshF,7
thedefinition ifkR Ii,Ii sik a a ifkRIdi,Isih sikF
where si is the ground truth quality score of Ii, denotes d
the threshold parameter. The general idea behind this for mulation is that it leverages the property of quality regres sion loss, where the loss is an explicit index that directly reflects the impact of G on R, to enforce D only penalizing samples with negative influence. Therefore, it could also be regarded as a relaxation strategy to stabilize the adversarial learning process.
Thus, G is eventually optimised to fool the discriminator D by generating qualified hallucinated scene that is benefi cial for R. The adversarial loss of G is formulated as
Ladv Elog D! G Id , 8
and the overall loss function of G for all training samples is given by
LG Lp 2Ls 3Ladv, 9
considered as Hcmn I Cmn . We fuse the ones after mn d cmn
XN oargmin l RIi,Ii
N r d map i
!
where dfake denotes the ground truth influence label with
the last encoder residual block of second stack with the fea ture maps after the last block of R, then we have the fusion term:
Ff H5,2 Id R Id,Imap 11
where f is a linear projection to ensure the dimensions of H and R are equal, R denotes the feature extraction be fore the fully connected layers R2 of R, and denotes
the concatenation operation. Thus, the loss of R could be formulated as:
In contrast, as a byproduct of our work, the hallucinated scene could be regarded as a universal medium among dif ferent datasets to help the training process of a particular one without losing precision, since the hallucination is only constrained on distorted image and serves as the fundamen tal agent reference information of image quality. Mean while, the detachable training process of our framework provides an alternative that the R in the stages of train ing G and final quality regression model could be different. Based on the above, as long as a hallucination generator is trained either on one specific dataset with multiple complex distortions or on multidatasets in once, it can be used to help the training process of any other datasets as a plugand play module in a weaklysupervised manner. Moreover, the module could also be used as a data augmentation or initial ization mechanism without any extra annotation or artificial prior knowledge. We evaluate above discussion in Sec.4.1.
3.6. Implementation Details
All the training samples are 256 ! 256 pixel patches that randomly sampled from the original images. Then a com mon data augmentation is performed with random rotation 2a and flip. We train our models with Caffe 15 on the Titan X GPUs with a minibatch size of 32 and all of them are trained from scratch. The stochastic gradient de scent SGD is used to optimise the networks with an initial learning rate of a5 for the generation network and a2 for the regression network, and dropped by a factor of aa every 2aK iterations. The weight decay is 0.0005, and the momentum is 0.9. During testing, we extract overlapped image patches at a fixed stride from each testing image, and simply average all predicted scores as the final wholeimage quality score.
XT
LR T t kR2 f H5,2 Id
R Id,Imap
stk
12 The form of the loss LR allows the highlevel semantic in formation of G participating in the optimization procedure of R. As we discussed in the introduction, the fusion term F explores implicit ranking relationship5 within G to as a guidance to help R adjusting the quality prediction in an adaptive manner. Specifically, if G is optimal, the solvers may simply drive the weights of neurons in R2 that connect ing with f toward zero to approach identity mappings. Oth erwise, the eligibility of the hallucinated scene is materially a reflection of the quality of the input distorted images that could be leveraged to as a guidance to correct the prediction, and therefore improving the precision of R in a highlevel semantic manner. Meanwhile, the iqadiscriminator could be regarded as a lowlevel semantic scheme to R, since it encourages G to generate useful hallucination input to R. Therefore, our model has schemes in multiple semantic lev
els to stabilize the quality regression process.
3.4. Training Strategy
Since all of the operations in G and R are differentiable, these two subnetworks can be trained in an endtoend manner. To better optimize the generation and quality re gression in a mutually reinforced way, we take an alterna tive training strategy in practice. Please refer to the supple mentary, where Algorithm demonstrates the whole train ing processing of our approach as the pseudo codes.
3.5. WeaklySupervised Quality Assessment
In this section, we discuss some extensions to further un cover the potential of our framework.
To advance the development of IQA task, various bench marks have been released in these years. However, a signif icant issue follows as well. As shown in Table 1, there are huge gaps of distorted quality definition, types and levels among the datasets. While NRIQA models are commonly trained on one specific dataset, these gaps will easily lead the models to suffer overfitting problem and lack of gen eralization ability. Learning from crossdatasets is an alter native way to ease the problem. Previous methods usually transfer the definition of quality scores by nonlinear map pings learned from the distributions of datasets, which may introduce bias to the models.
5G serves as not only a generator, but also an encoderdecoder mech anism. Thus, the differenceinformation between images distorted in dif ferent degree is encoded compactly in the end of the encoder part. We refer to this differenceinformation as implicit ranking relationship of distorted images in this work.
Databases LIVE CSIQ TID2008 TID2013
of Ref.Images 29
30
25
25
of Dist.Images 779
866
1700
3000
of Dist.Types 5
6
17
24
Score Type DMOS DMOS MOS MOS
Score Range 1,100 0,1 0,9 0,9
Table1: Summaryofthedatabasesevaluatedintheexperiments.
4. Experiments
Datasets. We perform experiments on four widely used benchmark datasets LIVE 41, CSIQ 21, TID2008 36, and TID2013 35. The detailed information are summa rized in Table 1.
Evaluation Metrics. Following most previous works, two evaluation criteria are adopted in our paper: the Spear man s Rank Order Correlation Coefficient SROCC and the Linear Correlation Coefficient LCC. SROCC is a measure of the monotonic relationship between the groundtruth and model prediction. LCC is a measure of the linear correlation between the groundtruth and model prediction. The de tailed definitions are formulated in the supplementary ma terial.
Methods 1 2 3 4 5 6 7 8 9 10 11 12 13
BLIINDSII 39 0.714 0.728 CORNIA10K 47 0.341 0.196 HOSA44 0.853 0.625
RankIQA 25 0.667 0.620
Ours 0.923 0.880 OursOracle 0.952 0.890
Methods BLIINDSII39 CORNIA10K 47 HOSA 44 RankIQA 25 Ours OursOracle
0.825 0.358 0.689 0.184 0.782 0.368 0.821 0.365 0.945 0.673 0.976 0.831
0.852 0.664 0.607 0.014 0.905 0.775 0.760 0.736 0.955 0.810 0.931 0.773
0.780 0.852 0.754 0.673 0.896 0.787 0.810 0.892 0.870 0.783 0.809 0.767 0.855 0.832 0.957 0.898 0.812 0.910
0.808 0.862 0.875 0.911 0.893 0.932 0.866 0.878 0.914 0.624 0.929 0.735
0.251 0.755 0.310 0.625 0.747 0.701 0.704 0.810 0.460 0.782 0.638 0.739 ALL
0.550 0.651 0.728 0.780 0.879 0.935
14 15 16 17 18 19 20 21 22 23 24
0.081 0.371 0.161 0.096 0.199 0.327 0.512 0.622
0.159 0.008 0.233 0.268
0.082 0.109 0.423 0.055 0.294 0.119 0.613 0.662
0.699 0.222 0.259 0.606 0.782 0.532 0.619 0.644
0.451 0.815 0.555 0.592 0.835 0.855 0.800 0.779
0.568 0.856 0.759 0.903 0.801 0.905 0.629 0.859
0.664 0.122 0.182 0.376 0.156 0.850 0.614 0.852 0.911 0.381 0.616 0.834 0.457 0.823 0.850 0.539 0.893 0.695 0.859 0.910 0.655 0.712
Table 2: Performance evaluation SROCC on the entire TID2013 database.
4.1. Comparisons with the stateofthearts
To validate our approach, we conduct extensive evalua tions, where ten stateoftheart NRIQA methods are com pared. We follow the experimental protocol used in three most recent algorithms i.e., HOSA 44, BIECON 20, and RankIQA 25, where the reference images are ran domly divided into two subsets with 8aa for training and 2aa for testing, and the corresponding distorted images are divided in the same way to ensure there is no overlap image content between the two sets. All the experiments are under ten times random traintest splitting operation, and the me dian SROCC and LCC values are reported as final statistics.
Single dataset evaluations. We first analyze the exper iment results on TID2013. The SROCC for our approach and compared stateofthearts on entire TID2013 dataset are reported in Table 2. Our method significantly outper forms previous methods by a large margin. We achieve a 3a relative improvement over the most stateoftheart method RankIQA on entire dataset with all the distortion types under consideration at once. For individual distor tions, due to the normalization operation in the network, the performances on a small number of types like intensity shift and change of colour saturation are lower than some methods. While we generally achieve the highest accura cies on most distortion types over 60 subsets. Specifi cally, the significant improvements on distortion types like 4masked noise and 4noneccentricity pattern noise quantitatively demonstrate the effectiveness of our hallu cinated reference compensation mechanism, and improve ments on types such as 9 Image denoising and 22 Multiplicative Gaussian noise verify the capacity of our G component as a single model that hallucinates images under multiple distortions effectively.
Table 3 shows the performance evaluation on the entire LIVE database. Our method outperforms all of the state oftheart methods for both SROCC and LCC evaluations. Among the methods compared in the experiments, the most stateoftheart three methods explore different strategies to
better leverage the power of DNNs and achieve promis ing results, where BIECON uses FRIQA methods to gen erated proxy quality scores, RankIQA synthesizes masses of ranked images to train the network, and PQR takes ad vantage from a pretrained Res50 network. Our method achieves 2a improvements than BIECON, 2a SROCC and a LCC improvements than PQR, and aaa slightly improvements than RankIQA with training from scratch. These observation demonstrate that our mechanisms in crease the model capacity effectively from a new perspec tive.
As for TID2008 dataset, our approach also achieves highest performances compared with all of the stateofthe arts. We also reach best performances on CSIQ dataset. For space saving, the detail results and discussion of this two dataset are shown in the supplementary material, please re fer to it.
We also list the results of using groundtruth reference on above experiments as the theoretical bounds, which are referred to oursoracle , to further verify the effectiveness and potential of proposed hallucinated references to NR IQA. The oracle outperforms all the methods in all datasets by large margins. These results demonstrate the effective ness of hallucinated information and show great potential performance gain if the hallucinated information could be well generated.
Crossdataset evaluations. Here, we perform two types of crossdataset evaluations to further verify some merits of our approach. Table 4 shows the results of crossdataset test where the models are trained by the LIVE dataset, and tested on the TID2008 dataset. We follow the com mon experiment setting to test the results on the subsets of TID2008, where four distortion types i.e., JPEG, JPEG2K, WN, and BLUR are included, and a logistic regression is applied to match the predicted DMOS to MOS value. The promising results demonstrate the generalization ability of our approach.
To evaluate the byproduct of our work where the model could be leveraged in a weaklysupervised manner to han
JP2K JPEG WN BLUR FF
0.914 0.965 0.979 0.951 0.877 0.943 0.955 0.976 0.969 0.906 0.952 0.977 0.978 0.962 0.908 0.947 0.952 0.984 0.976 0.937 0.952 0.974 0.980 0.956 0.923 0.970 0.978 0.991 0.988 0.954
0.983 0.961 0.984 0.983 0.989 0.978 0.960 0.993 0.988 0.968
JP2K JPEG WN BLUR FF
0.923 0.973 0.985 0.951 0.903 0.951 0.965 0.987 0.968 0.917 0.953 0.981 0.984 0.953 0.933 0.952 0.961 0.991 0.974 0.954 0.965 0.987 0.970 0.945 0.931 0.975 0.986 0.994 0.988 0.960
0.977 0.984 0.993 0.990 0.960 0.989 0.985 0.997 0.992 0.988
SROCC ALL BRISQUE 30 0.940
BL
B L H C M Q S L A D V
SROCC
BLHCM
B L H C M Q S Q A D V
BLHCMQSL
B L H C M Q S Q A D V H S F
LCC
1.000 0.950 0.900 0.850 0.800 0.750 0.700 0.650 0.600 0.550 0.500
CORNIA 47 CNN 17 SOM 51 BIECON 20 RankIQA 25 PQR 48
0.942 0.956 0.964 0.961 0.981 0.965
Ours 0.982 OursOracle 0.983 LCC ALL BRISQUE 30 0.942
CORNIA 47 CNN 17 SOM 51 BIECON 20 RankIQA 25 PQR 48
0.935 0.953 0.962 0.962 0.982 0.971
Figure 4: Ablation results on the entire TID2008 dataset.
By adding a holistic hallucination model to provide halluci nated references pairing with distorted images as the inputs to res18 network BLHCM , we get a aa859 SROCC value and a aa8aa PLCC value, which up to 4a and 8a improvement compared to the baseline model, respectively.
Qualityaware perceptual loss. By adding the feature matching loss w.r.t. quality similarity at the training process of the hallucination model BLHCMQPL , our model obtains a further aa5a improvement on SROCC and 2a on LCC.
Adversarial learning. To explore the effect of proposed IQADiscriminative network for quality assessment, we further compare the models with adversarial learning mech anism under original definition BLHCMQPLADV and our definition BLHCMQPLQADV . Adding original adversarial learning mechanism leads to a 3a im provement on SROCC and 3a on LCC, while our method obtains further 2a and about a improvements on SROCC and LCC, respectively.
Multilevel semantic fusion. We also show the im provements brought by the multilevel semantic fusion mechanism. We fuse the feature maps of the generator from stack two with the ones of same size in quality regression network, and obtain the highest aa94 SROCC value and aa949 LCC value.
5. Conclusion
In this paper, we propose to solve the illposed na ture of NRIQA from a new perspective. We introduce a hallucinationguided quality regression network to capture the perceptual discrepancy between the distorted images and the hallucinated images, and therefore predict precise perceptual quality result. We generate the hallucinations by a novel qualityaware generation network with the help of a specially designed iqadiscriminator under the adversarial learning manner. The proposed network does not require any extra annotations or artificial prior knowledge for train ing and can be trained endtoend. Extensive experiments demonstrate the superior performance on NRIQA task.
Ours 0.982 OursOracle 0.989
Table3: PerformanceevaluationbothSROCCandLCConthe entire LIVE database.
Ours OursOracle SROCC 0.934 0.939 LCC 0.917 0.920
Table 4: Crossdataset evaluation SROCC.The models are trained on the LIVE database and tested on the subset of TID2008.
OursOracle SROCC 0.983
CORNIA 47 CNN 17 SOM 51
0.892 0.920 0.923 0.880 0.903 0.899
L T08 T08T13
0.982 0.982 0.983 0.982 0.985 0.988
LCC
0.989
Table 5: SROCC and LCC results of models on the LIVE database with training generator on different datasets.
dle crossdataset quality assessment, we train the generator on different datasets, and use LIVE dataset to train the re gression network. Table 5 reports the results. The L as the plain of the experiment represents the hallucination genera tor is trained on the training set of LIVE. The T08 repre sents training the generator on TID2008, and T08T13 is the version that training on both the TID2008 and TID2013. It can be clearly observed that with more IQA datasets ag gregated in the generator, the regression network reaches higher SROCC and LCC performances to approximate the oracle.
4.2. Ablation study
To investigate the efficacy of the key components of our model, we conduct ablation experiments on the TID2008 dataset. The overall results are shown in Figure 4. We use a modified Res18 network with only distorted images as inputs to be our baseline model and analyze each proposed component based on the baseline network BL, by compar ing both SROCC and LCC results.
Hallucinated reference compensation. We first eval uate the hallucinated reference compensation mechanism.
0.755
0.859
0.864 0.894
0.800 0.870
0.910 0.941
0.887 0.910
0.918 0.949
References
1 M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gener ative adversarial networks. In ICML, 2017.
2 S. Bianco, L. Celona, P. Napoletano, and R. Schettini. On the use of deep learning for blind image quality assessment. CoRR, 2016.
3 X. Chen, X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable repre sentation learning by information maximizing generative ad versarial nets. In NIPS, 2016.
4 D. Cho, J. Park, T. Oh, Y. Tai, and I. S. Kweon. Weakly and selfsupervised learning for contentaware deep image retargeting. In ICCV, 2017.
5 J. Dai, Y. Li, K. He, and J. Sun. RFCN: object detection via regionbased fully convolutional networks. In NIPS, 2016.
6 J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li. Ima genet: A largescale hierarchical image database. In CVPR, 2009.
7 E. L. Denton, S. Chintala, a. szlam, and R. Fergus. Deep generative image models using a laplacian pyramid of adver sarial networks. In NIPS. 2015.
8 J.Fu,H.Zheng,andT.Mei.Lookclosertoseebetter:Recur rent attention convolutional neural network for finegrained image recognition. In CVPR, 2017.
9 L. A. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis using convolutional neural networks. In NIPS, 2015.
10 L.A.Gatys,A.S.Ecker,andM.Bethge.Imagestyletransfer using convolutional neural networks. In CVPR, 2016.
11 S. A. Golestaneh and L. J. Karam. Reducedreference qual
ity assessment based on the entropy of DWT coefficients of locally weighted gradient magnitudes. TIP, 2511:5293 5303, 2016.
12 I. J. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. C. Courville, and Y. Bengio. Generative adversarial networks. CoRR, 2014.
13 J.GuoandH.Chao.Buildinganendtoendspatialtemporal convolutional network for video superresolution. In AAAI, 2017.
14 K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
15 Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir shick, S. Guadarrama, and T. Darrell. Caffe: Convolu tional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
16 C. Kaae Snderby, J. Caballero, L. Theis, W. Shi, and F. Husza r. Amortised MAP Inference for Image Super resolution. ICLR, 2017.
17 L. Kang, P. Ye, Y. Li, and D. Doermann. Convolutional neu ral networks for noreference image quality assessment. In CVPR, 2014.
18 L. Kang, P. Ye, Y. Li, and D. S. Doermann. Simultaneous es timation of image quality and distortion via multitask con volutional neural networks. In ICIP, 2015.
19 J. Kim and S. Lee. Deep learning of human visual sensitivity in image quality assessment framework. In CVPR, 2017.
20 J. Kim and S. Lee. Fully deep blind image quality predictor. J. Sel. Topics Signal Processing, 111:206 220, 2017.
21 E. C. Larson and D. M. Chandler. Most apparent distortion: fullreference image quality assessment and the role of strat egy. Journal of Electronic Imaging, 191:011006, 2010.
22 C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunning ham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi. Photorealistic single image superresolution using a generative adversarial network. In CVPR, 2017.
23 Y. Li, W. Dong, X. Xie, G. Shi, X. Li, and D. Xu. Learn ing parametric sparse models for image superresolution. In NIPS. 2016.
24 Y. Liang, J. Wang, X. Wan, Y. Gong, and N. Zheng. Im age quality assessment using similar scene as reference. In ECCV, 2016.
25 X. Liu, J. van de Weijer, and A. D. Bagdanov. Rankiqa: Learning from rankings for noreference image quality as sessment. In ICCV, 2017.
26 Y. Liu, J. Yan, and W. Ouyang. Quality aware network for set to set recognition. In CVPR, 2017.
27 K. Ma, W. Liu, T. Liu, Z. Wang, and D. Tao. dipiq: Blind image quality assessment by learningtorank discriminable image pairs. TIP, pages 3951 3964, 2017.
28 M. Mirza and S. Osindero. Conditional generative adversar ial nets. CoRR, 2014.
29 A. Mittal, A. K. Moorthy, and A. C. Bovik. Noreference image quality assessment in the spatial domain. TIP, pages 4695 4708, 2012.
30 A. Mittal, A. K. Moorthy, and A. C. Bovik. Noreference image quality assessment in the spatial domain. TIP, pages 4695 4708, 2012.
31 A. K. Moorthy and A. C. Bovik. Blind image quality as sessment: From natural scene statistics to perceptual quality. TIP, 2012:3350 3364, 2011.
32 A. Newell, K. Yang, and J. Deng. Stacked hourglass net works for human pose estimation. In ECCV, 2016.
33 J.Pan,Z.Lin,Z.Su,andM.H.Yang.Robustkernelestima tion with outliers handling for image deblurring. In CVPR, 2016.
34 D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature learning by inpainting. In CVPR, 2016.
35 N. Ponomarenko, O. Ieremeiev, V. Lukin, K. Egiazarian, L. Jin, J. Astola, B. Vozel, K. Chehdi, M. Carli, and F. Bat tisti. Color image database tid2013: Peculiarities and prelim inary results. In European Workshop on Visual Information Processing, pages 106 111, 2013.
36 N. Ponomarenko, V. Lukin, A. Zelensky, K. Egiazarian, M. Carli, and F. Battisti. Tid2008 a database for evalua tion of fullreference visual quality assessment metrics. Adv Modern Radioelectron, 10:30 45, 2004.
37 A. Radford, L. Metz, and S. Chintala. Unsupervised repre sentation learning with deep convolutional generative adver sarial networks. CoRR, 2015.
38 M. A. Saad, A. C. Bovik, and C. Charrier. Dct statistics modelbased blind image quality assessment. In ICIP, 2011. 39 M. A. Saad, A. C. Bovik, and C. Charrier. Blind image quality assessment: A natural scene statistics approach in the
DCT domain. TIP, pages 3339 3352, 2012.
40 M. S. M. Sajjadi, B. Scholkopf, and M. Hirsch. Enhancenet: Single image superresolution through automated texture synthesis. In ICCV, 2017.
41 H. R. Sheikh, M. F. Sabir, and A. C. Bovik. A statistical evaluation of recent full reference image quality assessment algorithms. TIP, 1511:3440 3451, 2006.
42 H. Tang, N. Joshi, and A. Kapoor. Blind image quality as sessment using semisupervised rectifier networks. In CVPR, 2014.
43 S. Xie and Z. Tu. Holisticallynested edge detection. In ICCV, 2015.
44 J. Xu, P. Ye, Q. Li, H. Du, Y. Liu, and D. Doermann. Blind image quality assessment based on high order statistics ag gregation. TIP, pages 4444 4457, 2016.
45 L. Xu, J. Li, W. Lin, Y. Zhang, L. Ma, Y. Fang, and Y. Yan. Multitask rank learning for image quality assessment. IEEE Trans. Circuits Syst. Video Techn., pages 1833 1843, 2017.
46 P. Ye, J. Kumar, and D. S. Doermann. Beyond human opin ion scores: Blind image quality assessment based on syn thetic scores. In CVPR, 2014.
47 P. Ye, J. Kumar, L. Kang, and D. Doermann. Unsupervised feature learning framework for noreference image quality assessment. In CVPR, 2012.
48 H. Zeng, L. Zhang, and A. C. Bovik. A probabilistic quality representation approach to deep blind image quality predic tion. CoRR, 2017.
49 K. Zhang, W. Zuo, S. Gu, and L. Zhang. Learning deep cnn denoiser prior for image restoration. In CVPR, 2017.
50 L.ZhangandH.Li.Srsim:Afastandhighperformanceiqa index based on spectral residual. In ICIP, 2012.
51 P. Zhang, W. Zhou, L. Wu, and H. Li. Som: Semantic ob viousness metric for image quality assessment. In CVPR, 2015.