程序代写代做 Hallucinated-IQA: No-Reference Image Quality Assessment via Adversarial Learning

Hallucinated-IQA: No-Reference Image Quality Assessment via Adversarial Learning
Kwan-Yee Lin1 and Guanxiang Wang2
1Department of Information Science, School of Mathematical Sciences, Peking University 2Department of Mathematics, School of Mathematical Sciences, Peking University
1linjunyi@pku.edu.cn 2gxwang@math.pku.edu.cn Abstract
No-reference image quality assessment (NR-IQA) is a fundamental yet challenging task in low-level computer vi- sion community. The difficulty is particularly pronounced for the limited information, for which the corresponding ref- erence for comparison is typically absent. Although various feature extraction mechanisms have been leveraged from natural scene statistics to deep neural networks in previous methods, the performance bottleneck still exists.
In this work, we propose a hallucination-guided qual- ity regression network to address the issue. We firstly generate a hallucinated reference constrained on the dis- torted image, to compensate the absence of the true refer- ence. Then, we pair the information of hallucinated ref- erence with the distorted image, and forward them to the regressor to learn the perceptual discrepancy with the guid- ance of an implicit ranking relationship within the gener- ator, and therefore produce the precise quality prediction. To demonstrate the effectiveness of our approach, compre- hensive experiments are evaluated on four popular image quality assessment benchmarks. Our method significantly outperforms all the previous state-of-the-art methods by large margins. The code and model are publicly available on the project page https://kwanyeelin.github. io/projects/HIQA/HIQA.html.
1. Introduction
Image quality assessment (IQA) refers to the challenging task of automatically predicting the perceptual quality of a distorted image. IQA serves as a key component in the low- level computer vision community and has a wide range of applications [13, 26, 49].
IQA algorithms could be classified into three cate- gories: full-reference IQA (FR-IQA) [50, 24, 19], reduced- reference IQA (RR-IQA) [11], and general purpose no- reference IQA (NR-IQA) [46, 17, 42, 51, 44, 20, 25]. Al-
(a)
(b)
(c)
Ground truth Reference
Distorted Image
Discrepancy Reference Map
Figure 1: An illustration of our motivation. The first column is the Ground-truth Reference image which is undistorted. The sec- ond column shows several kinds of distortion that is easily hap- pened. The third row demonstrates the hallucinated reference im- ages which are generated by our approach. The fourth column is the discrepancy map which captures rich information that can be utilized to guide the learning of quality regression network to get high accuracy results.
though FR-IQA and RR-IQA metrics have achieved re- markable results over the decades, the precondition of them that requiring a corresponding non-distorted reference im- age for comparison during quality predicting process makes these metrics infeasible in practical applications, since it is hard, even impossible in most cases, to obtain an ideal ref- erence image. In contrast, NR-IQA, which takes only the distorted image to be assessed as input without any addi- tional information, is more realistic and therefore receives substantial attention in recent years. However, the ill-posed definition makes it is highly challenging for NR-IQA to make a good image quality prediction.
The ill-posed nature of the underdetermined NR-IQA problem is particularly pronounced for the limited informa- tion, for which the form of distortion and the correspond-
Hallucinated
arXiv:1804.01681v1 [cs.CV] 5 Apr 2018

ing non-distorted reference image are typically absent. It is counter-intuitive, since human visual system (HVS) needs a reference to quantify the perceptual discrepancy by com- paring the distorted image either directly with the original undistorted image or implicitly with a hallucinated scene in mind, as demonstrated in Figure. 1(a). The ill-posed defini- tion becomes the most essential issue of NR-IQA task that leads to the performance bottleneck over the last decade.
Numerous efforts have been made to ease this problem by designing powerful feature representation models. Tra- ditional methods commonly use manually designed statis- tic representations, and hence lack of diversity and flexibil- ity for modeling the multiple complex distorted types1 and large span of image contents (e.g., human, animal, plant, cityscape, transportation, etc.) in NR-IQA. In recent years, the promising results of Deep Neural Networks (DNNs) in many computer vision tasks [8, 5, 43] encourage re- searchers exploring their formidable feature representation power to the NR-IQA task. Nevertheless, the extremely limited annotation samples in public datasets greatly limit the advantage of DNNs in NR-IQA task. To better leverag- ing the power of DNNs, previous works usually utilize var- ious multi-task and data augmentation strategies with extra annotated ranking, proxy quality scores, or distortion infor- mation sophisticatedly, which are unavailable in practical NR-IQA applications, and hence lack of feasibility for mod- eling unknown multiple distortion types. Some works at- tempt to transfer general image feature representations from a pre-trained model on ImageNet [6] to quality prediction. While, less correlation and similarity between NR-IQA and image classification task reduce the effectiveness of transfer learning.
In this work, a Hallucination-Guided Quality Regression Network is proposed to simulate the behaviour of human visual system (HVS), which can make precise prediction by leveraging perceptual discrepancy information between the distorted image and hallucinated reference. As shown in Fig. 2, a high-resolution scene hallucination is firstly generated from the distorted image. Then, the discrepancy map which naturally encoding the difference between the distorted images and hallucinated reference can be obtained to guide the learning of regression network. With the strong and clear defined discrepancy information incorporated in, the ill-posed nature of NR-IQA can be dramatically over- come. Therefore, even with a common data augmentation, our approach could lead to better performance than all of the conventional sophisticated methods.
A straightforward way to generate the hallucinated ref- erence is to leverage the state-of-the-art image super-
1An image could be distorted by any stages in the whole process of its lifecycle from acquisition to storage, and therefore will undergo diverse distortions, like noise corruption, compression artifact, transmission errors, under-/over-exposure,etc. For more details, please refer to [35].
resolution [23, 22, 40], blind deblur [33], or inpainting [34] methods to reconstruct images from the distorted ones. However, since an image could be distorted with multiple unknown distortions, which breaks the basic assumptions2 of these related fields, it is impractical to utilize them to ob- tain a reconstructed image that qualified for the agent ref- erence of NR-IQA task. To this end, a Quality-Aware Gen- erative Network is proposed to generate hallucinated refer- ence with a novel quality-aware perceptual term which is designed specifically for the NR-IQA task at hand.
While the Quality-Aware Generative Network is robust to most distortion types and levels, it is, however, still very challenging for a method that under the framework of DCNN to reconstruct high-frequency details with realistic texture, when the distorted image lacks structure informa- tion, as shown in Fig. 1(c). Since the result of hallucinated reference is crucial for final prediction, a bad hallucination will introduce large bias and therefore lead the regression results into sub-optimal values. We propose to tackle this problem with the two following mechanisms from low-level to high:  We introduce adversarial learning idea to hal- lucinated reference generation and quality prediction with a novel IQA-discriminator to, on the one hand, encourage the generated hallucinated scene perceptually hard to dis- tinguish from the true reference images, and on the other hand, in a low-level semantic, constrain the influence of bad hallucinations to the quality regression network. 2 A novel high-level semantic fusion mechanism is introduced to further reduce the instability of the quality regression net- work caused by the hallucination model. It explores implicit ranking relationship within the hallucination network to as a guidance to help the regression network adjusting the image quality prediction in an adaptive manner.The quality-aware generative network, hallucination-guided quality regression network, and the iqa-discriminator can be jointly optimised in an end-to-end manner.
Our main contributions of this work are summarised into three folds:
A novel Hallucination-Guided Quality Regression Network is proposed to incorporate the perceptual dis- crepancy information into network learning to over- come the ill-posed nature of NR-IQA and significantly improves the prediction precision and robustness.
A Quality-Aware Generative Network together with a quality-aware perceptual loss is proposed, in which both texture feature similarity and quality feature sim- ilarity are taken into consideration in a complementary manner to help generating qualified hallucinated refer- ences.
2For example, super-resolution methods usually assume the blur kernel or form is known.

Since the result of hallucinated reference is crucial for final prediction, an IQA-Discriminator and an implicit ranking relationship fusion scheme are introduced to better guide the learning of generator and suppress the negative scene hallucination influence to quality re- gression in a low-level to high-level manner.
We evaluate the proposed method on four broadly used image quality assessment benchmarks including LIVE [41], CSIQ [21], TID2008 [36], and TID2013 [35]. Our approach shows the superior performance over all of the state-of-the- art NR-IQA methods by significant margins. Comprehen- sive ablation study further demonstrates the effectiveness of each component.
2. Related Work
No-reference Image Quality Assessment. In the litera- ture of NR-IQA, besides classic methods ([31, 38, 29, 47]) and their improved versions ([51, 44, 27]), recently, signifi- cant progresses have been achieved by exploring DNNs for better feature representation [17, 18, 45, 20, 25, 48]. For example, Kang et al. [17] introduce a shallow ConvNet to model the quality prediction. This approach is refined to a multi-task CNN [18], where the network learns both dis- tortion type and quality score simultaneously. Bianco et al. [2] use a pre-trained DCNN fine-tuned on an IQA dataset to extract features, then map them to IQA scores by an SVR model. Hui et al. [48] also propose to extract features by a pre-trained ResNet [14]. Instead of learning IQA scores di- rectly, they fine tune the network to learn a probabilistic rep- resentation of distorted images. According to the distortion types and levels in particular datasets, Liu et al. [25] syn- thesize masses of ranked images to train a Siamese network to learn the rankings for NR-IQA. Liang et al. [24] propose to use non-aligned similar scene as a reference. Kim and Lee [20] apply state-of-the-art FR-IQA methods to gener- ate proxy scores on patches as the ground truth to pre-train the model and then fine-tune to NR-IQA.
In this work, we propose a unique approach to address the ill-posed problem by compensating the absent refer- ence information without any extra data annotation or prior knowledge, which therefore increases the flexibility and feasibility than other methods.
Generative Adversarial Network. GANs [12] and var- ious variants [37, 1, 28] flourish in generating natural im- ages such as human faces [3] and indoor scenes [7]. How- ever, generating high-resolution images (e.g.,256!256) will lead GANs to training instability and sometimes nonsensi- cal outputs, which has been proven in [16]. Since our ul- timate goal is NR-IQA, and the performance of quality re- gression network is closely related to the output of the gen- erator, instead of applying original discriminator, we tai- lor the adversarial learning scheme for image quality as-
sessment by introducing an effective iqa-discriminative net- work.
3. Our Approach
In this section, we introduce our approach for NR-IQA. An overview of our framework is illustrated in Fig. 2. The model consists of three parts, i.e., the quality-aware gen- erative network G, the iqa-discriminative network D and the hallucination-guided quality regression network R. The generative network produces hallucinated reference as the compensatory information for the distorted images. The discriminative network is trained with G in the adversar- ial manner to help G producing more qualified results and constrain negative effects of bad ones to R. We define the objective discrepancy (i.e., the pixel-wise differences) be- tween a distorted image and the corresponding scene hal- lucination as the discrepancy map3. The quality regression network takes the distorted images and corresponding dis- crepancy maps as inputs, with the guidance of implicit rank- ing relationships in G, to exploit the perceptual discrepancy and produce the predicted quality scores as outputs.
3.1. Quality-Aware Generative Network
As we mentioned in the previous sections, the function of hallucinated reference for the distorted image is to com- pensate the absence of true reference image, and the less gap between hallucination and true reference, the more pre- cise the quality regression network will perform. Therefore, the aim of G is to generate a high-resolution hallucinate im- age Ish conditioned on the distorted image Id. Toward this end, we adopt a stacked hourglass [32] as the baseline of the generative network.
A straightforward way for learning the generating func- tion G Id is to enforce the output of the generator both pixel-wise and perception-wise close to the true refer- ence. Therefore, given a set of distorted images I id , i = ,2,aaa,N , and corresponding true reference images
, we solve
lp G’ Idi ,Iri +ls G’ Idi ,Iri , (1)
where lp penalizes the pixel-wise differences between the output and the ground truth with pixel-level error measure- ments, such as MSE, to generate holistic content; and ls pe- nalizes the perception-wise differences to achieve sharper local results. We adopt a feature space loss term [9] as the perception constraint, which is defined as
ls G’ Idi ,Iri =kG’ Idi Iri k2, (2)
3This is different from the concept of error map, which is used in FR- IQA to represent pixel-wise error between the distorted image and true reference.
Iir,i = ,2,aaa,N o XN
’=argmin
’ Ni=

Distorted Image
Quality Aware Generative Network
Generated Hallucinated Reference
Stack 1
Hourglass Module
Stack n
High Level Feature
Convolution
Residual Unit
Fully Connection
Down Sampling
Up Sampling
Subtraction S
Discrepancy Map
Entrywise Sum
Predicted Quality Score
Hallucination Guided Quality Regression Network
Generated Hallucinated Reference
Ground Truth Reference
Real/Fake
IQA Discriminative Network
Figure 2: An illustration of our proposed Hallucinated-IQA framework. It consists of three strongly related subnets. (a) Quality-Aware Generative Network is used to generate hallucinated reference images. In order to get high resolution hallucinated images, a quality- aware loss is introduced to the learning process. (b) Hallucination-Guided Quality Regression Network is in a position to incorporate the discrepancy information between the hallucinated image and distorted encoded in the discrepancy map. The incorporated discrepancy information together with high-level semantic fusion from the generative network can supply the regression network with rich information and greatly guide the network learning. (3) Since the results of the hallucinated image are crucial for the final prediction, IQA-Discriminator is proposed to further refine the hallucinated image.
where
Cv Wj Hj
lv=X  XXkjG’Idi x,yjIrix,yk2,
Distorted Image
Figure 3:
loss and IQA-GAN. With the quality-aware loss and IQA-GAN scheme adding, the hallucinated images are improved to be more and more clear and plausible. The last column shows the discrep- ancy map got from our model, which can be seen to well capture the type and location information of the distortion. The map is demonstrated to be very helpful for our IQA task.
where a represents a feature transformation. Intuitively, pre-trained network like VGG-19 could be utilized to cal- culate the perception term. This is reasonable in most cases by the fact that the VGG-19 is trained for semantic classifi- cation, and the features of its intermediate layers are there- fore invariant to the noise of input [4, 10]. Consequently, these layers provide structure and texture information to the generator for inferring more accurate results. However, the invariance property will also lead to the perception term ig- noring the hard cases where the output of the generator still contains a certain degree distortion information, as demon- strated in Fig. 3. To ease this problem, we propose a quality-aware perceptual loss, which incorporates the fea- tures of the deep regression network R dynamically. The lossfunctioninequation 2 becomes
ls G’ Idi ,Iri =lv G’ Idi ,Iri +2lq G’ Idi ,Iri , (3)
cv= WjHj x=y= Cq Wk Hk
(5) where j a denotes the feature map at the j-th layer of VGG-19, !k a denotes the feature map at the k-th layer of R; W and H represent the dimensions of the feature map, C represents the number of feature maps at a par- ticular layer. Since the vgg-19 network and R are trained for different tasks, the representation of kernels within the two networks also toward to preserve different information. The activations from the layers of a pre-trained 4 NR-IQA regression network capture the distortion information of the input, which ensures the quality similarity measurement be- tween the output of G and the ground truth. The activations from the layers of the VGG-19 network ensure the seman- tic similarity measurement. Base on respective represent- ing capabilities of the two networks, incorporating both lv and lq losses to the perception term could complement each other and therefore help the generator producing better re-
sults jointly.
4It should be noted that, the pre-trained quality regression model refers to the one that trained from scratch with IQA dataset.
Baseline Generator Baseline + Quality Baseline + Quality Aware Loss Aware Loss + IQA GAN
Discrepancy Map
(4)
x,y !k Iri x,yk2,
An illustration of the effectiveness of quality-aware
and
lq = X  XXk!k G’ Idi cq= WkHk x=y=
Fusion

3.2. IQA-Discriminative Network
To ensure the generator producing high perceptual out- puts with realistic high-frequency details, especially for the samples that seriously lack structure and texture informa- tion due to the distortion type (e.g., local block-wise dis- tortions of different intensity, transmission errors), or the distortion level, the adversarial learning mechanism is in- troduced to our work.
The original manner of adversarial learning is to train G to generate images to fool D, and D is in contrast trained to distinguish fake reference images Ish from real reference images Ir. However, since GANs are limited to the resolu- tion of the generator, and the distorted images forwarded to a quality network are usually of large size to maintain suffi- cient contextual information, directly providing Ish as fake images to the discriminator will introduce instability to opti- mization procedure and sometimes leads to nonsensical re- sults. More importantly, our ultimate goal is improving the performance of the deep regression network R. Even when G fails to generate high-resolution hallucination images, the predicted scores of R should still be a reasonable value. Thus, the influence of bad hallucination images to R should be suppressed. Thus, we propose a IQA-Discriminator (i.e., D ) to ease above problems by discriminating the fake sam- ples from the real samples according to their positive or neg- ative influence to R. If G generates a hallucinated reference could help improving the precision of R, then this halluci- nation is defined as real sample to D, otherwise the halluci- nation is a fake sample. This could be formulated as
maxElogD! Ir +Elog  D! G’ Id dfake ,
where , 2 and 3 represent the parameters that trade off the three loss components.
3.3. Hallucination-Guided Quality Regression Net- work
Given the hallucinated scene generated by G, we are able to provide agent references to the quality regression network to compensate the absence of true reference infor- mation. In order to incorporate the hallucinated reference information effectively, the concept of discrepancy map is introduced to the work. To further stabilizing the optimiza- tion procedure of R, a high-level semantic fusion scheme is proposed.
Discrepancy Map. Given a set of distorted images to be assessed, previous CNN-based NR-IQA methods learn a mapping function R Id to predict the quality scores. On the contrary, we consider the distorted images and their dis- crepancy maps as pairs Idi , Imi ap Ni= to train a deep regres- sion network by solving
,si , (10)
whereImap = IdG’o Id ,denotesthediscrepancymap. The formulation shows the discrepancy map could virtually be regarded as a prior information to tell the network what the distortion looks like.
It is interesting that, so far, the holistic mechanism func- tions in a reinforced way, during training stage of R, G is used to produce auxiliary hallucinated references, while during the training stage of G, R is in reverse introduced to help generating better hallucinations. In essence, G and R are mutually correlated and thus can reinforce each other.
High-level Semantic Fusion. As we mentioned in pre- vious sections, the precision of R is greatly depended on the eligibility of the hallucinated scene. To be specific, a qualified hallucination as the agent reference could help R exploring correct perceptual discrepancy of the distorted image, while the unqualified one will conversely introduce large bias to R by improperly narrowing the distortion infor- mation. Hence, a constrained scheme is needed to stabilize the quality regression process.
Assume G has been trained, the feature maps after the
m-th residual block in encoder part of its n-th stack are
(6) difake=dshF,(7)
thedefinition ( ifkR Ii,Ii sik a’ a ifkRIdi,Isih sikF’
where si is the ground truth quality score of Ii, ’ denotes d
the threshold parameter. The general idea behind this for- mulation is that it leverages the property of quality regres- sion loss, where the loss is an explicit index that directly reflects the impact of G on R, to enforce D only penalizing samples with negative influence. Therefore, it could also be regarded as a relaxation strategy to stabilize the adversarial learning process.
Thus, G is eventually optimised to fool the discriminator D by generating qualified hallucinated scene that is benefi- cial for R. The adversarial loss of G is formulated as
Ladv =Elog D! G’ Id , (8)
and the overall loss function of G for all training samples is given by
LG = Lp + 2Ls + 3Ladv, (9)
considered as Hcmn I Cmn . We fuse the ones after mn d cmn=
 XN o=argmin l RIi,Ii
N r d map i=
!
where dfake denotes the ground truth influence label with
the last encoder residual block of second stack with the fea- ture maps after the last block of R, then we have the fusion term:
F=f H5,2 Id &# R Id,Imap (11)
where f is a linear projection to ensure the dimensions of H and R are equal, R denotes the feature extraction be- fore the fully connected layers (R2) of R, and &# denotes

the concatenation operation. Thus, the loss of R could be formulated as:
In contrast, as a by-product of our work, the hallucinated scene could be regarded as a universal medium among dif- ferent datasets to help the training process of a particular one without losing precision, since the hallucination is only constrained on distorted image and serves as the fundamen- tal agent reference information of image quality. Mean- while, the detachable training process of our framework provides an alternative that the R in the stages of train- ing G and final quality regression model could be different. Based on the above, as long as a hallucination generator is trained either on one specific dataset with multiple complex distortions or on multi-datasets in once, it can be used to help the training process of any other datasets as a plug-and play module in a weakly-supervised manner. Moreover, the module could also be used as a data augmentation or initial- ization mechanism without any extra annotation or artificial prior knowledge. We evaluate above discussion in Sec.4.1.
3.6. Implementation Details
All the training samples are 256 ! 256 pixel patches that randomly sampled from the original images. Then a com- mon data augmentation is performed with random rotation (2a) and flip. We train our models with Caffe [15] on the Titan X GPUs with a mini-batch size of 32 and all of them are trained from scratch. The stochastic gradient de- scent (SGD) is used to optimise the networks with an initial learning rate of a5 for the generation network and a2 for the regression network, and dropped by a factor of aa every 2aK iterations. The weight decay is 0.0005, and the momentum is 0.9. During testing, we extract overlapped image patches at a fixed stride from each testing image, and simply average all predicted scores as the final whole-image quality score.
 XT
LR = T t= kR2 f H5,2 Id
&# R Id,Imap
stk`
(12) The form of the loss LR allows the high-level semantic in- formation of G participating in the optimization procedure of R. As we discussed in the introduction, the fusion term F explores implicit ranking relationship5 within G to as a guidance to help R adjusting the quality prediction in an adaptive manner. Specifically, if G is optimal, the solvers may simply drive the weights of neurons in R2 that connect- ing with f toward zero to approach identity mappings. Oth- erwise, the eligibility of the hallucinated scene is materially a reflection of the quality of the input distorted images that could be leveraged to as a guidance to correct the prediction, and therefore improving the precision of R in a high-level semantic manner. Meanwhile, the iqa-discriminator could be regarded as a low-level semantic scheme to R, since it encourages G to generate useful hallucination input to R. Therefore, our model has schemes in multiple semantic lev-
els to stabilize the quality regression process.
3.4. Training Strategy
Since all of the operations in G and R are differentiable, these two sub-networks can be trained in an end-to-end manner. To better optimize the generation and quality re- gression in a mutually reinforced way, we take an alterna- tive training strategy in practice. Please refer to the supple- mentary, where Algorithm  demonstrates the whole train- ing processing of our approach as the pseudo codes.
3.5. Weakly-Supervised Quality Assessment
In this section, we discuss some extensions to further un- cover the potential of our framework.
To advance the development of IQA task, various bench- marks have been released in these years. However, a signif- icant issue follows as well. As shown in Table 1, there are huge gaps of distorted quality definition, types and levels among the datasets. While NR-IQA models are commonly trained on one specific dataset, these gaps will easily lead the models to suffer over-fitting problem and lack of gen- eralization ability. Learning from cross-datasets is an alter- native way to ease the problem. Previous methods usually transfer the definition of quality scores by non-linear map- pings learned from the distributions of datasets, which may introduce bias to the models.
5G serves as not only a generator, but also an encoder-decoder mech- anism. Thus, the difference-information between images distorted in dif- ferent degree is encoded compactly in the end of the encoder part. We refer to this  difference-information as  implicit ranking relationship of distorted images in this work.
Databases LIVE CSIQ TID2008 TID2013
# of Ref.Images 29
30
25
25
# of Dist.Images 779
866
1700
3000
# of Dist.Types 5
6
17
24
Score Type DMOS DMOS MOS MOS
Score Range [1,100] [0,1] [0,9] [0,9]
Table1: Summaryofthedatabasesevaluatedintheexperiments.
4. Experiments
Datasets. We perform experiments on four widely used benchmark datasets LIVE [41], CSIQ [21], TID2008 [36], and TID2013 [35]. The detailed information are summa- rized in Table 1.
Evaluation Metrics. Following most previous works, two evaluation criteria are adopted in our paper: the Spear- man s Rank Order Correlation Coefficient (SROCC) and the Linear Correlation Coefficient (LCC). SROCC is a measure of the monotonic relationship between the ground-truth and model prediction. LCC is a measure of the linear correlation between the ground-truth and model prediction. The de- tailed definitions are formulated in the supplementary ma- terial.

Methods #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13
BLIINDS-II [39] 0.714 0.728 CORNIA-10K [47] 0.341 -0.196 HOSA[44] 0.853 0.625
RankIQA [25] 0.667 0.620
Ours 0.923 0.880 Ours+Oracle 0.952 0.890
Methods BLIINDS-II[39] CORNIA-10K [47] HOSA [44] RankIQA [25] Ours Ours+Oracle
0.825 0.358 0.689 0.184 0.782 0.368 0.821 0.365 0.945 0.673 0.976 0.831
0.852 0.664 0.607 -0.014 0.905 0.775 0.760 0.736 0.955 0.810 0.931 0.773
0.780 0.852 0.754 0.673 0.896 0.787 0.810 0.892 0.870 0.783 0.809 0.767 0.855 0.832 0.957 0.898 0.812 0.910
0.808 0.862 0.875 0.911 0.893 0.932 0.866 0.878 0.914 0.624 0.929 0.735
0.251 0.755 0.310 0.625 0.747 0.701 0.704 0.810 0.460 0.782 0.638 0.739 ALL
0.550 0.651 0.728 0.780 0.879 0.935
# 14 # 15 # 16 # 17 # 18 # 19 # 20 # 21 # 22 # 23 # 24
0.081 0.371 0.161 0.096 0.199 0.327 0.512 0.622
0.159 0.008 0.233 0.268
-0.082 0.109 0.423 -0.055 0.294 0.119 0.613 0.662
0.699 0.222 0.259 0.606 0.782 0.532 0.619 0.644
0.451 0.815 0.555 0.592 0.835 0.855 0.800 0.779
0.568 0.856 0.759 0.903 0.801 0.905 0.629 0.859
0.664 0.122 0.182 0.376 0.156 0.850 0.614 0.852 0.911 0.381 0.616 0.834 0.457 0.823 0.850 0.539 0.893 0.695 0.859 0.910 0.655 0.712
Table 2: Performance evaluation (SROCC) on the entire TID2013 database.
4.1. Comparisons with the state-of-the-arts
To validate our approach, we conduct extensive evalua- tions, where ten state-of-the-art NR-IQA methods are com- pared. We follow the experimental protocol used in three most recent algorithms (i.e., HOSA [44], BIECON [20], and RankIQA [25]), where the reference images are ran- domly divided into two subsets with 8aa for training and 2aa for testing, and the corresponding distorted images are divided in the same way to ensure there is no overlap image content between the two sets. All the experiments are under ten times random train-test splitting operation, and the me- dian SROCC and LCC values are reported as final statistics.
Single dataset evaluations. We first analyze the exper- iment results on TID2013. The SROCC for our approach and compared state-of-the-arts on entire TID2013 dataset are reported in Table 2. Our method significantly outper- forms previous methods by a large margin. We achieve a 3a relative improvement over the most state-of-the-art method RankIQA on entire dataset with all the distortion types under consideration at once. For individual distor- tions, due to the normalization operation in the network, the performances on a small number of types like intensity shift and change of colour saturation are lower than some methods. While we generally achieve the highest accura- cies on most distortion types (over 60% subsets). Specifi- cally, the significant improvements on distortion types like #4(masked noise) and #4(non-eccentricity pattern noise) quantitatively demonstrate the effectiveness of our hallu- cinated reference compensation mechanism, and improve- ments on types such as #9 (Image denoising) and #22 (Multiplicative Gaussian noise) verify the capacity of our G component as a single model that hallucinates images under multiple distortions effectively.
Table 3 shows the performance evaluation on the entire LIVE database. Our method outperforms all of the state- of-the-art methods for both SROCC and LCC evaluations. Among the methods compared in the experiments, the most state-of-the-art three methods explore different strategies to
better leverage the power of DNNs and achieve promis- ing results, where BIECON uses FR-IQA methods to gen- erated proxy quality scores, RankIQA synthesizes masses of ranked images to train the network, and PQR takes ad- vantage from a pre-trained Res-50 network. Our method achieves 2a improvements than BIECON, 2a SROCC and a LCC improvements than PQR, and aaa slightly improvements than RankIQA with training from scratch. These observation demonstrate that our mechanisms in- crease the model capacity effectively from a new perspec- tive.
As for TID2008 dataset, our approach also achieves highest performances compared with all of the state-of-the- arts. We also reach best performances on CSIQ dataset. For space saving, the detail results and discussion of this two dataset are shown in the supplementary material, please re- fer to it.
We also list the results of using ground-truth reference on above experiments as the theoretical bounds, which are referred to  ours+oracle , to further verify the effectiveness and potential of proposed hallucinated references to NR- IQA. The oracle outperforms all the methods in all datasets by large margins. These results demonstrate the effective- ness of hallucinated information and show great potential performance gain if the hallucinated information could be well generated.
Cross-dataset evaluations. Here, we perform two types of cross-dataset evaluations to further verify some merits of our approach. Table 4 shows the results of cross-dataset test where the models are trained by the LIVE dataset, and tested on the TID2008 dataset. We follow the com- mon experiment setting to test the results on the subsets of TID2008, where four distortion types (i.e., JPEG, JPEG2K, WN, and BLUR) are included, and a logistic regression is applied to match the predicted DMOS to MOS value. The promising results demonstrate the generalization ability of our approach.
To evaluate the by-product of our work where the model could be leveraged in a weakly-supervised manner to han-

JP2K JPEG WN BLUR FF
0.914 0.965 0.979 0.951 0.877 0.943 0.955 0.976 0.969 0.906 0.952 0.977 0.978 0.962 0.908 0.947 0.952 0.984 0.976 0.937 0.952 0.974 0.980 0.956 0.923 0.970 0.978 0.991 0.988 0.954 -----
0.983 0.961 0.984 0.983 0.989 0.978 0.960 0.993 0.988 0.968
JP2K JPEG WN BLUR FF
0.923 0.973 0.985 0.951 0.903 0.951 0.965 0.987 0.968 0.917 0.953 0.981 0.984 0.953 0.933 0.952 0.961 0.991 0.974 0.954 0.965 0.987 0.970 0.945 0.931 0.975 0.986 0.994 0.988 0.960 -----
0.977 0.984 0.993 0.990 0.960 0.989 0.985 0.997 0.992 0.988
SROCC ALL BRISQUE [30] 0.940
BL
B L +H C M +Q S L +A D V
SROCC
BL+HCM
B L +H C M +Q S +Q A D V
BL+HCM+QSL
B L +H C M +Q S +Q A D V +H S F
LCC
1.000 0.950 0.900 0.850 0.800 0.750 0.700 0.650 0.600 0.550 0.500
CORNIA [47] CNN [17] SOM [51] BIECON [20] RankIQA [25] PQR [48]
0.942 0.956 0.964 0.961 0.981 0.965
Ours 0.982 Ours+Oracle 0.983 LCC ALL BRISQUE [30] 0.942
CORNIA [47] CNN [17] SOM [51] BIECON [20] RankIQA [25] PQR [48]
0.935 0.953 0.962 0.962 0.982 0.971
Figure 4: Ablation results on the entire TID2008 dataset.
By adding a holistic hallucination model to provide halluci- nated references pairing with distorted images as the inputs to res-18 network ( BL+HCM ), we get a aa859 SROCC value and a aa8aa PLCC value, which up to 4a and 8a improvement compared to the baseline model, respectively.
Quality-aware perceptual loss. By adding the feature matching loss w.r.t. quality similarity at the training process of the hallucination model( BL+HCM+QPL ), our model obtains a further aa5a improvement on SROCC and 2a on LCC.
Adversarial learning. To explore the effect of proposed IQA-Discriminative network for quality assessment, we further compare the models with adversarial learning mech- anism under original definition ( BL+HCM+QPL+ADV ) and our definition( BL+HCM+QPL+QADV ). Adding original adversarial learning mechanism leads to a 3a im- provement on SROCC and 3a on LCC, while our method obtains further 2a and about a improvements on SROCC and LCC, respectively.
Multi-level semantic fusion. We also show the im- provements brought by the multi-level semantic fusion mechanism. We fuse the feature maps of the generator from stack two with the ones of same size in quality regression network, and obtain the highest aa94 SROCC value and aa949 LCC value.
5. Conclusion
In this paper, we propose to solve the ill-posed na- ture of NR-IQA from a new perspective. We introduce a hallucination-guided quality regression network to capture the perceptual discrepancy between the distorted images and the hallucinated images, and therefore predict precise perceptual quality result. We generate the hallucinations by a novel quality-aware generation network with the help of a specially designed iqa-discriminator under the adversarial learning manner. The proposed network does not require any extra annotations or artificial prior knowledge for train- ing and can be trained end-to-end. Extensive experiments demonstrate the superior performance on NR-IQA task.
Ours 0.982 Ours+Oracle 0.989
Table3: Performanceevaluation(bothSROCCandLCC)onthe entire LIVE database.
Ours Ours+Oracle SROCC 0.934 0.939 LCC 0.917 0.920
Table 4: Cross-dataset evaluation (SROCC).The models are trained on the LIVE database and tested on the subset of TID2008.
Ours+Oracle SROCC 0.983
CORNIA [47] CNN [17] SOM [51]
0.892 0.920 0.923 0.880 0.903 0.899
L T08 T08+T13
0.982 0.982 0.983 0.982 0.985 0.988
LCC
0.989
Table 5: SROCC and LCC results of models on the LIVE database with training generator on different datasets.
dle cross-dataset quality assessment, we train the generator on different datasets, and use LIVE dataset to train the re- gression network. Table 5 reports the results. The  L as the plain of the experiment represents the hallucination genera- tor is trained on the training set of LIVE. The  T08 repre- sents training the generator on TID2008, and  T08+T13 is the version that training on both the TID2008 and TID2013. It can be clearly observed that with more IQA datasets ag- gregated in the generator, the regression network reaches higher SROCC and LCC performances to approximate the oracle.
4.2. Ablation study
To investigate the efficacy of the key components of our model, we conduct ablation experiments on the TID2008 dataset. The overall results are shown in Figure 4. We use a modified Res-18 network with only distorted images as inputs to be our baseline model and analyze each proposed component based on the baseline network (BL), by compar- ing both SROCC and LCC results.
Hallucinated reference compensation. We first eval- uate the hallucinated reference compensation mechanism.
0.755
0.859
0.864 0.894
0.800 0.870
0.910 0.941
0.887 0.910
0.918 0.949

References
[1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gener- ative adversarial networks. In ICML, 2017.
[2] S. Bianco, L. Celona, P. Napoletano, and R. Schettini. On the use of deep learning for blind image quality assessment. CoRR, 2016.
[3] X. Chen, X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable repre- sentation learning by information maximizing generative ad- versarial nets. In NIPS, 2016.
[4] D. Cho, J. Park, T. Oh, Y. Tai, and I. S. Kweon. Weakly- and self-supervised learning for content-aware deep image retargeting. In ICCV, 2017.
[5] J. Dai, Y. Li, K. He, and J. Sun. R-FCN: object detection via region-based fully convolutional networks. In NIPS, 2016.
[6] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li. Ima- genet: A large-scale hierarchical image database. In CVPR, 2009.
[7] E. L. Denton, S. Chintala, a. szlam, and R. Fergus. Deep generative image models using a laplacian pyramid of adver- sarial networks. In NIPS. 2015.
[8] J.Fu,H.Zheng,andT.Mei.Lookclosertoseebetter:Recur- rent attention convolutional neural network for fine-grained image recognition. In CVPR, 2017.
[9] L. A. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis using convolutional neural networks. In NIPS, 2015.
[10] L.A.Gatys,A.S.Ecker,andM.Bethge.Imagestyletransfer using convolutional neural networks. In CVPR, 2016.
[11] S. A. Golestaneh and L. J. Karam. Reduced-reference qual-
ity assessment based on the entropy of DWT coefficients of locally weighted gradient magnitudes. TIP, 25(11):5293 5303, 2016.
[12] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio. Generative adversarial networks. CoRR, 2014.
[13] J.GuoandH.Chao.Buildinganend-to-endspatial-temporal convolutional network for video super-resolution. In AAAI, 2017.
[14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
[15] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir- shick, S. Guadarrama, and T. Darrell. Caffe: Convolu- tional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
[16] C. Kaae Snderby, J. Caballero, L. Theis, W. Shi, and F. Husza r. Amortised MAP Inference for Image Super- resolution. ICLR, 2017.
[17] L. Kang, P. Ye, Y. Li, and D. Doermann. Convolutional neu- ral networks for no-reference image quality assessment. In CVPR, 2014.
[18] L. Kang, P. Ye, Y. Li, and D. S. Doermann. Simultaneous es- timation of image quality and distortion via multi-task con- volutional neural networks. In ICIP, 2015.
[19] J. Kim and S. Lee. Deep learning of human visual sensitivity in image quality assessment framework. In CVPR, 2017.
[20] J. Kim and S. Lee. Fully deep blind image quality predictor. J. Sel. Topics Signal Processing, 11(1):206 220, 2017.
[21] E. C. Larson and D. M. Chandler. Most apparent distortion: full-reference image quality assessment and the role of strat- egy. Journal of Electronic Imaging, 19(1):011006, 2010.
[22] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunning- ham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, 2017.
[23] Y. Li, W. Dong, X. Xie, G. Shi, X. Li, and D. Xu. Learn- ing parametric sparse models for image super-resolution. In NIPS. 2016.
[24] Y. Liang, J. Wang, X. Wan, Y. Gong, and N. Zheng. Im- age quality assessment using similar scene as reference. In ECCV, 2016.
[25] X. Liu, J. van de Weijer, and A. D. Bagdanov. Rankiqa: Learning from rankings for no-reference image quality as- sessment. In ICCV, 2017.
[26] Y. Liu, J. Yan, and W. Ouyang. Quality aware network for set to set recognition. In CVPR, 2017.
[27] K. Ma, W. Liu, T. Liu, Z. Wang, and D. Tao. dipiq: Blind image quality assessment by learning-to-rank discriminable image pairs. TIP, pages 3951 3964, 2017.
[28] M. Mirza and S. Osindero. Conditional generative adversar- ial nets. CoRR, 2014.
[29] A. Mittal, A. K. Moorthy, and A. C. Bovik. No-reference image quality assessment in the spatial domain. TIP, pages 4695 4708, 2012.
[30] A. Mittal, A. K. Moorthy, and A. C. Bovik. No-reference image quality assessment in the spatial domain. TIP, pages 4695 4708, 2012.
[31] A. K. Moorthy and A. C. Bovik. Blind image quality as- sessment: From natural scene statistics to perceptual quality. TIP, 20(12):3350 3364, 2011.
[32] A. Newell, K. Yang, and J. Deng. Stacked hourglass net- works for human pose estimation. In ECCV, 2016.
[33] J.Pan,Z.Lin,Z.Su,andM.-H.Yang.Robustkernelestima- tion with outliers handling for image deblurring. In CVPR, 2016.
[34] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature learning by inpainting. In CVPR, 2016.
[35] N. Ponomarenko, O. Ieremeiev, V. Lukin, K. Egiazarian, L. Jin, J. Astola, B. Vozel, K. Chehdi, M. Carli, and F. Bat- tisti. Color image database tid2013: Peculiarities and prelim- inary results. In European Workshop on Visual Information Processing, pages 106 111, 2013.
[36] N. Ponomarenko, V. Lukin, A. Zelensky, K. Egiazarian, M. Carli, and F. Battisti. Tid2008 - a database for evalua- tion of full-reference visual quality assessment metrics. Adv Modern Radioelectron, 10:30 45, 2004.
[37] A. Radford, L. Metz, and S. Chintala. Unsupervised repre- sentation learning with deep convolutional generative adver- sarial networks. CoRR, 2015.
[38] M. A. Saad, A. C. Bovik, and C. Charrier. Dct statistics model-based blind image quality assessment. In ICIP, 2011. [39] M. A. Saad, A. C. Bovik, and C. Charrier. Blind image quality assessment: A natural scene statistics approach in the
DCT domain. TIP, pages 3339 3352, 2012.

[40] M. S. M. Sajjadi, B. Scholkopf, and M. Hirsch. Enhancenet: Single image super-resolution through automated texture synthesis. In ICCV, 2017.
[41] H. R. Sheikh, M. F. Sabir, and A. C. Bovik. A statistical evaluation of recent full reference image quality assessment algorithms. TIP, 15(11):3440 3451, 2006.
[42] H. Tang, N. Joshi, and A. Kapoor. Blind image quality as- sessment using semi-supervised rectifier networks. In CVPR, 2014.
[43] S. Xie and Z. Tu. Holistically-nested edge detection. In ICCV, 2015.
[44] J. Xu, P. Ye, Q. Li, H. Du, Y. Liu, and D. Doermann. Blind image quality assessment based on high order statistics ag- gregation. TIP, pages 4444 4457, 2016.
[45] L. Xu, J. Li, W. Lin, Y. Zhang, L. Ma, Y. Fang, and Y. Yan. Multi-task rank learning for image quality assessment. IEEE Trans. Circuits Syst. Video Techn., pages 1833 1843, 2017.
[46] P. Ye, J. Kumar, and D. S. Doermann. Beyond human opin- ion scores: Blind image quality assessment based on syn- thetic scores. In CVPR, 2014.
[47] P. Ye, J. Kumar, L. Kang, and D. Doermann. Unsupervised feature learning framework for no-reference image quality assessment. In CVPR, 2012.
[48] H. Zeng, L. Zhang, and A. C. Bovik. A probabilistic quality representation approach to deep blind image quality predic- tion. CoRR, 2017.
[49] K. Zhang, W. Zuo, S. Gu, and L. Zhang. Learning deep cnn denoiser prior for image restoration. In CVPR, 2017.
[50] L.ZhangandH.Li.Sr-sim:Afastandhighperformanceiqa index based on spectral residual. In ICIP, 2012.
[51] P. Zhang, W. Zhou, L. Wu, and H. Li. Som: Semantic ob- viousness metric for image quality assessment. In CVPR, 2015.