Presentation PowerPoint
Single Image Super Resolution
2
Type A-3
What is Super Resolution?
Applications of Super Resolution
Deep Learning for Single Image Super Resolution
Some Issues for Super Resolution
What is Super Resolution?
Super Resolution
Restore High-Resolution(HR) image(or video) from Low-Resolution(LR) image(or video)
According to the number of input LR images, SR can be classified SISR or MISR
Single Image Super Resolution
Super
Resolution
3
What is Super Resolution?
Single Image Super Resolution
Restore High-Resolution(HR) image(or video) from Low-Resolution(LR) image(or video)
Ill-Posed Problem.. (Regular Inverse Problem) We can’t have ground truth from LR image
HR 2
HR 3
HR 1
LR
4
What is Super Resolution?
Interpolation-based Single Image Super Resolution
In image upscaling task, bicubic or bilinear or Lanczos interpolation is usually used.
Fast, easy.. but low quality..
Super
Resolution
Deep SR
bilinear
5
Type A-3
What is Super Resolution?
6
Type A-3
Single Image Super Resolution algorithms
Interpolation-based method
Reconstruction-based method
(Deep) Learning-based method
Today, I will cover learning-based method
Applications of Super Resolution
Satellite image processing
Medical image processing
Multimedia Industry and Video Enhancement
HD(1280×720), FHD(1920×180)
Reference: “Super Resolution Applications in Modern Digital Image Processing”, 2016 IJCA
7
Type A-3
TV & Monitor
UHD(3840×2160)
Deep Learning for Single Image Super Resolution
Learning-based Single Image Super Resolution
For tackling regular inverse problem, almost use this paradigm
[HR(GT) image] + [distortion & down-sampling] [LR(input) image]
This is limitation of SISR training
Overall restoration quality is dependent on the distortion & down-sampling method
Reference: “Deep Learning for Single Image Super-Resolution: A Brief Review”, 2018 IEEE Transactions on Multimedia (TMM)
8
Type A-3
Deep Learning for Single Image Super Resolution
Learning-based Single Image Super Resolution
[HR(GT) image] + [distortion & down-sampling] [LR(input) image]
In CVPR 2017 SR Challenge, many team showed many degradation of quality metric
Reference: http://www.vision.ee.ethz.ch/~timofter/publications/NTIRE2017SRchallenge_factsheets.pdf
9
Type A-3
Deep Learning for Single Image Super Resolution
First Deep Learning architecture for Single Image Super Resolution
SRCNN(2014) – three-layer CNN, MSE Loss, Early upsampling
Compared to traditional methods, it shows excellent performance.
Reference: “Image Super-Resolution Using Deep Convolutional Networks”, 2014 ECCV
10
Type A-3
Deep Learning for Single Image Super Resolution
Efficient Single Image Super Resolution
FSRCNN(2016), ESPCN(2016)
Use Late Upsampling with deconvolution or sub-pixel convolutional layer
Inefficient in Memory, FLOPS
Reference: “Image Super-Resolution Using Deep Convolutional Networks”, 2014 ECCV
11
Type A-3
Deep Learning for Single Image Super Resolution
FSRCNN(Fast Super-Resolution Convolutional Neural Network)
Use Deconvolution layer instead of pre-processing(upsampling)
Faster and more accurate than SRCNN
Reference: “Accelerating the Super-Resolution Convolutional Neural Network”, 2016 ECCV
12
Type A-3
Deep Learning for Single Image Super Resolution
ESPCN(Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel
Convolutional Neural Network)
Use sub-pixel convolutional layer (pixel shuffler or depth_to_space)
This sub-pixel convolutional layer is used in recent SR models
Reference: “Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network”, 2016 CVPR Code: https://github.com/leftthomas/ESPCN
13
Type A-3
Deep Learning for Single Image Super Resolution
Deeper Networks for Super-Resolution
SRCNN, FSRCNN, ESPCN are shallow network Why not deep network?
Failed to train deeper models.. Use shallow network how to use deeper network?
Reference: “Image Super-Resolution Using Deep Convolutional Networks”, 2014 ECCV
14
Type A-3
Deep Learning for Single Image Super Resolution
VDSR(Accurate Image Super-Resolution Using Very Deep Convolutional Networks)
VGG based deeper model(20-layer) for Super-Resolution large receptive field
Residual learning & High learning rate with gradient clipping
MSE Loss, Early upsampling
Reference: “Accurate Image Super-Resolution Using Very Deep Convolutional Networks”, 2016 CVPR
15
Type A-3
Deep Learning for Single Image Super Resolution
Deeper Networks for Super-Resolution after VDSR
DRCN(Deeply-recursive Convolutional network), 2016 CVPR
SRResNet, 2017 CVPR
DRRN(Deep Recursive Residual Network), 2017 CVPR
Reference: “Deep Learning for Single Image Super-Resolution: A Brief Review”, 2018 IEEE Transactions on Multimedia (TMM)
16
Type A-3
Deep Learning for Single Image Super Resolution
Deeper Networks for Super-Resolution after VDSR
EDSR, MDSR (Enhanced Deep Residual Network, Multi Scale EDSR), 2017 CVPRW
DenseSR, 2017 CVPR
MemNet, 2017 CVPR
Reference: “Deep Learning for Single Image Super-Resolution: A Brief Review”, 2018 IEEE Transactions on Multimedia (TMM)
17
Type A-3
Deep Learning for Single Image Super Resolution
Generative Adversarial Network(GAN) for Super-Resolution
SRGAN(Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network)
First GAN-based SR Model, MSE Loss Blurry Output GAN loss + Content loss = Perceptual loss
Reference: “Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network”, 2017 CVPR
18
Type A-3
Deep Learning for Single Image Super Resolution
Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network
MSE Loss Blurry Output GAN loss + Content loss = Perceptual loss
Replace MSE loss to VGG loss (used in style transfer) and add adversarial loss
Reference: “Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network”, 2017 CVPR
19
Type A-3
Deep Learning for Single Image Super Resolution
Generative Adversarial Network(GAN) for Super-Resolution
SRGAN, EnhanceNet, SRFeat, ESRGAN
Reference: “A Deep Journey into Super-resolution: A survey”, 2019 arXiv
20
Type A-3
Deep Learning for Single Image Super Resolution
Deep Learning for Single Image Super Resolution
Reference: “A Deep Journey into Super-resolution: A survey”, 2019 arXiv
21
Type A-3
Deep Learning for Single Image Super Resolution
Reference: “A Deep Journey into Super-resolution: A survey”, 2019 arXiv
22
Type A-3
Deep Learning for Single Image Super Resolution
Deep Learning for Single Image Super Resolution
23
Type A-3
Some Issues for Super Resolution
Checkerboard artifact
Deconvolution (Transposed convolution) layer can easily have “uneven overlap”
Simple solution: use “resize + conv” or “sub-pixel convolutional layer”
Reference: “Deconvolution and Checkerboard Artifacts”, distill blog(https://distill.pub/2016/deconv-checkerboard/)
24
Type A-3
Some Issues for Super Resolution
Loss function
Propose a various loss function methods in Image Restoration task
Report the best result when using mixed loss with MS-SSIM loss + 𝒍𝟏 loss
Reference: “Loss Functions for Image Restoration with Neural Networks”, 2016 IEEE TCI
25
Type A-3
Some Issues for Super Resolution
Loss function
Recent papers almost use 𝒍𝟏 loss
Reference: “A Deep Journey into Super-resolution: A survey”, 2019 arXiv
26
Type A-3
Some Issues for Super Resolution
Metric (Distortion measure)
Almost paper use distortion metric(PSNR and SSIM) as performance metric
But, high PSNR, SSIM do not guarantee good quality that looks good to human..
Reference: “Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network”, 2017 CVPR
27
Type A-3
Some Issues for Super Resolution
Metric (Human opinion score)
But, high PSNR, SSIM do not guarantee good quality that looks good to human..
In SRGAN Paper, Mean Opinion Score(MOS) Testing is done for quantify perceptual quality
26 raters, score from 1(bad) to 5(excellent)
The raters were calibrated on the NN(1) and HR(5) version of 20 images from BSD300 dataset
28
Type A-3
Some Issues for Super Resolution
Metric Paper (The Perception-Distortion Tradeoff, 2018 CVPR)
Analysis distortion measure(PSNR, SSIM, etc.) vs human opinion score(Perceptual quality)
Good supplementary video: https://www.youtube.com/watch?v=6Yid4dituqo (PR-12)
Reference: “The Perception-Distortion Tradeoff”, 2018 CVPR
Better
quality
29
Type A-3
Through-Wall Human Pose Estimation Using Radio Signals
CVPR 2018
Mingmin Zhao, Tianhong Li, Mohammad Abu Alsheikh
Yonglong Tian, Hang Zhao, Antonio Torralba, Dina Katabi
Problem &
Motivation
Estimate human poses through walls and occlusions
Visible light blocked by walls and opaque objects
Radio frequency signals traverse walls and occlusions
Background: RF Signals
Properties
Human body is specular
Resolution is low: ~10cm
Design Considerations
Challenges
Radio signal: Hard to label
Key idea: Using Vision modality to teach RF modality
Design Considerations
Invariant to space/time translation: spatiotemporal conv as building blocks (C3D, TSN)
RF has low spatial resolution, human body is specular: multi-frames as input
Need to transform RF heatmap view to camera view: encoder + decoder
Technical Details: Model
Teacher Network: OpenPose
Output: Human keypoint confidence map
Student Network
Input: 100 frames input (3.3sec)
Encoder: Strided Conv (1 x 2 x 2 strides), 10 layers: removal spatial dimension and summarize
Decoder: Fractionally strided Conv (1×0.5×0.5). 4 layers
Task: Minimize the difference between two networks’ prediction (Heatmaps)
Loss: Binary Cross-Entropy
Keypoint Association:
NMS Key point candidates, Relaxation method by OpenPose
Input & Output of OpenPose
Remove spatial dimensions and summarize the information from the original view. Map the RF hm into a representation not in the original heatmap
Experiment: Settings
Data Collection
Synchronized data from web camera and RF sensor
More than 50 hours and 50 environments
More than 1k different people
Number of people in each frame: 0 to 14 – Various activities
Ground Truth Human Labelling for visible, multi-camera generated 3D skeleton for through-wall
Metrics Average Precision over different object keypoint similarity (OKS)
Experiment: Results
Low OKS threshold: RF better!
(OpenPose has higher false alarms: poster, image in mirror)
Body parts: Large reflected area, slow motion
Outperform OpenPose at 50
Experiments: Analysis
Guided backpropagation: Compute gradient, zero out negatives, backpropagate
– Understand image features the neuron detects
Spatial attention
Temporal attention
raise forearm, backarm
Take-home Messages
Sensing Modalities not limited by RGBD
Cross modal learning: transfers discriminative knowledge from well established model (e.g. visual recognition) to other modalities with a bridge
This work: visual pose estimation for RF
SoundNet (NIPS 2016): object identification based solely on audio. Visual object detection for audio
Deep Image Prior
Dmitry Ulyanov, Andrea Vedaldi, Victor Lempitsky
Partial content of this slide refers the presentation given by Dmitry Ulyanov regarding to the paper
Background and Motivation
State-of-the-art ConvNets for image restoration and generation are almost invariably trained on large datasets of images. One may thus assume that their excellent performance is due to their ability to learn realistic image priors from a large number of example images.
However, learning alone is insufficient to explain the good performance of deep networks.
Recent research has shown that generalization requires the structure of the network to “resonate” with the structure of the data.
What did they do?
In this paper, they show that, contrary to the belief that learning is necessary for building good image priors, a great deal of image statistics are captured by the structure of a convolutional image generator independent of learning.
They cast reconstruction as a conditional image generation problem and show that the only information required to solve it is contained in the single degraded input image and the handcrafted structure of the network used for reconstruction.
Instead of trying to beat the state-of-art neural networks, they try to show the structure of the network imposes strong prior.
Result
Image restoration – Method
– Clean image
– Corrupted/degraded image (observed)
– Restored image
Degradation for denoising:
Restoration Model:
If there is no preference for prior, the prior will be a constant. Then,
=> the best estimation of the clean image is the corrupted image
– Clean image
– Corrupted image (observed)
– Restored image
Expressed as energy minimization problem:
where is a task-dependent data term, is a regularizer
For example:
Deep Image Prior
– Corrupted image (observed)
Parametrization:
Interpreting the neural network as a parametrization
In particular, most of their experiments are performed using a U-Net type “hourglass” architecture(also known as “decoder-encoder”) with skip-connections, where z and x have the same spatial size.
Convolutional network with parameter
Fixed input z
Deep Image Prior step by step
– Corrupted image (observed)
– Restored image
1. initialize z
For example fill it with uniform noise U(-1, 1)
2. Solve
With any favorite gradient-based method
3.Get the solution
The network has high impedance to noise and low impedance to signal.
Therefore for most applications,
They restrict the number of iterations in the optimization process to a certain number of iterations.
Data term
– Clean image
– Corrupted image (observed)
– Binary mask
Objective:
Denoising: Needs early stopping!
Inpainting: , where is Hadamard’s product, m is binary mask
Super-resolution: , where is a downsampling operator to resize the image
Feature-inv: , where is the first several layers of a neural network trained to perform
Experiments
Denoising and generic reconstruction
Deep Image Prior approach can restore an image with a complex degradation (JPEG compression in this case). As the optimization process progresses, the deep image prior allows to recover most of the signal while getting rid of halos and blockiness (after 2400 iterations) before eventually overfitting to the input (at 50K iterations).
The deep image prior is successful at recovering both man-made and natural patterns.
Super-resolution
use a scaling factor of 4 to compare to other works
fix the number of optimization steps to be 2000 for every image
Inpainting
Text inpainting
54
Image restoration
sampled to drop 50% of pixels at random
g is the result from comparison with Shepard networks
Inpainting of large holes
The deep image prior utilizes context of the image and interpolates the unknown region with textures from the known part. Such behaviour highlights the relation between the deep image prior and traditional self-similarity priors
Inpainting of large holes
Inpainting using different depths and architectures.
The figure shows that much better inpainting results can be obtained by using deeper random networks. However, adding skip connections to ResNet in U-Net is highly detrimental.
Feature-inversion (AlexNet Inversion)
Given the image on the left, it shows the natural pre-image obtained by inverting different layers of AlexNet using three different regularizers.
The deep image prior results in inversions at least as interpretable as the ones of [8].
58
Flash/No Flash
The proposed approach can be extended to the tasks of the restoration of multiple images
The flash-no flash image pair-based restoration is to obtain an image of a scene with the lighting similar to a no-flash image, while using the flash image as a guide to reduce the noise level.
The deep image prior allows to obtain low-noise reconstruction with the lighting very close to the no-flash image.
Colorful Image Colorization
Richard Zhang, Phillip Isola, Alexei (Alyosha) Efros
richzhang.github.io/colorization
60
Ansel Adams, Yosemite Valley Bridge
Consider this iconic photograph of Yosemite Valley from Ansel Adams…So how would it look like in COLOR?……….Now, on the face of it, the problem is rather underconstrained. We are looking to go from a 1-dimensional signal into a 3-dimensional signal. However, you and I have seen many color images and have no trouble doing this. We know that the sky is probably blue, the mountain is likely brown, and the trees are most definitely green.
61
Ansel Adams, Yosemite Valley Bridge – Our Result
So this clearly calls for the use of data, and we can use machine learning to help solve the problem.
62
Grayscale image: L channel
Color information: ab channels
ab
L
So formally, we are working in the Lab color space. The grayscale information is contained in the L, or lightness channel of the image, and is the input to our system.
The output is the ab, or color channels.
We’re looking to learn the mapping from L to ab using a CNN.
We can then take the predicted ab channels, concatenate them with the input, and hopefully get a plausible colorization of the input image. This is the graphics benefit of this problem.
63
ab
L
Concatenate (L,ab)
Grayscale image: L channel
“Free” supervisory
signal
Semantics? Higher-level abstraction?
We note that any image can be broken up into its grayscale and color components, and in this manner, can serve as a free supervisory signal for training a CNN. So perhaps by learning to color, we can achieve a deep representation which has higher level abstractions, or semantics.
Now, this learning problem is less straightforward than one may expect.
64
Inherent Ambiguity
Grayscale
For example, consider this grayscale image.
65
Inherent Ambiguity
Our Output
Ground Truth
This is the output after passing it through our system. Now, it seems to look plausible. Now here is the ground truth. So notice that these two look very different. But even though red and blue are far apart in ab space, we are just as happy with the red colorization as we are with the blue, and perhaps the red is even better…
66
Colors in ab space
(continuous)
Better Loss Function
Regression with L2 loss inadequate
Use multinomial classification
Class rebalancing to encourage learning of rare colors
This indicates that any loss which assumes a unimodal output distribution, such as an L2 regression loss, is likely to be inadequate.
67
Better Loss Function
Colors in ab space
(discrete)
Regression with L2 loss inadequate
Use multinomial classification
Class rebalancing to encourage learning of rare colors
We reformulate the problem as multinomial classification. We divide the output ab space into discrete bins of size 10.
68
So given the input grayscale image of the bird on the upper left, here is a predicted distribution from out system. Each tile corresponds to one output color bins in our multinomial classification problem, which are subsampled for visual clarity. For each tile, spatial regions inside which have high lightness values indicate high probability for that color.
Note that the foreground object, the bird, has been predicted to be blue, purple, or perhaps red.
Meanwhile, the background vegetation has been classified to be either green, yellow, or perhaps brown.
69
Histogram over ab space
log10 probability
Regression with L2 loss inadequate
Use multinomial classification
Class rebalancing to encourage learning of rare colors
Better Loss Function
But this is not the end of the story. We also have to take into consideration the statistics of natural images. To the right, we have a 2d histogram showing the occurrences of colors over all pixels in ImageNet on a log scale. You can see that most of the distribution is at the center of the gamut, where colors are desaturated and bland. This is because almost all pixels in images belong to background. For example, if you look around this room, almost every pixel will be white.
Without taking this into account, given any uncertainty, the predictions will tend to be desaturated.
70
Regression with L2 loss inadequate
Use multinomial classification
Class rebalancing to encourage learning of rare colors
Better Loss Function
Histogram over ab space
log10 probability
As such, we add a class rebalancing term in the training objective, effectively oversampling rarer, more vibrant colors relative to their representation in the training set.
It is the combination of these modifications, using classification instead of regression, and adding the class-rebalancing, that makes our results qualitatively more colorful than previous and concurrent work, hence the name of our project, “Colorful Image Colorization”
71
Hertzmann et al. In SIGGRAPH, 2001.
Welsh et al. In TOG, 2002.
Irony et al. In Eurographics, 2005.
Liu et al. In TOG, 2008.
Chia et al. In ACM 2011.
Gupta et al. In ACM, 2012.
Larsson et al. In ECCV 2016. [Concurrent]
Dahl. Jan 2016. Iizuka et al. In SIGGRAPH, 2016.
Deshpande et al. Cheng et al. In ICCV 2015.
Charpiat et al. In ECCV 2008.
Hand-engineered Features
Deep Networks
L2 Regression
Classification
Non-
parametric
Parametric
Upcoming Oral O-3A-04
Tomorrow, 9–10 AM
So how does our system compare to previous work in this problem?
Much previous work has focused on using non-parametric approaches. Generally, a reference color image is first obtained, and the colors are then transferred over to the grayscale image. These can work very well, but often times do not generalize, and obtaining the reference image may be slow or require user intervention.
Many previous parametric approaches have used L2 regression as a loss function, both before
and after the deep learning era.
Now we are not the first to have the insight of using classification. There is actually some older work from Charpiat et al which introduced using classification for colorization, which inspired us.
More recently, concurrent work by Larsson et al trained a deep network using a classification loss. They will be be presenting in tomorrow morning’s oral session.
72
224
lightness
ab color
+L
313
224
112
56
28
64
128
256
512
512
14
conv1
conv2
conv3
conv4
conv5
Network Architecture
4096
4096
1
1
fc7
fc6
So how do we minimize the loss function that we have formulated?
Well first, since we have converted this problem into pixel-wise classification, this allows us to draw on the insights and advances made in the semantic segmentation literature.
We start with the architecture of a VGG network,
remove the fc layers
73
Network Architecture
224
224
112
56
28
64
128
256
512
512
lightness
ab color
+L
313
conv1
conv2
conv3
conv4
28
conv5
à trous [1]/dilated [2]
28
56
512
256
28
512
conv7
conv8
conv6
à trous/dilated
[1] Chen et al. In arXiv, 2016.
[2] Yu and Koltun. In ICLR, 2016
conv5
add additional spatial resolution using a trous or dilated convolutions, add additional convolutional layers on top, and produce a color distribution for each pixel.
There is a final step, where we must go from a predicted distribution into a single point estimate. We do this by simply interpolating between the mean and the mode of the distribution, allowing us to keep vibrancy in the output while maintaining spatial consistency.
74
GT
Class w/ Rebalancing
L2 Regression
So how do we do? Well consider the image of the boat. Using L2 regression fares quite well. Since sky is always blue and the vegetation is always green, there is only 1 plausible color, so L2 regression will works just fine. Our full system does just as well.
However, consider this bird. Because of the multimodality of the output, L2 regression gives us a sepia result, whereas our full system results in a bright blue bird with a yellow belly.
Now you may be wondering what the ground truth colors look like. Notice that the bird is actually yellow in ground truth! Even though the L2 distance between the blue bird and the yellow bird is rather far, we are satisfied with the output, since blue is a plausible colorization of the bird.
75
Failure Cases
The system does have some interesting failure cases. We find that many man-made objects can be multiple colors. The system sometimes has a difficult time deciding which one to go with, leading to this type of tie-dye effect.
76
Biases
Also, we find other curious behaviors and biases. For example, when the system sees a dog, it sometimes expects a tongue underneath. Even when there is none, it will just go ahead and hallucinate one for us anyways.
77
Evaluation
Visual Quality Representation Learning
Quantitative Per-pixel accuracy
Perceptual realism
Semantic interpretability Task generalization
ImageNet classification
Task & dataset generalization
PASCAL classification, detection, segmentation
Qualitative Low-level stimuli
Legacy grayscale photos
Hidden unit activations
One of our contributions is to really consider how to properly evaluate the colorization task. Previous works have focused on measures such as per-pixel accuracy. We evaluate these in our paper as well, but this metric does not take into account the joint interaction between pixels and does not speak towards the perceptual realism or visual plausibility of the synthesized colors
Of course no metric is perfect, so we propose a few tests which shed some light on the performance of our system. Note that our problem falls under the general problem of image synthesis, and these evaluations may be applicable to related synthesis tasks as well.
We also test the capability for colorization to learn representations.
78
Evaluation
Visual Quality Representation Learning
Quantitative Per-pixel accuracy
Perceptual realism
Semantic interpretability Task generalization
ImageNet classification
Task & dataset generalization
PASCAL classification, detection, segmentation
Qualitative Low-level stimuli
Legacy grayscale photos
Hidden unit activations
Due to time constraints, we will not be able to discuss all of the tests, but please come by our poster for more details.
79
Evaluation
Visual Quality Representation Learning
Quantitative Per-pixel accuracy
Perceptual realism
Semantic interpretability Task generalization
ImageNet classification
Task & dataset generalization
PASCAL classification, detection, segmentation
Qualitative Low-level stimuli
Legacy grayscale photos
Hidden unit activations
So to really get at the perceptual realism of our method, we ran a Amazon Mechanical Turk test using real human judgements.
80
Perceptual Realism / Amazon Mechanical Turk Test
We will like to invite all of you to participate in a version of this test now. We will first show you two images, one ground truth, and one synthesized. You should decide which one of the two is the FAKE or SYNTHESIZED image.
81
Right is fake
82
clap if “fake”
clap if “fake”
Now clap if you believe the left one to be fake.
Now clap if you believe the right image to be fake.
83
Fake, 0% fooled
Very good, most of you were able to correctly identify the right image as FAKE. The lack of spatial consistency on the truck results in this smudging effect, which serves as a dead giveaway on an otherwise good colorization. This is one of the failure modes of our system.
84
Left is fake
85
clap if “fake”
clap if “fake”
Now clap if the left is fake.
Now clap if the right is fake.
86
Fake, 55% fooled
In this case, our synthesizes result looks almost identical to the ground truth. In these cases, the turkers are fooled about 50% of the time. Let’s try one final turk test.
87
Left is fake
88
clap if “fake”
clap if “fake”
Clap if the left one is fake.
Clap if the right one is fake.
89
Fake, 58% fooled
Here, the ground truth image is of a blue chameleon, which is rather unusual. Our system learns that chameleons are usually green. There are times like these where our system will actually predict a more prototypical appearance than the ground truth, which can lead to fooling rates above 50%.
90
from Reddit /u/SherySantucci
We have also found some additional examples of this behavior. This example was submitted from a reddit user.
91
Recolorized by Reddit ColorizeBot
Our algorithm makes him normal colored. This was processed by the Reddit ColorizeBot, which is running our system under the hood.
92
Photo taken by Reddit /u/Timteroo,
Mural from street artist Eduardo Kobra
Colorfully tiled Yoda becomes
93
Recolorized by Reddit ColorizeBot
Normal green yoda.
94
AMT Labeled Real [%]
50%
0%
Perceptual Realism Test
Ours (full)
Ours (full)
32.3
Ground Truth
Ground Truth
50
Ours (L2)
Ours (L2)
21.2
Random
Random
13.0
Ours (class)
Ours (class)
23.9
Larsson et al.
Larsson et al.
[Concurrent]
27.2
1600 images tested per algorithm
So how does our system fare quantitatively?
If we were to produce ground truth colorizations, we would fool participants 50% of the time by definition, and so this acts as a soft ceiling for this metric.
e tried a random baseline, where we added colors from a random image. This fooled users about 13% of the time.
We achieve 21.2% with a L2 regression loss.
With classification, we go up to 24%.
With the class rebalancing term, we get over 32%.
The concurrent method from Larsson et al uses a classification loss, without rebalancing, and achieves 27%. This indicates that our choices to go to classification, and then to add the rebalancing term, resulted in more plausible colorizations
95
Input
Ground Truth
Output
So let’s try to get more intuition on how the network is performing this task. [pause]
If we colorize these two equilluminant vegetables, the network has little trouble doing so.
What about the Macbeth chart that it has never seen before? It fails.
This suggests to us that rather than exploiting low-level cues, the network is perhaps actually recognizing the objects.
96