COMP9517: Computer Vision
Deep Learning
Week 8 COMP9517 2021 T3 1
Recap: CNNs for supervised image classification
• Beyond classification
• Beyond single image input
• Beyond strong supervision
Week 8 COMP9517 2021 T3 2
Vision Beyond Classification
• An image is worth a thousand words
• Classification models learn only a few
• Resnet-50: bicycle, garden
• Holy grail
A model that achieves human level scene understanding
COMP9517 2021 T3 3
Vision Beyond Classification Object Detection Semantic Segmentation
Scene Understanding Pose Estimation
References and further reading: https://github.com/kjw0612/awesome-deep-vision
Week 8 COMP9517 2021 T3 4
Vision Beyond Classification Identified Tasks
Object Detection Semantic Segmentation Instance Segmentation
Figures from Microsoft COCO: Common Objects in Context, Lin et al, 2014
Week 8 COMP9517 2021 T3 5
Object Detection
• Multi-task
classification + localization
an RGB image
class label + bounding box
Image from COCO dataset – Microsoft COCO: Common Objects in Context, Lin et al, 2014
Week 8 COMP9517 2021 T3 6
Object Detection How to learn to predict class label + bounding box?
Softmax + cross entropy for classification
𝑙𝑙 𝑥𝑥,𝑦𝑦 =|𝑦𝑦−𝑥𝑥|2
Classification
Regression
Qu2adratic loss for regression
COMP9517 2021 T3
Object Detection Summary: classification vs regression
Classification
Regression
Map inputs to predefined classes
Map inputs to continuous values
Discrete values
Continuous values
Unordered data
Ordered data
Week 8 COMP9517 2021 T3 8
Object Detection How to learn to predict class label + bounding box?
• Classification then regression
Week 8 COMP9517 2021 T3 9
COMP9517 2021 T3 10
Object Detection Two categories of deep learning based methods
Two-stage methods: • R-CNN
• Fast R-CNN
• Faster R-CNN
One-stage methods:
• RetinaNet
Faster R-CNN
Object Detection
• Two-stage detector i. Identify bboxs
ii. Classify and refine • Architecture
• Fast R-CNN
Figure from Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, Ren et al, 2016
COMP9517 2021 T3 11
Object Detection • Region Proposal Network (RPN)
Faster R-CNN
Week 8 COMP9517 2021 T3 12
Object Detection • Region Proposal Network (RPN)
Figure from Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, Ren et al, 2016
Week 8 COMP9517 2021 T3 13
Faster R-CNN
Object Detection • Region Proposal Network (RPN)
Faster R-CNN
Week 8 COMP9517 2021 T3 14
Object Detection • One-stage detector
• Architecture: ResNet + FPN + three subnets
Figure from Focal Loss for Dense Object Detection, Lin et al, 2017
Week 8 COMP9517 2021 T3 15
Object Detection • FPN (feature pyramid network)
Figure from Feature Pyramid Networks for Object Detection, Lin et al, 2017
Week 8 COMP9517 2021 T3 16
Object Detection Issue with one-stage detectors
• Most of the candidate bounding boxes are background
Figure from Focal Loss for Dense Object Detection, Lin et al, 2017
Week 8 COMP9517 2021 T3 17
Object Detection RetinaNet solution
• using Focal Loss (FL)
Figure from Focal Loss for Dense Object Detection, Lin et al, 2017
Week 8 COMP9517 2021 T3 18
Semantic Segmentation
an RGB image
class label for every pixel
• Dense prediction problem
Image from COCO dataset – Microsoft COCO: Common Objects in Context, Lin et al, 2014
Week 8 COMP9517 2021 T3 19
Semantic Segmentation UpSampling
Recap Pooling: compute mean or max over small windows to reduce resolution.
Week 8 COMP9517 2021 T3 20
Semantic Segmentation
UpSampling – to increase resolution; here 2×2 kernel.
Week 8 COMP9517 2021 T3 21
Semantic Segmentation U-Net
Figure from U-Net: Convolutional Networks for Biomedical Image Segmentation, Ronneberger et al, 2015
Week 8 COMP9517 2021 T3 22
Semantic Segmentation U-Net
Figure from U-Net: Convolutional Networks for Biomedical Image Segmentation, Ronneberger et al, 2015
Week 8 COMP9517 2021 T3 23
Semantic Segmentation Recall the RetinaNet – U shape
Figure from Focal Loss for Dense Object Detection, Lin et al, 2017
Week 8 COMP9517 2021 T3 24
Instance Segmentation
an RGB image
class label for every instance
• Object detection + segmentation
Image from COCO dataset – Microsoft COCO: Common Objects in Context, Lin et al, 2014
Week 8 COMP9517 2021 T3 25
Instance Segmentation Two categories of methods
• Two-stage methods
• Top-Down (‘detect-then-segment’) Mask R-CNN
• Bottom-Up
Semantic segmentation + instance embedding
• Single stage methods
PloarMask AdaptIS
COMP9517 2021 T3 26
Instance Segmentation
Mask R-CNN
• Faster R-CNN + mask head
• ROIAlign instead of ROI pooling in Faster R-CNN
Image from Mask R-CNN, He et al, 2018
Week 8 COMP9517 2021 T3 27
Instance Segmentation Mask R-CNN
• ROIAlign
The dashed grid represents a feature map, the solid lines an RoI (with 2×2 bins in this example), and the dots the 4 sampling points in each bin. RoIAlign computes the value of each sampling point by bilinear interpolation from the nearby grid points on the feature map. No quantization is performed on any coordinates involved in the RoI, its bins, or the sampling points.
Image from Mask R-CNN, He et al, 2018
COMP9517 2021 T3 28
Instance Segmentation
SOLO (segment objects by locations)
• Box-free
• the notion of “instance categories”, i.e., the quantized center locations
and object sizes.
Image from SOLO: Segmenting Objects by Locations, Wang et al, 2020
Week 8 COMP9517 2021 T3 29
Instance Segmentation SOLO (segment objects by locations)
Image from SOLO: Segmenting Objects by Locations, Wang et al, 2020
Week 8 COMP9517 2021 T3 30
Evaluation Metrics
Classification
• Accuracy: percentage of correct predictions
Object detection & segmentation
– Recall image segmentation lecture in week 5
Intersection-over-union (IoU)
IoU non-differentiable: used only for evaluation
COMP9517 2021 T3 31
Beyond single image input
Motion helps object recognition when learning to see.
• Motion – cues for object recognition during learning
• Natural data augmentation: translation, scale, 3D rotation, camera motion, light changes
Week 8 COMP9517 2021 T3 32
Beyond single image input identified Tasks
• Pairs of images input optical flow estimation
• Videos input Target tracking
Action recognition
COMP9517 2021 T3 33
Optical Flow Estimation
A pair of RGB images
Dense flow map (real values)
2D translation displacements
Week 8 COMP9517 2021 T3 34
Encoder-decoder architecture (similar to U-NET)
Supervised training Loss: Euclidean distance
Optical Flow Estimation
Image from FlowNet: Learning optical flow with convolutional network, Wang et al, 2020
COMP9517 2021 T3 35
Encoder-decoder architecture (similar to U-NET)
Supervised training Loss: Euclidean distance
Image from FlowNet: Learning optical flow with convolutional network, Wang et al, 2020
COMP9517 2021 T3 36
Optical Flow Estimation
Video input
Video models using 3D convolutions
Stack frames TxHxWx3 A volume
COMP9517 2021 T3 37
Video input
Recap 2D convolution operation
• The kernel slides across spatial dimensions.
COMP9517 2021 T3 38
Video input
3D convolution operation
• The kernel slides across spatial and time to generate spatio- temporal feature maps.
• 3D convolutions are non-causal
• Strided, dilated, and padded convolutions also apply in 3D.
COMP9517 2021 T3 39
Action Recognition
RGB video (optional + flow map)
Video from Kinetics dataset, Carreira et al, 2017
COMP9517 2021 T3 40
Action label one_hot classes e.g. cricket shot
cricket shot
• SlowFast
Action Recognition
Image from SlowFast Networks for Video Recognition, Feichtenhofer et al, 2019
Week 8 COMP9517 2021 T3 41
Transfer Learning for Video Input
• Intuition: a 2D image is a video of a static scene
• Inflating 2D kernels into 3D
Week 8 COMP9517 2021 T3 42
Beyond Strong Supervision
• Why? – Labelling is tedious.
• Self-supervision – Metric learning
Image from COCO dataset and CTC dataset, respectively.
Week 8 COMP9517 2021 T3 43
Beyond Strong Supervision
• Recap standard losses (e.g. cross-entropy, mean square error)
• learn mapping between input(s) and output distribution / value(s)
• Metric learning
• learn to predict distances between inputs given some similarity
measure (e.g. same person or not)
Image from VGGFace2: A dataset for recognising faces across pose and age, Cao et al, 2018
COMP9517 2021 T3 44
Beyond Strong Supervision Metric Learning
• Contrastive loss (– margin loss)
• Self-supervised representation, e.g. dimensionality reduction [1]
• Difficult to choose the margin
• Triplet loss
• Information retrieval [2]
• Hard negative mining to select informative triplets
• State-of-the-art representation learning
• Low-shot face recognition [3]
[1] Dimensionality reduction by learning an invariant mapping, Hadsell et al, 2006
[2] Learning to Learn from Web Data through Deep Semantic Embeddings, Gomez et al, 2018 [3] VGGFace2: A dataset for recognising faces across pose and age, Cao et al, 2018
Week 8 COMP9517 2021 T3 45
Beyond Strong Supervision
State-of-the-art representation learning
• Composition of data augmentations
• Learnable non-linear transformation
• Larger mini-batches and longer training
Image from A Simple Framework for Contrastive Learning of Visual Representations, Chen et al, 2020
Week 8 COMP9517 2021 T3 46
Beyond Strong Supervision Generative adversarial networks (GAN)
Image from https://wiki.pathmind.com/generative-adversarial-network-gan
Week 8 COMP9517 2021 T3 47
Beyond Strong Supervision Generative adversarial networks (GAN)
Image from https://wiki.pathmind.com/generative-adversarial-network-gan
Week 8 COMP9517 2021 T3 48
Generative
overview Aggarwal
An applications,
adversarial
References
• Some slides were adopted from the class notes of Stanford course cs231n
• Some slides were adopted from the DeepMind deep learning lecture series 2020
Week 8 COMP9517 2021 T3 49