The University of Sydney Page 1
Deep Learning Applications
Dr Chang Xu
School of Computer Science
The University of Sydney Page 2
Applications
The University of Sydney Page 3
Introducing Deep Learning for Computer Vision
Deep convolutional neural network (DCNN) is the key concept
for introducing deep learning to the development of computer
vision. By imitating the biological nervous systems, deep neural
networks can provide unprecedented ability to interpret
complicated data patterns and thus effectively tackle various
computer vision tasks.
Images
Videos
Classification
Detection
Segmentation
Tracking
Deep Neural
Network
The University of Sydney Page 4
Classification
Goal: Assign a label to an input image based on a fixed set of
categories.
Credit To: https://medium.com/@tifa2up/image-classification-using-deep-
neural-networks-a-beginner-friendly-approach-using-tensorflow-94b0a090ccd4
The University of Sydney Page 5
Deep Learning-based Classification
Suppose 𝑊 is the parameter set of a deep neural network. Given
an input image 𝐼, classification can be tackled by:
𝑦 = 𝑓 𝑊, 𝐼
where 𝑦 is the predicted class label and 𝑓 represents a series of
operations parameterized based on 𝑊 for processing the input
image.
The University of Sydney Page 6
Computer Vision Tasks
The University of Sydney Page 7
Pedestrian Detection
The University of Sydney Page 8
Car Detection
The University of Sydney Page 9
Other Applications
The University of Sydney Page 10
Localize objects with regression
The University of Sydney Page 11
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 – May 10, 201748
Class Scores
Cat: 0.9
Dog: 0.05
Car: 0.01
…
Classification + Localization
Vector:
4096
Fully
Connected:
4096 to 1000
Box
Coordinates
(x, y, w, h)
Fully
Connected:
4096 to 4
Softmax
Loss
L2 Loss
Loss
Correct label:
Cat
Correct box:
(x’, y’, w’, h’)
+
This image is CC0 public domain
Treat localization as a
regression problem!
Multitask Loss
Classification with Localization
Credit To: FeiFei Li’s deep learning lecture ppt.
It is assumed that there is only one object in an input image.
The University of Sydney Page 12
The University of Sydney Page 13
Detection
Goal: Detect semantic objects of certain classes in images.
– Region CNN
– SPPNet
– Fast RCNN
– Faster RCNN
– Mask R-CNN
The University of Sydney Page 14
Typical architecture
1.Region proposal: Given an input image, find all possible places where
objects can be located. The output of this stage should be a list of
bounding boxes of likely positions of objects. These are often called
region proposals or regions of interest.
2.Final classification: for every region proposal from the previous stage,
decide whether it belongs to one of the target classes or to the
background. Here we could use a deep convolutional network.
Proposals
The University of Sydney Page 15
Deep Learning-based Detection
Given a deep neural network parameterized by 𝑊, the goal of
object detection is to predict a set of bounding boxes that may
contain objects together as well as the objects’ categories.
{𝑦!, 𝑥”, 𝑦”, 𝑥#, 𝑦# !} = 𝑓 𝑊, 𝐼
where 𝑦! is the predicted class label and 𝑥”, 𝑦”, 𝑥#, 𝑦# ! describes
the four coordinates (i.e. the top-left coordinate and the bottom-
right coordinate) of the estimated bounding box for the 𝑖-th result.
The University of Sydney Page 16
Region CNN (R-CNN)
Rich feature hierarchies for accurate object detection and semantic segmentation
Tech report (v5)
Ross Girshick Jeff Donahue Trevor Darrell Jitendra Malik
UC Berkeley
{rbg,jdonahue,trevor,malik}@eecs.berkeley.edu
Abstract
Object detection performance, as measured on the
canonical PASCAL VOC dataset, has plateaued in the last
few years. The best-performing methods are complex en-
semble systems that typically combine multiple low-level
image features with high-level context. In this paper, we
propose a simple and scalable detection algorithm that im-
proves mean average precision (mAP) by more than 30%
relative to the previous best result on VOC 2012—achieving
a mAP of 53.3%. Our approach combines two key insights:
(1) one can apply high-capacity convolutional neural net-
works (CNNs) to bottom-up region proposals in order to
localize and segment objects and (2) when labeled training
data is scarce, supervised pre-training for an auxiliary task,
followed by domain-specific fine-tuning, yields a significant
performance boost. Since we combine region proposals
with CNNs, we call our method R-CNN: Regions with CNN
features. We also compare R-CNN to OverFeat, a recently
proposed sliding-window detector based on a similar CNN
architecture. We find that R-CNN outperforms OverFeat
by a large margin on the 200-class ILSVRC2013 detection
dataset. Source code for the complete system is available at
http://www.cs.berkeley.edu/˜rbg/rcnn.
1. Introduction
Features matter. The last decade of progress on various
visual recognition tasks has been based considerably on the
use of SIFT [29] and HOG [7]. But if we look at perfor-
mance on the canonical visual recognition task, PASCAL
VOC object detection [15], it is generally acknowledged
that progress has been slow during 2010-2012, with small
gains obtained by building ensemble systems and employ-
ing minor variants of successful methods.
SIFT and HOG are blockwise orientation histograms,
a representation we could associate roughly with complex
cells in V1, the first cortical area in the primate visual path-
way. But we also know that recognition occurs several
stages downstream, which suggests that there might be hier-
1. Input
image
2. Extract region
proposals (~2k)
3. Compute
CNN features
aeroplane? no.
…
person? yes.
tvmonitor? no.
4. Classify
regions
warped region
…
CNN
R-CNN: Regions with CNN features
Figure 1: Object detection system overview. Our system (1)
takes an input image, (2) extracts around 2000 bottom-up region
proposals, (3) computes features for each proposal using a large
convolutional neural network (CNN), and then (4) classifies each
region using class-specific linear SVMs. R-CNN achieves a mean
average precision (mAP) of 53.7% on PASCAL VOC 2010. For
comparison, [39] reports 35.1% mAP using the same region pro-
posals, but with a spatial pyramid and bag-of-visual-words ap-
proach. The popular deformable part models perform at 33.4%.
On the 200-class ILSVRC2013 detection dataset, R-CNN’s
mAP is 31.4%, a large improvement over OverFeat [34], which
had the previous best result at 24.3%.
archical, multi-stage processes for computing features that
are even more informative for visual recognition.
Fukushima’s “neocognitron” [19], a biologically-
inspired hierarchical and shift-invariant model for pattern
recognition, was an early attempt at just such a process.
The neocognitron, however, lacked a supervised training
algorithm. Building on Rumelhart et al. [33], LeCun et
al. [26] showed that stochastic gradient descent via back-
propagation was effective for training convolutional neural
networks (CNNs), a class of models that extend the neocog-
nitron.
CNNs saw heavy use in the 1990s (e.g., [27]), but then
fell out of fashion with the rise of support vector machines.
In 2012, Krizhevsky et al. [25] rekindled interest in CNNs
by showing substantially higher image classification accu-
racy on the ImageNet Large Scale Visual Recognition Chal-
lenge (ILSVRC) [9, 10]. Their success resulted from train-
ing a large CNN on 1.2 million labeled images, together
with a few twists on LeCun’s CNN (e.g., max(x, 0) rectify-
ing non-linearities and “dropout” regularization).
The significance of the ImageNet result was vigorously
1
ar
X
iv
:1
31
1.
25
24
v5
[
cs
.C
V
]
22
O
ct
2
01
4
Object Detection to Image Classification
Pre-trained on ImageNet
[Girshick14] R. Girshick, J. Donahue, S. Guadarrama, T. Darrell, J. Malik: Rich Feature Hierarchies for
Accurate Object Detection and Semantic Segmentation, CVPR 2014
The University of Sydney Page 17
Region CNN (R-CNN)
Credit To: towardsdatascience.com
– Duplication of calculation
– Not in end-to-end manner
– Slow
– Fixed image size (227×227)
The University of Sydney Page 18
*Selective Search
The University of Sydney Page 19
*Bounding Box Regression
(𝑃! , 𝑃” , 𝑃#, 𝑃$)
The University of Sydney Page 20
Fast R-CNN
Drawback of R-CNN and the modification:
1.Multi-stage training. à End-to-end joint training.
2.Expensive in space and time. à Convolutional layer sharing.
3.Test-time detection is slow. à Single scale testing
R-CNN
Fast R-CNN
The University of Sydney Page 21
Fast R-CNN
The University of Sydney Page 22
Region of Interest Pooling
Input to ROI pooling layer:
1. A fixed-size feature map
2. A list of regions of interest
8×8 feature map 2×2 output
a region proposal
A type of max-pooling. The output always has the same size.
The University of Sydney Page 23
Region of Interest Pooling
Input to ROI pooling layer:
1. A fixed-size feature map
2. A list of regions of interest
8×8 feature map 2×2 output
a region proposal
A type of max-pooling. The output always has the same size.
– Used for object detection tasks
– Reuse the feature map from CNNs
– Speed up both train and test time
– Train detection model in an end-to-end manner
The University of Sydney Page 24
Fast R-CNN
VS
Credit To: http://www.robots.ox.ac.uk/~tvg/publications/talks/fast-rcnn-
slides.pdf
The University of Sydney Page 25
Faster R-CNN
Replace the slow selective search algorithm with a fast neural net
– region proposal network (RPN).
Fast R-CNN
The University of Sydney Page 26
Region Proposal Network
Image Credit to [Li Wang et al. 2017]
3×3 sliding window
Default
bounding boxes
Contain an
object?
offsets of
coordinates
The University of Sydney Page 27
Region Proposal Network
Each anchor has 2
neurons to represent
fg/bg label
The input feature map
has 256 channels
Each anchor also has
4 neurons to represent
offsets of coordinates
Anchors are pre-
determined boxes
whose sizes can be
ranged from 32! to
512! and aspect ratios
can be ranged from
0.5 to 2.
The University of Sydney Page 28
Faster R-CNN
Faster R-CNN
= RPN + Fast R-CNN.
The University of Sydney Page 29
Mask R-CNN
Image instance segmentation is to identify, at a pixel level, what the
different objects in a scene are.
The University of Sydney Page 30
Mask R-CNN
A binary mask that says whether or not a
given pixel is part of an object.
The University of Sydney Page 31
RoIAlign
Realigning RoI Pooling to be More Accurate.
Image credit to Ardian Umam
The University of Sydney Page 32
RoIAlign
Credit To: https://www.slideshare.net/windmdk/mask-rcnn
bilinear
interpolation
The University of Sydney Page 33
RoIAlign
Realigning RoI Pooling to be More Accurate.
The University of Sydney Page 34
Mask R-CNN
Credit To: https://www.slideshare.net/windmdk/mask-rcnn
Mask R-CNN
= Instance Segmentation+ Faster R-CNN.
The University of Sydney Page 35
Demo
Mask R-CNN
The University of Sydney Page 36
From R-CNN to Mask R-CNN
Credit To: https://www.slideshare.net/windmdk/mask-rcnn