CS计算机代考程序代写 deep learning algorithm The University of Sydney Page 1

The University of Sydney Page 1

Deep Learning Applications

Dr Chang Xu

School of Computer Science

The University of Sydney Page 2

Applications

The University of Sydney Page 3

Introducing Deep Learning for Computer Vision

Deep convolutional neural network (DCNN) is the key concept
for introducing deep learning to the development of computer
vision. By imitating the biological nervous systems, deep neural
networks can provide unprecedented ability to interpret
complicated data patterns and thus effectively tackle various
computer vision tasks.

Images

Videos

Classification

Detection

Segmentation

Tracking
Deep Neural
Network

The University of Sydney Page 4

Classification
Goal: Assign a label to an input image based on a fixed set of
categories.

Credit To: https://medium.com/@tifa2up/image-classification-using-deep-
neural-networks-a-beginner-friendly-approach-using-tensorflow-94b0a090ccd4

The University of Sydney Page 5

Deep Learning-based Classification

Suppose 𝑊 is the parameter set of a deep neural network. Given

an input image 𝐼, classification can be tackled by:

𝑦 = 𝑓 𝑊, 𝐼

where 𝑦 is the predicted class label and 𝑓 represents a series of

operations parameterized based on 𝑊 for processing the input

image.

The University of Sydney Page 6

Computer Vision Tasks

The University of Sydney Page 7

Pedestrian Detection

The University of Sydney Page 8

Car Detection

The University of Sydney Page 9

Other Applications

The University of Sydney Page 10

Localize objects with regression

The University of Sydney Page 11

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 – May 10, 201748

Class Scores
Cat: 0.9
Dog: 0.05
Car: 0.01

Classification + Localization

Vector:
4096

Fully
Connected:
4096 to 1000

Box
Coordinates
(x, y, w, h)

Fully
Connected:
4096 to 4

Softmax
Loss

L2 Loss

Loss

Correct label:
Cat

Correct box:
(x’, y’, w’, h’)

+
This image is CC0 public domain

Treat localization as a
regression problem!

Multitask Loss

Classification with Localization

Credit To: FeiFei Li’s deep learning lecture ppt.

It is assumed that there is only one object in an input image.

The University of Sydney Page 12

The University of Sydney Page 13

Detection
Goal: Detect semantic objects of certain classes in images.

– Region CNN
– SPPNet
– Fast RCNN
– Faster RCNN
– Mask R-CNN

The University of Sydney Page 14

Typical architecture

1.Region proposal: Given an input image, find all possible places where
objects can be located. The output of this stage should be a list of
bounding boxes of likely positions of objects. These are often called
region proposals or regions of interest.

2.Final classification: for every region proposal from the previous stage,
decide whether it belongs to one of the target classes or to the
background. Here we could use a deep convolutional network.

Proposals

The University of Sydney Page 15

Deep Learning-based Detection

Given a deep neural network parameterized by 𝑊, the goal of

object detection is to predict a set of bounding boxes that may

contain objects together as well as the objects’ categories.

{𝑦!, 𝑥”, 𝑦”, 𝑥#, 𝑦# !} = 𝑓 𝑊, 𝐼

where 𝑦! is the predicted class label and 𝑥”, 𝑦”, 𝑥#, 𝑦# ! describes

the four coordinates (i.e. the top-left coordinate and the bottom-

right coordinate) of the estimated bounding box for the 𝑖-th result.

The University of Sydney Page 16

Region CNN (R-CNN)

Rich feature hierarchies for accurate object detection and semantic segmentation

Tech report (v5)

Ross Girshick Jeff Donahue Trevor Darrell Jitendra Malik
UC Berkeley

{rbg,jdonahue,trevor,malik}@eecs.berkeley.edu

Abstract

Object detection performance, as measured on the
canonical PASCAL VOC dataset, has plateaued in the last
few years. The best-performing methods are complex en-
semble systems that typically combine multiple low-level
image features with high-level context. In this paper, we
propose a simple and scalable detection algorithm that im-
proves mean average precision (mAP) by more than 30%
relative to the previous best result on VOC 2012—achieving
a mAP of 53.3%. Our approach combines two key insights:
(1) one can apply high-capacity convolutional neural net-
works (CNNs) to bottom-up region proposals in order to
localize and segment objects and (2) when labeled training
data is scarce, supervised pre-training for an auxiliary task,
followed by domain-specific fine-tuning, yields a significant
performance boost. Since we combine region proposals
with CNNs, we call our method R-CNN: Regions with CNN
features. We also compare R-CNN to OverFeat, a recently
proposed sliding-window detector based on a similar CNN
architecture. We find that R-CNN outperforms OverFeat
by a large margin on the 200-class ILSVRC2013 detection
dataset. Source code for the complete system is available at
http://www.cs.berkeley.edu/˜rbg/rcnn.

1. Introduction

Features matter. The last decade of progress on various
visual recognition tasks has been based considerably on the
use of SIFT [29] and HOG [7]. But if we look at perfor-
mance on the canonical visual recognition task, PASCAL
VOC object detection [15], it is generally acknowledged
that progress has been slow during 2010-2012, with small
gains obtained by building ensemble systems and employ-
ing minor variants of successful methods.

SIFT and HOG are blockwise orientation histograms,
a representation we could associate roughly with complex
cells in V1, the first cortical area in the primate visual path-
way. But we also know that recognition occurs several
stages downstream, which suggests that there might be hier-

1. Input
image

2. Extract region
proposals (~2k)

3. Compute
CNN features

aeroplane? no.


person? yes.

tvmonitor? no.

4. Classify
regions

warped region

CNN

R-CNN: Regions with CNN features

Figure 1: Object detection system overview. Our system (1)
takes an input image, (2) extracts around 2000 bottom-up region
proposals, (3) computes features for each proposal using a large
convolutional neural network (CNN), and then (4) classifies each
region using class-specific linear SVMs. R-CNN achieves a mean
average precision (mAP) of 53.7% on PASCAL VOC 2010. For
comparison, [39] reports 35.1% mAP using the same region pro-
posals, but with a spatial pyramid and bag-of-visual-words ap-
proach. The popular deformable part models perform at 33.4%.
On the 200-class ILSVRC2013 detection dataset, R-CNN’s
mAP is 31.4%, a large improvement over OverFeat [34], which
had the previous best result at 24.3%.

archical, multi-stage processes for computing features that
are even more informative for visual recognition.

Fukushima’s “neocognitron” [19], a biologically-
inspired hierarchical and shift-invariant model for pattern
recognition, was an early attempt at just such a process.
The neocognitron, however, lacked a supervised training
algorithm. Building on Rumelhart et al. [33], LeCun et
al. [26] showed that stochastic gradient descent via back-
propagation was effective for training convolutional neural
networks (CNNs), a class of models that extend the neocog-
nitron.

CNNs saw heavy use in the 1990s (e.g., [27]), but then
fell out of fashion with the rise of support vector machines.
In 2012, Krizhevsky et al. [25] rekindled interest in CNNs
by showing substantially higher image classification accu-
racy on the ImageNet Large Scale Visual Recognition Chal-
lenge (ILSVRC) [9, 10]. Their success resulted from train-
ing a large CNN on 1.2 million labeled images, together
with a few twists on LeCun’s CNN (e.g., max(x, 0) rectify-
ing non-linearities and “dropout” regularization).

The significance of the ImageNet result was vigorously

1

ar
X

iv
:1

31
1.

25
24

v5
[

cs
.C

V
]

22
O

ct
2

01
4

Object Detection to Image Classification

Pre-trained on ImageNet

[Girshick14] R. Girshick, J. Donahue, S. Guadarrama, T. Darrell, J. Malik: Rich Feature Hierarchies for
Accurate Object Detection and Semantic Segmentation, CVPR 2014

The University of Sydney Page 17

Region CNN (R-CNN)

Credit To: towardsdatascience.com

– Duplication of calculation
– Not in end-to-end manner
– Slow
– Fixed image size (227×227)

The University of Sydney Page 18

*Selective Search

The University of Sydney Page 19

*Bounding Box Regression

(𝑃! , 𝑃” , 𝑃#, 𝑃$)

The University of Sydney Page 20

Fast R-CNN
Drawback of R-CNN and the modification:
1.Multi-stage training. à End-to-end joint training.
2.Expensive in space and time. à Convolutional layer sharing.
3.Test-time detection is slow. à Single scale testing

R-CNN

Fast R-CNN

The University of Sydney Page 21

Fast R-CNN

The University of Sydney Page 22

Region of Interest Pooling

Input to ROI pooling layer:
1. A fixed-size feature map
2. A list of regions of interest

8×8 feature map 2×2 output
a region proposal

A type of max-pooling. The output always has the same size.

The University of Sydney Page 23

Region of Interest Pooling

Input to ROI pooling layer:
1. A fixed-size feature map
2. A list of regions of interest

8×8 feature map 2×2 output
a region proposal

A type of max-pooling. The output always has the same size.

– Used for object detection tasks
– Reuse the feature map from CNNs
– Speed up both train and test time
– Train detection model in an end-to-end manner

The University of Sydney Page 24

Fast R-CNN

VS

Credit To: http://www.robots.ox.ac.uk/~tvg/publications/talks/fast-rcnn-
slides.pdf

The University of Sydney Page 25

Faster R-CNN

Replace the slow selective search algorithm with a fast neural net
– region proposal network (RPN).

Fast R-CNN

The University of Sydney Page 26

Region Proposal Network

Image Credit to [Li Wang et al. 2017]

3×3 sliding window
Default

bounding boxes

Contain an
object?

offsets of
coordinates

The University of Sydney Page 27

Region Proposal Network

Each anchor has 2
neurons to represent
fg/bg label

The input feature map
has 256 channels

Each anchor also has
4 neurons to represent
offsets of coordinates

Anchors are pre-
determined boxes
whose sizes can be
ranged from 32! to
512! and aspect ratios
can be ranged from
0.5 to 2.

The University of Sydney Page 28

Faster R-CNN

Faster R-CNN
= RPN + Fast R-CNN.

The University of Sydney Page 29

Mask R-CNN

Image instance segmentation is to identify, at a pixel level, what the
different objects in a scene are.

The University of Sydney Page 30

Mask R-CNN

A binary mask that says whether or not a
given pixel is part of an object.

The University of Sydney Page 31

RoIAlign

Realigning RoI Pooling to be More Accurate.

Image credit to Ardian Umam

The University of Sydney Page 32

RoIAlign

Credit To: https://www.slideshare.net/windmdk/mask-rcnn

bilinear
interpolation

The University of Sydney Page 33

RoIAlign

Realigning RoI Pooling to be More Accurate.

The University of Sydney Page 34

Mask R-CNN

Credit To: https://www.slideshare.net/windmdk/mask-rcnn

Mask R-CNN
= Instance Segmentation+ Faster R-CNN.

The University of Sydney Page 35

Demo

Mask R-CNN

The University of Sydney Page 36

From R-CNN to Mask R-CNN

Credit To: https://www.slideshare.net/windmdk/mask-rcnn