CS代写 SIGGRAPH 2004]

Image Segmentation
How many zebras?
From Sandlot Science

Copyright By PowCoder代写 加微信 powcoder

Why context is important?
What is this?

Why is this a car?

…because it’s on the road!
Why is this road?

Why is this a road?
Context is very important!

Same problem in real scenes

From images to objects
What defines an object?
• Subjective problem, but has been well-studied
• Proximity, similarity, continuation…

Extracting objects
How could we do this automatically (or at least semi-automatically)?

Semi-automatic binary segmentation

Simplifying the user interaction
Grabcut [Rother et al., SIGGRAPH 2004]

Source: K. Grauman
Auto segmentation: toy example
black pixels
white pixels
• These intensities define the three groups.
• We could label every pixel in the image according to which of these primary intensities it is.
• i.e., segment the image based on the intensity feature. • But … image isn’t quite so simple …
input image
pixel count

Source: K. Grauman
input image
• Now how to determine the three main intensities that define our groups?
• We need to cluster.
pixel count

Semantic Segmentation
Classification + Localization
Object Detection
Instance Segmentation
Deep Learning
GRASS, CAT, TREE, SKY
Pixel-level
Fei-Fei Li & &
CAT DOG, DOG, CAT
Single Object
Segmentation+Classification
DOG, DOG, CAT Multiple Object
Serena 11 – 13
May 10, 2017

Semantic Segmentation
Label each pixel in the image with a category label
Fei-Fei Li & &
Don’t differentiate instances, only care about pixels
Serena 11 –

Semantic Segmentation Idea: Fully Convolutional
Design a network as a bunch of convolutional layers to make predictions for pixels all at once!

Scores: C x H xW
Predictions: Hx W
May 10, 2017
Fei-Fei Li & &
Lecture 11 –

Convolutions
Each channel is a class C channels->C classes

Semantic Segmentation Idea: Fully Convolutional
Design network as a bunch of convolutional layers, with downsampling and upsampling inside the network!
Med-res: Med-res: D2 x H/4 x W/4 D2 x H/4 x W/4
Low-res: D3 x H/4 x W/4
High-res: D1 x H/2 x W/2
High-res: D1 x H/2 x W/2
Predictions: Hx W
May 10, 2017
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015 Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Fei-Fei Li & &
Serena 11 –

Downsampling: Pooling, strided convolution
Design network as a bunch of convolutional layers, with Upsampling: downsampling and upsampling inside the network! ???
Semantic Segmentation Idea: Fully Convolutional
Med-res: Med-res: D2 x H/4 x W/4 D2 x H/4 x W/4
Low-res: D3 x H/4 x W/4
Serena 11 –
High-res: D1 x H/2 x W/2
High-res: D1 x H/2 x W/2
Predictions: Hx W
May 10, 2017
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015 Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Fei-Fei Li & &

In-Network upsampling: “Unpooling”
Nearest Neighbor
“Bed of Nails”
Input: 2 x 2
Output: 4 x 4
Input: 2 x 2
Output: 4 x 4
Fei-Fei Li & &
Lecture 11 –

May 10, 2017

In-Network upsampling: “ ”

Remember which element was max!

Use positions from pooling layer
Rest of the network
Input: 4 x 4
Output: 2 x 2
Corresponding pairs of downsampling and upsampling layers
Input: 2 x 2
Output: 4 x 4
Fei-Fei Li & &
Lecture 11 –

May 10, 2017

Fei-Fei Li & &
Lecture 11 –

3 x 3 transpose convolution, stride 2 pad 1
Input: 2 x 2
Output: 4 x 4
May 10, 2017

3 x 3 transpose convolution, stride 2 pad 1
Fei-Fei Li & &
Lecture 11 –

Input: 2 x 2
Output: 4 x 4
Input gives weight for filter
May 10, 2017

3 x 3 transpose convolution, stride 2 pad 1
Input gives weight for filter
Sum where output overlaps
Filter moves 2 pixels in the output for every one pixel in the input
Stride gives ratio between movement in output and input
May 10, 2017
Fei-Fei Li & &
Lecture 11 –

Input: 2 x 2
Output: 4 x 4

Output contains copies of the filter weighted by the input, summing at where at overlaps in the output
Transpose Convolution: 1D Example
Fei-Fei Li & &
Lecture 11 –

Adapted from
May 10, 2017

Object Detection as Regression?
CAT: (x, y, w, h)
DOG: (x, y, w, h) DOG: (x, y, w, h) CAT: (x, y, w, h)
DUCK: (x, y, w, h) DUCK: (x, y, w, h) ….
Fei-Fei Li & &
Lecture 11 –
May 10, 2017 24

Object Detection as Regression?
CAT: (x, y, w, h)
DOG: (x, y, w, h) DOG: (x, y, w, h) CAT: (x, y, w, h)
Fei-Fei Li & &
Lecture 11 –

DUCK: (x, y, w, h) DUCK: (x, y, w, h) ….
16 numbers
Many numbers!
Each image needs a different number of outputs!
May 10, 2017

Object Detection as Classification: Sliding Window
Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
Fei-Fei Li & &
Lecture 11 – 26

Dog? NO Cat? NO
Background? YES
May 10, 2017

Object Detection as Classification: Sliding Window
Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
Fei-Fei Li & &
Lecture 11 –

Cat? NO Background? NO
May 10, 2017

Object Detection as Classification: Sliding Window
Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
Fei-Fei Li & &
Lecture 11 –

Cat? NO Background? NO
May 10, 2017

Object Detection as Classification: Sliding Window
Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
Fei-Fei Li & &
Lecture 11 –

Dog? NO Cat? YES Background? NO
May 10, 2017

Object Detection as Classification: Sliding Window
Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
Fei-Fei Li & &
Lecture 11 –

Dog? NO Cat? YES Background? NO
Problem: Need to apply CNN to huge number of locations and scales, very computationally expensMivea!y 10, 2017

Region Proposals
● Find image regions that are likely to contain objects
● Relatively fast to run; e.g. Selective Search gives 1000 region
proposals in a few seconds on CPU
Alexe et al, “Measuring the objectness of image windows”, TPAMI 2012 Uijlings et al, “Selective Search for Object Recognition”, IJCV 2013
Cheng et al, “BING: Binarized normed gradients for objectness estimation at 300fps”, CVPR 2014 Zitnick and Dollar, “Edge boxes: Locating object proposals from edges”, ECCV 2014
Fei-Fei Li & &
May 10, 2017
Serena 11 – 31

Alexe et al., CVPR 2010

Fei-Fei Li & &
Lecture 11 – 33

Girshick et al., “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation”, CVPR 2014
May 10, 2017

Fei-Fei Li & &
Lecture 11 –

Girshick et al., “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation”, CVPR 2014
May 10, 2017

Fei-Fei Li & &
Lecture 11 –

Girshick et al., “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation”, CVPR 2014
May 10, 2017

Fei-Fei Li & &
Lecture 11 –

Girshick et al., “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation”, CVPR 2014
May 10, 2017

Fei-Fei Li & &
Lecture 11 –

Girshick et al., “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation”, CVPR 2014
May 10, 2017

Fei-Fei Li & &
Lecture 11 –

Girshick et al., “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation”, CVPR 2014
May 10, 2017

Detection without Proposals: YOLO
Input image 3 x H x W
Divide image into grid 7 x 7
Image a set of base boxes centered at each gridcell HereB=3
Within each grid cell:
•Regress from each of the B base
boxes to a final box with 5 numbers:(dx, dy, dh, dw, confidence)
•Predict scores for each of C classes (including background as a class)
7 x 7 x (5 * B + C)
May 10, 2017
Redmon et al, “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016 Liu et al, “SSD: Single-Shot MultiBox Detector”, ECCV 2016
Fei-Fei Li & &
Serena 11 –

This parameterization fixes the output size
Each cell predicts:
For each bounding box:
– 4 coordinates (x, y, w, h)
– 1 confidence value
Some number of class probabilities
For Pascal VOC:
2 bounding boxes / cell 20 classes
7 x 7 x (2 x 5 + 20) = 7 x 7 x 30 tensor = 1470 outputs
Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016

Split the image into a grid
Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016

Each cell predicts boxes and confidences: P(Object)
Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016

Each cell also predicts a probability P(Class | Object)
Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016
Dining Table

Combine the box and class predictions
Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016

Finally do non-maximum suppression and threshold detections
Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016

It also generalizes well to new domains
Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com