Image Segmentation
How many zebras?
From Sandlot Science
Copyright By PowCoder代写 加微信 powcoder
Why context is important?
What is this?
Why is this a car?
…because it’s on the road!
Why is this road?
Why is this a road?
Context is very important!
Same problem in real scenes
From images to objects
What defines an object?
• Subjective problem, but has been well-studied
• Proximity, similarity, continuation…
Extracting objects
How could we do this automatically (or at least semi-automatically)?
Semi-automatic binary segmentation
Simplifying the user interaction
Grabcut [Rother et al., SIGGRAPH 2004]
Source: K. Grauman
Auto segmentation: toy example
black pixels
white pixels
• These intensities define the three groups.
• We could label every pixel in the image according to which of these primary intensities it is.
• i.e., segment the image based on the intensity feature. • But … image isn’t quite so simple …
input image
pixel count
Source: K. Grauman
input image
• Now how to determine the three main intensities that define our groups?
• We need to cluster.
pixel count
Semantic Segmentation
Classification + Localization
Object Detection
Instance Segmentation
Deep Learning
GRASS, CAT, TREE, SKY
Pixel-level
Fei-Fei Li & &
CAT DOG, DOG, CAT
Single Object
Segmentation+Classification
DOG, DOG, CAT Multiple Object
Serena 11 – 13
May 10, 2017
Semantic Segmentation
Label each pixel in the image with a category label
Fei-Fei Li & &
Don’t differentiate instances, only care about pixels
Serena 11 –
Semantic Segmentation Idea: Fully Convolutional
Design a network as a bunch of convolutional layers to make predictions for pixels all at once!
Scores: C x H xW
Predictions: Hx W
May 10, 2017
Fei-Fei Li & &
Lecture 11 –
Convolutions
Each channel is a class C channels->C classes
Semantic Segmentation Idea: Fully Convolutional
Design network as a bunch of convolutional layers, with downsampling and upsampling inside the network!
Med-res: Med-res: D2 x H/4 x W/4 D2 x H/4 x W/4
Low-res: D3 x H/4 x W/4
High-res: D1 x H/2 x W/2
High-res: D1 x H/2 x W/2
Predictions: Hx W
May 10, 2017
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015 Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Fei-Fei Li & &
Serena 11 –
Downsampling: Pooling, strided convolution
Design network as a bunch of convolutional layers, with Upsampling: downsampling and upsampling inside the network! ???
Semantic Segmentation Idea: Fully Convolutional
Med-res: Med-res: D2 x H/4 x W/4 D2 x H/4 x W/4
Low-res: D3 x H/4 x W/4
Serena 11 –
High-res: D1 x H/2 x W/2
High-res: D1 x H/2 x W/2
Predictions: Hx W
May 10, 2017
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015 Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Fei-Fei Li & &
In-Network upsampling: “Unpooling”
Nearest Neighbor
“Bed of Nails”
Input: 2 x 2
Output: 4 x 4
Input: 2 x 2
Output: 4 x 4
Fei-Fei Li & &
Lecture 11 –
May 10, 2017
In-Network upsampling: “ ”
Remember which element was max!
Use positions from pooling layer
Rest of the network
Input: 4 x 4
Output: 2 x 2
Corresponding pairs of downsampling and upsampling layers
Input: 2 x 2
Output: 4 x 4
Fei-Fei Li & &
Lecture 11 –
May 10, 2017
Fei-Fei Li & &
Lecture 11 –
3 x 3 transpose convolution, stride 2 pad 1
Input: 2 x 2
Output: 4 x 4
May 10, 2017
3 x 3 transpose convolution, stride 2 pad 1
Fei-Fei Li & &
Lecture 11 –
Input: 2 x 2
Output: 4 x 4
Input gives weight for filter
May 10, 2017
3 x 3 transpose convolution, stride 2 pad 1
Input gives weight for filter
Sum where output overlaps
Filter moves 2 pixels in the output for every one pixel in the input
Stride gives ratio between movement in output and input
May 10, 2017
Fei-Fei Li & &
Lecture 11 –
Input: 2 x 2
Output: 4 x 4
Output contains copies of the filter weighted by the input, summing at where at overlaps in the output
Transpose Convolution: 1D Example
Fei-Fei Li & &
Lecture 11 –
Adapted from
May 10, 2017
Object Detection as Regression?
CAT: (x, y, w, h)
DOG: (x, y, w, h) DOG: (x, y, w, h) CAT: (x, y, w, h)
DUCK: (x, y, w, h) DUCK: (x, y, w, h) ….
Fei-Fei Li & &
Lecture 11 –
May 10, 2017 24
Object Detection as Regression?
CAT: (x, y, w, h)
DOG: (x, y, w, h) DOG: (x, y, w, h) CAT: (x, y, w, h)
Fei-Fei Li & &
Lecture 11 –
DUCK: (x, y, w, h) DUCK: (x, y, w, h) ….
16 numbers
Many numbers!
Each image needs a different number of outputs!
May 10, 2017
Object Detection as Classification: Sliding Window
Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
Fei-Fei Li & &
Lecture 11 – 26
Dog? NO Cat? NO
Background? YES
May 10, 2017
Object Detection as Classification: Sliding Window
Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
Fei-Fei Li & &
Lecture 11 –
Cat? NO Background? NO
May 10, 2017
Object Detection as Classification: Sliding Window
Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
Fei-Fei Li & &
Lecture 11 –
Cat? NO Background? NO
May 10, 2017
Object Detection as Classification: Sliding Window
Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
Fei-Fei Li & &
Lecture 11 –
Dog? NO Cat? YES Background? NO
May 10, 2017
Object Detection as Classification: Sliding Window
Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
Fei-Fei Li & &
Lecture 11 –
Dog? NO Cat? YES Background? NO
Problem: Need to apply CNN to huge number of locations and scales, very computationally expensMivea!y 10, 2017
Region Proposals
● Find image regions that are likely to contain objects
● Relatively fast to run; e.g. Selective Search gives 1000 region
proposals in a few seconds on CPU
Alexe et al, “Measuring the objectness of image windows”, TPAMI 2012 Uijlings et al, “Selective Search for Object Recognition”, IJCV 2013
Cheng et al, “BING: Binarized normed gradients for objectness estimation at 300fps”, CVPR 2014 Zitnick and Dollar, “Edge boxes: Locating object proposals from edges”, ECCV 2014
Fei-Fei Li & &
May 10, 2017
Serena 11 – 31
Alexe et al., CVPR 2010
Fei-Fei Li & &
Lecture 11 – 33
Girshick et al., “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation”, CVPR 2014
May 10, 2017
Fei-Fei Li & &
Lecture 11 –
Girshick et al., “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation”, CVPR 2014
May 10, 2017
Fei-Fei Li & &
Lecture 11 –
Girshick et al., “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation”, CVPR 2014
May 10, 2017
Fei-Fei Li & &
Lecture 11 –
Girshick et al., “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation”, CVPR 2014
May 10, 2017
Fei-Fei Li & &
Lecture 11 –
Girshick et al., “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation”, CVPR 2014
May 10, 2017
Fei-Fei Li & &
Lecture 11 –
Girshick et al., “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation”, CVPR 2014
May 10, 2017
Detection without Proposals: YOLO
Input image 3 x H x W
Divide image into grid 7 x 7
Image a set of base boxes centered at each gridcell HereB=3
Within each grid cell:
•Regress from each of the B base
boxes to a final box with 5 numbers:(dx, dy, dh, dw, confidence)
•Predict scores for each of C classes (including background as a class)
7 x 7 x (5 * B + C)
May 10, 2017
Redmon et al, “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016 Liu et al, “SSD: Single-Shot MultiBox Detector”, ECCV 2016
Fei-Fei Li & &
Serena 11 –
This parameterization fixes the output size
Each cell predicts:
For each bounding box:
– 4 coordinates (x, y, w, h)
– 1 confidence value
Some number of class probabilities
For Pascal VOC:
2 bounding boxes / cell 20 classes
7 x 7 x (2 x 5 + 20) = 7 x 7 x 30 tensor = 1470 outputs
Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016
Split the image into a grid
Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016
Each cell predicts boxes and confidences: P(Object)
Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016
Each cell also predicts a probability P(Class | Object)
Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016
Dining Table
Combine the box and class predictions
Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016
Finally do non-maximum suppression and threshold detections
Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016
It also generalizes well to new domains
Redmon et al., “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com