CISC 6525
CISC 6525
Perception
(Computer Vision)
Chapter 24
VM For Class
Download the virtual machine for Oracle virtualbox
http://erdos.dsm.fordham.edu/~lyons/ROSIndigo64Bits.ova
Google team drive: CISC 6525 Fall 2018
File: RosIndigo64Bits.ova
This is an Ubuntu 14.04 VM with some special software installed.
This has the ROS (Robot operating System), OpenCV (Computer Vision) and FF (a high performance symbolic planner) installed.
Outline
Perception generally
Image formation
Early vision
2D 3D
Object recognition
Slides from R&N Chapter 24 or DML unless otherwise attributed
The Problem
Image Formation
P is a point in the scene, with coordinates (X, Y, Z)
P’ is its image on the image plane, with coordinates (x, y z)
by similar triangles. Scale/distance is indeterminate!
Images
Individual values are called
pixels for picture elements.
Images
Images & Video
I(x, y, t) is the intensity at (x, y) at time t
CCD camera 4,000,000 pixels, 4Mpixel; human eyes 240,000,000; 240 Mpixels
i.e., ~5 terabits/sec at 20hz = 20fps
What is color
Color is related to the wavelength of light
The shorter wavelengths are perceived as blue and longer as red with green in between.
What is daylight
The intensity of light of each frequency that falls on the earth during day can be represented by the spectral power distribution graph.
From a subjective viewpoint
The Retina
Rods sense
‘light intensity’;
Cones sense ‘color’.
Each cone has
one of three
pigments: red,
green, or blue.
Color sensitivity of the 3 cones
The closer the wavelength
to the target wavelength
for that cone, the more
active the cone
cell becomes
How do we see all those colors!
Depending on how
‘activated’ each of the
types of cones are,
We see a different
Color = wavelength of
light
E.g.:
10% Blue
30% red
60% Green
= approx. light
Of 500nm
The Tristimulus Theory
This is the theory that any color can be specified by giving just three values.
We call Red, Green and Blue, the additive primary colors.
We can define a given color by saying how much red, green and blue light we need to add to get that color
Color – Summary
Intensity varies with frequency – infinite dimensional signal
Human eye has three types of color-sensitive cells;
each integrates the signal => 3-element vector intensity
Alternative way of specifing color
Hue (roughly, dominant wavelength)
Saturation (purity)
Value (brightness)
Model HSV as a ‘cylinder’: H angle, S distance from axis, V distance along axis
Basis of popular style of color picker
HSV
HSV Color Cone
Why is it
not a cylinder?
YUV
However, Y not simply related to R, G and B because eye is more sensitive to some colors
Digital TV uses Y ‘ CBCR not YUV (different weights).
Y = R * .299000 + G * .587000 + B * .114000
U = R * -.168736 + G * -.331264 + B * .500000 + 128
V = R * .500000 + G * -.418688 + B * -.081312 + 128
YUV Color Cube
two perspectives
Compute new value for pixel from its old value and the values of surrounding pixels
Filtering operations
Compute weighted average of pixel values
Array of weights known — convolution mask
Pixels used in convolution — convolution kernel
Computationally intensive process
Pixel Group Processing
Pixel processing
-1 -1 -1
-1 8 -1
-1 -1 -1
50 10 55 30 20
18 20 40 35 30
19 18 30 40 50
18 18 20 90 80
17 16 40 80 100
Convolution kernel
Image
Kernel applied left to right,
top to bottom
Classic simple blur
Convolution mask with equal weights
Unnatural effect
Gaussian blur
Convolution mask with coefficients falling off gradually (Gaussian bell curve)
More gentle, can set amount and radius
Blurring
Gaussian Blur Filter
No blur small radius larger radius
Low frequency filter
3×3 convolution mask coefficients all equal to -1, except centre = 9
Produces harsh edges
Unsharp masking
Copy image, apply Gaussian blur to copy, subtract it from original
Enhances image features
Sharpening
Sharpening Filter
Edge Detection
Convolve image with spatially oriented
filters (possibly multi-scale)
Label above-threshold pixels with edge orientation
Infer “clean” line segments by combining edge pixels with same orientation
Edge Detection
Edges in image come from I(x,y,t) discontinuities in scene.
These can be due to:
1) depth
2) surface orientation
3) reflectance (surface markings)
4) illumination (shadows, etc.)
Laplacian Edges
Reconstructing based on edges
Solid polygons with trihedral edges
Trihedral Edges
Vertex/Edge Labeling Example
Cues from Prior Knowledge
(“Shape from X”)
Shape from Motion
Stereo
Stereo Depth Calculation
Example Stereo Disparity
Shape from Texture
Idea: assume actual texture is uniform, compute surface shape that would
produce this distortion
Similar idea works for shading – assume uniform reflectance, etc.
But inter-reflections give nonlocal computation of perceived intensity
=> hollows seem shallower than they really are
Shape from Optical Flow
Optical flow describes the direction and speed of motion of features in the image.
Segmentation of Images
Which image components “belong together”?
Belong together=lie on the same object
Cues
similar color
similar texture
not separated by contour
form a suggestive shape when assembled
Computer Vision – A Modern Approach
Set: Introduction to Vision
Slides by D.A. Forsyth
Object Recognition
Simple idea:
extract 3-D shapes from image
match against “shape library”
Problems:
extracting curved surfaces from image
representing shape of extracted object
representing shape and variability of library object classes
improper segmentation, occlusion
unknown illumination, shadows, markings, noise, complexity, etc.
Approaches:
index into library by measuring invariant properties of objects
alignment of image feature with projected library object feature
match image against multiple stored views (aspects) of library object
machine learning methods based on image statistics
ImageNet
2012, 1.3 million hand labelled images
1000 classes (e.g., 120 dog classes)
Deep Learning for Image Classification
Regular NN don’t scale up to image size well
AlexNet 2012, 50% red. in ImageNet error rate.
ResNet 2015, performs exceeds human
Deep Issues
Supervised vs. Unsupervised
Transfer training
Computational requirements, GPUs
Adversarial images:
Computer Vision – A Modern Approach
Set: Introduction to Vision
Slides by D.A. Forsyth
Matching templates
Some objects are 2D patterns
e.g. faces
Find faces by
finding eyes, nose, mouth
finding assembly of the three that has the “right” relations
Build an explicit pattern matcher
discount changes in illumination by using a parametric model
changes in background are hard
changes in pose are hard
Computer Vision – A Modern Approach
Set: Introduction to Vision
Slides by D.A. Forsyth
http://www.ri.cmu.edu/projects/project_271.html
Computer Vision – A Modern Approach
Set: Introduction to Vision
Slides by D.A. Forsyth
Computer Vision – A Modern Approach
Set: Introduction to Vision
Slides by D.A. Forsyth
http://www.ri.cmu.edu/projects/project_320.html
Computer Vision – A Modern Approach
Set: Introduction to Vision
Slides by D.A. Forsyth
People
Skin is characteristic; clothing hard to segment
hence, people wearing little clothing
Finding body segments:
finding skin-like (color, texture) regions that have nearly straight, nearly parallel boundaries
Grouping process constructed by hand, tuned by hand using small dataset.
When a sufficiently large group is found, assert a person is present
Action recognition from still images
Description of the human pose
Silhouette description [Sullivan & Carlsson, 2002]
Histogram of gradients (HOG) [Dalal & Triggs 2005]
Human body part layout
[Felzenszwalb & Huttenlocher, 2000]
Computer Vision – A Modern Approach
Set: Introduction to Vision
Slides by D.A. Forsyth
Tracking
Extract a set of features from the image
Use a model to predict next position and refine using next image
Model:
simple dynamic models (second order dynamics)
kinematic models
etc.
Face tracking and eye tracking now work rather well
9/10/2018
52
SIFT Features (Lowe 1999)
Image content is transformed into local feature coordinates that are invariant to translation, rotation, scale, and other imaging parameters
SIFT Features
52
9/10/2018
53
Lowe’s Scale-space Interest Points
Laplacian of Gaussian kernel
Scale normalised (x by scale2)
Proposed by Lindeberg
Scale-space detection
Find local maxima across scale/space
A good “blob” detector
[ T. Lindeberg IJCV 1998 ]
53
9/10/2018
54
Lowe’s Pyramid Scheme
s+2 filters
s+1=2(s+1)/s0
.
.
i=2i/s0
.
.
2=22/s0
1=21/s0
0
s+3
images
including
original
s+2
differ-
ence
images
The parameter s determines the number of images per octave.
54
9/10/2018
55
Using SIFT for Matching “Objects”
Click to edit Master text styles
Second level
Third level
Fourth level
Fifth level
55
SIFT for Navigation
Homing in Scale Space (HiSS)
[Churchill & Vardy 2008]
Uses SIFT Feature matching
Add scale information to
improve homing performance
Home image
Current image
–
+
In robotics, the input image is very typically a 360 degree image and
there are several approach to calculate the home vector from the image comparison: bearing based approaches just use landmark bearing information calculated from home and current image to calculate a direction to move; the ALV approach is fast, simple and homes reliably – but the path is erratic.
Churchill & vardy – HiSS- improved on this by leveraging the scale information from SIFT feature matching to add a distance to the bearing calculation.
56
Semantic Navigation (Hulbert 2018)
Yolo: Extremely fast (155 fps) object recognition using special CNN
ROS/Gazebo 3D simulation of a large
Suburban scene
Action recognition in videos
Motion history image
[Bobick & Davis, 2001]
Spatial motion descriptor
[Efros et al. ICCV 2003]
Learning dynamic prior
[Blake et al. 1998]
Sign language recognition
[Zisserman et al. 2009]
Action Recognition:
Action = Space Time Object
CNNs & Activity Recognition
Karen Simonyan & Andrew Zisserman, NIPS 2014
Z
X
f
x
–
=
Z
Y
f
y
–
=
/docProps/thumbnail.jpeg