[06-30213][06-30241][06-25024]
Computer Vision and Imaging &
Robot Vision
Dr Hyung Jin Chang Dr Yixing Gao
h.j.chang@bham.ac.uk y.gao.8@bham.ac.uk
School of Computer Science
Dr. Hector Basevi
• Research Fellow working within the Intelligent Robotics Lab
in the School of Computer Science of the University of Birmingham under Professor Aleš Leonardis.
• My interests include:
– Scene understanding from RGB and RGBD images.
– Using non-visual measurement channels and prior information to enhance analysis of visual data.
– Deep learning.
– Generative models.
– Effective and interpretable representations for visual, geometric and other information channels.
– Medical imaging and medical image analysis.
– 3D imaging via structured light.
– 3D printing.
– Virtual and augmented reality.
Moving to Active 3D Imaging
Dr Hector Basevi
Surveying
Marco Verch, Melaten Friedhof Köln, 2015,
https://en.wikipedia.org/wiki/File:Melaten_Friedhof_K%C3%B6ln_(233 11886499).jpg
Google maps
Archaeology
Paul Souders/Corbis, https://www.nature.com/news/3d-images- remodel-history-1.15418
http://www.ucl.ac.uk/~tcrn ahb/ImpLog3D/
Industrial inspection
https://www.faro.com/en-gb/
https://www.industrialvision.co.uk/about- us/case-studies/3d-vision-inspection- systems-and-3d-machine-vision-technology
http://www.ukiva.org/3D- Imaging/3d-inspection.html
Biometrics
K. Bowyer et al., A survey of approaches and challenges in 3D and multi-modal 3D + 2D face recognition, 2006
appleinsider.com
S. Huang et al., 3D fingerprint imaging system based on full-field fringe projection profilometry, 2014
Augmented reality
R. Fernandez, 2017, https://en.wikipedi a.org/wiki/File:IPho ne_X_vector.svg
Wayfair – furniture & décor, https://get.google.com/tango/apps/
http://8i.com/technology/
3D scanning and printing
https://central-scanning.co.uk/3d-scanning- services/3d-printing.html
http://blog.tngvisualeffects.com/3d- printing-is-now-the-rage/
http://microfabricator.com/articles/view/id/53f0e 2de313944596e8b45a1/lifeform-studios-brings- 3d-scanning-and-3d-printing-to-kansas
Autonomous driving
https://arstechnica.co.uk/cars/2017/0 1/googles-waymo-invests-in-lidar- technology-cuts-costs-by-90-percent/
https://spectrum.ieee.org/transportation/adv anced-cars/cheap-lidar-the-key-to-making- selfdriving-cars-affordable
Robotic manipulation
https://www.cs.bham.ac.uk/research/groupings/robotics/
3D
• Useful in the applications just mentioned
• Also, to most other tasks in computer vision
• 3D allows us to cheat:
• Distinguish objects from pictures of objects • Segment foreground from background
• Distinguish object texture from outlines
• Remove effects of directional illumination
• Different environments and applications require different measurement techniques
3D Imaging
• Specific type of 3D information: How far is the surface patch measured in a pixel from the camera?
3D Imaging
• Specific type of 3D information: How far is the surface patch measured in a pixel from the camera?
Depth versus distance
Depth
• Observer dependent • Vision
Distance
• Observer independent
• Physics
Depth versus distance
Observer dependent
•
• Vision
Depth
Distance
• Observer independent
• Physics
Depth versus distance
Distance
• Observer independent • Physics
Depth
• Observer dependent • Vision
How to measure depth/distance?
1. Passive
1.
2. 3.
Stereophotogrammetry
• Observe the same surface from a different position
Structure from motion
• Observe the surface when it, or the camera, is moved
Depth from focus
• Use a focal image stack to estimate when a point is most in focus
2. Active
1. 2. 3.
Time of flight
• Emit light and measure how long it takes to come back
Structured light imaging
• Project a pattern onto the surface from a different position
Photometric stereo
• Observe the surface when lit by lights in different positions
Recap: Stereophotogrammetry
• Inspiration from one of the ways that we get 3D perception: multiple observers
Determining distance in 2D
• Adopt the pinhole camera model: single ray between object and pupil
• Use coplanar cameras
• Find pixel locations of object in both images
• 𝑑 = 𝑓𝑏⁄𝑑𝑖𝑠𝑝𝑎𝑟𝑖𝑡𝑦
disparity
Imaged object
d
12345678 12345678
Virtual image plane Pupil/pinhole
f
b
Stereophotogrammetry
Stereophotogrammetry
• Inspiration from one of the ways that we get 3D perception: multiple observers
Aside
• Another way to get 3D perception is via motion
http://www.cambridgein colour.com/tutorials/ani mated-3d-stereo- photography.htm
http://depthy.me/
Structure from motion
Structure from motion
N. Snavely et al., Photo tourism: exploring photo collections in 3D, 2006
Depth from focus
http://prometheus.med.utah.edu/~bwjones/2009/03/focus-stacking/
Depth from focus
K. Kutulakos et al., Focal Stack Photography: High-Performance Photography with a Conventional Camera, 2009
Depth from focus
S. Suwajanakorn et al., Depth from Focus with Your Mobile Phone, 2015
Passive imaging
• Need two (or more) cameras, but no additional light source
• 3D data from shared field of view
• Units do not interfere with each other
• At mercy of environment: • Shadows
• Reflections
• Lack of surface features
• However, can artificially add surface features
Active stereophotogrammetry
• Can project surface features
• Multiple cameras still do not interfere with each
other
https://software.intel.com/en-us/articles/realsense-r200-camera
Stereophotogrammetry in practice
• Intel RealSense R200 camera
Stereophotogrammetry in practice
Stereophotogrammetry in practice
Stereophotogrammetry in practice
Combining multiple images
Stereophotogrammetry summary
• Two or more cameras, coplanar in the simplest case
• Find correspondences between images of the same
scene taken from different viewpoints at same time
• Finding correspondences relies on surface detail
• Can add detail artificially via projection
• Lack of detail and edges produce holes
• Error inversely proportional to resolution and distance between cameras (assuming fixed disparity accuracy in pixels)
• Only have 3D data where two views overlap
Time of flight
• 𝑐 = 3×10! 𝑚⁄𝑠
• Light travels in straight lines • 𝑑 = 𝑐𝑡⁄2
• Attempt 1:
Time of flight
• 𝑐 = 3×10! 𝑚⁄𝑠
• Light travels in straight lines • 𝑑 = 𝑐𝑡⁄2
• Attempt 1:
Time of flight
• 𝑐 = 3×10! 𝑚⁄𝑠
• Light travels in straight lines • 𝑑 = 𝑐𝑡⁄2
• Attempt 2:
Lidar
• Photosensor and pulse laser on same optical path
• To acquire an image, rotate the system
• In practice, rotate a mirror
• Imaging speed determined by speed of rotation, pulsing, and sensor
• Every measurement acquired at a slightly different time
Lidar
• Avalanche photodiode or photomultiplier-based sensor
• Difficult to measure short distances accurately
• Large
• Expensive
• Robust
• Very useful for autonomous vehicles
David Monniaux, 2007, https://en.wikipedia.org/wiki/Fil e:Lidar_P1270901.jpg
Sarah Frey, 2016, https://en.wikipedia.org/wiki/File:Forest_LIDAR.jpg
Producing a time of flight camera
• Add spatially resolved sensor and unfocused light source
• Can no longer measure time directly as all pixels accumulate photons and are read off together (unless using avalanche photodiodes)
• Instead modulate the illumination and gate pixels
• 𝑑 = 𝑐Δ𝜃⁄4𝜋𝑓
Signal
Phase
∝ distance
0 𝜋
2𝜋 3𝜋 = 𝜋
4𝜋
= 2𝜋
Producing a time of flight camera
• Modulate the illumination and gate pixels
• 4 pixels acquire a single phase measurement
• Each pixel collects light from a different part of the wave
Time of flight camera
• Can use measurements to estimate phase of signal
• Can do this with two measurements per wavelength
• Four measurements per wavelength reduces the effect of background illumination and source reflectivity
• Unfortunately, space becomes periodic
I1
I2
I4
I3
!” #!!#”
• Δ𝜃 = tan ‘ ##!#$
Time of flight camera
• 𝑓 = 𝑐⁄2𝑑$%& = 1.5𝐸8⁄𝑑$%&
• 𝑑$%& = 1.5𝐸8/𝑓
• Unfortunately, noise is inversely proportional to frequency
• Noise is proportional to square root of background
• Noise is inversely proportional to square root of signal
• Can unwrap phase in software or via multiple signal frequencies (see structured light section)
• Larry Li, Time-of-Flight Camera – An Introduction, Texas Instruments, 2014
Active imaging
• Projecting light via laser or modulated LED
• Dark surfaces absorb light (no signal)
• Specular surfaces reflect it (no signal or secondary signals from multiple reflections)
• Secondary signals more problematic for time of flight cameras than lidar
• Multiple units interfere with each other
Time of flight summary
• Only one observer
• Acquiring pixels sequentially is slow, but robust and
long range
• Acquiring pixel simultaneously is expensive or subject to distance wrapping, but fast
• Have to deal with phase wrapping or put up with increased noise
• Increasingly being used in commercial applications (Kinect 2, Google Tango devices)
Structured light imaging
• Stereophotogrammetry has problems with featureless surfaces
• Can artificially add detail by projecting an arbitrary pattern
• Can this be done more effectively, and without a second camera?
• Yes, by projecting known patterns which specify points or curves on the surface
• These techniques can be more accurate than time of flight imaging, particularly for small objects, at the cost of increased imaging time and processing expense
Stereophotogrammetry
Active stereophotogrammetry
Structured light imaging
Point scanner
• Similar to lidar system, except measure direction rather than time
• Assume the relative poses of camera and projector are known
• Unlike time of flight, want to separate camera and laser
• Can use closest points of two lines to estimate coordinates
• Need an image for each point measured
• Slow
Laser line scanner
• Rather than projecting a line, can project a plane
• Illuminates a curve on the surface
• For each point on the curve, calculate intersection between line corresponding to the pixel and the projected plane
• One image per projected plane
• Faster, but not ideal
• This is one of the most common types of industrial inspection cameras because it is robust and synergises with conveyor belts
Encodings
• To reconstruct the position of a pixel, need to know on which projected plane it lies
• Can use a conventional electronic projector to flexibly project patterns
• However, can’t simply turn on all the projector columns at once
• Can turn on alternative columns and then look for bright-dark-bright patterns in the image
• This works for continuous planar surfaces, but for complex surfaces may miss some of the transitions
• Instead encode a label for each plane, using a series of patterns
Binary encodings
• Example: projector with 1920 columns
• Assume can only reliably distinguish between white
and black
• Each pixel in a pattern encodes a zero or one
• To encode numbers up to 1920, need
log” 1920 = 11 images
• This seems pretty efficient!
Binary encodings
• Binary code formed from sequence of patterns
• Gray code: Hamming distance of 1 for adjacent codes
• Inverse Gray code: Hamming distance of 𝑛 − 1
• Which is more useful?
• Binary encodings can in theory use the full resolution of the projector
• But what about defocus?
J. Geng, Structured-light 3D surface imaging: a tutorial, 2011
Binary encodings
• Binary code formed from sequence of patterns
• Gray code: Hamming distance of 1 for adjacent codes
• Inverse Gray code: Hamming distance of 𝑛 − 1
• Inverse Gray code: adjacent columns are easy to tell apart
• Binary encodings can in theory use the full resolution of the projector
• But what about defocus?
J. Geng, Structured-light 3D surface imaging: a tutorial, 2011
Binary encodings
J. Geng, Structured-light 3D surface imaging: a tutorial, 2011
Sinusoidal patterns
• Can do the same as the time of flight camera and encode the column in a phase signal
• Sinusoidal patterns are continuous so less affected by defocus
J. Geng, Structured-light 3D surface imaging: a tutorial, 2011
Sinusoidal patterns
• Phase can be reconstructed from three or more shifted patterns
• As with the time of flight camera, can only reconstruct phase modulo 2𝜋
• Again, reducing the pattern frequency reduces the number of phase wrapping events but increases the noise
• Want to use a high frequency for accuracy, but also want to unwrap phase
An example
H. Basevi, Use of prior information and probabilistic image reconstruction for optical tomographic imaging, 2015
An example
H. Basevi, Use of prior information and probabilistic image reconstruction for optical tomographic imaging, 2015
Phase unwrapping
• There are a number of methods of unwrapping phase:
1. Unwrap spatially by following paths from a known pixel
2. Unwrap from a series of phase maps starting from a low frequency map with no wrapping
3. Label each wave using a binary encoding
Low frequency vs. high frequency
Variety is the spice of life
J. Geng, Structured-light 3D surface imaging: a tutorial, 2011
Summary of structured light imaging
• Able to image featureless surfaces unlike stereophotogrammetry and more accurate than time of flight and stereophotogrammetric cameras (for small objects)
• Longer imaging time than both and more expensive to process than time of flight (less expensive than stereophotogrammetry)
• More flexible than both due to the nature of the hardware
Photometric stereo
Meekohi, 2015, https://en.wikipedia.org/wiki/File:Photometric_stereo.png
Illumination
• Recall Lambertian reflectance, and Phong model of specular reflectance:
• 𝐼( = 𝐼)𝑘(𝑐𝑜𝑠𝜃, 𝜃 angle between surface normal and direction of ray to light source
• 𝐼* = 𝐼)𝑘* cos+ 𝛼, 𝛼 angle between ray to light source and ray to observer
𝜃𝛼
Angles
• Take unit vectors in relevant directions, and apply cosine dot product formula:
• 𝐼(=𝐼)𝑘(𝒍⋅𝒏
• 𝐼* = 𝐼)𝑘*(𝒓 ⋅ 𝒔)+
• Have surface normal explicitly in 𝐼# , and implicitly in 𝐼$ as reflection depends on 𝒏
• Phong is approximate, assume no specular
𝒍
𝜃𝛼
𝒓
𝒔
𝒏
Estimating surface normal
•Have𝐼# =𝐼%𝑘#𝒍 ⋅𝒏
• 𝑘# is constant, and assume
that 𝐼% is also constant
• If 𝒍 is known then only need
to find 𝒏
• 𝒏 has 2 degrees of freedom (why?), so need at least 2 measurements
𝜃𝛼
𝒓
• Is 𝒍 really known?
𝒍
𝒏
𝒔
Estimating surface normal
• Assume light source is at infinity
• 𝒍 is same everywhere
• Also assume 𝐼% same everywhere, for each light source
• Have measurement 𝐼& = 𝐼& 𝑘# 𝒍& ⋅ 𝒏 for each light source 𝑖
𝒍
𝜃𝛼
𝒓
𝒔
𝒏
Estimating surface normal
• Writing the equations for all measurements in matrix form
𝐼’ 𝒍’)
• 𝐼” = 𝒍)” 𝜌𝒏
𝐼( 𝒍)(
• Can write this as 𝒊 = 𝑳𝜌𝒏
• Rearranging, 𝜌𝒏 = 𝑳)𝑳 *’𝑳) 𝒊
• Can normalise to remove 𝜌
𝒍
𝜃𝛼
𝒓
𝒔
𝒏
Estimating surface normal
•𝜌𝒏= 𝑳)𝑳*’𝑳)𝒊
• Can normalise to remove 𝜌
• This gives estimate of 𝒏 at each pixel
• This is still insufficient to extract 3D coordinates
• Need known point to begin
• Can then integrate (known point accounts for unknown integration constant)
𝒍
𝜃𝛼
𝒓
𝒔
𝒏
Photometric stereo
• Requires assumptions about illumination and surface reflectance
• Requires continuous surfaces
• Surface normal estimated directly and
independently per-pixel
• Depth estimated indirectly from estimated surface normals
• Requires a known reference point
Comparison
Technique
Imaging time
Processing cost
Accuracy
Size of imaged area
Multiple units
Lidar
1
5
3
5
No
Time of flight
5
5
3
3
No
Stereophotog rammety
5
2
3
4
Yes
Structured light
3
3
5
2
No
Photometric stereo
3
3
1
1
No
Structure from motion
2
1
3
4
Yes
Depth from focus
3
3
2
3
Yes
Arbitrary number scale: 1 = bad, 5 = good
3D data representations
Representations: Depth map
Converting depth images to point clouds
• Point cloud: set of 3D points (possibly with associated colour etc.)
𝒙, 1
•𝒙+=𝑹 𝒕
𝑓& 𝑠 𝑝& 𝒙! 0 𝑓- 𝑝- /!
• 𝒙% =
• Can partially invert this to determine a line for each
pixel
• Depth value determines the position of the point on that line
Representations: Point cloud
• 3D points
• Estimated surface normals
Building surfaces from point clouds
M. Kazhdan et al., Screened Poisson surface reconstruction, 2013
Poisson surface reconstruction
M. Kazhdan et al., Poisson surface reconstruction, 2006
Representations: Untextured mesh
• Triangular faces
Representations: Untextured mesh
• Triangular faces • Shared vertices
Representations: Textured mesh
• Triangular faces • Texture image
• Shared vertices • Texture coordinates
Representations
• Voxels (3D pixels)
Collecting multiple views of a scene (world coordinates)
Robot coordinates
Building up a scene representation (SLAM)
http://pointclouds.org/docume ntation/tutorials/registration_a pi.php
Building up a scene representation (SLAM)
http://pointclouds.org/docume ntation/tutorials/registration_a pi.php
Registering models of objects to point clouds
Aldoma et al., Point Cloud Library: Three-Dimensional Object Recognition and 6 DoF Pose Estimation, 2012
Combining point clouds
Combining point clouds
Combining point clouds
• Essentially two types of situation:
1. Correspondences between the two point clouds (i.e. pairs of points that correspond to the same point in space or in a scene/object) are known.
2. Correspondences between the two point clouds are not known.
• Type 2 situations are often solved by inferring correspondences and then applying type 1 solutions
Combining point clouds (known correspondences)
Combining point clouds (known correspondences)
• Need to find a single rigid transformation (translation and rotation) that minimises distance between points in each pair
• 6 degrees of freedom
• Solution quality affected by measurement noise
• Optimal least-squares solution via Kabsch algorithm
Kabsch algorithm
• Optimal least-squares solution • Two set of points 𝑃 and 𝑄
”
•Solvesmin 𝒑& − 𝑹𝒒& +𝒕 𝑹,2
• Steps:
1. 𝒕=𝑬𝒑𝒊−𝒒𝒊
2 . 𝒑A = 𝒑 − 𝐸 𝒑 , 𝒒A = 𝒒 − 𝐸 𝒒 -E-E—-
3. 𝑨=𝑷.𝑸
4 . 𝑹 = 𝑨 . 𝑨 !” 𝑨 / 0
Combining point clouds (unknown correspondences)
• Still have measurement noise
• Point clouds may not fully overlap
• Point clouds may be of different density
• Must find correspondences between point clouds in this context
• Can develop 3D feature descriptors based on normal vectors and curvature, similarly to SIFT
• Will talk about a simpler algorithm which is still widely used: ICP
Iterative closest point algorithm
• Iterative closest point (ICP) algorithm:
1. Find closest point in reference point cloud to each
point in input point cloud
2. Find single transformation that minimises error between reference and input pairs
3. Transform all input points using transformation
4. Find closest point in reference point cloud to each point in transformed input point cloud
5. Terminate if pairs are same, else goto 2
ICP
ICP
ICP
ICP
ICP
ICP
ICP
ICP
ICP converges on correct answer
Iterative closest point algorithm
• What happens if not every point in input point cloud has a paired point in reference point cloud?
• If not every point in input point cloud has a paired point in reference point cloud then step 2 fails unless have outlier removal
• What happens if the initial poses are dissimilar?
ICP (different initialisation)
ICP (different initialisation)
ICP (different initialisation)
ICP (different initialisation)
ICP (different initialisation)
ICP converges on incorrect answer
Improvements to ICP
• Pre-centring and alignment to axes of maximum variation
• Point-to-plane distance measure
• Random restarts
• Consensus subsets
• Point resampling
• Point consistency
• Auxiliary correspondences
Point-to-plane distance measure
• Estimate surface normal using nearby points
4
• Allows movement in two dimensions without change to penalty
• More computationally expensive but converges faster
• Solve min 𝒏 ⋅ (𝒑- − 𝑹𝒒- + 𝒕 ) 𝑹,3
Kok-Lim, Linear least-squares optimization for point-to-plane icp surface registration, 2004
Consensus subsets
• Useful for refinement when there are points with no corresponding point
• Remove points too far from their correspondence point when calculating new transformation
• Alternatively can look at local distance distribution
Point consistency and auxiliary correspondences
• Can use other information channels in performing outlier rejection:
• Local surface normals • Colours
• Can also use auxiliary information to generate initial auxiliary correspondences i.e. SIFT
Further reading 1
• Photometric stereo:
• D. Forsyth and J. Ponce, Computer Vision – A Modern Approach, Prentice Hall,
2002, section 2.5
• Structure from motion:
• R. Szeliski, Computer Vision – Algorithms and Applications, Springer 2010, chapter 7
• D. Forsyth and J. Ponce, Computer Vision – A Modern Approach, Prentice Hall, 2002, chapter 14
• Depth from focus:
• R. Szeliski, Computer Vision – Algorithms and Applications, Springer 2010,
subsection 12.1.3
• Y. Schechner and N. Kiryati, Depth from Defocus vs. Stereo: How Different Really
Are They?, International Journal of Computer Vision, 2000 • Active 3D imaging in general:
• R. Szeliski, Computer Vision – Algorithms and Applications, Springer 2010, section 12.2
• D. Forsyth and J. Ponce, Computer Vision – A Modern Approach, Prentice Hall, 2002, section 23.1
Further reading 2
• Lidar:
• https://news.voyage.auto/an-introduction-to-lidar-the-key-self-driving-car-sensor-a7e405590cff
• Structured light imaging:
• J. Geng, Structured-light 3D surface imaging: a tutorial, Advances in Optics and Photonics, 2011
• Time of flight cameras (including lidar):
• R. Horaud et al., An Overview of Depth Cameras and Range Scanners Based on Time-of-Flight
Technologies, https://hal.inria.fr/hal-01325045 • 3D data representations:
• R. Szeliski, Computer Vision – Algorithms and Applications, Springer 2010, section 12.3 • Kabsch algorithm
• W. Kabsch, A solution for the best rotation to relate two sets of vectors, Acta Crystallographica Section A, 1976
• Iterative closest point:
• R. Szeliski, Computer Vision – Algorithms and Applications, Springer 2010, subsection 12.2.1
• D. Forsyth and J. Ponce, Computer Vision – A Modern Approach, Prentice Hall, 2002, subsection 23.3.1
• P. Besl and N. McKay, A method for registration of 3-D shapes, IEEE Transactions on Pattern Analysis and Machine Intelligence, 1992