Deep Learning for 3D Vision
Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu
Spatial transformer networks
Why do we need
Spatial transformer networks?
Are Convolutional Neural Networks invariant to…
Scale?
Rotation?
Translation?
Why do we need
Spatial transformer networks?
CS231n: Convolutional Neural Networks for Visual Recognition (Stanford)
Why do we need
Spatial transformer networks?
Are Convolutional Neural Networks invariant to…
Scale? No
Rotation?
Translation?
Why do we need
Spatial transformer networks?
Are Convolutional Neural Networks invariant to…
Scale? No
Rotation? No
Translation?
Why do we need
Spatial transformer networks?
Are Convolutional Neural Networks invariant to…
Scale? No
Rotation? No
Translation? Partially
Why do we need
Spatial transformer networks?
A. W. Harley, “An Interactive Node-Link Visualization of Convolutional Neural Networks,” in ISVC, pages 867-877, 2015
Intuition behind Spatial transformers
Intuition behind Spatial transformers
Intuition behind Spatial transformers
Sampling!
Formulating
Spatial transformers
Three main differentiable blocks:
Localisation network
Grid generator
Sampler
Grid generator: Examples
Affine transform
Attention model
coordinates in the target (output) feature map
coordinates in the source (input) feature map
Sampler:
Mathematical formulation
Generic sampling kernel
From the grid generator
All pixels in the output feature map
Experiment: Distorted MNIST
Experiment: Distorted MNIST
Distortions: Rotation, Translation, Projective, Elastic Transformations: Affine, Projective, Thin Plate Spline (TPS)
Other experiments:
Applications of spatial transformers
Street View House Numbers
Fine-grained classification
Deep Learning for 3D Vision
Our world is 3D
Broad applications of 3D data
Roboti
Broad applications of 3D data
Robotics
Augmented
Autonomous
Broad applications of 3D data
Roboti
Augmented
Autonomous
Broad applications of 3D data
Roboti
Augmented
Medical Image Processing
3D Understanding Enables Interactions
[SIGGRAPH Asia 2016]
Example: 3D understanding for a robot
3D Understanding Enables Interactions
Shape
3D Understanding Enables Interactions
Shape
Graspable
3D Understanding Enables Interactions
Shape
Graspable
Mass
3D Understanding Enables Interactions
Shape
Graspable
Mass
Mobility
AI Perspective of 3D Understanding
See the world Understand the world
Transform the world
Sensory
Cognition
Action
Towards interaction with the physical world, 3D is the key!
Traditional 3D Vision
Multi-view Geometry: Physics based
3D Learning: Knowledge Based
3D Learning: Knowledge Based
Acquire Knowledge of 3D World by Learning
3D Learning Tasks
3D Analysis
Classification
Segmentation (object/scene)
Correspondence
3D Learning Tasks
3D Synthesis
Monocular
3D reconstruction
Shape completion
Shape modeling
3D Learning Tasks
3D-based Knowledge Transportation
3D Learning Tasks
Intuitive Physics based on 3D Understanding
Deep Learning on 3D: A New Rising Field
3D
Understanding
Computer Vision
Computer Graphics
Robotics
Cognitive Science
Machine Learning
Differential Geometry
Topological Analysis
Functional Analysis
Artificial Intelligence
Mathematics
Outline
Overview of 3D Deep Learning
3D Deep Learning Algorithms
The Representation Issue of 3D Deep Learning
Images: Unique representation with regular data structure
3D has many representations:
multi-view RGB(D) images volumetric
polygonal mesh point cloud
primitive-based models
The Representation Issue of 3D Deep Learning
Novel view image synthesis
3D has many representations:
multi-view RGB(D) images
volumetric polygonal mesh
point cloud
primitive-based models
The Representation Issue of 3D Deep Learning
3D has many representations:
multi-view RGB(D) images
volumetric polygonal mesh point cloud
primitive-based models
The Representation Issue of 3D Deep Learning
3D has many representations:
multi-view RGB(D) images volumetric
polygonal mesh
point cloud
primitive-based models
The Representation Issue of 3D Deep Learning
3D has many representations:
multi-view RGB(D) images volumetric
polygonal mesh
point cloud
primitive-based models
The Representation Issue of 3D Deep Learning
3D has many representations:
multi-view RGB(D) images volumetric
polygonal mesh point cloud
primitive-based models
The Representation Issue of 3D Deep Learning
Cartesian Product Space of “Task” and “Representation”
3D geometry analysis
3D synthesis
Fundamental Challenges of 3D Deep Learning
Convolution needs an underlying structure Can we directly apply CNN on 3D data?
3D has many representations:
multi-view RGB(D) images
volumetric
Rasterized vs Geometric
Rasterized form (regular grids)
Can directly apply CNN
But has other challenges
3D has many representations:
multi-view RGB(D) images volumetric
polygonal mesh point cloud
primitive-based models
Fundamental Challenges of 3D Deep Learning
Rasterized form (regular grids)
Geometric form (irregular)
Cannot directly apply CNN
3D Deep Learning Algorithms (by Representations)
Projection-based
[Su et al. 2015] [Kalogerakis et al. 2016]
…
[Maturana et al. 2015] [Wu et al. 2015] (GAN)
[Qi et al. 2016] [Liu et al. 2016]
[Wang et al. 2017] (O-Net) [Tatarchenko et al. 2017] (OGN)
…
Volumetric
Multi-view
3D Deep Learning Algorithms (by Representations)
Projection-based
[Defferard et al. 2016] [Henaff et al. 2015]
[Yi et al. 2017] (SyncSpecCNN)
…
Volumetric
Multi-view
[Qi et al. 2017] (PointNet)
[Fan et al. 2017] (PointSetGen)
Point cloud Mesh (Graph CNN)
Part assembly
[Tulsiani et al. 2017]
[Li et al. 2017] (GRASS)
[Su et al. 2015] [Kalogerakis et al. 2016]
…
[Maturana et al. 2015] [Wu et al. 2015] (GAN)
[Qi et al. 2016] [Liu et al. 2016]
[Wang et al. 2017] (O-Net) [Tatarchenko et al. 2017] (OGN)
…
3D has many representations:
multi-view RGB(D) images
volumetric
Fundamental Challenges of 3D Deep Learning
Rasterized form (regular grids)
Can directly apply CNN
But has other challenges
Deep Learning on Multi-view Representation
Multi-view Representation as 3D Input
Leverage the huge CNN literature in image analysis
Multi-view Representation as 3D Input
Classification
…
…
…
…
CNN1
.
.
.
View poolin g
CNN2: a second
ConvNet producing shape descriptors
…
CNN2
softmax
Hang Su, Subhransu Maji, Evangelos Kalogerakis, Erik Learned-Miller, “Multi-view Convolutional Neural Networks for 3D Shape Recognition”, Proceedings of ICCV 2015
Multi-view Representation as 3D Output
The Novel-view Synthesis Problem
Fully Convolutional Network (FCN)
Segmentati on:
Learning Deconvolution Network for Semantic Segmentation
Direct Novel-view Synthesis
Maxim Tatarchenko, Alexey Dosovitskiy, Thomas Brox,
“Multi-view 3D Models from Single Images with a Convolutional Network”,
ECCV2016
Results are often Blurry
+
… +
0.1
+0.4
…+0.3
Observed view image
Novel view feature
Su et al, 3D-Assisted Image Feature Synthesis for Novel Views of an Object, ECCV 2016
Idea 2: Explore Cross-View Relationship
Idea 2: Explore Cross-View Relationship
Single-view network architecture:
Zhou et al, View Synthesis by Appearance Flow, ECCV 2016
Combine both ideas
First, apply flow prediction
Second, conduct invisible part hallucination
Park et al, Transformation-Grounded Image Generation Network for Novel 3D View Synthesis, CVPR 2017
Combine both ideas
Deep Learning on Volumetric Representation
Popular 3D volumetric data
fMRI
Manufacturing (finite-element analysis)
Geology
CT
Volumetric Representation as 3D Input
The main hurdle is Complexity
The Sparsity Characteristic of 3D Data
Occupancy: Resolution:
32 64 128
Li et, FPNN: Field Probing Neural Networks for 3D Data, NIPS 2016
Solution: Octree based CNN (O-CNN)
Octree
Convolution on Octree
Neighborhood searching: Hash table
OCTREE
FullVoxel
Gernot Riegler, Ali Osman Ulusoy, Andreas Geiger
“OctNet: Learning Deep 3D Representations at High Resolutions”
CVPR2017
Pengshuai Wwang, Yang Liu, Yuxiao Guo, Chunyu Sun, Xin Tong
“O-CNN: Octree-based Convolutional Neural Network for Understanding 3D Shapes”
SIGGRAPH2017
Volumetric Representation as 3D Input
The main hurdle is still Complexity
A Straight-forward Implementation
Choi et al. ECCV 2016
Towards Higher Spatial Resolution
Maxim Tatarchenko, Alexey Dosovitskiy, Thomas Brox
“Octree Generating Networks: Efficient Convolutional Architectures for High-resolution 3D Outputs”
arxiv (March, 2017)
Progressive Voxel Refinement
3D has many representations:
multi-view RGB(D) images volumetric
polygonal mesh point cloud
primitive-based models
Fundamental Challenges of 3D Deep Learning
Rasterized form (regular grids)
Geometric form (irregular)
Cannot directly apply CNN
Deep Learning on Polygonal Meshes
Mesh as 3D Input
Deep Learning on Graphs
Geometry-aware Convolution can be Important
image credit: D. Boscaini, et al.
convolutional along spatial coordinates
convolutional considering underlying geometry
image credit: D. Boscaini, et al.
Meshes can be represented as graphs
3D shape graph
social network
molecules
How to define convolution kernel on graphs?
from Shuman et al. 2013
Desired properties:
locally supported (w.r.t graph metric)
allowing weight sharing across different coordinates
Issues of Geodesic CNN
The local charting method relies on a fast marching-like procedure requiring a triangular mesh.
The radius of the geodesic patches must be sufficiently small to acquire a topological disk.
No effective pooling, purely relying on convolutions to increase receptive field.
Spectral construction: Spectral CNN
Fourier analysis
Convert convolution to multiplication in spectral domain
Bases on meshes: eigenfunction of Laplacian- Bertrami operator
Synchronization of functional space across meshes
Functional map
Li Yi, Hao Su, Xingwen Guo, Leonidas Guibas
“SyncSpecCNN: Synchronized Spectral CNN for 3D Shape Segmentation”
CVPR2017 (spotlight)
Deep Learning
on Point Cloud Representation
Point Cloud: the Most Common Sensor Output
Figure from the recent VoxelNet paper from Apple.
Point Cloud as 3D Input
Deep Learning on Sets (orderless)
Properties of a desired neural network on point clouds
D
N
2D array representation
Point cloud: N orderless points, each represented by a D dim coordinate
Hao Su*, Charles Qi*, Kaichun Mo, Leonidas Guibas
“PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation”
CVPR2017 (oral)
D
N
2D array representation
Point cloud: N orderless points, each represented by a D dim coordinate
Properties of a desired neural network on point clouds
2D array representation
Point cloud: N orderless points, each represented by a D dim coordinate
N
D
N
D
represents the same set as
Properties of a desired neural network on point clouds
Permutation invariance:
Examples:
f (x1, x2 ,, xn ) max{x1, x2 ,, xn }
f (x1, x2 ,, xn ) x1 x2 xn
…
f (x , x ,, x ) f (x
1 2 n 1 2
n
, x ,, x )
i
x !
D
,
Construct symmetric function family
Observe:
f (x1, x2 ,, xn ) (g(h(x1 ),, h(xn )) is symmetric if g
is symmetric
Construct symmetric function family
(1,2,3)
(1,1,1)
(2,3,2)
(2,3,4)
f (x1, x2 ,, xn ) ( ( g(h(x1 ),, h(xn ))) is symmetric if g
h
Observe:
is symmetric
Construct symmetric function family
(1,2,3)
(1,1,1)
(2,3,2)
(2,3,4)
simple symmetric function
g
f (x1, x2 ,, xn ) ! g(h(x1 ),, h(xn )) is symmetric if g
h
Observe:
is symmetric
Construct symmetric function family
(1,2,3)
(1,1,1)
(2,3,2)
(2,3,4)
simple symmetric function
PointNet (vanilla)
f (x1, x2 ,, xn ) ! g(h(x1 ),, h(xn )) is symmetric if g
h
g
Observe:
is symmetric
Q: What symmetric functions can be constructed by PointNet?
PointNet (vanilla)
Symmetric functions
A: Universal approximation to continuous symmetric functions
Theorem:
A Hausdorff continuous symmetric function arbitrarily approximated by PointNet.
PointNet (vanilla)
can be
f : 2X !
S !d ,
PointNet is Light-weight
1000K
0000K
0000K
MVCNN Subvolume VRN PointNet
[Su et al. 2015] [Su et al. 2016] [Su et al. 2016] [Su et al. 2017]
multi-view
volumetric
point cloud
⎧
⎨
⎩
Saves 95% GPU memory
Space complexity (#params)
100M
10M
1M
Robustness to data corruption
Robustness to data corruption
Segmentation from partial scans
Visualize what is learned by reconstruction
Salient points are discovered!
PointNet v2.0: Multi-Scale PointNet
N points in (x,y)
N1 points in (x,y,f)
N2 points in (x,y,f’)
Larger receptive field in higher layers
Less points in higher layers (more scalable)
Weight sharing
Translation invariance (local coordinates in local regions)
Charles Qi, Hao Su, Li Yi, Leonidas Guibas
“PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space”
NIPS 2017
Fuse 2D and 3D:
Frustum PointNets for 3D Object Detection
+ Leveraging mature 2D detectors for region proposal and 3D search space reduction
+ Solving 3D detection problem with 3D data and 3D deep learning architectures
Our method ranks No. 1 on KITTI 3D Object Detection Benchmark
We get 5% higher AP than Apple’s recent CVPR submission
and more than 10% higher AP than previous SOTA in easy category
…
Our method ranks No. 1 on KITTI 3D Object Detection Benchmark
We are also 1st place for smaller objects (ped. and cyclist) winning with even bigger margins.
…
…
Remarkable box estimation accuracy even with a dozen of points or with very partial point cloud
Point Cloud as 3D Output
Deep Learning to Generate Combinatorial Objects
Supervision from “Synthesize for Learning”
ShapeNet
Renderer
3D Representation: Point Cloud
Describe shape for the whole object
Usable as network output?
No prior works in the deep learning community!
3D Prediction by Point Clouds
Input Reconstructed 3D point cloud
Hao Su, Haoqiang Fan, Leonidas Guibas
“A Point Set Generation Network for 3D Object Reconstruction from a Single Image”
CVPR2017 (oral)
3D Prediction by Point Clouds
Input
Reconstructed 3D point cloud
Pipeline
CVPR ’17, Point Set Generation
Loss on sets
(L)
sampl e
Prediction
Deep network
(f )
Loss function: Earth Mover’s Distance (EMD)
Given two sets of points, measure their discrepancy:
Differentiable
Admit fast computation
Generalization to Unseen Categories
input
observed view
input
observed view
Out of training
Deep Learning on Primitives
Describe Shapes by Primitives
What are parts? Reusable substructures!
A Structure Mining Problem
By DL, also a Meta-Learning Problem
Primitive-based Assembly
Shubham Tulsiani, Hao Su, Leonidas Guibas, Alexei A. Efros, Jitendra Malik Learning Shape Abstractions by Assembling Volumetric Primitives CVPR 2017
Approach
We predict primitive parameters: size, rotation, translation of M cuboids.
Variable number of parts? We predict “primitive existence probability”
Generative Models for Shapes by Reusing Primitives
Incremental Assembly-based modeling
“Transfer Learning” in the sense of reusing prior knowledge
Primitive Space from ShapeNet Parts
Markov Modeling Process
Part assembly:
Markov process – Incrementally assemble parts.
Sung et al, ComplementMe: Weakly-Supervised Component Suggestions for 3D Modeling SIGGRAPH Asia 2017
New part proposal by network
Placement Network
Proposal Network
Component Embedding Space
Partial Assembly
Output
Automatic Shape Synthesis
Automatic Shape Synthesis