CS计算机代考程序代写 GPU AI data structure algorithm assembly deep learning Deep Learning for 3D Vision

Deep Learning for 3D Vision

Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu
Spatial transformer networks

Why do we need
Spatial transformer networks?
Are Convolutional Neural Networks invariant to…
Scale?
Rotation?
Translation?

Why do we need
Spatial transformer networks?

CS231n: Convolutional Neural Networks for Visual Recognition (Stanford)

Why do we need
Spatial transformer networks?
Are Convolutional Neural Networks invariant to…
Scale? No
Rotation?
Translation?

Why do we need
Spatial transformer networks?
Are Convolutional Neural Networks invariant to…
Scale? No
Rotation? No
Translation?

Why do we need
Spatial transformer networks?
Are Convolutional Neural Networks invariant to…
Scale? No
Rotation? No
Translation? Partially

Why do we need
Spatial transformer networks?

A. W. Harley, “An Interactive Node-Link Visualization of Convolutional Neural Networks,” in ISVC, pages 867-877, 2015

Intuition behind Spatial transformers

Intuition behind Spatial transformers

Intuition behind Spatial transformers

Sampling!

Formulating
Spatial transformers
Three main differentiable blocks:
Localisation network
Grid generator
Sampler

Grid generator: Examples

Affine transform
Attention model

coordinates in the target (output) feature map
coordinates in the source (input) feature map

Sampler:
Mathematical formulation

Generic sampling kernel

From the grid generator

All pixels in the output feature map

Experiment: Distorted MNIST

Experiment: Distorted MNIST

Distortions: Rotation, Translation, Projective, Elastic Transformations: Affine, Projective, Thin Plate Spline (TPS)

Other experiments:
Applications of spatial transformers
Street View House Numbers

Fine-grained classification

Deep Learning for 3D Vision

Our world is 3D

Broad applications of 3D data
Roboti

Broad applications of 3D data
Robotics

Augmented

Autonomous

Broad applications of 3D data
Roboti

Augmented

Autonomous

Broad applications of 3D data
Roboti

Augmented
Medical Image Processing

3D Understanding Enables Interactions
[SIGGRAPH Asia 2016]
Example: 3D understanding for a robot

3D Understanding Enables Interactions

Shape

3D Understanding Enables Interactions

Shape

Graspable

3D Understanding Enables Interactions

Shape

Graspable

Mass

3D Understanding Enables Interactions

Shape

Graspable

Mass

Mobility

AI Perspective of 3D Understanding
See the world Understand the world
Transform the world

Sensory

Cognition

Action
Towards interaction with the physical world, 3D is the key!

Traditional 3D Vision

Multi-view Geometry: Physics based

3D Learning: Knowledge Based

3D Learning: Knowledge Based

Acquire Knowledge of 3D World by Learning

3D Learning Tasks
3D Analysis

Classification

Segmentation (object/scene)

Correspondence

3D Learning Tasks
3D Synthesis
Monocular
3D reconstruction

Shape completion

Shape modeling

3D Learning Tasks

3D-based Knowledge Transportation

3D Learning Tasks
Intuitive Physics based on 3D Understanding

Deep Learning on 3D: A New Rising Field

3D
Understanding
Computer Vision
Computer Graphics
Robotics
Cognitive Science
Machine Learning
Differential Geometry
Topological Analysis
Functional Analysis
Artificial Intelligence
Mathematics

Outline
Overview of 3D Deep Learning

3D Deep Learning Algorithms

The Representation Issue of 3D Deep Learning

Images: Unique representation with regular data structure

3D has many representations:

multi-view RGB(D) images volumetric
polygonal mesh point cloud
primitive-based models
The Representation Issue of 3D Deep Learning

Novel view image synthesis
3D has many representations:

multi-view RGB(D) images
volumetric polygonal mesh
point cloud
primitive-based models
The Representation Issue of 3D Deep Learning

3D has many representations:

multi-view RGB(D) images
volumetric polygonal mesh point cloud
primitive-based models
The Representation Issue of 3D Deep Learning

3D has many representations:

multi-view RGB(D) images volumetric
polygonal mesh
point cloud
primitive-based models
The Representation Issue of 3D Deep Learning

3D has many representations:

multi-view RGB(D) images volumetric
polygonal mesh
point cloud
primitive-based models

The Representation Issue of 3D Deep Learning

3D has many representations:

multi-view RGB(D) images volumetric
polygonal mesh point cloud
primitive-based models

The Representation Issue of 3D Deep Learning

Cartesian Product Space of “Task” and “Representation”
3D geometry analysis

3D synthesis

Fundamental Challenges of 3D Deep Learning
Convolution needs an underlying structure Can we directly apply CNN on 3D data?

3D has many representations:
multi-view RGB(D) images
volumetric

Rasterized vs Geometric
Rasterized form (regular grids)
Can directly apply CNN
But has other challenges

3D has many representations:

multi-view RGB(D) images volumetric
polygonal mesh point cloud
primitive-based models

Fundamental Challenges of 3D Deep Learning
Rasterized form (regular grids)

Geometric form (irregular)
Cannot directly apply CNN

3D Deep Learning Algorithms (by Representations)
Projection-based

[Su et al. 2015] [Kalogerakis et al. 2016]

[Maturana et al. 2015] [Wu et al. 2015] (GAN)
[Qi et al. 2016] [Liu et al. 2016]
[Wang et al. 2017] (O-Net) [Tatarchenko et al. 2017] (OGN)

Volumetric
Multi-view

3D Deep Learning Algorithms (by Representations)
Projection-based

[Defferard et al. 2016] [Henaff et al. 2015]
[Yi et al. 2017] (SyncSpecCNN)

Volumetric
Multi-view
[Qi et al. 2017] (PointNet)
[Fan et al. 2017] (PointSetGen)
Point cloud Mesh (Graph CNN)

Part assembly
[Tulsiani et al. 2017]
[Li et al. 2017] (GRASS)
[Su et al. 2015] [Kalogerakis et al. 2016]

[Maturana et al. 2015] [Wu et al. 2015] (GAN)
[Qi et al. 2016] [Liu et al. 2016]
[Wang et al. 2017] (O-Net) [Tatarchenko et al. 2017] (OGN)

3D has many representations:
multi-view RGB(D) images
volumetric

Fundamental Challenges of 3D Deep Learning
Rasterized form (regular grids)
Can directly apply CNN
But has other challenges

Deep Learning on Multi-view Representation

Multi-view Representation as 3D Input
Leverage the huge CNN literature in image analysis

Multi-view Representation as 3D Input
Classification

CNN1
.
.
.
View poolin g

CNN2: a second
ConvNet producing shape descriptors

CNN2
softmax
Hang Su, Subhransu Maji, Evangelos Kalogerakis, Erik Learned-Miller, “Multi-view Convolutional Neural Networks for 3D Shape Recognition”, Proceedings of ICCV 2015

Multi-view Representation as 3D Output
The Novel-view Synthesis Problem

Fully Convolutional Network (FCN)

Segmentati on:
Learning Deconvolution Network for Semantic Segmentation

Direct Novel-view Synthesis

Maxim Tatarchenko, Alexey Dosovitskiy, Thomas Brox,
“Multi-view 3D Models from Single Images with a Convolutional Network”,
ECCV2016

Results are often Blurry

+

… +
0.1
+0.4
…+0.3

Observed view image

Novel view feature
Su et al, 3D-Assisted Image Feature Synthesis for Novel Views of an Object, ECCV 2016
Idea 2: Explore Cross-View Relationship

Idea 2: Explore Cross-View Relationship

Single-view network architecture:
Zhou et al, View Synthesis by Appearance Flow, ECCV 2016

Combine both ideas

First, apply flow prediction
Second, conduct invisible part hallucination
Park et al, Transformation-Grounded Image Generation Network for Novel 3D View Synthesis, CVPR 2017

Combine both ideas

Deep Learning on Volumetric Representation

Popular 3D volumetric data

fMRI
Manufacturing (finite-element analysis)

Geology

CT

Volumetric Representation as 3D Input
The main hurdle is Complexity

The Sparsity Characteristic of 3D Data

Occupancy: Resolution:
32 64 128

Li et, FPNN: Field Probing Neural Networks for 3D Data, NIPS 2016

Solution: Octree based CNN (O-CNN)

Octree

Convolution on Octree
Neighborhood searching: Hash table

OCTREE

FullVoxel
Gernot Riegler, Ali Osman Ulusoy, Andreas Geiger
“OctNet: Learning Deep 3D Representations at High Resolutions”
CVPR2017
Pengshuai Wwang, Yang Liu, Yuxiao Guo, Chunyu Sun, Xin Tong
“O-CNN: Octree-based Convolutional Neural Network for Understanding 3D Shapes”
SIGGRAPH2017

Volumetric Representation as 3D Input
The main hurdle is still Complexity

A Straight-forward Implementation

Choi et al. ECCV 2016

Towards Higher Spatial Resolution

Maxim Tatarchenko, Alexey Dosovitskiy, Thomas Brox
“Octree Generating Networks: Efficient Convolutional Architectures for High-resolution 3D Outputs”
arxiv (March, 2017)

Progressive Voxel Refinement

3D has many representations:

multi-view RGB(D) images volumetric
polygonal mesh point cloud
primitive-based models

Fundamental Challenges of 3D Deep Learning
Rasterized form (regular grids)

Geometric form (irregular)
Cannot directly apply CNN

Deep Learning on Polygonal Meshes

Mesh as 3D Input
Deep Learning on Graphs

Geometry-aware Convolution can be Important

image credit: D. Boscaini, et al.
convolutional along spatial coordinates
convolutional considering underlying geometry
image credit: D. Boscaini, et al.

Meshes can be represented as graphs

3D shape graph
social network
molecules

How to define convolution kernel on graphs?

from Shuman et al. 2013
Desired properties:

locally supported (w.r.t graph metric)
allowing weight sharing across different coordinates

Issues of Geodesic CNN
The local charting method relies on a fast marching-like procedure requiring a triangular mesh.

The radius of the geodesic patches must be sufficiently small to acquire a topological disk.

No effective pooling, purely relying on convolutions to increase receptive field.

Spectral construction: Spectral CNN
Fourier analysis
Convert convolution to multiplication in spectral domain

Bases on meshes: eigenfunction of Laplacian- Bertrami operator

Synchronization of functional space across meshes

Functional map
Li Yi, Hao Su, Xingwen Guo, Leonidas Guibas
“SyncSpecCNN: Synchronized Spectral CNN for 3D Shape Segmentation”
CVPR2017 (spotlight)

Deep Learning
on Point Cloud Representation

Point Cloud: the Most Common Sensor Output

Figure from the recent VoxelNet paper from Apple.

Point Cloud as 3D Input
Deep Learning on Sets (orderless)

Properties of a desired neural network on point clouds

D

N

2D array representation

Point cloud: N orderless points, each represented by a D dim coordinate
Hao Su*, Charles Qi*, Kaichun Mo, Leonidas Guibas
“PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation”
CVPR2017 (oral)

D

N

2D array representation

Point cloud: N orderless points, each represented by a D dim coordinate
Properties of a desired neural network on point clouds

2D array representation

Point cloud: N orderless points, each represented by a D dim coordinate

N
D

N
D

represents the same set as
Properties of a desired neural network on point clouds

Permutation invariance:
Examples:
f (x1, x2 ,, xn )  max{x1, x2 ,, xn }
f (x1, x2 ,, xn )  x1  x2  xn

f (x , x ,, x )  f (x
1 2 n 1  2
 n
, x ,, x )
i
x !
D
,

Construct symmetric function family
Observe:
f (x1, x2 ,, xn )   (g(h(x1 ),, h(xn )) is symmetric if g
is symmetric

Construct symmetric function family
(1,2,3)
(1,1,1)
(2,3,2)
(2,3,4)

f (x1, x2 ,, xn )   ( ( g(h(x1 ),, h(xn ))) is symmetric if g
h
Observe:
is symmetric

Construct symmetric function family
(1,2,3)
(1,1,1)
(2,3,2)
(2,3,4)

simple symmetric function
g
f (x1, x2 ,, xn )   ! g(h(x1 ),, h(xn )) is symmetric if g
h
Observe:
is symmetric

Construct symmetric function family

(1,2,3)
(1,1,1)
(2,3,2)
(2,3,4)

simple symmetric function
PointNet (vanilla)

f (x1, x2 ,, xn )   ! g(h(x1 ),, h(xn )) is symmetric if g
h
g

Observe:
is symmetric

Q: What symmetric functions can be constructed by PointNet?

PointNet (vanilla)
Symmetric functions

A: Universal approximation to continuous symmetric functions
Theorem:
A Hausdorff continuous symmetric function arbitrarily approximated by PointNet.

PointNet (vanilla)
can be
f : 2X  !
S  !d ,

PointNet is Light-weight
1000K
0000K
0000K

MVCNN Subvolume VRN PointNet
[Su et al. 2015] [Su et al. 2016] [Su et al. 2016] [Su et al. 2017]
multi-view
volumetric
point cloud


Saves 95% GPU memory

Space complexity (#params)
100M

10M

1M

Robustness to data corruption

Robustness to data corruption

Segmentation from partial scans

Visualize what is learned by reconstruction

Salient points are discovered!

PointNet v2.0: Multi-Scale PointNet
N points in (x,y)

N1 points in (x,y,f)

N2 points in (x,y,f’)

Larger receptive field in higher layers
Less points in higher layers (more scalable)
Weight sharing
Translation invariance (local coordinates in local regions)
Charles Qi, Hao Su, Li Yi, Leonidas Guibas
“PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space”
NIPS 2017

Fuse 2D and 3D:
Frustum PointNets for 3D Object Detection

+ Leveraging mature 2D detectors for region proposal and 3D search space reduction
+ Solving 3D detection problem with 3D data and 3D deep learning architectures

Our method ranks No. 1 on KITTI 3D Object Detection Benchmark

We get 5% higher AP than Apple’s recent CVPR submission
and more than 10% higher AP than previous SOTA in easy category

Our method ranks No. 1 on KITTI 3D Object Detection Benchmark
We are also 1st place for smaller objects (ped. and cyclist) winning with even bigger margins.

Remarkable box estimation accuracy even with a dozen of points or with very partial point cloud

Point Cloud as 3D Output
Deep Learning to Generate Combinatorial Objects

Supervision from “Synthesize for Learning”

ShapeNet
Renderer

3D Representation: Point Cloud

Describe shape for the whole object

Usable as network output?

No prior works in the deep learning community!

3D Prediction by Point Clouds

Input Reconstructed 3D point cloud

Hao Su, Haoqiang Fan, Leonidas Guibas
“A Point Set Generation Network for 3D Object Reconstruction from a Single Image”
CVPR2017 (oral)

3D Prediction by Point Clouds

Input
Reconstructed 3D point cloud

Pipeline
CVPR ’17, Point Set Generation

Loss on sets
(L)

sampl e

Prediction
Deep network
(f )

Loss function: Earth Mover’s Distance (EMD)
Given two sets of points, measure their discrepancy:

Differentiable
Admit fast computation

Generalization to Unseen Categories

input
observed view

input
observed view

Out of training

Deep Learning on Primitives

Describe Shapes by Primitives
What are parts? Reusable substructures!
A Structure Mining Problem
By DL, also a Meta-Learning Problem

Primitive-based Assembly

Shubham Tulsiani, Hao Su, Leonidas Guibas, Alexei A. Efros, Jitendra Malik Learning Shape Abstractions by Assembling Volumetric Primitives CVPR 2017

Approach
We predict primitive parameters: size, rotation, translation of M cuboids.

Variable number of parts? We predict “primitive existence probability”

Generative Models for Shapes by Reusing Primitives
Incremental Assembly-based modeling
“Transfer Learning” in the sense of reusing prior knowledge

Primitive Space from ShapeNet Parts

Markov Modeling Process

Part assembly:
Markov process – Incrementally assemble parts.
Sung et al, ComplementMe: Weakly-Supervised Component Suggestions for 3D Modeling SIGGRAPH Asia 2017

New part proposal by network

Placement Network

Proposal Network
Component Embedding Space

Partial Assembly

Output

Automatic Shape Synthesis

Automatic Shape Synthesis