[06-30213][06-30241][06-25024]
Computer Vision and Imaging &
Robot Vision
Dr Hyung Jin Chang Dr Yixing Gao
h.j.chang@bham.ac.uk y.gao.8@bham.ac.uk
School of Computer Science
DEEP LEARNING I
Discriminative classifiers
Nearest neighbor
106 examples
Shakhnarovich, Viola, Darrell 2003 Berg, Berg, Malik 2005…
Support Vector Machines
Guyon, Vapnik
Heisele, Serre, Poggio, 2001,…
Conditional Random Fields
McCallum, Freitag, Pereira 2000; Kumar, Hebert 2003 …
Boosting
Viola, Jones 2001, Torralba et al. 2004, Opelt et al. 2006,…
Slide adapted from Antonio Torralba
Neural networks
LeCun, Bottou, Bengio, Haffner 1998 Rowley, Baluja, Kanade 1998
…
Traditional Image Categorization: Training phase
Training Training Images
Image Features
Training Labels
Classifier Training
Trained Classifier
Traditional Image Categorization: Testing phase
Training Images
Training
Image Features
Testing
Image Features
Training Labels
Classifier Training
Trained Classifier
Trained Classifier
Prediction
Outdoor
Test Image
Features have been key..
SIFT [Loewe IJCV 04] Hand-crafted
SPM [Lazebnik et al. CVPR 06] DPM [Felzenszwalb et al. PAMI 10]
HOG [Dalal and Triggs CVPR 05]
Color Descriptor [Van De Sande et al. PAMI 10]
What about learning the features?
• Learn a feature hierarchy all the way from pixels to
classifier
• Each layer extracts features from the output of previous layer
• Layers have (nearly) the same structure
• Train all layers jointly (“end-to-end”)
Image/
Video Layer 1 Layer 2 Layer 3 Pixels
Simple Classifier
Learning Feature Hierarchy
Goal: Learn useful higher-level features from images
Input data
Feature representation
3rd layer “Objects”
2nd layer “Object parts”
1st layer “Edges”
Pixels
Lee et al., ICML 2009; CACM 2011
Slide: Rob Fergus
Learning Feature Hierarchy
• Betterperformance
• Otherdomains(unclearhowtohandengineer): – Kinect
– Video
– Multi spectral
• Featurecomputationtime
– Dozens of features now regularly used
– Getting prohibitive for large datasets (10’s sec / image)
Slide: R. Fergus
“Shallow” vs. “Deep” architectures
Image/ Video Pixels
Hand-designed feature extraction
Trainable classifier
Object Class
Traditional recognition: “Shallow” architecture
Deep learning: “Deep” architecture
Image/ Video Pixels
Layer 1
… Layer N
Simple classifier
Object Class
Biological neuron and Perceptrons
A biological neuron
An artificial neuron (Perceptron) – a linear classifier
Simple, Complex, and Hyper-complex cells
video
David H. Hubel and Torsten Wiesel
Suggested a hierarchy of feature detectors in the visual cortex, with higher level features responding to patterns of activation in lower level cells, and propagating activation upwards to still higher level cells.
David Hubel’s Eye, Brain, and Vision
Hubel/Wiesel Architecture and Multi-layer Neural Network
Hubel and Weisel’s architecture
Multi-layer Neural Network – A non-linear classifier
Neuron: Linear Perceptron
• Inputs are feature values
• Each feature has a weight
• Sum is the activation
• If the activation is: – Positive, output +1 – Negative, output -1
Slide credit: Pieter Abeel and Dan Klein
Two-layer perceptron network
Slide credit: Pieter Abeel and Dan Klein
Two-layer perceptron network
Slide credit: Pieter Abeel and Dan Klein
Two-layer perceptron network
Slide credit: Pieter Abeel and Dan Klein
Learning w
• Training examples
• Objective: a misclassification loss
• Procedure:
– Gradient descent / hill climbing
Slide credit: Pieter Abeel and Dan Klein
Hill climbing
• Simple, general idea:
– Start wherever
– Repeat: move to the best neighboring state
– If no neighbors better than current, quit
– Neighbors = small perturbations of w
• What’sbad? – Optimal?
Slide credit: Pieter Abeel and Dan Klein
Two-layer perceptron network
Slide credit: Pieter Abeel and Dan Klein
Two-layer perceptron network
Slide credit: Pieter Abeel and Dan Klein
Two-layer neural network
Slide credit: Pieter Abeel and Dan Klein
Neural network properties
• Theorem(Universalfunctionapproximators):Atwo- layer network with a sufficient number of neurons can approximate any continuous function to any desired accuracy
• Practicalconsiderations:
– Can be seen as learning the features
– Large number of neurons • Dangerforoverfitting
– Hill-climbing procedure can get stuck in bad local optima Approximation by Superpositions of Sigmoidal Function,1989 Slide credit: Pieter Abeel and Dan Klein
Multi-layer Neural Network • A non-linear classifier
• Training: find network weights w to minimize the error between true training labels and estimated labels
• Minimization can be done by gradient descent provided f is differentiable
• This training method is called back-propagation
Convolutional Neural Networks (CNN, ConvNet, DCN)
• CNN = a multi-layer neural network with – Local connectivity:
• Neurons in a layer are only connected to a small region of the layer before it
– Share weight parameters across spatial positions: • Learning shift-invariant filter kernels
Image credit: A. Karpathy
Neocognitron [Fukushima, Biological Cybernetics 1980]
Deformation-Resistant Recognition
S-cells: (simple)
– extract local features
C-cells: (complex)
– allow for positional errors
LeNet [LeCun et al. 1998]
• Stack multiple stages of feature extractors
• Higher stages compute more global, more invariant features
• Classification layer at the end
Gradient-based learning applied to document recognition [LeCun, Bottou, Bengio, Haffner 1998]
LeNet-1 from 1993
Convolutional Neural Networks
Feature maps
Spatial pooling
Non-linearity
Convolution (Learned)
Input Image
•
What is a Convolution? Weighted moving sum
. . .
Input
Feature Activation Map
• • •
Why Convolution?
Few parameters (filter weights) Dependencies are local Translation invariance
. . .
Input
Feature Map
Convolutional Neural Networks
Feature maps
Spatial pooling
Non-linearity
Convolution (Learned)
Input Image
. . .
Input
Feature Map
Convolutional Neural Networks
Feature maps
Spatial pooling
Non-linearity
Convolution (Learned)
Input Image
Rectified Linear Unit (ReLU)
slide credit: S. Lazebnik
Non-Linearity
• Per-element (independent)
• Options: – Tanh
– Sigmoid: 1/(1+exp(-x))
– Rectified linear unit (ReLU) • Makes learning faster
• Simplifies backpropagation
• Avoids saturation issues →Preferred option
Convolutional Neural Networks
Feature maps
Normalization
Spatial pooling
Non-linearity
Convolution (Learned)
Max pooling
Max-pooling: a non-linear down-sampling Provide translation invariance
Input Image
Spatial Pooling
• • •
Average or max
Non-overlapping / overlapping regions
Role of pooling:
• Invariance to small transformations
• Larger receptive fields (see more of input) Max
Average
Engineered vs. learned features
Convolutional filters are trained in a supervised manner by back-propagating classification error
Label
Dense Dense Dense
Convolution/pool Convolution/pool Convolution/pool Convolution/pool Convolution/pool Image
Label
Classifier Pooling Feature extraction Image
Image Pixels
Apply
oriented filters
Compare: SIFT Descriptor
Lowe [IJCV 2004]
Spatial pool (Sum)
Normalize to unit length
Feature Vector
•
Handwritten text/digits
– MNIST (0.17% error [Ciresan et al. 2011]) – Arabic & Chinese [Ciresan et al. 2012]
Simpler recognition benchmarks
– CIFAR-10 (9.3% error [Wan et al. 2013]) – Traffic sign recognition
• 0.56% error vs 1.16% for humans [Ciresan et al. 2011]
•
Previous Convnet successes
ImageNet Challenge 2012
Validation classification Validation classification
Validation classification
[Deng et al. CVPR 2009]
• ~14 million labeled images, 20k classes
• Images gathered from Internet
• Human labels via Amazon Turk
• ImageNet Challenge: 1.2 million training images, 1000 classes
A. Krizhevsky, I. Sutskever, and G. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012
AlexNet
Similar framework to LeCun’98 but:
• Bigger model (7 hidden layers, 650,000 units, 60,000,000 params)
• More data (106 vs. 103 images)
• GPU implementation (50x speedup over CPU)
• Trained on two GPUs for a week
A. Krizhevsky, I. Sutskever, and G. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012
AlexNet for image classification
AlexNet
“car”
Fixed input size: 224x224x3
ImageNet Classification Challenge
AlexNet
http://image-net.org/challenges/talks/2016/ILSVRC2016_10_09_clsloc.pdf
Industry Deployment
• Used in Facebook, Google, Microsoft
• Startups
• Image Recognition, Speech Recognition, …. • Fast at test time
Taigman et al. DeepFace: Closing the Gap to Human-Level Performance in Face Verification, CVPR’14
Visualizing CNNs
• What input pattern originally caused a given activation in the feature maps?
Visualizing and Understanding Convolutional Networks [Zeiler and Fergus, ECCV 2014]
Layer 1
Visualizing and Understanding Convolutional Networks [Zeiler and Fergus, ECCV 2014]
Layer 2
Visualizing and Understanding Convolutional Networks [Zeiler and Fergus, ECCV 2014]
Layer 3
Visualizing and Understanding Convolutional Networks [Zeiler and Fergus, ECCV 2014]
Layer 4 and 5
Visualizing and Understanding Convolutional Networks [Zeiler and Fergus, ECCV 2014]
Beyond classification
• Detection
• Segmentation
• Regression
• Pose estimation
• Matching patches • Synthesis
and many more…
R-CNN: Regions with CNN features • Trained on ImageNet classification
• Finetune CNN on PASCAL
RCNN [Girshick et al. CVPR 2014]
Labeling Pixels: Semantic Labels
Fully Convolutional Networks for Semantic Segmentation [Long et al. CVPR 2015]
Labeling Pixels: Edge Detection
DeepEdge: A Multi-Scale Bifurcated Deep Network for Top-Down Contour Detection [Bertasius et al. CVPR 2015]
CNN for Regression
DeepPose [Toshev and Szegedy CVPR 2014]
CNN as a Similarity Measure for Matching
Stereo matching [Zbontar and LeCun CVPR 2015] Compare patch [Zagoruyko and Komodakis 2015]
FaceNet [Schroff et al. 2015]
FlowNet [Fischer et al 2015]
Match ground and aerial images [Lin et al. CVPR 2015]
CNN for Image Generation
Learning to Generate Chairs with Convolutional Neural Networks [Dosovitskiy et al. CVPR 2015]
Chair Morphing
Learning to Generate Chairs with Convolutional Neural Networks [Dosovitskiy et al. CVPR 2015]
Fooling CNNs
Intriguing properties of neural networks [Szegedy ICLR 2014]
Transfer Learning
• Improvement of learning in a new task through the transfer of knowledge from a related task that has already been learned.
• WeightinitializationforCNN
Learning and Transferring Mid-Level Image Representations using Convolutional Neural Networks [Oquab et al. CVPR 2014]
Deep Learning Frameworks (2019)
(UC Berkeley)
(NYU / Facebook)
(U Montreal)
(Baidu)
(Facebook)
(Facebook)
(Amazon)
(Google)
(Microsoft)
(Google)
+ MatConvNet etc…