CS7267 Machine Learning Logistic regression
AI Lecture:
Convolutional Neural Networks (CNN)
C.-C. Hung
Slides used in the classroom only
Outline
Pattern Recognition Concept
Basic Concept
Feature Extraction
Terminology
Challenges
Why deep learning?
What is CNN for deep learning?
Pattern Recognition/Classification
Assign an object or an event (pattern) to one of several known categories (or classes).
3
Category “A”
Category “B”
Classification vs Clustering
4
Category “A”
Category “B”
(Supervised Classification)
(Unsupervised Classification)
Classification (known categories)
Clustering (unknown categories)
What is a Pattern?
A pattern could be an object or event.
Typically, represented by a vector x of numbers.
5
biometric patterns
hand gesture patterns
What is a Pattern? (con’t)
Loan/Credit card applications
Income, # of dependents, mortgage amount credit worthiness classification
Dating services
Age, hobbies, income “desirability” classification
Web documents
Key-word based descriptions (e.g., documents containing “football”, “NFL”) document classification
6
What is a Class ?
A collection of “similar” objects.
7
Female class
Male class
Main Objectives
Separate the data belonging to different classes.
Given new data, assign them to the correct category.
8
Gender Classification
Main Approaches
x: input vector (pattern)
ω: class label (class)
Generative
Model the joint probability, p(x, ω).
Make predictions by using Bayes rule to calculate p(ω/x).
Pick the most likely class label ω.
Discriminative
No need to model p(x, ω).
Estimate p(ω/x) by “learning” a direct mapping from x to ω (i.e., estimate decision boundary).
Pick the most likely class label ω.
9
ω1
ω2
How do we model p(x, ω)?
Typically, using a statistical model.
probability density function (e.g., Gaussian)
10
Gender Classification
male
female
Key Challenges
Intra-class variability
Inter-class variability
11
Letters/Numbers that look similar
The letter “T” in different typefaces
M and W
If rotation must be considered.
Traditional pattern recognition
Traditional pattern recognition models use hand-crafted features and relatively simple trainable classifier.
This approach has the following limitations:
It is very tedious and costly to develop hand-crafted features
The hand-crafted features usually highly depend on one application, and cannot be transferred easily to other applications
hand-crafted feature extractor
“Simple” Trainable Classifier
output
Traditional pattern recognition
What is the hand-crafted feature extractor?
hand-crafted feature extractor
“Simple” Trainable Classifier
output
Digital Image Processing/Machine Vision:
feature extractions
Wuhan University
Spatial Filtering
Use of spatial masks for image processing (spatial filters)
Linear and nonlinear filters
Low-pass filters eliminate or attenuate high frequency components in the frequency domain (sharp image details), and result in image blurring.
Wuhan University
Spatial Filtering …
High-pass filters attenuate or eliminate low-frequency components (resulting in sharpening edges and other sharp details).
LIESMARS, Wuhan University
Kernel Operator
Place the kernel h (α, β) down on top of an image f (x, y),
g (x, y) = ∑ ∑ f(x + α, y + β) h (α, β) over α and β
h(-1,-1) h(0, -1) h(1, -1)
h(-1, 0) h(0, 0) h(1, 0)
h(-1, 1) h(0, 1) h(1, 1)
Wuhan University
Spatial Filtering …
The basic approach is to sum products between the mask coefficients and the intensities of the pixels under the mask at a specific location in the image:
(for a 3 x 3 filter)
Wuhan University
A spatial filter used in the digital image processing
Wuhan University
Spatial Filtering …
Non-linear filters also use pixel neighborhoods but do not explicitly use coefficients.
e.g. noise reduction by median gray-level value computation in the neighborhood of the filter
Wuhan University
Smoothing Filters …
Used for blurring (removal of small details prior to large object extraction, bridging small gaps in lines) and noise reduction.
Low-pass (smoothing) spatial filtering
Neighborhood averaging
– Results in image blurring
Wuhan University
Image Enhancement in the
Spatial Domain
Wuhan University
Some Masks
Wuhan University
Image Enhancement in the
Spatial Domain
N = 3, 5, 9, 15 and 35
Low-pass filter
A spatial low-pass filter has the effect of passing, or leaving untouched, the low spatial frequency components of the image.
High frequency components are attenuated and are virtually absent in the output image
1/9 1/9 1/9
1/9 1/9 1/9
1/9 1/9 1/9
High-pass Filter
GRADIENT for edge detection
A vector variable
– Direction of the maximum growth of the function
– Magnitude of the growth
– Perpendicular to the edge direction
GRADIENT
LIESMARS, Wuhan University
Image gradient
The gradient of an image:
The gradient points in the direction of most rapid change in intensity
The gradient direction is given by:
how does this relate to the direction of the edge?
The edge strength is given by the gradient magnitude
GRADIENT AND EDGE VECTORS – EXAMPLE
Gradient Vectors Edge Vectors
The discrete gradient
How can we differentiate a digital image f(x,y)?
Option 1: reconstruct a continuous image, then take gradient
Option 2: take discrete derivative (finite difference)
How would you implement this using the kernel operation?
LIESMARS, Wuhan University
How to calculate the gradient
Discrete approx. of the partial derivation
Filtering with kernel h
Response to sharp changes
1-D differentiation example
Filter h = [-1 1] for ∆x = 1
әf/әx = lim (f(x + ∆x) – f(x))/ ∆x
∆x0
More sensitive h =1/2 [-1 0 1]
Recall: әf/әx = lim (f(x + ∆x) – f(x – ∆x))/ 2∆x
∆x0
How to calculate the gradient …
For noise sensitivity, by taking the difference horizontally and then averaging vertically.
|-1 0 1|
1/6|-1 0 1|
|-1 0 1|
LIESMARS, Wuhan University
Sobel operator
Greater weight to the central pixels
Can be approximated as derivative of a Gaussian
First Gaussian smoothing, then derivation
Sobel operator
Mathematically, the operator uses two 3×3 kernels which are “kernel operated” with the original image to calculate approximations of the derivatives – one for horizontal changes, and one for vertical.
If we define A as the source image, and Gx and Gy are two images which at each point contain the horizontal and vertical derivative approximations, the computations are as follows:
LIESMARS, Wuhan University
Sobel operator
where * here denotes the 2-dimensional operation.
Sobel operator
The x-coordinate is here defined as increasing in the “right”-direction, and the y-coordinate is defined as increasing in the “down”-direction.
At each point in the image, the resulting gradient approximations can be combined to give the gradient magnitude, using:
G = |Gx| + |Gy|.
LIESMARS, Wuhan University
Sobel operator
we can also calculate the gradient’s direction:
Θ = arctan (|Gy| / |Gx|)
where, for example, Θ is 0 for a vertical edge.
Sobel Edge Detection: Example
Original Horizontal Edges Vertical Edges Edge Strength
Note: Horizontal / Vertical edge images are re-scaled for display (I.e. black relates to negative, grey to zero, white to positive)
It is also possible to take the absolute value of edges for display
Sobel Edge Detection: Example
Sobel Edge Detector
The Sobel filter extracts all of the edges in an image, regardless of direction
It is implemented as the sum of two directional edge enhancement operators
-1 -2 -1
-0 0 0
1 2 1
-1 0 1
-2 0 2
-1 0 1
Pattern/Template Matching:
feature extraction
44
Grey-Level Image TM
When using template-matching (TM) scheme on grey-level image it is unreasonable to expect a perfect match of the grey levels.
Instead of yes/no match at each pixel, the difference in level should be used.
Template
Source Image
45
Matching Method
x,y
Template Image
Input Image
I(x,y)
O(x,y)
Output Image
x,y
Correlation
The matching process moves the template image to all possible positions in a larger source image and computes a numerical index that indicates how well the template matches the image in that position.
Match is done on a pixel-by-pixel basis.
2D Pattern Matching – Example
Input: Pattern = {A,B}
Image
Output: { (1,4),(2,2),(4, 3)}
A B A
A B A
A A B
A A B A B A A
B A B A B A B
A A B A A B B
B A A B A A A
A B A B A A A
B B A A B A B
B B B A B A B
A A B A B A A
B A B A B A B
A A B A A B B
B A A B A A A
A B A B A A A
B B A A B A B
B B B A B A B
A A B A B A A
B A B A B A B
A A B A A B B
B A A B A A A
A B A B A A A
B B A A B A B
B B B A B A B
A A B A B A A
B A B A B A B
A A B A A B B
B A A B A A A
A B A B A A A
B B A A B A B
B B B A B A B
47
Euclidean Distance
Let I be a gray level image
and g be a gray-value template of size nm.
In this formula (r,c) denotes the top left corner of template g.
48
Correlation
Correlation is a measure of the degree to which two variables agree, not necessary in actual value but in general behavior.
The two variables are the corresponding pixel values in two images, template and source.
49
Grey-Level Correlation Formula
x is the template gray level image
x is the average grey level in the template image
y is the source image section
y is the average grey level in the source image
N is the number of pixels in the section image
(N= template image size = columns * rows)
The value cor is between –1 and +1,
with larger values representing a stronger relationship between the two images.
Multiple Features
To improve recognition accuracy, we might need to use more than one features.
Single features might not yield the best performance.
Using combinations of features might yield better performance.
50
How Many Features?
Does adding more features always improve performance?
It might be difficult and computationally expensive to extract certain features.
Correlated features might not improve performance (i.e. redundancy).
“Curse” of dimensionality.
51
Curse of Dimensionality
Adding too many features can, paradoxically, lead to a worsening of performance.
Divide each of the input features into a number of intervals, so that the value of a feature can be specified approximately by saying in which interval it lies.
If each input feature is divided into M divisions, then the total number of cells is Md (d: # of features).
Since each cell must contain at least one point, the number of training data grows exponentially with d.
52
Missing Features
Certain features might be missing (e.g., due to occlusion).
How should we train the classifier with missing features ?
How should the classifier make the best decision with missing features ?
53
Convolution vs. Correlation
Cs4533/cs6563
Basis Images
Figure 3.11. Graphic illustration of convolution.
Cs4533/cs6563
Figure 3.11 (cont.). Graphic illustration of convolution.
Cs4533/cs6563
Figure 3.16. Graphic illustration of correlation.
Cs4533/cs6563
Figure 3.16 (cont.). Graphic illustration of correlation.
Deep Learning
Deep learning (a.k.a. representation learning) seeks to learn rich hierarchical representations (i.e. features) automatically through multiple stage of feature learning process.
Low-level features
output
Mid-level features
High-level features
Trainable classifier
Feature visualization of convolutional net trained on ImageNet
(Zeiler and Fergus, 2013)
Learning Hierarchical
Representations
Hierarchy of representations with increasing level of abstraction. Each stage is a kind of trainable nonlinear feature transform.
Image recognition (from low-level to high-level features)
Pixel → edge → texton → motif → part → object
Text (from low-level to high-level features)
Character → word → word group → clause → sentence → story
Low-level features
output
Mid-level features
High-level features
Trainable classifier
Increasing level of abstraction
The word ‘deep’ in deep learning refers to the layered model architectures which are usually deeper than conventional learning models.
60
Convolutional Neural Network (CNN)
Input can have very high dimension. (if we use a fully-connected ANN would need a large amount of parameters)
Inspired by the neurophysiological experiments conducted by [Hubel & Wiesel 1962], CNNs are a special type of neural network whose hidden units are only connected to local receptive field. The number of parameters needed by CNNs is much smaller.
CNN
Example: 200×200 image
fully connected: 40,000 hidden units => 1.6 billion parameters
CNN: 5×5 kernel, 100 feature maps => 2,500 parameters
Local Receptive Field
A layer in CNN
A general architecture of CNN
64
Convolutional Layer
65
Images from http://cs231n.stanford.edu/slides/2016/winter1516_lecture7.pdf
Convolutional Layer
66
Images from http://cs231n.stanford.edu/slides/2016/winter1516_lecture7.pdf
Convolutional Layer
67
Images from http://cs231n.stanford.edu/slides/2016/winter1516_lecture7.pdf
Convolutional Layer
68
Images from http://cs231n.stanford.edu/slides/2016/winter1516_lecture7.pdf
Multiple filters in digital image processing
Spots and Oriented Bars
(Malik and Perona)
Gabor Filters
Gabor filters at different
scales and spatial frequencies
top row shows anti-symmetric
(or odd) filters, bottom row the
symmetric (or even) filters.
Hyper-parameters on convolutional layer
Stride
Stride controls how the filter convolves around the input volume
The amount by which the filter shifts is the stride.
Padding size
Size to pad the input volume with zeros around the border
73
Ref: http://cs231n.github.io/convolutional-networks/
Filters on convolutional layer
Example filters learned by Krizhevsky et al.
Each of the 96 filters shown here is of size [11x11x3],
74
Ref: http://cs231n.github.io/convolutional-networks/
Convolutional network
75
Images from http://cs231n.stanford.edu/slides/2016/winter1516_lecture7.pdf
A Rectified Linear Unit/Function (ReLU)
76
ReLU on convolutional layer
Rectified Linear Unit (ReLU)
An element wise operation (applied per pixel)
Replaces all negative pixel values in the feature map by zero.
77
Ref: https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/
Pooling Layer
Progressively reduce the spatial size of the representation to reduce the amount of parameters.
Max, Average, or L2-norm pooling
Max is the most common
78
Pooling
Common pooling operations:
Max pooling: reports the maximum output within a rectangular neighborhood.
Average pooling: reports the average output of a rectangular neighborhood (possibly weighted by the distance from the central pixel).
By spacing pooling regions k > 1 (rather than 1) pixels apart, the next higher layer has roughly k times fewer inputs to process, leading to downsampling.
79
Pooling
By spacing pooling regions k > 1 (rather than 1) pixels apart, the next higher layer has roughly k times fewer inputs to process, leading to downsampling.
80
Pooling
By spacing pooling regions k > 1 (rather than 1) pixels apart, the next higher layer has roughly k times fewer inputs to process, leading to downsampling.
81
CNN: LeNet-5
82
Images from http://cs231n.stanford.edu/slides/2016/winter1516_lecture7.pdf
CNN: LeNet-5
83
Convolutional network
84
Images from http://cs231n.stanford.edu/slides/2016/winter1516_lecture7.pdf
CNN
85
Images from http://cs231n.stanford.edu/slides/2016/winter1516_lecture7.pdf
Hierarchical feature representation
86
Ref: https://www.rsipvision.com/exploring-deep-learning/
A basic topology of CNN
87
CNN models
LeNet
First successful model of CNN (Yann LeCun in 1990’s)
AlexNet
First work that popularized CNN (Alex Krizhevsky et al in 2012)
GoogLeNet
ILSVRC 2014 winner (Szegedy et al)
VGGNet
Runner-up in ILSVRC 2014
ResNet
Residual Network (Kaiming He et al.) Winner of ILSVRC 2015
88
Ref: http://cs231n.github.io/convolutional-networks/
AlexNet
89
AlexNet
C: Convolutional layer,
S: Sampling layer
F: Fully Connected layer
90
References
Deep Learning by I. Goodfellow, Y. Bengio and A. Courville (The MIT press)
Chapter 9 Foundation of Deep Machine Learning in Neural Networks (PDF)
91
Questions & Suggestions?
The End
92
Appendix
Filtering
Study about filtering in image processing
http://www.coe.utah.edu/~cs4640/slides/Lecture5.pdf
https://web.eecs.umich.edu/~jjcorso/t/598F14/files/lecture_0924_filtering.pdf
94
Filtering example
Example
Gaussian filtering on a histogram
95
Filtering for Vertical/Horizontal Edges
gray = read_gray(‘data/hand20.bmp’);
dx = [-1 0 1;
-2 0 2;
-1 0 1];
dx = dx / (sum(abs(dx(:))));
dy = dx’; % dy is the transpose of dx
dxgray = abs(imfilter(gray, dx, ‘symmetric’, ‘same’));
dygray = abs(imfilter(gray, dy, ‘symmetric’, ‘same’));
A feature matching filter
A feature matching filter
A feature matching filter
A feature matching filter
1
2
.
.
n
x
x
x
éù
êú
êú
êú
=
êú
êú
êú
ëû
x
9
9
2
2
1
1
…
z
w
z
w
z
w
R
+
+
+
=
|
)
,
(
|
y
x
f
Ñ
y
f
x
¶
¶
å
(
)
å
å
=
=
–
+
+
=
n
i
m
j
j
i
g
j
c
i
r
I
c
r
g
I
d
1
2
1
)
,
(
)
,
(
)
,
,
,
(
(
)
(
)
(
)
å
å
å
–
=
–
=
–
=
–
×
–
–
×
–
=
1
0
1
0
2
2
1
0
)
(
N
i
N
i
i
i
i
N
i
i
y
y
x
x
y
y
x
x
cor
1
2
x
x
éù
êú
ëû
1
2
:
:
xlightness
xwidth
þ
ý
ü
î
í
ì
+
–
+
þ
ý
ü
î
í
ì
+
–
+
2
2
2
2
2
2
2
exp
)
sin(
:
ric
antisymmet
2
exp
)
cos(
:
symmetric
s
s
y
x
y
k
x
k
y
x
y
k
x
k
y
x
y
x
Layer C1 S1 C2 S2 C3 C4 C5 S5 F6 F7 Output
Depth 96 96 256 256 384 384 256 256 4096 4096 1000
Dimension 55×55 27×27 27×27 13×13 13×13 13×13 13×13 6×6 6×6 1 1
Filter Size 11×11 3×3 5×5 3×3 3×3 3×3 3×3 3×3 1×1 1 1
Stride 4 2 1 2 1 1 1 2 1 1 1
/docProps/thumbnail.jpeg