CS计算机代考程序代写 AI deep learning scheme CS7267 Machine Learning Logistic regression

CS7267 Machine Learning Logistic regression

AI Lecture:
Convolutional Neural Networks (CNN)
C.-C. Hung
Slides used in the classroom only

Outline
Pattern Recognition Concept
Basic Concept
Feature Extraction
Terminology
Challenges
Why deep learning?
What is CNN for deep learning?

Pattern Recognition/Classification
Assign an object or an event (pattern) to one of several known categories (or classes).
3

Category “A”
Category “B”

Classification vs Clustering
4

Category “A”
Category “B”

(Supervised Classification)

(Unsupervised Classification)
Classification (known categories)
Clustering (unknown categories)

What is a Pattern?
A pattern could be an object or event.
Typically, represented by a vector x of numbers.
5

biometric patterns

hand gesture patterns

What is a Pattern? (con’t)
Loan/Credit card applications
Income, # of dependents, mortgage amount  credit worthiness classification

Dating services
Age, hobbies, income “desirability” classification

Web documents
Key-word based descriptions (e.g., documents containing “football”, “NFL”)  document classification
6

What is a Class ?
A collection of “similar” objects.
7

Female class
Male class

Main Objectives
Separate the data belonging to different classes.
Given new data, assign them to the correct category.

8

Gender Classification

Main Approaches
x: input vector (pattern)

ω: class label (class)

Generative
Model the joint probability, p(x, ω).
Make predictions by using Bayes rule to calculate p(ω/x).
Pick the most likely class label ω.

Discriminative
No need to model p(x, ω).
Estimate p(ω/x) by “learning” a direct mapping from x to ω (i.e., estimate decision boundary).
Pick the most likely class label ω.
9

ω1
ω2

How do we model p(x, ω)?
Typically, using a statistical model.
probability density function (e.g., Gaussian)

10

Gender Classification

male
female

Key Challenges
Intra-class variability

Inter-class variability

11
Letters/Numbers that look similar
The letter “T” in different typefaces
M and W
If rotation must be considered.

Traditional pattern recognition
Traditional pattern recognition models use hand-crafted features and relatively simple trainable classifier.

This approach has the following limitations:
It is very tedious and costly to develop hand-crafted features
The hand-crafted features usually highly depend on one application, and cannot be transferred easily to other applications

hand-crafted feature extractor
“Simple” Trainable Classifier
output

Traditional pattern recognition

What is the hand-crafted feature extractor?

hand-crafted feature extractor
“Simple” Trainable Classifier
output

Digital Image Processing/Machine Vision:
feature extractions

Wuhan University
Spatial Filtering
Use of spatial masks for image processing (spatial filters)

Linear and nonlinear filters

Low-pass filters eliminate or attenuate high frequency components in the frequency domain (sharp image details), and result in image blurring.

Wuhan University
Spatial Filtering …
High-pass filters attenuate or eliminate low-frequency components (resulting in sharpening edges and other sharp details).

LIESMARS, Wuhan University
Kernel Operator
Place the kernel h (α, β) down on top of an image f (x, y),

g (x, y) = ∑ ∑ f(x + α, y + β) h (α, β) over α and β

h(-1,-1) h(0, -1) h(1, -1)
h(-1, 0) h(0, 0) h(1, 0)
h(-1, 1) h(0, 1) h(1, 1)

Wuhan University
Spatial Filtering …
The basic approach is to sum products between the mask coefficients and the intensities of the pixels under the mask at a specific location in the image:

(for a 3 x 3 filter)

Wuhan University

A spatial filter used in the digital image processing

Wuhan University
Spatial Filtering …
Non-linear filters also use pixel neighborhoods but do not explicitly use coefficients.

e.g. noise reduction by median gray-level value computation in the neighborhood of the filter

Wuhan University
Smoothing Filters …
Used for blurring (removal of small details prior to large object extraction, bridging small gaps in lines) and noise reduction.

Low-pass (smoothing) spatial filtering
Neighborhood averaging
– Results in image blurring

Wuhan University
Image Enhancement in the
Spatial Domain

Wuhan University
Some Masks

Wuhan University
Image Enhancement in the
Spatial Domain

N = 3, 5, 9, 15 and 35

Low-pass filter
A spatial low-pass filter has the effect of passing, or leaving untouched, the low spatial frequency components of the image.
High frequency components are attenuated and are virtually absent in the output image

1/9 1/9 1/9
1/9 1/9 1/9
1/9 1/9 1/9

High-pass Filter

GRADIENT for edge detection
A vector variable
– Direction of the maximum growth of the function
– Magnitude of the growth
– Perpendicular to the edge direction

GRADIENT

LIESMARS, Wuhan University

Image gradient
The gradient of an image:

The gradient points in the direction of most rapid change in intensity

The gradient direction is given by:

how does this relate to the direction of the edge?
The edge strength is given by the gradient magnitude

GRADIENT AND EDGE VECTORS – EXAMPLE
Gradient Vectors Edge Vectors

The discrete gradient
How can we differentiate a digital image f(x,y)?
Option 1: reconstruct a continuous image, then take gradient
Option 2: take discrete derivative (finite difference)
How would you implement this using the kernel operation?

LIESMARS, Wuhan University
How to calculate the gradient
Discrete approx. of the partial derivation
Filtering with kernel h
Response to sharp changes

1-D differentiation example
Filter h = [-1 1] for ∆x = 1
әf/әx = lim (f(x + ∆x) – f(x))/ ∆x
∆x0

More sensitive h =1/2 [-1 0 1]
Recall: әf/әx = lim (f(x + ∆x) – f(x – ∆x))/ 2∆x
∆x0

How to calculate the gradient …
For noise sensitivity, by taking the difference horizontally and then averaging vertically.

|-1 0 1|
1/6|-1 0 1|
|-1 0 1|

LIESMARS, Wuhan University
Sobel operator
Greater weight to the central pixels

Can be approximated as derivative of a Gaussian
First Gaussian smoothing, then derivation

Sobel operator
Mathematically, the operator uses two 3×3 kernels which are “kernel operated” with the original image to calculate approximations of the derivatives – one for horizontal changes, and one for vertical.

If we define A as the source image, and Gx and Gy are two images which at each point contain the horizontal and vertical derivative approximations, the computations are as follows:

LIESMARS, Wuhan University
Sobel operator

where * here denotes the 2-dimensional operation.

Sobel operator
The x-coordinate is here defined as increasing in the “right”-direction, and the y-coordinate is defined as increasing in the “down”-direction.
At each point in the image, the resulting gradient approximations can be combined to give the gradient magnitude, using:

G = |Gx| + |Gy|.

LIESMARS, Wuhan University
Sobel operator
we can also calculate the gradient’s direction:

Θ = arctan (|Gy| / |Gx|)

where, for example, Θ is 0 for a vertical edge.

Sobel Edge Detection: Example

Original Horizontal Edges Vertical Edges Edge Strength
Note: Horizontal / Vertical edge images are re-scaled for display (I.e. black relates to negative, grey to zero, white to positive)
It is also possible to take the absolute value of edges for display

Sobel Edge Detection: Example

Sobel Edge Detector
The Sobel filter extracts all of the edges in an image, regardless of direction
It is implemented as the sum of two directional edge enhancement operators
-1 -2 -1
-0 0 0
1 2 1

-1 0 1
-2 0 2
-1 0 1

Pattern/Template Matching:
feature extraction

44
Grey-Level Image TM
When using template-matching (TM) scheme on grey-level image it is unreasonable to expect a perfect match of the grey levels.
Instead of yes/no match at each pixel, the difference in level should be used.

Template
Source Image

45
Matching Method

x,y
Template Image
Input Image
I(x,y)
O(x,y)

Output Image

x,y

Correlation

The matching process moves the template image to all possible positions in a larger source image and computes a numerical index that indicates how well the template matches the image in that position.
Match is done on a pixel-by-pixel basis.

2D Pattern Matching – Example
Input: Pattern = {A,B}

Image

Output: { (1,4),(2,2),(4, 3)}
A B A
A B A
A A B

A A B A B A A
B A B A B A B
A A B A A B B
B A A B A A A
A B A B A A A
B B A A B A B
B B B A B A B

A A B A B A A
B A B A B A B
A A B A A B B
B A A B A A A
A B A B A A A
B B A A B A B
B B B A B A B

A A B A B A A
B A B A B A B
A A B A A B B
B A A B A A A
A B A B A A A
B B A A B A B
B B B A B A B

A A B A B A A
B A B A B A B
A A B A A B B
B A A B A A A
A B A B A A A
B B A A B A B
B B B A B A B

47
Euclidean Distance

Let I be a gray level image
and g be a gray-value template of size nm.
In this formula (r,c) denotes the top left corner of template g.

48
Correlation
Correlation is a measure of the degree to which two variables agree, not necessary in actual value but in general behavior.
The two variables are the corresponding pixel values in two images, template and source.

49
Grey-Level Correlation Formula

x is the template gray level image
x is the average grey level in the template image
y is the source image section
y is the average grey level in the source image
N is the number of pixels in the section image
(N= template image size = columns * rows)
The value cor is between –1 and +1,
with larger values representing a stronger relationship between the two images.

Multiple Features
To improve recognition accuracy, we might need to use more than one features.
Single features might not yield the best performance.
Using combinations of features might yield better performance.

50

How Many Features?
Does adding more features always improve performance?
It might be difficult and computationally expensive to extract certain features.
Correlated features might not improve performance (i.e. redundancy).
“Curse” of dimensionality.
51

Curse of Dimensionality
Adding too many features can, paradoxically, lead to a worsening of performance.
Divide each of the input features into a number of intervals, so that the value of a feature can be specified approximately by saying in which interval it lies.

If each input feature is divided into M divisions, then the total number of cells is Md (d: # of features).
Since each cell must contain at least one point, the number of training data grows exponentially with d.
52

Missing Features
Certain features might be missing (e.g., due to occlusion).
How should we train the classifier with missing features ?
How should the classifier make the best decision with missing features ?

53

Convolution vs. Correlation

Cs4533/cs6563
Basis Images
Figure 3.11. Graphic illustration of convolution.

Cs4533/cs6563

Figure 3.11 (cont.). Graphic illustration of convolution.

Cs4533/cs6563
Figure 3.16. Graphic illustration of correlation.

Cs4533/cs6563
Figure 3.16 (cont.). Graphic illustration of correlation.

Deep Learning
Deep learning (a.k.a. representation learning) seeks to learn rich hierarchical representations (i.e. features) automatically through multiple stage of feature learning process.

Low-level features
output
Mid-level features
High-level features
Trainable classifier

Feature visualization of convolutional net trained on ImageNet
(Zeiler and Fergus, 2013)

Learning Hierarchical
Representations
Hierarchy of representations with increasing level of abstraction. Each stage is a kind of trainable nonlinear feature transform.
Image recognition (from low-level to high-level features)
Pixel → edge → texton → motif → part → object
Text (from low-level to high-level features)
Character → word → word group → clause → sentence → story

Low-level features
output
Mid-level features
High-level features
Trainable classifier
Increasing level of abstraction

The word ‘deep’ in deep learning refers to the layered model architectures which are usually deeper than conventional learning models.

60

Convolutional Neural Network (CNN)
Input can have very high dimension. (if we use a fully-connected ANN would need a large amount of parameters)
Inspired by the neurophysiological experiments conducted by [Hubel & Wiesel 1962], CNNs are a special type of neural network whose hidden units are only connected to local receptive field. The number of parameters needed by CNNs is much smaller.

CNN

Example: 200×200 image
fully connected: 40,000 hidden units => 1.6 billion parameters
CNN: 5×5 kernel, 100 feature maps => 2,500 parameters

Local Receptive Field

A layer in CNN

A general architecture of CNN
64

Convolutional Layer
65

Images from http://cs231n.stanford.edu/slides/2016/winter1516_lecture7.pdf

Convolutional Layer
66
Images from http://cs231n.stanford.edu/slides/2016/winter1516_lecture7.pdf

Convolutional Layer
67
Images from http://cs231n.stanford.edu/slides/2016/winter1516_lecture7.pdf

Convolutional Layer
68
Images from http://cs231n.stanford.edu/slides/2016/winter1516_lecture7.pdf

Multiple filters in digital image processing

Spots and Oriented Bars
(Malik and Perona)

Gabor Filters

Gabor filters at different
scales and spatial frequencies

top row shows anti-symmetric
(or odd) filters, bottom row the
symmetric (or even) filters.

Hyper-parameters on convolutional layer
Stride
Stride controls how the filter convolves around the input volume
The amount by which the filter shifts is the stride.
Padding size
Size to pad the input volume with zeros around the border
73
Ref: http://cs231n.github.io/convolutional-networks/

Filters on convolutional layer
Example filters learned by Krizhevsky et al.
Each of the 96 filters shown here is of size [11x11x3], 
74

Ref: http://cs231n.github.io/convolutional-networks/

Convolutional network
75

Images from http://cs231n.stanford.edu/slides/2016/winter1516_lecture7.pdf

A Rectified Linear Unit/Function (ReLU)
76

ReLU on convolutional layer
Rectified Linear Unit (ReLU)
An element wise operation (applied per pixel)
Replaces all negative pixel values in the feature map by zero.
77

Ref: https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/

Pooling Layer
Progressively reduce the spatial size of the representation to reduce the amount of parameters.
Max, Average, or L2-norm pooling
Max is the most common
78

Pooling
Common pooling operations:
Max pooling: reports the maximum output within a rectangular neighborhood.
Average pooling: reports the average output of a rectangular neighborhood (possibly weighted by the distance from the central pixel).

By spacing pooling regions k > 1 (rather than 1) pixels apart, the next higher layer has roughly k times fewer inputs to process, leading to downsampling.

79

Pooling

By spacing pooling regions k > 1 (rather than 1) pixels apart, the next higher layer has roughly k times fewer inputs to process, leading to downsampling.

80

Pooling

By spacing pooling regions k > 1 (rather than 1) pixels apart, the next higher layer has roughly k times fewer inputs to process, leading to downsampling.

81

CNN: LeNet-5
82
Images from http://cs231n.stanford.edu/slides/2016/winter1516_lecture7.pdf

CNN: LeNet-5
83

Convolutional network
84

Images from http://cs231n.stanford.edu/slides/2016/winter1516_lecture7.pdf

CNN
85

Images from http://cs231n.stanford.edu/slides/2016/winter1516_lecture7.pdf

Hierarchical feature representation
86

Ref: https://www.rsipvision.com/exploring-deep-learning/

A basic topology of CNN
87

CNN models
LeNet
First successful model of CNN (Yann LeCun in 1990’s)
AlexNet
First work that popularized CNN (Alex Krizhevsky et al in 2012)
GoogLeNet
ILSVRC 2014 winner (Szegedy et al)
VGGNet
Runner-up in ILSVRC 2014
ResNet
Residual Network (Kaiming He et al.) Winner of ILSVRC 2015
88
Ref: http://cs231n.github.io/convolutional-networks/

AlexNet
89

AlexNet

C: Convolutional layer,
S: Sampling layer
F: Fully Connected layer

90

References
Deep Learning by I. Goodfellow, Y. Bengio and A. Courville (The MIT press)
Chapter 9 Foundation of Deep Machine Learning in Neural Networks (PDF)
91

Questions & Suggestions?

The End

92

Appendix

Filtering
Study about filtering in image processing
http://www.coe.utah.edu/~cs4640/slides/Lecture5.pdf
https://web.eecs.umich.edu/~jjcorso/t/598F14/files/lecture_0924_filtering.pdf

94

Filtering example
Example
Gaussian filtering on a histogram
95

Filtering for Vertical/Horizontal Edges

gray = read_gray(‘data/hand20.bmp’);
dx = [-1 0 1;
-2 0 2;
-1 0 1];
dx = dx / (sum(abs(dx(:))));
dy = dx’; % dy is the transpose of dx
dxgray = abs(imfilter(gray, dx, ‘symmetric’, ‘same’));
dygray = abs(imfilter(gray, dy, ‘symmetric’, ‘same’));

A feature matching filter

A feature matching filter

A feature matching filter

A feature matching filter

1
2
.
.
n
x
x
x
éù
êú
êú
êú
=
êú
êú
êú
ëû
x

9
9
2
2
1
1

z
w
z
w
z
w
R
+
+
+
=

|
)
,
(
|
y
x
f
Ñ

y

f

x

å

(
)
å
å
=
=

+
+
=
n
i
m
j
j
i
g
j
c
i
r
I
c
r
g
I
d
1
2
1
)
,
(
)
,
(
)
,
,
,
(

(
)
(
)
(
)
å
å
å

=

=

=

×


×

=
1
0
1
0
2
2
1
0
)
(
N
i
N
i
i
i
i
N
i
i
y
y
x
x
y
y
x
x
cor

1
2
x
x
éù
êú
ëû

1
2
:
:
xlightness
xwidth

þ
ý
ü
î
í
ì
+

+
þ
ý
ü
î
í
ì
+

+
2
2
2
2
2
2
2
exp
)
sin(

:
ric
antisymmet
2
exp
)
cos(

:
symmetric
s
s
y
x
y
k
x
k
y
x
y
k
x
k
y
x
y
x

Layer C1 S1 C2 S2 C3 C4 C5 S5 F6 F7 Output
Depth 96 96 256 256 384 384 256 256 4096 4096 1000
Dimension 55×55 27×27 27×27 13×13 13×13 13×13 13×13 6×6 6×6 1 1
Filter Size 11×11 3×3 5×5 3×3 3×3 3×3 3×3 3×3 1×1 1 1
Stride 4 2 1 2 1 1 1 2 1 1 1

/docProps/thumbnail.jpeg