程序代写 CMT107 Visual Computing

CMT107 Visual Computing

CMT107 Visual Computing

Copyright By PowCoder代写 加微信 powcoder

Object Recognition

School of Computer Science and Informatics

Cardiff University

• Object Recognition
• Overview

• What “Works” Today

• Machine Learning Approach for Recognition
• The Machine Learning Framework

• Classifiers
• Nearest neighbour

• Recognition Task and Supervision

• Generalization

• Datasets

• Face Detection and Recognition
• The Viola/Jones Face Detector

• Face Recognition

Object Recognition

• Object recognition is the task of finding a given object in an image or a video.

• The object recognition problem can be defined as a labelling problem based
on models of known objects.

• Object recognition approaches:
• Geometric model-based methods

• Appearance-based methods

• Feature-based methods

How many Visual Object Categories are there?

Biederman 1987
Slides adapted from Fei-Fei Li, , , and

How many Visual Object Categories are there?

Visual Object Categories

ANIMALS INANIMATEPLANTS

MAN-MADENATURAL
VERTEBRATE…..

MAMMALS BIRDS

GROUSEBOARTAPIR CAMERA

Specific Recognition Tasks

Scene Categorisation

• outdoor/indoor

• city/forest/factory/etc.

Image Annotation/Tagging

• building

• mountain

Object Detection

• find pedestrians

Image Parsing

street lamp

Image Understanding?

Recognition is All about Modelling Variability

• Variability
• Camera position

• Illumination

• Shape parameters

Within-class variations?

Within-class Variations

History of Ideas in Recognition

• 1960s – early 1990s: the geometric era

Recognition is All about Modelling Variability

• Variability
• Camera position

• Illumination

• Shape parameters : assumed known

Roberts (1965); Lowe (1987);

Faugeras & Hebert (1986);

Grimson & Lozano-Perez (1986);

Huttenlocher & Ullman (1987)

• Alignment: fitting a model to a transformation between pairs of features
(matches) in two images

‘ Find transformation T
that minimizes
σ𝑖 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙(𝑇 𝑥𝑖 , 𝑥𝑖

Recognition as an Alignment Problem

J. Mundy, Object Recognition in the Geometric Era: a Retrospective, 2006

L. G. Roberts, Machine
Perception of Three
Dimensional Solids, Ph.D.
thesis, MIT Department of
Electrical Engineering, 1963.

http://www.di.ens.fr/~ponce/mundy.pdf
http://www.packet.cc/files/mach-per-3D-solids.html

Recognition is All about Modelling Variability

• Variability
• Camera position

• Illumination

• Internal parameters

Invariant to Duda & Hart ( 1972);
Weiss (1987);

Mundy et al. (1992-94);

Rothwell et al. (1992);

Burns et al. (1993)

Representing and Recognising Object Categories is Harder…

ACRONYM (Brooks and Binford, 1981)

Binford (1971), Nevatia & Binford (1972), Marr & Nishihara (1978)

Recognition by Components

Primitives (geons) Objects

http://en.wikipedia.org/wiki/Recognition_by_Components_Theory

Biederman (1987)

http://en.wikipedia.org/wiki/Recognition_by_Components_Theory

General Shape Primitives

Zisserman et al. (1995)

Generalized cylinders

Ponce et al. (1989)

Forsyth (2000)

History of Ideas in Recognition

• 1960s – early 1990s: the geometric era

• 1990s – appearance-based models

Recognition is All about Modelling Variability

• Empirical models of image variability

• Appearance-based models:
Turk & Pentland (1991); Murase & Nayar (1995); etc.

Eigenfaces (Turk & Pentland 1991)

Colour Histograms

Swain and Ballard, Color Indexing, IJCV 1991.

http://www.inf.ed.ac.uk/teaching/courses/av/LECTURE_NOTES/swainballard91.pdf

Appearance Manifolds

H. Murase and S. Nayar, Visual learning and recognition of 3-d objects
from appearance, IJCV 1995

Limitations of Global Appearance Models

• Requires global registration of patterns

• Not robust to clutter, occlusion, geometric transformations

History of Ideas in Recognition

• 1960s – early 1990s: the geometric era

• 1990s – appearance-based models

• Mid-1990s – sliding window approaches

Sliding Window Approaches

Sliding Window Approaches

• Turk and Pentland, 1991

• Belhumeur, Hespanha, & Kriegman, 1997

• Schneiderman & Kanade 2004

• Viola and Jones, 2000

• Schneiderman & Kanade, 2004
• Argawal and Roth, 2002
• Poggio et al. 1993

History of Ideas in Recognition

• 1960s – early 1990s: the geometric era

• 1990s – appearance-based models

• Mid-1990s – sliding window approaches

• Late-1990s – local features

Local features for Object Instance Recognition

D. Lowe (1999, 2004)

Large-Scale Image Search

• Combine local features, indexing, and spatial constraints

Image credit: K. Grauman and B. Leibe

Large-Scale Image Search

• Combine local features, indexing, and spatial constraints

Philbin et al. ‘07

History of Ideas in Recognition

• 1960s – early 1990s: the geometric era

• 1990s – appearance-based models

• Mid-1990s – sliding window approaches

• Late-1990s – local features

• Early-2000s – parts-and-shape models

Parts-and-Shape Models

• Object as a set of parts

• Relative locations between parts

• Appearance of parts

Figure from [Fischler & Elschlager 73]

Constellation Models

Weber, Welling & Perona (2000), Fergus, Perona & Zisserman (2003)

Pictorial Structure Models

• Representing people

Part AppearancePart Geometry

Fisheler and Elschlager(73), Felzenszwalb and Huttenlocher(00)

History of Ideas in Recognition

• 1960s – early 1990s: the geometric era

• 1990s – appearance-based models

• Mid-1990s – sliding window approaches

• Late-1990s – local features

• Early-2000s – parts-and-shape models

• Mid-2000s – bags of features

Bag-of-Features Model

Bag of ‘words’

Objects as Texture

• All of these are being treated as the same

• No distinction between foreground and background: Scene recognition?

History of Ideas in Recognition

• 1960s – early 1990s: the geometric era

• 1990s – appearance-based models

• Mid-1990s – sliding window approaches

• Late-1990s – local features

• Early-2000s – parts-and-shape models

• Mid-2000s – bags of features

• Present trends: combination of local and global methods, data-driven
methods, context

Global Scene Descriptors

• The “gist” of a scene: Oliva & Torralba (2001)

http://people.csail.mit.edu/torralba/code/spatialenvelope/

http://people.csail.mit.edu/torralba/code/spatialenvelope/

Data Driven Methods

J. Hays and A. Efros,
Scene Completion using
Millions of Photographs,

SIGGRAPH 2007

http://graphics.cs.cmu.edu/projects/scene-completion/

Data Driven Methods

J. Tighe and S. Lazebnik, ECCV 2010

Geometric Context

D. Hoiem, A. Efros, and M. Herbert. Putting Objects in Perspective. CVPR 2006.

http://www.cs.uiuc.edu/homes/dhoiem/projects/pop/

Discriminatively Trained Part-based Models

P. Felzenszwalb, R. Girshick, D.
McAllester, D. Ramanan, “Object
Detection with Discriminatively

Trained Part-Based Models,”

http://www.ics.uci.edu/~dramanan/papers/latentmix.pdf

What “Works” Today

• Reading license plates, postcodes, cheques

What “Works” Today

• Reading license plates, postcodes, cheques

• Fingerprint recognition

What “Works” Today

• Reading license plates, postcodes, cheques

• Fingerprint recognition

• Face detection

What “Works” Today

• Reading license plates, postcodes, cheques

• Fingerprint recognition

• Face detection

• Recognition of flat textured objects (CD covers, book covers, etc.)

Recognition: A Machine Learning Approach

Slides adapted from Fei-Fei Li, , , , and Derek Hoiem

The Machine Learning Framework

• Apply a prediction function to a feature representation of the image to get
the desired output

f( ) = “apple”
f( ) = “tomato”
f( ) = “cow”

The Machine Learning Framework

• Training: given a training set of labelled examples {(x1,y1), …, (xN,yN)},
estimate the prediction function f by minimizing the prediction error on the
training set

• Testing: apply f to a never seen before test example x and output the
predicted value y = f(x)

output prediction

Image feature

The Machine Learning Framework

Prediction

Test Image

• Raw pixels

• Histograms

• GIST Descriptors

Classifier: Nearest Neighbour

f(x) = label of the training sample nearest to x

• All we need is a distance function for the inputs

• No training required!

Training examples
from class 1

Training examples
from class 2

Classifier: Linear

• Find a linear function to separate the classes:

f(x) = sgn(w  x + b)

Training examples
from class 1

Training examples
from class 2

Recognition Task and Supervision

• Images in the training set must be annotated with the “correct answer” that
the model is expected to produce

“Contains a motorbike”

Spectrum of Supervision

Unsupervised “Weakly” supervised Fully supervised

Definition depends on task

Generalisation

• How well does a
learned model
generalise from
the data it was
trained on to a
new test set?

Training set (labels known) Test set (labels unknown)

Generalisation

• Components of generalisation error
• Bias: how much the average model over all training sets differ from the true model?

– Error due to inaccurate assumptions/simplifications made by the model

• Variance: how much models estimated from different training sets differ from each

• Underfitting: model is too “simple” to represent all the relevant class
characteristics
• High bias and low variance

• High training error and high test error

• Overfitting: model is too “complex” and fits irrelevant characteristics (noise)
in the data
• Low bias and high variance

• Low training error and high test error

Bias-Variance Tradeoff

Training error

Test error

Complexity Low Bias
High Variance

Low Variance

Effect of Training Size

Many training examples

Few training examples

Complexity Low Bias
High Variance

Low Variance

Effect of Training Size

Generalisation Error

Number of Training Examples

Fixed prediction model

• Circa 2001: 5 categories, 100s of images per category

• Circa 2004: 101 categories

• Today: up to thousands of categories, millions of images

Caltech 101 & 256

Griffin, Holub, Perona, 2007

Fei-Fei, Fergus, Perona, 2004

http://www.vision.caltech.edu/Image_Datasets/Caltech101/
http://www.vision.caltech.edu/Image_Datasets/Caltech256/

http://www.vision.caltech.edu/Image_Datasets/Caltech101/
http://www.vision.caltech.edu/Image_Datasets/Caltech256/

Caltech 101: Intraclass Variability

Face Detection

Behold a state-of-the-art face detector!
(Courtesy Boris Babenko)

http://vision.ucsd.edu/~bbabenko/

Face Detection and Recognition

Detection Recognition “Sally”

Consumer Application: Apple iPhoto

http://www.apple.com/ilife/iphoto/

http://www.apple.com/ilife/iphoto/

Consumer Application: Apple iPhoto

Consumer Application: Apple iPhoto

• Things iPhoto thinks are faces

Funny Nikon Ads

“The Nikon S60 detects up to 12 faces.”

Funny Nikon Ads

“The Nikon S60 detects up to 12 faces.”

Challenges of Face Detection

• Sliding window detector must evaluate tens of thousands of local/scale
combinations

• Faces are rare: 0 – 10 per image
• For computational efficiency, we should try to spend as little time as possible on the

non-face windows

• A megapixel image has ~106 pixels and a comparable number of candidate face

• To avoid having a false positive in every image, our false positive rate has to be less

The Viola/Jones Face Detector

• A seminal approach to real-time object detection

• Training is slow but detection is very fast

• Key ideas
• Integral images for fast feature evaluation

• Boosting for feature selection

• Attentional cascade for fast rejection of non-face windows

P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features.
CVPR 2001.

P. Viola and M. Jones. Robust real-time face detection. IJCV 57(2), 2004.

http://research.microsoft.com/en-us/um/people/viola/pubs/detect/violajones_cvpr2001.pdf
http://www.vision.caltech.edu/html-files/EE148-2005-Spring/pprs/viola04ijcv.pdf

Image Features

• “Rectangular filters”

• Value = ∑ (pixels in white area)

– ∑ (pixels in black area)

Fast Computation with Integral Images

• The integral image computes a
value at each pixel (x,y) that is the
sum of the pixel values above and
to the left of (x,y), inclusive

• This can quickly be computed in
one pass through the image

Computing the Integral Image

Computing the Integral Image

• Cumulative row sum: s(x, y) = s(x–1, y) + i(x, y)

• Integral image: ii(x, y) = ii(x, y−1) + s(x, y)

ii(x, y-1)

Computing Sum within a Rectangle

• Let A,B,C,D be the values of the
integral image at the corners of a

• Then the sum of original image
values within the rectangle can be
computed as:

sum = A – B – C + D

• Only 3 additions are required for
any size of rectangle!

Integral Image

Black = A-B-C+D
White = C-D-E+F
Value = White-Black = -A+B+2C-2D-E+F

Feature Selection

• For a 24×24 detection region, the number of possible rectangle features is

Feature Selection

• For a 24×24 detection region, the number of possible rectangle features is

• At test time, it is impractical to evaluate the entire feature set

• Can we create a good classifier using just a small subset of all possible

• How to select such a subset?

• Boosting is a classification scheme that combines weak learners into a more
accurate ensemble classifier

• Training procedure
• Initially, weight each training example equally

• In each boosting round:
• Find the weak learner that achieves the lowest weighted training error

• Raise the weights of training examples misclassified by current weak learner

• Compute final classifier as linear combination of all weak learners (weight of each
learner is directly proportional to its accuracy)
• Exact formulas for re-weighting and combining weak learners depend on the particular boosting

scheme (e.g., AdaBoost)

Y. Freund and R. Schapire, A short introduction to boosting, Journal of Japanese Society for Artificial Intelligence,
14(5):771-780, September, 1999.

http://www.cs.princeton.edu/~schapire/uncompress-papers.cgi/FreundSc99.ps

Boosting for Face Detection

• Define weak learners based on rectangle features

• For each round of boosting:
• Evaluate each rectangle filter on each example

• Select best filter/threshold combination based on weighted training error

• Reweight examples

otherwise 0

value of rectangle feature

parity threshold

Boosting for Face Detection

• First two features selected by

• This feature combination can
yield 100% detection rate and
50% false positive rate

Boosting for Face Detection

• First two features selected by

• This feature combination can
yield 100% detection rate and
50% false positive rate

Boosting for Face Detection

• A 200-feature classifier can yield 95% detection rate and a false positive rate
of 1 in 14084

Receiver operating characteristic (ROC) curve

Not good enough!

Attentional Cascade

• We start with simple classifiers which reject many of the negative sub-
windows while detecting almost all positive sub-windows

• Positive response from the first classifier triggers the evaluation of a second
(more complex) classifier, and so on

• A negative outcome at any point leads to the immediate rejection of the sub-

SUB-WINDOW
Classifier 1 Classifier 3

Classifier 2

Attentional Cascade

• Chain classifiers that are progressively
more complex and have lower false
positive rates:

SUB-WINDOW
Classifier 1 Classifier 3

Classifier 2

vs false neg determined by

% False Pos

0 50

Receiver operating characteristic

Attentional Cascade

• The detection rate and the false positive rate of the cascade are found by
multiplying the respective rates of the individual stages

• A detection rate of 0.9 and a false positive rate on the order of 10-6 can be
achieved by a 10-stage cascade if each stage has a detection rate of 0.99
(0.9910 ≈ 0.9) and a false positive rate of about 0.30 (0.310 ≈ 6×10-6)

SUB-WINDOW
Classifier 1 Classifier 3

Classifier 2

Training the Cascade

• Set target detection and false positive rates for each stage

• Keep adding features to the current stage until its target rates have been met
• Need to lower AdaBoost threshold to maximize detection (as opposed to minimizing

total classification error)

• Test on a validation set

• If the overall false positive rate is not low enough, then add another stage

• Use false positives from current stage as the negative training examples for
the next stage

The Implemented System

• Training data
• 5000 faces

all frontal, rescaled to 24×24 pixels

• 300 million non-faces

9500 non-face images

• Faces are normalized:

scale, translation

• Many variations
• Across individuals

• Illumination

System Performance

• Training time: “weeks” on 466 MHz Sun workstation

• 38 layers, total of 6061 features

• Average of 10 features evaluated per window on test set

• “On a 700 Mhz Pentium III processor, the face detector can process a 384 by
288 pixel image in about .067 seconds”

• 15 times faster than previous detector of comparable accuracy (Rowley et al., 1998)

Output of Face Detector on Test Images

Other Detection Tasks

Facial Feature Localization

Profile Detection

Profile Detection

Viola/Jones Detector Summary

• Rectangle features

• Integral images for fast computation

• Boosting for feature selection

• Attentional cascade for fast rejection of negative windows

Face Recognition

• N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar, “Attribute
and Simile Classifiers for Face Verification,” ICCV 2009.

http://www.cs.columbia.edu/CAVE/projects/faceverification/

Face Recognition

• N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar, “Attribute
and Simile Classifiers for Face Verification,” ICCV 2009.

Attributes for training Similes for training

http://www.cs.columbia.edu/CAVE/projects/faceverification/

Face Recognition

• Results on Labeled Faces in the Wild Dataset

• N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar, “Attribute
and Simile Classifiers for Face Verification,” ICCV 2009.

http://vis-www.cs.umass.edu/lfw/
http://www.cs.columbia.edu/CAVE/projects/faceverification/

• What is object recognition?

• Briefly describe the history of ideas of object recognition.

• Describe the machine learning framework.

• Describe nearest neighbour and linear classifiers.

• What is the task of face detection and recognition?

• Describe the Viola/Jones Face Detector.

Slide 1: CMT107 Visual Computing
Slide 2: Overview
Slide 3: Object Recognition
Slide 4: How many Visual Object Categories are there?
Slide 5: How many Visual Object Categories are there?
Slide 6: Visual Object Categories
Slide 7: Specific Recognition Tasks
Slide 8: Scene Categorisation
Slide 9: Image Annotation/Tagging
Slide 10: Object Detection
Slide 11: Image Parsing
Slide 12: Image Understanding?
Slide 13: Recognition is All about Modelling Variability
Slide 14: Within-class Variations
Slide 15: History of Ideas in Recognition
Slide 16: Recognition is All about Modelling Variability

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com