Computer Vision (7CCSMCVI / 6CCS3COV)
Recap
• Image formation
● Low-level vision
● Mid-level vision
● High-level vision
● Artificial
– template matching
– sliding window
– edge matching
– model-based
– intensity histograms
– implicit shape model
– SIFT feature matching – bag-of-words
– geometric invariants
● Biological
Computer Vision / High-Level Vision / Object Recognition (Biological) 1
←Today
Today
• Theories of object recognition / categorisation: – object-based (3D) vs image-based (2D)
– configural (global) vs featural (local)
– rules vs exemplars vs prototypes
• Theories of cortical processing:
– hierarchical neural network models » Feedforward (HMAX, CNN)
» Recurrent
• Top-down vs Bottom-up – Bayesian inference
Computer Vision / High-Level Vision / Object Recognition (Biological) 2
Object based vs Image based theories
Object based:
• each object represented by storing a 3D model • object-centred reference frame
Image based:
• each object represented by storing multiple 2D views (e.g. images)
• viewer-centred reference frame
Computer Vision / High-Level Vision / Object Recognition (Biological) 3
Object based: Recognition By Components Representations of objects are stored in the
brain as structural descriptions.
A structural description contains a specification of the object’s parts and their inter-relations (e.g., the cube above cylinder).
Early processing
Part segmentation
structural description
Part modelling
Computer Vision / High-Level Vision / Object Recognition (Biological) 4
Object based: Recognition By Components
Hypothesis: there is a small number of geometric components that constitute the primitive elements of the object recognition system (like letters forming words).
Computer Vision / High-Level Vision / Object Recognition (Biological) 5
Object based: Recognition By Components Hence, an object is an arrangement of a few simple three-dimensional
shapes called geometrical icons, or geons.
Geons are simple volumes such as cubes, spheres, cylinders, and
wedges.
Computer Vision / High-Level Vision / Object Recognition (Biological) 6
Object based: Recognition By Components
Different combinations of geons can be used to represent a large variety of objects
Computer Vision / High-Level Vision / Object Recognition (Biological) 7
Object based: Recognition By Components
Geons are chosen to be:
• sufficiently different from each other to be easily discriminated
• robust to noise (can be identified even with parts of image missing)
• view-invariant (look similar from most viewpoints)
Different views of the same object are represented by the same set of geons, in the same arrangement. Therefore, the model achieves viewpoint invariance.
Computer Vision / High-Level Vision / Object Recognition (Biological) 8
Object based: Recognition By Components
Matching
Recognition involves recognizing object elements (geons) and their configuration
The visual system parses an image of an object into its constituent geons.
Interrelations are determined, such as relative location and size (e.g., the lamp shade is left-of, below, and larger-than the fixture).
The geons and interrelations of the perceived object are matched against stored structural descriptions.
If a reasonably good match is found, then successful object recognition will occur.
Computer Vision / High-Level Vision / Object Recognition (Biological) 9
Object based: Recognition By Components
Problems:
• difficult to decompose an image into components (i.e. to map an image onto a representation in geons)
• difficult to represent many natural objects using geons (may not have a simple parts-based description, e.g. a tree)
• cannot detect finer details which are necessary for identification of individuals or discrimination of similar objects. e.g.:
Computer Vision / High-Level Vision / Object Recognition (Biological) 10
Image based
3D object represented by multiple, stored, 2D views of the object.
Object recognition occurs when a current pattern matches a stored pattern.
• Templatematching
» An early version of the image-base approach.
» Too rigid to account for flexibility of human object recognition.
• Multiple Views approach
» More recent version of the image-based approach.
» Through experience, we encode multiple views of objects.
» These serve as the templates for recognition, but interpolation between stored views enables recognition of objects from novel viewpoints.
Computer Vision / High-Level Vision / Object Recognition (Biological) 11
Configural vs Featural theories Who is this?
Is he looking well?
Computer Vision / High-Level Vision / Object Recognition (Biological) 12
Configural vs Featural theories
Computer Vision / High-Level Vision / Object Recognition (Biological) 13
Configural vs Featural theories
Inverted faces: featural processing
– features processed independently, relationships between features ignored.
Upright faces: configural processing
– holistic, global
Computer Vision / High-Level Vision / Object Recognition (Biological)
14
Rules vs Prototypes vs Exemplars
How are the boundaries between different categories defined?
How are new stimuli assigned to the closest category?
feature space (2D in this example) – see lecture 5
= previous examples of stimuli from 3 different categories = a new stimulus from an unknown category
Computer Vision / High-Level Vision / Object Recognition (Biological) 15
Rules
Category membership defined by abstract rules, e.g.
– has three sides = triangle
–
Anything that satisfies the rule(s) for the category goes into that category
For: over-extension of rules of grammar, e.g. “goed” instead of went, “bitted” instead of bitten, “mouses” instead of mice.
Against: Some members are better examples of a category (graded membership), e.g. bear is a better mammal than a whale, 4 is a better even number than 106, pigeon is a better bird than penguin.
has four legs and barks = dog – has a beak and feathers = bird
Computer Vision / High-Level Vision / Object Recognition (Biological) 16
Prototypes
Calculate the average (or prototype) of all the individual instances from each category.
A new stimulus is compared to the stored prototypes and assigned to the category of the nearest one
For: prototypical category members are accessed more quickly and learnt more easily (e.g. pigeons vs penguins.)
Against: variations within a class can not be represented.
Computer Vision / High-Level Vision / Object Recognition (Biological) 17
Exemplars
Specific individual instances of each category (“exemplars”) stored in memory.
A new stimulus is compared to the stored exemplars and assigned to the category of the nearest one
For: successfully predicts some kinds of mis-categorizations (e.g., a whale as a fish).
Against: Some members are better examples of a category (graded membership), e.g. bear is a better mammal than a whale, 4 is a better even number than 106, pigeon is a better bird than penguin.
Computer Vision / High-Level Vision / Object Recognition (Biological) 18
Classifiers
Prototype and Exemplar theories in psychology correspond to standard classification methods used in pattern recognition / machine learning.
These methods use “supervised” learning:
• assumes that class for each data point in the training set is known
•
similarity to training examples
new (unknown) data points assigned to appropriate class based on
The alternative is unsupervised learning:
•
•
We previously came across unsupervised pattern recognition /
machine learning methods, called clustering, when discussing image segmentation techniques (i.e. k-means clustering, agglomerative hierarchical clustering, graph cutting).
assumes that class for each data point is unknown
all data points assigned to appropriate class based on similarity
Computer Vision / High-Level Vision / Object Recognition (Biological) 19
Nearest Mean Classifier (Prototype)
For each class
• calculate the mean of the feature vectors for all the training examples in that class
For each new stimulus
• find the closest class prototype and assign new stimulus to that class label
Decision boundary is linear. Hence, suitable only if data is linearly separable
Computer Vision / High-Level Vision / Object Recognition (Biological)
20
Nearest Neighbour Classifier (Exemplar)
• Save the vectors for all the training examples (instead of just the mean for each class)
For each new stimulus
• find the closest training exemplar and assign new stimulus to that class label
Decision boundary is non- linear (piecewise linear). Hence, suitable if data is non-linearly separable.
Computer Vision / High-Level Vision / Object Recognition (Biological)
21
Nearest Neighbour Classifier (Exemplar)
• Save the vectors for all the training examples (instead of just the mean for each class)
For each new stimulus
• find the closest training exemplar and assign new stimulus to that class label
Decision boundaries form Voronoi partitioning of feature space.
Doesn’t deal with outliers.
Computer Vision / High-Level Vision / Object Recognition (Biological) 22
K-Nearest Neighbours Classifier
• Save the vectors for all the training examples (instead of just the mean for each class)
For each new stimulus
• find the k closest training exemplars and assign new stimulus to the class label of the majority of these points (e.g. closest points vote on correct label)
Computer Vision / High-Level Vision / Object Recognition (Biological)
23
k=3
Decision boundary is non- linear. Hence, suitable if data is non-linearly separable.
k typically small and odd (to break ties).
Increasing k reduces the effects of outliers
Similarity Measures
Determining the nearest neighbour(s) or nearest mean requires some measure of the distance between two sets of features.
As previously, we can either find the minimum distance, e.g.:
• Sum of Squared Differences (SSD)
• Euclidean distance
• Sum of Absolute Differences (SAD) = Manhattan distance
Or, find the maximum similarity, e.g.: • Cross-correlation
• Normalised cross-correlation
• Correlation coefficient
Computer Vision / High-Level Vision / Object Recognition (Biological) 24
The Cortical Visual System: pathways
Where (or How):
• V1 to parietal cortex
• spatial / motion information
What
• V1 to inferotemporal cortex
• identity / category information
Computer Vision / High-Level Vision / Object Recognition (Biological) 25
“What” and “Where” pathways
Hierarchically organised:
• simple, local, RFs at V1
• complex, large, RFs in higher areas
Hierarchy of Receptive Fields
As we progress along a pathway, neurons’ preferred stimuli gets more complex, receptive fields become larger, and there is greater invariance to location.
e.g: Eye → LGN → V1
Centre-surround Cells → Simple Cells →
Complex Cells
-+-
Centre-surround cells respond to isolated spots of contrasting intensity or colour
Simple cells respond to edges/bars at a specific orientation with specific contrast polarity at a specific location
Complex cells respond edges/bars of a specific orientation with any polarity at any position within a small region
Computer Vision / High-Level Vision / Object Recognition (Biological)
26
Hierarchy of Receptive Fields
More complex RFs built by combining outputs from multiple cells with simpler RFs.
e.g: Eye → LGN → V1 Centre-surround Cells →
Simple Cells
→
Complex Cells
-+-
-+- -+-
-+-
-+- + – +
– +-+
+-+ +-+
Simple cells respond when multiple co- aligned centre-surround
Computer Vision / High-Level Vision / Object Recogcneitilolns(Bairoelogaiccalt)ive
Complex cells respond when any of multiple similarly oriented simple cells are a2c7tive
Hierarchy of Receptive Fields This trend continues along the ventral pathway
• larger RFs
• higher complexity
• higher invariance
Computer Vision / High-Level Vision / Object Recognition (Biological) 29
Hierarchy of Receptive Fields
Neurons’ preferred stimuli gets more complex but they have less sensitivity to location.
Neurons near the end of the ventral pathway respond to very complex stimuli, like faces.
Computer Vision / High-Level Vision / Object Recognition (Biological) 30
Feedforward models of cortical hierarchy
Image processed by layers of neurons with progressively more complex receptive fields at progressively less specific locations.
Hierarchical in that features at one stage are built from features at earlier stages.
Can be thought of as hierarchical template matching
Computer Vision / High-Level Vision / Object Recognition (Biological)
31
Feedforward models: HMAX
complex (C) simple (S)
complex (C)
simple (S)
Computer Vision / High-Level Vision / Object Recognition (Biological)
Different mathematical operations are required to increase the complexity or selectivity of RFs and to increase the degree of invariance of RFs.
Hence, several models use alternating layers of neurons with different properties.
In analogy with the response properties of V1, these are called simple (S) and complex (C) cells.
32
Feedforward models: HMAX
Unit types
Simple “S-cells”
Computation
sum “and”-like
Result
Increased Selectivity
Complex “C-cells”
max “or”-like
Increased Invariance
Computer Vision / High-Level Vision / Object Recognition (Biological)
33
Feedforward models: HMAX
Simple “S-cells”
S-cells in one layer respond to conjunctions of C-cells in previous layer.
Complex “C-cells”
C-cells in one layer respond to any S-cell in a small neighborhood of the previous layer.
Computer Vision / High-Level Vision / Object Recognition (Biological)
34
Feedforward models: HMAX
Computer Vision / High-Level Vision / Object Recognition (Biological) 35
Feedforward models: HMAX
Computer Vision / High-Level Vision / Object Recognition (Biological) 36
Feedforward models: HMAX
Computer Vision / High-Level Vision / Object Recognition (Biological) 37
Feedforward models: HMAX
Computer Vision / High-Level Vision / Object Recognition (Biological) 38
Feedforward models: HMAX
Computer Vision / High-Level Vision / Object Recognition (Biological) 39
Feedforward models: HMAX
From IT to PFC: Task-specific
Trained using supervised learning (i.e. a classifier)
From V1 to IT:
Generic, reusable, representations of shape components.
“Hard-wired.”
Computer Vision / High-Level Vision / Object Recognition (Biological) 40
Feedforward models: CNN CNN = convolutional neural network
A hierarchical model similar to HMAX can be implemented using standard image processing techniques: convolution and sub- sampling.
It consists of alternating layers of
• convolution (equivalent to responding to conjunctions), and
• sub-sampling (equivalent to responding to any input in a small neighbourhood, to reduce location specificity).
Computer Vision / High-Level Vision / Object Recognition (Biological) 41
Feedforward models: CNN
Confusingly:
• convolution layers are called “C layers” but are equivalent to S layers in HMAX, and
• sub-sampling layers are called “S layers” but are equivalent to C layers in HMAX.
Deep Neural Networks (like HMAX and CNN) produce state-of- the-art performance in many computer vision tasks.
Computer Vision / High-Level Vision / Object Recognition (Biological) 42
Feedforward models of cortical hierarchy
Parietal areas
HMAX and CNN are two examples of several models that propose a purely serial, feedforward, sequence of cortical information processing.
MT
MT
V2
Temporal
areas V4
V2
Computer Vision / High-Level Vision / Object Recognition (Biological)
43
V1
V1
Recurrent models of cortical hierarchy
Parietal areas
However, there are two types of recurrent connections.
MT
(1) Within each region, lateral connections (both excitatory and inhibitory) enable neurons within the same population to interact (see descriptions of V1 and V2 in earlier lectures).
MT
Temporal
areas V4
V2
V2
V1
V1
Computer Vision / High-Level Vision / Object Recognition (Biological)
44
Recurrent models of cortical hierarchy
Parietal areas
In addition,
(2) feedback connections convey information from higher cortical regions to primary sensory areas.
MT
MT
Temporal
areas V4
V2
V2
Bottom-up and top-down information interacts to affect perception.
V1
V1
Computer Vision / High-Level Vision / Object Recognition (Biological)
45
Recurrent models of cortical hierarchy Allow bottom-up and top-down information to be combined.
• Bottom-Upprocesses:
– Using the information in the stimulus itself to aid in
identification.
– Stimulus driven.
•
Top-Down processes:
– Using context, previous knowledge, and expectation to aid in identification.
– Knowledge driven.
Computer Vision / High-Level Vision / Object Recognition (Biological) 49
Bayesian Inference
Bayes’ Theorem describes an optimum method of combing bottom-up and top-down information.
Bayes’ Theorem:
p(A|B)p(B) = p(B|A)p(A) or
p(A|B) = p(B|A)p(A)/p(B)
p(A|B) is the conditional probability of A given B (vertical bar “|” reads as “given”).
Computer Vision / High-Level Vision / Object Recognition (Biological) 50
Bayesian Inference
An example of conditional probabilities.
The conditional probability that it is raining given that the pavement is wet is:
p(rain | wet pavement) < 1
because a wet pavement can be caused by many things (leaking pipes, dropped water bottles, etc).
The conditional probability that the pavement is wet given that it is raining is:
p(wet pavement | rain) = 1
because rain always wets the pavement.
Therefore, the two conditional probabilities are not necessarily equal
p(rain | wet pavement) ≠ p(wet pavement | rain)
Bayes' theorem gives the relationship between conditional probabilities.
Computer Vision / High-Level Vision / Object Recognition (Biological) 51
Bayesian Inference
Bayes' theorem can be considered as a method for obtaining the information you need from the information you have.
In vision, we want to know p(objectj | Imagei): the probability that objectj is present in the world given that imagei is on the retina.
Solving this is hard – it is an inverse problem
However, what we know is p(Imagei | objectj): the probability of observing imagei given the 3D objectj.
Solving this is easy – it is a forward problem
Bayes' theorem provides a means of calculating p(objectj | Imagei)
since:
p(objectj | Imagei) = p(Imagei | objectj) p(objectj) / p(Imagei)
Computer Vision / High-Level Vision / Object Recognition (Biological) 52
Bayesian Inference
Bayes' theorem can be considered as a method for obtaining the information you need from the information you have.
In vision, we want to know p(objectj | Imagei): the probability that objectj is present in the world given that imagei is on the retina.
Solving this is hard – it is an inverse problem
However, we can calculate p(Imagei | objectj): the probability of observing imagei given the 3D objectj.
Solving this is easier – it is a forward problem
Bayes' theorem provides a means of calculating p(objectj | Imagei)
since:
p(objectj | Imagei) = p(Imagei | objectj) p(objectj) / p(Imagei)
Computer Vision / High-Level Vision / Object Recognition (Biological) 53
Bayesian Inference: nomenclature
p(objectj | Imagei) = p(Imagei | objectj) p(objectj) / p(Imagei) posterior likelihood prior evidence
posterior: the thing we want to know (the probability of a particular object being present given the image).
likelihood: the thing we can calculate (the probability of the particular image being a projection of the particular object).
prior: the thing we know from prior experience (the probability that the particular object will be present in the environment)
evidence: the thing we can ignore, as it is the same for all possible interpretations of this image.
Computer Vision / High-Level Vision / Object Recognition (Biological) 54
Bayesian Inference: example
Each of N=3 possible objects can generate the observed image.
The probability of observing this image, I, is constant (make p(I)=1 for simplicity).
The likelihood p(I|objj) is: “the probability of observing image I, given the 3D object objj”.
If all N=3 objects could produce the same
image with equal probability, their likelihoods are the same:
p(I|obj1) = p(I|obj2) = p(I|obj3) = 0.09
obj1 obj2
obj3
I
Computer Vision / High-Level Vision / Object Recognition (Biological) 55
Bayesian Inference: example
Thus, the image alone cannot be used to decide
which of the three possible objects produced the
image. obj1
However, if our prior experience of 3D objects produces a higher expectation of cubes than irregular shapes, then the priors will be different: e.g. p(obj3) = 0.1, p(obj2) = 0.01, p(obj1) = 0.01
obj2
We can use the prior probability of each object to
weight the known likelihood to obtain the posterior
probability: obj3
p(objj|I) = p(I|objj) p(objj) (assuming p(I)=1).
Hence,
p(obj1|I) = p(obj2|I) = 0.09x0.01 = 0.0009
p(obj3|I) = 0.09x0.1 = 0.009
I
Computer Vision / High-Level Vision / Object Recognition (Biological) 56
Bayesian Inference: example
The posterior p(objj|I) is the probability that object objj is present in the world given that image I is on the retina.
The posterior probabilities thus tell us which object is most likely to have yielded image I.
In this example, the prior experience biases our interpretation of the image, so that we tend to interpret the image I as object obj3.
obj1 obj2
obj3
I
Computer Vision / High-Level Vision / Object Recognition (Biological) 57
Bayesian Inference
Bayes rule shows how to combine current evidence, I, with knowledge gained from prior experience, p(objj), to estimate the
posterior probability p(objj|I) that the hypothesis (objj) under consideration is true (e.g. that objj is the correct 3D object).
Need to compute posterior p(objj|I) for all possible hypotheses in order to select that hypothesis with the largest posterior.
If we assume p(I)=1 then
posterior = likelihood * prior
Computer Vision / High-Level Vision / Object Recognition (Biological) 58
Bayesian Inference
Alternatively, if we just want to determine the probability that an image contains a particular object or not, we can use the following formulation:
pobjectj∣imagei = pimagei∣objectj⋅ pobjectj pnotobjectj∣imagei pimagei∣notobjectj pnotobjectj
posterior ratio
likelihood ratio prior ratio
Computer Vision / High-Level Vision / Object Recognition (Biological)
59
Bayesian Inference: example
p(image | zebra) = 0.07 p(zebra) = 0.01
p(image | no zebra) = 0.0005 p(no zebra) = 0.99
pzebra∣image = pimage∣zebra ⋅ pzebra pno zebra∣image pimage∣no zebra pno zebra
= 0.07 0.01 =1.41 0.0005 0.99
>1 so zebra
Computer Vision / High-Level Vision / Object Recognition (Biological) 60
Bayesian Inference: example
p(image | zebra) = 0.003 p(zebra) = 0.01
p(image | no zebra) = 0.85 p(no zebra) = 0.99
pzebra∣image = pimage∣zebra ⋅ pzebra pno zebra∣image pimage∣no zebra pno zebra
=0.003 0.01=0.000036 0.85 0.99
<1 so not zebra
Computer Vision / High-Level Vision / Object Recognition (Biological) 61
Bayesian Inference: example
Computer Vision / High-Level Vision / Object Recognition (Biological) 62
Bayesian Inference
Bayesian inference can be seen as a method of solving the ill- posed, inverse problem of vision (see introductory lecture)
Vision is an inverse problem – we know the pixel intensities (the outcomes) and want to infer the causes (i.e. the objects in the scene, etc.).
Vision is ill-posed as there are usually multiple solutions (i.e. multiple causes that could give rise to the same outcomes).
In order to compensate, the perceptual systems make use of assumptions, constraints or priors about the nature of the physical world.
Computer Vision / High-Level Vision / Object Recognition (Biological) 64
Bayesian Inference
Prior: Texture is circular and homogeneous Infer: shape/depth
Prior: Light from above Infer: depth
Computer Vision / High-Level Vision / Object Recognition (Biological) 65
Bayesian Inference
Prior: faces are convex Infer: shape/depth
Convex face appears convex if assume light comes from above
Concave face still appears
convex, but only if assume light now comes from below
Computer Vision / High-Level Vision / Object Recognition (Biological)
66
Bayesian Inference
Prior: size is constant Infer: depth
Computer Vision / High-Level Vision / Object Recognition (Biological) 67
Bayesian Inference
Prior: neighbouring features are related Infer: grouping
Prior: similar features are related Infer: grouping
Prior: connected features are related Infer: grouping
Computer Vision / High-Level Vision / Object Recognition (Biological) 68
Bayesian Inference
Prior: strings of letters form words Infer: letter identity
Computer Vision / High-Level Vision / Object Recognition (Biological) 69
Bayesian Inference
Prior: knowledge about image content Infer: object identity
We are back where we started in lecture 1!
Computer Vision / Introduction 70
Summary rules
vs prototypes
vs exemplars
Decision Trees
Nearest Mean Classifier
Nearest Neighbour Classifier
K-Nearest Neighbours Classifier
Computer Vision / High-Level Vision / Object Recognition (Biological)
71
machine learning psychology
Summary
object-based (3D) e.g. recognition by components
vs
image-based (2D) e.g. template matching
configural (global)
vs
featural (local)
Computer Vision / High-Level Vision / Object Recognition (Biological)
72
Summary
Local (featural) and global (configural) representations have complementary advantages and disadvantages
simple (local) features generate many false positives:
fail to distinguish objects with similar features in different arrangements,
fail to deal with clutter
complex (global) features generate many false negatives
fail to deal with occlusion
fail to deal with viewpoint changes and within class variation Solutions:
1. use features of intermediate complexity
2. use a hierarchy of features with a range of complexities
Computer Vision / High-Level Vision / Object Recognition (Biological) 73
Summary
Cortex seems to employ the latter approach: a hierarchy of features with a range of complexities.
Modelled using alternate layers increasing selectivity and increasing invariance.
Computer Vision / High-Level Vision / Object Recognition (Biological) 74
Summary
• Bottom-Up processes
– Using the information in the stimulus itself to aid in identification – Stimulus driven
– Discriminative
• Top-Down processes
– Using context, previous knowledge, and expectation to aid in
identification
– Knowledge driven
– Generative
Computer Vision / High-Level Vision / Object Recognition (Biological) 75
Summary
p(objectj | imagei) = p(imagei | objectj) p(objectj) / p(imagei)
posterior likelihood prior evidence
pobject j∣imagei = pimagei∣object j ⋅ pobjectj pnotobjectj∣imagei pimagei∣notobjectj pnotobjectj
posterior ratio
• •
likelihood ratio
prior ratio
Discriminative methods model the posterior Generative methods model the likelihood and prior
Computer Vision / High-Level Vision / Object Recognition (Biological)
76