High-Level Vision (Artificial)
1. Below is shown a template T and an image I . Calculate the result of performing template matching on the image, and hence, suggest the location of the object depicted in the template assuming that there is exactly one such object in the image. Use the following similarity measures (a) normalised cross-correlation (b) sum of absolute differences.
60 50 40 40 I=150 100 100 80 50 20 200 80
200 150 150 50
i,j T(i,j)I(i,j) i,j T(i, j)2i,j I(i, j)2
i,j T(i,j)2 =1002+1502+2002+1502+102+2002+2002+2002+2502 =526.9 At pixel (2,2) Similarity =
100×60+150×50+200×40+150×150+10×100+200×100+200×50+200×20+250×200
526.9 602 +502 +402 +1502 +1002 +1002 +502 +202 +2002
129000 = 0.80 526.9×305.1
At pixel (3,2) Similarity =
100×50+150×100+200×20+150×40+10×100+200×200+200×40+200×80+250×80
526.9 502 +1002 +202 +402 +1002 +2002 +402 +802 +802
115000 = 0.78 526.9×280.9
At pixel (2,3) Similarity =
100×150+150×50+200×200+150×100+10×20+200×150+200×100+200×200+250×150
526.9 1502 +502 +2002 +1002 +202 +1502 +1002 +2002 +1502
205200 = 0.94 526.9×412.8
At pixel (3,3) Similarity =
100×100+150×20+200×150+150×100+10×200+200×150+200×80+200×80+250×50
100 150 200
T= 150 10 200 , 200 200 250
(a) Normalised cross-correlation. Similarity =
=
=
=
=
526.9 1002 +202 +1502 +1002 +2002 +1502 +802 +802 +502
134500 = 0.73 526.9×347.4
Hence, object at location (2,3). (b) Sum of absolute differences. Distance =
At pixel (2,2) Distance =
abs(T(i,j)−I(i,j)) i,j
∥100−60∥+∥150−50∥+∥200−40∥+∥150−150∥+∥10−100∥+∥200−100∥+∥200−50∥+∥200−20∥+∥250−200∥ = 870
At pixel (3,2) Distance =
∥100−50∥+∥150−40∥+∥200−40∥+∥150−100∥+∥10−100∥+∥200−80∥+∥200−20∥+∥200−200∥+∥250−80∥ = 930
At pixel (2,3) Distance =
∥100−150∥+∥150−100∥+∥200−100∥+∥150−50∥+∥10−20∥+∥200−200∥+∥200−200∥+∥200−150∥+∥250−150∥ 37
= 460
At pixel (3,3) Distance =
∥100−100∥+∥150−100∥+∥200−80∥+∥150−20∥+∥10−200∥+∥200−80∥+∥200−150∥+∥200−150∥+∥250−50∥ = 910
Hence, object at location (2,3).
2. A computer vision system uses template matching to perform object recognition. The system needs to detect 20 different objects each of which can be seen from 12 different viewpoints, each of which requires a different template. If an image is 300 by 200 pixels, and templates are 11 by 11 pixels, how many floating-point operations are required to process one image if cross-correlation is used as the similarity measure?
The system uses 20 × 12 templates. Each template is matched to (300 − 10) × (200 − 10) image locations (assuming we don’t allow the template to fall off the edge of the image). Hence, similarity is calculated 20 × 12 × (300 − 10) × (200 − 10) = 13, 224, 000 times when processing one image.
For cross-correlation, each matching operation requires 11 × 11 multiplications plus 11 × 11 − 1 additions, i.e., 2 × (11 × 11) − 1 = 241 floating-point operations.
Therefore, processing one image requires 13, 224, 000 × 241 ≈ 3.2 × 109 floating point operations.
3. Below are shown three binary templates T1, T2 and T3 together with a patch I of a binary image. Determine which template best matches the image patch using the following similarity measures (a) cross-correlation, (b) normalised cross-correlation, (c) correlation coefficient, (d) sum of absolute differences.
111 100 111 111 T1=1 0 0, T2=1 0 0, T3=1 0 1, I=1 0 0
111 111 111 111
(a) Cross-correlation. Similarity =
For T1 Similarity = 7
For T2 Similarity = 5
For T3 Similarity = 7
Both T1 and T3 match equally well.
(b) Normalised cross-correlation. Similarity =
For T1 Similarity = 7 7× 7
For T2 Similarity = 5 5× 7
For T3 Similarity = 7 8× 7
T1 is the best match.
(c) Correlation coefficient.
Similarity =
= 1
= 0.85 = 0.94
T(i,j)I(i,j) i,j
i,j T(i,j)I(i,j) i,j T(i, j)2i,j I(i, j)2
i,j(T(i,j)−T ̄)(I(i,j)−I ̄) i,j(T(i,j)−T ̄)2i,j(I(i,j)−I ̄)2
0.22 0.22 0.22 0.44 −0.56 −0.56 T1−T ̄1= 0.22 −0.78 −0.78 , T2−T ̄2= 0.44 −0.56 −0.56 ,
0.11 T3 −T ̄3 = 0.11 0.11
0.11 0.11 −0.89 0.11 ,
0.11 0.11
0.22 I −I ̄= 0.22 0.22
0.22 0.22 0.22
0.44 0.44 0.44
0.22 0.22 −0.78 −0.78
0.22 0.22 38
7(0.222 )+2(−0.782 )
ForT1 Similarity= (7(0.222)+2(−0.782))(7(0.222)+2(−0.782)) =1 5(0.44×0.22)+2(−0.56×0.22)+2(−0.56×−0.78) 1.111
For T2 Similarity = (5(0.442 )+4(−0.562 ))(7(0.222 )+2(−0.782 )) = 1.491×1.247 = 0.60 7(0.11×0.22)+(0.11×−0.78)+(−0.89×−0.78) 0.777
For T3 Similarity = (8(0.112 )+(−0.892 ))(7(0.222 )+2(−0.782 )) = 0.943×1.247 = 0.66 T1 is the best match.
(d) Sum of absolute differences Distance =
∥T(i,j)−I(i,j)∥ i,j
ForT1 Distance=7(1−1)+2(0−0)=0
ForT2 Distance=5(1−1)+2(1−0)+2(0−0)=2 ForT3 Distance=7(1−1)+1(1−0)+1(0−0)=1 T1 is the best match.
4. Below is shown an edge template T and a binary image I which has been pre-processed to extract edges. Calculate the result of performing edge matching on the image, and hence, suggest the location of the object depicted in the edge template assuming that there is exactly one such object in the image. Calculate the distance between the template and the image as the average of the minimum distances between points on the edge template (T ) and points on the edge image (I ).
111 0000 T=101, I=1110
0 0 1 0 111 1110
At pixel (2,2) Distance =
At pixel (3,2) Distance =
At pixel (2,3) Distance =
At pixel (3,3) Distance =
81 [0 + 0 + 1 + 1 + 1 + 0 + 0 + 1] = 0.5
5. One method of object recognition is comparison of intensity histograms. Briefly describe two advantages and two
disadvantages of this method.
Advantages: • Fast
• Unaffectedbyviewpointchanges Disadvantages:
• Sensitivetoilluminationchanges
• Insensitivetospatialconfigurationchanges
6. In a very simple feature-matching object recognition system each keypoint has x,y-coordinates and a 3 element fea- ture vector. Two training images, one of object A and the other of object B, have been processed to create a database of
Hence, object at location (2,3).
81 [1 + 1 + 1 + 0 + 0 + 1 + 1 + 0] = 0.625 1 1 + 1 + 2 + 0 + 1 + 1 + 0 + 1 = 0.802
8
18 [0 + 0 + 0 + 1 + 0 + 0 + 0 + 0] = 0.125
39
known objects, as shown below:
Object Keypoint Number A A1
A2
A3 B B1 B2 B3
Coordinates (pixels) (20,5) (10,40) (40,25) (20,10)
(30,5) (30,45)
Feature Vector (1,6,10) (7,8,15) (2,9,3) (6,1,12) (13,4,8) (3,8,4)
The keypoints and feature vectors extracted from a new image are as follows:
Keypoint Number N1
N2
N3
N4
N5
Coordinates (pixels) Feature Vector (16,50) (5,8,15) (25,14) (2,6,11) (30,31) (12,3,8) (40,45) (5,2,11) (44,34) (2,8,3)
Perform feature matching using the sum of absolute differences as the distance measure and applying the following criterion for accepting a match: that the ratio of distance to first nearest descriptor to that of second is less than 0.4.
It is known that objects in different images are related by a pure translation in the image plane. Hence, use the RANSAC algorithm to assess the consistency of the matched points and so determine which of the two training objects is present in the new image. Apply RANSAC exhaustively to all matches, rather than to a subset of matches chosen at random and assume the threshold for comparing the model’s prediction with the data is 3 pixels.
SAD distance between each keypoint in the training image database and each keypoint in the new image:
Feature Vectors A1(1,6,10) A2(7,8,15) A3(2,9,3) B1(6,1,12) B2(13,4,8) B3(3,8,4) Ratio 1st to 2nd best:
N1(5,8,15) 11
2
16
11
19
13 2/11 = 0.18
N2(2,6,11) 2
11
11
10
16
10 2/10 =0.2
N4(5,2,11) N5(2,8,3) 9 10
17 12 17
18 2/12=0.17
N3(12,3,8) 16
18 1 3 20 2 13 20
21 12
15 2
3/9=0.33 Coordinates
(16,50) – (10,40) (25,14) – (20,5) (30,31) – (30,5) (40,45) – (20,10)
1/2=0.5
Hence, the following matches are found:
Applying RANSAC.
Keypoints N1-A2 N2-A1 N3-B2 N4-B1
Choose N1. Model is a translation of (16,50)-(10,40) = (6,10) Locations matching points predicted by the model are:
For N2: (25,14)-(6,10)=(19,4)
actual match is at (20,5), hence, this is an inlier for this model. Hence, consensus set =1.
Choosing N2, will also predict a translation consistent with N1.
Choose N3. Model is a translation of (30,31)-(30,5) = (0,26) Locations matching points predicted by the model are:
For N4: (40,45)- (0,26) = (40,19)
actual match is at (20,10), hence, this is an outlier for this model. Hence, consensus set =0.
Choosing N4, will predict a translation inconsistent with N3.
The new image contains object A, as this is the only object for which the locations of the matching keypoints are consistent.
7. In a simple bag-of-words object recognition system images are represented by histograms showing the number of occurrences of 10 “codewords”. The number of occurrences of the codewords in three training images are given below: ObjectA = (2,0,0,5,1,0,0,0,3,1)
40
ObjectB = (0,0,1,2,0,3,1,0,1,0)
ObjectC = (1,1,2,0,0,1,0,3,1,1)
A new image is encoded as follows:
New = (2,1,1,0,1,1,0,2,0,1)
Determine the training image that best matches the new image by finding the cosine of the angle between the codeword vectors.
i A(i)N(i)
Similarity=cos(θ)=iA(i)2iN(i)2 (i.e.,thenormalisedcross-correlation).
Similarity between New image and ObjectA is:
(2×2)+(1×0)+(1×0)+(0×5)+(1×1)+(1×0)+(0×0)+(2×0)+(0×3)+(1×1)
22 +12 +12 +02 +12 +12 +02 +22 +02 +12 22 +02 +02 +52 +12 +02 +02 +02 +32 +12 = 6 =0.26
3.6 × 6.3
Similarity between New image and ObjectB is:
(2×0)+(1×0)+(1×1)+(0×2)+(1×0)+(1×3)+(0×1)+(2×0)+(0×1)+(1×0)
22 +12 +12 +02 +12 +12 +02 +22 +02 +12 02 +02 +12 +22 +02 +32 +12 +02 +12 +02 = 4 =0.28
3.6×4
Similarity between New image and ObjectC is:
(2×1)+(1×1)+(1×2)+(0×0)+(1×0)+(1×1)+(0×0)+(2×3)+(0×1)+(1×1)
22 +12 +12 +02 +12 +12 +02 +22 +02 +12 12 +12 +22 +02 +02 +12 +02 +32 +12 +12 = 13 = 0.86
3.6 × 4.2 Hence, the new image is most similar to ObjectC.
8. A computer vision system is to be developed the can read digits from an 7-segment LCD display (like that on a standard calculator). On such a display, the numbers 0 to 9 are generated by turning on specific combinations of segments, as shown below.
A simple bag-of-words object recognition system is to be used. The codeword dictionary consists of two features: (1) a vertical line, and (2) a horizontal line.
(a) How would the digits 0 to 9 be encoded?
(b) Does this system succeed in recognising all 10 digits?
(c) Suggest an alternative object recognition method that might work better?
(d) If the camera capturing images of the LCD display gets rotated 180 degrees around the optical axis, what effect does this have on the bag-of-words solution and your alternative method?
(e) If the LCD display shows multiple digits simultaneously, what effect does this have on the bag-of-words solution and your alternative method?
(a) encoding = (vertical,horizontal) 0= (4,2)
1= (2,0)
2= (2,3)
41
3= (2,3) 4= (3,1) 5= (2,3) 6= (3,3) 7= (3,1) 8= (4,3) 9= (3,3)
(b) No. Digits 2, 3, and 5 all have the same encoding, as do digits 4 and 7 and digits 6 and 9. These digits cannot be told apart.
(c) Template matching.
(d) If the image is rotated 180 degrees, this:
• hasnoeffectonthebag-of-wordssolution(invertedinputimageswillbeencodedinthesamewayasupright training images, but the system still fails for the reason given in part (b)).
• has a large effect on template matching as most templates will fail to match the upside down input, and some templates (such as those for 2 and 5) will match the wrong inputs.
(e) If the image contains multiple digits, this:
• makesthebag-of-wordssolutionevenmoreinfeasible,astheencodingoftheinputimagewillcontainfea- tures from separate objects.
• hasnoeffectontemplatematching,exceptthatmultiplematchesmaynowbefound.
9. A computer vision system is to be developed the can read digits from an 7-segment LCD display (like that on a standard calculator). On such a display, the numbers 0 to 9 are generated by turning on specific combinations of segments, as shown below.
A simple bag-of-words object recognition system is to be used. The SIFT feature detector has been used to locate features in the 10 training digits in order to create a codeword dictionary. Due to the rotation invariance of the SIFT descriptor only three distinct features are identified: (1) an “L” shaped corner (at any orientation), (2) a “T” shaped corner (at any orientation), (3) a line termination, or end point, (at any orientation).
(a) How would the digits 0 to 9 be encoded?
(b) Does this system succeed in recognising all 10 digits?
(a) encoding = (L,T,end) 0= (4,0,0)
1= (0,0,2)
2= (4,0,2)
3= (2,1,3) 4= (1,1,3) 5= (4,0,2) 6= (4,1,1) 7= (2,0,2) 8= (4,2,0) 9= (4,1,1)
(b) No. Digits 2 and 5 have the same encoding, as do digits 6 and 9. These digits cannot be told apart.
10. Projective geometry does not preserve distances or angles. However, the cross-ratio (which is a ratio of ratios of distances) is preserved. Given four collinear points p1, p2, p3, and p4, the cross-ratio is defined as:
Cr(p1,p2,p3,p4)= ∆13∆24 ∆14 ∆23
42
Where∆ij isthedistancebetweentwopointspi andpj.
Four co-linear points are at the following 3D coordinates relative to the camera reference frame: p1 = [40,−40,400], p2 = [23.3,−6.7,483.3], p3 = [15,10,525], p4 = [−10,60,650].
(a) Calculate the cross-ratio for these points in 3D space.
(b) Calculate the cross-ratio for these points in the image seen by the camera. The image principal point is at coordinates [244,180] pixels, and the magnification factors in the x and y directions are 925 and 740. Assume that the camera does not suffer from skew or any other defect.
(a)
(b)
u α0ox1000x
v=1 0βoy 0100y
v1 u2
0 400 1 1
0 23.3 288.6
v2 u3
0 740 180
v3 u4
0 740
180
0 525 1 1
v4
0 740 180
0 60 = 248.3 0 650 1
0 −6.7= 169.7 1 0 483.3 1
= 1
1 483.3 0 0
925 0
1
244 0 15 270.4
0 −10 229.8
∆13 = (40 − 15)2 + (−40 − 10)2 + (400 − 525)2 = 136.9 ∆24 = (23.3 − (−10))2 + (−6.7 − 60)2 + (483.3 − 650)2 = 182.6 ∆14 = (40 − (−10))2 + (−40 − 60)2 + (400 − 650)2 = 273.9 ∆23 = (23.3 − 15)2 + (−6.7 − 10)2 + (483.3 − 525)2 = 45.7
Cr(p1,p2,p3,p4)= ∆13∆24 = 136.9×182.6 =2
273.9 × 45.7
1z0010010z 1
0 x
1z0010010z 1
u 925 0 244 0 x
u 925 0 244 1 0 0
u1
1
925 0 244 0 40 336.5
v=1 0740180 0100y
v=1 07401800y 1z0010z
0 7401800−40= 106
=1
1 400 0 0 1
925 0 244
0 10 = 194.1
= 1
1 525 0 0 1
925 0 244
= 1
1 650 0 0 1
1
∆13 = (336.5 − 270.4)2 + (106 − 194.1)2 = 110.1
∆24 = (288.6 − 229.8)2 + (169.7 − 248.3)2 = 98.2 ∆14 = (336.5 − 229.8)2 + (106 − 248.3)2 = 177.9 ∆23 = (288.6 − 270.4)2 + (169.7 − 194.1)2 = 30.4
Cr(p1,p2,p3,p4)= ∆13∆24 = 110.1×98.2 =2 ∆14 ∆23 177.9 × 30.4
∆14 ∆23
Note, in reality image coordinates would be rounded to integer values (as they are in pixels) and this might result in a slight error in the value calculated for the cross-ratio.
43