Instructions
Fundamentals of Computer Vision Project 3
Tracking Objects in Videos
1. Integrity and collaboration: Students are encouraged to work in groups but each student must submit their own work. If you work as a group, include the names of your collaborators in your write-up. Code should NOT be shared or copied. Please DO NOT use external code unless permitted. Plagiarism is strongly prohibited and may lead to failure of this course.
2. Start early! Running the code on the videos can take a lot of time, making debugging very slow.
3. Verify your implmenentation as you proceed! Otherwise you risk having a huge mess of malfunctioning code that can go wrong anywhere.
4. Questions: If you have any question, please look at Discussion on canvas first. Other students may have encountered the same problem, and might have been solved already. If not, post your question on the discussion board. TA will respond as soon as possible.
5. Write-up: Your write-up should mainly consist of two parts, your answers to theory questions and your insights as you attempt the programming questions. Specific items to be included in the write-up are mentioned in each question.
6. Code: Stick to the function prototypes mentioned in the handout. This makes veri- fying code easier for the TA. If you do want to change a function prototype or add an extra parameter, please talk to the TA.
7. Submission: Your submission for this assignment should be a zip file,
Your final upload should have the files arranged in this layout: 1
•
–
∗
· LucasKanade.m.m
· affineMBTracker.m
· initAffineMBTracker.m · lk demo.m
· mb demo.m
∗ results/ · car.mp4
2
Figure 1: Sample images from the video sequence provided
Overview
One incredibly important aspect of human and animal vision is the ability to follow objects and people in our view. Whether it is a tiger chasing its prey, or you trying to catch a basketball, tracking is so integral to our everyday lives that we forget how much we rely on it. In this assignment, you will be implementing an algorithm that will track an object in a video.
You will first implement the Lucas-Kanade tracker, and then a more computationally effi- cient version called the Matthew-Baker (or inverse compositional) method [1]. This method is one of the most commonly used methods in computer vision due to its simplicity and wide applicability. We have provided two video sequences: a car on a road, and a helicopter approaching a runway.
To initialize the tracker you need to define a template by drawing a bounding box around the object to be tracked in the first frame of the video. For each of the subsequent frames the tracker will update an affine transform that warps the current frame so that the template in the first frame is aligned with the warped current frame.
Preliminaries
An image transformation or warp is an operation that acts on pixel coordinates and maps pixel values from one place to another in an image. Translation, rotation and scaling are all examples of warps. We will use the symbol W to denote warps. A warp function W has a set of parameters p associated with it and maps a pixel with coordinates x = [u v]T to x′=[u′ v′]T.
x′ = W(x; p) (1) An affine transform is a warp that can include any combination of translation, anisotropic
scaling and rotations. An affine warp can be parametrized in terms of 6 parameters p =
3
[p1 p2 p3 p4 p5 p6]T . One of the convenient things about an affine transformation is that it is linear; its action on a point with coordinates x = [u v]T can be described as a matrix operation
u′ u
v′ =W(p)v (2)
11 Where W(p) is a 3 × 3 matrix such that
1+p1 p3 p5
W(p)= p2 1+p4 p6 (3)
001
Note that for convenience when we want to refer to the warp as a function we will use W(x; p) and when we want to refer to the matrix for an affine warp we will use W(p). Table 1 contains a summary of the variables used in the next two sections. It will be useful to keep these in mind.
Table 1: Summary of Variables
Symbol
Vector/Matrix Size
Description
u
v
x
I
T W(p)
p
∂I ∂u
∂I ∂v
∂T ∂u
∂T ∂v
∇I ∇T
∂W ∂p
J H
1×1
1×1
2×1 or 1×1 m×1
m×1
3×3
6×1
m×1
m×1
m×1
m×1
m×2 m×2
2×6 m×6 6×6
Image horizontal coordinate
Image vertical coordinate
pixel coordinates: (u, v) or unrolled
Image unrolled into a vector (m pixels) Template unrolled into a vector (m pixels) Affine warp matrix
parameters of affine warp
partial derivative of image wrt u
partial derivative of image wrt v
partial derivative of template wrt u partial derivative of template wrt v
image gradient ∇I(x) = ∂I(x) ∂I(x) ∂u ∂v
image gradient ∇T(x) = ∂T(x) ∂T(x) ∂u ∂v
Jacobian of affine warp wrt its parameters Jacobian of error function L wrt p Pseudo Hessian of L wrt p
4
Lucas-Kanade: Forward Additive Alignment
A Lucas Kanade tracker maintains a warp W(x; p) which aligns a sequence of images It to a template T. We denote pixel locations by x, so I(x) is the pixel value at location x in image I. For the purposes of this derivation, I and T are treated as column vectors (think of them as unrolled image matrices). W(x;p) is the point obtained by warping x with a transform that has parameters p. W can be any transformation that is continuous in its parameters p. Examples of valid warp classes for W include translations (2 parameters), affine transforms (6 parameters) and full projective transforms (8 parameters). The Lucas Kanade tracker minimizes the pixel-wise sum of square difference between the warped image I(W(x; p)) and the template T.
In order to align an image or patch to a reference template, we seek to find the parameter vector p that minimizes L, where:
L = [T (x) − I (W (x; p))]2 (4) x
In general this is a difficult non-linear optimization, but if we assume we already have a close estimate p of the correct warp, then we can assume that a small linear change ∆p is enough to get the best alignment. This is the forward additive form of the warp. The objective can then be written as:
L = [T (x) − I (W (x; p + ∆p))]2 (5) x
Expanding this to the first order with Taylor Series gives us:
∂W 2
L≈
Here ∇I(x) = ∂I(x) ∂I(x), which is the vector containing the horizontal and vertical
T(x)−I(W(x;p))−∇I(x) ∂p ∆p (6) gradient at pixel location x. Rearranging the Taylor expansion, it can be rewritten as a
x
∂u ∂v
typical least squares approximation ∆p∗ = argmin||A∆p − b||2 ∆p
∗ ∂W
2
∇I ∂p ∆p−{T(x)−I(W(x;p))} This can be solved with ∆p∗ = (AT A)−1AT b where:
∆p =argmin ∆p x
(7)
(8)
(9) (10)
T ∂WT (A A)=H= ∇I ∂p
x
∂W A= ∇I ∂p
x
∂W ∇I ∂p
b = T (x) − I (W (x; p))
Once ∆p is computed, the best estimate warp can be updated p ← p + ∆p, and the
whole procedure can be repeated again, stopping when ∆p is less than some threshold. 5
Matthew-Baker: Inverse Compositional Alignment
While Lucas-Kanade alignment works very well, it is computationally expensive. The inverse compositional method is similar, but requires less computation, as the Hessian and Jacobian only need to be computed once. One caveat is that the warp needs to be invertible. Since affine warps are invertible, we can use this method.
In the previous section, we combined two warps by simply adding one parameter vector to another parameter vector, and produce a new warp W(x,p+p′). Another way of combining warps is through composition of warps. After applying a warp W(x; p) to an image, another warp W(x; q) can be applied to the warped image. The resultant (combined) warp is
W(x; q) ◦ W(x; p) = W(W(x; p), q) (11) Since affine warps can be implemented as matrix multiplications, composing two affine
warps reduces to multiplying their corresponding matrices
W(x; q) ◦ W(x; p) = W(W(x; p), q) = W(W(p)x, q) = W(q)W(p)x (12)
An affine transform can also be inverted. The inverse warp of W(p) is simply the matrix inverse of W(p), W(p)−1. In this assignment it will sometimes be simpler to consider an affine warp as a set of 6 parameters in a vector p and it will sometimes be easier to work with the matrix version W(p). Fortunately, switching between these two forms is easy (Equation 3).
The minimization is performed using an iterative procedure by making a small change (∆p) to p at each iteration. It is computationally more efficient to do the minimization by finding the ∆p that helps align the template to the image, than applying the inverse warp to the image. This is because the image will change with each frame of the video, but the template is fixed at initialization. We will see soon that doing this allows us to write the Hessian and Jacobian in terms of the template, and so this can be computed once at the beginning of the tracking. Hence at each step, we want to find the ∆p to minimize
L = [T (W (x; ∆p)) − I (W (x; p))]2 (13) x
For tracking a patch template, the summation is performed only over the pixels lying inside the template region. We can expand T (W (x; ∆p)) in terms of its first order linear approximation to get
∂W 2
T(x)+∇T(x) ∂p ∆p−I(W(x;p)) (14) Where ∇T(x) = ∂T(x) ∂T(x). To minimize we need to take the derivative of L and set
it to zero
L≈
∂u ∂v
x
∂L ∂WT
∂W T(x)+∇T(x) ∂p ∆p−I(W(x;p)) (15)
6
∂∆p =2
x
∇T ∂p
Setting to zero, switching from summation to vector notation and solving for ∆p we get ∆p = H−1JT [I(W(x; p)) − T] (16)
where J is the Jacobian of T(W(x;∆p)), J = ∇T∂W, H is the approximated Hessian ∂p
H = JT J and I (W (x; p)) is the warped image. Note that for a given template, the Jacobian J and Hessian H are independent of p. This means they only need to be computed once and then they can be reused during the entire tracking sequence.
Once ∆p has been solved for, it needs to be inverted and composed with p to get the new warp parameters for the next iteration.
W(x; p) ← W(x; p) ◦ W(x; ∆p)−1 (17)
The next iteration solves Equation 16 starting with the new value of p. Possible termi- nation criteria include the absolute value of ∆p falling below some value or running for some fixed number of iterations.
7
1 Theory Questions (25 points)
Type down your answers for the following questions in your write-up. Each question should only take a couple of lines. In particular, the “proofs” do not require any lengthy calculations. If you are lost in many lines of complicated algebra you are doing something much too complicated (or wrong).
Q1.1: Calculating the Jacobian (15 points) Assuming the affine warp model defined in Equation 3, derive the expression for the
Jacobian Matrix J in terms of the warp parameters p = [p1 p2 p3 p4 p5 p6]′.
Q1.2: Computational complexity (10 points)
Find the computational complexity (Big O notation) for the initialization step (Pre- computing J and H−1) and for each runtime iteration (Equation 16) of the Inverse Compositional method. Express your answers in terms of n, m and p where n is the number of pixels in the template T, m is the number of pixels in an input image I and p is the number of parameters used to describe the warp W. How does this compare to the run time of the regular Lucas-Kanade method?
8
2 Lucas-Kanade Tracker (60 points)
For this section, TA will grade your tracker based on the performance you achieved on the two provided video sequences: (1) data/car1/. The provided script files lk demo.m and mb demo.m handle reading in images, template region marking, making tracker function calls and displaying output onto the screen. The function prototypes provided are guidelines. Please make sure that your code runs functionally with the original script and generates the outputs we are looking for (a frame sequence with the bounding box of the target being tracked on each frame) so that we can replicate your results.
Note that the only thing TA would do for you during grading is change the input data directory, and initialize your tracker based on what you mentioned in your write-up. Please submit one video for each of them in the results/ directory, with file name car.mp4. Also, please mention the initialization coordinates of your tracker for both video sequences in your write-up and in your code.
Q2.1: Write a Lucas-Kanade Tracker for a Flow Warp (20 points) Write the function with the following function signature:
[u,v] = LucasKanade(It, It1, rect)
that computes the optimal local motion from frame It to frame It+1 that minimizes Equation 1. Here It is the image frame It, It1 is the image frame It+1, and rect is the 4×1 vector that represents a rectangle on the image frame It. The four components of the rectangle are [x, y, w, h], where (x, y) is the top-left corner and (w, h) is the width and height of the bounding box. The rectangle is inclusive, i.e., in includes all the four corners. To deal with fractional movement of the template, you will need to interpolate the image using the Matlab function interp2. You will also need to iterate the estimation until the change in warp parameters (u, v) is below a threshold. Use the forward compositional (Lucas-Kanade method) for this question.
Q2.2: Initializing the Matthew-Baker Tracker (10 points) Write the function initAffineMBTracker() that initializes the inverse compositional
tracker by precomputing important matrices needed to track a template patch.
function [affineMBContext] = initAffineMBTracker(img, rect)
The function will input a greyscale image (img) along with a bounding box (rect) (in
the format [x y w h]).
The function should output a Matlab structure affineMBContext that contains the Jacobian of the affine warp with respect to the 6 affine warp parameters and the inverse of the approximated Hessian matrix (J and H−1 in Equation 16).
Q2.3: The Main Matthew-Baker Tracker (20 points) Write the function affineMBTracker() that does the actual template tracking. function [Wout] = affineMBTracker(img, tmp, rect, Win, context)
9
The function will input a greyscale image of the current frame (img), the template image (tmp), the bounding box rect that marks the template region in tmp, The affine warp matrix for the previous frame (Win) and the precomputed J and H−1 matrices context.
The function should output the 3 × 3 matrix Wout that contains the new affine warp matrix updated so that it aligns the current frame with the template.
You can either used a fixed number of gradient descent iterations or formulate a stop- ping criteria for the algorithm. You can use the included image warping function to apply affine warps to images.
Q2.4: Tracking a Car (10 points)
Test your trackers on the short car video sequence (data/car1/) by running the wrap- per scripts lk demo.m and mb demo.m. What sort of templates work well for tracking? At what point does the tracker break down? Why does this happen?
In your write-up: Submit your best video of the car being tracked. Save it as results/car.mp4.
Figure 2: Tracking in the car image sequences
10
4
• • • • • •
•
Submission Summary
Q1.1 Derive the expression for the Jacobian Matrix
Q1.2 What is the computational complexity of inverse compositional method? Q2.1 Write the forward compositional tracker (LK Tracker)
Q2.2 Initialize the inverse compositional tracker (MB Tracker)
Q2.3 Write the inverse compositional tracker (MB Tracker)
Q2.4 Run the inverse compositional tracker on the car dataset. What templates does it work well with? When does the tracker break down? Why does this happen?
Q2.5 Run the inverse compositional tracker on the run markings dataset.
References
[1] Simon Baker, et al. Lucas-Kanade 20 Years On: A Unifying Framework: Part 1, CMU- RI-TR-02-16, Robotics Institute, Carnegie Mellon University, 2002
[2] Simon Baker, et al. Lucas-Kanade 20 Years On: A Unifying Framework: Part 2, CMU- RI-TR-03-35, Robotics Institute, Carnegie Mellon University, 2003
[3] Bouguet, Jean-Yves. Pyramidal Implementation of the Lucas Kanade Feature Tracker: Description of the algorithm, Intel Corporation, 2001
Credit
This project is adapted directly from Ioannis Gkioulekas.
12