18794 Pattern Recognition Theory : Midterm Fall 2020
Carnegie Mellon University Department of Electrical and Computer Engineering
18794 Pattern Recognition Theory Midterm – Fall 2020
From Friday, November 6, 2020 4:30 pm Due: Saturday, November 7, 2020 4:29 pm
Name: Andrew ID:
Page 1 of 32
Instructions
• Do all problems by hand, and do not use Matlab, unless stated otherwise.
• Submitting your work: For all problems on this exam we will be using Gradescope. You can access Gradescope from Canvas or from here: https://gradescope.com. To receive full credit, show the steps to your solution. Please be super clear, neat, and eligible when you write your solutions. If the TAs cannot follow your work, you will not receive credit for that problem.
• This midterm contains 32 pages. Total of points is 150.
• The exam sheet allows you to write answers in the given space. You are free to print this paper out and solve on it. You are allowed to write on separate sheets but you have to compile all the questions as per the exam paper. Answer to each question must be on a separate page.
• To receive credit, show all the steps in a clear and legible form.
• This is an individual exam, please DO NOT collaborate with other students or anyone else. Please don’t search online. You are allowed to look at lecture notes and slides. We are aware of the online answers and implementations and will recognize them when grading the midterms.
• We will be monitoring our emails for questions during the midterm and will do our best to respond quickly. Please ask your questions PRIVATELY to instructors only, this is really im- portant!! Please check the (Canvas Announcements) periodically for any updates or announce- ments.
• Keep your answers precise and concise. Show all work, clearly and in order, or else points will be deducted, even if your final answer is correct.
• If you’re caught violating any protocol, the exam is considered over for you.
For TA use only:
Total (150pt)
1 (8pt)
2 (20pt)
3 (20pt)
4 (20pt)
5 (20pt)
6 (20pt)
7 (12pt)
8 (20pt)
9 (10pt)
2
18794 Pattern Recognition Theory : Midterm Fall 2020
1. Freebies, because everything has a cost (8 points)
Please attempt ALL questions, you get points for ANY attempt, even if your answer doesn’t make sense. There are no wrong answers. Only non-interesting ones.
Name two types of fish?
What is the good, the bad, and the ugly, in the context of our class?
What is your most favorite algorithm covered in class so far?
What is your least favorite algorithm covered in class so far?
(1 POINT)
(1 POINT)
(1 POINT)
(1 POINT)
If you thought Pattern Recognition and Machine Learning was pretty cool and interesting with a score of 5
on a scale of 1 to 10, how do you think it is now, based on your experience in this class?
Is Friday the 6th a good day to take a mid-term? If yes, why? If no, why not?
Why did the pattern recognition engineer not cross the road?
(1 POINT)
(1 POINT)
(1 POINT)
Can you come up with a one-liner freebie question that would be interesting to have in next year’s midterm?
(1 POINT)
Page 3 of 32
18794 Pattern Recognition Theory : Midterm Fall 2020
2. Probability and Linear Algebra Basics (20 points)
— Covariance Matrices are useful —
Let us say you are provided with a dataset consisting of n samples of d-dimensional data (d > n): x1 , x2 , . . . , xn . You obtain an estimate of the mean (designated by μˆn) and an estimate of the covariance matrix (designated by Σˆn). Now, a new sample xn+1 is added to the data. Assuming that the number of samples n is very large so that the mean remains almost unchanged, derive an expression for the new covariance matrix Σˆ n+1 in terms of μˆn,Σˆn and xn+1. (3 POINTS)
Page 4 of 32
18794 Pattern Recognition Theory : Midterm Fall 2020
— Projections are more useful —
Consider two vectors x and y in an n-dimensional space as shown below. If a is a scalar, ay represents a vector along the direction of vector y, scaled in magnitude by the amount a. For any given value of a, there is a corresponding vector e as shown which connects x to ay. Derive a formula for the value of a which minimizes the l2-norm squared of this vector. (2 POINTS)
Page 5 of 32
18794 Pattern Recognition Theory : Midterm Fall 2020
— Multivariate Gaussian Distribution are super useful (nah, not really) —
X is a random vector of dimension N following multi-variate Gaussian distribution X ∼ N(μX,ΣX). Random vector Y is an affine transformation from X, where Y = AX + b. A is an N × N square matrix, and b is an N × 1 vector. Therefore, Y should be the same dimension as X. In this case, Y also follows multi-variate Gaussian distribution Y ∼ N (μY , ΣY ). Show the relation between μX and μY , as well as ΣX and ΣY .
1 1 ⊤−1 fX(x) = (2π)N det(ΣX) exp −2(x − μX) ΣX (x − μX)
Hint: for any affine transformation Y = AX + b, the PDF of Y is related to the PDF of X as follows: fY(y) = 1 fX A−1(y − b)
det(A)
Also, you can use the fact that det(A) = det(A⊤ ) and det(AB) = det(A) det(B). (3 POINTS)
Page 6 of 32
18794 Pattern Recognition Theory : Midterm Fall 2020
—Matrix Calculus, because uni-variate calculus is so mainstream— Given A, an M × N matrix, find the solution to:
minimize J (x) = μ∥x∥2 + ∥y − Ax∥2 x
Your answer should be in terms of y, μ, and A. Hint: ∂x⊤Bx = (B + B⊤)x. ∂x
(3 POINTS)
Page 7 of 32
—Generalized Norm—
The lp-norm of the vector v ∈ Rn is defined as:
∥v∥p = The l0-norm of the vector v ∈ Rn is defined as:
1 np
|vi|p i=1
(1)
(2)
(3)
(4)
(2 POINTS)
(1 POINTS)
18794 Pattern Recognition Theory : Midterm Fall 2020
n
∥v∥0 =|vi|0
i=1 The Frobenius norm of the matrix M ∈ Rn×m is defined as:
nm n
∥mi∥2
∥M∥F = where mi is the i-th row of matrix M.
m2ij =
The l2,1-norm of a matrix is defined as:
∥M∥2,1 = m2
Derive the l3,2-norm.
Generalize to the lr,p-norm.
i=1 j=1
i=1
n m
ij
= ∥mi∥2
i=1 j=1
Note that lr,p-norm is a valid norm because it satisfies the three norm conditions, including the triangle
inequality ∥A∥r,p + ∥B∥r,p ≥ ∥A + B∥r,p. Prove it.
Hint: start from the triangle inequality ( |u |p)1 + ( |v |p)1 ≥ ( |u + v |p)1 and setting u = ∥ai∥ iipiipiiip ir
and vi = ∥bi∥r. (2 POINTS)
Page 8 of 32
Gaussian. Do this on Figure 1 (b) below.
(1 POINTS)
18794 Pattern Recognition Theory : Midterm Fall 2020
—2D Gaussian Distribution: the ideal Gaussians for paper—
Consider two 2D Gaussian distributions characterized by:
1 1 1 μ1 = 2 , Σ1 = 1 2
3 2 0 μ2 = 4 , Σ2 = 0 1
2
Sketch the contours of the Gaussian in Figure 1 (a) below. The top entry for μ1 and μ2 corresponds to the
y-axis in the figure below, and the second entry in the vector corresponds to the x-axis. (1 POINTS)
Figure 1: (a)
If you perform PCA on EACH of the two distributions above, in both cases, draw the first principal com- ponent vector along which there is maximum variance of data. Draw the vector from the mean of each
Figure 2: (b)
Page 9 of 32
18794 Pattern Recognition Theory : Midterm Fall 2020
— Correlation Coefficient: apparently enhancing convenience —
If x is a d-dimensional random vector where x = [X1 , X2 , …, Xd ], its covariance matrix is defined as: Cov[x] = E[(x − E(x))(x − E(x))T ]
V ar[X1]
Cov[X2, X1] = .
Cov[X1, X2] · · · Cov[X1, Xd] V ar[X2] · · · Cov[X2, Xd]
. .. . ….
Cov[Xd, X1] Cov[Xd, X2] · · · V ar[Xd]
Covariances can be between 0 and infinity. Sometimes it is more convenient to work with a normalized
measure, with a finite upper bound. In 2-D case, the correlation coefficient between X and Y is defined as: Cov[X,Y]
ρ[X,Y]=Corr[X,Y]= Var[X]Var[Y]
Consider several sets of (x, y) points shown in the figure below. Write the correlation coefficient of x and y
for each set. Hint: you may choose from {−1.0, − 0.8, − 0.4, 0, 0.4, 0.8, 1.0}. (2 POINTS) ()()()()()()()
()()()()()()()
()()()()()()()
Page 10 of 32
18794 Pattern Recognition Theory : Midterm Fall 2020
3. Decision Theory (20 points)
— ROC Curves, because they are pretty —
Below is a plot of ROC curves from some classification experiments evaluated on 5 algorithms. Draw the trajectory of Equal Error Rate. Note that the horizontal axis is in log scale. (1 POINTS)
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
0
10−3 10−2 10−1 100
False Accept Rate
1
2
3
4
5
According to the ROC curves, which algorithm should be preferred? (1 POINTS)
Page 11 of 32
True Accept Rate
18794 Pattern Recognition Theory : Midterm Fall 2020
— 2-Class Problem in 1D Space, because life is easy —
In a 2-class problem using a single feature x, the class conditional PDFs are as follows: f(x|w1) = ae−|x| f(x|w2) = be−5|x| − ∞ ≤ x ≤ ∞
Determine the constants a and b.
(2 POINTS)
Page 12 of 32
18794 Pattern Recognition Theory : Midterm Fall 2020
Determine the minimum error decision strategy and probability of error (Pe) assuming equal priors for both
classes. (Hint: log(5) ≈ 0.4024. You can leave Pe in terms of e−0.4024) (8 POINTS) 4
Page 13 of 32
18794 Pattern Recognition Theory : Midterm Fall 2020
— Minimum Error Decision Strategy does not exist for life in general —
Consider a three-class problem based on two features x1 and x2. Each class conditional PDF is uniformly distributed inside the region shown in the above figures. Determine the minimum probability of error scheme and the corresponding probability of error (Pe) with the priors given by P(w1) = 3P(w3), and P (w2) = 2P (w3). (8 POINTS)
Page 14 of 32
18794 Pattern Recognition Theory : Midterm Fall 2020
Page 15 of 32
18794 Pattern Recognition Theory : Midterm Fall 2020
4. Dimensionality Reduction (20 points)
—Orthogonality, because it’s more fancy than just being perpendicular —
When applying PCA on the covariance matrix, Σw = λw, we would get N − 1 such w vectors, where N is the number of samples, and we assume dimensionality d ≫ N . Prove that the w’s are orthogonal to each other. Hint: Start by proving that the covariance matrix is symmetric. (5 POINTS)
Page 16 of 32
18794 Pattern Recognition Theory : Midterm Fall 2020
—PCA and Feature Selection will Transform your life—
As we have shown in class, PCA is a good tool for dimensionality reduction but with limited discriminabil- ity. Here we introduce a feature selection transform (FST) which is similar to PCA but does a better job in discriminating two classes. Unlike traditional principal component analysis (Karhunen-Lo`eve transform), the FST incorporates data form both positive and negative class and using eigen decomposition on the joint covariance matrix, in order to find the optimal basis vectors that very well represent one class while have least representation power on the other class.
Let X ∈ Rd×m be the data set from Class 1, with each column a vectorized image with dimension d. LetY∈Rd×n bethedatasetfromClass2. BothXandYhavezeromean. LetΣX andΣY bethe autocorrelation matrices of datasets X and Y respectively, i.e.
ΣX =XXT ΣY =YYT
Now consider a unified dataset Z = [X Y] of size Rd×(m+n) with autocorrelation matrix Σ. Derive the
relationship between Σ,ΣX, and ΣY. (2 POINTS)
Σ is symmetric and can be diagonalized using eigen-decomposition as: Σ = ΦΛΦ⊤
where Φ contains the entire span of eigenvectors of Σ and Λ houses the corresponding eigenvalues of Σ on its diagonal.
Next, a whitening step is applied in the FST. Both the Class 1 and Class 2 data are transformed by a
2 whitening matrix P = ΦΛ− 1 . So the transformed data X and Y becomes:
X = P⊤X and Y = P⊤Y
What are the autocorrelation matrices on the transformed data X and Y ? (2 POINTS)
Page 17 of 32
18794 Pattern Recognition Theory : Midterm Fall 2020
The transformed total autocorrelation matrix Σ = ΣX + ΣY , show that Σ = I. (2 POINTS)
Here, we perform an eigen-decomposition on ΣX , which yields:
ΣX v = λv (5)
Can you show that ΣY and ΣX share the same eigenvectors and their eigenvalues are complementary (i.e. for the same eigenvector, the corresponding eigenvalues from both classes sum up to 1)? (3 POINTS)
This effectively means that the autocorrelation matrix from two classes share the same eigenvectors v and the eigenvalue of one class is exactly the complement of the eigenvalue of the other class. Because of the complementary property of the eigenvalues, the eigenvectors that are the most dominant in one class, is the least dominant in the other class. So in the traditional FST method as applied to any two-class problem, a discriminative subspace is created by selecting a few of the most dominant eigenvectors for one class and the least dominant ones for the other class. By ignoring the eigenvectors in the middle range, the subspace we ob- tain contains basis that are very discriminative and will yield discriminative feature selection after projection.
Page 18 of 32
18794 Pattern Recognition Theory : Midterm Fall 2020
— Numerical Problem on PCA, because why not —
Consider the two binary face images in low resolution shown below (black being 0):
Figure 3: Two low-resolution faces.
What will be the dimension of the covariance matrix? What will be the dimension of the Gram matrix? (1
POINT)
Using the Gram matrix trick, determine the principal component with the largest variance. For this problem, you shouldn’t normalize the eigenvectors at any stage to be unit norm nor should you subtract the mean of the sample data when building the matrix (X) of the vectorized samples. (5 POINTS)
Page 19 of 32
18794 Pattern Recognition Theory : Midterm Fall 2020
Page 20 of 32
18794 Pattern Recognition Theory : Midterm Fall 2020
5. Correlation Filters (20 points)
—Alternate views of MACE filters—
We know the MACE filter aims to minimize the average correlation energy of the correlation surfaces with your training data while maintaining some peak constraints. We know the solution for the MACE filter is of the form
h = D−1X(X+D−1X)−1u We have also seen that ECP-SDF filters take the form
h = X(X+X)−1u
Show how the MACE filter can be seen to be an ECPSDF filter created from data transformed by some
matrix. (This is also called pre-whitening given the matrix you multiply by) (4 POINTS)
Currently, we have seen how to formulate a MACE filter in the frequency domain. Now, you will show how to formulate a MACE filter in the time domain. We know that the MACE filter tries to miminimze the average correlation energy while maintaining peak constraints for all the training data. Therefore, the first thing to do is formulate a term for the average correlation energy in the time domain. Assume you have N data samples, x1 through xN . Given the first data sample, x1, and a filter h, both as column vectors in the time domain, we can write an expression for the correlation surface c1 as
c 1 = A T1 h
Give a general idea of how to create the matrix A1. (3 POINTS)
What is the energy of c1 in terms of A1 and h? (2 POINTS)
Page 21 of 32
18794 Pattern Recognition Theory : Midterm Fall 2020
What is the average energy of the correlation surfaces over all N data samples? Express it in the form hT Rh
and write an equation for R. (3 POINTS)
What is the equation for the filter h that minimizes the average correlation energy hT Rh while maintaining constraints on the correlation peaks, XT h = u? (4 POINTS)
What is the main reason this solution is not usually a computationally feasible solution while the solution in the frequency domain is? (4 POINTS)
Page 22 of 32
18794 Pattern Recognition Theory : Midterm Fall 2020
6. Correlation Filters and Support Vector Machines (20 points)
— Maximum Margin Correlation Filter (MMCF) is basically a margin maximizing correlation filter (in case the name didn’t make sense) —
Given N training samples, each represented by a column vector xi ∈ Rd, and class label ti ∈ {1,−1},∀i ∈ {1, . . . , N }, the support vector machine (SVMs) approach (for a 2-class problem) finds the hyperplane that maximizes the smallest l2 norm distance between the hyperplane and any data sample:
N minimize wT w + C ξi
w,b
suchthat ti(xTi w+b)≥1−ξi (6)
where w denotes the normal to the hyperplane and b is the bias or offset from the origin, C > 0 is a trade-off parameter, and the sum of ξi ≥ 0 is a penalty term containing the slack variables.
A new classifier called MMCF is proposed1 to combine the design principles of SVMs and CFs, which has the good generalization capability of SVMs and the good localization capability of CFs. Many CF designs can be interpreted as optimizing a distance metric between an ideal desired correlation output for an input image and the correlation output of the training images with the filter template
N
w=argmin∥w⊗xi −gi∥2 (7)
w
where the ⊗ symbol denotes the implied 2D cross-correlation operation of the 2D input image and the template represented by their vector versions xi and w, respectively, and gi is the vector version of the desired correlation output.
In MMCF, we choose gi to be a delta function-like in order to have a sharp peak in the correlation output at the target’s center location. So the MMCF multi-objective function (combining SVM and CFs design principles) can be written as:
NN minimize wTw+Cξi, ∥w⊗xi −gi∥2
i=1
w,b
suchthat ti(xTi w+b)≥ci−ξi (8)
where ci = 1 for true-class images and ci = 0 or other small value ε for false-class images. That is, for true-class images, we expect a value near 1 and for false-class we expect a value that is close to 0. This allows us to detect the true targets and ignore everything else. We refer to wT w + C Ni=1 ξi as the margin criterion and Ni=1 ∥w ⊗ xi − gi∥2 as the localization criterion. We set the desired CF output to gi =[0,…,0,wTxi,0,…,0]T whereweusewTxi asthecross-correlationvalueofwandxi atthetarget’s location.
We now express the multi-objective function shown in Eq. 8 in the frequency domain in order to take advan- tage of the well-known property that cross-correlation in the space domain is equivalent to multiplication in the frequency domain.
1Rodriguez, A.; Boddeti, V.N.; Kumar, B.V.K.V.; Mahalanobis, A., ”Maximum Margin Correlation Filter: A New Approach for Localization and Classification,” in Image Processing, IEEE Transactions on , vol.22, no.2, pp.631-643, Feb. 2013
i=1 i=1
i=1
Page 23 of 32
18794 Pattern Recognition Theory : Midterm Fall 2020
Express the localization criterion in the frequency domain using Parseval’s Theorem: (Hint: use these notations provided: d is the dimensionality of xi, Xˆi is a diagonal matrix whose diagonal entries are the elements of xˆi, and xˆi, wˆ , and gˆi are the vector representations of the 2D discrete Fourier transform (DFTs) of the 2D images represented by the vector xi, w, and gi, respectively. You can use superscript ∗ to denote
complex conjugate. We can ignore the scalar 1 because it does not affect the optimality.) d
The vector gˆi can be expressed as:
gˆi =1xTi w= 11xˆ†iwˆ d
where † denotes conjugate transpose, and vector 1 = [1 · · · 1]T with d ones in it. Express vector xˆi in terms of Xˆ i using a simple matrix-vector operation:
(4 POINTS)
(9)
(2 POINTS)
Show that localization criterion in the frequency domain can be expanded to a quadratic form like wˆ†Zˆwˆ where Zˆ is a positive semidefinite matrix. (6 POINTS)
Page 24 of 32
18794 Pattern Recognition Theory : Midterm Fall 2020
We can formulate the SVM in the frequency domain by making use of the fact that inner products are only
scaled by 1 . Thus, the SVM frequency domain formulation is as follows, d
N minimize wˆ†wˆ +Cξi
wˆ ,b′
suchthat ti(wˆ†xˆi+b′)≥1−ξi (10)
where b′ = b × d, and the other 1 scalars are ignored because everything gets scaled appropriately. d
Now express the MMCF multi-criteria shown in Equation 8 in the frequency domain: (4 POINTS)
Studies have shown that two quadratic criteria can be optimized by minimizing the weighted sum of the two. So we can re-write the optimization as follows:
N minimize λwˆ†wˆ +C′ ξi +(1−λ)wˆ†Zˆwˆ
wˆ ,b′
suchthat ti(wˆ†xˆi+b′)≥ci−ξi
where 0 < λ ≤ 1 and C′ = λC. Further subsuming one quadratic term into the other yields: N
minimize wˆ †Sˆwˆ + C′ ξi
wˆ ,b′
suchthat ti(wˆ†xˆi+b′)≥ci−ξi
where Sˆ = λI + (1 − λ)Zˆ.
What role does parameter λ play? What happens if λ = 1?
Suggest a way to transform the data xˆi such that we can implement the MMCF design using a standard SVM solver. (2 POINTS)
i=1
i=1
i=1
(11)
(12)
(2 POINTS)
Page 25 of 32
minimize
w,b such that
What are xi and yi? What value can yi take?
wT w
yi(xTi w + b) ≥ 1
(13)
(1 POINTS)
(1 POINTS)
18794 Pattern Recognition Theory : Midterm Fall 2020
7. Support Vector Machine (12 points)
Please keep your answer short for each questions so we know you get the key points of the question. We will penalize your answer if you provide unnecessary justifications.
Given the objective function of hard-margin SVM:
How do you predict the class of a new data sample x with a trained SVM given w and b?
Does the hard-margin SVM always have solution in general? If not, in what case it fails? What is the key
factor in real world scenarios that fails hard-margin SVM?
The objective function of the soft-margin SVM if given by:
N minimize wTw+Cξi
(1 POINTS)
w,b
suchthat yi(xTi w+b)≥1−ξi
ξi ≥ 0
How does it solves the case where hard-margin SVM cannot handle? Explain in terms of the role of ξi. (1
POINTS)
In soft-margin SVM, the cost of the slack variables (C) controls how much slack (mislabelling in training) the SVM can afford. If C is increased, does the feasible set of the SVM increase or decrease, why? The feasible
set of an optimization problem (such as the SVM optimization problem above) can be thought of the size Page 26 of 32
i=1
(14)
18794 Pattern Recognition Theory : Midterm Fall 2020
of the set of all w that satisfy the constraints in the SVM optimization problem. If this set increases, the feasible set of the SVM increases. The optimization problem of the SVM minimizes the objective function (under the constraints) to pick one and only one solution from the feasible set. Hence if the feasible set is large, the optimization procedure has more decision boundaries to choose from. (2 POINTS)
If you find that you have very low training error and high test error, would you increase or decrease C, why?
(2 POINTS)
If you find that the SVM achieves very high training error but the test error is low, would you increase or decrease C, why ? (2 POINTS)
If you increase C for a kernel SVM, for a non-linearly separable dataset, does the complexity of the decision boundary increase or decrease? The complexity of the decision is defined as to how complex or how non-linear is the final decision boundary obtained. (2 POINTS)
Page 27 of 32
18794 Pattern Recognition Theory : Midterm Fall 2020
8. Solving SVM with MATLAB (20 points)
Please attach your code at the end of your submission!
In the lecture, we covered SVM in both primal and dual formulations. Given the dataset with N samples, X ∈ RN ×p and their label y ∈ {−1, 1}N , the primal problem is:
(15)
ξi ≥0,∀i
(1) Good? Bad? Ugly? Write down the dimensionality of w, w0, ξ, C, xi and yi. (answer should be like:
X ∈ RN×p) (2 POINTS)
1T N min ww+C ξi
w,w0 ,ξ 2
s.t yi(xTi w+w0)≥1−ξi,∀i
(2) QP Solver in MATLAB. Here we provide you an actual example to solve SVM with QP solver in MATALB. The function in MATLAB is defined as:
t = quadprog(H, f, A, b, Aeq, beq, blow, bup), where it finds the solution for the following optimization problem:
min 1t⊤Ht + f⊤t t2
s.t At≤b Aeqt = beq
blow ≤t≤bup
(16)
Let’s start by focusing on the optimization target t. In original SVM problem as shown in objective (15), we have three sets of optimization targets {w, w0, ξ}, however, to get the QP solver to work in MATLAB, we need to concatenate all {w, w0, ξ} into one longer vector, i.e. t = [w⊤, w0, ξ⊤]⊤ ∈ R(p+1+N)×1. Starting from this, write down the dimensionality of H,f,A,b,Aeq,beq,blow,bup (2 POINTS)
i=1
Page 28 of 32
18794 Pattern Recognition Theory : Midterm Fall 2020
(3) Code it up with your linear algebra skill! Now, in MATLAB, please load the dataset pat- tern rec.mat. It consists of 100 training data X train ∈ R100×2 and their associated binary labels y train ∈ R100×1. Given the dataset, you should be able to construct H, f, A, b, Aeq, beq, blow, bup with the mapping from objective (15) to objective (16):
1wT w ⇒ 1t⊤Ht 22
N
Cξi ⇒f⊤t
i=1
yi(xTi w+w0)≥1−ξi,∀i⇒At≤b
ξi ≥0,∀i⇒blow ≤t≤bup
Note that SVM doesn’t have equality constraints, thus all element in Aeq and beq should be zero. Additionally, if there’s no lower bound or upper bound for a variable, the associated blow and bup will be -Inf and Inf respectively.
Now, let C = 1, report the size and summation of entries of your H,f,A,b,Aeq,beq,blow,bup with the following example MATLAB command:
size(H) sum(H(:))
(2 POINTS)
(4) Time to Solve! Let C = 1, please use MATLAB’s quadprog function to solve the SVM problem for the given dataset. Plot the dataset and draw the decision boundary (a line). Use dark circles to point out the support vectors (the associated ξi of support vectors should be non-zero value). Please include your MATLAB code in the submission to get full points in this question. (8 POINTS)
Page 29 of 32
18794 Pattern Recognition Theory : Midterm Fall 2020
(5) Play with C Now change C = 0.01, and then C = 100, train again, plot the result with same procedure.
What do you observe with the decision boundary and support vectors when you increase C?
(6 POINTS)
Page 30 of 32
18794 Pattern Recognition Theory : Midterm Fall 2020
9. Matrix Approximation (10 points)
— This is nothing like Robust PCA —
Matrices can be decomposed in many ways. One way is to say that a matrix M ∈ RN×N is a sum of two matrices L and S, where L is a low-rank matrix and S is a sparse matrix (∥S∥0 ≤ K). Here the l0 norm is defined for matrices as ∥S∥0 = i,j 1{Sij ̸= 0}, where 1{·} = 1 if the argument is true and 1{·} = 0 if the argument (the expression in the bracket) is false. Thus the l0 simply counts the number of non-zeros in the matrix. The singular norm ∥A∥∗ = i σi(A) is defined as the sum of the singular values.
The Robust PCA problem is defined as
argmin∥M−(L+S)∥2F s.t. ∥L∥∗ ≤t, ∥S∥0 ≤K L,S
Here, ∥M∥F is the Frobenius norm (simply the l2 norm of the vector formed by all the matrix elements). Write the Lagrangian of the function with varible u for the singular norm and v for the l0 norm. Do you have any conditions on u,v ? Would you be able to minimize the Lagrangian w.r.t L,S directly ? Why ? Why not ? (3 POINTS)
The original formulation is actually not convex, and so the Lagrangian would only help in finding the dual problem and not solving the primal formulation. Nonetheless, sometimes, under certain conditions, it is possible to solve a non-convex problem optimally without even using the Lagrangian. You will now attempt such a feat.
Consider the case where K > N2. What would the solution of the Robust PCA optimization problem given any matrix M look like? Specifically, under the new relaxed condition K > N2, give one solution to the original problem (specify what L, S become) that is globally optimal. (2 POINTS)
Page 31 of 32
18794 Pattern Recognition Theory : Midterm Fall 2020
We now drop the l0 constraint and the variable S, and replace the singular norm ∥A∥∗ with the rank function (rank(A)), which just returns the rank of the matrix A. Rewrite the original Robust PCA problem with the modifications suggested. (2 POINTS)
The problem now becomes to find a matrix that is the best rank t approximation to a given matrix M. Suggest an algorithm to solve this problem. Write down the steps you would follow to solve it. (3 POINTS)
Page 32 of 32