Assigment 1
EECS 4404/5327
The assignment is due on Thursday, February 8, before 10:30am (before class).
1. Bayesian Reasoning I
Consider a test which detects if a person has a disease. Let R denote the outcome of the test on a person, D denote whether the person actually has the disease and θ be the likelihood that the test gives the correct result. That is, the probability that it reports that someone has the disease (R = 1) when they actually do (D = 1), is θ, and the probability that it reports that someone doesnt have the disease when they don’t is θ. Formally:
p(R = 1|D = 1) = p(R = 0|D = 0) = θ
Finally, an α-fraction of the population actually has this disease, that is, the prior
probability of a person having this disease is p(D) = α.
(a) A patient goes to the doctor, has the test performed and it comes back positive. Derive the posterior probability that the person actually has the disease, and simplify it in terms of θ and α.
(b) After the results of the first test come back positive, the doctor runs it a second time. This time it comes back negative. Derive the posterior probability that the person actually has the disease after this second round of testing and simplify in terms of θ and α.
(c) Suppose θ = .99 and α = 0.001. Suppose 1000 patients get tested, and they are all negative. On average, how many of these patients actually have the disease? I.E., what is the expected number of false negatives?
Solution
1
(3 + 4 + 3 marks)
(a) By applying Bayes Theorem, we have
p(D=1|R=1)= p(R=1|D=1)p(D=1) p(R = 1)
= p(R = 1|D = 1)p(D = 1)
p(R = 1|D = 0)p(D = 0) + p(R = 1|D = 1)p(D = 1)
= θα (1−θ)(1−α)+θα
Thus, if first test comes back positive, the probability that the person actually has the disease is θα.
(b) The important thing to notice here, is that the two test results, we’ll call them R1 and R2 are independent. Thus, we get
p(D=1|R1 =1∧R2 =0)
p(R1 =1∧R2 =0|D=1)p(D=1) p(R1 = 1 ∧ R2 = 0)
p(R1 =1|D=1)p(R2 =0|D=1)p(D=1)
p(R1 =1∧R2 =0|D=0)p(D=0)+p(R1 =1∧R2 =0|D=1)p(D=1)
θ(1 − θ)α
p(R1 =1|D=0)p(R2 =0|D=0)p(D=0)+p(R1 =1|D=1)p(R2 =0|D=1)p(D=1)
θ(1 − θ)α (1−θ)θα+θ(1−θ)(1−α)
θ(1 − θ)α (1−θ)θ(α+(1−α))
=
=
=
=
(c) We first compute the probability that a person has the disease, given that their test is negative:
p(D=1|R=0)= p(R=0|D=1)p(D=1) p(R = 0)
= p(R = 0|D = 1)p(D = 1)
p(R = 0|D = 0)p(D = 0) + p(R = 0|D = 1)p(D = 1)
= (1−θ)α (1−θ)α+θ(1−α)
Now note that the expectation of a random variable that takes values in {0, 1} is equal to its probability of being 1. That is, denoting Di the variable that indicates
2
= =α
whether the i-th person has the disease and Ri the test result of the i-th person, we get
E[number of people with the disease among 1000 negatively tested] 1000
E[(Di|Ri = 0)] i=1
1000
=E[Di|Ri =0] i=1
1000
=p(Di =1|Ri =0)·1+p(Di =0|Ri =0)·0 i=1
1000
=p(Di =1|Ri =0) i=1
= 1000·p(D = 1|R = 0) = 1000 (1 − θ)α
(1−θ)α+θ(1−α) = 0.010110189
2. Bayesian Reasoning II
A terrible crime has been committed and blood is found on the crime scene, that must come from the person who committed the crime. Only 1% of the population (in the city which has 1000000 inhabitants) have this type of blood. A suspect is identified and tested positive for this blood type.
(a) The prosecutor says: there was only 1% chance that he had this blood type if they were innocent, so there is now 99% chance they are guilty.
What is wrong with this argument?
(b) The defendant says: There are 10000 people in this city with this blood type, so the chance of being guilty is only 1/10000.
What is wrong with this argument? Can you come up with a scenario, where it would be valid?
(c) Further investigations are being conducted, and more evidence collected. The search is narrowed down to 10 suspects. One of these 10 must have committed the crime. A first suspect of these is chosen (at random), the test conducted and it comes back positive. The judge says: “Given how this whole case developed, I have learned my lesson about using Bayes rule now. We can send this person to jail. We know:
p(B) = 1/100, p(G) = 1/10,
where B is the event that the blood test comes back positive and G is the event
that the person was guilty. We also know p(B|G) = 1, due to the evidence on the 3
crime scene. Now we get p(G|B)=
Now this seems convincing…!” Is the judge correct?
p(B|G)p(G) p(B)
1· 1
Solution We let p(B) denote the probability that the blood test is positive, and p(G) the probability of being guilty. From the information given in the question we can conclude that, for the population of the whole city, we have
p(G) = 1/1000000, p(B) = 0.01, p(B|G) = 1. (Here we assume that the blood test itself is always correct.)
(a) The prosecutor concludes
p(G|B) = 1 − p(B|¬G) = 1 − 0.01 = .99
However, the first step is not correct. We have p(G|B) = 1 − p(¬G|B), hence the prosecutor believes p(B|¬G) = p(¬G|B) which is not true in general.
(b) The conclusion of the defendant would be correct, if the suspect was chosen uniformly at random from the population of the city. In that case, applying Bayes theorem correctly would yield
=
10 =10 100
1
(3 + 3 + 4 marks)
p(G|B) =
p(B|G)p(G) p(B)
=
1 · 1 1000000
1 100
=
1 10000
However, in reality, the suspect was likely not chosen by drawing from a lottery, but rather they became a suspect based on some other, additional evidence (for example, they may have been seen close to the crime scene, or have known the victim). Given that there was such other evidence, the above calculation and conclusion is not valid anymore.
(c) The argument is obviously wrong, since a probability value can never be larger than 1. Technically, the mistake is in the assumed value of p(B) (the application of Bayes theorem as stated is fine in principle). In the population of the city, we knew the blood type occurs in 1% of the population, thus p(B) = 1/100. However, after narrowing down the search to 10 suspects (one of which is the criminal), more than 1% of this new population have the bloodtype. We now have p(B) ≥ 1/10 since the guilty person is among the 10 suspects. We don’t know the exact new value of p(B).
4
3. Linear Algebra
We have seen in class that the solution to regularized least squares regression is given as a solution to the linear system
(XT X + λId)w = XT t
where X is the design matrix and Id is the identity matrix. Prove that if λ > 0, then
(XT X + λId) is invertible.
Hint: What do you know about the eigenvalues and eigenvectors of XT X? What can
you infer about (XT X + λId)?
(10 marks)
Solution We have seen in class that the matrix XT X is positive semi-definite and all its eigenvalues λi are non-negative, ie λi ≥ 0 for all i. We also know that the eigenvectors of XT X form an orthogonal basis. By assumption λ > 0. Let vi be an eigenvector with corresponding eigenvalue λi. Then we have
(XTX+λId)vi =XTXvi +λIdvi =λivi +λvi =(λi +λ)vi
Thus the vi are also eigenvectors of XT X + λId and λi + λ are the corresponding eigenvalues. We have λi +λ ≥ λ > 0 since λi ≥ 0 and λ > 0. Thus all eigenvalues are positive and the corresponding eigenvectors form a basis. This implies that the nullspace of XT X + λId is empty and thus the matrix is invertible.
5
4. Linear Regression
Step 1 – load the data
The data is stored in two files, dataset1_inputs.txt and dataset1_outputs.txt which contain the input values (i.e., values xi) and the target values (i.e., values ti) respectively. These files are simple text files which can be loaded with the load function in Matlab/Octave. Plot the outputs as a function of the inputs (ie plot the datapoints, not a curve) and include this plot in your report.
Step 2 – ERM
For degrees W = 1, . . . 20, fit a polynomial of degree W to the data using (unregular- ized) least squares regression. For each learned function, compute the empirical square loss on the data and plot it as a function of W . Include this plot in your report. Which value of W do you think would be suitable?
Step 3 – RLM
Repeat the previous step using regularized least squares polynomial regression with λ = 0.001. Again plot (and include) the empirical loss as a function of W . Compare and discuss the two curves you get for ERM and RLM.
Step 4 – cross validation
Implement 10-fold cross validation for RLM. That is, randomly divide that data into 10 chunks of equal size. Then train a model on 9 chunks and test on the 10th that was not used for training. For each model you train, average the 10 test scores you got and plot these again as a function of W. Which value of W do you think would be suitable?
Step 5 – visualization
For the degree that you chose based on the previous questions and for W = 1, 5, 10, 20 plot the data along with the ERM and RLM learned models. Discuss the plots.
Step 6 – bonus
Repeat the steps above (or whatever else you may find suitable) to come up with a poly- nomial regression vector w = (w0, w1, . . . wW ) for the data in dataset2_inputs.txt and dataset2_outputs.txt (to be posted a few days before the submission deadline). Submit the weights vector. Your submitted weight vector will then be tested on an independent test set generated by the same process.
Please submit the weights as a 21-dimensional vector w = (w0, w1, . . . w20) to be applied to the data as w0 +w1x+w2x2 +…+w20x20; if you choose W < 20 just set the appropriate weights to 0. Submit this vector as a text file with each weight on a line.
(2 + 5 + 5 + 5 + 3 marks + 5 bonus marks)
6
Solution Step 1 - load the data
Here you just load the inputs and outputs and then generate the following plot:
Step 2 - ERM
Here you fit a polynomial of various degrees to the data, and compute the training loss as a function of the degree of the fitted polynomial. As expected, the loss decreases with increasing the degree (as we fit potentially more complex functions as the degree increases):
At this point, it seems that some degree a bit over 10, would be suitable. The loss decreases sharply up to degree 10, then continues to decrease only mildly. This indicates that, for degrees larger than 10, we would expect overfitting to occur.
7
Step 3 - RLM
We repeat the same steps but include a regularizer in the loss minimization. Plotting the loss as a function of the degree yields a very similar picture:
However, examining the values of the loss closely for degrees larger than 10, we observe that the loss of the regularized minimizer remains a bit larger. The regularization term guides the optimization towards less complex functions and thereby prevents overfit- ting.
Step 4 - cross validation
Here, we first randomly permute the data. Then we divide it into 10 equal sized parts. Now, for every degree from 1 to 20, we choose one of the 10 parts as testset, fit a polynomial to the other 9 (using the RLM method), and then compute the testloss on the chosen testset. That is, for every degree, we fit a function 10 times, compute the test loss, and then average these 10 testlosses. Plotting these averages as a function of the degree yields:
8
Step 5 - visualization
Plotting polynomials of various degrees together with the data:
We observe how the fitted function follows the trends in the data increasingly with higher degrees. We also see (at the ends) that for degree 15 and 20 the function starts to fit the noise (an indication of overfitting).
9