EECS4404/5327, Fall 2021 Midterm
You are strongly encouraged to typeset your solutions. Hand written answers will be accepted, but must be scanned into a PDF of reasonable size. However all submissions must be easily readable; unreadable answers will be given zero marks. Submissions must be uploaded to eClass by 11:59pm Toronto time on Thursday, October 28th.
This midterm test is open book meaning that you may use the textbook and lectures to arive at your answers. All answers must be well justified. You may use results (e.g., theorems, lemmas, identites, properties, etc) given in the textbook and lectures but anywhere that you use these results you must clearly describe what result you are using and where it came from.
1. Consider the probability density function
√α 2 3 6 β2 p(x|α,β)=√π3xexp 2βx−αx−α
overthevaluesofx∈Rwithparametersα>0andβ∈R. AssumeyouaregivenasetofN IIDsamples from this density, D = {xi}Ni=1. Define and fully specify with psuedocode two different methods to compute the maximum likelihood estimators of the parameters α and β based on the sampled data D. This will require both mathematical derivations and algorithmic specifications.
2. A traditional model of linear regression assumes that all data points have been corrupted with the same level of noise, that is, p(y|x, w) = N (y|wT x, σ2). Instead, consider a new model of linear regression where
p(y|x, w) = N (y|wT x, f (x))
where f(x) = ∥x∥. Given a dataset of the form D = {(xi,yi)}Ni=1, derive the maximum likelihood estimator of w and the maximum a posteriori estimator of w assuming the prior p(w) = N(w|0,σ2I). Describe what effect this will have on the estimation of w and why. Finally, if you were fitting this model in practice, what, if any, practical or numerical concerns might you have seperate from the usual concerns of linear regression?
3. Sometimes we may have knowledge about a classification problem that comes from previous studies or some other fundamental knowledge . In these cases we would, ideally, like to take that information into account. For instance, we might know that 70% of people who get a particular disease are women and about 55% of people who don’t get the disease are women based on a previous study. Or we might believe that there should be no connection between a disease and sex, even if the data we’re learning may be biased. If we’re trying to predict whether a person has that disease given a bunch of information including their sex, how can we make use of that knowledge?
Specifically, consider a classification problem where we are trying to predict the binary class label y ∈ {0, 1} given the input x. Further, assume that x has two parts, i.e., x = (x1,x2)T where x2 ∈ {0,1} and we already know the distribution p(x2|y). Derive a classifier p(y|x) which explicitly takes advantage of this known distribution but makes no other assumption about the data. Explicitly define what other distributions we need to estimate from the data. If x1 is a discrete variable with K choices (i.e., x1 ∈ {1, 2, . . . , K }) and we want to avoid making more assumptions about the data, what are the forms of the distributions and how many parameters do we need to estimate? Be sure to justify your answer and your derivation thoroughly.
4. Consider a Gamma class conditional model for binary classification where the inputs are x ∈ R+ (i.e., 1D scalars that are always greater than zero) and the output is a binary class y ∈ {0,1}. In this model, it is assumed that p(x|y = c) = Gamma(x|αc , βc ) where αc and βc are the parameters of the Gamma distribution of x for class c and that p(y) = Bernoulli(y|θ). The probability density function of the Gamma distribution is
Gamma(x|α, β) = Z(α, β)xα−1 exp(−βx)
where Z(α,β) is the normalization constant which doesn’t depend on x. Derive the function whose zeros (i.e., points where the function is equal to zero) define the decision boundary between the two classes for this model. For the cases when either α0 = α1 or β0 = β1 give direct expressions for the values of x which lie on the decision boundary. What is the geometric structure of the decision boundary for this model and why? What is the decision boundary if both α0 = α1 and β0 = β1?