编程代考 CSCI 567 Fall 2022 Practice Problems for Part II of class

CSCI 567 Fall 2022 Practice Problems for Part II of class
Instructor:

1 Multiple choice questions

Copyright By PowCoder代写 加微信 powcoder

(1) Which of the following about neural networks is correct?

(A) A neural network with a fixed architecture can represent any continuous function.
(B) Dropout is useful for preventing overfitting when training a nerual net.
(C) A neural network with one hidden layer and ReLU activations is a convex model, since ReLU is a
convex function.
(D) A convolution layer is a special case of a fully connected layer.

(2) Suppose we are training a neural networkwork with mini-batch SGD with batch size 50 and 50000
training samples. How many gradient updates would there be in total over during 5 training epochs?

(D) 250000

(3) Suppose a convolution layer takes a 4 × 6 image with 3 channels as input and outputs a 3 × 4 × 8
volume. Which of the following is a possible configuration of this layer? (You can refer to slide 50 of
Lecture 7.)

(A) One 2× 3 filter with depth 8, stride 1, and no zero-padding.
(B) Eight 2× 3 filters with depth 3, stride 1, and no zero-padding.
(C) Eight 2× 2 filters with depth 3, stride 2, and 1 pixel of zero-padding.
(D) Eight 2× 2 filters with depth 3, stride 2, and 2 pixels of zero-padding.

(4) How many parameters do we need to learn for the following network structure? An 8 × 8 × 3 image
input, followed by a convolution layer with 4 filters of size 3× 3 (stride 1 and 1 pixel of zero-padding),
then another convolution layer with 3 filters of size 2× 2 (stride 2 and no zero-padding), and finally a
max-pooling layer with a 2× 2 filter (stride 2 and no zero-padding). (Note: the depth of all filters are
not explicitly spelled out, and we assume no bias/intercept terms used here.)

(A) 48 (B) 144 (C) 156 (D) 168

(5) What is the final output dimension of the last question? (You can refer to slide 50 of Lecture 7.)

(A) 2× 2× 3 (B) 4× 4× 3 (C) 2× 2× 1 (D) 4× 4× 1

(6) Which of the following about decision trees is correct?

(A) Good interpretability is a key advantage of decision trees.
(B) Decision tree algorithms are usually implemented using recursion.
(C) Entropy is the only way to measure the uncertainty of a node when building a decision tree.

(D) Regularization is not applicable to decision trees since they do not minimize a certain loss function.

(7) Consider a binary dataset with 50 positive examples 50 negative examples. Decision stump T1 splits
this dataset into two children where the left one has 20 positive examples and 40 negative examples,
while another decision stump T2 results in a left child with 25 positive examples and 25 negative ex-
amples. Which of the following is correct?

(A) The entropy of the left child of T1 is 13 ln 3 +

(B) The entropy of the right child of T1 is 14 ln 4 +

(C) The entropy of either child of T2 is the maximum possible entropy for two classes.
(D) Based on conditional entropy, T2 is a better split than T1.

(8) Which of the following about Principal Component Analysis (PCA) is correct?

(A) PCA can be used to compress a dataset.
(B) PCA is useful for visualizing a dataset.
(C) PCA requires finding all the eigenvectors of the covariance matrix.
(D) To decide how many principal components to use, we can plot some of the eigenvalues and see if
they seem to be becoming small.

(9) Which of the following about k-means is correct?

(A) k-means always finds the global minimum of the k-means objective, but it might take a very long
time to converge.
(B) k-means always finds a local minimum of the k-means objective.
(C) There are initialization algorithms (such as k-means++) for the k-means objective, such that the
algorithm is always guaranteed to find the global minimum of the objective in polynomial time.
(D) k-means may not find the best possible cluster assignment for the cluster centers it finds.

(10) Which of the following about Gaussian Mixture Model (GMM) are correct?

(A) GMM assumes that the data are generated probabilistically in a particular way, and thus can only
be applied if we know the training data is indeed generated from this model.
(B) The global maximizer of the maximum likelihood objective of a GMM model can be found using
the EM algorithm.
(C) Learning a GMM via EM gives not only the cluster (soft) assignments and centers, but also other
parameters such as mixture weights and covariance matrices.
(D) GMM for clustering always performs better than k-means since it learns more parameters.

(11) When applying a GMM with k components to a dataset of n points, if we representing γij (the soft
assignment/posterior probability of datapoint i belonging to cluster j) as a matrix of n rows and k
columns, what is the sum of all the entries of this matrix?

(12) In a multi-armed bandit problem with two arms and binary rewards, suppose that we have selected
the first arm 6 times, 3 of which yield reward 1, and the second arms 3 times, 2 of which yield reward
1. Which of the following about the behavior of the UCB algorithm in the next round is correct (you
can refer to the lecture notes for the UCB algorithm)?

(A) UCB selects the second arm because 1

(B) UCB selects the second arm because 1

(C) UCB selects the second arm because 1

(D) UCB selects the second arm with probability 2

Consider the following probabilistic model to generate a non-negative integer x. First, draw a hidden binary
variable z from a Bernoulli distribution with mean π ∈ [0, 1], that is, p(z = 1;π) = π and p(z = 0;π) = 1−π.
If z = 0, set x = 0; otherwise, draw x from a Poisson distribution with parameter λ so that

p(x|z = 1;λ) =

Given N samples x1, . . . , xn generated independently in this way, follow the steps below to derive the
EM algorithm for this model. (This is also a good example to see why finding the exact MLE is difficult;
you can try it yourself.)

Fixing the model parameters π and λ, for each sample n, find the posterior distribution of the corresponding
hidden variable zi: p(zi|xi;π, λ). You will find it useful to consider the cases xi > 0 and xi = 0 separately.
Given that zi is binary, this means that you have to find the value of the following four quantities:

• γ′0 = p(zi = 0|xi > 0;π, λ),

• γ′1 = p(zi = 1|xi > 0;π, λ),

• γ0 = p(zi = 0|xi = 0;π, λ),

• γ1 = p(zi = 1|xi = 0;π, λ).

Suppose that we have computed γ0, γ1, γ

1 from the previous value of π and λ. Now, write down the

expected complete likelihood function Q(π, λ) in terms of the data x1, . . . , xi, the posteriors γ0, γ1, γ

and the parameters π and λ (show intermediate steps). This completes the E-step.

Find the maximizer π∗ and λ∗ for the function Q(π, λ) from the previous question (show your derivation).
You might find it convenient to use the notation N0 = |{n : xi = 0}| (that is, the number of examples with
value 0) in your solution. This completes the M-step.

Combining all the results, write down the EM update formulas for πnew and λnew, using only the data
x1, . . . , xi and the previous parameter values π

old and λold (do not use γ0, γ1, γ

3 Naive Bayes and Boosting

Consider the dataset of 5 examples shown in Table 1. Each example corresponds to a mushroom and has
two features “Color” and “Aroma”, and one label “Label” which denotes whether or not that mushroom
was poisonous. They all have two possible values: red or gray for Color, chemical or earthy for Aroma, and
poisonous or edible for Label.

i Color Aroma Label

1 red earthy edible
2 gray chemical edible
3 red earthy poisonous
4 red chemical poisonous
5 gray earthy poisonous

Table 1: Training set

p(L = poisonous)
p(L = edible)

C = red C = gray A = chemical A = earthy
L = poisonous
L = edible

Table 2: Naive Bayes parameters

(a) Suppose that we run Naive Bayes on this dataset. Write down all the 10 parameters learned by this
algorithm in Table 2. Specifically, for the table on the top, fill in the unconditional probability of the
label, and for the table on the bottom, fill in the probability of each feature conditioning on the label:
for example, the first entry represents p(C = red|L = poisonous) (C, A and L are shorthands for Color,
Aroma, and Label respectively). No reasoning needed.

(b) For each of the 4 possible feature combinations x, write down the joint probability p(x, L = poisonous/edible)
in Table 3 (still under the Naive Bayes assumption), as well as the prediction of the algorithm. No
reasoning needed.

x p(x, L = poisonous) p(x, L = edible) Prediction (poisonous or edible)

(red, chemical)

(red, earthy)

(gray, chemical)

(gray, earthy)

Table 3: Naive Bayes predictions

(c) Suppose that we run AdaBoost with Naive Bayes as the base algorithm. What you have figured out
in the last two questions shows you what this base algorithm will do in the first round of AdaBoost.
Calculate the importance of this classifier β1 and the distribution D2 over the 5 training examples for
the next round of AdaBoost (refer to the lecture notes for the expressions for β1 and D2). Show your
derivation.

4 Boosting for Regression

We know that AdaBoost can be derived by greedily minimizing the exponential loss. In this problem, you
will follow the same idea to derive a boosting algorithm for regression by greedily minimizing the square loss.

Specifically, our goal is to output a function f(x) =

τ=1 βτhτ (x) where βτ ∈ R and hτ is from a
fixed class H which consists of mappings from Rd to [−1, 1]. Denote ft =

τ=1 βτhτ (so that f = fT ),

and suppose we have found ft−1. Now, we want to find βt and ht to minimize the square loss on a set of
examples (x1, y1), . . . , (xn, yn) ∈ Rd × [−1, 1]:

L(βt, ht) =

(ft(xi)− yi)2 =

(βtht(xi)− ri)2,

where ri = yi − ft−1(xi).

(a) Given ht provided by a base algorithm, find the value of βt that minimizes L(βt, ht).

(b) Suppose that H is symmetric in the sense that if h ∈ H, then −h ∈ H. Plug the optimal value of βt
from the last question into L(βt, ht), and show that if the base algorithm wants to minimize L(βt, ht),
it needs to find

h∗t = argmin

where ∥ht∥ =

i=1 ht(xi)

Multiple choice questions

Naive Bayes and Boosting
Boosting for Regression

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com