CS代考 CSC 311: Introduction to Machine Learning

CSC 311: Introduction to Machine Learning
Tutorial 12 – Test 2 Review
University of Toronto

Copyright By PowCoder代写 加微信 powcoder

This tutorial
Cover example questions on several topics:
Bias-Variance Decomposition
Bagging / Boosting
Probabilistic Models (Nav ̈e Bayes, Gaussian Discriminant)
Principal Component Analysis (Matrix factorization, Autoencoder)
K-Means / EM

Useful mathematical concepts
Working with logs / exponents
MLE, MAP, Generative modeling
Independence, conditional independence
Bayes rule, law of total probability, marginalization.
Properties of Covariance matrices (i.e., positive semidefinite) / spectral decomposition for PCA.
Definition of expectation. Expectation/variance of a sum of variables

Bias-Variance Decomposition1
E[(y − t)2] = (y⋆ − E[y])2 + Var(y) + Var(t)
| {z } |{z} |{z}
bias variance Bayes error
We just split the expected loss into three terms:
I bias: how wrong the expected prediction is (corresponds to
underfitting)
I variance: the amount of variability in the predictions (corresponds
to overfitting)
I Bayes error: the inherent unpredictability of the targets
Even though this analysis only applies to squared error, we often loosely use “bias” and “variance” as synonyms for “underfitting” and “overfitting”.
1From Lecture 5, Slide 49

Ensembling Methods (Bagging/Boosting)
Bagging: Train independent models on random subsets of the full training data
Boosting: Train models sequentially, each time focusing on examples the previous model got wrong
Sequential
Bagging Boosting
Bias Variance Training Ensemble Elements
Minimize correlation High dependency

Ensembling Methods (Bagging/Boosting)
Question: Suppose your classifier achieves poor accuracy on both the training and test sets. Which would be a better choice to try to improve the performance: bagging or boosting? Justify your answer.

Ensembling Methods (Bagging/Boosting)
Question: Suppose your classifier achieves poor accuracy on both the training and test sets. Which would be a better choice to try to improve the performance: bagging or boosting? Justify your answer.
The model is underfitting, has high bias
Bagging reduces variance, whereas boosting reduces the bias Therefore, use boosting

Probabilistic Models: Naive Bayes
Question: True or False: Naive Bayes assumes that all features are independent.

Probabilistic Models: Naive Bayes
Question: True or False: Naive Bayes assumes that all features are independent.
Answer: False. Naive Bayes assumes that the input features xi are conditionally independent give the class c:
p(c,x1,…,D ) = p(c)p(x1|c)···p(xD|c)

Probabilistic Models: Naive Bayes
Question: Which of the following diagrams could be a visualization of a Naive Bayes classifier? Select all that applies.

Probabilistic Models: Naive Bayes
Question: Which of the following diagrams could be a visualization of a Naive Bayes classifier? Select all that applies.
Answer: A, D

Probabilistic Models: Nav ̈e Bayes
Consider the following problem, in which we have two classes: {T ainted, Clean}, and each data x has 3 attributes: (a1, a2, a3).
These attributes are also binary variables: a1 ∈ {on, of f }, a2 ∈ {blue, red}, a3 ∈ {light, heavy}.
We are given a training set as follows:
1. T ainted: (on, blue, light) (of f, red, light) (on, red, heavy)
2. Clean: (off, red, heavy) (off, blue, light) (on, blue, heavy)
(A) Manually construct Nav ̈e Bayes Classifier based on the above training data. Compute the following probability tables: a) the class prior probability, b) the class conditional probabilities of each attribute.

Probabilistic Models: Nav ̈e Bayes
(a) Class prior probability:
p(c = T ainted) = 3/6 = 1/2, p(c = Clean) = 1/2

Probabilistic Models: Nav ̈e Bayes
(a) Class prior probability:
p(c = T ainted) = 3/6 = 1/2, p(c = Clean) = 1/2
(b) The class conditional distributions:
p(a1 = on|c = Tainted) = 2/3, p(a1 = off|c = Tainted) = 1/3

Probabilistic Models: Nav ̈e Bayes
(a) Class prior probability:
p(c = T ainted) = 3/6 = 1/2, p(c = Clean) = 1/2
(b) The class conditional distributions:
p(a1 = on|c = Tainted) = 2/3, p(a1 = off|c = Tainted) = 1/3
p(a2 = blue|c = T ainted) = 1/3, p(a2 = red|c = T ainted) = 2/3
p(a3 = light|c = T ainted) = 2/3, p(a3 = heavy|c = T ainted) = 1/3
p(a1 = on|c = Clean) = 1/3, p(a1 = off|c = Clean) = 2/3 p(a2 = blue|c = Clean) = 2/3, p(a2 = red|c = Clean) = 1/3 p(a3 = light|c = Clean) = 1/3, p(a3 = heavy|c = Clean) = 2/3

Probabilistic Models: Nav ̈e Bayes
(B) Classify a new example (on, red, light) using the classifier you built above. You need to compute the posterior probability (up to a constant) of class given this example.

Probabilistic Models: Nav ̈e Bayes
(B) Classify a new example (on, red, light) using the classifier you built above. You need to compute the posterior probability (up to a constant) of class given this example.
Answer: To classify x = (on, red, light), we have: p(c|x) = p(c)p(x|c)
p(c = T ainted)p(x|c = T ainted) + p(c = Clean)p(x|c = Clean) Computing each term:
p(c=T)p(x|c=T)=p(c=T)p(a1 =on|c=T)p(a2 =red|c=T) p(a3 = light|c = T )
= 12 × 32 × 23 × 23

Probabilistic Models: Nav ̈e Bayes
(B) Classify a new example (on, red, light) using the classi
er you built above. You need to compute the posterior probability (up to a constant) of class given this example.
Answer: Similarly, p(c=Clean)p(x|c=Clean)=1×1×1×1= 1
2 3 3 3 54
Therefore, p(c = Tainted|x) = 8/9 and p(c = Clean|x) = 1/9, according to Nav ̈e Bayes classifier this example should be classified as Tainted.

Principal Component Analysis (PCA)
1. The principal components of a dataset can be found by either minimizing an objective or, equivalently, maximizing a different objective. In words, describe the objective in each case using a single sentence.

Principal Component Analysis (PCA)
1. The principal components of a dataset can be found by either minimizing an objective or, equivalently, maximizing a different objective. In words, describe the objective in each case using a single sentence.
Minimizing: Reconstruction error i.e. the distance between the original point and its projection onto the principal component subspace

Principal Component Analysis (PCA)
1. The principal components of a dataset can be found by either minimizing an objective or, equivalently, maximizing a different objective. In words, describe the objective in each case using a single sentence.
Minimizing: Reconstruction error i.e. the distance between the original point and its projection onto the principal component subspace
Maximizing: Variance between the code vectors i.e. the variance between the coordinate representations of the data in the principal component subspace

Principal Component Analysis (PCA)
2. The figure below shows a two-dimensional dataset. Draw the vector corresponding to the second principal component.

Principal Component Analysis (PCA)
2. The figure below shows a two-dimensional dataset. Draw the vector corresponding to the second principal component.

K-Means / EM
1. What is the difference between K-Means and Soft K-Means algorithm?

K-Means / EM
1. What is the difference between K-Means and Soft K-Means algorithm?
Hard K-Means assigns a point to 1 particular cluster, whereas Soft K-Means assigns responsibilities (summing to 1) across clusters

K-Means / EM
2. K-means algorithm can be seen as a special case of the EM algorithm. Describe the steps in K-means that correspond to the E and M steps, respectively.

K-Means / EM
2. K-means algorithm can be seen as a special case of the EM algorithm. Describe the steps in K-means that correspond to the E and M steps, respectively.
Assignment step in K-Means is similar to the E-step in EM, computing responsibilities assessment

K-Means / EM
2. K-means algorithm can be seen as a special case of the EM algorithm. Describe the steps in K-means that correspond to the E and M steps, respectively.
Assignment step in K-Means is similar to the E-step in EM,
computing responsibilities assessment
Refitting step in K-Means minimizes the cluster distance while M-step in EM maximizes generative likelihood

K-Means / EM
2. K-means algorithm can be seen as a special case of the EM algorithm. Describe the steps in K-means that correspond to the E and M steps, respectively.
Assignment step in K-Means is similar to the E-step in EM,
computing responsibilities assessment
Refitting step in K-Means minimizes the cluster distance while
M-step in EM maximizes generative likelihood
Soft K-Means is equivalent to having spherical covariance (shared diagonal) while EM can have arbitrary covariance.

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com