CSCC11 Introduction to Machine Learning, Winter 2021 Assignment 4, Due Thursday, April 8, 10am
This assignment makes use of material from week 8 to week 11 (specifically Chapter 12, Chapter 14 and Chapter 16). To begin the programming component, download a4.tgz from the course website and untar it. A directory A4 will be created; please don’t change its structure, nor the headers of the python methods therein. Please read the assignment handout carefully. Submission instructions are at the end.
We prefer you ask questions on Piazza. Otherwise, for your first point of contact, please reach out to your TA, Gautam Dawar, through email: gautam.dawar@mail.utoronto.ca.
Written Component
Suppose you are deciding whether or not to procrastinate on this assignment and want to rely on tossing a coin N times. Every time the coin lands on head, you get to procrastinate for 30 minutes and otherwise you need to work on this assignment for 30 minutes. After observing many coin tosses, you notice that you need to work on the assignment more often than not (curses!).
Being in this course for so long, you want to investigate this “fair” coin and determine the probability of tossing a head. Given N tosses, you observed n of them to have landed on head. Formally, denote xi to be the i’th outcome, where xi ∈ {0, 1} corresponds to tail or head respectively. Further denote x1:N to be the N outcomes, then we can formulate this as a Binomial distribution:
N n (N−n)
P(x1:N|θ)= n θ (1−θ) , (1)
where θ is the probability of tossing a head. Back when you decided to use the tossing results to determine whether or not to work, you thought your coin is more likely to land on head. To model this, you have decided to add a prior distribution on the probability that follows a Beta distribution:
P (θ) = 336θ5(1 − θ). (2) Unhappy with the previous results, you want to toss one more time to determine whether or not to procrastinate
once and for all. To this end, you want to know P(xN+1|x1:N) before you go all-in with this idea.
Written Problems.
1. Compute the Maximum Likelihood estimate θML.
2. ComputetheMaximumaPosterioriestimateθMAP.
3. Compute the Bayes’ estimate θBayes. Hint: You might find it useful to use the following:
1 Γ(a+1)Γ(b+1)
θa(1−θ)b = Γ(a+b+2) , (3) 0
where Γ(x) = (x − 1)!, for x ∈ N+.
4. PredicttheprobabilityoftossingaheadusingeachoftheaboveestimatesθMLE,θMAP,θBayes. Inother
words,computeP(xN+1 =1|θ),whereθ∈{θML,θMAP,θBayes}.
5. Predict the probability of tossing a head using Bayesian Prediction. That is, compute P (xN +1 = 1|x1:N ).
6. Suppose you tossed 38 times with 18 tosses landed on head, would you still go for an all-in after seeing the probabilities above? Why?
Programming Exercise 1: Dimensionality Reduction
In the starter code, in directory A4/pca, you will find a simple implementation of PCA. See the class PCA in the Python file pca.py, and the associated methods:
1. _compute_components computes the eigenvectors and eigenvalues of the training data covariance.
2. inference returns the representation of the training data in the (low-dimensional) principal subspace.
3. reconstruct computes (reconstructs) the approximation to the training data provided by the subspace model (i.e. as a linear combination of the principal vectors).
In test_pca.py you will also find functions to generate synthetic data, with control over the dimension of the observations (the data space), the dimension of the low dimensional representation (the subspace), and the magnitude of additive noise. First set the run_em_pca flag in test_pca.py to False, and then try running the PCA model on generated data.
Choosing the right subspace dimension is often challenging. To help, it’s often useful to plot properties of the data that reflect how well different subspaces capture the main factors of variation in the data. To this end, complete the following methods
1. plot_eigenvalues plots eigenvalue versus j where j varies from 1 to data dimension. This simply plots the eigenvalues starting from the largest to the smallest in rank order.
2. plot_subspace_varianceplotsthefractionofthetotaltrainingdatavariancethatiscapturedwithin the subspace as a function of the subspace dimension. As described in the online notes, this is the sum of the K leading eigenvalues divided by the sum of all eigenvalues, where K is the dimension of the subspace.
Then run the script test_pca.py and set plot_eigenvalues to True, and answer the following ques- tions:
1. How do the eigenvalues relate to the diagonal variances used in generating the data?
2. How does the additive noise affect the curves?
3. Based on the plots, can you hypothesize ways to choose the number of subspace dimensions?
For each, you are expected to write short answers (no more than 2 or 3 sentences) in the file called questions.txt which will be submitted (electronically) with your solution.
Now, let’s consider a simple data compression application. Read visualize_documents.py carefully. The script runs PCA or EM–PCA (you will be implementing this soon) independently to compress documents into three dimensions. It then stores the data and parameters into the results directory. Set algo to pca and in questions.txt answer the following questions:
1. How big is the covariance matrix used in the PCA algorithm?
2. How long does PCA algorithm take?
3. Do the points from different classes look reasonably separable in this space?
OK. That was fun. Now we’re going to do implement a different variant of the PCA algorithm. It is interesting because it does not require that you compute the covariance matrix explicitly. This is very useful for high dimensional data. It also allows you do perform PCA even some of the data points are missing elements of the data vectors (e.g. part of the image is corrupted). And finally, it turns out this method can be much faster because you only have to compute the K principal directions (eigenvectors) that you really care about.
In a nutshell, we want to find matrix A, and low-dimensional vectors xi, to model the data points yi, by minimizing the loss i ||yi − Axi||2. And we’ll use an iterative algorithm, with two steps in each iteration:
Step 1: Given the current estimate of A, find the best xi for each yi. This entails N regression problems (and note that A may not be orthogonal).
Step 2: Given the estimated values xi, find the best matrix A, yet another regression problem.
We iterate these steps until the residual error (i.e. the training loss) stops decreasing significantly.
Now it turns out that this algorithm neither assumes that the columns of the matrix are orthogonal, or aligned with the principal directions in PCA. But we can do so in two steps. First, using the well-known Gram Schmidt algorithm (provided for you), we transform the matrix A such that its columns are orthonormal without changing the subspace they span. Run this algorithm on A and check the columns are orthogonal and have unit length. Once this is done you will have to re-estimate the latent subspace coordinates of each point, i.e., resolve the
regression problem for each xi with the new matrix A. Because A is orthogonal, xi is equal to the product of AT and yi. (Prove this if you are unsure why it is so.)
But the columns of the orthogonal A matrix may still not be aligned with the principal directions. Fortunately, all we have to do is rotate the columns of A and the xi. And this doesn’t change our loss since Axi = ARRT xi. Moreover, it’s easy to find the right R since, as discussed in the PCA notes, the covariance of the latent subspace coordinates should be diagonal. Given C = (i xixTi )/N, where N is the number of data points, we just need to set R to be the eigenvectors of C, i.e. satisfying C = RDRT for some diagonal matrix D. So the final step of the algorithm involves (a) finding the covariance matrix C = (i xixTi )/N, (b) finding the matrix containing the eigenvectors of C, called R, and (c) compute the new matrix W = AR and new latent coordinates zi = RT xi. Now, up to a reflection of the subspace vectors, W and zi should be equivalent to the mapping obtained with the PCA algorithm above (i.e., in pca.py), and the corresponding latent coordinates zi should be equivalent to those found above using the inference method above.
Your next job is to implement this iterative algorithm, EM–PCA. You will need to complete the methods _e_step and _m_step in em_pca.py. In particular, the associated methods for EM–PCA are:
1. _compute_components computes the K leading eigenvectors and eigenvalues of the training data covariance.
2. _e_step: computes the subspace representation xi for each data point yi.
3. _m_step: computes the matrix A given data point yi and estimated subspace representation xi.
4. inference returns the representation of the training data in the (low-dimensional) principal subspace.
5. reconstruct computes (reconstructs) the approximation to the training data provided by the subspace model (i.e. as a linear combination of the principal vectors).
Now you can flip the switch, setting the run_em_pca flag in test_pca.py to True, and then run this new algorithm on test data to check that it works well. Then run visualize_documents.py again with algo set to empca and answer the following questions:
1. After running visualize_documents.py, compare the parameters you’ve estimated using PCA and EM–PCA. Are they identical and if not, how do they differ?
2. Which algorithm takes more space in your opinion?
3. How long does the EM–PCA algorithm take to run compared to PCA?
Programming Exercise 2: Clustering
You are asked to implement K-Means, Gaussian Mixture Model (GMM), and K-Means++. Starter code is in directory A4/clustering. You will then apply these algorithms on different datasets for various applications. This problem also asks a number of questions. For each, you are expected to write short answers (no more than 2 or 3 sentences) in the file called questions.txt which will be submitted (electronically) with your solution.
ImplementingK-Means: ImplementtheK-MeansclusteringalgorithmexplainedinChapter15oftheonline lecture notes. You only need to implement the train method in the Python file kmeans.py, which performs
K-Means clustering on a given the dataset. It iteratively updates K centers, and assigns each data point to the closest center. Upon convergence, the method returns the assignments of the dataset. The cluster centers are stored within the KMeans object under the variable self.centers. Once you are confident that your code is correct, you can run test_clustering.py to test your implementation. Note that you can visualize the clusters by setting the visualize=True.
With K-Means implemented, let’s look at an application. Read document_clustering.py carefully. It loads a dataset from BBC that contains word frequency vectors for thousands of documents on five different topics. Our goal is to cluster these documents to discover what those topics might be. This general problem is often called topic modeling. Here we explore some of the issues involved in determining the topics through clustering and careful data pre-processing. For this task, we only use K-Means clustering. Note: You could assume all initial centers are unique.
Step 0: Before you start, have a look at the data. Load BBC_data.pkl from the data directory. This file contains three main arrays: data contains the raw data, terms contains the words corresponding to the features of the data vectors, and labels contains the assignment of documents to classes (we’ll only use this at the end). Plot a few vectors from data, inspect some of the words in the terms array, print the word contents (and frequencies) for several document vectors, and get a good idea of how many different words you might expect in a document.
• How sparse are the document term vectors (i.e. on average, how many of the entries in each vector are zero)?
• What are the 10 most common terms? What are the 10 least common terms?
• What is the average value for word-frequencies? (only counting vector entries that are non-zero).
Step 1: Run the document clustering script with K=5, norm_flag=0, diffuse=0, and random center initialization (see center_initializations.py). This will apply your K-Means clustering code to the original input vectors. The result will be stored in the results directory with the correspond experiment configuration you used. Inspect the resulting clusters, labels and record:
• Can you figure out which topics the clusters represent?
For visualization, copy mkWordClouds.sh to the generated directory consisting the files centers_.txt and run it. This will plot a word cloud for each cluster (this works on Linux ONLY. If you are running this on your own machine, you should first install WordCloud). Run the clustering algorithm several times by changing num_trials in document_clustering.py. At this point it’s fair to wonder if clustering these documents will be effective.
• What are the factors that make clustering difficult?
• Should we expect better results if we get a lucky guess at the cluster centers?
Step 2: Run the document clustering script with K=5, norm_flag=1, diffuse=0, and random center initialization. This will run clustering after pre-processing the input document vectors so they are normalized to have unit length. In effect, we are now treating each input vector as a probability distributions over terms.
• What problem in Step 1 does this solve?
• Based on the data points in each cluster, what do you think topics for these clusters? • Would you consider this result better or worse than Step 1? Why?
Step 3: Run the document clustering script with K=5, norm_flag=1, diffuse=2, and random center initialization. This will pre-process each document by doing 2-steps of random-walk diffusion based on word co-occurrence probabilities (estimated from the entire data-set). That is, the pre-processed vectors will not only have the original terms, but they will also contain words that are strongly associated with those originally in the document (e.g., they might be synonyms).
Run the clustering a few times, checking the output results, and pick a good one. Answer these questions:
• What would you say are the topics for clusters?
• Why is the clustering suddenly better?
• What would you say is the general lesson to be learned from trying to cluster high-dimensional sparse data?
Step 4: Plot the most representative word-clouds you think for each of the output clusters from Step 3 above, and include them with your submitted code. For each cluster i, save an image named cluster i.png, where i = 0, 1, 2, 3, 4.
Step 5: Implement the K-means++ center initialization algorithm. You will need to modify the function kmeans_pp in center_initializations.py. Compare random center initialization (’random’) and K-means++ (’kmeans_pp’) by setting center_init in document_clustering.py, along with K=5, norm_flag=1, diffuse=2.
• How does the total training error (measured with the objective function) differ between K-means++ and random center initialization in Step 3?
• Run K-means with K-means++ center initialization 5 times and find the mean and variance of (total) training error after algorithm finishes.
• Do the training errors appear to be more consistent?
• Do the topics appear to be more meaningful?
Implementing GMM: K-Means ”struggles” to cluster specific type of data. We now look at GMM clustering and compare it with K-Means. Implement the GMM clustering algorithm explained in Chapter 15 of the online lecture notes. You only need to modify the Python file gmm.py. The train method is already provided and you will only need to implement the following methods:
1. _e_step performs the expectation step of the EM algorithm. It computes the responsibilities (probabilis- tic assignment of data to mixture components), given the current parameters.
2. _m_step performs the maximization step of the EM algorithm. This updates the mean and variance of each Gaussian component in the mixture, along with the mixture proportions.
Similar to KMeans, The parameters of the GMM are stored within the GMM object under the variables self.centers, self.covariances, and self.mixture_proportions. Test the code again with test_clustering.py by setting test_method=’gmm’. Set test_method=’all’ to compare the methods. For some datasets, you will notice that the clusters found by K-Means and GMM are very different. How can you characterize these differences and why do they occur? If you are curious (not marked), you can also run document clustering with GMM as well, but you will quickly notice why it’s not a great idea. What can we do if we want to run GMM for this task?
Submission
You will need to submit two files, one for each component. The specifications of the file format and naming convention are specified below.
1. Written Component
You may hand write or type up (e.g. Word, LateX, etc.) the solutions. We strongly recommend typing it up for legibility. Compile the document as a PDF named A4sol UTORID.pdf (e.g. A4sol student1.pdf). Then you should submit the pdf file onto Quercus.
2. Programming Component
Compress the A4 directory into a tar file named A4sol UTORID.tgz (e.g., A4sol student1.tgz).
Please don’t modify the method/function headers or the structure of the A4 directory tree since the automarker will fail to run. Failing to follow the instructions will result in a 25% penalty.
To tar and zip the contents, use the command: tar -cvzf name of tar.tgz A4. Make sure to decompress the tar file to ensure the entire A4 directory is there. Then you should submit the tar file onto Quercus.