CSCC11 Introduction to Machine Learning, Fall 2018
Assignment 2
Due Friday, November 2, 2018, by 11 AM
This assignment includes both written work and programming (in Matlab). To begin the programming, download a2starter.tar from the calendar page on the course website. When you untar a2starter.tar the directory A2 is created. Please do not change the directory structure, nor the headers of the Matlab func- tions therein. Please read the assignment handout carefully. The submission instructions are the end.
Written Exercises
W1. Probability: Consider a test for detecting if an athlete has used a performance-enhancing drug. Let θ be the likelihood that the drug test gives the correct result. That is, the probability of drug us (R = 1) when they actually did (D = 1), is θ. And the probability it reports no drug use when indeed there was none is also θ. More formally,
P(R=1|D=1) = P(R=0|D=0) = θ
Suppose also that the proportion of athletes in a particular sport that use the drug is α, where 0 ≤ α ≤ 1.
(a) Suppose an athlete tests positive. Derive the posterior probability that the athlete used the drug, and simplify it in terms of θ and α.
(b) Suppose θ = .99. Suppose 1000 athletes are tested, and they’re all ’clean’. What is the expected number of false positives?
W2. Gaussian Distributions: Suppose an ant is looking for its nest in a 1D world, beginning at location s. And suppose that s is normally distributed with a mean of 0 and a standard deviation of 3. From this point the ant walks for a short period. The resulting change in position, denoted d, is also normally distributed, with a mean of 0 and a standard deviation of 2. If the displacement d is positive, the ant moves to the right. Otherwise it moves to the left. Finally, assume s and d are statistically independent, and that the final position of the ant, f , is measured to be f = 3.
Find the most likely location s from which the ant began the search. Show each step of your derivation (i.e. just giving the numeric solution won’t earn you many marks). Hint: Try to write f in terms of s and d, and think of how to maximize the probability of observing f = 3.
W3. Logistic Regression: Suppose you want to classify 2d inputs (x1,x2), using logistic regression. And suppose there are two classes, with the output denoted as y ∈ {0, 1}. Following the lecture notes,
p(y=1|w,x) = 1 = g(xTw), where g(a)= 1 , (1) 1 + e−(w1x1+w2x2+b) 1 + e−a
where x = [x1, x2, 1]T and the parameter vector is w = [w1, w2, b]T . Because there are two classes, we also know p(y = 0 | x, w) = 1 − p(y = 1 | x, w). Finally, our learning objective is the negative log-likelihood for N training pairs, {(xi, yi)}Ni=1:
N
E(w) = −logp(y = yi |xi,w) (2)
i=1
We want to find the necessary conditions that are used to minimize E(w).
(a) Show that derivative of the sigmoid satisfies dg = g (1 − g). da
(b) Derive closed-form expressions for the derivatives of E(w) with respect to the model parameters. In other words, find the gradient ∂E = ( ∂E , ∂E , ∂E ). It should have the form
E(w) = −log p(w)
Derive ∂Eˆ for this regularized objective function.
. (5)
∂w
i=1
∂w ∂w1 ∂w2 ∂b
( g(wT xi)−yi ) xi . (3)
i
You will find it convenient to use the following form of the log-likelihood:
N
E(w) = −log p(y = 1|xi,w)yi p(y = 0|xi,w)(1−yi) . (4)
i=1
(c) Including regulaization in the form of an isotropic Gaussian prior on w, the log-posterior becomes
N
ˆ 1wTw
P1. Decision Trees – Programming Exercise
p(y = yi |xi,w) , where p(w) = (2πα)3/2 exp − 2α
Your task is to complete a random forest implementation. Use lecture notes on decision forests (Sec. 8.3.2), multinomial distributions (Sec. 5.4), information theory (Sec. 9), and cross-validation (Sec. 11).
You are given starter-code to build and test random forests, including scripts to test with both synthetic and real-world datasets. We suggest you initially debug your code on the synthetic examples, and then proceed to training/testing on the real-world datasets. To this end you will need to tune the hyper-parameters using a validation set, and then modify the script to train on the entire training set (and perform testing).
To begin, download the starter file and look at the code in directory P1. Among the various scripts, which you should read, you will find the following functions (m-files) that you have to complete:
- entropy.mcomputestheentropyofthedistributionoverclasslabelsgivenadatasetwithlabelsy.This function is used for building a single decision tree in build_tree.m and find_split.m.
- find_split.m chooses a random feature dimension of the input vector, sorts data points according to the value of the selected feature, and evaluates the information gain for hypothetical split thresholds. This file is used for building a single decision tree in build_tree.m. It returns the split threshold with maximum information gain. The file includes code to help structure your implementation.
- build_forest.m builds a random forest, a set of decision trees learned on random subsets of data and features. You need to complete the part that selects random subsets of data and features.
- test_examples.m enables debugging using simple synthetic examples. Feel free to write additional tests that check individual functions above. We will automatically mark your code in a similar way using other training/testing datasets with varying complexity.
Data Sets: The starter code also includes scripts to train on two real-world datasets. In these two datsets the range of feature values are very different. Decision trees are not limited by such differences. You can follow the links if you are interested in more information about the datasets. The two real-world datasets are in the directory called data.
- Occupancy Detection. This is a binary classification task with real-valued sensory inputs. The task is to detect whether an office is occupied from measurements of light, temperature, humidity and CO2. This dataset comes with two test sets, one for validation and the other for testing (see test_occupancy.m). On this dataset, you should be able to achieve accuracies above 95%.
In the text file called occupancy.txt, answer the following: Is a single DT prone to over-fitting with the occupancy dataset? How does classification accuracy behave as a function of the number of trees? Do you think a random forest is helpful? Explain why.
- Amazon Commerce Reviews The task is to predict the author of a review on Amazon. There are 50 authors (classes). The features are the frequency of words and of pairs of words in a vocabulary. You are given 1300 training pairs (with feature vector and author ID), and 200 test points. From the training set, 100 are selected as a validation set (see test_amazon.m). This is high-dimensional data, with 10, 000 feature dimensions, and a relatively small training set, so over-fitting is an issue.
In the text file called amazon.txt, answer the following: Can you reach a high accuracy on the test set using random forests? Can you think of a better way to choose split functions (as opposed to the method in find_split.m. And how does this challenge differ from that with the occupancy dataset?
Decision forests have several important hyper-parameters, including the number of trees, the tree depth, the minimum number of data points in a leaf, and the percentage of the data used to build each tree. A file called default_opts.m lists the parameters. You will have to put these parameters in occupancy_opts.m and amazon_opts.m and choose suitable values for each dataset. This may require some thought, and experimen- tation. After tuning the hyper-parameters on the validation set, change the test scripts test_occupancy.m and test_amazon.m to work on the test set.
P2. Logistic Regression – Programming Exercise
For this problem, you will implement a logistic regression classification algorithm. (We’ll provide gradient descent code for the optimization.) You’ll apply the approach to three data sets. Each dataset comprises two parts, a training set and a test set. You should train your algorithm on the training data, and then evaluate performance on the test set. The test performance should be measured as the average 0-1 loss on the test set. In other words, once you’ve fit a model to the training data, evaluate it by counting the number of correctly labelled test points. Then divide by N .
Data Sets: In the starter file from the course web site there is a Dataset directory. In that directory there are several mat files for the different datasets, as described below. To plot the data, see the script plotDataSets.m. The individual datasets are described below.
For the Fruit dataset, the 2D measurements (features) are the heights and widths of oranges and lemons (in cm to the nearest mm). There are two files, fruit_train.mat and fruit_test.mat. The training file contains two arrays, inputs_train and target_train. The array inputs_train is 2 × 30, with height and width measurements for 30 training inputs. The array target_train is also 2 × 30, with one col- umn for each training exemplar, (1,0) for oranges, and (0,1) for lemons. The test data file, fruit_test.mat, is structured in a similar way, comprising 10 test instances.
There are also two generic datasets, generic1.mat and generic2.mat, both of which have the same format. The measurements are 2D and real-valued. Each MATLAB file comprises four arrays, namely, c1_train and c2_train comprising 21 training exemplars for class 1 and class 2, and c1_test and c2_test comprising 69 instances of class 1 and class 2 for testing.
ImplementingLogisticRegression: CodetogetyoustartedisavailableinthedirectorystarterCode.You should implement the regularized logistic regression model explained in Chapter 8 of the online lecture notes. The gradient for the model is derived in a written question above. The m-files in the starter code are
- logistic.m evaluates the logistic regression model given an array of exemplars and the weight pa- rameters. It computes the posterior class probability given the data and model parameters. This function is used mainly for classification.
- logisticNLP.m computes the negative log likelihood of the input data, and the gradient of the nega- tive log likelihood. It is essential for learning the logistic regression model parameters.
- learnLogReg.m is a template for a gradient descent method, with line search, for minimizing the negative log likelihood of the data. It uses logisticNLP.m extensively, and it provides a facility to check your gradients numerically. But most of the code for this function is already written. You only have to modify a small amount of code here.
Testing: To evaluate he classifier, you will learn models with different parameter settings, and compute the test errors for them on different datasets. Make use of Matlab scripts and the functions above to help automate these processes.
Begin your testing on the generic datasets. In doing so, plot test error as a function of the regularization weight; use α = 0.25, 0.5, 1, 2, and 5. Also try it on the fruit dataset. In the file logisticRegression.txt answer the following: On which datasets does logistic regression work well? Explain. Also, for which datasets does the regulaization help? Explain.
Submission
- Hand in W1-3 above on paper into the drop box on 4th floor of the IC building (The office).
- For the programming parts, compress your code into a tar file named A2Sol_studentNo.tar. Please don’t modify the function headers or directory tree. The tar command is:
tar -cvzf name_of_tar.tar A2/P1/*.m A2/P1/*.txt A2/P2/*.m A2/P2/*.txtMake sure when you decompress the tar file, you have 2 directories, named P1 and P2, with the corre- sponding Matlab scripts and txt files inside.
Then submit the tar file from a mathlab computer (e.g., in IC406) using the command:
submit -c cscc11f18 -a A2 -f name_of_tar.tar