CS计算机代考程序代写 python deep learning data mining GMM algorithm University of Toronto Scarborough

University of Toronto Scarborough
Department of Computer and Mathematical Sciences

Introduction to Machine Learning and Data Mining
CSCC11H3, Fall 2021

Assignment 2

Due December 8, 2021 at 11:49 pm

1 Logistic Regression

1.1 Understanding Binary Class Logistic Regression

In this question, we investigate the assumptions and limitations of the binary class logistic regression model.
Suppose we have two classes with labels c1 and c2, we can express the posterior probability of class c1 given
input x as

P (c1 | x) =
1

1 + e−w
Tx

= g
(
wTx

)
(1.1)

where w is the weight vector with bias included and g is the Sigmoid function.

1.2 Written Problems.

1. First, let’s investigate the decision boundary of a binary class logistic regression model. The decision
boundary is the set of points where P (c1 | x) = 0.5 (i.e. when the decision function α(x) = 0 ).
Show that when P (c1 | x) = 0.5, the decision boundary α(x) = 0 is a linear function.

2. Knowing that logistic regression has a linear decision boundary, what datasets would this model
perform poorly on? In this case, performing poorly means the model is unable to classify all the
training data correct.

Let’s look at a toy example – the XOR (Exclusive OR) dataset. XOR is a binary input Boolean
function that outputs TRUE when both inputs are the same and outputs FALSE otherwise. The dataset
is represented as the table below. Our goal is to predict the output y ∈ {0, 1} given the input vector
x = [x1, x2]

T , where x1, x2 ∈ {0, 1}

1

2

x

x1 x2 y

0 0 0
0 1 1
1 0 1
1 1 0

(a) Assume we are using binary class logistic regression on the XOR dataset. What is the maximum classi-
fication accuracy we can obtain? Explain.

(b) As in basis function regression, we can apply basis functions to create a more sophisticated model.
Consider the following feature map (basis functions) on the inputs:

ψ(x) =


 ψ1(x)ψ2(x)

ψ3(x)


 =


 x1x2

x1x2


 (1.2)

Then, we can express the posterior probability of class c1 given input x as P (c1 | x) = g
(
wTψ(x)

)
,

where w = [w1, w2, w3]
T is the weight vector of the model. Note that we exclude the bias term in this

question. Specify the conditions and provide an example for the weight vector w such that this model
perfectly classifies the XOR dataset.

1.3 Multiclass Logistic Regression

In this question we will consider multi-class logistic regression. We will begin by extending the formulation
of 2-class logistic regression to the multi-class case.

We will denote input feature vectors by x, and let’s assume the first element of x is 1 to allow for a bias
term; i.e., x = (1, x1, . . . , xd)

T . For now, suppose there are 2 classes, with class labels 1 and 2 . And let c
denote the class variable. In these terms, the conditional probability of class 1, given the data, is simply

p(c = 1 | x) =
p(c = 1)p(x | c = 1)

p(c = 1)p(x | c = 1) + p(c = 2)p(x | c = 2)
(1.3)

For the purposes of logistic regression, we define a mapping from an input feature vector to what we might
call unnormalized probabilities as follows:

ew
T
k x = αp(c = k)p(x | c = k) (1.4)

for some unknown constant α, where k is either 1 or 2 in our 2 -class case. Such unnormalized probabilities
are just an exponentiated linear functions of the inputs (with the addition of the bias term which is the first
element of the weight vector). We call them unnormalized because we don’t know the value of α, and hence
the exponentiated inner products ew

T
k x don’t sum to one like a probability distribution should.

For two classes the model is specified by two weight vectors, w1 and w2, where we can write:

p (c = 1 | x,w1:2) =
ew

T
1 x

ew
T
1 x + ew

T
2 x

=
1

1 + e(w2−w1)
Tx

(1.5)

This is immediately recognizable as the sigmoidal function g(·) from Eqn. (39) in Chapter 9.6 of the online

3

lecture notes, but here it is applied to (w1 −w2)
T
x instead of wTx. In other words, in the online notes

on 2-class logistic regression, the weights we learned were equal to the difference of the weights used here,
i.e., w1 −w2, which defines the normal vector to the decision boundary (w1 −w2)

T
x = 0.

Now we’re ready to look at the more general case with K classes. Assume again that inputs are x =
(1, x1, . . . , xd)

T . And for each of K classes, let there be a weight vector, wk = (bk, wk,1, . . . , wk,d)
T .

Assuming c denotes the class variable, we can write the probability for class c = k, where 1 ≤ k ≤ K, as

p(c = k | x) =
p(c = k)p(x | c = k)∑K
`=1 p(c = `)p(x | c = `)

(1.6)

As above, the logistic regression model defines a parametric mapping from inputs to unnormalized proba-
bilities. Given the model parameters, w1:K , the model specifies that

p (c = k | x,w1:K) =
ew

T
k x∑K

`=1 e
wT

`
x

=
1

1 +

`∈{1,…K}\k e
(w`−wk)Tx

(1.7)

The sum in the denominator in Eqn. (6), ` takes on all values between 1 and K, except k. Note that while
Eqn. (6) more closely resembles the sigmoidal function in 2 -class logistic regression, it will be more
convenient to use Eqn. (5) in what follows. The function in Eqn. (5), which maps K score values (i.e., the
inner products wTk x ) to probabilities, is used often in deep learning; in neural network jargon it is called a
softmax function.

Now let’s consider the objective for learning. As with 2 -class logistic regression we are going to use a log
loss. Given training data {(xi, yi)}

N
i=1, assumed to be IID, we form the loss under the model as

E (w1:K) = − log
N∏
i=1

p (yi | xi,w1:K) (1.8)

In the lectures and notes on the 2 -class case we manipulated the form of the log probability of the model
in terms of the sigmoid function with the class labels being either 0 or 1 . We can do the same thing here
but it’s just a little trickier. Rather than assume that yi takes on a value from 1 to K, instead we let yi be a
so-called one-hot binary vector. That is, we define yi to be a vector of length K whose elements are 0 or 1.
When the correct class for the i th data point is k, then all elements of yi are 0 except for the k th; i.e.,

yi,k =


1 when k is the correct class0 otherwise (1.9)

With this one-hot label encoding we can express the negative log likelihood of the training data as follows:

E (w1:K) = − log
N∏
i=1

K∏
k=1

p (yi,k | xi,w1:K)
yi,k

= −
N∑
i=1

K∑
k=1

yi,k log p (yi,k | xi,w1:K)

(1.10)

Next we need to derive the form of the gradients of E so we can figure out the necessary conditions for

4

the optimal values of the parameters w1:K . To this end, just as we defined the sigmoid function for 2 -class
logistic regression, here we use a similar notation to define the softmax function: i.e.,

σ (z1:K , k) =
ezk∑K
`=1 e

z`
(1.11)

With this notation, and defining zi,k = wTk xi, for k = 1 . . .K, we can express the loss as

E (w1:K) = −
N∑
i=1

K∑
k=1

yi,k log σ (zi,1:K , k) (1.12)

We will solve the learning problem by minimizing E using gradient descent to find the weight vectors. But
we still need to find the gradient first.

1.4 Written Problems.

1. Find the gradient of σ (z1:K , k) in Eqn. (7) with respect to zj :

∂σ (z1:K , k)

∂zj
. (1.13)

Hint: Consider the cases when k = j and k 6= j. You may find it helpful to look at the structure of the
gradient in the 2 -class case in Eqn. (46) in Chapter 9.6 of the online lecture notes.

3. Find the gradient of the log likelihood for a single point (x, y) with respect to wj :

∂wj

K∑
k=1

yk log σ (z1:K , k) (1.14)

4. Find the gradient of the loss with respect to wj :

∂E (w1:K)

∂wj
(1.15)

Hint: Use the results above. And the gradient should have a form similar to Eqn. (48) in Chapter 9.6 of the
online lecture notes.

5. Now, suppose we have D dimensional inputs and K classes. For each of K classes, let there be a
weight vector, wk = (bk, wk,1, . . . , wk,d)

T . If we include regularization, with a (diagonal) Gaussian
prior, the negative log-posterior becomes, up to an additive constant,

Ê (w1:K) = − log

[
p (w1:K)

N∏
i=1

p (y = yi | xi,w1:K)

]
(1.16)

where p (w1:K) is the joint distribution over K indepdendent Gaussian densities with the same diagonal
covariance. That is,

p (w1:K) =

K∏
k=1

(
1(

(2π)(D+1)βαD
)1/2 exp

(

wTk C

−1wk
2

))
(1.17)

where the covariance matrix is given by C = diag(β, α, α, . . . , α) ∈ R(D+1)×(D+1). Here, α denotes the
variance of the prior on each of the weights, and β is the prior variance on the bias term. Usually we don’t
want a strong prior on the bias term so β � α. Derive ∂Ê

∂wk
for this regularized objective function.

5

Hint: Your negative log-posterior should have form Ê (w1:K) = E (w1:K)+E2 (w1:K), where E2 (w1:K)
is the regularization term.

1.5 Programming Component

You are asked to implement a logistic regression classification algorithm (we provide gradient descent
code). The starter code is in directory A2. You will then apply this algorithm on four datasets, each com-
prises a training set and a test set. You should train your algorithm on the training data, then evaluate
performance on the test set. The test performance should be measured as the average 0 − 1 test loss; i.e.,
once you’ve fit a model to the training data, evaluate it by counting the number of correctly labelled test
points and dividing the count by the size of the test set.

Datasets: In the starter file there is a datasets directory. In that directory there are several pickle files for the
different datasets, as described below.

• Generic datasets: There are three generic datasets, generic_.pkl where 〈i〉 = {1, 2, 3}, all of which
have the same format. Given 2D, real-valued measurements (features), we want to classify whether
a test point is class 0 or class 1 for the first two datasets, and class 0 , class 1 , or class 2 for the
third dataset. Each pickle file contains four arrays, train_X, train_y, test_X, and test_y. The arrays
train- x ∈ R100×2 and train_y ∈ R100×1 form 100 input/output pairs of the training set. The arrays
test_x ∈ R50×2 and test_y ∈ R50×1 form 50 input/output pairs of the test set.

• Wine dataset: The wine dataset, wine.pkl, has 13 attributes: Alcohol, Malic acid, Ash, Alcalinity of
ash, Magnesium, Total phenols, Flavanoids, Nonflavanoid phenols, Proanthocyanins, Color intensity,
Hue, OD 280/OD315 of diluted wines and Proline (i.e. 13D measurements/features). Given the 13D
measurements, we want to classify whether the wine class is wine 0 , wine 1 , or wine 2 (i.e. 3 classes).
The file contains four arrays, train_X, train_y, test_x, and test_y. The arrays train_x ∈ R148×13 and
train_y ∈ R148×1 form 148 input/output pairs of the training set. The arrays test_x ∈ R30×13 and
test_y ∈ R30×1 form 30 input/output pairs of the test set.

Visualizing the Generic Datasets: You need to modify visualize_generic.py to visualize the three
generic datasets. Once you have visualized the datasets answer the following questions:

1. Do you expect logistic regression to perform well on generic_1? Why? What if we apply the feature
map defined in Eqn. 1.2 ?

2. Do you expect logistic regression to perform well on generic_2? Why? What if we apply the feature
map defined in Eqn. 1.2 ?

3. Do you expect logistic regression to perform well on generic_3? Why?

4. Why can’t we directly visualize the wine dataset? What are some ways to visualize it?

Implementing Logistic Regression: You should implement the multi-class logistic regression with regular-
ization explained in the written component. The gradient for the model is derived in the written component
above. You only need to modify the file logistic_regression.py. You might find utils.py to be helpful for your
implementation. The relevant methods are:

6

• _compute_loss_and_gradient: computes the negative log likelihood and its gradient given the inputs
and outputs. It is essential for learning the logistic regression model parameters. When the hyperpa-
rameters α−1 and β−1 are non-zero, we compute the negative log posterior instead.

IMPORTANT NOTE: For numerical stability, divide the negative log likelihood by the number of
data points, drop all log constants, and drop all constant factors. Your loss should be in the form
Êavg(w) =

E(w)
N

+E2(w) where N is the number of data points. Then, compute the gradient based
on Eavg .

• learn is a template for a gradient descent method, with line search, for minimizing the negative log
likelihood of the data. It uses_compute_loss_and_gradient extensively, and it provides a facility to
check your gradients numerically. Specifically, by setting the check_grad flag, it computes the nu-
merical gradient using finite difference. It uses your implementation of negative log likelihood and
negative log posterior for the computation.

• predict evaluates the logistic regression model given an array of exemplars and the weight parameters.
It computes the posterior class probability given the data and model parameters. This function is used
mainly for classification.

Training Logistic Regression and Evaluating Performance: To train and evaluate the classifier, you will
learn models with different parameter settings, and compute the test errors for them on different datasets.
The Python file train_logistic_regression.py allows you to train and evaluate a model given a dataset, ran-
dom seed, hyperparameters, and initial parameters for training. You will need to implement the train func-
tion and the feature_map in train_logistic_regression.py. Specifically, you will need to implement the code
to initialize logistic regression model, initialize its parameters, train the model, and evaluate the training and
test performance with and without the feature map defined in Eqn. 1.2. Note that only the generic datasets
can be used with the feature map since it is meant for 2D inputs. To apply feature map on the inputs, set
apply_data_preprocessing = True.

Finally, run logistic regression using train_logistic_regression.py and answer the following questions:

1. First, we have an experiment on generic_1 dataset. Run logistic regression without regularization
and without feature map. Did you run into any numerical errors? If so, why do you think this is the
case? Now, run logistic regression with regularization. What happens? What are the train and test
accuracies? 2. Now, let’s look at generic_2 dataset. Run logistic regression without regularization
and without feature map. What are the train and test accuracies? Run it with feature map now, did the
performance get better? Why do you think that is the case?

2. generic_3 is a multi-class classification problem. Run logistic regression without regularization and
without feature map. What are the train and test accuracies? What if we run it with feature map?

3. Wine dataset: Does logistic regression perform well on this dataset?

7

2 Clustering

You are asked to implement K-Means and Gaussian Mixture Model (GMM). Starter code is in directory
Clustering_Problem. You will then apply these algorithms on different datasets for various applications.
This problem also asks a number of questions. For each, you are expected to write short answers (no more
than 2 or 3 sentences).

Implementing K-Means: Implement the K-Means clustering algorithm explained in Chapter 15 of the
online lecture notes. You only need to implement the train method in the Python file kmeans.py, which
performs K-Means clustering on a given the dataset. It iteratively updates K centers, and assigns each data
point to the closest center. Upon convergence, the method returns the assignments of the dataset. The cluster
centers are stored within the KMeans object under the variable self. centers. Once you are confident that
your code is correct, you can run test_clustering.py to test your implementation. Note that you can visualize
the clusters by setting the visualize=True.

With K-Means implemented, let’s look at an application. Read document_clustering ·py carefully. It loads a
dataset from BBC that contains word frequency vectors for thousands of documents on five different topics.
Our goal is to cluster these documents to discover what those topics might be. This general problem is
often called topic modeling. Here we explore some of the issues involved in determining the topics through
clustering and careful data pre-processing. For this task, we only use K-Means clustering. Note: You could
assume all initial centers are unique.

Step 0: Before you start, have a look at the data. Load BBC_data.pkl from the data directory. This file
contains three main arrays: dat a contains the raw data, terms contains the words corresponding to the
features of the data vectors, and labels contains the assignment of documents to classes (we’ll only use
this at the end). Plot a few vectors from data, inspect some of the words in the terms array, print the word
contents (and frequencies) for several document vectors, and get a good idea of how many different words
you might expect in a document

1. How sparse are the document term vectors (i.e. on average, how many of the entries in each vector
are zero)?

2. What are the 10 most common terms? What are the 10 least common terms?

3. What is the average value for word-frequencies? (only counting vector entries that are non-zero).

Step 1: Run the document clustering script with K = 5, norm_flag=0, diffuse=0, and random center
initialization (see center_initializations.py). This will apply your K-Means clustering code to the original
input vectors. The result will be stored in the results directory with the correspond experiment configuration
you used. Inspect the resulting clusters, labels and record:

1. Can you figure out which topics the clusters represent?

2. What are the factors that make clustering difficult?

3. Should we expect better results if we get a lucky guess at the cluster centers?

Step 2: Run the document clustering script with K = 5, norm_flag = 1, diffuse = 0, and random center
initialization. This will run clustering after pre-processing the input document vectors so they are normal-
ized to have unit length. In effect, we are now treating each input vector as a probability distributions over
terms.

8

1. What problem in Step 1 does this solve?

2. Based on the data points in each cluster, what do you think topics for these clusters?

3. Would you consider this result better or worse than Step 1 ? Why? Step 3: Run the document clus-
tering script with K = 5, norm_f lag = 1, diffuse = 2, and random center initialization. This will
pre-process each document by doing 2 -steps of random-walk diffusion based on word co-occurrence
probabilities (estimated from the entire data-set). That is, the pre-processed vectors will not only have
the original terms, but they will also contain words that are strongly associated with those originally
in the document (e.g., they might be synonyms).

Step 3: Run the clustering a few times, checking the output results, and pick a good one. Answer these
questions:

1. What would you say are the topics for clusters?

2. Why is the clustering suddenly better?

3. What would you say is the general lesson to be learned from trying to cluster high-dimensional sparse
data?

Implementing GMM: K-Means “struggles” to cluster specific type of data. We now look at GMM clus-
tering and compare it with K-Means. Implement the GMM clustering algorithm explained in Chapter 15
of the online lecture notes. You only need to modify the Python file gmm. py. The train method is already
provided and you will only need to implement the following methods:

1. e_step performs the expectation step of the EM algorithm. It computes the responsibilities (proba-
bilistic assignment of data to mixture components), given the current parameters.

2. m_step performs the maximization step of the EM algorithm. This updates the mean and variance of
each Gaussian component in the mixture, along with the mixture proportions.

Similar to KMeans, The parameters of the GMM are stored within the GMM object under the variables
self.centers, self.covariances, and self.mixture_proportions. Test the code again with test_clustering .py by
setting test_method=’ gmm’. Set test_method=’ all’ to compare the methods. For some datasets, you will
notice that the clusters found by K-Means and GMM are very different.

How can you characterize these differences and why do they occur? If you are curious (not marked), you
can also run document clustering with GMM as well, but you will quickly notice why it’s not a great idea.
What can we do if we want to run GMM for this task?

Logistic Regression
Understanding Binary Class Logistic Regression
Written Problems.
Multiclass Logistic Regression
Written Problems.
Programming Component

Clustering