程序代写代做代考 algorithm scheme data science CIS 545 Homework 4: Amazon Review Analysis and Classification¶

CIS 545 Homework 4: Amazon Review Analysis and Classification¶
Your main training set for this assignment is the text from 100,000 reviews from Amazon.com, their timestamps, and their star ratings. The high level goal of this homework is to use the textual and temporal data to predict the star ratings.
Adventurers beware! Analyzing this data in sklearn will likely kill your kernel because it may need to store 1.9 billion values. So instead we will use the package gensim for analysis. gensim specializes in efficient implementations of common modeling techniques for big text.
In [0]:
# install stuff
%%capture
!pip install -U gensim
!pip install urllib2

Make sure the following line prints the up-to-date version of gensim, which at time of releasing this homework was version 3.8.1. If not, run the cell above again. If you don’t do this, you may get different answers than us or have annoying error messages.
In [0]:
# check gensim version
import gensim
gensim.__version__
In [0]:
# import stuff
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.cm as cm

from gensim import corpora
from gensim.models import LsiModel, KeyedVectors
from gensim.models.tfidfmodel import TfidfModel
from gensim.models.nmf import Nmf

import sklearn.model_selection as ms
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE

from datetime import *
from operator import itemgetter
In [0]:
%%capture
!wget https://cis.upenn.edu/~cis545/data/reviews.dict
!wget https://cis.upenn.edu/~cis545/data/train_reviews.mm
!wget https://cis.upenn.edu/~cis545/data/train_times.npy
In [0]:
reviews_dict = corpora.Dictionary.load(“reviews.dict”)
reviews_bow = corpora.MmCorpus(‘train_reviews.mm’)
reviews_times = np.load(‘train_times.npy’)
reviews_times.shape = (len(reviews_bow),1)
y = np.vstack((np.repeat(1, 4000), np.repeat(2, 4000), np.repeat(3, 4000), np.repeat(4, 4000), np.repeat(5, 4000)))
y = np.repeat(y, 5)

Autograder Setup¶
In [0]:
# TODO: Enter your 8-digit Penn Id as an integer

STUDENT_PENN_ID = # TODO #
In [0]:
import json
import urllib.request
import dill
import base64

api_endpoint = ‘https://qvms14rjk2.execute-api.us-east-2.amazonaws.com/default/GallantGrader_v2’

class TheGallantGrader:
def __init__(self, student_id, api_endpoint = api_endpoint, homework_id = ‘4’):
if student_id == None:
print(‘Error Autograder Not Setup: Enter your 8 digit PennID in the cell above.’)
self.student_id = str(student_id)
self.api_endpoint = api_endpoint
self.homework_id = homework_id

def grade(self, question_id, answer):
payload = {‘student_id’ : self.student_id,
‘homework_id’ : self.homework_id,
‘test_case_id’ : question_id,
‘answer’ : self.serialize(answer)}
params = json.dumps(payload).encode(‘utf-8’)
request = urllib.request.Request(self.api_endpoint,
data = params,
headers = {‘content-type’: ‘application/json’})
try:
response = urllib.request.urlopen(request)
response_body = response.read().decode(‘utf-8’)
print(‘{}’.format(response_body))
except:
print(‘Error: Grading request could not be completed.’)

def serialize(self, obj):
byte_serialized = dill.dumps(obj)
return base64.b64encode(byte_serialized).decode(“utf-8”)

def deserialize(self, obj):
byte_decoded = base64.b64decode(obj)
return dill.loads(byte_decoded)

grader = TheGallantGrader(student_id = STUDENT_PENN_ID, homework_id = ‘4’)

Step 0: Explore the format¶
We will start with exploring the format of all of the data files that we imported above.

Step 0.1: gensim dictionary (lexicon)¶
Most data science over text has some form of vocabulary. Simply put, you need to decide which words your model will care about. Very rare words, misspellings, numbers, and urls are good candidates for exclusion, especially since if the model needs any form of normalization, the time complexity of such computations is at least linear in the size of the vocabulary, if not worse.
A lexicon associates each word in the vocabulary with an index. Since words are repeated, the model can save space by using the index for every repetition and only linking the index with the string form once. A gensim dictionary is special in that it is very fast and allows bidirectional lookups, namely, word to index and index to word.
After reviewing the documentation, rewrite the right hand side of each line in the cell below with the answers to these questions.
1. In the gensim dictionary reviews_dict, what is the index of “best”? Look it up and store it in a variable named best. To clarify, if you find that 42 is the index of “best”, change the line below so that it sets best equal to 42. Of course, you can do this with best = 42 and earn full points, but it is a litte better to reuse the command with which you found the index. For example, if the gensim dictionary worked like a list of strings, you could do it with
best = reviews_dict.iloc(‘best’).
2. What word belongs to index 1911? Look it up and store it in a variable named onenineoneone.
3. What happens when you evaluate reviews_dict[i] for some variable i? If this returns the word associated with that index, set idx2word to True. Otherwise, set it to False. For example, if reviews_dict[‘best’] equals best, idx2word should be False, but if reviews_dict[1911] equals onenineoneone, idx2word should be True.
Hint: token2id(‘best’) and id2token(1911) didn’t work for me either. Keep trying!
In [0]:
# answer 0.1
best = # TODO #
onenineoneone = # TODO #
idx2word = # TODO #
In [0]:
## AUTOGRADER Step 0.1: Run this to get your score. ##

grader.grade(question_id = “0.1”, answer = (best, onenineoneone, idx2word) )

Step 0.2: Look up individual reviews¶
gensim represents everything in a sparse way. Namely, the representation of a review will be a variable-size list that contains counts of the words that are present in the review. A dense representation, on the other hand, such as a matrix, would, in addition to the present words, contain zero counts for all of the words that are not in that particular review. For some examples, see this tutorial.
But the optimizations don’t stop there! gensim also saves space by not directly storing where one document ends and another begins. Such an implementation decision encourages users to stream the dataset through the user’s pipeline, rather than attempt to read large chunks into memory. Put another way, you can iterate through the dataset using a loop or vectorized function, but you cannot index. In code:
for doc in corpus works!
corpus[1911] does not work!
On some occasions, though, it would be convenient for us to, say, look up the 1911th document directly.
So let’s implement a function that iterates through the gensim corpus, collects every document whose index appears in indices, and returns that list of documents (subset of the dataset). For example, say we want the documents with the following indices: indices = [0, 19, 11, 0]. Then lookup_docs should return the 1st, 12th, and 20th documents, in that order.
To emphasize, if an index appears multiple times in indices, just return one copy. And for consistency with our autograder, please return the documents in order of increasing index. That would be be like corpus[0], then corpus[11], then corpus[19] in our example. Of course, that way to reference them doesn’t work, though!
In [0]:
# answer 0.2
# TODO: Complete the function
def lookup_docs(corpus, indices):
# TODO #

Once you have written lookup_docs, you may run the cell below (no modification needed) to see how documents are represented in a gensim corpus. In each review, gensim stores a tuple of size 2 for each distinct word in the review. The first number in the tuple is the index of the word in the dictionary and the second number in the tuple is the count of the times that word appeared in that review.
In [0]:
indices = [10,18]
docs = lookup_docs(reviews_bow, indices)

print(docs[0])
print(docs[1])
In [0]:
## AUTOGRADER Step 0.2: Run this to get your score. ##

grader.grade(question_id = “0.2”, answer = docs)

Step 0.3: Make reviews more human-readable¶
Now, we would like you to write a function that takes a gensim bag of words document and its corresponding dictionary as input and returns a “translated” version that is more readable. The reviews are already represented as bags of words, so recall that you cannot recover the order of the words in the reviews. But, we would like you to spell out the repeats of each word. So, if the original review were “to be or not to be”, reviews_bow would have something like:
[(0, 2.0), (1, 1.0), (2, 1.0), (3, 2.0)]
and we would like you to return the string
“be be not or to to”
In [0]:
# answer 0.3
# TODO: Complete the function
def translate_review(review, reviews_dict):
# TODO #

readable_1 = translate_review(docs[0], reviews_dict)
print(readable_1)

readable_2 = translate_review(docs[1], reviews_dict)
print(readable_2)
In [0]:
## AUTOGRADER Step 0.3: Run this to get your score. ##

grader.grade(question_id = “0.3”, answer = (translate_review, dict(reviews_dict)))

Step 0.4: Parse review times¶
It might be useful in predicting the scores of the reviews to know when the reviews were written. In this dataset, the day of the review was recorded as the number of seconds that passed between midnight on January 1, 1970 (the beginning of time for many computer systems) and the time the review was created. This may be efficient because it is one integer, but it is not very convenient. So we are going to convert these int objects to datetime objects:
Do not change review_times in any way. Work with other variables instead.

Step 0.4.1: Convert times vector¶
The convert_times function should take in the entire review_times vector at once. It should return a new pandas Series object made from review_times but the entries should be of type datetime or Timestamp.
Hint: You might find datetime.fromtimestamp to be useful.
In [0]:
# answer 0.4.1
# TODO: Complete the function
def convert_times(reviews_times):
# TODO #
In [0]:
converted_times = convert_times(reviews_times)
print(“converted_times is a”, type(converted_times))
In [0]:
## AUTOGRADER Step 0.4: Run this to get your score. ##

grader.grade(question_id = “0.4”, answer = (str(type(converted_times)), str(type(converted_times[0])), str(type(converted_times[1])), str(converted_times[1])))

Step 0.4.2: Time math¶
The days_before function should take in one time value (after applying convert_times) and return a new time value that is exactly offset days before the input.
Hint: You might find timedelta to be useful.
In [0]:
# answer 0.4.2
# TODO: Complete the function
def days_before(time_item, offset):
# TODO #
In [0]:
display(converted_times[0])
forty_days_before_review_times_0 = days_before(converted_times[0], 40)
display(forty_days_before_review_times_0)
In [0]:
## AUTOGRADER Step 0.4.2: Run this to get your score. ##

grader.grade(question_id = “0.4.2”, answer = str(forty_days_before_review_times_0))

Step 1: How many components?¶
We will need to perform dimesionality reduction on our dataset before we can proceed further with the supervised task of predicting the star ratings. One of the greatest benefits of gensim is that it can decompose a sparse dataset directly. Indeed, they post some impressive numbers about their SVD speed here.

Step 1.1: PCA on raw counts¶
We are first going to choose too many components deliberately, just to make sure that we see the whole picture. But note that 1000 components would still require us to store 100 million numbers. So that is probably too big for convenient exploration of the dataset.

Step 1.1.1: Train the PCA model¶
Train a gensim LsiModel on reviews_bow using reviews_dict as the dictionary and 1000 components. This magic number is provided as max_cutoff. The API is here.
This step took about 4 minutes for my Colab instance to complete.
In [0]:
# answer 1.1.1
# TODO: Learn the syntax of the LsiModel command
max_cutoff = 1000
# TODO #

Step 1.1.2: Extract the singular values¶
Look at the API page to figure out how to get the singular values from a trained model. Feed those and max_cutoff to the plot_variance_vs_components function, which you do not have to edit.
In [0]:
def plot_variance_vs_components(singular_values, cutoff):
evr = np.array([singular_values[i]**2 / sum(singular_values**2) for i in range(cutoff)])
var = np.cumsum(evr*100)
plt.ylabel(‘% Variance Explained’)
plt.xlabel(‘# of Components’)
plt.title(‘PCA Analysis’)
plt.style.context(‘seaborn-whitegrid’)
plt.plot(var)
In [0]:
# answer 1.1.2
# TODO: Plot variance versus number of components
# TODO #

The good news is this curve is very steep in the beginning, which shows that a lot of information is conveyed in the first components. However, there is no plateau that we can use to choose a cutoff!
So, let’s go back to the dataset. Are the numbers in reviews_bow distributed sensibly?

Step 1.2: TF-IDF — a better distribution¶
The function below allows us to visualize the distribution of the values in the bag of words. You do not need to edit it. Recall that there are no zero values by nature of the sparse representation. The function has two convenient features:
1. It allows you to transform the values uniformly using an optional second argument.
2. By subtracting the mean, the new mean will line up with $x=0$.
In [0]:
def plot_values(reviews, function=None):
values = []
for doc in reviews:
for (word, score) in doc:
if not function: values.append(score)
else: values.append(function(score))

plt.hist(values – np.mean(values), bins=’auto’)
plt.show()
In [0]:
plot_values(reviews_bow)

It appears that our values are very highly skewed. Therefore, minmax and standard scaling would not (yet) be appropriate. Let’s see if we can make it look better by log scaling the values:
In [0]:
plot_values(reviews_bow, np.log)

It is a little better, but only a little bit. Perhaps a double log?
In [0]:
plot_values(reviews_bow, lambda x: np.log(np.log(x+1)))

Still not so good. There are (at least) two outstanding issues with this distribution:
1. The vast majority of words only occur once per review.
2. In the rare case that a word occurs more than once, we can’t tell if that is because it is especially important or because it is a common word, like a stop word.
Therefore, we are going to convert our counts into TF-IDF scores. Luckily, this is built in to gensim so this can be done in a couple lines of code. The API is here. Complete the function below that converts the data to TF-IDF scores. Note: that is a two step process. First, you need to initialize and fit a TF-IDF model to reviews_bow. (use default values for all hyperparameters EXCEPT you should set normalize=True). Then, you should apply your TF-IDF model to reviews_bow to transform it. Return the new version of the dataset.
In [0]:
# answer 1.2
# TODO: Complete the function
def make_tfidf(reviews_bow):
# TODO #
In [0]:
reviews_tfidf = make_tfidf(reviews_bow)
In [0]:
plot_values(reviews_tfidf)
In [0]:
## AUTOGRADER Step 1.2: Run this to get your score. ##

grader.grade(question_id = “1.2”, answer = str(type(reviews_tfidf)))

This should look a lot better. Log scaling it may make the distribution look a bit more symmetrical, but this would come at the cost of collapsing some distinctions in the right tail, so we will not do it.

Step 1.3: PCA on TF-IDF scores¶
Let’s try the PCA again and plot a new variance versus number of components.
In [0]:
# answer 1.3
# TODO: Train an LsiModel and run the plot_variance_vs_components function
# TODO #

If anything, this graph is less helpful than before. So, instead, we would like to use downstream performance of the classifier to tune this hyperparameter. So let’s build the remaining pieces that we need.

Step 2: Interface with sparse representations¶
To get the real benefit of dimensionality reduction, it is important to consider which pieces of the decomposition are actually needed. Then, we can simply throw away the rest. To help you become familiar with the different pieces we will fully decompose and reconstruct the toy dataset of 5 computer science and 4 math article titles using gensim. It will be important later on that you only apply the functions that you write in this section to the pieces that you need on the big dataset.

Step 2.0: The sparse toy dataset¶
After lower casing, tokenizing, and stop wording, the corpus looks like titles in the cell below. Then, we create a dictionary and a sparse document-term matrix.
In [0]:
titles = [[‘human’, ‘interface’, ‘computer’],
[‘survey’, ‘user’, ‘computer’, ‘system’, ‘response’, ‘time’],
[‘eps’, ‘user’, ‘interface’, ‘system’],
[‘system’, ‘human’, ‘system’, ‘eps’],
[‘user’, ‘response’, ‘time’],
[‘trees’],
[‘graph’, ‘trees’],
[‘graph’, ‘minors’, ‘trees’],
[‘graph’, ‘minors’, ‘survey’]]

titles_dict = corpora.Dictionary(titles)
titles_bow = [titles_dict.doc2bow(title) for title in titles]
display(titles_bow)

Step 2.1: Sparse to dense¶
To get the term-document matrix that we have seen in lecture, we need to convert this matrix to its dense form. Write a function densify that takes as input:
1. a sparse matrix in the format of titles_bow above
2. an integer number of columns
and returns a NumPy array. Note that titles_bow is a document-term matrix, not a term-document matrix, so we transpose it in the test cell to show the matrix from lecture (with the rows and columns slightly reordered).
You may not use the corpus2dense function from gensim.
In [0]:
# answer 2.1
# TODO: Complete the function
def densify(sparse, columns):
# TODO #
In [0]:
td = densify(titles_bow, len(titles_dict)).transpose()
print(td)
print(td.shape)
In [0]:
## AUTOGRADER Step 2.1: Run this to get your score. ##

grader.grade(question_id = “2.1”, answer = (td[4].tolist(), round(sum(sum(td))).item(), str(type(td))))

Step 2.2: Toy PCA reconstruction¶
In the cell below, write a function called reconstruction that takes as input:
1. a sparse matrix
2. a gensim dictionary
3. a cutoff for PCA
The function should compute an LsiModel and reconstruct the original matrix.
There is something unexpected about the correct solution to this part!
Before turning to Piazza, print the dimensions of the pieces that you are working with, using .shape. What are the dimensions of the original? What are the dimensions of the outputs from LsiModel? How can you multiply the pieces together to get a match? You can do this! We have faith in you!
Note that there could be a loss because the function only computes the part of the singular value decomposition that is needed according to cutoff. So after reconstructing, let’s quantify the loss: compute the difference between the reconstructed matrix and the original. Then, take the Frobenius norm of that difference matrix. Divide by the Frobenius norm of the original. Make this the return value for the function.
Hint: The right singular vectors ($V$ or model[sparse]) already contain the singular values ($S$ or model.projection.s) so don’t include them again!
In [0]:
# answer 2.2
# TODO: Complete the function
def PCA_reconstruction(sparse, gsdict, cutoff):
# TODO #
In [0]:
for cutoff in range(2,10):
error = PCA_reconstruction(titles_bow, titles_dict, cutoff)
print(“The reconstruction error with”, cutoff, “components on the the toy dataset is”, error)
In [0]:
## AUTOGRADER Step 2.2: Run this to get your score. ##

answer1 = PCA_reconstruction(titles_bow, titles_dict, 2)
answer2 = PCA_reconstruction(titles_bow, titles_dict, 6)
grader.grade(question_id = “2.2”, answer = (answer1.item(), answer2.item()))

Step 3: Choose the number of components via the downstream task¶
Using classification performance to choose the number of components is arguably even better than the plateau method, because we are optimizing directly on the downstream task rather than something intrinsic to the dataset.

Step 3.1 Train the random forest¶
The code below:
1. combines the review TF-IDF scores and the date information into one dataset
2. splits off 20% of the training data as a validation set
3. Initializes a random forest with 70 estimators
To finish the pipeline, add the code that trains the random forest and computes the accuracy on the test set. Return that number as a real value between 0 and 1.
In [0]:
# answer 3.1
# TODO: Complete the function
def evaluate_model(X, review_times, y):
X = np.hstack((X, review_times))
X_train, X_test, y_train, y_test = ms.train_test_split(X, y, test_size=0.2, random_state = 1911)
rfor = RandomForestClassifier(n_estimators=70, random_state=1911)
# TODO #

Step 3.2 Compare performance¶
In the cell below, finish the evaluate_cutoffs function. The missing code should train an LsiModel, compute the $V$ matrix (right singular vectors), call densify on that, and pass the dense matrix to evaluate model. Store all of your accuracies in a list named results.
In [0]:
# answer 3.2
# TODO: Complete the function
def evaluate_cutoffs(X_orig, X_dict, X_times, y, cutoffs):
results = []
for cutoff in cutoffs:
np.random.seed(1911)
# TODO #

WARNING: The following cell should take a while to complete.
Each of the 30 models takes a minute or two, and the later ones are bigger (correspondingly slower). Therefore we are going to analyze your output for grading. Once you have a good idea about the best performing model in this set, give us that accuracy and we will check if it is in the expected range.
In [0]:
results = evaluate_cutoffs(reviews_tfidf, reviews_dict, reviews_times, y, range(10,40))
In [0]:
display(results)
In [0]:
## AUTOGRADER Step 3.2: Run this to get your score. ##

answer = max(results)
grader.grade(question_id = “3.2”, answer = (len(results), answer.item()))

Step 4: k-means clustering¶
So far, we have one system for classifying the number of stars in a review. But maybe there are patterns that are only true for some subsets of the data? To uncover this, we would like to cluster the reviews.

Step 4.0: Which version of the data?¶
Recall that k-means has a runtime complexity with the strongest term proportional to:
(# of dimensions)(# of points)(# of clusters)(# of iterations)(# of restarts)
Let’s focus on the first three terms. The number of points is 100,000, which is pretty large. Therefore, we will have to be especially careful with the number of dimensions and clusters.
In the previous steps, we generated dimensionality-reduced versions of the dataset. While they did not capture a large, satisfying percentage of the variance in the reviews, the random forest classifier hinted that relatively few principal components were enough to capture the relevant variance for classifying star ratings. Specifically, my random forest seemed to hit a performance ceiling somewhat before reaching 40 components. Therefore, let’s use the TF-IDF version with 40 components.
In the cell below, add the code that trains the LsiModel, computes the right singular vectors, and densifies these projections. Store this dimensionality-reduced dataset as X. What are the expected dimensions?
In [0]:
# answer 4.0
# TODO: Reduce the dimensions of the dataset
cutoff = 40
np.random.seed(1911)
# TODO #
In [0]:
X.shape
In [0]:
## AUTOGRADER Step 4.0: Run this to get your score. ##

grader.grade(question_id = “4.0”, answer = X.shape)

Step 4.1: Collect SSWs¶
In the cell below, the function called test_cluster_size iterates over the numbers of clusters in the array num_clusters. The function takes as input (1) the data as a matrix and (2) the num_clusters array.
Add the missing code that should cluster the data using k-means and store the $SS_W$ values.
Note from the sklearn.cluster documentation on k-means:
Attributes:
• cluster_centers_ : array, [n_clusters, n_features] Coordinates of cluster centers

• labels_ : Labels of each point

• inertia_ : Sum of squared distances between data points and their cluster centers

Finally, return a list of $SS_W$ values using the attributes above.
In [0]:
# answer 4.1
# TODO: Complete the function
def test_cluster_size(data, num_clusters):
scores = []
for i in num_clusters:
km = KMeans(n_clusters=i, init=’k-means++’, n_init=30, max_iter=10,
tol=1e-4, random_state=1911, n_jobs=1)
# TODO #

The cell below also takes a while to run because it is going to cluster the data 38 times.
In [0]:
num_clusters = range(2, 40)
ssws41 = test_cluster_size(X, num_clusters)
In [0]:
display(ssws41)
if (len(ssws41) != 38):
raise ValueError(“Did not compute SSWs for the given values of k.”)
In [0]:
## AUTOGRADER Step 4.1: Run this to get your score. ##

grader.grade(question_id = “4.1”, answer = [item.item() for item in ssws41] )

Step 4.2: Find the elbow?¶
The following provided code helps you plot the number of clusters (from 2 to 40) versus $SS_W$. You do not need to modify these two cells.
In [0]:
def plot_clusters(num_clusters, distortions):
plt.figure()
plt.xlabel(‘Number of Clusters’)
plt.ylabel(‘Distortion’)
plt.title(‘Cluster Analysis’)
plt.style.context(‘seaborn-whitegrid’)
plt.plot(num_clusters, distortions)
plt.show()
In [0]:
plot_clusters(num_clusters, ssws41)

Do you see a clear “elbow” in this graph??
Probably not.
Just so we can test your solution, we will mathematically define elbow as:
$$\hat{k}_{SSW} = \underset{k}{\operatorname{argmin}} (SS_W(k) – SS_W(k+1))$$
This is not a perfect mathematical definition because it does not take into account how much $SS_W$ dropped before the selected point. But for this dataset, it does provide one consistent answer.
In the cell below, complete the function that implements this mathematical definition. Note that we only pass in the list of distortion values.
So the function should return the index of the selected number of clusters!!
Look at the visible test for this subsection to see how we ultimately assign the value of $k$.
In [0]:
# answer 4.2
# TODO: Complete the function
def sharpest(distortions):
# TODO #
In [0]:
khat42 = num_clusters[sharpest(ssws41)]
print(“I have chosen to have”, khat42, “clusters.”)
if ((khat42 < 2) or (khat42 > 39)):
raise ValueError(‘k hat is not in the right range’)
In [0]:
## AUTOGRADER Step 4.2: Run this to get your score. ##

grader.grade(question_id = “4.2”, answer = khat42)

Step 4.3: The Variance Ratio Criterion¶
Perhaps we can shift to a different cluster evaluation metric that gives a more satisfying suggestion for the number of clusters.
Recall the Variance Ratio Criterion ($VRC$), given by
$$ VRC(k) = \frac{SS_B}{k-1} / \frac{SS_W}{N – k}$$
where $SS_B$ is the sum of squared distance between the cluster centers and the grand mean (calculated per data point), $k$ is the number of clusters, $SS_W$ is the sum of squared distance between data points and their assigned cluster centers, and $N$ is the number of data points.

Step 4.3.0: The grand mean¶
Before we apply the full formula, please compute the grand mean of the dataset. What does this represent?
In [0]:
# answer 4.3.0
# TODO: Compute grand_mean
# TODO #

Step 4.3.1: Interpret the grand mean¶
The grand mean is the text of the “average” review on Amazon. Let’s figure out what that is a bit more precisely for this dataset. The function below finds real data points, i.e. real reviews, that are the closest neighbors to a given vector (item). X_proj is the dataset, mask is a list of booleans stating whether each item in the dataset is an eligible neighbor (we need this later), and k is the number of neighbors we would like to find. Write the missing code which should:
1. Normalize item by its Frobenius norm.
2. Loop through the dataset. Exclude the items that have a corresponding False value in mask.
3. For each eligible item in the dataset, compute the cosine similarity with item. Remember that you have normalized item but you will still need to normalize the other vector.
4. Store the cosines in a list. It may be useful to put the cosines in tuples with the corresponding indices, but you don’t have to do it this way.
5. Find the k highest cosine values.
6. Return the indices corresponding to these highest cosines.
In [0]:
# answer 4.3.1
# TODO: Complete the function
def k_nearest_neighbors(X_proj, mask, item, k):
# TODO #

This visible test prints the “readable” versions of the ten nearest neighbors to the grand mean review using translate_review from Step 0.2. Do you agree that these are acceptable “average” reviews?
In [0]:
most_typical_indices = k_nearest_neighbors(X, [True]*len(X), grand_mean, 10)
most_typical_reviews = lookup_docs(reviews_bow, most_typical_indices)
for review in most_typical_reviews:
print(translate_review(review, reviews_dict))
In [0]:
## AUTOGRADER Step 4.3.1: Run this to get your score. ##

grader.grade(question_id = “4.3.1”, answer = (most_typical_indices, most_typical_reviews))

Step 4.3.2 Implement VRC¶
Complete the function test_vrc(data, max_num_clusters) that computes the $VRC$ for each value of k in num_clusters. Since we are passing in the data, compute a new grand mean within the function. However, since the grand mean does not depend on the clusters, you should not compute it within a loop. Please compute $SS_B$ using the grand mean, the cluster centers, and the assignments only (no additional libraries or built-in values). Just as a warning, it is expected that your $SS_W$ and $SS_B$ may not add up to exactly the same number every time, but the sum should not change too much.
Additionally, please also compute a related ratio:
$$\eta^2 = \frac{SS_B}{SS_B + SS_W}$$

This $\eta^2$ (eta squared) is an effect size that pairs with your $VRC$ statistic. Basically all of the $VRC$s are statistically significant because we have so many data points. This is why the effect size is so important. The literature recommends an effect size of at least 0.12.
The return statement is given because we would like to keep the $VRC$s and the $\eta^2$s.
In [0]:
# answer 4.3.2
# TODO: Complete the function
def test_vrc(data, num_clusters):
# TODO #
return vrcs, etas_squared

The code below takes a while to run just as the normal clustering before. I recommend printing the $VRC$ and $\eta^2$ values as the come, so that you can track the progress.
In [0]:
num_clusters = range(2,40)
vrcs432, etas_squared432 = test_vrc(X, num_clusters)
In [0]:
## AUTOGRADER Step 4.3.2.1: Run this to get your score. ##

grader.grade(question_id = “4.3.2.1”, answer = (vrcs432, etas_squared432))

Step 4.3.3: Select a number of clusters with VRC¶
The code below prints rounded versions of the $VRC$s and $\eta^2$s. Then, it plots the number of clusters (from 2 to 40) versus $VRC$. You do not need to modify this cell.
In [0]:
for i in range(len(num_clusters)):
print(“%2d”%num_clusters[i],
“%d”%np.round(vrcs432[i], 0),
np.round(etas_squared432[i], 2))

plot_clusters(num_clusters, vrcs432)

Much better, right??
Complete the best_vrc function that compares and chooses a number of clusters based on the $VRC$s and $\eta^2$s. Note that you are now looking for local maxima, so your elbow method should not be used again. Let’s define a maximum as a point where the graph is increasing just before and decreasing just after. Return a list of all indices of distortions that are maxima. Then, these can be used to select $k$s from the num_clusters array.
In [0]:
# answer 4.3.3
# TODO: Complete the function
def best_vrc(distortions):
# TODO #
In [0]:
khat433 = [num_clusters[i] for i in best_vrc(vrcs432)]
print(“A good number of clusters is one of these:”, khat433)
if ((min(khat433) < 2) or (max(khat433) > 39)):
raise ValueError(‘k hat is not in the right range’)
In [0]:
## AUTOGRADER Step 4.3.3: Run this to get your score. ##

grader.grade(question_id = “4.3.3”, answer = khat433)

Step 5: t-SNE¶
In this last section, we are going to create a t-SNE plot that may help us decide how to use the clusters that we found in the previous section. More specifically, does k-means find differences between the reviews that are related to the star ratings, or other differences?
We are going to use the clustering given below for this section. Double check that X is still the same as when you defined it in Step 4.0. You do not need to modify the cell.
In [0]:
km = KMeans(n_clusters=35, init=’k-means++’, n_init=30, max_iter=10,
tol=1e-4, random_state=1911, n_jobs=1)
km.fit(X)

Step 5.1: Exemplars of each cluster¶
Let’s begin by approaching this question in a somewhat qualitative way. Namely, let’s find the nearest review to each cluster center. Complete the function below that iterates through the cluster centers of km. For each cluster center, call k_nearest_neighbors. Use the cluster assignments (labels) from km to construct the mask parameter for k_nearest_neighbors. The key idea here is that you want to search only through the reviews that were assigned to that cluster when searching for neighbors. It really should not change your answer, but it is a whole lot faster to do it this way. Alternatively, you could subset the dataset and pass in [True]*(number_of_points_in_that_cluster) as the mask, i.e. the mask has all true values, so no item is masked. But keeping track of indices is harder in that case.
The function should return a list of lists of indices. There is one list per cluster center. Each index in one of these lists corresponds to a closest neighbor to a cluster center.
In [0]:
# answer 5.1
# TODO: Complete the function
def get_exemplars(X_proj, km, n_exemplars):
# TODO #

The visible test cell below finds the indices of the nearest neighbors to each cluster center. Then it looks up the vectors for each of these indices and prints the readable version of each vector. Depending on your implementation this cell could take a long time. You do not need to modify this cell.
In [0]:
exemplar_indices = get_exemplars(X, km, 1)
exemplars = lookup_docs(reviews_bow, sum(exemplar_indices, []))
for exemplar in exemplars:
print(translate_review(exemplar, reviews_dict))
In [0]:
## AUTOGRADER Step 5.1: Run this to get your score. ##

grader.grade(question_id = “5.1”, answer = exemplar_indices)

Step 5.2: Prepare and run t-SNE¶
The core idea of t-SNE is preserving the relative distances between all of the data points, but showing those distances in very few dimensions. As such, t-SNE compares every data point against every data point (Cartesian product). For the full dataset, that would be about 19 billion comparisons. So we can’t do that.
But, fortunately, you have already done a lot of work to cluster these data points. If we take a relatively large number of clusters and only consider 30 or so exemplars from each cluster, that should be small enough for t-SNE. The code below assembles the data subset for t-SNE using functions you have written before. Depending on your implementation this cell could take a long time. You do not need to modify this cell.
In [0]:
exemplar_indices = get_exemplars(X, km, 30)
exemplars = lookup_docs(reviews_tfidf, sum(exemplar_indices, []))
for_tsne = densify(exemplars, len(reviews_dict))
for_tsne.shape
In [0]:
## AUTOGRADER Step 5.2: Run this to get your score. ##

grader.grade(question_id = “5.2”, answer = (for_tsne.shape[0], for_tsne.shape[1]))

The hyperparameters for t-SNE that I used to get a pretty picture are given below. All you need to do is look up the command to train the t-SNE model, which is given in the API. Call your t-SNE vectors embeddings_2d.
In [0]:
# answer 5.2
# TODO: Run t-SNE
tsne_model_2d = TSNE(perplexity=20, n_components=2, init=’pca’, n_iter=3500, random_state=1911)
# TODO #

Step 5.3: Color the points¶
Some code to plot the t-SNE vectors is given below, but it won’t look very pretty yet because we have not decided on a color scheme for the points. We will implement two options.
In [0]:
def tsne_plot(embedding_clusters, a):
plt.figure(figsize=(16, 9))
colors = cm.rainbow(np.linspace(0, 1, len(embedding_clusters)))
for embeddings, color in zip(embedding_clusters, colors):
x = embeddings[:, 0]
y = embeddings[:, 1]
plt.scatter(x, y, c=[color]*len(x), alpha=a)
plt.show()

Step 5.3.1: By cluster membership¶
The most straighforward choice would be to color a point according to the k-means cluster to which it was assigned. k-means is based on Euclidean distance and t-SNE is based on angular distance, so there should be some, but not total, consistency between the two algorithms.
get_exemplars originally had a list of lists of indices, but these had to be flattened for lookup_docs. Therefore, you just need to re-group the points into clusters. There are exactly 30 points in each cluster, so there are many ways to do this. If you want an extra challenge, you can solve this part without using the magic number 30. But it is not required and will not affect your homework score.
Call the re-grouped t-SNE vectors embeddings.
In [0]:
# answer 5.3.1
# TODO: group the t-SNE embeddings by cluster membership
embeddings = []
# TODO #
In [0]:
tsne_plot(embeddings, 0.7)
In [0]:
## AUTOGRADER Step 5.3.1: Run this to get your score. ##

grader.grade(question_id = “5.3.1”, answer = str(type(embeddings[0])))

Step 5.3.2 By star rating¶
Now as a finale, group the reviews by star rating. When you are done, embeddings should have a length of five. Then, redraw the plot.
In [0]:
# answer 5.3.2
# TODO: group the t-SNE embeddings by star rating
embeddings = [[],[],[],[],[]]
# TODO #
In [0]:
tsne_plot(embeddings, 0.7)
In [0]:
## AUTOGRADER Step 5.3.2: Run this to get your score. ##

grader.grade(question_id = “5.3.2”, answer = str(type(embeddings[0])))

As you can see, the clusters formed by k-means and t-SNE do not seem to correspond to the star ratings. This pretty, but somewhat unhappy standpoint is where Homework 4 ends and your project begins.
What kinds of analyses have we not tried? What structure is still hidden in these reviews? Can you infer how these 100,000 reviews were selected? Can something fancier than a random forest have higher accuracy in predicting the star rating?
For your project, your task is to put together an interesting notebook about this dataset, similar to this one, the other homework notebooks from this class, or articles on towardsdatascience.com. The notebook should explain what you did in such a way that a non-technical person can read it. As such, your project will be manually graded as a work of data science communication. We hope that you can use your project notebook as something that you can show off in data science job interviews and the like.
Congratulations on all of your work so far! Five stars for you!