程序代写代做 algorithm go kernel graph data science CIS 545 Homework 4: Amazon Review Analysis and Classification

CIS 545 Homework 4: Amazon Review Analysis and Classification
Your main training set for this assignment is the text from 100,000 reviews from Amazon.com, their timestamps, and their star ratings. The high level goal of this homework is to use the textual and temporal data to predict the star ratings.
Adventurers beware! Analyzing this data in sklearn will likely kill your kernel because it may need to store 1.9 billion values. So instead we will use the package gensim for analysis. gensim specializes in efficient implementations of common modeling techniques for big text.
In0:
install stuff
capture
!pip install U gensim
!pip install urllib2

Make sure the following line prints the uptodate version of gensim, which at time of releasing this homework was version 3.8.1. If not, run the cell above again. If you dont do this, you may get different answers than us or have annoying error messages.
In0:
check gensim version
import gensim
gensim.version
In0:
import stuff
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.cm as cm

from gensim import corpora
from gensim.models import LsiModel, KeyedVectors
from gensim.models.tfidfmodel import TfidfModel
from gensim.models.nmf import Nmf

import sklearn.modelselection as ms
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE

from datetime import
from operator import itemgetter
In0:
capture
!wget https:cis.upenn.educis545datareviews.dict
!wget https:cis.upenn.educis545datatrainreviews.mm
!wget https:cis.upenn.educis545datatraintimes.npy
In0:
reviewsdict corpora.Dictionary.loadreviews.dict
reviewsbow corpora.MmCorpustrainreviews.mm
reviewstimes np.loadtraintimes.npy
reviewstimes.shape lenreviewsbow,1
y np.vstacknp.repeat1, 4000, np.repeat2, 4000, np.repeat3, 4000, np.repeat4, 4000, np.repeat5, 4000
y np.repeaty, 5

Autograder Setup
In0:
TODO: Enter your 8digit Penn Id as an integer

STUDENTPENNID TODO
In0:
import json
import urllib.request
import dill
import base64

apiendpoint https:qvms14rjk2.executeapi.useast2.amazonaws.comdefaultGallantGraderv2

class TheGallantGrader:
def initself, studentid, apiendpoint apiendpoint, homeworkid 4:
if studentid None:
printError Autograder Not Setup: Enter your 8 digit PennID in the cell above.
self.studentid strstudentid
self.apiendpoint apiendpoint
self.homeworkid homeworkid

def gradeself, questionid, answer:
payload studentid : self.studentid,
homeworkid : self.homeworkid,
testcaseid : questionid,
answer : self.serializeanswer
params json.dumpspayload.encodeutf8
request urllib.request.Requestself.apiendpoint,
data params,
headers contenttype: applicationjson
try:
response urllib.request.urlopenrequest
responsebody response.read.decodeutf8
print.formatresponsebody
except:
printError: Grading request could not be completed.

def serializeself, obj:
byteserialized dill.dumpsobj
return base64.b64encodebyteserialized.decodeutf8

def deserializeself, obj:
bytedecoded base64.b64decodeobj
return dill.loadsbytedecoded

grader TheGallantGraderstudentid STUDENTPENNID, homeworkid 4

Step 0: Explore the format
We will start with exploring the format of all of the data files that we imported above.

Step 0.1: gensim dictionary lexicon
Most data science over text has some form of vocabulary. Simply put, you need to decide which words your model will care about. Very rare words, misspellings, numbers, and urls are good candidates for exclusion, especially since if the model needs any form of normalization, the time complexity of such computations is at least linear in the size of the vocabulary, if not worse.
A lexicon associates each word in the vocabulary with an index. Since words are repeated, the model can save space by using the index for every repetition and only linking the index with the string form once. A gensim dictionary is special in that it is very fast and allows bidirectional lookups, namely, word to index and index to word.
After reviewing the documentation, rewrite the right hand side of each line in the cell below with the answers to these questions.
1. In the gensim dictionary reviewsdict, what is the index of best? Look it up and store it in a variable named best. To clarify, if you find that 42 is the index of best, change the line below so that it sets best equal to 42. Of course, you can do this with best 42 and earn full points, but it is a litte better to reuse the command with which you found the index. For example, if the gensim dictionary worked like a list of strings, you could do it withbest reviewsdict.ilocbest.
2. What word belongs to index 1911? Look it up and store it in a variable named onenineoneone.
3. What happens when you evaluate reviewsdicti for some variable i? If this returns the word associated with that index, set idx2word to True. Otherwise, set it to False. For example, if reviewsdictbest equals best, idx2word should be False, but if reviewsdict1911 equals onenineoneone, idx2word should be True.
Hint: token2idbest and id2token1911 didnt work for me either. Keep trying!
In0:
answer 0.1
best TODO
onenineoneone TODO
idx2word TODO
In0:
AUTOGRADER Step 0.1: Run this to get your score.

grader.gradequestionid 0.1, answer best, onenineoneone, idx2word

Step 0.2: Look up individual reviews
gensim represents everything in a sparse way. Namely, the representation of a review will be a variablesize list that contains counts of the words that are present in the review. A dense representation, on the other hand, such as a matrix, would, in addition to the present words, contain zero counts for all of the words that are not in that particular review. For some examples, see this tutorial.
But the optimizations dont stop there! gensim also saves space by not directly storing where one document ends and another begins. Such an implementation decision encourages users to stream the dataset through the users pipeline, rather than attempt to read large chunks into memory. Put another way, you can iterate through the dataset using a loop or vectorized function, but you cannot index. In code:
for doc in corpus works!
corpus1911 does not work!
On some occasions, though, it would be convenient for us to, say, look up the 1911th document directly.
So lets implement a function that iterates through the gensim corpus, collects every document whose index appears in indices, and returns that list of documents subset of the dataset. For example, say we want the documents with the following indices: indices 0, 19, 11, 0. Then lookupdocs should return the 1st, 12th, and 20th documents, in that order.
To emphasize, if an index appears multiple times in indices, just return one copy. And for consistency with our autograder, please return the documents in order of increasing index. That would be be like corpus0, then corpus11, then corpus19 in our example. Of course, that way to reference them doesnt work, though!
In0:
answer 0.2
TODO: Complete the function
def lookupdocscorpus, indices:
TODO

Once you have written lookupdocs, you may run the cell below no modification needed to see how documents are represented in a gensim corpus. In each review, gensim stores a tuple of size 2 for each distinct word in the review. The first number in the tuple is the index of the word in the dictionary and the second number in the tuple is the count of the times that word appeared in that review.
In0:
indices 10,18
docs lookupdocsreviewsbow, indices

printdocs0
printdocs1
In0:
AUTOGRADER Step 0.2: Run this to get your score.

grader.gradequestionid 0.2, answer docs

Step 0.3: Make reviews more humanreadable
Now, we would like you to write a function that takes a gensim bag of words document and its corresponding dictionary as input and returns a translated version that is more readable. The reviews are already represented as bags of words, so recall that you cannot recover the order of the words in the reviews. But, we would like you to spell out the repeats of each word. So, if the original review were to be or not to be, reviewsbow would have something like:
0, 2.0, 1, 1.0, 2, 1.0, 3, 2.0
and we would like you to return the string
be be not or to to
In0:
answer 0.3
TODO: Complete the function
def translatereviewreview, reviewsdict:
TODO

readable1 translatereviewdocs0, reviewsdict
printreadable1

readable2 translatereviewdocs1, reviewsdict
printreadable2
In0:
AUTOGRADER Step 0.3: Run this to get your score.

grader.gradequestionid 0.3, answer translatereview, dictreviewsdict

Step 0.4: Parse review times
It might be useful in predicting the scores of the reviews to know when the reviews were written. In this dataset, the day of the review was recorded as the number of seconds that passed between midnight on January 1, 1970 the beginning of time for many computer systems and the time the review was created. This may be efficient because it is one integer, but it is not very convenient. So we are going to convert these int objects to datetime objects:
Do not change reviewtimes in any way. Work with other variables instead.

Step 0.4.1: Convert times vector
The converttimes function should take in the entire reviewtimes vector at once. It should return a new pandas Series object made from reviewtimes but the entries should be of type datetime or Timestamp.
Hint: You might find datetime.fromtimestamp to be useful.
In0:
answer 0.4.1
TODO: Complete the function
def converttimesreviewstimes:
TODO
In0:
convertedtimes converttimesreviewstimes
printconvertedtimes is a, typeconvertedtimes
In0:
AUTOGRADER Step 0.4: Run this to get your score.

grader.gradequestionid 0.4, answer strtypeconvertedtimes, strtypeconvertedtimes0, strtypeconvertedtimes1, strconvertedtimes1

Step 0.4.2: Time math
The daysbefore function should take in one time value after applying converttimes and return a new time value that is exactly offset days before the input.
Hint: You might find timedelta to be useful.
In0:
answer 0.4.2
TODO: Complete the function
def daysbeforetimeitem, offset:
TODO
In0:
displayconvertedtimes0
fortydaysbeforereviewtimes0 daysbeforeconvertedtimes0, 40
displayfortydaysbeforereviewtimes0
In0:
AUTOGRADER Step 0.4.2: Run this to get your score.

grader.gradequestionid 0.4.2, answer strfortydaysbeforereviewtimes0

Step 1: How many components?
We will need to perform dimesionality reduction on our dataset before we can proceed further with the supervised task of predicting the star ratings. One of the greatest benefits of gensim is that it can decompose a sparse dataset directly. Indeed, they post some impressive numbers about their SVD speed here.

Step 1.1: PCA on raw counts
We are first going to choose too many components deliberately, just to make sure that we see the whole picture. But note that 1000 components would still require us to store 100 million numbers. So that is probably too big for convenient exploration of the dataset.

Step 1.1.1: Train the PCA model
Train a gensim LsiModel on reviewsbow using reviewsdict as the dictionary and 1000 components. This magic number is provided as maxcutoff. The API is here.
This step took about 4 minutes for my Colab instance to complete.
In0:
answer 1.1.1
TODO: Learn the syntax of the LsiModel command
maxcutoff 1000
TODO

Step 1.1.2: Extract the singular values
Look at the API page to figure out how to get the singular values from a trained model. Feed those and maxcutoff to the plotvariancevscomponents function, which you do not have to edit.
In0:
def plotvariancevscomponentssingularvalues, cutoff:
evr np.arraysingularvaluesi2 sumsingularvalues2 for i in rangecutoff
var np.cumsumevr100
plt.ylabel Variance Explained
plt.xlabel of Components
plt.titlePCA Analysis
plt.style.contextseabornwhitegrid
plt.plotvar
In0:
answer 1.1.2
TODO: Plot variance versus number of components
TODO

The good news is this curve is very steep in the beginning, which shows that a lot of information is conveyed in the first components. However, there is no plateau that we can use to choose a cutoff!
So, lets go back to the dataset. Are the numbers in reviewsbow distributed sensibly?

Step 1.2: TFIDF a better distribution
The function below allows us to visualize the distribution of the values in the bag of words. You do not need to edit it. Recall that there are no zero values by nature of the sparse representation. The function has two convenient features:
1. It allows you to transform the values uniformly using an optional second argument.
2. By subtracting the mean, the new mean will line up with x0.
In0:
def plotvaluesreviews, functionNone:
values
for doc in reviews:
for word, score in doc:
if not function: values.appendscore
else: values.appendfunctionscore

plt.histvalues np.meanvalues, binsauto
plt.show
In0:
plotvaluesreviewsbow

It appears that our values are very highly skewed. Therefore, minmax and standard scaling would not yet be appropriate. Lets see if we can make it look better by log scaling the values:
In0:
plotvaluesreviewsbow, np.log

It is a little better, but only a little bit. Perhaps a double log?
In0:
plotvaluesreviewsbow, lambda x: np.lognp.logx1

Still not so good. There are at least two outstanding issues with this distribution:
1. The vast majority of words only occur once per review.
2. In the rare case that a word occurs more than once, we cant tell if that is because it is especially important or because it is a common word, like a stop word.
Therefore, we are going to convert our counts into TFIDF scores. Luckily, this is built in to gensim so this can be done in a couple lines of code. The API is here. Complete the function below that converts the data to TFIDF scores. Note: that is a two step process. First, you need to initialize and fit a TFIDF model to reviewsbow. use default values for all hyperparameters EXCEPT you should set normalizeTrue. Then, you should apply your TFIDF model to reviewsbow to transform it. Return the new version of the dataset.
In0:
answer 1.2
TODO: Complete the function
def maketfidfreviewsbow:
TODO
In0:
reviewstfidf maketfidfreviewsbow
In0:
plotvaluesreviewstfidf
In0:
AUTOGRADER Step 1.2: Run this to get your score.

grader.gradequestionid 1.2, answer strtypereviewstfidf

This should look a lot better. Log scaling it may make the distribution look a bit more symmetrical, but this would come at the cost of collapsing some distinctions in the right tail, so we will not do it.

Step 1.3: PCA on TFIDF scores
Lets try the PCA again and plot a new variance versus number of components.
In0:
answer 1.3
TODO: Train an LsiModel and run the plotvariancevscomponents function
TODO

If anything, this graph is less helpful than before. So, instead, we would like to use downstream performance of the classifier to tune this hyperparameter. So lets build the remaining pieces that we need.

Step 2: Interface with sparse representations
To get the real benefit of dimensionality reduction, it is important to consider which pieces of the decomposition are actually needed. Then, we can simply throw away the rest. To help you become familiar with the different pieces we will fully decompose and reconstruct the toy dataset of 5 computer science and 4 math article titles using gensim. It will be important later on that you only apply the functions that you write in this section to the pieces that you need on the big dataset.

Step 2.0: The sparse toy dataset
After lower casing, tokenizing, and stop wording, the corpus looks like titles in the cell below. Then, we create a dictionary and a sparse documentterm matrix.
In0:
titles human, interface, computer,
survey, user, computer, system, response, time,
eps, user, interface, system,
system, human, system, eps,
user, response, time,
trees,
graph, trees,
graph, minors, trees,
graph, minors, survey

titlesdict corpora.Dictionarytitles
titlesbow titlesdict.doc2bowtitle for title in titles
displaytitlesbow

Step 2.1: Sparse to dense
To get the termdocument matrix that we have seen in lecture, we need to convert this matrix to its dense form. Write a function densify that takes as input:
1. a sparse matrix in the format of titlesbow above
2. an integer number of columns
and returns a NumPy array. Note that titlesbow is a documentterm matrix, not a termdocument matrix, so we transpose it in the test cell to show the matrix from lecture with the rows and columns slightly reordered.
You may not use the corpus2dense function from gensim.
In0:
answer 2.1
TODO: Complete the function
def densifysparse, columns:
TODO
In0:
td densifytitlesbow, lentitlesdict.transpose
printtd
printtd.shape
In0:
AUTOGRADER Step 2.1: Run this to get your score.

grader.gradequestionid 2.1, answer td4.tolist, roundsumsumtd.item, strtypetd

Step 2.2: Toy PCA reconstruction
In the cell below, write a function called reconstruction that takes as input:
1. a sparse matrix
2. a gensim dictionary
3. a cutoff for PCA
The function should compute an LsiModel and reconstruct the original matrix.
There is something unexpected about the correct solution to this part!
Before turning to Piazza, print the dimensions of the pieces that you are working with, using .shape. What are the dimensions of the original? What are the dimensions of the outputs from LsiModel? How can you multiply the pieces together to get a match? You can do this! We have faith in you!
Note that there could be a loss because the function only computes the part of the singular value decomposition that is needed according to cutoff. So after reconstructing, lets quantify the loss: compute the difference between the reconstructed matrix and the original. Then, take the Frobenius norm of that difference matrix. Divide by the Frobenius norm of the original. Make this the return value for the function.
Hint: The right singular vectors V or modelsparse already contain the singular values S or model.projection.s so dont include them again!
In0:
answer 2.2
TODO: Complete the function
def PCAreconstructionsparse, gsdict, cutoff:
TODO
In0:
for cutoff in range2,10:
error PCAreconstructiontitlesbow, titlesdict, cutoff
printThe reconstruction error with, cutoff, components on the the toy dataset is, error
In0:
AUTOGRADER Step 2.2: Run this to get your score.

answer1 PCAreconstructiontitlesbow, titlesdict, 2
answer2 PCAreconstructiontitlesbow, titlesdict, 6
grader.gradequestionid 2.2, answer answer1.item, answer2.item

Step 3: Choose the number of components via the downstream task
Using classification performance to choose the number of components is arguably even better than the plateau method, because we are optimizing directly on the downstream task rather than something intrinsic to the dataset.

Step 3.1 Train the random forest
The code below:
1. combines the review TFIDF scores and the date information into one dataset
2. splits off 20 of the training data as a validation set
3. Initializes a random forest with 70 estimators
To finish the pipeline, add the code that trains the random forest and computes the accuracy on the test set. Return that number as a real value between 0 and 1.
In0:
answer 3.1
TODO: Complete the function
def evaluatemodelX, reviewtimes, y:
X np.hstackX, reviewtimes
Xtrain, Xtest, ytrain, ytest ms.traintestsplitX, y, testsize0.2, randomstate 1911
rfor RandomForestClassifiernestimators70, randomstate1911
TODO

Step 3.2 Compare performance
In the cell below, finish the evaluatecutoffs function. The missing code should train an LsiModel, compute the V matrix right singular vectors, call densify on that, and pass the dense matrix to evaluate model. Store all of your accuracies in a list named results.
In0:
answer 3.2
TODO: Complete the function
def evaluatecutoffsXorig, Xdict, Xtimes, y, cutoffs:
results
for cutoff in cutoffs:
np.random.seed1911
TODO

WARNING: The following cell should take a while to complete.
Each of the 30 models takes a minute or two, and the later ones are bigger correspondingly slower. Therefore we are going to analyze your output for grading. Once you have a good idea about the best performing model in this set, give us that accuracy and we will check if it is in the expected range.
In0:
results evaluatecutoffsreviewstfidf, reviewsdict, reviewstimes, y, range10,40
In0:
displayresults
In0:
AUTOGRADER Step 3.2: Run this to get your score.

answer maxresults
grader.gradequestionid 3.2, answer lenresults, answer.item

Step 4: kmeans clustering
So far, we have one system for classifying the number of stars in a review. But maybe there are patterns that are only true for some subsets of the data? To uncover this, we would like to cluster the reviews.

Step 4.0: Which version of the data?
Recall that kmeans has a runtime complexity with the strongest term proportional to:
of dimensions of points of clusters of iterations of restarts
Lets focus on the first three terms. The number of points is 100,000, which is pretty large. Therefore, we will have to be especially careful with the number of dimensions and clusters.
In the previous steps, we generated dimensionalityreduced versions of the dataset. While they did not capture a large, satisfying percentage of the variance in the reviews, the random forest classifier hinted that relatively few principal components were enough to capture the relevant variance for classifying star ratings. Specifically, my random forest seemed to hit a performance ceiling somewhat before reaching 40 components. Therefore, lets use the TFIDF version with 40 components.
In the cell below, add the code that trains the LsiModel, computes the right singular vectors, and densifies these projections. Store this dimensionalityreduced dataset as X. What are the expected dimensions?
In0:
answer 4.0
TODO: Reduce the dimensions of the dataset
cutoff 40
np.random.seed1911
TODO
In0:
X.shape
In0:
AUTOGRADER Step 4.0: Run this to get your score.

grader.gradequestionid 4.0, answer X.shape

Step 4.1: Collect SSWs
In the cell below, the function called testclustersize iterates over the numbers of clusters in the array numclusters. The function takes as input 1 the data as a matrix and 2 the numclusters array.
Add the missing code that should cluster the data using kmeans and store the SSW values.
Note from the sklearn.cluster documentation on kmeans:
Attributes:
clustercenters : array, nclusters, nfeatures Coordinates of cluster centers
labels : Labels of each point
inertia : Sum of squared distances between data points and their cluster centers
Finally, return a list of SSW values using the attributes above.
In0:
answer 4.1
TODO: Complete the function
def testclustersizedata, numclusters:
scores
for i in numclusters:
km KMeansnclustersi, initkmeans, ninit30, maxiter10,
tol1e4, randomstate1911, njobs1
TODO

The cell below also takes a while to run because it is going to cluster the data 38 times.
In0:
numclusters range2, 40
ssws41 testclustersizeX, numclusters
In0:
displayssws41
if lenssws41 ! 38:
raise ValueErrorDid not compute SSWs for the given values of k.
In0:
AUTOGRADER Step 4.1: Run this to get your score.

grader.gradequestionid 4.1, answer item.item for item in ssws41

Step 4.2: Find the elbow?
The following provided code helps you plot the number of clusters from 2 to 40 versus SSW. You do not need to modify these two cells.
In0:
def plotclustersnumclusters, distortions:
plt.figure
plt.xlabelNumber of Clusters
plt.ylabelDistortion
plt.titleCluster Analysis
plt.style.contextseabornwhitegrid
plt.plotnumclusters, distortions
plt.show
In0:
plotclustersnumclusters, ssws41

Do you see a clear elbow in this graph??
Probably not.
Just so we can test your solution, we will mathematically define elbow as:
hatkSSW undersetkoperatornameargmin SSWk SSWk1
This is not a perfect mathematical definition because it does not take into account how much SSW dropped before the selected point. But for this dataset, it does provide one consistent answer.
In the cell below, complete the function that implements this mathematical definition. Note that we only pass in the list of distortion values.
So the function should return the index of the selected number of clusters!!
Look at the visible test for this subsection to see how we ultimately assign the value of k.
In0:
answer 4.2
TODO: Complete the function
def sharpestdistortions:
TODO
In0:
khat42 numclusterssharpestssws41
printI have chosen to have, khat42, clusters.
if khat42 2 or khat42 39:
raise ValueErrork hat is not in the right range
In0:
AUTOGRADER Step 4.2: Run this to get your score.

grader.gradequestionid 4.2, answer khat42

Step 4.3: The Variance Ratio Criterion
Perhaps we can shift to a different cluster evaluation metric that gives a more satisfying suggestion for the number of clusters.
Recall the Variance Ratio Criterion VRC, given by
VRCk fracSSBk1 fracSSWN k
where SSB is the sum of squared distance between the cluster centers and the grand mean calculated per data point, k is the number of clusters, SSW is the sum of squared distance between data points and their assigned cluster centers, and N is the number of data points.

Step 4.3.0: The grand mean
Before we apply the full formula, please compute the grand mean of the dataset. What does this represent?
In0:
answer 4.3.0
TODO: Compute grandmean
TODO

Step 4.3.1: Interpret the grand mean
The grand mean is the text of the average review on Amazon. Lets figure out what that is a bit more precisely for this dataset. The function below finds real data points, i.e. real reviews, that are the closest neighbors to a given vector item. Xproj is the dataset, mask is a list of booleans stating whether each item in the dataset is an eligible neighbor we need this later, and k is the number of neighbors we would like to find. Write the missing code which should:
1. Normalize item by its Frobenius norm.
2. Loop through the dataset. Exclude the items that have a corresponding False value in mask.
3. For each eligible item in the dataset, compute the cosine similarity with item. Remember that you have normalized item but you will still need to normalize the other vector.
4. Store the cosines in a list. It may be useful to put the cosines in tuples with the corresponding indices, but you dont have to do it this way.
5. Find the k highest cosine values.
6. Return the indices corresponding to these highest cosines.
In0:
answer 4.3.1
TODO: Complete the function
def knearestneighborsXproj, mask, item, k:
TODO

This visible test prints the readable versions of the ten nearest neighbors to the grand mean review using translatereview from Step 0.2. Do you agree that these are acceptable average reviews?
In0:
mosttypicalindices knearestneighborsX, TruelenX, grandmean, 10
mosttypicalreviews lookupdocsreviewsbow, mosttypicalindices
for review in mosttypicalreviews:
printtranslatereviewreview, reviewsdict
In0:
AUTOGRADER Step 4.3.1: Run this to get your score.

grader.gradequestionid 4.3.1, answer mosttypicalindices, mosttypicalreviews

Step 4.3.2 Implement VRC
Complete the function testvrcdata, maxnumclusters that computes the VRC for each value of k in numclusters. Since we are passing in the data, compute a new grand mean within the function. However, since the grand mean does not depend on the clusters, you should not compute it within a loop. Please compute SSB using the grand mean, the cluster centers, and the assignments only no additional libraries or builtin values. Just as a warning, it is expected that your SSW and SSB may not add up to exactly the same number every time, but the sum should not change too much.
Additionally, please also compute a related ratio:
eta2 fracSSBSSB SSW

This eta2 eta squared is an effect size that pairs with your VRC statistic. Basically all of the VRCs are statistically significant because we have so many data points. This is why the effect size is so important. The literature recommends an effect size of at least 0.12.
The return statement is given because we would like to keep the VRCs and the eta2s.
In0:
answer 4.3.2
TODO: Complete the function
def testvrcdata, numclusters:
TODO
return vrcs, etassquared

The code below takes a while to run just as the normal clustering before. I recommend printing the VRC and eta2 values as the come, so that you can track the progress.
In0:
numclusters range2,40
vrcs432, etassquared432 testvrcX, numclusters
In0:
AUTOGRADER Step 4.3.2.1: Run this to get your score.

grader.gradequestionid 4.3.2.1, answer vrcs432, etassquared432

Step 4.3.3: Select a number of clusters with VRC
The code below prints rounded versions of the VRCs and eta2s. Then, it plots the number of clusters from 2 to 40 versus VRC. You do not need to modify this cell.
In0:
for i in rangelennumclusters:
print2dnumclustersi,
dnp.roundvrcs432i, 0,
np.roundetassquared432i, 2

plotclustersnumclusters, vrcs432

Much better, right??
Complete the bestvrc function that compares and chooses a number of clusters based on the VRCs and eta2s. Note that you are now looking for local maxima, so your elbow method should not be used again. Lets define a maximum as a point where the graph is increasing just before and decreasing just after. Return a list of all indices of distortions that are maxima. Then, these can be used to select ks from the numclusters array.
In0:
answer 4.3.3
TODO: Complete the function
def bestvrcdistortions:
TODO
In0:
khat433 numclustersi for i in bestvrcvrcs432
printA good number of clusters is one of these:, khat433
if minkhat433 2 or maxkhat433 39:
raise ValueErrork hat is not in the right range
In0:
AUTOGRADER Step 4.3.3: Run this to get your score.

grader.gradequestionid 4.3.3, answer khat433

Step 5: tSNE
In this last section, we are going to create a tSNE plot that may help us decide how to use the clusters that we found in the previous section. More specifically, does kmeans find differences between the reviews that are related to the star ratings, or other differences?
We are going to use the clustering given below for this section. Double check that X is still the same as when you defined it in Step 4.0. You do not need to modify the cell.
In0:
km KMeansnclusters35, initkmeans, ninit30, maxiter10,
tol1e4, randomstate1911, njobs1
km.fitX

Step 5.1: Exemplars of each cluster
Lets begin by approaching this question in a somewhat qualitative way. Namely, lets find the nearest review to each cluster center. Complete the function below that iterates through the cluster centers of km. For each cluster center, call knearestneighbors. Use the cluster assignments labels from km to construct the mask parameter for knearestneighbors. The key idea here is that you want to search only through the reviews that were assigned to that cluster when searching for neighbors. It really should not change your answer, but it is a whole lot faster to do it this way. Alternatively, you could subset the dataset and pass in Truenumberofpointsinthatcluster as the mask, i.e. the mask has all true values, so no item is masked. But keeping track of indices is harder in that case.
The function should return a list of lists of indices. There is one list per cluster center. Each index in one of these lists corresponds to a closest neighbor to a cluster center.
In0:
answer 5.1
TODO: Complete the function
def getexemplarsXproj, km, nexemplars:
TODO

The visible test cell below finds the indices of the nearest neighbors to each cluster center. Then it looks up the vectors for each of these indices and prints the readable version of each vector. Depending on your implementation this cell could take a long time. You do not need to modify this cell.
In0:
exemplarindices getexemplarsX, km, 1
exemplars lookupdocsreviewsbow, sumexemplarindices,
for exemplar in exemplars:
printtranslatereviewexemplar, reviewsdict
In0:
AUTOGRADER Step 5.1: Run this to get your score.

grader.gradequestionid 5.1, answer exemplarindices

Step 5.2: Prepare and run tSNE
The core idea of tSNE is preserving the relative distances between all of the data points, but showing those distances in very few dimensions. As such, tSNE compares every data point against every data point Cartesian product. For the full dataset, that would be about 19 billion comparisons. So we cant do that.
But, fortunately, you have already done a lot of work to cluster these data points. If we take a relatively large number of clusters and only consider 30 or so exemplars from each cluster, that should be small enough for tSNE. The code below assembles the data subset for tSNE using functions you have written before. Depending on your implementation this cell could take a long time. You do not need to modify this cell.
In0:
exemplarindices getexemplarsX, km, 30
exemplars lookupdocsreviewstfidf, sumexemplarindices,
fortsne densifyexemplars, lenreviewsdict
fortsne.shape
In0:
AUTOGRADER Step 5.2: Run this to get your score.

grader.gradequestionid 5.2, answer fortsne.shape0, fortsne.shape1

The hyperparameters for tSNE that I used to get a pretty picture are given below. All you need to do is look up the command to train the tSNE model, which is given in the API. Call your tSNE vectors embeddings2d.
In0:
answer 5.2
TODO: Run tSNE
tsnemodel2d TSNEperplexity20, ncomponents2, initpca, niter3500, randomstate1911
TODO

Step 5.3: Color the points
Some code to plot the tSNE vectors is given below, but it wont look very pretty yet because we have not decided on a color scheme for the points. We will implement two options.
In0:
def tsneplotembeddingclusters, a:
plt.figurefigsize16, 9
colors cm.rainbownp.linspace0, 1, lenembeddingclusters
for embeddings, color in zipembeddingclusters, colors:
x embeddings:, 0
y embeddings:, 1
plt.scatterx, y, ccolorlenx, alphaa
plt.show

Step 5.3.1: By cluster membership
The most straighforward choice would be to color a point according to the kmeans cluster to which it was assigned. kmeans is based on Euclidean distance and tSNE is based on angular distance, so there should be some, but not total, consistency between the two algorithms.
getexemplars originally had a list of lists of indices, but these had to be flattened for lookupdocs. Therefore, you just need to regroup the points into clusters. There are exactly 30 points in each cluster, so there are many ways to do this. If you want an extra challenge, you can solve this part without using the magic number 30. But it is not required and will not affect your homework score.
Call the regrouped tSNE vectors embeddings.
In0:
answer 5.3.1
TODO: group the tSNE embeddings by cluster membership
embeddings
TODO
In0:
tsneplotembeddings, 0.7
In0:
AUTOGRADER Step 5.3.1: Run this to get your score.

grader.gradequestionid 5.3.1, answer strtypeembeddings0

Step 5.3.2 By star rating
Now as a finale, group the reviews by star rating. When you are done, embeddings should have a length of five. Then, redraw the plot.
In0:
answer 5.3.2
TODO: group the tSNE embeddings by star rating
embeddings ,,,,
TODO
In0:
tsneplotembeddings, 0.7
In0:
AUTOGRADER Step 5.3.2: Run this to get your score.

grader.gradequestionid 5.3.2, answer strtypeembeddings0

As you can see, the clusters formed by kmeans and tSNE do not seem to correspond to the star ratings. This pretty, but somewhat unhappy standpoint is where Homework 4 ends and your project begins.
What kinds of analyses have we not tried? What structure is still hidden in these reviews? Can you infer how these 100,000 reviews were selected? Can something fancier than a random forest have higher accuracy in predicting the star rating?
For your project, your task is to put together an interesting notebook about this dataset, similar to this one, the other homework notebooks from this class, or articles on towardsdatascience.com. The notebook should explain what you did in such a way that a nontechnical person can read it. As such, your project will be manually graded as a work of data science communication. We hope that you can use your project notebook as something that you can show off in data science job interviews and the like.
Congratulations on all of your work so far! Five stars for you!

Related Posts