程序代写 Assignment_8_solved

Assignment_8_solved

Follow these instructions:¶
Once you are finished, ensure to complete the following steps.

Copyright By PowCoder代写 加微信 powcoder

Restart your kernel by clicking ‘Kernel’ > ‘Restart & Run All’.

Fix any errors which result from this.

Repeat steps 1. and 2. until your notebook runs without errors.

Submit your completed notebook to OWL by the deadline.

Assignment Week 8: Text Mining using Dimensionality Reduction Methods [_/100 Marks]¶
This dataset comes from the Amazon website and represents 1,000 reviews which were labeled (by humans) as positive or negative. In this assignment, we will study apply dimensionality reduction methods to improve our understanding of text data and to predict the sentiment of a set of texts.

# !gdown https://drive.google.com/uc?id=1habwBbNCj6wFIDvxLa7hdP2xakA_tcDG

Downloading…
From: https://drive.google.com/uc?id=1habwBbNCj6wFIDvxLa7hdP2xakA_tcDG
To: /home/alireza/Desktop/Ass8/Reviews_sample.csv
100%|████████████████████████████████████████| 456k/456k [00:00<00:00, 16.6MB/s] import numpy as np import pandas as pd import umap from sklearn.decomposition import PCA, TruncatedSVD import sklearn.feature_extraction.text as sktext from sklearn.linear_model import LogisticRegressionCV from sklearn.model_selection import train_test_split from sklearn.metrics import roc_curve, roc_auc_score from itertools import product import seaborn as sns import matplotlib.pyplot as plt Task 1: Decomposition of the texts [ /66 marks]¶ Question 1.1¶ The dataset comes with the text and a binary variable which represents the sentiment, either positive or negative. Import the data and use sklearn's TfidfVectorizer to eliminate accents, special characters, and stopwords. In addition, make sure to eliminate words that appear in less than 5% of documents and those that appear in over 95%. You can also set sublinear_tf to True. After that, split the data into train and test with test_size = 0.2 and seed = seed. Calculate the Tf-Idf transform for both train and test. Note that you need to fit and transform the inputs for the train set but you only need to transform the inputs for the test set. Don't forget to turn the sparse matrices to dense ones after you apply the Tf-Idf transform. # Load the data [ /1 marks] reviews_data = pd.read_csv('Reviews_sample.csv') # Display the first 5 rows [ /1 marks] reviews_data.head() text label 0 Stuning even for the non-gamer: This sound tr... 1 1 The best soundtrack ever to anything.: I'm re... 1 2 Amazing!: This soundtrack is my favorite musi... 1 3 Excellent Soundtrack: I truly like this sound... 1 4 Remember, Pull Your Jaw Off The Floor After H... 1 # Defining the TfIDFTransformer [ /4 marks] TfIDFTransformer = sktext.TfidfVectorizer(strip_accents = 'unicode', # Eliminate accents and special characters stop_words = 'english', # Eliminates stop words min_df = 0.05, # Eliminate words that do not appear in more than 5% of texts max_df = 0.95, # Eliminate words that appear in more than 95% of texts sublinear_tf = True # Use sublinear weights (softplus) # Train/test split [ /2 marks] x_train, x_test, y_train, y_test = train_test_split(reviews_data['text'], reviews_data['label'], test_size = 0.3, random_state = seed) # Calculate the Tf-Idf transform [ /4 marks] TfIDF_train = TfIDFTransformer.fit_transform(x_train) TfIDF_test = TfIDFTransformer.transform(x_test) From here on, you will use the variables TfIDF_train and TfIDF_test as the input for the different tasks, and the y_train and y_test labels for each dataset (if required). Print the number of indices in the ouput using TfIDFTransformer.get_feature_names() method. # Print the number of indices [ /2 marks] print('There are %i words in the index.' % len(TfIDFTransformer.get_feature_names_out())) There are 69 words in the index. Question 1.2¶ Now we have the TfIDF matrix so we can start working on the data. We hope to explore what some commonly occuring concepts are in the text reviews. We can do this using PCA. A PCA transform of the TF-IDF matrix will give us a basis of the text data, each component representing a concept or set of words that are correlated. Correlation in text can be interpreted as a relation to a similar topic. Calculate a PCA transform of the training data using the maximum number of concepts possible. Make a plot of the explained variance that shows the cumulative explained variance per number of concepts. # Apply PCA on training data and get the explained variance [ / 4 marks] PCA_Reviews = PCA(n_components = 69) PCA_Reviews.fit(TfIDF_train.todense()) PCA_variances = PCA_Reviews.explained_variance_ratio_ # Plotting explained variance with number of concepts [ / 4 marks] plt.plot(np.arange(1, 70), np.cumsum(PCA_variances)) plt.xlabel('Number of concepts') plt.ylabel('Explained variance %') /home/alireza/ml/my_env/lib/python3.8/site-packages/sklearn/utils/validation.py:727: FutureWarning: np.matrix usage is deprecated in 1.0 and will raise a TypeError in 1.2. Please convert to a numpy array with np.asarray. For more information see: https://numpy.org/doc/stable/reference/generated/numpy.matrix.html warnings.warn( Text(0, 0.5, 'Explained variance %') Question: Exactly how many concepts do we need to correctly explain at least 80% of the data? # To get the exact index where the variance is above 80% [ / 4 marks] index80 = next(x for x, val in enumerate(np.cumsum(PCA_variances)) if val > 0.8)
print(index80)

Your Answer: We need 43 concepts in order to explain at least 80% of the variance.

Question 1.3¶
Let’s examine the first three concepts by looking how many variance they explained and showing the 10 words that are the most important in each of these three concepts (as revealed by the absolute value of the PCA weight in each concept).

# Explained variance [ / 2 marks]
print(‘The first three components explain %.2f%% of the variance.’ % (np.cumsum(PCA_variances)[2] * 100))

The first three components explain 12.72% of the variance.

# Get 10 most important words for each component [ / 4 marks]
words_per_row = TfIDFTransformer.get_feature_names_out()
most_important = [np.argpartition(np.abs(PCA_Reviews.components_[i]), -10)[-10:] for i in range(3)]

# Words for concept 1 [ / 2 marks]
[words_per_row[i] for i in most_important[0]]

‘written’,
‘product’,
‘reading’,

# Words for concept 2 [ / 2 marks]
[words_per_row[i] for i in most_important[1]]

[‘little’,

# Words for concept 3 [ / 2 marks]
[words_per_row[i] for i in most_important[2]]

‘product’]

Question: What is the cumulative variance explained by these three concepts? What would you name each of these concepts? [ / 2 marks]

Hint: If in a concept you would get the words ‘dog’, ‘cat’, ‘fish’ as the most important ones, you could name the concept ‘animals’ or ‘pets’.

Your answer: There are several interpretations that can be done here, but in general we can say that the first component is mostly about reading, the second one could be positive opinions on movies and the third one hints at recommendation.**

Question 1.4¶
Apply the PCA transformation to the test dataset. Use only the first two components and make a scatter plot of the cases. Identify positive and negative cases by colouring points with different sentiments with different colours.

# Apply PCA to the test dataset [ / 2 marks]
test_pca = PCA_Reviews.transform(TfIDF_test.todense())

# Plot the two different set of points with different markers and labels [ /4 marks]

fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(test_pca[:, 0][y_test == 0], test_pca[:, 1][y_test == 0], marker = ‘o’, label = ‘Negative sentiment’)
ax.scatter(test_pca[:, 0][y_test == 1], test_pca[:, 1][y_test == 1], marker = ‘^’, label = ‘Positive sentiment’)
plt.legend()

/home/alireza/ml/my_env/lib/python3.8/site-packages/sklearn/utils/validation.py:727: FutureWarning: np.matrix usage is deprecated in 1.0 and will raise a TypeError in 1.2. Please convert to a numpy array with np.asarray. For more information see: https://numpy.org/doc/stable/reference/generated/numpy.matrix.html
warnings.warn(

Question: What can we say about where the positive and negative cases lie in our plot? Could we use these concepts to discriminate positive and negative cases? If yes, why? If no, why not? Discuss your findings. [ /2 marks]

Your answer: The positive cases are way more scattered and they are mostly located above the negative ones. We cannot use the first two components to discriminate positive and negative cases since a large overlap can be seen for both cases where second component is between -0.2 and 0.

Question 1.5¶
Repeat the process above, only now using a UMAP projection with two components. Test all combinations of n_neighbors=[2, 10, 25] and min_dist=[0.1, 0.25, 0.5] over the train data and choose the projection that you think is best, and apply it over the test data. Use 1000 epochs, a cosine metric and random initialization. If you have more than 8GB of RAM (as in Colab), you may want to set low_memory=False to speed up computations.

Hint: This link may be helpful.

# Set parameters
fig, axs = plt.subplots(3, 3, figsize=(20,20))
n_neighbors=[2, 10, 25]
min_dist=[0.1, 0.25, 0.5]
fig.tight_layout()
sns.set_style(‘white’)

# Create UMAP and plots [ / 8 marks]
for (i,j), (nei, dist) in zip(product([0,1,2],[0,1,2]), product(n_neighbors, min_dist)):
UMAP_Reviews = umap.UMAP(n_components = 2,
metric=’cosine’,
n_epochs=1000,
low_memory=False,
n_neighbors=nei,
min_dist=dist,
init=’random’

x_train_umap = UMAP_Reviews.fit_transform(TfIDF_train)

# Create plot
sns.scatterplot(x=x_train_umap[:, 0], y=x_train_umap[:, 1], hue=y_train, ax=axs[i,j])
axs[i,j].title.set_text(f”n_neighbors={nei}, min_dist={dist}”)

plt.show()

Question: Which paramter would you choose? [ / 2 makrs]

Your Answer: We can see that 2 neighbours is too little and there are limited gains from using 25 neighbours. Regarding min_dist, 0.1 has the best structure (for me at least, they can argue differently and it should be considered correct).

# Choose the paramters that you think are best and apply to test set [ / 4 marks]
UMAP_Reviews = umap.UMAP(n_components = 2,
metric=’cosine’,
n_epochs=5000,
low_memory=False,
n_neighbors=10,
min_dist=0.1,
init=’random’

x_train_umap = UMAP_Reviews.fit_transform(TfIDF_train)
x_test_umap = UMAP_Reviews.transform(TfIDF_test)

# Create plot [ /2 marks]
sns.scatterplot(x=x_test_umap[:, 0], y=x_test_umap[:, 1], hue=y_test)
plt.show()

Question: How does the plot compare to the PCA one? [ /2 marks]

Your answer: The new plot looks fairly different than the PCA one. We see other structures and some non-linear separation between the cases. Still, there is no easy way of separating the different reviews.

Task 2: Benchmarking predictive capabilities of the compressed data [ / 34 marks]¶
For this task, we will benchmark the predictive capabilities of the compressed data against the original one.

Question 2.1¶
Train a regularized logistic regression over the original TfIDF train set (with no compression) using l2 regularization. Calculate the AUC score and plot the ROC curve for the original test set.

# Train and test using model LogisticRegressionCV [ /4 marks]

# Define the model
LogRegFull = LogisticRegressionCV(Cs = 10,
penalty = ‘l2’,
solver = ‘lbfgs’,
tol = 0.0001,
max_iter = 1000,
n_jobs = -1,
refit = True,
random_state = seed

# Fit on the training dataset
LogRegFull.fit(TfIDF_train, y_train)

# Apply to the test dataset
probs_test_full = LogRegFull.predict_proba(TfIDF_test)

# Plot ROC curve and compute AUC score [ /4 marks]
# Calculate the ROC curve points
fpr, tpr, thresholds = roc_curve(y_test, probs_test_full[:, 1])

# Save the AUC in a variable to display it. Round it first
auc = np.round(roc_auc_score(y_true = y_test,
y_score = probs_test_full[:, 1]), decimals = 3)

# Create and show the plot
plt.plot(fpr, tpr, label = “Full dataset, auc = ” + str(auc))
plt.legend(loc = 4)
plt.show()

Question 2.2¶
Train a regularized logistic regression over an SVD-reduced dataset (with 10 components) using l2 regularization. Calculate the AUC score and plot the ROC curve for the SVD-transformed test set.

# Apply SVD first [ / 4 marks]
SVD_Reviews = TruncatedSVD(n_components = 10,
n_iter=10,
random_state=42
x_train_pca = SVD_Reviews.fit_transform(TfIDF_train)
x_test_pca = SVD_Reviews.transform(TfIDF_test)

#Train and test using model LogisticRegressionCV [ /4 marks]
LogRegSVD = LogisticRegressionCV(Cs = 10,
penalty = ‘l2’,
solver = ‘lbfgs’,
tol = 0.0001,
max_iter = 1000,
n_jobs = -1,
refit = True,
random_state = 123456789

LogRegSVD.fit(x_train_pca, y_train)
probs_test_svd = LogRegSVD.predict_proba(x_test_pca)

# Plot ROC curve and compute AUC score [ /4 marks]
# Calculate the ROC curve points
fpr, tpr, thresholds = roc_curve(y_test, probs_test_svd[:, 1])

# Save the AUC in a variable to display it. Round it first
auc = np.round(roc_auc_score(y_true = y_test,
y_score = probs_test_svd[:, 1]), decimals = 3)

# Create and show the plot
plt.plot(fpr, tpr, label = “SVD, auc = ” + str(auc))
plt.legend(loc = 4)
plt.show()

Question 2.3¶
Train a regularized logistic regression over the UMAP-reduced dataset (with 10 components using the same parameters as Task 1.5) using l2 regularization. Calculate the AUC score and plot the ROC curve for the UMAP-transformed test set.

# Apply UMAP first [ / 4 marks]
UMAP_Reviews = umap.UMAP(n_components = 2,
metric=’cosine’,
n_epochs=5000,
low_memory=False,
n_neighbors=10,
min_dist=0.1,
init=’random’
x_train_umap = UMAP_Reviews.fit_transform(TfIDF_train)
x_test_umap = UMAP_Reviews.transform(TfIDF_test)

#Train and test using model LogisticRegressionCV [ /4 marks]
LogRegUMAP = LogisticRegressionCV(Cs = 10,
penalty = ‘l2’,
solver = ‘lbfgs’,
tol = 0.0001,
max_iter = 1000,
n_jobs = -1,
refit = True,
random_state = 123456789

LogRegUMAP.fit(x_train_umap, y_train)
probs_test_umap = LogRegUMAP.predict_proba(x_test_umap)

# Plot ROC curve and compute AUC score [ /4 marks]
# Calculate the ROC curve points
fpr, tpr, thresholds = roc_curve(y_test, probs_test_umap[:, 1])

# Save the AUC in a variable to display it. Round it first
auc = np.round(roc_auc_score(y_true = y_test,
y_score = probs_test_umap[:, 1]), decimals = 3)

# Create and show the plot
plt.plot(fpr, tpr, label = “UMAP, auc = ” + str(auc))
plt.legend(loc = 4)
plt.show()

Question 2.4¶
Compare the performance of the three models. Which one is the best. [ / 2 marks]

Your Answer: It can be seen that the model trained over the original data performs better than those trained over the reduced data. However, SVD and exhibits acceptable predictive capabilities. As UMAP did a non-linear mapping, its result is not good for logistic regression, that looks for linear separations, so the performance is not good at all. This can probably be improved using a non-linear model such as XGB.

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com