CS计算机代考程序代写 python Stop Words¶

Stop Words¶
In [1]:
import nltk
In [2]:
nltk.download(‘stopwords’)
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /home/diego/nltk_data…
[nltk_data] Package stopwords is already up-to-date!
In [3]:
stop = stopwords.words(‘english’)
stop[:5]
Out[3]:
[‘i’, ‘me’, ‘my’, ‘myself’, ‘we’]

Zipf’s Law¶
In [4]:
%matplotlib inline
nltk.download(‘gutenberg’)
import nltk
import collections
import matplotlib.pyplot as plt
words = nltk.corpus.gutenberg.words()
fd = collections.Counter(words)
data = [f for w, f in fd.most_common()]
plt.plot(data[:1000])
plt.title(“Zipf’s Law”)
plt.xlabel(“Word rank”)
plt.ylabel(“Frequency”)

[nltk_data] Downloading package gutenberg to /home/diego/nltk_data…
[nltk_data] Package gutenberg is already up-to-date!
Out[4]:
Text(0, 0.5, ‘Frequency’)

2021-02-26T19:53:44.005413 image/svg+xml Matplotlib v3.3.4, https://matplotlib.org/ *{stroke-linecap:butt;stroke-linejoin:round;}
In [5]:
fd.most_common(10)
Out[5]:
[(‘,’, 186091),
(‘the’, 125748),
(‘and’, 78846),
(‘.’, 73746),
(‘of’, 70078),
(‘:’, 47406),
(‘to’, 46443),
(‘a’, 32504),
(‘in’, 31959),
(‘I’, 30221)]

Vectors and Matrices in Python¶

Manipulating Vectors¶
In [6]:
import numpy as np
a = np.array([1,2,3,4])
a[0]
Out[6]:
1
In [7]:
a[1:3]
Out[7]:
array([2, 3])
In [8]:
a+1
Out[8]:
array([2, 3, 4, 5])
In [9]:
b = np.array([2,3,4,5])
a+b
Out[9]:
array([3, 5, 7, 9])
In [10]:
np.dot(a,b)
Out[10]:
40

Manipulating Matrices¶
In [11]:
x = np.array([[1,2,3],[4,5,6]])
x
Out[11]:
array([[1, 2, 3],
[4, 5, 6]])
In [12]:
y = np.array([[1,1,1],[2,2,2]])
x+y
Out[12]:
array([[2, 3, 4],
[6, 7, 8]])
In [13]:
x*y
Out[13]:
array([[ 1, 2, 3],
[ 8, 10, 12]])
In [14]:
x.T
Out[14]:
array([[1, 4],
[2, 5],
[3, 6]])
In [15]:
np.dot(x.T,y)
Out[15]:
array([[ 9, 9, 9],
[12, 12, 12],
[15, 15, 15]])

Scikit-learn¶
In the following code, tfidf is an instance of the TfidfVectorizer class, which will accept a list of text files and will ignore stop words.
The transform method computes tf.idf of the files. When we do fit and transform using the same files, as is the case below, you can combine fit and transform using fit_transform.
The output of transform (and fit_transform) is a sparse matrix. Scikit-learn provides many functions that operate with sparse matrices, so often we will not need to convert sparse matrices to regular arrays. If we need to convert a sparse matrix to an array we can use the toarray method.
The method shape returns the dimensions of the array or sparse matrix. In our case, the output of the function says that the variable tfidf_values has 3,672 rows (one row per file) and 19892 columns (one column per distinct word).
In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
import zipfile
In [17]:
import os.path
if not os.path.exists(‘enron1’):
with zipfile.ZipFile(‘enron1.zip’) as myzip:
myzip.extractall()
In [18]:
import glob
files = glob.glob(‘enron1/ham/*.txt’)
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(input=’filename’,stop_words=’english’)
tfidf.fit(files)
tfidf_values = tfidf.transform(files)
len(tfidf.get_feature_names())
Out[18]:
19891
In [19]:
tfidf.get_feature_names()[10000:10005]
Out[19]:
[‘grandpa’, ‘grandsn’, ‘grandsons’, ‘grant’, ‘granted’]
In [20]:
type(tfidf_values)
Out[20]:
scipy.sparse.csr.csr_matrix
In [21]:
type(tfidf_values.toarray())
Out[21]:
numpy.ndarray
In [22]:
tfidf_values.shape
Out[22]:
(3672, 19891)
In [23]:
len(files)
Out[23]:
3672

Normalised tf.idf and cosine similarity¶
In [24]:
import numpy as np
tfidf_norm = TfidfVectorizer(input=’filename’,
stop_words=’english’,
norm=’l2′)
tfidf_norm_values = tfidf_norm.fit_transform(files).toarray()
In [25]:
def cosine_similarity(X,Y):
return np.dot(X,Y)

Note that the first character of l2 is lowercase L and not the number 1.
In [26]:
cosine_similarity(tfidf_norm_values[0,:],
tfidf_norm_values[1,:])
Out[26]:
0.0024079977619124405

The following code shows an alternative way to compute the cosine similarity between two vectors. The sklearn package provides a cosine_similarity module that performs the pairwise cosine similarities between the vectors of two lists. The function returns a matrix of cosine similarities where element ($i$,$j$) is the cosine similarity between vector $i$ of the first list, and vector $j$ of the second list. The function does not work with sparse matrices so you need to convert them to numpy arrays.
We can see that the cosine similarity between tfidf vectors is the same as the cosine similarity between normalised tfidf vectors.
In [27]:
from sklearn.metrics import pairwise
pairwise.cosine_similarity([tfidf_norm_values[0,:]],
[tfidf_norm_values[1,:]])
Out[27]:
array([[0.002408]])
In [28]:
pairwise.cosine_similarity([tfidf_values[0,:]],
[tfidf_values[1,:]])

—————————————————————————
TypeError Traceback (most recent call last)
TypeError: float() argument must be a string or a number, not ‘csr_matrix’

The above exception was the direct cause of the following exception:

ValueError Traceback (most recent call last)
in
—-> 1 pairwise.cosine_similarity([tfidf_values[0,:]],
2 [tfidf_values[1,:]])

~/anaconda3/envs/comp3220/lib/python3.8/site-packages/sklearn/metrics/pairwise.py in cosine_similarity(X, Y, dense_output)
1177 # to avoid recursive import
1178
-> 1179 X, Y = check_pairwise_arrays(X, Y)
1180
1181 X_normalized = normalize(X, copy=True)

~/anaconda3/envs/comp3220/lib/python3.8/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
70 FutureWarning)
71 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
—> 72 return f(**kwargs)
73 return inner_f
74

~/anaconda3/envs/comp3220/lib/python3.8/site-packages/sklearn/metrics/pairwise.py in check_pairwise_arrays(X, Y, precomputed, dtype, accept_sparse, force_all_finite, copy)
143 estimator=estimator)
144 else:
–> 145 X = check_array(X, accept_sparse=accept_sparse, dtype=dtype,
146 copy=copy, force_all_finite=force_all_finite,
147 estimator=estimator)

~/anaconda3/envs/comp3220/lib/python3.8/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
70 FutureWarning)
71 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
—> 72 return f(**kwargs)
73 return inner_f
74

~/anaconda3/envs/comp3220/lib/python3.8/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
596 array = array.astype(dtype, casting=”unsafe”, copy=False)
597 else:
–> 598 array = np.asarray(array, order=order, dtype=dtype)
599 except ComplexWarning:
600 raise ValueError(“Complex data not supported\n”

~/anaconda3/envs/comp3220/lib/python3.8/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
81
82 “””
—> 83 return array(a, dtype, copy=False, order=order)
84
85

ValueError: setting an array element with a sequence.
In [29]:
dense_tfidf_values = tfidf_values.toarray()
pairwise.cosine_similarity([dense_tfidf_values[0,:]],
[dense_tfidf_values[1,:]])
Out[29]:
array([[0.002408]])
In [30]:
pairwise.cosine_similarity([tfidf_norm_values[0,:]],
[tfidf_norm_values[1,:]])
Out[30]:
array([[0.002408]])
In [ ]: