wk12_lda_word2vec
Week 12 – LDA topics and Word embedding¶
© Professor Yuefeng Li LDA topics in 20 newsgroups collection¶
Copyright By PowCoder代写 加微信 powcoder
The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.
https://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups
This data set consists of 20000 messages taken from 20 newsgroups. One thousand Usenet articles were taken from each of the following 20 newsgroups:
alt.atheism
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.electronics
soc.religion.christian
talk.politics.guns
talk.politics.mideast
talk.politics.misc
talk.religion.misc
pip install gensim # for word2vec
Collecting gensim
Downloading gensim-4.2.0-cp38-cp38-macosx_10_9_x86_64.whl (24.0 MB)
|████████████████████████████████| 24.0 MB 655 kB/s eta 0:00:011
Collecting smart-open>=1.8.1
Downloading smart_open-6.0.0-py3-none-any.whl (58 kB)
|████████████████████████████████| 58 kB 5.8 MB/s eta 0:00:01
Requirement already satisfied: numpy>=1.17.0 in /Users/li3/opt/anaconda3/lib/python3.8/site-packages (from gensim) (1.20.1)
Requirement already satisfied: scipy>=0.18.1 in /Users/li3/opt/anaconda3/lib/python3.8/site-packages (from gensim) (1.6.2)
Installing collected packages: smart-open, gensim
Successfully installed gensim-4.2.0 smart-open-6.0.0
Note: you may need to restart the kernel to use updated packages.
import string
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import LatentDirichletAllocation
def display_topics(model, feature_names, no_top_words):
for topic_idx, topic in enumerate(model.components_):
#print “Topic %d:” % (topic_idx)
print(‘Topic_’+str(topic_idx + 1))
print(‘<'+", ".join([feature_names[i]
for i in topic.argsort()[:-no_top_words - 1:-1]])+'>‘)
dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=(‘headers’, ‘footers’, ‘quotes’))
docs = dataset.data
documents = [d.translate(str.maketrans(”,”, string.digits)) for d in docs] # remove digits, then the generated topics will be different
# write 20 messages into a .dat file to test the input messages
wFile = open(’20NewsGroups.dat’, ‘w’)
for i in range(20):
wFile.write(‘
wFile.close()
# LDA can only use raw term counts because it is a probabilistic graphical model
no_features = 1000
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words=’english’)
tf = tf_vectorizer.fit_transform(documents)
tf_feature_names = tf_vectorizer.get_feature_names()
no_topics = 10
lda = LatentDirichletAllocation(n_components=no_topics, max_iter=5, learning_method=’online’, learning_offset=50.,random_state=0).fit(tf)
# display topics as a set of top words
no_top_words = 8
display_topics(lda, tf_feature_names, no_top_words)
Python Word2Vec¶
To generate word vectors in Python, modules needed are nltk and gensim. Run these commands in terminal to install nltk and gensim.
#pip install nltk
#pip install gensim
# importing all necessary modules
from nltk.tokenize import sent_tokenize, word_tokenize
import warnings
warnings.filterwarnings(action = ‘ignore’)
import gensim
from gensim.models import Word2Vec
# Reads ‘alice.txt’ file
#sample = open(“alice.txt”, “utf8”)
sample = open(“alice.txt”)
s = sample.read()
# Replaces escape character with space
f = s.replace(“\n”, ” “)
# iterate through each sentence in the file
for i in sent_tokenize(f):
# tokenize the sentence into words
for j in word_tokenize(i):
temp.append(j.lower())
data.append(temp)
# Create CBOW model
model1 = gensim.models.Word2Vec(data, min_count = 1, vector_size = 10, window = 5)
# Print results
print(“vector of alice = “)
print(model1.wv[‘alice’])
print(“vector of wonderland = “)
print(model1.wv[‘wonderland’])
print(“vector of machines = “)
print(model1.wv[‘machines’])
print(“Cosine similarity between ‘alice’ ” + “and ‘wonderland’ – CBOW : “, model1.wv.similarity(‘alice’, ‘wonderland’))
print(“Cosine similarity between ‘alice’ ” + “and ‘machines’ – CBOW : “, model1.wv.similarity(‘alice’, ‘machines’))
# you may use different model, e.g., Skip Gram model
# model2 = gensim.models.Word2Vec(data, min_count = 1, vector_size = 50, window = 6, sg = 1)
vector of alice =
[-0.6232302 -0.7186284 1.2500157 0.59926826 -0.03486401 0.7099776
2.3744094 2.128018 -1.971011 -0.69042027]
vector of wonderland =
[-0.01726623 -0.10275307 0.19141996 0.12364732 0.04587556 0.24207804
0.22506802 0.32132903 -0.37632695 -0.01187956]
vector of machines =
[-0.03659099 -0.0750192 0.12164483 -0.03393448 0.08365427 0.00243759
0.09836097 0.09806564 -0.13419634 0.07361779]
Cosine similarity between ‘alice’ and ‘wonderland’ – CBOW : 0.92701817
Cosine similarity between ‘alice’ and ‘machines’ – CBOW : 0.75782454
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com