workshop_topicmodels
In [ ]:
%matplotlib inline
Data Science Workshop Week 10 – Topic Models¶
The goal of this workshop is to extract topic models out of a corpus of 20 newsgroups. The corpus contains over 11000 documents. We will test Non-negative Matrix Factorization and Latent Dirichlet Allocation on this corpus with different pre-processing steps.
In order to make the notebook work, you first have to install the textblob library. This can be simply be done by:
change to comand line and change in the anaconda2/bin directory
type: “./pip install -U textblob”
type: “./python -m textblob.download_corpora”
Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation¶
The default parameters (n_samples / n_features / n_topics) should make
the example runnable in a couple of tens of seconds. You can try to
increase the dimensions of the problem, but be aware that the time
complexity is polynomial in NMF. In LDA, the time complexity is
proportional to (n_samples * iterations).
In [ ]:
from __future__ import print_function
from time import time
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from textblob import TextBlob
from sklearn.datasets import fetch_20newsgroups
n_samples = 2000
n_topics = 20
n_top_words = 20
In [ ]:
# Load the 20 newsgroups dataset and vectorize it. We use a few heuristics
# to filter out useless terms early on: the posts are stripped of headers,
# footers and quoted replies, and common English words, words occurring in
# only one document or in at least 95% of the documents are removed.
print(“Loading dataset…”)
t0 = time()
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
remove=(‘headers’, ‘footers’, ‘quotes’))
data_samples = dataset.data[:n_samples]
print(“done in %0.3fs.” % (time() – t0))
# Look at the first entry
data_samples[1]
# How many documents do we have?
len(data_samples)
Excercise 1: Data Preprocessing¶
Our first step is data-preproccing. Write the function split_into_tokens that splits the documents into its single words (also see workshop week 8). Subsequently, use the class CountVectorizer to transform the documents in the bag of words representation. How big is your resulting dictionary?
In [ ]:
def split_into_tokens(message):
# put your code here
# How does the first element look like?
split_into_tokens(data_samples[0])
In [ ]:
bow_transformer = # put your code here
# fit the bow transformer to the data
data_bow = # get the transformed data
print(len(bow_transformer.vocabulary_))
Your dictionary is probably pretty huge! We can restrict the number of words by using the min_df and the max_df parameters of the Countvectorizer class. min_df says we want to use words that occur in at least the given percentage of documents. max_df says we do not want to use documents that occur in more then max_df % of the documents as these are uninformative. Use min_df = 0.001 and max_df = 0.3 and retrain the CountVectorizer. Is your vocabulary smaller now?
In [ ]:
bow_transformer = # put your code here
# fit the bow transformer to the data
data_bow = # get the transformed data
print(len(bow_transformer.vocabulary_))
Training a non-negative matrix factorization (NMF) model¶
The NMF can be trained by
In [ ]:
nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(data_bow)
nmf
Now we can look at the topics. We first define a function that prints the top 20 words of all topics
In [ ]:
def print_top_words(model, feature_names, n_top_words):
for topic_idx, topic in enumerate(model.components_):
print(“Topic #%d:” % topic_idx)
print(” “.join([feature_names[i]
for i in topic.argsort()[:-n_top_words – 1:-1]]))
print()
An then use it for our topic model…
In [ ]:
print(“\nTopics in NMF model:”)
feature_names = bow_transformer.get_feature_names()
print_top_words(nmf, feature_names, n_top_words)
As we can see, the quality of our topics is quite poor. We have to do some more preprocessing.
Excercise 2: Stemming and lower case words¶
Repeat the previous experiment for training an NMF, but this time, use stemming as preprocessing step. Also, convert all words to lower case. Again see WS 8 for some code samples. Did your vocabulary size decrease? Print the first element of data_samples in the Bag-of-word representation. Can you interpret this vector?
In [ ]:
def split_into_lemmas(message):
# convert to lower case…
# Do the stemming…
In [ ]:
# create again the bow_transformer transform the data
In [ ]:
# train NMF model and print the topics
Excercise 3: Pruning numbers and short words¶
We still low quality topics, with numbers or only single letters. We want to prune this vocabulary. In your
preprocessing step, remove all words that contain digits or that have less then 3 letters from the vocabulary.
Hint:
you can query a string whether it contain a digit with .isalpha(). It returns true if the string does not contain a digit.
the following pattern might help you (Also see https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions):
words = [word for word in words if *put your code here *]
How large is your vocabulary? Again train the topic model and print the topics.
In [ ]:
def split_into_lemmas(message):
# convert to lower case…
# Do the stemming…
# remove words that contain digits
# remove words with len < 3 return words In [ ]: # create new bow_transformer In [ ]: # create NMF model and print topics Excercise 4: TFIDF representation¶ Repeat the training with the TFIDF representation instead of the Bag-of-words representation (see WS8). Can you see a difference in the topics? In [ ]: # create tf_idf transformer In [ ]: # create NMF model using tf_idf data and print topics Excercise 5: Removing stop words¶ Our dictionary still contains many uninformative words. Download the stopload corpora from nltk. Use the english stop-words. In your data-preprocssing step, delete all words that are contained in the stop word. Hints: download the corpora: In [ ]: import nltk nltk.download('stopwords') get the stopwords: In [ ]: from nltk.corpus import stopwords stop_words = set(stopwords.words('english')) Again use the function split_into_lemma for removing the stopwords. Do not use the build in function of the TFIDF vectorizer, as it does not work You can check whether a word is contained in the list of stopwords with In [ ]: 'have'in stop_words Repeat the experiment and train the topic models. Do you get better topics? In [ ]: def split_into_lemmas(message): # convert to lower case # Do the stemming... # remove words that contain digits # remove words with len < 3 # remove stop words return words In [ ]: # train new tf_idf transformer In [ ]: # train new NMF model and print the topics