Textual Analysis in Finance and Accounting Module 2
Agenda
• Use of Python in Text Analytics • Python Intro
• Data Preparation for Text Analytics
• Data Collection
• Tokenization
• Stopwords
• Lexicon Normalization
• POS Tagging
• Web Scrapping
• Extracting data from 10-K report
• Feature Extraction for Text Analytics
• Document Model Representation
• Bag of Words
• Text Classification approach
Using Python for Text Analytics
10 minutes lab: https://pandas.pydata.org/pandas- docs/stable/user_guide/10min.html
Why use Python for Text Analytics
• Short, concise text processing
• Scipy, NumPy, scikit_learn
• NLTK Library, Spacy, Gensim, etc. • Text Processing functions
• Name Entity recognition • Sentiment analysis
• Topic Modelling
• Text clustering
Basic Data Types and Syntax
• Variables
• Put quotes around strings
• Canuseboth“and ‘
• Lists
• Example: mylist=[1,1,2,3,5,…]
• First element of the list, mylist[0] • Last element of the list, mylist[-1]
Pandas Data Frame
• Use pandas to create a DataFrame, which is a matrix, but it can store more than just numbers
• To create a data frame, • importpandasaspd
• pd.DataFrame
To fill the data frame with zeros using np.zeros, which creates an n by m matrix of zeros
import pandas as pd
import numpy as np
df = pd.DataFrame(np.zeros(shape=(3,3))) print (df)
Loading Data
import pandas as pd
data = pd.read_csv(”music.csv”) data.groupby(”Artist”).mean()[”Listners”].idxmax()
Python Dictionaries
• Dictionaries
• Store collection of data
• Has elements(key) and attributes(values)
• For example, this dictionary has 3 elements
• Fish,dogandcat
• To extract an attribute of an element, use:
dictionary[element]
• Example: thisdict[‘brand’] will return “Ford”
{‘brand’: ‘Ford’, ‘model’: ‘Mustang’, ‘year’: 1964}
thisdict = {
“brand”: “Ford”, “model”: “Mustang”, “year”: 1964
} print(thisdict)
Dictionaries are used to store data values in key:value pairs.
Sets are used to store multiple items in a single variable
e.g. words that differentiate sentences about men and about women
MALE = ‘male’
FEMALE = ‘female’ UNKNOWN = ‘unknown’ BOTH = ‘both’
MALE_WORDS =
set([‘guy’,’spokesman’,’chairman’,”men’s”,’men’,’him’,”he’s”,’his’,’boy’,’boyfriend’,’boyfriends’,’boys’,’brother’,’ brothers’,’dad’,’dads’,’dude’,’father’,’fathers’,’fiance’,’gentleman’,’gentlemen’,’god’,’grandfather’,’grandpa’,’gra ndson’,’groom’,’he’,’himself’,’husband’,’husbands’,’king’,’male’,’man’,’mr’,’nephew’,’nephews’,’priest’,’prince’,’s on’,’sons’,’uncle’,’uncles’,’waiter’,’widower’,’widowers’])
FEMALE_WORDS =
set([‘heroine’,’spokeswoman’,’chairwoman’,”women’s”,’actress’,’women’,”she’s”,’her’,’aunt’,’aunts’,’bride’,’dau ghter’,’daughters’,’female’,’fiancee’,’girl’,’girlfriend’,’girlfriends’,’girls’,’goddess’,’granddaughter’,’grandma’,’gran dmother’,’herself’,’ladies’,’lady’,’lady’,’mom’,’moms’,’mother’,’mothers’,’mrs’,’ms’,’niece’,’nieces’,’priestess’,’pri ncess’,’queens’,’she’,’sister’,’sisters’,’waitress’,’widow’,’widows’,’wife’,’wives’,’woman’])
Building sets of words that differentiate sentences about men and about women
def genderize(words):
mwlen = len(MALE_WORDS.intersection(words)) fwlen = len(FEMALE_WORDS.intersection(words))
if mwlen > 0 and fwlen == 0: return MALE
elif mwlen == 0 and fwlen > 0: return FEMALE
elif mwlen > 0 and fwlen > 0: return BOTH
else:
return UNKNOWN
text=[‘men’,’his’,’gentlemen’,’her’] Genderize(text)
‘both’
Examines the numbers of words from a sentence that appear in the MALE_WORDS/FEMALE_WORDS list and assign a label.
If a sentence has only MALE_WORDS, label it as a “MALE” sentence, and if it has only FEMALE_WORDS, label it a “FEMALE” sentence.
If a sentence has nonzero counts for both male and female words, label it as “BOTH”, otherwise label it as “unknown”
Processing files with Python
• Opening/reading text files • Writing/saving text files
• Word Frequency Counts
• Regular Expressions
Opening a Text File and Read a few characters
Simply use the open() and the read() function; read() method returns the whole text,
f = open(“demofile.txt”, “r”) print(f.read())
# read 5 characters print(f.read(5))
# return one line by using the readline() method:
print(f.readline())
f = open(‘demofile.txt’,’r’) for line in f:
print (line)
More about the text processing – split()
• Counting the number of words in a string
• yourstring.split() breaks a string by spaces, and puts the pieces into a list
• len(your list) returns the number of items in a list • wordcount = len(yourstring.split())
• You can pass any argument you want between the parenthesis in split. This can be used to split by specific words, paragraphs, etc.
• yourstring.split(yourdelimiter)
sentence = “The capital of China is Beijing” sentence.split()
[‘The’, ‘capital’, ‘of’, ‘China’, ‘is’, ‘Beijing’]
Open a text file with the codecs package
• You can open files with open(), alternatively, can use codecs.open() • E.g. you want to process the text file in the Unicode encoding
• Arguments for codecs.open: • Document name
• “r” is for reading
• UTF-8, this is the default encoding for HTML.
• SomecharacterscannotberepresentedinUTF-8,thereplacecommandwillputaflag
character in place of the missing one. • Replacing ‘\n’ removes linebreaks
Writing Files
• To write to an existing file, you must add a parameter to the open()
function:
• “a” – Append – will append to the end of the file • “w” – Write – will overwrite any existing content
• New file is created it if does not exist
f = open(“demofile2.txt”, “a”) f.write(“Now the file has more content!”) f.close()
• You can also use codecs.open to save the file codecs.open(“demofile2.txt”, “a”, “utf-8”)
Word Frequency Count
• Identify the most recurrent terms or concepts in a set of data
• findingoutthemostmentionedwhenanalyzingtextsuchascustomerreviews,socialmedia
conversations or customer feedback.
• ”expensive”,”overpriced”and”overrated”frequentlyonyourcustomerreviews,mayindicateyouneed
to adjust your prices (or your target market!)
text1 = [‘Now’, ‘the’, ‘file’, ‘has’, ‘more’, ‘content’]
fdist1 = FreqDist(text1)
fdist1.most_common(50)
fdist1.plot()
• For any word, we can check how many times it occurred in a particular document.
• CountMethod:freq_dist.count(‘and’).Thisexpressionreturnsthevalueofthenumberoftimes’and’
occurred.
• FrequencyMethod:freq_dist.freq(‘and’).Thistheexpressionreturnsfrequencyofagivensample.
Regular Expressions (RegEx)
• Regular Expression (RegEx), is a sequence of characters that forms a search pattern.
• Useful to check if a string contains the specified search pattern.
• Python ha a built-in “re” package to work with regular expression
• >>> import re
RegEx Functions
• The re module offers a set of functions that allows us to search a string for a match:
Search for an exact match
• Suppose we have a text string, and want to know if the string has the exact word “cat” in it…
• String
• re.search (pattern, string [, flags] )
”A fat cat doesn’t eat oat but a rat eats bats.”
• Find the first location where the pattern produces an exact match
>>> import re
>>> teststring = “A fat cat doesn’t eat oat but a rat eats bats.” >>> match = re.search(“cat”,teststring)
>>> print (match)
]
class RegexpReplacer(object):
def __init__(self,
patterns=replacement_patterns):
self.patterns = [(re.compile(regex), repl) for
(regex, repl) in
patterns]
def replace(self, text):
s = text
for (pattern, repl) in self.patterns:
s = re.sub(pattern, repl, s)
return s
replacer = RegexpReplacer()
replacer.replace(“can’t is a contraction”)
‘cannot is a contraction’
More about Regular Expression
• Python’s official Regular Expression HOWTO:
• https://docs.python.org/2/howto/regex.html#regex- howto
• Google’s Python Regular Expression Tutorial:
• https://developers.google.com/edu/python/regular-expressions
• Test your regular expression: • https://regex101.com/#python
Test your regular expression:
Pattern: item[^a-zA-Z\n]*1a\.(.*?)item[^a-zA-Z\n]*1b
NLTK in Python
• A set of Python libraries for common natural language programming tasks • Developed at the University of Pennsylvania
• Free and open source
• NLTK provides:
• Basic classes for representing data relevant to Natural Language Processing.
• Standard interfaces for performing NLP tasks such as tokenization, tagging and parsing • Over 50 corpora and lexical resources
• Online book: http://www.nltk.org/book/
• Authors: Edward Loper, Ewan Kline and Steven Bird
NLTK package
• corpora: a package containing modules of example text • tokenize: functions to separate text strings
• stem – package of functions to stem words of text
• tag: tagging each word with part-of-speech, sense, etc. • wordnet – interface to the WordNet lexical resource
• chunk – identify short non-nested phrases in text
• parse: building trees over text – recursive descent, shift-reduce, probabilistic, etc. • probability: for modelling frequency distributions and probabilistic systems
• cluster: clustering algorithms
• draw: visualize NLP structures and processes
• contrib: various pieces of software from outside contributors
• etree: for hierarchical structure over text
NLTK Basic Functions
• Corpus Package
• Word Tokenizer
• Sentence Tokenizer • Stemming
• Lemmatization
• Part of Speech
• Stop Words
NLTK Corpora
• A corpus is a large body of text or linguistic data
• Useful in NLP research for application development and testing
• Over 50 corpora and lexical resources
• Use NLTK functions to analyze the text in the imported corpus (e.g. Brown Corpus)
Important Concept in NLP
• Document
• an entity containing a whole body of text data
• Corpus
• consists of a collection (sets) of documents
• Documents arranged within a corpus (might be categorized)
• A tokenized corpus
• refers to a corpus where each document is tokenized or broken down into tokens, which are usually words.
NLTK Corpora
NLTK provides a diverse set of corpus (over 50 corpora)
• Brown corpus
• WordNet
• Reuters corpus
• Gutenberg corpus
• These corpus can be accessed by a corpus reader object from nltk.corpus
• More details about each corpus could be found here: http://www.nltk.org/book/ch02.html.
Accessing Text Corpora
>>> import nltk
>>> from nltk.corpus import gutenberg >>> gutenberg.fileids()
[‘austen-emma.txt’, ‘austen-persuasion.txt’, ‘austen-sense.txt’, ‘bible-kjv.txt’, ‘blake- poems.txt’, ‘bryant-stories.txt’, ‘burgess-busterbrown.txt’, ‘carroll-alice.txt’, ‘chesterton- ball.txt’, ‘chesterton-brown.txt’, ‘chesterton-thursday.txt’, ‘edgeworth-parents.txt’, ‘melville- moby_dick.txt’, ‘milton-paradise.txt’, ‘shakespeare-caesar.txt’, ‘shakespeare-hamlet.txt’, ‘shakespeare-macbeth.txt’, ‘whitman-leaves.txt’]
>>> emma = nltk.corpus.gutenberg.words(‘austen-emma.txt’) >>> len(emma)
192427
Readability of the Book
from nltk.corpus import gutenberg
fileid=’austen-emma.txt’
num_chars = len(gutenberg.raw(fileid))
num_words = len(gutenberg.words(fileid))
num_sents = len(gutenberg.sents(fileid))
num_vocab = len(set(w.lower() for w in gutenberg.words(fileid)))
print(fileid, ‘\n’ +’chars per word’,round(num_chars/num_words), ‘\n’ +’words per sent’,
round(num_words/num_sents), ‘\n’ +’words per vocab’, round(num_words/num_vocab))
austen-emma.txt chars per word 5 words per sent 25 words per vocab 26
Concordance
import nltk
from nltk.corpus import gutenberg
emma = nltk.Text(nltk.corpus.gutenberg.words(‘austen-emma.txt’)) emma.concordance(“surprize”, width-100, line=2)
Brown Corpus
• Developed at Brown University in the 60s
• The first major corpus of English for Computer Analysis • Annotated with part-of-speech tags
• enabled sophisticated statistical analysis
• Consists of 1,000,000+ words of English prose text
• 15 text categories with total of 500 samples • brown.categories()
• news, editorial, reviews, religions, etc.
Brown Corpus
Study the writing styles in different categories
from nltk.corpus import brown
news_text = brown.words(categories=’news’)
fdist = nltk.FreqDist(w.lower() for w in news_text)
modals = [‘can’, ‘could’, ‘may’, ‘might’, ‘must’, ‘will’]
for m in modals:
print(m + ‘:’, fdist[m], end=’ ‘)
can: 94 could: 87 may: 93 might: 38 must: 53 will: 389
Annotated Corpora
• Brown Corpus comes with annotation
• POS tags, parse tree
• The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn…
Brown Corpus
Study the writing styles in different categories
cfd = nltk.ConditionalFreqDist((genre, word) for genre in brown.categories()
for word in brown.words(categories=genre))
genres = [‘news’, ‘religion’, ‘hobbies’, ‘science_fiction’, ‘romance’, ‘humor’] modals = [‘can’, ‘could’, ‘may’, ‘might’, ‘must’, ‘will’] cfd.tabulate(conditions=genres, samples=modals)
WordNet Corpus
• Lexical database for English and other languages • Created by Princeton University
• WordNet links words into semantic relations • E.g. Synonyms
• Synonyms are grouped into synsets with short definitions and usage examples
• For example, find the synsets for the work “program” • syns = wordnet.synsets(“program”)
• English WordNet has 155287 words and 117659 synonym sets
Wordnet Synsets
from nltk.corpus import wordnet syn = wordnet.synsets(‘menu’)[0] print(syn.name()) print(syn.definition())
from nltk.corpus import wordnet syn = wordnet.synsets(‘menu’)[2] print(syn.name()) print(syn.definition())
‘menu.n.01’ ‘menu.n.03’
‘a list of dishes available at a restaurant’ ‘(computer science) a list of options available
to a computer user’
Find the definition and examples of a given word using WordNet Corpus
from nltk.corpus import wordnet
syns = wordnet.synsets(“age”)
print(“Definition of the said word:”)
print(syns[0].definition())
print(“\nExamples of the word in use::”)
print(syns[0].examples())
Definition of the said word: how long something has existed
Examples of the word in use:: [‘it was replaced because of its age’]
Prepare Text Data with NTLK
Basic Pre-processing for Text Data
• Tokenization • Sentences
• Words
• Case conversion, stripping extra characters • Removing stop words
• Removing punctuation
• Stemming
• Lemmatization
Tokenization
• Tokenization is the first step in text analytics
• Process of breaking down a text paragraph into words or sentences
• Start with splitting the text into sentences, then splitting each sentence into words, then saving each sentence to file, one per line.
• Sentence Tokenization: Sentence tokenizer
• sent_tokenize() breaks text paragraph into sentences.
• Word Tokenization: Word tokenizer
• word_tokenize() breaks text paragraph into words.
• Splits tokens based on white space and punctuation
• Contractions are split apart (What’s becomes What and ‘s)
Sentence Tokenizer in NLTK
import nltk
text = “I buy pizza ingredients. My dogs like bones. Pizza is still unregulated. Dog bones are heavily regulated.”
print (text)
# split into sentences
sentences = nltk.sent_tokenize(text) print(sentences)
I buy pizza ingredients. My dogs like bones. Pizza is still unregulated. Dog bones are heavily regulated.
[‘I buy pizza ingredients.’, ‘My dogs like bones.’, ‘Pizza is still unregulated’. ‘Dog bones are heavily regulated.’]
Word Tokenizer in NLTK
from nltk.tokenize import word_tokenize # split into words
tokens = nltk.word_tokenize(text) print(tokens[:20])
[‘I’, ‘buy’, ‘pizza’, ‘ingredients’, ‘.’, ‘My’, ‘dogs’, ‘like’, ‘bones’, ‘.’, ‘Pizza’, ‘is’, ‘still’, ‘unregulated’, ‘.’, ‘Dog’, ‘bones’, ‘are’, ‘heavily’, ‘regulate
# split into sentences then into words sentences = nltk.sent_tokenize(text) for sent_text in sentences:
tokens = nltk.word_tokenize(sent_text) print(tokens)
[‘I’, ‘buy’, ‘pizza’, ‘ingredients’, ‘.’]
[‘My’, ‘dogs’, ‘like’, ‘bones’, ‘.’]
[‘Pizza’, ‘is’, ‘still’, ‘unregulated’, ‘.’] [‘Dog’, ‘bones’, ‘are’, ‘heavily’, ‘regulated’, ‘.’]
Tokenizing sentences using regular expressions
from nltk.tokenize import word_tokenize
word_tokenize(“Can’t is a contraction.”)
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(“[\w’]+”)
tokenizer.tokenize(“Can’t is a contraction.”)
‘Ca’, “n’t”, ‘is’, ‘a’, ‘contraction’, ‘.’]
[“Can’t”, ‘is’, ‘a’, ‘contraction’]
Regular expressions can be used if you want complete control over how to tokenize text. As regular expressions can get complicated very quickly, only use them if the word tokenizers covered in the previous recipe are unacceptable.
N-gram Tokenization
• An n-gram is an n-token sequence of words:
• a2-gram(bigram)isatwo-wordsequenceofwordslike“pleaseturn”,“turnyour”,or“your
homework”,
• a3-gram(trigram)isathree-wordsequenceofwordslike“pleaseturnyour”,or“turnyour homework”.
• For example, the sentence “It was the best of times” is tokenized in bigram • Itwas
• wasthe • thebest • bestof • oftimes
N-gram tokenizer in NLTK
import nltk
from nltk import ngrams
n=2
text = “I buy pizza ingredients. My dogs like bones.” \
” Pizza is still unregulated. Dog bones are heavily regulated.” tokens = nltk.word_tokenize(text)
print(tokens)
bigrams_token = list(ngrams(tokens, n))
print (bigrams_token)
[‘I’, ‘buy’, ‘pizza’, ‘ingredients’, ‘.’, ‘My’, ‘dogs’, ‘like’, ‘bones’, ‘.’, ‘Pizza’, ‘is’, ‘still’, ‘unregulated’, ‘.’, ‘Dog’, ‘bones’, ‘are’, ‘heavily’, ‘regulated’, ‘.’]
[(‘I’, ‘buy’), (‘buy’, ‘pizza’), (‘pizza’, ‘ingredients’), (‘ingredients’, ‘.’), (‘.’, ‘My’), (‘My’, ‘dogs’), (‘dogs’, ‘like’), (‘like’, ‘bones’), (‘bones’, ‘.’), (‘.’, ‘Pizza’), (‘Pizza’, ‘is’), (‘is’, ‘still’), (‘still’, ‘unregulated’), (‘unregulated’, ‘.’), (‘.’, ‘Dog’), (‘Dog’, ‘bones’), (‘bones’, ‘are’), (‘are’, ‘heavily’), (‘heavily’, ‘regulated’), (‘regulated’, ‘.’)]
Normalizing Case and filter out non-alphanumeric • isalpha() is a built-in method used for string handling.
• The isalpha() methods returns “True” if all characters in the string are alphabets, Otherwise, It returns “False”.
• lower() function in python to covert all words to lowercase • Convert all words to one case
• Vocabulary will shrink in size but some distinction are lost • Apple the company vs apple the fruit
Filter out remaining tokens that are not alphabetic
text = nltk.word_tokenize(raw)
# remove all tokens that are not alphabetic words = [w.lower() for w in text if w.isalpha()] vocab = sorted(set(words))print(words[:20])
[‘I’, ‘buy’, ‘pizza’, ‘ingredients’, ‘My’, ‘dogs’, ‘like’, ‘bones’, ‘Pizza’, ‘is’, ‘still’, ‘unregulated’, ‘Dog’, ‘bones’, ‘are’, ‘heavily’, ‘regulated’]
N-gram tokenizer in NLTK
Filter out remaining tokens that are not alphabetic
import nltk
from nltk import ngrams
n=2
text = “I buy pizza ingredients. My dogs like bones.” \
” Pizza is still unregulated. Dog bones are heavily regulated.” tokens = nltk.word_tokenize(text)
words = [w.lower() for w in tokens if w.isalpha()] bigrams_token = list(ngrams(words, n))
print (len(bigrams_token))
print (bigrams_token)
[(‘i’, ‘buy’), (‘buy’, ‘pizza’), (‘pizza’, ‘ingredients’), (‘ingredients’, ‘my’), (‘my’, ‘dogs’), (‘dogs’, ‘like’), (‘like’, ‘bones’), (‘bones’, ‘pizza’), (‘pizza’, ‘is’), (‘is’, ‘still’), (‘still’, ‘unregulated’), (‘unregulated’, ‘dog’), (‘dog’, ‘bones’), (‘bones’, ‘are’), (‘are’, ‘heavily’), (‘heavily’, ‘regulated’)]
Case Conversion
• Lower/uppercase conversions
• Title case will capitalize the first letter of each word in the sentence.
text = ‘The quick brown fox jumped over The Big Dog’ # lowercase
print(text.lower())
# uppercase
print (text.upper()) # title case
print (text.title())
the quick brown fox jumped over the big dog THE QUICK BROWN FOX JUMPED OVER THE BIG DOG The Quick Brown Fox Jumped Over The Big Dog
Stop words Removal
• Words that do not contribute to the deeper meaning of the phrase
• wordssuchas:the,a,andis.
• For some applications like documentation classification,
it may make sense to remove stop words.
• To remove stop words in the text, you need to create a list of stopwords and filter out your list of tokens from these words
• NLTK provides a list of stop words for a variety of languages, such as English.
Get a list of common stop words from NLTK package
import nltk
# stopword corpus from NLTK
from nltk.corpus import stopwords print (stopwords.fileids())
# get the list of stopword in English language stop = set(stopwords.words(‘english’)) print(“List of stopwords in English:”) print(stop)
[‘arabic’, ‘azerbaijani’, ‘danish’, ‘dutch’, ‘english’, ‘finnish’, ‘french’, ……]
List of stopwords in English:
{‘this’, ‘has’, ‘the’ ‘and’, ‘below’, ‘ain’, ‘o’, “wasn’t”, ‘my’, ‘his’, ‘at’, ‘through’, ‘each’, ‘between’, ‘down’, ‘should’, ‘the’, ‘me’, “you’d”, ‘they’, ‘yourself’, ‘is’, ‘an’, ‘couldn’, “weren’t”, ‘over’, ‘you’, ‘no’, ‘does’, ‘him’, ‘there’, ‘too’, ‘what’, ‘have’,…..
…}
Remove stop words for a given text
import nltk
from nltk.tokenize import word_tokenize
text = “The quick brown fox jumps over the lazy dog” tokens = word_tokenize(text.lower())
print(tokens)
[‘the’, ‘quick’, ‘brown’, ‘fox’, ‘jumps’, ‘over’, ‘the’, ‘lazy’, ‘dog’]
from nltk.corpus import stopwords
stop_words = set(stopwords.words(‘english’))
tokens = [w for w in tokens if not w in stop_words] print(tokens)
[‘quick’, ‘brown’, ‘fox’, ‘jumps’, ‘lazy’, ‘dog’]
Omit some given stop words from the stopwords list
import nltk
from nltk.tokenize import word_tokenize
text = “The quick brown fox jumps over the lazy dog”
tokens = word_tokenize(text.lower())
print(tokens)
from nltk.corpus import stopwords
stop_words = set(stopwords.words(‘english’))
stop_words = set(stopwords.words(‘english’)) – set([‘over’]) tokens = [w for w in tokens if not w in stop_words] print(tokens)
[‘the’, ‘quick’, ‘brown’, ‘fox’, ‘jumps’, ‘over’, ‘the’, ‘lazy’, ‘dog’] [‘quick’, ‘brown’, ‘fox’, ‘jumps’, ‘over’, ‘lazy’, ‘dog’]
Before and after stop words removal:
Original text:
European shares began the month with a gain, as BNP Paribas rose on relief it had settled a U.S. sanctions case and mining companies rallied after encouraging economic data came out of China, the world’s top metals consumer. The pan-European FTSEurofirst 300 index closed up 0.9 percent at 1,382.31 points – notching its biggest one-day percentage gain since May 8. BNP Paribas rose 3.6 percent in trading volume of almost twice its 90-day daily average. It had lost about 20 percent – or $21 billion of its market value – since Feb. 13 when it announced the provision for the fine. The French bank pleaded guilty to two criminal charges and agreed to pay almost $9 billion to resolve allegations that in many financial dealings it violated U.S. sanctions against Sudan, Cuba and Iran. Analysts and investors said the stock could now recover ground lost over the last few months.
(879)
After stop word removal:
European shares began month gain, BNP Paribas rose relief settled U.S. sanctions case mining companies rallied encouraging economic data came China, world’s top metals consumer. The pan-European FTSEurofirst 300 index closed 0.9 percent 1,382.31 points – notching biggest one-day percentage gain since May 8. BNP Paribas rose 3.6 percent trading volume almost twice 90-day daily average. It lost 20 percent – $21 billion market value – since Feb. 13 announced provision fine. The French bank pleaded guilty two criminal charges agreed pay almost $9 billion resolve allegations many financial dealings violated U.S. sanctions Sudan, Cuba Iran. Analysts investors said stock could recover ground lost last months. (711)
Removing Punctuation Marks
word_tokenize() list comprehension to remove all punctuation marks using isalnum()
sentence = “Think and wonder, wonder and think.” words = nltk.word_tokenize(sentence)
new_words= [word for word in words if word.isalnum()] print(new_words)
[‘Think’, ‘and’, ‘wonder’, ‘wonder’, ‘and’, ‘think’]
Remove number and punctuation
European shares began month gain BNP Paribas rose relief settled US sanctions case mining companies rallied encouraging economic data came China world top metals consumer The pan European FTSEurofirst index closed percent points notching biggest one day percentage gain since May BNP Paribas rose percent trading volume almost twice day daily average It lost percent billion market value since Feb announced provision fine The French bank pleaded guilty two criminal charges agreed pay almost billion resolve allegations many financial dealings violated US sanctions Sudan Cuba Iran Analysts investors said stock could recover ground lost last months. (650)
Text Normalization
• Languages are made up of words which often derived from one another.
• When a language contains words that are derived from another word as their use in the speech changes is called Inflected Language. The derived word is the inflected word.
• Inflection is the modification of a word to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, and mood.
• An inflection expresses one or more grammatical categories with a prefix, suffix or another internal modification such as a vowel change”
Text Normalization
• Text normalization is a way to remove noise in the text.
• For example, connection, connected, connecting word can be
reduced to a common word “connect”.
• It reduces derivationally related forms of a word to a common root word.
• Stemming and Lemmatization
• The goal is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.
Normalization with Stemming
Word Stem and Infections
“Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language.”
The chopped-off pieces are referred to as affixes
Stemming
• Process of linguistic normalization, which reduces words to their root or base. For example, Fishing, fished, fisher all reduce to the stem “fish”
• the stem need not be a valid word
• traditional would get stemmed to tradit
• Applications, like document classification and internet search engine benefit from stemming
• Store only the stems
• Reduce the vocabulary
Stemmer in NLTK
• PorterStemmer and LancasterStemmer are the two popular stemmers in NLTK package for English
language, other stemmer such as SnowballStemmer
• PorterStemmer
• developed in 1979.
• Most popular
• LancasterStemmer
• developed in 1990 and uses a more aggressive approach than Porter Stemming Algorithm.
• SnowballStemmer
from nltk.stem import PorterStemmer from nltk.stem import LancasterStemmer
Stemmer Example
import pandas as pd
from nltk.stem import PorterStemmer stemmer = PorterStemmer()
original_words = [‘trouble’, ‘troubling’, ‘troubled’, ‘cats’,’cat’,’died’,’dying’,’die’]
singles = [stemmer.stem(w) for w in original_words] pd.DataFrame(data = {‘original word’: original_words, ‘stemmed’: singles})
original stemmed word
1 troubling troubl 3 cats cat 5 died die 7 die die
0
trouble
troubl
2
troubled
troubl
4
cat
cat
6
dying
die
1
Lemmatization
Lemmatization, unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language. In lemmatization root word is called Lemma. (as opposed to root stem)
Lemmatization
• Reduce words to their base words, which is linguistically correct lemmas • runs, running, ran are all forms of the word run, therefore run is the lemma of all
these
• Return an actual word of the language, it is used where it is necessary to get valid words.
• Lemmatization is more sophisticated than stemming. • For example, The word “better” has “good” as its lemma.
• NLTK provides WordNet Lemmatizer that uses the WordNet Database to lookup lemmas of words.
• Other common lemmatizer includes Spacy Lemmatizer, Genism Lemmatizer
Lemmatize Example – WordNetLammtizer
from nltk.stem import WordNetLemmatizer Import pandas as pd
lemma = WordNetLemmatizer()
original_words = [‘trouble’, ‘troubling’, ‘troubled’, ‘cats’,’cat’,’is’,’am’,’are’]
singles = [lemma.lemmatize(w) for w in original_words] pd.DataFrame(data = {‘original word’: original_words, ‘Lemma’: singles})
original word Lemma
0 trouble trouble 1 troubling troubling
3 cats cat 5 is is 7 are are
2
troubled Troubled
4
cat cat
6
am am
Lemmatize Example
Wordnet Lemmatizer with appropriate POS tag
from nltk.stem import WordNetLemmatizer lemma = WordNetLemmatizer()
original_words = [‘trouble’, ‘troubling’, ‘troubled’, ‘cats’,’cat’,’is’,’am’,’are’]
singles = [lemma.lemmatize(w,pos=’v’) for w in original_words]
pd.DataFrame(data = {‘original word’: original_words, ‘Lemma’: singles})
original word Lemma
0 trouble trouble 1 troubling trouble
3 cats cat 5 is be 7 are be
2
troubled trouble
4
cat cat
6
am be
Also, sometimes, the same word can have multiple different ‘lemma’s. So, based on the context it’s used, you should identify the ‘part-of-speech’ (POS) tag for the word in that specific context and extract the appropriate lemma. Examples of implementing this comes in the following sections.
Wordnet Lemmatizer with appropriate POS tag
4 common POS tags found in WordNet
from nltk.corpus import wordnet
def get_wordnet_pos(word):
“””Map POS tag to first character lemmatize() accepts”””
tag = nltk.pos_tag([word])[0][1][0].upper()
tag_dict = {“J”: wordnet.ADJ,
“N”: wordnet.NOUN,
“V”: wordnet.VERB,
“R”: wordnet.ADV}
return tag_dict.get(tag, wordnet.NOUN)
lemmatizer = WordNetLemmatizer()
# Lemmatize a Sentence with the appropriate POS tag
sentence = “The striped bats are hanging on their feet for best”
print([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(sentence)])
[‘The’, ‘strip’, ‘bat’, ‘be’, ‘hang’, ‘on’, ‘their’, ‘foot’, ‘for’, ‘best’]
The WordNet lemmatizer works well if the POS tags are also provided as inputs.
Combining Stemming and Lemmatization
from nltk.stem import PorterStemmer
from nltk.stem import
WordNetLemmatizer
stemmer = PorterStemmer()
lemma = WordNetLemmatizer()
print(stemmer.stem(‘believes’))
print(lemma.lemmatize(‘believes’))
>>> stemmer.stem(‘buses’)
‘buse’
>>> lemmatizer.lemmatize(‘buses’)
‘bus’
>>> stemmer.stem(‘bus’)
‘bu’
believ
belief
Combining stemming and lemmatization => 60% compression
Original (cleaned) text
European shares began month gain BNP Paribas rose relief settled US sanctions case mining companies rallied encouraging economic data came China world top metals consumer The pan European FTSEurofirst index closed percent points notching biggest one day percentage gain since May BNP Paribas rose percent trading volume almost twice day daily average It lost percent billion market value since Feb announced provision fine The French bank pleaded guilty two criminal charges agreed pay almost billion resolve allegations many financial dealings violated US sanctions Sudan Cuba Iran Analysts investors said stock could recover ground lost last months
Porter stemmed
european share began month gain bnp pariba rose relief settl US sanction case mine compani ralli encourag econom data came china world top metal consum the pan european ftseurofirst index close percent point notch biggest one day percentag gain sinc may bnp pariba rose percent trade volum almost twice day daili averag It lost percent billion market valu sinc feb announc provis fine the french bank plead guilti two crimin charg agre pay almost billion resolv alleg mani financi deal violat US sanction sudan cuba iran analyst investor said stock could recov ground lost last month
Lemmatized (pos = ’v’)
European share begin month gain BNP Paribas rise relief settle US sanction case mine company rally encourage economic data come China world top metal consumer The pan European FTSEurofirst index close percent point notch biggest one day percentage gain since May BNP Paribas rise percent trade volume almost twice day daily average It lose percent billion market value since Feb announce provision fine The French bank plead guilty two criminal charge agree pay almost billion resolve allegations many financial deal violate US sanction Sudan Cuba Iran Analysts investors say stock could recover grind lose last months
Stemming or Lemmatization
• Both generate the root form of the inflected words. But “stem” might not be an actual word whereas, “lemma” is an actual language word
• ‘Caring’ -> Lemmatization -> ‘Care’ ‘Caring’ -> Stemming -> ‘Car’
• Stemming is faster
• Stemming follows an algorithm with steps to perform on the words which makes it faster
• Only removes the last few characters, often leading to incorrect meanings and spelling errors.
• Lemmatization considers the context
• Converts the word to its meaningful base form
• Uses WordNet corpus and also consider part-of-speech to produce lemma
Stemming/lemmatization a document
1. Take a document as the input.
2. Read the document line by line
3. Tokenize the line
4. Stem the words
5. Output the stemmed words (print on screen or write to a file)
6. Repeat step 2 to step 5 until it is to the end of the document.
Assignment (?)
• Given a section of the 10K document (e.g. Item 1A)
• Write a Python program to stem/lemmatize the document
• Follow this process:
1. Take a document as the input
2. Read the document line by line
3. Tokenize the line
4. Stem/Lemmatize the words
5. Output the stemmed words/lemma (write to a file)
6. Repeat step 2 to step 5 until it is to the end of the document.
• Refer to the code in the next page and the Home Exercise (Apr 18) the Python Notebook
Sample Code
file=open(“/kaggle/input/10K-Item1A.txt”) my_lines_list=file.readlines()
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer from nltk.stem import WordNetLemmatizer
lemma=WordNetLemmatizer() def lemmaSentence(sentence):
token_words=word_tokenize(sentence) token_words
lemma_sentence=[]
for word in token_words:
lemma_sentence.append(lemma.lemmatize( word,pos=’v’))
lemma_sentence.append(” “) return “”.join(lemma_sentence)
lemma_file=open(“10K-Item1A- Lemma.txt”,mode=”w”, encoding=”utf-8″) for line in my_lines_list:
lemma_sentence=lemmaSentence(line) lemma_file.write(lemma_sentence)
lemma_file.close()
Discovering Word Collocation
from nltk.corpus import gutenberg
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures words = [w.lower() for w in gutenberg.words(‘austen-emma.txt’)]
bcf = BigramCollocationFinder.from_words(words) bcf.nbest(BigramAssocMeasures.likelihood_rat io, 4)
from nltk.corpus import stopwords
stopset = set(stopwords.words(‘english’)) filter_stops = lambda w: len(w) < 3 or w in stopset
bcf.apply_word_filter(filter_stops) bcf.nbest(BigramAssocMeasures.likelihood_rat io, 4)
[("'", 's'), ('mr', '.'), ('."', '"'), ('mrs', '.')]
[('frank', 'churchill'), ('miss', 'woodhouse'), ('miss', 'bates'), ('jane', 'fairfax')]
Break
Try this yourself during the break Processing Raw Text
import nltk
from nltk import word_tokenize
from nltk.corpus import gutenberg
emma = gutenberg.words('austen-emma.txt’) print (len(emma))
print (emma[:10])
192427
['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', 'VOLUME', 'I', 'CHAPTER']
Try this yourself during the break
Processing Raw Text
>>> import nltk
>>> from nltk import word_tokenize
>>> from nltk import sent_tokenize
>>> from nltk.corpus import gutenberg
>>> emma_raw = gutenberg.raw(‘austen-emma.txt’) >>> emma_tokens = word_tokenize(emma_raw)
>>> emma_sent = sent_tokenize(emma_raw)
>>> print (emma_raw[:10])
>>> print (len(emma_tokens))
>>> print (len(emma_sent))
>>> print (emma_tokens[:10])
>>> print (emma_sent[9:10])
[Emma by J
191785
7493
[‘[‘, ‘Emma’, ‘by’, ‘Jane’, ‘Austen’, ‘1816’, ‘]’, ‘VOLUME’, ‘I’, ‘CHAPTER’] [“It was Miss\nTaylor’s loss which first brought grief.”]
Try this yourself during the break
Processing Raw Text
import nltk
from nltk import ngram
n=2
bigrams_token = list(ngrams(emma_tokens, n)) print (len(bigrams_token))
print (bigrams[:10])
191784
[(‘Hi’, ‘How’), (‘How’, ‘are’), (‘are’, ‘you’), (‘you’, ‘?’), (‘?’, ‘i’), (‘i’, ‘am’), (‘am’, ‘fine’), (‘fine’, ‘and’),(‘and’, ‘you’)]
Statistics
from nltk.corpus import gutenberg
from nltk.corpus import stopwords
corpus=gutenberg.raw(‘austen-emma.txt’)
sents = nltk.sent_tokenize(corpus)
print(“The number of sentences is”, len(sents))
words = nltk.word_tokenize(corpus)
print(“The number of tokens is”, len(words))
average_tokens = round(len(words)/len(sents))
print(“The average number of tokens per sentence is”,average_tokens)
unique_tokens = set(words)
print(“The number of unique tokens are”, len(unique_tokens))
stop_words = set(stopwords.words(‘english’))
final_tokens = []
for each in words:
if each not in stop_words:
final_tokens.append(each)
print(“The number of total tokens after removing stopwords are”, len((final_tokens)))
CLASS RESUME
Part of Speech Tagging
• Part of Speech Tagging refers to labelling words with their appropriate Part- of-Speech (noun, verb, adverb, ..)
• The tag is a part-of-speech tag, and signifies whether the word is a noun, adjective, verb, adverb, pronoun….and so on.
• It converts a sentence, into a list of tuples, where each tuple is of the form (word, tag).
• Part-of-speech tagging is a necessary step for many NLP applications (e.g. NER).
• By identifying the POS of a word, we can deduce its contextual meaning
Language Syntax and Structure
• Words
• Phrase
• Clause
• Sentence
The brown fox is quick and he is jumping over the lazy dog