TEXT MINING Applied Analytics: Frameworks and Methods 2
1
Outline
■ Examine the potential of analyzing unstructured data
■ Discuss applications of text analysis
■ Examine process of sentiment analysis
■ Use text as features in a predictive model
■ Review various methods used for text analysis
2
3
Business Decisions
■ Despite the overwhelming amount of unstructured data available, most decisions are based on structured data.
4
What is Unstructured Data?
■ Data that does not fit into rows and columns in a spreadsheet. E.g., – Text
– Pictures – Audio
– Video …
■ We will focus on Text
5
Text is everywhere
■ Books and articles
■ Web pages
■ News stories
■ Call center notes
■ Doctor’s notes
■ Emails
■ Reviews
■ Tweets
■ Social media
And much of it is generated by consumers
6
Text Analysis
■ In traditional data analysis projects
– Data Wrangling takes 80% of the time
– Analysis and Interpretation takes 20% of the time
■ For data analysis projects involving unstructured data
– Data Wrangling takes 90% of the time
– Analysis and Interpretation takes 10% of the time
7
Text Mining
■ … process of discovering and extracting meaningful patterns and relationships from text.
Text Mining
=
Data Mining
+
Natural Language Processing
8
Applications of Text Analysis
■ Document search (i.e., Google)
■ Translation
■ Feature Extraction – identifying specific words in text
■ Theme extraction – Cluster and Topic Models
■ Sentiment analysis
■ Classification
■ Summarization
■ Understanding (the semantic web)
■ Conversation/Chat – Turing Test
9
Value of Text
■ Identify keywords or groups of words
■ Identify underlying themes
■ Enhance predictive models by using as an input
■ Text-specific applications
– email classification,
– prioritization and routing,
– spam filtering,
– contextual advertising,
– news stories recommendation
■ Information retrieval tasks
– text search engines,
– opinion mining,
– summarizing documents
10
But, text analysis is hard
■ There is a lot of text data
■ But, text by people
– is loosely structured
– includes typographical errors, abbreviations, emoticons
– is multilingual
– includes sarcasm, double meaning, puns, mixed languages (e.g., hinglish, chinglish and spanglish), and abbreviations
■ Turing Test
– Which one is human?
Source: Go-Globe
11
PROCESS: SENTIMENT ANALYSIS
12
Steps in Process
■ Get text
■ Explore text
■ Prepare text
■ Tokenize
■ Categorize tokens using a lexicon
■ Summarize results
13
Get Text
■ Manually Copy
■ Pull from databases ■ API
■ Scrape
14
Explore Text
■ Explore general characteristics of text
– Number of letters, words, sentences, urls, emojis
■ Identify patterns
– Define patterns using regular expressions
15
Prepare Text
■ Bag of Words Approach
– create corpus
– lower case
– remove punctuation
– remove numbers
– remove brackets, replace numbers, replace contraction, replace abbreviation, replace symbol
– remove words, stop words
– strip white space
– stem document
16
Tokenization
■ Process of breaking a stream of text, a character sequence or a defined document unit, into phrases, words or other meaningful elements called tokens
– One word token: Unigram
– Two word token: Bigram
– n word token: n-Gram
17
Tokenization
Corpus Document: typically the unit of analysis
Token/Term
18
Categorize Tokens
■ Tokens may be grouped using a lexicon.
■ A lexicon is a taxonomy of tokens. Choice of lexicon will vary based on the goal of the analysis. Here is a short list of lexicons with an open license
– Binary (positive/negative) Sentiment: categorizes words in a binary fashion as positive or negative. Lexicons include nrc, emoji sentiment, bing,
– Emotion: Categorizes words based on the emotion conveyed. Lexicons include nrc
– Sentiment score: Scores words based on sentiment. afinn, Jockers Polarity table, Jockers
Sentiment, Senticnet, SentiWordNet, Slang phrases, SOCal,
– Domain-specific lexicon: Used to code words with domain-specific jargon. E.g., loughran (scores
words relevant to Finance), accounting, entrepreneurship, market orientation, etc.
– Interpreting Emojis and emoticons: List and associated emotions
– Common text filters: Lists of words to be removed before processing. E.g., Common names (based on 1990 US census), profane words, stopword lists
– Other: valence shifters, contraction conversions, clichés, POS (pronouns, prepositions, interjections)
19
Summarize
■ Descriptive
– Sentiment score
– Distribution of categories of emotions expressed
■ Visualize
20
Process using library(tidytext)
21
Other Techniques for Text Summarization
■ Topic Modelling
– Useful for discovering underlying themes or topics
– Two common topics models
■ Latent Dirichlet Allocation
■ Correlated Topic Model
■ Latent Semantic Analysis
– Used to understand the meaning of a collection of documents
– Relies on singular value decomposition
■ Machine learning driven text summarization
■ Text clustering and Document Clustering
22
PROCESS: PREDICTIVE ANALYSIS
23
Text Mining and Data Mining
0.2 0.8
0.1
0.7 0.3
0.8
0.5 0.7 0.3
0.7 0.5 0.2
0.3 0.1 0.2
0.5 0.8
0.1
Corpus
Scores
Text Mining Training
24
Text Mining and Data Mining
0.8
0.3
0.1 0.7
0.1
0.5
0.3
0.2
Scores
0.2 0.5
0.8 0.1
New Documents
Text Mining Trained Model
25
Steps in Process
■ Get text
■ Prepare text
■ Tokenization
■ Dimensionality Reduction
■ Weighting
■ Predictive modelling with textual features
26
Get Text
■ Manually Copy
■ Pull from databases ■ API
■ Scrape
27
Prepare Text
■ Bag of words Approach
– create corpus
– lower case
– remove punctuation
– remove numbers
– remove brackets, replace numbers, replace contraction, replace abbreviation, replace symbol
– remove words, stop words
– strip white space
– stem document
■ Semantic Parsing
– Identify patterns (e.g., zip code,
address, sentences)
– Identify and group synonyms
– Lexical diversity – Readability
28
Tokenization
■ Process of breaking a stream of text, a character sequence or a defined document unit, into phrases, words or other meaningful elements called tokens
– One word token: Unigram
– Two word token: Bigram
– n word token: n-Gram
■ PoS Tagging
– Annotation of word with the right part-of-speech tag. Basic tags include noun, verb, adjective,
number and proper noun
– Use PoS tag dictionary or Hidden Markov Models
■ Chunking
– Dividing text into syntactically correlated words like noun groups and verb groups or their role
in the sentence
■ Extracting co-occurrences
29
Dimensionality Reduction
■ Tokenization of text will inevitably create a large number of dimensions relative to sample size. Too many dimensions leads to overfitting and in extreme cases (n