EM623-Week12
Text Mining
Carlo Lipizzi
clipizzi@stevens.edu
SSE
2016
Reasons for Text Mining
• 85-90 percent of all corporate
data is in some kind of
unstructured form (e.g., text)
• Unstructured corporate data is
doubling in size every 18
months
• Tapping into these information
sources is not an option, but a
need to stay competitive
• Text mining is a semi-automated process of extracting knowledge from
unstructured data sources a.k.a. text data mining or knowledge
discovery in textual databases
2
“Search” versus “Discover”
Data
Mining
Text
Mining
Data
Retrieval
Information
Retrieval
Search
(goal-oriented)
Discover
(opportunistic)
Structured
Data
Unstructured
Data (Text)
3
Information Retrieval
Database Type Unstructured
Search Mode Goal-driven
Atomic entity Document
Example Information Need “Find a Japanese restaurant in New
York”
Example Query “Japanese restaurant New York” or
New York ->Restaurants->Japanese
4
Examples in Corporations
• Customer complaint
letters
• Contracts
• Transcripts of phone
calls with customers
• Technical documents
• Email
• Insurance claims
• News articles
• Web pages
• Patent portfolios
5
Challenges in Text Mining
• Very high number of possible “dimensions”
• All possible word and phrase types in the language
• Unlike data mining:
• records (= docs) are not structurally identical
• records are not statistically independent
• Complex and subtle relationships between concepts in text
• “AT&T merges with Time-Warner”
• “Time-Warner is bought by AT&T”
• Ambiguity and context sensitivity
• automobile = car = vehicle = Toyota
• Apple (the company) or apple (the fruit)
6
7
•Both seek for novel and useful patterns
•Both are semi-automated processes
•Difference is the nature of the data:
Structured versus unstructured data
Structured data: in databases
Unstructured data: Word documents, PDF files, text excerpts, XML
files, and so on
• Text mining – first, impose structure to the data, then
mine the structured data
Text Mining vs. Data Mining
Using Text Mining
• Benefits of text mining are obvious especially in
text-rich data environments
– e.g., law (court orders), academic research
(research articles), finance (quarterly reports),
medicine (discharge summaries), biology
(molecular interactions), technology (patent files),
marketing (customer comments), etc.
• Electronic communization records (e.g., Email)
– Spam filtering
– Email prioritization and categorization
– Automatic response generation
8
Text Mining Application Area
• Information extraction
• Topic tracking
• Summarization
• Categorization
• Clustering
• Concept linking
• Question answering
9
Text Mining Terminology
• Corpus
• Terms
• Concepts
• Stemming
• Stop words
• Synonyms
• Tokenizing
10
• Term dictionary
•Word frequency
•Part-of-speech tagging
•Morphology
• Term-by-document matrix
Occurrence matrix
•Singular value decomposition
Latent semantic indexing
The Emergence of Text Mining
• Advances in text processing technology
– Natural Language Processing (NLP)
– Computational Linguistics
• Cheap Hardware
– CPU
– Disk
– Network
11
Two Mining Phases
• Text “preparation”. Beyond the typical data mining
cleaning, there are some semantic steps involved
here
• Information Extraction. There are several ways to
extract information from clean data, based on the
goal of the search
12
Text “preparation”
• Basic data cleaning
• “Tokenization”
• Stopwords removal
• Stemming/Lemmatization
• Statistical Analysis
• Additional Content Analysis
13
Tokenization
• The simplest way to represent a text is with a single
string, but it is difficult to process text in this format
• Often, it is more convenient to work with a list of
tokens/elements
• The task of converting a text from a single string to a
list of tokens is known as tokenization
• There are tools/functions within libraries providing
this function: they take a string and returns a list of
tokens
14
Stopwords removal
• It is a process to eliminate from the set of words/tokens
with no semantic value
• There are words that have no semantic value in all the
contexts, such as articles and pronouns, and those are
part of any “standard” stopwords removal process
• There are other words with intrinsic semantic value, but
irrelevant for the specific case. Those may be part of
the list of stopwords for that case
15
Stemming/Lemmatization
• This is a step to eliminate words with the same semantic value
• Stemming: is the process for reducing words to their stem, base or root form.
Examples are:
• “fish” for “fishing”, “fished” and “fisher”
• “argu” for “argue”, “argued”, “argues”, “arguing” and “argus”
• Lemmatization: is the process of determining the lemma for a given word, where
“lemma” is the canonical form, dictionary form or citation form of a set of words.
For example, “run”, “runs”, “ran” and “running” are from the same canonical form
“run”. It may be based on an analysis of part of speech for the words. Part of
speech are linguistic categories, such as noun and verb
• A stemmer operates on a single word without knowledge of the context, and
therefore cannot discriminate between words which have different meanings
depending on part of speech. They also may create “stem” with no dictionary
value (like “argu” above). However, stemmers are typically easier to implement
and run faster, and the reduced accuracy may not matter for some applications
16
Statistical Analysis
• Use statistics to add a numerical dimension to
unstructured text:
• Term frequency
• Document frequency
• Term proximity
• Document length
17
Additional Content Analysis
• Semantic Processing
– Extracting meaning
– Named Entity Extraction (People names, Company
Names, Locations, etc…)
• Extra-semantic features
– Identify feelings or sentiment in text
• Goal = Dimension Reduction
18
Information Extraction
• The process of extracting the information is very
dependent from the goal of the project, such as:
• Information extraction
• Topic tracking
• Summarization
• Categorization
• Clustering
• Concept linking
• Question answering
19
Establish the Corpus:
Collect & Organize the
Domain Specific
Unstructured Data
Create the Term-
Document Matrix:
Introduce Structure
to the Corpus
Extract Knowledge:
Discover Novel
Patterns from the
T-D Matrix
The inputs to the process
includes a variety of relevant
unstructured (and semi-
structured) data sources such
as text, XML, HTML, etc.
The output of the Task 1 is a
collection of documents in
some digitized format for
computer processing
The output of the Task 2 is a
flat file called term-document
matrix where the cells are
populated with the term
frequencies
The output of Task 3 is a
number of problem specific
classification, association,
clustering models and
visualizations
Task 1 Task 2 Task 3
FeedbackFeedback
20
Information Extraction: Text
recognition/classification
Step 1: Establish the corpus
• Collect all relevant unstructured data (e.g., textual
documents, XML files, emails, Web pages, short
notes, voice recordings…)
• Digitize, standardize the collection (e.g., all in ASCII
text files)
• Place the collection in a common place (e.g., in a
flat file, or in a directory as separate files)
21
inv
est
me
nt r
isk
pro
jec
t m
ana
gem
ent
sof
twa
re e
ngi
nee
ring
dev
elo
pm
ent
1
SA
P
…
Document 1
Document 2
Document 3
Document 4
Document 5
Document 6
…
Documents
Terms
1
1
1
2
1
1
1
3
1
22
Step 2: Create the Term–by–
Document Matrix
Step 3: Extract patterns/knowledge
• Classification (text categorization)
• Clustering (natural groupings of text)
– Improve search recall
– Improve search precision
– Scatter/gather
– Query-specific clustering
• Association
• Trend Analysis (…)
23
Wordij
• WORDij is a package containing a set of tools for
text analysis
• It takes plain UTF-8/txt format files as input
• It creates files to be analyzed by Wordij itself or by
external tools
24
25
!
Program Role
WordLink It counts words and word pair strings
Produces custom semantic networks by using one, two or three special input files.
OptiComm The optimal message creator, OptiComm, traces all shortest paths between a seed word and a target word, both
of which must be connected indirectly in the network
VISij Visualization of a network.
QAPNet QAP is an overall measure of the similarity of two whole networks using a correlation coefficient.
Z Utilities Z-Utilities allow you to compare two text files and determine what the significant differences there are for either
the words or the word pairs or the pairs from NodeTric .nets.
Conversions Three types of conversions are possible:
1. Convert WordLink .wtg and .ptg files into MultiNet Node and Link .csv files.
2. Convert a WordLink .pr file into a Pajek .net file.
3. Convert a Pajek .net file into MultiNet Node and Link .csv files.
Utilities The Utilities has two applications:
1. A Proper Nouns extraction program which creates a list of Proper Nouns and a String Replace File (.str)
2. A TimeSegs program which creates new WordLink text file that has imbedded time stamp headers and a
Select file (.sel)
Using Wordij
Using Wordij
• Source File: text file in UTF-8
format
• Drop List File: file with a list of
words that will be dropped
• Drop words / pairs appearing
less often than: words / pairs
appearing less often will be
not included in the output files
• Window size for extracting
word pairs: two words equal
to or less than window size
preceding and after each
word in the text
26
Wordij files
• .net file. This contains the semantic network
• .pr is word pair listing in the form of From, To, and Frequency Count. The file contains three
columns: Word1, Word 2 and Frequency
• .ptg is like the “.pr” file except it contains IDs rather than words. The file contains three columns:
the ID for Word1, the ID for Word 2 and the Frequency
• .stp.csv is a file showing the number of pairs, number of unique pairs, average pair frequency,
and pair negative entropy. Then there are five columns listing the pair, frequency, proportion,
negentropy term, and mutual information (Negentropy measures the difference in entropy
between a given distribution and the Gaussian distribution with the same mean and variance)
• .stw.csv a file showing the number of words, number of unique words, average word frequency,
and word negative entropy. Then there are four columns listing the word, frequency, proportion,
and negentropy term
• .log, is a log file of the run settings
• .r.wrd, is an alphabetically listing of the words and a frequency count of their occurrence. The file
contains two columns: Word and Frequency
• .wtg is an is an alphabetically listing of the words, a unique ID number assigned to it and a
frequency count of their occurrence. The file contains three columns: Word, ID number and
Frequency
27
• Open the files “2SpeedInternet.txt” with Wordij and create the
standard output files
• Analyze the network using VISij
• Analyze the network using Gephi, with the .net file generated
by Wordij
28
Text Analysis – Exercise 1
1. Open the files “2SpeedInternet.txt” with Wordij
2. Add droplist.txt to eliminate stopwords
3. Create the standard output files
4. Analyze the network using Gephi, with the .net file generated
by Wordij
5. Compare the results with Exercise 1
6. Go back to Wordij, changing the parameters Drop
words/Drop pairs and window size
7. Open the droplist.txt file to see if there are words you want to
add/eliminate
8. Go back to 3 or to 2, if you changed droplist.txt
29
Text Analysis – Exercise 2
• Using your browser, open an article from any
newspaper/magazine. A longer one recommended
• Copy/paste the content into a .txt file using a basic text editor
• Analyze the text using Wordij
• Analyze the network using Gephi, with the .net file generated
by Wordij
30
Text Analysis – Exercise 3
• Use the files you created in the previous exercise
• Analyze using Excel the 2 .csv files generated by Wordij
• Clean the csv files, eliminating the statistical information on
top
• Analyze the files using R/Rattle
• Compare the results from Rattle with the results from Gephi, in
terms of information you can extract from the original text
• As an option, you can skip this exercise and start your HW4
31
Text Analysis – Exercise 4