程序代写代做代考 data mining Excel information retrieval database finance EM623-Week12

EM623-Week12

Text Mining

Carlo Lipizzi
clipizzi@stevens.edu

SSE

2016

Reasons for Text Mining

• 85-90 percent of all corporate
data is in some kind of
unstructured form (e.g., text)

• Unstructured corporate data is
doubling in size every 18
months

• Tapping into these information
sources is not an option, but a
need to stay competitive

• Text mining is a semi-automated process of extracting knowledge from
unstructured data sources a.k.a. text data mining or knowledge
discovery in textual databases

“Search” versus “Discover”

Data
Mining

Text
Mining

Data
Retrieval

Information
Retrieval

Search
(goal-oriented)

Discover
(opportunistic)

Structured
Data

Unstructured
Data (Text)

Information Retrieval

Database Type Unstructured

Search Mode Goal-driven

Atomic entity Document

Example Information Need “Find a Japanese restaurant in New
York”

Example Query “Japanese restaurant New York” or
New York ->Restaurants->Japanese

Examples in Corporations

• Customer complaint
letters

• Contracts
• Transcripts of phone

calls with customers
• Technical documents

• Email
• Insurance claims
• News articles
• Web pages
• Patent portfolios

Challenges in Text Mining

• Very high number of possible “dimensions”
• All possible word and phrase types in the language

• Unlike data mining:
• records (= docs) are not structurally identical
• records are not statistically independent

• Complex and subtle relationships between concepts in text
• “AT&T merges with Time-Warner”
• “Time-Warner is bought by AT&T”

• Ambiguity and context sensitivity
• automobile = car = vehicle = Toyota
• Apple (the company) or apple (the fruit)

•Both seek for novel and useful patterns
•Both are semi-automated processes
•Difference is the nature of the data:

Structured versus unstructured data
Structured data: in databases
Unstructured data: Word documents, PDF files, text excerpts, XML

files, and so on

• Text mining – first, impose structure to the data, then
mine the structured data

Text Mining vs. Data Mining

Using Text Mining

• Benefits of text mining are obvious especially in
text-rich data environments

– e.g., law (court orders), academic research
(research articles), finance (quarterly reports),
medicine (discharge summaries), biology
(molecular interactions), technology (patent files),
marketing (customer comments), etc.

• Electronic communization records (e.g., Email)
– Spam filtering
– Email prioritization and categorization
– Automatic response generation

Text Mining Application Area

• Information extraction
• Topic tracking
• Summarization
• Categorization
• Clustering
• Concept linking
• Question answering

Text Mining Terminology

• Corpus
• Terms
• Concepts
• Stemming
• Stop words
• Synonyms
• Tokenizing

• Term dictionary
•Word frequency
•Part-of-speech tagging
•Morphology
• Term-by-document matrix

Occurrence matrix

•Singular value decomposition
Latent semantic indexing

The Emergence of Text Mining

• Advances in text processing technology
– Natural Language Processing (NLP)
– Computational Linguistics

• Cheap Hardware
– CPU
– Disk
– Network

Two Mining Phases

• Text “preparation”. Beyond the typical data mining
cleaning, there are some semantic steps involved
here

• Information Extraction. There are several ways to
extract information from clean data, based on the
goal of the search

Text “preparation”

• Basic data cleaning
• “Tokenization”
• Stopwords removal
• Stemming/Lemmatization
• Statistical Analysis
• Additional Content Analysis

Tokenization

• The simplest way to represent a text is with a single
string, but it is difficult to process text in this format

• Often, it is more convenient to work with a list of
tokens/elements

• The task of converting a text from a single string to a
list of tokens is known as tokenization

• There are tools/functions within libraries providing
this function: they take a string and returns a list of
tokens

Stopwords removal

• It is a process to eliminate from the set of words/tokens
with no semantic value

• There are words that have no semantic value in all the
contexts, such as articles and pronouns, and those are
part of any “standard” stopwords removal process

• There are other words with intrinsic semantic value, but
irrelevant for the specific case. Those may be part of
the list of stopwords for that case

Stemming/Lemmatization

• This is a step to eliminate words with the same semantic value
• Stemming: is the process for reducing words to their stem, base or root form.

Examples are:
• “fish” for “fishing”, “fished” and “fisher”
• “argu” for “argue”, “argued”, “argues”, “arguing” and “argus”

• Lemmatization: is the process of determining the lemma for a given word, where
“lemma” is the canonical form, dictionary form or citation form of a set of words.
For example, “run”, “runs”, “ran” and “running” are from the same canonical form
“run”. It may be based on an analysis of part of speech for the words. Part of
speech are linguistic categories, such as noun and verb

• A stemmer operates on a single word without knowledge of the context, and
therefore cannot discriminate between words which have different meanings
depending on part of speech. They also may create “stem” with no dictionary
value (like “argu” above). However, stemmers are typically easier to implement
and run faster, and the reduced accuracy may not matter for some applications

Statistical Analysis

• Use statistics to add a numerical dimension to
unstructured text:

• Term frequency
• Document frequency
• Term proximity
• Document length

Additional Content Analysis

• Semantic Processing
– Extracting meaning
– Named Entity Extraction (People names, Company

Names, Locations, etc…)
• Extra-semantic features

– Identify feelings or sentiment in text

• Goal = Dimension Reduction

Information Extraction

• The process of extracting the information is very
dependent from the goal of the project, such as:

• Information extraction
• Topic tracking
• Summarization
• Categorization
• Clustering
• Concept linking
• Question answering

Establish the Corpus:
Collect & Organize the

Domain Specific
Unstructured Data

Create the Term-
Document Matrix:
Introduce Structure

to the Corpus

Extract Knowledge:
Discover Novel

Patterns from the
T-D Matrix

The inputs to the process
includes a variety of relevant
unstructured (and semi-
structured) data sources such
as text, XML, HTML, etc.

The output of the Task 1 is a
collection of documents in
some digitized format for
computer processing

The output of the Task 2 is a
flat file called term-document
matrix where the cells are
populated with the term
frequencies

The output of Task 3 is a
number of problem specific
classification, association,
clustering models and
visualizations

Task 1 Task 2 Task 3

FeedbackFeedback

Information Extraction: Text
recognition/classification

Step 1: Establish the corpus

• Collect all relevant unstructured data (e.g., textual
documents, XML files, emails, Web pages, short
notes, voice recordings…)

• Digitize, standardize the collection (e.g., all in ASCII
text files)

• Place the collection in a common place (e.g., in a
flat file, or in a directory as separate files)

inv
est

me
nt r

isk

pro
jec

t m
ana

gem
ent

sof
twa

re e
ngi

nee
ring

dev
elo

pm
ent

SA
P

…

Document 1

Document 2

Document 3

Document 4

Document 5

Document 6

…

Documents

Terms

Step 2: Create the Term–by–
Document Matrix

Step 3: Extract patterns/knowledge

• Classification (text categorization)
• Clustering (natural groupings of text)

– Improve search recall
– Improve search precision
– Scatter/gather
– Query-specific clustering

• Association
• Trend Analysis (…)

Wordij

• WORDij is a package containing a set of tools for
text analysis

• It takes plain UTF-8/txt format files as input
• It creates files to be analyzed by Wordij itself or by

external tools

Program Role
WordLink It counts words and word pair strings

Produces custom semantic networks by using one, two or three special input files.
OptiComm The optimal message creator, OptiComm, traces all shortest paths between a seed word and a target word, both

of which must be connected indirectly in the network
VISij Visualization of a network.
QAPNet QAP is an overall measure of the similarity of two whole networks using a correlation coefficient.
Z Utilities Z-Utilities allow you to compare two text files and determine what the significant differences there are for either

the words or the word pairs or the pairs from NodeTric .nets.
Conversions Three types of conversions are possible:

1. Convert WordLink .wtg and .ptg files into MultiNet Node and Link .csv files.
2. Convert a WordLink .pr file into a Pajek .net file.
3. Convert a Pajek .net file into MultiNet Node and Link .csv files.

Utilities The Utilities has two applications:

1. A Proper Nouns extraction program which creates a list of Proper Nouns and a String Replace File (.str)
2. A TimeSegs program which creates new WordLink text file that has imbedded time stamp headers and a
Select file (.sel)

Using Wordij

• Source File: text file in UTF-8
format

• Drop List File: file with a list of
words that will be dropped

• Drop words / pairs appearing
less often than: words / pairs
appearing less often will be
not included in the output files

• Window size for extracting
word pairs: two words equal
to or less than window size
preceding and after each
word in the text

Wordij files

• .net file. This contains the semantic network
• .pr is word pair listing in the form of From, To, and Frequency Count. The file contains three

columns: Word1, Word 2 and Frequency
• .ptg is like the “.pr” file except it contains IDs rather than words. The file contains three columns:

the ID for Word1, the ID for Word 2 and the Frequency
• .stp.csv is a file showing the number of pairs, number of unique pairs, average pair frequency,

and pair negative entropy. Then there are five columns listing the pair, frequency, proportion,
negentropy term, and mutual information (Negentropy measures the difference in entropy
between a given distribution and the Gaussian distribution with the same mean and variance)

• .stw.csv a file showing the number of words, number of unique words, average word frequency,
and word negative entropy. Then there are four columns listing the word, frequency, proportion,
and negentropy term

• .log, is a log file of the run settings
• .r.wrd, is an alphabetically listing of the words and a frequency count of their occurrence. The file

contains two columns: Word and Frequency
• .wtg is an is an alphabetically listing of the words, a unique ID number assigned to it and a

frequency count of their occurrence. The file contains three columns: Word, ID number and
Frequency

• Open the files “2SpeedInternet.txt” with Wordij and create the
standard output files

• Analyze the network using VISij
• Analyze the network using Gephi, with the .net file generated

by Wordij

Text Analysis – Exercise 1

1. Open the files “2SpeedInternet.txt” with Wordij
2. Add droplist.txt to eliminate stopwords
3. Create the standard output files
4. Analyze the network using Gephi, with the .net file generated

by Wordij
5. Compare the results with Exercise 1
6. Go back to Wordij, changing the parameters Drop

words/Drop pairs and window size
7. Open the droplist.txt file to see if there are words you want to

add/eliminate
8. Go back to 3 or to 2, if you changed droplist.txt

Text Analysis – Exercise 2

• Using your browser, open an article from any
newspaper/magazine. A longer one recommended

• Copy/paste the content into a .txt file using a basic text editor
• Analyze the text using Wordij
• Analyze the network using Gephi, with the .net file generated

by Wordij

Text Analysis – Exercise 3

• Use the files you created in the previous exercise
• Analyze using Excel the 2 .csv files generated by Wordij
• Clean the csv files, eliminating the statistical information on

top
• Analyze the files using R/Rattle
• Compare the results from Rattle with the results from Gephi, in

terms of information you can extract from the original text
• As an option, you can skip this exercise and start your HW4

Text Analysis – Exercise 4

Related Posts