Data Mining and Machine Learning
Lecture 2
Statistical Analysis of Texts
Peter Jančovič
Slide 1
Data Mining and Machine Learning
Objectives
Understand different approaches to text-based IR – Rationalism vs Empiricism
“Bundles of words” approaches
Introduction to zipf.c
Statistical analysis of word occurrence in text
Zipf’s Law
Examples
Slide 2
Data Mining and Machine Learning
A Basic Search Engine [Belew]
Query
Retrieved documents
Relevance feedback
+
Indices
Match
Information retrieval system
Documents
Slide 3
–
Data Mining and Machine Learning
Information Retrieval Components
The Documents
– Identify words which are ‘important’ for discriminating between documents, and how important they are
The Index
– Specifies the relationships between these ‘keywords’ and
the documents
The query
Matching
– Measuring the similarity between the query and each
document
Retrieved documents
Assessment and Relevance Feedback
Slide 4
Data Mining and Machine Learning
Example Text
“There was no possibility of taking a walk that day. We had been wandering, indeed, in the leafless shrubbery an hour in the morning; but since dinner (Mrs. Reed, when there was no company, dined early) the cold winter wind had brought with it clouds so sombre, and a rain so penetrating, that further out-door exercise was now out of the question.”
Charlotte Brontë, “Jane Eyre”, first paragraph
Slide 5
Data Mining and Machine Learning
“Jane Eyre” extract
What is it about?
How do you know?
What is your ‘strategy’ for understanding what a text is about?
What are the component topics?
– Exercise (walk, wandering, exercise)
– Gardens (shrubbery)
– Weather (cold, winter, wind, clouds, rain)
Slide 6
Data Mining and Machine Learning
Structure in text
Words
– Keywords (some words are more important than others) – Cold, Walk and Shrubbery are important
– There, and and that are not
Sentences (Grammar / Syntax)
– Word sequence structure helps us to understand and to
remove ambiguity
– ‘Parts of speech’
– The lead miner lived in Cornwall
– Keep that dog on a lead!
– He won the lead role in the new film
Slide 7
Data Mining and Machine Learning
Example
Det Noun Noun Verb Prep Noun Verb
Adj
The lead miner lived in Cornwall
Noun Phrase: The lead miner
Verb Phrase:
Lived in Cornwall
Determiner: The
Noun Phrase:
lead miner
Verb:
lived
Preposition Phrase: in Cornwall
Prep: in
Noun:
Cornwall
Slide 8
Data Mining and Machine Learning
Rationalism vs. Empiricism 1 Rationalism:
– Try to copy human language processing
Two questions:
– Do we understand sufficiently well how we do it?
– Is our knowledge ‘computationally useful’? I.e. is our knowledge sufficiently ‘solid’ to support algorithms and computer programs?
These are topics in Natural Language Processing (NLP) and Computational Linguistics
Slide 9
Data Mining and Machine Learning
Available knowledge
Word inventories
– Electronic dictionaries
Word forms (noun, verb etc)
– Available in electronic dictionaries
Word meanings
– Expressed in terms of predicate logic (properties)
Grammar / syntax
– Grammatical rules
Parsers
– Apply grammatical rules to a word sequence to determine if it is grammatical and, if so, its grammatical structure
Slide 10
Data Mining and Machine Learning
Natural Language Processing
Use word sense and meaning plus grammatical structure to infer ‘meaning’
Several problems
– Grammar may be too accommodating – accept non-
grammatical sentences
– Grammar may be too restrictive – reject valid sentences
– The number of interpretations of a simple sentence may be huge (“I saw the man on the hill with the telescope”)
Language is dynamic and changing
Slide 11
Data Mining and Machine Learning
Rationalism vs. Empiricism 2
Empiricism (“Big Data”)
– Use large corpora of text instead of human knowledge
– Use machine-learning to identify important structure and relationships
– Quantify the problem
– Rely on quantities which can be measured from these
large corpora, rather than human opinion For example:
– For each word w define a number U(w) which indicates how useful w is for Information Retrieval
– Invent algorithms to find the most useful words
– Invent measures of the similarity between queries and
texts Slide 12
Data Mining and Machine Learning
Rationalism vs Empiricism
Need sophisticated computationally useful models of language and semantics to infer meaning
Rational approaches accommodate complex structure but may be fragile and hard to generalise
– She ran, waving, across the bridge
Models based on Machine Learning (ML) are conceptually simpler but huge, and trained automatically
NLP currently outperformed in most applications by methods based on ML – “Deep Learning”, “Deep Neural Networks”
Progress – Amazon Echo/Alexa
Slide 13
Data Mining and Machine Learning
‘Bundles of Words’ approaches
There was no possibility of taking a walk that day. We had been wandering, indeed, in the leafless shrubbery an hour in the morning; but since dinner (Mrs. Reed, when there was no company, dined early) the cold winter wind had brought with it clouds so sombre, and a rain so penetrating, that further out- door exercise was now out of the question
the 4
was 3
a2
had 2
in 2
no 2
of 2
so 2
that 2 there 2
an 1
and 1
been 1 brought 1 but 1 clouds 1 cold 1 company 1 day 1 dined 1 dinner 1
early 1 exercise 1 further 1 hour 1 indeed 1 it1 leafless 1 morning 1 mrs 1 now 1
walk 1 wandering 1 we 1
when 1 wind 1 winter 1 with 1
out 1 out-door 1 penetrating 1 possibility 1 question 1 rain 1
reed 1 shrubbery 1 since 1 sombre 1 taking 1
Slide 14
Data Mining and Machine Learning
What is a word?
Tokens things separated by white space
Hyphenation
– Database Data-base?
Case
– “the bath shop” vs “the Bath shop”
– “the brown house” vs “the Brown house”
Morphology
– retrieval, retrieve, retrieved, retrieving,…
Punctuation
– The ‘honest’ politician vs the honest politician
Slide 15
Data Mining and Machine Learning
Some arbitrary choices…
Tokens things separated by white space
Ignore case:
– London london – BBC bbc
Ignore non-alphanumerics at start and end of token: – ‘honest’ honest. honest! “honest honest
Slide 16
Data Mining and Machine Learning
Statistical Analysis of Word Occurrence in Texts
zipf.c
– ANSII C program for simple analysis of texts
– Finds the set of different tokens in the text
– Counts how many times each word occurs
– Orders words according to the number of times they occur in the text (their rank)
– Prints out the result, and
– Stores results in a file results
Slide 17
Data Mining and Machine Learning
zipf.c
/* Function to read next word from text */
int nextWord(FILE *ip, char *token) {
int x;
int c;
for (c=0; c