程序代写代做代考 database data mining deep learning information retrieval algorithm compiler Data Mining and Machine Learning

Data Mining and Machine Learning
Lecture 2
Statistical Analysis of Texts
Peter Jančovič
Slide 1
Data Mining and Machine Learning

Objectives
 Understand different approaches to text-based IR – Rationalism vs Empiricism
 “Bundles of words” approaches
 Introduction to zipf.c
 Statistical analysis of word occurrence in text
 Zipf’s Law
 Examples
Slide 2
Data Mining and Machine Learning

A Basic Search Engine [Belew]
Query
Retrieved documents
Relevance feedback
+
Indices
Match
Information retrieval system
Documents
Slide 3

Data Mining and Machine Learning

Information Retrieval Components
 The Documents
– Identify words which are ‘important’ for discriminating between documents, and how important they are
 The Index
– Specifies the relationships between these ‘keywords’ and
the documents
 The query
 Matching
– Measuring the similarity between the query and each
document
 Retrieved documents
 Assessment and Relevance Feedback
Slide 4
Data Mining and Machine Learning

Example Text
“There was no possibility of taking a walk that day. We had been wandering, indeed, in the leafless shrubbery an hour in the morning; but since dinner (Mrs. Reed, when there was no company, dined early) the cold winter wind had brought with it clouds so sombre, and a rain so penetrating, that further out-door exercise was now out of the question.”
Charlotte Brontë, “Jane Eyre”, first paragraph
Slide 5
Data Mining and Machine Learning

“Jane Eyre” extract
 What is it about?
 How do you know?
 What is your ‘strategy’ for understanding what a text is about?
 What are the component topics?
– Exercise (walk, wandering, exercise)
– Gardens (shrubbery)
– Weather (cold, winter, wind, clouds, rain)
Slide 6
Data Mining and Machine Learning

Structure in text
 Words
– Keywords (some words are more important than others) – Cold, Walk and Shrubbery are important
– There, and and that are not
 Sentences (Grammar / Syntax)
– Word sequence structure helps us to understand and to
remove ambiguity
– ‘Parts of speech’
– The lead miner lived in Cornwall
– Keep that dog on a lead!
– He won the lead role in the new film
Slide 7
Data Mining and Machine Learning

Example
Det Noun Noun Verb Prep Noun Verb
Adj
The lead miner lived in Cornwall
Noun Phrase: The lead miner
Verb Phrase:
Lived in Cornwall
Determiner: The
Noun Phrase:
lead miner
Verb:
lived
Preposition Phrase: in Cornwall
Prep: in
Noun:
Cornwall
Slide 8
Data Mining and Machine Learning

Rationalism vs. Empiricism 1  Rationalism:
– Try to copy human language processing
 Two questions:
– Do we understand sufficiently well how we do it?
– Is our knowledge ‘computationally useful’? I.e. is our knowledge sufficiently ‘solid’ to support algorithms and computer programs?
 These are topics in Natural Language Processing (NLP) and Computational Linguistics
Slide 9
Data Mining and Machine Learning

Available knowledge
 Word inventories
– Electronic dictionaries
 Word forms (noun, verb etc)
– Available in electronic dictionaries
 Word meanings
– Expressed in terms of predicate logic (properties)
 Grammar / syntax
– Grammatical rules
 Parsers
– Apply grammatical rules to a word sequence to determine if it is grammatical and, if so, its grammatical structure
Slide 10
Data Mining and Machine Learning

Natural Language Processing
 Use word sense and meaning plus grammatical structure to infer ‘meaning’
 Several problems
– Grammar may be too accommodating – accept non-
grammatical sentences
– Grammar may be too restrictive – reject valid sentences
– The number of interpretations of a simple sentence may be huge (“I saw the man on the hill with the telescope”)
 Language is dynamic and changing
Slide 11
Data Mining and Machine Learning

Rationalism vs. Empiricism 2
 Empiricism (“Big Data”)
– Use large corpora of text instead of human knowledge
– Use machine-learning to identify important structure and relationships
– Quantify the problem
– Rely on quantities which can be measured from these
large corpora, rather than human opinion  For example:
– For each word w define a number U(w) which indicates how useful w is for Information Retrieval
– Invent algorithms to find the most useful words
– Invent measures of the similarity between queries and
texts Slide 12
Data Mining and Machine Learning

Rationalism vs Empiricism
 Need sophisticated computationally useful models of language and semantics to infer meaning
 Rational approaches accommodate complex structure but may be fragile and hard to generalise
– She ran, waving, across the bridge
 Models based on Machine Learning (ML) are conceptually simpler but huge, and trained automatically
 NLP currently outperformed in most applications by methods based on ML – “Deep Learning”, “Deep Neural Networks”
 Progress – Amazon Echo/Alexa
Slide 13
Data Mining and Machine Learning

‘Bundles of Words’ approaches
There was no possibility of taking a walk that day. We had been wandering, indeed, in the leafless shrubbery an hour in the morning; but since dinner (Mrs. Reed, when there was no company, dined early) the cold winter wind had brought with it clouds so sombre, and a rain so penetrating, that further out- door exercise was now out of the question
the 4
was 3
a2
had 2
in 2
no 2
of 2
so 2
that 2 there 2
an 1
and 1
been 1 brought 1 but 1 clouds 1 cold 1 company 1 day 1 dined 1 dinner 1
early 1 exercise 1 further 1 hour 1 indeed 1 it1 leafless 1 morning 1 mrs 1 now 1
walk 1 wandering 1 we 1
when 1 wind 1 winter 1 with 1
out 1 out-door 1 penetrating 1 possibility 1 question 1 rain 1
reed 1 shrubbery 1 since 1 sombre 1 taking 1
Slide 14
Data Mining and Machine Learning

What is a word?
 Tokens  things separated by white space
 Hyphenation
– Database  Data-base?
 Case
– “the bath shop” vs “the Bath shop”
– “the brown house” vs “the Brown house”
 Morphology
– retrieval, retrieve, retrieved, retrieving,…
 Punctuation
– The ‘honest’ politician vs the honest politician
Slide 15
Data Mining and Machine Learning

Some arbitrary choices…
 Tokens  things separated by white space
 Ignore case:
– London  london – BBC  bbc
 Ignore non-alphanumerics at start and end of token: – ‘honest’  honest.  honest!  “honest  honest
Slide 16
Data Mining and Machine Learning

Statistical Analysis of Word Occurrence in Texts
 zipf.c
– ANSII C program for simple analysis of texts
– Finds the set of different tokens in the text
– Counts how many times each word occurs
– Orders words according to the number of times they occur in the text (their rank)
– Prints out the result, and
– Stores results in a file results
Slide 17
Data Mining and Machine Learning

zipf.c
/* Function to read next word from text */
int nextWord(FILE *ip, char *token) {
int x;
int c;
for (c=0; c