程序代写代做代考 data mining algorithm information retrieval Data Mining and Machine Learning

Data Mining and Machine Learning
Topic Analysis Peter Jančovič
Slide 1
Data Mining and Machine Learning

Objectives
 Statistical modelling of topics
 Identifying topics in a document – Latent Dirichlet Allocation (LDA)
 Topic Spotting
– Salience and Usefulness
– Example: The AT&T “How May I Help You?” system
Slide 2
Data Mining and Machine Learning

Motivation
 Example 1:You are responsible for competitor analysis in a large company. You need to monitor all media for press- releases, news items and other articles relating to your company’s product range.
 Example 2: You work for the police. You are given the task of monitoring 500 telephone lines for 12 months. You have to identify calls on these lines which are about illegal drug trafficking.
 Example 3: You manage a call centre. You are concerned that some staff are being rude to the people that they are calling. You need to monitor all calls for a period of 6 months and detect all instances of ‘rudeness’.
Slide 3
Data Mining and Machine Learning

Topics
 “your company’s product range”, “illegal drug trafficking” and “rudeness” are all examples of topics
 A typical document typically covers multiple topics
 Topic Analysis is about decomposing a document into
its component topics
 Topic Spotting is about identifying documents that are relevant to a particular topic
 The previous slide is a list of Topic Spotting problems
Slide 4
Data Mining and Machine Learning

Topics as “bundles of words”  For any term w, P(w) is the probability of w
– Choose a document at random, and then choose a term at random from the document, P(w) is the probability that the term is w
– We know about P(w) from Zipf’s Law
 If T is a topic, P(w|T) is the conditional probability
of w given the topic T
– Choose a document about topic T at random, then choose a term at random from the document, P(w|T) is the probability that the term is w
Slide 5
Data Mining and Machine Learning

Statistical modelling of topics
 The conditional distribution P(w|T) is a “bundle of
words” model of the topic T
 A typical document is made up of multiple topics
– Example: Wikipedia entry on the London Marathon (next slide)
 Latent Dirichlet Allocation (LDA) expresses a document as a combination of topics
 The simplest way to understand LDA is to see how the LDA model generates a document
Slide 6
Data Mining and Machine Learning

Documents have multiple topics Topics include: London, marathons, fund-raising
The race was founded by the former Olympic champion and journalist Chris Brasher and athlete John Disley. It is organised by Hugh Brasher (son of Chris) as Race Director and Nick Bitel as Chief Executive. Set over a largely flat course around the River Thames, the race begins at three separate points around Blackheath and finishes in The Mall alongside St. James’s Park. Since the first marathon, the course has undergone very few route changes. In 1982, the finishing post was moved from Constitution Hill to Westminster Bridge due to construction works. It remained there for twelve years before moving to its present location at The Mall.
In addition to being one of the top six international marathons run over the distance of 26 miles and 385 yards, the IAAF standard for the marathon established in 1921 and originally used for the 1908 London Olympics, the London Marathon is also a large, celebratory sporting festival, third only to the Great North Run in South Shields and Great Manchester Run in Manchester in terms of the number of participants. The event has raised over £450 million for charity since 1981,[2][3] and holds the Guinness world record as the largest annual fund raising event in the world, with the 2009 participants raising over £47.2 million for charity. In 2007, 78% of all runners raised money. In 2011 the official charity of the London Marathon was Oxfam. In 2014, the official charity was Anthony Nolan, and in 2015, it will be Cancer Research UK.
Slide 7
Overview of the London Marathon, Wikipedia, January 2017 Data Mining and Machine Learning

Latent Semantic Analysis
 Latent Semantic Analysis can be seen as a method for automatically discovering topics in a corpus
 W = USVT
 In LSA the topic vectors are the columns of V
 So, a topic is described as a document vector
 If d is a document and vi is the ith “topic” (column of V), then
vec(d)·vi
is a measure of the contribution of the ith topic to d
Slide 8
Data Mining and Machine Learning

Latent Dirichlet Allocation
 Consider the document d:
“I eat sandwiches in a deck-chair on the sand by the
sea” → “eat sandwiches deck-chair sand sea”
 Intuitively d is made up of two topics, A and B:
– A: food, corresponding to “eat” and “sandwiches” – B: seaside, corresponding to “deck-chair”, “sand”
and “sea”
 It looks like d is made up approximately of 40% topic A (food) and 60% topic B (seaside)
Slide 9
Data Mining and Machine Learning

Latent Dirichlet Allocation
 According to LDA, d might be generated as follows:
– Decide number of topics: N=2 “food” (A) and “seaside” (B)
– Decide the document length: M=5
– Decide the prior probabilities of the topics:
PT(A) = 0.4, PT(B) = 0.6 – For i=1 to M
– Choose the topic Ti randomly according to PT – Choose word wi randomly according to P(w|T)
Slide 10
Data Mining and Machine Learning

Latent Dirichlet Allocation
 So, according to this model the document d was generated as follows:
– i=1, T1 = A (“food”), w1 = “eat”
– i=2, T2 = A (“food”), w2 = “sandwiches” – i=3, T3 = B (“seaside”), w3 = “deck-chair” – i=4, T4 = B (“seaside”), w4 = “sand”
– i=5, T5 = B (“seaside”), w5 = “sea”
Slide 11
Data Mining and Machine Learning

Latent Dirichlet Allocation
 This is simple because we know the two topics and their associated word probability distributions
 Given a corpus C and a number of topics N, a much bigger problem is to derive a set of N topics that cover C in some optimal sense
 This is the clever part of LDA
 LDA uses an “E-M” type algorithm to do this
Slide 12
Data Mining and Machine Learning

Latent Dirichlet Allocation  Basically:
Slide 13
1. Make an initial estimate of N topics (remember, a topic is just a probability distribution over words)
2. Decompose each document in C into its component topics
3. Use this decomposition to re-estimate the topic word probability distributions
4. Go back to 2.
Data Mining and Machine Learning

Latent Dirichlet Allocation
 See Edwin Chen’s blog “Introduction to Latent Dirichlet Analysis” for an explanation
 The method is called “Latent Dirichlet Allocation” because the prior probabilities of the different topics, PT(A), is assumed to be a Dirichlet distribution
Slide 14
Data Mining and Machine Learning

Topic Spotting
 Topic Spotting is a type of ‘dedicated’ IR
– The task is to find documents that are about a particular topic
– Corpus from which data is retrieved is dynamic  Other examples
– Detect all weather forecasts in BBC radio 4 broadcasts – Find all documents written by Charlotte Bronte
– Find all requirements in new EU railway legislation
 Topic Spotting vs IR
– Because a topic is richer than a query we can calculate
Slide 15
probabilities P(t|T) and not just IDF(t) Data Mining and Machine Learning

Topic spotting
Accepted documents – “On Topic”
Data stream
Topic Spotter
Rejected documents – “Off Topic”
Slide 16
Data Mining and Machine Learning

TF-IDF weights
 Recall the definition of the TF-IDF weight for a term t relative to a document d:  ND 
w  f t,d
IDFt,where,IDFtlog  t,d ND
t
 IDF(t) indicates how useful t is for discriminating between
documents
 ft,d ensures that t occurs sufficiently often to be useful
 For Topic Spotting we can define a more sophisticated criterion to identify words that are indicative of a given topic
Slide 17
Data Mining and Machine Learning

Usefulness
 Given a term t and a topic T, define the usefulness of
t (relative to T) by:
U t   Pt | T log Pt | T 
Pt
 If log Pt | T  is large t is characteristic of the topic
Pt
 If P(t|T) is large, then t occurs sufficiently often “on
Slide 18
topic” to be useful for topic spotting Data Mining and Machine Learning

Usefulness and IDF Recall IDFtlogND
 ND  t
 Given a set of documents S, write S = St u St’ , where St is the set of documents that contain t and St’ is the
set of documents that don’t contain t
 Then
Pt Pt | St PSt  Pt | St’ PSt’  t
Pt|S PS Pt|S ND tt tND
Slide 19
Data Mining and Machine Learning

Usefulness and IDF  Hence
Pt|S  t

ND ND
Pt|S  , and IDFt log t 
Pt
Pt t
Slide 20
Data Mining and Machine Learning

Usefulness and IDF
 Pt | S 
 IDF(t)  log t  is a measure of how useful the
 Pt 
term t is for general information retrieval (or for
retrieving documents that contain it?)
 P t | T  
 So, log Pt  is a measure of the usefulness of t

for retrieving documents about topic T
Slide 21
Data Mining and Machine Learning

‘Salience’
 Similarly, given a term t and a topic T, define the
salience of t (relative to T) by: St PT | tlog PT | t
PT
 Using Bayes’ Theorem (below) it is easy to establish
Slide 22
P T | t  
Data Mining and Machine Learning
a relationship between salience and usefulness
p(t|T)P(T) p(t)

Salience and Usefulness
 P T | t   StPT|tlog PT
 pt|TPT pt|TPT
 pt log PTpt  
PT pt|T PT
 ptpt|Tlog pt  ptUt

Slide 23
Data Mining and Machine Learning

Salience and Usefulness
St PTUt pt
 Now, T is the topic, so P(T) is fixed. Therefore St 1 Ut
Slide 24
pt
Data Mining and Machine Learning

Salience and Usefulness
 So, main difference between Salience and Usefulness is that to have high usefulness, a term must occur frequently
 Sometimes this means that the most useful words for a topic are not the ones you would immediately suspect:
– E.G. For Weather Forecast spotting, “north”, “south”, “east” and “west” turned out to be more ‘useful’ than “rain” and “sun” – why?
Slide 25
Data Mining and Machine Learning

Example
 A term w occurs:
– t1 times in documents about topic T
– t2 times in documents which are not about topic T
 Total number of terms:
– in documents about topic T is N1
– in documents not about topic T is N2
 The corpus contains C1 documents about T and C2 documents not about T
 Then
– P(w|T)=t1/N1,P(w|not-T)=t2/N2
– P(w)=P(w|T)P(T)+P(w|not-T)P(not-T)
=
tC tC tNCtNC 1122121212
NCC NCC NNCC 112 212 1212
Slide 26
Data Mining and Machine Learning

Example
 A term w occurs:
– t1 = 150 times in documents about topic T
– t2 = 230 times in documents which are not about topic T
 Total number of terms:
– in documents about topic T is N1 = 12,500
– in documents not about topic T is N2 = 23,100
 Suppose that only 10% of documents are “on topic”
 So:
– P(w|T)=0.012, P(w)=0.0102, log(P(w|T)/P(w)) = 0.0706
– U(w) = 0.000847
– S(w) = (P(T)/P(w))×U(w) = (0.1/0.0102)×0.000847= 0.0083
Slide 27
Data Mining and Machine Learning

Application to Topic Spotting
1. Start with a training corpus of documents d1,…, dN. Each dn could be a separate document, or a section (e.g. paragraph) from the same document.
2. For each n decide whether dn is on-topic (T) or off- topic (not-T)
3. Apply stemming and stop-word removal if required
4. Identify the set of terms (the vocabulary) in the
corpus: w1,…, wV.
5. For each v, calculate U(wv) – the usefulness of wv
Slide 28
for the topic T.
Data Mining and Machine Learning

Application (continued)
6. 7.
If required, choose a threshold X and discard any terms with usefulness less than X
For each document dn in the training set: – Let v1,…, vI(n) be the terms in dn
1 I  n 
Uvi
– AU(dn) is the average usefulness of terms in dn
– Calculate AUdn In
Data Mining and Machine Learning
i1
Slide 29

Application (continued)
8. For a threshold W define a classification rule by:
– If AU(dn) > W, then dn is classified as “topic”
– If AU(dn) ≤ W, then dn is classified as “not-topic”
9. Choose a suitable value of W using the training documents. For example W could correspond to the Equal Error Rate
10. Classification: Given a new document d:
– Calculate AU(d)
– Classify d as “topic” if AU(d) > W, otherwise d is “not- topic”
Slide 30
Data Mining and Machine Learning

Topic Spotter
0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0
1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526272829303132333435 …text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text …….
Slide 31
Data Mining and Machine Learning

Spotting topics in speech
 First convert audio stream into a text stream using automatic speech recognition
 Consider overlapping sections of text corresponding to, say, 30 seconds of speech (depends on the application)
 Calculate the Average (or Total) Usefulness or Average (or Total) Salience of words in the section of text for the topic
 Signal whenever this value exceeds a threshold
Slide 32
Data Mining and Machine Learning

Example
 The AT&T “How May I Help You?” system
 Task: to understand what AT&T customers’ messages are about sufficiently well to connect them to the correct service
 Services can be human operators (who deal with a specific problem or speak a specific language) or automated services.
 Look HMIHY? Up on the web
Slide 33
Data Mining and Machine Learning

AT&T How May I Help You?
Service 1
Speech Recognition
Language Processing
Salient word list
Service 2
Service 3
Service 15
Slide 34
Data Mining and Machine Learning

AT&T How May I Help You?
 HMIHY? Treats telephone network services as topics or documents, to be detected or retrieved
 Example salient words:
Word
Salience
Word
Salience
Difference
4.04
Dialled
1.29
Cost
3.39
Area
1.28
Rate
3.37
Time
1.23
Much
3.24
Person
1.23
Emergency
2.23
Charge
1.22
Misdialed
1.43
Home
1.13
Wrong
1.37
Information
1.11
code
1.36
credit
1.11
Allen Gorin, “Processing of semantic information in fluent spoken language, Proc. ICSLP 1996
Slide 35
Data Mining and Machine Learning

HMIHY Demonstrations
 See http://www.research.att.com/~algor/hmihy/samples.html
Slide 36
Data Mining and Machine Learning

Summary
 Topics
 Modelling a document as a mixture of topics
 Latent Dirichlet Allocation
 Topic spotting
 Salience and usefulness
 How May I Help You?
Slide 37
Data Mining and Machine Learning