Data Mining and Machine Learning
Topic Analysis Peter Jančovič
Slide 1
Data Mining and Machine Learning
Objectives
Statistical modelling of topics
Identifying topics in a document – Latent Dirichlet Allocation (LDA)
Topic Spotting
– Salience and Usefulness
– Example: The AT&T “How May I Help You?” system
Slide 2
Data Mining and Machine Learning
Motivation
Example 1:You are responsible for competitor analysis in a large company. You need to monitor all media for press- releases, news items and other articles relating to your company’s product range.
Example 2: You work for the police. You are given the task of monitoring 500 telephone lines for 12 months. You have to identify calls on these lines which are about illegal drug trafficking.
Example 3: You manage a call centre. You are concerned that some staff are being rude to the people that they are calling. You need to monitor all calls for a period of 6 months and detect all instances of ‘rudeness’.
Slide 3
Data Mining and Machine Learning
Topics
“your company’s product range”, “illegal drug trafficking” and “rudeness” are all examples of topics
A typical document typically covers multiple topics
Topic Analysis is about decomposing a document into
its component topics
Topic Spotting is about identifying documents that are relevant to a particular topic
The previous slide is a list of Topic Spotting problems
Slide 4
Data Mining and Machine Learning
Topics as “bundles of words” For any term w, P(w) is the probability of w
– Choose a document at random, and then choose a term at random from the document, P(w) is the probability that the term is w
– We know about P(w) from Zipf’s Law
If T is a topic, P(w|T) is the conditional probability
of w given the topic T
– Choose a document about topic T at random, then choose a term at random from the document, P(w|T) is the probability that the term is w
Slide 5
Data Mining and Machine Learning
Statistical modelling of topics
The conditional distribution P(w|T) is a “bundle of
words” model of the topic T
A typical document is made up of multiple topics
– Example: Wikipedia entry on the London Marathon (next slide)
Latent Dirichlet Allocation (LDA) expresses a document as a combination of topics
The simplest way to understand LDA is to see how the LDA model generates a document
Slide 6
Data Mining and Machine Learning
Documents have multiple topics Topics include: London, marathons, fund-raising
The race was founded by the former Olympic champion and journalist Chris Brasher and athlete John Disley. It is organised by Hugh Brasher (son of Chris) as Race Director and Nick Bitel as Chief Executive. Set over a largely flat course around the River Thames, the race begins at three separate points around Blackheath and finishes in The Mall alongside St. James’s Park. Since the first marathon, the course has undergone very few route changes. In 1982, the finishing post was moved from Constitution Hill to Westminster Bridge due to construction works. It remained there for twelve years before moving to its present location at The Mall.
In addition to being one of the top six international marathons run over the distance of 26 miles and 385 yards, the IAAF standard for the marathon established in 1921 and originally used for the 1908 London Olympics, the London Marathon is also a large, celebratory sporting festival, third only to the Great North Run in South Shields and Great Manchester Run in Manchester in terms of the number of participants. The event has raised over £450 million for charity since 1981,[2][3] and holds the Guinness world record as the largest annual fund raising event in the world, with the 2009 participants raising over £47.2 million for charity. In 2007, 78% of all runners raised money. In 2011 the official charity of the London Marathon was Oxfam. In 2014, the official charity was Anthony Nolan, and in 2015, it will be Cancer Research UK.
Slide 7
Overview of the London Marathon, Wikipedia, January 2017 Data Mining and Machine Learning
Latent Semantic Analysis
Latent Semantic Analysis can be seen as a method for automatically discovering topics in a corpus
W = USVT
In LSA the topic vectors are the columns of V
So, a topic is described as a document vector
If d is a document and vi is the ith “topic” (column of V), then
vec(d)·vi
is a measure of the contribution of the ith topic to d
Slide 8
Data Mining and Machine Learning
Latent Dirichlet Allocation
Consider the document d:
“I eat sandwiches in a deck-chair on the sand by the
sea” → “eat sandwiches deck-chair sand sea”
Intuitively d is made up of two topics, A and B:
– A: food, corresponding to “eat” and “sandwiches” – B: seaside, corresponding to “deck-chair”, “sand”
and “sea”
It looks like d is made up approximately of 40% topic A (food) and 60% topic B (seaside)
Slide 9
Data Mining and Machine Learning
Latent Dirichlet Allocation
According to LDA, d might be generated as follows:
– Decide number of topics: N=2 “food” (A) and “seaside” (B)
– Decide the document length: M=5
– Decide the prior probabilities of the topics:
PT(A) = 0.4, PT(B) = 0.6 – For i=1 to M
– Choose the topic Ti randomly according to PT – Choose word wi randomly according to P(w|T)
Slide 10
Data Mining and Machine Learning
Latent Dirichlet Allocation
So, according to this model the document d was generated as follows:
– i=1, T1 = A (“food”), w1 = “eat”
– i=2, T2 = A (“food”), w2 = “sandwiches” – i=3, T3 = B (“seaside”), w3 = “deck-chair” – i=4, T4 = B (“seaside”), w4 = “sand”
– i=5, T5 = B (“seaside”), w5 = “sea”
Slide 11
Data Mining and Machine Learning
Latent Dirichlet Allocation
This is simple because we know the two topics and their associated word probability distributions
Given a corpus C and a number of topics N, a much bigger problem is to derive a set of N topics that cover C in some optimal sense
This is the clever part of LDA
LDA uses an “E-M” type algorithm to do this
Slide 12
Data Mining and Machine Learning
Latent Dirichlet Allocation Basically:
Slide 13
1. Make an initial estimate of N topics (remember, a topic is just a probability distribution over words)
2. Decompose each document in C into its component topics
3. Use this decomposition to re-estimate the topic word probability distributions
4. Go back to 2.
Data Mining and Machine Learning
Latent Dirichlet Allocation
See Edwin Chen’s blog “Introduction to Latent Dirichlet Analysis” for an explanation
The method is called “Latent Dirichlet Allocation” because the prior probabilities of the different topics, PT(A), is assumed to be a Dirichlet distribution
Slide 14
Data Mining and Machine Learning
Topic Spotting
Topic Spotting is a type of ‘dedicated’ IR
– The task is to find documents that are about a particular topic
– Corpus from which data is retrieved is dynamic Other examples
– Detect all weather forecasts in BBC radio 4 broadcasts – Find all documents written by Charlotte Bronte
– Find all requirements in new EU railway legislation
Topic Spotting vs IR
– Because a topic is richer than a query we can calculate
Slide 15
probabilities P(t|T) and not just IDF(t) Data Mining and Machine Learning
Topic spotting
Accepted documents – “On Topic”
Data stream
Topic Spotter
Rejected documents – “Off Topic”
Slide 16
Data Mining and Machine Learning
TF-IDF weights
Recall the definition of the TF-IDF weight for a term t relative to a document d: ND
w f t,d
IDFt,where,IDFtlog t,d ND
t
IDF(t) indicates how useful t is for discriminating between
documents
ft,d ensures that t occurs sufficiently often to be useful
For Topic Spotting we can define a more sophisticated criterion to identify words that are indicative of a given topic
Slide 17
Data Mining and Machine Learning
Usefulness
Given a term t and a topic T, define the usefulness of
t (relative to T) by:
U t Pt | T log Pt | T
Pt
If log Pt | T is large t is characteristic of the topic
Pt
If P(t|T) is large, then t occurs sufficiently often “on
Slide 18
topic” to be useful for topic spotting Data Mining and Machine Learning
Usefulness and IDF Recall IDFtlogND
ND t
Given a set of documents S, write S = St u St’ , where St is the set of documents that contain t and St’ is the
set of documents that don’t contain t
Then
Pt Pt | St PSt Pt | St’ PSt’ t
Pt|S PS Pt|S ND tt tND
Slide 19
Data Mining and Machine Learning
Usefulness and IDF Hence
Pt|S t
ND ND
Pt|S , and IDFt log t
Pt
Pt t
Slide 20
Data Mining and Machine Learning
Usefulness and IDF
Pt | S
IDF(t) log t is a measure of how useful the
Pt
term t is for general information retrieval (or for
retrieving documents that contain it?)
P t | T
So, log Pt is a measure of the usefulness of t
for retrieving documents about topic T
Slide 21
Data Mining and Machine Learning
‘Salience’
Similarly, given a term t and a topic T, define the
salience of t (relative to T) by: St PT | tlog PT | t
PT
Using Bayes’ Theorem (below) it is easy to establish
Slide 22
P T | t
Data Mining and Machine Learning
a relationship between salience and usefulness
p(t|T)P(T) p(t)
Salience and Usefulness
P T | t StPT|tlog PT
pt|TPT pt|TPT
pt log PTpt
PT pt|T PT
ptpt|Tlog pt ptUt
Slide 23
Data Mining and Machine Learning
Salience and Usefulness
St PTUt pt
Now, T is the topic, so P(T) is fixed. Therefore St 1 Ut
Slide 24
pt
Data Mining and Machine Learning
Salience and Usefulness
So, main difference between Salience and Usefulness is that to have high usefulness, a term must occur frequently
Sometimes this means that the most useful words for a topic are not the ones you would immediately suspect:
– E.G. For Weather Forecast spotting, “north”, “south”, “east” and “west” turned out to be more ‘useful’ than “rain” and “sun” – why?
Slide 25
Data Mining and Machine Learning
Example
A term w occurs:
– t1 times in documents about topic T
– t2 times in documents which are not about topic T
Total number of terms:
– in documents about topic T is N1
– in documents not about topic T is N2
The corpus contains C1 documents about T and C2 documents not about T
Then
– P(w|T)=t1/N1,P(w|not-T)=t2/N2
– P(w)=P(w|T)P(T)+P(w|not-T)P(not-T)
=
tC tC tNCtNC 1122121212
NCC NCC NNCC 112 212 1212
Slide 26
Data Mining and Machine Learning
Example
A term w occurs:
– t1 = 150 times in documents about topic T
– t2 = 230 times in documents which are not about topic T
Total number of terms:
– in documents about topic T is N1 = 12,500
– in documents not about topic T is N2 = 23,100
Suppose that only 10% of documents are “on topic”
So:
– P(w|T)=0.012, P(w)=0.0102, log(P(w|T)/P(w)) = 0.0706
– U(w) = 0.000847
– S(w) = (P(T)/P(w))×U(w) = (0.1/0.0102)×0.000847= 0.0083
Slide 27
Data Mining and Machine Learning
Application to Topic Spotting
1. Start with a training corpus of documents d1,…, dN. Each dn could be a separate document, or a section (e.g. paragraph) from the same document.
2. For each n decide whether dn is on-topic (T) or off- topic (not-T)
3. Apply stemming and stop-word removal if required
4. Identify the set of terms (the vocabulary) in the
corpus: w1,…, wV.
5. For each v, calculate U(wv) – the usefulness of wv
Slide 28
for the topic T.
Data Mining and Machine Learning
Application (continued)
6. 7.
If required, choose a threshold X and discard any terms with usefulness less than X
For each document dn in the training set: – Let v1,…, vI(n) be the terms in dn
1 I n
Uvi
– AU(dn) is the average usefulness of terms in dn
– Calculate AUdn In
Data Mining and Machine Learning
i1
Slide 29
Application (continued)
8. For a threshold W define a classification rule by:
– If AU(dn) > W, then dn is classified as “topic”
– If AU(dn) ≤ W, then dn is classified as “not-topic”
9. Choose a suitable value of W using the training documents. For example W could correspond to the Equal Error Rate
10. Classification: Given a new document d:
– Calculate AU(d)
– Classify d as “topic” if AU(d) > W, otherwise d is “not- topic”
Slide 30
Data Mining and Machine Learning
Topic Spotter
0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0
1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526272829303132333435 …text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text …….
Slide 31
Data Mining and Machine Learning
Spotting topics in speech
First convert audio stream into a text stream using automatic speech recognition
Consider overlapping sections of text corresponding to, say, 30 seconds of speech (depends on the application)
Calculate the Average (or Total) Usefulness or Average (or Total) Salience of words in the section of text for the topic
Signal whenever this value exceeds a threshold
Slide 32
Data Mining and Machine Learning
Example
The AT&T “How May I Help You?” system
Task: to understand what AT&T customers’ messages are about sufficiently well to connect them to the correct service
Services can be human operators (who deal with a specific problem or speak a specific language) or automated services.
Look HMIHY? Up on the web
Slide 33
Data Mining and Machine Learning
AT&T How May I Help You?
Service 1
Speech Recognition
Language Processing
Salient word list
Service 2
Service 3
Service 15
Slide 34
Data Mining and Machine Learning
AT&T How May I Help You?
HMIHY? Treats telephone network services as topics or documents, to be detected or retrieved
Example salient words:
Word
Salience
Word
Salience
Difference
4.04
Dialled
1.29
Cost
3.39
Area
1.28
Rate
3.37
Time
1.23
Much
3.24
Person
1.23
Emergency
2.23
Charge
1.22
Misdialed
1.43
Home
1.13
Wrong
1.37
Information
1.11
code
1.36
credit
1.11
Allen Gorin, “Processing of semantic information in fluent spoken language, Proc. ICSLP 1996
Slide 35
Data Mining and Machine Learning
HMIHY Demonstrations
See http://www.research.att.com/~algor/hmihy/samples.html
Slide 36
Data Mining and Machine Learning
Summary
Topics
Modelling a document as a mixture of topics
Latent Dirichlet Allocation
Topic spotting
Salience and usefulness
How May I Help You?
Slide 37
Data Mining and Machine Learning