Data Mining and Machine Learning
Page Rank Peter Jančovič
Slide 1
Data Mining and Machine Learning
Objectives
To understand the basic idea of the PageRank of a document in a corpus
To understand how to calculate PageRank
To understand the Markovian model that underlies
PageRank
Slide 2
Data Mining and Machine Learning
Not all documents are equal
So far, whether or not a document d is retrieved in
response to a query q depends only on sim(q,d)
Assumption is that all documents are equal – relevance of a document for a query depends only on the similarity score
This is clearly not true (compare Wikipedia with my home page)
Prior importance of a document is its Page rank
Probabilistic interpretation of Page rank
Slide 3
Data Mining and Machine Learning
The prior probability of a document
Suppose that we could assign a probability P(d) to
each document d in our corpus
Think of P(d) as the probability that d is a relevant
document before the user creates a query q
P(d) is the prior (or a priori ) probability of d
In this case, whether d is returned in response to a query q depends on sim(q,d) and P(d)
We will treat P(d) as the Page rank of d
Slide 4
Data Mining and Machine Learning
Retrieval using prior probabilities
Retrieval based only on sim(q,d) assumes that P(d) is the same for all documents
This case is called equal priors
Intuitively we could do better if we could estimate
more meaningful priors
Assumption: the prior relevance of a document to any query is related to how often that document is accessed
Slide 5
Data Mining and Machine Learning
Citation indices
Similar idea used to measure quality of academic papers
If a paper p contains important results or ideas, then lots of papers will refer to it
The citations index ci(p) measures how many papers refer to a given paper p
Citations index is a standard quality measure in research assessment
But, quality of a paper depends not only on the quantity of papers that cite it but on their quality – their citation indices
Slide 6
Data Mining and Machine Learning
Basics of Page Rank
For a document, or a page, d on the web, we could defined the Page Rank pr(d) to be the number of documents that have a hyperlink to d
This relies on the underlying democracy of the web – users ‘vote with their mouse buttons’
The ranking of a document d in response to q depends on both sim(q,d) and pr(d)
But not all links are equal
Slide 7
Data Mining and Machine Learning
The “Random Surfer Model”
The solution is to allocate a weight of wde to the
hyperlink from document d to document e
wde can be thought of as the probability of following
the link to page e if the user is on page d
If l(d) denotes the number of hyperlinks from d, setting wde = 1/l(d) corresponds to the random surfer model: on any page any of the available links are chosen with equal probability
Slide 8
Data Mining and Machine Learning
The “Intentional Surfer Model”
In reality all links on a page are not clicked with equal probability
A better alternative is to estimate the wdes using actual statistics of hyperlink use by surfers
This is the intentional surfer model
Organizations like Google collect and store this kind
of information (I assume!)
Slide 9
Data Mining and Machine Learning
Simplified Page Rank Calculation
Once pr(d) is accepted as a measure of the
importance of d there is a natural consequence
In the calculation of pr(d), a hyperlink from a page d1 to d should count for more than a hyperlink from page d2 to d if pr(d1) > pr(d2)
This motivates:
prd prew
ed eLd
where L(d) is the set of pages which link to page d This is the simplified Page rank calculation
Slide 10
Data Mining and Machine Learning
Simplified Page Rank Calculation
pr = 0.35
1/3
1/3
pr = 0.1
1
1/2
pr = 0.15
1/3 1
pr = 0.25
pr(d)
1/2
p r d 0 . 3 5 1 0 . 1 1 0 . 1 5 1 0 . 2 9 2 3 2
Slide 11
Data Mining and Machine Learning
Example
Taken from wikipedia: see http://en.wikipedia.org/wiki/PageRank
Slide 12
Data Mining and Machine Learning
Simplified Page Rank Calculation
Of course, changing pr(d) will change the Page Ranks of the other pages, which in turn will change pr(d)….
Hence the definition of Page Rank is recursive, and pr(d) is calculated iteratively:
pr dprew
n1
eLd
n ed
Slide 13
Data Mining and Machine Learning
Markov Chain interpretation
w w 11 12
w 1D
Let
W….
w2D wd1 wd2 … wdD
wD1 wD2 … wDD
where wij is the probability of a user following a hyperlink between the ith and jth pages and D is the number of pages – this is the page transition probability matrix
Notice that each row of W sums to 1
w21 w22
Slide 14
Data Mining and Machine Learning
Markov Chain interpretation
pr T pr 1, pr 2,…, pr D – pr (i) is the Page Rank nnnnn
Let
of the ith page after n iterations
Then pr(n+1) = WTprn , or prn = (WT)npr0
In Markov Chain terminology, wde is the transition
probability from page d to page e
Can think of wde as the probability of page e at time
t+1 given page d at time t: P(e @ t+1 | d @ t)
prn is an estimate of the probability distribution
over all of the pages after the nth iteration
In this case pr (d ) 1
n d
Slide 15
Data Mining and Machine Learning
Markov chain interpretation
pr 1
n1 w w
pr1 w n
pr 2 11 21 n1 w w
D1pr2 w n
12 22
w w … w
D2 pr d. . . .prd
n1 1d 2d
Dd n
w w … w 1D2D DD
pr D prD n1 n
Slide 16
Data Mining and Machine Learning
Markov Chain interpretation If this system converges, then
pr1 w w w pr1 pr2 11 21 D1 pr2
12 22 D2
w w w
w w … w prd . . . . prd
1d 2d
Dd
w w … w 1D2D DD
prD prD
pr = WTpr
In other words pr is an eigenvector of WT with
eigenvalue 1
Slide 17
Data Mining and Machine Learning
Damping Factor
The model we have used to develop Page Rank is a “random surfer” model with ‘proper’ hyperlink probabilities
The random surfer will eventually stop clicking
The probability that the random surfer continues clicking when he arrives at a page is called the damping factor and denoted by δ
A typical value of δ is 0.85
Slide 18
Data Mining and Machine Learning
Page Rank
Taking into account the damping factor,
prd1 prew N ed
eLd where N is the number of documents
Slide 19
Data Mining and Machine Learning
Notes
Assuming that p(e) is the probability of the page d,
then this formula preserves prd1 d
1
The formula assigns a ‘floor’ value of N to a
page that has no incoming hyperlinks (so that it has non-zero page rank)
In addition, the damping factor reduces the effect of past estimates of PageRank on the present estimate
Slide 20
Data Mining and Machine Learning
Notes
This lecture presents a probabilistic approach to Page rank
“PageRank” is a trademark of Google
It was developed by Larry Page between 1995 and
1998
Larry Page is one of the founders of Google Inc.
A high PageRank is a valuable asset for a www page, for example to attract advertising
Hence the precise details of the Google PageRank algorithm are secret!
Slide 21
Data Mining and Machine Learning