程序代写代做代考 data mining algorithm chain Data Mining and Machine Learning

Data Mining and Machine Learning
Page Rank Peter Jančovič
Slide 1
Data Mining and Machine Learning

Objectives
 To understand the basic idea of the PageRank of a document in a corpus
 To understand how to calculate PageRank
 To understand the Markovian model that underlies
PageRank
Slide 2
Data Mining and Machine Learning

Not all documents are equal
 So far, whether or not a document d is retrieved in
response to a query q depends only on sim(q,d)
 Assumption is that all documents are equal – relevance of a document for a query depends only on the similarity score
 This is clearly not true (compare Wikipedia with my home page)
 Prior importance of a document is its Page rank
 Probabilistic interpretation of Page rank
Slide 3
Data Mining and Machine Learning

The prior probability of a document
 Suppose that we could assign a probability P(d) to
each document d in our corpus
 Think of P(d) as the probability that d is a relevant
document before the user creates a query q
 P(d) is the prior (or a priori ) probability of d
 In this case, whether d is returned in response to a query q depends on sim(q,d) and P(d)
 We will treat P(d) as the Page rank of d
Slide 4
Data Mining and Machine Learning

Retrieval using prior probabilities
 Retrieval based only on sim(q,d) assumes that P(d) is the same for all documents
 This case is called equal priors
 Intuitively we could do better if we could estimate
more meaningful priors
 Assumption: the prior relevance of a document to any query is related to how often that document is accessed
Slide 5
Data Mining and Machine Learning

Citation indices
 Similar idea used to measure quality of academic papers
 If a paper p contains important results or ideas, then lots of papers will refer to it
 The citations index ci(p) measures how many papers refer to a given paper p
 Citations index is a standard quality measure in research assessment
 But, quality of a paper depends not only on the quantity of papers that cite it but on their quality – their citation indices
Slide 6
Data Mining and Machine Learning

Basics of Page Rank
 For a document, or a page, d on the web, we could defined the Page Rank pr(d) to be the number of documents that have a hyperlink to d
 This relies on the underlying democracy of the web – users ‘vote with their mouse buttons’
 The ranking of a document d in response to q depends on both sim(q,d) and pr(d)
 But not all links are equal
Slide 7
Data Mining and Machine Learning

The “Random Surfer Model”
 The solution is to allocate a weight of wde to the
hyperlink from document d to document e
 wde can be thought of as the probability of following
the link to page e if the user is on page d
 If l(d) denotes the number of hyperlinks from d, setting wde = 1/l(d) corresponds to the random surfer model: on any page any of the available links are chosen with equal probability
Slide 8
Data Mining and Machine Learning

The “Intentional Surfer Model”
 In reality all links on a page are not clicked with equal probability
 A better alternative is to estimate the wdes using actual statistics of hyperlink use by surfers
 This is the intentional surfer model
 Organizations like Google collect and store this kind
of information (I assume!)
Slide 9
Data Mining and Machine Learning

Simplified Page Rank Calculation
 Once pr(d) is accepted as a measure of the
importance of d there is a natural consequence
 In the calculation of pr(d), a hyperlink from a page d1 to d should count for more than a hyperlink from page d2 to d if pr(d1) > pr(d2)
 This motivates:
prd prew
ed eLd
where L(d) is the set of pages which link to page d  This is the simplified Page rank calculation
Slide 10
Data Mining and Machine Learning

Simplified Page Rank Calculation
pr = 0.35
1/3
1/3
pr = 0.1
1
1/2
pr = 0.15
1/3 1
pr = 0.25
pr(d)
1/2
p r d    0 . 3 5  1   0 . 1  1   0 . 1 5  1   0 . 2 9 2  3   2 
Slide 11
Data Mining and Machine Learning

Example
Taken from wikipedia: see http://en.wikipedia.org/wiki/PageRank
Slide 12
Data Mining and Machine Learning

Simplified Page Rank Calculation
 Of course, changing pr(d) will change the Page Ranks of the other pages, which in turn will change pr(d)….
 Hence the definition of Page Rank is recursive, and pr(d) is calculated iteratively:
pr dprew
n1
eLd
n ed
Slide 13
Data Mining and Machine Learning

Markov Chain interpretation
w w  11 12
 w  1D 
 Let
W….
 w2D  wd1 wd2 … wdD 
    wD1 wD2 … wDD 
where wij is the probability of a user following a hyperlink between the ith and jth pages and D is the number of pages – this is the page transition probability matrix
 Notice that each row of W sums to 1
w21 w22
   
Slide 14
Data Mining and Machine Learning

Markov Chain interpretation
pr T  pr 1, pr 2,…, pr D – pr (i) is the Page Rank nnnnn
 Let
of the ith page after n iterations
 Then pr(n+1) = WTprn , or prn = (WT)npr0
 In Markov Chain terminology, wde is the transition
probability from page d to page e
 Can think of wde as the probability of page e at time
t+1 given page d at time t: P(e @ t+1 | d @ t)
 prn is an estimate of the probability distribution
over all of the pages after the nth iteration
 In this case  pr (d )  1
n d
Slide 15
Data Mining and Machine Learning

Markov chain interpretation
pr 1
n1 w w
pr1  w  n
pr 2  11 21 n1 w w
D1pr2  w  n
   12 22
w w … w
D2    pr d. . . .prd
 n1   1d 2d
 
Dd  n   
w w … w 1D2D DD
pr D prD n1  n 
Slide 16
Data Mining and Machine Learning

Markov Chain interpretation  If this system converges, then
pr1 w w  w pr1  pr2  11 21 D1  pr2
12 22 D2
   w w  w   
w w … w prd . . . . prd
  1d 2d
 
Dd   
w w … w 1D2D DD
prD prD  
 pr = WTpr
 In other words pr is an eigenvector of WT with
eigenvalue 1
Slide 17
Data Mining and Machine Learning

Damping Factor
 The model we have used to develop Page Rank is a “random surfer” model with ‘proper’ hyperlink probabilities
 The random surfer will eventually stop clicking
 The probability that the random surfer continues clicking when he arrives at a page is called the damping factor and denoted by δ
 A typical value of δ is 0.85
Slide 18
Data Mining and Machine Learning

Page Rank
 Taking into account the damping factor,
prd1  prew  N  ed
eLd  where N is the number of documents
Slide 19
Data Mining and Machine Learning

Notes
 Assuming that p(e) is the probability of the page d,
then this formula preserves prd1 d
1
 The formula assigns a ‘floor’ value of N to a
page that has no incoming hyperlinks (so that it has non-zero page rank)
 In addition, the damping factor reduces the effect of past estimates of PageRank on the present estimate
Slide 20
Data Mining and Machine Learning

Notes
 This lecture presents a probabilistic approach to Page rank
 “PageRank” is a trademark of Google
 It was developed by Larry Page between 1995 and
1998
 Larry Page is one of the founders of Google Inc.
 A high PageRank is a valuable asset for a www page, for example to attract advertising
 Hence the precise details of the Google PageRank algorithm are secret!
Slide 21
Data Mining and Machine Learning