CSCI 5250: Information Retrieval and Search Engine
Lecture 8: Massive Link Analysis
1
CMSC5741 Big Data Tech. & Apps.
Prof. Michael R. Lyu
Computer Science & Engineering Dept.
The Chinese University of Hong Kong
2
How Google return such kind of rankings (e.g. Charles Kao Wikipedia first then his NobelPrize. Even his passing away)?
One important factor is PageRank score.
What’s the Mechanism Distinguished Google?
2
How to make PageRank computation scalable
There are billions of web pages and hyperlinks between them, how to compute their ranking score (e.g., PageRank) efficiently?
3
Outline
Web as a Graph
PageRank
Topic-Specific PageRank
Appendix: Trust-Rank
4
Outline
Web as a Graph
PageRank
Topic-Specific PageRank
Appendix: Trust-Rank
5
Web as a Graph
Web as a directed graph:
Nodes: Web pages
Edges: Hyperlinks
6
Web as a Directed Graph
7
Broad Question
How to organize the Web?
First try: Human curated Web directories
Yahoo, DMOZ, LookSmart
Second try: Web Search
Information Retrieval investigates: Find relevant docs in a small and trusted set
Newspaper articles, Patents, etc.
But: Web is huge, full of untrusted documents, random things, web spam, etc.
8
Web Search: 2 Challenges
Web contains many sources of information: Who to “trust”?
Trick: Trustworthy pages may point to each other
What is the “best” answer to query “newspaper”?
No single right answer
Trick: Pages that actually know about newspapers might all be pointing to many newspapers
9
Ranking Nodes on the Graph
All web pages are not equally “important”
www.cuhk.edu.hk vs. www.joe-schmoe.com
There is large diversity in the web-graph node connectivity.
Rank the pages by the link structure!
10
Link Analysis Algorithms
We will cover the following Link Analysis approaches for computing importance of nodes in a graph
PageRank
Topic-Specific (Personalized) PageRank
Web Spam Detection Algorithms, e.g. TrustRank
11
Outline
Web as a Graph
PageRank
Topic-Specific PageRank
Appendix: Trust-Rank
12
Links as Votes
Idea: Links as votes
Page is more important if it has more links
In-coming links? Out-going links?
Think of in-links as votes:
www.cuhk.edu.hk has 15,432 in-links
www.joe-schmoe.com has 1 in-link
Are all in-links equal?
Link from important pages count more
Recursive question!
13
Example: PageRank Scores
14
Simple Recursive Formulation
Each link’s vote is proportional to the importance of its source page
If page j with importance has n out-links, each link gets votes.
Page j’s own importance is the sum of the votes on its in-links
15
PageRank: The “Flow” Model
A “vote” from an important page is worth more
A page is important if it is pointed to by other important pages
Define a “rank” for page j
16
Solving the Flow Equations
3 equations, 3 unknowns,
no constants
No unique solution
All solutions equivalent modulo the scale factor
Additional constraint forces uniqueness:
Solution:
Gaussian Elimination method works for small examples, but we need a better method for large web-size graphs
17
PageRank: Matrix Formulation
Stochastic adjacency matrix M
Let page have out-links
If , then , else
M is a column stochastic matrix , i.e., columns sum to 1
Rank vector r: vector with an entry per page
is the importance score of page
The flow equations can be written
18
Example
Remember the flow equation:
Flow equation in the matrix form:
Suppose page i links to 3 pages, including j
19
Eigenvector Formulation
The flow equations can be written
So the rank vector r is an eigenvector of the stochastic web matrix M
In fact, its first or principal eigenvector, with corresponding eigenvalue 1
Largest eigenvalue of M is 1 since M is column stochastic
We can now efficiently solve for r!
The method is called Power iteration.
20
Example: Flow Equation & M
21
Power Iteration Method
Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks
Power Iteration: a simple iterative scheme
Suppose there are N web pages
Initialize:
Iterate:
Stop when
is the L1 norm
22
Why Power Iteration Works?
Power iteration:
A method for finding dominant eigenvector (the vector corresponding to the largest eigenvalue)
Claim:
Sequence approaches the dominant eigenvector of M
23
Why Power Iteration Works?
Proof:
Assume M has n linearly independent eigenvectors x1, x2, …, xn with corresponding eigenvalues
Vectors x1, x2, …, xn form a basis and thus we can write:
Repeated multiplication on both sides:
24
Why Power Iteration Works?
Proof: (cont.)
Repeated multiplication on both sides produces
Since then , so
Thus
Note if , then the method won’t converge
25
Random Walk Interpretation
Imagine a random web surfer:
At any time t, surfer is on some page i
At time t+1, the surfer follows an out-link from i uniformly at random
Ends up on some page j linked from i
Process repeats indefinitely
Let:
p(t) … vector whose ith coordinate is the prob. that the surfer is at page i at time t
So p(t) is a probability distribution over pages
26
The Stationary Distribution
Where is the surfer at time t+1
Follows a link uniformly at random
Suppose the random walk reaches a state
Then p(t) is stationary distribution of a random walk
Our original rank vector r satisfies
So r is a stationary distribution for the random walk
27
PageRank
Three questions:
Does this converge?
Does it converge to what we want?
Are results reasonable?
28
Does This Converge?
29
Does it Converge to What We Want?
30
PageRank: Problems
Two problems:
Spider traps: all out-links are within the group
Eventually spider traps absorb all importance
Some pages are dead ends (have no out-links)
Such pages cause importance to “leak out”
31
Problem: Spider Traps
Power Iteration:
Set rj = 1
And iterate
Example
32
Solution: Random Teleports
The Google solution for spider traps: At each time step, the random surfer has two options
With prob. , follow a link at random
With prob. , jump to some random page
Common values for are in the range 0.8 to 0.9
Surfer will teleport out of spider trap within a few time steps
33
Problem: Dead Ends
Power Iteration:
Set rj = 1
And iterate
Example
34
Solution: Always Teleport
Teleports: Follow random teleport links with probability 1.0 from dead-ends
Adjust matrix accordingly
35
Why Teleports Solve the Problem?
Markov chains
Set of states X
Transition matrix P where
specifying the stationary probability of being at each state
Goal is to find such that
36
Why Is This Analogy Useful?
Theory of Markov chains
Fact: For any start vector, the power method applied to a Markov transition matrix P will converge to a unique positive stationary vector as long as P is stochastic, irreducible and aperiodic.
37
Make M Stochastic
Stochastic: Every column sums to 1
A possible solution: add green links
38
Make M Aperiodic
A chain is periodic if there exists k > 1 such that the interval between two visits to some state s is always a multiple of k.
A possible solution: add green links
39
Make M Irreducible
From any state, there is a non-zero probability of going from any one state to any another
A possible solution: add green links
40
Solution: Random Jumps
Google’s solution that does it all:
Makes M stochastic, aperiodic, irreducible
At each step, random surfer has two options:
With probability , follow a link at random
With probability , jump to some random page
PageRank equation [Brin-Page,98]
41
This formulation assumes that M has no dead ends. We can either preprocess matrix M to remove all dead ends or explicitly follow random teleport links with probability 1.0 from dead-ends.
The Google Matrix
PageRank equation [Brin-Page,98]
The Google Matrix A:
A is stochastic, aperiodic and irreducible, so
In practice (make 5 steps and jump)
42
In-class Practice
Go to Practice
43
Computing PageRank
Key step is matrix-vector multiplication
Easy if we have enough main memory to hold
Say N=1 billion pages
We need 4 bytes for each entry (say)
2 billion entries for vectors, approx. 8GB
Matrix A has N2 entries: 1018 is a large number!
44
Matrix Formulation
Suppose there are N pages
Consider page j, with dj out-links
We have Mij = 1/|dj| when and Mij = 0 otherwise
The random teleport is equivalent to:
Adding a teleport link from j to every other page and setting transition prob. to
Reducing the prob. of following each out-link from
45
Rearranging the Equation
46
Note: Here we assumed M has no dead-ends
[x]N … a vector of length N with all entries x
Spare Matrix Formulation
We just rearranged the PageRank equation
M is a sparse matrix! (with no dead-ends)
10 links per node, approx. 10N entries
So in each iteration, we need to
Compute
Add a constant to each entry in
Note: if M contains dead-ends then and we also have to renormalize so that it sums to 1
47
PageRank: The Complete Algorithm
48
Sparse Matrix Encoding
Encode sparse matrix using only nonzero entries
Space proportional roughly to number of links
Say 10N, or 4*10*1 billion = 40GB
Still won’t fit in memory, but will fit on disk
49
Basic Algorithm: Update Step
Assume enough RAM to fit into memory
Store and matrix M on disk
Then 1 step of power-iteration is:
Initialize all entries of to
For each page p (of out-degree n):
Read into memory: p, n, dest1 , …, destn , rold (p)
For j=1…n: rnew (destj ) += β rold (p)/n
50
Analysis
Assume enough RAM to fit into memory
Store and matrix M on disk
In each iteration, we have to:
Read and M
Write back to disk
IO cost = 2|r| + |M|
Question:
What if we could not even fit in memory
51
Block-based Update Algorithm
52
Analysis of Block Update
Similar to nested-loop join in databases
Break into k blocks that fit in memory
Scan M and once for each block
k scans of M and
k(|M|+|r|) + |r| = k|M| + (k+1)|r|
Can we do better?
Hint: M is much bigger than r (approx. 10-20x), so we must avoid reading it k times per iteration
53
Block-Strip Update Algorithm
54
Block-Strip Analysis
Break M into stripes
Each strip contains only destination nodes in the corresponding block of
Some additional overhead per stripe
But it is usually worth it
Cost per iteration
55
Some Problems with PageRank
Measures generic popularity of a page
Biased against topic-specific authorities
Solution: Topic-Specific PageRank (next)
Susceptible to Link spam
Artificial link topographies created in order to boost page rank
Solution: TrustRank (next)
Uses a single measure of importance
Solution: Hubs-and-Authorities (in further reading)
56
Outline
Web as a Graph
PageRank
Topic-Specific PageRank
Appendix: Trust-Rank
57
Topic-Specific PageRank
Instead of generic popularity, can we measure popularity within a topic?
Goal: Evaluate Web pages not just according to their popularity, but by how close they are to a particular topic, e.g. “sports” or “history”
Allows search queries to be answered based on interests of the user
Example: Query “Trojan” wants different pages depending on whether you are interested in sports, history and computer security
58
Topic-Specific PageRank
Random walker has a small probability of teleporting at any step
Teleport can go to:
Standard PageRank: Any page with equal probability
To avoid dead-end and spider-trap problems
Topic Specific PageRank: A topic-specific set of “relevant” pages (teleport set)
59
Topic-Specific PageRank
Idea: Bias the random walk
When walker teleports, she picks a page from a set S
S contains only pages that are relevant to the topic
e.g., Open Directory (DMOZ) pages for a given topic/query
For each teleport set S, we get a different vector rs
60
Matrix Formulation
To make this work all we need is to update the teleportation part of the PageRank formulation:
A is stochastic!
We have weighted all pages in the teleport set S equally
Could also assign different weights to pages!
Compute as for standard PageRank
61
Example
62
Discovering the Topic
Create different PageRanks for different topics
The 16 DMOZ top-level categories: arts, business, sports, …
Which topic ranking to use?
User can pick from a menu
Classify query into a topic
Can use the context of the query
E.g., query is launched from a web page talking about a known topic
User context, e.g., user’s bookmarks,…
63
SimiRank: An Application of Personalized PageRank
SimRank: Random walks from a fixed node on k-partite graphs
Setting: k-partite graph with k types of nodes
Example: picture nodes and tag nodes
How to find nodes similar to node u?
Do a Random-Walk with Restarts from node u
i.e. teleport set S = {u}
64
SimiRank(cont.)
Resulting scores measures similarity/proximity to node u
Generally applicable to social networks (typically undirected graphs)
Problems:
Must be done once for each node u
Suitable for sub-Web-scale applications
65
SimRank: Example
66
SimRank: Example
67
Outline
Web as a Graph
PageRank
Topic-Specific PageRank
Appendix: Trust-Rank
Skip to conclusion
68
What is Web Spam?
Spamming:
Any deliberate action to boost a web page’s position in search engine results, incommensurate with page’s real value
Spam:
Web pages that are the result of spamming
Approximately 10-15% of web pages are spam
69
Web Search
Early Search Engines
Crawl the Web
Index pages by the words they contained
Respond to search queries with pages containing those words
70
Web Search
Early page ranking
Attempt to order pages matching a search query by “importance”
First search engines considered:
Number of times query words appeared
Prominence of word position, e.g. title, header
71
First Spammers
Those with commercial interests tried to exploit search engines to bring people to their own site – whether they wanted to be there or not
Example
Shirt-seller might pretend to be about “movies”
Techniques for achieving high relevance/importance for a web page
72
First Spammers: Term Spam
How do you make your page appear to be about movies?
Add the word movie 1,000 times to your page, set text color to the background color, so only search engines would see it
Or, run the query “movie” on your target search engine, see what page came first in the listings, copy it into your page, make it “invisible”
These and similar techniques are term spam
73
Google’s Solution to Term Spam
Believe what people say about you, rather than what you say about yourself
Use words in the anchor text (words that appear underlined to represent the link) and its surrounding text
PageRank as a tool to measure the “importance” of Web pages
74
Why It Works?
Our hypothetical shirt-seller looses
Saying he is about movies doesn’t help, because others don’t say he is about movies, his page isn’t very important, so it won’t be ranked high for shirts or movies
Example:
Shirt-seller creates 1,000 pages, each links to his with “movie” in the anchor text, these pages have no links in, so they get little PageRank
So the shirt-seller can’t beat truly important movie, pages, like IMDB
75
Spam Farms
Once Google became the dominant search engine, spammers began to work out ways to fool Google
Spam farms were developed to concentrate PageRank on a single page
Link spam:
Creating link structures that boost PageRank of a particular page
76
Link Spamming
Three kinds of web pages from a spammer’s point of view
Inaccessible pages
Accessible pages
E.g. blog comments pages
Spammer can post links to his pages
Own pages
Completely controlled by spammer
May span multiple domain names
77
Link Farms
Spammer’s goal:
Maximize the PageRank of target page t
Technique:
Get as many links from accessible pages as possible to target page t
Construct “link farm” to get PageRank multiplier effect
78
Link Farms
79
Analysis
x: PageRank contributed by accessible pages
y: PageRank of target page t
Rank of each “farm” page =
80
Very small: ignore now we solve for y
Analysis
Multiplier effect for “acquired” PageRank
By making M large, we can make y as large as we want
81
TrustRank: Combating Spam
Combating term spam
Analyze text using statistical methods
Similar to email spam filtering
Also useful: Detecting approximate duplicate pages
Combating link spam
Detection and blacklisting of structures that look like spam farms
Leads to another war – hiding and detecting spam farms
TrustRank = topic-specific PageRank with a teleport set of “trusted” pages
Example: .edu domains, similar domains for non-US schools
82
TrustRank: Idea
Basic principle: Approximate isolation
It is rare for a “good” page to point to a “bad” (spam) page
Sample a set of seed pages from the web
Have an oracle (human) to identify the good pages and the spam pages in the seed set
Expensive task, so we must make seed set as small as possible
83
Trust Propagation
Call the subset of seed pages that are identified as good the trusted pages
Perform a topic-sensitive PageRank with teleport set = trusted pages
Propagate trust through links:
Each page gets a trust value between 0 and 1
Use a threshold value and mark all pages below the trust threshold as spam
84
Why is It a Good Idea?
Trust attenuation:
The degree of trust conferred by a trusted page decreases with the distance in the graph
Trust splitting:
The larger the number of out-links from a page, the less scrutiny the page author gives each out-link
Trust is split across out-links
85
Picking the Seed Set
Two conflicting considerations:
Human has to inspect each seed page, so seed set must be as small as possible
Must ensure every good page gets adequate trust rank, so need make all good pages reachable from seed set by short paths
86
Approaches to Picking Seed Set
Suppose we want to pick a good set of k pages
How to do that?
PageRank
Pick the top k pages by PageRank
Theory is that you can’t get a bad page’s rank really high
Use trusted domains whose membership is controlled, like .edu, .mil, .gov
87
Spam Mass
In the TrustRank model, we start with good pages and propagate trust.
Complementary view:
What fraction of a page’s PageRank comes from spam pages?
In practice, we don’t know all the spam pages, so we need to estimate.
88
Spam Mass Estimation
rp = PageRank of page p
rp+ = PageRank of p with teleport into trusted pages only
Then: What fraction of a page’s PageRank comes from spam pages?
Spam mass of
89
One-slide Takeaway
Web as a Graph
Denote the web structure as a graph
PageRank
PageRank score reflect the importance of web pages
Topic-Specific PageRank
Evaluate web pages by their popularity as well as particular topic
Trust-Rank
Deal with link spams
90
Further Reading
Original PageRank paper: http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf
An Analytical Comparison of Approaches to Personalizing PageRank: http://www-cs-students.stanford.edu/~taherh/papers/comparison.pdf
91
Further Reading
HITS algorithm: http://www.cs.cornell.edu/home/kleinber/auth.pdf
Parallel PageRank: http://link.springer.com/content/pdf/10.1007%2F11735106_22.pdf
92
93
Reference
http://www.stanford.edu/class/cs246/slides/09-pagerank.pdf
http://www.stanford.edu/class/cs246/slides/10-spam.pdf
http://i.stanford.edu/~ullman/mmds/ch5.pdf
http://en.wikipedia.org/wiki/PageRank
http://en.wikipedia.org/wiki/HITS_algorithm
93
In-class Practice
Compute the final PageRank Score of the given graph, with Google matrix A and assume
Show the matrix A, and solve by both Gaussian Elimination and Power Iteration methods
94
/docProps/thumbnail.jpeg