High dim. data
Dimension ality reduction
Graph Infinite data data
PageRank,
Machine learning
Apps
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
2
Facebook social graph
4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011]
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3
Connections between political blogs
Polarization of the network [Adamic-Glance, 2005]
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4
Citation networks and Maps of science
[Börner et al., 2012]
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5
domain2
domain3
Internet
domain1
router
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6
Seven Bridges of Königsberg
[Euler, 1735]
Return to the starting point by traveling each link of the graph once and only once.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8
How to organize the Web? First try: Human curated
Web directories
▪ Yahoo, DMOZ, LookSmart Second try: Web Search
▪ Information Retrieval investigates: Find relevant docs in a small
and trusted set
▪ Newspaper articles, Patents, etc.
▪ But: Web is huge, full of untrusted documents,
random things, web spam, etc.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9
2 challenges of web search:
(1) Web contains many sources of information Who to “trust”?
▪ Trick: Trustworthy pages may point to each other! (2) What is the “best” answer to query
“newspaper”?
▪ No single right answer
▪ Trick: Pages that actually know about newspapers
might all be pointing to many newspapers
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10
All web pages are not equally “important”
There is large diversity in the web-graph
node connectivity. Let’s rank the pages by the link structure!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11
We will cover the following Link Analysis approaches for computing importances of nodes in a graph:
▪ Page Rank
▪ Topic-Specific (Personalized) Page Rank ▪ Web Spam Detection Algorithms
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12
Idea: Links as votes
▪ Page is more important if it has more links ▪ In-coming links? Out-going links?
Think of in-links as votes: Are all in-links are equal?
▪ Links from important pages count more
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13
A 3.3
B 38.4
C 34.3
D 3.9
E 8.1
1.6
F 3.9
1.6
1.6
1.6
1.6
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
14
Each link’s vote is proportional to the importance of its source page
If page j with importance rj has n out-links, each link gets rj / n votes
Page j’s own importance is the sum of the
votes on its in-links
ir/3 k
i rk/4
j
rj/3
rj/3 rj/3
rj = ri/3+rk/4
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15
A
B
C
D
16
ABCD
A B C D
0110 1001 1001 1100
17
ABCD
A B C D
0 1/2 1/1 0 1/3 0 0 1/2 1/3 0 0 1/2 1/3 1/2 0 0
A B C D
0.25 0.25 0.25 0.25
18
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19