CS计算机代考程序代写 algorithm information retrieval High dim. data

High dim. data
Dimension ality reduction
Graph Infinite data data
PageRank,
Machine learning
Apps
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
2

Facebook social graph
4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011]
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3

Connections between political blogs
Polarization of the network [Adamic-Glance, 2005]
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4

Citation networks and Maps of science
[Börner et al., 2012]
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5

domain2
domain3
Internet
domain1
router
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6

Seven Bridges of Königsberg
[Euler, 1735]
Return to the starting point by traveling each link of the graph once and only once.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8

 How to organize the Web?  First try: Human curated
Web directories
▪ Yahoo, DMOZ, LookSmart  Second try: Web Search
▪ Information Retrieval investigates: Find relevant docs in a small
and trusted set
▪ Newspaper articles, Patents, etc.
▪ But: Web is huge, full of untrusted documents,
random things, web spam, etc.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9

2 challenges of web search:
 (1) Web contains many sources of information Who to “trust”?
▪ Trick: Trustworthy pages may point to each other!  (2) What is the “best” answer to query
“newspaper”?
▪ No single right answer
▪ Trick: Pages that actually know about newspapers
might all be pointing to many newspapers
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10

 All web pages are not equally “important”
 There is large diversity in the web-graph
node connectivity. Let’s rank the pages by the link structure!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11

 We will cover the following Link Analysis approaches for computing importances of nodes in a graph:
▪ Page Rank
▪ Topic-Specific (Personalized) Page Rank ▪ Web Spam Detection Algorithms
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12

 Idea: Links as votes
▪ Page is more important if it has more links ▪ In-coming links? Out-going links?
 Think of in-links as votes:  Are all in-links are equal?
▪ Links from important pages count more
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13

A 3.3
B 38.4
C 34.3
D 3.9
E 8.1
1.6
F 3.9
1.6
1.6
1.6
1.6
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
14

 Each link’s vote is proportional to the importance of its source page
 If page j with importance rj has n out-links, each link gets rj / n votes
 Page j’s own importance is the sum of the
votes on its in-links
ir/3 k
i rk/4
j
rj/3
rj/3 rj/3
rj = ri/3+rk/4
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15

A
B
C
D
16

ABCD
A B C D
0110 1001 1001 1100
17

ABCD
A B C D
0 1/2 1/1 0 1/3 0 0 1/2 1/3 0 0 1/2 1/3 1/2 0 0
A B C D
0.25 0.25 0.25 0.25
18

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19