程序代写代做代考 crawler information retrieval algorithm C graph AWS CIS 455/555: Internet and Web Systems

CIS 455/555: Internet and Web Systems
PageRank November 16, 2020
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
1

Details: Final Project
n Four basic components:
n Crawler: Mercator-style distributed crawler
n Indexer / TF-IDF retrieval engine with a distributed store n PageRank engine, based on MapReduce
n Search engine / user interface
n (and lots of opportunities for extra credit)
n Plus evaluation
n For crawling and query processing
n Examples: How does the throughput depend on the number of nodes in the system? What does the processing time depend on? Which components take the most time? etc.
n Evaluation should be on Amazon EC2; AWS credit is available (through AWS In Education program)
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
2

Milestones
n Project plan (due Monday): n Atwo-pagePDFdocument
n Introduction:Projectgoals,high-levelapproach
n Projectarchitecture:Whichcomponents?Howdotheyinteract?
What are the interfaces? Etc. n Divisionoflabor+Timeline
n Check-in (December 4-7):
n 50kpagescrawled
n Somethingthatlookslikeasearchengineandpullsindataviathe defined interfaces
n Final report (due 24 hours after your demo):
n Asix-pagePDFdocument
n Alloftheabove(possiblyrevisedand/orexpanded)plusnontrivial implementation details, evaluation, conclusion
3
n Thereportwillhaveasubstantialimpactonyourgrade! © 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania

Major challenge: Integration
n The components must fit together for the overall system to work well
n Good interfaces are essential
n Test early!! – even if some of the functionality is still missing
n Regular project meetings help – talk to your teammates!!
n Think about which component(s) are likely to be ‘central’ and start working on these very early
n Experience tells that, if you only integrate at the end, you may have to rewrite a substantial amount of code
n The components must work on a cloud
provider, e.g., EC2
n May have implications for performance
n Try to develop on EC2 early, to avoid surprises later on © 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
4

A recipe for disaster
n Submit a vague project plan without a clear division of labor and/or vague interfaces
n Result: Missing components, duplicated functionality, pieces don’t fit together at the end
n Do not motivate your teammates, or keep track of their progress
n Result: You end up having to do most of the work n Do not define milestones
n Result: Most of the time is spent on one component; other components are hastily cobbled together at the last minute
n Develop components separately and test only on workstations; do integration at the end
n Result: Major debugging effort; lots of code is rewritten © 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
5

A recipe for disaster (continued)
n Do nothing until December
n Result 1: Most of the features not completed due to lack of
time; other features untested and buggy
n Result 2: Massive time crunch at the end (second midterm, exams & projects in other classes will all be due around the same time!!!)
n Do not test your solution
n Result: Evaluation becomes a nightmare; system doesn’t
run on EC2 and/or crashes during the project demo
n Ignore the report, and write it only at the last minute
n Result: Grade is lower than expected, even if you have produced a beautiful implementation
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
6
Guarantees diaster!

A recipe for disaster (continued)
n Don’t crawl until the very end
n Result 1: Small corpus with not enough data to return
good search results; ranking function cannot be tested
n Result 2: Scalability problems not discovered early (search takes minutes) & too late to do anything about it
n Don’t make the crawler restartable
n Small bug -> Need to throw away the entire corpus!
n Assume ranking function is trivial
n Do not reserve any time for tweaking it; do not read IR book and other materials to discover good approaches; do not build in diagnostics
n Result: Visually appealing, technically sophisticated search engine – but doesn’t return useful results!!
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
7

Plan for today
n Information retrieval n Basics
n Precision and recall
n Taxonomy of IR models
n Classic IR models
n Boolean model
n Vector model
n TF/IDF
n HITS and PageRank
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
8
NEXT

© 2020 A. Haeberlen, Z. Ives, V. Liu
Common TF and IDF preprocessing
n Let:
N be the total number of docs in the collection ni be the number of docs which contain ki freq(i,j) raw frequency of ki within dj
n A normalized tf factor is given by
f(i,j) = a + (1-a) * freq(i,j) / max(freq(l,j))
where the maximum is computed over all terms which occur within the document dj. (a is usually set to 0.4 or 0.5)
n The idf factor is computed as idf(i) = log (N / ni)
the log is used to make the values of tf and idf comparable.
It can also be interpreted as the amount of information associated with the term ki
9

Vector Model Example 1
No weights
k1
k2
d7 d3
k3
d2 d4
d6 d5
© 2020 A. Haeberlen, Z. Ives, V. Liu
d1
k1
k2
k3
q • dj
d1
1
0
1
2
d2
1
0
0
1
d3
0
1
1
2
d4
1
0
0
1
d5
1
1
1
3
d6
1
1
0
2
d7
0
1
0
1
q
1
1
1
Query: k1 k2 k3
10

Vector Model Example 2
Query weights
k1
k2
d7 d3
k3
d2 d4
d6 d5
d1
k1
k2
k3
q • dj
d1
1
0
1
4
d2
1
0
0
1
d3
0
1
1
5
d4
1
0
0
1
d5
1
1
1
6
d6
1
1
0
3
d7
0
1
0
2
q
1
2
3
© 2020 A. Haeberlen, Z. Ives, V. Liu
Query: k3 k2 k3 k1 k2 k3
11

Vector Model Example 3
Document + query weights
k1
k2
d7 d3
k3
d2 d4
d6 d5
d1
k1
k2
k3
q • dj
d1
2
0
1
5
d2
1
0
0
1
d3
0
1
3
11
d4
2
0
0
2
d5
1
2
4
17
d6
1
2
0
5
d7
0
5
0
10
q
1
2
3
© 2020 A. Haeberlen, Z. Ives, V. Liu
Query: k3 k2 k3 k1 k2 k3
12

Putting it all together: Scoring
Computed across all documents
Specific document being scored
df
5000
Corpus
2.3
Docum.
wt,d
0.41
0
Query
Term
idf
tf
tf
wt,q
0
product
auto
1
0
best
car
50000
1.3
0
1
0
1
1
1.3
2.0
0
10000
1000
2.0
3.0
2
0.41
0.82
1
3.0
0.82
insurance
2.46
© 2020 A. Haeberlen, Z. Ives, V. Liu
13
n Example: Query is for ‘best car insurance’
n Document: Use tf weighting without idf, but with Euclidean
normalization
n Query: Use idf
n Net score for this document is sum of wt,d*wt,q: 0.41*0 + 0*1.3 + 0.41*2.0 + 0.82*3.0 = 3.28
From: An Introduction to Information Retrieval, Cambridge UP
N=1000000

Stop words
n What do we do about very common words (‘the’, ‘of’, ‘is’, ‘may’, ‘a’, …)?
n Do not appear to be very useful in general
n … though they may be in phrase searches n “PresidentoftheUnitedStates”
n “Tobeornottobe”
n We can use a stop list to remove these entirely
n Typically small (200-300 terms or less)
n Ongoing trend is towards even smaller lists, or even no list at all (web search engines generally do not use them)
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
14

Stemming and lemmatization
n What if the document contains many similar word forms?
n View, viewing, viewer, viewed, views, viewable, … n Democracy, democratization, …
n Can use stemming to ‘normalize’ words
n A somewhat rough heuristic; chops off ends of words etc.
n Most common algorithm: Porter stemmer n http://tartarus.org/~martin/PorterStemmer/
n Far from perfect
n Example:Operate,operating,operates,operation,operative,
operatives, operational, … are all stemmed to ‘oper’ n Better: Use NLP tools (lemmatizer)
n May use a vocabulary (e.g., ‘are/is/were’ -> ‘be’) © 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
15

Example: Porter stemmer
Such an analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation
n Above: Simple example of stemmer output n Example rules: SSES®SS, IES®I, …
n Entire algorithm is fairly long and complex
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
16
Porter stemmer
such an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret
From: “An introduction to Information Retrieval”, Cambridge University Press, page 34

© 2020 A. Haeberlen, Z. Ives, V. Liu
Pros & Cons of the vector model
n Advantages:
n Term-weighting improves quality of the answer set
n Partial matching allows retrieval of docs that approximate the query conditions
n Cosine ranking formula sorts documents according to degree of similarity to the query
n Disadvantages:
n Assumes independence of index terms; not clear if this is a
good or bad assumption
17

Further reading
n “An introduction to Information Retrieval” n Christopher D. Manning, Prabhakar Rhagavan, Hinrich
Schuetze; Cambridge University Press, 2009
n Available as a PDF from: http://nlp.stanford.edu/IR-book/
n Contains more details on many topics covered in this lecture
n Examples: Scoring, tokenization, lemmatization, … n If you’re the ranking expert in your final
project team, you should have a look! n … and possibly even if you’re not (interesting!)
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
18

Plan for today
n Information retrieval n Classic IR models
n HITS
n Hubs and authorities
n PageRank
n Iterative computation
n Random-surfer model
n Refinements: Sinks and Hogs
n Google
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
NEXT
19

Goal: Find authoritative pages
n Many queries are relatively broad n “cats”, “harvard”, “iphone”, …
n Consequence: Abundance of results
n There may be thousands or even millions of pages that
contain the search term, incl. personal homepages, rants, …
n IR-type ranking isn’t enough; still way too much for a human user to digest
n Need to further refine the ranking!
n Idea: Look for the most authoritative pages
n But how do we tell which pages these are?
n Problem: No endogenous measure of authoritativeness ® Hard to tell just by looking at the page.
n Need some ‘off-page’ factors © 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
20

Idea: Use the link structure
n Hyperlinks encode a considerable amount of human judgment
n What does it mean when a web page links
another web page?
n Intra-domain links: Often created primarily for navigation n Inter-domain links: Confer some measure of authority
n So, can we simply boost the rank of pages with lots of inbound links?
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
21

Relevance 1 Popularity! “Tiger King”
page
Tiger – Wikipedia
Hollywood “Series to Recycle” page
Carole Baskin’s page
Joe’s favorite TV Shows
New York Penn
© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
Times
Biology
22

Hubs and authorities
AB Hub Authority
n Idea: Give more weight to links from hub pages that point to lots of other authorities
n Mutually reinforcing relationship:
n A good hub is one that points to many good authorities
n A good authority is one that is pointed to by many good hubs
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
23

HITS
R S
n Algorithm for a query Q:
1. Start with a root set R, e.g., the t highest-ranked pages from
the IR-style ranking for Q
2. For each rÎR, add all the pages r points to, and up to d pages that point to r. Call the resulting set S.
3. Assign each page pÎS an authority weight xp and a hub weight yp; initially, set all weights to be equal and sum to 1
4. For each pÎS, compute new weights xp and yp as follows: n New xp := Sum of all yq such that q®p is an interdomain link
n New yp := Sum of all xq such that p®q is an interdomain link
5. Normalize the new weights such that both the sum of all the xp and the sum of all the yp are 1
6. Repeat from step 4 until a fixpoint is reached
n IfAisadjacencymatrix,fixpointsareprincipaleigenvectorsof
ATA and AAT, respectively
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
24

Recap: HITS
n Improves the ranking based on link structure n Intuition: Links confer some measure of authority
n Overall ranking is a combination of IR ranking and this
n Based on concept of hubs and authorities n Hub: Points to many good authorities
n Authority: Is pointed to by many good hubs
n Iterative algorithm to assign hub/authority scores
n Query-specific
n No notion of ‘absolute quality’ of a page; ranking needs to
be computed for each new query © 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
25

Plan for today
n Information retrieval
n Classic IR models
n HITS
n Hubs and authorities
n PageRank
n Iterative computation
n Random-surfer model
n Refinements: Sinks and Hogs
n Google
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
NEXT
26

Google’s PageRank (Brin/Page 98)
n A technique for estimating page quality n Based on web link graph, just like HITS
n Like HITS, relies on a fixpoint computation
n Important differences to HITS:
n No hubs/authorities distinction; just a single value per page n Query-independent
n Results are combined with IR score
n Think of it as: TotalScore = IR score * PageRank
n In practice, search engines use many other factors (for example, Google says it uses more than 200)
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
27

Shouldn’t E’s vote be worth more than F’s?
A
C F
D
PageRank: Intuition
How many levels should we consider?
G HEB
I J
n Imagine a contest for The Web’s Best Page
n Initially, each page has one vote
n Each page votes for all the pages it has a link to
n To ensure fairness, pages voting for more than one page must split their vote equally between them
n Voting proceeds in rounds; in each round, each page has the number of votes it received in the previous round
n In practice, it’s a little more complicated – but not much! © 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
28

PageRank
n Each page i is given a rank xi
n Goal: Assign the xi such that the rank of each page is governed by the ranks of the pages
linking to it:
å 1 xi= Nxj
jÎBi j
Rank of page i
How do we compute the rank values?
Every page
j that links to i
Rank of page j
Number of links out from page j
© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
29

© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
Iterative PageRank (simplified)
Initialize all ranks to be equal, e.g.:
x(0) = i
1
n
Iterate until convergence
x(k+1) = å 1 x(k)
i
j i
jÎB Nj
30

Example: Step 0
Initialize all ranks to be equal
x(0) = 1 i n
.33
.33
© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
.33
31

Example: Step 1
Propagate weights across out-edges
x(k+1) = å 1 x(k) iNj
jÎBi j
.17
.33 .33
© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
.17
32

Example: Step 2
Compute weights based on in-edges
.5
x (1) = å 1 x (0) iNj
jÎBi j
© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
.33
.17
33

Example: Convergence
x(k+1) = å 1 x(k) iNj
jÎBi j
0.4
© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
0.4
0.2
34

Naïve PageRank Algorithm Restated
n Let
n N(p) = number outgoing links from page p n B(p) = number of back-links to page p
PageRank( p) = å 1 bÎBp N(b)
PageRank(b)
n Each page b distributes its importance to all of the
pages it points to (so we scale by 1/N(b))
n Page p’s importance is increased by the importance of its back set
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
35

Linear Algebra formulation
n Create an m x m matrix M to capture links:
n M(i,j) =1/nj ifpageiispointedtobypagej and page j has nj outgoing links.
= 0 otherwise.
n Initialize all PageRanks to 1, multiply by M repeatedly until all values converge:
éPageRank(p ‘)ù éPageRank(p )ù ê1úê1ú
êPageRank(p2′)ú = MêPageRank(p2)ú ê … ú ê … ú ëPageRank(pm’)û ëPageRank(pm)û
n Computes principal eigenvector via power iteration © 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
36

A Brief Example
Google
g’
y’ = 0 0 0.5 * y
0 0.5 0.5 g 10.50 a
11 0.75 ,… 0.67 1. 25 1. 33
Amazon
Running for multiple iterations:
g11
y = 1 , 0.5 , a 1 1.5
Yahoo
a’
Total rank sums to number of pages
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
37

Oops #1 – PageRank
g’ 000.5 g y’= 0.500.5* y a’ 0.500 a
‘dead end’ – PageRank is lost after each round
Running for multiple iterations:
© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
Google
Amazon
Yahoo
g 1 0.5 0.25 y = 1 , 1 , 0.5 a 1 0.5 0.25
0 ,…, 0 0
38

Oops #2 – PageRank
Running for multiple iterations:
g’ 000.5 g y’ = 0.5 1 0.5 * y a’ 0.500 a
PageRank cannot flow out and accumulates
in a PageRank ‘sink’
© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
Google
Amazon
Yahoo
g 1 0.5 0.25 y = 1 , 2 , 2.5 a 1 0.5 0.25
0 ,…, 3 0
39

Random Surfer Model
n PageRank has an intuitive basis in random walks on graphs
n Imagine a random surfer, who starts on a random page and, in each step,
n with probability d, klicks on a random link on the page n with probability 1-d, jumps to a random page (bored?)
n The PageRank of a page can be interpreted as the fraction of steps the surfer spends on the corresponding page
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
40

Improved PageRank
n Remove out-degree 0 nodes (or consider them to refer back to referrer)
n Add decay factor d to deal with sinks PageRank(p)=(1-d)+då 1 PageRank(b)
n Typical value: d=0.85
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
bÎBp N(b)
41

Stopping the Sink
g’ y’ a’
Running for multiple iterations:
g 0.57 0.39
y = 1.85 , 2.21 , 2.36 ,…, 2.48 a 0.57 0.39 0.32 0.26
Google
= 0.85
g 0.15 a 0.15
0 0 0.5
0.510.5* y+0.15
0.5 0 0
Amazon
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
Yahoo
0.32 0.26
42

Reminder: Search-Engine Optimization
n White-hat techniques
n Google webmaster tools; add meta tags to documents, etc.
n Black-hat techniques n Link farms
n Keyword stuffing, hidden text, meta-tag stuffing, …
n Comment spam
n Initialsolution:
n Somepeoplestartedtoabusethistoimprovetheirownrankings
n Doorway pages / cloaking
n Specialpagesjustforsearchengines
n BMWGermanyandRicohGermanybannedinFebruary2006
n Link buying
n You need countermeasures for these
n Otherwise they can degrade the quality of your results © 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
43

Recap: PageRank
n Estimates absolute ‘quality’ or ‘importance’ of a given page based on inbound links
n Query-independent
n Can be computed via fixpoint iteration
n Can be interpreted as the fraction of time a ‘random surfer’ would spend on the page
n Several refinements, e.g., to deal with sinks
n An important factor, but not the only one
n Overall ranking is based on many factors (Google: >200)
n Need to perform rank merging, e.g., with TF/IDF scores n e.g.,TF/IDFcanensurehighprecision,andPageRankhighquality
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
44