MULTIMEDIA RETRIEVAL
Semester 1, 2022
Web Search
Web Information Retrieval The Web
Copyright By PowCoder代写 加微信 powcoder
Crawling
Ranking: PageRank & HITS
School of Computer Science
Web Search – Everyone is doing it!
First, they do an on-line search
Police in woods with bloodhounds looking at laptop computer on ground.
School of Computer Science
Web Search
Historically, IR was mainly motivated by text search (libraries, etc.). Today: Various other areas and data, e.g. multi media (images, video, etc.), WWW, etc.
Web search: perfect example for an IR system
Goal: Find best possible results (web pages) based on a) Unstructured, heterogeneous, semistructured data b) Imprecise, ambiguous, short queries
Note: ‘Best possible results‘ is also a very vague specification of the ultimate goal
But: Very different from traditional IR tasks!
School of Computer Science
Web Search
Web search is an active research area with high economical impact
Many open questions & challenges for research
Improving existing systems
adapting to new scenarios (more data, spam, …)
new challenges (diverse data formats, multimedia, …),
new tasks (desktop search, personalization, …), etc. Many other approaches & techniques exist, e.g.
Clustering
Specialized search engines, meta search engines, etc.
School of Computer Science
Brief (non-technical) History
Early keyword-based engines
Altavista, Excite, Infoseek, Inktomi, 1995-1997 Extension of traditional IR
Sponsored search ranking: Goto.com (morphed into Overture.com Yahoo!)
Your search ranking depended on how much you paid
Auction for keywords: casino was expensive!
School of Computer Science
Brief (non-technical) History
1998+: Link-based ranking pioneered by Google
Blew away all early engines
Great user experience in search of a business model
Meanwhile Goto/Overture’s annual revenues were nearing $1 billion
Result: Google added paid-placement “ads” to the side, independent of search results
Yahoo followed suit, acquiring Overture (for paid placement) and Inktomi (for search)
HITS introduced by Kleinberg was used by Teoma
School of Computer Science
Web search basics
Sponsored Links
CG Appliance Express
Web spider
Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com
San Francisco-Oakland-San Jose, CA
– Complete Selection Free Shipping!
www.vacuums.com
Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com
Results 1 – 10 of about 7,310,000 for miele. (0.12 seconds)
Miele, Inc — Anything else is a compromise
At the heart of your home, Appliances by Miele. … USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System … www.miele.com/ – 20k – Cached – Similar pages
to Miele, the home of the very best appliances and kitchens in the world. www.miele.cIo.unk/ – 3dk – Ceachexd – Seimilar pages
Miele – Deu
Das Portal zum & Geniessen online unter www.zu-tisch.de. Miele weltweit …ein Leben lang. … Wählen Sie die .
www.miele.de/ – 10k – Cached – Similar pages
Herzlich willkommen bei ̈sterreich – [ Translate this page ] Herzlich willkommen bei ̈sterreich nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE … www.miele.at/ – 3k – Cached – Similar pages
Einbaugeräten, Hausgeräten … – [ Translate this
Ad indexes
School of Computer Science
Algorithmic results.
Characteristics of the Web
Size: The web is huge! An there are lots of users!
Documents
Extreme variety regarding formats, structure, quality, content (duplicate/near-duplicate) etc.
Users: Very different skills & intensions, e.g. Find all information about related patents
Find some good tourist inform. about Paris
Find the phone no. of the tourist office
Space: The web is a distributed system
Spam: Expect manipulation instead of cooperation
from the document providers
Dynamic: The web keeps growing & changing
School of Computer Science
User Needs
Informational – want to learn about something (~40% / 65%)
Low hemoglobin
Navigational – want to go to that page (~25% / 15%)
United Airlines
Transactional – want to do something (web-mediated) (~35% / 20%) Access a service
Downloads
Gray areas
Find a good hub
Exploratory search “see what’s there”
Seattle weather
Mars surface images
Canon S410
Car rental Brasil
School of Computer Science
How far do people look for results?
(Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf)
School of Computer Science
Users’ empirical evaluation of results
Quality of pages varies widely Relevance is not enough
Other desirable qualities (non IR!!)
Content: Trustworthy, diverse, non-duplicated, well maintained Web readability: display correctly & fast
No annoyances: pop-ups, etc
Precision vs. recall
On the web, recall seldom matters
What matters
Precision at 1? Precision above the fold?
Comprehensiveness – must be able to deal with obscure queries
Recall matters when the number of matches is very small
User perceptions may be unscientific, but are significant over a large aggregate
School of Computer Science
Search engine optimization (Spam)
Commercial, political, religious, lobbies Promotion funded by advertising budget
Operators
Contractors (Search Engine Optimizers) for lobbies, companies Web masters
Hosting services
E.g., Web master world ( www.webmasterworld.com ) Search engine specific tricks
Discussions about academic papers
School of Computer Science
Simplest forms
First generation engines relied heavily on tf/idf
The top-ranked pages for the query maui resort were the ones
containing the most maui’s and resort’s
SEOs responded with dense repetitions of chosen terms
e.g., maui resort maui resort maui resort
Often, the repetitions would be in the same color as the background of the web page
Repeated terms got indexed by crawlers But not visible to humans on browsers
Pure word density cannot be trusted as an IR signal
School of Computer Science
Variants of keyword stuffing
Misleading meta-tags, excessive repetition
Hidden text with colors, style sheet tricks, etc.
Meta-Tags =
“… London hotels, hotel, holiday inn, hilton, discount, booking, reservation, sex, mp3, britney spears, viagra, …”
School of Computer Science
Serve fake content to search engine spider
DNS cloaking: Switch IP address. Impersonate
Is this a Search Engine spider?
School of Computer Science
More spam techniques
Doorway pages
Pages optimized for a single keyword that re-
direct to the real target page
Link spamming
Mutual admiration societies, hidden links, awards
Domain flooding: numerous domains that point or re-direct to a target page
Fake query stream – rank checking programs “Curve-fit” ranking programs of search engines
Millions of submissions via Add-Url
– more on these later
School of Computer Science
The war against spam
Quality signals – Prefer authoritative pages based on:
Votes from authors (linkage signals) Votes from users (usage signals)
PolicingofURLsubmissions Anti robot test
Limitsonmeta-keywords Robustlinkanalysis
Ignore statistically implausible linkage (or text)
Use link analysis to detect spammers (guilt by association)
Spam recognition by machine learning
Training set based on known spam
Family friendly filters
Linguistic analysis, general
classification techniques, etc.
For images: flesh tone detectors, source text analysis, etc.
Editorial intervention Blacklists
Top queries audited
Complaints addressed
Suspect pattern detection
School of Computer Science
Web search basics
Sponsored Links
CG Appliance Express
Web spider
Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com
San Francisco-Oakland-San Jose, CA
– Complete Selection Free Shipping!
www.vacuums.com
Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com
Results 1 – 10 of about 7,310,000 for miele. (0.12 seconds)
Miele, Inc — Anything else is a compromise
At the heart of your home, Appliances by Miele. … USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System … www.miele.com/ – 20k – Cached – Similar pages
to Miele, the home of the very best appliances and kitchens in the world. www.miele.cIo.unk/ – 3dk – Ceachexd – Seimilar pages
Miele – Deu
Das Portal zum & Geniessen online unter www.zu-tisch.de. Miele weltweit …ein Leben lang. … Wählen Sie die .
www.miele.de/ – 10k – Cached – Similar pages
Herzlich willkommen bei ̈sterreich – [ Translate this page ] Herzlich willkommen bei ̈sterreich nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE … www.miele.at/ – 3k – Cached – Similar pages
Einbaugeräten, Hausgeräten … – [ Translate this
Ad indexes
School of Computer Science
Basic crawler operation
Begin with known “seed” pages
Fetch and parse them
Extract URLs they point to
Place the extracted URLs on a queue
Fetch each URL on the queue and repeat
Web spider
School of Computer Science
Crawling picture
URLs crawled and parsed
Seed pages
Unseen Web
URLs frontier
School of Computer Science
Simple picture – complications
Web crawling isn’t feasible with one machine All of the above steps distributed
Even non-malicious pages pose challenges
Latency/bandwidth to remote servers vary
Webmasters’ stipulations
How “deep” should you crawl a site’s URL hierarchy?
Site mirrors and duplicate pages Malicious pages
Spam pages
Spider traps – incl dynamically generated
Politeness – don’t hit a server too often
School of Computer Science
What any crawler must do
Be Polite: Respect implicit and explicit politeness considerations
Only crawl allowed pages
Respect robots.txt (more on this shortly)
Be Robust: Be immune to spider traps and other malicious behavior from web servers
School of Computer Science
What any crawler should do
Be capable of distributed operation: designed to run on multiple distributed machines
Be scalable: designed to increase the crawl rate by adding more machines
Performance/efficiency: permit full use of available processing and network resources
School of Computer Science
What any crawler should do
Fetch pages of “higher quality” first
Continuous operation: Continue fetching
fresh copies of a previously fetched page
Extensible: Adapt to new data formats, protocols
School of Computer Science
Updated crawling picture
URLs crawled and parsed
Seed Pages
Crawling thread
Unseen Web
URL frontier
School of Computer Science
URL frontier: two main considerations
Politeness: do not hit a web server too frequently
Freshness: crawl some pages more often than others
E.g., pages (such as News sites) whose content changes often
These goals may conflict each other.
(E.g., simple priority queue fails – many links out of a page go to its own site, creating a burst of accesses to that site.)
School of Computer Science
URL frontier
Can include multiple pages from the same host
Must avoid trying to fetch them all at the same time
Must try to keep all crawling threads busy
School of Computer Science
Explicit and implicit politeness
Explicit politeness: specifications from webmasters on what portions of site can be crawled
robots.txt
Implicit politeness: even with no specification, avoid hitting any site too often
School of Computer Science
Robots.txt
Protocol for giving spiders (“robots”) limited access to a website, originally from 1994
www.robotstxt.org/wc/norobots.html
Website announces its request on what can(not) be crawled
For a URL, create a file URL/robots.txt This file specifies access restrictions
School of Computer Science
Robots.txt example
No robot should visit any URL starting with “/yoursite/temp/”, except the robot called “searchengine”:
User-agent: *
Disallow: /yoursite/temp/
User-agent: searchengine
School of Computer Science
Processing steps in crawling
Pick a URL from the frontier
Fetch the document at the URL
Which one?
Parse the URL
Extract links from it to other docs (URLs)
Check if URL has content already seen
If not, add to indexes
For each extracted URL
E.g., only crawl .edu, obey robots.txt, etc.
Ensure it passes certain URL filter tests
Check if it is already in the frontier (duplicate URL elimination)
School of Computer Science
Basic crawl architecture
Doc robots FP’s filters
Content seen?
URL Frontier
URL filter
Dup URL elim
School of Computer Science
DNS (Domain Name Server)
A lookup service on the internet
Given a URL, retrieve its IP address
Service provided by a distributed set of servers – thus, lookup latencies can be high (even seconds)
Common OS implementations of DNS lookup are blocking: only one outstanding request at a time
Solutions
DNS caching
Batch DNS resolver – collects requests and sends them out together
School of Computer Science
Parsing: URL normalization
When a fetched document is parsed, some of the extracted links are relative URLs
E.g., at http://en.wikipedia.org/wiki/Main_Page we have a relative link to
/wiki/Wikipedia:General_disclaimer which is the same as the absolute URL
http://en.wikipedia.org/wiki/Wikipedia:General_disclaimer
During parsing, must normalize (expand) such relative URLs
School of Computer Science
Content seen?
Duplication is widespread on the web If the page just fetched is already in
the index, do not further process it
This is verified using document fingerprints or shingles
School of Computer Science
Filters and robots.txt
Filters – regular expressions for URL’s to be crawled/not
Once a robots.txt file is fetched from a site, need not fetch it repeatedly
Doing so burns bandwidth, hits web server Cache robots.txt files
School of Computer Science
Duplicate URL elimination
For a non-continuous (one-shot) crawl, test to see if an extracted+filtered URL has already been passed to the frontier
For a continuous crawl – see details of frontier implementation
School of Computer Science
Distributing the crawler
Run multiple crawl threads, under different processes – potentially at different nodes
Geographically distributed nodes
Partition hosts being crawled into nodes
Hash used for partition
How do these nodes communicate?
School of Computer Science
Communication between nodes
The output of the URL filter at each node is sent to the Duplicate URL Eliminator at all nodes
Doc robots To
FP’s filters
URL hosts set
Content seen?
From other hosts
URL Frontier
URL filter
Dup URL elim
School of Computer Science
Beyond traditional IR ranking Purely focus on content
Must incorporating the following formation as well
Anchor text (peer review)
Link analysis (social relationship) PageRank and variants
School of Computer Science
The Web as a Directed Graph
Assumption 1: A hyperlink between pages denotes author perceived relevance (quality signal)
Assumption 2: The anchor of the hyperlink
describes the target page (textual context)
School of Computer Science
Anchor Text
For ibm how to distinguish between:
IBM’s home page (mostly graphical)
IBM’s copyright page (high term freq. for ‘ibm’) Rival’s spam page (arbitrarily high term freq.)
A million pieces of anchor text with “ibm” send a strong signal
“IBM home page”
www.ibm.com
WWW Worm – McBryan [Mcbr94]
School of Computer Science
Indexing anchor text
When indexing a document D, include anchor text from links pointing to D.
www.ibm.com
Armonk, NY-based computer giant IBM announced today
Big Blue today announced record profits for the quarter
Joe’s computer hardware links Sun
School of Computer Science
Indexing anchor text
Can sometimes have unexpected side effects – e.g., evil empire.
Google Bomb
Can score anchor text with weight depending on the authority of the anchor page’s website
E.g., if we were to assume that content from cnn.com or yahoo.com is authoritative, then trust their anchor text
To describe the force of public opinion, and even be able to melt metal. Unanimously to confuse right and wrong analogy.
http://www.searchenginedictionary.com/terms-google-bomb.shtml
School of Computer Science
Citation Analysis
Citation frequency
Co-citation coupling frequency
Cocitations with a given author measures “impact” Cocitation analysis
Bibliographic coupling frequency
Articles that co-cite the same articles are related
Citation indexing
Who is author cited by? (Garfield 1972)
PageRank preview: Pinsker and Narin ’60s which journals are authoritative?
School of Computer Science
Query processing
First retrieve all pages meeting the text query (say venture capital).
Order these by their link popularity (either variant on the previous page).
School of Computer Science
Query-independent ordering
First generation: using link counts as simple measures of popularity.
Two basic suggestions: Undirected popularity:
Each page gets a score = the number of in-links plus the number of out-links (3+2=5).
Directed popularity:
Score of a page = number of its in-links (3).
School of Computer Science
Spamming simple popularity
Exercise: How do you spam each of the following heuristics so your page gets a high score?
Each page gets a static score = the number of in- links plus the number of out-links.
Static score of a page = number of its in-links.
School of Computer Science
PageRank Scoring
Imagine a browser doing a random walk on
web pages:
Start at a random page
1/3 1/3 1/3
At each step, go out of the current page along one of the links on that page, equiprobably
“In the steady state” each page has a long- term visit rate – use this as the page’s score.
School of Computer Science
Mathematically … …
Assume that there are n web pages and the i-th page is simply noted with i.
The web pages form a directed graph and the graph can be represented with adjacent matrix P
Element pi,j represents the connection between the i-th web page and the j-th web page
1: if page i refers to page j; 0, otherwise.
PageRank of the i-th web page is denoted with ai
i1 OutLinks(i) i1
n ai 1. i1
a i pa
i, j OutLinks(i)
School of Computer Science
0 1 1 1 0 0 1 0 0
0 0.5 0.5
1 0 0
pi, j OutLinks(i)
n pij 1. j1
School of Computer Science
More Mathematically … …
a aP a(t) a(t 1)P
School of Computer Science
How do we compute this vector?
Let a = [a1, … an] denote the row vector of steady- state probabilities.
If we our current position is described by a, then the next step is distributed as aP.
But a is the steady state, so a=aP.
Solving this matrix equation gives us a.
So a is the
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com