CS代考 MULTIMEDIA RETRIEVAL

MULTIMEDIA RETRIEVAL
Semester 1, 2022
Web Search
 Web Information Retrieval The Web

Copyright By PowCoder代写 加微信 powcoder

 Crawling
 Ranking: PageRank & HITS
School of Computer Science

Web Search – Everyone is doing it!
 First, they do an on-line search
 Police in woods with bloodhounds looking at laptop computer on ground.
School of Computer Science
Web Search
 Historically, IR was mainly motivated by text search (libraries, etc.). Today: Various other areas and data, e.g. multi media (images, video, etc.), WWW, etc.
 Web search: perfect example for an IR system
 Goal: Find best possible results (web pages) based on a) Unstructured, heterogeneous, semistructured data b) Imprecise, ambiguous, short queries
 Note: ‘Best possible results‘ is also a very vague specification of the ultimate goal
 But: Very different from traditional IR tasks!
School of Computer Science

Web Search
 Web search is an active research area with high economical impact
 Many open questions & challenges for research
 Improving existing systems
 adapting to new scenarios (more data, spam, …)
 new challenges (diverse data formats, multimedia, …),
 new tasks (desktop search, personalization, …), etc.  Many other approaches & techniques exist, e.g.
 Clustering
 Specialized search engines, meta search engines, etc.
School of Computer Science
Brief (non-technical) History
 Early keyword-based engines
Altavista, Excite, Infoseek, Inktomi, 1995-1997 Extension of traditional IR
 Sponsored search ranking: Goto.com (morphed into Overture.com  Yahoo!)
 Your search ranking depended on how much you paid
 Auction for keywords: casino was expensive!
School of Computer Science

Brief (non-technical) History
 1998+: Link-based ranking pioneered by Google
 Blew away all early engines
 Great user experience in search of a business model
 Meanwhile Goto/Overture’s annual revenues were nearing $1 billion
 Result: Google added paid-placement “ads” to the side, independent of search results
 Yahoo followed suit, acquiring Overture (for paid placement) and Inktomi (for search)
 HITS introduced by Kleinberg was used by Teoma
School of Computer Science
Web search basics
Sponsored Links
CG Appliance Express
Web spider
Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com
San Francisco-Oakland-San Jose, CA
– Complete Selection Free Shipping!
www.vacuums.com
Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com
Results 1 – 10 of about 7,310,000 for miele. (0.12 seconds)
Miele, Inc — Anything else is a compromise
At the heart of your home, Appliances by Miele. … USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System … www.miele.com/ – 20k – Cached – Similar pages
to Miele, the home of the very best appliances and kitchens in the world. www.miele.cIo.unk/ – 3dk – Ceachexd – Seimilar pages
Miele – Deu
Das Portal zum & Geniessen online unter www.zu-tisch.de. Miele weltweit …ein Leben lang. … Wählen Sie die .
www.miele.de/ – 10k – Cached – Similar pages
Herzlich willkommen bei ̈sterreich – [ Translate this page ] Herzlich willkommen bei ̈sterreich nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE … www.miele.at/ – 3k – Cached – Similar pages
Einbaugeräten, Hausgeräten … – [ Translate this
Ad indexes
School of Computer Science

Algorithmic results.
Characteristics of the Web
 Size: The web is huge! An there are lots of users!
 Documents
 Extreme variety regarding formats, structure, quality, content (duplicate/near-duplicate) etc.
 Users: Very different skills & intensions, e.g.  Find all information about related patents
 Find some good tourist inform. about Paris
 Find the phone no. of the tourist office
 Space: The web is a distributed system
 Spam: Expect manipulation instead of cooperation
from the document providers
 Dynamic: The web keeps growing & changing
School of Computer Science

User Needs
 Informational – want to learn about something (~40% / 65%)
Low hemoglobin
 Navigational – want to go to that page (~25% / 15%)
United Airlines
 Transactional – want to do something (web-mediated) (~35% / 20%)  Access a service
 Downloads
 Gray areas
 Find a good hub
 Exploratory search “see what’s there”
Seattle weather
Mars surface images
Canon S410
Car rental Brasil
School of Computer Science
How far do people look for results?
(Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf)
School of Computer Science

Users’ empirical evaluation of results
 Quality of pages varies widely  Relevance is not enough
 Other desirable qualities (non IR!!)
 Content: Trustworthy, diverse, non-duplicated, well maintained  Web readability: display correctly & fast
 No annoyances: pop-ups, etc
 Precision vs. recall
 On the web, recall seldom matters
 What matters
 Precision at 1? Precision above the fold?
 Comprehensiveness – must be able to deal with obscure queries
 Recall matters when the number of matches is very small
 User perceptions may be unscientific, but are significant over a large aggregate
School of Computer Science
Search engine optimization (Spam)
 Commercial, political, religious, lobbies  Promotion funded by advertising budget
 Operators
 Contractors (Search Engine Optimizers) for lobbies, companies  Web masters
 Hosting services
 E.g., Web master world ( www.webmasterworld.com )  Search engine specific tricks
 Discussions about academic papers
School of Computer Science

Simplest forms
 First generation engines relied heavily on tf/idf
 The top-ranked pages for the query maui resort were the ones
containing the most maui’s and resort’s
 SEOs responded with dense repetitions of chosen terms
 e.g., maui resort maui resort maui resort
 Often, the repetitions would be in the same color as the background of the web page
 Repeated terms got indexed by crawlers  But not visible to humans on browsers
Pure word density cannot be trusted as an IR signal
School of Computer Science
Variants of keyword stuffing
 Misleading meta-tags, excessive repetition
 Hidden text with colors, style sheet tricks, etc.
Meta-Tags =
“… London hotels, hotel, holiday inn, hilton, discount, booking, reservation, sex, mp3, britney spears, viagra, …”
School of Computer Science

 Serve fake content to search engine spider
 DNS cloaking: Switch IP address. Impersonate
Is this a Search Engine spider?
School of Computer Science
More spam techniques
 Doorway pages
 Pages optimized for a single keyword that re-
direct to the real target page
 Link spamming
 Mutual admiration societies, hidden links, awards
 Domain flooding: numerous domains that point or re-direct to a target page
 Fake query stream – rank checking programs  “Curve-fit” ranking programs of search engines
 Millions of submissions via Add-Url
– more on these later
School of Computer Science

The war against spam
 Quality signals – Prefer authoritative pages based on:
 Votes from authors (linkage signals)  Votes from users (usage signals)
 PolicingofURLsubmissions  Anti robot test
 Limitsonmeta-keywords  Robustlinkanalysis
 Ignore statistically implausible linkage (or text)
 Use link analysis to detect spammers (guilt by association)
 Spam recognition by machine learning
 Training set based on known spam
 Family friendly filters
 Linguistic analysis, general
classification techniques, etc.
 For images: flesh tone detectors, source text analysis, etc.
 Editorial intervention  Blacklists
 Top queries audited
 Complaints addressed
 Suspect pattern detection
School of Computer Science
Web search basics
Sponsored Links
CG Appliance Express
Web spider
Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com
San Francisco-Oakland-San Jose, CA
– Complete Selection Free Shipping!
www.vacuums.com
Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com
Results 1 – 10 of about 7,310,000 for miele. (0.12 seconds)
Miele, Inc — Anything else is a compromise
At the heart of your home, Appliances by Miele. … USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System … www.miele.com/ – 20k – Cached – Similar pages
to Miele, the home of the very best appliances and kitchens in the world. www.miele.cIo.unk/ – 3dk – Ceachexd – Seimilar pages
Miele – Deu
Das Portal zum & Geniessen online unter www.zu-tisch.de. Miele weltweit …ein Leben lang. … Wählen Sie die .
www.miele.de/ – 10k – Cached – Similar pages
Herzlich willkommen bei ̈sterreich – [ Translate this page ] Herzlich willkommen bei ̈sterreich nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE … www.miele.at/ – 3k – Cached – Similar pages
Einbaugeräten, Hausgeräten … – [ Translate this
Ad indexes
School of Computer Science

Basic crawler operation
 Begin with known “seed” pages
 Fetch and parse them
Extract URLs they point to
Place the extracted URLs on a queue
 Fetch each URL on the queue and repeat
Web spider
School of Computer Science
Crawling picture
URLs crawled and parsed
Seed pages
Unseen Web
URLs frontier
School of Computer Science

Simple picture – complications
 Web crawling isn’t feasible with one machine All of the above steps distributed
 Even non-malicious pages pose challenges
 Latency/bandwidth to remote servers vary
 Webmasters’ stipulations
 How “deep” should you crawl a site’s URL hierarchy?
 Site mirrors and duplicate pages  Malicious pages
Spam pages
 Spider traps – incl dynamically generated
 Politeness – don’t hit a server too often
School of Computer Science
What any crawler must do
 Be Polite: Respect implicit and explicit politeness considerations
Only crawl allowed pages
Respect robots.txt (more on this shortly)
 Be Robust: Be immune to spider traps and other malicious behavior from web servers
School of Computer Science

What any crawler should do
 Be capable of distributed operation: designed to run on multiple distributed machines
 Be scalable: designed to increase the crawl rate by adding more machines
 Performance/efficiency: permit full use of available processing and network resources
School of Computer Science
What any crawler should do
 Fetch pages of “higher quality” first
 Continuous operation: Continue fetching
fresh copies of a previously fetched page
 Extensible: Adapt to new data formats, protocols
School of Computer Science

Updated crawling picture
URLs crawled and parsed
Seed Pages
Crawling thread
Unseen Web
URL frontier
School of Computer Science
URL frontier: two main considerations
 Politeness: do not hit a web server too frequently
 Freshness: crawl some pages more often than others
 E.g., pages (such as News sites) whose content changes often
These goals may conflict each other.
(E.g., simple priority queue fails – many links out of a page go to its own site, creating a burst of accesses to that site.)
School of Computer Science

URL frontier
 Can include multiple pages from the same host
 Must avoid trying to fetch them all at the same time
 Must try to keep all crawling threads busy
School of Computer Science
Explicit and implicit politeness
 Explicit politeness: specifications from webmasters on what portions of site can be crawled
 robots.txt
 Implicit politeness: even with no specification, avoid hitting any site too often
School of Computer Science

Robots.txt
 Protocol for giving spiders (“robots”) limited access to a website, originally from 1994
 www.robotstxt.org/wc/norobots.html
 Website announces its request on what can(not) be crawled
 For a URL, create a file URL/robots.txt  This file specifies access restrictions
School of Computer Science
Robots.txt example
 No robot should visit any URL starting with “/yoursite/temp/”, except the robot called “searchengine”:
User-agent: *
Disallow: /yoursite/temp/
User-agent: searchengine
School of Computer Science

Processing steps in crawling
 Pick a URL from the frontier
 Fetch the document at the URL
Which one?
 Parse the URL
 Extract links from it to other docs (URLs)
 Check if URL has content already seen
 If not, add to indexes
 For each extracted URL
E.g., only crawl .edu, obey robots.txt, etc.
 Ensure it passes certain URL filter tests
 Check if it is already in the frontier (duplicate URL elimination)
School of Computer Science
Basic crawl architecture
Doc robots FP’s filters
Content seen?
URL Frontier
URL filter
Dup URL elim
School of Computer Science

DNS (Domain Name Server)
 A lookup service on the internet
 Given a URL, retrieve its IP address
 Service provided by a distributed set of servers – thus, lookup latencies can be high (even seconds)
 Common OS implementations of DNS lookup are blocking: only one outstanding request at a time
 Solutions
 DNS caching
 Batch DNS resolver – collects requests and sends them out together
School of Computer Science
Parsing: URL normalization
 When a fetched document is parsed, some of the extracted links are relative URLs
 E.g., at http://en.wikipedia.org/wiki/Main_Page we have a relative link to
/wiki/Wikipedia:General_disclaimer which is the same as the absolute URL
http://en.wikipedia.org/wiki/Wikipedia:General_disclaimer
 During parsing, must normalize (expand) such relative URLs
School of Computer Science

Content seen?
 Duplication is widespread on the web  If the page just fetched is already in
the index, do not further process it
 This is verified using document fingerprints or shingles
School of Computer Science
Filters and robots.txt
 Filters – regular expressions for URL’s to be crawled/not
 Once a robots.txt file is fetched from a site, need not fetch it repeatedly
Doing so burns bandwidth, hits web server  Cache robots.txt files
School of Computer Science

Duplicate URL elimination
 For a non-continuous (one-shot) crawl, test to see if an extracted+filtered URL has already been passed to the frontier
 For a continuous crawl – see details of frontier implementation
School of Computer Science
Distributing the crawler
 Run multiple crawl threads, under different processes – potentially at different nodes
 Geographically distributed nodes
 Partition hosts being crawled into nodes
 Hash used for partition
 How do these nodes communicate?
School of Computer Science

Communication between nodes
 The output of the URL filter at each node is sent to the Duplicate URL Eliminator at all nodes
Doc robots To
FP’s filters
URL hosts set
Content seen?
From other hosts
URL Frontier
URL filter
Dup URL elim
School of Computer Science
 Beyond traditional IR ranking Purely focus on content
 Must incorporating the following formation as well
 Anchor text (peer review)
Link analysis (social relationship)  PageRank and variants
School of Computer Science

The Web as a Directed Graph
Assumption 1: A hyperlink between pages denotes author perceived relevance (quality signal)
Assumption 2: The anchor of the hyperlink
describes the target page (textual context)
School of Computer Science
Anchor Text
 For ibm how to distinguish between:
 IBM’s home page (mostly graphical)
 IBM’s copyright page (high term freq. for ‘ibm’)  Rival’s spam page (arbitrarily high term freq.)
A million pieces of anchor text with “ibm” send a strong signal
“IBM home page”
www.ibm.com
WWW Worm – McBryan [Mcbr94]
School of Computer Science

Indexing anchor text
 When indexing a document D, include anchor text from links pointing to D.
www.ibm.com
Armonk, NY-based computer giant IBM announced today
Big Blue today announced record profits for the quarter
Joe’s computer hardware links Sun
School of Computer Science
Indexing anchor text
 Can sometimes have unexpected side effects – e.g., evil empire.
 Google Bomb
 Can score anchor text with weight depending on the authority of the anchor page’s website
 E.g., if we were to assume that content from cnn.com or yahoo.com is authoritative, then trust their anchor text
To describe the force of public opinion, and even be able to melt metal. Unanimously to confuse right and wrong analogy.
http://www.searchenginedictionary.com/terms-google-bomb.shtml
School of Computer Science

Citation Analysis
 Citation frequency
 Co-citation coupling frequency
 Cocitations with a given author measures “impact”  Cocitation analysis
 Bibliographic coupling frequency
 Articles that co-cite the same articles are related
 Citation indexing
 Who is author cited by? (Garfield 1972)
 PageRank preview: Pinsker and Narin ’60s  which journals are authoritative?
School of Computer Science
Query processing
 First retrieve all pages meeting the text query (say venture capital).
 Order these by their link popularity (either variant on the previous page).
School of Computer Science

Query-independent ordering
 First generation: using link counts as simple measures of popularity.
 Two basic suggestions: Undirected popularity:
 Each page gets a score = the number of in-links plus the number of out-links (3+2=5).
Directed popularity:
 Score of a page = number of its in-links (3).
School of Computer Science
Spamming simple popularity
 Exercise: How do you spam each of the following heuristics so your page gets a high score?
 Each page gets a static score = the number of in- links plus the number of out-links.
 Static score of a page = number of its in-links.
School of Computer Science

PageRank Scoring
 Imagine a browser doing a random walk on
web pages:
Start at a random page
1/3 1/3 1/3
 At each step, go out of the current page along one of the links on that page, equiprobably
 “In the steady state” each page has a long- term visit rate – use this as the page’s score.
School of Computer Science
Mathematically … …
 Assume that there are n web pages and the i-th page is simply noted with i.
 The web pages form a directed graph and the graph can be represented with adjacent matrix P
 Element pi,j represents the connection between the i-th web page and the j-th web page
 1: if page i refers to page j; 0, otherwise.
 PageRank of the i-th web page is denoted with ai
i1 OutLinks(i) i1
n ai 1. i1
a i pa
i, j OutLinks(i)
School of Computer Science

0 1 1 1 0 0 1 0 0
0 0.5 0.5
1 0 0 
pi, j OutLinks(i)
n pij 1. j1
School of Computer Science
More Mathematically … …
a  aP a(t)  a(t 1)P
School of Computer Science

How do we compute this vector?
 Let a = [a1, … an] denote the row vector of steady- state probabilities.
 If we our current position is described by a, then the next step is distributed as aP.
 But a is the steady state, so a=aP.
 Solving this matrix equation gives us a.
 So a is the

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com