CIS 455/555: Internet and Web Systems
Web Services November 9, 2020
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
1
Plan for today
n Hadoop and HDFS
n Remote Procedure Calls
n Abstraction
n Mechanism
n Stub-code generation
n Web services n REST vs SOAP n JSON
NEXT
© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
2
(W3C) Web Services
“A Web service is a software system designed to support interoperable machine-to-machine interaction over a network. It has an interface described in a machine-processable format (specifically WSDL). Other systems interact with the Web service in a manner prescribed by its description using SOAP messages, typically conveyed using HTTP with an XML serialization in conjunction with other Web-related standards.”
http://www.w3.org/TR/ws-arch/
n Key elements:
n Machine-to-machine interaction
n Interoperable (with other applications and services) n Machine-processable format
n Key technologies:
n SOAP and REST
n WSDL (Web Services Description language; XML-based)
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
3
© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
The “Standard” for Web Services
Three parts:
1.
2.
3.
“Wire” / messaging protocols
n Data encodings, RPC calls or document passing, etc.
n We will discuss: SOAP and REST Describing what goes on the wire
n Schemas for the data
n We have already discussed: XML Schema
“Service discovery”
n n
Means of finding web services Historical Example: UDDI
4
The ‘protocol stacks’ of web services
High-level state transition + messaging diagrams between modules
Other extensions
WS-AtomicTransaction, WS-Coordination
MTOM / SOAP Attachments
WS-Addressing
WS-Security, SAML
Orchestration (WS-BPEL)
Message Sequencing
Service Capabilities (WS-Capability)
Service Description (WSDL)
XML Schema
SOAP, XML-RPC
XML
© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
5
Wire Format Stack
Description Stack
Enhanced + expanded from a figure from IBM’s “Web Services Insider”, http://www.ibm.com/developerworks/webservices/library/ws-ref2/index.html
REST and SOAP
n Example: Access AWS from your program
n Example: Launch an EC2 instance, store a value in S3, …
n Simple Object Access protocol (SOAP)
n Not as simple as the name suggests
n Full of features: well-defined encodings, security/authentication, QoS, failure handling, etc.
n XML-based, extensible, general, standardized, but also somewhat heavyweight and verbose
n Representational State Transfer (REST) n Much simpler to develop than SOAP
n Web-specific; lack of standards
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
6
Simple Object Access Protocol (SOAP)
n One example of a messaging protocol
n XML-based format for passing parameters
n Has a SOAP header and body inside an envelope
n Has a defined HTTP binding (POST with content-type of application/soap+xml)
n A companion SOAP Attachments Protocol encapsulates other (MIME) data
n The header defines information about processing: encoding, signatures, etc.
n It’s extensible, and there’s a special attribute called mustUnderstand that is attached to elements that must be supported by the callee
Envelope
Header
Body
n The body defines the actual application-defined data © 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
7
Making a SOAP Call
n To execute a call to service PlaceOrder:
POST /PlaceOrder HTTP/1.1
Host: my.server.com
Content-Type: application/soap+xml; charset=“utf-8” Content-Length: nnn
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
8
SOAP Return Values
n If successful, the SOAP response will generally be another SOAP message with the return data values, much like the request
n If failure, the contents of the SOAP envelop will generally be a Fault message, along the lines of:
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
9
Example: SOAP envelope
Sample request
Source: http://awsdocs.s3.amazonaws.com/SDB/latest/sdb-dg.pdf © 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
Sample response
10
Extensions
n WS-Notification
n WS-BaseNotification
n WS-Topics
n WS-BrokeredNotification
n WS-Addressing
n WS-Transfer
n WS-Eventing
n WS-Enumeration
n WS-MakeConnection
n WS-ReliableMessaging n WS-Reliability
n WS-RM Policy Assertion n WS-Security
n WS-Policy
n WS-PolicyAssertions
n WS-PolicyAttachment
n WS-Discovery
n WS-Inspection
n WS-MetadataExchange n…
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
11
Representational State Transfer (REST)
n Another example of a messaging protocol
n Not really a standard – a style of development
n Data is represented in XML, e.g., with a schema
n Function call interface uses HTTP Requests n GET/POST/PUT/PATCH/DELETE
n Serveristobestateless
n And the HTTP request type specifies the operation
n e.g.,GEThttp://my.com/rest/service1
n e.g.,POSThttp://my.com/rest/service1{body}addsthebodytothe service
© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
12
Example: REST
Invoked method
Response elements
Parameters
https://sdb.amazonaws.com/?Action=PutAttributes &DomainName=MyDomain
&ItemName=Item123 &Attribute.1.Name=Color&Attribute.1.Value=Blue &Attribute.2.Name=Size&Attribute.2.Value=Med &Attribute.3.Name=Price&Attribute.3.Value=0014.99 &AWSAccessKeyId=
&Signature=[valid signature]
&SignatureVersion=2 &SignatureMethod=HmacSHA256 &Timestamp=2014-01-25T15%3A01%3A28-07%3A00
Sample request
Credentials
Source: http://awsdocs.s3.amazonaws.com/SDB/latest/sdb-dg.pdf
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
13
Sample response
Plan for today
n Web services n REST vs SOAP n WSDL
n JSON
n Information retrieval n Basics
n Precision and recall
n Taxonomy of IR models
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
14
NEXT
JSON
“Object”: Unordered collection of
key-value pairs
{
“firstName”: “John”,
“lastName”: “Smith”,
“age”: 25,
“address”: {
“streetAddress”: “21 2nd Street”, “city”: “New York”,
“state”: “NY”,
“postalCode”: 10021
},
“phoneNumber”: [
{ “type”: “home”, “number”: “212 555-1234” },
{ “type”: “fax”, “number”: “646 555-4567” } ]
}
n “JavaScript Object Notation”; MIME type application/json
n Basically legal JavaScript code; can be parsed with eval() n Caution:Security!
n Often used in AJAX-style applications
n Data types: Numbers, strings, booleans, arrays, “objects”
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
15
Array (ordered sequence of values; can be
different types)
n Another standard for data interchange
Case study: The Facebook API
n Like many other web systems, Facebook offers API access to its system
n Programs can use the API to:
n Read data from profiles and pages
n Navigate the graph (e.g., via friends lists)
n Issue queries (for posts, people, pages, …)
n Add or modify data (e.g., create new posts)
n Get real-time updates, issue batch requests, …
n How you can access it: n Graph API, Marketing API
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
16
{
“id”: “1074724712”,
“age_range”: {
“min”: 21 },
“locale”: “en_US”,
“location”: {
“id”: “101881036520836”,
“name”: “Philadelphia, Pennsylvania” }
}
The Graph API
n Requests are mapped directly to HTTP:
n https://graph.facebook.com/(identifier)?fields=(fieldList)
17
n Response is in JSON
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
The Graph API
n Uses several HTTP methods: n GET for reading
n POST for adding or modifying
n DELETE for removing
n IDs can be numeric or names n /1074724712 or /vincent.liu
n Pages also have IDs
n Authorization is via ‘access tokens’ n Opaque string; encodes specific permissions
(access user location, but not interests, etc.)
n Has an expiration date, so may need to be refreshed
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
18
REST Implementation in Java
REST – REpresentational State Transfer
n Basic idea: use HTTP navigation to abstract parameters in a hierarchy
http://my.service.com/lookup/{user}/{folder}/{file}
Typically use JSON to encompass the request + response. Two useful tools used in HW3:
n Jacksonallowsustoserialize/de-serializeJavaobjectsinJSON ObjectMapper om …;
str = om.writeValueAsString(myObj);
obj = om.readValue(str);
n “Route”basedprogrammingviaSparkJava
public static void main(String[] args) {
get(“/myfn/:arg”, (req, res) -> handle(req.body(), req.params(“:arg”)));
}
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
19
Plan for today
n Web services
n Information retrieval
n Basics
n Precision and recall
n Taxonomy of IR models
n Classic IR models n Boolean model
n Vector model n TF/IDF
NEXT
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
20
Web search
n Goal is to find information relevant to a user’s interests – and this is hard!
n Challenge 1: Data quality
n A significant amount of content on the web is not quality
information
n Many pages contain nonsensical rants, etc.
n The web is full of misspellings, multiple languages, etc.
n Many pages are designed not to convey information – but to get a high ranking (e.g., “search engine optimization”)
n Challenge 2: Scale n Billions of documents
n Challenge 3: Very little structure
n No explicit schema
n However, hyperlinks and tags encode information! © 2020 A. Haeberlen, Z. Ives, V. Liu
21
© 2020 A. Haeberlen, Z. Ives, V. Liu
Our discussion of web search
n Begin with traditional information retrieval n Document models
n Stemming and stop words
n Web-specific issues
n Crawlers and robots.txt (already discussed) n Scalability
n Models for exploiting hyperlinks in ranking n GoogleandPageRank
n LatentSemanticIndexing
22
Information Retrieval
n Traditional information retrieval is basically text search n A corpus or body of text documents, e.g., in a document
collection in a library or on a CD
n Documents are generally high-quality and designed to convey information
n Documents are assumed to have no structure beyond words n Searches are generally based on meaningful phrases,
perhaps including predicates over categories, dates, etc.
n The goal is to find the document(s) that best match the search phrase, according to a search model
n Assumptions are typically different from Web: quality text, limited-size corpus, no hyperlinks
© 2020 A. Haeberlen, Z. Ives, V. Liu
23
© 2020 A. Haeberlen, Z. Ives, V. Liu
Motivation for Information Retrieval
n Information Retrieval (IR) is about: n Representation
n Storage
n Organization of
n And access to “information items”
n Focus is on user’s information need rather than a precise query:
n User enters: “March Madness”
n Goal: Find information on college basketball teams which (1) are maintained by a US university and (2) participate in the NCAA tournament
n Emphasis is on the retrieval of information (not data) 24
© 2020 A. Haeberlen, Z. Ives, V. Liu
Data vs. Information Retrieval
n Data retrieval, analogous to database querying: which docs contain a set of keywords?
n Well-defined, precise logical semantics
n Example:Alldocumentswith((‘CIS455’OR’CIS555’)AND(‘midterm’))
n A single erroneous object implies failure!
n Information retrieval:
n Information about a subject or topic
n Semantics is frequently loose; we want approximate matches
n Small errors are tolerated (and in fact inevitable)
n IR system:
n Interpret contents of information items
n Generate a ranking which reflects relevance
n Notion of relevance is most important – needs a model
25
Basic model
Docs
Index Terms
doc
match
Information Need ?
Ranking
© 2020 A. Haeberlen, Z. Ives, V. Liu
query
26
© 2020 A. Haeberlen, Z. Ives, V. Liu
Information Retrieval as a field
n IR addressed many issues in the last 30 years: n Classification and categorization of documents
n Systems and languages for searching
n User interfaces and visualization of results
n Area was seen as of narrow interest – libraries, mainly n And then – the advent of the web:
n Universal “library”
n Free (low cost) universal access
n No central editorial board
n Many problems in finding information: IR seen as key to finding the solutions!
27
The full Information Retrieval process
user interest
Text Text
user feedback
query
retrieved docs
ranked docs
Index
Query Operations
Text Processing and Modeling
logical view
logical view
inverted index
Indexing
Searching
Crawler / Data Access
© 2020 A. Haeberlen, Z. Ives, V. Liu
Ranking
Browser / UI
Documents (Web or DB)
28
Precision and recall
n How good is our IR system? n Two common metrics:
n Precision: What fraction
of the returned documents is relevant?
n Recall: What fraction of the relevant documents are returned?
r
p
ideal
better
typical
n How can you build trivial systems that optimize one of them? n Tradeoff: Increasing precision will usually lower
recall, and vice versa
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
29
© 2020 A. Haeberlen, Z. Ives, V. Liu
What is a meaningful result?
n Matching keywords is quite imprecise
n Users are frequently dissatisfied
n One problem: users are generally poor at formulating queries
n Frequent dissatisfaction of Web users (who often give single- keyword queries)
n Issue of deciding relevance is critical for IR
systems: ranking
n Show more relevant documents first
n May leave out documents with low relevance
30
© 2020 A. Haeberlen, Z. Ives, V. Liu
Rankings
n A ranking is an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the user query
n A ranking is based on fundamental premises
regarding the notion of relevance, such as: n common sets of index terms
n sharing of weighted terms
n likelihood of relevance
n Each set of premises leads to a distinct IR model
31
Types of IR Models
Set Theoretic
Fuzzy Extended Boolean
Classic Models
boolean vector
probabilistic
Algebraic
Generalized Vector Lat. Semantic Index Neural Networks
U s e r
T a s k
Retrieval: Adhoc
Filtering
Structured Models
Non-Overlapping Lists Proximal Nodes
Probabilistic
Inference Network Belief Network
Browsing
Browsing
Flat Structure Guided Hypertext
© 2020 A. Haeberlen, Z. Ives, V. Liu
32
© 2020 A. Haeberlen, Z. Ives, V. Liu
Classic IR models – Basic concepts
n Each document represented by a set of representative keywords or index terms
n An index term is a document word useful for remembering the document’s main themes
n Traditionally, index terms were nouns because nouns have meaning by themselves
n Search engines assume that all words are index terms (full text representation)
33
© 2020 A. Haeberlen, Z. Ives, V. Liu
Classic IR Models – Weights
n Not all terms are equally useful for representing the document contents: less frequent terms allow identifying a narrower set of documents
n The importance of the index terms is represented by weights associated to them
n Let
n ki be an index term
n dj be a document
n wij be a weight associated with (ki, dj)
n The weight wij quantifies the importance of the index term for describing the document contents
34
Classic IR Models – Notation
dj
ki t
K = (k1, k2, …, kt) wij 3 0
wij = 0
gi(dj) = wij
dj = (w1j, w2j, …, wtj)
is a document
is an index term (keyword)
is the total number of index terms
is the set of all index terms
is a weight associated with (ki,dj)
indicates that term does not belong to doc
is a function which returns the weight associated with pair (ki, dj)
is a weighted vector associated with the document dj
© 2020 A. Haeberlen, Z. Ives, V. Liu
35