IDs
data integration
URI = Uniform Resource Identifier
RDF = Resource Description Framework RDF/XML
RSS = RDF Site Summary
Names vs. IDs
names are meant for humans
IDs are meant for computers
e.g. the word ‘soap’ is inherently ambiguous
humans will have no problems disambiguating it using
the context (human intelligence)
computers will also have some success disambiguating
it using the context (artificial intelligence)
humans will typically modify the search query adding other relevant terms, but that approach relies on human improvisation
one way of systematically tackling the ambiguity problem on the Web is assigning IDs to ‘things’
4/26/2016
Semantic Web
CMT207 Information modelling & database systems
1
Example
a software consultant has just received a new project to create a series of SOAP-based Web services
they need to learn a bit about SOAP, so they search for the term using a search engine
the search results will contain documents about soap operas, toiletries, detergents as well as SOAP-based Web services
different semantic associations of the word ‘soap’ → search results will vary in relevance
→ manually sifting through a lot information
Lecture content
Semantic Web data
resources
Semantic Web
conceived by Tim Berners-Lee: “a web of data that can be processed directly & indirectly by machines”
data itself becomes part of the Web and can be processed independently of application, platform or domain
information is currently shared on the Web in the form of documents
1. computers can search for these documents
2. … but humans have to read & interpret them before any useful information can be extrapolated
Semantic Web
does not rely on artificial intelligence, i.e. guesstimates
instead it relies on structured information & inference rules that allow it to ‘understand’ the relationship between different data resources
the computer does not really understand information the way a human can
… but it can be given enough information to make logical connections & decisions rather than to guess
1
Semantics & relationships
Semantic Web requires adding semantic metadata (data about data) to information resources
semantic metadata allows computers to effectively process the data & make inferences about the data
XML has paved the road by adding metadata in the form
Uniform Resource Identifier (URI) resource – anything that has an identity
e.g. an electronic document, an image, a service, a collection of other resources…
NOTE: not all resources are network retrievable!
4/26/2016
Data integration
Semantic Web = a “web of data” that not only harnesses the seemingly endless amount of data, but also connects the data
the ability to connect data not only on the Web, but also in relational databases and other types of repositories increases the usability of data available
data integration applications for connecting disparate sources, typically require one-to-one mappings between elements (i.e. local IDs) in each data repository
Semantic Web supports efficient data integration based on built-in, universally available semantic information that describes each resource (i.e. global IDs)
Semantic Web acts as one huge database
Uniform Resource Identifier (URI)
of human-readable tags
before XML, data was stored in flat files and databases
with proprietary formats
XML made data interoperable within a single domain, i.e.
the domain defined by an XML schema
XML provides syntactic interoperability only when both
parties know & understand the element names used
Semantic Web requires a universal (or global) schema
e.g. human beings, corporations & books in a library can also be considered resources
resource = the conceptual mapping to an entity or set of entities, not necessarily the entity which corresponds to that mapping
identifier = an object that can act as a reference to something that has identity
Semantic Web standards & technologies
the very minimum needed to enable the Semantic Web includes the means of:
1. uniquely identifying resources
2. defining relationships between them
these requirements are addressed by using:
1. URI = Uniform Resource Identifier
2. RDF = Resource Description Framework
anofficialW3Crecommendation,RDFisanXML-based
standard for describing resources
RDFbuildsonexistingXMLandURItechnologies,using a URI to identify every resource & using URIs to make statements about resources
Uniform Resource Identifier (URI)
the Web is an information space – URIs are the points in
that space
URI is simply a Web identifier
a sequence of characters with a restricted syntax
e.g. the strings starting with ‘http:’ or ‘ftp:’ found on the
Web
URI can be further classified as a locator (URL) or a name (URN)
URL (Uniform Resource Locator) – identifies a resource via a representation of its primary access mechanism
URN (Uniform Resource Name) – persistent labelling of a resource with a globally unique identifier
2
4/26/2016
Example URIs
ftp://ftp.is.co.za/rfc/rfc1808.txt
gopher://spinaltap.micro.umn.edu/00/Weather/Califor
nia/Los%20Angeles
http://www.math.uio.no/faq/compression- faq/part1.html
mailto:mduerst@ifi.unizh.ch
news:comp.infosystems.www.servers.unix
telnet://melvyl.ucop.edu/
URI syntax
http://meyerweb.com/eric/tools/dencoder/
URI syntax
a restricted set of characters: digits, letters & a few graphic symbols
reserved characters – their usage within the URI component is limited to their reserved purpose
reserved=”;”|”/”|”?”|”:”|”@”|”&”|”=”|”+”| “$” | “,”
unreserved characters – allowed in a URI but without a reserved purpose
upper & lower case letters, decimal digits & a limited set of punctuation marks and symbols
unreserved = alphanum | mark
mark=”–” |”_”|”.”|”!”|”~”|”*”|”‘”|”(“|”)”
URI syntax: excluded URI characters
the reasons for exclusion of some US-ASCII characters
1. control characters in the US-ASCII coded character set are not allowed, both because they are non-printable and are likely to be misinterpreted
2.
control =
space character may disappear or be introduced when transcribed, typeset or word-processed
whitespace is also used to delimit URI in many contexts
space =
URI syntax
character must be escaped if it does not have a representation using an unreserved character
i.e. does not correspond to a printable character of the US-ASCII coded character set
… or corresponds to any US-ASCII character that is not allowed in an URI, as explained previously
escaped octet – encoded as a character triplet, consisting of the percent character “%” followed by two hexadecimal digits
e.g. “%20″ = the US-ASCII space character, i.e. ” ”
escaped = “%” hex hex
hex=digit|”A”|”B”|”C”|”D”|”E”|”F”|”a”|”b”| “c” | “d” | “e” | “f”
URI syntax: excluded URI characters
3. delims=”<"|">“|”#”|”%”|<">
“<" and ">” and (“) are often used as the delimiters
around URI in text documents & protocol fields
“#” is used to delimit a URI from a fragment identifier in URI references
“%” is used for the encoding of escaped characters
4. unwise=”{“|”}”|”|”|”\”|”^”|”[“|”]”|”`”
other characters are excluded because gateways and other transport agents are known to sometimes modify such characters, or they are used as delimiters
3
URI scheme
the top level of a URI
most schemes were originally designed to be used with a particular protocol, and often have the same name
… but URI schemes are not protocols themselves!
4/26/2016
URI syntactic components
the URI syntax is dependent upon the scheme
scheme-specific-part does not have to have any general structure or set of semantics common among all URIs
however, a subset of URIs do share a common syntax for representing hierarchical relationships within the namespace
URI path
contains data, specific to the authority (or the scheme if there is no authority component), identifying the resource within the scope of that scheme and authority
the path may consist of a sequence of path segments separated by a single slash “/” character
each path segment may include a sequence of parameters, indicated by the semicolon “;” character
the parameters are not significant to the parsing of relative references
URI query
a string of information to be interpreted by the resource Demo…
e.g. the http scheme is mainly used for interacting with Web resources using HyperText Transfer Protocol
… but URIs within the http scheme are also used for other purposes, e.g. RDF resource identifiers and XML namespaces, which are not related to the protocol
some schemes are not associated with any protocol (e.g. file) or do not use the name of a protocol as their prefix (e.g. news)
URI schemes should be registered with IANA (Internet Assigned Numbers Authority)
URI authority
naming authority: the namespace defined by the
remainder of the URI is governed by that authority
typically defined by an Internet-based server or a
scheme-specific registry of naming authorities authority = server | reg_name
preceded by a double slash “//”, terminated by the next slash “/”, question-mark “?” or the end of the URI
Resource Description Framework (RDF)
4
4/26/2016
Resource Description Framework (RDF)
a standard model for data interchange on the Web
extends the linking structure of the Web
RDF lets us use URIs to make statements about resources using triples: (subject, predicate, object)
uses URIs to name the triple elements, i.e. subject, predicate & object
RDF example
RDF statements may be represented as graphs nodes: subject & object
arc (edge): predicate
labelled directed graph: arcs have labels & point in a specific direction, from subject to object
http://www.example.org/index.html
http://www.example.org/terms/creation-date http://www.example.org/terms/language
August 16, 1999 http://purl.org/dc/elements/1.1/creator English http://www.example.org/staffid/85740
RDF example
English statement:
http://www.example.org/index.html has a creator
whose value is John Smith RDF statement:
subject http://www.example.org/index.html
predicate http://purl.org/dc/elements/1.1/creator object http://www.example.org/staffid/85740
NOTE: URIs instead of names such as ‘creator’ & ‘John Smith’!
RDF: URIs vs. literals
RDF permits the objects of statements to be constant values (called literals) represented by character strings
… but not the subjects or predicates!
literals are used to identify values such as numbers &
dates by means of a lexical representation
resources (URIs) vs. values (literals)
URIs are shown as ellipses, literals are shown as boxes
anything represented by a literal could also be represented by a URI, but it is often more convenient or intuitive to use literals
RDF statements
RDF statements are similar to a number of other formats for recording information, e.g.
rows in a simple relational database XML elements in an XML document simple assertions in formal logic
etc.
information in these formats can be treated as RDF statements
different formats allows RDF to be used as a unifying model for integrating data from many sources
RDF: URIs vs. literals
example: easier to use the literal “7” than the URI http://dbpedia.org/resource/7_(number)
literals are usually abstract values & describing them in most cases is not necessary nor practical
literals are end nodes in an RDF graph that do not branch out
similar to the OO model: URI ≈ object, literal ≈ object property, which usually belongs to a primitive data type, which is represented by a single value
5
RDF: URIs
using URIs in RDF statements allows us to begin to develop & use a controlled vocabulary on the Web
this vocabulary reflects a shared understanding of the concepts we talk about
of course, URIs do not automatically solve all our problems because, e.g. people can still use different URIs to refer to the same thing
however, URIs are used in the commonly-accessible Web space, thus creating the opportunity to:
identify equivalences among them
migrate toward the use of common references
RDF/XML
an XML syntax for RDF: RDF/XML initial statement:
RDF graph:
triple:
RDF/XML:
Another RDF/XML example
Yet another RDF/XML example
1
RSS 1.0 – RDF Site Summary
RSS 1.0 (RDF Site Summary) – a lightweight multipurpose extensible metadata description & syndication format
lightweight → an XML document
extensible → via XML namespaces & RDF
metadata = data about data – descriptive information structured in such a way that allows Web pages to be properly searched & processed in particular by computer
RDF allows for representation of rich metadata
syndication – making data available online for further transmission, aggregation or online publication
abbreviated
RSS 1.0: RDF Site Summary AKA Really Simple Syndication.
4/26/2016
6
4/26/2016
RSS
RSS
an XML application that conforms to RDF specification
a most widely deployed RDF application on the web
packages content into easily distinguishable sections
the feed can be requested by any application able to speak HTTP
ideal for dynamic information: news sites, web logs, sports scores, stock quotes…
an RSS summary is a document describing a channel consisting of URL-retrievable items
RSS
the channel element contains metadata describing the channel itself:
title
brief description URL link
the rdf:about attribute is a URI which identifies the channel, most commonly:
URL of the homepage being described, or URL where the RSS file can be found
RSS reader
an application which aggregates syndicated web content such as news headlines, blogs, podcasts & video blogs in one location for easy viewing
Demo – Google News
RSS
an RDF table of contents:
associates the document’s items with the given RSS channel
each item’s rdf:resource {item_uri} must be the same as the associated item element’s rdf:about {item_uri}
Summary
7