1
Introduction to Data Science
Lecture 7
Data Integration,
Information Retrieval
CIS 5930/4930 – Fall 2021
Assignments
CIS 5930/4930 – Fall 2021
• Homework 1
• Posted on Canvas 9/10
• Due 9/17 3pm on Canvas
Data Integration
1. Enterprise Information Integration:
making separate DB’s, all owned by one
company, work together.
2. Scientific DB’s, e.g., genome DB’s.
3. Catalog integration: combining product
information from all your suppliers.
CIS 5930/4930 – Fall 2021
Challenges
1. DB’s get used for many applications.
You can’t change its structure for the sake of one
application, because it will cause others to
break.
2. Incompatibilities : Two, supposedly similar
databases, will mismatch in many ways.
CIS 5930/4930 – Fall 2021
Examples: Incompatibilities
• Lexical : addr in one DB is address in
another.
• Value mismatches : Is 20 degrees Fahrenheit
or Centigrade?
• Semantic : are “employees” in each database
the same? What about consultants?
Retirees? Contractors?
CIS 5930/4930 – Fall 2021
What Do You Do About It?
• Handwritten translation at each interface
• Wrapper (aka “adapter”) translates incoming
queries and outgoing answers.
CIS 5930/4930 – Fall 2021
1 2
3 4
5 6
2
Integration Architectures
1. Federation : everybody talks directly to
everyone else.
2. Warehouse : Sources are translated from
their local schema to a global schema and
copied to a central DB.
3. Mediator : Virtual warehouse – turns a
user query into a sequence of source
queries.
CIS 5930/4930 – Fall 2021
Federations
Wrapper
Wrapper
Wrapper
Wrapper
Wrapper
Wrapper
CIS 5930/4930 – Fall 2021
Warehouse Diagram
Warehouse
Wrapper Wrapper
Source 1 Source 2
CIS 5930/4930 – Fall 2021
A Mediator
Mediator
Wrapper Wrapper
Source 1 Source 2
User query
Query
Query
QueryQuery
Result
Result
Result
Result
Result
CIS 5930/4930 – Fall 2021
Vector Space Model
Term-Document
# of times
“remorse”
appears in
document #4
markets
below
levinson
olsen
remorse
schuyler
rodents
scrambled
likely
minnesota
d
o
c1
d
o
c2
d
o
c3
d
o
c4
d
o
c5
d
o
c6
d
o
c7
d
o
c8
d
o
c9
d
o
c1
0
CIS 5930/4930 – Fall 2021
Web-search
CIS 5930/4930 – Fall 2021
• Architecture
• Crawling
• Indexing
• Ranking
• Pagerank
• Relevance of search results
• Metrics
7 8
9 10
11 12
3
Relevance to Data Science
CIS 5930/4930 – Fall 2021
• Before
• Data models
• Relational
• Semi-structured
• Data cleaning
• After
• Machine Learning
• Ranking
• Visualization
• Crowdsourcing
Architecture
CIS 5930/4930 – Fall 2021
crawl the
web
create an
inverted
index
store documents,
check for duplicates,
extract links
inverted
index
Web
Pages
Search
engine
servers
show results
To user
Query
Crawling
CIS 5930/4930 – Fall 2021
crawl the
web
store documents,
check for duplicates,
extract links
Web
Pages
Indexing
CIS 5930/4930 – Fall 2021
create an
inverted
index
inverted
index
Web
Pages
• What to index
• Scalability
• Performance
Indexing
CIS 5930/4930 – Fall 2021
POS
1
10
20
30
36
a (1, 4, 40)
entry (11, 20, 31)
file (2, 38)
list (5, 41)
position (9, 16, 26)
positions (44)
word (14, 19, 24, 29, 35, 45)
words (7)
1234 (21, 27, 28)
A file is a list of words by position
First entry is the word in position 1 (first word)
Entry 1234 is the word in position 1234 (1234th word)
Last entry is the last word
An inverted file is a list of positions by word
• Inverted index for 1 file
Indexing
CIS 5930/4930 – Fall 2021
• Index
– Must handle multiple documents
– Must minimize disk seeks & reads
t1 t2 … tm
d1 w11 w12 … w1m
d2 w21 w22 … w2m
…
dn wn1 wn2 …wnm
a (1, 4, 40)
entry (11, 20, 31)
file (2, 38)
list (5, 41)
position (9, 16, 26)
positions (44)
+
13 14
15 16
17 18
4
Indexing
CIS 5930/4930 – Fall 2021
• Multiple Documents
– Split index into Two Files
• Lexicon
– Hashtable on disk (one read)
– Main memory
• Occurrence List
– On Disk
– Distributed File System
a
aa
add
and
…
…
docID # pos1, …
…
Lexicon Occurrence List
Indexing
CIS 5930/4930 – Fall 2021
107 4 322 354 381 405
232 6 15 195 248 1897 1951 2192
677 1 481
713 3 42 312 802
WORD NDOCS PTR
jezebel 20
jezer 3
jezerit 1
jeziah 1
jeziel 1
jezliah 1
jezoar 1
jezrahliah 1
jezreel 39
jezoar
34 6 1 118 2087 3922 3981 5002
44 3 215 2291 3010
56 4 5 22 134 992
DOCID OCCURS POS 1 POS 2 . . .
566 3 203 245 287
67 1 132
. . .
“jezreel” occurs
4 times in document 107,
6 times in document 232,
Once in document 677 . . .
Lexicon
Occurrence List
…
19 20