CS计算机代考程序代写 information retrieval data science database file system 1

1

Introduction to Data Science

Lecture 7

Data Integration,

Information Retrieval

CIS 5930/4930 – Fall 2021

Assignments

CIS 5930/4930 – Fall 2021

• Homework 1
• Posted on Canvas 9/10
• Due 9/17 3pm on Canvas

Data Integration

1. Enterprise Information Integration:
making separate DB’s, all owned by one
company, work together.

2. Scientific DB’s, e.g., genome DB’s.

3. Catalog integration: combining product
information from all your suppliers.

CIS 5930/4930 – Fall 2021

Challenges

1. DB’s get used for many applications.

You can’t change its structure for the sake of one
application, because it will cause others to
break.

2. Incompatibilities : Two, supposedly similar
databases, will mismatch in many ways.

CIS 5930/4930 – Fall 2021

Examples: Incompatibilities

• Lexical : addr in one DB is address in
another.

• Value mismatches : Is 20 degrees Fahrenheit
or Centigrade?

• Semantic : are “employees” in each database
the same? What about consultants?
Retirees? Contractors?

CIS 5930/4930 – Fall 2021

What Do You Do About It?

• Handwritten translation at each interface

• Wrapper (aka “adapter”) translates incoming
queries and outgoing answers.

CIS 5930/4930 – Fall 2021

1 2

3 4

5 6

2

Integration Architectures

1. Federation : everybody talks directly to
everyone else.

2. Warehouse : Sources are translated from
their local schema to a global schema and
copied to a central DB.

3. Mediator : Virtual warehouse – turns a
user query into a sequence of source
queries.

CIS 5930/4930 – Fall 2021

Federations

Wrapper

Wrapper

Wrapper

Wrapper

Wrapper

Wrapper

CIS 5930/4930 – Fall 2021

Warehouse Diagram

Warehouse

Wrapper Wrapper

Source 1 Source 2

CIS 5930/4930 – Fall 2021

A Mediator

Mediator

Wrapper Wrapper

Source 1 Source 2

User query

Query

Query

QueryQuery

Result

Result

Result

Result

Result

CIS 5930/4930 – Fall 2021

Vector Space Model

Term-Document

# of times
“remorse”
appears in

document #4

markets

below

levinson

olsen

remorse

schuyler

rodents

scrambled

likely

minnesota

d
o

c1

d
o

c2

d
o

c3

d
o

c4

d
o

c5

d
o

c6

d
o

c7

d
o

c8

d
o

c9

d
o

c1
0

CIS 5930/4930 – Fall 2021

Web-search

CIS 5930/4930 – Fall 2021

• Architecture
• Crawling
• Indexing
• Ranking
• Pagerank
• Relevance of search results
• Metrics

7 8

9 10

11 12

3

Relevance to Data Science

CIS 5930/4930 – Fall 2021

• Before
• Data models

• Relational
• Semi-structured

• Data cleaning
• After

• Machine Learning
• Ranking

• Visualization
• Crowdsourcing

Architecture

CIS 5930/4930 – Fall 2021

crawl the

web

create an

inverted

index

store documents,
check for duplicates,

extract links

inverted

index

Web

Pages

Search

engine

servers

show results

To user

Query

Crawling

CIS 5930/4930 – Fall 2021

crawl the

web

store documents,
check for duplicates,

extract links

Web

Pages

Indexing

CIS 5930/4930 – Fall 2021

create an

inverted

index

inverted

index

Web

Pages

• What to index
• Scalability
• Performance

Indexing

CIS 5930/4930 – Fall 2021

POS

1

10

20

30

36

a (1, 4, 40)

entry (11, 20, 31)

file (2, 38)

list (5, 41)

position (9, 16, 26)

positions (44)

word (14, 19, 24, 29, 35, 45)

words (7)

1234 (21, 27, 28)

A file is a list of words by position

First entry is the word in position 1 (first word)

Entry 1234 is the word in position 1234 (1234th word)

Last entry is the last word

An inverted file is a list of positions by word

• Inverted index for 1 file

Indexing

CIS 5930/4930 – Fall 2021

• Index

– Must handle multiple documents

– Must minimize disk seeks & reads

t1 t2 … tm

d1 w11 w12 … w1m
d2 w21 w22 … w2m


dn wn1 wn2 …wnm

a (1, 4, 40)

entry (11, 20, 31)

file (2, 38)

list (5, 41)

position (9, 16, 26)

positions (44)

+

13 14

15 16

17 18

4

Indexing

CIS 5930/4930 – Fall 2021

• Multiple Documents
– Split index into Two Files

• Lexicon
– Hashtable on disk (one read)
– Main memory

• Occurrence List
– On Disk
– Distributed File System

a

aa

add

and

docID # pos1, …

Lexicon Occurrence List

Indexing

CIS 5930/4930 – Fall 2021

107 4 322 354 381 405

232 6 15 195 248 1897 1951 2192

677 1 481

713 3 42 312 802

WORD NDOCS PTR

jezebel 20

jezer 3

jezerit 1

jeziah 1

jeziel 1

jezliah 1

jezoar 1

jezrahliah 1

jezreel 39
jezoar

34 6 1 118 2087 3922 3981 5002

44 3 215 2291 3010

56 4 5 22 134 992

DOCID OCCURS POS 1 POS 2 . . .

566 3 203 245 287

67 1 132

. . .

“jezreel” occurs

4 times in document 107,

6 times in document 232,

Once in document 677 . . .

Lexicon

Occurrence List

19 20