8/25/2021
1
Introduction to Data Science
Lecture 2
Data Science, Structured Data
CIS 5930/4930 – Fall 2021
What is Data Science
• Relatively new Computer Science field
• Systems, algorithms, and processes for
managing and deriving insights from
heterogeneous and/or large data sets
CIS 5930/4930 – Fall 2021
Typical Workflow for a Data Scientist
CIS 5930/4930 – Fall 2021
Structured Data: Relations
CIS 5930/4930 – Fall 2021
• Definitions:
• A Relation R is a subset of S1 x S2 x … x Sn,
where Si is Domain of attribute i ϵ [1, n],
and n is a number of attributes of R
• A Tuple t is an element of S1 x S2 x … x Sn
• A relation is a set of tuples
• Relational Database is a collection of
relations
Structured Data: Schema
CIS 5930/4930 – Fall 2021
• Relation schema
• Relation name
• Name of each attribute
• Type of each attribute
• Database schema
• A set of all relation schemas
Example
CIS 5930/4930 – Fall 2021
Relation Schema
Supplier (sno: integer, sname: string, scity: string, sstate: string)
Supplier
sno sname scity sstate
1 s1 city1 TX
2 s2 city1 TX
3 s3 city2 TX
4 s4 city2 TX
1 2
3 4
5 6
8/25/2021
2
Keys
CIS 5930/4930 – Fall 2021
• Super Key: a set of attributes that functionally
determines all other attributes in a tuple
• Key (candidate key): a minimal super-key
• A relation can have several candidate
keys
• Primary key: one minimal key can be selected
as primary key
Foreign Keys
CIS 5930/4930 – Fall 2021
• Foreign key (A → B)
• A and B are relations
• An attribute K (or collection of attributes)
in relation B that uniquely identifies a
tuple in relation A
• Typically, K is a primary key in A
Key Constraints
CIS 5930/4930 – Fall 2021
CREATE TABLE Part
(
pno integer,
pname varchar(20),
psize integer,
pcolor varchar(20),
PRIMARY KEY (pno)
);
CREATE TABLE Order
(
sno integer,
pno integer,
qty integer,
price integer,
PRIMARY KEY (sno, pno),
FOREIGN KEY (pno) REFERENCES Part
);
Relational Query Languages
CIS 5930/4930 – Fall 2021
• Relational Queries
• Inputs and outputs are relations
• Takes one/many relation instance(s) as
input, returns one relation instance as a
result
• Easy to compose into relational algebra
expressions
Basic Relational Operators
CIS 5930/4930 – Fall 2021
• Selection: sCondition (S)
• Condition is a Boolean combination (, ) of
atomic predicates ( , =, = , ≠, =, = , )
• Projection: list-of-attributes (S)
• Union ()
• Set difference ( – )
• Cross-product/Cartesian product (x)
• Join: R ⨝θ S = sθ (R x S)
Selection/Projection Examples
CIS 5930/4930 – Fall 2021
Patient
no name zip disease
1 p1 78225 flu
2 p2 78225 heart
3 p3 78220 lung
4 p4 78220 heart
zip, disease (Patient)
zip disease
78225 flu
78225 heart
78220 lung
78220 heart
sdisease = ‘heart’ (Patient)
no name zip disease
2 p2 78225 heart
4 p4 78220 heart
zip sdisease = ‘heart’ (Patient)
zip
78220
78225
7 8
9 10
11 12
8/25/2021
3
Cross-Product Example
CIS 5930/4930 – Fall 2021
AnonymousPatient P
age zip disease
54 78225 heart
20 78220 flu
Voters V
name age zip
p1 54 78225
p2 20 78220
P x V
P.age P.zip disease name V.age V.zip
54 78225 heart p1 54 78225
54 78225 heart p2 20 78220
20 78220 flu p1 54 78225
20 78220 flu p2 20 78220
Natural Join
CIS 5930/4930 – Fall 2021
AnonymousPatient P
age zip disease
54 78225 heart
20 78220 flu
Voters V
name age zip
p1 54 78225
p2 20 78220
P⨝ V
age zip disease name
54 78225 heart p1
20 78220 flu p2
Outer Join
CIS 5930/4930 – Fall 2021
• Outer join
• Includes tuples, even with no matches
• NULL values for missing attributes
Outer Join
CIS 5930/4930 – Fall 2021
AnonPatient P
age zip disease
54 78225 heart
20 78220 flu
33 78220 lung
Voters V
name age zip
p1 54 78225
p2 20 78220
P ⨝V
age zip disease name
54 78225 heart p1
20 78220 flu p2
33 78220 lung null
Relational Algebra Expressions
CIS 5930/4930 – Fall 2021
E1: Names of voting patients who have heart
disease:
name (Voter ⨝ (sdisease =‘heart’(AnonymousPatient))
CIS 5930/4930 – Fall 2021
Relations
Supplier(sno,sname,scity,sstate)
Part(pno,pname,psize,pcolor)
Order(sno,pno,qty,price)
E2: Suppliers of parts, size greater than 20
sname (Supplier ⨝ Order⨝(spsize > 20(Part))
E3: Suppliers of red parts or parts of size greater than 30
sname (Supplier ⨝ Order⨝(spsize > 30(Part) spcolor = ‘red’(Part)))
Relational Algebra Expressions
13 14
15 16
17 18
8/25/2021
4
CIS 5930/4930 – Fall 2021
Supplier(sno, sname, scity, sstate)
Order(sno, pno, qty, price)
Part(pno, pname, psize, pcolor)
Supplier Order
pno = pno
sname, scity
sno = sno
spsize > 30
Part
Relational Algebra Expression
Query Plan
19