Course Review
Introduction
DSCI 551
Wensheng Wu
1
Logistics
• Instructor email:
• Class meeting times (2 sections):
– MW 3:30-5:20pm
– Tuesday 3:30-6:50pm
• Office hours:
– MW 2:30-3:15pm (Zoom, link to be posted)
– After class
– Please email for appointment
2
mailto:
Logistics
• TAs & office hours
– Watch the announcements
• Class materials
– Posted on Blackboard
3
Piazza
• Discussion forums
– You may post general and homework questions
– Do not post solutions
– Please actively participate in helping others!
– Do not abuse forum (an academic misconduct!)
• Check frequently for updates
• Follow instruction on Blackboard to sign up
4
Prerequisites
• Programming skills:
– Python (homework, Spark), Java (e.g., for Hadoop only)
• Unix-like environment & shell commands (ls?)
– E.g., Amazon EC2
• Basic knowledge of algorithms and data structures
– Sorting, hashing, etc. (CS 570)
• Basic probability and statistics
5
Textbooks
• Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-
Dusseau. Operating Systems: Three Easy Pieces,
2015 (selected chapters only). Available free at:
http://pages.cs.wisc.edu/~remzi/OSTEP/
• Hector Garcia-Molina, Jeffrey D. Ullman, and
Jennifer Widom. Database Systems: The
Complete Book (Second Edition), Prentice Hall,
2009. (selected chapters only)
– http://infolab.stanford.edu/~ullman/dscb.html
6
http://pages.cs.wisc.edu/~remzi/OSTEP/
http://infolab.stanford.edu/~ullman/dscb.html
Additional readings
• Links can be found in Syllabus
– Check out the schedule
7
Grading structure
• Homework 20%
• Midterm 1 15%
• Midterm 15%
• Final 25%
• Lab sessions 5%
• Group project 20%
8
Grading scale
• [93, 100] = A
• [90, 93) = A-
• [87, 90) = B+
• [83, 87) = B
• [80, 83) = B-
• [77, 80) = C+
• [73, 77) = C
• … (see Syllabus for complete breakdown)
9
Lab sessions
• Flipped:
– Task and details posted before class
– Bring questions to class
• Typically utilize last 15-30 mins of class
10
Exams
• Closed-notes & book
• Midterms:
– See syllabus for time
• Final:
– See syllabus for time
11
Calculator
• Bring one to the tests
• If calculator is needed, we will either
announce or state it on the tests
• Otherwise, no electronic devices are allowed
12
Course project
• Details coming up
• Done in phases
– Proposal
– Midterm report
– Final report
13
Late Policy
• Homework/coursework will be submitted online
– 10% for every 24 hours late
– No credit after 3 days
• Make up for tests are permitted only when
– You have an emergency, typically medical
– Let me know at least two weeks in advance
– You may be required to contact Campus Support and
Intervention office for verification of emergency
14
Grading Corrections
• All coursework’s grades are final one week after
grades are posted
• Final exam grades (& all grades) are final after
final exam grading review time (to be announced
right before/after final)
• Please submit reasonable regrading requests
– Irrational requests (e.g., simply asking for more points
or special treatments) may result in reduction of your
grades
15
Academic Integrity
• Cheating will NOT be tolerated
• All parties involved will receive a grade of F
for the course and be reported to SJACS
WITHOUT EXCEPTION
– USC Student Judicial Affairs and Community
Standards
16
Now, movie time ☺
• Explain big data:
– https://www.youtube.com/watch?v=7D1CQ_LOiz
A
• Questions:
– Where does big data come from?
– What characteristics doe it have? 3Vs?
– What big data technologies were mentioned?
17
Variety
18
Internet Traffic in 2012
• 4.8 zettabyte = 4.8 billion terabytes
• Zettabyte (1000 exabytes)
• Exabyte
• Petabyte
• Terabyte (storage)
• Gigabyte = 2^30 (memory)
• Megabyte (128MB, HDFS)
• Kilobyte = 2^10 (4KB)
19
Major topics
• Storage systems
• File systems & file formats
• Database management systems (RDBMS)
– R = relational
• Big data solution stack
20
Storage Systems
• Hard disk
• SSD (Solid state drive)
21
https://youtu.be/4iaxOUYalJU
Internal of hard disk
22
Actuator Spindle
Disk head
Platter
https://youtu.be/4iaxOUYalJU
NAND flash
23
Latencies: read, write, and erase
24
Major topics
• Storage systems
• File systems & file formats
• Database management systems
• Big data solution stack
25
File Systems
• Standalone
– Single machine
• Distributed (e.g., Hadoop)
– A number of data servers
26
Standalone file systems
• Data structures
– Data blocks
– Metadata blocks (Inodes)
– Bitmap blocks (for space allocation)
• Access paths
– Read
– write
27
Inode (index node)
• Each is identified by a number
– Low-level number of file name: inumber
• Can figure out location of inode from inumber
28
Distributed file systems
• Hadoop HDFS (after GFS)
– Data are distributed among data nodes
• Replication
– Automatic creation of replica (typically 2 or 3
copies/replica of data)
• Fault-tolerant
– Automatic recovery from node failure
29
HDFS architecture
30
Major topics
• Storage systems
• File systems & file formats
• Database management systems
• Big data solution stack
31
File Formats
• JSON
32
33
HTML
Bibliography
Foundations of Databases
Abiteboul, Hull, Vianu
Addison Wesley, 1995
Data on the Web
Abiteoul, Buneman, Suciu
Morgan Kaufmann, 1999
34
XML
…
XML describes the content
XML usages
• Software configurations files
– E.g., HDFS
• Android app development
– Layout resource files, e.g., activity_main.xml
• Java archive (.jar file)
– Manifest.xml
35
Android app resource file
36
Manifest.xml
37
38
…
…
39
Data Model for XPath
bib
book book
publisher author . . . .
Addison-Wesley Serge Abiteboul
Document node
The root element
40
XPath: Simple Expressions
Result:
Result: empty (there were no papers)
/bib/book/year
/bib/paper/year
Major topics
• Storage systems
• File systems & file formats
• Database management systems
• Big data solution stack
41
Relational DBMS
• Data models
– E (entity set) R
– Relational (redundancy => update anomaly)
• Schema
– Normal forms, e.g., 3NF, BCNF (not required, unless we
talk about data warehousing)
– Multidimensional model
– Fact table + dimension (product, customer, time) tables
– Star/snowflake
– ETL: insert into … (warehouse)
• select … From … (data sources)
42
RDBMS
• Query languages
– Relational algebra
– SQL, constraints, views
• Data organization
– Records and blocks
– Index structure: B+-tree (external data structure)
43
RDBMS
• Query execution algorithms
– External sorting
– One-pass algorithms
– Nested-loop join, sorting, hashing-based
– Multiple-pass algorithms
44
RDBMS
• Rigid schema
• Strong consistency is the key design goal
– Never read old data
– Suitable for mission-critical applications, e.g.,
banking
• But may suffer from low availability
– ACID vs CAP
45
RDBMS
• Hard to scale out
– Horizontal partitioning/sharding possible
– But would need distributed storage & computing
support like Hadoop & MapReduce
46
RDBMS Examples
• MySQL (can be installed in Amazon AWS EC2)
• Amazon RDS (Relational database as a service)
– DBMS in the cloud
– Database as a service
• Data warehouse on RDBMS
– OLAP
47
Amazon RDS: Database-as-a-service
• MySQL, PostgreSQL, Oracle, SQL Server, etc.
48
49
address name field
Professor
Advises
Takes
Teaches
Course
Student
name category
semester
name
ssn
Conceptual
Modeling
cid
50
Schema Design and Implementation
• Tables (relations):
• Separates the logical view from the physical view
of the data.
SSN Name Category
123-45-6789 Charles undergrad
234-56-7890 Dan grad
… …
SSN CID
123-45-6789 CSE444
123-45-6789 CSE444
234-56-7890 CSE142
…
Students: Takes:
CID Name Semster
CSE444 Databases fall
CSE541 Operating systems spring
Courses:
51
Querying a Database
• Find all courses that “Mary” takes
• S(tructured) Q(uery) L(anguage)
• Query processor figures out how to answer
the query efficiently.
select C.name
from Students S, Takes T, Courses C
where S.name = “Mary” and
S.ssn = T.ssn and T.cid = C.cid
Select A’s ,agg
From R’s
Where C’s
Group by A’s
Having
Order by
Limit ?
Offset ?
(pagination)
===
Insert
Update
Delete
Declarative (what)
52
Query Optimization
Imperative query execution plan:
select C.name
from Students S, Takes T, Courses C
where S.name=”Mary” and
S.ssn = T.ssn and T.cid = C.cid
Declarative SQL query
Plan: tree of Relational Algebra operators,
choice of algorithms at each operator
Goal:
Students Takes
sid=sid
Courses.name
name=”Mary”
cid=cid
Courses
Major topics
• Storage systems
• File systems & file formats
• Database management systems
• Big data solution stack
53
Topics
• Big data management & analytics
– Cloud data storage (Amazon S3)
– NoSQL
• Google Firebase (real-time database, …)
• MongoDB (shell, mongo)
• Amazon DynamoDB (row store, key-value)
• Cassandra (not required)
– Apache Hadoop & MapReduce
– Apache Spark
54
Cloud data storage
• Amazon S3 (simple storage service)
– Ideal for storing large binary files
– E.g., audio, video, image
– Simple RESTful web service
• Eventual consistency for high availability
55
56
Upload a file
57
58
NoSQL
• Not only SQL
• Flexible schemas
– e.g., JSON documents or key-value pairs
– Ideal for managing a mix of structured, semi-
structured, and unstructured data
• High availability (CAP)
• Weaker (e.g., eventual) consistency model
59
Example NoSQL databases
• MongoDB, Firebase, etc.
– Manage JSON documents
• Amazon DynamoDB
– Row store
– row = item = a collection of key-value pairs
• Apache Cassandra (not required)
– Wide column store
– Google’s Bigtable clone
• Neo4J…
60
Key techniques
• Consistent hashing (Cassandra, Dynamo)
– Avoid moving too much data when adding new
machines (scaling out)
• Efficient writes (for update-heavy apps)
– Append-only
– No overwrites
– Avoid random seek
– But compaction needed later
61
Write path in Cassandra
62
1
2
3
Key techniques
• Compaction
– Introduced in Google “Bigtable” paper
– Merge multiple versions of data
– Remove expired or deleted data
63
DynamoDB
• https://console.aws.amazon.com/dynamodb/
home?region=us-east-1#gettingStarted:
64
https://console.aws.amazon.com/dynamodb/home?region=us-east-1#gettingStarted
65
Insert items
66
May add new attributes
67
Firebase: a cloud database
68
Firebase
69
Topics
• Big data management & analytics
– Cloud data storage (Amazon S3)
– NoSQL (Amazon DynamoDB, Cassandra,
MongoDB)
– MapReduce
– Apache Hadoop
– Apache Spark
70
Roots in functional programming
• Functional programming languages:
– Python, Lisp (list processor), Scheme, Erlang, Haskell
• Two functions:
– Map: mapping a list => list
– Reduce: reducing a list => value
• map() and reduce() in Python
– https://docs.python.org/2/library/functions.html#ma
p
71
https://docs.python.org/2/library/functions.html#map
map() and reduce() in Python
• list = [1, 2, 3]
• def sqr(x): return x ** 2
• list1 = map(sqr, list)
• def add(x, y): return x + y
• z = reduce(add, list)
72
What are the value of list1 and z?
reduce() is in functools module of Python 3
Lambda function
• Anonymous function (not bound to a name)
• list = [1, 2, 3]
• list1 = map(lambda x: x ** 2, list)
• z = reduce(lambda x, y: x + y, list)
73
How is reduce() in Python evaluated?
• z = reduce(f, list) where f is add function
• Initially, z (an accumulator) is set to list[0]
• Next, repeat z = add(z, list[i]) for each i > 0
• Return final z
• Example: z = reduce(add, [1, 2, 3])
– i = 0, z = 1; i = 1, z = 3; i = 2, z = 6
74
Hadoop MapReduce
• Map
–
• Reduce:
–
• Write MapReduce programs on Hadoop
– Using Java
75
MapReduce
76
WordCount: mapper
77
Data types of input key-value
Data types of output key-value
Object can be replaced with LongWritable
Key-value pairs with specified data types
WordCount: reducer
78
Data types of input key-value
Data types of output key-value
A list of values
Characteristics of Hadoop
• Acyclic data flow model
– Data loaded from stable storage (e.g., HDFS)
– Processed through a sequence of steps
– Results written to disk
• Batch processing
– No interactions permitted during processing
79
Problems
• Ill-suited for iterative algorithms that requires
repeated reuse of data
– E.g., machine learning and data mining algorithms
such as k-means, PageRank, logistic regression
• Ill-suited for interactive exploration of data
– E.g., OLAP on big data
80
In-memory MapReduce (Spark)
• Key concepts
– RDD (resilient distributed dataset)
– Transformations
– Actions
81
Apache Spark: history
82
Spark
• Support working sets through RDD
– Enabling reuse & fault-tolerance
• 10x faster than Hadoop in iterative jobs
• Interactively explore 39GB with sub-second
response time
83
Spark
• Combine SQL, streaming, and complex
analytics
• We will see DataFrame in Spark too
84
Spark
• Run on Hadoop, Cassandra, HBase, etc.
85
wc.py
from pyspark import SparkContext
from operator import add
sc = SparkContext(appName=”inf551″)
lines = sc.textFile(‘hello.txt’)
counts = lines.flatMap(lambda x: x.split(‘ ‘)) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)
output = counts.collect()
for v in output:
print(v[0], v[1])
86
Lab session
• Task: Setting up an EC2 instance
• Details: see lab session slides to be posted…
87