Course Review

Introduction

DSCI 551

Wensheng Wu

Logistics

• Instructor email:

• Class meeting times (2 sections):
– MW 3:30-5:20pm
– Tuesday 3:30-6:50pm

• Office hours:
– MW 2:30-3:15pm (Zoom, link to be posted)
– After class
– Please email for appointment

mailto:

Logistics

• TAs & office hours

– Watch the announcements

• Class materials

– Posted on Blackboard

Piazza

• Discussion forums
– You may post general and homework questions

– Do not post solutions

– Please actively participate in helping others!

– Do not abuse forum (an academic misconduct!)

• Check frequently for updates

• Follow instruction on Blackboard to sign up

Prerequisites

• Programming skills:
– Python (homework, Spark), Java (e.g., for Hadoop only)

• Unix-like environment & shell commands (ls?)
– E.g., Amazon EC2

• Basic knowledge of algorithms and data structures
– Sorting, hashing, etc. (CS 570)

• Basic probability and statistics

Textbooks

• Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-
Dusseau. Operating Systems: Three Easy Pieces,
2015 (selected chapters only). Available free at:
http://pages.cs.wisc.edu/~remzi/OSTEP/

• Hector Garcia-Molina, Jeffrey D. Ullman, and
Jennifer Widom. Database Systems: The
Complete Book (Second Edition), Prentice Hall,
2009. (selected chapters only)
– http://infolab.stanford.edu/~ullman/dscb.html

http://pages.cs.wisc.edu/~remzi/OSTEP/
http://infolab.stanford.edu/~ullman/dscb.html

Additional readings

• Links can be found in Syllabus

– Check out the schedule

Grading structure

• Homework 20%

• Midterm 1 15%

• Midterm 15%

• Final 25%

• Lab sessions 5%

• Group project 20%

Grading scale

• [93, 100] = A

• [90, 93) = A-

• [87, 90) = B+

• [83, 87) = B

• [80, 83) = B-

• [77, 80) = C+

• [73, 77) = C

• … (see Syllabus for complete breakdown)

Lab sessions

• Flipped:

– Task and details posted before class

– Bring questions to class

• Typically utilize last 15-30 mins of class

Exams

• Closed-notes & book

• Midterms:

– See syllabus for time

• Final:

– See syllabus for time

Calculator

• Bring one to the tests

• If calculator is needed, we will either
announce or state it on the tests

• Otherwise, no electronic devices are allowed

Course project

• Details coming up

• Done in phases

– Proposal

– Midterm report

– Final report

Late Policy

• Homework/coursework will be submitted online

– 10% for every 24 hours late

– No credit after 3 days

• Make up for tests are permitted only when

– You have an emergency, typically medical

– Let me know at least two weeks in advance

– You may be required to contact Campus Support and
Intervention office for verification of emergency

Grading Corrections

• All coursework’s grades are final one week after
grades are posted

• Final exam grades (& all grades) are final after
final exam grading review time (to be announced
right before/after final)

• Please submit reasonable regrading requests
– Irrational requests (e.g., simply asking for more points

or special treatments) may result in reduction of your
grades

Academic Integrity

• Cheating will NOT be tolerated

• All parties involved will receive a grade of F
for the course and be reported to SJACS
WITHOUT EXCEPTION

– USC Student Judicial Affairs and Community
Standards

Welcome

Now, movie time ☺

• Explain big data:

– https://www.youtube.com/watch?v=7D1CQ_LOiz
A

• Questions:

– Where does big data come from?

– What characteristics doe it have? 3Vs?

– What big data technologies were mentioned?

Variety

Internet Traffic in 2012

• 4.8 zettabyte = 4.8 billion terabytes

• Zettabyte (1000 exabytes)
• Exabyte
• Petabyte
• Terabyte (storage)
• Gigabyte = 2^30 (memory)
• Megabyte (128MB, HDFS)
• Kilobyte = 2^10 (4KB)

Major topics

• Storage systems

• File systems & file formats

• Database management systems (RDBMS)
– R = relational

• Big data solution stack

Storage Systems

• Hard disk

• SSD (Solid state drive)

https://youtu.be/4iaxOUYalJU

Internal of hard disk

Actuator Spindle

Disk head

Platter

https://youtu.be/4iaxOUYalJU

NAND flash

Latencies: read, write, and erase

Major topics

• Storage systems

• File systems & file formats

• Database management systems

• Big data solution stack

File Systems

• Standalone

– Single machine

• Distributed (e.g., Hadoop)

– A number of data servers

Standalone file systems

• Data structures

– Data blocks

– Metadata blocks (Inodes)

– Bitmap blocks (for space allocation)

• Access paths

– Read

– write

Inode (index node)

• Each is identified by a number

– Low-level number of file name: inumber

• Can figure out location of inode from inumber

Distributed file systems

• Hadoop HDFS (after GFS)
– Data are distributed among data nodes

• Replication
– Automatic creation of replica (typically 2 or 3

copies/replica of data)

• Fault-tolerant
– Automatic recovery from node failure

HDFS architecture

Major topics

• Storage systems

• File systems & file formats

• Database management systems

• Big data solution stack

File Formats

• JSON

HTML

Bibliography

Foundations of Databases

Abiteboul, Hull, Vianu

Addison Wesley, 1995

Data on the Web

Abiteoul, Buneman, Suciu

Morgan Kaufmann, 1999

XML

Foundations…

Abiteboul

Hull

Vianu

Addison Wesley

1995

…

XML describes the content

XML usages

• Software configurations files

– E.g., HDFS

• Android app development

– Layout resource files, e.g., activity_main.xml

• Java archive (.jar file)

– Manifest.xml

Android app resource file

Manifest.xml

…

Addison-Wesley

Serge Abiteboul

RickHull

Victor Vianu

Foundations of Databases

1995

38.8

Freeman

Jeffrey D. Ullman

Principles of Database and Knowledge Base Systems

1998

…

Data Model for XPath

bib

book book

publisher author . . . .

Addison-Wesley Serge Abiteboul

Document node

The root element

XPath: Simple Expressions

Result: 1995

1998

Result: empty (there were no papers)

/bib/book/year

/bib/paper/year

Major topics

• Storage systems

• File systems & file formats

• Database management systems

• Big data solution stack

Relational DBMS

• Data models
– E (entity set) R
– Relational (redundancy => update anomaly)

• Schema
– Normal forms, e.g., 3NF, BCNF (not required, unless we

talk about data warehousing)
– Multidimensional model
– Fact table + dimension (product, customer, time) tables
– Star/snowflake
– ETL: insert into … (warehouse)

• select … From … (data sources)

RDBMS

• Query languages

– Relational algebra

– SQL, constraints, views

• Data organization

– Records and blocks

– Index structure: B+-tree (external data structure)

RDBMS

• Query execution algorithms

– External sorting

– One-pass algorithms

– Nested-loop join, sorting, hashing-based

– Multiple-pass algorithms

RDBMS

• Rigid schema

• Strong consistency is the key design goal
– Never read old data

– Suitable for mission-critical applications, e.g.,
banking

• But may suffer from low availability
– ACID vs CAP

RDBMS

• Hard to scale out

– Horizontal partitioning/sharding possible

– But would need distributed storage & computing
support like Hadoop & MapReduce

RDBMS Examples

• MySQL (can be installed in Amazon AWS EC2)

• Amazon RDS (Relational database as a service)

– DBMS in the cloud

– Database as a service

• Data warehouse on RDBMS

– OLAP

Amazon RDS: Database-as-a-service

• MySQL, PostgreSQL, Oracle, SQL Server, etc.

49
address name field

Professor

Advises

Takes

Teaches

Course

Student

name category

semester

name

ssn

Conceptual

Modeling

cid

Schema Design and Implementation

• Tables (relations):

• Separates the logical view from the physical view
of the data.

SSN Name Category

123-45-6789 Charles undergrad

234-56-7890 Dan grad

… …

SSN CID

123-45-6789 CSE444

234-56-7890 CSE142

…

Students: Takes:

CID Name Semster

CSE444 Databases fall

CSE541 Operating systems spring

Courses:

Querying a Database

• Find all courses that “Mary” takes

• S(tructured) Q(uery) L(anguage)

• Query processor figures out how to answer
the query efficiently.

select C.name
from Students S, Takes T, Courses C
where S.name = “Mary” and

S.ssn = T.ssn and T.cid = C.cid

Select A’s ,agg
From R’s
Where C’s
Group by A’s
Having
Order by
Limit ?
Offset ?
(pagination)

===
Insert
Update
Delete

Declarative (what)

Query Optimization

Imperative query execution plan:

select C.name

from Students S, Takes T, Courses C

where S.name=”Mary” and

S.ssn = T.ssn and T.cid = C.cid

Declarative SQL query

Plan: tree of Relational Algebra operators,

choice of algorithms at each operator

Goal:

Students Takes

sid=sid

Courses.name

name=”Mary”

cid=cid

Courses

Major topics

• Storage systems

• File systems & file formats

• Database management systems

• Big data solution stack

Topics

• Big data management & analytics

– Cloud data storage (Amazon S3)

– NoSQL

• Google Firebase (real-time database, …)

• MongoDB (shell, mongo)

• Amazon DynamoDB (row store, key-value)

• Cassandra (not required)

– Apache Hadoop & MapReduce

– Apache Spark

Cloud data storage

• Amazon S3 (simple storage service)

– Ideal for storing large binary files

– E.g., audio, video, image

– Simple RESTful web service

• Eventual consistency for high availability

Upload a file

NoSQL

• Not only SQL

• Flexible schemas
– e.g., JSON documents or key-value pairs
– Ideal for managing a mix of structured, semi-

structured, and unstructured data

• High availability (CAP)

• Weaker (e.g., eventual) consistency model

Example NoSQL databases

• MongoDB, Firebase, etc.
– Manage JSON documents

• Amazon DynamoDB
– Row store
– row = item = a collection of key-value pairs

• Apache Cassandra (not required)
– Wide column store
– Google’s Bigtable clone

• Neo4J…

Key techniques

• Consistent hashing (Cassandra, Dynamo)
– Avoid moving too much data when adding new

machines (scaling out)

• Efficient writes (for update-heavy apps)
– Append-only

– No overwrites

– Avoid random seek

– But compaction needed later

Write path in Cassandra

Key techniques

• Compaction

– Introduced in Google “Bigtable” paper

– Merge multiple versions of data

– Remove expired or deleted data

DynamoDB

• https://console.aws.amazon.com/dynamodb/
home?region=us-east-1#gettingStarted:

https://console.aws.amazon.com/dynamodb/home?region=us-east-1#gettingStarted

Insert items

May add new attributes

Firebase: a cloud database

Firebase

Topics

• Big data management & analytics

– Cloud data storage (Amazon S3)

– NoSQL (Amazon DynamoDB, Cassandra,
MongoDB)

– MapReduce

– Apache Hadoop

– Apache Spark

Roots in functional programming

• Functional programming languages:
– Python, Lisp (list processor), Scheme, Erlang, Haskell

• Two functions:
– Map: mapping a list => list

– Reduce: reducing a list => value

• map() and reduce() in Python
– https://docs.python.org/2/library/functions.html#ma

https://docs.python.org/2/library/functions.html#map

map() and reduce() in Python

• list = [1, 2, 3]

• def sqr(x): return x ** 2

• list1 = map(sqr, list)

• def add(x, y): return x + y

• z = reduce(add, list)

What are the value of list1 and z?

reduce() is in functools module of Python 3

Lambda function

• Anonymous function (not bound to a name)

• list = [1, 2, 3]

• list1 = map(lambda x: x ** 2, list)

• z = reduce(lambda x, y: x + y, list)

How is reduce() in Python evaluated?

• z = reduce(f, list) where f is add function

• Initially, z (an accumulator) is set to list[0]

• Next, repeat z = add(z, list[i]) for each i > 0

• Return final z

• Example: z = reduce(add, [1, 2, 3])
– i = 0, z = 1; i = 1, z = 3; i = 2, z = 6

Hadoop MapReduce

• Map

– => list of

• Reduce:

– => list of

• Write MapReduce programs on Hadoop

– Using Java

MapReduce

WordCount: mapper

Data types of input key-value

Data types of output key-value

Object can be replaced with LongWritable

Key-value pairs with specified data types

WordCount: reducer

Data types of input key-value

Data types of output key-value

A list of values

Characteristics of Hadoop

• Acyclic data flow model

– Data loaded from stable storage (e.g., HDFS)

– Processed through a sequence of steps

– Results written to disk

• Batch processing

– No interactions permitted during processing

Problems

• Ill-suited for iterative algorithms that requires
repeated reuse of data

– E.g., machine learning and data mining algorithms
such as k-means, PageRank, logistic regression

• Ill-suited for interactive exploration of data

– E.g., OLAP on big data

In-memory MapReduce (Spark)

• Key concepts

– RDD (resilient distributed dataset)

– Transformations

– Actions

Apache Spark: history

Spark

• Support working sets through RDD

– Enabling reuse & fault-tolerance

• 10x faster than Hadoop in iterative jobs

• Interactively explore 39GB with sub-second
response time

Spark

• Combine SQL, streaming, and complex
analytics

• We will see DataFrame in Spark too

Spark

• Run on Hadoop, Cassandra, HBase, etc.

wc.py

from pyspark import SparkContext
from operator import add

sc = SparkContext(appName=”inf551″)

lines = sc.textFile(‘hello.txt’)

counts = lines.flatMap(lambda x: x.split(‘ ‘)) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)

Bibliography

Related Posts