CS计算机代考程序代写 SQL scheme python data structure database Java file system flex hbase android data mining hadoop Erlang AWS Haskell algorithm Hive Course Review

Course Review

Introduction

DSCI 551

Wensheng Wu

1

Logistics

• Instructor email:

• Class meeting times (2 sections):
– MW 3:30-5:20pm
– Tuesday 3:30-6:50pm

• Office hours:
– MW 2:30-3:15pm (Zoom, link to be posted)
– After class
– Please email for appointment

2

mailto:

Logistics

• TAs & office hours

– Watch the announcements

• Class materials

– Posted on Blackboard

3

Piazza

• Discussion forums
– You may post general and homework questions

– Do not post solutions

– Please actively participate in helping others!

– Do not abuse forum (an academic misconduct!)

• Check frequently for updates

• Follow instruction on Blackboard to sign up

4

Prerequisites

• Programming skills:
– Python (homework, Spark), Java (e.g., for Hadoop only)

• Unix-like environment & shell commands (ls?)
– E.g., Amazon EC2

• Basic knowledge of algorithms and data structures
– Sorting, hashing, etc. (CS 570)

• Basic probability and statistics

5

Textbooks

• Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-
Dusseau. Operating Systems: Three Easy Pieces,
2015 (selected chapters only). Available free at:
http://pages.cs.wisc.edu/~remzi/OSTEP/

• Hector Garcia-Molina, Jeffrey D. Ullman, and
Jennifer Widom. Database Systems: The
Complete Book (Second Edition), Prentice Hall,
2009. (selected chapters only)
– http://infolab.stanford.edu/~ullman/dscb.html

6

http://pages.cs.wisc.edu/~remzi/OSTEP/
http://infolab.stanford.edu/~ullman/dscb.html

Additional readings

• Links can be found in Syllabus

– Check out the schedule

7

Grading structure

• Homework 20%

• Midterm 1 15%

• Midterm 15%

• Final 25%

• Lab sessions 5%

• Group project 20%

8

Grading scale

• [93, 100] = A

• [90, 93) = A-

• [87, 90) = B+

• [83, 87) = B

• [80, 83) = B-

• [77, 80) = C+

• [73, 77) = C

• … (see Syllabus for complete breakdown)

9

Lab sessions

• Flipped:

– Task and details posted before class

– Bring questions to class

• Typically utilize last 15-30 mins of class

10

Exams

• Closed-notes & book

• Midterms:

– See syllabus for time

• Final:

– See syllabus for time

11

Calculator

• Bring one to the tests

• If calculator is needed, we will either
announce or state it on the tests

• Otherwise, no electronic devices are allowed

12

Course project

• Details coming up

• Done in phases

– Proposal

– Midterm report

– Final report

13

Late Policy

• Homework/coursework will be submitted online

– 10% for every 24 hours late

– No credit after 3 days

• Make up for tests are permitted only when

– You have an emergency, typically medical

– Let me know at least two weeks in advance

– You may be required to contact Campus Support and
Intervention office for verification of emergency

14

Grading Corrections

• All coursework’s grades are final one week after
grades are posted

• Final exam grades (& all grades) are final after
final exam grading review time (to be announced
right before/after final)

• Please submit reasonable regrading requests
– Irrational requests (e.g., simply asking for more points

or special treatments) may result in reduction of your
grades

15

Academic Integrity

• Cheating will NOT be tolerated

• All parties involved will receive a grade of F
for the course and be reported to SJACS
WITHOUT EXCEPTION

– USC Student Judicial Affairs and Community
Standards

16

Welcome

Now, movie time ☺

• Explain big data:

– https://www.youtube.com/watch?v=7D1CQ_LOiz
A

• Questions:

– Where does big data come from?

– What characteristics doe it have? 3Vs?

– What big data technologies were mentioned?

17

Variety

18

Internet Traffic in 2012

• 4.8 zettabyte = 4.8 billion terabytes

• Zettabyte (1000 exabytes)
• Exabyte
• Petabyte
• Terabyte (storage)
• Gigabyte = 2^30 (memory)
• Megabyte (128MB, HDFS)
• Kilobyte = 2^10 (4KB)

19

Major topics

• Storage systems

• File systems & file formats

• Database management systems (RDBMS)
– R = relational

• Big data solution stack

20

Storage Systems

• Hard disk

• SSD (Solid state drive)

21

https://youtu.be/4iaxOUYalJU

Internal of hard disk

22

Actuator Spindle

Disk head

Platter

https://youtu.be/4iaxOUYalJU

NAND flash

23

Latencies: read, write, and erase

24

Major topics

• Storage systems

• File systems & file formats

• Database management systems

• Big data solution stack

25

File Systems

• Standalone

– Single machine

• Distributed (e.g., Hadoop)

– A number of data servers

26

Standalone file systems

• Data structures

– Data blocks

– Metadata blocks (Inodes)

– Bitmap blocks (for space allocation)

• Access paths

– Read

– write

27

Inode (index node)

• Each is identified by a number

– Low-level number of file name: inumber

• Can figure out location of inode from inumber

28

Distributed file systems

• Hadoop HDFS (after GFS)
– Data are distributed among data nodes

• Replication
– Automatic creation of replica (typically 2 or 3

copies/replica of data)

• Fault-tolerant
– Automatic recovery from node failure

29

HDFS architecture

30

Major topics

• Storage systems

• File systems & file formats

• Database management systems

• Big data solution stack

31

File Formats

• JSON

32

33

HTML

Bibliography

Foundations of Databases

Abiteboul, Hull, Vianu

Addison Wesley, 1995

Data on the Web

Abiteoul, Buneman, Suciu

Morgan Kaufmann, 1999

34

XML

Foundations…

Abiteboul

Hull

Vianu

Addison Wesley

1995

XML describes the content

XML usages

• Software configurations files

– E.g., HDFS

• Android app development

– Layout resource files, e.g., activity_main.xml

• Java archive (.jar file)

– Manifest.xml

35

Android app resource file

36

Manifest.xml

37

38

Addison-Wesley

Serge Abiteboul

RickHull

Victor Vianu

Foundations of Databases

1995

38.8

Freeman

Jeffrey D. Ullman

Principles of Database and Knowledge Base Systems

1998

39

Data Model for XPath

bib

book book

publisher author . . . .

Addison-Wesley Serge Abiteboul

Document node

The root element

40

XPath: Simple Expressions

Result: 1995

1998

Result: empty (there were no papers)

/bib/book/year

/bib/paper/year

Major topics

• Storage systems

• File systems & file formats

• Database management systems

• Big data solution stack

41

Relational DBMS

• Data models
– E (entity set) R
– Relational (redundancy => update anomaly)

• Schema
– Normal forms, e.g., 3NF, BCNF (not required, unless we

talk about data warehousing)
– Multidimensional model
– Fact table + dimension (product, customer, time) tables
– Star/snowflake
– ETL: insert into … (warehouse)

• select … From … (data sources)

42

RDBMS

• Query languages

– Relational algebra

– SQL, constraints, views

• Data organization

– Records and blocks

– Index structure: B+-tree (external data structure)

43

RDBMS

• Query execution algorithms

– External sorting

– One-pass algorithms

– Nested-loop join, sorting, hashing-based

– Multiple-pass algorithms

44

RDBMS

• Rigid schema

• Strong consistency is the key design goal
– Never read old data

– Suitable for mission-critical applications, e.g.,
banking

• But may suffer from low availability
– ACID vs CAP

45

RDBMS

• Hard to scale out

– Horizontal partitioning/sharding possible

– But would need distributed storage & computing
support like Hadoop & MapReduce

46

RDBMS Examples

• MySQL (can be installed in Amazon AWS EC2)

• Amazon RDS (Relational database as a service)

– DBMS in the cloud

– Database as a service

• Data warehouse on RDBMS

– OLAP

47

Amazon RDS: Database-as-a-service

• MySQL, PostgreSQL, Oracle, SQL Server, etc.

48

49
address name field

Professor

Advises

Takes

Teaches

Course

Student

name category

semester

name

ssn

Conceptual

Modeling

cid

50

Schema Design and Implementation

• Tables (relations):

• Separates the logical view from the physical view
of the data.

SSN Name Category

123-45-6789 Charles undergrad

234-56-7890 Dan grad

… …

SSN CID

123-45-6789 CSE444

123-45-6789 CSE444

234-56-7890 CSE142

Students: Takes:

CID Name Semster

CSE444 Databases fall

CSE541 Operating systems spring

Courses:

51

Querying a Database

• Find all courses that “Mary” takes

• S(tructured) Q(uery) L(anguage)

• Query processor figures out how to answer
the query efficiently.

select C.name
from Students S, Takes T, Courses C
where S.name = “Mary” and

S.ssn = T.ssn and T.cid = C.cid

Select A’s ,agg
From R’s
Where C’s
Group by A’s
Having
Order by
Limit ?
Offset ?
(pagination)

===
Insert
Update
Delete

Declarative (what)

52

Query Optimization

Imperative query execution plan:

select C.name

from Students S, Takes T, Courses C

where S.name=”Mary” and

S.ssn = T.ssn and T.cid = C.cid

Declarative SQL query

Plan: tree of Relational Algebra operators,

choice of algorithms at each operator

Goal:

Students Takes

sid=sid

Courses.name

name=”Mary”

cid=cid

Courses

Major topics

• Storage systems

• File systems & file formats

• Database management systems

• Big data solution stack

53

Topics

• Big data management & analytics

– Cloud data storage (Amazon S3)

– NoSQL

• Google Firebase (real-time database, …)

• MongoDB (shell, mongo)

• Amazon DynamoDB (row store, key-value)

• Cassandra (not required)

– Apache Hadoop & MapReduce

– Apache Spark

54

Cloud data storage

• Amazon S3 (simple storage service)

– Ideal for storing large binary files

– E.g., audio, video, image

– Simple RESTful web service

• Eventual consistency for high availability

55

56

Upload a file

57

58

NoSQL

• Not only SQL

• Flexible schemas
– e.g., JSON documents or key-value pairs
– Ideal for managing a mix of structured, semi-

structured, and unstructured data

• High availability (CAP)

• Weaker (e.g., eventual) consistency model

59

Example NoSQL databases

• MongoDB, Firebase, etc.
– Manage JSON documents

• Amazon DynamoDB
– Row store
– row = item = a collection of key-value pairs

• Apache Cassandra (not required)
– Wide column store
– Google’s Bigtable clone

• Neo4J…

60

Key techniques

• Consistent hashing (Cassandra, Dynamo)
– Avoid moving too much data when adding new

machines (scaling out)

• Efficient writes (for update-heavy apps)
– Append-only

– No overwrites

– Avoid random seek

– But compaction needed later

61

Write path in Cassandra

62

1

2

3

Key techniques

• Compaction

– Introduced in Google “Bigtable” paper

– Merge multiple versions of data

– Remove expired or deleted data

63

DynamoDB

• https://console.aws.amazon.com/dynamodb/
home?region=us-east-1#gettingStarted:

64

https://console.aws.amazon.com/dynamodb/home?region=us-east-1#gettingStarted

65

Insert items

66

May add new attributes

67

Firebase: a cloud database

68

Firebase

69

Topics

• Big data management & analytics

– Cloud data storage (Amazon S3)

– NoSQL (Amazon DynamoDB, Cassandra,
MongoDB)

– MapReduce

– Apache Hadoop

– Apache Spark

70

Roots in functional programming

• Functional programming languages:
– Python, Lisp (list processor), Scheme, Erlang, Haskell

• Two functions:
– Map: mapping a list => list

– Reduce: reducing a list => value

• map() and reduce() in Python
– https://docs.python.org/2/library/functions.html#ma

p

71

https://docs.python.org/2/library/functions.html#map

map() and reduce() in Python

• list = [1, 2, 3]

• def sqr(x): return x ** 2

• list1 = map(sqr, list)

• def add(x, y): return x + y

• z = reduce(add, list)

72

What are the value of list1 and z?

reduce() is in functools module of Python 3

Lambda function

• Anonymous function (not bound to a name)

• list = [1, 2, 3]

• list1 = map(lambda x: x ** 2, list)

• z = reduce(lambda x, y: x + y, list)

73

How is reduce() in Python evaluated?

• z = reduce(f, list) where f is add function

• Initially, z (an accumulator) is set to list[0]

• Next, repeat z = add(z, list[i]) for each i > 0

• Return final z

• Example: z = reduce(add, [1, 2, 3])
– i = 0, z = 1; i = 1, z = 3; i = 2, z = 6

74

Hadoop MapReduce

• Map

=> list of

• Reduce:

=> list of

• Write MapReduce programs on Hadoop

– Using Java

75

MapReduce

76

WordCount: mapper

77

Data types of input key-value

Data types of output key-value

Object can be replaced with LongWritable

Key-value pairs with specified data types

WordCount: reducer

78

Data types of input key-value

Data types of output key-value

A list of values

Characteristics of Hadoop

• Acyclic data flow model

– Data loaded from stable storage (e.g., HDFS)

– Processed through a sequence of steps

– Results written to disk

• Batch processing

– No interactions permitted during processing

79

Problems

• Ill-suited for iterative algorithms that requires
repeated reuse of data

– E.g., machine learning and data mining algorithms
such as k-means, PageRank, logistic regression

• Ill-suited for interactive exploration of data

– E.g., OLAP on big data

80

In-memory MapReduce (Spark)

• Key concepts

– RDD (resilient distributed dataset)

– Transformations

– Actions

81

Apache Spark: history

82

Spark

• Support working sets through RDD

– Enabling reuse & fault-tolerance

• 10x faster than Hadoop in iterative jobs

• Interactively explore 39GB with sub-second
response time

83

Spark

• Combine SQL, streaming, and complex
analytics

• We will see DataFrame in Spark too

84

Spark

• Run on Hadoop, Cassandra, HBase, etc.

85

wc.py

from pyspark import SparkContext
from operator import add

sc = SparkContext(appName=”inf551″)

lines = sc.textFile(‘hello.txt’)

counts = lines.flatMap(lambda x: x.split(‘ ‘)) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)

output = counts.collect()

for v in output:
print(v[0], v[1])

86

Lab session

• Task: Setting up an EC2 instance

• Details: see lab session slides to be posted…

87