Large Scale Data Processing
Adaptation from
(Univ. of Washington) Mining of Massive Datasets, by Rajaraman and Gates (Yahoo!)
Copyright By PowCoder代写 加微信 powcoder
Why Distributed Data Processing
• Hardware:
– CPU speed does not increase – Instead:multicore
• Commodity clusters
– Easy access to 1000 of nodes through cloud computing – Much cheaper than large mainframe
• Big Data
– Astronomy:high-resolution,high-frequencyskysurveys
– Medicine:digitalrecords,MRI,ultrasound
– Biology:sequencingdata
– Userbehaviordata:clickstreams,searchlogs,…
• GoogleandFacebook,butalsoWalmartandco… 2
Distribution and Performance
• Traditionally:scale-up
– Improve performance by buying larger machine
• Distribution:scale-out
– Improve performance through parallel execution
• Performancemetrics:
– Throughput: transactions/queries per time unit • The higher the better
• Important for OLTP
– Response time: time for execution of an individual transaction/query
• Thesmaller,thebetter • Important for OLAP
Speedup (the data size remains the same)
– More nodes à more throughput and/or lower response time
Response Time
Throughput
• Non-linear Speedup Startup costs – Coordination costs
– Communication costs
– Skew (equal distribution of load not possible)
• Scaleup (the data size increases):
– More nodesàhave same throughput / same response
time despite more data
• Non-linear Speedup and scaleup:
– Data distribution overhead
– Non – parallelizable operations • aggregation
– Communication costs
– Skew (equal distribution of data not possible)
Parallel Relational Database Systems
• A lot of DBS technology developed in 90s.
• Good understanding of distributed execution of
relational algebra queries
• Both for
– OLTP (online transaction processing): workload of short, update intensive queries, such as day-to-day banking, flight reservations etc.
– OLAP (online analytical processing) / Decision support: workload of complex queries, mainly read-only
• Sophisticated and optimized operators (such as distributed join operators…)
• Expensiveandspecialized:Oracle,Teradata,… 6
Parallel Query Evaluation
• Inter-query parallelism
– Different queries run in parallel on different
processors; each query is executed sequentially • Inter-operator parallelism
– Different operators within same execution tree run on different processors
– Pipelining leads to parallelism • Intra-operator parallelism
– A single operator (e.g., scan, join) runs on many processors
– Topic of this week 7
Horizontal Data Partitioning
– Large table R(K,A, B C) – Key-value store KV(K,V)
– partition into chunks C1, C2, … Cn of records stored at n nodes
• Hash partitioned on attribute X:
– Record r goes to chunk i, according to hash function – Example hash-function: H = r.X mod n + 1
• Range partitioned on attribute X:
– PartitionrangeofXinto:-∞=v1
• QueryonlyneedstoaccesspartitionRAB
• MuchlessI/O KABCDE
Vertical Data Partitioning
• Why is the key replicated? SELECT * FROM R
SELECT * FROM RAB, RCDE
WHERE RAB.K = RCDE.K
Map-reduce
❑General-purpose distributed computing framework
❑ Can be applied to many types of queries; non- relational and relational
❑ Developed by Google; open-source version Hadoop developed by Yahoo led to quick success
❑One initiative within the NoSQL movement 12
Data Processing at Massive Scale
❑Massive Scale ✩Petabytes of data
✩100s, 1000s, 10000s of servers ✩Many hours
❑Failure becomes an issue
✩If medium-time-between failure is 1 year
✩Then 10000 servers have one failure / hour
✩Query execution must succeed even if individual nodes fail
Distributed Large-Scale File Systems
• Google DFS / Yahoo’s Hadoop HDFS (sponsored by Yahoo)
• Assumptions:
– Files are large (terabytes….) – Files are rarely updated
• Main concepts
– Files are split into chunks, typically 64MBytes
• (compare with 4K page size discussed so far)
– Each chunk replicated for availability
– Master node knows about location of chunks
• Meta-repository (also replicated for fault-tolerance)
Map-reduce
• High-level programming model AND implementation for large-scale parallel data processing
• Programming model
– Distribute data and each side read data records one by one
(key-value pairs)
– Map tasks: extract something interesting from records and output a new set of data records (key-value pairs)
– Shuffle and sort (same key to same reduce task) – Reduce tasks: aggregate, summarize, filter
– Write the results
Key-value Pairs (k2,w)
Key-value-sequence pairs
(k2,(w,x,y,…))
Input Chunks of
Records (k1,v)
Key-valu Pairs (k3,s)
Group by keys
Reduce tasks
• Input and output considered key/value pairs in order to be able to compose several map/reduce instances
• Keys and values themselves could be complex objects (including tuples).
• Map and Reduce functions are written by programmer
• Number of map tasks and reduce tasks given at start of program
• The rest done automatically (at least conceptually) 17
Example: Word Count
• Given: Document Set DS(K, documenttext)
• Output: For each word w occurring at least in one document of DS: indicate the number of occurrences of w in DS
Input Parameters from User
– Number m of map tasks
– Number r of reduce tasks
– Data set = document set DS
Map function written by User
WordCountMap:
For each input key/value pair (dkey, dtext) For each word w of dtext
Output key-value pair (w, 1)
System splits input set into m partitions
System creates m map tasks, gives each one partition Each map task executes map function on its partition Map step only completes once all map tasks are done
Shuffle and Reduce Steps
• System sorts map outputs by key and transforms all key/value pairs (k, v1), (k, v2),…(k, vn) with same key k to one key/value-list pair (k, (v1, v2,…vn))
– For Word count: all (‘star’, 1), (‘star’, 1), (‘star’, 1) … are transformed into one (‘star’, (1,1,1,….))
• System partitions output by key into r partitions
• Systems creates r reduce tasks and assigns each one
• Each reduce task executes user written reduce function
WordCountReduce:
For each input key/value-list pair (k, (v1, v2, … vn)) Output (k, n) 20
(d1, text1) (d2,text2)
(d3, text3) (d4,text4)
(d5, text5) (d6,text6)
(w1,1) (w2,1) (w1,1)…
(w3,1) (w3,1) (w4,1)…
(w4,1) (w1,1) (w2,1)…
(w1,(1,1,1)) (w3,(1,1))
(w2,(1,1)) (w4,(1,1))
(w1,3) (w2,2) (w3,2) (w4,2)
Example Execution
Sort, group Shuffle
Reduce tasks
Once more as a tree
Local File System
Phase Details
• Splitintopartitions
• Atmaptasks
– Record reader
– Map function
– Possibly combine (explain later) – Write to local file
• Groupandshuffle
– Group keys and aggregate value-lists
– Copy from map location to reduce location – group keys and aggregate value-lists
• Atreducetasks
– Reduce function and write to file system 23
❑ Possible if reduce function commutative and associative
❑ Execute reduce function at each mapper on partial
result of mapper
❑ Reduces data to be transferred by shuffle
❑ Example word count:
✩ At each mapper: count the occurrences of each word for
all documents read by this mapper
✩ For each mapper, there is then only one key-value pair for each word and the value is the number of occurrences
Implementation
• There is one master node controlling execution
• Master partitions file into m partitions
• Master assigns workers (server processes) to m map tasks
• Workers executing map tasks write to local disk
• Master assigns workers to r reduce tasks
• Reduce workers implement group and shuffle (read from map disks) and execute reduce tasks
• Failures are detected by master
– Failure of map task during map phase
• master assigns new worker to map task – Failure of map task during reduce phase
• Master assigns new worker to map task to redo (as data stored locally)
– Failure of reduce task during reduce phase
• Master assigns new worker to reduce task
• Straggler
– A machine that takes unusually long to complete one of its last tasks
• Maybe some I/O problem, too many other tasks…
– Solution: back execution of last few remaining in-progress tasks 26
Selection with Map/reduce
Assume R(A, B, C) relation (no duplicates) Selection with condition c on R
• for each tuple t of R for which condition c holds, output (a,(b,c))
• identity, that is, output (a,(b,c))
SELECT * FROM Users WHERE experience = 10
experience
Join with Map/reduce
• Natural Join R(A,B,C) with Q(C,D,E)
• For each tuple (a,b,c) of R, output (c, (R, (a,b))) • For each tuple (c,d,e) of Q, output (c, (Q, (d,e)))
– Group and shuffle will aggregate all key/value pairs with same c-value
For each tuple (c, value-list)
(e.g., value-list = (R, (a1,b1)), (R, (a2,b2)),…(Q,(d1,e1)),…))
Rt = Qt = empty;
for each v=(rel,tuple) in value-list
if v.rel = R: insert tuple into Rt else insert tuple into Qt for v1 in Rt, for v2 in Qt, output(c,(v1,v2))
Basically produces all combinations (c, (ai,bi,dj,ej)) 28
FROMUsers u,GroupMembersg WHERE u.uid = g.uid
experience
FROMUsers u,GroupMembersg WHERE u.uid = g.uid
Projection with Map/reduce
• Projection on B,C of R – Map:
• for each tuple t=(a,(b,c)) of R, let t’=(b,c): output (t’, 0)
– There might now be duplicates, that is several (t’, 0) tuples; the group
function will aggregate them to (t’, 0, 0,… 0))
• for each tuple (t’, (0,0,0…), output (t’, 0)
Map SELECT DISTINCT uname, age FROM Users
experience
Group BY Map/reduce
Grouping: SELECT b, max(c) GROUP BY b – Map:
• for each tuple (a,(b,c)) of R , output (b, c) – Group and shuffle will create for each value b
a key/value-list (b, (c1, c2,…)) – Reduce:
• for each (b, (c1, c2,…)) perform aggregation (e.g., c1+c2, …)
SELECT experience, max(age) FROM Users GROUP BY experience
experience
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com