Declarative Languages
• Map-Reduce framework hides scheduling and parallelization details
• Limited query expressiveness
– Complex queries difficult to write
Copyright By PowCoder代写 加微信 powcoder
• Declarative languages on top of map-reduce
– Pig Latin (Yahoo!)
• Like relational algebra • Open source
– HiveQL (Facebook) • SQL like language
• Open source
– SQL / Tenzing
• proprietary
Pig (Latin)
• Pig Latin
– A higher SQL like language to run complex queries that
require several map-reduce jobs • Pig
– An execution engine
• Translates Pig Latin programs into graphs of map-reduce jobs • Executes them on top of Hadoop
• An Apache open source project
Example ( , Yahoo)
❑ users(name, age), pagelog(url, uname)
❑ Find the top 5 most popular pages for users aged 18-25
SELECT url,count(*)asclicks
FROM users U, pagelog P WHERE U.name = P.uname AND U.age >= 18
AND U.age <= 25
GROUP BY URL ORDER BY clicks desc LIMIT 5
-- FETCH FIRST 5 ROWS ONLY
Map reduce pogram: 170 lines of code
In Pig Latin
files or DFS
Users = load ‘users’ as (name,age);
Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (uname, url);
Jnd = join Fltrd by name, Pages by uname;
Grpd = group Jnd by url;
Smmd = foreach Grpd generate ($0), COUNT($1) as clicks;
Srtd = order Smmd by clicks desc; Top5 = limit Srtd 5;
store Top5 into ‘top5sites’
Very similar to Relational Algebra
Pig Latin vs. RA
Users = load ‘users’ as (name,age); = filter Users by age >= 18 and age <= 25;
Pages = load ‘pages’ as (uname, url);
Jnd = join Fltrd by name, Pages by uname;
Grpd = group Jnd by url;
Smmd = foreach Grpd generate ($0), COUNT($1) as clicks; Srtd = order Smmd by clicks desc;
Top5 = limit Srtd 5;
store Top5 into ‘top5sites’
Order by click desc
Count as click
Grouping on url
Join R1.name = P.uname
Selection !(18 <= age < 25)
Load and Store
– Users = load ‘users’ as (name,age);
– Load: read information from file into a (temp) relation
• Mostly user defined to translate file format into a relational format
– Store: write relation into file • Again, usually provided by user
Pig Latin Operators
• Fltrd=filterUsersbyage>=18andage<=25; – Left side: new intermediate relation
– Right side operation on existing relations
• Operators
– Selection
• Res = filter R1 by:
• SELECT * FROM R1 WHERE ...
• Byage>=18;byurlmatches‘*oracle*’
• Res=joinR1bya1,R2bya2:
• SELECT * FROM R1, R2 WHERE R1.a1 = R2.a2
– Order by
• Res = order R1 by a1 desc
• SELECT * FROM R1 order by a1 desc
• Given relation Rel(A, B, C) with three tuples
(a1, b1, c1) (a1, b2, c2) (a3, b3, c3)
• Grpd = group Rel by A;
• Result relation is Grpd(group, Rel)
– Attribute ‘group’ has the same type as attribute A of Rel
– Attribute ‘Rel’ is a multiset (bag)
– In the given example, Grpd has two tuples, one for each value of A; first attribute of the tuple is the value of A, the second is the set of
all tuples of Rel that have this particular value of A
– dumpGrpd:
• (a1, {(a1,b1,c1), (a1,b2,c2)})
• (a3, {(a3, b3, c3)}
• Assume same as before
– Grpd=groupRelbyA;
– ResultrelationisGrpd(group,Rel):
• (a1,{(a1,b1,c1),(a1,b2,c2)})
• (a3,{(a3,b3,c3)})
• For each (two options)
– Smmd=foreachGrpdgenerate($0),COUNT($1)asc;
– Smmd=foreachGrpdgenerategroup,COUNT(Rel)asc;
• Result relation is Smmd(group, c)
– Attribute ‘group’ has the same type as attribute A of Rel – Attribute c a long
– Dump Smmd:
• (a1,2L) • (a3,1L)
Projection and others
• Projection
– Assume R1(A, B, C)
– Rel = for each R1 generate A, B;
• Co-Group
– Group-by over more than one relation
– Related to joins
• Join results in a flat result
• Co-group results in nested result
Data Model and Flattening
• Supported types:
– Atomic (string, number…)
– Tuple (58,‘lilly’, 10, 10)
– Multiset {(58,‘lilly’, 10, 10), (33,‘debby’, 5, 7)} – Further nesting possible: (1, (2,3))
– Maps (advanced)
• Flattening example
– Assume R = {(1, (2,3))}
– Res = foreach R generate $0, flatten($1) – Res = {(1, 2, 3)}
– Semantics somewhat obscure…
• Sometimes output type not quite clear: try to flatten…
Implementation
• Parser and Query Generator:
– Everything between load and store translates into
one logical plan • Logical plan:
– Graph of Hadoop map-reduce jobs
• All statements between two (co)groupsà
one Map-reduce job
Map Reduce Assignment
• Use Pig Latin on top of Hadoop to process (not so) large data set
• We have set up a hadoop cluster with 4 nodes (virtual)
• You have access to that cluster
• You have to write PigLatin Queries
• You have to observe the execution
• Instructions will be on myCourses
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com