程序代写代做代考 algorithm cache file system hadoop Java graph CIS 455/555: Internet and Web Systems

CIS 455/555: Internet and Web Systems
Hadoop November 2, 2020
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
1

Plan for today
n Google File System
n Introduction to MapReduce
n Programming model n Data flow
n Example tasks
NEXT
n Hadoop and HDFS n Architecture
n Using Hadoop
n Using HDFS
n Beyond MapReduce
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
2

Beyond word count
n Distributed grep – all lines matching a pattern
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
3

Input: (k,v) where k is __________ and v is __________
map(key : __________, value : __________) {
}
reduce(key : __________, values: __________) {
}
Output: (k,v) where k is __________ and v is _________
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
4

Beyond word count
n Distributed grep – all lines matching a pattern n Map: filter by pattern
n Reduce: output set
n Count URL access frequency
n Map: output each URL as key, with count 1 n Reduce: sum the counts
n Reverse web-link graph
n Map: output (target,source) pairs when link to target
found in souce
n Reduce: concatenates values and emits (target,list(source))
n Inverted index
n Map: Emits (word,documentID)
n Reduce: Combines these into (word,list(documentID))
5
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania

Common mistakes to avoid
n Mapper and reducer should be stateless
n Don’t use static variables – after map + reduce return, they should remember nothing about the processed data!
n Reason: No guarantees about which key-value pairs will be processed by which workers!
n Don’t try to do your own I/O! n Don’t try to read from, or write to,
files in the file system
n The MapReduce framework does all the I/O for you:
n Alltheincomingdatawillbefedasargumentstomapandreduce n Anydatayourfunctionsproduceshouldbeoutputviaemit
HashMap h = new HashMap();
map(key, value) {
if (h.contains(key)) {
h.add(key,value);
emit(key, “X”);
}
}
Dangerous!
map(key, value) {
File foo =
new File(“xyz.txt”);
while (true) {
s = foo.readLine();
}
… }
Dangerous!
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
6

More common mistakes to avoid
map(key, value) {
emit(“FOO”, key + ” ” + value);
}
Wrong!
n Mapper must not map too much data to the same key
n In particular, don’t map everything to the same key!!
n Otherwise the reduce worker will be overwhelmed!
n It’s okay if some reduce workers have more work than others n Example:InWordCount,thereduceworkerthatworksonthekey’and’
has a lot more work than the reduce worker that works on ‘syzygy’.
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
7
reduce(key, value[]) {
/* do some computation on
all the values */
}

Designing MapReduce algorithms
n Key decision: What should be done by map, and what by reduce?
n map can do something to each individual key-value pair, but
it can’t look at other key-value pairs
n Example:Filteringoutkey-valuepairswedon’tneed
n map can emit more than one intermediate key-value pair for each incoming key-value pair
n Example:Incomingdataistext,mapproduces(word,1)foreachword
n reduce can aggregate data; it can look at multiple values, as long as map has mapped them to the same (intermediate) key
n Example:Countthenumberofwords,addupthetotalcost,… n Need to get the intermediate format right!
n Ifreduceneedstolookatseveralvaluestogether,map must emit them using the same key!
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
8

© 2020 A. Haeberlen, Z. Ives, V. Liu
Filtering algorithms
n Goal: Find lines/files/tuples with a particular characteristic
n Examples:
n grep Web logs for requests to *.upenn.edu/*
n find in the Web logs the hostnames accessed by 192.168.2.1 n locate all the files that contain the words ‘Apple’ and ‘Jobs’
n Generally: map does most of the work, reduce may simply be the identity
9

© 2020 A. Haeberlen, Z. Ives, V. Liu
Aggregation algorithms
n Goal: Compute the maximum, the sum, the average, …, over a set of values
n Examples:
n Count the number of requests to *.upenn.edu/*
n Find the most popular domain
n Average the number of requests per page per Web site
n Often: map may be simple or the identity
10

A more complex example
n Goal: Billing for a CDN like Amazon CloudFront
n Input: Log files from the edge servers. Two files per domain: n access_log-www.foo.com-20160316.txt:HTTPaccesses
n ssl_access_log-www.foo.com-20160316.txt:HTTPSaccesses
n Exampleline:
158.130.53.72 – – [03/Mar/2016:08:30:38 -0400] “GET /largeFile.ISO HTTP/1.1” 200 8130928734 “-” “Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11) Gecko/20100101 Firefox/44.0″
n Mapperreceives(filename,line)tuples
n Billing policy (simplified):
n Billingisbasedonamixofrequestcountanddatatraffic n 10,000HTTPrequestscost$0.0075
n 10,000HTTPSrequestscost$0.0100
n OneGBoftrafficcosts$0.085
n Desired output is a list of (domain, grandTotal) tuples © 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
11

Advanced Aggregation: Combiners
Mapper Combiner
Reducer
n Multiple map jobs on the same machine may write to the same reduce key
n Example: map(1,”Apple juice”) -> (“apple”,1), (“juice”,1) n map(2, “Apple sauce”) -> (“apple”,1),(“sauce”,1)
n combiner: (“apple”, [1,1]) -> (“apple”, 2)
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
12

© 2020 A. Haeberlen, Z. Ives, V. Liu
Intersections and joins
n Goal: Intersect multiple different inputs on some shared values
n Values can be equal, or meet a certain predicate n Examples:
n Find all documents with the words “data” and “centric” given an inverted index
n Find all professors and students in common courses and return the pairs for those cases
13

© 2020 A. Haeberlen, Z. Ives, V. Liu
Partial Cartesian products
n Goal: Find some complex relationship, e.g., based on pairwise distance
n Examples:
n Find all pairs of coffee shops within 100m of each other
n Generally hard to parallelize
n But may be possible if we can divide the input into bins or
tiles, or link it to some sort of landmark n Overlap the tiles? (how does this scale?) n Generate landmarks using clustering?
14

© 2020 A. Haeberlen, Z. Ives, V. Liu
Sorting
n Goal: Sort input n Examples:
n Return all the domains covered by Google’s index and the number of pages in each, ordered by the number of pages
n The programming model does not support this per se, but the implementations do
n Let’s take a look at what happens in the Shuffle stage
15

Shuffle really consists of two parts:
• Partition • Sort
The shuffle stage revisited
Node 1
Node 2
InputFormat
File File
File system
Split
Split
Split
RR
RR
RR
map
map
map
Combine
Reduce
OutputFormat
File File
File system
InputFormat
Split
Split
Split
RR
RR
RR
map
map
map
Combine
Reduce
OutputFormat
Partition & Sort
© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
Example: Hadoop
16

Shuffle as a sort mechanism
n We can exploit the per-node sorting operation done by Shuffle
n If we have a single reducer, we will get sorted output
n If we have multiple reducers, we can get partly sorted output (or better – consider an order-preserving hash)
n Note:Itisnotdifficulttowritealast-passfilethatmergesallofthe output files from the reducers
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
17

Strengths and weaknesses
n What problems can you solve well with
MapReduce?
n … in a single pass?
n … in multiple passes?
n Are there problems you cannot solve efficiently with MapReduce?
n Are there problems it can’t solve at all?
n How does it compare to other ways of doing
large-scale data analysis?
n Is MapReduce always the fastest/most efficient way? © 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
18

Plan for today
n MapReduce
n Programming model n Data flow
n Example tasks
n Hadoop and HDFS n Architecture
n Using Hadoop
n Using HDFS
n Beyond MapReduce
NEXT
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
19

What is ?
n An open-source re-implementation of Google’s MapReduce framework
n Developed by Doug Cutting n Open Source Award 2015
n Benefited from substantial help by Yahoo!
n Today, Hadoop (and its file system, HDFS) are
Apache Foundation projects © 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
20

Who uses ?
n Hadoop is running search on some of the Internet’s largest sites:
n Helped IBM’s Watson to win Jeopardy
n Amazon Web Services: Elastic MapReduce
n Verizon: Variety of uses, e.g., behavioral analysis & targeting
n eBay: Search optimization (700-node cluster)
n Each node has: 24TB disk, 72GB RAM, 12 cores. Can run 26,000 MapReduce tasks simultaneously
n Facebook: Reporting/analytics, machine learning (1100 m.), messaging n 2000-node warehouse cluster, 21PB total storage capacity, ~400 million objects
n IBM: Blue Cloud Computing Clusters
n LinkedIn: 1000s of machines, e.g., people You May Know n Twitter: Store + process tweets, log files, other data
n Apple: iAds platform
n Netflix: Streaming summaries, analysis tasks
n Hulu: log storage and analysis
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
21

Hadoop
n A ‘modern’ open-source ‘clone’ of MapReduce+GFS
n Written in Java
n Operates on HDFS, a page-level replicating filesystem n Modeled in part after GFS
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
22
Source: Hadoop HDFS architecture documentation

Hadoop MapReduce 1.X Architecture
n Job tracker (~MapReduce master):
n Accepts jobs submitted by users
n Gives tasks to Task trackers – makes scheduling decisions, co-locates tasks to data
n Monitors task, tracker status, re-executes tasks if needed
n Task trackers (~MapReduce worker):
n Run Map and Reduce tasks
n Manage storage, transmission of intermediate output
n HDFS-related nodes:
n Namenode: Handles file-to-chunk mapping (~GFS master)
n Plussecondarynamenodesifnecessary
n Data node: Stores chunks (~GFS chunkserver)
23
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
A single node can run more than one of these!

Two example configurations
Small cluster
JobTracker NameNode Secondary NameNode
Medium cluster
JobTracker NameNode
Secondary NameNode TaskTracker DataNode
© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
24

Plan for today
n MapReduce
n Programming model n Data flow
n Example tasks
n Hadoop and HDFS n Architecture
n Using Hadoop
n Using HDFS
n Beyond MapReduce
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
25
NEXT

What do we need to write?
n A mapper
n Accepts (key,value) pairs from the input
n Produces intermediate (key,value) pairs, which are then shuffled
n A reducer
n Accepts intermediate (key,value) pairs
n Produces final (key,value) pairs for the output
n A driver
n Specifies which inputs to use, where to put the outputs n Chooses the mapper and the reducer to use
n Hadoop takes care of the rest
n Default behaviors can be customized by the driver
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
26

Hadoop data types
Name
Description
JDK equivalent
IntWritable
32-bit integers
Integer
LongWritable
64-bit integers
Long
DoubleWritable
Floating-point numbers
Double
Text
Strings
String
n Hadoop uses its own serialization
n Java serialization is known to be very inefficient
n Result: A set of special data types
n All implement the ‘Writable’ interface
n Most common types shown above; also has some more specialized types (SortedMapWritable, ObjectWritable, …)
n Caution: These are NOT normal Java classes. Do not try to use them between iterations – their content can change!
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
27

The Mapper
Input format Intermediate format (file offset, line) can be freely chosen
import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.io.*;
public class FooMapper extends Mapper { public void map(LongWritable key, Text value, Context context) {
context.write(new Text(“foo”), value); }
}
n Extends abstract ‘Mapper’ class
n Input/output types are specified as type parameters
n Implements a ‘map’ function
n Accepts (key,value) pair of the specified type
n Writes output pairs by calling ‘write’ method on context n Mixing up the types will cause problems at runtime (!)
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
28

The Reducer
Intermediate format (same as mapper output)
Output format
import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.io.*;
public class FooReducer extends Reducer {
public void reduce(Text key, Iterable values, Context context) throws java.io.IOException, InterruptedException
{
for (Text value: values)
context.write(new IntWritable(4711), value); }
}
Note: We may get multiple values for the same key!
n Extends abstract ‘Reducer’ class
n Must specify types again (must be compatible with mapper!)
n Implements a ‘reduce’ function n Values are passed in as an ‘Iterable’
n context.write is used here as well
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
29

The Driver
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class FooDriver {
public static void main(String[] args) throws Exception {
Job job = new Job();
job.setJarByClass(FooDriver.class);
job.setMapperClass(FooMapper.class);
job.setReducerClass(FooReducer.class);
// is the default input format
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(“in”));
FileOutputFormat.setOutputPath(job, new Path(“out”));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Mapper&Reducer are in the same Jar as FooDriver
Format of the input and output (key,value) pairs
Input and Output paths
n Specifies how the job is to be executed
n Input and output directories; mapper & reducer classes
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
30

Manual compilation
n Goal: Produce a JAR file that contains the classes for mapper, reducer, and driver
n This can be submitted to the Job Tracker, or run directly through Hadoop n Step #1: Put hadoop JAR into classpath:
export CLASSPATH=$CLASSPATH:/path/to/hadoop/hadoop-mapreduce-client- core-2.7.3.jar
n Step #2: Compile mapper, reducer, driver: javac FooMapper.java FooReducer.java FooDriver.java
n Step #3: Package into a JAR file: jar cvf Foo.jar *.class
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
31

Accessing data in HDFS
[liuv@carbon ~]$ ls -la /tmp/hadoop/dfs/data/current/
total 209588
drwxrwxr-x 2 liuv liuv 4096 2017-02-13 15:46 .
drwxrwxr-x 5 liuv liuv 4096 2017-02-13 15:39 ..
-rw-rw-r– 1 liuv liuv 11568995 2017-02-13 15:44 blk_-3562426239750716067
-rw-rw-r– 1 liuv liuv
-rw-rw-r– 1 liuv liuv
-rw-rw-r– 1 liuv liuv
-rw-rw-r– 1 liuv liuv 67108864 2017-02-13 15:44 blk_7080460240917416109
-rw-rw-r– 1 liuv liuv 524295 2017-02-13 15:44 blk_7080460240917416109_1020.meta
-rw-rw-r– 1 liuv liuv 67108864 2017-02-13 15:44 blk_-8388309644856805769
-rw-rw-r– 1 liuv liuv 524295 2017-02-13 15:44 blk_-8388309644856805769_1020.meta
-rw-rw-r– 1 liuv liuv 67108864 2017-02-13 15:44 blk_-9220415087134372383
-rw-rw-r– 1 liuv liuv 524295 2017-02-13 15:44 blk_-9220415087134372383_1020.meta
-rw-rw-r– 1 liuv liuv 158 2017-02-13 15:40 VERSION
[liuv@carbon ~]$
90391 2017-02-13 15:44 blk_-3562426239750716067_1020.meta
4 2017-02-13 15:40 blk_5467088600876920840
11 2017-02-13 15:40 blk_5467088600876920840_1019.meta
n HDFS implements a separate namespace n Files in HDFS are not visible in the normal file system
n Only the blocks and the block metadata are visible n HDFS cannot be mounted
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
32

Accessing data in HDFS
[liuv@carbon ~]$ /usr/local/hadoop-2.7.3/bin/hadoop fs -ls /user/liuv
Found 4 items
-rw-r–r–
-rw-r–r–
-rw-r–r–
-rw-r–r–
[liuv@carbon ~]$
1 liuv supergroup
1 liuv supergroup
1 liuv supergroup
1 liuv supergroup 212895587 2019-02-13 15:44 /user/liuv/input3
1366 2019-02-13 15:46 /user/liuv/README.txt
0 2019-02-13 15:35 /user/liuv/input
0 2019-02-13 15:39 /user/liuv/input2
n File access is through the hadoop command n Examples:
n hadoop fs -put [file] [hdfsPath] n hadoop fs -ls [hdfsPath]
n hadoop fs -get [hdfsPath] [file] n hadoop fs -rm [hdfsPath]
n hadoop fs -mkdir [hdfsPath]
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
Stores a file in HDFS
List a directory
Retrieves a file from HDFS Deletes a file in HDFS Makes a directory in HDFS
33

Accessing HDFS directly from Java
n Programs can read/write HDFS files directly
n Not needed in MapReduce; I/O is handled by the framework
n Files are represented as URIs
n Example: hdfs://localhost/user/liuv/example.txt
n Access is via the FileSystem API
n To get access to the file: FileSystem.get()
n For reading, call open() — returns InputStream
n For writing, call create() — returns OutputStream
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
34

Recap: HDFS
n HDFS: A specialized distributed file system
n Good for large amounts of data, sequential reads
n Bad for lots of small files, random access, non-append writes
n Architecture: Blocks, namenode, datanodes n File data is broken into large blocks (64MB default)
n Blocks are stored & replicated by datanodes
n Single namenode manages all the metadata
n Secondary namenode: Housekeeping & (some) redundancy n Usage: Special command-line interface
n Example: hadoop fs -ls /path/in/hdfs © 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
35

Plan for today
n MapReduce: Examples n Hadoop and HDFS
n Architecture
n Using Hadoop
n Using HDFS
n Beyond MapReduce
NEXT
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
36

Lots of Places where MR is Limited
What if I want streaming data instead of static data? n What if I want low-latency jobs instead of batch jobs?
n What if my computation requires iteration (e.g., transitive closure, clustering, classification, …) or to see all values?
n MapReduce can only do iteration in a kludgey way
1. “base case” job produces a “single” output file combining all input data
2. “iterative case” reads the last stage’s output file, does its computation, writes another “single” output file for the next iteration
3. “termination case” converts the last output file into something a human
n
can read
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
37

Early Picture of Hadoop (1.0)
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
38

Many Early Projects Re-Implemented Communication, Checkpointing
e.g., Apache Giraph (based on Google’s Pregel) does in-memory computation with messages between nodes with many iterations
n e.g.,ApacheSparkreplacesHadoop’sfunctionalmodelwithin- memory, procedural / iterative computation
n CouldwegeneralizethemainpartsoftheHadoopforallofthese?
n
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
39

The New Apache Hadoop (V2+): Now a “Stack”, Not Just MapReduce
http://hortonworks.com/hadoop/yarn/ Also HDFS NameNodes are:
n ReplicatedusingPrimary/Backup
n Federated(i.e.,therearemorethanoneactive)
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
40

n
Beyond MapReduce:
Apache Spark, Flink, FlumeJava, …
Virtually every “post MapReduce” implementation treats datasets as virtual collections, enables us to apply map / reduce / join etc to these collections
JavaRDD lines = ctx.textFile(args[0], 1);
JavaPairRDD> links = lines.mapToPair(new PairFunction() {
public Tuple2 call(String s) {
String[] parts = SPACES.split(s);
return new Tuple2(parts[0], parts[1]);
} }).distinct().groupByKey().cache();
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
41