CIS 455/555: Internet and Web Systems
RPCs November 4, 2020
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
1
Plan for today
n MapReduce n Example tasks
n Hadoop and HDFS n Architecture
n Using Hadoop
n Using HDFS
n Beyond MapReduce
n Remote Procedure Calls n Web Services
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
2
NEXT
What do we need to write?
n A mapper
n Accepts (key,value) pairs from the input
n Produces intermediate (key,value) pairs, which are then shuffled
n A reducer
n Accepts intermediate (key,value) pairs
n Produces final (key,value) pairs for the output
n A driver
n Specifies which inputs to use, where to put the outputs n Chooses the mapper and the reducer to use
n Hadoop takes care of the rest
n Default behaviors can be customized by the driver
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
3
Hadoop data types
Name
Description
JDK equivalent
IntWritable
32-bit integers
Integer
LongWritable
64-bit integers
Long
DoubleWritable
Floating-point numbers
Double
Text
Strings
String
n Hadoop uses its own serialization
n Java serialization is known to be very inefficient
n Result: A set of special data types
n All implement the ‘Writable’ interface
n Most common types shown above; also has some more specialized types (SortedMapWritable, ObjectWritable, …)
n Caution: These are NOT normal Java classes. Do not try to use them between iterations – their content can change!
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
4
The Mapper
Input format Intermediate format (file offset, line) can be freely chosen
import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.io.*;
public class FooMapper extends Mapper
context.write(new Text(“foo”), value); }
}
n Extends abstract ‘Mapper’ class
n Input/output types are specified as type parameters
n Implements a ‘map’ function
n Accepts (key,value) pair of the specified type
n Writes output pairs by calling ‘write’ method on context n Mixing up the types will cause problems at runtime (!)
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
5
The Reducer
Intermediate format (same as mapper output)
Output format
import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.io.*;
public class FooReducer extends Reducer
public void reduce(Text key, Iterable
{
for (Text value: values)
context.write(new IntWritable(4711), value); }
}
Note: We may get multiple values for the same key!
n Extends abstract ‘Reducer’ class
n Must specify types again (must be compatible with mapper!)
n Implements a ‘reduce’ function n Values are passed in as an ‘Iterable’
n context.write is used here as well
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
6
The Driver
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class FooDriver {
public static void main(String[] args) throws Exception {
Job job = new Job();
job.setJarByClass(FooDriver.class);
job.setMapperClass(FooMapper.class);
job.setReducerClass(FooReducer.class);
//
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(“in”));
FileOutputFormat.setOutputPath(job, new Path(“out”));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Mapper&Reducer are in the same Jar as FooDriver
Format of the input and output (key,value) pairs
Input and Output paths
n Specifies how the job is to be executed
n Input and output directories; mapper & reducer classes
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
7
Manual compilation
n Goal: Produce a JAR file that contains the classes for mapper, reducer, and driver
n This can be submitted to the Job Tracker, or run directly through Hadoop n Step #1: Put hadoop JAR into classpath:
export CLASSPATH=$CLASSPATH:/path/to/hadoop/hadoop-mapreduce-client- core-2.7.3.jar
n Step #2: Compile mapper, reducer, driver: javac FooMapper.java FooReducer.java FooDriver.java
n Step #3: Package into a JAR file: jar cvf Foo.jar *.class
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
8
Accessing data in HDFS
[liuv@carbon ~]$ ls -la /tmp/hadoop/dfs/data/current/
total 209588
drwxrwxr-x 2 liuv liuv 4096 2017-02-13 15:46 .
drwxrwxr-x 5 liuv liuv 4096 2017-02-13 15:39 ..
-rw-rw-r– 1 liuv liuv 11568995 2017-02-13 15:44 blk_-3562426239750716067
-rw-rw-r– 1 liuv liuv
-rw-rw-r– 1 liuv liuv
-rw-rw-r– 1 liuv liuv
-rw-rw-r– 1 liuv liuv 67108864 2017-02-13 15:44 blk_7080460240917416109
-rw-rw-r– 1 liuv liuv 524295 2017-02-13 15:44 blk_7080460240917416109_1020.meta
-rw-rw-r– 1 liuv liuv 67108864 2017-02-13 15:44 blk_-8388309644856805769
-rw-rw-r– 1 liuv liuv 524295 2017-02-13 15:44 blk_-8388309644856805769_1020.meta
-rw-rw-r– 1 liuv liuv 67108864 2017-02-13 15:44 blk_-9220415087134372383
-rw-rw-r– 1 liuv liuv 524295 2017-02-13 15:44 blk_-9220415087134372383_1020.meta
-rw-rw-r– 1 liuv liuv 158 2017-02-13 15:40 VERSION
[liuv@carbon ~]$
90391 2017-02-13 15:44 blk_-3562426239750716067_1020.meta
4 2017-02-13 15:40 blk_5467088600876920840
11 2017-02-13 15:40 blk_5467088600876920840_1019.meta
n HDFS implements a separate namespace n Files in HDFS are not visible in the normal file system
n Only the blocks and the block metadata are visible n HDFS cannot be mounted
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
9
Accessing data in HDFS
[liuv@carbon ~]$ /usr/local/hadoop-2.7.3/bin/hadoop fs -ls /user/liuv
Found 4 items
-rw-r–r–
-rw-r–r–
-rw-r–r–
-rw-r–r–
[liuv@carbon ~]$
1 liuv supergroup
1 liuv supergroup
1 liuv supergroup
1 liuv supergroup 212895587 2019-02-13 15:44 /user/liuv/input3
1366 2019-02-13 15:46 /user/liuv/README.txt
0 2019-02-13 15:35 /user/liuv/input
0 2019-02-13 15:39 /user/liuv/input2
n File access is through the hadoop command n Examples:
n hadoop fs -put [file] [hdfsPath] n hadoop fs -ls [hdfsPath]
n hadoop fs -get [hdfsPath] [file] n hadoop fs -rm [hdfsPath]
n hadoop fs -mkdir [hdfsPath]
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
Stores a file in HDFS
List a directory
Retrieves a file from HDFS Deletes a file in HDFS Makes a directory in HDFS
10
Accessing HDFS directly from Java
n Programs can read/write HDFS files directly
n Not needed in MapReduce; I/O is handled by the framework
n Files are represented as URIs
n Example: hdfs://localhost/user/liuv/example.txt
n Access is via the FileSystem API
n To get access to the file: FileSystem.get()
n For reading, call open() — returns InputStream
n For writing, call create() — returns OutputStream
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
11
Recap: HDFS
n HDFS: A specialized distributed file system
n Good for large amounts of data, sequential reads
n Bad for lots of small files, random access, non-append writes
n Architecture: Blocks, namenode, datanodes n File data is broken into large blocks (64MB default)
n Blocks are stored & replicated by datanodes
n Single namenode manages all the metadata
n Secondary namenode: Housekeeping & (some) redundancy n Usage: Special command-line interface
n Example: hadoop fs -ls /path/in/hdfs © 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
12
Plan for today
n Hadoop and HDFS n Architecture
n Using Hadoop
n Using HDFS
n Beyond MapReduce NEXT
n Remote Procedure Calls n Web Services
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
13
Lots of Places where MR is Limited
What if I want streaming data instead of static data? n What if I want low-latency jobs instead of batch jobs?
n What if my computation requires iteration (e.g., transitive closure, clustering, classification, …) or to see all values?
n MapReduce can only do iteration in a kludgey way
1. “base case” job produces a “single” output file combining all input data
2. “iterative case” reads the last stage’s output file, does its computation, writes another “single” output file for the next iteration
3. “termination case” converts the last output file into something a human
n
can read
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
14
Early Picture of Hadoop (1.0)
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
15
Many Early Projects Re-Implemented Communication, Checkpointing
e.g., Apache Giraph (based on Google’s Pregel) does in-memory computation with messages between nodes with many iterations
n e.g.,ApacheSparkreplacesHadoop’sfunctionalmodelwithin- memory, procedural / iterative computation
n CouldwegeneralizethemainpartsoftheHadoopforallofthese?
n
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
16
The New Apache Hadoop (V2+): Now a “Stack”, Not Just MapReduce
http://hortonworks.com/hadoop/yarn/ Also HDFS NameNodes are:
n ReplicatedusingPrimary/Backup
n Federated(i.e.,therearemorethanoneactive)
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
17
n
Beyond MapReduce:
Apache Spark, Flink, FlumeJava, …
Virtually every “post MapReduce” implementation treats datasets as virtual collections, enables us to apply map / reduce / join etc to these collections
JavaRDD
JavaPairRDD
public Tuple2
String[] parts = SPACES.split(s);
return new Tuple2
} }).distinct().groupByKey().cache();
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
18
Plan for today
n Hadoop and HDFS n Architecture
n Using Hadoop
n Using HDFS
n Beyond MapReduce
n Remote Procedure Calls n Web Services
NEXT
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
19
What are web services?
Alice
Charlie
Bob Alice Bob
n Intuition: An application that is accessible to other applications over the web
n Example enabling technology: XML © 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
20
(W3C) Web Services
“A Web service is a software system designed to support interoperable machine-to-machine interaction over a network. It has an interface described in a machine-processable format (specifically WSDL). Other systems interact with the Web service in a manner prescribed by its description using SOAP messages, typically conveyed using HTTP with an XML serialization in conjunction with other Web-related standards.”
http://www.w3.org/TR/ws-arch/
n Key elements:
n Machine-to-machine interaction
n Interoperable (with other applications and services) n Machine-processable format
n Key technologies:
n SOAP (and also REST)
n WSDL (Web Services Description language; XML-based)
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
21
Plan for today
n Hadoop and HDFS
n Remote Procedure Calls
n Abstraction
n Mechanism
n Stub-code generation
n Web services n REST
n SOAP n WSDL n JSON
NEXT
© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
22
Motivation for RPCs
n Coding your own messaging is hard
n Example: Look up a name on our directory server
n Assemble the message at the sender, parse at the receiver… n Other things too: which service? How to get return values?
n Let’s hide this in the programming language and middleware
n Similar strategy works great for many other hard or cumbersome tasks, e.g., memory management
n Wouldn’t it be nice if we could simply call a function lookup(name) in the client code, and it executes remotely on the name server?
n That is the abstraction provided by Remote Procedure Calls
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
23
The intuition behind RPCs
void foo() void bar(int a, int *b, bool c) {{
int x, y;
…
x = bar(45, &y, false);
…
…
if (!c)
*b = a + 17;
… }
}
y
“Message”
retaddr
false
&y
45
Stack
© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
24
Machine A
Machine B
Remote Procedure Calls
n Remote procedure calls have been around forever n Implementationexamples:COM+,CORBA,DCE,JavaRMI,…
n An RPC API defines a format for:
n Initiating a call on a given server, generally in a reliable way
n At-most-once,at-least-once,exactly-oncesemantics
n Sending parameters (marshalling) to the server
n Receiving a return value – may require marshalling as well
n Different language bindings may exist
n JavaclientcancallC++server,FortranclientcancallPascalserver,…
n Traditionally: RPC calls are synchronous n Callerblocksuntilresponseisreceivedfromcallee n Exception:One-wayRPCs
n ModernRPCframeworksincludesupportforasynccalls © 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
25
RPC visualized
RPC Server
RPC Client
server is busy
function executing
server waits for next request
© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
working
client blocked (waiting for response)
client continues
26
response
request
© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
How RPC generally works
n You write an application with a series of functions
n Some of these functions will be distributed remotely
n You call a stub-code generator, which produces
n A client stub, which emulates each function F:
n Marshals all the parameters and produces a request message
n Opens a connection to the server and sends the request
n Receives response, unmarshals+returns F’s return values, status
n A server stub, which emulates the caller on the server side:
n Receives a request for F with parameters
n Unmarshals the parameters, invokes F
n Takes F’s return status (e.g., protection fault), return value, marshals it, produces a response, and sends it back to the client
n Waits for the next request (or returns to the server loop)
27
© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
Passing value parameters
n StepsinvolvedindoingremotecomputationthroughRPC 2-8
28
RPC components
n Generally, you need to write:
n Your function, in a compatible language
n An interface definition, analogous to a C header file, so other people can program for F without having its source
n Includes annotations for marshalling, e.g., [in] and [out] n Special interface definition languages (IDLs) exist for this
n Stub-code generator takes the interface definition and generate the appropriate stubs
n (In the case of Java, RMIC knows enough about Java to run directly on the source file)
n The server stubs will generally run in some type of daemon process on the server
n Each function will need a globally unique name or GUID
29
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania