Hadoop & MapReduce
Hadoop & HDFS
DSCI 551
Wensheng Wu
1
Hadoop
• A large-scale distributed & parallel batch-
processing infrastructure
• Large-scale:
– Handle a large amount of data and computation
• Distributed:
– Distribute data & computation over multiple machines
• Batch processing
– Process a series of jobs without human intervention
2
Mem
Disk
CPU
Mem
Disk
CPU
…
Switch
Each rack contains 16-64 nodes
In 2011 it was guestimated that Google had 1M machines, http://bit.ly/Shh0RO
Mem
Disk
CPU
Mem
Disk
CPU
…
Switch
Switch1 Gbps between
any pair of nodes
in a rack
2-10 Gbps backbone between racks
3
http://bit.ly/Shh0RO
1/7/2014 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 36
4
http://cs246.stanford.edu/
History
• 1st version released by Yahoo! in 2006
– named after an elephant toy
• Originated from Google’s work
– GFS: Google File System (2003)
– MapReduce (2004)
5
Roadmap
• Hadoop architecture
– HDFS
– MapReduce
• Installing Hadoop & HDFS
6
Key components
• HDFS (Hadoop distributed file system)
– Distributed data storage with high reliability
• MapReduce
– A parallel, distributed computational paradigm
– With a simplified programming model
7
HDFS
• Data are distributed among multiple data nodes
– Data nodes may be added on demand for more
storage space
• Data are replicated to cope with node failure
– Typically replication factor: 2 or 3
• Requests can go to any replica
– Removing the bottleneck (as in single file server)
8
HDFS architecture
9
HDFS has …
• A single NameNode, storing meta data:
– A hierarchy of directories and files (name space)
– Attributes of directories and files (in inodes), e.g.,
permission, access/modification times, etc.
– Mapping of files to blocks on data nodes
• A number of DataNodes:
– Storing contents/blocks of files
10
Compute nodes
• Data nodes are compute nodes too
• Advantage:
– Allow schedule computation close to data
11
HDFS also has …
• A SecondaryNameNode
– Maintaining checkpoints/images of NameNode
– For recovery
• In a single-machine setup
– all nodes correspond to the same machine
12
Metadata in NameNode
• NameNode has an inode for each file and dir
• Record attributes of file/dir such as
– Permission
– Access time
– Modification time
• Also record mapping of files to blocks
13
Mapping information in NameNode
• E.g., file /user/aaron/foo consists of blocks 1,
2, and 4
• Block 1 is stored on data nodes 1 and 3
• Block 2 is stored on data nodes 1 and 2
• …
14
Block size
• HDFS: 128 MB (version 2 & above)
– Much larger than disk block size (4KB)
– A: 128MB; B: 4KB
– 128MB/4KB = 32K
– A: 1GB/128MB = 8; B: 1GB/4KB = 256K
• Why larger size in HDFS?
– Reduce metadata required per file (1GB)
– Fast streaming read of data (since larger amount
of data are sequentially laid out on disk)
• Thus good for workload with largely sequential read of
large file
15
HDFS
• HDFS exposes the concept of blocks to client
• Reading and writing are done in two phases
– Phase 1: client asks NameNode for block locations
• By calling (sending request) getBlockLocations(), if reading
• Or calling addBlock() for allocating new blocks (one at a
time), if writing (need to call create()/append() first)
– Phase 2: client talks to DataNode for data transfer
• Reading blocks via readBlock() or writing blocks via
writeBlock()
16
Client and Namenode communication
• Source code (version 2.8.1)
– Definition of protocol
• ClientNamenodeProtocol.proto
•
client\src\main\proto
– Implementation
• ClientProtocol.java
•
client\src\main\java\org\apache\hadoop\hdfs\protocol
17
Key operations
• Reading:
– getBlockLocations()
• Writing
– create()
– append()
– addBlock()
18
getBlockLocations
19
Before reading, client needs to first obtain locations of blocks
getBlockLocations
• Input:
– File name
– Offset (to start reading)
– Length (how much data to be read)
• Output:
– Located blocks (data nodes + offsets)
20
21
22
../java/…hdfs/protocol/LocatedBlocks.java
Block
Offset of this block
in the entire file
Data nodes with
replicas of block
Create/append a file
23
This opens the file for
create/append
Creating a file
• Needs to specify:
– Path to the file to be created, e.g., /foo/bar
– Permission mask
– Client name
– Flag on whether to overwrite (entire file!) if
already exists
– How many replicas
– Block size
24
25
Creating a new file
A hierarchy of files and directories
Allocating new blocks for writing
26
Asking NameNode to allocate a new block
+ data nodes holding its replicas
27
Client and Datanode communication
• Source code (version 2.8.1)
– Definition of protocol
• datatransfer.proto
• Located at:
project\hadoop-hdfs-client\src\main\proto
– Implementation
• DataTransferProtocol.java
•
client\src\main\java\org\apache\hadoop\hdfs\protocol
\datatransfer
28
Operations
• readBlock()
• writeBlock()
• copyBlock() – for load balancing
• replaceBlock() – for load balancing
– Move a block from one DataNode to another
29
Reading a file
1. Client first contacts NameNode which
informs the client of the closest DataNodes
storing blocks of the file
– This is done by making which RPC call?
2. Client contacts the DataNodes directly for
reading the blocks
– Calling readBlock()
30
datatransfer.proto
31
Block, offset, length
DataTransferProtocol.java
32
Block, offset, length
Writing a file
• Blocks are written one at a time
– In a pipelined fashion through the data nodes
• For each block:
– Client asks NameNode to select DataNodes for
holding its replica (using which rpc call?)
• e.g., DataNodes 1 and 3 for the first block of
/user/aaron/foo
– It then forms the pipeline to send the block
33
Writing a file
34
A, [B,C]
B, [C]
C, []
35
Current data node in the pipeline
Rest of data nodes
Block to be written
Data pipelining
• Consider a block X to be written to DataNode
A, B, and C (replication factor = 3)
1. X is broken down into packets (typically
64KB/packet)
– 128MB/64KB = 2048
2. Client sends the packet to DataNode A
3. A sends it further to B & B further to C
36
Acknowledgement
• Client maintains an ack (acknowledgment) queue
• Packet removed from ack queue once received by
all data nodes
• When all packets were written, client notifies
NameNode
– NameNode will update the metadata for the file
– Reflecting that a new block has been added to the file
37
Data pipelining for writing blocks
38
Control messages
Acknowledgment
messages
Data packets
Acknowledgement
• Client does not wait for the acknowledgement
of previous packet before sending next one
• Is this synchronous or asynchronous?
• Advantage?
39
Roadmap
• Hadoop architecture
– HDFS
– MapReduce
• Installing Hadoop & HDFS
40
Hadoop installation
• Install the Hadoop package
– Log into your EC2 instance and then execute:
• wget
https://downloads.apache.org/hadoop/common/hado
op-3.3.1/hadoop-3.3.1.tar.gz
• tar xvf hadoop-3.3.1.tar.gz
• Might want to remove installation package
(~200MB) to save space
41
https://downloads.apache.org/hadoop/common/hadoop-3.2.2/hadoop-3.2.2.tar.gz
Install java sdk (if you have not)
• sudo yum install java-1.8.0-devel
– Java 1.8 is needed for Spark
42
Setup environment variables
• Edit ~/.bashrc by adding the following:
– export JAVA_HOME=/usr/lib/jvm/java
– export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
– export HADOOP_HOME=/home/ec2-user/hadoop-3.3.1
– export
PATH=${JAVA_HOME}/bin:${HADOOP_HOME}/bin:${PATH}
• source ~/.bashrc
– This is to get the new variables in effect
– Or you may also log out and log in again
43
Set up pseudo-distributed mode
• Edit
following property:
–
• hdfs://localhost:9000 will be the URI for root of hdfs
44
Pseudo-distributed mode
• Edit etc/hadoop/hdfs-site.xml, add this:
–
–
• dfs.replication = 1 (replication factor)
45
Setup passphraseless ssh
• Reason:
– So that Hadoop can automatically start the
DataNode daemons on machines running the
daemons
• Note that DataNode is running on localhost in
our setup
– So all daemons run on the same host
46
Setup passphraseless ssh
• ssh-keygen -t rsa -P ” -f ~/.ssh/id_rsa
– This generates public/private key pairs
– id_rsa is the private key; id_rsa.pub public key
• cat ~/.ssh/id_rsa.pub >>
~/.ssh/authorized_keys
– Add public key into the list of authorized keys
• chmod 0400 ~/.ssh/authorized_keys
– Change the file permission properly
47
-P specifies passphrase: here is an empty string
Check if it works
• ssh localhost
– It should login to localhost without asking for
password (may need to confirm yes first time)
• exit
– Make sure you exit from “ssh localhost”
48
Formatting hdfs & starting hdfs
• Go to your Hadoop installation directory
• bin/hdfs namenode -format
• sbin/start-dfs.sh
– sbin/stop-dfs.sh to stop it
49
Verifying HDFS is started properly
• Execute jps, you should see 3 java processes:
– SecondaryNameNode
– DataNode
– NameNode
• If NameNode is not started
– Try to stop hdfs & reformat namenode (see
previous slide)
50
Working with hdfs
• Setting up home directory in hdfs
– bin/hdfs dfs -mkdir /user
– bin/hdfs dfs -mkdir /user/ec2-user
(ec2-user is user name of your EC2 account)
• Create a directory “input” under home
– bin/hdfs dfs -mkdir /user/ec2-user/input
– Or simply:
– bin/hdfs dfs -mkdir input
51
This will automatically create the “input” directory under /user/ec2-user
Working with hdfs
• Copy data from local file system
– bin/hdfs dfs -put etc/hadoop/*.xml /user/ec2-
user/input
– Ignore error if you see one like this: “WARN hdfs.
DataStreamer: Caught exception…”
• List the content of directory
– bin/hdfs dfs -ls /user/ec2-user/input
52
Working with hdfs
• Copy data from hdfs
– bin/hdfs dfs -get /user/ec2-user/input input1
– If input1 does not exist, it will create one
– If it does, it will create another one under it
• Examine the content of file in hdfs
– bin/hdfs dfs -cat /user/ec2-user/input/core-
site.xml
53
Working with hdfs
• Remove files
– bin/hdfs dfs -rm /user/ec2-user/input/core-
site.xml
– bin/hdfs dfs -rm /user/ec2-user/input/*
• Remove directory
– bin/hdfs dfs -rmdir /user/ec2-user/input
– Directory “input” needs to be empty first
54
Where is hdfs located?
• /tmp/hadoop-ec2-user/dfs/
55
References
• K. Shvachko, H. Kuang, S. Radia, and R. Chansler,
“The hadoop distributed file system,” in Mass
Storage Systems and Technologies (MSST), 2010
IEEE 26th Symposium on, 2010, pp. 1-10.
• HDFS File System Shell Guide:
– https://hadoop.apache.org/docs/current/hadoop-
project-dist/hadoop-common/FileSystemShell.html
56
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwjI-fKetuXRAhXojlQKHTgYAo8QFggcMAA&url=http://pages.cs.wisc.edu/~akella/CS838/F15/838-CloudPapers/hdfs.pdf&usg=AFQjCNEkwomCr18XK-WOYkQiv_HIxgmCoQ&sig2=oeZyy4Krv3SGA19yMz36AQ
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html