CS计算机代考程序代写 Java file system hadoop Hadoop & MapReduce

Hadoop & MapReduce

Hadoop & HDFS

DSCI 551

Wensheng Wu

Hadoop

• A large-scale distributed & parallel batch-
processing infrastructure

• Large-scale:

– Handle a large amount of data and computation

• Distributed:

– Distribute data & computation over multiple machines

• Batch processing

– Process a series of jobs without human intervention

Mem

Disk

CPU

Mem

Disk

CPU

…

Switch

Each rack contains 16-64 nodes

In 2011 it was guestimated that Google had 1M machines, http://bit.ly/Shh0RO

Mem

Disk

CPU

Mem

Disk

CPU

…

Switch

Switch1 Gbps between
any pair of nodes
in a rack

2-10 Gbps backbone between racks

http://bit.ly/Shh0RO

1/7/2014 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 36

http://cs246.stanford.edu/

History

• 1st version released by Yahoo! in 2006

– named after an elephant toy

• Originated from Google’s work

– GFS: Google File System (2003)

– MapReduce (2004)

Roadmap

• Hadoop architecture

– HDFS

– MapReduce

• Installing Hadoop & HDFS

Key components

• HDFS (Hadoop distributed file system)

– Distributed data storage with high reliability

• MapReduce

– A parallel, distributed computational paradigm

– With a simplified programming model

HDFS

• Data are distributed among multiple data nodes

– Data nodes may be added on demand for more
storage space

• Data are replicated to cope with node failure

– Typically replication factor: 2 or 3

• Requests can go to any replica

– Removing the bottleneck (as in single file server)

HDFS architecture

HDFS has …

• A single NameNode, storing meta data:

– A hierarchy of directories and files (name space)

– Attributes of directories and files (in inodes), e.g.,
permission, access/modification times, etc.

– Mapping of files to blocks on data nodes

• A number of DataNodes:

– Storing contents/blocks of files

Compute nodes

• Data nodes are compute nodes too

• Advantage:

– Allow schedule computation close to data

HDFS also has …

• A SecondaryNameNode

– Maintaining checkpoints/images of NameNode

– For recovery

• In a single-machine setup

– all nodes correspond to the same machine

Metadata in NameNode

• NameNode has an inode for each file and dir

• Record attributes of file/dir such as

– Permission

– Access time

– Modification time

• Also record mapping of files to blocks

Mapping information in NameNode

• E.g., file /user/aaron/foo consists of blocks 1,
2, and 4

• Block 1 is stored on data nodes 1 and 3

• Block 2 is stored on data nodes 1 and 2

• …

Block size

• HDFS: 128 MB (version 2 & above)
– Much larger than disk block size (4KB)

– A: 128MB; B: 4KB

– 128MB/4KB = 32K

– A: 1GB/128MB = 8; B: 1GB/4KB = 256K

• Why larger size in HDFS?
– Reduce metadata required per file (1GB)

– Fast streaming read of data (since larger amount
of data are sequentially laid out on disk)
• Thus good for workload with largely sequential read of

large file
15

HDFS

• HDFS exposes the concept of blocks to client

• Reading and writing are done in two phases

– Phase 1: client asks NameNode for block locations

• By calling (sending request) getBlockLocations(), if reading

• Or calling addBlock() for allocating new blocks (one at a
time), if writing (need to call create()/append() first)

– Phase 2: client talks to DataNode for data transfer

• Reading blocks via readBlock() or writing blocks via
writeBlock()

Client and Namenode communication

• Source code (version 2.8.1)

– Definition of protocol

• ClientNamenodeProtocol.proto

• \hadoop-hdfs-project\hadoop-hdfs-
client\src\main\proto

– Implementation

• ClientProtocol.java

• \hadoop-hdfs-project\hadoop-hdfs-
client\src\main\java\org\apache\hadoop\hdfs\protocol

Key operations

• Reading:

– getBlockLocations()

• Writing

– create()

– append()

– addBlock()

getBlockLocations

Before reading, client needs to first obtain locations of blocks

getBlockLocations

• Input:

– File name

– Offset (to start reading)

– Length (how much data to be read)

• Output:

– Located blocks (data nodes + offsets)

../java/…hdfs/protocol/LocatedBlocks.java

Block
Offset of this block

in the entire file
Data nodes with

replicas of block

Create/append a file

This opens the file for
create/append

Creating a file

• Needs to specify:

– Path to the file to be created, e.g., /foo/bar

– Permission mask

– Client name

– Flag on whether to overwrite (entire file!) if
already exists

– How many replicas

– Block size

Creating a new file

A hierarchy of files and directories

Allocating new blocks for writing

Asking NameNode to allocate a new block
+ data nodes holding its replicas

Client and Datanode communication

• Source code (version 2.8.1)

– Definition of protocol

• datatransfer.proto

• Located at: \hadoop-hdfs-
project\hadoop-hdfs-client\src\main\proto

– Implementation

• DataTransferProtocol.java

• \hadoop-hdfs-project\hadoop-hdfs-
client\src\main\java\org\apache\hadoop\hdfs\protocol
\datatransfer

Operations

• readBlock()

• writeBlock()

• copyBlock() – for load balancing

• replaceBlock() – for load balancing
– Move a block from one DataNode to another

Reading a file

1. Client first contacts NameNode which
informs the client of the closest DataNodes
storing blocks of the file

– This is done by making which RPC call?

2. Client contacts the DataNodes directly for
reading the blocks

– Calling readBlock()

datatransfer.proto

Block, offset, length

DataTransferProtocol.java

Block, offset, length

Writing a file

• Blocks are written one at a time

– In a pipelined fashion through the data nodes

• For each block:

– Client asks NameNode to select DataNodes for
holding its replica (using which rpc call?)

• e.g., DataNodes 1 and 3 for the first block of
/user/aaron/foo

– It then forms the pipeline to send the block

Writing a file

A, [B,C]

B, [C]

C, []

Current data node in the pipeline

Rest of data nodes

Block to be written

Data pipelining

• Consider a block X to be written to DataNode
A, B, and C (replication factor = 3)

1. X is broken down into packets (typically
64KB/packet)

– 128MB/64KB = 2048

2. Client sends the packet to DataNode A

3. A sends it further to B & B further to C

Acknowledgement

• Client maintains an ack (acknowledgment) queue

• Packet removed from ack queue once received by
all data nodes

• When all packets were written, client notifies
NameNode
– NameNode will update the metadata for the file

– Reflecting that a new block has been added to the file

Data pipelining for writing blocks

Control messages

Acknowledgment
messages

Data packets

Acknowledgement

• Client does not wait for the acknowledgement
of previous packet before sending next one

• Is this synchronous or asynchronous?

• Advantage?

Roadmap

• Hadoop architecture

– HDFS

– MapReduce

• Installing Hadoop & HDFS

Hadoop installation

• Install the Hadoop package

– Log into your EC2 instance and then execute:

• wget
https://downloads.apache.org/hadoop/common/hado
op-3.3.1/hadoop-3.3.1.tar.gz

• tar xvf hadoop-3.3.1.tar.gz

• Might want to remove installation package
(~200MB) to save space

https://downloads.apache.org/hadoop/common/hadoop-3.2.2/hadoop-3.2.2.tar.gz

Install java sdk (if you have not)

• sudo yum install java-1.8.0-devel

– Java 1.8 is needed for Spark

Setup environment variables

• Edit ~/.bashrc by adding the following:
– export JAVA_HOME=/usr/lib/jvm/java

– export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar

– export HADOOP_HOME=/home/ec2-user/hadoop-3.3.1

– export
PATH=${JAVA_HOME}/bin:${HADOOP_HOME}/bin:${PATH}

• source ~/.bashrc
– This is to get the new variables in effect

– Or you may also log out and log in again

Set up pseudo-distributed mode

• Edit /etc/hadoop/core-site.xml by adding the
following property:
–

fs.defaultFS
hdfs://localhost:9000

• hdfs://localhost:9000 will be the URI for root of hdfs

Pseudo-distributed mode

• Edit etc/hadoop/hdfs-site.xml, add this:

–

dfs.replication
1

–

• dfs.replication = 1 (replication factor)

Setup passphraseless ssh

• Reason:

– So that Hadoop can automatically start the
DataNode daemons on machines running the
daemons

• Note that DataNode is running on localhost in
our setup

– So all daemons run on the same host

Setup passphraseless ssh

• ssh-keygen -t rsa -P ” -f ~/.ssh/id_rsa

– This generates public/private key pairs

– id_rsa is the private key; id_rsa.pub public key

• cat ~/.ssh/id_rsa.pub >>
~/.ssh/authorized_keys

– Add public key into the list of authorized keys

• chmod 0400 ~/.ssh/authorized_keys

– Change the file permission properly

-P specifies passphrase: here is an empty string

Check if it works

• ssh localhost

– It should login to localhost without asking for
password (may need to confirm yes first time)

• exit

– Make sure you exit from “ssh localhost”

Formatting hdfs & starting hdfs

• Go to your Hadoop installation directory

• bin/hdfs namenode -format

• sbin/start-dfs.sh

– sbin/stop-dfs.sh to stop it

Verifying HDFS is started properly

• Execute jps, you should see 3 java processes:

– SecondaryNameNode

– DataNode

– NameNode

• If NameNode is not started

– Try to stop hdfs & reformat namenode (see
previous slide)

Working with hdfs

• Setting up home directory in hdfs
– bin/hdfs dfs -mkdir /user

– bin/hdfs dfs -mkdir /user/ec2-user

(ec2-user is user name of your EC2 account)

• Create a directory “input” under home
– bin/hdfs dfs -mkdir /user/ec2-user/input

– Or simply:

– bin/hdfs dfs -mkdir input

This will automatically create the “input” directory under /user/ec2-user

Working with hdfs

• Copy data from local file system

– bin/hdfs dfs -put etc/hadoop/*.xml /user/ec2-
user/input

– Ignore error if you see one like this: “WARN hdfs.
DataStreamer: Caught exception…”

• List the content of directory

– bin/hdfs dfs -ls /user/ec2-user/input

Working with hdfs

• Copy data from hdfs

– bin/hdfs dfs -get /user/ec2-user/input input1

– If input1 does not exist, it will create one

– If it does, it will create another one under it

• Examine the content of file in hdfs

– bin/hdfs dfs -cat /user/ec2-user/input/core-
site.xml

Working with hdfs

• Remove files

– bin/hdfs dfs -rm /user/ec2-user/input/core-
site.xml

– bin/hdfs dfs -rm /user/ec2-user/input/*

• Remove directory

– bin/hdfs dfs -rmdir /user/ec2-user/input

– Directory “input” needs to be empty first

Where is hdfs located?

• /tmp/hadoop-ec2-user/dfs/

References

• K. Shvachko, H. Kuang, S. Radia, and R. Chansler,
“The hadoop distributed file system,” in Mass
Storage Systems and Technologies (MSST), 2010
IEEE 26th Symposium on, 2010, pp. 1-10.

• HDFS File System Shell Guide:

– https://hadoop.apache.org/docs/current/hadoop-
project-dist/hadoop-common/FileSystemShell.html

https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwjI-fKetuXRAhXojlQKHTgYAo8QFggcMAA&url=http://pages.cs.wisc.edu/~akella/CS838/F15/838-CloudPapers/hdfs.pdf&usg=AFQjCNEkwomCr18XK-WOYkQiv_HIxgmCoQ&sig2=oeZyy4Krv3SGA19yMz36AQ
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html

Related Posts