CS计算机代考程序代写 algorithm AI Hive cache cuda x86 GPU data science file system DSCC 201/401

DSCC 201/401
Tools and Infrastructure for Data Science
February 8, 2021

TAs and Blackboard
• Teaching Assistants
• Alex Crystal (acrystal@u.rochester.edu)
• Siyu Xue (sxue3@u.rochester.edu)
• Senqi Zhang (szhang71@u.rochester.edu)
• Quick Review of Blackboard
• Reminder: HW#1 is due at 9 a.m. EST on Wednesday
2

Hardware Resources for Data Science
• Supercomputers
• Cluster Computing
• Virtualization and Cloud Computing
3

Supercomputers: Measuring Performance
• Rpeak (theoretical) vs. Rmax (actual)
• Theoretical value is just that — a theoretical value based on chip architecture
• How do we know that we can achieve (or get close to) that number?
• Answer: LINPACK Benchmark
• LINPACK Benchmark is based on LU decomposition of a large matrix (factoring a matrix as the product of a lower triangular matrix and an upper triangular matrix)
• “Gaussian elimination” for computers – used for solving systems of equations and determining inverses, determinants, etc.
• How big of a matrix should we use? We tune to get the best Rmax.
• Other considerations: Power consumption (measured in FLOPS/Watt)
4

Measuring Performance
• Theoretical CPU performance is calculated from CPU architecture and clock speed
• Most common metric is based on calculation of double-precision (DP) floating-point numbers (i.e. double in C++) – 64 bits
• FLOPS = FLoating point OPerations per Second
• We need to consider what type of floating point operation per second!
Name
Abreviation
Memory (Bytes)
Bits
Name
Double Precision
DP
8
64
FP64
Single Precision
SP
4
32
FP32
Half Precision
HP
2
16
FP16
5

Intel Xeon (x86)
• Performance based on: Microarchitecture, Number of Cores, and Clock Speed
Microarchitecture
Year Announced
Process Technology
Instructions per Cycle (DP)
Nehalem
2008
45 nm
4
Westmere
2010
32 nm
4
Sandy Bridge
2011
32 nm
8
Ivy Bridge
2012
22 nm
8
Haswell
2013
22 nm
16
Broadwell
2014
14 nm
16
Skylake
2015
14 nm
32
Cascade Lake
2018
14 nm
32
6

Supercomputers: Measuring Performance
• Most common metric is based on calculation of double-precision floating- point numbers (i.e. double in C++) – 64 bits
• Example: Intel Xeon E5-2697v4 (Broadwell) 2.3 GHz
• Details: https://ark.intel.com/content/www/us/en/ark/products/91755/ intel-xeon-processor-e5-2697-v4-45m-cache-2-30-ghz.html
•Performance = (# cores) * (clock speed) * (instructions per cycle)
• Performance = 18 * 2.3 GHz * 16 = 662.4 GFLOPS
• 100 servers with 2 CPUs per server would have a theoretical performance of:
• 100 * 2 * 662.4 = 132,480 GFLOPS = 132.48 TFLOPS
7

Processing Technology
• x86 architecture: Intel Xeon
• Accelerator architecture: Intel Phi and Nvidia GPU
8

• CPU = Central Processing Unit
CPU
Instruction Memory
Arithmetic Logic Unit
Control Unit
Data Memory
Input/Output
9

Intel Phi
• Many integrated core (MIC) architecture
• Introduced in 2013 and x86 compatible
• Goal is to provide many cores at a slower clock speed (opposite of initial driver for standard CPUs)
• X100 Series – Introduced as PCIe card (e.g. Phi 5110P – 60 cores, 1.0 GHz, 1.0 TF (DP))
• Evolved to exist as stand alone chips – Knights Landing (72 cores, 1.5 GHz, 3.5 TF), Knights Hill (canceled), and Knights Mill (future)
• Much of the development of Intel Phi has been integrated into the latest server-class CPUs (Skylake and Cascade Lake)
10

Trinity: Los Alamos National Laboratory
Cray XC30: Intel Haswell + Knights Landing

GPU
• GPU = Graphics Processing Unit
• “A specialized electronic circuit designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device”
• Originally developed for graphics and high-end video gaming systems (and continues to be developed)
• Eventually extended for general purpose computing using programming models to access and control a GPU – starting around 2007 – CUDA introduced (Compute Unified Device Architecture)
12

GPU
• GPU is a coprocessor to CPU and tied to at least one core • GPU cores are many more than processor (e.g. 2,688)
• GPU RAM is smaller than CPU RAM (e.g. 8 GB)
CPU
PCIe Bus
GPU
GPU Memory (RAM)
Cache
Memory (RAM)
13

GPU
• What does a GPU look like?
• Expansion card or integrated onto board
14

GPU
• PCIe (Peripheral Component Interconnect Express) card
• Each device has a grid of blocks and each block has shared memory and threads
• What does a GPU look like?
Device
Grid
Block Block Block Block Block Block
Global Memory
15

GPU Specifications
• Nvidia GPUs designed for high-performance computing are referred to as Tesla
• Nvidia Tesla has had many major generations of GPUs for computing: Fermi, Kepler, Pascal, Volta, and Ampere
GPU
Generation
CUDA Cores
GPU RAM
TF (DP)
C2050
Fermi
448
6 GB
0.5
K20
Kepler
2496
5 GB
1.2
K20X
Kepler
2688
6 GB
1.3
K80
Kepler
4992
24 GB
2.9
P100
Pascal
3584
16 GB
4.7
V100
Volta
5120
16 GB
7.0
A100
Ampere
6912
40 GB
19.5
16

GPU – Programming Model and PCIe vs. NVLINK
• Multiple GPU cards can be placed in a host computer • Communication through PCIe bus (P2P)
17

GPU – Programming Model and PCIe vs. NVLink
• Multiple GPU cards can be placed in a host computer
• NVLink is available to Pascal, Volta, and Ampere class GPUs in systems that support it – much better performance (but expensive!)
18

GPU – Nvidia DGX-1
19

GPU – Nvidia DGX-2
20

Summit: Oak Ridge National Laboratory – 149 PF
IBM Power 9 + Nvidia V100 GPU 2 CPUs + 6 GPUs

Measuring Performance
• Theoretical performance is calculated from chip architecture and clock speed
• Most common metric is based on calculation of double-precision floating-point numbers (i.e. double in C++) – 64 bits
• FLOPS = FLoating point OPerations per Second
• We need to consider what type of floating point operation per second!
Name
Abreviation
Memory (Bytes)
Bits
Name
Double Precision
DP
8
64
FP64
Single Precision
SP
4
32
FP32
Half Precision
HP
2
16
FP16
22

sign (1)
exponent (11)
FP64
fraction (52)
3.14159265359
sign (1) FP32 fraction (23)
exponent (8)
sign (1)
Floating Point Precision
3.14159
FP16
fraction (10)
3.14
63
exponent (5)
31 15 0
23 Note: Values of pi only shown for illustration purposes.

GPU Acceleration for Machine Learning
• Machine learning algorithms generally do not need double precision
• GPUs provide additional acceleration when high precision is not required
GPU or CPU
Generation
Cores
TF (DP)
TF (SP)
C2050
Fermi
448
0.5
1.0
K20
Kepler
2496
1.2
3.5
K20X
Kepler
2688
1.3
3.9
K80
Kepler
4992
2.9
8.7
P100
Pascal
3584
4.7
9.3
V100
Volta
5120
7.0
14.0 
 (125 HP)
A100
Ampere
6912
19.5
19.5 (312 HP)
Intel Xeon Gold 6230 (Cascade Lake) 2.1 GHz
20
1.3
2.6
24

GPU Acceleration for Machine Learning
• Nvidia’s V100 Tensor cores provide 125 TF (HP) of acceleration per GPU • Nvidia’s A100 Tensor cores provide 312 TF (HP) of acceleration per GPU • Built specifically for tensor processing
• Competition with Google’s Tensor Processing Unit (TPU) – 45 TF (HP)
• TPU -> ASIC (Application Specific Integrated Circuit)
25

Nvidia DGX A100
26

27

Selene: Nvidia Corporation (USA) – 63 PF
Nvidia DGX A100 Superpod with Nvidia A100 GPU

GPU Acceleration for Machine Learning
• AMD is working on competitors based on Radeon Instinct
• AMD focusing programming efforts on OpenCL (Open Computing Language) in contrast to Nvidia’s CUDA
GPU or CPU
Generation
Cores
TF (DP)
TF (SP)
MI100 (AMD)
MI100 (AMD)
7680
11.5
23.1
 (185 HP)
MI60 (AMD)
Vega (AMD)
4096
7.4
14.7 
 (29.5 HP)
A100
Ampere
6912
19.5
19.5
 (312 HP)
P100
Pascal
3584
4.7
9.3
V100
Volta
5120
7.0
14.0 
 (125 HP)
Intel Xeon Gold 6230 (Cascade Lake) 2.1 GHz
20
1.3
2.6
29

GPUs – More Than Just For Supercomputers
• Nvidia Tesla line is designed for supercomputers and server-class architectures
• Nvidia Tegra line is designed for mobile devices and embedded systems
• Google’s Edge TPU is another device designed for embedded and mobile systems
• AI application – computer vision for cars, robotics, etc.
GPU
Generation
CUDA Cores
GPU RAM
TF (SP)
TF (HP)
K1
Kepler
192
8 GB
0.4

X1
Maxwell
256
8 GB
0.5
1.0
X2
Pascal
256
8 GB
0.8
1.5
Xavier
Volta
512
16 GB
1.4
2.8
Orin
Ampere
2048
TBD
TBD
TBD

Cluster Computing
31

What is a Linux Cluster?
• A group of computers linked by a high-speed interconnect that can act as a large system for big computations and data processing
• Group works closely together and has the appearance of a single computer
• Runs an operating system that uses the Linux kernel
• Uses software to control computational tasks
32

33

How is a Linux Cluster Different from Other Supercomputers?
• A Linux cluster is a type of supercomputer
• Typically constructed from “commodity” server hardware
• Linux clusters have a more customizable system architecture (processor, memory, interconnect)
• Usually deployed with “less proprietary” designs and configurations and can be constructed with smaller discrete units (e.g. node vs. rack)
• We will examine Linux clusters in the context of our own here at the University of Rochester, known as BlueHive
34

• Computing • Storage
• Network
• Software
Linux Clusters – Major Components
35

Some Hardware Definitions
• Storage – permanent data storage (hard drive or solid state drive), does not go away when system is powered off
• Memory – usually refers to RAM (random access memory), data goes away when system is powered off (volatile)
• Processor – a computing chip that processes data (usually refers to a CPU)
• CPU – central processing unit, the chip that does the computing, also know as a processor
• Socket – a physical location on a computer board that can house a CPU
• Core – a computing element of a CPU that can process data
independently, most CPUs today have multiple cores
• Node – a physical computing unit (e.g. server) with sockets that have one or more processors, banks of memory, and a network interface
36

Linux Cluster Hardware – Compute
• Compute nodes are typically dense servers that are responsible for processing in a Linux cluster
• Compute nodes are stacked and placed in standard 19-inch wide 42U high racks
• Each server often has 2 sockets with RAM and local disk
• Each socket has 1 CPU and that CPU has many cores
• At least 1 node in the cluster is a login node and 1 node in the cluster is is a service node
• Login node is where users log in to system and submit jobs
• Service node controls the system (not user accessible)
37

Linux Cluster Hardware – Compute
38

Linux Cluster Hardware – Compute
e.g. Dell C6300: 2 sockets, RAM, hard drive
39

Linux Cluster Hardware – Compute
2U ~3.5 in
PS
CN
CN
CPSN
CN
CN
e.g. Dell C6300: 4 compute nodes in 2U
40

Linux Cluster Hardware – Compute
PS
CN
CN
PS
CN
CN
42U Rack ~6.5 ft.
19 inch width rails
41

Linux Cluster Hardware – Storage
• Individual hard drives on compute nodes – BUT typically not enough!
• How do we make sure all files are accessible by any one of the compute nodes?
• Clustered file system allows a file system to be mounted on several nodes
• Network attached storage (NAS) can be mounted on nodes using NFS (network file system) – similar to “network drive”
• Better performance and redundancy is achieved through parallel file systems, which is a special type of clustered file system
• Parallel file systems provide concurrent high-speed file access to applications executing on multiple nodes of clusters
• Efficiency and redundancy is improved by allowing nodes to access block level (which is a lower level than the file level)
42

• Lustre
Parallel File Systems
• Developed by Cluster File Systems, Inc. (but now open source)
• Uses metadata servers, object storage servers, and client servers for the file system
• Spectrum Scale (i.e. GPFS – General Parallel File System)
• Developed by IBM (proprietary)
• Parallel file system that provides access to block level storage on multiple nodes
• Blocks are distributed across multiple disk arrays with redundancy (declustered RAID – Redundant Array of Independent Disks)
• Metadata (i.e. information about the files and layout) are distributed across the multiple disk arrays
43

Spectrum Scale (GPFS)
JBOD
JBOD
NSD Servers
JBOD
JBOD
JBOD
JBOD
NSD Servers
JBOD
JBOD
• JBOD (Just a Bunch of Disks) provides the actual storage – typically 60 disks
• NSD (Network Shared Disk) Servers share out the storage to the clients through a network connection
• GPFS server and client software
42U Rack
44

Linux Clusters – Networking
• Ethernet can be used but has high latency (e.g. 50-125 μs) due to the TCP/IP protocol
• Small packets of information take a long time to reach destination
• InfiniBand has been designed for low latency (and high bandwidth) – typically less than 5 μs
• FDR10 (10 Gb/s), FDR (14 Gb/s), EDR (25 Gb/s), and HDR (50 Gb/s) are commonly used today
• Links can be aggregated for extra bandwidth (e.g. 4X, 8X, etc.)
• BlueHive uses 4X aggregation of EDR (i.e. 100 Gb/s) and 4X aggregation of FDR10 (i.e. 40 Gb/s bandwidth)
• Copper cables can be used for InfiniBand lengths less than 10 meters (otherwise optical fiber cables are used)
• Network switches can be be used to link all components together 45

BlueHive
46

47

Linux Cluster Hardware
• Necessary hardware components: Compute Nodes, Storage, Networking
• Also need a login node to provide a system for users to log in and interact with the system
• Also need a service node to provide a place to run and manage the control and monitoring software for the system
48