DSCC 201/401
Tools and Infrastructure for Data Science
February 3, 2021
Office Hours, TAs, and Blackboard
• Instructor Office Hours: TBD (Email me for questions or to set up time) • brendan.mort@rochester.edu
• Teaching Assistants
• Alex Crystal (acrystal@u.rochester.edu)
• Siyu Xue (sxue3@u.rochester.edu)
• Senqi Zhang (szhang71@u.rochester.edu)
• Quick Review of Blackboard
2
Reminder of Homework Assignment #1
•Homework #1 is due Wednesday, February 10th at 9 a.m. EST
• Posted to Blackboard
• NetID Username and Password
• DUO Two-Factor Authentication
• VPN Client
• Submit acknowledgment through Blackboard
3
Hardware Resources for Data Science
• Supercomputers
• Cluster Computing
• Virtualization and Cloud Computing
4
Supercomputers
5
Quick Hardware Overview
• CPU (Central Processing Unit) – the chip that performs the calculations and controls the computer (arithmetic and logic)
• Memory – often an abbreviation for RAM (random access memory), a place where data is temporarily stored for access by an application; memory is volatile (i.e. the contents stored disappear when the computer is powered off)
• Cache – localized memory that is integrated on the processor and is used by the processor in computations
• Storage – place where data is stored in a permanent way (non- volatile), usually hard drive, flash drive, or array of drives
• Bus – Circuitry which allows communication between components inside a computer (internal communication)
• High speed interconnect – often a networking cable that allows computers to communicate over distance externally from the device (external communication)
6
Supercomputers
• What is supercomputing? – the fastest systems we have available
• Architecture distinguished by layout between processors, memory, and interconnect
• 4 Standard Models (Historical) • Distributed Memory
• Shared Memory
• Accelerator Coprocessing
• Massively Parallel Processing
7
Distributed Memory
Memory access on remote machines via a high speed interconnect
High Speed Interconnect
CPU
CPU
CPU
CPU
Cache
Cache
Cache
Memory
Memory
Memory
Cache
Memory
8
Cluster of uniprocessor computers (Original Compute Cluster)
Shared Memory
Common physical memory can be accessed by all processors Known as symmetric multiprocessing (SMP)
Cache Cache Cache Cache
SGI Altix UV (up to 256 sockets sharing up to 16 TB)
CPU
CPU
CPU
CPU
High Speed Interconnect
Memory
9
Distributed and shared models combined
Hybrid Model
Memory
Memory
Memory
Memory
Cache
Bus
Cache
Cache
Cache
CPU
CPU
CPU
CPU
Bus
High Speed Interconnect
10
IBM Blade Center H with HS22 Blades
Accelerated Computing
Memory
Memory
Cache
Cache
Bus
CPU
CPU
GPU
GPU
11
Nvidia Tesla P100 GPU
Massively Parallel Processing
RAM
RAM
RAM
CPU
RAM
CPU
CPU
CPU
RAM
RAM
RAM
CPU
RAM
CPU
CPU
CPU
12
Blue Gene/P
Blue Gene/P System
13
Supercomputers: Measuring Performance
• Measuring performance: FLOPS
• 1 FLOPS = 1 FLoating point Operation Per Second
• Theoretical and “actual” value based on a specific benchmark
• Often measured today in GigaFLOPS (GF) and TeraFLOPS (TF)
Device
Speed (FLOPS)
Human
<10-2
Pocket Calculator
10
Cray-1 (1976)
3 x 108
iPhone 7
3 x 1011
High-End Desktop PC
1 x 1012
Blue Gene/Q Rack
2.1 x 1014
14
Brief History
Fastest Supercomputer (log FLOPS vs. Time (year))
PetaFLOPS
TeraFLOPS
MegaFLOPS
GigaFLOPS
log (FLOPS)
18.0
13.5
9.0
4.5
0.0
1960 1975
15
1990 2005 2020
Year
16
CDC 6600 (1964)
17
Cray X-MP (ca. 1984)
Beowulf Cluster - NASA (ca. 1994)
19
ASCI Red - Sandia (1996) 1.3 TF
20
Blue Gene/L - LLNL (2004)
HLRB II (ca. 2005)
HLRB II
Leibniz-Rechenzentrum, Garching, Germany
21
By courtesy of Barcelona Supercomputing Center - www.bsc.es
MareNostrum (2005)
22
Roadrunner (2008) 1.0 PF
Los Alamos National Laboratory
23
Jugene (2010) 826 TF
Forschungszentrum Jülich GmbH, Jülich, Germany
24
Forschungszentrum Jülich GmbH, Jülich, Germany
25
Juqueen (2013) 5.0 PF
Lawrence Livermore National Laboratory (LLNL)
26
Sequoia (2013) 17.2 PF
Titan (2013) 17.6 PF
Oak Ridge National Laboratory (ORNL)
Top Systems (2021)
28
Fugaku (富岳): RIKEN Center for Computational Science (Japan) - 442 PF
Fujitsu A64FX Processor
Summit: Oak Ridge National Laboratory (USA) - 149 PF
IBM Power 9 + Nvidia Volta V100 GPU
Sierra: Lawrence Livermore National Laboratory (USA) - 95 PF
IBM Power 9 + Nvidia Volta V100 GPU
TaihuLight (太湖之光): National Supercomputing
Center in Wuxi (China) - 93 PF
Sunway Processor
Selene: Nvidia Corporation (USA) - 63 PF
Nvidia DGX A100 Superpod with Nvidia A100 GPU
Tianhe-2A (天河-2A): National Super Computer Center in Guangzhou (China) - 61 PF
Intel Ivy Bridge + Phi
JUWELS: Forschungszentrum Juelich (Germany) - 44 PF
Bull: AMD EPYC + Nvidia Ampere A100 GPU
HPC5: Eni S.p.A. (Italy) - 35 PF
Dell: Intel Cascade Lake + Nvidia Volta V100 GPU
Frontera: Texas Advanced Computing Center (USA) - 24 PF
Dell: Intel Cascade Lake
Dammam-7: Saudi Aramco (Saudi Arabia) - 22 PF
Intel Cascade Lake + NVIDIA Volta V100
Top 5 Supercomputers (2021)
39
Supercomputers: Measuring Performance
• Rpeak (theoretical) vs. Rmax (actual)
• Theoretical value is just that — a theoretical value based on chip architecture
• How do we know that we can achieve (or get close to) that number?
• Answer: LINPACK Benchmark
• LINPACK Benchmark is based on LU decomposition of a large matrix (factoring a matrix as the product of a lower triangular matrix and an upper triangular matrix)
• "Gaussian elimination" for computers - used for solving systems of equations and determining inverses, determinants, etc.
• How big of a matrix should we use? We tune to get the best Rmax.
• Other considerations: Power consumption (measured in FLOPS/Watt)
40
Measuring Performance
• Theoretical CPU performance is calculated from CPU architecture and clock speed
• Most common metric is based on calculation of double-precision (DP) floating-point numbers (i.e. double in C++) - 64 bits
• FLOPS = FLoating point OPerations per Second
• We need to consider what type of floating point operation per second!
Name
Abreviation
Memory (Bytes)
Bits
Name
Double Precision
DP
8
64
FP64
Single Precision
SP
4
32
FP32
Half Precision
HP
2
16
FP16
41
Intel Xeon (x86)
• Performance based on: Microarchitecture, Number of Cores, and Clock Speed
Microarchitecture
Year Announced
Process Technology
Instructions per Cycle (DP)
Nehalem
2008
45 nm
4
Westmere
2010
32 nm
4
Sandy Bridge
2011
32 nm
8
Ivy Bridge
2012
22 nm
8
Haswell
2013
22 nm
16
Broadwell
2014
14 nm
16
Skylake
2015
14 nm
32
Cascade Lake
2018
14 nm
32
42
Supercomputers: Measuring Performance
• Most common metric is based on calculation of double-precision floating-point numbers (i.e. double in C++) - 64 bits
• Example: Intel Xeon E5-2697v4 (Broadwell) 2.3 GHz
• Details: https://ark.intel.com/content/www/us/en/ark/products/ 91755/intel-xeon-processor-e5-2697-v4-45m-cache-2-30-ghz.html
• Performance = (# cores) * (clock speed) * (instructions per cycle)
• Performance = 18 * 2.3 GHz * 16 = 662.4 GFLOPS • 100 nodes with 2 CPUs per node would be:
• 100 * 2 * 662.4 = 132,480 GFLOPS = 132.48 TFLOPS
43
Processing Technology
• x86 architecture: Intel Xeon
• Accelerator architecture: Intel Phi and Nvidia GPU
44
• CPU = Central Processing Unit
CPU
Instruction Memory
Arithmetic Logic Unit
Control Unit
Data Memory
Input/Output
45
Intel Phi
• Many integrated core (MIC) architecture
• Introduced in 2013 and x86 compatible
• Goal is to provide many cores at a slower clock speed (opposite of initial driver for standard CPUs)
• X100 Series - Introduced as PCIe card (e.g. Phi 5110P - 60 cores, 1.0 GHz, 1.0 TF (DP))
• Evolved to exist as stand alone chips - Knights Landing (72 cores, 1.5 GHz, 3.5 TF), Knights Hill (canceled), and Knights Mill (future)
• Much of the development of Intel Phi has been integrated into the latest server-class CPUs (Skylake and Cascade Lake)
46
Trinity: Los Alamos National Laboratory
Cray XC30: Intel Haswell + Knights Landing