CS计算机代考程序代写 x86 GPU data science cache DSCC 201/401

DSCC 201/401
Tools and Infrastructure for Data Science
February 3, 2021

Office Hours, TAs, and Blackboard
• Instructor Office Hours: TBD (Email me for questions or to set up time) • brendan.mort@rochester.edu
• Teaching Assistants
• Alex Crystal (acrystal@u.rochester.edu)
• Siyu Xue (sxue3@u.rochester.edu)
• Senqi Zhang (szhang71@u.rochester.edu)
• Quick Review of Blackboard
2

Reminder of Homework Assignment #1
•Homework #1 is due Wednesday, February 10th at 9 a.m. EST
• Posted to Blackboard
• NetID Username and Password
• DUO Two-Factor Authentication
• VPN Client
• Submit acknowledgment through Blackboard
3

Hardware Resources for Data Science
• Supercomputers
• Cluster Computing
• Virtualization and Cloud Computing
4

Supercomputers
5

Quick Hardware Overview
• CPU (Central Processing Unit) – the chip that performs the calculations and controls the computer (arithmetic and logic)
• Memory – often an abbreviation for RAM (random access memory), a place where data is temporarily stored for access by an application; memory is volatile (i.e. the contents stored disappear when the computer is powered off)
• Cache – localized memory that is integrated on the processor and is used by the processor in computations
• Storage – place where data is stored in a permanent way (non- volatile), usually hard drive, flash drive, or array of drives
• Bus – Circuitry which allows communication between components inside a computer (internal communication)
• High speed interconnect – often a networking cable that allows computers to communicate over distance externally from the device (external communication)
6

Supercomputers
• What is supercomputing? – the fastest systems we have available
• Architecture distinguished by layout between processors, memory, and interconnect
• 4 Standard Models (Historical) • Distributed Memory
• Shared Memory
• Accelerator Coprocessing
• Massively Parallel Processing
7

Distributed Memory
Memory access on remote machines via a high speed interconnect
High Speed Interconnect
CPU
CPU
CPU
CPU
Cache
Cache
Cache
Memory
Memory
Memory
Cache
Memory
8
Cluster of uniprocessor computers (Original Compute Cluster)

Shared Memory
Common physical memory can be accessed by all processors Known as symmetric multiprocessing (SMP)
Cache Cache Cache Cache
SGI Altix UV (up to 256 sockets sharing up to 16 TB)
CPU
CPU
CPU
CPU
High Speed Interconnect
Memory
9

Distributed and shared models combined
Hybrid Model
Memory
Memory
Memory
Memory
Cache
Bus
Cache
Cache
Cache
CPU
CPU
CPU
CPU
Bus
High Speed Interconnect
10
IBM Blade Center H with HS22 Blades

Accelerated Computing
Memory
Memory
Cache
Cache
Bus
CPU
CPU
GPU
GPU
11
Nvidia Tesla P100 GPU

Massively Parallel Processing
RAM
RAM
RAM
CPU
RAM
CPU
CPU
CPU
RAM
RAM
RAM
CPU
RAM
CPU
CPU
CPU
12
Blue Gene/P

Blue Gene/P System
13

Supercomputers: Measuring Performance
• Measuring performance: FLOPS
• 1 FLOPS = 1 FLoating point Operation Per Second
• Theoretical and “actual” value based on a specific benchmark
• Often measured today in GigaFLOPS (GF) and TeraFLOPS (TF)
Device
Speed (FLOPS)
Human
<10-2 Pocket Calculator 10 Cray-1 (1976) 3 x 108 iPhone 7 3 x 1011 High-End Desktop PC 1 x 1012 Blue Gene/Q Rack 2.1 x 1014 14 Brief History Fastest Supercomputer (log FLOPS vs. Time (year)) PetaFLOPS TeraFLOPS MegaFLOPS GigaFLOPS log (FLOPS) 18.0 13.5 9.0 4.5 0.0 1960 1975 15 1990 2005 2020 Year 16 CDC 6600 (1964) 17 Cray X-MP (ca. 1984) Beowulf Cluster - NASA (ca. 1994) 19 ASCI Red - Sandia (1996) 1.3 TF 20 Blue Gene/L - LLNL (2004) HLRB II (ca. 2005) HLRB II Leibniz-Rechenzentrum, Garching, Germany 21 By courtesy of Barcelona Supercomputing Center - www.bsc.es MareNostrum (2005) 22 Roadrunner (2008) 1.0 PF Los Alamos National Laboratory 23 Jugene (2010) 826 TF Forschungszentrum Jülich GmbH, Jülich, Germany 24 Forschungszentrum Jülich GmbH, Jülich, Germany 25 Juqueen (2013) 5.0 PF Lawrence Livermore National Laboratory (LLNL) 26 Sequoia (2013) 17.2 PF Titan (2013) 17.6 PF Oak Ridge National Laboratory (ORNL) Top Systems (2021) 28 Fugaku (富岳): RIKEN Center for Computational Science (Japan) - 442 PF Fujitsu A64FX Processor Summit: Oak Ridge National Laboratory (USA) - 149 PF IBM Power 9 + Nvidia Volta V100 GPU Sierra: Lawrence Livermore National Laboratory (USA) - 95 PF IBM Power 9 + Nvidia Volta V100 GPU TaihuLight (太湖之光): National Supercomputing 
 Center in Wuxi (China) - 93 PF Sunway Processor Selene: Nvidia Corporation (USA) - 63 PF Nvidia DGX A100 Superpod with Nvidia A100 GPU Tianhe-2A (天河-2A): National Super Computer Center in Guangzhou (China) - 61 PF Intel Ivy Bridge + Phi JUWELS: Forschungszentrum Juelich (Germany) - 44 PF Bull: AMD EPYC + Nvidia Ampere A100 GPU HPC5: Eni S.p.A. (Italy) - 35 PF Dell: Intel Cascade Lake + Nvidia Volta V100 GPU Frontera: Texas Advanced Computing Center (USA) - 24 PF Dell: Intel Cascade Lake Dammam-7: Saudi Aramco (Saudi Arabia) - 22 PF Intel Cascade Lake + NVIDIA Volta V100 Top 5 Supercomputers (2021) 39 Supercomputers: Measuring Performance • Rpeak (theoretical) vs. Rmax (actual) • Theoretical value is just that — a theoretical value based on chip architecture • How do we know that we can achieve (or get close to) that number? • Answer: LINPACK Benchmark • LINPACK Benchmark is based on LU decomposition of a large matrix (factoring a matrix as the product of a lower triangular matrix and an upper triangular matrix) • "Gaussian elimination" for computers - used for solving systems of equations and determining inverses, determinants, etc. • How big of a matrix should we use? We tune to get the best Rmax. • Other considerations: Power consumption (measured in FLOPS/Watt) 40 Measuring Performance • Theoretical CPU performance is calculated from CPU architecture and clock speed • Most common metric is based on calculation of double-precision (DP) floating-point numbers (i.e. double in C++) - 64 bits • FLOPS = FLoating point OPerations per Second • We need to consider what type of floating point operation per second! Name Abreviation Memory (Bytes) Bits Name Double Precision DP 8 64 FP64 Single Precision SP 4 32 FP32 Half Precision HP 2 16 FP16 41 Intel Xeon (x86) • Performance based on: Microarchitecture, Number of Cores, and Clock Speed Microarchitecture Year Announced Process Technology Instructions per Cycle (DP) Nehalem 2008 45 nm 4 Westmere 2010 32 nm 4 Sandy Bridge 2011 32 nm 8 Ivy Bridge 2012 22 nm 8 Haswell 2013 22 nm 16 Broadwell 2014 14 nm 16 Skylake 2015 14 nm 32 Cascade Lake 2018 14 nm 32 42 Supercomputers: Measuring Performance • Most common metric is based on calculation of double-precision floating-point numbers (i.e. double in C++) - 64 bits • Example: Intel Xeon E5-2697v4 (Broadwell) 2.3 GHz • Details: https://ark.intel.com/content/www/us/en/ark/products/ 91755/intel-xeon-processor-e5-2697-v4-45m-cache-2-30-ghz.html • Performance = (# cores) * (clock speed) * (instructions per cycle) • Performance = 18 * 2.3 GHz * 16 = 662.4 GFLOPS • 100 nodes with 2 CPUs per node would be: • 100 * 2 * 662.4 = 132,480 GFLOPS = 132.48 TFLOPS 43 Processing Technology • x86 architecture: Intel Xeon • Accelerator architecture: Intel Phi and Nvidia GPU 44 • CPU = Central Processing Unit CPU Instruction Memory Arithmetic Logic Unit Control Unit Data Memory Input/Output 45 Intel Phi • Many integrated core (MIC) architecture • Introduced in 2013 and x86 compatible • Goal is to provide many cores at a slower clock speed (opposite of initial driver for standard CPUs) • X100 Series - Introduced as PCIe card (e.g. Phi 5110P - 60 cores, 1.0 GHz, 1.0 TF (DP)) • Evolved to exist as stand alone chips - Knights Landing (72 cores, 1.5 GHz, 3.5 TF), Knights Hill (canceled), and Knights Mill (future) • Much of the development of Intel Phi has been integrated into the latest server-class CPUs (Skylake and Cascade Lake) 46 Trinity: Los Alamos National Laboratory Cray XC30: Intel Haswell + Knights Landing