Future of Computing I:
Diverging Computer System Design
15-213/18-213/15-513/18-613: Introduction to Computer Systems 28th Lecture, April 28, 2020
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
1
Carnegie Mellon
Data Generated Worldwide (ZB)
The Proliferation of Computing
Before 2000
Personal computers: desktops (more predominant) and laptops
Servers: delivered mostly static web pages (limited PHP/Perl/ASP)
Today
Smartphones/tablets greatly outnumber desktops and laptops
Servers and the cloud provide and store changing content
Big data revolution: amount generated is growing exponentially
160 120 80 40
Source: IDC/Seagate
0
2010 2015 2020 2025
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
2
Carnegie Mellon
One-Size-Fits-All Is Going Away
Computers have very different needs and target metrics Used to be just performance
Power/energy matter greatly now
Specialization can help cut down energy Mobile devices use systems-on-chip (SoCs)
Servers use highly-multithreaded CPUs
x86 is no longer the dominant ISA
Recall: RISC vs. CISC from the first machine programming lecture
ARM ISAs (which are RISC) are now used in the vast majority of mobile devices
x86 (which is CISC) still reigns supreme in servers, desktops, laptops
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
3
Carnegie Mellon
Computer System Design is Diverging
Several types of systems are becoming popular Graphics processing units (GPUs)
Mobile systems-on-chip (SoCs)
Data centers and cloud computing
Internet of things (IoT)/edge computing
A few promising designs may emerge in the future Processing-in-memory (PIM)
Neuromorphic computing
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
4
Carnegie Mellon
Computer System Design is Diverging
Several types of systems are becoming popular Graphics processing units (GPUs)
Mobile systems-on-chip (SoCs)
Data centers and cloud computing
Internet of things (IoT)/edge computing
A few promising designs may emerge in the future Processing-in-memory (PIM)
Neuromorphic computing
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
5
Carnegie Mellon
How Do We Run Thousands of Threads?
CPUs become increasingly inefficient
Small number of cores: need thousands of context switches Large number of cores
Lots of hardware and power need to be used How do we handle consistency and snooping?
One option: SIMT (single instruction, multiple thread)
Run multiple copies of a single thread in lockstep (they all execute
the same instruction at the same time) on different pieces of data
Programs use the multithreaded programming model (a.k.a. single program, multiple data or SPMD), but there are key differences from the multithreading you are used to
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
6
Carnegie Mellon
Lane (scalar pipeline) Lane (scalar pipeline) Lane (scalar pipeline) Lane (scalar pipeline)
How to Implement SIMT Write a program using threads
PC I-Cache Decode
Each thread executes the same code but operates on a different piece of data
Each thread has its own context (i.e., can be treated/restarted/executed independently)
Group threads together dynamically (i.e., in hardware)
A group is known as a warp or a wavefront Essentially a vector formed by hardware
SIMT processors can share common control flow logic for a warp across a number of scalar execution lanes (one lane per thread)
SIMD Execution
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
7
Carnegie Mellon
Graphics Processing Units (GPUs)
A cluster of SIMT cores (known as SMs or shaders) that share a memory hierarchy
SIMT Cores
Each SIMT core operates on one or more warps
Interconnect (Crossbar)
Original purpose was for graphics workloads
Memory Partition
Memory Partition
Memory Partition
Designed to operate in parallel
on thousands of pixels/vertices/fragments
Used to have special cores for each step of the graphics pipeline
SIMT cores now general purpose enough to execute all graphics pipeline
stages (generically called a shader core)
Now also used for general-purpose GPU (GPGPU) programming Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
8
Carnegie Mellon
Using a GPU: Program
AlltermsherebasedonNVIDIACUDA
Basic unit of programming: kernel
A piece of code that can be run in parallel A program consists of multiple kernels
Each kernel is assigned to a grid of threads
Basic unit of execution: thread block
A group of threads that can be executed in
parallel
Thread block is limited to 1024 threads
Multiple blocks (of the same thread count) can be combined to form a grid
Kernelsandthreadblocksaremanaged by a software runtime
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
9
Carnegie Mellon
Using a GPU: Execution Flow
1. Host (i.e., CPU) sends a request to the GPU runtime to start a program
2. Runtime copies memory from host address space to the GPU address space (separate memory in discrete GPUs)
3. Runtime allocates per-thread resources (e.g., registers, scratchpad)
4. GPU executes each kernel in the program
5. Runtime copies results from GPU address space to host
address space
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
10
Carnegie Mellon
Common Issues in GPUs
Sharing memory with the CPU is challenging
GPU has its own physical memory and address space
Not managed by the OS!
Requires program to copy data between the CPU and GPU Unified Virtual Memory
Shared address space between the CPU and the GPU
No more need to copy data back and forth
Big issue: coordinating virtual-to-physical page mappings
Thread divergence makes lockstep execution inefficient
Each thread can have control flow instructions (e.g., branches)
Branch divergence occurs when threads inside a warp branch to different execution paths
Memory divergence occurs when some threads hit in a cache and others must go to main memory
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
11
Carnegie Mellon
Resource Allocation and Program Portability
GPUs have a several resources that must be allocated
Programmer dictates the resources needed per thread
Runtime simply provides what each thread/warp needs until it cannot fit any more threads on the SM
How does a programmer know how to allocate resources? Performance tuning: for each GPU architecture, test out different
resource allocations and assign the best one
GPU architectures tend to keep the ratio of resources per warp context/per SM constant within a GPU generation
Requires retuning every time a program is ported to a different architecture
Auto-tuning tries to automate resource allocation (with mixed results)
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
12
Carnegie Mellon
Heterogeneous Computing
While GPUs can help with massive multithreading, they clearly have their own challenges
Reality: different types of compute require different types of hardware and systems
Today: heterogeneous computing reigns supreme CPUs handle more traditional workloads
GPUs handle highly parallel programs and graphics
Other hardware accelerators are designed for very common tasks
We could just have separate chips for each…
… But today we put them all into a single system-on-chip (SoC)
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
13
Carnegie Mellon
Why a System-on-Chip?
There used to be separate chips for almost everything Floating-point units (e.g., the Intel x87)
Caches
Memory and I/O controllers
Discrete modems and accelerators (if present)
A few fundamental changes have made it more desirable to combine these on a single chip
Smaller communication distances: faster latencies, higher bandwidth, and lower energy
Better use of the available transistors and chip area
CPUs integrated some, but not all, units (e.g., FPUs, caches,
memory controllers) over the last few decades
1970: the first SoC (used by Pulsar for the first digital watch)
Mid-2000s: SoC development led to the smartphone revolution
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
14
Carnegie Mellon
Common System-on-Chip Components
Processor cores
Samsung Exynos 9820 (2019)
Graphics processing units (GPUs)
Caches (L1/L2/L3 today)
Digital signal processors (DSPs)
Accelerators that perform signal processing operations for sensors, multimedia processing
Often made up of vector extensions Networking modems
(e.g., WiFi, 4G LTE)
AI/ML accelerators
(i.e., neural processing units)
Source: AnandTech/Chip Rebel
On-chip interconnect
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
15
Carnegie Mellon
How Do We Use an SoC?
Each SoC can have a different set of components
Before: one fixed set of resources, then write software for them
Now: software informs the hardware design! Start with basic structures (e.g., CPU, cache, GPU)
Analyze software to find most common operations/tasks
Define an SoC architecture (using basic structures, premade blocks known
as IP cores, and custom-designed logic) Optimize your software for your SoC
System design can be challenging
How do we manage and coordinate all of these components?
Burden often left on the systems programmer
Runtimes or APIs are commonly used by application developers
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
16
Carnegie Mellon
SoCs Have Helped Move Us to the Cloud
Traditional model
Compute everything locally
Worked great for small-data workloads of the past
Difficult to shrink the size of a computer (e.g., an SoC)
Today: data centers and cloud computing Your computer sends a request across
the network
Giant “farms” of computers perform a significant portion of the computation
Result is sent back to your computer
A key enabler of smartphones
These farms typically service billions of
requests each second (think Google or Facebook)
Requires highly-available, reasonably fast network Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
17
Carnegie Mellon
Cloud Computing vs. Data Centers
Data center
The company providing a service owns and maintains its own
servers for the service (or pays someone to do so)
Machines are dedicated for that company
Can (but don’t always) run code natively
Cloud computing
The company providing a service runs the service on someone
else’s servers
Machines are shared across many companies and services
Typically use virtual machines (VMs) or containers to allow multiple services to run on a single server without having access to each other’s data, and to allow for job migration
Examples: Amazon AWS, Microsoft Azure, Google GCP Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
18
Carnegie Mellon
Running Programs in the Cloud
You wrote a program for OS G, but the cloud runs OS H
Virtual machines let you run your program inside OS H! System virtual machines (i.e., full virtualization)
Hypervisor runs inside OS H (the host OS), provides an interface to emulate all of the hardware
OS G (the guest OS) runs inside the hypervisor, and thinks it is running directly on a machine (the one faked by the hypervisor)
Lots of overhead (e.g., 4-level page tables can require as many as 24 memory accesses!)
Process virtual machines (i.e., managed runtime environments) Create a platform-independent environment for programs Examples: Java VM, .NET framework
Containers: one OS can run multiple isolated kernels Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
19
Carnegie Mellon
Data Centers Require Significant Power
Globally, data centers consume 3% of the world’s total power in 2017
2% of global emissions
Projected to be as much as
A-Side Power Feed Utility ATS
B-Side Power Feed Generator ATS Utility
20% by 2025
UPS … UPS
UPS … UPS
Need to be efficient but reliable
Transformer
Redundant power feeds and infrastructure
RPP RPP
RPP
RPP
Load varies from day to day, and minute to minute in a day: data centers need to be overprovisioned, and must adjust based on the current load
PowerSupply . . .
PowerSupply
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
20
Circuit … Breaker
…
CDU . . . CDU Rack
CDU
. . . CDU
Server
Server
Power Supply
Power Supply
Carnegie Mellon
Bringing Back Local Compute Largeamountsofdatasenttothecloud
Whatifourdevicescouldbesmartand process (some of the) data for us?
InternetofThings(IoT)
A very wide, distributed network of devices
that can all talk with each other
Many IoT devices are simpler than smartphones (e.g., smart sensors) – designed to be deployed everywhere
Edgecomputing
Cloud computing + IoT model pushed almost all compute from a smartphone to data centers
Now we’re pushing back, because the Internet can’t scale as rapidly as data: bandwidth limited, energy hungry
Source: Network World
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
21
Carnegie Mellon
Rethinking the Computer
Today’s computers are built off of assumptions made going back to the 1940s
Spatial/temporal locality
Instruction-based computation Today’s levels of abstractions
Applications and use cases have changed significantly Machine learning and data analytics
IoT and edge computing
Drones and autonomous vehicles
Precision medicine and bioinformatics Mobile apps
Shouldn’t our computers change as well? Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
22
Carnegie Mellon
Computer System Design is Diverging
Several types of systems are becoming popular Graphics processing units (GPUs)
Mobile systems-on-chip (SoCs)
Data centers and cloud computing
Internet of things (IoT)/edge computing
A few promising designs may emerge in the future Processing-in-memory (PIM)
Neuromorphic computing
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
23
Carnegie Mellon
Hardware Hasn’t Kept Up with the Times
Memory (DRAM)
Compute
long, narrow
Memory Channel
Beefy processing engines (CPUs, GPUs, accelerators) Large numbers of cores, high degrees of multithreading
Out-of-order execution in CPUs
Many low-power optimizations
Designed for infrequent memory accesses
Caches highly dependent on locality
Long, narrow off-chip memory channel to connect CPU with DRAM
While programs are becoming more data-centric, computer architectures remain compute-centric
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
24
Processing Engine
Carnegie Mellon
The Cost of Data Movement in Modern CPUs
In terms of energy costs, data movement dominates compute
Dally, HiPEAC 2015
DRAM responsible for 25–50% of a computer’s total energy
Off-chip memory channel: ~30% of DRAM energy
Data movement is a major bottleneck in modern systems High energy spent on off-chip communication
Pin-limited bandwidth
High latency
Identified as the von Neumann bottleneck by Jim Backus in 1977
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
25
Carnegie Mellon
Can We Avoid Moving Data Around?
Processing-in-memory (PIM)
Add some compute capability to memory
No need to move data across memory channel
Processing Engine
MeMmeomryor(yDR(DARMA)M+)PIM Compute
long, narrow
PIM close to a reality
Memory Channel
PIM has been proposed as early as 1970
New innovations in memory design have finally brought
Kind of like an SoC: add new components/functionality, but this time near memory
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
26
Carnegie Mellon
Two Variants of PIM
Variant 1: Processing-Near-Memory
Memory
Memory Channel
ANORB C
high-bandwidth
internal compute
A
using new memory technologies
CPU
B
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
27
Memory Layers
high bandwidth with Through-Silicon Vias (TSVs)
we can add small processing engines to the Logic Layer or on nearby chips Variant 2: Processing-Using-Memory
Carnegie Mellon
Great… How Does This Affect Systems?
Once PIM hardware exists, programmers must be able to use it Tough sell: force them to learn a new programming model
Path to broad adoption: adapt PIM to existing models
Unfortunately, PIM logic can’t easily make use of a lot of systems essentials
Support for multithreading: OS needs to be exposed to PIM
Virtual memory: expensive for PIM to access TLBs in the CPU
Coherence/consistency: these can introduce a lot of traffic between the CPU and PIM
How do compilers generate code for PIM logic?
What about handling branches?
Active research area: solving these challenges in the coming years
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
28
Carnegie Mellon
Motivating Neuromorphic Computing
Artificial neural networks are the hot stud of computing now Forms implicit relationships between inputs and outputs
Can learn and represent very powerful models
However, ANNs are not accurate representations of our brain
What can our brain do?
We can track things moving in real time as we see them
We can learn with uncertainty (ANNs need to experience everything) And yet our brain runs at only a few Hz (vs. GHz for ANN accelerators)
Many applications can benefit from designing computers that look more like our brain
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
29
Carnegie Mellon
Neuromorphic Architectures
Several chips exist: IBM TrueNorth, Intel Loihi
How do you use this?
Replace CPUs in existing systems? Add as accelerators? IBM made its own object-oriented language (Corelet)
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
30
Source: US DOE Report
Carnegie Mellon
Summary
Computing is looking more and more heterogeneous Many different types of hardware
Many different types of use cases
There may be more radical hardware changes ahead
Keeping up with significant shifts in applications
We need to think of what systems support will look like after these changes!
Does it mean that what you’ve learned in 213 is useless? No! Most of the core ideas will still stick around for decades
New systems are still built on the same underlying principles
It’s an exciting time to be working in systems!
Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition
31
Carnegie Mellon