Microsoft PowerPoint – GPU-1 [Compatibility Mode]
High Performance Computing
Course Notes
GPU and CUDA – I
Dr Ligang He
2Computer Science, University of Warwick
GPU
– Graphics processing unit
– Contains a large number of ALUs
2560 ALUs (stream processors) in Nvidia
GeForce GTX 1080
– Is a PCI-e peripheral device
3Computer Science, University of Warwick
PCI-e slot
4Computer Science, University of Warwick
Performance Trend
– Many-core GPU is 100x more powerful
than multicore CPU
– Why is there such performance gap?
Because of the differences in the design
between GPU and CPU
5Computer Science, University of Warwick
Design of CPU
– The design objective of CPU is to optimize the
performance of a sequential code
– Has complicated control unit
– Obtains instructions from memory
– Interprets the instructions
– Figure out what data are needed by instructions and where
it is stored
– Issues signals to ask other functional units (ALUs) to run the
instructions
6Computer Science, University of Warwick
Design of CPU
– The design objective of CPU is to optimize the
performance of a sequential code
– Has complicated control unit
– Complicated control unit enables
– instructions from a single thread to execute out of their
sequential order (single core) or in parallel (multicore)
– branch prediction
– data forwarding
7Computer Science, University of Warwick
Design of CPU
– The design objective of CPU is to optimize the
performance of a sequential code
– Has complicated control unit
– Complicated control unit enables
– Has large cache to reduce the instruction and data
access latencies
– Powerful ALU
8Computer Science, University of Warwick
Design Objective of CPU
– Latency-oriented design
Large on-chip caches
Complicated control unit
Complicated arithmetic logic unit
They are at the cost of increased use of chip area
and power
– Applications with one or
very few threads achieve
higher performance in CPU
NAND gate with transistors
9Computer Science, University of Warwick
Motivation of GPU Design
– Video game industry: need to perform a massive
number of floating-point calculations per video
frame
– Motivate GPU vendors to maximize the chip area
and power dedicated to floating point
calculations
Each calculation is simple: therefore simple control
logic and simple ALUs
Calculation is more important than cache, therefore
small cache, allowing memory access to have long
latency
10Computer Science, University of Warwick
GPU Design
– GPU has a large number of ALUs on a chip to
increase the total throughput
The application is run with a large number of parallel
threads
While some threads are waiting for long-latency
operations (e.g., memory access), the GPU can
always find other threads to run due to the large
number of threads
Throughput-oriented design: maximize the total
throughput of a large number of threads, allowing
individual threads to take a longer time
– GPU adopts the throughput-oriented design
11Computer Science, University of Warwick
GPU vs. CPU in Architecture