程序代写代做代考 GPU cache cuda Microsoft PowerPoint – GPU-1 [Compatibility Mode]

Microsoft PowerPoint – GPU-1 [Compatibility Mode]

High Performance Computing
Course Notes

GPU and CUDA – I

Dr Ligang He

2Computer Science, University of Warwick

GPU

– Graphics processing unit

– Contains a large number of ALUs

2560 ALUs (stream processors) in Nvidia
GeForce GTX 1080

– Is a PCI-e peripheral device

3Computer Science, University of Warwick

PCI-e slot

4Computer Science, University of Warwick

Performance Trend

– Many-core GPU is 100x more powerful
than multicore CPU

– Why is there such performance gap?

Because of the differences in the design
between GPU and CPU

5Computer Science, University of Warwick

Design of CPU

– The design objective of CPU is to optimize the
performance of a sequential code

– Has complicated control unit

– Obtains instructions from memory

– Interprets the instructions

– Figure out what data are needed by instructions and where
it is stored

– Issues signals to ask other functional units (ALUs) to run the
instructions

6Computer Science, University of Warwick

Design of CPU

– The design objective of CPU is to optimize the
performance of a sequential code

– Has complicated control unit

– Complicated control unit enables

– instructions from a single thread to execute out of their
sequential order (single core) or in parallel (multicore)

– branch prediction

– data forwarding

7Computer Science, University of Warwick

Design of CPU

– The design objective of CPU is to optimize the
performance of a sequential code

– Has complicated control unit

– Complicated control unit enables

– Has large cache to reduce the instruction and data
access latencies

– Powerful ALU

8Computer Science, University of Warwick

Design Objective of CPU

– Latency-oriented design

Large on-chip caches

Complicated control unit

Complicated arithmetic logic unit

They are at the cost of increased use of chip area
and power

– Applications with one or
very few threads achieve
higher performance in CPU

NAND gate with transistors

9Computer Science, University of Warwick

Motivation of GPU Design

– Video game industry: need to perform a massive
number of floating-point calculations per video
frame

– Motivate GPU vendors to maximize the chip area
and power dedicated to floating point
calculations

Each calculation is simple: therefore simple control
logic and simple ALUs

Calculation is more important than cache, therefore
small cache, allowing memory access to have long
latency

10Computer Science, University of Warwick

GPU Design

– GPU has a large number of ALUs on a chip to
increase the total throughput

 The application is run with a large number of parallel
threads

 While some threads are waiting for long-latency
operations (e.g., memory access), the GPU can
always find other threads to run due to the large
number of threads

 Throughput-oriented design: maximize the total
throughput of a large number of threads, allowing
individual threads to take a longer time

– GPU adopts the throughput-oriented design

11Computer Science, University of Warwick

GPU vs. CPU in Architecture