This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License
Computer Graphics
gpu101.pptx
Copyright By PowCoder代写 加微信 powcoder
mjb – March 16, 2022
How Have You Been Able to Gain Access to GPU Power?
There have been three ways:
1. Write a graphics display program (≥ 1985)
2. Write an application that looks like a graphics display program, but uses the fragment shader to do some per-node computation (≥ 2002)
3. Write in OpenCL or CUDA, which looks like C++ (≥ 2006)
Computer Graphics
mjb – March 16, 2022
Why do we care about GPU Programming? A History of GPU vs. CPU Performance
Computer Graphics
mjb – March 16, 2022
Why do we care about GPU Programming? A History of GPU vs. CPU Performance
Note that the top of the graph on the previous page fits here
Computer Graphics
mjb – March 16, 2022
The “Core-Score”. How can this be?
Computer Graphics
mjb – March 16, 2022
Why have GPUs Been Outpacing CPUs in Performance?
Due to the nature of graphics computations, GPU chips are customized to stream regular data. General CPU chips must be able to handle irregular data.
Another reason is that GPU chips do not need the significant amount of cache space that occupies much of the real estate on general-purpose CPU chips. The GPU die real estate can then be re-targeted to hold more cores and thus to produce more processing power.
Computer Graphics
mjb – March 16, 2022
Why have GPUs Been Outpacing CPUs in Performance?
Another reason is that general CPU chips contain on-chip logic to do branch prediction and out-of-order execution. This, too, takes up chip die space.
But, CPU chips can handle more general-purpose computing tasks.
So, which is better, a CPU or a GPU?
It depends on what you are trying to do!
Computer Graphics
mjb – March 16, 2022
Originally, GPU Devices were very task-specific
Computer Graphics
mjb – March 16, 2022
Today’s GPU Devices are much less task-specific
mputer Graphics
mjb – March 16, 2022
Consider the architecture of the NVIDIA Tesla V100’s that we have in our DGX System
84 Streaming Multiprocessors (SMs) / chip 64 cores / SM
Wow! 5,396 cores / chip? Really?
Computer Graphics
mjb – March 16, 2022
What is a “Core” in the GPU Sense?
Look closely, and you’ll see that NVIDIA really calls these “CUDA Cores”
Look even more closely and you’ll see that these CUDA Cores have no control logic –
they are pure compute units. (The surrounding SM has the control logic.)
Other vendors refer to these as “Lanes”. You might also think of them as 64-way SIMD.
Computer Graphics
mjb – March 16, 2022
A Mechanical Equivalent…
“Streaming Multiprocessor”
“CUDA Cores” “Data”
Computer Graphics
http://news.cision.com
mjb – March 16, 2022
How Many Robots Do You See Here?
Computer Graphics 12? 72? Depends what you count as a “robot”.
mjb – March 16, 2022
Streaming Multiprocessors
A Spec Sheet Example
CUDA Cores per SM
Computer Graphics
mjb – March 16, 2022
NVIDIA’s Ampere Line
Computer Graphics
mjb – March 16, 2022
The Bottom Line is This
It is obvious that it is difficult to directly compare a CPU with a GPU. They are optimized to do different things.
So, let’s use the information about the architecture as a way to consider what CPUs should be good at and what GPUs should be good at
General purpose programming Multi-core under user control Irregular data structures Irregular flow control
Data parallel programming Little user control
Regular data structures Regular Flow Control
The general term in the OpenCL world for an SM is a Compute Unit.
The general term in the OpenCL world for a CUDA Core is a Processing Element.
Computer Graphics
mjb – March 16, 2022
Compute Units and Processing Elements are Arranged in Grids
A GPU Platform can have one or more Devices. A GPU Device is organized as a grid of Compute
Each Compute Unit is organized as a grid of Processing Elements.
So in NVIDIA terms, their new V100 GPU has 84 Compute Units, each of which has 64 Processing Elements, for a grand total of 5,396 Processing Elements.
Computer Graphics
Compute Unit
mjb – March 16, 2022
Thinking ahead to CUDA and OpenCL… 18 How can GPUs execute General C Code Efficiently?
• Ask them to do what they do best. Unless you have a very intense Data Parallel application, don’t even think about using GPUs for computing.
• GPU programs expect you to not just have a few threads, but to have thousands of them! • Each thread executes the same program (called the kernel), but operates on a different
small piece of the overall data
• Thus, you have many, many threads, all waking up at about the same time, all executing the same kernel program, all hoping to work on a small piece of the overall problem.
• CUDA and OpenCL have built-in functions so that each thread can figure out which thread number it is, and thus can figure out what part of the overall job it’s supposed to do.
• When a thread gets blocked somehow (a memory access, waiting for information from another thread, etc.), the processor switches to executing another thread to work on.
Computer Graphics
mjb – March 16, 2022
So, the Trick is to Break your Problem 19 into Many, Many Small Pieces
Particle Systems are a great example.
1. Have one thread per each particle.
2. Put all of the initial parameters into an array in GPU memory.
3. Tell each thread what the current Time is.
4. Each thread then computes its particle’s position,
color, etc. and writes it into arrays in GPU memory.
5. The CPU program then initiates OpenGL drawing of the information in those arrays.
Note: once setup, the data never leaves GPU memory!
Computer Graphics
mjb – March 16, 2022
Something New – Tensor Cores
Computer Graphics
mjb – March 16, 2022
Tensor Cores Accelerate Fused-Multiply-Add Arithmetic 21
Computer Graphics
mjb – March 16, 2022
What is Fused Multiply-Add?
Many scientific and engineering computations take the form:
D = A + (B*C);
A “normal” multiply-add would likely handle this as:
tmp = B*C; D = A + tmp;
A “fused” multiply-add does it all at once, that is, when the low-order bits of B*C are ready, they are immediately added into the low-order bits of A at the same time the higher-order bits of B*C are being multiplied.
Consider a Base 10 example: 789 + ( 123*456 )
123 x 456 738
+ 789 56,877
Computer Graphics
Can start adding the 9 the moment the 8 is produced!
Note: “Normal” A+(B*C) ≠ “FMA” A+(B*C)
mjb – March 16, 2022
There are Two Approaches to Combining CPU and GPU Programs
1. Combine both the CPU and GPU code in the same file. The CPU compiler compiles its part of that file. The GPU compiler compiles just its part of that file.
2. Have two separate programs: a .cpp and a .somethingelse that get compiled separately.
Advantages of Each
1. The CPU and GPU sections of the code know about each others’ intents. Also, they can share common structs, #define’s, etc.
2. It’s potentially cleaner to look at each section by itself. Also, the GPU code can be easily used in combination with other CPU programs.
Computer Graphics
Who are we Talking About Here?
1 = NVIDIA’s CUDA
2 = Khronos’s OpenCL
We will talk about each of these separately – stay tuned!
mjb – March 16, 2022
Looking ahead: 24 If threads all execute the same program,
what happens on flow divergence?
The line “if( a > b )” creates a vector of Boolean values giving the results of the if-statement for each thread. This becomes a “mask”.
Then, the GPU executes all parts of the divergence: Do This;
During that execution, anytime a value wants to be stored, the mask is consulted and the storage only happens if that thread’s location in the mask is the right value.
if( a > b )
Computer Graphics
mjb – March 16, 2022
• GPUs were originally designed for the streaming-ness of computer graphics
• Now, GPUs are also used for the streaming-ness of data-parallel computing
• GPUs are better for some things. CPUs are better for others.
Computer Graphics
mjb – March 16, 2022
26 This is an Nvidia 1080 ti card – one that died on us. It willed its body to education.
Dismantling a Graphics Card
Computer Graphics
mjb – March 16, 2022
Dismantling a Graphics Card
Removing the covers:
Computer Graphics
mjb – March 16, 2022
Dismantling a Graphics Card
Removing the heat sink:
Computer Graphics
mjb – March 16, 2022
This transfers heat from the GPU Chip to the cooling fins
Dismantling a Graphics Card
Removing the fan assembly reveals the board:
Computer Graphics
mjb – March 16, 2022
Dismantling a Graphics Card
Power half of the board:
Computer Graphics
Power Power distribution input
mjb – March 16, 2022
Dismantling a Graphics Card
Graphics half of the board:
Computer Graphics
mjb – March 16, 2022
This one contains 7.2 billion transistors! (Thank you, Moore’s Law)
Dismantling a Graphics Card
Underside of the board:
Computer Graphics
mjb – March 16, 2022
Dismantling a Graphics Card
Underside of where the GPU chip attaches:
Here is a fun video of someone explaining the different parts of this same card:
Computer Graphics
mjb – March 16, 2022
Bonus — Looking at a More Complete GPU Spec Sheet 34
Computer Graphics
mjb – March 16, 2022
Bonus — Looking at a More Complete GPU Spec Sheet 35
Computer Graphics
mjb – March 16, 2022
Bonus — Looking at a More Complete GPU Spec Sheet 36
Computer Graphics
mjb – March 16, 2022
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com