CS代写 This work is licensed under a Creative Commons Attribution-NonCommercial-No

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License
How Have You Been Able to Gain Access to GPU Power?
There have been three ways:
1. Write a graphics display program (≥ 1985)

Copyright By PowCoder代写 加微信 powcoder

2. Write an application that looks like a graphics display program, but uses the fragment shader to do some per-node computation (≥ 2002)
3. Write in OpenCL or CUDA, which looks like C++ (≥ 2006)
Computer Graphics
mjb – March 16, 2022
Computer Graphics

gpu101.pptx
mjb – March 16, 2022
Why do we care about GPU Programming? A History of GPU vs. CPU Performance
Computer Graphics
mjb – March 16, 2022
Why do we care about GPU Programming? A History of GPU vs. CPU Performance
Note that the top of the graph on the previous page fits here
Computer Graphics
mjb – March 16, 2022

The “Core-Score”. How can this be?
Computer Graphics
mjb – March 16, 2022
Why have GPUs Been Outpacing CPUs in Performance?
Due to the nature of graphics computations, GPU chips are customized to stream regular data. General CPU chips must be able to handle irregular data.
Another reason is that GPU chips do not need the significant amount of cache space that occupies much of the real estate on general-purpose CPU chips. The GPU die real estate can then be re-targeted to hold more cores and thus to produce more processing power.
Computer Graphics
mjb – March 16, 2022
Why have GPUs Been Outpacing CPUs in Performance?
But, CPU chips can handle more general-purpose computing tasks.
Computer Graphics
Another reason is that general CPU chips contain on-chip logic to do branch prediction and out-of-order execution. This, too, takes up chip die space.
So, which is better, a CPU or a GPU?
It depends on what you are trying to do!
mjb – March 16, 2022
Originally, GPU Devices were very task-specific
Computer Graphics
mjb – March 16, 2022

Today’s GPU Devices are much less task-specific
mputer Graphics
mjb – March 16, 2022
Consider the architecture of the NVIDIA Tesla V100’s 10 that we have in our DGX System
84 Streaming Multiprocessors (SMs) / chip 64 cores / SM
Wow! 5,396 cores / chip? Really?
Computer Graphics
mjb – March 16, 2022
What is a “Core” in the GPU Sense?
Look closely, and you’ll see that NVIDIA really calls these “CUDA Cores”
Look even more closely and you’ll see that these CUDA Cores have no control logic –
they are pure compute units. (The surrounding SM has the control logic.)
Other vendors refer to these as “Lanes”. You might also think of them as 64-way SIMD.
Computer Graphics
mjb – March 16, 2022
A Mechanical Equivalent…
“Streaming Multiprocessor”
Computer Graphics
http://news.cision.com
“CUDA Cores” “Data”
mjb – March 16, 2022

How Many Robots Do You See Here?
Computer Graphics
12? 72? Depends what you count as a “robot”.
mjb – March 1
Streaming Multiprocessors
A Spec Sheet Example
CUDA Cores per SM
Computer Graphics
mjb – March 16, 2022
NVIDIA’s Ampere Line
Computer Graphics
mjb – March 16, 2022
The Bottom Line is This
It is obvious that it is difficult to directly compare a CPU with a GPU. They are optimized to do different things.
So, let’s use the information about the architecture as a way to consider what CPUs should be good at and what GPUs should be good at
General purpose programming Multi-core under user control Irregular data structures Irregular flow control
Computer Graphics
Data parallel programming Little user control
Regular data structures Regular Flow Control
The general term in the OpenCL world for an SM is a Compute Unit.
The general term in the OpenCL world for a CUDA Core is a Processing Element.
mjb – March 16, 2022

Compute Units and Processing Elements are Arranged in Grids 17
A GPU Platform can have one or more Devices. A GPU Device is organized as a grid of Compute
Each Compute Unit is organized as a grid of Processing Elements.
So in NVIDIA terms, their new V100 GPU has 84 Compute Units, each of which has 64 Processing Elements, for a grand total of 5,396 Processing Elements.
Computer Graphics
March 16, 2022
Compute Unit
Thinking ahead to CUDA and OpenCL… 18 How can GPUs execute General C Code Efficiently?
• Ask them to do what they do best. Unless you have a very intense Data Parallel application, don’t even think about using GPUs for computing.
• GPU programs expect you to not just have a few threads, but to have thousands of them! • Each thread executes the same program (called the kernel), but operates on a different
small piece of the overall data
• Thus, you have many, many threads, all waking up at about the same time, all executing the same kernel program, all hoping to work on a small piece of the overall problem.
• CUDA and OpenCL have built-in functions so that each thread can figure out which thread number it is, and thus can figure out what part of the overall job it’s supposed to do.
• When a thread gets blocked somehow (a memory access, waiting for information from another thread, etc.), the processor switches to executing another thread to work on.
Computer Graphics
mjb – March 16, 2022
So, the Trick is to Break your Problem 19 into Many, Many Small Pieces
Particle Systems are a great example.
1. Have one thread per each particle.
2. Put all of the initial parameters into an array in GPU memory.
3. Tell each thread what the current Time is.
4. Each thread then computes its particle’s position,
color, etc. and writes it into arrays in GPU memory.
5. The CPU program then initiates OpenGL drawing of the information in those arrays.
Computer Graphics

Note: once setup, the data never leaves GPU memory!
mjb – March 16, 2022
Something New – Tensor Cores
Computer Graphics
mjb – March 16, 2022

Tensor Cores Accelerate Fused-Multiply-Add Arithmetic 21
Computer Graphics
mjb – March 16, 2022
What is Fused Multiply-Add?
Many scientific and engineering computations take the form:
D = A + (B*C);
A “normal” multiply-add would likely handle this as:
tmp = B*C; D = A + tmp;
A “fused” multiply-add does it all at once, that is, when the low-order bits of B*C are ready, they are immediately added into the low-order bits of A at the same time the higher-order bits of B*C are being multiplied.
Consider a Base 10 example: 789 + ( 123*456 )
123 x 456 738
+ 789 56,877
Can start adding the 9 the moment the 8 is produced!
Computer Graphics
rch 16, 2022
Note: “Normal” A+(B*C) ≠ “FMA” A+(B*C)
There are Two Approaches to Combining CPU and GPU Programs
1. Combine both the CPU and GPU code in the same file. The CPU compiler compiles its part of that file. The GPU compiler compiles just its part of that file.
2. Have two separate programs: a .cpp and a .somethingelse that get compiled separately.
Advantages of Each
1. The CPU and GPU sections of the code know about each others’ intents. Also, they can share common structs, #define’s, etc.
2. It’s potentially cleaner to look at each section by itself. Also, the GPU code can be easily used in combination with other CPU programs.
Who are we Talking About Here?
1 = NVIDIA’s CUDA
2 = Khronos’s OpenCL
Computer Graphics We will talk about each of these separately – stay tuned!
mjb – March 16, 2022
Looking ahead: 24 If threads all execute the same program,
Computer Graphics
what happens on flow divergence?
The line “if( a > b )” creates a vector of Boolean values giving the results of the if-statement for each thread. This becomes a “mask”.
Then, the GPU executes all parts of the divergence: Do This;
During that execution, anytime a value wants to be stored, the mask is consulted and the storage only happens if that thread’s location in the mask is the right value.
if( a > b )
mjb – March 16, 2022

• GPUs were originally designed for the streaming-ness of computer graphics
• Now, GPUs are also used for the streaming-ness of data-parallel computing
• GPUs are better for some things. CPUs are better for others.
Computer Graphics
mjb – March 16, 2022
26 This is an Nvidia 1080 ti card – one that died on us. It willed its body to education.
Dismantling a Graphics Card
Computer Graphics
mjb – March 16, 2022
Dismantling a Graphics Card
Removing the covers:
Computer Graphics
mjb – March 16, 2022
Dismantling a Graphics Card
Removing the heat sink:
Computer Graphics
mjb – March 16, 2022
This transfers heat from the GPU Chip to the cooling fins

Dismantling a Graphics Card
Removing the fan assembly reveals the board:
Computer Graphics
mjb – March 16, 2022
Dismantling a Graphics Card
Power half of the board:
Computer Graphics
Power Power distribution input
mjb – March 16, 2022
Dismantling a Graphics Card
Graphics half of the board:
Computer Graphics
mjb – March 16, 2022
This one contains 7.2 billion transistors! (Thank you, Moore’s Law)
Dismantling a Graphics Card
Underside of the board:
Computer Graphics
mjb – March 16, 2022

Dismantling a Graphics Card
Underside of where the GPU chip attaches:
Computer Graphics
mjb – March 16, 2022
Here is a fun video of someone explaining the different parts of this same card:

Bonus — Looking at a More Complete GPU Spec Sheet 34
Computer Graphics
mjb – March 16, 2022
Bonus — Looking at a More Complete GPU Spec Sheet 35
Computer Graphics
mjb – March 16, 2022
Bonus — Looking at a More Complete GPU Spec Sheet 36
Computer Graphic
mjb – March 16, 2022

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com