程序代写 AM 148 Chapter 2: CPU vs GPU, and HIP

AM 148 Chapter 2: CPU vs GPU, and HIP
Steven I. Reeves, PhD
1 CPU vs GPU
As the title of this class would indicate, we will be using graphics processing units (GPUs) to do parallel computing. In the industry there are two main options for compute-capable GPUs, AMD and Nvidia. In AMS 147 and previous computing classes, you will have most likely only been executing codes using the central processing unit (CPU).

Modern CPUs having many cores are support hyper-threading, the ability to handle two or more processes with one core. The majority of even laptop level CPUs have two more cores within them. So the types of parallel processing discussed in the previous chapter are already being accessed and used within basic applications and the operating system.
In scientific applications both CPUs and GPUs are used for many large scale programs. However, the majority of modern data-centers and Super Computers are GPU based. Key examples are the data-centers owned by Meta (formerly Facebook), and Oak-Ridge National Laboratory’s Frontier. Both Meta’s data- center and Frontier use AMD as their processor vendor. Other Super Computers, like National Laboratory’s Perlmutter, are Nvidia based. The compute system that we use for this class uses Nvidia GPUs.
1.1 Central Processing Unit
This section gives a brief overview of CPU from a users prospective. Details regarding architecture and specifics in the engineering are outside the scope of this class.
The central processing unit has been the brain of computer systems since the early days of computing. This chip executes the operating system, manages memory, and writes to disc space when it needs to. Further, when a program is launched the CPU has direct access to off chip memory. A modern processor will have multiple cores, and each core can handle a number of threads. Figure 1 illustrates the compute hierarchy for CPUs.
Figure 1: Compute Hierarchy in a CPU
Core 1 Core 2 ··· Core m
A typical CPU supports hyper-threading, where each core can support 2 threads. Some exotic architec- tures allow for greater hyper-threading, i.e. the Intel Xeon Phi Knight’s Landing Accelerator, which is no

longer in production.
Notice that CPUs have on chip memory, and access to main memory. Cache is the ”on chip” memory
that allows the CPU to make very fast calculations. Generally, a program will have the CPU transfer data from main memory into cache. Data that is being reused should be in cache to offer the faster programs, and the for vectorization. Vectorization is a form of data parallelism, where an operation can be applied to the multiple entries in an array rather than one at a time. As CPUs have evolved, the ability for vectorization has increased. National Laboratory is home to NERSC (National Energy Research Scientific Computing Center) which has two Super Computers in service. Perlmutter (the GPU machine mentioned earlier), and Cori – a CPU based super computer, which leverages the Intel ”Knights Landing” many-core accelerator.. Knights Landing behaves like a CPU and has 32 compute tiles, each contain 2 cores and 2 512-bit vector processing units each. So each core can handle 8 double precision floating point opertaions each cycle. Most commercial CPUs can handle 4 64-bit (double precision) floating point operations per core, each cycle.
Figure 2 is a simplified diagram of a single-core CPU. A multicore CPU will have more arithmetic logic units (ALUs) and other components.
The control unit is an internal portion of the CPU that co-ordinates the instructions and data flow between the CPU and other components of the computer, such as main memory.
The arithmetic logic unit is the internal circuitry of a CPU that performs all the arithmetic and logical operations on a computer. The ALU receives three types of inputs:
􏰀 Control signal from the Control Unit
􏰀 Data from memory for operation
􏰀 Status information from previous operations
Figure 2: A basic diagram illustrating the function of a CPU, it4nextgen.com.
Processors in general, have a clock speed. This clock speed defines how fast the processor handle instructions. Typically today’s CPUs have clock speeds on the order of GigaHertz. That is, a CPU operates on the millions of thousands of cycles per second. Clock speed is limited by a number of factors on processors: the number of transistors, the energy supplied to the system, and the heat tolerance of the system. One major reason for the drive towards parallel computing was that increasing the clockspeed of single-core processors was becoming too energy expensive. If we continued to increase clockspeed to reach the performance of the supercomputers today, that super-CPU would need a dedicated Nuclear Reactor. We can calculate the power necessary via the following formula:

where P is the dynamic power consumed by the CPU, C is the capacitance, V voltage, and f is the clock speed frequency.
There are two main ways to do parallelism on CPUs, through the use of application programming interfaces: Open Multi-Processing (OpenMP) or Message Passing Interface (MPI). Essentially, OpenMP assigns threads to the cores in a CPU, and each thread can ”see” the same memory. However, in MPI, memory is segmented into many processes, and data must be passed from one core or processor to another. The most effective CPU based high performance computing codes use a hybrid parallel programming model. This model combines the use of MPI for nodes on a compute cluster and OpenMP threading for cores within nodes. If you wish to learn more on how to take advantage of CPU based systems and more about CPU parallelization, look into AMS 250: High Performance Computing.
1.2 Graphics Processing Units
Graphics Processing Units (GPUs) are somewhat like an extreme version of Figure 1. These types of processors are many-core, and are used as accelerators. That is, compute intensive applications are offloaded to GPUs for efficiency.
Figure 3: A basic diagram illustrating the function of an Nvidia GPU, An Li et al. 2016.
Figure 3 illustrates the relationship between an Nvidia GPU and the host machine, it is very similar to how AMD GPUs work as well. Within this course, we will refer to the CPU as the host, and GPUs as devices. A GPU has cores and multiprocessing managers. With Nvidia GPUs these are referred to as streaming multiprocessors (SMs), and on AMD hardware these are refered to as compute units (CU). These multiprocessing managers can be thought of as foremen for a group of workers, in this case cores. Like the host system, the GPU has memory. A GPU will have memory of three speeds and uses:
􏰀 There is the main on-chip memory, this is the largest memory space on the GPU.
􏰀 Further there is memory that is a scratchpad that allows communication between cores in a workgroup. 􏰀 Per-thread private memory, usually mapped to registers.
1.2.1 Are GPUs faster than CPUs?
For inherently sequential tasks, they aren’t. However, due to the massively parallel nature of the architecture, data parallel tasks can be done in unison. Figure 4 is a table showing the difference between a real CPU and a GPU (albiet fairly old architectures). A useful measurement of compute power is the FLOP (floating point operation per second), a combination of clockspeed, number of cores, and memory transfer rates. As we can see in the table, the Intel chip has a greater clock speed than the Nvidia card. However, the Nvidia card

has a greater performance in FLOPS than the CPU, by an order of magnitude. Something that is about as important as compute power is memory bandwidth, especially for problems involving Big Data or Machine Learning. We also see the GPU has a larger memory bandwidth, by a factor of 4.
Figure 4: A table illustrating the performance difference between an Intel Xeon CPU and an Nvidia Tesla K20.
2 Programming On GPUs
Since there are two main manufacturers of compute capable GPUs there are naturally a few languages we can use to program with them. The majority of these languages are C++ extensions. Let’s briefly discuss the two main langauges:
􏰀 Compute Unified Device Architecture (CUDA) 􏰀 Heterogenous Interface for Portability (HIP)
2.1 Compute Unified Device Architecture: CUDA
CUDA is the programming language provided by the NVIDIA corperation to do computational tasks on the GPU. There are other flavors of CUDA that programmers can harness, such as pyCUDA (a Python interface to CUDA C++ backends), and CUDA Fortran. CUDA C/C++, CUDA Fortran, and pyCUDA have free to download compilers.
The CUDA C/C++ compiler is NVCC. The Lux cluster (the compute resource for this class), which has 28 GPU compute nodes with 2 NVIDIA V100 GPUs each, will be our main source for hardware. If you have an NVIDIA GPU you can also run CUDA on your home machine.

2.2 Heterogenous Interface for Portability (HIP)
HIP is the language solution from AMD. HIP is different from CUDA as that compiled HIP can run on both AMD and NVIDIA hardware. HIP works in a way that there are compilation targets, AMD or NVIDIA. If the compilation target is NVIDIA, HIP uses nvcc to compiler the HIP code into Nvidia PTX code. Conversely, if the compilation target is AMD, HIP uses llvm compilation tools to generate AMD gcn code.
As CUDA was created before HIP, HIP is very similar in syntax and use. This was to make it easy for users to transition their CUDA applications to HIP.
Nvidia Terminology Streaming Multipro- cessor
Kernel Warp
Thread Block
Global Memory Shared Memory
Local Memory
AMD Terminology
Description
One of many parallel vector processors in a GPU that contain parallel ALUs. All waves in a workgroup are assigned to the same work manager.
Functions launched to the GPU that are executed by the parallel workers on GPU.
Collection of operations that execute in lockstep, run the same instructions on cores, think as threads. 64 threads to a Warp/Wavefront.
Group of wavefronts that are on the GPU at the same time. Can synchrnoize together and communicate through communcation memory.
Individual lane in a warp/wavefront. GPU program- ming models can treat this as a separate thread of execution, though you do not necessarily get forward sub-wavefront progress.
DRAM memory accessible by the GPU. (Main mem- ory of the GPU).
Communication memory, that can be used to com- municate data between warps/wavefronts in a thread- block/work group.
Thread private memory, usually mapped to registers.
If you have a CUDA enableed graphics card (NVIDIA) or a select AMD compute enabled GPU you can develop HIP code. HIP can be easily installed with pre-build binary packages with the package manager on your Linux platform.
HIP code is developed either on the AMD ROCm platform using HIP-Clang, or a CUDA platform using NVCC (NVIDIA compiler).
2.3.1 AMD Platform
HIP-Clang is the compiler for compiling HIP programs on the AMD platform:
Compute Unit
Global Memory
Local Memory
Private Memory
Table 1: A collection of verbage between the two systems.
2.3 Running HIP on your own machine
sudo apt-get -y install rocm-dkms
install mesa -common -dev install clang
install cogmr
Or you can build from source:
HIP-Clang:
git clone -b roc-5.0.x https://github.com/RadeonOpenCompute/llvm-project.git
cd llvm-project
mkdir -p build && cd build
cmake -DCMAKE_INSTALL_PREFIX=/opt/rocm/llvm -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=1 –

DLLVM_TARGETS_TO_BUILD=”AMDGPU;X86″ – DLLVM_ENABLE_PROJECTS=”clang;lld;compiler -rt” ../llvm make -j
sudo make install
The ROCm device library can be manually built as folowing,
export PATH=/opt/rocm/llvm/bin:\$PATH
git clone -b roc-5.0.x https://github.com/RadeonOpenCompute/ROCm-Device-Libs.git
cd ROCm-Device-Libs
mkdir -p build && cd build
CC=clang CXX=clang++ cmake -DLLVM_DIR=/opt/rocm/llvm -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_WERROR=1 –
DLLVM_ENABLE_ASSERTIONS=1 -DCMAKE_INSTALL_PREFIX=/opt/rocm .. make -j
sudo make install
2.4 NVIDIA Platform
To run HIP code you will need HIP-nvcc. HIP-nvcc is the compiler for HIP program compilation on NVIDIA hardware. You will need to:
􏰀 Add the ROCm packge server to your system
􏰀 Install the ”hip-runtime-nvidia” and ”hip-dev” packages. This will install the CUDA SDK and the HIP porting layer.
apt-get install hip-runtime-nvidia hip-dev
􏰀 Default paths and environment variables:
– By default HIP finds the CUDA SDK in /usr/local/cuda. This can be overriden by setting the
CUDA PATH env variable.
– HIP is installed into /opt/rocm/hip.
– Consider adding /opt/rocm/bin to your path to make it easier to use.
HIP is a C++ Runtime API and Kernel Language that allows developers to create portable applications for AMD and NVIDIA GPUs from single sources code.
Key features include:
􏰀 HIP when compiled on the NVIDIA Platform is a thin wrapper over CUDA, and has little or no performance impact over coding directly in CUDA.
􏰀 HIP allows coding in a single-source C++ programming including features such as templates, C++11 lambdas, classes, namespaces, and more.
􏰀 HIP allows developers to use the ”best” development environment and tools on each target platform.
􏰀 Developers can specialize for the platform (CUDA or AMD) to tune for performance or handle tricky
Checkout github.com/ROCm-Developer-Tools/HIP for HIP source code. HIP releases are typically named in convention for each ROCm release.
3.1 Writing your first HIP program
When writing in HIP C/C++, you divide your program in to two parts, host code and device code. The host will manage memory transfer and establishing data, where the device code will have the bulk of the computation. Note that GPU io is not currently supported (without great difficulty), so a ”Hello World!” type program will not be our first HIP program. Instead, let us revisit our cx + y algorithm from Chapter 1. HIP is a C++ extension. So it is the same for initialization, and main programs. We will call this algorithm SAXPY, for single-precision A*X Plus Y.

Listing 1: My First HIP C++ program
__global__
void saxpy(int n, float a, float *x, float *y) {
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i]; } int main(void) { int N = 256; float *x, *y, *d_x, *d_y; x = new float[N]; y = new float[N]; hipMalloc(&d_x, N*sizeof(float)); hipMalloc(&d_y, N*sizeof(float)); for (int i = 0; i < N; i++) { x[i] = 1.0f; y[i] = 2.0f; hipMemcpy(d_x, x, N*sizeof(float), hipMemcpyHostToDevice); hipMemcpy(d_y, y, N*sizeof(float), hipMemcpyHostToDevice); // Perform SAXPY on 1M elements saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y); hipMemcpy(y, d_y, N*sizeof(float), hipMemcpyDeviceToHost);
float maxError = 0.0f;
for (int i = 0; i < N; i++) maxError = max(maxError , abs(y[i]-4.0f)); printf("Max error: %f\n", maxError); hipFree(d_x); hipFree(d_y); delete x; delete y; HIP and GPU code in general is divided into two sections, host code (code that is executed and run on the CPU), and device code (code that is executed by the CPU but run on the GPU). 3.1.1 Host Code The main function declares two pairs of arrays, one for the host and one for the device. As a convention we denote device arrays with d ”array name”. The pointers x and y point to the host arrays, allocated in the typical C++ fashion.. While, the d x and d y arrays are allocated with the hipMalloc function from the HIP runtime API. As illustrated before, the host and device have separate memory spaces, both of which can be managed from host code. However, some GPUs can support device code that allocates device memory. float *x, *y, *d_x, *d_y; x = new float[N]; y = new float[N]; hipMalloc(&d_x, N*sizeof(float)); hipMalloc(&d_y, N*sizeof(float)); Listing 2: Host Allocation Next the host initializes the data in x and y. Listing 3: Host Initialization for (int i = 0; i < N; i++) { x[i] = 1.0f; y[i] = 2.0f; To initialize the device arrays we must transfer data from the host arrays, using hipMemcpy. This function works like the standard C++ memcpy function, only that it takes a fourth argument. This fourth argument dictates the direction of the memory transfer. For the transfer from the host to the device, we use hipMemcpyHostToDevice. Listing 4: Device Initialization hipMemcpy(d_x, x, N*sizeof(float), hipMemcpyHostToDevice); hipMemcpy(d_y, y, N*sizeof(float), hipMemcpyHostToDevice); After completing the computation we will use Listing 5: Result transfer hipMemcpy(y, d_y, N*sizeof(float), hipMemcpyDeviceToHost); . The function that the host calls to execute computation on the device is called a kernel. To launch the saxpy kernel use the statement: Listing 6: Launching the Kernel saxpy<<<(N+255)/256, 256>>>(N, 2.0, d_x, d_y);
In HIP there are two ways to launch kernels, using the triple chevron syntax or using the hipLaunchK-
ernelGGL macro. The triple chevrons dictate the execution configuration. That is how many device threads to execute the kernel in parallel. In HIP there is a hierarchy of threads which mimics how processors are grouped on the GPU. The first argument in the execution configuration specifies the number of thread blocks in the grid, and the second specifies the number of threads in the thread block. We will explain this more in detail later.
We can also use the hipLaunchKernelGGL macro like this:
hipLaunchKernelGGL(saxpy, /*compute kernel*/ dim3((N+255/256)), dim3(256),
0, /*dynamic shared*/
N, 2.0, d_x, d_y); /*arguments to the compute kernel*/
After the computation we must deallocate memory.
Listing 7: Deallocating memory
hipFree(d_x); hipFree(d_y); delete x; delete y;
Here we use the C++ delete for the host arrays, and hipFree for the device arrays.
3.1.2 Device Code
Before we mentioned that the kernel was where the GPU computation happens. Listing 8: The saxpy kernel
__global__
void saxpy(int n, float a, float *x, float *y) {
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i]; } For device code there are declaration specifiers, for kernels we use the compiler know that the code in the function is device code, but can be called from the host code. There are other specifiers, notably device , which is a function that can be called within kernels. In the serial case we loop over an index to do the operation on all entries. In HIP we create a local index based on the ”location” based on the thread number in the block. Recall we specified the block dimension, and grid dimension in listing 6 while launching the kernel. The blockIdx.x term selects the block number within grid, blockDim.x is the number of blocks within the grid, and threadIdx.x is the local thread within blockIdx.x. You can think of this as a mailing address, for example UCSC’s address is 1156 High St, Santa Cruz, CA. Here California will be our GPU, Santa Cruz is the block, and 1156 High St is the analogous thread id. The next step is to perform the computation for threads that are within the range of the array. For performance reasons, most of the time the number of threads allocated will not be the same as the dimensions of the array. All these components together make a HIP C++ program. 8 specifier. Global, lets the Figure 5: 1024x1024 Mandelbrot set generated with max iter = 256 4 Parallel Mandelbrot Calculation The Mandelbrot set is a fun example to demonstrate parallel computing. The Mandelbrot set has a well known fractal structure, and is interesting both mathematically and aesthetically because it has an infinitely recursive structure. You can zoom in to reveal swirls, spirals, snowflakes, and other interesting shapes if you are willing to do enough computation. 4.1 Serial Implementation This program uses the escape time algorithm. For each pixel in the image, it starts with the x and y position, and then computes a re 程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts