CS代写 PG-02829-001_v11.6 | March 2022

CUDA C++ Programming Guide
Design Guide
PG-02829-001_v11.6 | March 2022

Copyright By PowCoder代写 加微信 powcoder

Changes from Version 11.3
‣ Added Graph Memory Nodes.
‣ Formalized Asynchronous SIMT Programming Model.
CUDA C++ Programming Guide PG-02829-001_v11.6 | ii

Table of Contents
Chapter 1. Introduction………………………………………………………………………………………….. 1
1.1. The Benefits of Using GPUs………………………………………………………………………………………..1 1.2. CUDA®: A General-Purpose Parallel Computing Platform and Programming Model……. 2 1.3. A Scalable Programming Model………………………………………………………………………………….3 1.4. Document Structure………………………………………………………………………………………………….. 5
Chapter 2. Programming Model……………………………………………………………………………… 7
2.1. Kernels………………………………………………………………………………………………………………………7 2.2. Thread Hierarchy………………………………………………………………………………………………………. 8 2.3. Memory Hierarchy…………………………………………………………………………………………………….10 2.4. Heterogeneous Programming……………………………………………………………………………………11 2.5. Asynchronous SIMT Programming Model…………………………………………………………………..14
2.5.1. Asynchronous Operations…………………………………………………………………………………… 14 2.6. Compute Capability…………………………………………………………………………………………………..15
Chapter 3. Programming Interface…………………………………………………………………………16
3.1. Compilation with NVCC……………………………………………………………………………………………. 16 3.1.1. Compilation Workflow………………………………………………………………………………………… 17 3.1.1.1. Offline Compilation………………………………………………………………………………………. 17 3.1.1.2. Just-in-Time Compilation………………………………………………………………………………17 3.1.2. Binary Compatibility…………………………………………………………………………………………… 18 3.1.3. PTX Compatibility………………………………………………………………………………………………..18 3.1.4. Application Compatibility……………………………………………………………………………………..18 3.1.5. C++ Compatibility………………………………………………………………………………………………..19 3.1.6. 64-Bit Compatibility…………………………………………………………………………………………….19 3.2. CUDA Runtime………………………………………………………………………………………………………… 20 3.2.1. Initialization………………………………………………………………………………………………………..20 3.2.2. Device Memory………………………………………………………………………………………………….. 21 3.2.3. Device Memory L2 Access Management……………………………………………………………… 24 3.2.3.1. L2 cache Set-Aside for Persisting Accesses……………………………………………………24 3.2.3.2. L2 Policy for Persisting Accesses…………………………………………………………………. 24 3.2.3.3. L2 Access Properties…………………………………………………………………………………….26 3.2.3.4. L2 Persistence Example………………………………………………………………………………..26 3.2.3.5. Reset L2 Access to Normal………………………………………………………………………….. 27 3.2.3.6. Manage Utilization of L2 set-aside cache………………………………………………………. 28 3.2.3.7. Query L2 cache Properties…………………………………………………………………………….28 3.2.3.8. Control L2 Cache Set-Aside Size for Persisting Memory Access……………………….28
CUDA C++ Programming Guide PG-02829-001_v11.6 | iii

3.2.4. Shared Memory…………………………………………………………………………………………………. 29 3.2.5. Page-Locked Host Memory………………………………………………………………………………… 34 3.2.5.1. Portable Memory…………………………………………………………………………………………. 35 3.2.5.2. Write-Combining Memory…………………………………………………………………………….. 35 3.2.5.3. Mapped Memory………………………………………………………………………………………….. 36 3.2.6. Asynchronous Concurrent Execution…………………………………………………………………… 37 3.2.6.1. Concurrent Execution between Host and Device……………………………………………..37 3.2.6.2. Concurrent Kernel Execution…………………………………………………………………………38 3.2.6.3. Overlap of Data Transfer and Kernel Execution……………………………………………… 38 3.2.6.4. Concurrent Data Transfers…………………………………………………………………………… 38 3.2.6.5. Streams………………………………………………………………………………………………………. 38 3.2.6.6. CUDA Graphs………………………………………………………………………………………………. 43 3.2.6.7. Events…………………………………………………………………………………………………………. 51 3.2.6.8. Synchronous Calls……………………………………………………………………………………….. 52 3.2.7. Multi-Device System……………………………………………………………………………………………52 3.2.7.1. Device Enumeration………………………………………………………………………………………52 3.2.7.2. Device Selection……………………………………………………………………………………………52 3.2.7.3. Stream and Event Behavior………………………………………………………………………….. 52 3.2.7.4. Peer-to-Peer Memory Access………………………………………………………………………. 53 3.2.7.5. Peer-to-Peer Memory Copy………………………………………………………………………….. 54 3.2.8. Unified Virtual Address Space……………………………………………………………………………..54 3.2.9. Interprocess Communication……………………………………………………………………………… 55 3.2.10. Error Checking………………………………………………………………………………………………… 55 3.2.11. Call Stack…………………………………………………………………………………………………………56 3.2.12. Texture and Surface Memory……………………………………………………………………………. 56 3.2.12.1. Texture Memory………………………………………………………………………………………….57 3.2.12.2. Surface Memory………………………………………………………………………………………….66 3.2.12.3. CUDA Arrays……………………………………………………………………………………………… 69 3.2.12.4. Read/Write Coherency…………………………………………………………………………………69 3.2.13. Graphics Interoperability……………………………………………………………………………………70 3.2.13.1. OpenGL Interoperability……………………………………………………………………………….70 3.2.13.2. Direct3D Interoperability…………………………………………………………………………….. 72 3.2.13.3. SLI Interoperability………………………………………………………………………………………78 3.2.14. External Resource Interoperability……………………………………………………………………..78 3.2.14.1. Vulkan Interoperability…………………………………………………………………………………79 3.2.14.2. OpenGL Interoperability……………………………………………………………………………….86 3.2.14.3. Direct3D 12 Interoperability………………………………………………………………………… 87 3.2.14.4. Direct3D 11 Interoperability………………………………………………………………………… 93
CUDA C++ Programming Guide PG-02829-001_v11.6 | iv

3.2.14.5. NVIDIA Software Communication Interface Interoperability (NVSCI)……………… 100 3.2.15. CUDA User Objects………………………………………………………………………………………… 106 3.3. Versioning and Compatibility……………………………………………………………………………………107 3.4. Compute Modes…………………………………………………………………………………………………….. 109 3.5. Mode Switches………………………………………………………………………………………………………. 110 3.6. Tesla Compute Cluster Mode for Windows……………………………………………………………….110
Chapter 4. Hardware Implementation………………………………………………………………….. 111
4.1. SIMT Architecture………………………………………………………………………………………………….. 111 4.2. Hardware Multithreading…………………………………………………………………………………………113
Chapter 5. Performance Guidelines………………………………………………………………………114
5.1. Overall Performance Optimization Strategies……………………………………………………………114 5.2. Maximize Utilization………………………………………………………………………………………………..114 5.2.1. Application Level……………………………………………………………………………………………….114 5.2.2. Device Level……………………………………………………………………………………………………..115 5.2.3. Multiprocessor Level…………………………………………………………………………………………115 5.2.3.1. Occupancy Calculator………………………………………………………………………………….117 5.3. Maximize Memory Throughput…………………………………………………………………………………119 5.3.1. Data Transfer between Host and Device……………………………………………………………. 119 5.3.2. Device Memory Accesses…………………………………………………………………………………..120 5.4. Maximize Instruction Throughput……………………………………………………………………………. 124 5.4.1. Arithmetic Instructions………………………………………………………………………………………124 5.4.2. Control Flow Instructions…………………………………………………………………………………. 130 5.4.3. Synchronization Instruction………………………………………………………………………………. 130 5.5. Minimize Memory Thrashing……………………………………………………………………………………130
Appendix A. CUDA-Enabled GPUs…………………………………………………………………………132
Appendix B. C++ Language Extensions………………………………………………………………….133
B.1. Function Execution Space Specifiers………………………………………………………………………. 133 B.1.1. __global__………………………………………………………………………………………………………..133 B.1.2. __device__………………………………………………………………………………………………………. 133 B.1.3. __host__…………………………………………………………………………………………………………..133 B.1.4. Undefined behavior………………………………………………………………………………………….. 134 B.1.5. __noinline__ and __forceinline__……………………………………………………………………….134
B.2. Variable Memory Space Specifiers…………………………………………………………………………..135 B.2.1. __device__………………………………………………………………………………………………………. 135 B.2.2. __constant__…………………………………………………………………………………………………… 135 B.2.3. __shared__……………………………………………………………………………………………………… 135 B.2.4. __managed__………………………………………………………………………………………………….. 136
CUDA C++ Programming Guide PG-02829-001_v11.6 | v

B.2.5. __restrict__………………………………………………………………………………………………………137 B.3. Built-in Vector Types………………………………………………………………………………………………138 B.3.1. char, short, int, long, longlong, float, double………………………………………………………138 B.3.2. dim3……………………………………………………………………………………………………………….. 139 B.4. Built-in Variables……………………………………………………………………………………………………139 B.4.1. gridDim…………………………………………………………………………………………………………… 139 B.4.2. blockIdx……………………………………………………………………………………………………………140 B.4.3. blockDim………………………………………………………………………………………………………….140 B.4.4. threadIdx………………………………………………………………………………………………………….140 B.4.5. warpSize…………………………………………………………………………………………………………. 140 B.5. Memory Fence Functions………………………………………………………………………………………..140 B.6. Synchronization Functions……………………………………………………………………………………… 143 B.7. Mathematical Functions………………………………………………………………………………………….144 B.8. Texture Functions…………………………………………………………………………………………………..144 B.8.1. Texture Object API……………………………………………………………………………………………145 B.8.1.1. tex1Dfetch()………………………………………………………………………………………………..145 B.8.1.2. tex1D()………………………………………………………………………………………………………. 145 B.8.1.3. tex1DLod()………………………………………………………………………………………………….145 B.8.1.4. tex1DGrad()……………………………………………………………………………………………….. 145 B.8.1.5. tex2D()………………………………………………………………………………………………………. 145 B.8.1.6. tex2DLod()………………………………………………………………………………………………….145 B.8.1.7. tex2DGrad()……………………………………………………………………………………………….. 146 B.8.1.8. tex3D()………………………………………………………………………………………………………. 146 B.8.1.9. tex3DLod()………………………………………………………………………………………………….146 B.8.1.10. tex3DGrad()……………………………………………………………………………………………… 146 B.8.1.11. tex1DLayered()………………………………………………………………………………………….146 B.8.1.12. tex1DLayeredLod()…………………………………………………………………………………….146 B.8.1.13. tex1DLayeredGrad()…………………………………………………………………………………..147 B.8.1.14. tex2DLayered()………………………………………………………………………………………….147 B.8.1.15. tex2DLayeredLod()…………………………………………………………………………………….147 B.8.1.16. tex2DLayeredGrad()…………………………………………………………………………………..147 B.8.1.17. texCubemap()……………………………………………………………………………………………147 B.8.1.18. texCubemapLod()………………………………………………………………………………………147 B.8.1.19. texCubemapLayered()………………………………………………………………………………..148 B.8.1.20. texCubemapLayeredLod()…………………………………………………………………………. 148 B.8.1.21. tex2Dgather()…………………………………………………………………………………………… 148 B.8.2. Texture Reference API………………………………………………………………………………………148 B.8.2.1. tex1Dfetch()………………………………………………………………………………………………..148
CUDA C++ Programming Guide PG-02829-001_v11.6 | vi

B.8.2.2. tex1D()………………………………………………………………………………………………………. 149 B.8.2.3. tex1DLod()…………………………………………………………………………..

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com