1 ============================= 2 User Guide for NVPTX Back-end 3 ============================= 4 5 .. contents:: 6 :local: 7 :depth: 3 8 9 10 Introduction 11 ============ 12 13 To support GPU programming, the NVPTX back-end supports a subset of LLVM IR 14 along with a defined set of conventions used to represent GPU programming 15 concepts. This document provides an overview of the general usage of the back- 16 end, including a description of the conventions used and the set of accepted 17 LLVM IR. 18 19 .. note:: 20 21 This document assumes a basic familiarity with CUDA and the PTX 22 assembly language. Information about the CUDA Driver API and the PTX assembly 23 language can be found in the `CUDA documentation 24 <http://docs.nvidia.com/cuda/index.html>`_. 25 26 27 28 Conventions 29 =========== 30 31 Marking Functions as Kernels 32 ---------------------------- 33 34 In PTX, there are two types of functions: *device functions*, which are only 35 callable by device code, and *kernel functions*, which are callable by host 36 code. By default, the back-end will emit device functions. Metadata is used to 37 declare a function as a kernel function. This metadata is attached to the 38 ``nvvm.annotations`` named metadata object, and has the following format: 39 40 .. code-block:: llvm 41 42 !0 = metadata !{<function-ref>, metadata !"kernel", i32 1} 43 44 The first parameter is a reference to the kernel function. The following 45 example shows a kernel function calling a device function in LLVM IR. The 46 function ``@my_kernel`` is callable from host code, but ``@my_fmad`` is not. 47 48 .. code-block:: llvm 49 50 define float @my_fmad(float %x, float %y, float %z) { 51 %mul = fmul float %x, %y 52 %add = fadd float %mul, %z 53 ret float %add 54 } 55 56 define void @my_kernel(float* %ptr) { 57 %val = load float* %ptr 58 %ret = call float @my_fmad(float %val, float %val, float %val) 59 store float %ret, float* %ptr 60 ret void 61 } 62 63 !nvvm.annotations = !{!1} 64 !1 = metadata !{void (float*)* @my_kernel, metadata !"kernel", i32 1} 65 66 When compiled, the PTX kernel functions are callable by host-side code. 67 68 69 Address Spaces 70 -------------- 71 72 The NVPTX back-end uses the following address space mapping: 73 74 ============= ====================== 75 Address Space Memory Space 76 ============= ====================== 77 0 Generic 78 1 Global 79 2 Internal Use 80 3 Shared 81 4 Constant 82 5 Local 83 ============= ====================== 84 85 Every global variable and pointer type is assigned to one of these address 86 spaces, with 0 being the default address space. Intrinsics are provided which 87 can be used to convert pointers between the generic and non-generic address 88 spaces. 89 90 As an example, the following IR will define an array ``@g`` that resides in 91 global device memory. 92 93 .. code-block:: llvm 94 95 @g = internal addrspace(1) global [4 x i32] [ i32 0, i32 1, i32 2, i32 3 ] 96 97 LLVM IR functions can read and write to this array, and host-side code can 98 copy data to it by name with the CUDA Driver API. 99 100 Note that since address space 0 is the generic space, it is illegal to have 101 global variables in address space 0. Address space 0 is the default address 102 space in LLVM, so the ``addrspace(N)`` annotation is *required* for global 103 variables. 104 105 106 NVPTX Intrinsics 107 ================ 108 109 Address Space Conversion 110 ------------------------ 111 112 '``llvm.nvvm.ptr.*.to.gen``' Intrinsics 113 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 114 115 Syntax: 116 """"""" 117 118 These are overloaded intrinsics. You can use these on any pointer types. 119 120 .. code-block:: llvm 121 122 declare i8* @llvm.nvvm.ptr.global.to.gen.p0i8.p1i8(i8 addrspace(1)*) 123 declare i8* @llvm.nvvm.ptr.shared.to.gen.p0i8.p3i8(i8 addrspace(3)*) 124 declare i8* @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(i8 addrspace(4)*) 125 declare i8* @llvm.nvvm.ptr.local.to.gen.p0i8.p5i8(i8 addrspace(5)*) 126 127 Overview: 128 """"""""" 129 130 The '``llvm.nvvm.ptr.*.to.gen``' intrinsics convert a pointer in a non-generic 131 address space to a generic address space pointer. 132 133 Semantics: 134 """""""""" 135 136 These intrinsics modify the pointer value to be a valid generic address space 137 pointer. 138 139 140 '``llvm.nvvm.ptr.gen.to.*``' Intrinsics 141 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 142 143 Syntax: 144 """"""" 145 146 These are overloaded intrinsics. You can use these on any pointer types. 147 148 .. code-block:: llvm 149 150 declare i8* @llvm.nvvm.ptr.gen.to.global.p1i8.p0i8(i8 addrspace(1)*) 151 declare i8* @llvm.nvvm.ptr.gen.to.shared.p3i8.p0i8(i8 addrspace(3)*) 152 declare i8* @llvm.nvvm.ptr.gen.to.constant.p4i8.p0i8(i8 addrspace(4)*) 153 declare i8* @llvm.nvvm.ptr.gen.to.local.p5i8.p0i8(i8 addrspace(5)*) 154 155 Overview: 156 """"""""" 157 158 The '``llvm.nvvm.ptr.gen.to.*``' intrinsics convert a pointer in the generic 159 address space to a pointer in the target address space. Note that these 160 intrinsics are only useful if the address space of the target address space of 161 the pointer is known. It is not legal to use address space conversion 162 intrinsics to convert a pointer from one non-generic address space to another 163 non-generic address space. 164 165 Semantics: 166 """""""""" 167 168 These intrinsics modify the pointer value to be a valid pointer in the target 169 non-generic address space. 170 171 172 Reading PTX Special Registers 173 ----------------------------- 174 175 '``llvm.nvvm.read.ptx.sreg.*``' 176 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 177 178 Syntax: 179 """"""" 180 181 .. code-block:: llvm 182 183 declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() 184 declare i32 @llvm.nvvm.read.ptx.sreg.tid.y() 185 declare i32 @llvm.nvvm.read.ptx.sreg.tid.z() 186 declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x() 187 declare i32 @llvm.nvvm.read.ptx.sreg.ntid.y() 188 declare i32 @llvm.nvvm.read.ptx.sreg.ntid.z() 189 declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() 190 declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.y() 191 declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.z() 192 declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.x() 193 declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.y() 194 declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.z() 195 declare i32 @llvm.nvvm.read.ptx.sreg.warpsize() 196 197 Overview: 198 """"""""" 199 200 The '``@llvm.nvvm.read.ptx.sreg.*``' intrinsics provide access to the PTX 201 special registers, in particular the kernel launch bounds. These registers 202 map in the following way to CUDA builtins: 203 204 ============ ===================================== 205 CUDA Builtin PTX Special Register Intrinsic 206 ============ ===================================== 207 ``threadId`` ``@llvm.nvvm.read.ptx.sreg.tid.*`` 208 ``blockIdx`` ``@llvm.nvvm.read.ptx.sreg.ctaid.*`` 209 ``blockDim`` ``@llvm.nvvm.read.ptx.sreg.ntid.*`` 210 ``gridDim`` ``@llvm.nvvm.read.ptx.sreg.nctaid.*`` 211 ============ ===================================== 212 213 214 Barriers 215 -------- 216 217 '``llvm.nvvm.barrier0``' 218 ^^^^^^^^^^^^^^^^^^^^^^^^^^^ 219 220 Syntax: 221 """"""" 222 223 .. code-block:: llvm 224 225 declare void @llvm.nvvm.barrier0() 226 227 Overview: 228 """"""""" 229 230 The '``@llvm.nvvm.barrier0()``' intrinsic emits a PTX ``bar.sync 0`` 231 instruction, equivalent to the ``__syncthreads()`` call in CUDA. 232 233 234 Other Intrinsics 235 ---------------- 236 237 For the full set of NVPTX intrinsics, please see the 238 ``include/llvm/IR/IntrinsicsNVVM.td`` file in the LLVM source tree. 239 240 241 Executing PTX 242 ============= 243 244 The most common way to execute PTX assembly on a GPU device is to use the CUDA 245 Driver API. This API is a low-level interface to the GPU driver and allows for 246 JIT compilation of PTX code to native GPU machine code. 247 248 Initializing the Driver API: 249 250 .. code-block:: c++ 251 252 CUdevice device; 253 CUcontext context; 254 255 // Initialize the driver API 256 cuInit(0); 257 // Get a handle to the first compute device 258 cuDeviceGet(&device, 0); 259 // Create a compute device context 260 cuCtxCreate(&context, 0, device); 261 262 JIT compiling a PTX string to a device binary: 263 264 .. code-block:: c++ 265 266 CUmodule module; 267 CUfunction funcion; 268 269 // JIT compile a null-terminated PTX string 270 cuModuleLoadData(&module, (void*)PTXString); 271 272 // Get a handle to the "myfunction" kernel function 273 cuModuleGetFunction(&function, module, "myfunction"); 274 275 For full examples of executing PTX assembly, please see the `CUDA Samples 276 <https://developer.nvidia.com/cuda-downloads>`_ distribution. 277