Home | History | Annotate | Download | only in docs
      1 =============================
      2 User Guide for NVPTX Back-end
      3 =============================
      4 
      5 .. contents::
      6    :local:
      7    :depth: 3
      8 
      9 
     10 Introduction
     11 ============
     12 
     13 To support GPU programming, the NVPTX back-end supports a subset of LLVM IR
     14 along with a defined set of conventions used to represent GPU programming
     15 concepts. This document provides an overview of the general usage of the back-
     16 end, including a description of the conventions used and the set of accepted
     17 LLVM IR.
     18 
     19 .. note:: 
     20    
     21    This document assumes a basic familiarity with CUDA and the PTX
     22    assembly language. Information about the CUDA Driver API and the PTX assembly
     23    language can be found in the `CUDA documentation
     24    <http://docs.nvidia.com/cuda/index.html>`_.
     25 
     26 
     27 
     28 Conventions
     29 ===========
     30 
     31 Marking Functions as Kernels
     32 ----------------------------
     33 
     34 In PTX, there are two types of functions: *device functions*, which are only
     35 callable by device code, and *kernel functions*, which are callable by host
     36 code. By default, the back-end will emit device functions. Metadata is used to
     37 declare a function as a kernel function. This metadata is attached to the
     38 ``nvvm.annotations`` named metadata object, and has the following format:
     39 
     40 .. code-block:: llvm
     41 
     42    !0 = metadata !{<function-ref>, metadata !"kernel", i32 1}
     43 
     44 The first parameter is a reference to the kernel function. The following
     45 example shows a kernel function calling a device function in LLVM IR. The
     46 function ``@my_kernel`` is callable from host code, but ``@my_fmad`` is not.
     47 
     48 .. code-block:: llvm
     49 
     50     define float @my_fmad(float %x, float %y, float %z) {
     51       %mul = fmul float %x, %y
     52       %add = fadd float %mul, %z
     53       ret float %add
     54     }
     55 
     56     define void @my_kernel(float* %ptr) {
     57       %val = load float* %ptr
     58       %ret = call float @my_fmad(float %val, float %val, float %val)
     59       store float %ret, float* %ptr
     60       ret void
     61     }
     62 
     63     !nvvm.annotations = !{!1}
     64     !1 = metadata !{void (float*)* @my_kernel, metadata !"kernel", i32 1}
     65 
     66 When compiled, the PTX kernel functions are callable by host-side code.
     67 
     68 
     69 Address Spaces
     70 --------------
     71 
     72 The NVPTX back-end uses the following address space mapping:
     73 
     74    ============= ======================
     75    Address Space Memory Space
     76    ============= ======================
     77    0             Generic
     78    1             Global
     79    2             Internal Use
     80    3             Shared
     81    4             Constant
     82    5             Local
     83    ============= ======================
     84 
     85 Every global variable and pointer type is assigned to one of these address
     86 spaces, with 0 being the default address space. Intrinsics are provided which
     87 can be used to convert pointers between the generic and non-generic address
     88 spaces.
     89 
     90 As an example, the following IR will define an array ``@g`` that resides in
     91 global device memory.
     92 
     93 .. code-block:: llvm
     94 
     95     @g = internal addrspace(1) global [4 x i32] [ i32 0, i32 1, i32 2, i32 3 ]
     96 
     97 LLVM IR functions can read and write to this array, and host-side code can
     98 copy data to it by name with the CUDA Driver API.
     99 
    100 Note that since address space 0 is the generic space, it is illegal to have
    101 global variables in address space 0.  Address space 0 is the default address
    102 space in LLVM, so the ``addrspace(N)`` annotation is *required* for global
    103 variables.
    104 
    105 
    106 NVPTX Intrinsics
    107 ================
    108 
    109 Address Space Conversion
    110 ------------------------
    111 
    112 '``llvm.nvvm.ptr.*.to.gen``' Intrinsics
    113 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    114 
    115 Syntax:
    116 """""""
    117 
    118 These are overloaded intrinsics.  You can use these on any pointer types.
    119 
    120 .. code-block:: llvm
    121 
    122     declare i8* @llvm.nvvm.ptr.global.to.gen.p0i8.p1i8(i8 addrspace(1)*)
    123     declare i8* @llvm.nvvm.ptr.shared.to.gen.p0i8.p3i8(i8 addrspace(3)*)
    124     declare i8* @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(i8 addrspace(4)*)
    125     declare i8* @llvm.nvvm.ptr.local.to.gen.p0i8.p5i8(i8 addrspace(5)*)
    126 
    127 Overview:
    128 """""""""
    129 
    130 The '``llvm.nvvm.ptr.*.to.gen``' intrinsics convert a pointer in a non-generic
    131 address space to a generic address space pointer.
    132 
    133 Semantics:
    134 """"""""""
    135 
    136 These intrinsics modify the pointer value to be a valid generic address space
    137 pointer.
    138 
    139 
    140 '``llvm.nvvm.ptr.gen.to.*``' Intrinsics
    141 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    142 
    143 Syntax:
    144 """""""
    145 
    146 These are overloaded intrinsics.  You can use these on any pointer types.
    147 
    148 .. code-block:: llvm
    149 
    150     declare i8* @llvm.nvvm.ptr.gen.to.global.p1i8.p0i8(i8 addrspace(1)*)
    151     declare i8* @llvm.nvvm.ptr.gen.to.shared.p3i8.p0i8(i8 addrspace(3)*)
    152     declare i8* @llvm.nvvm.ptr.gen.to.constant.p4i8.p0i8(i8 addrspace(4)*)
    153     declare i8* @llvm.nvvm.ptr.gen.to.local.p5i8.p0i8(i8 addrspace(5)*)
    154 
    155 Overview:
    156 """""""""
    157 
    158 The '``llvm.nvvm.ptr.gen.to.*``' intrinsics convert a pointer in the generic
    159 address space to a pointer in the target address space.  Note that these
    160 intrinsics are only useful if the address space of the target address space of
    161 the pointer is known.  It is not legal to use address space conversion
    162 intrinsics to convert a pointer from one non-generic address space to another
    163 non-generic address space.
    164 
    165 Semantics:
    166 """"""""""
    167 
    168 These intrinsics modify the pointer value to be a valid pointer in the target
    169 non-generic address space.
    170 
    171 
    172 Reading PTX Special Registers
    173 -----------------------------
    174 
    175 '``llvm.nvvm.read.ptx.sreg.*``'
    176 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    177 
    178 Syntax:
    179 """""""
    180 
    181 .. code-block:: llvm
    182 
    183     declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
    184     declare i32 @llvm.nvvm.read.ptx.sreg.tid.y()
    185     declare i32 @llvm.nvvm.read.ptx.sreg.tid.z()
    186     declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
    187     declare i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
    188     declare i32 @llvm.nvvm.read.ptx.sreg.ntid.z()
    189     declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
    190     declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
    191     declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.z()
    192     declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.x()
    193     declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.y()
    194     declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.z()
    195     declare i32 @llvm.nvvm.read.ptx.sreg.warpsize()
    196 
    197 Overview:
    198 """""""""
    199 
    200 The '``@llvm.nvvm.read.ptx.sreg.*``' intrinsics provide access to the PTX
    201 special registers, in particular the kernel launch bounds.  These registers
    202 map in the following way to CUDA builtins:
    203 
    204    ============ =====================================
    205    CUDA Builtin PTX Special Register Intrinsic
    206    ============ =====================================
    207    ``threadId`` ``@llvm.nvvm.read.ptx.sreg.tid.*``
    208    ``blockIdx`` ``@llvm.nvvm.read.ptx.sreg.ctaid.*``
    209    ``blockDim`` ``@llvm.nvvm.read.ptx.sreg.ntid.*``
    210    ``gridDim``  ``@llvm.nvvm.read.ptx.sreg.nctaid.*``
    211    ============ =====================================
    212 
    213 
    214 Barriers
    215 --------
    216 
    217 '``llvm.nvvm.barrier0``'
    218 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
    219 
    220 Syntax:
    221 """""""
    222 
    223 .. code-block:: llvm
    224 
    225   declare void @llvm.nvvm.barrier0()
    226 
    227 Overview:
    228 """""""""
    229 
    230 The '``@llvm.nvvm.barrier0()``' intrinsic emits a PTX ``bar.sync 0``
    231 instruction, equivalent to the ``__syncthreads()`` call in CUDA.
    232 
    233 
    234 Other Intrinsics
    235 ----------------
    236 
    237 For the full set of NVPTX intrinsics, please see the
    238 ``include/llvm/IR/IntrinsicsNVVM.td`` file in the LLVM source tree.
    239 
    240 
    241 Executing PTX
    242 =============
    243 
    244 The most common way to execute PTX assembly on a GPU device is to use the CUDA
    245 Driver API. This API is a low-level interface to the GPU driver and allows for
    246 JIT compilation of PTX code to native GPU machine code.
    247 
    248 Initializing the Driver API:
    249 
    250 .. code-block:: c++
    251 
    252     CUdevice device;
    253     CUcontext context;
    254 
    255     // Initialize the driver API
    256     cuInit(0);
    257     // Get a handle to the first compute device
    258     cuDeviceGet(&device, 0);
    259     // Create a compute device context
    260     cuCtxCreate(&context, 0, device);
    261 
    262 JIT compiling a PTX string to a device binary:
    263 
    264 .. code-block:: c++
    265 
    266     CUmodule module;
    267     CUfunction funcion;
    268 
    269     // JIT compile a null-terminated PTX string
    270     cuModuleLoadData(&module, (void*)PTXString);
    271 
    272     // Get a handle to the "myfunction" kernel function
    273     cuModuleGetFunction(&function, module, "myfunction");
    274 
    275 For full examples of executing PTX assembly, please see the `CUDA Samples
    276 <https://developer.nvidia.com/cuda-downloads>`_ distribution.
    277