Home | History | Annotate | Download | only in docs
      1 =============================
      2 User Guide for NVPTX Back-end
      3 =============================
      4 
      5 .. contents::
      6    :local:
      7    :depth: 3
      8 
      9 
     10 Introduction
     11 ============
     12 
     13 To support GPU programming, the NVPTX back-end supports a subset of LLVM IR
     14 along with a defined set of conventions used to represent GPU programming
     15 concepts. This document provides an overview of the general usage of the back-
     16 end, including a description of the conventions used and the set of accepted
     17 LLVM IR.
     18 
     19 .. note:: 
     20    
     21    This document assumes a basic familiarity with CUDA and the PTX
     22    assembly language. Information about the CUDA Driver API and the PTX assembly
     23    language can be found in the `CUDA documentation
     24    <http://docs.nvidia.com/cuda/index.html>`_.
     25 
     26 
     27 
     28 Conventions
     29 ===========
     30 
     31 Marking Functions as Kernels
     32 ----------------------------
     33 
     34 In PTX, there are two types of functions: *device functions*, which are only
     35 callable by device code, and *kernel functions*, which are callable by host
     36 code. By default, the back-end will emit device functions. Metadata is used to
     37 declare a function as a kernel function. This metadata is attached to the
     38 ``nvvm.annotations`` named metadata object, and has the following format:
     39 
     40 .. code-block:: llvm
     41 
     42    !0 = metadata !{<function-ref>, metadata !"kernel", i32 1}
     43 
     44 The first parameter is a reference to the kernel function. The following
     45 example shows a kernel function calling a device function in LLVM IR. The
     46 function ``@my_kernel`` is callable from host code, but ``@my_fmad`` is not.
     47 
     48 .. code-block:: llvm
     49 
     50     define float @my_fmad(float %x, float %y, float %z) {
     51       %mul = fmul float %x, %y
     52       %add = fadd float %mul, %z
     53       ret float %add
     54     }
     55 
     56     define void @my_kernel(float* %ptr) {
     57       %val = load float* %ptr
     58       %ret = call float @my_fmad(float %val, float %val, float %val)
     59       store float %ret, float* %ptr
     60       ret void
     61     }
     62 
     63     !nvvm.annotations = !{!1}
     64     !1 = metadata !{void (float*)* @my_kernel, metadata !"kernel", i32 1}
     65 
     66 When compiled, the PTX kernel functions are callable by host-side code.
     67 
     68 
     69 .. _address_spaces:
     70 
     71 Address Spaces
     72 --------------
     73 
     74 The NVPTX back-end uses the following address space mapping:
     75 
     76    ============= ======================
     77    Address Space Memory Space
     78    ============= ======================
     79    0             Generic
     80    1             Global
     81    2             Internal Use
     82    3             Shared
     83    4             Constant
     84    5             Local
     85    ============= ======================
     86 
     87 Every global variable and pointer type is assigned to one of these address
     88 spaces, with 0 being the default address space. Intrinsics are provided which
     89 can be used to convert pointers between the generic and non-generic address
     90 spaces.
     91 
     92 As an example, the following IR will define an array ``@g`` that resides in
     93 global device memory.
     94 
     95 .. code-block:: llvm
     96 
     97     @g = internal addrspace(1) global [4 x i32] [ i32 0, i32 1, i32 2, i32 3 ]
     98 
     99 LLVM IR functions can read and write to this array, and host-side code can
    100 copy data to it by name with the CUDA Driver API.
    101 
    102 Note that since address space 0 is the generic space, it is illegal to have
    103 global variables in address space 0.  Address space 0 is the default address
    104 space in LLVM, so the ``addrspace(N)`` annotation is *required* for global
    105 variables.
    106 
    107 
    108 Triples
    109 -------
    110 
    111 The NVPTX target uses the module triple to select between 32/64-bit code
    112 generation and the driver-compiler interface to use. The triple architecture
    113 can be one of ``nvptx`` (32-bit PTX) or ``nvptx64`` (64-bit PTX). The
    114 operating system should be one of ``cuda`` or ``nvcl``, which determines the
    115 interface used by the generated code to communicate with the driver.  Most
    116 users will want to use ``cuda`` as the operating system, which makes the
    117 generated PTX compatible with the CUDA Driver API.
    118 
    119 Example: 32-bit PTX for CUDA Driver API: ``nvptx-nvidia-cuda``
    120 
    121 Example: 64-bit PTX for CUDA Driver API: ``nvptx64-nvidia-cuda``
    122 
    123 
    124 
    125 .. _nvptx_intrinsics:
    126 
    127 NVPTX Intrinsics
    128 ================
    129 
    130 Address Space Conversion
    131 ------------------------
    132 
    133 '``llvm.nvvm.ptr.*.to.gen``' Intrinsics
    134 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    135 
    136 Syntax:
    137 """""""
    138 
    139 These are overloaded intrinsics.  You can use these on any pointer types.
    140 
    141 .. code-block:: llvm
    142 
    143     declare i8* @llvm.nvvm.ptr.global.to.gen.p0i8.p1i8(i8 addrspace(1)*)
    144     declare i8* @llvm.nvvm.ptr.shared.to.gen.p0i8.p3i8(i8 addrspace(3)*)
    145     declare i8* @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(i8 addrspace(4)*)
    146     declare i8* @llvm.nvvm.ptr.local.to.gen.p0i8.p5i8(i8 addrspace(5)*)
    147 
    148 Overview:
    149 """""""""
    150 
    151 The '``llvm.nvvm.ptr.*.to.gen``' intrinsics convert a pointer in a non-generic
    152 address space to a generic address space pointer.
    153 
    154 Semantics:
    155 """"""""""
    156 
    157 These intrinsics modify the pointer value to be a valid generic address space
    158 pointer.
    159 
    160 
    161 '``llvm.nvvm.ptr.gen.to.*``' Intrinsics
    162 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    163 
    164 Syntax:
    165 """""""
    166 
    167 These are overloaded intrinsics.  You can use these on any pointer types.
    168 
    169 .. code-block:: llvm
    170 
    171     declare i8 addrspace(1)* @llvm.nvvm.ptr.gen.to.global.p1i8.p0i8(i8*)
    172     declare i8 addrspace(3)* @llvm.nvvm.ptr.gen.to.shared.p3i8.p0i8(i8*)
    173     declare i8 addrspace(4)* @llvm.nvvm.ptr.gen.to.constant.p4i8.p0i8(i8*)
    174     declare i8 addrspace(5)* @llvm.nvvm.ptr.gen.to.local.p5i8.p0i8(i8*)
    175 
    176 Overview:
    177 """""""""
    178 
    179 The '``llvm.nvvm.ptr.gen.to.*``' intrinsics convert a pointer in the generic
    180 address space to a pointer in the target address space.  Note that these
    181 intrinsics are only useful if the address space of the target address space of
    182 the pointer is known.  It is not legal to use address space conversion
    183 intrinsics to convert a pointer from one non-generic address space to another
    184 non-generic address space.
    185 
    186 Semantics:
    187 """"""""""
    188 
    189 These intrinsics modify the pointer value to be a valid pointer in the target
    190 non-generic address space.
    191 
    192 
    193 Reading PTX Special Registers
    194 -----------------------------
    195 
    196 '``llvm.nvvm.read.ptx.sreg.*``'
    197 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    198 
    199 Syntax:
    200 """""""
    201 
    202 .. code-block:: llvm
    203 
    204     declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
    205     declare i32 @llvm.nvvm.read.ptx.sreg.tid.y()
    206     declare i32 @llvm.nvvm.read.ptx.sreg.tid.z()
    207     declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
    208     declare i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
    209     declare i32 @llvm.nvvm.read.ptx.sreg.ntid.z()
    210     declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
    211     declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
    212     declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.z()
    213     declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.x()
    214     declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.y()
    215     declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.z()
    216     declare i32 @llvm.nvvm.read.ptx.sreg.warpsize()
    217 
    218 Overview:
    219 """""""""
    220 
    221 The '``@llvm.nvvm.read.ptx.sreg.*``' intrinsics provide access to the PTX
    222 special registers, in particular the kernel launch bounds.  These registers
    223 map in the following way to CUDA builtins:
    224 
    225    ============ =====================================
    226    CUDA Builtin PTX Special Register Intrinsic
    227    ============ =====================================
    228    ``threadId`` ``@llvm.nvvm.read.ptx.sreg.tid.*``
    229    ``blockIdx`` ``@llvm.nvvm.read.ptx.sreg.ctaid.*``
    230    ``blockDim`` ``@llvm.nvvm.read.ptx.sreg.ntid.*``
    231    ``gridDim``  ``@llvm.nvvm.read.ptx.sreg.nctaid.*``
    232    ============ =====================================
    233 
    234 
    235 Barriers
    236 --------
    237 
    238 '``llvm.nvvm.barrier0``'
    239 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
    240 
    241 Syntax:
    242 """""""
    243 
    244 .. code-block:: llvm
    245 
    246   declare void @llvm.nvvm.barrier0()
    247 
    248 Overview:
    249 """""""""
    250 
    251 The '``@llvm.nvvm.barrier0()``' intrinsic emits a PTX ``bar.sync 0``
    252 instruction, equivalent to the ``__syncthreads()`` call in CUDA.
    253 
    254 
    255 Other Intrinsics
    256 ----------------
    257 
    258 For the full set of NVPTX intrinsics, please see the
    259 ``include/llvm/IR/IntrinsicsNVVM.td`` file in the LLVM source tree.
    260 
    261 
    262 .. _libdevice:
    263 
    264 Linking with Libdevice
    265 ======================
    266 
    267 The CUDA Toolkit comes with an LLVM bitcode library called ``libdevice`` that
    268 implements many common mathematical functions. This library can be used as a
    269 high-performance math library for any compilers using the LLVM NVPTX target.
    270 The library can be found under ``nvvm/libdevice/`` in the CUDA Toolkit and
    271 there is a separate version for each compute architecture.
    272 
    273 For a list of all math functions implemented in libdevice, see
    274 `libdevice Users Guide <http://docs.nvidia.com/cuda/libdevice-users-guide/index.html>`_.
    275 
    276 To accommodate various math-related compiler flags that can affect code
    277 generation of libdevice code, the library code depends on a special LLVM IR
    278 pass (``NVVMReflect``) to handle conditional compilation within LLVM IR. This
    279 pass looks for calls to the ``@__nvvm_reflect`` function and replaces them
    280 with constants based on the defined reflection parameters. Such conditional
    281 code often follows a pattern:
    282 
    283 .. code-block:: c++
    284 
    285   float my_function(float a) {
    286     if (__nvvm_reflect("FASTMATH"))
    287       return my_function_fast(a);
    288     else
    289       return my_function_precise(a);
    290   }
    291 
    292 The default value for all unspecified reflection parameters is zero. 
    293 
    294 The ``NVVMReflect`` pass should be executed early in the optimization
    295 pipeline, immediately after the link stage. The ``internalize`` pass is also
    296 recommended to remove unused math functions from the resulting PTX. For an
    297 input IR module ``module.bc``, the following compilation flow is recommended:
    298 
    299 1. Save list of external functions in ``module.bc``
    300 2. Link ``module.bc`` with ``libdevice.compute_XX.YY.bc``
    301 3. Internalize all functions not in list from (1)
    302 4. Eliminate all unused internal functions
    303 5. Run ``NVVMReflect`` pass
    304 6. Run standard optimization pipeline
    305 
    306 .. note::
    307 
    308   ``linkonce`` and ``linkonce_odr`` linkage types are not suitable for the
    309   libdevice functions. It is possible to link two IR modules that have been
    310   linked against libdevice using different reflection variables.
    311 
    312 Since the ``NVVMReflect`` pass replaces conditionals with constants, it will
    313 often leave behind dead code of the form:
    314 
    315 .. code-block:: llvm
    316 
    317   entry:
    318     ..
    319     br i1 true, label %foo, label %bar
    320   foo:
    321     ..
    322   bar:
    323     ; Dead code
    324     ..
    325 
    326 Therefore, it is recommended that ``NVVMReflect`` is executed early in the
    327 optimization pipeline before dead-code elimination.
    328 
    329 
    330 Reflection Parameters
    331 ---------------------
    332 
    333 The libdevice library currently uses the following reflection parameters to
    334 control code generation:
    335 
    336 ==================== ======================================================
    337 Flag                 Description
    338 ==================== ======================================================
    339 ``__CUDA_FTZ=[0,1]`` Use optimized code paths that flush subnormals to zero
    340 ==================== ======================================================
    341 
    342 
    343 Invoking NVVMReflect
    344 --------------------
    345 
    346 To ensure that all dead code caused by the reflection pass is eliminated, it
    347 is recommended that the reflection pass is executed early in the LLVM IR
    348 optimization pipeline. The pass takes an optional mapping of reflection
    349 parameter name to an integer value. This mapping can be specified as either a
    350 command-line option to ``opt`` or as an LLVM ``StringMap<int>`` object when
    351 programmatically creating a pass pipeline.
    352 
    353 With ``opt``:
    354 
    355 .. code-block:: text
    356 
    357   # opt -nvvm-reflect -nvvm-reflect-list=<var>=<value>,<var>=<value> module.bc -o module.reflect.bc
    358 
    359 
    360 With programmatic pass pipeline:
    361 
    362 .. code-block:: c++
    363 
    364   extern ModulePass *llvm::createNVVMReflectPass(const StringMap<int>& Mapping);
    365 
    366   StringMap<int> ReflectParams;
    367   ReflectParams["__CUDA_FTZ"] = 1;
    368   Passes.add(createNVVMReflectPass(ReflectParams));
    369 
    370 
    371 
    372 Executing PTX
    373 =============
    374 
    375 The most common way to execute PTX assembly on a GPU device is to use the CUDA
    376 Driver API. This API is a low-level interface to the GPU driver and allows for
    377 JIT compilation of PTX code to native GPU machine code.
    378 
    379 Initializing the Driver API:
    380 
    381 .. code-block:: c++
    382 
    383     CUdevice device;
    384     CUcontext context;
    385 
    386     // Initialize the driver API
    387     cuInit(0);
    388     // Get a handle to the first compute device
    389     cuDeviceGet(&device, 0);
    390     // Create a compute device context
    391     cuCtxCreate(&context, 0, device);
    392 
    393 JIT compiling a PTX string to a device binary:
    394 
    395 .. code-block:: c++
    396 
    397     CUmodule module;
    398     CUfunction funcion;
    399 
    400     // JIT compile a null-terminated PTX string
    401     cuModuleLoadData(&module, (void*)PTXString);
    402 
    403     // Get a handle to the "myfunction" kernel function
    404     cuModuleGetFunction(&function, module, "myfunction");
    405 
    406 For full examples of executing PTX assembly, please see the `CUDA Samples
    407 <https://developer.nvidia.com/cuda-downloads>`_ distribution.
    408 
    409 
    410 Common Issues
    411 =============
    412 
    413 ptxas complains of undefined function: __nvvm_reflect
    414 -----------------------------------------------------
    415 
    416 When linking with libdevice, the ``NVVMReflect`` pass must be used. See
    417 :ref:`libdevice` for more information.
    418 
    419 
    420 Tutorial: A Simple Compute Kernel
    421 =================================
    422 
    423 To start, let us take a look at a simple compute kernel written directly in
    424 LLVM IR. The kernel implements vector addition, where each thread computes one
    425 element of the output vector C from the input vectors A and B.  To make this
    426 easier, we also assume that only a single CTA (thread block) will be launched,
    427 and that it will be one dimensional.
    428 
    429 
    430 The Kernel
    431 ----------
    432 
    433 .. code-block:: llvm
    434 
    435   target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
    436   target triple = "nvptx64-nvidia-cuda"
    437 
    438   ; Intrinsic to read X component of thread ID
    439   declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
    440 
    441   define void @kernel(float addrspace(1)* %A,
    442                       float addrspace(1)* %B,
    443                       float addrspace(1)* %C) {
    444   entry:
    445     ; What is my ID?
    446     %id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
    447 
    448     ; Compute pointers into A, B, and C
    449     %ptrA = getelementptr float addrspace(1)* %A, i32 %id
    450     %ptrB = getelementptr float addrspace(1)* %B, i32 %id
    451     %ptrC = getelementptr float addrspace(1)* %C, i32 %id
    452 
    453     ; Read A, B
    454     %valA = load float addrspace(1)* %ptrA, align 4
    455     %valB = load float addrspace(1)* %ptrB, align 4
    456 
    457     ; Compute C = A + B
    458     %valC = fadd float %valA, %valB
    459 
    460     ; Store back to C
    461     store float %valC, float addrspace(1)* %ptrC, align 4
    462 
    463     ret void
    464   }
    465 
    466   !nvvm.annotations = !{!0}
    467   !0 = metadata !{void (float addrspace(1)*,
    468                         float addrspace(1)*,
    469                         float addrspace(1)*)* @kernel, metadata !"kernel", i32 1}
    470 
    471 
    472 We can use the LLVM ``llc`` tool to directly run the NVPTX code generator:
    473 
    474 .. code-block:: text
    475 
    476   # llc -mcpu=sm_20 kernel.ll -o kernel.ptx
    477 
    478 
    479 .. note::
    480 
    481   If you want to generate 32-bit code, change ``p:64:64:64`` to ``p:32:32:32``
    482   in the module data layout string and use ``nvptx-nvidia-cuda`` as the
    483   target triple.
    484 
    485 
    486 The output we get from ``llc`` (as of LLVM 3.4):
    487 
    488 .. code-block:: text
    489 
    490   //
    491   // Generated by LLVM NVPTX Back-End
    492   //
    493 
    494   .version 3.1
    495   .target sm_20
    496   .address_size 64
    497 
    498     // .globl kernel
    499                                           // @kernel
    500   .visible .entry kernel(
    501     .param .u64 kernel_param_0,
    502     .param .u64 kernel_param_1,
    503     .param .u64 kernel_param_2
    504   )
    505   {
    506     .reg .f32   %f<4>;
    507     .reg .s32   %r<2>;
    508     .reg .s64   %rl<8>;
    509 
    510   // BB#0:                                // %entry
    511     ld.param.u64    %rl1, [kernel_param_0];
    512     mov.u32         %r1, %tid.x;
    513     mul.wide.s32    %rl2, %r1, 4;
    514     add.s64         %rl3, %rl1, %rl2;
    515     ld.param.u64    %rl4, [kernel_param_1];
    516     add.s64         %rl5, %rl4, %rl2;
    517     ld.param.u64    %rl6, [kernel_param_2];
    518     add.s64         %rl7, %rl6, %rl2;
    519     ld.global.f32   %f1, [%rl3];
    520     ld.global.f32   %f2, [%rl5];
    521     add.f32         %f3, %f1, %f2;
    522     st.global.f32   [%rl7], %f3;
    523     ret;
    524   }
    525 
    526 
    527 Dissecting the Kernel
    528 ---------------------
    529 
    530 Now let us dissect the LLVM IR that makes up this kernel. 
    531 
    532 Data Layout
    533 ^^^^^^^^^^^
    534 
    535 The data layout string determines the size in bits of common data types, their
    536 ABI alignment, and their storage size.  For NVPTX, you should use one of the
    537 following:
    538 
    539 32-bit PTX:
    540 
    541 .. code-block:: llvm
    542 
    543   target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
    544 
    545 64-bit PTX:
    546 
    547 .. code-block:: llvm
    548 
    549   target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
    550 
    551 
    552 Target Intrinsics
    553 ^^^^^^^^^^^^^^^^^
    554 
    555 In this example, we use the ``@llvm.nvvm.read.ptx.sreg.tid.x`` intrinsic to
    556 read the X component of the current thread's ID, which corresponds to a read
    557 of register ``%tid.x`` in PTX. The NVPTX back-end supports a large set of
    558 intrinsics.  A short list is shown below; please see
    559 ``include/llvm/IR/IntrinsicsNVVM.td`` for the full list.
    560 
    561 
    562 ================================================ ====================
    563 Intrinsic                                        CUDA Equivalent
    564 ================================================ ====================
    565 ``i32 @llvm.nvvm.read.ptx.sreg.tid.{x,y,z}``     threadIdx.{x,y,z}
    566 ``i32 @llvm.nvvm.read.ptx.sreg.ctaid.{x,y,z}``   blockIdx.{x,y,z}
    567 ``i32 @llvm.nvvm.read.ptx.sreg.ntid.{x,y,z}``    blockDim.{x,y,z}
    568 ``i32 @llvm.nvvm.read.ptx.sreg.nctaid.{x,y,z}``  gridDim.{x,y,z}
    569 ``void @llvm.cuda.syncthreads()``                __syncthreads()
    570 ================================================ ====================
    571 
    572 
    573 Address Spaces
    574 ^^^^^^^^^^^^^^
    575 
    576 You may have noticed that all of the pointer types in the LLVM IR example had
    577 an explicit address space specifier. What is address space 1? NVIDIA GPU
    578 devices (generally) have four types of memory:
    579 
    580 - Global: Large, off-chip memory
    581 - Shared: Small, on-chip memory shared among all threads in a CTA
    582 - Local: Per-thread, private memory
    583 - Constant: Read-only memory shared across all threads
    584 
    585 These different types of memory are represented in LLVM IR as address spaces.
    586 There is also a fifth address space used by the NVPTX code generator that
    587 corresponds to the "generic" address space.  This address space can represent
    588 addresses in any other address space (with a few exceptions).  This allows
    589 users to write IR functions that can load/store memory using the same
    590 instructions. Intrinsics are provided to convert pointers between the generic
    591 and non-generic address spaces.
    592 
    593 See :ref:`address_spaces` and :ref:`nvptx_intrinsics` for more information.
    594 
    595 
    596 Kernel Metadata
    597 ^^^^^^^^^^^^^^^
    598 
    599 In PTX, a function can be either a `kernel` function (callable from the host
    600 program), or a `device` function (callable only from GPU code). You can think
    601 of `kernel` functions as entry-points in the GPU program. To mark an LLVM IR
    602 function as a `kernel` function, we make use of special LLVM metadata. The
    603 NVPTX back-end will look for a named metadata node called
    604 ``nvvm.annotations``. This named metadata must contain a list of metadata that
    605 describe the IR. For our purposes, we need to declare a metadata node that
    606 assigns the "kernel" attribute to the LLVM IR function that should be emitted
    607 as a PTX `kernel` function. These metadata nodes take the form:
    608 
    609 .. code-block:: text
    610 
    611   metadata !{<function ref>, metadata !"kernel", i32 1}
    612 
    613 For the previous example, we have:
    614 
    615 .. code-block:: llvm
    616 
    617   !nvvm.annotations = !{!0}
    618   !0 = metadata !{void (float addrspace(1)*,
    619                         float addrspace(1)*,
    620                         float addrspace(1)*)* @kernel, metadata !"kernel", i32 1}
    621 
    622 Here, we have a single metadata declaration in ``nvvm.annotations``. This
    623 metadata annotates our ``@kernel`` function with the ``kernel`` attribute.
    624 
    625 
    626 Running the Kernel
    627 ------------------
    628 
    629 Generating PTX from LLVM IR is all well and good, but how do we execute it on
    630 a real GPU device? The CUDA Driver API provides a convenient mechanism for
    631 loading and JIT compiling PTX to a native GPU device, and launching a kernel.
    632 The API is similar to OpenCL.  A simple example showing how to load and
    633 execute our vector addition code is shown below. Note that for brevity this
    634 code does not perform much error checking!
    635 
    636 .. note::
    637 
    638   You can also use the ``ptxas`` tool provided by the CUDA Toolkit to offline
    639   compile PTX to machine code (SASS) for a specific GPU architecture. Such
    640   binaries can be loaded by the CUDA Driver API in the same way as PTX. This
    641   can be useful for reducing startup time by precompiling the PTX kernels.
    642 
    643 
    644 .. code-block:: c++
    645 
    646   #include <iostream>
    647   #include <fstream>
    648   #include <cassert>
    649   #include "cuda.h"
    650 
    651 
    652   void checkCudaErrors(CUresult err) {
    653     assert(err == CUDA_SUCCESS);
    654   }
    655 
    656   /// main - Program entry point
    657   int main(int argc, char **argv) {
    658     CUdevice    device;
    659     CUmodule    cudaModule;
    660     CUcontext   context;
    661     CUfunction  function;
    662     CUlinkState linker;
    663     int         devCount;
    664 
    665     // CUDA initialization
    666     checkCudaErrors(cuInit(0));
    667     checkCudaErrors(cuDeviceGetCount(&devCount));
    668     checkCudaErrors(cuDeviceGet(&device, 0));
    669 
    670     char name[128];
    671     checkCudaErrors(cuDeviceGetName(name, 128, device));
    672     std::cout << "Using CUDA Device [0]: " << name << "\n";
    673 
    674     int devMajor, devMinor;
    675     checkCudaErrors(cuDeviceComputeCapability(&devMajor, &devMinor, device));
    676     std::cout << "Device Compute Capability: "
    677               << devMajor << "." << devMinor << "\n";
    678     if (devMajor < 2) {
    679       std::cerr << "ERROR: Device 0 is not SM 2.0 or greater\n";
    680       return 1;
    681     }
    682 
    683     std::ifstream t("kernel.ptx");
    684     if (!t.is_open()) {
    685       std::cerr << "kernel.ptx not found\n";
    686       return 1;
    687     }
    688     std::string str((std::istreambuf_iterator<char>(t)),
    689                       std::istreambuf_iterator<char>());
    690 
    691     // Create driver context
    692     checkCudaErrors(cuCtxCreate(&context, 0, device));
    693 
    694     // Create module for object
    695     checkCudaErrors(cuModuleLoadDataEx(&cudaModule, str.c_str(), 0, 0, 0));
    696 
    697     // Get kernel function
    698     checkCudaErrors(cuModuleGetFunction(&function, cudaModule, "kernel"));
    699 
    700     // Device data
    701     CUdeviceptr devBufferA;
    702     CUdeviceptr devBufferB;
    703     CUdeviceptr devBufferC;
    704 
    705     checkCudaErrors(cuMemAlloc(&devBufferA, sizeof(float)*16));
    706     checkCudaErrors(cuMemAlloc(&devBufferB, sizeof(float)*16));
    707     checkCudaErrors(cuMemAlloc(&devBufferC, sizeof(float)*16));
    708 
    709     float* hostA = new float[16];
    710     float* hostB = new float[16];
    711     float* hostC = new float[16];
    712 
    713     // Populate input
    714     for (unsigned i = 0; i != 16; ++i) {
    715       hostA[i] = (float)i;
    716       hostB[i] = (float)(2*i);
    717       hostC[i] = 0.0f;
    718     }
    719 
    720     checkCudaErrors(cuMemcpyHtoD(devBufferA, &hostA[0], sizeof(float)*16));
    721     checkCudaErrors(cuMemcpyHtoD(devBufferB, &hostB[0], sizeof(float)*16));
    722 
    723 
    724     unsigned blockSizeX = 16;
    725     unsigned blockSizeY = 1;
    726     unsigned blockSizeZ = 1;
    727     unsigned gridSizeX  = 1;
    728     unsigned gridSizeY  = 1;
    729     unsigned gridSizeZ  = 1;
    730 
    731     // Kernel parameters
    732     void *KernelParams[] = { &devBufferA, &devBufferB, &devBufferC };
    733 
    734     std::cout << "Launching kernel\n";
    735 
    736     // Kernel launch
    737     checkCudaErrors(cuLaunchKernel(function, gridSizeX, gridSizeY, gridSizeZ,
    738                                    blockSizeX, blockSizeY, blockSizeZ,
    739                                    0, NULL, KernelParams, NULL));
    740 
    741     // Retrieve device data
    742     checkCudaErrors(cuMemcpyDtoH(&hostC[0], devBufferC, sizeof(float)*16));
    743 
    744 
    745     std::cout << "Results:\n";
    746     for (unsigned i = 0; i != 16; ++i) {
    747       std::cout << hostA[i] << " + " << hostB[i] << " = " << hostC[i] << "\n";
    748     }
    749 
    750 
    751     // Clean up after ourselves
    752     delete [] hostA;
    753     delete [] hostB;
    754     delete [] hostC;
    755 
    756     // Clean-up
    757     checkCudaErrors(cuMemFree(devBufferA));
    758     checkCudaErrors(cuMemFree(devBufferB));
    759     checkCudaErrors(cuMemFree(devBufferC));
    760     checkCudaErrors(cuModuleUnload(cudaModule));
    761     checkCudaErrors(cuCtxDestroy(context));
    762 
    763     return 0;
    764   }
    765 
    766 
    767 You will need to link with the CUDA driver and specify the path to cuda.h.
    768 
    769 .. code-block:: text
    770 
    771   # clang++ sample.cpp -o sample -O2 -g -I/usr/local/cuda-5.5/include -lcuda
    772 
    773 We don't need to specify a path to ``libcuda.so`` since this is installed in a
    774 system location by the driver, not the CUDA toolkit.
    775 
    776 If everything goes as planned, you should see the following output when
    777 running the compiled program:
    778 
    779 .. code-block:: text
    780 
    781   Using CUDA Device [0]: GeForce GTX 680
    782   Device Compute Capability: 3.0
    783   Launching kernel
    784   Results:
    785   0 + 0 = 0
    786   1 + 2 = 3
    787   2 + 4 = 6
    788   3 + 6 = 9
    789   4 + 8 = 12
    790   5 + 10 = 15
    791   6 + 12 = 18
    792   7 + 14 = 21
    793   8 + 16 = 24
    794   9 + 18 = 27
    795   10 + 20 = 30
    796   11 + 22 = 33
    797   12 + 24 = 36
    798   13 + 26 = 39
    799   14 + 28 = 42
    800   15 + 30 = 45
    801 
    802 .. note::
    803 
    804   You will likely see a different device identifier based on your hardware
    805 
    806 
    807 Tutorial: Linking with Libdevice
    808 ================================
    809 
    810 In this tutorial, we show a simple example of linking LLVM IR with the
    811 libdevice library. We will use the same kernel as the previous tutorial,
    812 except that we will compute ``C = pow(A, B)`` instead of ``C = A + B``.
    813 Libdevice provides an ``__nv_powf`` function that we will use.
    814 
    815 .. code-block:: llvm
    816 
    817   target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
    818   target triple = "nvptx64-nvidia-cuda"
    819 
    820   ; Intrinsic to read X component of thread ID
    821   declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
    822   ; libdevice function
    823   declare float @__nv_powf(float, float)
    824 
    825   define void @kernel(float addrspace(1)* %A,
    826                       float addrspace(1)* %B,
    827                       float addrspace(1)* %C) {
    828   entry:
    829     ; What is my ID?
    830     %id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
    831 
    832     ; Compute pointers into A, B, and C
    833     %ptrA = getelementptr float addrspace(1)* %A, i32 %id
    834     %ptrB = getelementptr float addrspace(1)* %B, i32 %id
    835     %ptrC = getelementptr float addrspace(1)* %C, i32 %id
    836 
    837     ; Read A, B
    838     %valA = load float addrspace(1)* %ptrA, align 4
    839     %valB = load float addrspace(1)* %ptrB, align 4
    840 
    841     ; Compute C = pow(A, B)
    842     %valC = call float @__nv_powf(float %valA, float %valB)
    843 
    844     ; Store back to C
    845     store float %valC, float addrspace(1)* %ptrC, align 4
    846 
    847     ret void
    848   }
    849 
    850   !nvvm.annotations = !{!0}
    851   !0 = metadata !{void (float addrspace(1)*,
    852                         float addrspace(1)*,
    853                         float addrspace(1)*)* @kernel, metadata !"kernel", i32 1}
    854 
    855 
    856 To compile this kernel, we perform the following steps:
    857 
    858 1. Link with libdevice
    859 2. Internalize all but the public kernel function
    860 3. Run ``NVVMReflect`` and set ``__CUDA_FTZ`` to 0
    861 4. Optimize the linked module
    862 5. Codegen the module
    863 
    864 
    865 These steps can be performed by the LLVM ``llvm-link``, ``opt``, and ``llc``
    866 tools. In a complete compiler, these steps can also be performed entirely
    867 programmatically by setting up an appropriate pass configuration (see
    868 :ref:`libdevice`).
    869 
    870 .. code-block:: text
    871 
    872   # llvm-link t2.bc libdevice.compute_20.10.bc -o t2.linked.bc
    873   # opt -internalize -internalize-public-api-list=kernel -nvvm-reflect-list=__CUDA_FTZ=0 -nvvm-reflect -O3 t2.linked.bc -o t2.opt.bc
    874   # llc -mcpu=sm_20 t2.opt.bc -o t2.ptx
    875 
    876 .. note::
    877 
    878   The ``-nvvm-reflect-list=_CUDA_FTZ=0`` is not strictly required, as any
    879   undefined variables will default to zero. It is shown here for evaluation
    880   purposes.
    881 
    882 
    883 This gives us the following PTX (excerpt):
    884 
    885 .. code-block:: text
    886 
    887   //
    888   // Generated by LLVM NVPTX Back-End
    889   //
    890 
    891   .version 3.1
    892   .target sm_20
    893   .address_size 64
    894 
    895     // .globl kernel
    896                                           // @kernel
    897   .visible .entry kernel(
    898     .param .u64 kernel_param_0,
    899     .param .u64 kernel_param_1,
    900     .param .u64 kernel_param_2
    901   )
    902   {
    903     .reg .pred  %p<30>;
    904     .reg .f32   %f<111>;
    905     .reg .s32   %r<21>;
    906     .reg .s64   %rl<8>;
    907 
    908   // BB#0:                                // %entry
    909     ld.param.u64  %rl2, [kernel_param_0];
    910     mov.u32   %r3, %tid.x;
    911     ld.param.u64  %rl3, [kernel_param_1];
    912     mul.wide.s32  %rl4, %r3, 4;
    913     add.s64   %rl5, %rl2, %rl4;
    914     ld.param.u64  %rl6, [kernel_param_2];
    915     add.s64   %rl7, %rl3, %rl4;
    916     add.s64   %rl1, %rl6, %rl4;
    917     ld.global.f32   %f1, [%rl5];
    918     ld.global.f32   %f2, [%rl7];
    919     setp.eq.f32 %p1, %f1, 0f3F800000;
    920     setp.eq.f32 %p2, %f2, 0f00000000;
    921     or.pred   %p3, %p1, %p2;
    922     @%p3 bra  BB0_1;
    923     bra.uni   BB0_2;
    924   BB0_1:
    925     mov.f32   %f110, 0f3F800000;
    926     st.global.f32   [%rl1], %f110;
    927     ret;
    928   BB0_2:                                  // %__nv_isnanf.exit.i
    929     abs.f32   %f4, %f1;
    930     setp.gtu.f32  %p4, %f4, 0f7F800000;
    931     @%p4 bra  BB0_4;
    932   // BB#3:                                // %__nv_isnanf.exit5.i
    933     abs.f32   %f5, %f2;
    934     setp.le.f32 %p5, %f5, 0f7F800000;
    935     @%p5 bra  BB0_5;
    936   BB0_4:                                  // %.critedge1.i
    937     add.f32   %f110, %f1, %f2;
    938     st.global.f32   [%rl1], %f110;
    939     ret;
    940   BB0_5:                                  // %__nv_isinff.exit.i
    941 
    942     ...
    943 
    944   BB0_26:                                 // %__nv_truncf.exit.i.i.i.i.i
    945     mul.f32   %f90, %f107, 0f3FB8AA3B;
    946     cvt.rzi.f32.f32 %f91, %f90;
    947     mov.f32   %f92, 0fBF317200;
    948     fma.rn.f32  %f93, %f91, %f92, %f107;
    949     mov.f32   %f94, 0fB5BFBE8E;
    950     fma.rn.f32  %f95, %f91, %f94, %f93;
    951     mul.f32   %f89, %f95, 0f3FB8AA3B;
    952     // inline asm
    953     ex2.approx.ftz.f32 %f88,%f89;
    954     // inline asm
    955     add.f32   %f96, %f91, 0f00000000;
    956     ex2.approx.f32  %f97, %f96;
    957     mul.f32   %f98, %f88, %f97;
    958     setp.lt.f32 %p15, %f107, 0fC2D20000;
    959     selp.f32  %f99, 0f00000000, %f98, %p15;
    960     setp.gt.f32 %p16, %f107, 0f42D20000;
    961     selp.f32  %f110, 0f7F800000, %f99, %p16;
    962     setp.eq.f32 %p17, %f110, 0f7F800000;
    963     @%p17 bra   BB0_28;
    964   // BB#27:
    965     fma.rn.f32  %f110, %f110, %f108, %f110;
    966   BB0_28:                                 // %__internal_accurate_powf.exit.i
    967     setp.lt.f32 %p18, %f1, 0f00000000;
    968     setp.eq.f32 %p19, %f3, 0f3F800000;
    969     and.pred    %p20, %p18, %p19;
    970     @!%p20 bra  BB0_30;
    971     bra.uni   BB0_29;
    972   BB0_29:
    973     mov.b32    %r9, %f110;
    974     xor.b32   %r10, %r9, -2147483648;
    975     mov.b32    %f110, %r10;
    976   BB0_30:                                 // %__nv_powf.exit
    977     st.global.f32   [%rl1], %f110;
    978     ret;
    979   }
    980 
    981