1 ============================= 2 User Guide for NVPTX Back-end 3 ============================= 4 5 .. contents:: 6 :local: 7 :depth: 3 8 9 10 Introduction 11 ============ 12 13 To support GPU programming, the NVPTX back-end supports a subset of LLVM IR 14 along with a defined set of conventions used to represent GPU programming 15 concepts. This document provides an overview of the general usage of the back- 16 end, including a description of the conventions used and the set of accepted 17 LLVM IR. 18 19 .. note:: 20 21 This document assumes a basic familiarity with CUDA and the PTX 22 assembly language. Information about the CUDA Driver API and the PTX assembly 23 language can be found in the `CUDA documentation 24 <http://docs.nvidia.com/cuda/index.html>`_. 25 26 27 28 Conventions 29 =========== 30 31 Marking Functions as Kernels 32 ---------------------------- 33 34 In PTX, there are two types of functions: *device functions*, which are only 35 callable by device code, and *kernel functions*, which are callable by host 36 code. By default, the back-end will emit device functions. Metadata is used to 37 declare a function as a kernel function. This metadata is attached to the 38 ``nvvm.annotations`` named metadata object, and has the following format: 39 40 .. code-block:: llvm 41 42 !0 = metadata !{<function-ref>, metadata !"kernel", i32 1} 43 44 The first parameter is a reference to the kernel function. The following 45 example shows a kernel function calling a device function in LLVM IR. The 46 function ``@my_kernel`` is callable from host code, but ``@my_fmad`` is not. 47 48 .. code-block:: llvm 49 50 define float @my_fmad(float %x, float %y, float %z) { 51 %mul = fmul float %x, %y 52 %add = fadd float %mul, %z 53 ret float %add 54 } 55 56 define void @my_kernel(float* %ptr) { 57 %val = load float* %ptr 58 %ret = call float @my_fmad(float %val, float %val, float %val) 59 store float %ret, float* %ptr 60 ret void 61 } 62 63 !nvvm.annotations = !{!1} 64 !1 = metadata !{void (float*)* @my_kernel, metadata !"kernel", i32 1} 65 66 When compiled, the PTX kernel functions are callable by host-side code. 67 68 69 .. _address_spaces: 70 71 Address Spaces 72 -------------- 73 74 The NVPTX back-end uses the following address space mapping: 75 76 ============= ====================== 77 Address Space Memory Space 78 ============= ====================== 79 0 Generic 80 1 Global 81 2 Internal Use 82 3 Shared 83 4 Constant 84 5 Local 85 ============= ====================== 86 87 Every global variable and pointer type is assigned to one of these address 88 spaces, with 0 being the default address space. Intrinsics are provided which 89 can be used to convert pointers between the generic and non-generic address 90 spaces. 91 92 As an example, the following IR will define an array ``@g`` that resides in 93 global device memory. 94 95 .. code-block:: llvm 96 97 @g = internal addrspace(1) global [4 x i32] [ i32 0, i32 1, i32 2, i32 3 ] 98 99 LLVM IR functions can read and write to this array, and host-side code can 100 copy data to it by name with the CUDA Driver API. 101 102 Note that since address space 0 is the generic space, it is illegal to have 103 global variables in address space 0. Address space 0 is the default address 104 space in LLVM, so the ``addrspace(N)`` annotation is *required* for global 105 variables. 106 107 108 Triples 109 ------- 110 111 The NVPTX target uses the module triple to select between 32/64-bit code 112 generation and the driver-compiler interface to use. The triple architecture 113 can be one of ``nvptx`` (32-bit PTX) or ``nvptx64`` (64-bit PTX). The 114 operating system should be one of ``cuda`` or ``nvcl``, which determines the 115 interface used by the generated code to communicate with the driver. Most 116 users will want to use ``cuda`` as the operating system, which makes the 117 generated PTX compatible with the CUDA Driver API. 118 119 Example: 32-bit PTX for CUDA Driver API: ``nvptx-nvidia-cuda`` 120 121 Example: 64-bit PTX for CUDA Driver API: ``nvptx64-nvidia-cuda`` 122 123 124 125 .. _nvptx_intrinsics: 126 127 NVPTX Intrinsics 128 ================ 129 130 Address Space Conversion 131 ------------------------ 132 133 '``llvm.nvvm.ptr.*.to.gen``' Intrinsics 134 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 135 136 Syntax: 137 """"""" 138 139 These are overloaded intrinsics. You can use these on any pointer types. 140 141 .. code-block:: llvm 142 143 declare i8* @llvm.nvvm.ptr.global.to.gen.p0i8.p1i8(i8 addrspace(1)*) 144 declare i8* @llvm.nvvm.ptr.shared.to.gen.p0i8.p3i8(i8 addrspace(3)*) 145 declare i8* @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(i8 addrspace(4)*) 146 declare i8* @llvm.nvvm.ptr.local.to.gen.p0i8.p5i8(i8 addrspace(5)*) 147 148 Overview: 149 """"""""" 150 151 The '``llvm.nvvm.ptr.*.to.gen``' intrinsics convert a pointer in a non-generic 152 address space to a generic address space pointer. 153 154 Semantics: 155 """""""""" 156 157 These intrinsics modify the pointer value to be a valid generic address space 158 pointer. 159 160 161 '``llvm.nvvm.ptr.gen.to.*``' Intrinsics 162 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 163 164 Syntax: 165 """"""" 166 167 These are overloaded intrinsics. You can use these on any pointer types. 168 169 .. code-block:: llvm 170 171 declare i8 addrspace(1)* @llvm.nvvm.ptr.gen.to.global.p1i8.p0i8(i8*) 172 declare i8 addrspace(3)* @llvm.nvvm.ptr.gen.to.shared.p3i8.p0i8(i8*) 173 declare i8 addrspace(4)* @llvm.nvvm.ptr.gen.to.constant.p4i8.p0i8(i8*) 174 declare i8 addrspace(5)* @llvm.nvvm.ptr.gen.to.local.p5i8.p0i8(i8*) 175 176 Overview: 177 """"""""" 178 179 The '``llvm.nvvm.ptr.gen.to.*``' intrinsics convert a pointer in the generic 180 address space to a pointer in the target address space. Note that these 181 intrinsics are only useful if the address space of the target address space of 182 the pointer is known. It is not legal to use address space conversion 183 intrinsics to convert a pointer from one non-generic address space to another 184 non-generic address space. 185 186 Semantics: 187 """""""""" 188 189 These intrinsics modify the pointer value to be a valid pointer in the target 190 non-generic address space. 191 192 193 Reading PTX Special Registers 194 ----------------------------- 195 196 '``llvm.nvvm.read.ptx.sreg.*``' 197 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 198 199 Syntax: 200 """"""" 201 202 .. code-block:: llvm 203 204 declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() 205 declare i32 @llvm.nvvm.read.ptx.sreg.tid.y() 206 declare i32 @llvm.nvvm.read.ptx.sreg.tid.z() 207 declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x() 208 declare i32 @llvm.nvvm.read.ptx.sreg.ntid.y() 209 declare i32 @llvm.nvvm.read.ptx.sreg.ntid.z() 210 declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() 211 declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.y() 212 declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.z() 213 declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.x() 214 declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.y() 215 declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.z() 216 declare i32 @llvm.nvvm.read.ptx.sreg.warpsize() 217 218 Overview: 219 """"""""" 220 221 The '``@llvm.nvvm.read.ptx.sreg.*``' intrinsics provide access to the PTX 222 special registers, in particular the kernel launch bounds. These registers 223 map in the following way to CUDA builtins: 224 225 ============ ===================================== 226 CUDA Builtin PTX Special Register Intrinsic 227 ============ ===================================== 228 ``threadId`` ``@llvm.nvvm.read.ptx.sreg.tid.*`` 229 ``blockIdx`` ``@llvm.nvvm.read.ptx.sreg.ctaid.*`` 230 ``blockDim`` ``@llvm.nvvm.read.ptx.sreg.ntid.*`` 231 ``gridDim`` ``@llvm.nvvm.read.ptx.sreg.nctaid.*`` 232 ============ ===================================== 233 234 235 Barriers 236 -------- 237 238 '``llvm.nvvm.barrier0``' 239 ^^^^^^^^^^^^^^^^^^^^^^^^^^^ 240 241 Syntax: 242 """"""" 243 244 .. code-block:: llvm 245 246 declare void @llvm.nvvm.barrier0() 247 248 Overview: 249 """"""""" 250 251 The '``@llvm.nvvm.barrier0()``' intrinsic emits a PTX ``bar.sync 0`` 252 instruction, equivalent to the ``__syncthreads()`` call in CUDA. 253 254 255 Other Intrinsics 256 ---------------- 257 258 For the full set of NVPTX intrinsics, please see the 259 ``include/llvm/IR/IntrinsicsNVVM.td`` file in the LLVM source tree. 260 261 262 .. _libdevice: 263 264 Linking with Libdevice 265 ====================== 266 267 The CUDA Toolkit comes with an LLVM bitcode library called ``libdevice`` that 268 implements many common mathematical functions. This library can be used as a 269 high-performance math library for any compilers using the LLVM NVPTX target. 270 The library can be found under ``nvvm/libdevice/`` in the CUDA Toolkit and 271 there is a separate version for each compute architecture. 272 273 For a list of all math functions implemented in libdevice, see 274 `libdevice Users Guide <http://docs.nvidia.com/cuda/libdevice-users-guide/index.html>`_. 275 276 To accommodate various math-related compiler flags that can affect code 277 generation of libdevice code, the library code depends on a special LLVM IR 278 pass (``NVVMReflect``) to handle conditional compilation within LLVM IR. This 279 pass looks for calls to the ``@__nvvm_reflect`` function and replaces them 280 with constants based on the defined reflection parameters. Such conditional 281 code often follows a pattern: 282 283 .. code-block:: c++ 284 285 float my_function(float a) { 286 if (__nvvm_reflect("FASTMATH")) 287 return my_function_fast(a); 288 else 289 return my_function_precise(a); 290 } 291 292 The default value for all unspecified reflection parameters is zero. 293 294 The ``NVVMReflect`` pass should be executed early in the optimization 295 pipeline, immediately after the link stage. The ``internalize`` pass is also 296 recommended to remove unused math functions from the resulting PTX. For an 297 input IR module ``module.bc``, the following compilation flow is recommended: 298 299 1. Save list of external functions in ``module.bc`` 300 2. Link ``module.bc`` with ``libdevice.compute_XX.YY.bc`` 301 3. Internalize all functions not in list from (1) 302 4. Eliminate all unused internal functions 303 5. Run ``NVVMReflect`` pass 304 6. Run standard optimization pipeline 305 306 .. note:: 307 308 ``linkonce`` and ``linkonce_odr`` linkage types are not suitable for the 309 libdevice functions. It is possible to link two IR modules that have been 310 linked against libdevice using different reflection variables. 311 312 Since the ``NVVMReflect`` pass replaces conditionals with constants, it will 313 often leave behind dead code of the form: 314 315 .. code-block:: llvm 316 317 entry: 318 .. 319 br i1 true, label %foo, label %bar 320 foo: 321 .. 322 bar: 323 ; Dead code 324 .. 325 326 Therefore, it is recommended that ``NVVMReflect`` is executed early in the 327 optimization pipeline before dead-code elimination. 328 329 330 Reflection Parameters 331 --------------------- 332 333 The libdevice library currently uses the following reflection parameters to 334 control code generation: 335 336 ==================== ====================================================== 337 Flag Description 338 ==================== ====================================================== 339 ``__CUDA_FTZ=[0,1]`` Use optimized code paths that flush subnormals to zero 340 ==================== ====================================================== 341 342 343 Invoking NVVMReflect 344 -------------------- 345 346 To ensure that all dead code caused by the reflection pass is eliminated, it 347 is recommended that the reflection pass is executed early in the LLVM IR 348 optimization pipeline. The pass takes an optional mapping of reflection 349 parameter name to an integer value. This mapping can be specified as either a 350 command-line option to ``opt`` or as an LLVM ``StringMap<int>`` object when 351 programmatically creating a pass pipeline. 352 353 With ``opt``: 354 355 .. code-block:: text 356 357 # opt -nvvm-reflect -nvvm-reflect-list=<var>=<value>,<var>=<value> module.bc -o module.reflect.bc 358 359 360 With programmatic pass pipeline: 361 362 .. code-block:: c++ 363 364 extern ModulePass *llvm::createNVVMReflectPass(const StringMap<int>& Mapping); 365 366 StringMap<int> ReflectParams; 367 ReflectParams["__CUDA_FTZ"] = 1; 368 Passes.add(createNVVMReflectPass(ReflectParams)); 369 370 371 372 Executing PTX 373 ============= 374 375 The most common way to execute PTX assembly on a GPU device is to use the CUDA 376 Driver API. This API is a low-level interface to the GPU driver and allows for 377 JIT compilation of PTX code to native GPU machine code. 378 379 Initializing the Driver API: 380 381 .. code-block:: c++ 382 383 CUdevice device; 384 CUcontext context; 385 386 // Initialize the driver API 387 cuInit(0); 388 // Get a handle to the first compute device 389 cuDeviceGet(&device, 0); 390 // Create a compute device context 391 cuCtxCreate(&context, 0, device); 392 393 JIT compiling a PTX string to a device binary: 394 395 .. code-block:: c++ 396 397 CUmodule module; 398 CUfunction funcion; 399 400 // JIT compile a null-terminated PTX string 401 cuModuleLoadData(&module, (void*)PTXString); 402 403 // Get a handle to the "myfunction" kernel function 404 cuModuleGetFunction(&function, module, "myfunction"); 405 406 For full examples of executing PTX assembly, please see the `CUDA Samples 407 <https://developer.nvidia.com/cuda-downloads>`_ distribution. 408 409 410 Common Issues 411 ============= 412 413 ptxas complains of undefined function: __nvvm_reflect 414 ----------------------------------------------------- 415 416 When linking with libdevice, the ``NVVMReflect`` pass must be used. See 417 :ref:`libdevice` for more information. 418 419 420 Tutorial: A Simple Compute Kernel 421 ================================= 422 423 To start, let us take a look at a simple compute kernel written directly in 424 LLVM IR. The kernel implements vector addition, where each thread computes one 425 element of the output vector C from the input vectors A and B. To make this 426 easier, we also assume that only a single CTA (thread block) will be launched, 427 and that it will be one dimensional. 428 429 430 The Kernel 431 ---------- 432 433 .. code-block:: llvm 434 435 target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64" 436 target triple = "nvptx64-nvidia-cuda" 437 438 ; Intrinsic to read X component of thread ID 439 declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind 440 441 define void @kernel(float addrspace(1)* %A, 442 float addrspace(1)* %B, 443 float addrspace(1)* %C) { 444 entry: 445 ; What is my ID? 446 %id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind 447 448 ; Compute pointers into A, B, and C 449 %ptrA = getelementptr float addrspace(1)* %A, i32 %id 450 %ptrB = getelementptr float addrspace(1)* %B, i32 %id 451 %ptrC = getelementptr float addrspace(1)* %C, i32 %id 452 453 ; Read A, B 454 %valA = load float addrspace(1)* %ptrA, align 4 455 %valB = load float addrspace(1)* %ptrB, align 4 456 457 ; Compute C = A + B 458 %valC = fadd float %valA, %valB 459 460 ; Store back to C 461 store float %valC, float addrspace(1)* %ptrC, align 4 462 463 ret void 464 } 465 466 !nvvm.annotations = !{!0} 467 !0 = metadata !{void (float addrspace(1)*, 468 float addrspace(1)*, 469 float addrspace(1)*)* @kernel, metadata !"kernel", i32 1} 470 471 472 We can use the LLVM ``llc`` tool to directly run the NVPTX code generator: 473 474 .. code-block:: text 475 476 # llc -mcpu=sm_20 kernel.ll -o kernel.ptx 477 478 479 .. note:: 480 481 If you want to generate 32-bit code, change ``p:64:64:64`` to ``p:32:32:32`` 482 in the module data layout string and use ``nvptx-nvidia-cuda`` as the 483 target triple. 484 485 486 The output we get from ``llc`` (as of LLVM 3.4): 487 488 .. code-block:: text 489 490 // 491 // Generated by LLVM NVPTX Back-End 492 // 493 494 .version 3.1 495 .target sm_20 496 .address_size 64 497 498 // .globl kernel 499 // @kernel 500 .visible .entry kernel( 501 .param .u64 kernel_param_0, 502 .param .u64 kernel_param_1, 503 .param .u64 kernel_param_2 504 ) 505 { 506 .reg .f32 %f<4>; 507 .reg .s32 %r<2>; 508 .reg .s64 %rl<8>; 509 510 // BB#0: // %entry 511 ld.param.u64 %rl1, [kernel_param_0]; 512 mov.u32 %r1, %tid.x; 513 mul.wide.s32 %rl2, %r1, 4; 514 add.s64 %rl3, %rl1, %rl2; 515 ld.param.u64 %rl4, [kernel_param_1]; 516 add.s64 %rl5, %rl4, %rl2; 517 ld.param.u64 %rl6, [kernel_param_2]; 518 add.s64 %rl7, %rl6, %rl2; 519 ld.global.f32 %f1, [%rl3]; 520 ld.global.f32 %f2, [%rl5]; 521 add.f32 %f3, %f1, %f2; 522 st.global.f32 [%rl7], %f3; 523 ret; 524 } 525 526 527 Dissecting the Kernel 528 --------------------- 529 530 Now let us dissect the LLVM IR that makes up this kernel. 531 532 Data Layout 533 ^^^^^^^^^^^ 534 535 The data layout string determines the size in bits of common data types, their 536 ABI alignment, and their storage size. For NVPTX, you should use one of the 537 following: 538 539 32-bit PTX: 540 541 .. code-block:: llvm 542 543 target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64" 544 545 64-bit PTX: 546 547 .. code-block:: llvm 548 549 target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64" 550 551 552 Target Intrinsics 553 ^^^^^^^^^^^^^^^^^ 554 555 In this example, we use the ``@llvm.nvvm.read.ptx.sreg.tid.x`` intrinsic to 556 read the X component of the current thread's ID, which corresponds to a read 557 of register ``%tid.x`` in PTX. The NVPTX back-end supports a large set of 558 intrinsics. A short list is shown below; please see 559 ``include/llvm/IR/IntrinsicsNVVM.td`` for the full list. 560 561 562 ================================================ ==================== 563 Intrinsic CUDA Equivalent 564 ================================================ ==================== 565 ``i32 @llvm.nvvm.read.ptx.sreg.tid.{x,y,z}`` threadIdx.{x,y,z} 566 ``i32 @llvm.nvvm.read.ptx.sreg.ctaid.{x,y,z}`` blockIdx.{x,y,z} 567 ``i32 @llvm.nvvm.read.ptx.sreg.ntid.{x,y,z}`` blockDim.{x,y,z} 568 ``i32 @llvm.nvvm.read.ptx.sreg.nctaid.{x,y,z}`` gridDim.{x,y,z} 569 ``void @llvm.cuda.syncthreads()`` __syncthreads() 570 ================================================ ==================== 571 572 573 Address Spaces 574 ^^^^^^^^^^^^^^ 575 576 You may have noticed that all of the pointer types in the LLVM IR example had 577 an explicit address space specifier. What is address space 1? NVIDIA GPU 578 devices (generally) have four types of memory: 579 580 - Global: Large, off-chip memory 581 - Shared: Small, on-chip memory shared among all threads in a CTA 582 - Local: Per-thread, private memory 583 - Constant: Read-only memory shared across all threads 584 585 These different types of memory are represented in LLVM IR as address spaces. 586 There is also a fifth address space used by the NVPTX code generator that 587 corresponds to the "generic" address space. This address space can represent 588 addresses in any other address space (with a few exceptions). This allows 589 users to write IR functions that can load/store memory using the same 590 instructions. Intrinsics are provided to convert pointers between the generic 591 and non-generic address spaces. 592 593 See :ref:`address_spaces` and :ref:`nvptx_intrinsics` for more information. 594 595 596 Kernel Metadata 597 ^^^^^^^^^^^^^^^ 598 599 In PTX, a function can be either a `kernel` function (callable from the host 600 program), or a `device` function (callable only from GPU code). You can think 601 of `kernel` functions as entry-points in the GPU program. To mark an LLVM IR 602 function as a `kernel` function, we make use of special LLVM metadata. The 603 NVPTX back-end will look for a named metadata node called 604 ``nvvm.annotations``. This named metadata must contain a list of metadata that 605 describe the IR. For our purposes, we need to declare a metadata node that 606 assigns the "kernel" attribute to the LLVM IR function that should be emitted 607 as a PTX `kernel` function. These metadata nodes take the form: 608 609 .. code-block:: text 610 611 metadata !{<function ref>, metadata !"kernel", i32 1} 612 613 For the previous example, we have: 614 615 .. code-block:: llvm 616 617 !nvvm.annotations = !{!0} 618 !0 = metadata !{void (float addrspace(1)*, 619 float addrspace(1)*, 620 float addrspace(1)*)* @kernel, metadata !"kernel", i32 1} 621 622 Here, we have a single metadata declaration in ``nvvm.annotations``. This 623 metadata annotates our ``@kernel`` function with the ``kernel`` attribute. 624 625 626 Running the Kernel 627 ------------------ 628 629 Generating PTX from LLVM IR is all well and good, but how do we execute it on 630 a real GPU device? The CUDA Driver API provides a convenient mechanism for 631 loading and JIT compiling PTX to a native GPU device, and launching a kernel. 632 The API is similar to OpenCL. A simple example showing how to load and 633 execute our vector addition code is shown below. Note that for brevity this 634 code does not perform much error checking! 635 636 .. note:: 637 638 You can also use the ``ptxas`` tool provided by the CUDA Toolkit to offline 639 compile PTX to machine code (SASS) for a specific GPU architecture. Such 640 binaries can be loaded by the CUDA Driver API in the same way as PTX. This 641 can be useful for reducing startup time by precompiling the PTX kernels. 642 643 644 .. code-block:: c++ 645 646 #include <iostream> 647 #include <fstream> 648 #include <cassert> 649 #include "cuda.h" 650 651 652 void checkCudaErrors(CUresult err) { 653 assert(err == CUDA_SUCCESS); 654 } 655 656 /// main - Program entry point 657 int main(int argc, char **argv) { 658 CUdevice device; 659 CUmodule cudaModule; 660 CUcontext context; 661 CUfunction function; 662 CUlinkState linker; 663 int devCount; 664 665 // CUDA initialization 666 checkCudaErrors(cuInit(0)); 667 checkCudaErrors(cuDeviceGetCount(&devCount)); 668 checkCudaErrors(cuDeviceGet(&device, 0)); 669 670 char name[128]; 671 checkCudaErrors(cuDeviceGetName(name, 128, device)); 672 std::cout << "Using CUDA Device [0]: " << name << "\n"; 673 674 int devMajor, devMinor; 675 checkCudaErrors(cuDeviceComputeCapability(&devMajor, &devMinor, device)); 676 std::cout << "Device Compute Capability: " 677 << devMajor << "." << devMinor << "\n"; 678 if (devMajor < 2) { 679 std::cerr << "ERROR: Device 0 is not SM 2.0 or greater\n"; 680 return 1; 681 } 682 683 std::ifstream t("kernel.ptx"); 684 if (!t.is_open()) { 685 std::cerr << "kernel.ptx not found\n"; 686 return 1; 687 } 688 std::string str((std::istreambuf_iterator<char>(t)), 689 std::istreambuf_iterator<char>()); 690 691 // Create driver context 692 checkCudaErrors(cuCtxCreate(&context, 0, device)); 693 694 // Create module for object 695 checkCudaErrors(cuModuleLoadDataEx(&cudaModule, str.c_str(), 0, 0, 0)); 696 697 // Get kernel function 698 checkCudaErrors(cuModuleGetFunction(&function, cudaModule, "kernel")); 699 700 // Device data 701 CUdeviceptr devBufferA; 702 CUdeviceptr devBufferB; 703 CUdeviceptr devBufferC; 704 705 checkCudaErrors(cuMemAlloc(&devBufferA, sizeof(float)*16)); 706 checkCudaErrors(cuMemAlloc(&devBufferB, sizeof(float)*16)); 707 checkCudaErrors(cuMemAlloc(&devBufferC, sizeof(float)*16)); 708 709 float* hostA = new float[16]; 710 float* hostB = new float[16]; 711 float* hostC = new float[16]; 712 713 // Populate input 714 for (unsigned i = 0; i != 16; ++i) { 715 hostA[i] = (float)i; 716 hostB[i] = (float)(2*i); 717 hostC[i] = 0.0f; 718 } 719 720 checkCudaErrors(cuMemcpyHtoD(devBufferA, &hostA[0], sizeof(float)*16)); 721 checkCudaErrors(cuMemcpyHtoD(devBufferB, &hostB[0], sizeof(float)*16)); 722 723 724 unsigned blockSizeX = 16; 725 unsigned blockSizeY = 1; 726 unsigned blockSizeZ = 1; 727 unsigned gridSizeX = 1; 728 unsigned gridSizeY = 1; 729 unsigned gridSizeZ = 1; 730 731 // Kernel parameters 732 void *KernelParams[] = { &devBufferA, &devBufferB, &devBufferC }; 733 734 std::cout << "Launching kernel\n"; 735 736 // Kernel launch 737 checkCudaErrors(cuLaunchKernel(function, gridSizeX, gridSizeY, gridSizeZ, 738 blockSizeX, blockSizeY, blockSizeZ, 739 0, NULL, KernelParams, NULL)); 740 741 // Retrieve device data 742 checkCudaErrors(cuMemcpyDtoH(&hostC[0], devBufferC, sizeof(float)*16)); 743 744 745 std::cout << "Results:\n"; 746 for (unsigned i = 0; i != 16; ++i) { 747 std::cout << hostA[i] << " + " << hostB[i] << " = " << hostC[i] << "\n"; 748 } 749 750 751 // Clean up after ourselves 752 delete [] hostA; 753 delete [] hostB; 754 delete [] hostC; 755 756 // Clean-up 757 checkCudaErrors(cuMemFree(devBufferA)); 758 checkCudaErrors(cuMemFree(devBufferB)); 759 checkCudaErrors(cuMemFree(devBufferC)); 760 checkCudaErrors(cuModuleUnload(cudaModule)); 761 checkCudaErrors(cuCtxDestroy(context)); 762 763 return 0; 764 } 765 766 767 You will need to link with the CUDA driver and specify the path to cuda.h. 768 769 .. code-block:: text 770 771 # clang++ sample.cpp -o sample -O2 -g -I/usr/local/cuda-5.5/include -lcuda 772 773 We don't need to specify a path to ``libcuda.so`` since this is installed in a 774 system location by the driver, not the CUDA toolkit. 775 776 If everything goes as planned, you should see the following output when 777 running the compiled program: 778 779 .. code-block:: text 780 781 Using CUDA Device [0]: GeForce GTX 680 782 Device Compute Capability: 3.0 783 Launching kernel 784 Results: 785 0 + 0 = 0 786 1 + 2 = 3 787 2 + 4 = 6 788 3 + 6 = 9 789 4 + 8 = 12 790 5 + 10 = 15 791 6 + 12 = 18 792 7 + 14 = 21 793 8 + 16 = 24 794 9 + 18 = 27 795 10 + 20 = 30 796 11 + 22 = 33 797 12 + 24 = 36 798 13 + 26 = 39 799 14 + 28 = 42 800 15 + 30 = 45 801 802 .. note:: 803 804 You will likely see a different device identifier based on your hardware 805 806 807 Tutorial: Linking with Libdevice 808 ================================ 809 810 In this tutorial, we show a simple example of linking LLVM IR with the 811 libdevice library. We will use the same kernel as the previous tutorial, 812 except that we will compute ``C = pow(A, B)`` instead of ``C = A + B``. 813 Libdevice provides an ``__nv_powf`` function that we will use. 814 815 .. code-block:: llvm 816 817 target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64" 818 target triple = "nvptx64-nvidia-cuda" 819 820 ; Intrinsic to read X component of thread ID 821 declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind 822 ; libdevice function 823 declare float @__nv_powf(float, float) 824 825 define void @kernel(float addrspace(1)* %A, 826 float addrspace(1)* %B, 827 float addrspace(1)* %C) { 828 entry: 829 ; What is my ID? 830 %id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind 831 832 ; Compute pointers into A, B, and C 833 %ptrA = getelementptr float addrspace(1)* %A, i32 %id 834 %ptrB = getelementptr float addrspace(1)* %B, i32 %id 835 %ptrC = getelementptr float addrspace(1)* %C, i32 %id 836 837 ; Read A, B 838 %valA = load float addrspace(1)* %ptrA, align 4 839 %valB = load float addrspace(1)* %ptrB, align 4 840 841 ; Compute C = pow(A, B) 842 %valC = call float @__nv_powf(float %valA, float %valB) 843 844 ; Store back to C 845 store float %valC, float addrspace(1)* %ptrC, align 4 846 847 ret void 848 } 849 850 !nvvm.annotations = !{!0} 851 !0 = metadata !{void (float addrspace(1)*, 852 float addrspace(1)*, 853 float addrspace(1)*)* @kernel, metadata !"kernel", i32 1} 854 855 856 To compile this kernel, we perform the following steps: 857 858 1. Link with libdevice 859 2. Internalize all but the public kernel function 860 3. Run ``NVVMReflect`` and set ``__CUDA_FTZ`` to 0 861 4. Optimize the linked module 862 5. Codegen the module 863 864 865 These steps can be performed by the LLVM ``llvm-link``, ``opt``, and ``llc`` 866 tools. In a complete compiler, these steps can also be performed entirely 867 programmatically by setting up an appropriate pass configuration (see 868 :ref:`libdevice`). 869 870 .. code-block:: text 871 872 # llvm-link t2.bc libdevice.compute_20.10.bc -o t2.linked.bc 873 # opt -internalize -internalize-public-api-list=kernel -nvvm-reflect-list=__CUDA_FTZ=0 -nvvm-reflect -O3 t2.linked.bc -o t2.opt.bc 874 # llc -mcpu=sm_20 t2.opt.bc -o t2.ptx 875 876 .. note:: 877 878 The ``-nvvm-reflect-list=_CUDA_FTZ=0`` is not strictly required, as any 879 undefined variables will default to zero. It is shown here for evaluation 880 purposes. 881 882 883 This gives us the following PTX (excerpt): 884 885 .. code-block:: text 886 887 // 888 // Generated by LLVM NVPTX Back-End 889 // 890 891 .version 3.1 892 .target sm_20 893 .address_size 64 894 895 // .globl kernel 896 // @kernel 897 .visible .entry kernel( 898 .param .u64 kernel_param_0, 899 .param .u64 kernel_param_1, 900 .param .u64 kernel_param_2 901 ) 902 { 903 .reg .pred %p<30>; 904 .reg .f32 %f<111>; 905 .reg .s32 %r<21>; 906 .reg .s64 %rl<8>; 907 908 // BB#0: // %entry 909 ld.param.u64 %rl2, [kernel_param_0]; 910 mov.u32 %r3, %tid.x; 911 ld.param.u64 %rl3, [kernel_param_1]; 912 mul.wide.s32 %rl4, %r3, 4; 913 add.s64 %rl5, %rl2, %rl4; 914 ld.param.u64 %rl6, [kernel_param_2]; 915 add.s64 %rl7, %rl3, %rl4; 916 add.s64 %rl1, %rl6, %rl4; 917 ld.global.f32 %f1, [%rl5]; 918 ld.global.f32 %f2, [%rl7]; 919 setp.eq.f32 %p1, %f1, 0f3F800000; 920 setp.eq.f32 %p2, %f2, 0f00000000; 921 or.pred %p3, %p1, %p2; 922 @%p3 bra BB0_1; 923 bra.uni BB0_2; 924 BB0_1: 925 mov.f32 %f110, 0f3F800000; 926 st.global.f32 [%rl1], %f110; 927 ret; 928 BB0_2: // %__nv_isnanf.exit.i 929 abs.f32 %f4, %f1; 930 setp.gtu.f32 %p4, %f4, 0f7F800000; 931 @%p4 bra BB0_4; 932 // BB#3: // %__nv_isnanf.exit5.i 933 abs.f32 %f5, %f2; 934 setp.le.f32 %p5, %f5, 0f7F800000; 935 @%p5 bra BB0_5; 936 BB0_4: // %.critedge1.i 937 add.f32 %f110, %f1, %f2; 938 st.global.f32 [%rl1], %f110; 939 ret; 940 BB0_5: // %__nv_isinff.exit.i 941 942 ... 943 944 BB0_26: // %__nv_truncf.exit.i.i.i.i.i 945 mul.f32 %f90, %f107, 0f3FB8AA3B; 946 cvt.rzi.f32.f32 %f91, %f90; 947 mov.f32 %f92, 0fBF317200; 948 fma.rn.f32 %f93, %f91, %f92, %f107; 949 mov.f32 %f94, 0fB5BFBE8E; 950 fma.rn.f32 %f95, %f91, %f94, %f93; 951 mul.f32 %f89, %f95, 0f3FB8AA3B; 952 // inline asm 953 ex2.approx.ftz.f32 %f88,%f89; 954 // inline asm 955 add.f32 %f96, %f91, 0f00000000; 956 ex2.approx.f32 %f97, %f96; 957 mul.f32 %f98, %f88, %f97; 958 setp.lt.f32 %p15, %f107, 0fC2D20000; 959 selp.f32 %f99, 0f00000000, %f98, %p15; 960 setp.gt.f32 %p16, %f107, 0f42D20000; 961 selp.f32 %f110, 0f7F800000, %f99, %p16; 962 setp.eq.f32 %p17, %f110, 0f7F800000; 963 @%p17 bra BB0_28; 964 // BB#27: 965 fma.rn.f32 %f110, %f110, %f108, %f110; 966 BB0_28: // %__internal_accurate_powf.exit.i 967 setp.lt.f32 %p18, %f1, 0f00000000; 968 setp.eq.f32 %p19, %f3, 0f3F800000; 969 and.pred %p20, %p18, %p19; 970 @!%p20 bra BB0_30; 971 bra.uni BB0_29; 972 BB0_29: 973 mov.b32 %r9, %f110; 974 xor.b32 %r10, %r9, -2147483648; 975 mov.b32 %f110, %r10; 976 BB0_30: // %__nv_powf.exit 977 st.global.f32 [%rl1], %f110; 978 ret; 979 } 980 981