Home | History | Annotate | Download | only in docs
      1 ===================================
      2 Compiling CUDA C/C++ with LLVM
      3 ===================================
      4 
      5 .. contents::
      6    :local:
      7 
      8 Introduction
      9 ============
     10 
     11 This document contains the user guides and the internals of compiling CUDA
     12 C/C++ with LLVM. It is aimed at both users who want to compile CUDA with LLVM
     13 and developers who want to improve LLVM for GPUs. This document assumes a basic
     14 familiarity with CUDA. Information about CUDA programming can be found in the
     15 `CUDA programming guide
     16 <http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html>`_.
     17 
     18 How to Build LLVM with CUDA Support
     19 ===================================
     20 
     21 Below is a quick summary of downloading and building LLVM. Consult the `Getting
     22 Started <http://llvm.org/docs/GettingStarted.html>`_ page for more details on
     23 setting up LLVM.
     24 
     25 #. Checkout LLVM
     26 
     27    .. code-block:: console
     28 
     29      $ cd where-you-want-llvm-to-live
     30      $ svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm
     31 
     32 #. Checkout Clang
     33 
     34    .. code-block:: console
     35 
     36      $ cd where-you-want-llvm-to-live
     37      $ cd llvm/tools
     38      $ svn co http://llvm.org/svn/llvm-project/cfe/trunk clang
     39 
     40 #. Configure and build LLVM and Clang
     41 
     42    .. code-block:: console
     43 
     44      $ cd where-you-want-llvm-to-live
     45      $ mkdir build
     46      $ cd build
     47      $ cmake [options] ..
     48      $ make
     49 
     50 How to Compile CUDA C/C++ with LLVM
     51 ===================================
     52 
     53 We assume you have installed the CUDA driver and runtime. Consult the `NVIDIA
     54 CUDA installation Guide
     55 <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ if
     56 you have not.
     57 
     58 Suppose you want to compile and run the following CUDA program (``axpy.cu``)
     59 which multiplies a ``float`` array by a ``float`` scalar (AXPY).
     60 
     61 .. code-block:: c++
     62 
     63   #include <helper_cuda.h> // for checkCudaErrors
     64 
     65   #include <iostream>
     66 
     67   __global__ void axpy(float a, float* x, float* y) {
     68     y[threadIdx.x] = a * x[threadIdx.x];
     69   }
     70 
     71   int main(int argc, char* argv[]) {
     72     const int kDataLen = 4;
     73 
     74     float a = 2.0f;
     75     float host_x[kDataLen] = {1.0f, 2.0f, 3.0f, 4.0f};
     76     float host_y[kDataLen];
     77 
     78     // Copy input data to device.
     79     float* device_x;
     80     float* device_y;
     81     checkCudaErrors(cudaMalloc(&device_x, kDataLen * sizeof(float)));
     82     checkCudaErrors(cudaMalloc(&device_y, kDataLen * sizeof(float)));
     83     checkCudaErrors(cudaMemcpy(device_x, host_x, kDataLen * sizeof(float),
     84                                cudaMemcpyHostToDevice));
     85 
     86     // Launch the kernel.
     87     axpy<<<1, kDataLen>>>(a, device_x, device_y);
     88 
     89     // Copy output data to host.
     90     checkCudaErrors(cudaDeviceSynchronize());
     91     checkCudaErrors(cudaMemcpy(host_y, device_y, kDataLen * sizeof(float),
     92                                cudaMemcpyDeviceToHost));
     93 
     94     // Print the results.
     95     for (int i = 0; i < kDataLen; ++i) {
     96       std::cout << "y[" << i << "] = " << host_y[i] << "\n";
     97     }
     98 
     99     checkCudaErrors(cudaDeviceReset());
    100     return 0;
    101   }
    102 
    103 The command line for compilation is similar to what you would use for C++.
    104 
    105 .. code-block:: console
    106 
    107   $ clang++ -o axpy -I<CUDA install path>/samples/common/inc -L<CUDA install path>/<lib64 or lib> axpy.cu -lcudart_static -lcuda -ldl -lrt -pthread
    108   $ ./axpy
    109   y[0] = 2
    110   y[1] = 4
    111   y[2] = 6
    112   y[3] = 8
    113 
    114 Note that ``helper_cuda.h`` comes from the CUDA samples, so you need the
    115 samples installed for this example. ``<CUDA install path>`` is the root
    116 directory where you installed CUDA SDK, typically ``/usr/local/cuda``.
    117 
    118 Optimizations
    119 =============
    120 
    121 CPU and GPU have different design philosophies and architectures. For example, a
    122 typical CPU has branch prediction, out-of-order execution, and is superscalar,
    123 whereas a typical GPU has none of these. Due to such differences, an
    124 optimization pipeline well-tuned for CPUs may be not suitable for GPUs.
    125 
    126 LLVM performs several general and CUDA-specific optimizations for GPUs. The
    127 list below shows some of the more important optimizations for GPUs. Most of
    128 them have been upstreamed to ``lib/Transforms/Scalar`` and
    129 ``lib/Target/NVPTX``. A few of them have not been upstreamed due to lack of a
    130 customizable target-independent optimization pipeline.
    131 
    132 * **Straight-line scalar optimizations**. These optimizations reduce redundancy
    133   in straight-line code. Details can be found in the `design document for
    134   straight-line scalar optimizations <https://goo.gl/4Rb9As>`_.
    135 
    136 * **Inferring memory spaces**. `This optimization
    137   <http://www.llvm.org/docs/doxygen/html/NVPTXFavorNonGenericAddrSpaces_8cpp_source.html>`_
    138   infers the memory space of an address so that the backend can emit faster
    139   special loads and stores from it. Details can be found in the `design
    140   document for memory space inference <https://goo.gl/5wH2Ct>`_.
    141 
    142 * **Aggressive loop unrooling and function inlining**. Loop unrolling and
    143   function inlining need to be more aggressive for GPUs than for CPUs because
    144   control flow transfer in GPU is more expensive. They also promote other
    145   optimizations such as constant propagation and SROA which sometimes speed up
    146   code by over 10x. An empirical inline threshold for GPUs is 1100. This
    147   configuration has yet to be upstreamed with a target-specific optimization
    148   pipeline. LLVM also provides `loop unrolling pragmas
    149   <http://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll>`_
    150   and ``__attribute__((always_inline))`` for programmers to force unrolling and
    151   inling.
    152 
    153 * **Aggressive speculative execution**. `This transformation
    154   <http://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_ is
    155   mainly for promoting straight-line scalar optimizations which are most
    156   effective on code along dominator paths.
    157 
    158 * **Memory-space alias analysis**. `This alias analysis
    159   <http://reviews.llvm.org/D12414>`_ infers that two pointers in different
    160   special memory spaces do not alias. It has yet to be integrated to the new
    161   alias analysis infrastructure; the new infrastructure does not run
    162   target-specific alias analysis.
    163 
    164 * **Bypassing 64-bit divides**. `An existing optimization
    165   <http://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html>`_
    166   enabled in the NVPTX backend. 64-bit integer divides are much slower than
    167   32-bit ones on NVIDIA GPUs due to lack of a divide unit. Many of the 64-bit
    168   divides in our benchmarks have a divisor and dividend which fit in 32-bits at
    169   runtime. This optimization provides a fast path for this common case.
    170