1 =================================== 2 Compiling CUDA C/C++ with LLVM 3 =================================== 4 5 .. contents:: 6 :local: 7 8 Introduction 9 ============ 10 11 This document contains the user guides and the internals of compiling CUDA 12 C/C++ with LLVM. It is aimed at both users who want to compile CUDA with LLVM 13 and developers who want to improve LLVM for GPUs. This document assumes a basic 14 familiarity with CUDA. Information about CUDA programming can be found in the 15 `CUDA programming guide 16 <http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html>`_. 17 18 How to Build LLVM with CUDA Support 19 =================================== 20 21 CUDA support is still in development and works the best in the trunk version 22 of LLVM. Below is a quick summary of downloading and building the trunk 23 version. Consult the `Getting Started 24 <http://llvm.org/docs/GettingStarted.html>`_ page for more details on setting 25 up LLVM. 26 27 #. Checkout LLVM 28 29 .. code-block:: console 30 31 $ cd where-you-want-llvm-to-live 32 $ svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm 33 34 #. Checkout Clang 35 36 .. code-block:: console 37 38 $ cd where-you-want-llvm-to-live 39 $ cd llvm/tools 40 $ svn co http://llvm.org/svn/llvm-project/cfe/trunk clang 41 42 #. Configure and build LLVM and Clang 43 44 .. code-block:: console 45 46 $ cd where-you-want-llvm-to-live 47 $ mkdir build 48 $ cd build 49 $ cmake [options] .. 50 $ make 51 52 How to Compile CUDA C/C++ with LLVM 53 =================================== 54 55 We assume you have installed the CUDA driver and runtime. Consult the `NVIDIA 56 CUDA installation guide 57 <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ if 58 you have not. 59 60 Suppose you want to compile and run the following CUDA program (``axpy.cu``) 61 which multiplies a ``float`` array by a ``float`` scalar (AXPY). 62 63 .. code-block:: c++ 64 65 #include <iostream> 66 67 __global__ void axpy(float a, float* x, float* y) { 68 y[threadIdx.x] = a * x[threadIdx.x]; 69 } 70 71 int main(int argc, char* argv[]) { 72 const int kDataLen = 4; 73 74 float a = 2.0f; 75 float host_x[kDataLen] = {1.0f, 2.0f, 3.0f, 4.0f}; 76 float host_y[kDataLen]; 77 78 // Copy input data to device. 79 float* device_x; 80 float* device_y; 81 cudaMalloc(&device_x, kDataLen * sizeof(float)); 82 cudaMalloc(&device_y, kDataLen * sizeof(float)); 83 cudaMemcpy(device_x, host_x, kDataLen * sizeof(float), 84 cudaMemcpyHostToDevice); 85 86 // Launch the kernel. 87 axpy<<<1, kDataLen>>>(a, device_x, device_y); 88 89 // Copy output data to host. 90 cudaDeviceSynchronize(); 91 cudaMemcpy(host_y, device_y, kDataLen * sizeof(float), 92 cudaMemcpyDeviceToHost); 93 94 // Print the results. 95 for (int i = 0; i < kDataLen; ++i) { 96 std::cout << "y[" << i << "] = " << host_y[i] << "\n"; 97 } 98 99 cudaDeviceReset(); 100 return 0; 101 } 102 103 The command line for compilation is similar to what you would use for C++. 104 105 .. code-block:: console 106 107 $ clang++ axpy.cu -o axpy --cuda-gpu-arch=<GPU arch> \ 108 -L<CUDA install path>/<lib64 or lib> \ 109 -lcudart_static -ldl -lrt -pthread 110 $ ./axpy 111 y[0] = 2 112 y[1] = 4 113 y[2] = 6 114 y[3] = 8 115 116 ``<CUDA install path>`` is the root directory where you installed CUDA SDK, 117 typically ``/usr/local/cuda``. ``<GPU arch>`` is `the compute capability of 118 your GPU <https://developer.nvidia.com/cuda-gpus>`_. For example, if you want 119 to run your program on a GPU with compute capability of 3.5, you should specify 120 ``--cuda-gpu-arch=sm_35``. 121 122 Detecting clang vs NVCC 123 ======================= 124 125 Although clang's CUDA implementation is largely compatible with NVCC's, you may 126 still want to detect when you're compiling CUDA code specifically with clang. 127 128 This is tricky, because NVCC may invoke clang as part of its own compilation 129 process! For example, NVCC uses the host compiler's preprocessor when 130 compiling for device code, and that host compiler may in fact be clang. 131 132 When clang is actually compiling CUDA code -- rather than being used as a 133 subtool of NVCC's -- it defines the ``__CUDA__`` macro. ``__CUDA_ARCH__`` is 134 defined only in device mode (but will be defined if NVCC is using clang as a 135 preprocessor). So you can use the following incantations to detect clang CUDA 136 compilation, in host and device modes: 137 138 .. code-block:: c++ 139 140 #if defined(__clang__) && defined(__CUDA__) && !defined(__CUDA_ARCH__) 141 // clang compiling CUDA code, host mode. 142 #endif 143 144 #if defined(__clang__) && defined(__CUDA__) && defined(__CUDA_ARCH__) 145 // clang compiling CUDA code, device mode. 146 #endif 147 148 Both clang and nvcc define ``__CUDACC__`` during CUDA compilation. You can 149 detect NVCC specifically by looking for ``__NVCC__``. 150 151 Flags that control numerical code 152 ================================= 153 154 If you're using GPUs, you probably care about making numerical code run fast. 155 GPU hardware allows for more control over numerical operations than most CPUs, 156 but this results in more compiler options for you to juggle. 157 158 Flags you may wish to tweak include: 159 160 * ``-ffp-contract={on,off,fast}`` (defaults to ``fast`` on host and device when 161 compiling CUDA) Controls whether the compiler emits fused multiply-add 162 operations. 163 164 * ``off``: never emit fma operations, and prevent ptxas from fusing multiply 165 and add instructions. 166 * ``on``: fuse multiplies and adds within a single statement, but never 167 across statements (C11 semantics). Prevent ptxas from fusing other 168 multiplies and adds. 169 * ``fast``: fuse multiplies and adds wherever profitable, even across 170 statements. Doesn't prevent ptxas from fusing additional multiplies and 171 adds. 172 173 Fused multiply-add instructions can be much faster than the unfused 174 equivalents, but because the intermediate result in an fma is not rounded, 175 this flag can affect numerical code. 176 177 * ``-fcuda-flush-denormals-to-zero`` (default: off) When this is enabled, 178 floating point operations may flush `denormal 179 <https://en.wikipedia.org/wiki/Denormal_number>`_ inputs and/or outputs to 0. 180 Operations on denormal numbers are often much slower than the same operations 181 on normal numbers. 182 183 * ``-fcuda-approx-transcendentals`` (default: off) When this is enabled, the 184 compiler may emit calls to faster, approximate versions of transcendental 185 functions, instead of using the slower, fully IEEE-compliant versions. For 186 example, this flag allows clang to emit the ptx ``sin.approx.f32`` 187 instruction. 188 189 This is implied by ``-ffast-math``. 190 191 Optimizations 192 ============= 193 194 CPU and GPU have different design philosophies and architectures. For example, a 195 typical CPU has branch prediction, out-of-order execution, and is superscalar, 196 whereas a typical GPU has none of these. Due to such differences, an 197 optimization pipeline well-tuned for CPUs may be not suitable for GPUs. 198 199 LLVM performs several general and CUDA-specific optimizations for GPUs. The 200 list below shows some of the more important optimizations for GPUs. Most of 201 them have been upstreamed to ``lib/Transforms/Scalar`` and 202 ``lib/Target/NVPTX``. A few of them have not been upstreamed due to lack of a 203 customizable target-independent optimization pipeline. 204 205 * **Straight-line scalar optimizations**. These optimizations reduce redundancy 206 in straight-line code. Details can be found in the `design document for 207 straight-line scalar optimizations <https://goo.gl/4Rb9As>`_. 208 209 * **Inferring memory spaces**. `This optimization 210 <https://github.com/llvm-mirror/llvm/blob/master/lib/Target/NVPTX/NVPTXInferAddressSpaces.cpp>`_ 211 infers the memory space of an address so that the backend can emit faster 212 special loads and stores from it. 213 214 * **Aggressive loop unrooling and function inlining**. Loop unrolling and 215 function inlining need to be more aggressive for GPUs than for CPUs because 216 control flow transfer in GPU is more expensive. They also promote other 217 optimizations such as constant propagation and SROA which sometimes speed up 218 code by over 10x. An empirical inline threshold for GPUs is 1100. This 219 configuration has yet to be upstreamed with a target-specific optimization 220 pipeline. LLVM also provides `loop unrolling pragmas 221 <http://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll>`_ 222 and ``__attribute__((always_inline))`` for programmers to force unrolling and 223 inling. 224 225 * **Aggressive speculative execution**. `This transformation 226 <http://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_ is 227 mainly for promoting straight-line scalar optimizations which are most 228 effective on code along dominator paths. 229 230 * **Memory-space alias analysis**. `This alias analysis 231 <http://reviews.llvm.org/D12414>`_ infers that two pointers in different 232 special memory spaces do not alias. It has yet to be integrated to the new 233 alias analysis infrastructure; the new infrastructure does not run 234 target-specific alias analysis. 235 236 * **Bypassing 64-bit divides**. `An existing optimization 237 <http://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html>`_ 238 enabled in the NVPTX backend. 64-bit integer divides are much slower than 239 32-bit ones on NVIDIA GPUs due to lack of a divide unit. Many of the 64-bit 240 divides in our benchmarks have a divisor and dividend which fit in 32-bits at 241 runtime. This optimization provides a fast path for this common case. 242 243 Publication 244 =========== 245 246 | `gpucc: An Open-Source GPGPU Compiler <http://dl.acm.org/citation.cfm?id=2854041>`_ 247 | Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt 248 | *Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO 2016)* 249 | `Slides for the CGO talk <http://wujingyue.com/docs/gpucc-talk.pdf>`_ 250 251 Tutorial 252 ======== 253 254 `CGO 2016 gpucc tutorial <http://wujingyue.com/docs/gpucc-tutorial.pdf>`_ 255 256 Obtaining Help 257 ============== 258 259 To obtain help on LLVM in general and its CUDA support, see `the LLVM 260 community <http://llvm.org/docs/#mailing-lists>`_. 261