Home | History | Annotate | Download | only in doc
      1 CUDA Module Introduction {#cuda_intro}
      2 ========================
      3 
      4 General Information
      5 -------------------
      6 
      7 The OpenCV CUDA module is a set of classes and functions to utilize CUDA computational capabilities.
      8 It is implemented using NVIDIA\* CUDA\* Runtime API and supports only NVIDIA GPUs. The OpenCV CUDA
      9 module includes utility functions, low-level vision primitives, and high-level algorithms. The
     10 utility functions and low-level primitives provide a powerful infrastructure for developing fast
     11 vision algorithms taking advantage of CUDA whereas the high-level functionality includes some
     12 state-of-the-art algorithms (such as stereo correspondence, face and people detectors, and others)
     13 ready to be used by the application developers.
     14 
     15 The CUDA module is designed as a host-level API. This means that if you have pre-compiled OpenCV
     16 CUDA binaries, you are not required to have the CUDA Toolkit installed or write any extra code to
     17 make use of the CUDA.
     18 
     19 The OpenCV CUDA module is designed for ease of use and does not require any knowledge of CUDA.
     20 Though, such a knowledge will certainly be useful to handle non-trivial cases or achieve the highest
     21 performance. It is helpful to understand the cost of various operations, what the GPU does, what the
     22 preferred data formats are, and so on. The CUDA module is an effective instrument for quick
     23 implementation of CUDA-accelerated computer vision algorithms. However, if your algorithm involves
     24 many simple operations, then, for the best possible performance, you may still need to write your
     25 own kernels to avoid extra write and read operations on the intermediate results.
     26 
     27 To enable CUDA support, configure OpenCV using CMake with WITH\_CUDA=ON . When the flag is set and
     28 if CUDA is installed, the full-featured OpenCV CUDA module is built. Otherwise, the module is still
     29 built but at runtime all functions from the module throw Exception with CV\_GpuNotSupported error
     30 code, except for cuda::getCudaEnabledDeviceCount(). The latter function returns zero GPU count in
     31 this case. Building OpenCV without CUDA support does not perform device code compilation, so it does
     32 not require the CUDA Toolkit installed. Therefore, using the cuda::getCudaEnabledDeviceCount()
     33 function, you can implement a high-level algorithm that will detect GPU presence at runtime and
     34 choose an appropriate implementation (CPU or GPU) accordingly.
     35 
     36 Compilation for Different NVIDIA\* Platforms
     37 --------------------------------------------
     38 
     39 NVIDIA\* compiler enables generating binary code (cubin and fatbin) and intermediate code (PTX).
     40 Binary code often implies a specific GPU architecture and generation, so the compatibility with
     41 other GPUs is not guaranteed. PTX is targeted for a virtual platform that is defined entirely by the
     42 set of capabilities or features. Depending on the selected virtual platform, some of the
     43 instructions are emulated or disabled, even if the real hardware supports all the features.
     44 
     45 At the first call, the PTX code is compiled to binary code for the particular GPU using a JIT
     46 compiler. When the target GPU has a compute capability (CC) lower than the PTX code, JIT fails. By
     47 default, the OpenCV CUDA module includes:
     48 
     49 \*
     50    Binaries for compute capabilities 1.3 and 2.0 (controlled by CUDA\_ARCH\_BIN in CMake)
     51 
     52 \*
     53    PTX code for compute capabilities 1.1 and 1.3 (controlled by CUDA\_ARCH\_PTX in CMake)
     54 
     55 This means that for devices with CC 1.3 and 2.0 binary images are ready to run. For all newer
     56 platforms, the PTX code for 1.3 is JIT'ed to a binary image. For devices with CC 1.1 and 1.2, the
     57 PTX for 1.1 is JIT'ed. For devices with CC 1.0, no code is available and the functions throw
     58 Exception. For platforms where JIT compilation is performed first, the run is slow.
     59 
     60 On a GPU with CC 1.0, you can still compile the CUDA module and most of the functions will run
     61 flawlessly. To achieve this, add "1.0" to the list of binaries, for example,
     62 CUDA\_ARCH\_BIN="1.0 1.3 2.0" . The functions that cannot be run on CC 1.0 GPUs throw an exception.
     63 
     64 You can always determine at runtime whether the OpenCV GPU-built binaries (or PTX code) are
     65 compatible with your GPU. The function cuda::DeviceInfo::isCompatible returns the compatibility
     66 status (true/false).
     67 
     68 Utilizing Multiple GPUs
     69 -----------------------
     70 
     71 In the current version, each of the OpenCV CUDA algorithms can use only a single GPU. So, to utilize
     72 multiple GPUs, you have to manually distribute the work between GPUs. Switching active device can be
     73 done using cuda::setDevice() function. For more details please read Cuda C Programming Guide.
     74 
     75 While developing algorithms for multiple GPUs, note a data passing overhead. For primitive functions
     76 and small images, it can be significant, which may eliminate all the advantages of having multiple
     77 GPUs. But for high-level algorithms, consider using multi-GPU acceleration. For example, the Stereo
     78 Block Matching algorithm has been successfully parallelized using the following algorithm:
     79 
     80 1.  Split each image of the stereo pair into two horizontal overlapping stripes.
     81 2.  Process each pair of stripes (from the left and right images) on a separate Fermi\* GPU.
     82 3.  Merge the results into a single disparity map.
     83 
     84 With this algorithm, a dual GPU gave a 180% performance increase comparing to the single Fermi GPU.
     85 For a source code example, see <https://github.com/Itseez/opencv/tree/master/samples/gpu/>.
     86