Home | History | Annotate | Download | only in gpu-basics-similarity
      1 Similarity check (PNSR and SSIM) on the GPU {#tutorial_gpu_basics_similarity}
      2 ===========================================
      3 @todo update this tutorial
      4 
      5 Goal
      6 ----
      7 
      8 In the @ref tutorial_video_input_psnr_ssim tutorial I already presented the PSNR and SSIM methods for checking
      9 the similarity between the two images. And as you could see there performing these takes quite some
     10 time, especially in the case of the SSIM. However, if the performance numbers of an OpenCV
     11 implementation for the CPU do not satisfy you and you happen to have an NVidia CUDA GPU device in
     12 your system all is not lost. You may try to port or write your algorithm for the video card.
     13 
     14 This tutorial will give a good grasp on how to approach coding by using the GPU module of OpenCV. As
     15 a prerequisite you should already know how to handle the core, highgui and imgproc modules. So, our
     16 goals are:
     17 
     18 -   What's different compared to the CPU?
     19 -   Create the GPU code for the PSNR and SSIM
     20 -   Optimize the code for maximal performance
     21 
     22 The source code
     23 ---------------
     24 
     25 You may also find the source code and these video file in the
     26 `samples/cpp/tutorial_code/gpu/gpu-basics-similarity/gpu-basics-similarity` folder of the OpenCV
     27 source library or download it from [here](https://github.com/Itseez/opencv/tree/master/samples/cpp/tutorial_code/gpu/gpu-basics-similarity/gpu-basics-similarity.cpp).
     28 The full source code is quite long (due to the controlling of the application via the command line
     29 arguments and performance measurement). Therefore, to avoid cluttering up these sections with those
     30 you'll find here only the functions itself.
     31 
     32 The PSNR returns a float number, that if the two inputs are similar between 30 and 50 (higher is
     33 better).
     34 
     35 @snippet samples/cpp/tutorial_code/gpu/gpu-basics-similarity/gpu-basics-similarity.cpp getpsnr
     36 @snippet samples/cpp/tutorial_code/gpu/gpu-basics-similarity/gpu-basics-similarity.cpp getpsnrcuda
     37 @snippet samples/cpp/tutorial_code/gpu/gpu-basics-similarity/gpu-basics-similarity.cpp psnr
     38 @snippet samples/cpp/tutorial_code/gpu/gpu-basics-similarity/gpu-basics-similarity.cpp getpsnropt
     39 
     40 The SSIM returns the MSSIM of the images. This is too a float number between zero and one (higher is
     41 better), however we have one for each channel. Therefore, we return a *Scalar* OpenCV data
     42 structure:
     43 
     44 @snippet samples/cpp/tutorial_code/gpu/gpu-basics-similarity/gpu-basics-similarity.cpp getssim
     45 @snippet samples/cpp/tutorial_code/gpu/gpu-basics-similarity/gpu-basics-similarity.cpp getssimcuda
     46 @snippet samples/cpp/tutorial_code/gpu/gpu-basics-similarity/gpu-basics-similarity.cpp ssim
     47 @snippet samples/cpp/tutorial_code/gpu/gpu-basics-similarity/gpu-basics-similarity.cpp getssimopt
     48 
     49 How to do it? - The GPU
     50 -----------------------
     51 
     52 Now as you can see we have three types of functions for each operation. One for the CPU and two for
     53 the GPU. The reason I made two for the GPU is too illustrate that often simple porting your CPU to
     54 GPU will actually make it slower. If you want some performance gain you will need to remember a few
     55 rules, whose I'm going to detail later on.
     56 
     57 The development of the GPU module was made so that it resembles as much as possible its CPU
     58 counterpart. This is to make porting easy. The first thing you need to do before writing any code is
     59 to link the GPU module to your project, and include the header file for the module. All the
     60 functions and data structures of the GPU are in a *gpu* sub namespace of the *cv* namespace. You may
     61 add this to the default one via the *use namespace* keyword, or mark it everywhere explicitly via
     62 the cv:: to avoid confusion. I'll do the later.
     63 @code{.cpp}
     64 #include <opencv2/gpu.hpp>        // GPU structures and methods
     65 @endcode
     66 
     67 GPU stands for "graphics processing unit". It was originally build to render graphical
     68 scenes. These scenes somehow build on a lot of data. Nevertheless, these aren't all dependent one
     69 from another in a sequential way and as it is possible a parallel processing of them. Due to this a
     70 GPU will contain multiple smaller processing units. These aren't the state of the art processors and
     71 on a one on one test with a CPU it will fall behind. However, its strength lies in its numbers. In
     72 the last years there has been an increasing trend to harvest these massive parallel powers of the
     73 GPU in non-graphical scene rendering too. This gave birth to the general-purpose computation on
     74 graphics processing units (GPGPU).
     75 
     76 The GPU has its own memory. When you read data from the hard drive with OpenCV into a *Mat* object
     77 that takes place in your systems memory. The CPU works somehow directly on this (via its cache),
     78 however the GPU cannot. He has too transferred the information he will use for calculations from the
     79 system memory to its own. This is done via an upload process and takes time. In the end the result
     80 will have to be downloaded back to your system memory for your CPU to see it and use it. Porting
     81 small functions to GPU is not recommended as the upload/download time will be larger than the amount
     82 you gain by a parallel execution.
     83 
     84 Mat objects are stored only in the system memory (or the CPU cache). For getting an OpenCV matrix to
     85 the GPU you'll need to use its GPU counterpart @ref cv::cuda::GpuMat . It works similar to the Mat with a
     86 2D only limitation and no reference returning for its functions (cannot mix GPU references with CPU
     87 ones). To upload a Mat object to the GPU you need to call the upload function after creating an
     88 instance of the class. To download you may use simple assignment to a Mat object or use the download
     89 function.
     90 @code{.cpp}
     91 Mat I1;         // Main memory item - read image into with imread for example
     92 gpu::GpuMat gI; // GPU matrix - for now empty
     93 gI1.upload(I1); // Upload a data from the system memory to the GPU memory
     94 
     95 I1 = gI1;       // Download, gI1.download(I1) will work too
     96 @endcode
     97 Once you have your data up in the GPU memory you may call GPU enabled functions of OpenCV. Most of
     98 the functions keep the same name just as on the CPU, with the difference that they only accept
     99 *GpuMat* inputs. A full list of these you will find in the documentation: [online
    100 here](http://docs.opencv.org/modules/gpu/doc/gpu.html) or the OpenCV reference manual that comes
    101 with the source code.
    102 
    103 Another thing to keep in mind is that not for all channel numbers you can make efficient algorithms
    104 on the GPU. Generally, I found that the input images for the GPU images need to be either one or
    105 four channel ones and one of the char or float type for the item sizes. No double support on the
    106 GPU, sorry. Passing other types of objects for some functions will result in an exception thrown,
    107 and an error message on the error output. The documentation details in most of the places the types
    108 accepted for the inputs. If you have three channel images as an input you can do two things: either
    109 adds a new channel (and use char elements) or split up the image and call the function for each
    110 image. The first one isn't really recommended as you waste memory.
    111 
    112 For some functions, where the position of the elements (neighbor items) doesn't matter quick
    113 solution is to just reshape it into a single channel image. This is the case for the PSNR
    114 implementation where for the *absdiff* method the value of the neighbors is not important. However,
    115 for the *GaussianBlur* this isn't an option and such need to use the split method for the SSIM. With
    116 this knowledge you can already make a GPU viable code (like mine GPU one) and run it. You'll be
    117 surprised to see that it might turn out slower than your CPU implementation.
    118 
    119 Optimization
    120 ------------
    121 
    122 The reason for this is that you're throwing out on the window the price for memory allocation and
    123 data transfer. And on the GPU this is damn high. Another possibility for optimization is to
    124 introduce asynchronous OpenCV GPU calls too with the help of the @ref cv::cuda::Stream.
    125 
    126 -#  Memory allocation on the GPU is considerable. Therefore, if its possible allocate new memory as
    127     few times as possible. If you create a function what you intend to call multiple times it is a
    128     good idea to allocate any local parameters for the function only once, during the first call. To
    129     do this you create a data structure containing all the local variables you will use. For
    130     instance in case of the PSNR these are:
    131     @code{.cpp}
    132     struct BufferPSNR                                     // Optimized GPU versions
    133       {   // Data allocations are very expensive on GPU. Use a buffer to solve: allocate once reuse later.
    134       gpu::GpuMat gI1, gI2, gs, t1,t2;
    135 
    136       gpu::GpuMat buf;
    137     };
    138     @endcode
    139     Then create an instance of this in the main program:
    140     @code{.cpp}
    141     BufferPSNR bufferPSNR;
    142     @endcode
    143     And finally pass this to the function each time you call it:
    144     @code{.cpp}
    145     double getPSNR_GPU_optimized(const Mat& I1, const Mat& I2, BufferPSNR& b)
    146     @endcode
    147     Now you access these local parameters as: *b.gI1*, *b.buf* and so on. The GpuMat will only
    148     reallocate itself on a new call if the new matrix size is different from the previous one.
    149 
    150 -#  Avoid unnecessary function data transfers. Any small data transfer will be significant one once
    151     you go to the GPU. Therefore, if possible make all calculations in-place (in other words do not
    152     create new memory objects - for reasons explained at the previous point). For example, although
    153     expressing arithmetical operations may be easier to express in one line formulas, it will be
    154     slower. In case of the SSIM at one point I need to calculate:
    155     @code{.cpp}
    156     b.t1 = 2 * b.mu1_mu2 + C1;
    157     @endcode
    158     Although the upper call will succeed observe that there is a hidden data transfer present.
    159     Before it makes the addition it needs to store somewhere the multiplication. Therefore, it will
    160     create a local matrix in the background, add to that the *C1* value and finally assign that to
    161     *t1*. To avoid this we use the gpu functions, instead of the arithmetic operators:
    162     @code{.cpp}
    163     gpu::multiply(b.mu1_mu2, 2, b.t1); //b.t1 = 2 * b.mu1_mu2 + C1;
    164     gpu::add(b.t1, C1, b.t1);
    165     @endcode
    166 -#  Use asynchronous calls (the @ref cv::cuda::Stream ). By default whenever you call a gpu function
    167     it will wait for the call to finish and return with the result afterwards. However, it is
    168     possible to make asynchronous calls, meaning it will call for the operation execution, make the
    169     costly data allocations for the algorithm and return back right away. Now you can call another
    170     function if you wish to do so. For the MSSIM this is a small optimization point. In our default
    171     implementation we split up the image into channels and call then for each channel the gpu
    172     functions. A small degree of parallelization is possible with the stream. By using a stream we
    173     can make the data allocation, upload operations while the GPU is already executing a given
    174     method. For example we need to upload two images. We queue these one after another and call
    175     already the function that processes it. The functions will wait for the upload to finish,
    176     however while that happens makes the output buffer allocations for the function to be executed
    177     next.
    178     @code{.cpp}
    179     gpu::Stream stream;
    180 
    181     stream.enqueueConvert(b.gI1, b.t1, CV_32F);    // Upload
    182 
    183     gpu::split(b.t1, b.vI1, stream);              // Methods (pass the stream as final parameter).
    184     gpu::multiply(b.vI1[i], b.vI1[i], b.I1_2, stream);        // I1^2
    185     @endcode
    186 
    187 Result and conclusion
    188 ---------------------
    189 
    190 On an Intel P8700 laptop CPU paired with a low end NVidia GT220M here are the performance numbers:
    191 @code
    192 Time of PSNR CPU (averaged for 10 runs): 41.4122 milliseconds. With result of: 19.2506
    193 Time of PSNR GPU (averaged for 10 runs): 158.977 milliseconds. With result of: 19.2506
    194 Initial call GPU optimized:              31.3418 milliseconds. With result of: 19.2506
    195 Time of PSNR GPU OPTIMIZED ( / 10 runs): 24.8171 milliseconds. With result of: 19.2506
    196 
    197 Time of MSSIM CPU (averaged for 10 runs): 484.343 milliseconds. With result of B0.890964 G0.903845 R0.936934
    198 Time of MSSIM GPU (averaged for 10 runs): 745.105 milliseconds. With result of B0.89922 G0.909051 R0.968223
    199 Time of MSSIM GPU Initial Call            357.746 milliseconds. With result of B0.890964 G0.903845 R0.936934
    200 Time of MSSIM GPU OPTIMIZED ( / 10 runs): 203.091 milliseconds. With result of B0.890964 G0.903845 R0.936934
    201 @endcode
    202 In both cases we managed a performance increase of almost 100% compared to the CPU implementation.
    203 It may be just the improvement needed for your application to work. You may observe a runtime
    204 instance of this on the [YouTube here](https://www.youtube.com/watch?v=3_ESXmFlnvY).
    205 
    206 \htmlonly
    207 <div align="center">
    208 <iframe title="Similarity check (PNSR and SSIM) on the GPU" width="560" height="349" src="http://www.youtube.com/embed/3_ESXmFlnvY?rel=0&loop=1" frameborder="0" allowfullscreen align="middle"></iframe>
    209 </div>
    210 \endhtmlonly
    211