Home | History | Annotate | Download | only in hs
      1 
      2 # HotSort
      3 
      4 HotSort is a high-performance GPU-accelerated integer sorting library
      5 for Vulkan, CUDA and OpenCL compute APIs.
      6 
      7 HotSort's advantages include:
      8 
      9 * Ultra-fast sorting of 32bit or 64bit keys
     10 * Reaches peak throughput on small arrays
     11 * In-place sorting for low-memory environments
     12 * Strong scaling with number of multiprocessors
     13 * Low memory transactions relative to array size
     14 * A concurrency-friendly dense kernel grid
     15 * Support for GPU post-processing of sorted results
     16 
     17 HotSort is typically significantly faster than other GPU-accelerated
     18 implementations when sorting arrays of smaller than 500K-2M keys.
     19 
     20 ## Benchmarks
     21 
     22 ### Throughput
     23 
     24 Here is a throughput plot for HotSort on Vulkan and CUDA sorting
     25 32-bit and 64-bit keys on a 640-core Quadro M1200:
     26 
     27 ![](images/hs_nvidia_sm35_u32_mkeys.svg)
     28 ![](images/hs_nvidia_sm35_u64_mkeys.svg)
     29 
     30 HotSort throughput on Vulkan on an AMD V1807B APU (similar to a Ryzen 2400G) with DDR4-2666 RAM:
     31 
     32 ![](images/hs_amd_gcn_mkeys.svg)
     33 
     34 HotSort throughput on Vulkan and OpenCL on an Intel HD 630:
     35 
     36 ![](images/hs_intel_gen8_mkeys.svg)
     37 
     38 ### Execution time
     39 
     40 Note that these sorting rates translate to sub-millisecond to
     41 multi-millisecond execution times on small GPUs:
     42 
     43 ![](images/hs_nvidia_sm35_u32_msecs.svg)
     44 ![](images/hs_nvidia_sm35_u64_msecs.svg)
     45 ![](images/hs_amd_gcn_msecs.svg)
     46 ![](images/hs_intel_gen8_msecs.svg)
     47 
     48 # Usage
     49 
     50 There are HotSort implementations for Vulkan, CUDA and OpenCL.
     51 
     52 Note that HotSort is a comparison sort and supports in-place sorting.
     53 
     54 There are also benchmarking examples for the
     55 [Vulkan](vk/bench/main.c), [CUDA](cuda/bench/main.c) and
     56 [OpenCL](cl/bench/main.c) implementations.
     57 
     58 *Not all targeted architectures have been tested.*
     59 
     60 ## Vulkan
     61 
     62 The following architectures are supported:
     63 
     64 Vendor | Architecture                              | 32bit             | 64bit             | 32+32bit   | Notes
     65 -------|-------------------------------------------|:------------------:|:------------------:|:-----------:|------
     66 NVIDIA | sm_35,sm_37,sm_50,sm_52,sm_60,sm_61,sm_70 | :white_check_mark: | :white_check_mark: | :x:         | Not tested on all architectures
     67 NVIDIA | sm_30,sm_32,sm_53,sm_62                   | :x:                | :x:                | :x:         | Need to generate properly shaped kernels
     68 AMD    | GCN                                       | :white_check_mark: | :white_check_mark: | :x:         | Tested on Linux MESA 18.2
     69 Intel  | GEN8+                                     | :white_check_mark: | :white_check_mark: | :x:         | Good but the assumed *best-shaped* kernels aren't being used due to a compiler issue
     70 Intel  | APL/GLK using a 2x9 or 1x12 thread pool   | :x:                | :x:                | :x:         | Need to generate properly shaped kernels
     71 
     72 Add an arch-specific HotSort algorithm (aka "target") to your project
     73 by including a `.c` source and `.h` header file:
     74 
     75 Key Size | Source                                                                 | Header
     76 ---------|------------------------------------------------------------------------|-------------------------------------------------------
     77 32bit   | ```hs/vk/<vendor>/<arch>/u32/hs_<vendor>_<arch>_u32.c``` | ```hs/vk/<vendor>/<arch>/u32/hs_target.h```
     78 64bit   | ```hs/vk/<vendor>/<arch>/u64/hs_<vendor>_<arch>_u64.c``` | ```hs/vk/<vendor>/<arch>/u64/hs_target.h```
     79 
     80 To sort `count` keys on Vulkan:
     81 
     82 ```C
     83 #include "hs/vk/intel/gen8/u32/hs_target.h"
     84 
     85 ...
     86 
     87 // create the Vulkan HotSort target
     88 struct hs_vk * hs = hs_vk_create(<address of target>,...);
     89 
     90 // allocate a descriptor set from a pool
     91 VkDescriptorSet hs_ds = hs_vk_ds_alloc(hs,descriptor_pool);
     92 
     93 ...
     94 
     95 // command buffer begin
     96 
     97 ...
     98 
     99 // bind buffer(s) to a command buffer
    100 hs_vk_ds_bind(hs,hs_ds,cb,vin,vout); // or (...,vin,VK_NULL_HANDLE) for in-place sorting
    101 
    102 // see how much padding may be required
    103 hs_vk_pad(hs,count,&count_padded_in,&count_padded_out);
    104 
    105 // append compute shaders to command buffer
    106 hs_vk_sort(hs,cb,...,vin,...,vout,...); // hs_vk_sort() and hs_vk_ds_bind() must have matching vin/vout args
    107 
    108 ...
    109 
    110 // command buffer end and queue submit
    111 
    112 ...
    113 
    114 // release the Vulkan HotSort target
    115 hs_vk_release(hs,...);
    116 
    117 ```
    118 
    119 The [`hs_vk.h`](vk/hs_vk.h) header file describes these functions in
    120 greater detail.
    121 
    122 ## CUDA
    123 
    124 The following architectures are supported:
    125 
    126 Vendor | Architecture                                          | 32bit             | 64bit             | 32+32bit | Notes
    127 -------|-------------------------------------------------------|:------------------:|:------------------:|:---------:|------
    128 NVIDIA | sm_35,sm_37,sm_50,sm_52,sm_60,sm_61,sm_70             | :white_check_mark: | :white_check_mark: | :x:       |
    129 NVIDIA | sm_30,sm_32,sm_53,sm_62                               | :x:                | :x:                | :x:       | Need to generate properly shaped kernels
    130 
    131 Add an arch-specific HotSort target to your project by including a
    132 `.cu` CUDA source and `.h` header file:
    133 
    134 Key Size | Source                                          | Header
    135 ---------|-------------------------------------------------|------------------------------------------
    136 32bit   | ```hs/cuda/sm_35/u32/hs_cuda_u32.cu``` | ```hs/cuda/sm_35/u32/hs_cuda.h```
    137 64bit   | ```hs/cuda/sm_35/u64/hs_cuda_u64.cu``` | ```hs/cuda/sm_35/u64/hs_cuda.h```
    138 
    139 Usage on CUDA is very simple.
    140 
    141 For example, to sort `count` keys:
    142 
    143 ```C
    144 #include "hs/cuda/sm_35/u32/hs_cuda.h"
    145 
    146 ...
    147 
    148 uint32_t count_padded_in, count_padded_out;
    149 
    150 hs_cuda_pad_u32(count,&count_padded_in,&count_padded_in);
    151 
    152 ...
    153 
    154 hs_cuda_sort_u32(vin,vout, // or (vin,NULL,...) for in-place sorting
    155                  count,count_padded_in,count_padded_out,
    156                  true,
    157                  stream0,stream1,stream2);
    158 ```
    159 
    160 HotSort on CUDA requires two auxilary streams in order to maximize concurrency.
    161 
    162 The algorithm is guaranteed to complete on `stream0`.
    163 
    164 
    165 ## OpenCL
    166 
    167 The following architectures are supported:
    168 
    169 Vendor | Architecture                            | 32bit             | 64bit             | 32+32bit | Notes
    170 -------|-----------------------------------------|:------------------:|:------------------:|:---------:|------
    171 Intel  | GEN8+                                   | :white_check_mark: | :white_check_mark: | :x:       | Due to a fragile compiler, the assumed best kernels are not being generated at this time
    172 Intel  | APL/GLK using a 2x9 or 1x12 thread pool | :x:                | :x:                | :x:       | Need to generate properly shaped kernels
    173 
    174 Add an arch-specific HotSort target to your project by including a
    175 `.c` source and `.h` header file:
    176 
    177 Key Size | Source                                                                 | Header
    178 ---------|------------------------------------------------------------------------|-------------------------------------------------------
    179 32bit   | ```hs/cl/<vendor>/<arch>/u32/hs_<vendor>_<arch>_u32.c``` | ```hs/cl/<vendor>/<arch>/u32/hs_target.h```
    180 64bit   | ```hs/cl/<vendor>/<arch>/u64/hs_<vendor>_<arch>_u64.c``` | ```hs/cl/<vendor>/<arch>/u64/hs_target.h```
    181 
    182 Note that if the symbol `HS_DUMP_SOURCE` is not defined then the
    183 pre-compiled GEN9 binaries will be included.  These binaries may not
    184 be compatible with all GEN8+ devices and drivers.
    185 
    186 To sort `count` keys on OpenCL:
    187 
    188 ```C
    189 // create the OpenCL HotSort target
    190 struct hs_cl * hs = hs_cl_create(<address of target>,...);
    191 
    192 ...
    193 
    194 // see how much padding may be required
    195 hs_cl_pad(hs,count,&count_padded_in,&count_padded_out);
    196 
    197 // enqueue compute kernels
    198 hs_cl_sort(hs,cq,...,vin,vout,...); // or (...,vin,NULL,...) for in-place sorting
    199 
    200 ...
    201 
    202 // release the OpenCL HotSort target
    203 hs_cl_release(hs,...);
    204 
    205 ```
    206 
    207 The [`hs_cl.h`](cl/hs_cl.h) header file describes these functions in
    208 greater detail.
    209 
    210 # Background
    211 
    212 The HotSort sorting algorithm was created in 2012 and generalized in
    213 2015 to support GPUs that benefit from non-power-of-two workgroups.
    214 
    215 The objective of HotSort is to achieve high throughput as *early* as
    216 possible on small GPUs when sorting modestly-sized arrays  1,000s to
    217 100s of thousands of 64bit keys.
    218 
    219 HotSort uses both well-known and obscure properties of bitonic
    220 sequences to create a novel mapping of keys onto data-parallel devices
    221 like GPUs.
    222 
    223 ## Overview
    224 
    225 The overall flow of the HotSort algorithm is below.  Kernel launches
    226 are in italics.
    227 
    228 1. For each workgroup of slabs:
    229    1. For each slab in the workgroup:
    230       1. *Slab Load*
    231       1. *Slab Sort*
    232    1. Until all slabs in the workgroup are merged:
    233       1. *Multi-Slab Flip Merge*
    234    1. *Slab Store*
    235 1. Until all slabs are merged:
    236    1. *Streaming Flip Merge*
    237    1. If necessary, *Streaming Half Merge*
    238    1. If necessary, *Multi-Slab Half Merge*
    239    1. If necessary, *Slab Half Merge*
    240    1. If complete:
    241       1. Optionally, *Report Key Changes*
    242       1. Optionally, *Slab Transpose & Store*
    243    1. Otherwise: *Slab Store*
    244 1. Done
    245 
    246 ## Sorting
    247 
    248 The algorithm begins with a very *dense* per-multiprocessor block
    249 sorting algorithm that loads a "slab" of keys into a subgroup's
    250 registers, sorts the slabs, merges all slabs in the workgroup, and
    251 stores the slabs back to global memory.
    252 
    253 In the slab sorting phase, each lane of a subgroup executes a bitonic
    254 sorting network on its registers and successively merges lanes until
    255 the slab of registers is sorted in serpentine order.
    256 
    257 ![](images/hs_sorted_slab.svg)
    258 
    259 ## Merging
    260 
    261 HotSort has several different merging strategies.
    262 
    263 The merging kernels leverage the multiprocessor's register file by
    264 loading, merging and storing a large number of strided slab rows
    265 without using local memory.
    266 
    267 The merging kernels exploit the bitonic sequence property that
    268 interleaved subsequences of a bitonic sequence are also bitonic
    269 sequences.  This property also holds for non-power-of-two sequences.
    270 
    271 As an example, the *Streaming Flip Merge* kernel is illustrated below:
    272 
    273 ![](images/hs_flip_merge.svg)
    274 
    275 # Future Enhancements
    276 
    277 ## Hybrid improved merging
    278 
    279 HotSort's initial sorting and merging phases are performed on bitonic
    280 sequences.  Because of this, throughput decreases as the problem size
    281 increases.
    282 
    283 A hybrid algorithm that combined HotSort's block sorter and several
    284 rounds of merging with a state-of-the-art GPU merging algorithm would
    285 probably improve the algorithm's performance on larger arrays.
    286 
    287 ## Reenable support for devices lacking shuffle functions
    288 
    289 The original version of HotSort ran on pre-Kepler GPUs without
    290 intra-warp/inter-lane shuffling  reenable this capability.
    291 
    292 ## Metal support
    293 
    294 Modify the HotSort generator to support Metal targets.
    295