1 2 # HotSort 3 4 HotSort is a high-performance GPU-accelerated integer sorting library 5 for Vulkan, CUDA and OpenCL compute APIs. 6 7 HotSort's advantages include: 8 9 * Ultra-fast sorting of 32bit or 64bit keys 10 * Reaches peak throughput on small arrays 11 * In-place sorting for low-memory environments 12 * Strong scaling with number of multiprocessors 13 * Low memory transactions relative to array size 14 * A concurrency-friendly dense kernel grid 15 * Support for GPU post-processing of sorted results 16 17 HotSort is typically significantly faster than other GPU-accelerated 18 implementations when sorting arrays of smaller than 500K-2M keys. 19 20 ## Benchmarks 21 22 ### Throughput 23 24 Here is a throughput plot for HotSort on Vulkan and CUDA sorting 25 32-bit and 64-bit keys on a 640-core Quadro M1200: 26 27 ![](images/hs_nvidia_sm35_u32_mkeys.svg) 28 ![](images/hs_nvidia_sm35_u64_mkeys.svg) 29 30 HotSort throughput on Vulkan on an AMD V1807B APU (similar to a Ryzen 2400G) with DDR4-2666 RAM: 31 32 ![](images/hs_amd_gcn_mkeys.svg) 33 34 HotSort throughput on Vulkan and OpenCL on an Intel HD 630: 35 36 ![](images/hs_intel_gen8_mkeys.svg) 37 38 ### Execution time 39 40 Note that these sorting rates translate to sub-millisecond to 41 multi-millisecond execution times on small GPUs: 42 43 ![](images/hs_nvidia_sm35_u32_msecs.svg) 44 ![](images/hs_nvidia_sm35_u64_msecs.svg) 45 ![](images/hs_amd_gcn_msecs.svg) 46 ![](images/hs_intel_gen8_msecs.svg) 47 48 # Usage 49 50 There are HotSort implementations for Vulkan, CUDA and OpenCL. 51 52 Note that HotSort is a comparison sort and supports in-place sorting. 53 54 There are also benchmarking examples for the 55 [Vulkan](vk/bench/main.c), [CUDA](cuda/bench/main.c) and 56 [OpenCL](cl/bench/main.c) implementations. 57 58 *Not all targeted architectures have been tested.* 59 60 ## Vulkan 61 62 The following architectures are supported: 63 64 Vendor | Architecture | 32bit | 64bit | 32+32bit | Notes 65 -------|-------------------------------------------|:------------------:|:------------------:|:-----------:|------ 66 NVIDIA | sm_35,sm_37,sm_50,sm_52,sm_60,sm_61,sm_70 | :white_check_mark: | :white_check_mark: | :x: | Not tested on all architectures 67 NVIDIA | sm_30,sm_32,sm_53,sm_62 | :x: | :x: | :x: | Need to generate properly shaped kernels 68 AMD | GCN | :white_check_mark: | :white_check_mark: | :x: | Tested on Linux MESA 18.2 69 Intel | GEN8+ | :white_check_mark: | :white_check_mark: | :x: | Good but the assumed *best-shaped* kernels aren't being used due to a compiler issue 70 Intel | APL/GLK using a 2x9 or 1x12 thread pool | :x: | :x: | :x: | Need to generate properly shaped kernels 71 72 Add an arch-specific HotSort algorithm (aka "target") to your project 73 by including a `.c` source and `.h` header file: 74 75 Key Size | Source | Header 76 ---------|------------------------------------------------------------------------|------------------------------------------------------- 77 32bit | ```hs/vk/<vendor>/<arch>/u32/hs_<vendor>_<arch>_u32.c``` | ```hs/vk/<vendor>/<arch>/u32/hs_target.h``` 78 64bit | ```hs/vk/<vendor>/<arch>/u64/hs_<vendor>_<arch>_u64.c``` | ```hs/vk/<vendor>/<arch>/u64/hs_target.h``` 79 80 To sort `count` keys on Vulkan: 81 82 ```C 83 #include "hs/vk/intel/gen8/u32/hs_target.h" 84 85 ... 86 87 // create the Vulkan HotSort target 88 struct hs_vk * hs = hs_vk_create(<address of target>,...); 89 90 // allocate a descriptor set from a pool 91 VkDescriptorSet hs_ds = hs_vk_ds_alloc(hs,descriptor_pool); 92 93 ... 94 95 // command buffer begin 96 97 ... 98 99 // bind buffer(s) to a command buffer 100 hs_vk_ds_bind(hs,hs_ds,cb,vin,vout); // or (...,vin,VK_NULL_HANDLE) for in-place sorting 101 102 // see how much padding may be required 103 hs_vk_pad(hs,count,&count_padded_in,&count_padded_out); 104 105 // append compute shaders to command buffer 106 hs_vk_sort(hs,cb,...,vin,...,vout,...); // hs_vk_sort() and hs_vk_ds_bind() must have matching vin/vout args 107 108 ... 109 110 // command buffer end and queue submit 111 112 ... 113 114 // release the Vulkan HotSort target 115 hs_vk_release(hs,...); 116 117 ``` 118 119 The [`hs_vk.h`](vk/hs_vk.h) header file describes these functions in 120 greater detail. 121 122 ## CUDA 123 124 The following architectures are supported: 125 126 Vendor | Architecture | 32bit | 64bit | 32+32bit | Notes 127 -------|-------------------------------------------------------|:------------------:|:------------------:|:---------:|------ 128 NVIDIA | sm_35,sm_37,sm_50,sm_52,sm_60,sm_61,sm_70 | :white_check_mark: | :white_check_mark: | :x: | 129 NVIDIA | sm_30,sm_32,sm_53,sm_62 | :x: | :x: | :x: | Need to generate properly shaped kernels 130 131 Add an arch-specific HotSort target to your project by including a 132 `.cu` CUDA source and `.h` header file: 133 134 Key Size | Source | Header 135 ---------|-------------------------------------------------|------------------------------------------ 136 32bit | ```hs/cuda/sm_35/u32/hs_cuda_u32.cu``` | ```hs/cuda/sm_35/u32/hs_cuda.h``` 137 64bit | ```hs/cuda/sm_35/u64/hs_cuda_u64.cu``` | ```hs/cuda/sm_35/u64/hs_cuda.h``` 138 139 Usage on CUDA is very simple. 140 141 For example, to sort `count` keys: 142 143 ```C 144 #include "hs/cuda/sm_35/u32/hs_cuda.h" 145 146 ... 147 148 uint32_t count_padded_in, count_padded_out; 149 150 hs_cuda_pad_u32(count,&count_padded_in,&count_padded_in); 151 152 ... 153 154 hs_cuda_sort_u32(vin,vout, // or (vin,NULL,...) for in-place sorting 155 count,count_padded_in,count_padded_out, 156 true, 157 stream0,stream1,stream2); 158 ``` 159 160 HotSort on CUDA requires two auxilary streams in order to maximize concurrency. 161 162 The algorithm is guaranteed to complete on `stream0`. 163 164 165 ## OpenCL 166 167 The following architectures are supported: 168 169 Vendor | Architecture | 32bit | 64bit | 32+32bit | Notes 170 -------|-----------------------------------------|:------------------:|:------------------:|:---------:|------ 171 Intel | GEN8+ | :white_check_mark: | :white_check_mark: | :x: | Due to a fragile compiler, the assumed best kernels are not being generated at this time 172 Intel | APL/GLK using a 2x9 or 1x12 thread pool | :x: | :x: | :x: | Need to generate properly shaped kernels 173 174 Add an arch-specific HotSort target to your project by including a 175 `.c` source and `.h` header file: 176 177 Key Size | Source | Header 178 ---------|------------------------------------------------------------------------|------------------------------------------------------- 179 32bit | ```hs/cl/<vendor>/<arch>/u32/hs_<vendor>_<arch>_u32.c``` | ```hs/cl/<vendor>/<arch>/u32/hs_target.h``` 180 64bit | ```hs/cl/<vendor>/<arch>/u64/hs_<vendor>_<arch>_u64.c``` | ```hs/cl/<vendor>/<arch>/u64/hs_target.h``` 181 182 Note that if the symbol `HS_DUMP_SOURCE` is not defined then the 183 pre-compiled GEN9 binaries will be included. These binaries may not 184 be compatible with all GEN8+ devices and drivers. 185 186 To sort `count` keys on OpenCL: 187 188 ```C 189 // create the OpenCL HotSort target 190 struct hs_cl * hs = hs_cl_create(<address of target>,...); 191 192 ... 193 194 // see how much padding may be required 195 hs_cl_pad(hs,count,&count_padded_in,&count_padded_out); 196 197 // enqueue compute kernels 198 hs_cl_sort(hs,cq,...,vin,vout,...); // or (...,vin,NULL,...) for in-place sorting 199 200 ... 201 202 // release the OpenCL HotSort target 203 hs_cl_release(hs,...); 204 205 ``` 206 207 The [`hs_cl.h`](cl/hs_cl.h) header file describes these functions in 208 greater detail. 209 210 # Background 211 212 The HotSort sorting algorithm was created in 2012 and generalized in 213 2015 to support GPUs that benefit from non-power-of-two workgroups. 214 215 The objective of HotSort is to achieve high throughput as *early* as 216 possible on small GPUs when sorting modestly-sized arrays 1,000s to 217 100s of thousands of 64bit keys. 218 219 HotSort uses both well-known and obscure properties of bitonic 220 sequences to create a novel mapping of keys onto data-parallel devices 221 like GPUs. 222 223 ## Overview 224 225 The overall flow of the HotSort algorithm is below. Kernel launches 226 are in italics. 227 228 1. For each workgroup of slabs: 229 1. For each slab in the workgroup: 230 1. *Slab Load* 231 1. *Slab Sort* 232 1. Until all slabs in the workgroup are merged: 233 1. *Multi-Slab Flip Merge* 234 1. *Slab Store* 235 1. Until all slabs are merged: 236 1. *Streaming Flip Merge* 237 1. If necessary, *Streaming Half Merge* 238 1. If necessary, *Multi-Slab Half Merge* 239 1. If necessary, *Slab Half Merge* 240 1. If complete: 241 1. Optionally, *Report Key Changes* 242 1. Optionally, *Slab Transpose & Store* 243 1. Otherwise: *Slab Store* 244 1. Done 245 246 ## Sorting 247 248 The algorithm begins with a very *dense* per-multiprocessor block 249 sorting algorithm that loads a "slab" of keys into a subgroup's 250 registers, sorts the slabs, merges all slabs in the workgroup, and 251 stores the slabs back to global memory. 252 253 In the slab sorting phase, each lane of a subgroup executes a bitonic 254 sorting network on its registers and successively merges lanes until 255 the slab of registers is sorted in serpentine order. 256 257 ![](images/hs_sorted_slab.svg) 258 259 ## Merging 260 261 HotSort has several different merging strategies. 262 263 The merging kernels leverage the multiprocessor's register file by 264 loading, merging and storing a large number of strided slab rows 265 without using local memory. 266 267 The merging kernels exploit the bitonic sequence property that 268 interleaved subsequences of a bitonic sequence are also bitonic 269 sequences. This property also holds for non-power-of-two sequences. 270 271 As an example, the *Streaming Flip Merge* kernel is illustrated below: 272 273 ![](images/hs_flip_merge.svg) 274 275 # Future Enhancements 276 277 ## Hybrid improved merging 278 279 HotSort's initial sorting and merging phases are performed on bitonic 280 sequences. Because of this, throughput decreases as the problem size 281 increases. 282 283 A hybrid algorithm that combined HotSort's block sorter and several 284 rounds of merging with a state-of-the-art GPU merging algorithm would 285 probably improve the algorithm's performance on larger arrays. 286 287 ## Reenable support for devices lacking shuffle functions 288 289 The original version of HotSort ran on pre-Kepler GPUs without 290 intra-warp/inter-lane shuffling reenable this capability. 291 292 ## Metal support 293 294 Modify the HotSort generator to support Metal targets. 295