1
2 # HotSort
3
4 HotSort is a high-performance GPU-accelerated integer sorting library
5 for Vulkan, CUDA and OpenCL compute APIs.
6
7 HotSort's advantages include:
8
9 * Ultra-fast sorting of 32bit or 64bit keys
10 * Reaches peak throughput on small arrays
11 * In-place sorting for low-memory environments
12 * Strong scaling with number of multiprocessors
13 * Low memory transactions relative to array size
14 * A concurrency-friendly dense kernel grid
15 * Support for GPU post-processing of sorted results
16
17 HotSort is typically significantly faster than other GPU-accelerated
18 implementations when sorting arrays of smaller than 500K-2M keys.
19
20 ## Benchmarks
21
22 ### Throughput
23
24 Here is a throughput plot for HotSort on Vulkan and CUDA sorting
25 32-bit and 64-bit keys on a 640-core Quadro M1200:
26
27 ![](images/hs_nvidia_sm35_u32_mkeys.svg)
28 ![](images/hs_nvidia_sm35_u64_mkeys.svg)
29
30 HotSort throughput on Vulkan on an AMD V1807B APU (similar to a Ryzen 2400G) with DDR4-2666 RAM:
31
32 ![](images/hs_amd_gcn_mkeys.svg)
33
34 HotSort throughput on Vulkan and OpenCL on an Intel HD 630:
35
36 ![](images/hs_intel_gen8_mkeys.svg)
37
38 ### Execution time
39
40 Note that these sorting rates translate to sub-millisecond to
41 multi-millisecond execution times on small GPUs:
42
43 ![](images/hs_nvidia_sm35_u32_msecs.svg)
44 ![](images/hs_nvidia_sm35_u64_msecs.svg)
45 ![](images/hs_amd_gcn_msecs.svg)
46 ![](images/hs_intel_gen8_msecs.svg)
47
48 # Usage
49
50 There are HotSort implementations for Vulkan, CUDA and OpenCL.
51
52 Note that HotSort is a comparison sort and supports in-place sorting.
53
54 There are also benchmarking examples for the
55 [Vulkan](vk/bench/main.c), [CUDA](cuda/bench/main.c) and
56 [OpenCL](cl/bench/main.c) implementations.
57
58 *Not all targeted architectures have been tested.*
59
60 ## Vulkan
61
62 The following architectures are supported:
63
64 Vendor | Architecture | 32bit | 64bit | 32+32bit | Notes
65 -------|-------------------------------------------|:------------------:|:------------------:|:-----------:|------
66 NVIDIA | sm_35,sm_37,sm_50,sm_52,sm_60,sm_61,sm_70 | :white_check_mark: | :white_check_mark: | :x: | Not tested on all architectures
67 NVIDIA | sm_30,sm_32,sm_53,sm_62 | :x: | :x: | :x: | Need to generate properly shaped kernels
68 AMD | GCN | :white_check_mark: | :white_check_mark: | :x: | Tested on Linux MESA 18.2
69 Intel | GEN8+ | :white_check_mark: | :white_check_mark: | :x: | Good but the assumed *best-shaped* kernels aren't being used due to a compiler issue
70 Intel | APL/GLK using a 2x9 or 1x12 thread pool | :x: | :x: | :x: | Need to generate properly shaped kernels
71
72 Add an arch-specific HotSort algorithm (aka "target") to your project
73 by including a `.c` source and `.h` header file:
74
75 Key Size | Source | Header
76 ---------|------------------------------------------------------------------------|-------------------------------------------------------
77 32bit | ```hs/vk/<vendor>/<arch>/u32/hs_<vendor>_<arch>_u32.c``` | ```hs/vk/<vendor>/<arch>/u32/hs_target.h```
78 64bit | ```hs/vk/<vendor>/<arch>/u64/hs_<vendor>_<arch>_u64.c``` | ```hs/vk/<vendor>/<arch>/u64/hs_target.h```
79
80 To sort `count` keys on Vulkan:
81
82 ```C
83 #include "hs/vk/intel/gen8/u32/hs_target.h"
84
85 ...
86
87 // create the Vulkan HotSort target
88 struct hs_vk * hs = hs_vk_create(<address of target>,...);
89
90 // allocate a descriptor set from a pool
91 VkDescriptorSet hs_ds = hs_vk_ds_alloc(hs,descriptor_pool);
92
93 ...
94
95 // command buffer begin
96
97 ...
98
99 // bind buffer(s) to a command buffer
100 hs_vk_ds_bind(hs,hs_ds,cb,vin,vout); // or (...,vin,VK_NULL_HANDLE) for in-place sorting
101
102 // see how much padding may be required
103 hs_vk_pad(hs,count,&count_padded_in,&count_padded_out);
104
105 // append compute shaders to command buffer
106 hs_vk_sort(hs,cb,...,vin,...,vout,...); // hs_vk_sort() and hs_vk_ds_bind() must have matching vin/vout args
107
108 ...
109
110 // command buffer end and queue submit
111
112 ...
113
114 // release the Vulkan HotSort target
115 hs_vk_release(hs,...);
116
117 ```
118
119 The [`hs_vk.h`](vk/hs_vk.h) header file describes these functions in
120 greater detail.
121
122 ## CUDA
123
124 The following architectures are supported:
125
126 Vendor | Architecture | 32bit | 64bit | 32+32bit | Notes
127 -------|-------------------------------------------------------|:------------------:|:------------------:|:---------:|------
128 NVIDIA | sm_35,sm_37,sm_50,sm_52,sm_60,sm_61,sm_70 | :white_check_mark: | :white_check_mark: | :x: |
129 NVIDIA | sm_30,sm_32,sm_53,sm_62 | :x: | :x: | :x: | Need to generate properly shaped kernels
130
131 Add an arch-specific HotSort target to your project by including a
132 `.cu` CUDA source and `.h` header file:
133
134 Key Size | Source | Header
135 ---------|-------------------------------------------------|------------------------------------------
136 32bit | ```hs/cuda/sm_35/u32/hs_cuda_u32.cu``` | ```hs/cuda/sm_35/u32/hs_cuda.h```
137 64bit | ```hs/cuda/sm_35/u64/hs_cuda_u64.cu``` | ```hs/cuda/sm_35/u64/hs_cuda.h```
138
139 Usage on CUDA is very simple.
140
141 For example, to sort `count` keys:
142
143 ```C
144 #include "hs/cuda/sm_35/u32/hs_cuda.h"
145
146 ...
147
148 uint32_t count_padded_in, count_padded_out;
149
150 hs_cuda_pad_u32(count,&count_padded_in,&count_padded_in);
151
152 ...
153
154 hs_cuda_sort_u32(vin,vout, // or (vin,NULL,...) for in-place sorting
155 count,count_padded_in,count_padded_out,
156 true,
157 stream0,stream1,stream2);
158 ```
159
160 HotSort on CUDA requires two auxilary streams in order to maximize concurrency.
161
162 The algorithm is guaranteed to complete on `stream0`.
163
164
165 ## OpenCL
166
167 The following architectures are supported:
168
169 Vendor | Architecture | 32bit | 64bit | 32+32bit | Notes
170 -------|-----------------------------------------|:------------------:|:------------------:|:---------:|------
171 Intel | GEN8+ | :white_check_mark: | :white_check_mark: | :x: | Due to a fragile compiler, the assumed best kernels are not being generated at this time
172 Intel | APL/GLK using a 2x9 or 1x12 thread pool | :x: | :x: | :x: | Need to generate properly shaped kernels
173
174 Add an arch-specific HotSort target to your project by including a
175 `.c` source and `.h` header file:
176
177 Key Size | Source | Header
178 ---------|------------------------------------------------------------------------|-------------------------------------------------------
179 32bit | ```hs/cl/<vendor>/<arch>/u32/hs_<vendor>_<arch>_u32.c``` | ```hs/cl/<vendor>/<arch>/u32/hs_target.h```
180 64bit | ```hs/cl/<vendor>/<arch>/u64/hs_<vendor>_<arch>_u64.c``` | ```hs/cl/<vendor>/<arch>/u64/hs_target.h```
181
182 Note that if the symbol `HS_DUMP_SOURCE` is not defined then the
183 pre-compiled GEN9 binaries will be included. These binaries may not
184 be compatible with all GEN8+ devices and drivers.
185
186 To sort `count` keys on OpenCL:
187
188 ```C
189 // create the OpenCL HotSort target
190 struct hs_cl * hs = hs_cl_create(<address of target>,...);
191
192 ...
193
194 // see how much padding may be required
195 hs_cl_pad(hs,count,&count_padded_in,&count_padded_out);
196
197 // enqueue compute kernels
198 hs_cl_sort(hs,cq,...,vin,vout,...); // or (...,vin,NULL,...) for in-place sorting
199
200 ...
201
202 // release the OpenCL HotSort target
203 hs_cl_release(hs,...);
204
205 ```
206
207 The [`hs_cl.h`](cl/hs_cl.h) header file describes these functions in
208 greater detail.
209
210 # Background
211
212 The HotSort sorting algorithm was created in 2012 and generalized in
213 2015 to support GPUs that benefit from non-power-of-two workgroups.
214
215 The objective of HotSort is to achieve high throughput as *early* as
216 possible on small GPUs when sorting modestly-sized arrays 1,000s to
217 100s of thousands of 64bit keys.
218
219 HotSort uses both well-known and obscure properties of bitonic
220 sequences to create a novel mapping of keys onto data-parallel devices
221 like GPUs.
222
223 ## Overview
224
225 The overall flow of the HotSort algorithm is below. Kernel launches
226 are in italics.
227
228 1. For each workgroup of slabs:
229 1. For each slab in the workgroup:
230 1. *Slab Load*
231 1. *Slab Sort*
232 1. Until all slabs in the workgroup are merged:
233 1. *Multi-Slab Flip Merge*
234 1. *Slab Store*
235 1. Until all slabs are merged:
236 1. *Streaming Flip Merge*
237 1. If necessary, *Streaming Half Merge*
238 1. If necessary, *Multi-Slab Half Merge*
239 1. If necessary, *Slab Half Merge*
240 1. If complete:
241 1. Optionally, *Report Key Changes*
242 1. Optionally, *Slab Transpose & Store*
243 1. Otherwise: *Slab Store*
244 1. Done
245
246 ## Sorting
247
248 The algorithm begins with a very *dense* per-multiprocessor block
249 sorting algorithm that loads a "slab" of keys into a subgroup's
250 registers, sorts the slabs, merges all slabs in the workgroup, and
251 stores the slabs back to global memory.
252
253 In the slab sorting phase, each lane of a subgroup executes a bitonic
254 sorting network on its registers and successively merges lanes until
255 the slab of registers is sorted in serpentine order.
256
257 ![](images/hs_sorted_slab.svg)
258
259 ## Merging
260
261 HotSort has several different merging strategies.
262
263 The merging kernels leverage the multiprocessor's register file by
264 loading, merging and storing a large number of strided slab rows
265 without using local memory.
266
267 The merging kernels exploit the bitonic sequence property that
268 interleaved subsequences of a bitonic sequence are also bitonic
269 sequences. This property also holds for non-power-of-two sequences.
270
271 As an example, the *Streaming Flip Merge* kernel is illustrated below:
272
273 ![](images/hs_flip_merge.svg)
274
275 # Future Enhancements
276
277 ## Hybrid improved merging
278
279 HotSort's initial sorting and merging phases are performed on bitonic
280 sequences. Because of this, throughput decreases as the problem size
281 increases.
282
283 A hybrid algorithm that combined HotSort's block sorter and several
284 rounds of merging with a state-of-the-art GPU merging algorithm would
285 probably improve the algorithm's performance on larger arrays.
286
287 ## Reenable support for devices lacking shuffle functions
288
289 The original version of HotSort ran on pre-Kepler GPUs without
290 intra-warp/inter-lane shuffling reenable this capability.
291
292 ## Metal support
293
294 Modify the HotSort generator to support Metal targets.
295