1 r600-sb 2 ======= 3 4 * * * * * 5 6 Debugging 7 --------- 8 9 ### Environment variables 10 11 - **R600\_DEBUG** 12 13 There are new flags: 14 15 - **sb** - Enable optimization of graphics shaders 16 - **sbcl** - Enable optimization of compute shaders (experimental) 17 - **sbdry** - Dry run, optimize but use source bytecode - 18 useful if you only want to check shader dumps 19 without the risk of lockups and other problems 20 - **sbstat** - Print optimization statistics (only time so far) 21 - **sbdump** - Print IR after some passes. 22 23 ### Regression debugging 24 25 If there are any regressions as compared to the default backend 26 (R600\_SB=0), it's possible to use the following environment variables 27 to find the incorrectly optimized shader that causes the regression. 28 29 - **R600\_SB\_DSKIP\_MODE** - allows to skip optimization for some 30 shaders 31 - 0 - disabled (default) 32 - 1 - skip optimization for the shaders in the range 33 [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END], that is, 34 optimize only the shaders that are not in this range 35 - 2 - optimize only the shaders in the range 36 [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END] 37 38 - **R600\_SB\_DSKIP\_START** - start of the range (1-based) 39 40 - **R600\_SB\_DSKIP\_END** - end of the range (1-based) 41 42 Example - optimize only the shaders 5, 6, and 7: 43 44 R600_SB_DSKIP_START=5 R600_SB_DSKIP_END=7 R600_SB_DSKIP_MODE=2 45 46 All shaders compiled by the application are numbered starting from 1, 47 the number of shaders used by the application may be obtained by running 48 it with "R600_DEBUG=sb,sbstat" - it will print "sb: shader \#index\#" 49 for each compiled shader. 50 51 After figuring out the total number of shaders used by the application, 52 the variables above allow to use bisection to find the shader that is 53 the cause of regression. E.g. if the application uses 100 shaders, we 54 can divide the range [1; 100] and run the application with the 55 optimization enabled only for the first half of the shaders: 56 57 R600_SB_DSKIP_START=1 R600_SB_DSKIP_END=50 R600_SB_DSKIP_MODE=2 <app> 58 59 If the regression is reproduced with these parameters, then the failing 60 shader is in the range [1; 50], if it's not reproduced - then it's in 61 the range [51; 100]. Then we can divide the new range again and repeat 62 the testing, until we'll reduce the range to a single failing shader. 63 64 *NOTE: This method relies on the assumption that the application 65 produces the same sequence of the shaders on each run. It's not always 66 true - some applications may produce different sequences of the shaders, 67 in such cases the tools like apitrace may be used to record the trace 68 with the application, then this method may be applied when replaying the 69 trace - also this may be faster and/or more convenient than testing the 70 application itself.* 71 72 * * * * * 73 74 Intermediate Representation 75 --------------------------- 76 77 ### Values 78 79 All kinds of the operands (literal constants, references to kcache 80 constants, references to GPRs, etc) are currently represented by the 81 **value** class (possibly it makes sense to switch to hierarchy of 82 classes derived from **value** instead, to save some memory). 83 84 All values (except some pseudo values like the exec\_mask or predicate 85 register) represent 32bit scalar values - there are no vector values, 86 CF/FETCH instructions use groups of 4 values for src and dst operands. 87 88 ### Nodes 89 90 Shader programs are represented using the tree data structure, some 91 nodes contain a list of subnodes. 92 93 #### Control flow nodes 94 95 Control flow information is represented using four special node types 96 (based on the ideas from [[1]](#references) ) 97 98 - **region\_node** - single-entry, single-exit region. 99 100 All loops and if's in the program are enclosed in region nodes. 101 Region nodes have two containers for phi nodes - 102 region\_node::loop\_phi contains the phi expressions to be executed 103 at the region entry, region\_node::phi contains the phi expressions 104 to be executed at the region exit. It's the only type of the node 105 that contains associated phi expressions. 106 107 - **depart\_node** - "depart region \$id after { ... }" 108 109 Depart target region (jump to exit point) after executing contained 110 code. 111 112 - **repeat\_node** - "repeat region \$id after { ... }" 113 114 Repeat target region (jump to entry point) after executing contained 115 code. 116 117 - **if\_node** - "if (cond) { ... }" 118 119 Execute contained code if condition is true. The difference from 120 [[1]](#references) is that we don't have associated phi expressions 121 for the **if\_node**, we enclose **if\_node** in the 122 **region\_node** and store corresponding phi's in the 123 **region\_node**, this allows more uniform handling. 124 125 The target region of depart and repeat nodes is always the region where 126 they are located (possibly in the nested region), there are no arbitrary 127 jumps/goto's - control flow in the program is always structured. 128 129 Typical control flow constructs can be represented as in the following 130 examples: 131 132 GLSL: 133 134 if (cond) { 135 < 1 > 136 } else { 137 < 2 > 138 } 139 140 IR: 141 142 region #0 { 143 depart region #0 after { 144 if (cond) { 145 depart region #0 after { 146 < 1 > 147 } 148 } 149 < 2 > 150 } 151 <region #0 phi nodes > 152 } 153 154 GLSL: 155 156 while (cond) { 157 < 1 > 158 } 159 160 IR: 161 162 region #0 { 163 <region #0 loop_phi nodes> 164 repeat region #0 after { 165 region #1 { 166 depart region #1 after { 167 if (!cond) { 168 depart region #0 169 } 170 } 171 } 172 < 1 > 173 } 174 <region #0 phi nodes> 175 } 176 177 'Break' and 'continue' inside the loops are directly translated to the 178 depart and repeat nodes for the corresponding loop region. 179 180 This may look a bit too complicated, but in fact this allows more simple 181 and uniform handling of the control flow. 182 183 All loop\_phi and phi nodes for some region always have the same number 184 of source operands. The number of source operands for 185 region\_node::loop\_phi nodes is 1 + number of repeat nodes that 186 reference this region as a target. The number of source operands for 187 region\_node::phi nodes is equal to the number of depart nodes that 188 reference this region as a target. All depart/repeat nodes for the 189 region have unique indices equal to the index of source operand for 190 phi/loop\_phi nodes. 191 192 First source operand for region\_node::loop\_phi nodes (src[0]) is an 193 incoming value that enters the region from the outside. Each remaining 194 source operand comes from the corresponding repeat node. 195 196 More complex example: 197 198 GLSL: 199 200 a = 1; 201 while (a < 5) { 202 a = a * 2; 203 if (b == 3) { 204 continue; 205 } else { 206 a = 6; 207 } 208 if (c == 4) 209 break; 210 a = a + 1; 211 } 212 213 IR with SSA form: 214 215 a.1 = 1; 216 region #0 { 217 // loop phi values: src[0] - incoming, src[1] - from repeat_1, src[2] - from repeat_2 218 region#0 loop_phi: a.2 = phi a.1, a.6, a.3 219 220 repeat_1 region #0 after { 221 a.3 = a.2 * 2; 222 cond1 = (b == 3); 223 region #1 { 224 depart_0 region #1 after { 225 if (cond1) { 226 repeat_2 region #0; 227 } 228 } 229 a.4 = 6; 230 231 region #1 phi: a.5 = phi a.4; // src[0] - from depart_0 232 } 233 cond2 = (c == 4); 234 region #2 { 235 depart_0 region #2 after { 236 if (cond2) { 237 depart_0 region #0; 238 } 239 } 240 } 241 a.6 = a.5 + 1; 242 } 243 244 region #0 phi: a.7 = phi a.5 // src[0] from depart_0 245 } 246 247 Phi nodes with single source operand are just copies, they are not 248 really necessary, but this allows to handle all **depart\_node**s in the 249 uniform way. 250 251 #### Instruction nodes 252 253 Instruction nodes represent different kinds of instructions - 254 **alu\_node**, **cf\_node**, **fetch\_node**, etc. Each of them contains 255 the "bc" structure where all fields of the bytecode are stored (the type 256 is **bc\_alu** for **alu\_node**, etc). The operands are represented 257 using the vectors of pointers to **value** class (node::src, node::dst) 258 259 #### SSA-specific nodes 260 261 Phi nodes currently don't have special node class, they are stored as 262 **node**. Destination vector contains a single destination value, source 263 vector contains 1 or more source values. 264 265 Psi nodes [[5], [6]](#references) also don't have a special node class 266 and stored as **node**. Source vector contains 3 values for each source 267 operand - the **value** of predicate, **value** of corresponding 268 PRED\_SEL field, and the source **value** itself. 269 270 ### Indirect addressing 271 272 Special kind of values (VLK\_RELREG) is used to represent indirect 273 operands. These values don't have SSA versions. The representation is 274 mostly based on the [[2]](#references). Indirect operand contains the 275 "offset/address" value (value::rel), (e.g. some SSA version of the AR 276 register value, though after some passes it may be any value - constant, 277 register, etc), also it contains the maydef and mayuse vectors of 278 pointers to **value**s (similar to dst/src vectors in the **node**) to 279 represent the effects of aliasing in the SSA form. 280 281 E.g. if we have the array R5.x ... R8.x and the following instruction : 282 283 MOV R0.x, R[5 + AR].x 284 285 then source indirect operand is represented with the VLK\_RELREG value, 286 value::rel is AR, value::maydef is empty (in fact it always contain the 287 same number of elements as mayuse to simplify the handling, but they are 288 NULLs), value::mayuse contains [R5.x, R6.x, R7.x, R8.x] (or the 289 corresponding SSA versions after ssa\_rename). 290 291 Additional "virtual variables" as in [HSSA [2]](#references) are not 292 used, also there is no special handling for "zero versions". Typical 293 programs in our case are small, indirect addressing is rare, array sizes 294 are limited by max gpr number, so we don't really need to use special 295 tricks to avoid the explosion of value versions. Also this allows more 296 precise liveness computation for array elements without modifications to 297 the algorithms. 298 299 With the following instruction: 300 301 MOV R[5+AR].x, R0.x 302 303 we'll have both maydef and mayuse vectors for dst operand filled with 304 array values initially: [R5.x, R6.x, R7.x, R8.x]. After the ssa\_rename 305 pass mayuse will contain previous versions, maydef will contain new 306 potentially-defined versions. 307 308 * * * * * 309 310 Passes 311 ------ 312 313 - **bc\_parser** - creates the IR from the source bytecode, 314 initializes src and dst value vectors for instruction nodes. Most 315 ALU nodes have one dst operand and the number of source operands is 316 equal to the number of source operands for the ISA instruction. 317 Nodes for PREDSETxx instructions have 3 dst operands - dst[0] is dst 318 gpr as in the original instruction, other two are pseudo-operands 319 that represent possibly updated predicate and exec\_mask. Predicate 320 values are used in the predicated alu instructions (node::pred), 321 exec\_mask values are used in the if\_nodes (if\_node::cond). Each 322 vector operand in the CF/TEX/VTX instructions is represented with 4 323 values - components of the vector. 324 325 - **ssa\_prepare** - creates phi expressions. 326 327 - **ssa\_rename** - renames the values (assigns versions). 328 329 - **liveness** - liveness computation, sets 'dead' flag for unused 330 nodes and values, optionally computes interference information for 331 the values. 332 333 - **dce\_cleanup** - eliminates 'dead' nodes, also removes some 334 unnecessary nodes created by bc\_parser, e.g. the nodes for the JUMP 335 instructions in the source, containers for ALU groups (they were 336 only needed for the ssa\_rename pass) 337 338 - **if\_conversion** - converts control flow with if\_nodes to the 339 data flow in cases where it can improve performance (small alu-only 340 branches). Both branches are executed speculatively and the phi 341 expressions are replaced with conditional moves (CNDxx) to select 342 the final value using the same condition predicate as was used by 343 the original if\_node. E.g. **if\_node** used dst[2] from PREDSETxx 344 instruction, CNDxx now uses dst[0] from the same PREDSETxx 345 instruction. 346 347 - **peephole** - peephole optimizations 348 349 - **gvn** - Global Value Numbering [[2]](#references), 350 [[3]](#references) 351 352 - **gcm** - Global Code Motion [[3]](#references). Also performs 353 grouping of the instructions of the same kind (CF/FETCH/ALU). 354 355 - register allocation passes, some ideas are used from 356 [[4]](#references), but implementation is simplified to make it more 357 efficient in terms of the compilation speed (e.g. no recursive 358 recoloring) while achieving good enough results. 359 360 - **ra\_split** - prepares the program to register allocation. 361 Splits live ranges for constrained values by inserting the 362 copies to/from temporary values, so that the live range of the 363 constrained values becomes minimal. 364 365 - **ra\_coalesce** - performs global allocation on registers used 366 in CF/FETCH instructions. It's performed first to make sure they 367 end up in the same GPR. Also tries to allocate all values 368 involved in copies (inserted by the ra\_split pass) to the same 369 register, so that the copies may be eliminated. 370 371 - **ra\_init** - allocates gpr arrays (if indirect addressing is 372 used), and remaining values. 373 374 - **post\_scheduler** - ALU scheduler, handles VLIW packing and 375 performs the final register allocation for local values inside ALU 376 clauses. Eliminates all coalesced copies (if src and dst of the copy 377 are allocated to the same register). 378 379 - **ra\_checker** - optional debugging pass that tries to catch basic 380 errors of the scheduler or regalloc, 381 382 - **bc\_finalize** - propagates the regalloc information from values 383 in node::src and node::dst vectors to the bytecode fields, converts 384 control flow structure (region/depart/repeat) to the target 385 instructions (JUMP/ELSE/POP, 386 LOOP\_START/LOOP\_END/LOOP\_CONTINUE/LOOP\_BREAK). 387 388 - **bc\_builder** - builds final bytecode, 389 390 * * * * * 391 392 References 393 ---------- 394 395 [1] ["Tree-Based Code Optimization. A Thesis Proposal", Carl 396 McConnell](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.38.4210&rep=rep1&type=pdf) 397 398 [2] ["Effective Representation of Aliases and Indirect Memory Operations 399 in SSA Form", Fred Chow, Sun Chan, Shin-Ming Liu, Raymond Lo, Mark 400 Streich](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.6974&rep=rep1&type=pdf) 401 402 [3] ["Global Code Motion. Global Value Numbering.", Cliff 403 Click](http://www.cs.washington.edu/education/courses/cse501/06wi/reading/click-pldi95.pdf) 404 405 [4] ["Register Allocation for Programs in SSA Form", Sebastian 406 Hack](http://digbib.ubka.uni-karlsruhe.de/volltexte/documents/6532) 407 408 [5] ["An extension to the SSA representation for predicated code", 409 Francois de 410 Ferriere](http://www.cdl.uni-saarland.de/ssasem/talks/Francois.de.Ferriere.pdf) 411 412 [6] ["Improvements to the Psi-SSA Representation", F. de 413 Ferriere](http://www.scopesconf.org/scopes-07/presentations/3_Presentation.pdf) 414