Home | History | Annotate | Download | only in sb
      1 r600-sb
      2 =======
      3 
      4 * * * * *
      5 
      6 Debugging
      7 ---------
      8 
      9 ### Environment variables
     10 
     11 -   **R600\_DEBUG**
     12 
     13     There are new flags:
     14 
     15     -   **sb** - Enable optimization of graphics shaders
     16     -   **sbcl** - Enable optimization of compute shaders (experimental)
     17     -   **sbdry** - Dry run, optimize but use source bytecode - 
     18         useful if you only want to check shader dumps 
     19         without the risk of lockups and other problems
     20     -   **sbstat** - Print optimization statistics (only time so far)
     21     -   **sbdump** - Print IR after some passes.
     22 
     23 ### Regression debugging
     24 
     25 If there are any regressions as compared to the default backend
     26 (R600\_SB=0), it's possible to use the following environment variables
     27 to find the incorrectly optimized shader that causes the regression.
     28 
     29 -   **R600\_SB\_DSKIP\_MODE** - allows to skip optimization for some
     30     shaders
     31     -   0 - disabled (default)
     32     -   1 - skip optimization for the shaders in the range
     33         [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END], that is,
     34         optimize only the shaders that are not in this range
     35     -   2 - optimize only the shaders in the range
     36         [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END]
     37 
     38 -   **R600\_SB\_DSKIP\_START** - start of the range (1-based)
     39 
     40 -   **R600\_SB\_DSKIP\_END** - end of the range (1-based)
     41 
     42 Example - optimize only the shaders 5, 6, and 7:
     43 
     44     R600_SB_DSKIP_START=5 R600_SB_DSKIP_END=7 R600_SB_DSKIP_MODE=2
     45 
     46 All shaders compiled by the application are numbered starting from 1,
     47 the number of shaders used by the application may be obtained by running
     48 it with "R600_DEBUG=sb,sbstat" - it will print "sb: shader \#index\#"
     49 for each compiled shader.
     50 
     51 After figuring out the total number of shaders used by the application,
     52 the variables above allow to use bisection to find the shader that is
     53 the cause of regression. E.g. if the application uses 100 shaders, we
     54 can divide the range [1; 100] and run the application with the
     55 optimization enabled only for the first half of the shaders:
     56 
     57     R600_SB_DSKIP_START=1 R600_SB_DSKIP_END=50 R600_SB_DSKIP_MODE=2 <app>
     58 
     59 If the regression is reproduced with these parameters, then the failing
     60 shader is in the range [1; 50], if it's not reproduced - then it's in
     61 the range [51; 100]. Then we can divide the new range again and repeat
     62 the testing, until we'll reduce the range to a single failing shader.
     63 
     64 *NOTE: This method relies on the assumption that the application
     65 produces the same sequence of the shaders on each run. It's not always
     66 true - some applications may produce different sequences of the shaders,
     67 in such cases the tools like apitrace may be used to record the trace
     68 with the application, then this method may be applied when replaying the
     69 trace - also this may be faster and/or more convenient than testing the
     70 application itself.*
     71 
     72 * * * * *
     73 
     74 Intermediate Representation
     75 ---------------------------
     76 
     77 ### Values
     78 
     79 All kinds of the operands (literal constants, references to kcache
     80 constants, references to GPRs, etc) are currently represented by the
     81 **value** class (possibly it makes sense to switch to hierarchy of
     82 classes derived from **value** instead, to save some memory).
     83 
     84 All values (except some pseudo values like the exec\_mask or predicate
     85 register) represent 32bit scalar values - there are no vector values,
     86 CF/FETCH instructions use groups of 4 values for src and dst operands.
     87 
     88 ### Nodes
     89 
     90 Shader programs are represented using the tree data structure, some
     91 nodes contain a list of subnodes.
     92 
     93 #### Control flow nodes
     94 
     95 Control flow information is represented using four special node types
     96 (based on the ideas from [[1]](#references) )
     97 
     98 -   **region\_node** - single-entry, single-exit region.
     99 
    100     All loops and if's in the program are enclosed in region nodes.
    101     Region nodes have two containers for phi nodes -
    102     region\_node::loop\_phi contains the phi expressions to be executed
    103     at the region entry, region\_node::phi contains the phi expressions
    104     to be executed at the region exit. It's the only type of the node
    105     that contains associated phi expressions.
    106 
    107 -   **depart\_node** - "depart region \$id after { ... }"
    108 
    109     Depart target region (jump to exit point) after executing contained
    110     code.
    111 
    112 -   **repeat\_node** - "repeat region \$id after { ... }"
    113 
    114     Repeat target region (jump to entry point) after executing contained
    115     code.
    116 
    117 -   **if\_node** - "if (cond) { ... }"
    118 
    119     Execute contained code if condition is true. The difference from
    120     [[1]](#references) is that we don't have associated phi expressions
    121     for the **if\_node**, we enclose **if\_node** in the
    122     **region\_node** and store corresponding phi's in the
    123     **region\_node**, this allows more uniform handling.
    124 
    125 The target region of depart and repeat nodes is always the region where
    126 they are located (possibly in the nested region), there are no arbitrary
    127 jumps/goto's - control flow in the program is always structured.
    128 
    129 Typical control flow constructs can be represented as in the following
    130 examples:
    131 
    132 GLSL:
    133 
    134     if (cond) {
    135         < 1 >
    136     } else {
    137         < 2 >
    138     }
    139 
    140 IR:
    141 
    142     region #0 {
    143         depart region #0 after {
    144             if (cond) {
    145                 depart region #0 after {
    146                     < 1 >
    147                 }
    148             }
    149             < 2 >
    150         }
    151         <region #0 phi nodes >
    152     }
    153 
    154 GLSL:
    155 
    156     while (cond) {
    157         < 1 >
    158     }
    159 
    160 IR:
    161 
    162     region #0 {
    163         <region #0 loop_phi nodes>
    164         repeat region #0 after {
    165             region #1 {
    166                 depart region #1 after {
    167                     if (!cond) {
    168                         depart region #0
    169                     }
    170                 }
    171             }
    172             < 1 >
    173         }
    174         <region #0 phi nodes>
    175     }
    176 
    177 'Break' and 'continue' inside the loops are directly translated to the
    178 depart and repeat nodes for the corresponding loop region.
    179 
    180 This may look a bit too complicated, but in fact this allows more simple
    181 and uniform handling of the control flow.
    182 
    183 All loop\_phi and phi nodes for some region always have the same number
    184 of source operands. The number of source operands for
    185 region\_node::loop\_phi nodes is 1 + number of repeat nodes that
    186 reference this region as a target. The number of source operands for
    187 region\_node::phi nodes is equal to the number of depart nodes that
    188 reference this region as a target. All depart/repeat nodes for the
    189 region have unique indices equal to the index of source operand for
    190 phi/loop\_phi nodes.
    191 
    192 First source operand for region\_node::loop\_phi nodes (src[0]) is an
    193 incoming value that enters the region from the outside. Each remaining
    194 source operand comes from the corresponding repeat node.
    195 
    196 More complex example:
    197 
    198 GLSL:
    199 
    200     a = 1;
    201     while (a < 5) {
    202         a = a * 2;
    203         if (b == 3) {
    204             continue;
    205         } else {
    206             a = 6;
    207         }
    208         if (c == 4)
    209             break;
    210         a = a + 1;
    211     }
    212 
    213 IR with SSA form:
    214 
    215     a.1 = 1;
    216     region #0 {
    217         // loop phi values: src[0] - incoming, src[1] - from repeat_1, src[2] - from repeat_2
    218         region#0 loop_phi: a.2 = phi a.1, a.6, a.3
    219 
    220         repeat_1 region #0 after {
    221             a.3 = a.2 * 2;
    222             cond1 = (b == 3);
    223             region #1 {
    224                 depart_0 region #1 after {
    225                     if (cond1) {
    226                         repeat_2 region #0;
    227                     }
    228                 }
    229                 a.4 = 6;
    230 
    231                 region #1 phi: a.5 = phi a.4; // src[0] - from depart_0
    232             }
    233             cond2 = (c == 4);
    234             region #2 {
    235                 depart_0 region #2 after {
    236                     if (cond2) {
    237                         depart_0 region #0;
    238                     }
    239                 }
    240             }
    241             a.6 = a.5 + 1;
    242         }
    243 
    244         region #0 phi: a.7 = phi a.5 // src[0] from depart_0
    245     }
    246 
    247 Phi nodes with single source operand are just copies, they are not
    248 really necessary, but this allows to handle all **depart\_node**s in the
    249 uniform way.
    250 
    251 #### Instruction nodes
    252 
    253 Instruction nodes represent different kinds of instructions -
    254 **alu\_node**, **cf\_node**, **fetch\_node**, etc. Each of them contains
    255 the "bc" structure where all fields of the bytecode are stored (the type
    256 is **bc\_alu** for **alu\_node**, etc). The operands are represented
    257 using the vectors of pointers to **value** class (node::src, node::dst)
    258 
    259 #### SSA-specific nodes
    260 
    261 Phi nodes currently don't have special node class, they are stored as
    262 **node**. Destination vector contains a single destination value, source
    263 vector contains 1 or more source values.
    264 
    265 Psi nodes [[5], [6]](#references) also don't have a special node class
    266 and stored as **node**. Source vector contains 3 values for each source
    267 operand - the **value** of predicate, **value** of corresponding
    268 PRED\_SEL field, and the source **value** itself.
    269 
    270 ### Indirect addressing
    271 
    272 Special kind of values (VLK\_RELREG) is used to represent indirect
    273 operands. These values don't have SSA versions. The representation is
    274 mostly based on the [[2]](#references). Indirect operand contains the
    275 "offset/address" value (value::rel), (e.g. some SSA version of the AR
    276 register value, though after some passes it may be any value - constant,
    277 register, etc), also it contains the maydef and mayuse vectors of
    278 pointers to **value**s (similar to dst/src vectors in the **node**) to
    279 represent the effects of aliasing in the SSA form.
    280 
    281 E.g. if we have the array R5.x ... R8.x and the following instruction :
    282 
    283     MOV R0.x, R[5 + AR].x
    284 
    285 then source indirect operand is represented with the VLK\_RELREG value,
    286 value::rel is AR, value::maydef is empty (in fact it always contain the
    287 same number of elements as mayuse to simplify the handling, but they are
    288 NULLs), value::mayuse contains [R5.x, R6.x, R7.x, R8.x] (or the
    289 corresponding SSA versions after ssa\_rename).
    290 
    291 Additional "virtual variables" as in [HSSA [2]](#references) are not
    292 used, also there is no special handling for "zero versions". Typical
    293 programs in our case are small, indirect addressing is rare, array sizes
    294 are limited by max gpr number, so we don't really need to use special
    295 tricks to avoid the explosion of value versions. Also this allows more
    296 precise liveness computation for array elements without modifications to
    297 the algorithms.
    298 
    299 With the following instruction:
    300 
    301     MOV R[5+AR].x, R0.x
    302 
    303 we'll have both maydef and mayuse vectors for dst operand filled with
    304 array values initially: [R5.x, R6.x, R7.x, R8.x]. After the ssa\_rename
    305 pass mayuse will contain previous versions, maydef will contain new
    306 potentially-defined versions.
    307 
    308 * * * * *
    309 
    310 Passes
    311 ------
    312 
    313 -   **bc\_parser** - creates the IR from the source bytecode,
    314     initializes src and dst value vectors for instruction nodes. Most
    315     ALU nodes have one dst operand and the number of source operands is
    316     equal to the number of source operands for the ISA instruction.
    317     Nodes for PREDSETxx instructions have 3 dst operands - dst[0] is dst
    318     gpr as in the original instruction, other two are pseudo-operands
    319     that represent possibly updated predicate and exec\_mask. Predicate
    320     values are used in the predicated alu instructions (node::pred),
    321     exec\_mask values are used in the if\_nodes (if\_node::cond). Each
    322     vector operand in the CF/TEX/VTX instructions is represented with 4
    323     values - components of the vector.
    324 
    325 -   **ssa\_prepare** - creates phi expressions.
    326 
    327 -   **ssa\_rename** - renames the values (assigns versions).
    328 
    329 -   **liveness** - liveness computation, sets 'dead' flag for unused
    330     nodes and values, optionally computes interference information for
    331     the values.
    332 
    333 -   **dce\_cleanup** - eliminates 'dead' nodes, also removes some
    334     unnecessary nodes created by bc\_parser, e.g. the nodes for the JUMP
    335     instructions in the source, containers for ALU groups (they were
    336     only needed for the ssa\_rename pass)
    337 
    338 -   **if\_conversion** - converts control flow with if\_nodes to the
    339     data flow in cases where it can improve performance (small alu-only
    340     branches). Both branches are executed speculatively and the phi
    341     expressions are replaced with conditional moves (CNDxx) to select
    342     the final value using the same condition predicate as was used by
    343     the original if\_node. E.g. **if\_node** used dst[2] from PREDSETxx
    344     instruction, CNDxx now uses dst[0] from the same PREDSETxx
    345     instruction.
    346 
    347 -   **peephole** - peephole optimizations
    348 
    349 -   **gvn** - Global Value Numbering [[2]](#references),
    350     [[3]](#references)
    351 
    352 -   **gcm** - Global Code Motion [[3]](#references). Also performs
    353     grouping of the instructions of the same kind (CF/FETCH/ALU).
    354 
    355 -   register allocation passes, some ideas are used from
    356     [[4]](#references), but implementation is simplified to make it more
    357     efficient in terms of the compilation speed (e.g. no recursive
    358     recoloring) while achieving good enough results.
    359 
    360     -   **ra\_split** - prepares the program to register allocation.
    361         Splits live ranges for constrained values by inserting the
    362         copies to/from temporary values, so that the live range of the
    363         constrained values becomes minimal.
    364 
    365     -   **ra\_coalesce** - performs global allocation on registers used
    366         in CF/FETCH instructions. It's performed first to make sure they
    367         end up in the same GPR. Also tries to allocate all values
    368         involved in copies (inserted by the ra\_split pass) to the same
    369         register, so that the copies may be eliminated.
    370 
    371     -   **ra\_init** - allocates gpr arrays (if indirect addressing is
    372         used), and remaining values.
    373 
    374 -   **post\_scheduler** - ALU scheduler, handles VLIW packing and
    375     performs the final register allocation for local values inside ALU
    376     clauses. Eliminates all coalesced copies (if src and dst of the copy
    377     are allocated to the same register).
    378 
    379 -   **ra\_checker** - optional debugging pass that tries to catch basic
    380     errors of the scheduler or regalloc,
    381 
    382 -   **bc\_finalize** - propagates the regalloc information from values
    383     in node::src and node::dst vectors to the bytecode fields, converts
    384     control flow structure (region/depart/repeat) to the target
    385     instructions (JUMP/ELSE/POP,
    386     LOOP\_START/LOOP\_END/LOOP\_CONTINUE/LOOP\_BREAK).
    387 
    388 -   **bc\_builder** - builds final bytecode,
    389 
    390 * * * * *
    391 
    392 References
    393 ----------
    394 
    395 [1] ["Tree-Based Code Optimization. A Thesis Proposal", Carl
    396 McConnell](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.38.4210&rep=rep1&type=pdf)
    397 
    398 [2] ["Effective Representation of Aliases and Indirect Memory Operations
    399 in SSA Form", Fred Chow, Sun Chan, Shin-Ming Liu, Raymond Lo, Mark
    400 Streich](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.6974&rep=rep1&type=pdf)
    401 
    402 [3] ["Global Code Motion. Global Value Numbering.", Cliff
    403 Click](http://www.cs.washington.edu/education/courses/cse501/06wi/reading/click-pldi95.pdf)
    404 
    405 [4] ["Register Allocation for Programs in SSA Form", Sebastian
    406 Hack](http://digbib.ubka.uni-karlsruhe.de/volltexte/documents/6532)
    407 
    408 [5] ["An extension to the SSA representation for predicated code",
    409 Francois de
    410 Ferriere](http://www.cdl.uni-saarland.de/ssasem/talks/Francois.de.Ferriere.pdf)
    411 
    412 [6] ["Improvements to the Psi-SSA Representation", F. de
    413 Ferriere](http://www.scopesconf.org/scopes-07/presentations/3_Presentation.pdf)
    414