Home | History | Annotate | Download | only in sb
      1 r600-sb
      2 =======
      3 
      4 * * * * *
      5 
      6 Debugging
      7 ---------
      8 
      9 ### Environment variables
     10 
     11 -   **R600\_DEBUG**
     12 
     13     There are new flags:
     14 
     15     -   **nosb** - Disable sb backend for graphics shaders
     16     -   **sbcl** - Enable optimization of compute shaders (experimental)
     17     -   **sbdry** - Dry run, optimize but use source bytecode -
     18         useful if you only want to check shader dumps
     19         without the risk of lockups and other problems
     20     -   **sbstat** - Print optimization statistics (only time so far)
     21     -   **sbdump** - Print IR after some passes.
     22     -   **sbnofallback** - Abort on errors instead of fallback
     23     -   **sbdisasm** - Use sb disassembler for shader dumps
     24     -   **sbsafemath** - Disable unsafe math optimizations
     25 
     26 ### Regression debugging
     27 
     28 If there are any regressions as compared to the default backend
     29 (R600\_SB=0), it's possible to use the following environment variables
     30 to find the incorrectly optimized shader that causes the regression.
     31 
     32 -   **R600\_SB\_DSKIP\_MODE** - allows to skip optimization for some
     33     shaders
     34     -   0 - disabled (default)
     35     -   1 - skip optimization for the shaders in the range
     36         [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END], that is,
     37         optimize only the shaders that are not in this range
     38     -   2 - optimize only the shaders in the range
     39         [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END]
     40 
     41 -   **R600\_SB\_DSKIP\_START** - start of the range (1-based)
     42 
     43 -   **R600\_SB\_DSKIP\_END** - end of the range (1-based)
     44 
     45 Example - optimize only the shaders 5, 6, and 7:
     46 
     47     R600_SB_DSKIP_START=5 R600_SB_DSKIP_END=7 R600_SB_DSKIP_MODE=2
     48 
     49 All shaders compiled by the application are numbered starting from 1,
     50 the number of shaders used by the application may be obtained by running
     51 it with "R600_DEBUG=sb,sbstat" - it will print "sb: shader \#index\#"
     52 for each compiled shader.
     53 
     54 After figuring out the total number of shaders used by the application,
     55 the variables above allow to use bisection to find the shader that is
     56 the cause of regression. E.g. if the application uses 100 shaders, we
     57 can divide the range [1; 100] and run the application with the
     58 optimization enabled only for the first half of the shaders:
     59 
     60     R600_SB_DSKIP_START=1 R600_SB_DSKIP_END=50 R600_SB_DSKIP_MODE=2 <app>
     61 
     62 If the regression is reproduced with these parameters, then the failing
     63 shader is in the range [1; 50], if it's not reproduced - then it's in
     64 the range [51; 100]. Then we can divide the new range again and repeat
     65 the testing, until we'll reduce the range to a single failing shader.
     66 
     67 *NOTE: This method relies on the assumption that the application
     68 produces the same sequence of the shaders on each run. It's not always
     69 true - some applications may produce different sequences of the shaders,
     70 in such cases the tools like apitrace may be used to record the trace
     71 with the application, then this method may be applied when replaying the
     72 trace - also this may be faster and/or more convenient than testing the
     73 application itself.*
     74 
     75 * * * * *
     76 
     77 Intermediate Representation
     78 ---------------------------
     79 
     80 ### Values
     81 
     82 All kinds of the operands (literal constants, references to kcache
     83 constants, references to GPRs, etc) are currently represented by the
     84 **value** class (possibly it makes sense to switch to hierarchy of
     85 classes derived from **value** instead, to save some memory).
     86 
     87 All values (except some pseudo values like the exec\_mask or predicate
     88 register) represent 32bit scalar values - there are no vector values,
     89 CF/FETCH instructions use groups of 4 values for src and dst operands.
     90 
     91 ### Nodes
     92 
     93 Shader programs are represented using the tree data structure, some
     94 nodes contain a list of subnodes.
     95 
     96 #### Control flow nodes
     97 
     98 Control flow information is represented using four special node types
     99 (based on the ideas from [[1]](#references) )
    100 
    101 -   **region\_node** - single-entry, single-exit region.
    102 
    103     All loops and if's in the program are enclosed in region nodes.
    104     Region nodes have two containers for phi nodes -
    105     region\_node::loop\_phi contains the phi expressions to be executed
    106     at the region entry, region\_node::phi contains the phi expressions
    107     to be executed at the region exit. It's the only type of the node
    108     that contains associated phi expressions.
    109 
    110 -   **depart\_node** - "depart region \$id after { ... }"
    111 
    112     Depart target region (jump to exit point) after executing contained
    113     code.
    114 
    115 -   **repeat\_node** - "repeat region \$id after { ... }"
    116 
    117     Repeat target region (jump to entry point) after executing contained
    118     code.
    119 
    120 -   **if\_node** - "if (cond) { ... }"
    121 
    122     Execute contained code if condition is true. The difference from
    123     [[1]](#references) is that we don't have associated phi expressions
    124     for the **if\_node**, we enclose **if\_node** in the
    125     **region\_node** and store corresponding phi's in the
    126     **region\_node**, this allows more uniform handling.
    127 
    128 The target region of depart and repeat nodes is always the region where
    129 they are located (possibly in the nested region), there are no arbitrary
    130 jumps/goto's - control flow in the program is always structured.
    131 
    132 Typical control flow constructs can be represented as in the following
    133 examples:
    134 
    135 GLSL:
    136 
    137     if (cond) {
    138         < 1 >
    139     } else {
    140         < 2 >
    141     }
    142 
    143 IR:
    144 
    145     region #0 {
    146         depart region #0 after {
    147             if (cond) {
    148                 depart region #0 after {
    149                     < 1 >
    150                 }
    151             }
    152             < 2 >
    153         }
    154         <region #0 phi nodes >
    155     }
    156 
    157 GLSL:
    158 
    159     while (cond) {
    160         < 1 >
    161     }
    162 
    163 IR:
    164 
    165     region #0 {
    166         <region #0 loop_phi nodes>
    167         repeat region #0 after {
    168             region #1 {
    169                 depart region #1 after {
    170                     if (!cond) {
    171                         depart region #0
    172                     }
    173                 }
    174             }
    175             < 1 >
    176         }
    177         <region #0 phi nodes>
    178     }
    179 
    180 'Break' and 'continue' inside the loops are directly translated to the
    181 depart and repeat nodes for the corresponding loop region.
    182 
    183 This may look a bit too complicated, but in fact this allows more simple
    184 and uniform handling of the control flow.
    185 
    186 All loop\_phi and phi nodes for some region always have the same number
    187 of source operands. The number of source operands for
    188 region\_node::loop\_phi nodes is 1 + number of repeat nodes that
    189 reference this region as a target. The number of source operands for
    190 region\_node::phi nodes is equal to the number of depart nodes that
    191 reference this region as a target. All depart/repeat nodes for the
    192 region have unique indices equal to the index of source operand for
    193 phi/loop\_phi nodes.
    194 
    195 First source operand for region\_node::loop\_phi nodes (src[0]) is an
    196 incoming value that enters the region from the outside. Each remaining
    197 source operand comes from the corresponding repeat node.
    198 
    199 More complex example:
    200 
    201 GLSL:
    202 
    203     a = 1;
    204     while (a < 5) {
    205         a = a * 2;
    206         if (b == 3) {
    207             continue;
    208         } else {
    209             a = 6;
    210         }
    211         if (c == 4)
    212             break;
    213         a = a + 1;
    214     }
    215 
    216 IR with SSA form:
    217 
    218     a.1 = 1;
    219     region #0 {
    220         // loop phi values: src[0] - incoming, src[1] - from repeat_1, src[2] - from repeat_2
    221         region#0 loop_phi: a.2 = phi a.1, a.6, a.3
    222 
    223         repeat_1 region #0 after {
    224             a.3 = a.2 * 2;
    225             cond1 = (b == 3);
    226             region #1 {
    227                 depart_0 region #1 after {
    228                     if (cond1) {
    229                         repeat_2 region #0;
    230                     }
    231                 }
    232                 a.4 = 6;
    233 
    234                 region #1 phi: a.5 = phi a.4; // src[0] - from depart_0
    235             }
    236             cond2 = (c == 4);
    237             region #2 {
    238                 depart_0 region #2 after {
    239                     if (cond2) {
    240                         depart_0 region #0;
    241                     }
    242                 }
    243             }
    244             a.6 = a.5 + 1;
    245         }
    246 
    247         region #0 phi: a.7 = phi a.5 // src[0] from depart_0
    248     }
    249 
    250 Phi nodes with single source operand are just copies, they are not
    251 really necessary, but this allows to handle all **depart\_node**s in the
    252 uniform way.
    253 
    254 #### Instruction nodes
    255 
    256 Instruction nodes represent different kinds of instructions -
    257 **alu\_node**, **cf\_node**, **fetch\_node**, etc. Each of them contains
    258 the "bc" structure where all fields of the bytecode are stored (the type
    259 is **bc\_alu** for **alu\_node**, etc). The operands are represented
    260 using the vectors of pointers to **value** class (node::src, node::dst)
    261 
    262 #### SSA-specific nodes
    263 
    264 Phi nodes currently don't have special node class, they are stored as
    265 **node**. Destination vector contains a single destination value, source
    266 vector contains 1 or more source values.
    267 
    268 Psi nodes [[5], [6]](#references) also don't have a special node class
    269 and stored as **node**. Source vector contains 3 values for each source
    270 operand - the **value** of predicate, **value** of corresponding
    271 PRED\_SEL field, and the source **value** itself.
    272 
    273 ### Indirect addressing
    274 
    275 Special kind of values (VLK\_RELREG) is used to represent indirect
    276 operands. These values don't have SSA versions. The representation is
    277 mostly based on the [[2]](#references). Indirect operand contains the
    278 "offset/address" value (value::rel), (e.g. some SSA version of the AR
    279 register value, though after some passes it may be any value - constant,
    280 register, etc), also it contains the maydef and mayuse vectors of
    281 pointers to **value**s (similar to dst/src vectors in the **node**) to
    282 represent the effects of aliasing in the SSA form.
    283 
    284 E.g. if we have the array R5.x ... R8.x and the following instruction :
    285 
    286     MOV R0.x, R[5 + AR].x
    287 
    288 then source indirect operand is represented with the VLK\_RELREG value,
    289 value::rel is AR, value::maydef is empty (in fact it always contain the
    290 same number of elements as mayuse to simplify the handling, but they are
    291 NULLs), value::mayuse contains [R5.x, R6.x, R7.x, R8.x] (or the
    292 corresponding SSA versions after ssa\_rename).
    293 
    294 Additional "virtual variables" as in [HSSA [2]](#references) are not
    295 used, also there is no special handling for "zero versions". Typical
    296 programs in our case are small, indirect addressing is rare, array sizes
    297 are limited by max gpr number, so we don't really need to use special
    298 tricks to avoid the explosion of value versions. Also this allows more
    299 precise liveness computation for array elements without modifications to
    300 the algorithms.
    301 
    302 With the following instruction:
    303 
    304     MOV R[5+AR].x, R0.x
    305 
    306 we'll have both maydef and mayuse vectors for dst operand filled with
    307 array values initially: [R5.x, R6.x, R7.x, R8.x]. After the ssa\_rename
    308 pass mayuse will contain previous versions, maydef will contain new
    309 potentially-defined versions.
    310 
    311 * * * * *
    312 
    313 Passes
    314 ------
    315 
    316 -   **bc\_parser** - creates the IR from the source bytecode,
    317     initializes src and dst value vectors for instruction nodes. Most
    318     ALU nodes have one dst operand and the number of source operands is
    319     equal to the number of source operands for the ISA instruction.
    320     Nodes for PREDSETxx instructions have 3 dst operands - dst[0] is dst
    321     gpr as in the original instruction, other two are pseudo-operands
    322     that represent possibly updated predicate and exec\_mask. Predicate
    323     values are used in the predicated alu instructions (node::pred),
    324     exec\_mask values are used in the if\_nodes (if\_node::cond). Each
    325     vector operand in the CF/TEX/VTX instructions is represented with 4
    326     values - components of the vector.
    327 
    328 -   **ssa\_prepare** - creates phi expressions.
    329 
    330 -   **ssa\_rename** - renames the values (assigns versions).
    331 
    332 -   **liveness** - liveness computation, sets 'dead' flag for unused
    333     nodes and values, optionally computes interference information for
    334     the values.
    335 
    336 -   **dce\_cleanup** - eliminates 'dead' nodes, also removes some
    337     unnecessary nodes created by bc\_parser, e.g. the nodes for the JUMP
    338     instructions in the source, containers for ALU groups (they were
    339     only needed for the ssa\_rename pass)
    340 
    341 -   **if\_conversion** - converts control flow with if\_nodes to the
    342     data flow in cases where it can improve performance (small alu-only
    343     branches). Both branches are executed speculatively and the phi
    344     expressions are replaced with conditional moves (CNDxx) to select
    345     the final value using the same condition predicate as was used by
    346     the original if\_node. E.g. **if\_node** used dst[2] from PREDSETxx
    347     instruction, CNDxx now uses dst[0] from the same PREDSETxx
    348     instruction.
    349 
    350 -   **peephole** - peephole optimizations
    351 
    352 -   **gvn** - Global Value Numbering [[2]](#references),
    353     [[3]](#references)
    354 
    355 -   **gcm** - Global Code Motion [[3]](#references). Also performs
    356     grouping of the instructions of the same kind (CF/FETCH/ALU).
    357 
    358 -   register allocation passes, some ideas are used from
    359     [[4]](#references), but implementation is simplified to make it more
    360     efficient in terms of the compilation speed (e.g. no recursive
    361     recoloring) while achieving good enough results.
    362 
    363     -   **ra\_split** - prepares the program to register allocation.
    364         Splits live ranges for constrained values by inserting the
    365         copies to/from temporary values, so that the live range of the
    366         constrained values becomes minimal.
    367 
    368     -   **ra\_coalesce** - performs global allocation on registers used
    369         in CF/FETCH instructions. It's performed first to make sure they
    370         end up in the same GPR. Also tries to allocate all values
    371         involved in copies (inserted by the ra\_split pass) to the same
    372         register, so that the copies may be eliminated.
    373 
    374     -   **ra\_init** - allocates gpr arrays (if indirect addressing is
    375         used), and remaining values.
    376 
    377 -   **post\_scheduler** - ALU scheduler, handles VLIW packing and
    378     performs the final register allocation for local values inside ALU
    379     clauses. Eliminates all coalesced copies (if src and dst of the copy
    380     are allocated to the same register).
    381 
    382 -   **ra\_checker** - optional debugging pass that tries to catch basic
    383     errors of the scheduler or regalloc,
    384 
    385 -   **bc\_finalize** - propagates the regalloc information from values
    386     in node::src and node::dst vectors to the bytecode fields, converts
    387     control flow structure (region/depart/repeat) to the target
    388     instructions (JUMP/ELSE/POP,
    389     LOOP\_START/LOOP\_END/LOOP\_CONTINUE/LOOP\_BREAK).
    390 
    391 -   **bc\_builder** - builds final bytecode,
    392 
    393 * * * * *
    394 
    395 References
    396 ----------
    397 
    398 [1] ["Tree-Based Code Optimization. A Thesis Proposal", Carl
    399 McConnell](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.38.4210&rep=rep1&type=pdf)
    400 
    401 [2] ["Effective Representation of Aliases and Indirect Memory Operations
    402 in SSA Form", Fred Chow, Sun Chan, Shin-Ming Liu, Raymond Lo, Mark
    403 Streich](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.6974&rep=rep1&type=pdf)
    404 
    405 [3] ["Global Code Motion. Global Value Numbering.", Cliff
    406 Click](http://www.cs.washington.edu/education/courses/cse501/06wi/reading/click-pldi95.pdf)
    407 
    408 [4] ["Register Allocation for Programs in SSA Form", Sebastian
    409 Hack](http://digbib.ubka.uni-karlsruhe.de/volltexte/documents/6532)
    410 
    411 [5] ["An extension to the SSA representation for predicated code",
    412 Francois de
    413 Ferriere](http://www.cdl.uni-saarland.de/ssasem/talks/Francois.de.Ferriere.pdf)
    414 
    415 [6] ["Improvements to the Psi-SSA Representation", F. de
    416 Ferriere](http://www.scopesconf.org/scopes-07/presentations/3_Presentation.pdf)
    417