Home | History | Annotate | Download | only in freedreno
      1 IR3 NOTES
      2 =========
      3 
      4 Some notes about ir3, the compiler and machine-specific IR for the shader ISA introduced with adreno a3xx.  The same shader ISA is present, with some small differences, in adreno a4xx.
      5 
      6 Compared to the previous generation a2xx ISA (ir2), the a3xx ISA is a "simple" scalar instruction set.  However, the compiler is responsible, in most cases, to schedule the instructions.  The hardware does not try to hide the shader core pipeline stages.  For a common example, a common (cat2) ALU instruction takes four cycles, so a subsequent cat2 instruction which uses the result must have three intervening instructions (or nops).  When operating on vec4's, typically the corresponding scalar instructions for operating on the remaining three components could typically fit.  Although that results in a lot of edge cases where things fall over, like:
      7 
      8 ::
      9 
     10   ADD TEMP[0], TEMP[1], TEMP[2]
     11   MUL TEMP[0], TEMP[1], TEMP[0].wzyx
     12 
     13 Here, the second instruction needs the output of the first group of scalar instructions in the wrong order, resulting in not enough instruction spots between the ``add r0.w, r1.w, r2.w`` and ``mul r0.x, r1.x, r0.w``.  Which is why the original (old) compiler which merely translated nearly literally from TGSI to ir3, had a strong tendency to fall over.
     14 
     15 So the current compiler instead, in the frontend, generates a directed-acyclic-graph of instructions and basic blocks, which go through various additional passes to eventually schedule and do register assignment.
     16 
     17 For additional documentation about the hardware, see wiki: `a3xx ISA
     18 <https://github.com/freedreno/freedreno/wiki/A3xx-shader-instruction-set-architecture>`_.
     19 
     20 External Structure
     21 ------------------
     22 
     23 ``ir3_shader``
     24     A single vertex/fragment/etc shader from gallium perspective (ie.
     25     maps to a single TGSI shader), and manages a set of shader variants
     26     which are generated on demand based on the shader key.
     27 
     28 ``ir3_shader_key``
     29     The configuration key that identifies a shader variant.  Ie. based
     30     on other GL state (two-sided-color, render-to-alpha, etc) or render
     31     stages (binning-pass vertex shader) different shader variants are
     32     generated.
     33 
     34 ``ir3_shader_variant``
     35     The actual hw shader generated based on input TGSI and shader key.
     36 
     37 ``ir3_compiler``
     38     Compiler frontend which generates ir3 and runs the various backend
     39     stages to schedule and do register assignment.
     40 
     41 The IR
     42 ------
     43 
     44 The ir3 IR maps quite directly to the hardware, in that instruction opcodes map directly to hardware opcodes, and that dst/src register(s) map directly to the hardware dst/src register(s).  But there are a few extensions, in the form of meta_ instructions.  And additionally, for normal (non-const, etc) src registers, the ``IR3_REG_SSA`` flag is set and ``reg->instr`` points to the source instruction which produced that value.  So, for example, the following TGSI shader:
     45 
     46 ::
     47 
     48   VERT
     49   DCL IN[0]
     50   DCL IN[1]
     51   DCL OUT[0], POSITION
     52   DCL TEMP[0], LOCAL
     53     1: DP3 TEMP[0].x, IN[0].xyzz, IN[1].xyzz
     54     2: MOV OUT[0], TEMP[0].xxxx
     55     3: END
     56 
     57 eventually generates:
     58 
     59 .. graphviz::
     60 
     61   digraph G {
     62   rankdir=RL;
     63   nodesep=0.25;
     64   ranksep=1.5;
     65   subgraph clusterdce198 {
     66   label="vert";
     67   inputdce198 [shape=record,label="inputs|<in0> i0.x|<in1> i0.y|<in2> i0.z|<in4> i1.x|<in5> i1.y|<in6> i1.z"];
     68   instrdcf348 [shape=record,style=filled,fillcolor=lightgrey,label="{mov.f32f32|<dst0>|<src0> }"];
     69   instrdcedd0 [shape=record,style=filled,fillcolor=lightgrey,label="{mad.f32|<dst0>|<src0> |<src1> |<src2> }"];
     70   inputdce198:<in2>:w -> instrdcedd0:<src0>
     71   inputdce198:<in6>:w -> instrdcedd0:<src1>
     72   instrdcec30 [shape=record,style=filled,fillcolor=lightgrey,label="{mad.f32|<dst0>|<src0> |<src1> |<src2> }"];
     73   inputdce198:<in1>:w -> instrdcec30:<src0>
     74   inputdce198:<in5>:w -> instrdcec30:<src1>
     75   instrdceb60 [shape=record,style=filled,fillcolor=lightgrey,label="{mul.f|<dst0>|<src0> |<src1> }"];
     76   inputdce198:<in0>:w -> instrdceb60:<src0>
     77   inputdce198:<in4>:w -> instrdceb60:<src1>
     78   instrdceb60:<dst0> -> instrdcec30:<src2>
     79   instrdcec30:<dst0> -> instrdcedd0:<src2>
     80   instrdcedd0:<dst0> -> instrdcf348:<src0>
     81   instrdcf400 [shape=record,style=filled,fillcolor=lightgrey,label="{mov.f32f32|<dst0>|<src0> }"];
     82   instrdcedd0:<dst0> -> instrdcf400:<src0>
     83   instrdcf4b8 [shape=record,style=filled,fillcolor=lightgrey,label="{mov.f32f32|<dst0>|<src0> }"];
     84   instrdcedd0:<dst0> -> instrdcf4b8:<src0>
     85   outputdce198 [shape=record,label="outputs|<out0> o0.x|<out1> o0.y|<out2> o0.z|<out3> o0.w"];
     86   instrdcf348:<dst0> -> outputdce198:<out0>:e
     87   instrdcf400:<dst0> -> outputdce198:<out1>:e
     88   instrdcf4b8:<dst0> -> outputdce198:<out2>:e
     89   instrdcedd0:<dst0> -> outputdce198:<out3>:e
     90   }
     91   }
     92 
     93 (after scheduling, etc, but before register assignment).
     94 
     95 Internal Structure
     96 ~~~~~~~~~~~~~~~~~~
     97 
     98 ``ir3_block``
     99     Represents a basic block.
    100 
    101     TODO: currently blocks are nested, but I think I need to change that
    102     to a more conventional arrangement before implementing proper flow
    103     control.  Currently the only flow control handles is if/else which
    104     gets flattened out and results chosen with ``sel`` instructions.
    105 
    106 ``ir3_instruction``
    107     Represents a machine instruction or meta_ instruction.  Has pointers
    108     to dst register (``regs[0]``) and src register(s) (``regs[1..n]``),
    109     as needed.
    110 
    111 ``ir3_register``
    112     Represents a src or dst register, flags indicate const/relative/etc.
    113     If ``IR3_REG_SSA`` is set on a src register, the actual register
    114     number (name) has not been assigned yet, and instead the ``instr``
    115     field points to src instruction.
    116 
    117 In addition there are various util macros/functions to simplify manipulation/traversal of the graph:
    118 
    119 ``foreach_src(srcreg, instr)``
    120     Iterate each instruction's source ``ir3_register``\s
    121 
    122 ``foreach_src_n(srcreg, n, instr)``
    123     Like ``foreach_src``, also setting ``n`` to the source number (starting
    124     with ``0``).
    125 
    126 ``foreach_ssa_src(srcinstr, instr)``
    127     Iterate each instruction's SSA source ``ir3_instruction``\s.  This skips
    128     non-SSA sources (consts, etc), but includes virtual sources (such as the
    129     address register if `relative addressing`_ is used).
    130 
    131 ``foreach_ssa_src_n(srcinstr, n, instr)``
    132     Like ``foreach_ssa_src``, also setting ``n`` to the source number.
    133 
    134 For example:
    135 
    136 ::
    137 
    138   foreach_ssa_src_n(src, i, instr) {
    139     unsigned d = delay_calc_srcn(ctx, src, instr, i);
    140     delay = MAX2(delay, d);
    141   }
    142 
    143 
    144 TODO probably other helper/util stuff worth mentioning here
    145 
    146 .. _meta:
    147 
    148 Meta Instructions
    149 ~~~~~~~~~~~~~~~~~
    150 
    151 **input**
    152     Used for shader inputs (registers configured in the command-stream
    153     to hold particular input values, written by the shader core before
    154     start of execution.  Also used for connecting up values within a
    155     basic block to an output of a previous block.
    156 
    157 **output**
    158     Used to hold outputs of a basic block.
    159 
    160 **flow**
    161     TODO
    162 
    163 **phi**
    164     TODO
    165 
    166 **fanin**
    167     Groups registers which need to be assigned to consecutive scalar
    168     registers, for example `sam` (texture fetch) src instructions (see
    169     `register groups`_) or array element dereference
    170     (see `relative addressing`_).
    171 
    172 **fanout**
    173     The counterpart to **fanin**, when an instruction such as `sam`
    174     writes multiple components, splits the result into individual
    175     scalar components to be consumed by other instructions.
    176 
    177 
    178 .. _`flow control`:
    179 
    180 Flow Control
    181 ~~~~~~~~~~~~
    182 
    183 TODO
    184 
    185 
    186 .. _`register groups`:
    187 
    188 Register Groups
    189 ~~~~~~~~~~~~~~~
    190 
    191 Certain instructions, such as texture sample instructions, consume multiple consecutive scalar registers via a single src register encoded in the instruction, and/or write multiple consecutive scalar registers.  In the simplest example:
    192 
    193 ::
    194 
    195   sam (f32)(xyz)r2.x, r0.z, s#0, t#0
    196 
    197 for a 2d texture, would read ``r0.zw`` to get the coordinate, and write ``r2.xyz``.
    198 
    199 Before register assignment, to group the two components of the texture src together:
    200 
    201 .. graphviz::
    202 
    203   digraph G {
    204     { rank=same;
    205       fanin;
    206     };
    207     { rank=same;
    208       coord_x;
    209       coord_y;
    210     };
    211     sam -> fanin [label="regs[1]"];
    212     fanin -> coord_x [label="regs[1]"];
    213     fanin -> coord_y [label="regs[2]"];
    214     coord_x -> coord_y [label="right",style=dotted];
    215     coord_y -> coord_x [label="left",style=dotted];
    216     coord_x [label="coord.x"];
    217     coord_y [label="coord.y"];
    218   }
    219 
    220 The frontend sets up the SSA ptrs from ``sam`` source register to the ``fanin`` meta instruction, which in turn points to the instructions producing the ``coord.x`` and ``coord.y`` values.  And the grouping_ pass sets up the ``left`` and ``right`` neighbor pointers to the ``fanin``\'s sources, used later by the `register assignment`_ pass to assign blocks of scalar registers.
    221 
    222 And likewise, for the consecutive scalar registers for the destination:
    223 
    224 .. graphviz::
    225 
    226   digraph {
    227     { rank=same;
    228       A;
    229       B;
    230       C;
    231     };
    232     { rank=same;
    233       fanout_0;
    234       fanout_1;
    235       fanout_2;
    236     };
    237     A -> fanout_0;
    238     B -> fanout_1;
    239     C -> fanout_2;
    240     fanout_0 [label="fanout\noff=0"];
    241     fanout_0 -> sam;
    242     fanout_1 [label="fanout\noff=1"];
    243     fanout_1 -> sam;
    244     fanout_2 [label="fanout\noff=2"];
    245     fanout_2 -> sam;
    246     fanout_0 -> fanout_1 [label="right",style=dotted];
    247     fanout_1 -> fanout_0 [label="left",style=dotted];
    248     fanout_1 -> fanout_2 [label="right",style=dotted];
    249     fanout_2 -> fanout_1 [label="left",style=dotted];
    250     sam;
    251   }
    252 
    253 .. _`relative addressing`:
    254 
    255 Relative Addressing
    256 ~~~~~~~~~~~~~~~~~~~
    257 
    258 Most instructions support addressing indirectly (relative to address register) into const or gpr register file in some or all of their src/dst registers.  In this case the register accessed is taken from ``r<a0.x + n>`` or ``c<a0.x + n>``, ie. address register (``a0.x``) value plus ``n``, where ``n`` is encoded in the instruction (rather than the absolute register number).
    259 
    260     Note that cat5 (texture sample) instructions are the notable exception, not
    261     supporting relative addressing of src or dst.
    262 
    263 Relative addressing of the const file (for example, a uniform array) is relatively simple.  We don't do register assignment of the const file, so all that is required is to schedule things properly.  Ie. the instruction that writes the address register must be scheduled first, and we cannot have two different address register values live at one time.
    264 
    265 But relative addressing of gpr file (which can be as src or dst) has additional restrictions on register assignment (ie. the array elements must be assigned to consecutive scalar registers).  And in the case of relative dst, subsequent instructions now depend on both the relative write, as well as the previous instruction which wrote that register, since we do not know at compile time which actual register was written.
    266 
    267 Each instruction has an optional ``address`` pointer, to capture the dependency on the address register value when relative addressing is used for any of the src/dst register(s).  This behaves as an additional virtual src register, ie. ``foreach_ssa_src()`` will also iterate the address register (last).
    268 
    269     Note that ``nop``\'s for timing constraints, type specifiers (ie.
    270     ``add.f`` vs ``add.u``), etc, omitted for brevity in examples
    271 
    272 ::
    273 
    274   mova a0.x, hr1.y
    275   sub r1.y, r2.x, r3.x
    276   add r0.x, r1.y, c<a0.x + 2>
    277 
    278 results in:
    279 
    280 .. graphviz::
    281 
    282   digraph {
    283     rankdir=LR;
    284     sub;
    285     const [label="const file"];
    286     add;
    287     mova;
    288     add -> mova;
    289     add -> sub;
    290     add -> const [label="off=2"];
    291   }
    292 
    293 The scheduling pass has some smarts to schedule things such that only a single ``a0.x`` value is used at any one time.
    294 
    295 To implement variable arrays, values are stored in consecutive scalar registers.  This has some overlap with `register groups`_, in that ``fanin`` and ``fanout`` are used to help group things for the `register assignment`_ pass.
    296 
    297 To use a variable array as a src register, a slight variation of what is done for const array src.  The instruction src is a `fanin` instruction that groups all the array members:
    298 
    299 ::
    300 
    301   mova a0.x, hr1.y
    302   sub r1.y, r2.x, r3.x
    303   add r0.x, r1.y, r<a0.x + 2>
    304 
    305 results in:
    306 
    307 .. graphviz::
    308 
    309   digraph {
    310     a0 [label="r0.z"];
    311     a1 [label="r0.w"];
    312     a2 [label="r1.x"];
    313     a3 [label="r1.y"];
    314     sub;
    315     fanin;
    316     mova;
    317     add;
    318     add -> sub;
    319     add -> fanin [label="off=2"];
    320     add -> mova;
    321     fanin -> a0;
    322     fanin -> a1;
    323     fanin -> a2;
    324     fanin -> a3;
    325   }
    326 
    327 TODO better describe how actual deref offset is derived, ie. based on array base register.
    328 
    329 To do an indirect write to a variable array, a ``fanout`` is used.  Say the array was assigned to registers ``r0.z`` through ``r1.y`` (hence the constant offset of 2):
    330 
    331     Note that only cat1 (mov) can do indirect write.
    332 
    333 ::
    334 
    335   mova a0.x, hr1.y
    336   min r2.x, r2.x, c0.x
    337   mov r<a0.x + 2>, r2.x
    338   mul r0.x, r0.z, c0.z
    339 
    340 
    341 In this case, the ``mov`` instruction does not write all elements of the array (compared to usage of ``fanout`` for ``sam`` instructions in grouping_).  But the ``mov`` instruction does need an additional dependency (via ``fanin``) on instructions that last wrote the array element members, to ensure that they get scheduled before the ``mov`` in scheduling_ stage (which also serves to group the array elements for the `register assignment`_ stage).
    342 
    343 .. graphviz::
    344 
    345   digraph {
    346     a0 [label="r0.z"];
    347     a1 [label="r0.w"];
    348     a2 [label="r1.x"];
    349     a3 [label="r1.y"];
    350     min;
    351     mova;
    352     mov;
    353     mul;
    354     fanout [label="fanout\noff=0"];
    355     mul -> fanout;
    356     fanout -> mov;
    357     fanin;
    358     fanin -> a0;
    359     fanin -> a1;
    360     fanin -> a2;
    361     fanin -> a3;
    362     mov -> min;
    363     mov -> mova;
    364     mov -> fanin;
    365   }
    366 
    367 Note that there would in fact be ``fanout`` nodes generated for each array element (although only the reachable ones will be scheduled, etc).
    368 
    369 
    370 
    371 Shader Passes
    372 -------------
    373 
    374 After the frontend has generated the use-def graph of instructions, they are run through various passes which include scheduling_ and `register assignment`_.  Because inserting ``mov`` instructions after scheduling would also require inserting additional ``nop`` instructions (since it is too late to reschedule to try and fill the bubbles), the earlier stages try to ensure that (at least given an infinite supply of registers) that `register assignment`_ after scheduling_ cannot fail.
    375 
    376     Note that we essentially have ~256 scalar registers in the
    377     architecture (although larger register usage will at some thresholds
    378     limit the number of threads which can run in parallel).  And at some
    379     point we will have to deal with spilling.
    380 
    381 .. _flatten:
    382 
    383 Flatten
    384 ~~~~~~~
    385 
    386 In this stage, simple if/else blocks are flattened into a single block with ``phi`` nodes converted into ``sel`` instructions.  The a3xx ISA has very few predicated instructions, and we would prefer not to use branches for simple if/else.
    387 
    388 
    389 .. _`copy propagation`:
    390 
    391 Copy Propagation
    392 ~~~~~~~~~~~~~~~~
    393 
    394 Currently the frontend inserts ``mov``\s in various cases, because certain categories of instructions have limitations about const regs as sources.  And the CP pass simply removes all simple ``mov``\s (ie. src-type is same as dst-type, no abs/neg flags, etc).
    395 
    396 The eventual plan is to invert that, with the front-end inserting no ``mov``\s and CP legalize things.
    397 
    398 
    399 .. _grouping:
    400 
    401 Grouping
    402 ~~~~~~~~
    403 
    404 In the grouping pass, instructions which need to be grouped (for ``fanin``\s, etc) have their ``left`` / ``right`` neighbor pointers setup.  In cases where there is a conflict (ie. one instruction cannot have two unique left or right neighbors), an additional ``mov`` instruction is inserted.  This ensures that there is some possible valid `register assignment`_ at the later stages.
    405 
    406 
    407 .. _depth:
    408 
    409 Depth
    410 ~~~~~
    411 
    412 In the depth pass, a depth is calculated for each instruction node within it's basic block.  The depth is the sum of the required cycles (delay slots needed between two instructions plus one) of each instruction plus the max depth of any of it's source instructions.  (meta_ instructions don't add to the depth).  As an instruction's depth is calculated, it is inserted into a per block list sorted by deepest instruction.  Unreachable instructions and inputs are marked.
    413 
    414     TODO: we should probably calculate both hard and soft depths (?) to
    415     try to coax additional instructions to fit in places where we need
    416     to use sync bits, such as after a texture fetch or SFU.
    417 
    418 .. _scheduling:
    419 
    420 Scheduling
    421 ~~~~~~~~~~
    422 
    423 After the grouping_ pass, there are no more instructions to insert or remove.  Start scheduling each basic block from the deepest node in the depth sorted list created by the depth_ pass, recursively trying to schedule each instruction after it's source instructions plus delay slots.  Insert ``nop``\s as required.
    424 
    425 .. _`register assignment`:
    426 
    427 Register Assignment
    428 ~~~~~~~~~~~~~~~~~~~
    429 
    430 TODO
    431 
    432 
    433