Home | History | Annotate | Download | only in Proposals
      1 ==================
      2 Vectorization Plan
      3 ==================
      4 
      5 .. contents::
      6    :local:
      7 
      8 Abstract
      9 ========
     10 The vectorization transformation can be rather complicated, involving several
     11 potential alternatives, especially for outer-loops [1]_ but also possibly for
     12 innermost loops. These alternatives may have significant performance impact,
     13 both positive and negative. A cost model is therefore employed to identify the
     14 best alternative, including the alternative of avoiding any transformation
     15 altogether.
     16 
     17 The Vectorization Plan is an explicit model for describing vectorization
     18 candidates. It serves for both optimizing candidates including estimating their
     19 cost reliably, and for performing their final translation into IR. This
     20 facilitates dealing with multiple vectorization candidates.
     21 
     22 High-level Design
     23 =================
     24 
     25 Vectorization Workflow
     26 ----------------------
     27 VPlan-based vectorization involves three major steps, taking a "scenario-based
     28 approach" to vectorization planning:
     29 
     30 1. Legal Step: check if a loop can be legally vectorized; encode constraints and
     31    artifacts if so.
     32 2. Plan Step:
     33 
     34    a. Build initial VPlans following the constraints and decisions taken by
     35       Legal Step 1, and compute their cost.
     36    b. Apply optimizations to the VPlans, possibly forking additional VPlans.
     37       Prune sub-optimal VPlans having relatively high cost.
     38 3. Execute Step: materialize the best VPlan. Note that this is the only step
     39    that modifies the IR.
     40 
     41 Design Guidelines
     42 -----------------
     43 In what follows, the term "input IR" refers to code that is fed into the
     44 vectorizer whereas the term "output IR" refers to code that is generated by the
     45 vectorizer. The output IR contains code that has been vectorized or "widened"
     46 according to a loop Vectorization Factor (VF), and/or loop unroll-and-jammed
     47 according to an Unroll Factor (UF).
     48 The design of VPlan follows several high-level guidelines:
     49 
     50 1. Analysis-like: building and manipulating VPlans must not modify the input IR.
     51    In particular, if the best option is not to vectorize at all, the
     52    vectorization process terminates before reaching Step 3, and compilation
     53    should proceed as if VPlans had not been built.
     54 
     55 2. Align Cost & Execute: each VPlan must support both estimating the cost and
     56    generating the output IR code, such that the cost estimation evaluates the
     57    to-be-generated code reliably.
     58 
     59 3. Support vectorizing additional constructs:
     60 
     61    a. Outer-loop vectorization. In particular, VPlan must be able to model the
     62       control-flow of the output IR which may include multiple basic-blocks and
     63       nested loops.
     64    b. SLP vectorization.
     65    c. Combinations of the above, including nested vectorization: vectorizing
     66       both an inner loop and an outer-loop at the same time (each with its own
     67       VF and UF), mixed vectorization: vectorizing a loop with SLP patterns
     68       inside [4]_, (re)vectorizing input IR containing vector code.
     69    d. Function vectorization [2]_.
     70 
     71 4. Support multiple candidates efficiently. In particular, similar candidates
     72    related to a range of possible VF's and UF's must be represented efficiently.
     73    Potential versioning needs to be supported efficiently.
     74 
     75 5. Support vectorizing idioms, such as interleaved groups of strided loads or
     76    stores. This is achieved by modeling a sequence of output instructions using
     77    a "Recipe", which is responsible for computing its cost and generating its
     78    code.
     79 
     80 6. Encapsulate Single-Entry Single-Exit regions (SESE). During vectorization
     81    such regions may need to be, for example, predicated and linearized, or
     82    replicated VF*UF times to handle scalarized and predicated instructions.
     83    Innerloops are also modelled as SESE regions.
     84 
     85 7. Support instruction-level analysis and transformation, as part of Planning
     86    Step 2.b: During vectorization instructions may need to be traversed, moved,
     87    replaced by other instructions or be created. For example, vector idiom
     88    detection and formation involves searching for and optimizing instruction
     89    patterns.
     90 
     91 Definitions
     92 ===========
     93 The low-level design of VPlan comprises of the following classes.
     94 
     95 :LoopVectorizationPlanner:
     96   A LoopVectorizationPlanner is designed to handle the vectorization of a loop
     97   or a loop nest. It can construct, optimize and discard one or more VPlans,
     98   each VPlan modelling a distinct way to vectorize the loop or the loop nest.
     99   Once the best VPlan is determined, including the best VF and UF, this VPlan
    100   drives the generation of output IR.
    101 
    102 :VPlan:
    103   A model of a vectorized candidate for a given input IR loop or loop nest. This
    104   candidate is represented using a Hierarchical CFG. VPlan supports estimating
    105   the cost and driving the generation of the output IR code it represents.
    106 
    107 :Hierarchical CFG:
    108   A control-flow graph whose nodes are basic-blocks or Hierarchical CFG's. The
    109   Hierarchical CFG data structure is similar to the Tile Tree [5]_, where
    110   cross-Tile edges are lifted to connect Tiles instead of the original
    111   basic-blocks as in Sharir [6]_, promoting the Tile encapsulation. The terms
    112   Region and Block are used rather than Tile [5]_ to avoid confusion with loop
    113   tiling.
    114 
    115 :VPBlockBase:
    116   The building block of the Hierarchical CFG. A pure-virtual base-class of
    117   VPBasicBlock and VPRegionBlock, see below. VPBlockBase models the hierarchical
    118   control-flow relations with other VPBlocks. Note that in contrast to the IR
    119   BasicBlock, a VPBlockBase models its control-flow successors and predecessors
    120   directly, rather than through a Terminator branch or through predecessor
    121   branches that "use" the VPBlockBase.
    122 
    123 :VPBasicBlock:
    124   VPBasicBlock is a subclass of VPBlockBase, and serves as the leaves of the
    125   Hierarchical CFG. It represents a sequence of output IR instructions that will
    126   appear consecutively in an output IR basic-block. The instructions of this
    127   basic-block originate from one or more VPBasicBlocks. VPBasicBlock holds a
    128   sequence of zero or more VPRecipes that model the cost and generation of the
    129   output IR instructions.
    130 
    131 :VPRegionBlock:
    132   VPRegionBlock is a subclass of VPBlockBase. It models a collection of
    133   VPBasicBlocks and VPRegionBlocks which form a SESE subgraph of the output IR
    134   CFG. A VPRegionBlock may indicate that its contents are to be replicated a
    135   constant number of times when output IR is generated, effectively representing
    136   a loop with constant trip-count that will be completely unrolled. This is used
    137   to support scalarized and predicated instructions with a single model for
    138   multiple candidate VF's and UF's.
    139 
    140 :VPRecipeBase:
    141   A pure-virtual base class modeling a sequence of one or more output IR
    142   instructions, possibly based on one or more input IR instructions. These
    143   input IR instructions are referred to as "Ingredients" of the Recipe. A Recipe
    144   may specify how its ingredients are to be transformed to produce the output IR
    145   instructions; e.g., cloned once, replicated multiple times or widened
    146   according to selected VF.
    147 
    148 :VPValue:
    149   The base of VPlan's def-use relations class hierarchy. When instantiated, it
    150   models a constant or a live-in Value in VPlan. It has users, which are of type
    151   VPUser, but no operands.
    152 
    153 :VPUser:
    154   A VPValue representing a general vertex in the def-use graph of VPlan. It has
    155   operands which are of type VPValue. When instantiated, it represents a
    156   live-out Instruction that exists outside VPlan. VPUser is similar in some
    157   aspects to LLVM's User class.
    158 
    159 :VPInstruction:
    160   A VPInstruction is both a VPRecipe and a VPUser. It models a single
    161   VPlan-level instruction to be generated if the VPlan is executed, including
    162   its opcode and possibly additional characteristics. It is the basis for
    163   writing instruction-level analyses and optimizations in VPlan as creating,
    164   replacing or moving VPInstructions record both def-use and scheduling
    165   decisions. VPInstructions also extend LLVM IR's opcodes with idiomatic
    166   operations that enrich the Vectorizer's semantics.
    167 
    168 :VPTransformState:
    169   Stores information used for generating output IR, passed from
    170   LoopVectorizationPlanner to its selected VPlan for execution, and used to pass
    171   additional information down to VPBlocks and VPRecipes.
    172 
    173 The Planning Process and VPlan Roadmap
    174 ======================================
    175 
    176 Transforming the Loop Vectorizer to use VPlan follows a staged approach. First,
    177 VPlan is used to record the final vectorization decisions, and to execute them:
    178 the Hierarchical CFG models the planned control-flow, and Recipes capture
    179 decisions taken inside basic-blocks. Next, VPlan will be used also as the basis
    180 for taking these decisions, effectively turning them into a series of
    181 VPlan-to-VPlan algorithms. Finally, VPlan will support the planning process
    182 itself including cost-based analyses for making these decisions, to fully
    183 support compositional and iterative decision making.
    184 
    185 Some decisions are local to an instruction in the loop, such as whether to widen
    186 it into a vector instruction or replicate it, keeping the generated instructions
    187 in place. Other decisions, however, involve moving instructions, replacing them
    188 with other instructions, and/or introducing new instructions. For example, a
    189 cast may sink past a later instruction and be widened to handle first-order
    190 recurrence; an interleave group of strided gathers or scatters may effectively
    191 move to one place where they are replaced with shuffles and a common wide vector
    192 load or store; new instructions may be introduced to compute masks, shuffle the
    193 elements of vectors, and pack scalar values into vectors or vice-versa.
    194 
    195 In order for VPlan to support making instruction-level decisions and analyses,
    196 it needs to model the relevant instructions along with their def/use relations.
    197 This too follows a staged approach: first, the new instructions that compute
    198 masks are modeled as VPInstructions, along with their induced def/use subgraph.
    199 This effectively models masks in VPlan, facilitating VPlan-based predication.
    200 Next, the logic embedded within each Recipe for generating its instructions at
    201 VPlan execution time, will instead take part in the planning process by modeling
    202 them as VPInstructions. Finally, only logic that applies to instructions as a
    203 group will remain in Recipes, such as interleave groups and potentially other
    204 idiom groups having synergistic cost.
    205 
    206 Related LLVM components
    207 -----------------------
    208 1. SLP Vectorizer: one can compare the VPlan model with LLVM's existing SLP
    209    tree, where TSLP [3]_ adds Plan Step 2.b.
    210 
    211 2. RegionInfo: one can compare VPlan's H-CFG with the Region Analysis as used by
    212    Polly [7]_.
    213 
    214 3. Loop Vectorizer: the Vectorization Plan aims to upgrade the infrastructure of
    215    the Loop Vectorizer and extend it to handle outer loops [8]_, [9]_.
    216 
    217 References
    218 ----------
    219 .. [1] "Outer-loop vectorization: revisited for short SIMD architectures", Dorit
    220     Nuzman and Ayal Zaks, PACT 2008.
    221 
    222 .. [2] "Proposal for function vectorization and loop vectorization with function
    223     calls", Xinmin Tian, [`cfe-dev
    224     <http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html>`_].,
    225     March 2, 2016.
    226     See also `review <https://reviews.llvm.org/D22792>`_.
    227 
    228 .. [3] "Throttling Automatic Vectorization: When Less is More", Vasileios
    229     Porpodas and Tim Jones, PACT 2015 and LLVM Developers' Meeting 2015.
    230 
    231 .. [4] "Exploiting mixed SIMD parallelism by reducing data reorganization
    232     overhead", Hao Zhou and Jingling Xue, CGO 2016.
    233 
    234 .. [5] "Register Allocation via Hierarchical Graph Coloring", David Callahan and
    235     Brian Koblenz, PLDI 1991
    236 
    237 .. [6] "Structural analysis: A new approach to flow analysis in optimizing
    238     compilers", M. Sharir, Journal of Computer Languages, Jan. 1980
    239 
    240 .. [7] "Enabling Polyhedral Optimizations in LLVM", Tobias Grosser, Diploma
    241     thesis, 2011.
    242 
    243 .. [8] "Introducing VPlan to the Loop Vectorizer", Gil Rapaport and Ayal Zaks,
    244     European LLVM Developers' Meeting 2017.
    245 
    246 .. [9] "Extending LoopVectorizer: OpenMP4.5 SIMD and Outer Loop
    247     Auto-Vectorization", Intel Vectorizer Team, LLVM Developers' Meeting 2016.
    248