Home | History | Annotate | Download | only in X86
      1 //===- README_X86_64.txt - Notes for X86-64 code gen ----------------------===//
      2 
      3 AMD64 Optimization Manual 8.2 has some nice information about optimizing integer
      4 multiplication by a constant. How much of it applies to Intel's X86-64
      5 implementation? There are definite trade-offs to consider: latency vs. register
      6 pressure vs. code size.
      7 
      8 //===---------------------------------------------------------------------===//
      9 
     10 Are we better off using branches instead of cmove to implement FP to
     11 unsigned i64?
     12 
     13 _conv:
     14 	ucomiss	LC0(%rip), %xmm0
     15 	cvttss2siq	%xmm0, %rdx
     16 	jb	L3
     17 	subss	LC0(%rip), %xmm0
     18 	movabsq	$-9223372036854775808, %rax
     19 	cvttss2siq	%xmm0, %rdx
     20 	xorq	%rax, %rdx
     21 L3:
     22 	movq	%rdx, %rax
     23 	ret
     24 
     25 instead of
     26 
     27 _conv:
     28 	movss LCPI1_0(%rip), %xmm1
     29 	cvttss2siq %xmm0, %rcx
     30 	movaps %xmm0, %xmm2
     31 	subss %xmm1, %xmm2
     32 	cvttss2siq %xmm2, %rax
     33 	movabsq $-9223372036854775808, %rdx
     34 	xorq %rdx, %rax
     35 	ucomiss %xmm1, %xmm0
     36 	cmovb %rcx, %rax
     37 	ret
     38 
     39 Seems like the jb branch has high likelihood of being taken. It would have
     40 saved a few instructions.
     41 
     42 //===---------------------------------------------------------------------===//
     43 
     44 It's not possible to reference AH, BH, CH, and DH registers in an instruction
     45 requiring REX prefix. However, divb and mulb both produce results in AH. If isel
     46 emits a CopyFromReg which gets turned into a movb and that can be allocated a
     47 r8b - r15b.
     48 
     49 To get around this, isel emits a CopyFromReg from AX and then right shift it
     50 down by 8 and truncate it. It's not pretty but it works. We need some register
     51 allocation magic to make the hack go away (e.g. putting additional constraints
     52 on the result of the movb).
     53 
     54 //===---------------------------------------------------------------------===//
     55 
     56 The x86-64 ABI for hidden-argument struct returns requires that the
     57 incoming value of %rdi be copied into %rax by the callee upon return.
     58 
     59 The idea is that it saves callers from having to remember this value,
     60 which would often require a callee-saved register. Callees usually
     61 need to keep this value live for most of their body anyway, so it
     62 doesn't add a significant burden on them.
     63 
     64 We currently implement this in codegen, however this is suboptimal
     65 because it means that it would be quite awkward to implement the
     66 optimization for callers.
     67 
     68 A better implementation would be to relax the LLVM IR rules for sret
     69 arguments to allow a function with an sret argument to have a non-void
     70 return type, and to have the front-end to set up the sret argument value
     71 as the return value of the function. The front-end could more easily
     72 emit uses of the returned struct value to be in terms of the function's
     73 lowered return value, and it would free non-C frontends from a
     74 complication only required by a C-based ABI.
     75 
     76 //===---------------------------------------------------------------------===//
     77 
     78 We get a redundant zero extension for code like this:
     79 
     80 int mask[1000];
     81 int foo(unsigned x) {
     82  if (x < 10)
     83    x = x * 45;
     84  else
     85    x = x * 78;
     86  return mask[x];
     87 }
     88 
     89 _foo:
     90 LBB1_0:	## entry
     91 	cmpl	$9, %edi
     92 	jbe	LBB1_3	## bb
     93 LBB1_1:	## bb1
     94 	imull	$78, %edi, %eax
     95 LBB1_2:	## bb2
     96 	movl	%eax, %eax                    <----
     97 	movq	_mask@GOTPCREL(%rip), %rcx
     98 	movl	(%rcx,%rax,4), %eax
     99 	ret
    100 LBB1_3:	## bb
    101 	imull	$45, %edi, %eax
    102 	jmp	LBB1_2	## bb2
    103   
    104 Before regalloc, we have:
    105 
    106         %reg1025<def> = IMUL32rri8 %reg1024, 45, %EFLAGS<imp-def>
    107         JMP mbb<bb2,0x203afb0>
    108     Successors according to CFG: 0x203afb0 (#3)
    109 
    110 bb1: 0x203af60, LLVM BB @0x1e02310, ID#2:
    111     Predecessors according to CFG: 0x203aec0 (#0)
    112         %reg1026<def> = IMUL32rri8 %reg1024, 78, %EFLAGS<imp-def>
    113     Successors according to CFG: 0x203afb0 (#3)
    114 
    115 bb2: 0x203afb0, LLVM BB @0x1e02340, ID#3:
    116     Predecessors according to CFG: 0x203af10 (#1) 0x203af60 (#2)
    117         %reg1027<def> = PHI %reg1025, mbb<bb,0x203af10>,
    118                             %reg1026, mbb<bb1,0x203af60>
    119         %reg1029<def> = MOVZX64rr32 %reg1027
    120 
    121 so we'd have to know that IMUL32rri8 leaves the high word zero extended and to
    122 be able to recognize the zero extend.  This could also presumably be implemented
    123 if we have whole-function selectiondags.
    124 
    125 //===---------------------------------------------------------------------===//
    126 
    127 Take the following code
    128 (from http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34653):
    129 extern unsigned long table[];
    130 unsigned long foo(unsigned char *p) {
    131   unsigned long tag = *p;
    132   return table[tag >> 4] + table[tag & 0xf];
    133 }
    134 
    135 Current code generated:
    136 	movzbl	(%rdi), %eax
    137 	movq	%rax, %rcx
    138 	andq	$240, %rcx
    139 	shrq	%rcx
    140 	andq	$15, %rax
    141 	movq	table(,%rax,8), %rax
    142 	addq	table(%rcx), %rax
    143 	ret
    144 
    145 Issues:
    146 1. First movq should be movl; saves a byte.
    147 2. Both andq's should be andl; saves another two bytes.  I think this was
    148    implemented at one point, but subsequently regressed.
    149 3. shrq should be shrl; saves another byte.
    150 4. The first andq can be completely eliminated by using a slightly more
    151    expensive addressing mode.
    152 
    153 //===---------------------------------------------------------------------===//
    154 
    155 Consider the following (contrived testcase, but contains common factors):
    156 
    157 #include <stdarg.h>
    158 int test(int x, ...) {
    159   int sum, i;
    160   va_list l;
    161   va_start(l, x);
    162   for (i = 0; i < x; i++)
    163     sum += va_arg(l, int);
    164   va_end(l);
    165   return sum;
    166 }
    167 
    168 Testcase given in C because fixing it will likely involve changing the IR
    169 generated for it.  The primary issue with the result is that it doesn't do any
    170 of the optimizations which are possible if we know the address of a va_list
    171 in the current function is never taken:
    172 1. We shouldn't spill the XMM registers because we only call va_arg with "int".
    173 2. It would be nice if we could scalarrepl the va_list.
    174 3. Probably overkill, but it'd be cool if we could peel off the first five
    175 iterations of the loop.
    176 
    177 Other optimizations involving functions which use va_arg on floats which don't
    178 have the address of a va_list taken:
    179 1. Conversely to the above, we shouldn't spill general registers if we only
    180    call va_arg on "double".
    181 2. If we know nothing more than 64 bits wide is read from the XMM registers,
    182    we can change the spilling code to reduce the amount of stack used by half.
    183 
    184 //===---------------------------------------------------------------------===//
    185