1 Target Independent Opportunities: 2 3 //===---------------------------------------------------------------------===// 4 5 With the recent changes to make the implicit def/use set explicit in 6 machineinstrs, we should change the target descriptions for 'call' instructions 7 so that the .td files don't list all the call-clobbered registers as implicit 8 defs. Instead, these should be added by the code generator (e.g. on the dag). 9 10 This has a number of uses: 11 12 1. PPC32/64 and X86 32/64 can avoid having multiple copies of call instructions 13 for their different impdef sets. 14 2. Targets with multiple calling convs (e.g. x86) which have different clobber 15 sets don't need copies of call instructions. 16 3. 'Interprocedural register allocation' can be done to reduce the clobber sets 17 of calls. 18 19 //===---------------------------------------------------------------------===// 20 21 We should recognized various "overflow detection" idioms and translate them into 22 llvm.uadd.with.overflow and similar intrinsics. Here is a multiply idiom: 23 24 unsigned int mul(unsigned int a,unsigned int b) { 25 if ((unsigned long long)a*b>0xffffffff) 26 exit(0); 27 return a*b; 28 } 29 30 The legalization code for mul-with-overflow needs to be made more robust before 31 this can be implemented though. 32 33 //===---------------------------------------------------------------------===// 34 35 Get the C front-end to expand hypot(x,y) -> llvm.sqrt(x*x+y*y) when errno and 36 precision don't matter (ffastmath). Misc/mandel will like this. :) This isn't 37 safe in general, even on darwin. See the libm implementation of hypot for 38 examples (which special case when x/y are exactly zero to get signed zeros etc 39 right). 40 41 //===---------------------------------------------------------------------===// 42 43 On targets with expensive 64-bit multiply, we could LSR this: 44 45 for (i = ...; ++i) { 46 x = 1ULL << i; 47 48 into: 49 long long tmp = 1; 50 for (i = ...; ++i, tmp+=tmp) 51 x = tmp; 52 53 This would be a win on ppc32, but not x86 or ppc64. 54 55 //===---------------------------------------------------------------------===// 56 57 Shrink: (setlt (loadi32 P), 0) -> (setlt (loadi8 Phi), 0) 58 59 //===---------------------------------------------------------------------===// 60 61 Reassociate should turn things like: 62 63 int factorial(int X) { 64 return X*X*X*X*X*X*X*X; 65 } 66 67 into llvm.powi calls, allowing the code generator to produce balanced 68 multiplication trees. 69 70 First, the intrinsic needs to be extended to support integers, and second the 71 code generator needs to be enhanced to lower these to multiplication trees. 72 73 //===---------------------------------------------------------------------===// 74 75 Interesting? testcase for add/shift/mul reassoc: 76 77 int bar(int x, int y) { 78 return x*x*x+y+x*x*x*x*x*y*y*y*y; 79 } 80 int foo(int z, int n) { 81 return bar(z, n) + bar(2*z, 2*n); 82 } 83 84 This is blocked on not handling X*X*X -> powi(X, 3) (see note above). The issue 85 is that we end up getting t = 2*X s = t*t and don't turn this into 4*X*X, 86 which is the same number of multiplies and is canonical, because the 2*X has 87 multiple uses. Here's a simple example: 88 89 define i32 @test15(i32 %X1) { 90 %B = mul i32 %X1, 47 ; X1*47 91 %C = mul i32 %B, %B 92 ret i32 %C 93 } 94 95 96 //===---------------------------------------------------------------------===// 97 98 Reassociate should handle the example in GCC PR16157: 99 100 extern int a0, a1, a2, a3, a4; extern int b0, b1, b2, b3, b4; 101 void f () { /* this can be optimized to four additions... */ 102 b4 = a4 + a3 + a2 + a1 + a0; 103 b3 = a3 + a2 + a1 + a0; 104 b2 = a2 + a1 + a0; 105 b1 = a1 + a0; 106 } 107 108 This requires reassociating to forms of expressions that are already available, 109 something that reassoc doesn't think about yet. 110 111 112 //===---------------------------------------------------------------------===// 113 114 This function: (derived from GCC PR19988) 115 double foo(double x, double y) { 116 return ((x + 0.1234 * y) * (x + -0.1234 * y)); 117 } 118 119 compiles to: 120 _foo: 121 movapd %xmm1, %xmm2 122 mulsd LCPI1_1(%rip), %xmm1 123 mulsd LCPI1_0(%rip), %xmm2 124 addsd %xmm0, %xmm1 125 addsd %xmm0, %xmm2 126 movapd %xmm1, %xmm0 127 mulsd %xmm2, %xmm0 128 ret 129 130 Reassociate should be able to turn it into: 131 132 double foo(double x, double y) { 133 return ((x + 0.1234 * y) * (x - 0.1234 * y)); 134 } 135 136 Which allows the multiply by constant to be CSE'd, producing: 137 138 _foo: 139 mulsd LCPI1_0(%rip), %xmm1 140 movapd %xmm1, %xmm2 141 addsd %xmm0, %xmm2 142 subsd %xmm1, %xmm0 143 mulsd %xmm2, %xmm0 144 ret 145 146 This doesn't need -ffast-math support at all. This is particularly bad because 147 the llvm-gcc frontend is canonicalizing the later into the former, but clang 148 doesn't have this problem. 149 150 //===---------------------------------------------------------------------===// 151 152 These two functions should generate the same code on big-endian systems: 153 154 int g(int *j,int *l) { return memcmp(j,l,4); } 155 int h(int *j, int *l) { return *j - *l; } 156 157 this could be done in SelectionDAGISel.cpp, along with other special cases, 158 for 1,2,4,8 bytes. 159 160 //===---------------------------------------------------------------------===// 161 162 It would be nice to revert this patch: 163 http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20060213/031986.html 164 165 And teach the dag combiner enough to simplify the code expanded before 166 legalize. It seems plausible that this knowledge would let it simplify other 167 stuff too. 168 169 //===---------------------------------------------------------------------===// 170 171 For vector types, TargetData.cpp::getTypeInfo() returns alignment that is equal 172 to the type size. It works but can be overly conservative as the alignment of 173 specific vector types are target dependent. 174 175 //===---------------------------------------------------------------------===// 176 177 We should produce an unaligned load from code like this: 178 179 v4sf example(float *P) { 180 return (v4sf){P[0], P[1], P[2], P[3] }; 181 } 182 183 //===---------------------------------------------------------------------===// 184 185 Add support for conditional increments, and other related patterns. Instead 186 of: 187 188 movl 136(%esp), %eax 189 cmpl $0, %eax 190 je LBB16_2 #cond_next 191 LBB16_1: #cond_true 192 incl _foo 193 LBB16_2: #cond_next 194 195 emit: 196 movl _foo, %eax 197 cmpl $1, %edi 198 sbbl $-1, %eax 199 movl %eax, _foo 200 201 //===---------------------------------------------------------------------===// 202 203 Combine: a = sin(x), b = cos(x) into a,b = sincos(x). 204 205 Expand these to calls of sin/cos and stores: 206 double sincos(double x, double *sin, double *cos); 207 float sincosf(float x, float *sin, float *cos); 208 long double sincosl(long double x, long double *sin, long double *cos); 209 210 Doing so could allow SROA of the destination pointers. See also: 211 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17687 212 213 This is now easily doable with MRVs. We could even make an intrinsic for this 214 if anyone cared enough about sincos. 215 216 //===---------------------------------------------------------------------===// 217 218 quantum_sigma_x in 462.libquantum contains the following loop: 219 220 for(i=0; i<reg->size; i++) 221 { 222 /* Flip the target bit of each basis state */ 223 reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target); 224 } 225 226 Where MAX_UNSIGNED/state is a 64-bit int. On a 32-bit platform it would be just 227 so cool to turn it into something like: 228 229 long long Res = ((MAX_UNSIGNED) 1 << target); 230 if (target < 32) { 231 for(i=0; i<reg->size; i++) 232 reg->node[i].state ^= Res & 0xFFFFFFFFULL; 233 } else { 234 for(i=0; i<reg->size; i++) 235 reg->node[i].state ^= Res & 0xFFFFFFFF00000000ULL 236 } 237 238 ... which would only do one 32-bit XOR per loop iteration instead of two. 239 240 It would also be nice to recognize the reg->size doesn't alias reg->node[i], but 241 this requires TBAA. 242 243 //===---------------------------------------------------------------------===// 244 245 This isn't recognized as bswap by instcombine (yes, it really is bswap): 246 247 unsigned long reverse(unsigned v) { 248 unsigned t; 249 t = v ^ ((v << 16) | (v >> 16)); 250 t &= ~0xff0000; 251 v = (v << 24) | (v >> 8); 252 return v ^ (t >> 8); 253 } 254 255 //===---------------------------------------------------------------------===// 256 257 [LOOP DELETION] 258 259 We don't delete this output free loop, because trip count analysis doesn't 260 realize that it is finite (if it were infinite, it would be undefined). Not 261 having this blocks Loop Idiom from matching strlen and friends. 262 263 void foo(char *C) { 264 int x = 0; 265 while (*C) 266 ++x,++C; 267 } 268 269 //===---------------------------------------------------------------------===// 270 271 [LOOP RECOGNITION] 272 273 These idioms should be recognized as popcount (see PR1488): 274 275 unsigned countbits_slow(unsigned v) { 276 unsigned c; 277 for (c = 0; v; v >>= 1) 278 c += v & 1; 279 return c; 280 } 281 unsigned countbits_fast(unsigned v){ 282 unsigned c; 283 for (c = 0; v; c++) 284 v &= v - 1; // clear the least significant bit set 285 return c; 286 } 287 288 BITBOARD = unsigned long long 289 int PopCnt(register BITBOARD a) { 290 register int c=0; 291 while(a) { 292 c++; 293 a &= a - 1; 294 } 295 return c; 296 } 297 unsigned int popcount(unsigned int input) { 298 unsigned int count = 0; 299 for (unsigned int i = 0; i < 4 * 8; i++) 300 count += (input >> i) & i; 301 return count; 302 } 303 304 This should be recognized as CLZ: rdar://8459039 305 306 unsigned clz_a(unsigned a) { 307 int i; 308 for (i=0;i<32;i++) 309 if (a & (1<<(31-i))) 310 return i; 311 return 32; 312 } 313 314 This sort of thing should be added to the loop idiom pass. 315 316 //===---------------------------------------------------------------------===// 317 318 These should turn into single 16-bit (unaligned?) loads on little/big endian 319 processors. 320 321 unsigned short read_16_le(const unsigned char *adr) { 322 return adr[0] | (adr[1] << 8); 323 } 324 unsigned short read_16_be(const unsigned char *adr) { 325 return (adr[0] << 8) | adr[1]; 326 } 327 328 //===---------------------------------------------------------------------===// 329 330 -instcombine should handle this transform: 331 icmp pred (sdiv X / C1 ), C2 332 when X, C1, and C2 are unsigned. Similarly for udiv and signed operands. 333 334 Currently InstCombine avoids this transform but will do it when the signs of 335 the operands and the sign of the divide match. See the FIXME in 336 InstructionCombining.cpp in the visitSetCondInst method after the switch case 337 for Instruction::UDiv (around line 4447) for more details. 338 339 The SingleSource/Benchmarks/Shootout-C++/hash and hash2 tests have examples of 340 this construct. 341 342 //===---------------------------------------------------------------------===// 343 344 [LOOP OPTIMIZATION] 345 346 SingleSource/Benchmarks/Misc/dt.c shows several interesting optimization 347 opportunities in its double_array_divs_variable function: it needs loop 348 interchange, memory promotion (which LICM already does), vectorization and 349 variable trip count loop unrolling (since it has a constant trip count). ICC 350 apparently produces this very nice code with -ffast-math: 351 352 ..B1.70: # Preds ..B1.70 ..B1.69 353 mulpd %xmm0, %xmm1 #108.2 354 mulpd %xmm0, %xmm1 #108.2 355 mulpd %xmm0, %xmm1 #108.2 356 mulpd %xmm0, %xmm1 #108.2 357 addl $8, %edx # 358 cmpl $131072, %edx #108.2 359 jb ..B1.70 # Prob 99% #108.2 360 361 It would be better to count down to zero, but this is a lot better than what we 362 do. 363 364 //===---------------------------------------------------------------------===// 365 366 Consider: 367 368 typedef unsigned U32; 369 typedef unsigned long long U64; 370 int test (U32 *inst, U64 *regs) { 371 U64 effective_addr2; 372 U32 temp = *inst; 373 int r1 = (temp >> 20) & 0xf; 374 int b2 = (temp >> 16) & 0xf; 375 effective_addr2 = temp & 0xfff; 376 if (b2) effective_addr2 += regs[b2]; 377 b2 = (temp >> 12) & 0xf; 378 if (b2) effective_addr2 += regs[b2]; 379 effective_addr2 &= regs[4]; 380 if ((effective_addr2 & 3) == 0) 381 return 1; 382 return 0; 383 } 384 385 Note that only the low 2 bits of effective_addr2 are used. On 32-bit systems, 386 we don't eliminate the computation of the top half of effective_addr2 because 387 we don't have whole-function selection dags. On x86, this means we use one 388 extra register for the function when effective_addr2 is declared as U64 than 389 when it is declared U32. 390 391 PHI Slicing could be extended to do this. 392 393 //===---------------------------------------------------------------------===// 394 395 Tail call elim should be more aggressive, checking to see if the call is 396 followed by an uncond branch to an exit block. 397 398 ; This testcase is due to tail-duplication not wanting to copy the return 399 ; instruction into the terminating blocks because there was other code 400 ; optimized out of the function after the taildup happened. 401 ; RUN: llvm-as < %s | opt -tailcallelim | llvm-dis | not grep call 402 403 define i32 @t4(i32 %a) { 404 entry: 405 %tmp.1 = and i32 %a, 1 ; <i32> [#uses=1] 406 %tmp.2 = icmp ne i32 %tmp.1, 0 ; <i1> [#uses=1] 407 br i1 %tmp.2, label %then.0, label %else.0 408 409 then.0: ; preds = %entry 410 %tmp.5 = add i32 %a, -1 ; <i32> [#uses=1] 411 %tmp.3 = call i32 @t4( i32 %tmp.5 ) ; <i32> [#uses=1] 412 br label %return 413 414 else.0: ; preds = %entry 415 %tmp.7 = icmp ne i32 %a, 0 ; <i1> [#uses=1] 416 br i1 %tmp.7, label %then.1, label %return 417 418 then.1: ; preds = %else.0 419 %tmp.11 = add i32 %a, -2 ; <i32> [#uses=1] 420 %tmp.9 = call i32 @t4( i32 %tmp.11 ) ; <i32> [#uses=1] 421 br label %return 422 423 return: ; preds = %then.1, %else.0, %then.0 424 %result.0 = phi i32 [ 0, %else.0 ], [ %tmp.3, %then.0 ], 425 [ %tmp.9, %then.1 ] 426 ret i32 %result.0 427 } 428 429 //===---------------------------------------------------------------------===// 430 431 Tail recursion elimination should handle: 432 433 int pow2m1(int n) { 434 if (n == 0) 435 return 0; 436 return 2 * pow2m1 (n - 1) + 1; 437 } 438 439 Also, multiplies can be turned into SHL's, so they should be handled as if 440 they were associative. "return foo() << 1" can be tail recursion eliminated. 441 442 //===---------------------------------------------------------------------===// 443 444 Argument promotion should promote arguments for recursive functions, like 445 this: 446 447 ; RUN: llvm-as < %s | opt -argpromotion | llvm-dis | grep x.val 448 449 define internal i32 @foo(i32* %x) { 450 entry: 451 %tmp = load i32* %x ; <i32> [#uses=0] 452 %tmp.foo = call i32 @foo( i32* %x ) ; <i32> [#uses=1] 453 ret i32 %tmp.foo 454 } 455 456 define i32 @bar(i32* %x) { 457 entry: 458 %tmp3 = call i32 @foo( i32* %x ) ; <i32> [#uses=1] 459 ret i32 %tmp3 460 } 461 462 //===---------------------------------------------------------------------===// 463 464 We should investigate an instruction sinking pass. Consider this silly 465 example in pic mode: 466 467 #include <assert.h> 468 void foo(int x) { 469 assert(x); 470 //... 471 } 472 473 we compile this to: 474 _foo: 475 subl $28, %esp 476 call "L1$pb" 477 "L1$pb": 478 popl %eax 479 cmpl $0, 32(%esp) 480 je LBB1_2 # cond_true 481 LBB1_1: # return 482 # ... 483 addl $28, %esp 484 ret 485 LBB1_2: # cond_true 486 ... 487 488 The PIC base computation (call+popl) is only used on one path through the 489 code, but is currently always computed in the entry block. It would be 490 better to sink the picbase computation down into the block for the 491 assertion, as it is the only one that uses it. This happens for a lot of 492 code with early outs. 493 494 Another example is loads of arguments, which are usually emitted into the 495 entry block on targets like x86. If not used in all paths through a 496 function, they should be sunk into the ones that do. 497 498 In this case, whole-function-isel would also handle this. 499 500 //===---------------------------------------------------------------------===// 501 502 Investigate lowering of sparse switch statements into perfect hash tables: 503 http://burtleburtle.net/bob/hash/perfect.html 504 505 //===---------------------------------------------------------------------===// 506 507 We should turn things like "load+fabs+store" and "load+fneg+store" into the 508 corresponding integer operations. On a yonah, this loop: 509 510 double a[256]; 511 void foo() { 512 int i, b; 513 for (b = 0; b < 10000000; b++) 514 for (i = 0; i < 256; i++) 515 a[i] = -a[i]; 516 } 517 518 is twice as slow as this loop: 519 520 long long a[256]; 521 void foo() { 522 int i, b; 523 for (b = 0; b < 10000000; b++) 524 for (i = 0; i < 256; i++) 525 a[i] ^= (1ULL << 63); 526 } 527 528 and I suspect other processors are similar. On X86 in particular this is a 529 big win because doing this with integers allows the use of read/modify/write 530 instructions. 531 532 //===---------------------------------------------------------------------===// 533 534 DAG Combiner should try to combine small loads into larger loads when 535 profitable. For example, we compile this C++ example: 536 537 struct THotKey { short Key; bool Control; bool Shift; bool Alt; }; 538 extern THotKey m_HotKey; 539 THotKey GetHotKey () { return m_HotKey; } 540 541 into (-m64 -O3 -fno-exceptions -static -fomit-frame-pointer): 542 543 __Z9GetHotKeyv: ## @_Z9GetHotKeyv 544 movq _m_HotKey@GOTPCREL(%rip), %rax 545 movzwl (%rax), %ecx 546 movzbl 2(%rax), %edx 547 shlq $16, %rdx 548 orq %rcx, %rdx 549 movzbl 3(%rax), %ecx 550 shlq $24, %rcx 551 orq %rdx, %rcx 552 movzbl 4(%rax), %eax 553 shlq $32, %rax 554 orq %rcx, %rax 555 ret 556 557 //===---------------------------------------------------------------------===// 558 559 We should add an FRINT node to the DAG to model targets that have legal 560 implementations of ceil/floor/rint. 561 562 //===---------------------------------------------------------------------===// 563 564 Consider: 565 566 int test() { 567 long long input[8] = {1,0,1,0,1,0,1,0}; 568 foo(input); 569 } 570 571 Clang compiles this into: 572 573 call void @llvm.memset.p0i8.i64(i8* %tmp, i8 0, i64 64, i32 16, i1 false) 574 %0 = getelementptr [8 x i64]* %input, i64 0, i64 0 575 store i64 1, i64* %0, align 16 576 %1 = getelementptr [8 x i64]* %input, i64 0, i64 2 577 store i64 1, i64* %1, align 16 578 %2 = getelementptr [8 x i64]* %input, i64 0, i64 4 579 store i64 1, i64* %2, align 16 580 %3 = getelementptr [8 x i64]* %input, i64 0, i64 6 581 store i64 1, i64* %3, align 16 582 583 Which gets codegen'd into: 584 585 pxor %xmm0, %xmm0 586 movaps %xmm0, -16(%rbp) 587 movaps %xmm0, -32(%rbp) 588 movaps %xmm0, -48(%rbp) 589 movaps %xmm0, -64(%rbp) 590 movq $1, -64(%rbp) 591 movq $1, -48(%rbp) 592 movq $1, -32(%rbp) 593 movq $1, -16(%rbp) 594 595 It would be better to have 4 movq's of 0 instead of the movaps's. 596 597 //===---------------------------------------------------------------------===// 598 599 http://llvm.org/PR717: 600 601 The following code should compile into "ret int undef". Instead, LLVM 602 produces "ret int 0": 603 604 int f() { 605 int x = 4; 606 int y; 607 if (x == 3) y = 0; 608 return y; 609 } 610 611 //===---------------------------------------------------------------------===// 612 613 The loop unroller should partially unroll loops (instead of peeling them) 614 when code growth isn't too bad and when an unroll count allows simplification 615 of some code within the loop. One trivial example is: 616 617 #include <stdio.h> 618 int main() { 619 int nRet = 17; 620 int nLoop; 621 for ( nLoop = 0; nLoop < 1000; nLoop++ ) { 622 if ( nLoop & 1 ) 623 nRet += 2; 624 else 625 nRet -= 1; 626 } 627 return nRet; 628 } 629 630 Unrolling by 2 would eliminate the '&1' in both copies, leading to a net 631 reduction in code size. The resultant code would then also be suitable for 632 exit value computation. 633 634 //===---------------------------------------------------------------------===// 635 636 We miss a bunch of rotate opportunities on various targets, including ppc, x86, 637 etc. On X86, we miss a bunch of 'rotate by variable' cases because the rotate 638 matching code in dag combine doesn't look through truncates aggressively 639 enough. Here are some testcases reduces from GCC PR17886: 640 641 unsigned long long f5(unsigned long long x, unsigned long long y) { 642 return (x << 8) | ((y >> 48) & 0xffull); 643 } 644 unsigned long long f6(unsigned long long x, unsigned long long y, int z) { 645 switch(z) { 646 case 1: 647 return (x << 8) | ((y >> 48) & 0xffull); 648 case 2: 649 return (x << 16) | ((y >> 40) & 0xffffull); 650 case 3: 651 return (x << 24) | ((y >> 32) & 0xffffffull); 652 case 4: 653 return (x << 32) | ((y >> 24) & 0xffffffffull); 654 default: 655 return (x << 40) | ((y >> 16) & 0xffffffffffull); 656 } 657 } 658 659 //===---------------------------------------------------------------------===// 660 661 This (and similar related idioms): 662 663 unsigned int foo(unsigned char i) { 664 return i | (i<<8) | (i<<16) | (i<<24); 665 } 666 667 compiles into: 668 669 define i32 @foo(i8 zeroext %i) nounwind readnone ssp noredzone { 670 entry: 671 %conv = zext i8 %i to i32 672 %shl = shl i32 %conv, 8 673 %shl5 = shl i32 %conv, 16 674 %shl9 = shl i32 %conv, 24 675 %or = or i32 %shl9, %conv 676 %or6 = or i32 %or, %shl5 677 %or10 = or i32 %or6, %shl 678 ret i32 %or10 679 } 680 681 it would be better as: 682 683 unsigned int bar(unsigned char i) { 684 unsigned int j=i | (i << 8); 685 return j | (j<<16); 686 } 687 688 aka: 689 690 define i32 @bar(i8 zeroext %i) nounwind readnone ssp noredzone { 691 entry: 692 %conv = zext i8 %i to i32 693 %shl = shl i32 %conv, 8 694 %or = or i32 %shl, %conv 695 %shl5 = shl i32 %or, 16 696 %or6 = or i32 %shl5, %or 697 ret i32 %or6 698 } 699 700 or even i*0x01010101, depending on the speed of the multiplier. The best way to 701 handle this is to canonicalize it to a multiply in IR and have codegen handle 702 lowering multiplies to shifts on cpus where shifts are faster. 703 704 //===---------------------------------------------------------------------===// 705 706 We do a number of simplifications in simplify libcalls to strength reduce 707 standard library functions, but we don't currently merge them together. For 708 example, it is useful to merge memcpy(a,b,strlen(b)) -> strcpy. This can only 709 be done safely if "b" isn't modified between the strlen and memcpy of course. 710 711 //===---------------------------------------------------------------------===// 712 713 We compile this program: (from GCC PR11680) 714 http://gcc.gnu.org/bugzilla/attachment.cgi?id=4487 715 716 Into code that runs the same speed in fast/slow modes, but both modes run 2x 717 slower than when compile with GCC (either 4.0 or 4.2): 718 719 $ llvm-g++ perf.cpp -O3 -fno-exceptions 720 $ time ./a.out fast 721 1.821u 0.003s 0:01.82 100.0% 0+0k 0+0io 0pf+0w 722 723 $ g++ perf.cpp -O3 -fno-exceptions 724 $ time ./a.out fast 725 0.821u 0.001s 0:00.82 100.0% 0+0k 0+0io 0pf+0w 726 727 It looks like we are making the same inlining decisions, so this may be raw 728 codegen badness or something else (haven't investigated). 729 730 //===---------------------------------------------------------------------===// 731 732 Divisibility by constant can be simplified (according to GCC PR12849) from 733 being a mulhi to being a mul lo (cheaper). Testcase: 734 735 void bar(unsigned n) { 736 if (n % 3 == 0) 737 true(); 738 } 739 740 This is equivalent to the following, where 2863311531 is the multiplicative 741 inverse of 3, and 1431655766 is ((2^32)-1)/3+1: 742 void bar(unsigned n) { 743 if (n * 2863311531U < 1431655766U) 744 true(); 745 } 746 747 The same transformation can work with an even modulo with the addition of a 748 rotate: rotate the result of the multiply to the right by the number of bits 749 which need to be zero for the condition to be true, and shrink the compare RHS 750 by the same amount. Unless the target supports rotates, though, that 751 transformation probably isn't worthwhile. 752 753 The transformation can also easily be made to work with non-zero equality 754 comparisons: just transform, for example, "n % 3 == 1" to "(n-1) % 3 == 0". 755 756 //===---------------------------------------------------------------------===// 757 758 Better mod/ref analysis for scanf would allow us to eliminate the vtable and a 759 bunch of other stuff from this example (see PR1604): 760 761 #include <cstdio> 762 struct test { 763 int val; 764 virtual ~test() {} 765 }; 766 767 int main() { 768 test t; 769 std::scanf("%d", &t.val); 770 std::printf("%d\n", t.val); 771 } 772 773 //===---------------------------------------------------------------------===// 774 775 These functions perform the same computation, but produce different assembly. 776 777 define i8 @select(i8 %x) readnone nounwind { 778 %A = icmp ult i8 %x, 250 779 %B = select i1 %A, i8 0, i8 1 780 ret i8 %B 781 } 782 783 define i8 @addshr(i8 %x) readnone nounwind { 784 %A = zext i8 %x to i9 785 %B = add i9 %A, 6 ;; 256 - 250 == 6 786 %C = lshr i9 %B, 8 787 %D = trunc i9 %C to i8 788 ret i8 %D 789 } 790 791 //===---------------------------------------------------------------------===// 792 793 From gcc bug 24696: 794 int 795 f (unsigned long a, unsigned long b, unsigned long c) 796 { 797 return ((a & (c - 1)) != 0) || ((b & (c - 1)) != 0); 798 } 799 int 800 f (unsigned long a, unsigned long b, unsigned long c) 801 { 802 return ((a & (c - 1)) != 0) | ((b & (c - 1)) != 0); 803 } 804 Both should combine to ((a|b) & (c-1)) != 0. Currently not optimized with 805 "clang -emit-llvm-bc | opt -std-compile-opts". 806 807 //===---------------------------------------------------------------------===// 808 809 From GCC Bug 20192: 810 #define PMD_MASK (~((1UL << 23) - 1)) 811 void clear_pmd_range(unsigned long start, unsigned long end) 812 { 813 if (!(start & ~PMD_MASK) && !(end & ~PMD_MASK)) 814 f(); 815 } 816 The expression should optimize to something like 817 "!((start|end)&~PMD_MASK). Currently not optimized with "clang 818 -emit-llvm-bc | opt -std-compile-opts". 819 820 //===---------------------------------------------------------------------===// 821 822 unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return 823 i;} 824 unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;} 825 These should combine to the same thing. Currently, the first function 826 produces better code on X86. 827 828 //===---------------------------------------------------------------------===// 829 830 From GCC Bug 15784: 831 #define abs(x) x>0?x:-x 832 int f(int x, int y) 833 { 834 return (abs(x)) >= 0; 835 } 836 This should optimize to x == INT_MIN. (With -fwrapv.) Currently not 837 optimized with "clang -emit-llvm-bc | opt -std-compile-opts". 838 839 //===---------------------------------------------------------------------===// 840 841 From GCC Bug 14753: 842 void 843 rotate_cst (unsigned int a) 844 { 845 a = (a << 10) | (a >> 22); 846 if (a == 123) 847 bar (); 848 } 849 void 850 minus_cst (unsigned int a) 851 { 852 unsigned int tem; 853 854 tem = 20 - a; 855 if (tem == 5) 856 bar (); 857 } 858 void 859 mask_gt (unsigned int a) 860 { 861 /* This is equivalent to a > 15. */ 862 if ((a & ~7) > 8) 863 bar (); 864 } 865 void 866 rshift_gt (unsigned int a) 867 { 868 /* This is equivalent to a > 23. */ 869 if ((a >> 2) > 5) 870 bar (); 871 } 872 873 All should simplify to a single comparison. All of these are 874 currently not optimized with "clang -emit-llvm-bc | opt 875 -std-compile-opts". 876 877 //===---------------------------------------------------------------------===// 878 879 From GCC Bug 32605: 880 int c(int* x) {return (char*)x+2 == (char*)x;} 881 Should combine to 0. Currently not optimized with "clang 882 -emit-llvm-bc | opt -std-compile-opts" (although llc can optimize it). 883 884 //===---------------------------------------------------------------------===// 885 886 int a(unsigned b) {return ((b << 31) | (b << 30)) >> 31;} 887 Should be combined to "((b >> 1) | b) & 1". Currently not optimized 888 with "clang -emit-llvm-bc | opt -std-compile-opts". 889 890 //===---------------------------------------------------------------------===// 891 892 unsigned a(unsigned x, unsigned y) { return x | (y & 1) | (y & 2);} 893 Should combine to "x | (y & 3)". Currently not optimized with "clang 894 -emit-llvm-bc | opt -std-compile-opts". 895 896 //===---------------------------------------------------------------------===// 897 898 int a(int a, int b, int c) {return (~a & c) | ((c|a) & b);} 899 Should fold to "(~a & c) | (a & b)". Currently not optimized with 900 "clang -emit-llvm-bc | opt -std-compile-opts". 901 902 //===---------------------------------------------------------------------===// 903 904 int a(int a,int b) {return (~(a|b))|a;} 905 Should fold to "a|~b". Currently not optimized with "clang 906 -emit-llvm-bc | opt -std-compile-opts". 907 908 //===---------------------------------------------------------------------===// 909 910 int a(int a, int b) {return (a&&b) || (a&&!b);} 911 Should fold to "a". Currently not optimized with "clang -emit-llvm-bc 912 | opt -std-compile-opts". 913 914 //===---------------------------------------------------------------------===// 915 916 int a(int a, int b, int c) {return (a&&b) || (!a&&c);} 917 Should fold to "a ? b : c", or at least something sane. Currently not 918 optimized with "clang -emit-llvm-bc | opt -std-compile-opts". 919 920 //===---------------------------------------------------------------------===// 921 922 int a(int a, int b, int c) {return (a&&b) || (a&&c) || (a&&b&&c);} 923 Should fold to a && (b || c). Currently not optimized with "clang 924 -emit-llvm-bc | opt -std-compile-opts". 925 926 //===---------------------------------------------------------------------===// 927 928 int a(int x) {return x | ((x & 8) ^ 8);} 929 Should combine to x | 8. Currently not optimized with "clang 930 -emit-llvm-bc | opt -std-compile-opts". 931 932 //===---------------------------------------------------------------------===// 933 934 int a(int x) {return x ^ ((x & 8) ^ 8);} 935 Should also combine to x | 8. Currently not optimized with "clang 936 -emit-llvm-bc | opt -std-compile-opts". 937 938 //===---------------------------------------------------------------------===// 939 940 int a(int x) {return ((x | -9) ^ 8) & x;} 941 Should combine to x & -9. Currently not optimized with "clang 942 -emit-llvm-bc | opt -std-compile-opts". 943 944 //===---------------------------------------------------------------------===// 945 946 unsigned a(unsigned a) {return a * 0x11111111 >> 28 & 1;} 947 Should combine to "a * 0x88888888 >> 31". Currently not optimized 948 with "clang -emit-llvm-bc | opt -std-compile-opts". 949 950 //===---------------------------------------------------------------------===// 951 952 unsigned a(char* x) {if ((*x & 32) == 0) return b();} 953 There's an unnecessary zext in the generated code with "clang 954 -emit-llvm-bc | opt -std-compile-opts". 955 956 //===---------------------------------------------------------------------===// 957 958 unsigned a(unsigned long long x) {return 40 * (x >> 1);} 959 Should combine to "20 * (((unsigned)x) & -2)". Currently not 960 optimized with "clang -emit-llvm-bc | opt -std-compile-opts". 961 962 //===---------------------------------------------------------------------===// 963 964 This was noticed in the entryblock for grokdeclarator in 403.gcc: 965 966 %tmp = icmp eq i32 %decl_context, 4 967 %decl_context_addr.0 = select i1 %tmp, i32 3, i32 %decl_context 968 %tmp1 = icmp eq i32 %decl_context_addr.0, 1 969 %decl_context_addr.1 = select i1 %tmp1, i32 0, i32 %decl_context_addr.0 970 971 tmp1 should be simplified to something like: 972 (!tmp || decl_context == 1) 973 974 This allows recursive simplifications, tmp1 is used all over the place in 975 the function, e.g. by: 976 977 %tmp23 = icmp eq i32 %decl_context_addr.1, 0 ; <i1> [#uses=1] 978 %tmp24 = xor i1 %tmp1, true ; <i1> [#uses=1] 979 %or.cond8 = and i1 %tmp23, %tmp24 ; <i1> [#uses=1] 980 981 later. 982 983 //===---------------------------------------------------------------------===// 984 985 [STORE SINKING] 986 987 Store sinking: This code: 988 989 void f (int n, int *cond, int *res) { 990 int i; 991 *res = 0; 992 for (i = 0; i < n; i++) 993 if (*cond) 994 *res ^= 234; /* (*) */ 995 } 996 997 On this function GVN hoists the fully redundant value of *res, but nothing 998 moves the store out. This gives us this code: 999 1000 bb: ; preds = %bb2, %entry 1001 %.rle = phi i32 [ 0, %entry ], [ %.rle6, %bb2 ] 1002 %i.05 = phi i32 [ 0, %entry ], [ %indvar.next, %bb2 ] 1003 %1 = load i32* %cond, align 4 1004 %2 = icmp eq i32 %1, 0 1005 br i1 %2, label %bb2, label %bb1 1006 1007 bb1: ; preds = %bb 1008 %3 = xor i32 %.rle, 234 1009 store i32 %3, i32* %res, align 4 1010 br label %bb2 1011 1012 bb2: ; preds = %bb, %bb1 1013 %.rle6 = phi i32 [ %3, %bb1 ], [ %.rle, %bb ] 1014 %indvar.next = add i32 %i.05, 1 1015 %exitcond = icmp eq i32 %indvar.next, %n 1016 br i1 %exitcond, label %return, label %bb 1017 1018 DSE should sink partially dead stores to get the store out of the loop. 1019 1020 Here's another partial dead case: 1021 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=12395 1022 1023 //===---------------------------------------------------------------------===// 1024 1025 Scalar PRE hoists the mul in the common block up to the else: 1026 1027 int test (int a, int b, int c, int g) { 1028 int d, e; 1029 if (a) 1030 d = b * c; 1031 else 1032 d = b - c; 1033 e = b * c + g; 1034 return d + e; 1035 } 1036 1037 It would be better to do the mul once to reduce codesize above the if. 1038 This is GCC PR38204. 1039 1040 1041 //===---------------------------------------------------------------------===// 1042 This simple function from 179.art: 1043 1044 int winner, numf2s; 1045 struct { double y; int reset; } *Y; 1046 1047 void find_match() { 1048 int i; 1049 winner = 0; 1050 for (i=0;i<numf2s;i++) 1051 if (Y[i].y > Y[winner].y) 1052 winner =i; 1053 } 1054 1055 Compiles into (with clang TBAA): 1056 1057 for.body: ; preds = %for.inc, %bb.nph 1058 %indvar = phi i64 [ 0, %bb.nph ], [ %indvar.next, %for.inc ] 1059 %i.01718 = phi i32 [ 0, %bb.nph ], [ %i.01719, %for.inc ] 1060 %tmp4 = getelementptr inbounds %struct.anon* %tmp3, i64 %indvar, i32 0 1061 %tmp5 = load double* %tmp4, align 8, !tbaa !4 1062 %idxprom7 = sext i32 %i.01718 to i64 1063 %tmp10 = getelementptr inbounds %struct.anon* %tmp3, i64 %idxprom7, i32 0 1064 %tmp11 = load double* %tmp10, align 8, !tbaa !4 1065 %cmp12 = fcmp ogt double %tmp5, %tmp11 1066 br i1 %cmp12, label %if.then, label %for.inc 1067 1068 if.then: ; preds = %for.body 1069 %i.017 = trunc i64 %indvar to i32 1070 br label %for.inc 1071 1072 for.inc: ; preds = %for.body, %if.then 1073 %i.01719 = phi i32 [ %i.01718, %for.body ], [ %i.017, %if.then ] 1074 %indvar.next = add i64 %indvar, 1 1075 %exitcond = icmp eq i64 %indvar.next, %tmp22 1076 br i1 %exitcond, label %for.cond.for.end_crit_edge, label %for.body 1077 1078 1079 It is good that we hoisted the reloads of numf2's, and Y out of the loop and 1080 sunk the store to winner out. 1081 1082 However, this is awful on several levels: the conditional truncate in the loop 1083 (-indvars at fault? why can't we completely promote the IV to i64?). 1084 1085 Beyond that, we have a partially redundant load in the loop: if "winner" (aka 1086 %i.01718) isn't updated, we reload Y[winner].y the next time through the loop. 1087 Similarly, the addressing that feeds it (including the sext) is redundant. In 1088 the end we get this generated assembly: 1089 1090 LBB0_2: ## %for.body 1091 ## =>This Inner Loop Header: Depth=1 1092 movsd (%rdi), %xmm0 1093 movslq %edx, %r8 1094 shlq $4, %r8 1095 ucomisd (%rcx,%r8), %xmm0 1096 jbe LBB0_4 1097 movl %esi, %edx 1098 LBB0_4: ## %for.inc 1099 addq $16, %rdi 1100 incq %rsi 1101 cmpq %rsi, %rax 1102 jne LBB0_2 1103 1104 All things considered this isn't too bad, but we shouldn't need the movslq or 1105 the shlq instruction, or the load folded into ucomisd every time through the 1106 loop. 1107 1108 On an x86-specific topic, if the loop can't be restructure, the movl should be a 1109 cmov. 1110 1111 //===---------------------------------------------------------------------===// 1112 1113 [STORE SINKING] 1114 1115 GCC PR37810 is an interesting case where we should sink load/store reload 1116 into the if block and outside the loop, so we don't reload/store it on the 1117 non-call path. 1118 1119 for () { 1120 *P += 1; 1121 if () 1122 call(); 1123 else 1124 ... 1125 -> 1126 tmp = *P 1127 for () { 1128 tmp += 1; 1129 if () { 1130 *P = tmp; 1131 call(); 1132 tmp = *P; 1133 } else ... 1134 } 1135 *P = tmp; 1136 1137 We now hoist the reload after the call (Transforms/GVN/lpre-call-wrap.ll), but 1138 we don't sink the store. We need partially dead store sinking. 1139 1140 //===---------------------------------------------------------------------===// 1141 1142 [LOAD PRE CRIT EDGE SPLITTING] 1143 1144 GCC PR37166: Sinking of loads prevents SROA'ing the "g" struct on the stack 1145 leading to excess stack traffic. This could be handled by GVN with some crazy 1146 symbolic phi translation. The code we get looks like (g is on the stack): 1147 1148 bb2: ; preds = %bb1 1149 .. 1150 %9 = getelementptr %struct.f* %g, i32 0, i32 0 1151 store i32 %8, i32* %9, align bel %bb3 1152 1153 bb3: ; preds = %bb1, %bb2, %bb 1154 %c_addr.0 = phi %struct.f* [ %g, %bb2 ], [ %c, %bb ], [ %c, %bb1 ] 1155 %b_addr.0 = phi %struct.f* [ %b, %bb2 ], [ %g, %bb ], [ %b, %bb1 ] 1156 %10 = getelementptr %struct.f* %c_addr.0, i32 0, i32 0 1157 %11 = load i32* %10, align 4 1158 1159 %11 is partially redundant, an in BB2 it should have the value %8. 1160 1161 GCC PR33344 and PR35287 are similar cases. 1162 1163 1164 //===---------------------------------------------------------------------===// 1165 1166 [LOAD PRE] 1167 1168 There are many load PRE testcases in testsuite/gcc.dg/tree-ssa/loadpre* in the 1169 GCC testsuite, ones we don't get yet are (checked through loadpre25): 1170 1171 [CRIT EDGE BREAKING] 1172 loadpre3.c predcom-4.c 1173 1174 [PRE OF READONLY CALL] 1175 loadpre5.c 1176 1177 [TURN SELECT INTO BRANCH] 1178 loadpre14.c loadpre15.c 1179 1180 actually a conditional increment: loadpre18.c loadpre19.c 1181 1182 //===---------------------------------------------------------------------===// 1183 1184 [LOAD PRE / STORE SINKING / SPEC HACK] 1185 1186 This is a chunk of code from 456.hmmer: 1187 1188 int f(int M, int *mc, int *mpp, int *tpmm, int *ip, int *tpim, int *dpp, 1189 int *tpdm, int xmb, int *bp, int *ms) { 1190 int k, sc; 1191 for (k = 1; k <= M; k++) { 1192 mc[k] = mpp[k-1] + tpmm[k-1]; 1193 if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc; 1194 if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc; 1195 if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc; 1196 mc[k] += ms[k]; 1197 } 1198 } 1199 1200 It is very profitable for this benchmark to turn the conditional stores to mc[k] 1201 into a conditional move (select instr in IR) and allow the final store to do the 1202 store. See GCC PR27313 for more details. Note that this is valid to xform even 1203 with the new C++ memory model, since mc[k] is previously loaded and later 1204 stored. 1205 1206 //===---------------------------------------------------------------------===// 1207 1208 [SCALAR PRE] 1209 There are many PRE testcases in testsuite/gcc.dg/tree-ssa/ssa-pre-*.c in the 1210 GCC testsuite. 1211 1212 //===---------------------------------------------------------------------===// 1213 1214 There are some interesting cases in testsuite/gcc.dg/tree-ssa/pred-comm* in the 1215 GCC testsuite. For example, we get the first example in predcom-1.c, but 1216 miss the second one: 1217 1218 unsigned fib[1000]; 1219 unsigned avg[1000]; 1220 1221 __attribute__ ((noinline)) 1222 void count_averages(int n) { 1223 int i; 1224 for (i = 1; i < n; i++) 1225 avg[i] = (((unsigned long) fib[i - 1] + fib[i] + fib[i + 1]) / 3) & 0xffff; 1226 } 1227 1228 which compiles into two loads instead of one in the loop. 1229 1230 predcom-2.c is the same as predcom-1.c 1231 1232 predcom-3.c is very similar but needs loads feeding each other instead of 1233 store->load. 1234 1235 1236 //===---------------------------------------------------------------------===// 1237 1238 [ALIAS ANALYSIS] 1239 1240 Type based alias analysis: 1241 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14705 1242 1243 We should do better analysis of posix_memalign. At the least it should 1244 no-capture its pointer argument, at best, we should know that the out-value 1245 result doesn't point to anything (like malloc). One example of this is in 1246 SingleSource/Benchmarks/Misc/dt.c 1247 1248 //===---------------------------------------------------------------------===// 1249 1250 Interesting missed case because of control flow flattening (should be 2 loads): 1251 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26629 1252 With: llvm-gcc t2.c -S -o - -O0 -emit-llvm | llvm-as | 1253 opt -mem2reg -gvn -instcombine | llvm-dis 1254 we miss it because we need 1) CRIT EDGE 2) MULTIPLE DIFFERENT 1255 VALS PRODUCED BY ONE BLOCK OVER DIFFERENT PATHS 1256 1257 //===---------------------------------------------------------------------===// 1258 1259 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19633 1260 We could eliminate the branch condition here, loading from null is undefined: 1261 1262 struct S { int w, x, y, z; }; 1263 struct T { int r; struct S s; }; 1264 void bar (struct S, int); 1265 void foo (int a, struct T b) 1266 { 1267 struct S *c = 0; 1268 if (a) 1269 c = &b.s; 1270 bar (*c, a); 1271 } 1272 1273 //===---------------------------------------------------------------------===// 1274 1275 simplifylibcalls should do several optimizations for strspn/strcspn: 1276 1277 strcspn(x, "a") -> inlined loop for up to 3 letters (similarly for strspn): 1278 1279 size_t __strcspn_c3 (__const char *__s, int __reject1, int __reject2, 1280 int __reject3) { 1281 register size_t __result = 0; 1282 while (__s[__result] != '\0' && __s[__result] != __reject1 && 1283 __s[__result] != __reject2 && __s[__result] != __reject3) 1284 ++__result; 1285 return __result; 1286 } 1287 1288 This should turn into a switch on the character. See PR3253 for some notes on 1289 codegen. 1290 1291 456.hmmer apparently uses strcspn and strspn a lot. 471.omnetpp uses strspn. 1292 1293 //===---------------------------------------------------------------------===// 1294 1295 simplifylibcalls should turn these snprintf idioms into memcpy (GCC PR47917) 1296 1297 char buf1[6], buf2[6], buf3[4], buf4[4]; 1298 int i; 1299 1300 int foo (void) { 1301 int ret = snprintf (buf1, sizeof buf1, "abcde"); 1302 ret += snprintf (buf2, sizeof buf2, "abcdef") * 16; 1303 ret += snprintf (buf3, sizeof buf3, "%s", i++ < 6 ? "abc" : "def") * 256; 1304 ret += snprintf (buf4, sizeof buf4, "%s", i++ > 10 ? "abcde" : "defgh")*4096; 1305 return ret; 1306 } 1307 1308 //===---------------------------------------------------------------------===// 1309 1310 "gas" uses this idiom: 1311 else if (strchr ("+-/*%|&^:[]()~", *intel_parser.op_string)) 1312 .. 1313 else if (strchr ("<>", *intel_parser.op_string) 1314 1315 Those should be turned into a switch. 1316 1317 //===---------------------------------------------------------------------===// 1318 1319 252.eon contains this interesting code: 1320 1321 %3072 = getelementptr [100 x i8]* %tempString, i32 0, i32 0 1322 %3073 = call i8* @strcpy(i8* %3072, i8* %3071) nounwind 1323 %strlen = call i32 @strlen(i8* %3072) ; uses = 1 1324 %endptr = getelementptr [100 x i8]* %tempString, i32 0, i32 %strlen 1325 call void @llvm.memcpy.i32(i8* %endptr, 1326 i8* getelementptr ([5 x i8]* @"\01LC42", i32 0, i32 0), i32 5, i32 1) 1327 %3074 = call i32 @strlen(i8* %endptr) nounwind readonly 1328 1329 This is interesting for a couple reasons. First, in this: 1330 1331 The memcpy+strlen strlen can be replaced with: 1332 1333 %3074 = call i32 @strlen([5 x i8]* @"\01LC42") nounwind readonly 1334 1335 Because the destination was just copied into the specified memory buffer. This, 1336 in turn, can be constant folded to "4". 1337 1338 In other code, it contains: 1339 1340 %endptr6978 = bitcast i8* %endptr69 to i32* 1341 store i32 7107374, i32* %endptr6978, align 1 1342 %3167 = call i32 @strlen(i8* %endptr69) nounwind readonly 1343 1344 Which could also be constant folded. Whatever is producing this should probably 1345 be fixed to leave this as a memcpy from a string. 1346 1347 Further, eon also has an interesting partially redundant strlen call: 1348 1349 bb8: ; preds = %_ZN18eonImageCalculatorC1Ev.exit 1350 %682 = getelementptr i8** %argv, i32 6 ; <i8**> [#uses=2] 1351 %683 = load i8** %682, align 4 ; <i8*> [#uses=4] 1352 %684 = load i8* %683, align 1 ; <i8> [#uses=1] 1353 %685 = icmp eq i8 %684, 0 ; <i1> [#uses=1] 1354 br i1 %685, label %bb10, label %bb9 1355 1356 bb9: ; preds = %bb8 1357 %686 = call i32 @strlen(i8* %683) nounwind readonly 1358 %687 = icmp ugt i32 %686, 254 ; <i1> [#uses=1] 1359 br i1 %687, label %bb10, label %bb11 1360 1361 bb10: ; preds = %bb9, %bb8 1362 %688 = call i32 @strlen(i8* %683) nounwind readonly 1363 1364 This could be eliminated by doing the strlen once in bb8, saving code size and 1365 improving perf on the bb8->9->10 path. 1366 1367 //===---------------------------------------------------------------------===// 1368 1369 I see an interesting fully redundant call to strlen left in 186.crafty:InputMove 1370 which looks like: 1371 %movetext11 = getelementptr [128 x i8]* %movetext, i32 0, i32 0 1372 1373 1374 bb62: ; preds = %bb55, %bb53 1375 %promote.0 = phi i32 [ %169, %bb55 ], [ 0, %bb53 ] 1376 %171 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1 1377 %172 = add i32 %171, -1 ; <i32> [#uses=1] 1378 %173 = getelementptr [128 x i8]* %movetext, i32 0, i32 %172 1379 1380 ... no stores ... 1381 br i1 %or.cond, label %bb65, label %bb72 1382 1383 bb65: ; preds = %bb62 1384 store i8 0, i8* %173, align 1 1385 br label %bb72 1386 1387 bb72: ; preds = %bb65, %bb62 1388 %trank.1 = phi i32 [ %176, %bb65 ], [ -1, %bb62 ] 1389 %177 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1 1390 1391 Note that on the bb62->bb72 path, that the %177 strlen call is partially 1392 redundant with the %171 call. At worst, we could shove the %177 strlen call 1393 up into the bb65 block moving it out of the bb62->bb72 path. However, note 1394 that bb65 stores to the string, zeroing out the last byte. This means that on 1395 that path the value of %177 is actually just %171-1. A sub is cheaper than a 1396 strlen! 1397 1398 This pattern repeats several times, basically doing: 1399 1400 A = strlen(P); 1401 P[A-1] = 0; 1402 B = strlen(P); 1403 where it is "obvious" that B = A-1. 1404 1405 //===---------------------------------------------------------------------===// 1406 1407 186.crafty has this interesting pattern with the "out.4543" variable: 1408 1409 call void @llvm.memcpy.i32( 1410 i8* getelementptr ([10 x i8]* @out.4543, i32 0, i32 0), 1411 i8* getelementptr ([7 x i8]* @"\01LC28700", i32 0, i32 0), i32 7, i32 1) 1412 %101 = call@printf(i8* ... @out.4543, i32 0, i32 0)) nounwind 1413 1414 It is basically doing: 1415 1416 memcpy(globalarray, "string"); 1417 printf(..., globalarray); 1418 1419 Anyway, by knowing that printf just reads the memory and forward substituting 1420 the string directly into the printf, this eliminates reads from globalarray. 1421 Since this pattern occurs frequently in crafty (due to the "DisplayTime" and 1422 other similar functions) there are many stores to "out". Once all the printfs 1423 stop using "out", all that is left is the memcpy's into it. This should allow 1424 globalopt to remove the "stored only" global. 1425 1426 //===---------------------------------------------------------------------===// 1427 1428 This code: 1429 1430 define inreg i32 @foo(i8* inreg %p) nounwind { 1431 %tmp0 = load i8* %p 1432 %tmp1 = ashr i8 %tmp0, 5 1433 %tmp2 = sext i8 %tmp1 to i32 1434 ret i32 %tmp2 1435 } 1436 1437 could be dagcombine'd to a sign-extending load with a shift. 1438 For example, on x86 this currently gets this: 1439 1440 movb (%eax), %al 1441 sarb $5, %al 1442 movsbl %al, %eax 1443 1444 while it could get this: 1445 1446 movsbl (%eax), %eax 1447 sarl $5, %eax 1448 1449 //===---------------------------------------------------------------------===// 1450 1451 GCC PR31029: 1452 1453 int test(int x) { return 1-x == x; } // --> return false 1454 int test2(int x) { return 2-x == x; } // --> return x == 1 ? 1455 1456 Always foldable for odd constants, what is the rule for even? 1457 1458 //===---------------------------------------------------------------------===// 1459 1460 PR 3381: GEP to field of size 0 inside a struct could be turned into GEP 1461 for next field in struct (which is at same address). 1462 1463 For example: store of float into { {{}}, float } could be turned into a store to 1464 the float directly. 1465 1466 //===---------------------------------------------------------------------===// 1467 1468 The arg promotion pass should make use of nocapture to make its alias analysis 1469 stuff much more precise. 1470 1471 //===---------------------------------------------------------------------===// 1472 1473 The following functions should be optimized to use a select instead of a 1474 branch (from gcc PR40072): 1475 1476 char char_int(int m) {if(m>7) return 0; return m;} 1477 int int_char(char m) {if(m>7) return 0; return m;} 1478 1479 //===---------------------------------------------------------------------===// 1480 1481 int func(int a, int b) { if (a & 0x80) b |= 0x80; else b &= ~0x80; return b; } 1482 1483 Generates this: 1484 1485 define i32 @func(i32 %a, i32 %b) nounwind readnone ssp { 1486 entry: 1487 %0 = and i32 %a, 128 ; <i32> [#uses=1] 1488 %1 = icmp eq i32 %0, 0 ; <i1> [#uses=1] 1489 %2 = or i32 %b, 128 ; <i32> [#uses=1] 1490 %3 = and i32 %b, -129 ; <i32> [#uses=1] 1491 %b_addr.0 = select i1 %1, i32 %3, i32 %2 ; <i32> [#uses=1] 1492 ret i32 %b_addr.0 1493 } 1494 1495 However, it's functionally equivalent to: 1496 1497 b = (b & ~0x80) | (a & 0x80); 1498 1499 Which generates this: 1500 1501 define i32 @func(i32 %a, i32 %b) nounwind readnone ssp { 1502 entry: 1503 %0 = and i32 %b, -129 ; <i32> [#uses=1] 1504 %1 = and i32 %a, 128 ; <i32> [#uses=1] 1505 %2 = or i32 %0, %1 ; <i32> [#uses=1] 1506 ret i32 %2 1507 } 1508 1509 This can be generalized for other forms: 1510 1511 b = (b & ~0x80) | (a & 0x40) << 1; 1512 1513 //===---------------------------------------------------------------------===// 1514 1515 These two functions produce different code. They shouldn't: 1516 1517 #include <stdint.h> 1518 1519 uint8_t p1(uint8_t b, uint8_t a) { 1520 b = (b & ~0xc0) | (a & 0xc0); 1521 return (b); 1522 } 1523 1524 uint8_t p2(uint8_t b, uint8_t a) { 1525 b = (b & ~0x40) | (a & 0x40); 1526 b = (b & ~0x80) | (a & 0x80); 1527 return (b); 1528 } 1529 1530 define zeroext i8 @p1(i8 zeroext %b, i8 zeroext %a) nounwind readnone ssp { 1531 entry: 1532 %0 = and i8 %b, 63 ; <i8> [#uses=1] 1533 %1 = and i8 %a, -64 ; <i8> [#uses=1] 1534 %2 = or i8 %1, %0 ; <i8> [#uses=1] 1535 ret i8 %2 1536 } 1537 1538 define zeroext i8 @p2(i8 zeroext %b, i8 zeroext %a) nounwind readnone ssp { 1539 entry: 1540 %0 = and i8 %b, 63 ; <i8> [#uses=1] 1541 %.masked = and i8 %a, 64 ; <i8> [#uses=1] 1542 %1 = and i8 %a, -128 ; <i8> [#uses=1] 1543 %2 = or i8 %1, %0 ; <i8> [#uses=1] 1544 %3 = or i8 %2, %.masked ; <i8> [#uses=1] 1545 ret i8 %3 1546 } 1547 1548 //===---------------------------------------------------------------------===// 1549 1550 IPSCCP does not currently propagate argument dependent constants through 1551 functions where it does not not all of the callers. This includes functions 1552 with normal external linkage as well as templates, C99 inline functions etc. 1553 Specifically, it does nothing to: 1554 1555 define i32 @test(i32 %x, i32 %y, i32 %z) nounwind { 1556 entry: 1557 %0 = add nsw i32 %y, %z 1558 %1 = mul i32 %0, %x 1559 %2 = mul i32 %y, %z 1560 %3 = add nsw i32 %1, %2 1561 ret i32 %3 1562 } 1563 1564 define i32 @test2() nounwind { 1565 entry: 1566 %0 = call i32 @test(i32 1, i32 2, i32 4) nounwind 1567 ret i32 %0 1568 } 1569 1570 It would be interesting extend IPSCCP to be able to handle simple cases like 1571 this, where all of the arguments to a call are constant. Because IPSCCP runs 1572 before inlining, trivial templates and inline functions are not yet inlined. 1573 The results for a function + set of constant arguments should be memoized in a 1574 map. 1575 1576 //===---------------------------------------------------------------------===// 1577 1578 The libcall constant folding stuff should be moved out of SimplifyLibcalls into 1579 libanalysis' constantfolding logic. This would allow IPSCCP to be able to 1580 handle simple things like this: 1581 1582 static int foo(const char *X) { return strlen(X); } 1583 int bar() { return foo("abcd"); } 1584 1585 //===---------------------------------------------------------------------===// 1586 1587 functionattrs doesn't know much about memcpy/memset. This function should be 1588 marked readnone rather than readonly, since it only twiddles local memory, but 1589 functionattrs doesn't handle memset/memcpy/memmove aggressively: 1590 1591 struct X { int *p; int *q; }; 1592 int foo() { 1593 int i = 0, j = 1; 1594 struct X x, y; 1595 int **p; 1596 y.p = &i; 1597 x.q = &j; 1598 p = __builtin_memcpy (&x, &y, sizeof (int *)); 1599 return **p; 1600 } 1601 1602 This can be seen at: 1603 $ clang t.c -S -o - -mkernel -O0 -emit-llvm | opt -functionattrs -S 1604 1605 1606 //===---------------------------------------------------------------------===// 1607 1608 Missed instcombine transformation: 1609 define i1 @a(i32 %x) nounwind readnone { 1610 entry: 1611 %cmp = icmp eq i32 %x, 30 1612 %sub = add i32 %x, -30 1613 %cmp2 = icmp ugt i32 %sub, 9 1614 %or = or i1 %cmp, %cmp2 1615 ret i1 %or 1616 } 1617 This should be optimized to a single compare. Testcase derived from gcc. 1618 1619 //===---------------------------------------------------------------------===// 1620 1621 Missed instcombine or reassociate transformation: 1622 int a(int a, int b) { return (a==12)&(b>47)&(b<58); } 1623 1624 The sgt and slt should be combined into a single comparison. Testcase derived 1625 from gcc. 1626 1627 //===---------------------------------------------------------------------===// 1628 1629 Missed instcombine transformation: 1630 1631 %382 = srem i32 %tmp14.i, 64 ; [#uses=1] 1632 %383 = zext i32 %382 to i64 ; [#uses=1] 1633 %384 = shl i64 %381, %383 ; [#uses=1] 1634 %385 = icmp slt i32 %tmp14.i, 64 ; [#uses=1] 1635 1636 The srem can be transformed to an and because if %tmp14.i is negative, the 1637 shift is undefined. Testcase derived from 403.gcc. 1638 1639 //===---------------------------------------------------------------------===// 1640 1641 This is a range comparison on a divided result (from 403.gcc): 1642 1643 %1337 = sdiv i32 %1336, 8 ; [#uses=1] 1644 %.off.i208 = add i32 %1336, 7 ; [#uses=1] 1645 %1338 = icmp ult i32 %.off.i208, 15 ; [#uses=1] 1646 1647 We already catch this (removing the sdiv) if there isn't an add, we should 1648 handle the 'add' as well. This is a common idiom with it's builtin_alloca code. 1649 C testcase: 1650 1651 int a(int x) { return (unsigned)(x/16+7) < 15; } 1652 1653 Another similar case involves truncations on 64-bit targets: 1654 1655 %361 = sdiv i64 %.046, 8 ; [#uses=1] 1656 %362 = trunc i64 %361 to i32 ; [#uses=2] 1657 ... 1658 %367 = icmp eq i32 %362, 0 ; [#uses=1] 1659 1660 //===---------------------------------------------------------------------===// 1661 1662 Missed instcombine/dagcombine transformation: 1663 define void @lshift_lt(i8 zeroext %a) nounwind { 1664 entry: 1665 %conv = zext i8 %a to i32 1666 %shl = shl i32 %conv, 3 1667 %cmp = icmp ult i32 %shl, 33 1668 br i1 %cmp, label %if.then, label %if.end 1669 1670 if.then: 1671 tail call void @bar() nounwind 1672 ret void 1673 1674 if.end: 1675 ret void 1676 } 1677 declare void @bar() nounwind 1678 1679 The shift should be eliminated. Testcase derived from gcc. 1680 1681 //===---------------------------------------------------------------------===// 1682 1683 These compile into different code, one gets recognized as a switch and the 1684 other doesn't due to phase ordering issues (PR6212): 1685 1686 int test1(int mainType, int subType) { 1687 if (mainType == 7) 1688 subType = 4; 1689 else if (mainType == 9) 1690 subType = 6; 1691 else if (mainType == 11) 1692 subType = 9; 1693 return subType; 1694 } 1695 1696 int test2(int mainType, int subType) { 1697 if (mainType == 7) 1698 subType = 4; 1699 if (mainType == 9) 1700 subType = 6; 1701 if (mainType == 11) 1702 subType = 9; 1703 return subType; 1704 } 1705 1706 //===---------------------------------------------------------------------===// 1707 1708 The following test case (from PR6576): 1709 1710 define i32 @mul(i32 %a, i32 %b) nounwind readnone { 1711 entry: 1712 %cond1 = icmp eq i32 %b, 0 ; <i1> [#uses=1] 1713 br i1 %cond1, label %exit, label %bb.nph 1714 bb.nph: ; preds = %entry 1715 %tmp = mul i32 %b, %a ; <i32> [#uses=1] 1716 ret i32 %tmp 1717 exit: ; preds = %entry 1718 ret i32 0 1719 } 1720 1721 could be reduced to: 1722 1723 define i32 @mul(i32 %a, i32 %b) nounwind readnone { 1724 entry: 1725 %tmp = mul i32 %b, %a 1726 ret i32 %tmp 1727 } 1728 1729 //===---------------------------------------------------------------------===// 1730 1731 We should use DSE + llvm.lifetime.end to delete dead vtable pointer updates. 1732 See GCC PR34949 1733 1734 Another interesting case is that something related could be used for variables 1735 that go const after their ctor has finished. In these cases, globalopt (which 1736 can statically run the constructor) could mark the global const (so it gets put 1737 in the readonly section). A testcase would be: 1738 1739 #include <complex> 1740 using namespace std; 1741 const complex<char> should_be_in_rodata (42,-42); 1742 complex<char> should_be_in_data (42,-42); 1743 complex<char> should_be_in_bss; 1744 1745 Where we currently evaluate the ctors but the globals don't become const because 1746 the optimizer doesn't know they "become const" after the ctor is done. See 1747 GCC PR4131 for more examples. 1748 1749 //===---------------------------------------------------------------------===// 1750 1751 In this code: 1752 1753 long foo(long x) { 1754 return x > 1 ? x : 1; 1755 } 1756 1757 LLVM emits a comparison with 1 instead of 0. 0 would be equivalent 1758 and cheaper on most targets. 1759 1760 LLVM prefers comparisons with zero over non-zero in general, but in this 1761 case it choses instead to keep the max operation obvious. 1762 1763 //===---------------------------------------------------------------------===// 1764 1765 define void @a(i32 %x) nounwind { 1766 entry: 1767 switch i32 %x, label %if.end [ 1768 i32 0, label %if.then 1769 i32 1, label %if.then 1770 i32 2, label %if.then 1771 i32 3, label %if.then 1772 i32 5, label %if.then 1773 ] 1774 if.then: 1775 tail call void @foo() nounwind 1776 ret void 1777 if.end: 1778 ret void 1779 } 1780 declare void @foo() 1781 1782 Generated code on x86-64 (other platforms give similar results): 1783 a: 1784 cmpl $5, %edi 1785 ja LBB2_2 1786 cmpl $4, %edi 1787 jne LBB2_3 1788 .LBB0_2: 1789 ret 1790 .LBB0_3: 1791 jmp foo # TAILCALL 1792 1793 If we wanted to be really clever, we could simplify the whole thing to 1794 something like the following, which eliminates a branch: 1795 xorl $1, %edi 1796 cmpl $4, %edi 1797 ja .LBB0_2 1798 ret 1799 .LBB0_2: 1800 jmp foo # TAILCALL 1801 1802 //===---------------------------------------------------------------------===// 1803 1804 We compile this: 1805 1806 int foo(int a) { return (a & (~15)) / 16; } 1807 1808 Into: 1809 1810 define i32 @foo(i32 %a) nounwind readnone ssp { 1811 entry: 1812 %and = and i32 %a, -16 1813 %div = sdiv i32 %and, 16 1814 ret i32 %div 1815 } 1816 1817 but this code (X & -A)/A is X >> log2(A) when A is a power of 2, so this case 1818 should be instcombined into just "a >> 4". 1819 1820 We do get this at the codegen level, so something knows about it, but 1821 instcombine should catch it earlier: 1822 1823 _foo: ## @foo 1824 ## BB#0: ## %entry 1825 movl %edi, %eax 1826 sarl $4, %eax 1827 ret 1828 1829 //===---------------------------------------------------------------------===// 1830 1831 This code (from GCC PR28685): 1832 1833 int test(int a, int b) { 1834 int lt = a < b; 1835 int eq = a == b; 1836 if (lt) 1837 return 1; 1838 return eq; 1839 } 1840 1841 Is compiled to: 1842 1843 define i32 @test(i32 %a, i32 %b) nounwind readnone ssp { 1844 entry: 1845 %cmp = icmp slt i32 %a, %b 1846 br i1 %cmp, label %return, label %if.end 1847 1848 if.end: ; preds = %entry 1849 %cmp5 = icmp eq i32 %a, %b 1850 %conv6 = zext i1 %cmp5 to i32 1851 ret i32 %conv6 1852 1853 return: ; preds = %entry 1854 ret i32 1 1855 } 1856 1857 it could be: 1858 1859 define i32 @test__(i32 %a, i32 %b) nounwind readnone ssp { 1860 entry: 1861 %0 = icmp sle i32 %a, %b 1862 %retval = zext i1 %0 to i32 1863 ret i32 %retval 1864 } 1865 1866 //===---------------------------------------------------------------------===// 1867 1868 This code can be seen in viterbi: 1869 1870 %64 = call noalias i8* @malloc(i64 %62) nounwind 1871 ... 1872 %67 = call i64 @llvm.objectsize.i64(i8* %64, i1 false) nounwind 1873 %68 = call i8* @__memset_chk(i8* %64, i32 0, i64 %62, i64 %67) nounwind 1874 1875 llvm.objectsize.i64 should be taught about malloc/calloc, allowing it to 1876 fold to %62. This is a security win (overflows of malloc will get caught) 1877 and also a performance win by exposing more memsets to the optimizer. 1878 1879 This occurs several times in viterbi. 1880 1881 Note that this would change the semantics of @llvm.objectsize which by its 1882 current definition always folds to a constant. We also should make sure that 1883 we remove checking in code like 1884 1885 char *p = malloc(strlen(s)+1); 1886 __strcpy_chk(p, s, __builtin_objectsize(p, 0)); 1887 1888 //===---------------------------------------------------------------------===// 1889 1890 This code (from Benchmarks/Dhrystone/dry.c): 1891 1892 define i32 @Func1(i32, i32) nounwind readnone optsize ssp { 1893 entry: 1894 %sext = shl i32 %0, 24 1895 %conv = ashr i32 %sext, 24 1896 %sext6 = shl i32 %1, 24 1897 %conv4 = ashr i32 %sext6, 24 1898 %cmp = icmp eq i32 %conv, %conv4 1899 %. = select i1 %cmp, i32 10000, i32 0 1900 ret i32 %. 1901 } 1902 1903 Should be simplified into something like: 1904 1905 define i32 @Func1(i32, i32) nounwind readnone optsize ssp { 1906 entry: 1907 %sext = shl i32 %0, 24 1908 %conv = and i32 %sext, 0xFF000000 1909 %sext6 = shl i32 %1, 24 1910 %conv4 = and i32 %sext6, 0xFF000000 1911 %cmp = icmp eq i32 %conv, %conv4 1912 %. = select i1 %cmp, i32 10000, i32 0 1913 ret i32 %. 1914 } 1915 1916 and then to: 1917 1918 define i32 @Func1(i32, i32) nounwind readnone optsize ssp { 1919 entry: 1920 %conv = and i32 %0, 0xFF 1921 %conv4 = and i32 %1, 0xFF 1922 %cmp = icmp eq i32 %conv, %conv4 1923 %. = select i1 %cmp, i32 10000, i32 0 1924 ret i32 %. 1925 } 1926 //===---------------------------------------------------------------------===// 1927 1928 clang -O3 currently compiles this code 1929 1930 int g(unsigned int a) { 1931 unsigned int c[100]; 1932 c[10] = a; 1933 c[11] = a; 1934 unsigned int b = c[10] + c[11]; 1935 if(b > a*2) a = 4; 1936 else a = 8; 1937 return a + 7; 1938 } 1939 1940 into 1941 1942 define i32 @g(i32 a) nounwind readnone { 1943 %add = shl i32 %a, 1 1944 %mul = shl i32 %a, 1 1945 %cmp = icmp ugt i32 %add, %mul 1946 %a.addr.0 = select i1 %cmp, i32 11, i32 15 1947 ret i32 %a.addr.0 1948 } 1949 1950 The icmp should fold to false. This CSE opportunity is only available 1951 after GVN and InstCombine have run. 1952 1953 //===---------------------------------------------------------------------===// 1954 1955 memcpyopt should turn this: 1956 1957 define i8* @test10(i32 %x) { 1958 %alloc = call noalias i8* @malloc(i32 %x) nounwind 1959 call void @llvm.memset.p0i8.i32(i8* %alloc, i8 0, i32 %x, i32 1, i1 false) 1960 ret i8* %alloc 1961 } 1962 1963 into a call to calloc. We should make sure that we analyze calloc as 1964 aggressively as malloc though. 1965 1966 //===---------------------------------------------------------------------===// 1967 1968 clang -O3 doesn't optimize this: 1969 1970 void f1(int* begin, int* end) { 1971 std::fill(begin, end, 0); 1972 } 1973 1974 into a memset. This is PR8942. 1975 1976 //===---------------------------------------------------------------------===// 1977 1978 clang -O3 -fno-exceptions currently compiles this code: 1979 1980 void f(int N) { 1981 std::vector<int> v(N); 1982 1983 extern void sink(void*); sink(&v); 1984 } 1985 1986 into 1987 1988 define void @_Z1fi(i32 %N) nounwind { 1989 entry: 1990 %v2 = alloca [3 x i32*], align 8 1991 %v2.sub = getelementptr inbounds [3 x i32*]* %v2, i64 0, i64 0 1992 %tmpcast = bitcast [3 x i32*]* %v2 to %"class.std::vector"* 1993 %conv = sext i32 %N to i64 1994 store i32* null, i32** %v2.sub, align 8, !tbaa !0 1995 %tmp3.i.i.i.i.i = getelementptr inbounds [3 x i32*]* %v2, i64 0, i64 1 1996 store i32* null, i32** %tmp3.i.i.i.i.i, align 8, !tbaa !0 1997 %tmp4.i.i.i.i.i = getelementptr inbounds [3 x i32*]* %v2, i64 0, i64 2 1998 store i32* null, i32** %tmp4.i.i.i.i.i, align 8, !tbaa !0 1999 %cmp.i.i.i.i = icmp eq i32 %N, 0 2000 br i1 %cmp.i.i.i.i, label %_ZNSt12_Vector_baseIiSaIiEEC2EmRKS0_.exit.thread.i.i, label %cond.true.i.i.i.i 2001 2002 _ZNSt12_Vector_baseIiSaIiEEC2EmRKS0_.exit.thread.i.i: ; preds = %entry 2003 store i32* null, i32** %v2.sub, align 8, !tbaa !0 2004 store i32* null, i32** %tmp3.i.i.i.i.i, align 8, !tbaa !0 2005 %add.ptr.i5.i.i = getelementptr inbounds i32* null, i64 %conv 2006 store i32* %add.ptr.i5.i.i, i32** %tmp4.i.i.i.i.i, align 8, !tbaa !0 2007 br label %_ZNSt6vectorIiSaIiEEC1EmRKiRKS0_.exit 2008 2009 cond.true.i.i.i.i: ; preds = %entry 2010 %cmp.i.i.i.i.i = icmp slt i32 %N, 0 2011 br i1 %cmp.i.i.i.i.i, label %if.then.i.i.i.i.i, label %_ZNSt12_Vector_baseIiSaIiEEC2EmRKS0_.exit.i.i 2012 2013 if.then.i.i.i.i.i: ; preds = %cond.true.i.i.i.i 2014 call void @_ZSt17__throw_bad_allocv() noreturn nounwind 2015 unreachable 2016 2017 _ZNSt12_Vector_baseIiSaIiEEC2EmRKS0_.exit.i.i: ; preds = %cond.true.i.i.i.i 2018 %mul.i.i.i.i.i = shl i64 %conv, 2 2019 %call3.i.i.i.i.i = call noalias i8* @_Znwm(i64 %mul.i.i.i.i.i) nounwind 2020 %0 = bitcast i8* %call3.i.i.i.i.i to i32* 2021 store i32* %0, i32** %v2.sub, align 8, !tbaa !0 2022 store i32* %0, i32** %tmp3.i.i.i.i.i, align 8, !tbaa !0 2023 %add.ptr.i.i.i = getelementptr inbounds i32* %0, i64 %conv 2024 store i32* %add.ptr.i.i.i, i32** %tmp4.i.i.i.i.i, align 8, !tbaa !0 2025 call void @llvm.memset.p0i8.i64(i8* %call3.i.i.i.i.i, i8 0, i64 %mul.i.i.i.i.i, i32 4, i1 false) 2026 br label %_ZNSt6vectorIiSaIiEEC1EmRKiRKS0_.exit 2027 2028 This is just the handling the construction of the vector. Most surprising here 2029 is the fact that all three null stores in %entry are dead (because we do no 2030 cross-block DSE). 2031 2032 Also surprising is that %conv isn't simplified to 0 in %....exit.thread.i.i. 2033 This is a because the client of LazyValueInfo doesn't simplify all instruction 2034 operands, just selected ones. 2035 2036 //===---------------------------------------------------------------------===// 2037 2038 clang -O3 -fno-exceptions currently compiles this code: 2039 2040 void f(char* a, int n) { 2041 __builtin_memset(a, 0, n); 2042 for (int i = 0; i < n; ++i) 2043 a[i] = 0; 2044 } 2045 2046 into: 2047 2048 define void @_Z1fPci(i8* nocapture %a, i32 %n) nounwind { 2049 entry: 2050 %conv = sext i32 %n to i64 2051 tail call void @llvm.memset.p0i8.i64(i8* %a, i8 0, i64 %conv, i32 1, i1 false) 2052 %cmp8 = icmp sgt i32 %n, 0 2053 br i1 %cmp8, label %for.body.lr.ph, label %for.end 2054 2055 for.body.lr.ph: ; preds = %entry 2056 %tmp10 = add i32 %n, -1 2057 %tmp11 = zext i32 %tmp10 to i64 2058 %tmp12 = add i64 %tmp11, 1 2059 call void @llvm.memset.p0i8.i64(i8* %a, i8 0, i64 %tmp12, i32 1, i1 false) 2060 ret void 2061 2062 for.end: ; preds = %entry 2063 ret void 2064 } 2065 2066 This shouldn't need the ((zext (%n - 1)) + 1) game, and it should ideally fold 2067 the two memset's together. 2068 2069 The issue with the addition only occurs in 64-bit mode, and appears to be at 2070 least partially caused by Scalar Evolution not keeping its cache updated: it 2071 returns the "wrong" result immediately after indvars runs, but figures out the 2072 expected result if it is run from scratch on IR resulting from running indvars. 2073 2074 //===---------------------------------------------------------------------===// 2075 2076 clang -O3 -fno-exceptions currently compiles this code: 2077 2078 struct S { 2079 unsigned short m1, m2; 2080 unsigned char m3, m4; 2081 }; 2082 2083 void f(int N) { 2084 std::vector<S> v(N); 2085 extern void sink(void*); sink(&v); 2086 } 2087 2088 into poor code for zero-initializing 'v' when N is >0. The problem is that 2089 S is only 6 bytes, but each element is 8 byte-aligned. We generate a loop and 2090 4 stores on each iteration. If the struct were 8 bytes, this gets turned into 2091 a memset. 2092 2093 In order to handle this we have to: 2094 A) Teach clang to generate metadata for memsets of structs that have holes in 2095 them. 2096 B) Teach clang to use such a memset for zero init of this struct (since it has 2097 a hole), instead of doing elementwise zeroing. 2098 2099 //===---------------------------------------------------------------------===// 2100 2101 clang -O3 currently compiles this code: 2102 2103 extern const int magic; 2104 double f() { return 0.0 * magic; } 2105 2106 into 2107 2108 @magic = external constant i32 2109 2110 define double @_Z1fv() nounwind readnone { 2111 entry: 2112 %tmp = load i32* @magic, align 4, !tbaa !0 2113 %conv = sitofp i32 %tmp to double 2114 %mul = fmul double %conv, 0.000000e+00 2115 ret double %mul 2116 } 2117 2118 We should be able to fold away this fmul to 0.0. More generally, fmul(x,0.0) 2119 can be folded to 0.0 if we can prove that the LHS is not -0.0, not a NaN, and 2120 not an INF. The CannotBeNegativeZero predicate in value tracking should be 2121 extended to support general "fpclassify" operations that can return 2122 yes/no/unknown for each of these predicates. 2123 2124 In this predicate, we know that uitofp is trivially never NaN or -0.0, and 2125 we know that it isn't +/-Inf if the floating point type has enough exponent bits 2126 to represent the largest integer value as < inf. 2127 2128 //===---------------------------------------------------------------------===// 2129 2130 When optimizing a transformation that can change the sign of 0.0 (such as the 2131 0.0*val -> 0.0 transformation above), it might be provable that the sign of the 2132 expression doesn't matter. For example, by the above rules, we can't transform 2133 fmul(sitofp(x), 0.0) into 0.0, because x might be -1 and the result of the 2134 expression is defined to be -0.0. 2135 2136 If we look at the uses of the fmul for example, we might be able to prove that 2137 all uses don't care about the sign of zero. For example, if we have: 2138 2139 fadd(fmul(sitofp(x), 0.0), 2.0) 2140 2141 Since we know that x+2.0 doesn't care about the sign of any zeros in X, we can 2142 transform the fmul to 0.0, and then the fadd to 2.0. 2143 2144 //===---------------------------------------------------------------------===// 2145 2146 We should enhance memcpy/memcpy/memset to allow a metadata node on them 2147 indicating that some bytes of the transfer are undefined. This is useful for 2148 frontends like clang when lowering struct copies, when some elements of the 2149 struct are undefined. Consider something like this: 2150 2151 struct x { 2152 char a; 2153 int b[4]; 2154 }; 2155 void foo(struct x*P); 2156 struct x testfunc() { 2157 struct x V1, V2; 2158 foo(&V1); 2159 V2 = V1; 2160 2161 return V2; 2162 } 2163 2164 We currently compile this to: 2165 $ clang t.c -S -o - -O0 -emit-llvm | opt -scalarrepl -S 2166 2167 2168 %struct.x = type { i8, [4 x i32] } 2169 2170 define void @testfunc(%struct.x* sret %agg.result) nounwind ssp { 2171 entry: 2172 %V1 = alloca %struct.x, align 4 2173 call void @foo(%struct.x* %V1) 2174 %tmp1 = bitcast %struct.x* %V1 to i8* 2175 %0 = bitcast %struct.x* %V1 to i160* 2176 %srcval1 = load i160* %0, align 4 2177 %tmp2 = bitcast %struct.x* %agg.result to i8* 2178 %1 = bitcast %struct.x* %agg.result to i160* 2179 store i160 %srcval1, i160* %1, align 4 2180 ret void 2181 } 2182 2183 This happens because SRoA sees that the temp alloca has is being memcpy'd into 2184 and out of and it has holes and it has to be conservative. If we knew about the 2185 holes, then this could be much much better. 2186 2187 Having information about these holes would also improve memcpy (etc) lowering at 2188 llc time when it gets inlined, because we can use smaller transfers. This also 2189 avoids partial register stalls in some important cases. 2190 2191 //===---------------------------------------------------------------------===// 2192 2193 We don't fold (icmp (add) (add)) unless the two adds only have a single use. 2194 There are a lot of cases that we're refusing to fold in (e.g.) 256.bzip2, for 2195 example: 2196 2197 %indvar.next90 = add i64 %indvar89, 1 ;; Has 2 uses 2198 %tmp96 = add i64 %tmp95, 1 ;; Has 1 use 2199 %exitcond97 = icmp eq i64 %indvar.next90, %tmp96 2200 2201 We don't fold this because we don't want to introduce an overlapped live range 2202 of the ivar. However if we can make this more aggressive without causing 2203 performance issues in two ways: 2204 2205 1. If *either* the LHS or RHS has a single use, we can definitely do the 2206 transformation. In the overlapping liverange case we're trading one register 2207 use for one fewer operation, which is a reasonable trade. Before doing this 2208 we should verify that the llc output actually shrinks for some benchmarks. 2209 2. If both ops have multiple uses, we can still fold it if the operations are 2210 both sinkable to *after* the icmp (e.g. in a subsequent block) which doesn't 2211 increase register pressure. 2212 2213 There are a ton of icmp's we aren't simplifying because of the reg pressure 2214 concern. Care is warranted here though because many of these are induction 2215 variables and other cases that matter a lot to performance, like the above. 2216 Here's a blob of code that you can drop into the bottom of visitICmp to see some 2217 missed cases: 2218 2219 { Value *A, *B, *C, *D; 2220 if (match(Op0, m_Add(m_Value(A), m_Value(B))) && 2221 match(Op1, m_Add(m_Value(C), m_Value(D))) && 2222 (A == C || A == D || B == C || B == D)) { 2223 errs() << "OP0 = " << *Op0 << " U=" << Op0->getNumUses() << "\n"; 2224 errs() << "OP1 = " << *Op1 << " U=" << Op1->getNumUses() << "\n"; 2225 errs() << "CMP = " << I << "\n\n"; 2226 } 2227 } 2228 2229 //===---------------------------------------------------------------------===// 2230 2231 define i1 @test1(i32 %x) nounwind { 2232 %and = and i32 %x, 3 2233 %cmp = icmp ult i32 %and, 2 2234 ret i1 %cmp 2235 } 2236 2237 Can be folded to (x & 2) == 0. 2238 2239 define i1 @test2(i32 %x) nounwind { 2240 %and = and i32 %x, 3 2241 %cmp = icmp ugt i32 %and, 1 2242 ret i1 %cmp 2243 } 2244 2245 Can be folded to (x & 2) != 0. 2246 2247 SimplifyDemandedBits shrinks the "and" constant to 2 but instcombine misses the 2248 icmp transform. 2249 2250 //===---------------------------------------------------------------------===// 2251 2252 This code: 2253 2254 typedef struct { 2255 int f1:1; 2256 int f2:1; 2257 int f3:1; 2258 int f4:29; 2259 } t1; 2260 2261 typedef struct { 2262 int f1:1; 2263 int f2:1; 2264 int f3:30; 2265 } t2; 2266 2267 t1 s1; 2268 t2 s2; 2269 2270 void func1(void) 2271 { 2272 s1.f1 = s2.f1; 2273 s1.f2 = s2.f2; 2274 } 2275 2276 Compiles into this IR (on x86-64 at least): 2277 2278 %struct.t1 = type { i8, [3 x i8] } 2279 @s2 = global %struct.t1 zeroinitializer, align 4 2280 @s1 = global %struct.t1 zeroinitializer, align 4 2281 define void @func1() nounwind ssp noredzone { 2282 entry: 2283 %0 = load i32* bitcast (%struct.t1* @s2 to i32*), align 4 2284 %bf.val.sext5 = and i32 %0, 1 2285 %1 = load i32* bitcast (%struct.t1* @s1 to i32*), align 4 2286 %2 = and i32 %1, -4 2287 %3 = or i32 %2, %bf.val.sext5 2288 %bf.val.sext26 = and i32 %0, 2 2289 %4 = or i32 %3, %bf.val.sext26 2290 store i32 %4, i32* bitcast (%struct.t1* @s1 to i32*), align 4 2291 ret void 2292 } 2293 2294 The two or/and's should be merged into one each. 2295 2296 //===---------------------------------------------------------------------===// 2297 2298 Machine level code hoisting can be useful in some cases. For example, PR9408 2299 is about: 2300 2301 typedef union { 2302 void (*f1)(int); 2303 void (*f2)(long); 2304 } funcs; 2305 2306 void foo(funcs f, int which) { 2307 int a = 5; 2308 if (which) { 2309 f.f1(a); 2310 } else { 2311 f.f2(a); 2312 } 2313 } 2314 2315 which we compile to: 2316 2317 foo: # @foo 2318 # BB#0: # %entry 2319 pushq %rbp 2320 movq %rsp, %rbp 2321 testl %esi, %esi 2322 movq %rdi, %rax 2323 je .LBB0_2 2324 # BB#1: # %if.then 2325 movl $5, %edi 2326 callq *%rax 2327 popq %rbp 2328 ret 2329 .LBB0_2: # %if.else 2330 movl $5, %edi 2331 callq *%rax 2332 popq %rbp 2333 ret 2334 2335 Note that bb1 and bb2 are the same. This doesn't happen at the IR level 2336 because one call is passing an i32 and the other is passing an i64. 2337 2338 //===---------------------------------------------------------------------===// 2339 2340 I see this sort of pattern in 176.gcc in a few places (e.g. the start of 2341 store_bit_field). The rem should be replaced with a multiply and subtract: 2342 2343 %3 = sdiv i32 %A, %B 2344 %4 = srem i32 %A, %B 2345 2346 Similarly for udiv/urem. Note that this shouldn't be done on X86 or ARM, 2347 which can do this in a single operation (instruction or libcall). It is 2348 probably best to do this in the code generator. 2349 2350 //===---------------------------------------------------------------------===// 2351 2352 unsigned foo(unsigned x, unsigned y) { return (x & y) == 0 || x == 0; } 2353 should fold to (x & y) == 0. 2354 2355 //===---------------------------------------------------------------------===// 2356 2357 unsigned foo(unsigned x, unsigned y) { return x > y && x != 0; } 2358 should fold to x > y. 2359 2360 //===---------------------------------------------------------------------===// 2361