1 //===- README.txt - Notes for improving PowerPC-specific code gen ---------===// 2 3 TODO: 4 * lmw/stmw pass a la arm load store optimizer for prolog/epilog 5 6 ===-------------------------------------------------------------------------=== 7 8 This code: 9 10 unsigned add32carry(unsigned sum, unsigned x) { 11 unsigned z = sum + x; 12 if (sum + x < x) 13 z++; 14 return z; 15 } 16 17 Should compile to something like: 18 19 addc r3,r3,r4 20 addze r3,r3 21 22 instead we get: 23 24 add r3, r4, r3 25 cmplw cr7, r3, r4 26 mfcr r4 ; 1 27 rlwinm r4, r4, 29, 31, 31 28 add r3, r3, r4 29 30 Ick. 31 32 ===-------------------------------------------------------------------------=== 33 34 We compile the hottest inner loop of viterbi to: 35 36 li r6, 0 37 b LBB1_84 ;bb432.i 38 LBB1_83: ;bb420.i 39 lbzx r8, r5, r7 40 addi r6, r7, 1 41 stbx r8, r4, r7 42 LBB1_84: ;bb432.i 43 mr r7, r6 44 cmplwi cr0, r7, 143 45 bne cr0, LBB1_83 ;bb420.i 46 47 The CBE manages to produce: 48 49 li r0, 143 50 mtctr r0 51 loop: 52 lbzx r2, r2, r11 53 stbx r0, r2, r9 54 addi r2, r2, 1 55 bdz later 56 b loop 57 58 This could be much better (bdnz instead of bdz) but it still beats us. If we 59 produced this with bdnz, the loop would be a single dispatch group. 60 61 ===-------------------------------------------------------------------------=== 62 63 Lump the constant pool for each function into ONE pic object, and reference 64 pieces of it as offsets from the start. For functions like this (contrived 65 to have lots of constants obviously): 66 67 double X(double Y) { return (Y*1.23 + 4.512)*2.34 + 14.38; } 68 69 We generate: 70 71 _X: 72 lis r2, ha16(.CPI_X_0) 73 lfd f0, lo16(.CPI_X_0)(r2) 74 lis r2, ha16(.CPI_X_1) 75 lfd f2, lo16(.CPI_X_1)(r2) 76 fmadd f0, f1, f0, f2 77 lis r2, ha16(.CPI_X_2) 78 lfd f1, lo16(.CPI_X_2)(r2) 79 lis r2, ha16(.CPI_X_3) 80 lfd f2, lo16(.CPI_X_3)(r2) 81 fmadd f1, f0, f1, f2 82 blr 83 84 It would be better to materialize .CPI_X into a register, then use immediates 85 off of the register to avoid the lis's. This is even more important in PIC 86 mode. 87 88 Note that this (and the static variable version) is discussed here for GCC: 89 http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html 90 91 Here's another example (the sgn function): 92 double testf(double a) { 93 return a == 0.0 ? 0.0 : (a > 0.0 ? 1.0 : -1.0); 94 } 95 96 it produces a BB like this: 97 LBB1_1: ; cond_true 98 lis r2, ha16(LCPI1_0) 99 lfs f0, lo16(LCPI1_0)(r2) 100 lis r2, ha16(LCPI1_1) 101 lis r3, ha16(LCPI1_2) 102 lfs f2, lo16(LCPI1_2)(r3) 103 lfs f3, lo16(LCPI1_1)(r2) 104 fsub f0, f0, f1 105 fsel f1, f0, f2, f3 106 blr 107 108 ===-------------------------------------------------------------------------=== 109 110 PIC Code Gen IPO optimization: 111 112 Squish small scalar globals together into a single global struct, allowing the 113 address of the struct to be CSE'd, avoiding PIC accesses (also reduces the size 114 of the GOT on targets with one). 115 116 Note that this is discussed here for GCC: 117 http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html 118 119 ===-------------------------------------------------------------------------=== 120 121 Darwin Stub removal: 122 123 We still generate calls to foo$stub, and stubs, on Darwin. This is not 124 necessary when building with the Leopard (10.5) or later linker, as stubs are 125 generated by ld when necessary. Parameterizing this based on the deployment 126 target (-mmacosx-version-min) is probably enough. x86-32 does this right, see 127 its logic. 128 129 ===-------------------------------------------------------------------------=== 130 131 Darwin Stub LICM optimization: 132 133 Loops like this: 134 135 for (...) bar(); 136 137 Have to go through an indirect stub if bar is external or linkonce. It would 138 be better to compile it as: 139 140 fp = &bar; 141 for (...) fp(); 142 143 which only computes the address of bar once (instead of each time through the 144 stub). This is Darwin specific and would have to be done in the code generator. 145 Probably not a win on x86. 146 147 ===-------------------------------------------------------------------------=== 148 149 Simple IPO for argument passing, change: 150 void foo(int X, double Y, int Z) -> void foo(int X, int Z, double Y) 151 152 the Darwin ABI specifies that any integer arguments in the first 32 bytes worth 153 of arguments get assigned to r3 through r10. That is, if you have a function 154 foo(int, double, int) you get r3, f1, r6, since the 64 bit double ate up the 155 argument bytes for r4 and r5. The trick then would be to shuffle the argument 156 order for functions we can internalize so that the maximum number of 157 integers/pointers get passed in regs before you see any of the fp arguments. 158 159 Instead of implementing this, it would actually probably be easier to just 160 implement a PPC fastcc, where we could do whatever we wanted to the CC, 161 including having this work sanely. 162 163 ===-------------------------------------------------------------------------=== 164 165 Fix Darwin FP-In-Integer Registers ABI 166 167 Darwin passes doubles in structures in integer registers, which is very very 168 bad. Add something like a BITCAST to LLVM, then do an i-p transformation that 169 percolates these things out of functions. 170 171 Check out how horrible this is: 172 http://gcc.gnu.org/ml/gcc/2005-10/msg01036.html 173 174 This is an extension of "interprocedural CC unmunging" that can't be done with 175 just fastcc. 176 177 ===-------------------------------------------------------------------------=== 178 179 Fold add and sub with constant into non-extern, non-weak addresses so this: 180 181 static int a; 182 void bar(int b) { a = b; } 183 void foo(unsigned char *c) { 184 *c = a; 185 } 186 187 So that 188 189 _foo: 190 lis r2, ha16(_a) 191 la r2, lo16(_a)(r2) 192 lbz r2, 3(r2) 193 stb r2, 0(r3) 194 blr 195 196 Becomes 197 198 _foo: 199 lis r2, ha16(_a+3) 200 lbz r2, lo16(_a+3)(r2) 201 stb r2, 0(r3) 202 blr 203 204 ===-------------------------------------------------------------------------=== 205 206 We should compile these two functions to the same thing: 207 208 #include <stdlib.h> 209 void f(int a, int b, int *P) { 210 *P = (a-b)>=0?(a-b):(b-a); 211 } 212 void g(int a, int b, int *P) { 213 *P = abs(a-b); 214 } 215 216 Further, they should compile to something better than: 217 218 _g: 219 subf r2, r4, r3 220 subfic r3, r2, 0 221 cmpwi cr0, r2, -1 222 bgt cr0, LBB2_2 ; entry 223 LBB2_1: ; entry 224 mr r2, r3 225 LBB2_2: ; entry 226 stw r2, 0(r5) 227 blr 228 229 GCC produces: 230 231 _g: 232 subf r4,r4,r3 233 srawi r2,r4,31 234 xor r0,r2,r4 235 subf r0,r2,r0 236 stw r0,0(r5) 237 blr 238 239 ... which is much nicer. 240 241 This theoretically may help improve twolf slightly (used in dimbox.c:142?). 242 243 ===-------------------------------------------------------------------------=== 244 245 PR5945: This: 246 define i32 @clamp0g(i32 %a) { 247 entry: 248 %cmp = icmp slt i32 %a, 0 249 %sel = select i1 %cmp, i32 0, i32 %a 250 ret i32 %sel 251 } 252 253 Is compile to this with the PowerPC (32-bit) backend: 254 255 _clamp0g: 256 cmpwi cr0, r3, 0 257 li r2, 0 258 blt cr0, LBB1_2 259 ; BB#1: ; %entry 260 mr r2, r3 261 LBB1_2: ; %entry 262 mr r3, r2 263 blr 264 265 This could be reduced to the much simpler: 266 267 _clamp0g: 268 srawi r2, r3, 31 269 andc r3, r3, r2 270 blr 271 272 ===-------------------------------------------------------------------------=== 273 274 int foo(int N, int ***W, int **TK, int X) { 275 int t, i; 276 277 for (t = 0; t < N; ++t) 278 for (i = 0; i < 4; ++i) 279 W[t / X][i][t % X] = TK[i][t]; 280 281 return 5; 282 } 283 284 We generate relatively atrocious code for this loop compared to gcc. 285 286 We could also strength reduce the rem and the div: 287 http://www.lcs.mit.edu/pubs/pdf/MIT-LCS-TM-600.pdf 288 289 ===-------------------------------------------------------------------------=== 290 291 We generate ugly code for this: 292 293 void func(unsigned int *ret, float dx, float dy, float dz, float dw) { 294 unsigned code = 0; 295 if(dx < -dw) code |= 1; 296 if(dx > dw) code |= 2; 297 if(dy < -dw) code |= 4; 298 if(dy > dw) code |= 8; 299 if(dz < -dw) code |= 16; 300 if(dz > dw) code |= 32; 301 *ret = code; 302 } 303 304 ===-------------------------------------------------------------------------=== 305 306 %struct.B = type { i8, [3 x i8] } 307 308 define void @bar(%struct.B* %b) { 309 entry: 310 %tmp = bitcast %struct.B* %b to i32* ; <uint*> [#uses=1] 311 %tmp = load i32* %tmp ; <uint> [#uses=1] 312 %tmp3 = bitcast %struct.B* %b to i32* ; <uint*> [#uses=1] 313 %tmp4 = load i32* %tmp3 ; <uint> [#uses=1] 314 %tmp8 = bitcast %struct.B* %b to i32* ; <uint*> [#uses=2] 315 %tmp9 = load i32* %tmp8 ; <uint> [#uses=1] 316 %tmp4.mask17 = shl i32 %tmp4, i8 1 ; <uint> [#uses=1] 317 %tmp1415 = and i32 %tmp4.mask17, 2147483648 ; <uint> [#uses=1] 318 %tmp.masked = and i32 %tmp, 2147483648 ; <uint> [#uses=1] 319 %tmp11 = or i32 %tmp1415, %tmp.masked ; <uint> [#uses=1] 320 %tmp12 = and i32 %tmp9, 2147483647 ; <uint> [#uses=1] 321 %tmp13 = or i32 %tmp12, %tmp11 ; <uint> [#uses=1] 322 store i32 %tmp13, i32* %tmp8 323 ret void 324 } 325 326 We emit: 327 328 _foo: 329 lwz r2, 0(r3) 330 slwi r4, r2, 1 331 or r4, r4, r2 332 rlwimi r2, r4, 0, 0, 0 333 stw r2, 0(r3) 334 blr 335 336 We could collapse a bunch of those ORs and ANDs and generate the following 337 equivalent code: 338 339 _foo: 340 lwz r2, 0(r3) 341 rlwinm r4, r2, 1, 0, 0 342 or r2, r2, r4 343 stw r2, 0(r3) 344 blr 345 346 ===-------------------------------------------------------------------------=== 347 348 Consider a function like this: 349 350 float foo(float X) { return X + 1234.4123f; } 351 352 The FP constant ends up in the constant pool, so we need to get the LR register. 353 This ends up producing code like this: 354 355 _foo: 356 .LBB_foo_0: ; entry 357 mflr r11 358 *** stw r11, 8(r1) 359 bl "L00000$pb" 360 "L00000$pb": 361 mflr r2 362 addis r2, r2, ha16(.CPI_foo_0-"L00000$pb") 363 lfs f0, lo16(.CPI_foo_0-"L00000$pb")(r2) 364 fadds f1, f1, f0 365 *** lwz r11, 8(r1) 366 mtlr r11 367 blr 368 369 This is functional, but there is no reason to spill the LR register all the way 370 to the stack (the two marked instrs): spilling it to a GPR is quite enough. 371 372 Implementing this will require some codegen improvements. Nate writes: 373 374 "So basically what we need to support the "no stack frame save and restore" is a 375 generalization of the LR optimization to "callee-save regs". 376 377 Currently, we have LR marked as a callee-save reg. The register allocator sees 378 that it's callee save, and spills it directly to the stack. 379 380 Ideally, something like this would happen: 381 382 LR would be in a separate register class from the GPRs. The class of LR would be 383 marked "unspillable". When the register allocator came across an unspillable 384 reg, it would ask "what is the best class to copy this into that I *can* spill" 385 If it gets a class back, which it will in this case (the gprs), it grabs a free 386 register of that class. If it is then later necessary to spill that reg, so be 387 it. 388 389 ===-------------------------------------------------------------------------=== 390 391 We compile this: 392 int test(_Bool X) { 393 return X ? 524288 : 0; 394 } 395 396 to: 397 _test: 398 cmplwi cr0, r3, 0 399 lis r2, 8 400 li r3, 0 401 beq cr0, LBB1_2 ;entry 402 LBB1_1: ;entry 403 mr r3, r2 404 LBB1_2: ;entry 405 blr 406 407 instead of: 408 _test: 409 addic r2,r3,-1 410 subfe r0,r2,r3 411 slwi r3,r0,19 412 blr 413 414 This sort of thing occurs a lot due to globalopt. 415 416 ===-------------------------------------------------------------------------=== 417 418 We compile: 419 420 define i32 @bar(i32 %x) nounwind readnone ssp { 421 entry: 422 %0 = icmp eq i32 %x, 0 ; <i1> [#uses=1] 423 %neg = sext i1 %0 to i32 ; <i32> [#uses=1] 424 ret i32 %neg 425 } 426 427 to: 428 429 _bar: 430 cntlzw r2, r3 431 slwi r2, r2, 26 432 srawi r3, r2, 31 433 blr 434 435 it would be better to produce: 436 437 _bar: 438 addic r3,r3,-1 439 subfe r3,r3,r3 440 blr 441 442 ===-------------------------------------------------------------------------=== 443 444 We generate horrible ppc code for this: 445 446 #define N 2000000 447 double a[N],c[N]; 448 void simpleloop() { 449 int j; 450 for (j=0; j<N; j++) 451 c[j] = a[j]; 452 } 453 454 LBB1_1: ;bb 455 lfdx f0, r3, r4 456 addi r5, r5, 1 ;; Extra IV for the exit value compare. 457 stfdx f0, r2, r4 458 addi r4, r4, 8 459 460 xoris r6, r5, 30 ;; This is due to a large immediate. 461 cmplwi cr0, r6, 33920 462 bne cr0, LBB1_1 463 464 //===---------------------------------------------------------------------===// 465 466 This: 467 #include <algorithm> 468 inline std::pair<unsigned, bool> full_add(unsigned a, unsigned b) 469 { return std::make_pair(a + b, a + b < a); } 470 bool no_overflow(unsigned a, unsigned b) 471 { return !full_add(a, b).second; } 472 473 Should compile to: 474 475 __Z11no_overflowjj: 476 add r4,r3,r4 477 subfc r3,r3,r4 478 li r3,0 479 adde r3,r3,r3 480 blr 481 482 (or better) not: 483 484 __Z11no_overflowjj: 485 add r2, r4, r3 486 cmplw cr7, r2, r3 487 mfcr r2 488 rlwinm r2, r2, 29, 31, 31 489 xori r3, r2, 1 490 blr 491 492 //===---------------------------------------------------------------------===// 493 494 We compile some FP comparisons into an mfcr with two rlwinms and an or. For 495 example: 496 #include <math.h> 497 int test(double x, double y) { return islessequal(x, y);} 498 int test2(double x, double y) { return islessgreater(x, y);} 499 int test3(double x, double y) { return !islessequal(x, y);} 500 501 Compiles into (all three are similar, but the bits differ): 502 503 _test: 504 fcmpu cr7, f1, f2 505 mfcr r2 506 rlwinm r3, r2, 29, 31, 31 507 rlwinm r2, r2, 31, 31, 31 508 or r3, r2, r3 509 blr 510 511 GCC compiles this into: 512 513 _test: 514 fcmpu cr7,f1,f2 515 cror 30,28,30 516 mfcr r3 517 rlwinm r3,r3,31,1 518 blr 519 520 which is more efficient and can use mfocr. See PR642 for some more context. 521 522 //===---------------------------------------------------------------------===// 523 524 void foo(float *data, float d) { 525 long i; 526 for (i = 0; i < 8000; i++) 527 data[i] = d; 528 } 529 void foo2(float *data, float d) { 530 long i; 531 data--; 532 for (i = 0; i < 8000; i++) { 533 data[1] = d; 534 data++; 535 } 536 } 537 538 These compile to: 539 540 _foo: 541 li r2, 0 542 LBB1_1: ; bb 543 addi r4, r2, 4 544 stfsx f1, r3, r2 545 cmplwi cr0, r4, 32000 546 mr r2, r4 547 bne cr0, LBB1_1 ; bb 548 blr 549 _foo2: 550 li r2, 0 551 LBB2_1: ; bb 552 addi r4, r2, 4 553 stfsx f1, r3, r2 554 cmplwi cr0, r4, 32000 555 mr r2, r4 556 bne cr0, LBB2_1 ; bb 557 blr 558 559 The 'mr' could be eliminated to folding the add into the cmp better. 560 561 //===---------------------------------------------------------------------===// 562 Codegen for the following (low-probability) case deteriorated considerably 563 when the correctness fixes for unordered comparisons went in (PR 642, 58871). 564 It should be possible to recover the code quality described in the comments. 565 566 ; RUN: llvm-as < %s | llc -march=ppc32 | grep or | count 3 567 ; This should produce one 'or' or 'cror' instruction per function. 568 569 ; RUN: llvm-as < %s | llc -march=ppc32 | grep mfcr | count 3 570 ; PR2964 571 572 define i32 @test(double %x, double %y) nounwind { 573 entry: 574 %tmp3 = fcmp ole double %x, %y ; <i1> [#uses=1] 575 %tmp345 = zext i1 %tmp3 to i32 ; <i32> [#uses=1] 576 ret i32 %tmp345 577 } 578 579 define i32 @test2(double %x, double %y) nounwind { 580 entry: 581 %tmp3 = fcmp one double %x, %y ; <i1> [#uses=1] 582 %tmp345 = zext i1 %tmp3 to i32 ; <i32> [#uses=1] 583 ret i32 %tmp345 584 } 585 586 define i32 @test3(double %x, double %y) nounwind { 587 entry: 588 %tmp3 = fcmp ugt double %x, %y ; <i1> [#uses=1] 589 %tmp34 = zext i1 %tmp3 to i32 ; <i32> [#uses=1] 590 ret i32 %tmp34 591 } 592 //===----------------------------------------------------------------------===// 593 ; RUN: llvm-as < %s | llc -march=ppc32 | not grep fneg 594 595 ; This could generate FSEL with appropriate flags (FSEL is not IEEE-safe, and 596 ; should not be generated except with -enable-finite-only-fp-math or the like). 597 ; With the correctness fixes for PR642 (58871) LowerSELECT_CC would need to 598 ; recognize a more elaborate tree than a simple SETxx. 599 600 define double @test_FNEG_sel(double %A, double %B, double %C) { 601 %D = fsub double -0.000000e+00, %A ; <double> [#uses=1] 602 %Cond = fcmp ugt double %D, -0.000000e+00 ; <i1> [#uses=1] 603 %E = select i1 %Cond, double %B, double %C ; <double> [#uses=1] 604 ret double %E 605 } 606 607 //===----------------------------------------------------------------------===// 608 The save/restore sequence for CR in prolog/epilog is terrible: 609 - Each CR subreg is saved individually, rather than doing one save as a unit. 610 - On Darwin, the save is done after the decrement of SP, which means the offset 611 from SP of the save slot can be too big for a store instruction, which means we 612 need an additional register (currently hacked in 96015+96020; the solution there 613 is correct, but poor). 614 - On SVR4 the same thing can happen, and I don't think saving before the SP 615 decrement is safe on that target, as there is no red zone. This is currently 616 broken AFAIK, although it's not a target I can exercise. 617 The following demonstrates the problem: 618 extern void bar(char *p); 619 void foo() { 620 char x[100000]; 621 bar(x); 622 __asm__("" ::: "cr2"); 623 } 624 625 //===-------------------------------------------------------------------------=== 626 Naming convention for instruction formats is very haphazard. 627 We have agreed on a naming scheme as follows: 628 629 <INST_form>{_<OP_type><OP_len>}+ 630 631 Where: 632 INST_form is the instruction format (X-form, etc.) 633 OP_type is the operand type - one of OPC (opcode), RD (register destination), 634 RS (register source), 635 RDp (destination register pair), 636 RSp (source register pair), IM (immediate), 637 XO (extended opcode) 638 OP_len is the length of the operand in bits 639 640 VSX register operands would be of length 6 (split across two fields), 641 condition register fields of length 3. 642 We would not need denote reserved fields in names of instruction formats. 643 644 //===----------------------------------------------------------------------===// 645 646 Instruction fusion was introduced in ISA 2.06 and more opportunities added in 647 ISA 2.07. LLVM needs to add infrastructure to recognize fusion opportunities 648 and force instruction pairs to be scheduled together. 649 650
1 //===- README_ALTIVEC.txt - Notes for improving Altivec code gen ----------===// 2 3 Implement PPCInstrInfo::isLoadFromStackSlot/isStoreToStackSlot for vector 4 registers, to generate better spill code. 5 6 //===----------------------------------------------------------------------===// 7 8 The first should be a single lvx from the constant pool, the second should be 9 a xor/stvx: 10 11 void foo(void) { 12 int x[8] __attribute__((aligned(128))) = { 1, 1, 1, 17, 1, 1, 1, 1 }; 13 bar (x); 14 } 15 16 #include <string.h> 17 void foo(void) { 18 int x[8] __attribute__((aligned(128))); 19 memset (x, 0, sizeof (x)); 20 bar (x); 21 } 22 23 //===----------------------------------------------------------------------===// 24 25 Altivec: Codegen'ing MUL with vector FMADD should add -0.0, not 0.0: 26 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=8763 27 28 When -ffast-math is on, we can use 0.0. 29 30 //===----------------------------------------------------------------------===// 31 32 Consider this: 33 v4f32 Vector; 34 v4f32 Vector2 = { Vector.X, Vector.X, Vector.X, Vector.X }; 35 36 Since we know that "Vector" is 16-byte aligned and we know the element offset 37 of ".X", we should change the load into a lve*x instruction, instead of doing 38 a load/store/lve*x sequence. 39 40 //===----------------------------------------------------------------------===// 41 42 For functions that use altivec AND have calls, we are VRSAVE'ing all call 43 clobbered regs. 44 45 //===----------------------------------------------------------------------===// 46 47 Implement passing vectors by value into calls and receiving them as arguments. 48 49 //===----------------------------------------------------------------------===// 50 51 GCC apparently tries to codegen { C1, C2, Variable, C3 } as a constant pool load 52 of C1/C2/C3, then a load and vperm of Variable. 53 54 //===----------------------------------------------------------------------===// 55 56 We need a way to teach tblgen that some operands of an intrinsic are required to 57 be constants. The verifier should enforce this constraint. 58 59 //===----------------------------------------------------------------------===// 60 61 We currently codegen SCALAR_TO_VECTOR as a store of the scalar to a 16-byte 62 aligned stack slot, followed by a load/vperm. We should probably just store it 63 to a scalar stack slot, then use lvsl/vperm to load it. If the value is already 64 in memory this is a big win. 65 66 //===----------------------------------------------------------------------===// 67 68 extract_vector_elt of an arbitrary constant vector can be done with the 69 following instructions: 70 71 vTemp = vec_splat(v0,2); // 2 is the element the src is in. 72 vec_ste(&destloc,0,vTemp); 73 74 We can do an arbitrary non-constant value by using lvsr/perm/ste. 75 76 //===----------------------------------------------------------------------===// 77 78 If we want to tie instruction selection into the scheduler, we can do some 79 constant formation with different instructions. For example, we can generate 80 "vsplti -1" with "vcmpequw R,R" and 1,1,1,1 with "vsubcuw R,R", and 0,0,0,0 with 81 "vsplti 0" or "vxor", each of which use different execution units, thus could 82 help scheduling. 83 84 This is probably only reasonable for a post-pass scheduler. 85 86 //===----------------------------------------------------------------------===// 87 88 For this function: 89 90 void test(vector float *A, vector float *B) { 91 vector float C = (vector float)vec_cmpeq(*A, *B); 92 if (!vec_any_eq(*A, *B)) 93 *B = (vector float){0,0,0,0}; 94 *A = C; 95 } 96 97 we get the following basic block: 98 99 ... 100 lvx v2, 0, r4 101 lvx v3, 0, r3 102 vcmpeqfp v4, v3, v2 103 vcmpeqfp. v2, v3, v2 104 bne cr6, LBB1_2 ; cond_next 105 106 The vcmpeqfp/vcmpeqfp. instructions currently cannot be merged when the 107 vcmpeqfp. result is used by a branch. This can be improved. 108 109 //===----------------------------------------------------------------------===// 110 111 The code generated for this is truly aweful: 112 113 vector float test(float a, float b) { 114 return (vector float){ 0.0, a, 0.0, 0.0}; 115 } 116 117 LCPI1_0: ; float 118 .space 4 119 .text 120 .globl _test 121 .align 4 122 _test: 123 mfspr r2, 256 124 oris r3, r2, 4096 125 mtspr 256, r3 126 lis r3, ha16(LCPI1_0) 127 addi r4, r1, -32 128 stfs f1, -16(r1) 129 addi r5, r1, -16 130 lfs f0, lo16(LCPI1_0)(r3) 131 stfs f0, -32(r1) 132 lvx v2, 0, r4 133 lvx v3, 0, r5 134 vmrghw v3, v3, v2 135 vspltw v2, v2, 0 136 vmrghw v2, v2, v3 137 mtspr 256, r2 138 blr 139 140 //===----------------------------------------------------------------------===// 141 142 int foo(vector float *x, vector float *y) { 143 if (vec_all_eq(*x,*y)) return 3245; 144 else return 12; 145 } 146 147 A predicate compare being used in a select_cc should have the same peephole 148 applied to it as a predicate compare used by a br_cc. There should be no 149 mfcr here: 150 151 _foo: 152 mfspr r2, 256 153 oris r5, r2, 12288 154 mtspr 256, r5 155 li r5, 12 156 li r6, 3245 157 lvx v2, 0, r4 158 lvx v3, 0, r3 159 vcmpeqfp. v2, v3, v2 160 mfcr r3, 2 161 rlwinm r3, r3, 25, 31, 31 162 cmpwi cr0, r3, 0 163 bne cr0, LBB1_2 ; entry 164 LBB1_1: ; entry 165 mr r6, r5 166 LBB1_2: ; entry 167 mr r3, r6 168 mtspr 256, r2 169 blr 170 171 //===----------------------------------------------------------------------===// 172 173 CodeGen/PowerPC/vec_constants.ll has an and operation that should be 174 codegen'd to andc. The issue is that the 'all ones' build vector is 175 SelectNodeTo'd a VSPLTISB instruction node before the and/xor is selected 176 which prevents the vnot pattern from matching. 177 178 179 //===----------------------------------------------------------------------===// 180 181 An alternative to the store/store/load approach for illegal insert element 182 lowering would be: 183 184 1. store element to any ol' slot 185 2. lvx the slot 186 3. lvsl 0; splat index; vcmpeq to generate a select mask 187 4. lvsl slot + x; vperm to rotate result into correct slot 188 5. vsel result together. 189 190 //===----------------------------------------------------------------------===// 191 192 Should codegen branches on vec_any/vec_all to avoid mfcr. Two examples: 193 194 #include <altivec.h> 195 int f(vector float a, vector float b) 196 { 197 int aa = 0; 198 if (vec_all_ge(a, b)) 199 aa |= 0x1; 200 if (vec_any_ge(a,b)) 201 aa |= 0x2; 202 return aa; 203 } 204 205 vector float f(vector float a, vector float b) { 206 if (vec_any_eq(a, b)) 207 return a; 208 else 209 return b; 210 } 211 212 //===----------------------------------------------------------------------===// 213 214 We should do a little better with eliminating dead stores. 215 The stores to the stack are dead since %a and %b are not needed 216 217 ; Function Attrs: nounwind 218 define <16 x i8> @test_vpmsumb() #0 { 219 entry: 220 %a = alloca <16 x i8>, align 16 221 %b = alloca <16 x i8>, align 16 222 store <16 x i8> <i8 1, i8 2, i8 3, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16>, <16 x i8>* %a, align 16 223 store <16 x i8> <i8 113, i8 114, i8 115, i8 116, i8 117, i8 118, i8 119, i8 120, i8 121, i8 122, i8 123, i8 124, i8 125, i8 126, i8 127, i8 112>, <16 x i8>* %b, align 16 224 %0 = load <16 x i8>* %a, align 16 225 %1 = load <16 x i8>* %b, align 16 226 %2 = call <16 x i8> @llvm.ppc.altivec.crypto.vpmsumb(<16 x i8> %0, <16 x i8> %1) 227 ret <16 x i8> %2 228 } 229 230 231 ; Function Attrs: nounwind readnone 232 declare <16 x i8> @llvm.ppc.altivec.crypto.vpmsumb(<16 x i8>, <16 x i8>) #1 233 234 235 Produces the following code with -mtriple=powerpc64-unknown-linux-gnu: 236 # BB#0: # %entry 237 addis 3, 2, .LCPI0_0@toc@ha 238 addis 4, 2, .LCPI0_1@toc@ha 239 addi 3, 3, .LCPI0_0@toc@l 240 addi 4, 4, .LCPI0_1@toc@l 241 lxvw4x 0, 0, 3 242 addi 3, 1, -16 243 lxvw4x 35, 0, 4 244 stxvw4x 0, 0, 3 245 ori 2, 2, 0 246 lxvw4x 34, 0, 3 247 addi 3, 1, -32 248 stxvw4x 35, 0, 3 249 vpmsumb 2, 2, 3 250 blr 251 .long 0 252 .quad 0 253 254 The two stxvw4x instructions are not needed. 255 With -mtriple=powerpc64le-unknown-linux-gnu, the associated permutes 256 are present too. 257 258 //===----------------------------------------------------------------------===// 259 260 The following example is found in test/CodeGen/PowerPC/vec_add_sub_doubleword.ll: 261 262 define <2 x i64> @increment_by_val(<2 x i64> %x, i64 %val) nounwind { 263 %tmpvec = insertelement <2 x i64> <i64 0, i64 0>, i64 %val, i32 0 264 %tmpvec2 = insertelement <2 x i64> %tmpvec, i64 %val, i32 1 265 %result = add <2 x i64> %x, %tmpvec2 266 ret <2 x i64> %result 267 268 This will generate the following instruction sequence: 269 std 5, -8(1) 270 std 5, -16(1) 271 addi 3, 1, -16 272 ori 2, 2, 0 273 lxvd2x 35, 0, 3 274 vaddudm 2, 2, 3 275 blr 276 277 This will almost certainly cause a load-hit-store hazard. 278 Since val is a value parameter, it should not need to be saved onto 279 the stack, unless it's being done set up the vector register. Instead, 280 it would be better to splat the value into a vector register, and then 281 remove the (dead) stores to the stack. 282 283 //===----------------------------------------------------------------------===// 284 285 At the moment we always generate a lxsdx in preference to lfd, or stxsdx in 286 preference to stfd. When we have a reg-immediate addressing mode, this is a 287 poor choice, since we have to load the address into an index register. This 288 should be fixed for P7/P8. 289 290 //===----------------------------------------------------------------------===// 291 292 Right now, ShuffleKind 0 is supported only on BE, and ShuffleKind 2 only on LE. 293 However, we could actually support both kinds on either endianness, if we check 294 for the appropriate shufflevector pattern for each case ... this would cause 295 some additional shufflevectors to be recognized and implemented via the 296 "swapped" form. 297 298 //===----------------------------------------------------------------------===// 299 300 There is a utility program called PerfectShuffle that generates a table of the 301 shortest instruction sequence for implementing a shufflevector operation on 302 PowerPC. However, this was designed for big-endian code generation. We could 303 modify this program to create a little endian version of the table. The table 304 is used in PPCISelLowering.cpp, PPCTargetLowering::LOWERVECTOR_SHUFFLE(). 305 306 //===----------------------------------------------------------------------===// 307 308 Opportunies to use instructions from PPCInstrVSX.td during code gen 309 - Conversion instructions (Sections 7.6.1.5 and 7.6.1.6 of ISA 2.07) 310 - Scalar comparisons (xscmpodp and xscmpudp) 311 - Min and max (xsmaxdp, xsmindp, xvmaxdp, xvmindp, xvmaxsp, xvminsp) 312 313 Related to this: we currently do not generate the lxvw4x instruction for either 314 v4f32 or v4i32, probably because adding a dag pattern to the recognizer requires 315 a single target type. This should probably be addressed in the PPCISelDAGToDAG logic. 316 317 //===----------------------------------------------------------------------===// 318 319 Currently EXTRACT_VECTOR_ELT and INSERT_VECTOR_ELT are type-legal only 320 for v2f64 with VSX available. We should create custom lowering 321 support for the other vector types. Without this support, we generate 322 sequences with load-hit-store hazards. 323 324 v4f32 can be supported with VSX by shifting the correct element into 325 big-endian lane 0, using xscvspdpn to produce a double-precision 326 representation of the single-precision value in big-endian 327 double-precision lane 0, and reinterpreting lane 0 as an FPR or 328 vector-scalar register. 329 330 v2i64 can be supported with VSX and P8Vector in the same manner as 331 v2f64, followed by a direct move to a GPR. 332 333 v4i32 can be supported with VSX and P8Vector by shifting the correct 334 element into big-endian lane 1, using a direct move to a GPR, and 335 sign-extending the 32-bit result to 64 bits. 336 337 v8i16 can be supported with VSX and P8Vector by shifting the correct 338 element into big-endian lane 3, using a direct move to a GPR, and 339 sign-extending the 16-bit result to 64 bits. 340 341 v16i8 can be supported with VSX and P8Vector by shifting the correct 342 element into big-endian lane 7, using a direct move to a GPR, and 343 sign-extending the 8-bit result to 64 bits. 344