1 2 Verification todo 3 ~~~~~~~~~~~~~~~~~ 4 check that illegal insns on all targets don't cause the _toIR.c's to 5 assert. [DONE: amd64 x86 ppc32 ppc64 arm s390] 6 7 check also with --vex-guest-chase-cond=yes 8 9 check that all targets can run their insn set tests with 10 --vex-guest-max-insns=1. 11 12 all targets: run some tests using --profile-flags=... to exercise 13 function patchProfInc_<arch> [DONE: amd64 x86 ppc32 ppc64 arm s390] 14 15 figure out if there is a way to write a test program that checks 16 that event checks are actually getting triggered 17 18 19 Cleanups 20 ~~~~~~~~ 21 host_arm_isel.c and host_arm_defs.c: get rid of global var arm_hwcaps. 22 23 host_x86_defs.c, host_amd64_defs.c: return proper VexInvalRange 24 records from the patchers, instead of {0,0}, so that transparent 25 self hosting works properly. 26 27 host_ppc_defs.h: is RdWrLR still needed? If not delete. 28 29 ditto ARM, Ld8S 30 31 Comments that used to be in m_scheduler.c: 32 tchaining tests: 33 - extensive spinrounds 34 - with sched quantum = 1 -- check that handle_noredir_jump 35 doesn't return with INNER_COUNTERZERO 36 other: 37 - out of date comment w.r.t. bit 0 set in libvex_trc_values.h 38 - can VG_TRC_BORING still happen? if not, rm 39 - memory leaks in m_transtab (InEdgeArr/OutEdgeArr leaking?) 40 - move do_cacheflush out of m_transtab 41 - more economical unchaining when nuking an entire sector 42 - ditto w.r.t. cache flushes 43 - verify case of 2 paths from A to B 44 - check -- is IP_AT_SYSCALL still right? 45 46 47 Optimisations 48 ~~~~~~~~~~~~~ 49 ppc: chain_XDirect: generate short form jumps when possible 50 51 ppc64: immediate generation is terrible .. should be able 52 to do better 53 54 arm codegen: Generate ORRS for CmpwNEZ32(Or32(x,y)) 55 56 all targets: when nuking an entire sector, don't bother to undo the 57 patching for any translations within the sector (nor with their 58 invalidations). 59 60 (somewhat implausible) for jumps to disp_cp_indir, have multiple 61 copies of disp_cp_indir, one for each of the possible registers that 62 could have held the target guest address before jumping to the stub. 63 Then disp_cp_indir wouldn't have to reload it from memory each time. 64 Might also have the effect of spreading out the indirect mispredict 65 burden somewhat (across the multiple copies.) 66 67 68 Implementation notes 69 ~~~~~~~~~~~~~~~~~~~~ 70 T-chaining changes -- summary 71 72 * The code generators (host_blah_isel.c, host_blah_defs.[ch]) interact 73 more closely with Valgrind than before. In particular the 74 instruction selectors must use one of 3 different kinds of 75 control-transfer instructions: XDirect, XIndir and XAssisted. 76 All archs must use these the same; no more ad-hoc control transfer 77 instructions. 78 (more detail below) 79 80 81 * With T-chaining, translations can jump between each other without 82 going through the dispatcher loop every time. This means that the 83 event check (counter dec, and exit if negative) the dispatcher loop 84 previously did now needs to be compiled into each translation. 85 86 87 * The assembly dispatcher code (dispatch-arch-os.S) is still 88 present. It still provides table lookup services for 89 indirect branches, but it also provides a new feature: 90 dispatch points, to which the generated code jumps. There 91 are 5: 92 93 VG_(disp_cp_chain_me_to_slowEP): 94 VG_(disp_cp_chain_me_to_fastEP): 95 These are chain-me requests, used for Boring conditional and 96 unconditional jumps to destinations known at JIT time. The 97 generated code calls these (doesn't jump to them) and the 98 stub recovers the return address. These calls never return; 99 instead the call is done so that the stub knows where the 100 calling point is. It needs to know this so it can patch 101 the calling point to the requested destination. 102 VG_(disp_cp_xindir): 103 Old-style table lookup and go; used for indirect jumps 104 VG_(disp_cp_xassisted): 105 Most general and slowest kind. Can transfer to anywhere, but 106 first returns to scheduler to do some other event (eg a syscall) 107 before continuing. 108 VG_(disp_cp_evcheck_fail): 109 Code jumps here when the event check fails. 110 111 112 * new instructions in backends: XDirect, XIndir and XAssisted. 113 XDirect is used for chainable jumps. It is compiled into a 114 call to VG_(disp_cp_chain_me_to_slowEP) or 115 VG_(disp_cp_chain_me_to_fastEP). 116 117 XIndir is used for indirect jumps. It is compiled into a jump 118 to VG_(disp_cp_xindir) 119 120 XAssisted is used for "assisted" (do something first, then jump) 121 transfers. It is compiled into a jump to VG_(disp_cp_xassisted) 122 123 All 3 of these may be conditional. 124 125 More complexity: in some circumstances (no-redir translations) 126 all transfers must be done with XAssisted. In such cases the 127 instruction selector will be told this. 128 129 130 * Patching: XDirect is compiled basically into 131 %r11 = &VG_(disp_cp_chain_me_to_{slow,fast}EP) 132 call *%r11 133 Backends must provide a function (eg) chainXDirect_AMD64 134 which converts it into a jump to a specified destination 135 jmp $delta-of-PCs 136 or 137 %r11 = 64-bit immediate 138 jmpq *%r11 139 depending on branch distance. 140 141 Backends must provide a function (eg) unchainXDirect_AMD64 142 which restores the original call-to-the-stub version. 143 144 145 * Event checks. Each translation now has two entry points, 146 the slow one (slowEP) and fast one (fastEP). Like this: 147 148 slowEP: 149 counter-- 150 if (counter < 0) goto VG_(disp_cp_evcheck_fail) 151 fastEP: 152 (rest of the translation) 153 154 slowEP is used for control flow transfers that are or might be 155 a back edge in the control flow graph. Insn selectors are 156 given the address of the highest guest byte in the block so 157 they can determine which edges are definitely not back edges. 158 159 The counter is placed in the first 8 bytes of the guest state, 160 and the address of VG_(disp_cp_evcheck_fail) is placed in 161 the next 8 bytes. This allows very compact checks on all 162 targets, since no immediates need to be synthesised, eg: 163 164 decq 0(%baseblock-pointer) 165 jns fastEP 166 jmpq *8(baseblock-pointer) 167 fastEP: 168 169 On amd64 a non-failing check is therefore 2 insns; all 3 occupy 170 just 8 bytes. 171 172 On amd64 the event check is created by a special single 173 pseudo-instruction AMD64_EvCheck. 174 175 176 * BB profiling (for --profile-flags=). The dispatch assembly 177 dispatch-arch-os.S no longer deals with this and so is much 178 simplified. Instead the profile inc is compiled into each 179 translation, as the insn immediately following the event 180 check. Again, on amd64 a pseudo-insn AMD64_ProfInc is used. 181 Counters are now 64 bit even on 32 bit hosts, to avoid overflow. 182 183 One complexity is that at JIT time it is not known where the 184 address of the counter is. To solve this, VexTranslateResult 185 now returns the offset of the profile inc in the generated 186 code. When the counter address is known, VEX can be called 187 again to patch it in. Backends must supply eg 188 patchProfInc_AMD64 to make this happen. 189 190 191 * Front end changes (guest_blah_toIR.c) 192 193 The way the guest program counter is handled has changed 194 significantly. Previously, the guest PC was updated (in IR) 195 at the start of each instruction, except for the first insn 196 in an IRSB. This is inconsistent and doesn't work with the 197 new framework. 198 199 Now, each instruction must update the guest PC as its last 200 IR statement -- not its first. And no special exemption for 201 the first insn in the block. As before most of these are 202 optimised out by ir_opt, so no concerns about efficiency. 203 204 As a logical side effect of this, exits (IRStmt_Exit) and the 205 block-end transfer are both considered to write to the guest state 206 (the guest PC) and so need to be told the offset of it. 207 208 IR generators (eg disInstr_AMD64) are no longer allowed to set the 209 IRSB::next, to specify the block-end transfer address. Instead they 210 now indicate, to the generic steering logic that drives them (iow, 211 guest_generic_bb_to_IR.c), that the block has ended. This then 212 generates effectively "goto GET(PC)" (which, again, is optimised 213 away). What this does mean is that if the IR generator function 214 ends the IR of the last instruction in the block with an incorrect 215 assignment to the guest PC, execution will transfer to an incorrect 216 destination -- making the error obvious quickly. 217