1 page.title=SMP Primer for Android 2 page.tags="ndk","native" 3 4 page.article=true 5 @jd:body 6 7 <div id="tb-wrapper"> 8 <div id="tb"> 9 <h2>In this document</h2> 10 <ol class="nolist"> 11 <li><a href="#theory">Theory</a> 12 <ol class="nolist"> 13 <li style="margin: 3px 0 0"><a href="#mem_consistency">Memory consistency models</a> 14 <ol class="nolist"> 15 <li style="margin:0"><a href="#proc_consistency">Processor consistency</a></li> 16 <li style="margin:0"><a href="#cpu_cache">CPU cache behavior</a></li> 17 <li style="margin:0"><a href="#observability">Observability</a></li> 18 <li style="margin:0"><a href="#ordering">ARMs weak ordering</a></li> 19 </ol> 20 </li> 21 <li style="margin:3px 0 0"><a href="#datamem_barriers">Data memory barriers</a> 22 <ol class="nolist"> 23 <li style="margin:0"><a href="#ss_ll">Store/store and load/load</a></li> 24 <li style="margin:0"><a href="#ls_sl">Load/store and store/load</a></li> 25 <li style="margin:0"><a href="#barrier_inst">Barrier instructions</a></li> 26 <li style="margin:0"><a href="#addr_dep">Address dependencies and causal consistency</a></li> 27 <li style="margin:0"><a href="#membarrier_summry">Memory barrier summary</a></li> 28 </ol> 29 </li> 30 <li style="margin:3px 0 0"><a href="#atomic_ops">Atomic operations</a> 31 <ol class="nolist"> 32 <li style="margin:0"><a href="#atomic_essentials">Atomic essentials</a></li> 33 <li style="margin:0"><a href="#atomic_barrierpairing">Atomic + barrier pairing</a></li> 34 <li style="margin:0"><a href="#acq_rel">Acquire and release</a></li> 35 </ol> 36 </li> 37 </ol> 38 </li> 39 <li><a href="#practice">Practice</a> 40 <ol class="nolist"> 41 <li style="margin:3px 0 0"><a href="#c_dont">What not to do in C</a> 42 <ol class="nolist"> 43 <li style="margin:0"><a href="#volatile">C/C++ and volatile</a></li> 44 <li style="margin:0"><a href="#examplesc">Examples</a></li> 45 </ol> 46 </li> 47 <li style="margin:3px 0 0"><a href="#j_dont">What not to do in Java</a> 48 <ol class="nolist"> 49 <li style="margin:0"><a href="#sync_volatile">synchronized and volatile</a></li> 50 <li style="margin:0"><a href="#examplesj">Examples</a></li> 51 </ol> 52 </li> 53 <li style="margin:3px 0 0"><a href="#bestpractice">What to do</a> 54 <ol class="nolist"> 55 <li style="margin:0"><a href="#advice">General advice</a></li> 56 <li style="margin:0"><a href="#sync_guarantees">Synchronization primitive guarantees</a></li> 57 <li style="margin:0"><a href="#ccpp_changes">Upcoming changes to C/C++</a></li> 58 </ol> 59 </li> 60 </ol> 61 </li> 62 <li><a href="#closing_notes">Closing Notes</a></li> 63 <li><a href="#appendix">Appendix</a> 64 <ol class="nolist"> 65 <li style="margin:0"><a href="#smp_failure_example">SMP failure example</a></li> 66 <li style="margin:0"><a href="#sync_stores">Implementing synchronization stores</a></li> 67 <li style="margin:0"><a href="#more">Further reading</a></li> 68 </ol> 69 </li> 70 </ol> 71 </div> 72 </div> 73 74 <p>Android 3.0 and later platform versions are optimized to support 75 multiprocessor architectures. This document introduces issues that 76 can arise when writing code for symmetric multiprocessor systems in C, C++, and the Java 77 programming language (hereafter referred to simply as Java for the sake of 78 brevity). It's intended as a primer for Android app developers, not as a complete 79 discussion on the subject. The focus is on the ARM CPU architecture.</p> 80 81 <p>If youre in a hurry, you can skip the <a href="#theory">Theory</a> section 82 and go directly to <a href="#practice">Practice</a> for best practices, but this 83 is not recommended.</p> 84 85 86 <h2 id="intro">Introduction</h2> 87 88 <p>SMP is an acronym for Symmetric Multi-Processor. It describes a design in 89 which two or more identical CPU cores share access to main memory. Until 90 a few years ago, all Android devices were UP (Uni-Processor).</p> 91 92 <p>Most — if not all — Android devices do have multiple CPUs, but generally one 93 of them is used to run applications while others manage various bits of device 94 hardware (for example, the radio). The CPUs may have different architectures, and the 95 programs running on them cant use main memory to communicate with each 96 other.</p> 97 98 <p>Most Android devices sold today are built around SMP designs, 99 making things a bit more complicated for software developers. The sorts of race 100 conditions you might encounter in a multi-threaded program are much worse on SMP 101 when two or more of your threads are running simultaneously on different cores. 102 Whats more, SMP on ARM is more challenging to work with than SMP on x86. Code 103 that has been thoroughly tested on x86 may break badly on ARM.</p> 104 105 <p>The rest of this document will explain why, and tell you what you need to do 106 to ensure that your code behaves correctly.</p> 107 108 109 <h2 id="theory">Theory</h2> 110 111 <p>This is a high-speed, glossy overview of a complex subject. Some areas will 112 be incomplete, but none of it should be misleading or wrong.</p> 113 114 <p>See <a href="#more">Further reading</a> at the end of the document for 115 pointers to more thorough treatments of the subject.</p> 116 117 <h3 id="mem_consistency">Memory consistency models</h3> 118 119 <p>Memory consistency models, or often just memory models, describe the 120 guarantees the hardware architecture makes about memory accesses. For example, 121 if you write a value to address A, and then write a value to address B, the 122 model might guarantee that every CPU core sees those writes happen in that 123 order.</p> 124 125 <p>The model most programmers are accustomed to is <em>sequential 126 consistency</em>, which is described like this <span 127 style="font-size:.9em;color:#777">(<a href="#more" style="color:#777">Adve & 128 Gharachorloo</a>)</span>:</p> 129 130 <ul> 131 <li>All memory operations appear to execute one at a time</li> 132 <li>All operations on a single processor appear to execute in the order described 133 by that processor's program.</li> 134 </ul> 135 136 <p>If you look at a bit of code and see that it does some reads and writes from 137 memory, on a sequentially-consistent CPU architecture you know that the code 138 will do those reads and writes in the expected order. Its possible that the 139 CPU is actually reordering instructions and delaying reads and writes, but there 140 is no way for code running on the device to tell that the CPU is doing anything 141 other than execute instructions in a straightforward manner. (Were ignoring 142 memory-mapped device driver I/O for the moment.)</p> 143 144 <p>To illustrate these points its useful to consider small snippets of code, 145 commonly referred to as <em>litmus tests</em>. These are assumed to execute in 146 <em>program order</em>, that is, the order in which the instructions appear here is 147 the order in which the CPU will execute them. We dont want to consider 148 instruction reordering performed by compilers just yet.</p> 149 150 <p>Heres a simple example, with code running on two threads:</p> 151 152 <table> 153 <tr> 154 <th>Thread 1</th> 155 <th>Thread 2</th> 156 </tr> 157 <tr> 158 <td><code>A = 3<br /> 159 B = 5</code></td> 160 <td><code>reg0 = B<br /> 161 reg1 = A</code></td> 162 </tr> 163 </table> 164 165 <p>In this and all future litmus examples, memory locations are represented by 166 capital letters (A, B, C) and CPU registers start with reg. All memory is 167 initially zero. Instructions are executed from top to bottom. Here, thread 1 168 stores the value 3 at location A, and then the value 5 at location B. Thread 2 169 loads the value from location B into reg0, and then loads the value from 170 location A into reg1. (Note that were writing in one order and reading in 171 another.)</p> 172 173 <p>Thread 1 and thread 2 are assumed to execute on different CPU cores. You 174 should <strong>always</strong> make this assumption when thinking about 175 multi-threaded code.</p> 176 177 <p>Sequential consistency guarantees that, after both threads have finished 178 executing, the registers will be in one of the following states:</p> 179 180 181 <table> 182 <tr> 183 <th>Registers</th> 184 <th>States</th> 185 </tr> 186 <tr> 187 <td>reg0=5, reg1=3</td> 188 <td>possible (thread 1 ran first)</td> 189 </tr> 190 <tr> 191 <td>reg0=0, reg1=0</td> 192 <td>possible (thread 2 ran first)</td> 193 </tr> 194 <tr> 195 <td>reg0=0, reg1=3</td> 196 <td>possible (concurrent execution)</td> 197 </tr> 198 <tr> 199 <td>reg0=5, reg1=0</td> 200 <td>never</td> 201 </tr> 202 </table> 203 204 <p>To get into a situation where we see B=5 before we see the store to A, either 205 the reads or the writes would have to happen out of order. On a 206 sequentially-consistent machine, that cant happen.</p> 207 208 <p>Most uni-processors, including x86 and ARM, are sequentially consistent. 209 Most SMP systems, including x86 and ARM, are not.</p> 210 211 <h4 id="proc_consistency">Processor consistency</h4> 212 213 <p>x86 SMP provides <em>processor consistency</em>, which is slightly weaker than 214 sequential. While the architecture guarantees that loads are not reordered with 215 respect to other loads, and stores are not reordered with respect to other 216 stores, it does not guarantee that a store followed by a load will be observed 217 in the expected order.</p> 218 219 <p>Consider the following example, which is a piece of Dekkers Algorithm for 220 mutual exclusion:</p> 221 222 <table> 223 <tr> 224 <th>Thread 1</th> 225 <th>Thread 2</th> 226 </tr> 227 <tr> 228 <td><code>A = true<br /> 229 reg1 = B<br /> 230 if (reg1 == false)<br /> 231 <em>critical-stuff</em></code></td> 232 <td><code>B = true<br /> 233 reg2 = A<br /> 234 if (reg2 == false)<br /> 235 <em>critical-stuff</em></code></td> 236 </tr> 237 </table> 238 239 <p>The idea is that thread 1 uses A to indicate that its busy, and thread 2 240 uses B. Thread 1 sets A and then checks to see if B is set; if not, it can 241 safely assume that it has exclusive access to the critical section. Thread 2 242 does something similar. (If a thread discovers that both A and B are set, a 243 turn-taking algorithm is used to ensure fairness.)</p> 244 245 <p>On a sequentially-consistent machine, this works correctly. On x86 and ARM 246 SMP, the store to A and the load from B in thread 1 can be observed in a 247 different order by thread 2. If that happened, we could actually appear to 248 execute this sequence (where blank lines have been inserted to highlight the 249 apparent order of operations):</p> 250 251 <table> 252 <tr> 253 <th>Thread 1</th> 254 <th>Thread 2</th> 255 </tr> 256 <tr> 257 <td><code>reg1 = B<br /> 258 <br /> 259 <br /> 260 A = true<br /> 261 if (reg1 == false)<br /> 262 <em>critical-stuff</em></code></td> 263 264 <td><code><br /> 265 B = true<br /> 266 reg2 = A<br /> 267 <br /> 268 if (reg2 == false)<br /> 269 <em>critical-stuff</em></code></td> 270 </tr> 271 </table> 272 273 <p>This results in both reg1 and reg2 set to false, allowing the threads to 274 execute code in the critical section simultaneously. To understand how this can 275 happen, its useful to know a little about CPU caches.</p> 276 277 <h4 id="cpu_cache">CPU cache behavior</h4> 278 279 <p>This is a substantial topic in and of itself. An extremely brief overview 280 follows. (The motivation for this material is to provide some basis for 281 understanding why SMP systems behave as they do.)</p> 282 283 <p>Modern CPUs have one or more caches between the processor and main memory. 284 These are labeled L1, L2, and so on, with the higher numbers being successively 285 farther from the CPU. Cache memory adds size and cost to the hardware, and 286 increases power consumption, so the ARM CPUs used in Android devices typically 287 have small L1 caches and little or no L2/L3.</p> 288 289 <p>Loading or storing a value into the L1 cache is very fast. Doing the same to 290 main memory can be 10-100x slower. The CPU will therefore try to operate out of 291 the cache as much as possible. The <em>write policy</em> of a cache determines when data 292 written to it is forwarded to main memory. A <em>write-through</em> cache will initiate 293 a write to memory immediately, while a <em>write-back</em> cache will wait until it runs 294 out of space and has to evict some entries. In either case, the CPU will 295 continue executing instructions past the one that did the store, possibly 296 executing dozens of them before the write is visible in main memory. (While the 297 write-through cache has a policy of immediately forwarding the data to main 298 memory, it only <strong>initiates</strong> the write. It does not have to wait 299 for it to finish.)</p> 300 301 <p>The cache behavior becomes relevant to this discussion when each CPU core has 302 its own private cache. In a simple model, the caches have no way to interact 303 with each other directly. The values held by core #1s cache are not shared 304 with or visible to core #2s cache except as loads or stores from main memory. 305 The long latencies on memory accesses would make inter-thread interactions 306 sluggish, so its useful to define a way for the caches to share data. This 307 sharing is called <em>cache coherency</em>, and the coherency rules are defined 308 by the CPU architectures <em>cache consistency model</em>.</p> 309 310 <p>With that in mind, lets return to the Dekker example. When core 1 executes 311 A = 1, the value gets stored in core 1s cache. When core 2 executes if (A 312 == 0), it might read from main memory or it might read from core 2s cache; 313 either way it wont see the store performed by core 1. (A could be in core 314 2s cache because of a previous load from A.)</p> 315 316 <p>For the memory consistency model to be sequentially consistent, core 1 would 317 have to wait for all other cores to be aware of A = 1 before it could execute 318 if (B == 0) (either through strict cache coherency rules, or by disabling the 319 caches entirely so everything operates out of main memory). This would impose a 320 performance penalty on every store operation. Relaxing the rules for the 321 ordering of stores followed by loads improves performance but imposes a burden 322 on software developers.</p> 323 324 <p>The other guarantees made by the processor consistency model are less 325 expensive to make. For example, to ensure that memory writes are not observed 326 out of order, it just needs to ensure that the stores are published to other 327 cores in the same order that they were issued. It doesnt need to wait for 328 store #1 to <strong>finish</strong> being published before it can start on store 329 #2, it just needs to ensure that it doesnt finish publishing #2 before it 330 finishes publishing #1. This avoids a performance bubble.</p> 331 332 <p>Relaxing the guarantees even further can provide additional opportunities for 333 CPU optimization, but creates more opportunities for code to behave in ways the 334 programmer didnt expect.</p> 335 336 <p>One additional note: CPU caches dont operate on individual bytes. Data is 337 read or written as <em>cache lines</em>; for many ARM CPUs these are 32 bytes. If you 338 read data from a location in main memory, you will also be reading some adjacent 339 values. Writing data will cause the cache line to be read from memory and 340 updated. As a result, you can cause a value to be loaded into cache as a 341 side-effect of reading or writing something nearby, adding to the general aura 342 of mystery.</p> 343 344 <h4 id="observability">Observability</h4> 345 346 <p>Before going further, its useful to define in a more rigorous fashion what 347 is meant by observing a load or store. Suppose core 1 executes A = 1. The 348 store is <em>initiated</em> when the CPU executes the instruction. At some 349 point later, possibly through cache coherence activity, the store is 350 <em>observed</em> by core 2. In a write-through cache it doesnt really 351 <em>complete</em> until the store arrives in main memory, but the memory 352 consistency model doesnt dictate when something completes, just when it can be 353 <em>observed</em>.</p> 354 355 356 <p>(In a kernel device driver that accesses memory-mapped I/O locations, it may 357 be very important to know when things actually complete. Were not going to go 358 into that here.)</p> 359 360 <p>Observability may be defined as follows:</p> 361 362 <ul> 363 <li>"A write to a location in memory is said to be observed by an observer Pn 364 when a subsequent read of the location by Pn would return the value written by 365 the write."</li> 366 <li>"A read of a location in memory is said to be observed by an observer Pm 367 when a subsequent write to the location by Pm would have no effect on the value 368 returned by the read." <span style="font-size:.9em;color:#777">(<em><a 369 href="#more" style="color:#777">Reasoning about the ARM weakly consistent memory 370 model</a></em>)</span></li> 371 </ul> 372 373 374 <p>A less formal way to describe it (where you and I are CPU cores) would be:</p> 375 376 <ul> 377 <li>I have observed your write when I can read what you wrote</li> 378 <li>I have observed your read when I can no longer affect the value you read</li> 379 </ul> 380 381 <p>The notion of observing a write is intuitive; observing a read is a bit less 382 so (dont worry, it grows on you).</p> 383 384 <p>With this in mind, were ready to talk about ARM.</p> 385 386 <h4 id="ordering">ARM's weak ordering</h4> 387 388 <p>ARM SMP provides weak memory consistency guarantees. It does not guarantee that 389 loads or stores are ordered with respect to each other.</p> 390 391 <table> 392 <tr> 393 <th>Thread 1</th> 394 <th>Thread 2</th> 395 </tr> 396 <tr> 397 <td><code>A = 41<br /> 398 B = 1 // A is ready</code></td> 399 <td><code>loop_until (B == 1)<br /> 400 reg = A</code></td> 401 </tr> 402 </table> 403 404 <p>Recall that all addresses are initially zero. The loop_until instruction 405 reads B repeatedly, looping until we read 1 from B. The idea here is that 406 thread 2 is waiting for thread 1 to update A. Thread 1 sets A, and then sets B 407 to 1 to indicate data availability.</p> 408 409 <p>On x86 SMP, this is guaranteed to work. Thread 2 will observe the stores 410 made by thread 1 in program order, and thread 1 will observe thread 2s loads in 411 program order.</p> 412 413 <p>On ARM SMP, the loads and stores can be observed in any order. It is 414 possible, after all the code has executed, for reg to hold 0. Its also 415 possible for it to hold 41. Unless you explicitly define the ordering, you 416 dont know how this will come out.</p> 417 418 <p>(For those with experience on other systems, ARMs memory model is equivalent 419 to PowerPC in most respects.)</p> 420 421 422 <h3 id="datamem_barriers">Data memory barriers</h3> 423 424 <p>Memory barriers provide a way for your code to tell the CPU that memory 425 access ordering matters. ARM/x86 uniprocessors offer sequential consistency, 426 and thus have no need for them. (The barrier instructions can be executed but 427 arent useful; in at least one case theyre hideously expensive, motivating 428 separate builds for SMP targets.)</p> 429 430 <p>There are four basic situations to consider:</p> 431 432 <ol> 433 <li>store followed by another store</li> 434 <li>load followed by another load</li> 435 <li>load followed by store</li> 436 <li>store followed by load</li> 437 </ol> 438 439 <h4 id="ss_ll">Store/store and load/load</h4> 440 441 <p>Recall our earlier example:</p> 442 443 <table> 444 <tr> 445 <th>Thread 1</th> 446 <th>Thread 2</th> 447 </tr> 448 <tr> 449 <td><code>A = 41<br /> 450 B = 1 // A is ready</code></td> 451 <td><code>loop_until (B == 1)<br /> 452 reg = A</code></td> 453 </tr> 454 </table> 455 456 457 <p>Thread 1 needs to ensure that the store to A happens before the store to B. 458 This is a store/store situation. Similarly, thread 2 needs to ensure that the 459 load of B happens before the load of A; this is a load/load situation. As 460 mentioned earlier, the loads and stores can be observed in any order.</p> 461 462 <div style="padding:.5em 2em;"> 463 <div style="border-left:4px solid #ccc;padding:0 1em;font-style:italic;"> 464 <p>Going back to the cache discussion, assume A and B are on separate cache 465 lines, with minimal cache coherency. If the store to A stays local but the 466 store to B is published, core 2 will see B=1 but wont see the update to A. On 467 the other side, assume we read A earlier, or it lives on the same cache line as 468 something else we recently read. Core 2 spins until it sees the update to B, 469 then loads A from its local cache, where the value is still zero.</p> 470 </div> 471 </div> 472 473 <p>We can fix it like this:</p> 474 475 <table> 476 <tr> 477 <th>Thread 1</th> 478 <th>Thread 2</th> 479 </tr> 480 <tr> 481 <td><code>A = 41<br /> 482 <em>store/store barrier</em><br /> 483 B = 1 // A is ready</code></td> 484 <td><code>loop_until (B == 1)<br /> 485 <em>load/load barrier</em><br /> 486 reg = A</code></td> 487 </tr> 488 </table> 489 490 <p>The store/store barrier guarantees that <strong>all observers</strong> will 491 observe the write to A before they observe the write to B. It makes no 492 guarantees about the ordering of loads in thread 1, but we dont have any of 493 those, so thats okay. The load/load barrier in thread 2 makes a similar 494 guarantee for the loads there.</p> 495 496 <p>Since the store/store barrier guarantees that thread 2 observes the stores in 497 program order, why do we need the load/load barrier in thread 2? Because we 498 also need to guarantee that thread 1 observes the loads in program order.</p> 499 500 <div style="padding:.5em 2em;"> 501 <div style="border-left:4px solid #ccc;padding:0 1em;font-style:italic;"> 502 <p>The store/store barrier could work by flushing all 503 dirty entries out of the local cache, ensuring that other cores see them before 504 they see any future stores. The load/load barrier could purge the local cache 505 completely and wait for any in-flight loads to finish, ensuring that future 506 loads are observed after previous loads. What the CPU actually does doesnt 507 matter, so long as the appropriate guarantees are kept. If we use a barrier in 508 core 1 but not in core 2, core 2 could still be reading A from its local 509 cache.</p> 510 </div> 511 </div> 512 513 <p>Because the architectures have different memory models, these barriers are 514 required on ARM SMP but not x86 SMP.</p> 515 516 <h4 id="ls_sl">Load/store and store/load</h4> 517 518 <p>The Dekkers Algorithm fragment shown earlier illustrated the need for a 519 store/load barrier. Heres an example where a load/store barrier is 520 required:</p> 521 522 <table> 523 <tr> 524 <th>Thread 1</th> 525 <th>Thread 2</th> 526 </tr> 527 <tr> 528 <td><code>reg = A<br /> 529 B = 1 // I have latched A</code></td> 530 <td><code>loop_until (B == 1)<br /> 531 A = 41 // update A</code></td> 532 </tr> 533 </table> 534 535 <p>Thread 2 could observe thread 1s store of B=1 before it observes thread 1s 536 load from A, and as a result store A=41 before thread 1 has a chance to read A. 537 Inserting a load/store barrier in each thread solves the problem:</p> 538 539 <table> 540 <tr> 541 <th>Thread 1</th> 542 <th>Thread 2</th> 543 </tr> 544 <tr> 545 <td><code>reg = A<br /> 546 <em>load/store barrier</em><br /> 547 B = 1 // I have latched A</code></td> 548 <td><code>loop_until (B == 1)<br /> 549 <em>load/store barrier</em><br /> 550 A = 41 // update A</code></td> 551 </tr> 552 </table> 553 554 <div style="padding:.5em 2em;"> 555 <div style="border-left:4px solid #ccc;padding:0 1em;font-style:italic;"> 556 <p>A store to local cache may be observed before a load from main memory, 557 because accesses to main memory are so much slower. In this case, assume core 558 1s cache has the cache line for B but not A. The load from A is initiated, and 559 while thats in progress execution continues. The store to B happens in local 560 cache, and by some means becomes available to core 2 while the load from A is 561 still in progress. Thread 2 is able to exit the loop before it has observed 562 thread 1s load from A.</p> 563 564 <p>A thornier question is: do we need a barrier in thread 2? If the CPU doesnt 565 perform speculative writes, and doesnt execute instructions out of order, can 566 thread 2 store to A before thread 1s read if thread 1 guarantees the load/store 567 ordering? (Answer: no.) What if theres a third core watching A and B? 568 (Answer: now you need one, or you could observe B==0 / A==41 on the third core.) 569 Its safest to insert barriers in both places and not worry about the 570 details.</p> 571 </div> 572 </div> 573 574 <p>As mentioned earlier, store/load barriers are the only kind required on x86 575 SMP.</p> 576 577 <h4 id="barrier_inst">Barrier instructions</h4> 578 579 <p>Different CPUs provide different flavors of barrier instruction. For 580 example:</p> 581 582 <ul> 583 <li>Sparc V8 has a membar instruction that takes a 4-element bit vector. The 584 four categories of barrier can be specified individually.</li> 585 <li>Alpha provides rmb (load/load), wmb (store/store), and mb (full). 586 (Trivia: the linux kernel provides three memory barrier functions with these 587 names and behaviors.)</li> 588 <li>x86 has a variety of options; mfence (introduced with SSE2) provides a 589 full barrier.</li> 590 <li>ARMv7 has dmb st (store/store) and dmb sy (full).</li> 591 </ul> 592 593 <p>Full barrier means all four categories are included.</p> 594 595 <p>It is important to recognize that the only thing guaranteed by barrier 596 instructions is ordering. Do not treat them as cache coherency sync points or 597 synchronous flush instructions. The ARM dmb instruction has no direct 598 effect on other cores. This is important to understand when trying to figure 599 out where barrier instructions need to be issued.</p> 600 601 602 <h4 id="addr_dep">Address dependencies and causal consistency</h4> 603 604 <p><em>(This is a slightly more advanced topic and can be skipped.)</em> 605 606 <p>The ARM CPU provides one special case where a load/load barrier can be 607 avoided. Consider the following example from earlier, modified slightly:</p> 608 609 <table> 610 <tr> 611 <th>Thread 1</th> 612 <th>Thread 2</th> 613 </tr> 614 <tr> 615 <td><code>[A+8] = 41<br /> 616 <em>store/store barrier</em><br /> 617 B = 1 // A is ready</code></td> 618 <td><code>loop:<br /> 619 reg0 = B<br /> 620 if (reg0 == 0) goto loop<br /> 621 reg1 = 8<br /> 622 reg2 = [A + reg1]</code></td> 623 </tr> 624 </table> 625 626 <p>This introduces a new notation. If A refers to a memory address, A+n 627 refers to a memory address offset by 8 bytes from A. If A is the base address 628 of an object or array, [A+8] could be a field in the object or an element in the 629 array.</p> 630 631 <p>The loop_until seen in previous examples has been expanded to show the load 632 of B into reg0. reg1 is assigned the numeric value 8, and reg2 is loaded from 633 the address [A+reg1] (the same location that thread 1 is accessing).</p> 634 635 <p>This will not behave correctly because the load from B could be observed 636 after the load from [A+reg1]. We can fix this with a load/load barrier after 637 the loop, but on ARM we can also just do this:</p> 638 639 <table> 640 <tr> 641 <th>Thread 1</th> 642 <th>Thread 2</th> 643 </tr> 644 <tr> 645 <td><code>[A+8] = 41<br /> 646 <em>store/store barrier</em><br /> 647 B = 1 // A is ready</code></td> 648 <td><code>loop:<br /> 649 reg0 = B<br /> 650 if (reg0 == 0) goto loop<br /> 651 reg1 = 8 <strong>+ (reg0 & 0)</strong><br /> 652 reg2 = [A + reg1]</code></td> 653 </tr> 654 </table> 655 656 <p>What weve done here is change the assignment of reg1 from a constant (8) to 657 a value that depends on what we loaded from B. In this case, we do a bitwise 658 AND of the value with 0, which yields zero, which means reg1 still has the value 659 8. However, the ARM CPU believes that the load from [A+reg1] depends upon the 660 load from B, and will ensure that the two are observed in program order.</p> 661 662 <p>This is called an <em>address dependency</em>. Address dependencies exist 663 when the value returned by a load is used to compute the address of a subsequent 664 load or store. It can let you avoid the need for an explicit barrier in certain 665 situations.</p> 666 667 <p>ARM does not provide <em>control dependency</em> guarantees. To illustrate 668 this its necessary to dip into ARM code for a moment: <span 669 style="font-size:.9em;color:#777">(<em><a href="#more" 670 style="color:#777">Barrier Litmus Tests and Cookbook</a></em>)</span>.</p> 671 672 <pre> 673 LDR r1, [r0] 674 CMP r1, #55 675 LDRNE r2, [r3] 676 </pre> 677 678 <p>The loads from r0 and r3 may be observed out of order, even though the load 679 from r3 will not execute at all if [r0] doesnt hold 55. Inserting AND r1, r1, 680 #0 and replacing the last instruction with LDRNE r2, [r3, r1] would ensure 681 proper ordering without an explicit barrier. (This is a prime example of why 682 you cant think about consistency issues in terms of instruction execution. 683 Always think in terms of memory accesses.)</p> 684 685 <p>While were hip-deep, its worth noting that ARM does not provide <em>causal 686 consistency</em>:</p> 687 688 <table> 689 <tr> 690 <th>Thread 1</th> 691 <th>Thread 2</th> 692 <th>Thread 3</th> 693 </tr> 694 <tr> 695 <td><code>A = 1</code></td> 696 <td><code>loop_until (A == 1)<br /> 697 B = 1</code></td> 698 <td><code>loop:<br /> 699 reg0 = B<br /> 700 if (reg0 == 0) goto loop<br /> 701 reg1 = reg0 & 0<br /> 702 reg2 = [A+reg1]</code></td> 703 </tr> 704 </table> 705 706 <p>Here, thread 1 sets A, signaling thread 2. Thread 2 sees that and sets B to 707 signal thread 3. Thread 3 sees it and loads from A, using an address dependency 708 to ensure that the load of B and the load of A are observed in program 709 order.</p> 710 711 <p>Its possible for reg2 to hold zero at the end of this. The fact that a 712 store in thread 1 causes something to happen in thread 2 which causes something 713 to happen in thread 3 does not mean that thread 3 will observe the stores in 714 that order. (Inserting a load/store barrier in thread 2 fixes this.)</p> 715 716 <h4 id="membarrier_summary">Memory barrier summary</h4> 717 718 <p>Barriers come in different flavors for different situations. While there can 719 be performance advantages to using exactly the right barrier type, there are 720 code maintenance risks in doing so — unless the person updating the code 721 fully understands it, they might introduce the wrong type of operation and cause 722 a mysterious breakage. Because of this, and because ARM doesnt provide a wide 723 variety of barrier choices, many atomic primitives use full 724 barrier instructions when a barrier is required.</p> 725 726 <p>The key thing to remember about barriers is that they define ordering. Dont 727 think of them as a flush call that causes a series of actions to happen. 728 Instead, think of them as a dividing line in time for operations on the current 729 CPU core.</p> 730 731 732 <h3 id="atomic_ops">Atomic operations</h3> 733 734 <p>Atomic operations guarantee that an operation that requires a series of steps 735 always behaves as if it were a single operation. For example, consider a 736 non-atomic increment (++A) executed on the same variable by two threads 737 simultaneously:</p> 738 739 <table> 740 <tr> 741 <th>Thread 1</th> 742 <th>Thread 2</th> 743 </tr> 744 <tr> 745 <td><code>reg = A<br /> 746 reg = reg + 1<br /> 747 A = reg</code></td> 748 <td><code>reg = A<br /> 749 reg = reg + 1<br /> 750 A = reg</code></td> 751 </tr> 752 </table> 753 754 <p>If the threads execute concurrently from top to bottom, both threads will 755 load 0 from A, increment it to 1, and store it back, leaving a final result of 756 1. If we used an atomic increment operation, you would be guaranteed that the 757 final result will be 2.</p> 758 759 <h4 id="atomic_essentials">Atomic essentials</h4> 760 761 <p>The most fundamental operations — loading and storing 32-bit values 762 — are inherently atomic on ARM so long as the data is aligned on a 32-bit 763 boundary. For example:</p> 764 765 <table> 766 <tr> 767 <th>Thread 1</th> 768 <th>Thread 2</th> 769 </tr> 770 <tr> 771 <td><code>reg = 0x00000000<br /> 772 A = reg</code></td> 773 <td><code>reg = 0xffffffff<br /> 774 A = reg</code></td> 775 </tr> 776 </table> 777 778 <p>The CPU guarantees that A will hold 0x00000000 or 0xffffffff. It will never 779 hold 0x0000ffff or any other partial mix of bytes.</p> 780 781 <div style="padding:.5em 2em;"> 782 <div style="border-left:4px solid #ccc;padding:0 1em;font-style:italic;"> 783 <p>The atomicity guarantee is lost if the data isnt aligned. Misaligned data 784 could straddle a cache line, so other cores could see the halves update 785 independently. Consequently, the ARMv7 documentation declares that it provides 786 single-copy atomicity for all byte accesses, halfword accesses to 787 halfword-aligned locations, and word accesses to word-aligned locations. 788 Doubleword (64-bit) accesses are <strong>not</strong> atomic, unless the 789 location is doubleword-aligned and special load/store instructions are used. 790 This behavior is important to understand when multiple threads are performing 791 unsynchronized updates to packed structures or arrays of primitive types.</p> 792 </div> 793 </div> 794 795 <p>There is no need for 32-bit atomic read or atomic write functions on ARM 796 or x86. Where one is provided for completeness, it just does a trivial load or 797 store.</p> 798 799 <p>Operations that perform more complex actions on data in memory are 800 collectively known as <em>read-modify-write</em> (RMW) instructions, because 801 they load data, modify it in some way, and write it back. CPUs vary widely in 802 how these are implemented. ARM uses a technique called Load Linked / Store 803 Conditional, or LL/SC.</p> 804 805 <div style="padding:.5em 2em;"> 806 <div style="border-left:4px solid #ccc;padding:0 1em;font-style:italic;"> 807 <p>A <em>linked</em> or <em>locked</em> load reads the data from memory as 808 usual, but also establishes a reservation, tagging the physical memory address. 809 The reservation is cleared when another core tries to write to that address. To 810 perform an LL/SC, the data is read with a reservation, modified, and then a 811 conditional store instruction is used to try to write the data back. If the 812 reservation is still in place, the store succeeds; if not, the store will fail. 813 Atomic functions based on LL/SC usually loop, retrying the entire 814 read-modify-write sequence until it completes without interruption.</p> 815 </div> 816 </div> 817 818 <p>Its worth noting that the read-modify-write operations would not work 819 correctly if they operated on stale data. If two cores perform an atomic 820 increment on the same address, and one of them is not able to see what the other 821 did because each core is reading and writing from local cache, the operation 822 wont actually be atomic. The CPUs cache coherency rules ensure that the 823 atomic RMW operations remain atomic in an SMP environment.</p> 824 825 <p>This should not be construed to mean that atomic RMW operations use a memory 826 barrier. On ARM, atomics have no memory barrier semantics. While a series of 827 atomic RMW operations on a single address will be observed in program order by 828 other cores, there are no guarantees when it comes to the ordering of atomic and 829 non-atomic operations.</p> 830 831 <p>It often makes sense to pair barriers and atomic operations together. The 832 next section describes this in more detail.</p> 833 834 <h4 id="atomic_barrierpairing">Atomic + barrier pairing</h4> 835 836 <p>As usual, its useful to illuminate the discussion with an example. Were 837 going to consider a basic mutual-exclusion primitive called a <em>spin 838 lock</em>. The idea is that a memory address (which well call lock) 839 initially holds zero. When a thread wants to execute code in the critical 840 section, it sets the lock to 1, executes the critical code, and then changes it 841 back to zero when done. If another thread has already set the lock to 1, we sit 842 and spin until the lock changes back to zero.</p> 843 844 <p>To make this work we use an atomic RMW primitive called 845 <em>compare-and-swap</em>. The function takes three arguments: the memory 846 address, the expected current value, and the new value. If the value currently 847 in memory matches what we expect, it is replaced with the new value, and the old 848 value is returned. If the current value is not what we expect, we dont change 849 anything. A minor variation on this is called <em>compare-and-set</em>; instead 850 of returning the old value it returns a boolean indicating whether the swap 851 succeeded. For our needs either will work, but compare-and-set is slightly 852 simpler for examples, so we use it and just refer to it as CAS.</p> 853 854 <p>The acquisition of the spin lock is written like this (using a C-like 855 language):</p> 856 857 <pre>do { 858 success = atomic_cas(&lock, 0, 1) 859 } while (!success) 860 861 full_memory_barrier() 862 863 <em>critical-section</em></pre> 864 865 <p>If no thread holds the lock, the lock value will be 0, and the CAS operation 866 will set it to 1 to indicate that we now have it. If another thread has it, the 867 lock value will be 1, and the CAS operation will fail because the expected 868 current value does not match the actual current value. We loop and retry. 869 (Note this loop is on top of whatever loop the LL/SC code might be doing inside 870 the atomic_cas function.)</p> 871 872 <div style="padding:.5em 2em;"> 873 <div style="border-left:4px solid #ccc;padding:0 1em;font-style:italic;"> 874 <p>On SMP, a spin lock is a useful way to guard a small critical section. If we 875 know that another thread is going to execute a handful of instructions and then 876 release the lock, we can just burn a few cycles while we wait our turn. 877 However, if the other thread happens to be executing on the same core, were 878 just wasting time because the other thread cant make progress until the OS 879 schedules it again (either by migrating it to a different core or by preempting 880 us). A proper spin lock implementation would optimistically spin a few times 881 and then fall back on an OS primitive (such as a Linux futex) that allows the 882 current thread to sleep while waiting for the other thread to finish up. On a 883 uniprocessor you never want to spin at all. For the sake of brevity were 884 ignoring all this.</p> 885 </div> 886 </div> 887 888 <p>The memory barrier is necessary to ensure that other threads observe the 889 acquisition of the lock before they observe any loads or stores in the critical 890 section. Without that barrier, the memory accesses could be observed while the 891 lock is not held.</p> 892 893 <p>The <code>full_memory_barrier</code> call here actually does 894 <strong>two</strong> independent operations. First, it issues the CPUs full 895 barrier instruction. Second, it tells the compiler that it is not allowed to 896 reorder code around the barrier. That way, we know that the 897 <code>atomic_cas</code> call will be executed before anything in the critical 898 section. Without this <em>compiler reorder barrier</em>, the compiler has a 899 great deal of freedom in how it generates code, and the order of instructions in 900 the compiled code might be much different from the order in the source code.</p> 901 902 <p>Of course, we also want to make sure that none of the memory accesses 903 performed in the critical section are observed after the lock is released. The 904 full version of the simple spin lock is:</p> 905 906 <pre>do { 907 success = atomic_cas(&lock, 0, 1) // acquire 908 } while (!success) 909 full_memory_barrier() 910 911 <em>critical-section</em> 912 913 full_memory_barrier() 914 atomic_store(&lock, 0) // release</pre> 915 916 <p>We perform our second CPU/compiler memory barrier immediately 917 <strong>before</strong> we release the lock, so that loads and stores in the 918 critical section are observed before the release of the lock.</p> 919 920 <p>As mentioned earlier, the <code>atomic_store</code> operation is a simple 921 assignment on ARM and x86. Unlike the atomic RMW operations, we dont guarantee 922 that other threads will see this value immediately. This isnt a problem, 923 though, because we only need to keep the other threads <strong>out</strong>. The 924 other threads will stay out until they observe the store of 0. If it takes a 925 little while for them to observe it, the other threads will spin a little 926 longer, but we will still execute code correctly.</p> 927 928 <p>Its convenient to combine the atomic operation and the barrier call into a 929 single function. It also provides other advantages, which will become clear 930 shortly.</p> 931 932 933 <h4 id="acq_rel">Acquire and release</h4> 934 935 <p>When acquiring the spinlock, we issue the atomic CAS and then the barrier. 936 When releasing the spinlock, we issue the barrier and then the atomic store. 937 This inspires a particular naming convention: operations followed by a barrier 938 are acquiring operations, while operations preceded by a barrier are 939 releasing operations. (It would be wise to install the spin lock example 940 firmly in mind, as the names are not otherwise intuitive.)</p> 941 942 <p>Rewriting the spin lock example with this in mind:</p> 943 944 <pre>do { 945 success = atomic_<strong>acquire</strong>_cas(&lock, 0, 1) 946 } while (!success) 947 948 <em>critical-section</em> 949 950 atomic_<strong>release</strong>_store(&lock, 0)</pre> 951 952 <p>This is a little more succinct and easier to read, but the real motivation 953 for doing this lies in a couple of optimizations we can now perform.</p> 954 955 <p>First, consider <code>atomic_release_store</code>. We need to ensure that 956 the store of zero to the lock word is observed after any loads or stores in the 957 critical section above it. In other words, we need a load/store and store/store 958 barrier. In an earlier section we learned that these arent necessary on x86 959 SMP -- only store/load barriers are required. The implementation of 960 <code>atomic_release_store</code> on x86 is therefore just a compiler reorder 961 barrier followed by a simple store. No CPU barrier is required.</p> 962 963 <p>The second optimization mostly applies to the compiler (although some CPUs, 964 such as the Itanium, can take advantage of it as well). The basic principle is 965 that code can move across acquire and release barriers, but only in one 966 direction.</p> 967 968 <p>Suppose we have a mix of locally-visible and globally-visible memory 969 accesses, with some miscellaneous computation as well:</p> 970 971 <pre>local1 = arg1 / 41 972 local2 = threadStruct->field2 973 threadStruct->field3 = local2 974 975 do { 976 success = atomic_acquire_cas(&lock, 0, 1) 977 } while (!success) 978 979 local5 = globalStruct->field5 980 globalStruct->field6 = local5 981 982 atomic_release_store(&lock, 0)</pre> 983 984 <p>Here we see two completely independent sets of operations. The first set 985 operates on a thread-local data structure, so were not concerned about clashes 986 with other threads. The second set operates on a global data structure, which 987 must be protected with a lock.</p> 988 989 <p>A full compiler reorder barrier in the atomic ops will ensure that the 990 program order matches the source code order at the lock boundaries. However, 991 allowing the compiler to interleave instructions can improve performance. Loads 992 from memory can be slow, but the CPU can continue to execute instructions that 993 dont require the result of that load while waiting for it to complete. The 994 code might execute more quickly if it were written like this instead:</p> 995 996 <pre>do { 997 success = atomic_acquire_cas(&lock, 0, 1) 998 } while (!success) 999 1000 local2 = threadStruct->field2 1001 local5 = globalStruct->field5 1002 local1 = arg1 / 41 1003 threadStruct->field3 = local2 1004 globalStruct->field6 = local5 1005 1006 atomic_release_store(&lock, 0)</pre> 1007 1008 <p>We issue both loads, do some unrelated computation, and then execute the 1009 instructions that make use of the loads. If the integer division takes less 1010 time than one of the loads, we essentially get it for free, since it happens 1011 during a period where the CPU would have stalled waiting for a load to 1012 complete.</p> 1013 1014 <p>Note that <strong>all</strong> of the operations are now happening inside the 1015 critical section. Since none of the threadStruct operations are visible 1016 outside the current thread, nothing else can see them until were finished here, 1017 so it doesnt matter exactly when they happen.</p> 1018 1019 <p>In general, it is always safe to move operations <strong>into</strong> a 1020 critical section, but never safe to move operations <strong>out of</strong> a 1021 critical section. Put another way, you can migrate code downward across an 1022 acquire barrier, and upward across a release barrier. If the atomic ops used 1023 a full barrier, this sort of migration would not be possible.</p> 1024 1025 <p>Returning to an earlier point, we can state that on x86 all loads are 1026 acquiring loads, and all stores are releasing stores. As a result:</p> 1027 1028 <ul> 1029 <li>Loads may not be reordered with respect to each other. You cant take a 1030 load and move it upward across another loads acquire barrier.</li> 1031 <li>Stores may not be reordered with respect to each other, because you cant 1032 move a store downward across another stores release barrier.</li> 1033 <li>A load followed by a store cant be reordered, because neither instruction 1034 will tolerate it.</li> 1035 <li>A store followed by a load <strong>can</strong> be reordered, because each 1036 instruction can move across the other in that direction.</li> 1037 </ul> 1038 1039 <p>Hence, you only need store/load barriers on x86 SMP.</p> 1040 1041 <p>Labeling atomic operations with acquire or release describes not only 1042 whether the barrier is executed before or after the atomic operation, but also 1043 how the compiler is allowed to reorder code.</p> 1044 1045 <h2 id="practice">Practice</h2> 1046 1047 <p>Debugging memory consistency problems can be very difficult. If a missing 1048 memory barrier causes some code to read stale data, you may not be able to 1049 figure out why by examining memory dumps with a debugger. By the time you can 1050 issue a debugger query, the CPU cores will have all observed the full set of 1051 accesses, and the contents of memory and the CPU registers will appear to be in 1052 an impossible state.</p> 1053 1054 <h3 id="c_dont">What not to do in C</h3> 1055 1056 <p>Here we present some examples of incorrect code, along with simple ways to 1057 fix them. Before we do that, we need to discuss the use of a basic language 1058 feature.</p> 1059 1060 <h4 id="volatile">C/C++ and "volatile"</h4> 1061 1062 <p>When writing single-threaded code, declaring a variable volatile can be 1063 very useful. The compiler will not omit or reorder accesses to volatile 1064 locations. Combine that with the sequential consistency provided by the 1065 hardware, and youre guaranteed that the loads and stores will appear to happen 1066 in the expected order.</p> 1067 1068 <p>However, accesses to volatile storage may be reordered with non-volatile 1069 accesses, so you have to be careful in multi-threaded uniprocessor environments 1070 (explicit compiler reorder barriers may be required). There are no atomicity 1071 guarantees, and no memory barrier provisions, so volatile doesnt help you at 1072 all in multi-threaded SMP environments. The C and C++ language standards are 1073 being updated to address this with built-in atomic operations.</p> 1074 1075 <p>If you think you need to declare something volatile, that is a strong 1076 indicator that you should be using one of the atomic operations instead.</p> 1077 1078 <h4 id="examplesc">Examples</h4> 1079 1080 <p>In most cases youd be better off with a synchronization primitive (like a 1081 pthread mutex) rather than an atomic operation, but we will employ the latter to 1082 illustrate how they would be used in a practical situation.</p> 1083 1084 <p>For the sake of brevity were ignoring the effects of compiler optimizations 1085 here — some of this code is broken even on uniprocessors — so for 1086 all of these examples you must assume that the compiler generates 1087 straightforward code (for example, compiled with gcc -O0). The fixes presented here do 1088 solve both compiler-reordering and memory-access-ordering issues, but were only 1089 going to discuss the latter.</p> 1090 1091 <pre>MyThing* gGlobalThing = NULL; 1092 1093 void initGlobalThing() // runs in thread 1 1094 { 1095 MyStruct* thing = malloc(sizeof(*thing)); 1096 memset(thing, 0, sizeof(*thing)); 1097 thing->x = 5; 1098 thing->y = 10; 1099 /* initialization complete, publish */ 1100 gGlobalThing = thing; 1101 } 1102 1103 void useGlobalThing() // runs in thread 2 1104 { 1105 if (gGlobalThing != NULL) { 1106 int i = gGlobalThing->x; // could be 5, 0, or uninitialized data 1107 ... 1108 } 1109 }</pre> 1110 1111 <p>The idea here is that we allocate a structure, initialize its fields, and at 1112 the very end we publish it by storing it in a global variable. At that point, 1113 any other thread can see it, but thats fine since its fully initialized, 1114 right? At least, it would be on x86 SMP or a uniprocessor (again, making the 1115 erroneous assumption that the compiler outputs code exactly as we have it in the 1116 source).</p> 1117 1118 <p>Without a memory barrier, the store to <code>gGlobalThing</code> could be observed before 1119 the fields are initialized on ARM. Another thread reading from <code>thing->x</code> could 1120 see 5, 0, or even uninitialized data.</p> 1121 1122 <p>This can be fixed by changing the last assignment to:</p> 1123 1124 <pre> atomic_release_store(&gGlobalThing, thing);</pre> 1125 1126 <p>That ensures that all other threads will observe the writes in the proper 1127 order, but what about reads? In this case we should be okay on ARM, because the 1128 address dependency rules will ensure that any loads from an offset of 1129 <code>gGlobalThing</code> are observed after the load of 1130 <code>gGlobalThing</code>. However, its unwise to rely on architectural 1131 details, since it means your code will be very subtly unportable. The complete 1132 fix also requires a barrier after the load:</p> 1133 1134 <pre> MyThing* thing = atomic_acquire_load(&gGlobalThing); 1135 int i = thing->x;</pre> 1136 1137 <p>Now we know the ordering will be correct. This may seem like an awkward way 1138 to write code, and it is, but thats the price you pay for accessing data 1139 structures from multiple threads without using locks. Besides, address 1140 dependencies wont always save us:</p> 1141 1142 <pre>MyThing gGlobalThing; 1143 1144 void initGlobalThing() // runs in thread 1 1145 { 1146 gGlobalThing.x = 5; 1147 gGlobalThing.y = 10; 1148 /* initialization complete */ 1149 gGlobalThing.initialized = true; 1150 } 1151 1152 void useGlobalThing() // runs in thread 2 1153 { 1154 if (gGlobalThing.initialized) { 1155 int i = gGlobalThing.x; // could be 5 or 0 1156 } 1157 }</pre> 1158 1159 <p>Because there is no relationship between the <code>initialized</code> field and the 1160 others, the reads and writes can be observed out of order. (Note global data is 1161 initialized to zero by the OS, so it shouldnt be possible to read random 1162 uninitialized data.)</p> 1163 1164 <p>We need to replace the store with:</p> 1165 <pre> atomic_release_store(&gGlobalThing.initialized, true);</pre> 1166 1167 <p>and replace the load with:</p> 1168 <pre> int initialized = atomic_acquire_load(&gGlobalThing.initialized);</pre> 1169 1170 <p>Another example of the same problem occurs when implementing 1171 reference-counted data structures. The reference count itself will be 1172 consistent so long as atomic increment and decrement operations are used, but 1173 you can still run into trouble at the edges, for example:</p> 1174 1175 <pre>void RefCounted::release() 1176 { 1177 int oldCount = atomic_dec(&mRefCount); 1178 if (oldCount == 1) { // was decremented to zero 1179 recycleStorage(); 1180 } 1181 } 1182 1183 void useSharedThing(RefCountedThing sharedThing) 1184 { 1185 int localVar = sharedThing->x; 1186 sharedThing->release(); 1187 sharedThing = NULL; // cant use this pointer any more 1188 doStuff(localVar); // value of localVar might be wrong 1189 }</pre> 1190 1191 <p>The <code>release()</code> call decrements the reference count using a 1192 barrier-free atomic decrement operation. Because this is an atomic RMW 1193 operation, we know that it will work correctly. If the reference count goes to 1194 zero, we recycle the storage.</p> 1195 1196 <p>The <code>useSharedThing()</code> function extracts what it needs from 1197 <code>sharedThing</code> and then releases its copy. However, because we didnt 1198 use a memory barrier, and atomic and non-atomic operations can be reordered, 1199 its possible for other threads to observe the read of 1200 <code>sharedThing->x</code> <strong>after</strong> they observe the recycle 1201 operation. Its therefore possible for <code>localVar</code> to hold a value 1202 from "recycled" memory, for example a new object created in the same 1203 location by another thread after <code>release()</code> is called.</p> 1204 1205 <p>This can be fixed by replacing the call to <code>atomic_dec()</code> with 1206 <code>atomic_release_dec()</code>. The barrier ensures that the reads from 1207 <code>sharedThing</code> are observed before we recycle the object.</p> 1208 1209 <div style="padding:.5em 2em;"> 1210 <div style="border-left:4px solid #ccc;padding:0 1em;font-style:italic;"> 1211 <p>In most cases the above wont actually fail, because the recycle function 1212 is likely guarded by functions that themselves employ barriers (libc heap 1213 <code>free()</code>/<code>delete()</code>, or an object pool guarded by a 1214 mutex). If the recycle function used a lock-free algorithm implemented without 1215 barriers, however, the above code could fail on ARM SMP.</p> 1216 </div> 1217 </div> 1218 1219 <h3 id="j_dont">What not to do in Java</h3> 1220 1221 <p>We havent discussed some relevant Java language features, so well take a 1222 quick look at those first.</p> 1223 1224 <h4 id="sync_volatile">Java's "synchronized" and "volatile" keywords</h4> 1225 1226 <p>The synchronized keyword provides the Java languages in-built locking 1227 mechanism. Every object has an associated monitor that can be used to provide 1228 mutually exclusive access.</p> 1229 1230 <p>The implementation of the synchronized block has the same basic structure 1231 as the spin lock example: it begins with an acquiring CAS, and ends with a 1232 releasing store. This means that compilers and code optimizers are free to 1233 migrate code into a synchronized block. One practical consequence: you must 1234 <strong>not</strong> conclude that code inside a synchronized block happens 1235 after the stuff above it or before the stuff below it in a function. Going 1236 further, if a method has two synchronized blocks that lock the same object, and 1237 there are no operations in the intervening code that are observable by another 1238 thread, the compiler may perform lock coarsening and combine them into a 1239 single block.</p> 1240 1241 <p>The other relevant keyword is volatile. As defined in the specification 1242 for Java 1.4 and earlier, a volatile declaration was about as weak as its C 1243 counterpart. The spec for Java 1.5 was updated to provide stronger guarantees, 1244 almost to the level of monitor synchronization.</p> 1245 1246 <p>The effects of volatile accesses can be illustrated with an example. If 1247 thread 1 writes to a volatile field, and thread 2 subsequently reads from that 1248 same field, then thread 2 is guaranteed to see that write and all writes 1249 previously made by thread 1. More generally, the writes made by 1250 <strong>any</strong> thread up to the point where it writes the field will be 1251 visible to thead 2 when it does the read. In effect, writing to a volatile is 1252 like a monitor release, and reading from a volatile is like a monitor 1253 acquire.</p> 1254 1255 <p>Non-volatile accesses may be reorded with respect to volatile accesses in the 1256 usual ways, for example the compiler could move a non-volatile load or store above a 1257 volatile store, but couldnt move it below. Volatile accesses may not be 1258 reordered with respect to each other. The VM takes care of issuing the 1259 appropriate memory barriers.</p> 1260 1261 <p>It should be mentioned that, while loads and stores of object references and 1262 most primitive types are atomic, <code>long</code> and <code>double</code> 1263 fields are not accessed atomically unless they are marked as volatile. 1264 Multi-threaded updates to non-volatile 64-bit fields are problematic even on 1265 uniprocessors.</p> 1266 1267 <h4 id="examplesj">Examples</h4> 1268 1269 <p>Heres a simple, incorrect implementation of a monotonic counter: <span 1270 style="font-size:.9em;color:#777">(<em><a href="#more" style="color:#777">Java 1271 theory and practice: Managing volatility</a></em>)</span>.</p> 1272 1273 <pre>class Counter { 1274 private int mValue; 1275 1276 public int get() { 1277 return mValue; 1278 } 1279 public void incr() { 1280 mValue++; 1281 } 1282 }</pre> 1283 1284 <p>Assume <code>get()</code> and <code>incr()</code> are called from multiple 1285 threads, and we want to be sure that every thread sees the current count when 1286 <code>get()</code> is called. The most glaring problem is that 1287 <code>mValue++</code> is actually three operations:</p> 1288 1289 <ol> 1290 <li><code>reg = mValue</code></li> 1291 <li><code>reg = reg + 1</code></li> 1292 <li><code>mValue = reg</code></li> 1293 </ol> 1294 1295 <p>If two threads execute in <code>incr()</code> simultaneously, one of the 1296 updates could be lost. To make the increment atomic, we need to declare 1297 <code>incr()</code> synchronized. With this change, the code will run 1298 correctly in multi-threaded uniprocessor environments.</p> 1299 1300 <p>Its still broken on SMP, however. Different threads might see different 1301 results from <code>get()</code>, because were reading the value with an ordinary load. We 1302 can correct the problem by declaring <code>get()</code> to be synchronized. 1303 With this change, the code is obviously correct.</p> 1304 1305 <p>Unfortunately, weve introduced the possibility of lock contention, which 1306 could hamper performance. Instead of declaring <code>get()</code> to be 1307 synchronized, we could declare <code>mValue</code> with volatile. (Note 1308 <code>incr()</code> must still use <code>synchronize</code>.) Now we know that 1309 the volatile write to <code>mValue</code> will be visible to any subsequent volatile read of 1310 <code>mValue</code>. <code>incr()</code> will be slightly slower, but 1311 <code>get()</code> will be faster, so even in the absence of contention this is 1312 a win if reads outnumber writes. (See also {@link 1313 java.util.concurrent.atomic.AtomicInteger}.)</p> 1314 1315 <p>Heres another example, similar in form to the earlier C examples:</p> 1316 1317 <pre>class MyGoodies { 1318 public int x, y; 1319 } 1320 class MyClass { 1321 static MyGoodies sGoodies; 1322 1323 void initGoodies() { // runs in thread 1 1324 MyGoodies goods = new MyGoodies(); 1325 goods.x = 5; 1326 goods.y = 10; 1327 sGoodies = goods; 1328 } 1329 1330 void useGoodies() { // runs in thread 2 1331 if (sGoodies != null) { 1332 int i = sGoodies.x; // could be 5 or 0 1333 .... 1334 } 1335 } 1336 }</pre> 1337 1338 <p>This has the same problem as the C code, namely that the assignment 1339 <code>sGoodies = goods</code> might be observed before the initialization of the 1340 fields in <code>goods</code>. If you declare <code>sGoodies</code> with the 1341 volatile keyword, you can think about the loads as if they were 1342 <code>atomic_acquire_load()</code> calls, and the stores as if they were 1343 <code>atomic_release_store()</code> calls.</p> 1344 1345 <p>(Note that only the <code>sGoodies</code> reference itself is volatile. The 1346 accesses to the fields inside it are not. The statement <code>z = 1347 sGoodies.x</code> will perform a volatile load of <code>MyClass.sGoodies</code> 1348 followed by a non-volatile load of <code>sGoodies.x</code>. If you make a local 1349 reference <code>MyGoodies localGoods = sGoodies</code>, <code>z = 1350 localGoods.x</code> will not perform any volatile loads.)</p> 1351 1352 <p>A more common idiom in Java programming is the infamous double-checked 1353 locking:</p> 1354 1355 <pre>class MyClass { 1356 private Helper helper = null; 1357 1358 public Helper getHelper() { 1359 if (helper == null) { 1360 synchronized (this) { 1361 if (helper == null) { 1362 helper = new Helper(); 1363 } 1364 } 1365 } 1366 return helper; 1367 } 1368 }</pre> 1369 1370 <p>The idea is that we want to have a single instance of a <code>Helper</code> 1371 object associated with an instance of <code>MyClass</code>. We must only create 1372 it once, so we create and return it through a dedicated <code>getHelper()</code> 1373 function. To avoid a race in which two threads create the instance, we need to 1374 synchronize the object creation. However, we dont want to pay the overhead for 1375 the synchronized block on every call, so we only do that part if 1376 <code>helper</code> is currently null.</p> 1377 1378 <p>This doesnt work correctly on uniprocessor systems, unless youre using a 1379 traditional Java source compiler and an interpreter-only VM. Once you add fancy 1380 code optimizers and JIT compilers it breaks down. See the Double Checked 1381 Locking is Broken Declaration link in the appendix for more details, or Item 1382 71 (Use lazy initialization judiciously) in Josh Blochs <em>Effective Java, 1383 2nd Edition.</em>.</p> 1384 1385 <p>Running this on an SMP system introduces an additional way to fail. Consider 1386 the same code rewritten slightly, as if it were compiled into a C-like language 1387 (Ive added a couple of integer fields to represent <code>Helpers</code> 1388 constructor activity):</p> 1389 1390 <pre>if (helper == null) { 1391 // acquire monitor using spinlock 1392 while (atomic_acquire_cas(&this.lock, 0, 1) != success) 1393 ; 1394 if (helper == null) { 1395 newHelper = malloc(sizeof(Helper)); 1396 newHelper->x = 5; 1397 newHelper->y = 10; 1398 helper = newHelper; 1399 } 1400 atomic_release_store(&this.lock, 0); 1401 }</pre> 1402 1403 <p>Now the problem should be obvious: the store to <code>helper</code> is 1404 happening before the memory barrier, which means another thread could observe 1405 the non-null value of <code>helper</code> before the stores to the 1406 <code>x</code>/<code>y</code> fields.</p> 1407 1408 <p>You could try to ensure that the store to <code>helper</code> happens after 1409 the <code>atomic_release_store()</code> on <code>this.lock</code> by rearranging 1410 the code, but that wont help, because its okay to migrate code upward — 1411 the compiler could move the assignment back above the 1412 <code>atomic_release_store()</code> to its original position.</p> 1413 1414 <p>There are two ways to fix this:</p> 1415 <ol> 1416 <li>Do the simple thing and delete the outer check. This ensures that we never 1417 examine the value of <code>helper</code> outside a synchronized block.</li> 1418 <li>Declare <code>helper</code> volatile. With this one small change, the code 1419 in Example J-3 will work correctly on Java 1.5 and later. (You may want to take 1420 a minute to convince yourself that this is true.)</li> 1421 </ol> 1422 1423 <p>This next example illustrates two important issues when using volatile:</p> 1424 1425 <pre>class MyClass { 1426 int data1, data2; 1427 volatile int vol1, vol2; 1428 1429 void setValues() { // runs in thread 1 1430 data1 = 1; 1431 vol1 = 2; 1432 data2 = 3; 1433 } 1434 1435 void useValues1() { // runs in thread 2 1436 if (vol1 == 2) { 1437 int l1 = data1; // okay 1438 int l2 = data2; // wrong 1439 } 1440 } 1441 void useValues2() { // runs in thread 2 1442 int dummy = vol2; 1443 int l1 = data1; // wrong 1444 int l2 = data2; // wrong 1445 }</pre> 1446 1447 <p>Looking at <code>useValues1()</code>, if thread 2 hasnt yet observed the 1448 update to <code>vol1</code>, then it cant know if <code>data1</code> or 1449 <code>data2</code> has been set yet. Once it sees the update to 1450 <code>vol1</code>, it knows that the change to <code>data1</code> is also 1451 visible, because that was made before <code>vol1</code> was changed. However, 1452 it cant make any assumptions about <code>data2</code>, because that store was 1453 performed after the volatile store.</p> 1454 1455 <P>The code in <code>useValues2()</code> uses a second volatile field, 1456 <code>vol2</code>, in an attempt to force the VM to generate a memory barrier. 1457 This doesnt generally work. To establish a proper happens-before 1458 relationship, both threads need to be interacting with the same volatile field. 1459 Youd have to know that <code>vol2</code> was set after <code>data1/data2</code> 1460 in thread 1. (The fact that this doesnt work is probably obvious from looking 1461 at the code; the caution here is against trying to cleverly cause a memory 1462 barrier instead of creating an ordered series of accesses.)</p> 1463 1464 <h3 id="bestpractice">What to do</h3> 1465 1466 <h4 id="advice">General advice</h4> 1467 1468 <p>In C/C++, use the <code>pthread</code> operations, like mutexes and 1469 semaphores. These include the proper memory barriers, providing correct and 1470 efficient behavior on all Android platform versions. Be sure to use them 1471 correctly, for example be wary of signaling a condition variable without holding the 1472 corresponding mutex.</p> 1473 1474 <p>It's best to avoid using atomic functions directly. Locking and 1475 unlocking a pthread mutex require a single atomic operation each if theres no 1476 contention, so youre not going to save much by replacing mutex calls with 1477 atomic ops. If you need a lock-free design, you must fully understand the 1478 concepts in this entire document before you begin (or, better yet, find an 1479 existing code library that is known to be correct on SMP ARM).</p> 1480 1481 <p>Be extremely circumspect with "volatile in C/C++. It often indicates a 1482 concurrency problem waiting to happen.</p> 1483 1484 <p>In Java, the best answer is usually to use an appropriate utility class from 1485 the {@link java.util.concurrent} package. The code is well written and well 1486 tested on SMP.</p> 1487 1488 <p>Perhaps the safest thing you can do is make your class immutable. Objects 1489 from classes like String and Integer hold data that cannot be changed once the 1490 class is created, avoiding all synchronization issues. The book <em>Effective 1491 Java, 2nd Ed.</em> has specific instructions in Item 15: Minimize Mutability. Note in 1492 particular the importance of declaring fields final" <span 1493 style="font-size:.9em;color:#777">(<a href="#more" style="color:#777">Bloch</a>)</span>.</p> 1494 1495 <p>If neither of these options is viable, the Java synchronized statement 1496 should be used to guard any field that can be accessed by more than one thread. 1497 If mutexes wont work for your situation, you should declare shared fields 1498 volatile, but you must take great care to understand the interactions between 1499 threads. The volatile declaration wont save you from common concurrent 1500 programming mistakes, but it will help you avoid the mysterious failures 1501 associated with optimizing compilers and SMP mishaps.</p> 1502 1503 <p>The Java Memory Model guarantees that assignments to final fields are visible 1504 to all threads once the constructor has finished — this is what ensures 1505 proper synchronization of fields in immutable classes. This guarantee does not 1506 hold if a partially-constructed object is allowed to become visible to other 1507 threads. It is necessary to follow safe construction practices.<span 1508 style="font-size:.9em;color:#777">(<a href="#more" style="color:#777">Safe 1509 Construction Techniques in Java</a>)</span>.</p> 1510 1511 <h4 id="sync_guarantees">Synchronization primitive guarantees</h4> 1512 1513 <p>The pthread library and VM make a couple of useful guarantees: all accesses 1514 previously performed by a thread that creates a new thread are observable by 1515 that new thread as soon as it starts, and all accesses performed by a thread 1516 that is exiting are observable when a <code>join()</code> on that thread 1517 returns. This means you dont need any additional synchronization when 1518 preparing data for a new thread or examining the results of a joined thread.</p> 1519 1520 <p>Whether or not these guarantees apply to interactions with pooled threads 1521 depends on the thread pool implementation.</p> 1522 1523 <p>In C/C++, the pthread library guarantees that any accesses made by a thread 1524 before it unlocks a mutex will be observable by another thread after it locks 1525 that same mutex. It also guarantees that any accesses made before calling 1526 <code>signal()</code> or <code>broadcast()</code> on a condition variable will 1527 be observable by the woken thread.</p> 1528 1529 <p>Java language threads and monitors make similar guarantees for the comparable 1530 operations.</p> 1531 1532 <h4 id="ccpp_changes">Upcoming changes to C/C++</h4> 1533 1534 <p>The C and C++ language standards are evolving to include a sophisticated 1535 collection of atomic operations. A full matrix of calls for common data types 1536 is defined, with selectable memory barrier semantics (choose from relaxed, 1537 consume, acquire, release, acq_rel, seq_cst).</p> 1538 1539 <p>See the <a href="#more">Further Reading</a> section for pointers to the 1540 specifications.</p> 1541 1542 1543 <h2 id="closing_notes">Closing Notes</h2> 1544 1545 <p>While this document does more than merely scratch the surface, it doesnt 1546 manage more than a shallow gouge. This is a very broad and deep topic. Some 1547 areas for further exploration:</p> 1548 1549 <ul> 1550 <li>Learn the definitions of <em>happens-before</em>, 1551 <em>synchronizes-with</em>, and other essential concepts from the Java Memory 1552 Model. (Its hard to understand what volatile really means without getting 1553 into this.)</li> 1554 <li>Explore what compilers are and arent allowed to do when reordering code. 1555 (The JSR-133 spec has some great examples of legal transformations that lead to 1556 unexpected results.)</li> 1557 <li>Find out how to write immutable classes in Java and C++. (Theres more to 1558 it than just dont change anything after construction.)</li> 1559 <li>Internalize the recommendations in the Concurrency section of <em>Effective 1560 Java, 2nd Edition.</em> (For example, you should avoid calling methods that are 1561 meant to be overridden while inside a synchronized block.)</li> 1562 <li>Understand what sorts of barriers you can use on x86 and ARM. (And other 1563 CPUs for that matter, for example Itaniums acquire/release instruction 1564 modifiers.)</li> 1565 <li>Read through the {@link java.util.concurrent} and {@link 1566 java.util.concurrent.atomic} APIs to see what's available. Consider using 1567 concurrency annotations like <code>@ThreadSafe</code> and 1568 <code>@GuardedBy</code> (from net.jcip.annotations).</li> 1569 </ul> 1570 1571 <p>The <a href="#more">Further Reading</a> section in the appendix has links to 1572 documents and web sites that will better illuminate these topics.</p> 1573 1574 <h2 id="appendix">Appendix</h2> 1575 1576 <h3 id="smp_failure_example">SMP failure example</h3> 1577 1578 <p>This document describes a lot of weird things that can, in theory, happen. 1579 If youre not convinced that these issues are real, a practical example may be 1580 useful.</p> 1581 1582 <p>Bill Pughs Java memory model <a 1583 href="http://www.cs.umd.edu/~pugh/java/memoryModel/">web site</a> has a few 1584 test programs on it. One interesting test is ReadAfterWrite.java, which does 1585 the following:</p> 1586 1587 <table> 1588 <tr> 1589 <th>Thread 1</th> 1590 <th>Thread 2</th> 1591 </tr> 1592 <tr> 1593 <td><code>for (int i = 0; i < ITERATIONS; i++) {<br /> 1594 a = i;<br /> 1595 BB[i] = b;<br /> 1596 }</code></td> 1597 <td><code>for (int i = 0; i < ITERATIONS; i++) {<br /> 1598 b = i;<br /> 1599 AA[i] = a;<br /> 1600 }</code></td> 1601 </tr> 1602 </table> 1603 1604 <p>Where <code>a</code> and <code>b</code> are declared as volatile 1605 <code>int</code> fields, and <code>AA</code> and <code>BB</code> are ordinary 1606 integer arrays. 1607 1608 <p>This is trying to determine if the VM ensures that, after a value is written 1609 to a volatile, the next read from that volatile sees the new value. The test 1610 code executes these loops a million or so times, and then runs through afterward 1611 and searches the results for inconsistencies.</p> 1612 1613 <p>At the end of execution,<code>AA</code> and <code>BB</code> will be full of 1614 gradually-increasing integers. The threads will not run side-by-side in a 1615 predictable way, but we can assert a relationship between the array contents. 1616 For example, consider this execution fragment:</p> 1617 1618 <table> 1619 <tr> 1620 <th>Thread 1</th> 1621 <th>Thread 2</th> 1622 </tr> 1623 <tr> 1624 <td><code>(initially a == 1534)<br /> 1625 a = 1535<br /> 1626 BB[1535] = 165<br /> 1627 a = 1536<br /> 1628 BB[1536] = 165<br /> 1629 <br /> 1630 <br /> 1631 <br /> 1632 <br /> 1633 a = 1537<br /> 1634 BB[1537] = 167</code></td> 1635 <td><code>(initially b == 165) 1636 <br /> 1637 <br /> 1638 <br /> 1639 <br /> 1640 <br /> 1641 b = 166<br /> 1642 AA[166] = 1536<br /> 1643 b = 167<br /> 1644 AA[167] = 1536<br /> 1645 <br /></code></td> 1646 </tr> 1647 </table> 1648 1649 <p>(This is written as if the threads were taking turns executing so that its 1650 more obvious when results from one thread should be visible to the other, but in 1651 practice that wont be the case.)</p> 1652 1653 <p>Look at the assignment of <code>AA[166]</code> in thread 2. We are capturing 1654 the fact that, at the point where thread 2 was on iteration 166, it can see that 1655 thread 1 was on iteration 1536. If we look one step in the future, at thread 1656 1s iteration 1537, we expect to see that thread 1 saw that thread 2 was at 1657 iteration 166 (or later). <code>BB[1537]</code> holds 167, so it appears things 1658 are working.</p> 1659 1660 <p>Now suppose we fail to observe a volatile write to <code>b</code>:</p> 1661 1662 <table> 1663 <tr> 1664 <th>Thread 1</th> 1665 <th>Thread 2</th> 1666 </tr> 1667 <tr> 1668 <td><code>(initially a == 1534)<br /> 1669 a = 1535<br /> 1670 BB[1535] = 165<br /> 1671 a = 1536<br /> 1672 BB[1536] = 165<br /> 1673 <br /> 1674 <br /> 1675 <br /> 1676 <br /> 1677 a = 1537<br /> 1678 BB[1537] = 165 // stale b</code></td> 1679 <td><code>(initially b == 165)<br /> 1680 <br /> 1681 <br /> 1682 <br /> 1683 <br /> 1684 b = 166<br /> 1685 AA[166] = 1536<br /> 1686 b = 167<br /> 1687 AA[167] = 1536</code></td> 1688 </tr> 1689 </table> 1690 1691 <p>Now, <code>BB[1537]</code> holds 165, a smaller value than we expected, so we 1692 know we have a problem. Put succinctly, for i=166, BB[AA[i]+1] < i. (This also 1693 catches failures by thread 2 to observe writes to <code>a</code>, for example if we 1694 miss an update and assign <code>AA[166] = 1535</code>, we will get 1695 <code>BB[AA[166]+1] == 165</code>.)</p> 1696 1697 <p>If you run the test program under Dalvik (Android 3.0 Honeycomb or later) 1698 on an SMP ARM device, it will never fail. If you remove the word volatile 1699 from the declarations of <code>a</code> and <code>b</code>, it will consistently 1700 fail. The program is testing to see if the VM is providing sequentially 1701 consistent ordering for accesses to <code>a</code> and <code>b</code>, so you 1702 will only see correct behavior when the variables are volatile. (It will also 1703 succeed if you run the code on a uniprocessor device, or run it while something 1704 else is using enough CPU that the kernel doesnt schedule the test threads on 1705 separate cores.)</p> 1706 1707 <p>If you run the modified test a few times you will note that it doesnt fail 1708 in the same place every time. The test fails consistently because it performs 1709 the operations a million times, and it only needs to see out-of-order accesses 1710 once. In practice, failures will be infrequent and difficult to locate. This 1711 test program could very well succeed on a broken VM if things just happen to 1712 work out.</p> 1713 1714 <h3 id="sync_stores">Implementing synchronization stores</h3> 1715 1716 <p><em>(This isnt something most programmers will find themselves implementing, 1717 but the discussion is illuminating.)</em></p> 1718 1719 <p>Consider once again volatile accesses in Java. Earlier we made reference to 1720 their similarities with acquiring loads and releasing stores, which works as a 1721 starting point but doesnt tell the full story.</p> 1722 1723 <p>We start with a fragment of Dekkers algorithm. Initially both 1724 <code>flag1</code> and <code>flag2</code> are false:</p> 1725 1726 <table> 1727 <tr> 1728 <th>Thread 1</th> 1729 <th>Thread 2</th> 1730 </tr> 1731 <tr> 1732 <td><code>flag1 = true<br /> 1733 if (flag2 == false)<br /> 1734 <em>critical-stuff</em></code></td> 1735 <td><code>flag2 = true<br /> 1736 if (flag1 == false)<br /> 1737 <em>critical-stuff</em></code></td> 1738 </tr> 1739 </table 1740 1741 <p><code>flag1</code> and <code>flag2</code> are declared as volatile boolean 1742 fields. The rules for acquiring loads and releasing stores would allow the 1743 accesses in each thread to be reordered, breaking the algorithm. Fortunately, 1744 the JMM has a few things to say here. Informally:</p> 1745 1746 <ul> 1747 <li>A write to a volatile field <em>happens-before</em> every subsequent read of that 1748 same field. (For this example, it means that if one thread updates a flag, and 1749 later on the other thread reads that flag, the reader is guaranteed to see the 1750 write.)</li> 1751 <li>Every execution has a total order over all volatile field accesses. The 1752 order is consistent with program order.</li> 1753 </ul> 1754 1755 <p>Taken together, these rules say that the volatile accesses in our example 1756 must be observable in program order by all threads. Thus, we will never see 1757 these threads executing the critical-stuff simultaneously.</p> 1758 1759 <div style="padding:.5em 2em;"> 1760 <div style="border-left:4px solid #999;padding:0 1em;font-style:italic;"> 1761 <p>Another way to think about this is in terms of <em>data races</em>. A data race 1762 occurs if two accesses to the same memory location by different threads are not 1763 ordered, at least one of them stores to the memory location, and at least one of 1764 them is not a synchronization action <span style="font-size:.9em;color:#777">(<a 1765 href="#more" style="color:#777">Boehm and McKenney</a>)</span>. The memory model 1766 declares that a program free of data races must behave as if executed by a 1767 sequentially-consistent machine. Because both <code>flag1</code> and 1768 <code>flag2</code> are volatile, and volatile accesses are considered 1769 synchronization actions, there are no data races and this code must execute in a 1770 sequentially consistent manner.</p> 1771 </div> 1772 </div> 1773 1774 <p>As we saw in an earlier section, we need to insert a store/load barrier 1775 between the two operations. The code executed in the VM for a volatile access 1776 will look something like this:</p> 1777 1778 <table> 1779 <tr> 1780 <th>volatile load</th> 1781 <th>volatile store</th> 1782 </tr> 1783 <tr> 1784 <td><code>reg = A<br /> 1785 <em>load/load + load/store barrier</em></code></td> 1786 <td><code><em>store/store barrier</em><br /> 1787 A = reg<br /> 1788 <em>store/load barrier</em></code></td> 1789 </tr> 1790 </table> 1791 1792 <p>The volatile load is just an acquiring load. The volatile store is similar 1793 to a releasing store, but weve omitted load/store from the pre-store barrier, 1794 and added a store/load barrier afterward.</p> 1795 1796 <p>What were really trying to guarantee, though, is that (using thread 1 as an 1797 example) the write to flag1 is observed before the read of flag2. We could 1798 issue the store/load barrier before the volatile load instead and get the same 1799 result, but because loads tend to outnumber stores its best to associate it 1800 with the store.</p> 1801 1802 <p>On some architectures, its possible to implement volatile stores with an 1803 atomic operation and skip the explicit store/load barrier. On x86, for example, 1804 atomics provide a full barrier. The ARM LL/SC operations dont include a 1805 barrier, so for ARM we must use explicit barriers.</p> 1806 1807 <p>(Much of this is due to Doug Lea and his JSR-133 Cookbook for Compiler 1808 Writers page.)</p> 1809 1810 <h3 id="more">Further reading</h3> 1811 1812 <p>Web pages and documents that provide greater depth or breadth. The more generally useful articles are nearer the top of the list.</p> 1813 1814 <dl> 1815 <dt>Shared Memory Consistency Models: A Tutorial</dt> 1816 <dd>Written in 1995 by Adve & Gharachorloo, this is a good place to start if you want to dive more deeply into memory consistency models. 1817 <br /><a href="http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf">http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf</a></dd> 1818 1819 <dt>Memory Barriers</dt> 1820 <dd>Nice little article summarizing the issues. 1821 <br /><a href="http://en.wikipedia.org/wiki/Memory_barrier">http://en.wikipedia.org/wiki/Memory_barrier</a></dd> 1822 1823 <dt>Threads Basics</dt> 1824 <dd>An introduction to multi-threaded programming in C++ and Java, by Hans Boehm. Excellent discussion of data races and basic synchronization methods. 1825 <br /><a href="http://www.hpl.hp.com/personal/Hans_Boehm/c++mm/threadsintro.html">http://www.hpl.hp.com/personal/Hans_Boehm/c++mm/threadsintro.html</a></dd> 1826 1827 <dt>Java Concurrency In Practice</dt> 1828 <dd>Published in 2006, this book covers a wide range of topics in great detail. Highly recommended for anyone writing multi-threaded code in Java. 1829 <br /><a href="http://www.javaconcurrencyinpractice.com">http://www.javaconcurrencyinpractice.com</a></dd> 1830 1831 <dt>JSR-133 (Java Memory Model) FAQ</dt> 1832 <dd>A gentle introduction to the Java memory model, including an explanation of synchronization, volatile variables, and construction of final fields. 1833 <br /><a href="http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html">http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html</a></dd> 1834 1835 <dt>Overview of package java.util.concurrent</dt> 1836 <dd>The documentation for the <code>java.util.concurrent</code> package. Near the bottom of the page is a section entitled Memory Consistency Properties that explains the guarantees made by the various classes. 1837 <br />{@link java.util.concurrent java.util.concurrent} Package Summary</dd> 1838 1839 <dt>Java Theory and Practice: Safe Construction Techniques in Java</dt> 1840 <dd>This article examines in detail the perils of references escaping during object construction, and provides guidelines for thread-safe constructors. 1841 <br /><a href="http://www.ibm.com/developerworks/java/library/j-jtp0618.html">http://www.ibm.com/developerworks/java/library/j-jtp0618.html</a></dd> 1842 1843 <dt>Java Theory and Practice: Managing Volatility</dt> 1844 <dd>A nice article describing what you can and cant accomplish with volatile fields in Java. 1845 <br /><a href="http://www.ibm.com/developerworks/java/library/j-jtp06197.html">http://www.ibm.com/developerworks/java/library/j-jtp06197.html</a></dd> 1846 1847 <dt>The Double-Checked Locking is Broken Declaration</dt> 1848 <dd>Bill Pughs detailed explanation of the various ways in which double-checked locking is broken. Includes C/C++ and Java. 1849 <br /><a href="http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html">http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html</a></dd> 1850 1851 <dt>[ARM] Barrier Litmus Tests and Cookbook</dt> 1852 <dd>A discussion of ARM SMP issues, illuminated with short snippets of ARM code. If you found the examples in this document too un-specific, or want to read the formal description of the DMB instruction, read this. Also describes the instructions used for memory barriers on executable code (possibly useful if youre generating code on the fly). 1853 <br /><a href="http://infocenter.arm.com/help/topic/com.arm.doc.genc007826/Barrier_Litmus_Tests_and_Cookbook_A08.pdf">http://infocenter.arm.com/help/topic/com.arm.doc.genc007826/Barrier_Litmus_Tests_and_Cookbook_A08.pdf</a></dd> 1854 1855 <dt>Linux Kernel Memory Barriers 1856 <dd>Documentation for Linux kernel memory barriers. Includes some useful examples and ASCII art. 1857 <br/><a href="http://www.kernel.org/doc/Documentation/memory-barriers.txt">http://www.kernel.org/doc/Documentation/memory-barriers.txt</a></dd> 1858 1859 <dt>ISO/IEC JTC1 SC22 WG21 (C++ standards) 14882 (C++ programming language), chapter 29 (Atomic operations library)</dt> 1860 <dd>Draft standard for C++ atomic operation features. 1861 <br /><a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3090.pdf">http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3090.pdf</a> 1862 <br/ >(intro: <a href="http://www.hpl.hp.com/techreports/2008/HPL-2008-56.pdf">http://www.hpl.hp.com/techreports/2008/HPL-2008-56.pdf</a>)</dd> 1863 1864 <dt>ISO/IEC JTC1 SC22 WG14 (C standards) 9899 (C programming language) chapter 7.16 (Atomics <stdatomic.h>)</dt> 1865 <dd>Draft standard for ISO/IEC 9899-201x C atomic operation features. (See also n1484 for errata.) 1866 <br /><a href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1425.pdf">http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1425.pdf</a></dd> 1867 1868 <dt>Dekkers algorithm</dt> 1869 <dd>The first known correct solution to the mutual exclusion problem in concurrent programming. The wikipedia article has the full algorithm, with a discussion about how it would need to be updated to work with modern optimizing compilers and SMP hardware. 1870 <br /><a href="http://en.wikipedia.org/wiki/Dekker's_algorithm">http://en.wikipedia.org/wiki/Dekker's_algorithm</a></dd> 1871 1872 <dt>Comments on ARM vs. Alpha and address dependencies</dt> 1873 <dd>An e-mail on the arm-kernel mailing list from Catalin Marinas. Includes a nice summary of address and control dependencies. 1874 <br /><a href="http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-05/msg11811.html">http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-05/msg11811.html</a></dd> 1875 1876 <dt>What Every Programmer Should Know About Memory</dt> 1877 <dd>A very long and detailed article about different types of memory, particularly CPU caches, by Ulrich Drepper. 1878 <br /><a href="http://www.akkadia.org/drepper/cpumemory.pdf">http://www.akkadia.org/drepper/cpumemory.pdf</a></dd> 1879 1880 <dt>Reasoning about the ARM weakly consistent memory model</dt> 1881 <dd>This paper was written by Chong & Ishtiaq of ARM, Ltd. It attempts to describe the ARM SMP memory model in a rigorous but accessible fashion. The definition of observability used here comes from this paper. 1882 <br /><a href="http://portal.acm.org/ft_gateway.cfm?id=1353528&type=pdf&coll=&dl=&CFID=96099715&CFTOKEN=57505711">http://portal.acm.org/ft_gateway.cfm?id=1353528&type=pdf&coll=&dl=&CFID=96099715&CFTOKEN=57505711</a></dd> 1883 1884 <dt>The JSR-133 Cookbook for Compiler Writers</dt> 1885 <dd>Doug Lea wrote this as a companion to the JSR-133 (Java Memory Model) documentation. It goes much deeper into the details than most people will need to worry about, but it provides good fodder for contemplation. 1886 <br /><a href="http://g.oswego.edu/dl/jmm/cookbook.html">http://g.oswego.edu/dl/jmm/cookbook.html</a></dd> 1887 1888 <dt>The Semantics of Power and ARM Multiprocessor Machine Code</dt> 1889 <dd>If you prefer your explanations in rigorous mathematical form, this is a fine place to go next. 1890 <br /><a href="http://www.cl.cam.ac.uk/~pes20/weakmemory/draft-ppc-arm.pdf">http://www.cl.cam.ac.uk/~pes20/weakmemory/draft-ppc-arm.pdf</a></dd> 1891 </dl> 1892