Home | History | Annotate | Download | only in articles
      1 page.title=SMP Primer for Android
      2 page.tags="ndk","native"
      3 
      4 page.article=true
      5 @jd:body
      6 
      7 <div id="tb-wrapper">
      8 <div id="tb">
      9 <h2>In this document</h2>
     10 <ol class="nolist">
     11   <li><a href="#theory">Theory</a>
     12     <ol class="nolist">
     13       <li style="margin: 3px 0 0"><a href="#mem_consistency">Memory consistency models</a>
     14         <ol class="nolist">
     15           <li style="margin:0"><a href="#proc_consistency">Processor consistency</a></li>
     16           <li style="margin:0"><a href="#cpu_cache">CPU cache behavior</a></li>
     17           <li style="margin:0"><a href="#observability">Observability</a></li>
     18           <li style="margin:0"><a href="#ordering">ARMs weak ordering</a></li>
     19         </ol>
     20       </li>
     21       <li style="margin:3px 0 0"><a href="#datamem_barriers">Data memory barriers</a>
     22         <ol class="nolist">
     23           <li style="margin:0"><a href="#ss_ll">Store/store and load/load</a></li>
     24           <li style="margin:0"><a href="#ls_sl">Load/store and store/load</a></li>
     25           <li style="margin:0"><a href="#barrier_inst">Barrier instructions</a></li>
     26           <li style="margin:0"><a href="#addr_dep">Address dependencies and causal consistency</a></li>
     27           <li style="margin:0"><a href="#membarrier_summry">Memory barrier summary</a></li>
     28         </ol>
     29       </li>
     30       <li style="margin:3px 0 0"><a href="#atomic_ops">Atomic operations</a>
     31         <ol class="nolist">
     32           <li style="margin:0"><a href="#atomic_essentials">Atomic essentials</a></li>
     33           <li style="margin:0"><a href="#atomic_barrierpairing">Atomic + barrier pairing</a></li>
     34           <li style="margin:0"><a href="#acq_rel">Acquire and release</a></li>
     35         </ol>
     36       </li>
     37     </ol>
     38   </li>
     39   <li><a href="#practice">Practice</a>
     40     <ol class="nolist">
     41       <li style="margin:3px 0 0"><a href="#c_dont">What not to do in C</a>
     42         <ol class="nolist">
     43           <li style="margin:0"><a href="#volatile">C/C++ and volatile</a></li>
     44           <li style="margin:0"><a href="#examplesc">Examples</a></li>
     45         </ol>
     46       </li>
     47       <li style="margin:3px 0 0"><a href="#j_dont">What not to do in Java</a>
     48         <ol class="nolist">
     49           <li style="margin:0"><a href="#sync_volatile">synchronized and volatile</a></li>
     50           <li style="margin:0"><a href="#examplesj">Examples</a></li>
     51         </ol>
     52       </li>
     53       <li style="margin:3px 0 0"><a href="#bestpractice">What to do</a>
     54         <ol class="nolist">
     55           <li style="margin:0"><a href="#advice">General advice</a></li>
     56           <li style="margin:0"><a href="#sync_guarantees">Synchronization primitive guarantees</a></li>
     57           <li style="margin:0"><a href="#ccpp_changes">Upcoming changes to C/C++</a></li>
     58         </ol>
     59       </li>
     60     </ol>
     61   </li>
     62   <li><a href="#closing_notes">Closing Notes</a></li>
     63   <li><a href="#appendix">Appendix</a>
     64     <ol class="nolist">
     65       <li style="margin:0"><a href="#smp_failure_example">SMP failure example</a></li>
     66       <li style="margin:0"><a href="#sync_stores">Implementing synchronization stores</a></li>
     67       <li style="margin:0"><a href="#more">Further reading</a></li>
     68     </ol>
     69   </li>
     70 </ol>
     71 </div>
     72 </div>
     73 
     74 <p>Android 3.0 and later platform versions are optimized to support
     75 multiprocessor architectures. This document introduces issues that
     76 can arise when writing code for symmetric multiprocessor systems in C, C++, and the Java
     77 programming language (hereafter referred to simply as Java for the sake of
     78 brevity). It's intended as a primer for Android app developers, not as a complete 
     79 discussion on the subject. The focus is on the ARM CPU architecture.</p>
     80 
     81 <p>If youre in a hurry, you can skip the <a href="#theory">Theory</a> section
     82 and go directly to <a href="#practice">Practice</a> for best practices, but this
     83 is not recommended.</p>
     84 
     85 
     86 <h2 id="intro">Introduction</h2>
     87 
     88 <p>SMP is an acronym for Symmetric Multi-Processor.  It describes a design in
     89 which two or more identical CPU cores share access to main memory.  Until
     90 a few years ago, all Android devices were UP (Uni-Processor).</p>
     91 
     92 <p>Most &mdash; if not all &mdash; Android devices do have multiple CPUs, but generally one
     93 of them is used to run applications while others manage various bits of device
     94 hardware (for example, the radio).  The CPUs may have different architectures, and the
     95 programs running on them cant use main memory to communicate with each
     96 other.</p>
     97 
     98 <p>Most Android devices sold today are built around SMP designs,
     99 making things a bit more complicated for software developers.  The sorts of race
    100 conditions you might encounter in a multi-threaded program are much worse on SMP
    101 when two or more of your threads are running simultaneously on different cores.
    102 Whats more, SMP on ARM is more challenging to work with than SMP on x86.  Code
    103 that has been thoroughly tested on x86 may break badly on ARM.</p>
    104 
    105 <p>The rest of this document will explain why, and tell you what you need to do
    106 to ensure that your code behaves correctly.</p>
    107 
    108 
    109 <h2 id="theory">Theory</h2>
    110 
    111 <p>This is a high-speed, glossy overview of a complex subject.  Some areas will
    112 be incomplete, but none of it should be misleading or wrong.</p>
    113 
    114 <p>See <a href="#more">Further reading</a> at the end of the document for
    115 pointers to more thorough treatments of the subject.</p>
    116 
    117 <h3 id="mem_consistency">Memory consistency models</h3>
    118 
    119 <p>Memory consistency models, or often just memory models, describe the
    120 guarantees the hardware architecture makes about memory accesses.  For example,
    121 if you write a value to address A, and then write a value to address B, the
    122 model might guarantee that every CPU core sees those writes happen in that
    123 order.</p>
    124 
    125 <p>The model most programmers are accustomed to is <em>sequential
    126 consistency</em>, which is described like this <span
    127 style="font-size:.9em;color:#777">(<a href="#more" style="color:#777">Adve &
    128 Gharachorloo</a>)</span>:</p>
    129 
    130 <ul>
    131 <li>All memory operations appear to execute one at a time</li>
    132 <li>All operations on a single processor appear to execute in the order described
    133 by that processor's program.</li>
    134 </ul>
    135 
    136 <p>If you look at a bit of code and see that it does some reads and writes from
    137 memory, on a sequentially-consistent CPU architecture you know that the code
    138 will do those reads and writes in the expected order.  Its possible that the
    139 CPU is actually reordering instructions and delaying reads and writes, but there
    140 is no way for code running on the device to tell that the CPU is doing anything
    141 other than execute instructions in a straightforward manner.  (Were ignoring
    142 memory-mapped device driver I/O for the moment.)</p>
    143 
    144 <p>To illustrate these points its useful to consider small snippets of code,
    145 commonly referred to as <em>litmus tests</em>.  These are assumed to execute in
    146 <em>program order</em>, that is, the order in which the instructions appear here is
    147 the order in which the CPU will execute them.  We dont want to consider
    148 instruction reordering performed by compilers just yet.</p>
    149 
    150 <p>Heres a simple example, with code running on two threads:</p>
    151 
    152 <table>
    153 <tr>
    154 <th>Thread 1</th>
    155 <th>Thread 2</th>
    156 </tr>
    157 <tr>
    158 <td><code>A = 3<br />
    159 B = 5</code></td>
    160 <td><code>reg0 = B<br />
    161 reg1 = A</code></td>
    162 </tr>
    163 </table>
    164 
    165 <p>In this and all future litmus examples, memory locations are represented by
    166 capital letters (A, B, C) and CPU registers start with reg.  All memory is
    167 initially zero.  Instructions are executed from top to bottom.  Here, thread 1
    168 stores the value 3 at location A, and then the value 5 at location B.  Thread 2
    169 loads the value from location B into reg0, and then loads the value from
    170 location A into reg1.  (Note that were writing in one order and reading in
    171 another.)</p>
    172 
    173 <p>Thread 1 and thread 2 are assumed to execute on different CPU cores.  You
    174 should <strong>always</strong> make this assumption when thinking about
    175 multi-threaded code.</p>
    176 
    177 <p>Sequential consistency guarantees that, after both threads have finished
    178 executing, the registers will be in one of the following states:</p>
    179 
    180 
    181 <table>
    182 <tr>
    183 <th>Registers</th>
    184 <th>States</th>
    185 </tr>
    186 <tr>
    187 <td>reg0=5, reg1=3</td>
    188 <td>possible (thread 1 ran first)</td>
    189 </tr>
    190 <tr>
    191 <td>reg0=0, reg1=0</td>
    192 <td>possible (thread 2 ran first)</td>
    193 </tr>
    194 <tr>
    195 <td>reg0=0, reg1=3</td>
    196 <td>possible (concurrent execution)</td>
    197 </tr>
    198 <tr>
    199 <td>reg0=5, reg1=0</td>
    200 <td>never</td>
    201 </tr>
    202 </table>
    203 
    204 <p>To get into a situation where we see B=5 before we see the store to A, either
    205 the reads or the writes would have to happen out of order.  On a
    206 sequentially-consistent machine, that cant happen.</p>
    207 
    208 <p>Most uni-processors, including x86 and ARM, are sequentially consistent.
    209 Most SMP systems, including x86 and ARM, are not.</p>
    210 
    211 <h4 id="proc_consistency">Processor consistency</h4>
    212 
    213 <p>x86 SMP provides <em>processor consistency</em>, which is slightly weaker than
    214 sequential.  While the architecture guarantees that loads are not reordered with
    215 respect to other loads, and stores are not reordered with respect to other
    216 stores, it does not guarantee that a store followed by a load will be observed
    217 in the expected order.</p>
    218 
    219 <p>Consider the following example, which is a piece of Dekkers Algorithm for
    220 mutual exclusion:</p>
    221 
    222 <table>
    223 <tr>
    224 <th>Thread 1</th>
    225 <th>Thread 2</th>
    226 </tr>
    227 <tr>
    228 <td><code>A = true<br />
    229 reg1 = B<br />
    230 if (reg1 == false)<br />
    231 &nbsp;&nbsp;&nbsp;&nbsp;<em>critical-stuff</em></code></td>
    232 <td><code>B = true<br />
    233 reg2 = A<br />
    234 if (reg2 == false)<br />
    235 &nbsp;&nbsp;&nbsp;&nbsp;<em>critical-stuff</em></code></td>
    236 </tr>
    237 </table>
    238 
    239 <p>The idea is that thread 1 uses A to indicate that its busy, and thread 2
    240 uses B.  Thread 1 sets A and then checks to see if B is set; if not, it can
    241 safely assume that it has exclusive access to the critical section.  Thread 2
    242 does something similar.  (If a thread discovers that both A and B are set, a
    243 turn-taking algorithm is used to ensure fairness.)</p>
    244 
    245 <p>On a sequentially-consistent machine, this works correctly.  On x86 and ARM
    246 SMP, the store to A and the load from B in thread 1 can be observed in a
    247 different order by thread 2.  If that happened, we could actually appear to
    248 execute this sequence (where blank lines have been inserted to highlight the
    249 apparent order of operations):</p>
    250 
    251 <table>
    252 <tr>
    253 <th>Thread 1</th>
    254 <th>Thread 2</th>
    255 </tr>
    256 <tr>
    257 <td><code>reg1 = B<br />
    258 <br />
    259 <br />
    260 A = true<br />
    261 if (reg1 == false)<br />
    262 &nbsp;&nbsp;&nbsp;&nbsp;<em>critical-stuff</em></code></td>
    263 
    264 <td><code><br />
    265 B = true<br />
    266 reg2 = A<br />
    267 <br />
    268 if (reg2 == false)<br />
    269 &nbsp;&nbsp;&nbsp;&nbsp;<em>critical-stuff</em></code></td>
    270 </tr>
    271 </table>
    272 
    273 <p>This results in both reg1 and reg2 set to false, allowing the threads to
    274 execute code in the critical section simultaneously.  To understand how this can
    275 happen, its useful to know a little about CPU caches.</p>
    276 
    277 <h4 id="cpu_cache">CPU cache behavior</h4>
    278 
    279 <p>This is a substantial topic in and of itself.  An extremely brief overview
    280 follows.  (The motivation for this material is to provide some basis for
    281 understanding why SMP systems behave as they do.)</p>
    282 
    283 <p>Modern CPUs have one or more caches between the processor and main memory.
    284 These are labeled L1, L2, and so on, with the higher numbers being successively
    285 farther from the CPU.  Cache memory adds size and cost to the hardware, and
    286 increases power consumption, so the ARM CPUs used in Android devices typically
    287 have small L1 caches and little or no L2/L3.</p>
    288 
    289 <p>Loading or storing a value into the L1 cache is very fast.  Doing the same to
    290 main memory can be 10-100x slower.  The CPU will therefore try to operate out of
    291 the cache as much as possible.  The <em>write policy</em> of a cache determines when data
    292 written to it is forwarded to main memory.  A <em>write-through</em> cache will initiate
    293 a write to memory immediately, while a <em>write-back</em> cache will wait until it runs
    294 out of space and has to evict some entries.  In either case, the CPU will
    295 continue executing instructions past the one that did the store, possibly
    296 executing dozens of them before the write is visible in main memory.  (While the
    297 write-through cache has a policy of immediately forwarding the data to main
    298 memory, it only <strong>initiates</strong> the write.  It does not have to wait
    299 for it to finish.)</p>
    300 
    301 <p>The cache behavior becomes relevant to this discussion when each CPU core has
    302 its own private cache.  In a simple model, the caches have no way to interact
    303 with each other directly.  The values held by core #1s cache are not shared
    304 with or visible to core #2s cache except as loads or stores from main memory.
    305 The long latencies on memory accesses would make inter-thread interactions
    306 sluggish, so its useful to define a way for the caches to share data.  This
    307 sharing is called <em>cache coherency</em>, and the coherency rules are defined
    308 by the CPU architectures <em>cache consistency model</em>.</p>
    309 
    310 <p>With that in mind, lets return to the Dekker example.  When core 1 executes
    311 A = 1, the value gets stored in core 1s cache.  When core 2 executes if (A
    312 == 0), it might read from main memory or it might read from core 2s cache;
    313 either way it wont see the store performed by core 1.  (A could be in core
    314 2s cache because of a previous load from A.)</p>
    315 
    316 <p>For the memory consistency model to be sequentially consistent, core 1 would
    317 have to wait for all other cores to be aware of A = 1 before it could execute
    318 if (B == 0) (either through strict cache coherency rules, or by disabling the
    319 caches entirely so everything operates out of main memory).  This would impose a
    320 performance penalty on every store operation.  Relaxing the rules for the
    321 ordering of stores followed by loads improves performance but imposes a burden
    322 on software developers.</p>
    323 
    324 <p>The other guarantees made by the processor consistency model are less
    325 expensive to make.  For example, to ensure that memory writes are not observed
    326 out of order, it just needs to ensure that the stores are published to other
    327 cores in the same order that they were issued.  It doesnt need to wait for
    328 store #1 to <strong>finish</strong> being published before it can start on store
    329 #2, it just needs to ensure that it doesnt finish publishing #2 before it
    330 finishes publishing #1.  This avoids a performance bubble.</p>
    331 
    332 <p>Relaxing the guarantees even further can provide additional opportunities for
    333 CPU optimization, but creates more opportunities for code to behave in ways the
    334 programmer didnt expect.</p>
    335 
    336 <p>One additional note: CPU caches dont operate on individual bytes.  Data is
    337 read or written as <em>cache lines</em>; for many ARM CPUs these are 32 bytes.  If you
    338 read data from a location in main memory, you will also be reading some adjacent
    339 values.  Writing data will cause the cache line to be read from memory and
    340 updated.  As a result, you can cause a value to be loaded into cache as a
    341 side-effect of reading or writing something nearby, adding to the general aura
    342 of mystery.</p>
    343 
    344 <h4 id="observability">Observability</h4>
    345 
    346 <p>Before going further, its useful to define in a more rigorous fashion what
    347 is meant by observing a load or store.  Suppose core 1 executes A = 1.  The
    348 store is <em>initiated</em> when the CPU executes the instruction.  At some
    349 point later, possibly through cache coherence activity, the store is
    350 <em>observed</em> by core 2.  In a write-through cache it doesnt really
    351 <em>complete</em> until the store arrives in main memory, but the memory
    352 consistency model doesnt dictate when something completes, just when it can be
    353 <em>observed</em>.</p>
    354 
    355 
    356 <p>(In a kernel device driver that accesses memory-mapped I/O locations, it may
    357 be very important to know when things actually complete.  Were not going to go
    358 into that here.)</p>
    359 
    360 <p>Observability may be defined as follows:</p>
    361 
    362 <ul>
    363 <li>"A write to a location in memory is said to be observed by an observer Pn
    364 when a subsequent read of the location by Pn would return the value written by
    365 the write."</li>
    366 <li>"A read of a location in memory is said to be observed by an observer Pm
    367 when a subsequent write to the location by Pm would have no effect on the value
    368 returned by the read." <span style="font-size:.9em;color:#777">(<em><a
    369 href="#more" style="color:#777">Reasoning about the ARM weakly consistent memory
    370 model</a></em>)</span></li>
    371 </ul>
    372 
    373 
    374 <p>A less formal way to describe it (where you and I are CPU cores) would be:</p>
    375 
    376 <ul>
    377 <li>I have observed your write when I can read what you wrote</li>
    378 <li>I have observed your read when I can no longer affect the value you read</li>
    379 </ul>
    380 
    381 <p>The notion of observing a write is intuitive; observing a read is a bit less
    382 so (dont worry, it grows on you).</p>
    383 
    384 <p>With this in mind, were ready to talk about ARM.</p>
    385 
    386 <h4 id="ordering">ARM's weak ordering</h4>
    387 
    388 <p>ARM SMP provides weak memory consistency guarantees.  It does not guarantee that
    389 loads or stores are ordered with respect to each other.</p>
    390 
    391 <table>
    392 <tr>
    393 <th>Thread 1</th>
    394 <th>Thread 2</th>
    395 </tr>
    396 <tr>
    397 <td><code>A = 41<br />
    398 B = 1    // A is ready</code></td>
    399 <td><code>loop_until (B == 1)<br />
    400 reg = A</code></td>
    401 </tr>
    402 </table>
    403 
    404 <p>Recall that all addresses are initially zero.  The loop_until instruction
    405 reads B repeatedly, looping until we read 1 from B.  The idea here is that
    406 thread 2 is waiting for thread 1 to update A.  Thread 1 sets A, and then sets B
    407 to 1 to indicate data availability.</p>
    408 
    409 <p>On x86 SMP, this is guaranteed to work.  Thread 2 will observe the stores
    410 made by thread 1 in program order, and thread 1 will observe thread 2s loads in
    411 program order.</p>
    412 
    413 <p>On ARM SMP, the loads and stores can be observed in any order.  It is
    414 possible, after all the code has executed, for reg to hold 0.  Its also
    415 possible for it to hold 41.  Unless you explicitly define the ordering, you
    416 dont know how this will come out.</p>
    417 
    418 <p>(For those with experience on other systems, ARMs memory model is equivalent
    419 to PowerPC in most respects.)</p>
    420 
    421 
    422 <h3 id="datamem_barriers">Data memory barriers</h3>
    423 
    424 <p>Memory barriers provide a way for your code to tell the CPU that memory
    425 access ordering matters.  ARM/x86 uniprocessors offer sequential consistency,
    426 and thus have no need for them.  (The barrier instructions can be executed but
    427 arent useful; in at least one case theyre hideously expensive, motivating
    428 separate builds for SMP targets.)</p>
    429 
    430 <p>There are four basic situations to consider:</p>
    431 
    432 <ol>
    433 <li>store followed by another store</li>
    434 <li>load followed by another load</li>
    435 <li>load followed by store</li>
    436 <li>store followed by load</li>
    437 </ol>
    438 
    439 <h4 id="ss_ll">Store/store and load/load</h4>
    440 
    441 <p>Recall our earlier example:</p>
    442 
    443 <table>
    444 <tr>
    445 <th>Thread 1</th>
    446 <th>Thread 2</th>
    447 </tr>
    448 <tr>
    449 <td><code>A = 41<br />
    450 B = 1    // A is ready</code></td>
    451 <td><code>loop_until (B == 1)<br />
    452 reg = A</code></td>
    453 </tr>
    454 </table>
    455 
    456 
    457 <p>Thread 1 needs to ensure that the store to A happens before the store to B.
    458 This is a store/store situation.  Similarly, thread 2 needs to ensure that the
    459 load of B happens before the load of A; this is a load/load situation.  As
    460 mentioned earlier, the loads and stores can be observed in any order.</p>
    461 
    462 <div style="padding:.5em 2em;">
    463 <div style="border-left:4px solid #ccc;padding:0 1em;font-style:italic;">
    464 <p>Going back to the cache discussion, assume A and B are on separate cache
    465 lines, with minimal cache coherency.  If the store to A stays local but the
    466 store to B is published, core 2 will see B=1 but wont see the update to A.  On
    467 the other side, assume we read A earlier, or it lives on the same cache line as
    468 something else we recently read.  Core 2 spins until it sees the update to B,
    469 then loads A from its local cache, where the value is still zero.</p>
    470 </div>
    471 </div>
    472 
    473 <p>We can fix it like this:</p>
    474 
    475 <table>
    476 <tr>
    477 <th>Thread 1</th>
    478 <th>Thread 2</th>
    479 </tr>
    480 <tr>
    481 <td><code>A = 41<br />
    482 <em>store/store barrier</em><br />
    483 B = 1    // A is ready</code></td>
    484 <td><code>loop_until (B == 1)<br />
    485 <em>load/load barrier</em><br />
    486 reg = A</code></td>
    487 </tr>
    488 </table>
    489 
    490 <p>The store/store barrier guarantees that <strong>all observers</strong> will
    491 observe the write to A before they observe the write to B.  It makes no
    492 guarantees about the ordering of loads in thread 1, but we dont have any of
    493 those, so thats okay.  The load/load barrier in thread 2 makes a similar
    494 guarantee for the loads there.</p>
    495 
    496 <p>Since the store/store barrier guarantees that thread 2 observes the stores in
    497 program order, why do we need the load/load barrier in thread 2?  Because we
    498 also need to guarantee that thread 1 observes the loads in program order.</p>
    499 
    500 <div style="padding:.5em 2em;">
    501 <div style="border-left:4px solid #ccc;padding:0 1em;font-style:italic;">
    502 <p>The store/store barrier could work by flushing all
    503 dirty entries out of the local cache, ensuring that other cores see them before
    504 they see any future stores.  The load/load barrier could purge the local cache
    505 completely and wait for any in-flight loads to finish, ensuring that future
    506 loads are observed after previous loads.  What the CPU actually does doesnt
    507 matter, so long as the appropriate guarantees are kept.  If we use a barrier in
    508 core 1 but not in core 2, core 2 could still be reading A from its local
    509 cache.</p>
    510 </div>
    511 </div>
    512 
    513 <p>Because the architectures have different memory models, these barriers are
    514 required on ARM SMP but not x86 SMP.</p>
    515 
    516 <h4 id="ls_sl">Load/store and store/load</h4>
    517 
    518 <p>The Dekkers Algorithm fragment shown earlier illustrated the need for a
    519 store/load barrier.  Heres an example where a load/store barrier is
    520 required:</p>
    521 
    522 <table>
    523 <tr>
    524 <th>Thread 1</th>
    525 <th>Thread 2</th>
    526 </tr>
    527 <tr>
    528 <td><code>reg = A<br />
    529 B = 1    // I have latched A</code></td>
    530 <td><code>loop_until (B == 1)<br />
    531 A = 41    // update A</code></td>
    532 </tr>
    533 </table>
    534 
    535 <p>Thread 2 could observe thread 1s store of B=1 before it observes thread 1s
    536 load from A, and as a result store A=41 before thread 1 has a chance to read A.
    537 Inserting a load/store barrier in each thread solves the problem:</p>
    538 
    539 <table>
    540 <tr>
    541 <th>Thread 1</th>
    542 <th>Thread 2</th>
    543 </tr>
    544 <tr>
    545 <td><code>reg = A<br />
    546 <em>load/store barrier</em><br />
    547 B = 1    // I have latched A</code></td>
    548 <td><code>loop_until (B == 1)<br />
    549 <em>load/store barrier</em><br />
    550 A = 41    // update A</code></td>
    551 </tr>
    552 </table>
    553 
    554 <div style="padding:.5em 2em;">
    555 <div style="border-left:4px solid #ccc;padding:0 1em;font-style:italic;">
    556 <p>A store to local cache may be observed before a load from main memory,
    557 because accesses to main memory are so much slower.  In this case, assume core
    558 1s cache has the cache line for B but not A.  The load from A is initiated, and
    559 while thats in progress execution continues.  The store to B happens in local
    560 cache, and by some means becomes available to core 2 while the load from A is
    561 still in progress.  Thread 2 is able to exit the loop before it has observed
    562 thread 1s load from A.</p>
    563 
    564 <p>A thornier question is: do we need a barrier in thread 2?  If the CPU doesnt
    565 perform speculative writes, and doesnt execute instructions out of order, can
    566 thread 2 store to A before thread 1s read if thread 1 guarantees the load/store
    567 ordering?  (Answer: no.)  What if theres a third core watching A and B?
    568 (Answer: now you need one, or you could observe B==0 / A==41 on the third core.)
    569  Its safest to insert barriers in both places and not worry about the
    570 details.</p>
    571 </div>
    572 </div>
    573 
    574 <p>As mentioned earlier, store/load barriers are the only kind required on x86
    575 SMP.</p>
    576 
    577 <h4 id="barrier_inst">Barrier instructions</h4>
    578 
    579 <p>Different CPUs provide different flavors of barrier instruction.  For
    580 example:</p>
    581 
    582 <ul>
    583 <li>Sparc V8 has a membar instruction that takes a 4-element bit vector.  The
    584 four categories of barrier can be specified individually.</li>
    585 <li>Alpha provides rmb (load/load), wmb (store/store), and mb (full).
    586 (Trivia: the linux kernel provides three memory barrier functions with these
    587 names and behaviors.)</li>
    588 <li>x86 has a variety of options; mfence (introduced with SSE2) provides a
    589 full barrier.</li>
    590 <li>ARMv7 has dmb st (store/store) and dmb sy (full).</li>
    591 </ul>
    592 
    593 <p>Full barrier means all four categories are included.</p>
    594 
    595 <p>It is important to recognize that the only thing guaranteed by barrier
    596 instructions is ordering.  Do not treat them as cache coherency sync points or
    597 synchronous flush instructions.  The ARM dmb instruction has no direct
    598 effect on other cores.  This is important to understand when trying to figure
    599 out where barrier instructions need to be issued.</p>
    600 
    601 
    602 <h4 id="addr_dep">Address dependencies and causal consistency</h4>
    603 
    604 <p><em>(This is a slightly more advanced topic and can be skipped.)</em>
    605 
    606 <p>The ARM CPU provides one special case where a load/load barrier can be
    607 avoided.  Consider the following example from earlier, modified slightly:</p>
    608 
    609 <table>
    610 <tr>
    611 <th>Thread 1</th>
    612 <th>Thread 2</th>
    613 </tr>
    614 <tr>
    615 <td><code>[A+8] = 41<br />
    616 <em>store/store barrier</em><br />
    617 B = 1    // A is ready</code></td>
    618 <td><code>loop:<br />
    619 &nbsp;&nbsp;&nbsp;&nbsp;reg0 = B<br />
    620 &nbsp;&nbsp;&nbsp;&nbsp;if (reg0 == 0) goto loop<br />
    621 reg1 = 8<br />
    622 reg2 = [A + reg1]</code></td>
    623 </tr>
    624 </table>
    625 
    626 <p>This introduces a new notation.  If A refers to a memory address, A+n
    627 refers to a memory address offset by 8 bytes from A.  If A is the base address
    628 of an object or array, [A+8] could be a field in the object or an element in the
    629 array.</p>
    630 
    631 <p>The loop_until seen in previous examples has been expanded to show the load
    632 of B into reg0.  reg1 is assigned the numeric value 8, and reg2 is loaded from
    633 the address [A+reg1] (the same location that thread 1 is accessing).</p>
    634 
    635 <p>This will not behave correctly because the load from B could be observed
    636 after the load from [A+reg1].  We can fix this with a load/load barrier after
    637 the loop, but on ARM we can also just do this:</p>
    638 
    639 <table>
    640 <tr>
    641 <th>Thread 1</th>
    642 <th>Thread 2</th>
    643 </tr>
    644 <tr>
    645 <td><code>[A+8] = 41<br />
    646 <em>store/store barrier</em><br />
    647 B = 1    // A is ready</code></td>
    648 <td><code>loop:<br />
    649 &nbsp;&nbsp;&nbsp;&nbsp;reg0 = B<br />
    650 &nbsp;&nbsp;&nbsp;&nbsp;if (reg0 == 0) goto loop<br />
    651 reg1 = 8 <strong>+ (reg0 & 0)</strong><br />
    652 reg2 = [A + reg1]</code></td>
    653 </tr>
    654 </table>
    655 
    656 <p>What weve done here is change the assignment of reg1 from a constant (8) to
    657 a value that depends on what we loaded from B.  In this case, we do a bitwise
    658 AND of the value with 0, which yields zero, which means reg1 still has the value
    659 8.  However, the ARM CPU believes that the load from [A+reg1] depends upon the
    660 load from B, and will ensure that the two are observed in program order.</p>
    661 
    662 <p>This is called an <em>address dependency</em>.  Address dependencies exist
    663 when the value returned by a load is used to compute the address of a subsequent
    664 load or store.  It can let you avoid the need for an explicit barrier in certain
    665 situations.</p>
    666 
    667 <p>ARM does not provide <em>control dependency</em> guarantees.  To illustrate
    668 this its necessary to dip into ARM code for a moment: <span
    669 style="font-size:.9em;color:#777">(<em><a href="#more"
    670 style="color:#777">Barrier Litmus Tests and Cookbook</a></em>)</span>.</p>
    671 
    672 <pre>
    673 LDR r1, [r0]
    674 CMP r1, #55
    675 LDRNE r2, [r3]
    676 </pre>
    677 
    678 <p>The loads from r0 and r3 may be observed out of order, even though the load
    679 from r3 will not execute at all if [r0] doesnt hold 55.  Inserting AND r1, r1,
    680 #0 and replacing the last instruction with LDRNE r2, [r3, r1] would ensure
    681 proper ordering without an explicit barrier.  (This is a prime example of why
    682 you cant think about consistency issues in terms of instruction execution.
    683 Always think in terms of memory accesses.)</p>
    684 
    685 <p>While were hip-deep, its worth noting that ARM does not provide <em>causal
    686 consistency</em>:</p>
    687 
    688 <table>
    689 <tr>
    690 <th>Thread 1</th>
    691 <th>Thread 2</th>
    692 <th>Thread 3</th>
    693 </tr>
    694 <tr>
    695 <td><code>A = 1</code></td>
    696 <td><code>loop_until (A == 1)<br />
    697 B = 1</code></td>
    698 <td><code>loop:<br />
    699 &nbsp;&nbsp;reg0 = B<br />
    700 &nbsp;&nbsp;if (reg0 == 0) goto loop<br />
    701 reg1 = reg0 & 0<br />
    702 reg2 = [A+reg1]</code></td>
    703 </tr>
    704 </table>
    705 
    706 <p>Here, thread 1 sets A, signaling thread 2. Thread 2 sees that and sets B to
    707 signal thread 3.  Thread 3 sees it and loads from A, using an address dependency
    708 to ensure that the load of B and the load of A are observed in program
    709 order.</p>
    710 
    711 <p>Its possible for reg2 to hold zero at the end of this.  The fact that a
    712 store in thread 1 causes something to happen in thread 2 which causes something
    713 to happen in thread 3 does not mean that thread 3 will observe the stores in
    714 that order.  (Inserting a load/store barrier in thread 2 fixes this.)</p>
    715 
    716 <h4 id="membarrier_summary">Memory barrier summary</h4>
    717 
    718 <p>Barriers come in different flavors for different situations.  While there can
    719 be performance advantages to using exactly the right barrier type, there are
    720 code maintenance risks in doing so &mdash; unless the person updating the code
    721 fully understands it, they might introduce the wrong type of operation and cause
    722 a mysterious breakage.  Because of this, and because ARM doesnt provide a wide
    723 variety of barrier choices, many atomic primitives use full
    724 barrier instructions when a barrier is required.</p>
    725 
    726 <p>The key thing to remember about barriers is that they define ordering.  Dont
    727 think of them as a flush call that causes a series of actions to happen.
    728 Instead, think of them as a dividing line in time for operations on the current
    729 CPU core.</p>
    730 
    731 
    732 <h3 id="atomic_ops">Atomic operations</h3>
    733 
    734 <p>Atomic operations guarantee that an operation that requires a series of steps
    735 always behaves as if it were a single operation.  For example, consider a
    736 non-atomic increment (++A) executed on the same variable by two threads
    737 simultaneously:</p>
    738 
    739 <table>
    740 <tr>
    741 <th>Thread 1</th>
    742 <th>Thread 2</th>
    743 </tr>
    744 <tr>
    745 <td><code>reg = A<br />
    746 reg = reg + 1<br />
    747 A = reg</code></td>
    748 <td><code>reg = A<br />
    749 reg = reg + 1<br />
    750 A = reg</code></td>
    751 </tr>
    752 </table>
    753 
    754 <p>If the threads execute concurrently from top to bottom, both threads will
    755 load 0 from A, increment it to 1, and store it back, leaving a final result of
    756 1.  If we used an atomic increment operation, you would be guaranteed that the
    757 final result will be 2.</p>
    758 
    759 <h4 id="atomic_essentials">Atomic essentials</h4>
    760 
    761 <p>The most fundamental operations &mdash; loading and storing 32-bit values
    762 &mdash; are inherently atomic on ARM so long as the data is aligned on a 32-bit
    763 boundary.  For example:</p>
    764 
    765 <table>
    766 <tr>
    767 <th>Thread 1</th>
    768 <th>Thread 2</th>
    769 </tr>
    770 <tr>
    771 <td><code>reg = 0x00000000<br />
    772 A = reg</code></td>
    773 <td><code>reg = 0xffffffff<br />
    774 A = reg</code></td>
    775 </tr>
    776 </table>
    777 
    778 <p>The CPU guarantees that A will hold 0x00000000 or 0xffffffff.  It will never
    779 hold 0x0000ffff or any other partial mix of bytes.</p>
    780 
    781 <div style="padding:.5em 2em;">
    782 <div style="border-left:4px solid #ccc;padding:0 1em;font-style:italic;">
    783 <p>The atomicity guarantee is lost if the data isnt aligned.  Misaligned data
    784 could straddle a cache line, so other cores could see the halves update
    785 independently.  Consequently, the ARMv7 documentation declares that it provides
    786 single-copy atomicity for all byte accesses, halfword accesses to
    787 halfword-aligned locations, and word accesses to word-aligned locations.
    788 Doubleword (64-bit) accesses are <strong>not</strong> atomic, unless the
    789 location is doubleword-aligned and special load/store instructions are used.
    790 This behavior is important to understand when multiple threads are performing
    791 unsynchronized updates to packed structures or arrays of primitive types.</p>
    792 </div>
    793 </div>
    794 
    795 <p>There is no need for 32-bit atomic read or atomic write functions on ARM
    796 or x86.  Where one is provided for completeness, it just does a trivial load or
    797 store.</p>
    798 
    799 <p>Operations that perform more complex actions on data in memory are
    800 collectively known as <em>read-modify-write</em> (RMW) instructions, because
    801 they load data, modify it in some way, and write it back.  CPUs vary widely in
    802 how these are implemented.  ARM uses a technique called Load Linked / Store
    803 Conditional, or LL/SC.</p>
    804 
    805 <div style="padding:.5em 2em;">
    806 <div style="border-left:4px solid #ccc;padding:0 1em;font-style:italic;">
    807 <p>A <em>linked</em> or <em>locked</em> load reads the data from memory as
    808 usual, but also establishes a reservation, tagging the physical memory address.
    809 The reservation is cleared when another core tries to write to that address.  To
    810 perform an LL/SC, the data is read with a reservation, modified, and then a
    811 conditional store instruction is used to try to write the data back.  If the
    812 reservation is still in place, the store succeeds; if not, the store will fail.
    813 Atomic functions based on LL/SC usually loop, retrying the entire
    814 read-modify-write sequence until it completes without interruption.</p>
    815 </div>
    816 </div>
    817 
    818 <p>Its worth noting that the read-modify-write operations would not work
    819 correctly if they operated on stale data.  If two cores perform an atomic
    820 increment on the same address, and one of them is not able to see what the other
    821 did because each core is reading and writing from local cache, the operation
    822 wont actually be atomic.  The CPUs cache coherency rules ensure that the
    823 atomic RMW operations remain atomic in an SMP environment.</p>
    824 
    825 <p>This should not be construed to mean that atomic RMW operations use a memory
    826 barrier.  On ARM, atomics have no memory barrier semantics.  While a series of
    827 atomic RMW operations on a single address will be observed in program order by
    828 other cores, there are no guarantees when it comes to the ordering of atomic and
    829 non-atomic operations.</p>
    830 
    831 <p>It often makes sense to pair barriers and atomic operations together. The
    832 next section describes this in more detail.</p>
    833 
    834 <h4 id="atomic_barrierpairing">Atomic + barrier pairing</h4>
    835 
    836 <p>As usual, its useful to illuminate the discussion with an example.  Were
    837 going to consider a basic mutual-exclusion primitive called a <em>spin
    838 lock</em>.  The idea is that a memory address (which well call lock)
    839 initially holds zero.  When a thread wants to execute code in the critical
    840 section, it sets the lock to 1, executes the critical code, and then changes it
    841 back to zero when done.  If another thread has already set the lock to 1, we sit
    842 and spin until the lock changes back to zero.</p>
    843 
    844 <p>To make this work we use an atomic RMW primitive called
    845 <em>compare-and-swap</em>.  The function takes three arguments: the memory
    846 address, the expected current value, and the new value.  If the value currently
    847 in memory matches what we expect, it is replaced with the new value, and the old
    848 value is returned.  If the current value is not what we expect, we dont change
    849 anything.  A minor variation on this is called <em>compare-and-set</em>; instead
    850 of returning the old value it returns a boolean indicating whether the swap
    851 succeeded.  For our needs either will work, but compare-and-set is slightly
    852 simpler for examples, so we use it and just refer to it as CAS.</p>
    853 
    854 <p>The acquisition of the spin lock is written like this (using a C-like
    855 language):</p>
    856 
    857 <pre>do {
    858     success = atomic_cas(&lock, 0, 1)
    859 } while (!success)
    860 
    861 full_memory_barrier()
    862 
    863 <em>critical-section</em></pre>
    864 
    865 <p>If no thread holds the lock, the lock value will be 0, and the CAS operation
    866 will set it to 1 to indicate that we now have it.  If another thread has it, the
    867 lock value will be 1, and the CAS operation will fail because the expected
    868 current value does not match the actual current value.  We loop and retry.
    869 (Note this loop is on top of whatever loop the LL/SC code might be doing inside
    870 the atomic_cas function.)</p>
    871 
    872 <div style="padding:.5em 2em;">
    873 <div style="border-left:4px solid #ccc;padding:0 1em;font-style:italic;">
    874 <p>On SMP, a spin lock is a useful way to guard a small critical section.  If we
    875 know that another thread is going to execute a handful of instructions and then
    876 release the lock, we can just burn a few cycles while we wait our turn.
    877 However, if the other thread happens to be executing on the same core, were
    878 just wasting time because the other thread cant make progress until the OS
    879 schedules it again (either by migrating it to a different core or by preempting
    880 us).  A proper spin lock implementation would optimistically spin a few times
    881 and then fall back on an OS primitive (such as a Linux futex) that allows the
    882 current thread to sleep while waiting for the other thread to finish up.  On a
    883 uniprocessor you never want to spin at all.  For the sake of brevity were
    884 ignoring all this.</p>
    885 </div>
    886 </div>
    887 
    888 <p>The memory barrier is necessary to ensure that other threads observe the
    889 acquisition of the lock before they observe any loads or stores in the critical
    890 section.  Without that barrier, the memory accesses could be observed while the
    891 lock is not held.</p>
    892 
    893 <p>The <code>full_memory_barrier</code> call here actually does
    894 <strong>two</strong> independent operations.  First, it issues the CPUs full
    895 barrier instruction.  Second, it tells the compiler that it is not allowed to
    896 reorder code around the barrier.  That way, we know that the
    897 <code>atomic_cas</code> call will be executed before anything in the critical
    898 section.  Without this <em>compiler reorder barrier</em>, the compiler has a
    899 great deal of freedom in how it generates code, and the order of instructions in
    900 the compiled code might be much different from the order in the source code.</p>
    901 
    902 <p>Of course, we also want to make sure that none of the memory accesses
    903 performed in the critical section are observed after the lock is released.  The
    904 full version of the simple spin lock is:</p>
    905 
    906 <pre>do {
    907     success = atomic_cas(&lock, 0, 1)   // acquire
    908 } while (!success)
    909 full_memory_barrier()
    910 
    911 <em>critical-section</em>
    912 
    913 full_memory_barrier()
    914 atomic_store(&lock, 0)                  // release</pre>
    915 
    916 <p>We perform our second CPU/compiler memory barrier immediately
    917 <strong>before</strong> we release the lock, so that loads and stores in the
    918 critical section are observed before the release of the lock.</p>
    919 
    920 <p>As mentioned earlier, the <code>atomic_store</code> operation is a simple
    921 assignment on ARM and x86.  Unlike the atomic RMW operations, we dont guarantee
    922 that other threads will see this value immediately.  This isnt a problem,
    923 though, because we only need to keep the other threads <strong>out</strong>. The
    924 other threads will stay out until they observe the store of 0.  If it takes a
    925 little while for them to observe it, the other threads will spin a little
    926 longer, but we will still execute code correctly.</p>
    927 
    928 <p>Its convenient to combine the atomic operation and the barrier call into a
    929 single function.  It also provides other advantages, which will become clear
    930 shortly.</p>
    931 
    932 
    933 <h4 id="acq_rel">Acquire and release</h4>
    934 
    935 <p>When acquiring the spinlock, we issue the atomic CAS and then the barrier.
    936 When releasing the spinlock, we issue the barrier and then the atomic store.
    937 This inspires a particular naming convention: operations followed by a barrier
    938 are acquiring operations, while operations preceded by a barrier are
    939 releasing operations.  (It would be wise to install the spin lock example
    940 firmly in mind, as the names are not otherwise intuitive.)</p>
    941 
    942 <p>Rewriting the spin lock example with this in mind:</p>
    943 
    944 <pre>do {
    945     success = atomic_<strong>acquire</strong>_cas(&lock, 0, 1)
    946 } while (!success)
    947 
    948 <em>critical-section</em>
    949 
    950 atomic_<strong>release</strong>_store(&lock, 0)</pre>
    951 
    952 <p>This is a little more succinct and easier to read, but the real motivation
    953 for doing this lies in a couple of optimizations we can now perform.</p>
    954 
    955 <p>First, consider <code>atomic_release_store</code>.  We need to ensure that
    956 the store of zero to the lock word is observed after any loads or stores in the
    957 critical section above it.  In other words, we need a load/store and store/store
    958 barrier.  In an earlier section we learned that these arent necessary on x86
    959 SMP -- only store/load barriers are required.  The implementation of
    960 <code>atomic_release_store</code> on x86 is therefore just a compiler reorder
    961 barrier followed by a simple store.  No CPU barrier is required.</p>
    962 
    963 <p>The second optimization mostly applies to the compiler (although some CPUs,
    964 such as the Itanium, can take advantage of it as well).  The basic principle is
    965 that code can move across acquire and release barriers, but only in one
    966 direction.</p>
    967 
    968 <p>Suppose we have a mix of locally-visible and globally-visible memory
    969 accesses, with some miscellaneous computation as well:</p>
    970 
    971 <pre>local1 = arg1 / 41
    972 local2 = threadStruct->field2
    973 threadStruct->field3 = local2
    974 
    975 do {
    976     success = atomic_acquire_cas(&lock, 0, 1)
    977 } while (!success)
    978 
    979 local5 = globalStruct->field5
    980 globalStruct->field6 = local5
    981 
    982 atomic_release_store(&lock, 0)</pre>
    983 
    984 <p>Here we see two completely independent sets of operations.  The first set
    985 operates on a thread-local data structure, so were not concerned about clashes
    986 with other threads.  The second set operates on a global data structure, which
    987 must be protected with a lock.</p>
    988 
    989 <p>A full compiler reorder barrier in the atomic ops will ensure that the
    990 program order matches the source code order at the lock boundaries.  However,
    991 allowing the compiler to interleave instructions can improve performance.  Loads
    992 from memory can be slow, but the CPU can continue to execute instructions that
    993 dont require the result of that load while waiting for it to complete.  The
    994 code might execute more quickly if it were written like this instead:</p>
    995 
    996 <pre>do {
    997     success = atomic_acquire_cas(&lock, 0, 1)
    998 } while (!success)
    999 
   1000 local2 = threadStruct->field2
   1001 local5 = globalStruct->field5
   1002 local1 = arg1 / 41
   1003 threadStruct->field3 = local2
   1004 globalStruct->field6 = local5
   1005 
   1006 atomic_release_store(&lock, 0)</pre>
   1007 
   1008 <p>We issue both loads, do some unrelated computation, and then execute the
   1009 instructions that make use of the loads.  If the integer division takes less
   1010 time than one of the loads, we essentially get it for free, since it happens
   1011 during a period where the CPU would have stalled waiting for a load to
   1012 complete.</p>
   1013 
   1014 <p>Note that <strong>all</strong> of the operations are now happening inside the
   1015 critical section.  Since none of the threadStruct operations are visible
   1016 outside the current thread, nothing else can see them until were finished here,
   1017 so it doesnt matter exactly when they happen.</p>
   1018 
   1019 <p>In general, it is always safe to move operations <strong>into</strong> a
   1020 critical section, but never safe to move operations <strong>out of</strong> a
   1021 critical section.  Put another way, you can migrate code downward across an
   1022 acquire barrier, and upward across a release barrier.  If the atomic ops used
   1023 a full barrier, this sort of migration would not be possible.</p>
   1024 
   1025 <p>Returning to an earlier point, we can state that on x86 all loads are
   1026 acquiring loads, and all stores are releasing stores.  As a result:</p>
   1027 
   1028 <ul>
   1029 <li>Loads may not be reordered with respect to each other.  You cant take a
   1030 load and move it upward across another loads acquire barrier.</li>
   1031 <li>Stores may not be reordered with respect to each other, because you cant
   1032 move a store downward across another stores release barrier.</li>
   1033 <li>A load followed by a store cant be reordered, because neither instruction
   1034 will tolerate it.</li>
   1035 <li>A store followed by a load <strong>can</strong> be reordered, because each
   1036 instruction can move across the other in that direction.</li>
   1037 </ul>
   1038 
   1039 <p>Hence, you only need store/load barriers on x86 SMP.</p>
   1040 
   1041 <p>Labeling atomic operations with acquire or release describes not only
   1042 whether the barrier is executed before or after the atomic operation, but also
   1043 how the compiler is allowed to reorder code.</p>
   1044 
   1045 <h2 id="practice">Practice</h2>
   1046 
   1047 <p>Debugging memory consistency problems can be very difficult.  If a missing
   1048 memory barrier causes some code to read stale data, you may not be able to
   1049 figure out why by examining memory dumps with a debugger.  By the time you can
   1050 issue a debugger query, the CPU cores will have all observed the full set of
   1051 accesses, and the contents of memory and the CPU registers will appear to be in
   1052 an impossible state.</p>
   1053 
   1054 <h3 id="c_dont">What not to do in C</h3>
   1055 
   1056 <p>Here we present some examples of incorrect code, along with simple ways to
   1057 fix them.  Before we do that, we need to discuss the use of a basic language
   1058 feature.</p>
   1059 
   1060 <h4 id="volatile">C/C+++ and "volatile"</h4>
   1061 
   1062 <p>When writing single-threaded code, declaring a variable volatile can be
   1063 very useful.  The compiler will not omit or reorder accesses to volatile
   1064 locations.  Combine that with the sequential consistency provided by the
   1065 hardware, and youre guaranteed that the loads and stores will appear to happen
   1066 in the expected order.</p>
   1067 
   1068 <p>However, accesses to volatile storage may be reordered with non-volatile
   1069 accesses, so you have to be careful in multi-threaded uniprocessor environments
   1070 (explicit compiler reorder barriers may be required).  There are no atomicity
   1071 guarantees, and no memory barrier provisions, so volatile doesnt help you at
   1072 all in multi-threaded SMP environments.  The C and C++ language standards are
   1073 being updated to address this with built-in atomic operations.</p>
   1074 
   1075 <p>If you think you need to declare something volatile, that is a strong
   1076 indicator that you should be using one of the atomic operations instead.</p>
   1077 
   1078 <h4 id="examplesc">Examples</h4>
   1079 
   1080 <p>In most cases youd be better off with a synchronization primitive (like a
   1081 pthread mutex) rather than an atomic operation, but we will employ the latter to
   1082 illustrate how they would be used in a practical situation.</p>
   1083 
   1084 <p>For the sake of brevity were ignoring the effects of compiler optimizations
   1085 here &mdash; some of this code is broken even on uniprocessors &mdash; so for
   1086 all of these examples you must assume that the compiler generates
   1087 straightforward code (for example, compiled with gcc -O0).  The fixes presented here do
   1088 solve both compiler-reordering and memory-access-ordering issues, but were only
   1089 going to discuss the latter.</p>
   1090 
   1091 <pre>MyThing* gGlobalThing = NULL;
   1092 
   1093 void initGlobalThing()    // runs in thread 1
   1094 {
   1095     MyStruct* thing = malloc(sizeof(*thing));
   1096     memset(thing, 0, sizeof(*thing));
   1097     thing->x = 5;
   1098     thing->y = 10;
   1099     /* initialization complete, publish */
   1100     gGlobalThing = thing;
   1101 }
   1102 
   1103 void useGlobalThing()    // runs in thread 2
   1104 {
   1105     if (gGlobalThing != NULL) {
   1106         int i = gGlobalThing->x;    // could be 5, 0, or uninitialized data
   1107         ...
   1108     }
   1109 }</pre>
   1110 
   1111 <p>The idea here is that we allocate a structure, initialize its fields, and at
   1112 the very end we publish it by storing it in a global variable.  At that point,
   1113 any other thread can see it, but thats fine since its fully initialized,
   1114 right?  At least, it would be on x86 SMP or a uniprocessor (again, making the
   1115 erroneous assumption that the compiler outputs code exactly as we have it in the
   1116 source).</p>
   1117 
   1118 <p>Without a memory barrier, the store to <code>gGlobalThing</code> could be observed before
   1119 the fields are initialized on ARM.  Another thread reading from <code>thing->x</code> could
   1120 see 5, 0, or even uninitialized data.</p>
   1121 
   1122 <p>This can be fixed by changing the last assignment to:</p>
   1123 
   1124 <pre>    atomic_release_store(&gGlobalThing, thing);</pre>
   1125 
   1126 <p>That ensures that all other threads will observe the writes in the proper
   1127 order, but what about reads?  In this case we should be okay on ARM, because the
   1128 address dependency rules will ensure that any loads from an offset of
   1129 <code>gGlobalThing</code> are observed after the load of
   1130 <code>gGlobalThing</code>.  However, its unwise to rely on architectural
   1131 details, since it means your code will be very subtly unportable.  The complete
   1132 fix also requires a barrier after the load:</p>
   1133 
   1134 <pre>    MyThing* thing = atomic_acquire_load(&gGlobalThing);
   1135     int i = thing->x;</pre>
   1136 
   1137 <p>Now we know the ordering will be correct.  This may seem like an awkward way
   1138 to write code, and it is, but thats the price you pay for accessing data
   1139 structures from multiple threads without using locks.  Besides, address
   1140 dependencies wont always save us:</p>
   1141 
   1142 <pre>MyThing gGlobalThing;
   1143 
   1144 void initGlobalThing()    // runs in thread 1
   1145 {
   1146     gGlobalThing.x = 5;
   1147     gGlobalThing.y = 10;
   1148     /* initialization complete */
   1149     gGlobalThing.initialized = true;
   1150 }
   1151 
   1152 void useGlobalThing()    // runs in thread 2
   1153 {
   1154     if (gGlobalThing.initialized) {
   1155         int i = gGlobalThing.x;    // could be 5 or 0
   1156     }
   1157 }</pre>
   1158 
   1159 <p>Because there is no relationship between the <code>initialized</code> field and the
   1160 others, the reads and writes can be observed out of order.  (Note global data is
   1161 initialized to zero by the OS, so it shouldnt be possible to read random
   1162 uninitialized data.)</p>
   1163 
   1164 <p>We need to replace the store with:</p>
   1165 <pre>    atomic_release_store(&gGlobalThing.initialized, true);</pre>
   1166 
   1167 <p>and replace the load with:</p>
   1168 <pre>    int initialized = atomic_acquire_load(&gGlobalThing.initialized);</pre>
   1169 
   1170 <p>Another example of the same problem occurs when implementing
   1171 reference-counted data structures.  The reference count itself will be
   1172 consistent so long as atomic increment and decrement operations are used, but
   1173 you can still run into trouble at the edges, for example:</p>
   1174 
   1175 <pre>void RefCounted::release()
   1176 {
   1177     int oldCount = atomic_dec(&mRefCount);
   1178     if (oldCount == 1) {    // was decremented to zero
   1179         recycleStorage();
   1180     }
   1181 }
   1182 
   1183 void useSharedThing(RefCountedThing sharedThing)
   1184 {
   1185     int localVar = sharedThing->x;
   1186     sharedThing->release();
   1187     sharedThing = NULL;    // cant use this pointer any more
   1188     doStuff(localVar);    // value of localVar might be wrong
   1189 }</pre>
   1190 
   1191 <p>The <code>release()</code> call decrements the reference count using a
   1192 barrier-free atomic decrement operation.  Because this is an atomic RMW
   1193 operation, we know that it will work correctly.  If the reference count goes to
   1194 zero, we recycle the storage.</p>
   1195 
   1196 <p>The <code>useSharedThing()</code> function extracts what it needs from
   1197 <code>sharedThing</code> and then releases its copy.  However, because we didnt
   1198 use a memory barrier, and atomic and non-atomic operations can be reordered,
   1199 its possible for other threads to observe the read of
   1200 <code>sharedThing->x</code> <strong>after</strong> they observe the recycle
   1201 operation.  Its therefore possible for <code>localVar</code> to hold a value
   1202 from "recycled" memory, for example a new object created in the same
   1203 location by another thread after <code>release()</code> is called.</p>
   1204 
   1205 <p>This can be fixed by replacing the call to <code>atomic_dec()</code> with
   1206 <code>atomic_release_dec()</code>. The barrier ensures that the reads from
   1207 <code>sharedThing</code> are observed before we recycle the object.</p>
   1208 
   1209 <div style="padding:.5em 2em;">
   1210 <div style="border-left:4px solid #ccc;padding:0 1em;font-style:italic;">
   1211 <p>In most cases the above wont actually fail, because the recycle function
   1212 is likely guarded by functions that themselves employ barriers (libc heap
   1213 <code>free()</code>/<code>delete()</code>, or an object pool guarded by a
   1214 mutex).  If the recycle function used a lock-free algorithm implemented without
   1215 barriers, however, the above code could fail on ARM SMP.</p>
   1216 </div>
   1217 </div>
   1218 
   1219 <h3 id="j_dont">What not to do in Java</h3>
   1220 
   1221 <p>We havent discussed some relevant Java language features, so well take a
   1222 quick look at those first.</p>
   1223 
   1224 <h4 id="sync_volatile">Java's "synchronized" and "volatile" keywords</h4>
   1225 
   1226 <p>The synchronized keyword provides the Java languages in-built locking
   1227 mechanism.  Every object has an associated monitor that can be used to provide
   1228 mutually exclusive access.</p>
   1229 
   1230 <p>The implementation of the synchronized block has the same basic structure
   1231 as the spin lock example: it begins with an acquiring CAS, and ends with a
   1232 releasing store.  This means that compilers and code optimizers are free to
   1233 migrate code into a synchronized block.  One practical consequence: you must
   1234 <strong>not</strong> conclude that code inside a synchronized block happens
   1235 after the stuff above it or before the stuff below it in a function.  Going
   1236 further, if a method has two synchronized blocks that lock the same object, and
   1237 there are no operations in the intervening code that are observable by another
   1238 thread, the compiler may perform lock coarsening and combine them into a
   1239 single block.</p>
   1240 
   1241 <p>The other relevant keyword is volatile.  As defined in the specification
   1242 for Java 1.4 and earlier, a volatile declaration was about as weak as its C
   1243 counterpart.  The spec for Java 1.5 was updated to provide stronger guarantees,
   1244 almost to the level of monitor synchronization.</p>
   1245 
   1246 <p>The effects of volatile accesses can be illustrated with an example.  If
   1247 thread 1 writes to a volatile field, and thread 2 subsequently reads from that
   1248 same field, then thread 2 is guaranteed to see that write and all writes
   1249 previously made by thread 1.  More generally, the writes made by
   1250 <strong>any</strong> thread up to the point where it writes the field will be
   1251 visible to thead 2 when it does the read.  In effect, writing to a volatile is
   1252 like a monitor release, and reading from a volatile is like a monitor
   1253 acquire.</p>
   1254 
   1255 <p>Non-volatile accesses may be reorded with respect to volatile accesses in the
   1256 usual ways, for example the compiler could move a non-volatile load or store above a
   1257 volatile store, but couldnt move it below.  Volatile accesses may not be
   1258 reordered with respect to each other.  The VM takes care of issuing the
   1259 appropriate memory barriers.</p>
   1260 
   1261 <p>It should be mentioned that, while loads and stores of object references and
   1262 most primitive types are atomic, <code>long</code> and <code>double</code>
   1263 fields are not accessed atomically unless they are marked as volatile.
   1264 Multi-threaded updates to non-volatile 64-bit fields are problematic even on
   1265 uniprocessors.</p>
   1266 
   1267 <h4 id="examplesj">Examples</h4>
   1268 
   1269 <p>Heres a simple, incorrect implementation of a monotonic counter: <span
   1270 style="font-size:.9em;color:#777">(<em><a href="#more" style="color:#777">Java
   1271 theory and practice: Managing volatility</a></em>)</span>.</p>
   1272 
   1273 <pre>class Counter {
   1274     private int mValue;
   1275 
   1276     public int get() {
   1277         return mValue;
   1278     }
   1279     public void incr() {
   1280         mValue++;
   1281     }
   1282 }</pre>
   1283 
   1284 <p>Assume <code>get()</code> and <code>incr()</code> are called from multiple
   1285 threads, and we want to be sure that every thread sees the current count when
   1286 <code>get()</code> is called.  The most glaring problem is that
   1287 <code>mValue++</code> is actually three operations:</p>
   1288 
   1289 <ol>
   1290 <li><code>reg = mValue</code></li>
   1291 <li><code>reg = reg + 1</code></li>
   1292 <li><code>mValue = reg</code></li>
   1293 </ol>
   1294 
   1295 <p>If two threads execute in <code>incr()</code> simultaneously, one of the
   1296 updates could be lost.  To make the increment atomic, we need to declare
   1297 <code>incr()</code> synchronized.  With this change, the code will run
   1298 correctly in multi-threaded uniprocessor environments.</p>
   1299 
   1300 <p>Its still broken on SMP, however.  Different threads might see different
   1301 results from <code>get()</code>, because were reading the value with an ordinary load.  We
   1302 can correct the problem by declaring <code>get()</code> to be synchronized.
   1303 With this change, the code is obviously correct.</p>
   1304 
   1305 <p>Unfortunately, weve introduced the possibility of lock contention, which
   1306 could hamper performance.  Instead of declaring <code>get()</code> to be
   1307 synchronized, we could declare <code>mValue</code> with volatile.  (Note
   1308 <code>incr()</code> must still use <code>synchronize</code>.)  Now we know that
   1309 the volatile write to <code>mValue</code> will be visible to any subsequent volatile read of
   1310 <code>mValue</code>. <code>incr()</code> will be slightly slower, but
   1311 <code>get()</code> will be faster, so even in the absence of contention this is
   1312 a win if reads outnumber writes. (See also {@link
   1313 java.util.concurrent.atomic.AtomicInteger}.)</p>
   1314 
   1315 <p>Heres another example, similar in form to the earlier C examples:</p>
   1316 
   1317 <pre>class MyGoodies {
   1318     public int x, y;
   1319 }
   1320 class MyClass {
   1321     static MyGoodies sGoodies;
   1322 
   1323     void initGoodies() {    // runs in thread 1
   1324         MyGoodies goods = new MyGoodies();
   1325         goods.x = 5;
   1326         goods.y = 10;
   1327         sGoodies = goods;
   1328     }
   1329 
   1330     void useGoodies() {    // runs in thread 2
   1331         if (sGoodies != null) {
   1332             int i = sGoodies.x;    // could be 5 or 0
   1333             ....
   1334         }
   1335     }
   1336 }</pre>
   1337 
   1338 <p>This has the same problem as the C code, namely that the assignment
   1339 <code>sGoodies = goods</code> might be observed before the initialization of the
   1340 fields in <code>goods</code>.  If you declare <code>sGoodies</code> with the
   1341 volatile keyword, you can think about the loads as if they were
   1342 <code>atomic_acquire_load()</code> calls, and the stores as if they were
   1343 <code>atomic_release_store()</code> calls.</p>
   1344 
   1345 <p>(Note that only the <code>sGoodies</code> reference itself is volatile.  The
   1346 accesses to the fields inside it are not.  The statement <code>z =
   1347 sGoodies.x</code> will perform a volatile load of <code>MyClass.sGoodies</code>
   1348 followed by a non-volatile load of <code>sGoodies.x</code>.  If you make a local
   1349 reference <code>MyGoodies localGoods = sGoodies</code>, <code>z =
   1350 localGoods.x</code> will not perform any volatile loads.)</p>
   1351 
   1352 <p>A more common idiom in Java programming is the infamous double-checked
   1353 locking:</p>
   1354 
   1355 <pre>class MyClass {
   1356     private Helper helper = null;
   1357 
   1358     public Helper getHelper() {
   1359         if (helper == null) {
   1360             synchronized (this) {
   1361                 if (helper == null) {
   1362                     helper = new Helper();
   1363                 }
   1364             }
   1365         }
   1366         return helper;
   1367     }
   1368 }</pre>
   1369 
   1370 <p>The idea is that we want to have a single instance of a <code>Helper</code>
   1371 object associated with an instance of <code>MyClass</code>.  We must only create
   1372 it once, so we create and return it through a dedicated <code>getHelper()</code>
   1373 function.  To avoid a race in which two threads create the instance, we need to
   1374 synchronize the object creation.  However, we dont want to pay the overhead for
   1375 the synchronized block on every call, so we only do that part if
   1376 <code>helper</code> is currently null.</p>
   1377 
   1378 <p>This doesnt work correctly on uniprocessor systems, unless youre using a
   1379 traditional Java source compiler and an interpreter-only VM.  Once you add fancy
   1380 code optimizers and JIT compilers it breaks down.  See the Double Checked
   1381 Locking is Broken Declaration link in the appendix for more details, or Item
   1382 71 (Use lazy initialization judiciously) in Josh Blochs <em>Effective Java,
   1383 2nd Edition.</em>.</p>
   1384 
   1385 <p>Running this on an SMP system introduces an additional way to fail.  Consider
   1386 the same code rewritten slightly, as if it were compiled into a C-like language
   1387 (Ive added a couple of integer fields to represent <code>Helpers</code>
   1388 constructor activity):</p>
   1389 
   1390 <pre>if (helper == null) {
   1391     // acquire monitor using spinlock
   1392     while (atomic_acquire_cas(&this.lock, 0, 1) != success)
   1393         ;
   1394     if (helper == null) {
   1395         newHelper = malloc(sizeof(Helper));
   1396         newHelper->x = 5;
   1397         newHelper->y = 10;
   1398         helper = newHelper;
   1399     }
   1400     atomic_release_store(&this.lock, 0);
   1401 }</pre>
   1402 
   1403 <p>Now the problem should be obvious: the store to <code>helper</code> is
   1404 happening before the memory barrier, which means another thread could observe
   1405 the non-null value of <code>helper</code> before the stores to the
   1406 <code>x</code>/<code>y</code> fields.</p>
   1407 
   1408 <p>You could try to ensure that the store to <code>helper</code> happens after
   1409 the <code>atomic_release_store()</code> on <code>this.lock</code> by rearranging
   1410 the code, but that wont help, because its okay to migrate code upward &mdash;
   1411 the compiler could move the assignment back above the
   1412 <code>atomic_release_store()</code> to its original position.</p>
   1413 
   1414 <p>There are two ways to fix this:</p>
   1415 <ol>
   1416 <li>Do the simple thing and delete the outer check.  This ensures that we never
   1417 examine the value of <code>helper</code> outside a synchronized block.</li>
   1418 <li>Declare <code>helper</code> volatile.  With this one small change, the code
   1419 in Example J-3 will work correctly on Java 1.5 and later.  (You may want to take
   1420 a minute to convince yourself that this is true.)</li>
   1421 </ol>
   1422 
   1423 <p>This next example illustrates two important issues when using volatile:</p>
   1424 
   1425 <pre>class MyClass {
   1426     int data1, data2;
   1427     volatile int vol1, vol2;
   1428 
   1429     void setValues() {    // runs in thread 1
   1430         data1 = 1;
   1431         vol1 = 2;
   1432         data2 = 3;
   1433     }
   1434 
   1435     void useValues1() {    // runs in thread 2
   1436         if (vol1 == 2) {
   1437             int l1 = data1;    // okay
   1438             int l2 = data2;    // wrong
   1439         }
   1440     }
   1441     void useValues2() {    // runs in thread 2
   1442         int dummy = vol2;
   1443         int l1 = data1;    // wrong
   1444         int l2 = data2;    // wrong
   1445     }</pre>
   1446 
   1447 <p>Looking at <code>useValues1()</code>, if thread 2 hasnt yet observed the
   1448 update to <code>vol1</code>, then it cant know if <code>data1</code> or
   1449 <code>data2</code> has been set yet.  Once it sees the update to
   1450 <code>vol1</code>, it knows that the change to <code>data1</code> is also
   1451 visible, because that was made before <code>vol1</code> was changed.  However,
   1452 it cant make any assumptions about <code>data2</code>, because that store was
   1453 performed after the volatile store.</p>
   1454 
   1455 <P>The code in <code>useValues2()</code> uses a second volatile field,
   1456 <code>vol2</code>, in an attempt to force the VM to generate a memory barrier.
   1457 This doesnt generally work.  To establish a proper happens-before
   1458 relationship, both threads need to be interacting with the same volatile field.
   1459 Youd have to know that <code>vol2</code> was set after <code>data1/data2</code>
   1460 in thread 1.  (The fact that this doesnt work is probably obvious from looking
   1461 at the code; the caution here is against trying to cleverly cause a memory
   1462 barrier instead of creating an ordered series of accesses.)</p>
   1463 
   1464 <h3 id="bestpractice">What to do</h3>
   1465 
   1466 <h4 id="advice">General advice</h4>
   1467 
   1468 <p>In C/C++, use the <code>pthread</code> operations, like mutexes and
   1469 semaphores.  These include the proper memory barriers, providing correct and
   1470 efficient behavior on all Android platform versions.  Be sure to use them
   1471 correctly, for example be wary of signaling a condition variable without holding the
   1472 corresponding mutex.</p>
   1473 
   1474 <p>It's best to avoid using atomic functions directly. Locking and
   1475 unlocking a pthread mutex require a single atomic operation each if theres no
   1476 contention, so youre not going to save much by replacing mutex calls with
   1477 atomic ops.  If you need a lock-free design, you must fully understand the
   1478 concepts in this entire document before you begin (or, better yet, find an
   1479 existing code library that is known to be correct on SMP ARM).</p>
   1480 
   1481 <p>Be extremely circumspect with "volatile in C/C++.  It often indicates a
   1482 concurrency problem waiting to happen.</p>
   1483 
   1484 <p>In Java, the best answer is usually to use an appropriate utility class from
   1485 the {@link java.util.concurrent} package.  The code is well written and well
   1486 tested on SMP.</p>
   1487 
   1488 <p>Perhaps the safest thing you can do is make your class immutable.  Objects
   1489 from classes like String and Integer hold data that cannot be changed once the
   1490 class is created, avoiding all synchronization issues.  The book <em>Effective
   1491 Java, 2nd Ed.</em> has specific instructions in Item 15: Minimize Mutability. Note in
   1492 particular the importance of declaring fields final" <span
   1493 style="font-size:.9em;color:#777">(<a href="#more" style="color:#777">Bloch</a>)</span>.</p>
   1494 
   1495 <p>If neither of these options is viable, the Java synchronized statement
   1496 should be used to guard any field that can be accessed by more than one thread.
   1497 If mutexes wont work for your situation, you should declare shared fields
   1498 volatile, but you must take great care to understand the interactions between
   1499 threads.  The volatile declaration wont save you from common concurrent
   1500 programming mistakes, but it will help you avoid the mysterious failures
   1501 associated with optimizing compilers and SMP mishaps.</p>
   1502 
   1503 <p>The Java Memory Model guarantees that assignments to final fields are visible
   1504 to all threads once the constructor has finished &mdash; this is what ensures
   1505 proper synchronization of fields in immutable classes.  This guarantee does not
   1506 hold if a partially-constructed object is allowed to become visible to other
   1507 threads.  It is necessary to follow safe construction practices.<span
   1508 style="font-size:.9em;color:#777">(<a href="#more" style="color:#777">Safe
   1509 Construction Techniques in Java</a>)</span>.</p>
   1510 
   1511 <h4 id="sync_guarantees">Synchronization primitive guarantees</h4>
   1512 
   1513 <p>The pthread library and VM make a couple of useful guarantees: all accesses
   1514 previously performed by a thread that creates a new thread are observable by
   1515 that new thread as soon as it starts, and all accesses performed by a thread
   1516 that is exiting are observable when a <code>join()</code> on that thread
   1517 returns.  This means you dont need any additional synchronization when
   1518 preparing data for a new thread or examining the results of a joined thread.</p>
   1519 
   1520 <p>Whether or not these guarantees apply to interactions with pooled threads
   1521 depends on the thread pool implementation.</p>
   1522 
   1523 <p>In C/C++, the pthread library guarantees that any accesses made by a thread
   1524 before it unlocks a mutex will be observable by another thread after it locks
   1525 that same mutex.  It also guarantees that any accesses made before calling
   1526 <code>signal()</code> or <code>broadcast()</code> on a condition variable will
   1527 be observable by the woken thread.</p>
   1528 
   1529 <p>Java language threads and monitors make similar guarantees for the comparable
   1530 operations.</p>
   1531 
   1532 <h4 id="ccpp_changes">Upcoming changes to C/C++</h4>
   1533 
   1534 <p>The C and C++ language standards are evolving to include a sophisticated
   1535 collection of atomic operations.  A full matrix of calls for common data types
   1536 is defined, with selectable memory barrier semantics (choose from relaxed,
   1537 consume, acquire, release, acq_rel, seq_cst).</p>
   1538 
   1539 <p>See the <a href="#more">Further Reading</a> section for pointers to the
   1540 specifications.</p>
   1541 
   1542 
   1543 <h2 id="closing_notes">Closing Notes</h2>
   1544 
   1545 <p>While this document does more than merely scratch the surface, it doesnt
   1546 manage more than a shallow gouge.  This is a very broad and deep topic.  Some
   1547 areas for further exploration:</p>
   1548 
   1549 <ul>
   1550 <li>Learn the definitions of <em>happens-before</em>,
   1551 <em>synchronizes-with</em>, and other essential concepts from the Java Memory
   1552 Model.  (Its hard to understand what volatile really means without getting
   1553 into this.)</li>
   1554 <li>Explore what compilers are and arent allowed to do when reordering code.
   1555 (The JSR-133 spec has some great examples of legal transformations that lead to
   1556 unexpected results.)</li>
   1557 <li>Find out how to write immutable classes in Java and C++.  (Theres more to
   1558 it than just dont change anything after construction.)</li>
   1559 <li>Internalize the recommendations in the Concurrency section of <em>Effective
   1560 Java, 2nd Edition.</em> (For example, you should avoid calling methods that are
   1561 meant to be overridden while inside a synchronized block.)</li>
   1562 <li>Understand what sorts of barriers you can use on x86 and ARM.  (And other
   1563 CPUs for that matter, for example Itaniums acquire/release instruction
   1564 modifiers.)</li>
   1565 <li>Read through the {@link java.util.concurrent} and {@link
   1566 java.util.concurrent.atomic} APIs to see what's available.  Consider using
   1567 concurrency annotations like <code>@ThreadSafe</code> and
   1568 <code>@GuardedBy</code> (from net.jcip.annotations).</li>
   1569 </ul>
   1570 
   1571 <p>The <a href="#more">Further Reading</a> section in the appendix has links to
   1572 documents and web sites that will better illuminate these topics.</p>
   1573 
   1574 <h2 id="appendix">Appendix</h2>
   1575 
   1576 <h3 id="smp_failure_example">SMP failure example</h3>
   1577 
   1578 <p>This document describes a lot of weird things that can, in theory, happen.
   1579 If youre not convinced that these issues are real, a practical example may be
   1580 useful.</p>
   1581 
   1582 <p>Bill Pughs Java memory model <a
   1583 href="http://www.cs.umd.edu/~pugh/java/memoryModel/">web site</a> has a few
   1584 test programs on it.  One interesting test is ReadAfterWrite.java, which does
   1585 the following:</p>
   1586 
   1587 <table>
   1588 <tr>
   1589 <th>Thread 1</th>
   1590 <th>Thread 2</th>
   1591 </tr>
   1592 <tr>
   1593 <td><code>for (int i = 0; i < ITERATIONS; i++) {<br />
   1594 &nbsp;&nbsp;&nbsp;&nbsp;a = i;<br />
   1595 &nbsp;&nbsp;&nbsp;&nbsp;BB[i] = b;<br />
   1596 }</code></td>
   1597 <td><code>for (int i = 0; i < ITERATIONS; i++) {<br />
   1598 &nbsp;&nbsp;&nbsp;&nbsp;b = i;<br />
   1599 &nbsp;&nbsp;&nbsp;&nbsp;AA[i] = a;<br />
   1600 }</code></td>
   1601 </tr>
   1602 </table>
   1603 
   1604 <p>Where <code>a</code> and <code>b</code> are declared as volatile
   1605 <code>int</code> fields, and <code>AA</code> and <code>BB</code> are ordinary
   1606 integer arrays.
   1607 
   1608 <p>This is trying to determine if the VM ensures that, after a value is written
   1609 to a volatile, the next read from that volatile sees the new value.  The test
   1610 code executes these loops a million or so times, and then runs through afterward
   1611 and searches the results for inconsistencies.</p>
   1612 
   1613 <p>At the end of execution,<code>AA</code> and <code>BB</code> will be full of
   1614 gradually-increasing integers.  The threads will not run side-by-side in a
   1615 predictable way, but we can assert a relationship between the array contents.
   1616 For example, consider this execution fragment:</p>
   1617 
   1618 <table>
   1619 <tr>
   1620 <th>Thread 1</th>
   1621 <th>Thread 2</th>
   1622 </tr>
   1623 <tr>
   1624 <td><code>(initially a == 1534)<br />
   1625 a = 1535<br />
   1626 BB[1535] = 165<br />
   1627 a = 1536<br />
   1628 BB[1536] = 165<br />
   1629 <br />
   1630 <br />
   1631 <br />
   1632 <br />
   1633 a = 1537<br />
   1634 BB[1537] = 167</code></td>
   1635 <td><code>(initially b == 165)
   1636 <br />
   1637 <br />
   1638 <br />
   1639 <br />
   1640 <br />
   1641 b = 166<br />
   1642 AA[166] = 1536<br />
   1643 b = 167<br />
   1644 AA[167] = 1536<br />
   1645 <br /></code></td>
   1646 </tr>
   1647 </table>
   1648 
   1649 <p>(This is written as if the threads were taking turns executing so that its
   1650 more obvious when results from one thread should be visible to the other, but in
   1651 practice that wont be the case.)</p>
   1652 
   1653 <p>Look at the assignment of <code>AA[166]</code> in thread 2.  We are capturing
   1654 the fact that, at the point where thread 2 was on iteration 166, it can see that
   1655 thread 1 was on iteration 1536.  If we look one step in the future, at thread
   1656 1s iteration 1537, we expect to see that thread 1 saw that thread 2 was at
   1657 iteration 166 (or later).  <code>BB[1537]</code> holds 167, so it appears things
   1658 are working.</p>
   1659 
   1660 <p>Now suppose we fail to observe a volatile write to <code>b</code>:</p>
   1661 
   1662 <table>
   1663 <tr>
   1664 <th>Thread 1</th>
   1665 <th>Thread 2</th>
   1666 </tr>
   1667 <tr>
   1668 <td><code>(initially a == 1534)<br />
   1669 a = 1535<br />
   1670 BB[1535] = 165<br />
   1671 a = 1536<br />
   1672 BB[1536] = 165<br />
   1673 <br />
   1674 <br />
   1675 <br />
   1676 <br />
   1677 a = 1537<br />
   1678 BB[1537] = 165  // stale b</code></td>
   1679 <td><code>(initially b == 165)<br />
   1680 <br />
   1681 <br />
   1682 <br />
   1683 <br />
   1684 b = 166<br />
   1685 AA[166] = 1536<br />
   1686 b = 167<br />
   1687 AA[167] = 1536</code></td>
   1688 </tr>
   1689 </table>
   1690 
   1691 <p>Now, <code>BB[1537]</code> holds 165, a smaller value than we expected, so we
   1692 know we have a problem.  Put succinctly, for i=166, BB[AA[i]+1] < i.  (This also
   1693 catches failures by thread 2 to observe writes to <code>a</code>, for example if we
   1694 miss an update and assign <code>AA[166] = 1535</code>, we will get
   1695 <code>BB[AA[166]+1] == 165</code>.)</p>
   1696 
   1697 <p>If you run the test program under Dalvik (Android 3.0 Honeycomb or later)
   1698 on an SMP ARM device, it will never fail.  If you remove the word volatile
   1699 from the declarations of <code>a</code> and <code>b</code>, it will consistently
   1700 fail.  The program is testing to see if the VM is providing sequentially
   1701 consistent ordering for accesses to <code>a</code> and <code>b</code>, so you
   1702 will only see correct behavior when the variables are volatile.  (It will also
   1703 succeed if you run the code on a uniprocessor device, or run it while something
   1704 else is using enough CPU that the kernel doesnt schedule the test threads on
   1705 separate cores.)</p>
   1706 
   1707 <p>If you run the modified test a few times you will note that it doesnt fail
   1708 in the same place every time.  The test fails consistently because it performs
   1709 the operations a million times, and it only needs to see out-of-order accesses
   1710 once.  In practice, failures will be infrequent and difficult to locate.  This
   1711 test program could very well succeed on a broken VM if things just happen to
   1712 work out.</p>
   1713 
   1714 <h3 id="sync_stores">Implementing synchronization stores</h3>
   1715 
   1716 <p><em>(This isnt something most programmers will find themselves implementing,
   1717 but the discussion is illuminating.)</em></p>
   1718 
   1719 <p>Consider once again volatile accesses in Java.  Earlier we made reference to
   1720 their similarities with acquiring loads and releasing stores, which works as a
   1721 starting point but doesnt tell the full story.</p>
   1722 
   1723 <p>We start with a fragment of Dekkers algorithm. Initially both
   1724 <code>flag1</code> and <code>flag2</code> are false:</p>
   1725 
   1726 <table>
   1727 <tr>
   1728 <th>Thread 1</th>
   1729 <th>Thread 2</th>
   1730 </tr>
   1731 <tr>
   1732 <td><code>flag1 = true<br />
   1733 if (flag2 == false)<br />
   1734 &nbsp;&nbsp;&nbsp;&nbsp;<em>critical-stuff</em></code></td>
   1735 <td><code>flag2 = true<br />
   1736 if (flag1 == false)<br />
   1737 &nbsp;&nbsp;&nbsp;&nbsp;<em>critical-stuff</em></code></td>
   1738 </tr>
   1739 </table
   1740 
   1741 <p><code>flag1</code> and <code>flag2</code> are declared as volatile boolean
   1742 fields.  The rules for acquiring loads and releasing stores would allow the
   1743 accesses in each thread to be reordered, breaking the algorithm.  Fortunately,
   1744 the JMM has a few things to say here.  Informally:</p>
   1745 
   1746 <ul>
   1747 <li>A write to a volatile field <em>happens-before</em> every subsequent read of that
   1748 same field.  (For this example, it means that if one thread updates a flag, and
   1749 later on the other thread reads that flag, the reader is guaranteed to see the
   1750 write.)</li>
   1751 <li>Every execution has a total order over all volatile field accesses.  The
   1752 order is consistent with program order.</li>
   1753 </ul>
   1754 
   1755 <p>Taken together, these rules say that the volatile accesses in our example
   1756 must be observable in program order by all threads. Thus, we will never see
   1757 these threads executing the critical-stuff simultaneously.</p>
   1758 
   1759 <div style="padding:.5em 2em;">
   1760 <div style="border-left:4px solid #999;padding:0 1em;font-style:italic;">
   1761 <p>Another way to think about this is in terms of <em>data races</em>.  A data race
   1762 occurs if two accesses to the same memory location by different threads are not
   1763 ordered, at least one of them stores to the memory location, and at least one of
   1764 them is not a synchronization action <span style="font-size:.9em;color:#777">(<a
   1765 href="#more" style="color:#777">Boehm and McKenney</a>)</span>. The memory model
   1766 declares that a program free of data races must behave as if executed by a
   1767 sequentially-consistent machine.  Because both <code>flag1</code> and
   1768 <code>flag2</code> are volatile, and volatile accesses are considered
   1769 synchronization actions, there are no data races and this code must execute in a
   1770 sequentially consistent manner.</p>
   1771 </div>
   1772 </div>
   1773 
   1774 <p>As we saw in an earlier section, we need to insert a store/load barrier
   1775 between the two operations.  The code executed in the VM for a volatile access
   1776 will look something like this:</p>
   1777 
   1778 <table>
   1779 <tr>
   1780 <th>volatile load</th>
   1781 <th>volatile store</th>
   1782 </tr>
   1783 <tr>
   1784 <td><code>reg = A<br />
   1785 <em>load/load + load/store barrier</em></code></td>
   1786 <td><code><em>store/store barrier</em><br />
   1787 A = reg<br />
   1788 <em>store/load barrier</em></code></td>
   1789 </tr>
   1790 </table>
   1791 
   1792 <p>The volatile load is just an acquiring load.  The volatile store is similar
   1793 to a releasing store, but weve omitted load/store from the pre-store barrier,
   1794 and added a store/load barrier afterward.</p>
   1795 
   1796 <p>What were really trying to guarantee, though, is that (using thread 1 as an
   1797 example) the write to flag1 is observed before the read of flag2.  We could
   1798 issue the store/load barrier before the volatile load instead and get the same
   1799 result, but because loads tend to outnumber stores its best to associate it
   1800 with the store.</p>
   1801 
   1802 <p>On some architectures, its possible to implement volatile stores with an
   1803 atomic operation and skip the explicit store/load barrier. On x86, for example,
   1804 atomics provide a full barrier. The ARM LL/SC operations dont include a
   1805 barrier, so for ARM we must use explicit barriers.</p>
   1806 
   1807 <p>(Much of this is due to Doug Lea and his JSR-133 Cookbook for Compiler
   1808 Writers page.)</p>
   1809 
   1810 <h3 id="more">Further reading</h3>
   1811 
   1812 <p>Web pages and documents that provide greater depth or breadth.  The more generally useful articles are nearer the top of the list.</p>
   1813 
   1814 <dl>
   1815 <dt>Shared Memory Consistency Models: A Tutorial</dt>
   1816 <dd>Written in 1995 by Adve & Gharachorloo, this is a good place to start if you want to dive more deeply into memory consistency models.
   1817 <br /><a href="http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf">http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf</a></dd>
   1818 
   1819 <dt>Memory Barriers</dt>
   1820 <dd>Nice little article summarizing the issues.
   1821 <br /><a href="http://en.wikipedia.org/wiki/Memory_barrier">http://en.wikipedia.org/wiki/Memory_barrier</a></dd>
   1822 
   1823 <dt>Threads Basics</dt>
   1824 <dd>An introduction to multi-threaded programming in C++ and Java, by Hans Boehm.  Excellent discussion of data races and basic synchronization methods.
   1825 <br /><a href="http://www.hpl.hp.com/personal/Hans_Boehm/c++mm/threadsintro.html">http://www.hpl.hp.com/personal/Hans_Boehm/c++mm/threadsintro.html</a></dd>
   1826 
   1827 <dt>Java Concurrency In Practice</dt>
   1828 <dd>Published in 2006, this book covers a wide range of topics in great detail.  Highly recommended for anyone writing multi-threaded code in Java.
   1829 <br /><a href="http://www.javaconcurrencyinpractice.com">http://www.javaconcurrencyinpractice.com</a></dd>
   1830 
   1831 <dt>JSR-133 (Java Memory Model) FAQ</dt>
   1832 <dd>A gentle introduction to the Java memory model, including an explanation of synchronization, volatile variables, and construction of final fields.
   1833 <br /><a href="http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html">http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html</a></dd>
   1834 
   1835 <dt>Overview of package java.util.concurrent</dt>
   1836 <dd>The documentation for the <code>java.util.concurrent</code> package.  Near the bottom of the page is a section entitled Memory Consistency Properties that explains the guarantees made by the various classes.
   1837 <br />{@link java.util.concurrent java.util.concurrent} Package Summary</dd>
   1838 
   1839 <dt>Java Theory and Practice: Safe Construction Techniques in Java</dt>
   1840 <dd>This article examines in detail the perils of references escaping during object construction, and provides guidelines for thread-safe constructors.
   1841 <br /><a href="http://www.ibm.com/developerworks/java/library/j-jtp0618.html">http://www.ibm.com/developerworks/java/library/j-jtp0618.html</a></dd>
   1842 
   1843 <dt>Java Theory and Practice: Managing Volatility</dt>
   1844 <dd>A nice article describing what you can and cant accomplish with volatile fields in Java.
   1845 <br /><a href="http://www.ibm.com/developerworks/java/library/j-jtp06197.html">http://www.ibm.com/developerworks/java/library/j-jtp06197.html</a></dd>
   1846 
   1847 <dt>The Double-Checked Locking is Broken Declaration</dt>
   1848 <dd>Bill Pughs detailed explanation of the various ways in which double-checked locking is broken.  Includes C/C++ and Java.
   1849 <br /><a href="http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html">http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html</a></dd>
   1850 
   1851 <dt>[ARM] Barrier Litmus Tests and Cookbook</dt>
   1852 <dd>A discussion of ARM SMP issues, illuminated with short snippets of ARM code.  If you found the examples in this document too un-specific, or want to read the formal description of the DMB instruction, read this.  Also describes the instructions used for memory barriers on executable code (possibly useful if youre generating code on the fly).
   1853 <br /><a href="http://infocenter.arm.com/help/topic/com.arm.doc.genc007826/Barrier_Litmus_Tests_and_Cookbook_A08.pdf">http://infocenter.arm.com/help/topic/com.arm.doc.genc007826/Barrier_Litmus_Tests_and_Cookbook_A08.pdf</a></dd>
   1854 
   1855 <dt>Linux Kernel Memory Barriers
   1856 <dd>Documentation for Linux kernel memory barriers.  Includes some useful examples and ASCII art.
   1857 <br/><a href="http://www.kernel.org/doc/Documentation/memory-barriers.txt">http://www.kernel.org/doc/Documentation/memory-barriers.txt</a></dd>
   1858 
   1859 <dt>ISO/IEC JTC1 SC22 WG21 (C++ standards) 14882 (C++ programming language), chapter 29 (Atomic operations library)</dt>
   1860 <dd>Draft standard for C++ atomic operation features.
   1861 <br /><a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3090.pdf">http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3090.pdf</a>
   1862 <br/ >(intro: <a href="http://www.hpl.hp.com/techreports/2008/HPL-2008-56.pdf">http://www.hpl.hp.com/techreports/2008/HPL-2008-56.pdf</a>)</dd>
   1863 
   1864 <dt>ISO/IEC JTC1 SC22 WG14 (C standards) 9899 (C programming language) chapter 7.16 (Atomics &lt;stdatomic.h&gt;)</dt>
   1865 <dd>Draft standard for ISO/IEC 9899-201x C atomic operation features. (See also n1484 for errata.)
   1866 <br /><a href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1425.pdf">http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1425.pdf</a></dd>
   1867 
   1868 <dt>Dekkers algorithm</dt>
   1869 <dd>The first known correct solution to the mutual exclusion problem in concurrent programming.  The wikipedia article has the full algorithm, with a discussion about how it would need to be updated to work with modern optimizing compilers and SMP hardware.
   1870 <br /><a href="http://en.wikipedia.org/wiki/Dekker's_algorithm">http://en.wikipedia.org/wiki/Dekker's_algorithm</a></dd>
   1871 
   1872 <dt>Comments on ARM vs. Alpha and address dependencies</dt>
   1873 <dd>An e-mail on the arm-kernel mailing list from Catalin Marinas.  Includes a nice summary of address and control dependencies.
   1874 <br /><a href="http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-05/msg11811.html">http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-05/msg11811.html</a></dd>
   1875 
   1876 <dt>What Every Programmer Should Know About Memory</dt>
   1877 <dd>A very long and detailed article about different types of memory, particularly CPU caches, by Ulrich Drepper.
   1878 <br /><a href="http://www.akkadia.org/drepper/cpumemory.pdf">http://www.akkadia.org/drepper/cpumemory.pdf</a></dd>
   1879 
   1880 <dt>Reasoning about the ARM weakly consistent memory model</dt>
   1881 <dd>This paper was written by Chong & Ishtiaq of ARM, Ltd.  It attempts to describe the ARM SMP memory model in a rigorous but accessible fashion.  The definition of observability used here comes from this paper.
   1882 <br /><a href="http://portal.acm.org/ft_gateway.cfm?id=1353528&type=pdf&coll=&dl=&CFID=96099715&CFTOKEN=57505711">http://portal.acm.org/ft_gateway.cfm?id=1353528&type=pdf&coll=&dl=&CFID=96099715&CFTOKEN=57505711</a></dd>
   1883 
   1884 <dt>The JSR-133 Cookbook for Compiler Writers</dt>
   1885 <dd>Doug Lea wrote this as a companion to the JSR-133 (Java Memory Model) documentation.  It goes much deeper into the details than most people will need to worry about, but it provides good fodder for contemplation.
   1886 <br /><a href="http://g.oswego.edu/dl/jmm/cookbook.html">http://g.oswego.edu/dl/jmm/cookbook.html</a></dd>
   1887 
   1888 <dt>The Semantics of Power and ARM Multiprocessor Machine Code</dt>
   1889 <dd>If you prefer your explanations in rigorous mathematical form, this is a fine place to go next.
   1890 <br /><a href="http://www.cl.cam.ac.uk/~pes20/weakmemory/draft-ppc-arm.pdf">http://www.cl.cam.ac.uk/~pes20/weakmemory/draft-ppc-arm.pdf</a></dd>
   1891 </dl>
   1892