Understanding the Java Memory Model

Posted on April 16, 2013

Like most Java developer I have been accustomed over the years to use the concurrency capabilities of the Java language and platform to build multi-threaded programs. I understand Thread and Runnable, I (think I) know how to use volatile fields and synchronized blocks and methods, I have explored the java.util.concurrent.* utilities and used most of the high-level constructs in it, I even used Google’s Guava concurrency features and Akka actors library.

Exploring concurrency, I went to the point of trying to understand the ForkJoinPool implemented by Doug Lea as part of the JSR 166 effort, and other interesting structures like lock-free queues or work-stealing thread pools. And when you study concurrency in Java you unavoidably end trying to understand its foundation which lies in the Java Memory Model.

This article reflects my understanding of this model. It first explains why a memory model is needed, then exposes the consequences for the developer of this model, following with the burden the model puts on the shoulders of JVM implementors. Some references are given at the end of this article.

The need for a Memory Model

Program Optimizations

Compilers and CPUs can and do optimize programs. These optimizations may be very aggressive and in the context of multithreading may lead to unexpected results. Transformations are always (or nearly always) local to a single thread of execution (or one CPU) hence transformations affecting reading and writing values of shared variables can have surprising effects on the behaviour of other threads of execution.

Among the possible transformations zoo, we can list:

Such transformations can be motivated by:

It is the essence of high-level programming language like Java to insulate the developer from all those details while still providing good performances. After all, if we chose to program in Java and not say in C, Forth or Assembler, it is because we do not want to take care of all those low-level optimizations.

Computer Architecture

MESI (modified/exclusive/shared/invalid) is the main Cache Coherence protocol implemented in modern hardware. It optimizes the use of the cache in multiprocessor systems by identifying which transitions require change notifications to other processors, issuing RFOs when some processor needs to change the value of a shared variable (Request For Ownership).

Note that because a cache coherence protocol exists does not mean the result of memory operations on cache are always coherent! While processing instructions to some memory region processors do not wait for exclusive access, which implies two concurrent modifications can collide and one gets lost. Cache refresh only occurs with certainty when a cache line is marked invalid, in which case the processor will read it from memory (this is known as a cache miss) before marking it shared or modified.

Hence even in the presence of a coherence protocol, different processors can and do see different values of the same variable. At the CPU level, synchronization instructions are also mandatory if one needs to ensure consistent semantics of program execution in the presence of multiple threads.

This is actually the same situation than a distributed systems where nodes are connected by a network: One needs explicit protocols and instructions to ensure that all nodes share a common global state, and maintaining this global state is expensive, requiring more bandwidth, more messages, more exchanges between nodes

Hence JVM implementors for different architectures and different needs must be able to have a clear understanding of the requirements of the language and how they relate to the underlying platform.

Programmer’s Guarantee

Sequential Consistency

Sequential consistency as defined by Lamport offers the most guarantee to developers. It states two properties over the result of any execution of a piece of software (result is here understood as the state of the memory at each step of the program, or more abstractly the value assigned to each variable):

Formally, sequential consistency is quite intuitive as it can be understood by stating that any execution is an interleaving of the program order of each thread/process, each read seeing the value of the immediately preceding write in the execution instantaneously. In other words: Take each thread/process instructions, mix them freely, and you have sequential consistency.

Sequential consistency is very strong as it requires that all memory operations effect be propagated instantaneously and visible to all threads atomically: This has a strong impact on what optimizations can be done and how performant memory operations are, as it effectively implies they are all synchronous, or synchronized globally across all threads. Given each threads is mapped to a processor on the underlying host, this requirements propagates automatically at the hardware level.

Requiring sequential consistency eschew most interesting optimizations that could be done by the compiler and the hardware and severely hinders peformances of any but the most trivial programs. The system must be able to reorder instructions as given by the program in order to minimize latency of memory accesses:

Data-Race Free Guarantee

The Java Memory Model proposes a weaker guarantee called Data-Race Free Guarantee:

If all sequentially consistent executions of a program are free of data-races then all executions appear to be sequentially consistent

This guarantee is defined through the happens-before relation: If two conflicting accesses (at least one access is a write to the same memory location) are not ordered w.r.t. happens-before relation, then this is a data race. A program whose all sequentially consistent executions are devoid of data races is said to be properly synchronized.

Happens-before is the transitive closure of 2 other relations:

All these orders apply to couple of actions within an execution of the program (eg. they are not ordering relations on the program’s source itself). Given these definitions and some program, there exists a minimal set of synchronization actions that ensures all executions are data-race free

Just because one uses volatile variables does not means there won’t be any race conditions in the program: The execution order is dependent on the scheduler hence different executions can observe different values of the variable.

Some Examples

Note that a program can be properly synchronized (eg. free of data races) even without explicit or implicit synchronization actions. The following program is correctly synchronized hence its only possible outcome is r1 = r2 = x = y = 0. This is so because there is no sequentially consistent execution where any of the x and y assignments occurs, hence there is no conflict.

public int x = 0;
public int y = 0;

public void run () {
  Thread p1 = new Thread() {
    public void run() {
       int r1 = x;
       if(r1 != 0) y = 42;
    }
  };

  Thread p2 = new Thread() {
    public void run() {
       int r2 = y;
       if(r2 != 0) x = 42;
    }
  };
  p1.start();
  p2.start();
}

The converse is also true: A code can be improperly synchronized even in the presence of synchronization actions. Consider the classical example of the flawed lazy Singleton pattern:

public class MySingleton {
   private static MySingleton instance;
   
   public static MySingleton instance() {
     if(instance == null) {
       synchronized(MySingleton.class) {
         if(instance == null) {
           instance = new MySingleton();
         }
       }
     }
     return instance;
  }
}

The read and write of instance occur without an happens-before relation exists between both actions which can occur in different threads, hence code contains a data-race

Final Objects

Final (immutable objects) are handled with special care: There exists a conceptual freeze action after which publication of the object’s reference is guaranteed to see the final value. This freeze action is effective at the end of the constructor of the object containing final fields.

If the reference to a final object is published to other thread before the freeze barrier, then the guarantee is dropped and different values than the final one can be seen. This may be caused by not propagating cached values, reading stale data…

Implementors’ Details

Happens-Before

One consequence of the happens-before relation is that it is possible for reads to see writes occuring “later” in the execution, which actually means the two threads see different orders of actions or equivalently that values propagation is not serialized. This is exactly what can happens in a distributed system where messages are not guaranteed to arrive at each node at the same time nor in the same order in all nodes, especially when they come from different nodes (eg. there is no total order over all exchanged messages in the system)

Causality

Data-race freeness cannot be guaranteed if the JVM only obeys happens-before consistency. Given the above properly synchronized program, there exists an execution which is happens-before consistent but not sequentially consistent:

p1: r1 = x;  //  sees a value of 42, eg. write of x in p2
p1: y = 42; 
p2: r2 = y;  // sees value of 42 above
p2: x = 42; 

Yet allowing such behavior is impossible, hence the Java Memory Model requires implementations to respect a non circular causality principle: actions cannot be ordered in such a way that occurence of an action is caused by a circular dependence with another action, as this is the case in the above trace. The formal details of the specification is beyond my understanding (it is given in terms of an increasing sequence of committed actions which respect some properties) but my intuition is simply that the execution order should be causaly consistent (eg. it is a DAG).

It should be noted that reads of “later” writes may still be possible if it respects the causality constraints. In the example, this may not happen but the following program makes it perfectly admissible (by the way, this program is not data-race free):

  int x = 0;
  int y = 0;
  Thread p1 = new Thread() {
    public void run() {
       int r1 = x;
       if(r1 == 1) y = 1;
    }
  };

  Thread p2 = new Thread() {
    public void run() {
       int r2 = y;
       if(r2 == 1) x = 1;
       if(r2 == 0) x = 1;
    }
  };
  p1.start();
  p2.start();

In this case, it is possible that r1 = r2 = 1: The compiler should be given the freedom to infer that because y never has a value different from 0 or 1, x = 1 will always occur, hence does not depend anymore on the read of y and can be moved earlier, resulting in p1 observing it and writing y = 1.

Consequences for the Programmer

It should be apparent from the above discussion and examples that understanding how the Java Memory Model works is not trivial, and I would never pretend having accomplished such a feat. But what is of interest to me is: What consequences does this have for programmers? What are the practical advices one can use to avoid bad surprises when writing concurrent programs? A lot of wisdom can be extracted from Brian Goetz’ book and the thread model has been thorougly debunked in The Problem with Threads.

Minimize sharing: All troubles come from the fact some memory locations are shared between different threads, leading to potential data races. An obvious solution to this problem is the to minimize sharing, even to the point of no sharing at all between threads.

This is the basic strategic choice behing actors’ based concurrency: By ensuring only immutable (hence perfectly shareable) objects are exchanged between actors, and encapsulating the state within each actor behind an unbreakable mailbox barrier removes all data-races problem. Of course this is true only if the mailboxes themselves are properly synchronized.

Use higher-level constructs: Why care about the nitty-gritty details of managing shared memory access when you can use more abstract concepts which handle the complexity for you? This is a generalization of the above and applies to quite a lot of possible concurrency constructs.

Reason locally: When sharing is needed (for example when implementing an actor’s mailbox!) then it helps to keep the reasoning local: Put all the potentially concurrent in the same place (eg. class) and carefully analyze and test the code to ensure it is properly synchronized. There is nothing worse than scattering locks acquire and release, or synchronized blocks across several obejcts and a deeply nested call tree for breeding concurrency bugs. Moreover, locality might helps reduce the needed contention and improve the performance, for example by replacing locking constructs by lock-free ones (eg. CAS operations or volatile variables).

References