# Understanding the Java Memory Model

Posted on April 16, 2013

Like most Java developer I have been accustomed over the years to use the concurrency capabilities of the Java language and platform to build multi-threaded programs. I understand Thread and Runnable, I (think I) know how to use volatile fields and synchronized blocks and methods, I have explored the java.util.concurrent.* utilities and used most of the high-level constructs in it, I even used Google’s Guava concurrency features and Akka actors library.

Exploring concurrency, I went to the point of trying to understand the ForkJoinPool implemented by Doug Lea as part of the JSR 166 effort, and other interesting structures like lock-free queues or work-stealing thread pools. And when you study concurrency in Java you unavoidably end trying to understand its foundation which lies in the Java Memory Model.

This article reflects my understanding of this model. It first explains why a memory model is needed, then exposes the consequences for the developer of this model, following with the burden the model puts on the shoulders of JVM implementors. Some references are given at the end of this article.

# The need for a Memory Model

## Program Optimizations

Compilers and CPUs can and do optimize programs. These optimizations may be very aggressive and in the context of multithreading may lead to unexpected results. Transformations are always (or nearly always) local to a single thread of execution (or one CPU) hence transformations affecting reading and writing values of shared variables can have surprising effects on the behaviour of other threads of execution.

Among the possible transformations zoo, we can list:

• Register allocation: Reading from a register is much faster than reading from memory location, hence compilers strive to maximize allocation of variables to register.
• Read reuse: A variable is read in two different registers (eg. local variables), the optimizer can detect that no write occurs in between the two reads and reuse the result of the first read for the second
• Reordering: Independent writes/reads can be moved to different locations, eg. at beginning of a sequence of instructions, for example to reduce memory bandwidth usage. This can be done if the compiler detects that the write is independent of intervening reads
• Control flow simplification:
• Conditional execution can be removed if the value is guaranteed to be set to some known value. Constant propagation can yield such optimzations
• Loop unrolling: When the number of loops is known to be short, or constant, compiler can remove the loop and inline its body, with further optimizations possible
• Synchronization elimination and simplification: Actions can be moved “inside” synchronized regions to merge several locks/unlocks (lock coarsening)

Such transformations can be motivated by:

• Optimizing cache access: Reading/writing several values located in the same cache line sequentially is much more efficient as it reduces the risk of cache line evictions
• Optimizing instructions pipelines and cache: Modifiying branching behaviour can optimize instruction cache usage by ensuring cache flush occurs less often
• Optimizing memory access: reusing values from a register, or a cache, is much more efficient than directly accessing the cache or the RAM

It is the essence of high-level programming language like Java to insulate the developer from all those details while still providing good performances. After all, if we chose to program in Java and not say in C, Forth or Assembler, it is because we do not want to take care of all those low-level optimizations.

## Computer Architecture

MESI (modified/exclusive/shared/invalid) is the main Cache Coherence protocol implemented in modern hardware. It optimizes the use of the cache in multiprocessor systems by identifying which transitions require change notifications to other processors, issuing RFOs when some processor needs to change the value of a shared variable (Request For Ownership).

Note that because a cache coherence protocol exists does not mean the result of memory operations on cache are always coherent! While processing instructions to some memory region processors do not wait for exclusive access, which implies two concurrent modifications can collide and one gets lost. Cache refresh only occurs with certainty when a cache line is marked invalid, in which case the processor will read it from memory (this is known as a cache miss) before marking it shared or modified.

Hence even in the presence of a coherence protocol, different processors can and do see different values of the same variable. At the CPU level, synchronization instructions are also mandatory if one needs to ensure consistent semantics of program execution in the presence of multiple threads.

This is actually the same situation than a distributed systems where nodes are connected by a network: One needs explicit protocols and instructions to ensure that all nodes share a common global state, and maintaining this global state is expensive, requiring more bandwidth, more messages, more exchanges between nodes

Hence JVM implementors for different architectures and different needs must be able to have a clear understanding of the requirements of the language and how they relate to the underlying platform.

# Programmer’s Guarantee

## Sequential Consistency

Sequential consistency as defined by Lamport offers the most guarantee to developers. It states two properties over the result of any execution of a piece of software (result is here understood as the state of the memory at each step of the program, or more abstractly the value assigned to each variable):

• From the point of view of a single thread, the execution is consistent with the program order
• The complete execution trace appears in a single unique order from the point of view of all the threads

Formally, sequential consistency is quite intuitive as it can be understood by stating that any execution is an interleaving of the program order of each thread/process, each read seeing the value of the immediately preceding write in the execution instantaneously. In other words: Take each thread/process instructions, mix them freely, and you have sequential consistency.

Sequential consistency is very strong as it requires that all memory operations effect be propagated instantaneously and visible to all threads atomically: This has a strong impact on what optimizations can be done and how performant memory operations are, as it effectively implies they are all synchronous, or synchronized globally across all threads. Given each threads is mapped to a processor on the underlying host, this requirements propagates automatically at the hardware level.

Requiring sequential consistency eschew most interesting optimizations that could be done by the compiler and the hardware and severely hinders peformances of any but the most trivial programs. The system must be able to reorder instructions as given by the program in order to minimize latency of memory accesses:

• Access time for various levels of cache (L1, L2, L3) increases exponentially hence it is always advantageous to try to hit the lowest level of cache possible: This can be achieved most of the time by ensuring memory accesses are “linear”,
• Requiring atomicity of writes severely reduces the usefulness of caches and increases latency: Writers have to wait acknowledgements of all other processors to proceed with write (which makes Write-back caches effectively useless).

## Data-Race Free Guarantee

The Java Memory Model proposes a weaker guarantee called Data-Race Free Guarantee:

If all sequentially consistent executions of a program are free of data-races then all executions appear to be sequentially consistent

This guarantee is defined through the happens-before relation: If two conflicting accesses (at least one access is a write to the same memory location) are not ordered w.r.t. happens-before relation, then this is a data race. A program whose all sequentially consistent executions are devoid of data races is said to be properly synchronized.

Happens-before is the transitive closure of 2 other relations:

• Program order which is the partial order of actions as specified by the program’s source code. It is partial because actions executed by different threads can be interleaved: The underlying memory model respects program order only in an intra-thread context
• Synchronization order which is the total order over all synchronization actions: monitor exit → monitor enter, volatile writes → volatile reads, thread start → first action in thread, last action in thread → thread isAlive(), interrupt → interruption catching, default value write → first action of thread,…

All these orders apply to couple of actions within an execution of the program (eg. they are not ordering relations on the program’s source itself). Given these definitions and some program, there exists a minimal set of synchronization actions that ensures all executions are data-race free

Just because one uses volatile variables does not means there won’t be any race conditions in the program: The execution order is dependent on the scheduler hence different executions can observe different values of the variable.

### Some Examples

Note that a program can be properly synchronized (eg. free of data races) even without explicit or implicit synchronization actions. The following program is correctly synchronized hence its only possible outcome is r1 = r2 = x = y = 0. This is so because there is no sequentially consistent execution where any of the x and y assignments occurs, hence there is no conflict.

public int x = 0;
public int y = 0;

public void run () {
public void run() {
int r1 = x;
if(r1 != 0) y = 42;
}
};

public void run() {
int r2 = y;
if(r2 != 0) x = 42;
}
};
p1.start();
p2.start();
}

The converse is also true: A code can be improperly synchronized even in the presence of synchronization actions. Consider the classical example of the flawed lazy Singleton pattern:

public class MySingleton {
private static MySingleton instance;

public static MySingleton instance() {
if(instance == null) {
synchronized(MySingleton.class) {
if(instance == null) {
instance = new MySingleton();
}
}
}
return instance;
}
}

The read and write of instance occur without an happens-before relation exists between both actions which can occur in different threads, hence code contains a data-race

### Final Objects

Final (immutable objects) are handled with special care: There exists a conceptual freeze action after which publication of the object’s reference is guaranteed to see the final value. This freeze action is effective at the end of the constructor of the object containing final fields.

If the reference to a final object is published to other thread before the freeze barrier, then the guarantee is dropped and different values than the final one can be seen. This may be caused by not propagating cached values, reading stale data…

# Implementors’ Details

## Happens-Before

One consequence of the happens-before relation is that it is possible for reads to see writes occuring “later” in the execution, which actually means the two threads see different orders of actions or equivalently that values propagation is not serialized. This is exactly what can happens in a distributed system where messages are not guaranteed to arrive at each node at the same time nor in the same order in all nodes, especially when they come from different nodes (eg. there is no total order over all exchanged messages in the system)

## Causality

Data-race freeness cannot be guaranteed if the JVM only obeys happens-before consistency. Given the above properly synchronized program, there exists an execution which is happens-before consistent but not sequentially consistent:

p1: r1 = x;  //  sees a value of 42, eg. write of x in p2
p1: y = 42;
p2: r2 = y;  // sees value of 42 above
p2: x = 42; 

Yet allowing such behavior is impossible, hence the Java Memory Model requires implementations to respect a non circular causality principle: actions cannot be ordered in such a way that occurence of an action is caused by a circular dependence with another action, as this is the case in the above trace. The formal details of the specification is beyond my understanding (it is given in terms of an increasing sequence of committed actions which respect some properties) but my intuition is simply that the execution order should be causaly consistent (eg. it is a DAG).

It should be noted that reads of “later” writes may still be possible if it respects the causality constraints. In the example, this may not happen but the following program makes it perfectly admissible (by the way, this program is not data-race free):

  int x = 0;
int y = 0;
public void run() {
int r1 = x;
if(r1 == 1) y = 1;
}
};

public void run() {
int r2 = y;
if(r2 == 1) x = 1;
if(r2 == 0) x = 1;
}
};
p1.start();
p2.start();

In this case, it is possible that r1 = r2 = 1: The compiler should be given the freedom to infer that because y never has a value different from 0 or 1, x = 1 will always occur, hence does not depend anymore on the read of y and can be moved earlier, resulting in p1 observing it and writing y = 1.

# Consequences for the Programmer

It should be apparent from the above discussion and examples that understanding how the Java Memory Model works is not trivial, and I would never pretend having accomplished such a feat. But what is of interest to me is: What consequences does this have for programmers? What are the practical advices one can use to avoid bad surprises when writing concurrent programs? A lot of wisdom can be extracted from Brian Goetz’ book and the thread model has been thorougly debunked in The Problem with Threads.

Minimize sharing: All troubles come from the fact some memory locations are shared between different threads, leading to potential data races. An obvious solution to this problem is the to minimize sharing, even to the point of no sharing at all between threads.

This is the basic strategic choice behing actors’ based concurrency: By ensuring only immutable (hence perfectly shareable) objects are exchanged between actors, and encapsulating the state within each actor behind an unbreakable mailbox barrier removes all data-races problem. Of course this is true only if the mailboxes themselves are properly synchronized.

Use higher-level constructs: Why care about the nitty-gritty details of managing shared memory access when you can use more abstract concepts which handle the complexity for you? This is a generalization of the above and applies to quite a lot of possible concurrency constructs.

• Software Transactional Memory: Variables are transactional with transactions working optimistically (rollback if someone else has changed the world behind your back). The nice thing with transactional memory is that it composes gracefully. There even is a Java implementation (actually there are several of them)
• Data Flow programming is another interesting model where concurrency is handled declaratively and implicitly: Data flow variables can be written once and reading is blocking. it allows concise and elegant expression of fork/join type and other kind of distributed computations
• Agents (see also the Gpars version are a kind of actors who guards mutable state: Modifying this state means passing a function whose execution is serialized within the agent itself in its own thread

Reason locally: When sharing is needed (for example when implementing an actor’s mailbox!) then it helps to keep the reasoning local: Put all the potentially concurrent in the same place (eg. class) and carefully analyze and test the code to ensure it is properly synchronized. There is nothing worse than scattering locks acquire and release, or synchronized blocks across several obejcts and a deeply nested call tree for breeding concurrency bugs. Moreover, locality might helps reduce the needed contention and improve the performance, for example by replacing locking constructs by lock-free ones (eg. CAS operations or volatile variables).