Memory Orders and Instruction Reordering

Learn how CPU out-of-order execution breaks lock-free code, and how to use seq_cst, acquire, release, acq_rel, and relaxed to fix it.

Ryan McCombe
Published

In the previous lesson, we successfully built a lock-free CAS (Compare-And-Swap) loop using std::atomic. We sidestepped the operating system, kept our threads awake, and updated a complex struct using raw hardware instructions. But we left out an important detail: the hardware has been lying to us.

When we look at our code, we see a top-to-bottom list of instructions. We assume that Line 1 executes, and then Line 2 executes. This is an illusion. To maximize performance, both the compiler and the CPU actively dismantle our code and execute it in whatever order they deem most efficient.

In this lesson, we will explore this chaos. We will learn how Out-of-Order Execution can break concurrent programming, and how we use std::memory_order to force the silicon back into compliance.

Out-of-Order Execution

Modern CPUs are incredibly complex engines designed to hide memory latency. They feature an architectural concept known as Out-of-Order Execution (OoOE).

If the CPU is trying to execute Instruction A, but Instruction A is waiting on a slow fetch from main RAM, the CPU will not sit idle. Its instruction decoder will look ahead in our compiled code, grab Instruction B, and might decide to execute it immediately.

As long as Instruction B does not depend on the exact mathematical result of Instruction A, the CPU is free to shuffle them. The compiler's optimizer is also free to do the exact same reordering at compile time if it expects those changes would improve performance.

For single-threaded programs, this reordering of instructions is completely invisible. The CPU guarantees the final state of the program is exactly as if it ran top-to-bottom. But when multiple threads are watching the same memory, they might see those surrounding operations happen in the "wrong" order, shattering our algorithms.

The Safety Fence - seq_cst

As we saw in , a std::atomic variable itself can never be "torn apart" or partially updated. The hardware and cache coherency protocols guarantee its absolute integrity. The chaos of instruction reordering applies entirely to the surrounding, non-atomic instructions above and below it.

If the hardware is constantly shuffling our surrounding instructions, why didn't our CAS loops and other std::atomic algorithms from the previous lessons randomly explode?

The std::atomic wrapper helps us here. Unless we explicitly specify otherwise, operations on a std::atomic applies sequential consistency, represented in the standard library as std::memory_order_seq_cst.

This is the default parameter used by std::atomic functions like load() and store(), so our previous usage was equivalent to this:

// Equivalent to MyAtomic.load();
MyAtomic.load(std::memory_order_seq_cst);

// Equivalent to MyAtomic.store(42);
MyAtomic.store(42, std::memory_order_seq_cst);

Conceptually, we can think of this default behavior as a massive, impenetrable fence dropping into our code. Whilst instructions either side of the atomic operation can be freely reordered, no instruction can cross the fence:

This means that, when we reach an atomic instruction with std::memory_order_seq_cst, we have two important guarantees:

  • Everything above the std::atomic instruction has been executed.
  • Nothing below the std::atomic instruction has been executed.

This creates a syncronization point that restores some degree of order, but it comes with a massive cost. It forces the CPU to drain all of its pending store buffers and halt the pipeline. By forcing total chronological consensus across every single core in the system, seq_cst severely bottlenecks our CPU's performance.

To create high-performance systems, we can remove or reduce these fences when we don't need them.

Removing Fences - relaxed

If our atomic algorithm does not interact with any external data, we don't care about a global timeline. We can just ask for the raw atomic hardware instruction with zero memory fences.

We do this using std::memory_order_relaxed. It guarantees the atomic variable itself won't suffer from data tearing, but allows the compiler and CPU to aggressively reorder surrounding instructions right over the top of it, maximizing performance.

Let's look at the ApplyDamage() function from the previous lesson, where we updated a PlayerState using CAS:

#include <atomic>
#include <cstdint>

struct alignas(8) PlayerState {
  uint32_t Health;
  uint32_t IsDead;
};

std::atomic<PlayerState> State;

void ApplyDamage(uint32_t Damage) {
  // Atomic Load
  PlayerState Expected{State.load()};
  PlayerState Desired;
  
  do {
    Desired = Expected;

    if (Desired.Health > Damage) {
      Desired.Health -= Damage;
    } else {
      Desired.Health = 0;
      Desired.IsDead = 1;
    }
  } while (
    // Atomic CAS
    !State.compare_exchange_weak(Expected, Desired)
  ); 
}

We can specify the memory order of the initial load(), and the compare_exchange_weak() function accepts two memory order arguments: one for if the swap succeeds, and one for if it fails.

Our ApplyDamage() only modifies the self-contained atomic PlayerState and has zero outside data dependencies that other threads might mess with.

We'll explore some more complex functions soon where fences are required but, in the current version of our ApplyDamage() function, we can safely pass relaxed to everything:

#include <atomic>
#include <cstdint>

struct alignas(8) PlayerState {
  uint32_t Health;
  uint32_t IsDead;
};

std::atomic<PlayerState> State;

void ApplyDamage(uint32_t Damage) {
  // The Initial Load
  PlayerState Expected{State.load(std::memory_order_relaxed)};
  PlayerState Desired;

  do {
    Desired = Expected;

    if (Desired.Health > Damage) {
      Desired.Health -= Damage;
    } else {
      Desired.Health = 0;
      Desired.IsDead = 1;
    }

  } while (!State.compare_exchange_weak(
    Expected,
    Desired,
    // The Success Order
    std::memory_order_relaxed, 
    // The Failure Order
    std::memory_order_relaxed  
  ));
}

This is as fast as the hardware can physically execute. However, in the real world, our data is rarely this simple.

The Provider-Consumer Pattern

In real-world applications, we rarely only share single integers or simple structs between threads. We often need to share large, complex data structures like arrays, strings, or object graphs.

We cannot make these massive structures std::atomic- as we covered in the , the hardware only guarantees lock-free atomicity for data that fits into CPU registers (typically 8 or 16 bytes). If we store something larger in a std::atomic, the standard library silently adds locks. This keeps the instructions atomic, but massively degrades performance.

To work around this, we often prefer to use a combination of non-atomic and atomic data. The non-atomic memory holds the actual heavy payload, while a lightweight atomic variable acts as a signal or flag to coordinate access to that payload.

This decision naturally leads to provider-consumer patterns emerging in our code. A "provider" writes the heavy payload and then flips the atomic flag to signal that the data is ready. A "consumer" waits for the atomic flag to flip, and then reads the payload.

Thinking of our functions in terms of providers and consumers significantly simplifies how we reason about memory ordering once we add the possibility that our provider and consumer code might be running concurrently on different threads.

Example Provider

Let's look at a simple function that provides some string data, and uses a std::atomic to communicate when the data is ready:

#include <atomic>
#include <string>

std::string SharedPayload;
std::atomic<bool> IsReady{false};

void ProviderThread() {
  // 1. Write the non-atomic payload
  SharedPayload = "Level Data Loaded";

  // 2. Flip the atomic flag to signal that
  //    the payload is ready to be consumed
  IsReady.store(true, std::memory_order_relaxed);
}

Because we use std::memory_order_relaxed here, we are explicitly telling the CPU and compiler that it is perfectly fine to reorder these instructions. The CPU might decide it's more efficient to execute the atomic store (IsReady = true) before finishing the string payload write.

If this happens, a consumer might see the flag set to true and attempt to read SharedPayload while it is still half-written or entirely empty, causing data races and crashes.

If we instead omitted the relaxed parameter and relied on the default std::memory_order_seq_cst, it would drop a massive memory fence, preventing the string write from sliding below the atomic store and keeping our logic safe.

Example Consumer

Now, let's look at the other side of this transaction - the consumer:

#include <iostream>

void ConsumerThread() {
  // 1. Wait for the atomic flag
  while (!IsReady.load(std::memory_order_relaxed)) {
    // Spin until the payload is ready
  }

  // 2. Read the non-atomic payload
  std::cout << SharedPayload << "\n";
}

Just like the provider, using relaxed here spells disaster. The CPU's out-of-order execution engine might look ahead and speculatively read SharedPayload into a register before the while loop has even asked the atomic if the data is ready.

This means when we get to the std::cout step, we could be sending it uninitialized or stale data, completely bypassing our intended waiting sequence.

Once again, std::memory_order_seq_cst would fix this by dropping a full fence that entirely stops the string read from floating above the atomic check.

However, we can now see the dilemma: seq_cst is safe but painfully slow because it forces total chronological consensus across all CPU cores, while relaxed is fast but dangerously allows instructions to leak across the atomic operation, ruining our intended behaviors.

To build lock-free algorithms that are both fast and safe, we need some middle-ground options.

Consumers - acquire

When our function is a consumer of external data, we need to ensure that we don't read that data before our atomic check says it is safe to do so.

Suppose our player has a separate, non-atomic inventory, and we want to read their DamageReduction stat before applying the damage.

If we keep using relaxed for our Initial load, the CPU is allowed to look ahead and speculatively fetch the DamageReduction into registers before the atomic State is actually loaded from memory. If another thread just equipped a new armor piece, we might process the new atomic state using stale inventory data.

To fix this, we need to upgrade our relaxed memory order with a stricter half-fence: std::memory_order_acquire. When we tag an atomic operation with acquire, we tell the CPU: "Absolutely no memory reads or writes that appear below this line are allowed to float above it."

Let's upgrade our CAS loop. Notice that we must upgrade both the initial load and the failure order, because if the CAS fails, the hardware automatically re-loads the newest state into Expected, and we will need to use the latest GetDamageReduction() value again on the next loop iteration:

// Reads non-atomic Inventory
extern int GetDamageReduction(); 

void ApplyDamage(uint32_t Damage) {
  // Ensure GetDamageReduction doesn't float above
  // the initial load
  PlayerState Expected{State.load(
    std::memory_order_acquire
  )};
  PlayerState Desired;

  do {
    Desired = Expected;

    // External Data Dependency (Consumer)
    uint32_t Reduction = GetDamageReduction();
    uint32_t FinalDamage = (Damage > Reduction)
      ? Damage - Reduction
      : 0;

    if (Desired.Health > FinalDamage) {
      Desired.Health -= FinalDamage;
    } else {
      Desired.Health = 0;
      Desired.IsDead = 1;
    }

  } while (!State.compare_exchange_weak(
    Expected,
    Desired,
    // Success can still be relaxed - GetDamageReduction 
    // can safely float above the do-while loop as long 
    // as we don't need more iterations of the loop
    std::memory_order_relaxed,
    // If CAS fails, we DO need additional iteration of 
    // the loop. We don't want GetDamageReduction to float 
    // above the loop body in this case - we want each
    // iteration of the loop to call GetDamageReduction
    // again to ensure we use the latest value
    std::memory_order_acquire  
  ));
}

Providers - release

What if our CAS loop is a provider of external data? We need to ensure that our external writes are fully completed in memory before our atomic swap becomes visible and tells other threads to come look at them.

Let's remove the damage reduction for a moment. Suppose that when a player's health hits 0, we must write a list of items to a non-atomic DroppedLoot array so other threads can pick them up.

If we use relaxed for our success order, the CPU is free to push the DroppedLoot writes below the atomic CAS. If another thread sees IsDead become 1, it might wake up and read the DroppedLoot array before our provider thread has actually written the items to memory.

To fix this, we use the opposite half-fence: std::memory_order_release. When we tag an atomic store operation with release, we tell the CPU: "Absolutely no memory reads or writes that appear above this line are allowed to sink below it."

To keep things simple, we'll just assume our CAS will succeed on the first try, so we can publish our external data from the loop body:

// Writes to non-atomic Inventory 
extern void WriteDroppedLoot();

void ApplyDamage(uint32_t Damage) {
  // We're no longer consuming external data so
  // we can switch back to relaxed ordering
  PlayerState Expected{State.load(
    std::memory_order_relaxed
  )};
  PlayerState Desired;

  do {
    Desired = Expected;

    if (Desired.Health > Damage) {
      Desired.Health -= Damage;
    } else {
      Desired.Health = 0;
      Desired.IsDead = 1;

      // External Data Write (Provider)
      WriteDroppedLoot(); 
    }

  } while (!State.compare_exchange_weak(
    Expected,
    Desired,
    // Ensure external writes (WriteDroppedLoot) don't
    // sink below the successful CAS
    std::memory_order_release, 
    
    // Our CAS loop is no longer consuming external data.
    // Failure is just a retry so we can switch back
    // to relaxed ordering
    std::memory_order_relaxed 
  ));
}

Providers and Consumers - acq_rel

Often, an atomic Read-Modify-Write operation - like a successful Compare-And-Swap - needs to act as a provider and a consumer at the same time.

Suppose that after a successful kill, we must read from a non-atomic MatchRules system to determine how the player should respawn. Now our successful CAS must provide the DroppedLoot written before it, and consume the MatchRules read after it.

If our success order was only acquire, our dropped loot write might sink too late. If our success order was only release, the CPU could speculatively read the MatchRules earlier than we intended.

We could go back to using the default std::memory_order_seq_cst here, but there is a faster alternative - std::memory_order_acq_rel:

While acq_rel might sound a lot like the default seq_cst, there is an important distinction: acq_rel acts as a localized "baton pass" only between specific threads touching this exact variable. It doesn't force different CPU cores to agree on the timing of other unrelated events across the system.

When we use acq_rel on our successful CAS, it pins the loot write above the line, and pins the rules read below the line:

// Reads non-atomic Inventory
extern int GetDamageReduction();
// Writes to non-atomic Inventory
extern void WriteDroppedLoot();
// Reads non-atomic Game State
extern void ReadMatchRules();

void ApplyDamage(uint32_t Damage) {
  // Ensure GetDamageReduction doesn't float above
  // this (Consumer)
  PlayerState Expected{State.load(
    std::memory_order_acquire
  )};
  PlayerState Desired;
  bool PlayerDied = false;

  do {
    Desired = Expected;
    PlayerDied = false;

    // External Data Dependency (Consumer)
    uint32_t Reduction = GetDamageReduction();
    uint32_t FinalDamage = (Damage > Reduction)
      ? Damage - Reduction
      : 0;

    if (Desired.Health > FinalDamage) {
      Desired.Health -= FinalDamage;
    } else {
      Desired.Health = 0;
      Desired.IsDead = 1;
      PlayerDied = true;

      // External Data Write (Provider)
      WriteDroppedLoot();
    }

  } while (!State.compare_exchange_weak(
    Expected,
    Desired,
    // Simultaneous Consumer (acquire) and
    // Provider (release)
   std::memory_order_acq_rel, 
    // Ensure GetDamageReduction doesn't
    // float above a CAS retry
    std::memory_order_acquire  
  ));

  if (PlayerDied) {
    // External Data Dependency (Consumer)
    // The 'acquire' half of 'acq_rel' ensures
    // this read doesn't float above the CAS
    ReadMatchRules(); 
  }
}

Summary

In this lesson, we looked beneath the code to understand how the hardware manipulates and reorders instructions:

  1. Out-of-Order Execution: We learned that to maximize throughput, the CPU can shuffle our instructions around, which disrupts concurrent logic. Whilst a std::atomic itself never tears, the instructions around it can be reordered.
  2. seq_cst: We saw that std::atomic is safe by default because it uses full memory fences to establish a global timeline, but this degrades performance by stalling the CPU pipeline.
  3. relaxed: We used raw atomic operations with zero memory fences to optimize fully isolated code, allowing the hardware to work as fast as physically possible.
  4. acquire and release: We used half-fences to safely consume and provide external, non-atomic data without paying the cost of a full pipeline flush.
  5. acq_rel: We used bidirectional fences for Read-Modify-Write operations, allowing a thread to act as both a consumer and a provider safely.

We're now familiar with scaling work across multiple cores. But, within each of those individual cores, there is yet another level of parallelism waiting to be unlocked.

In the next lesson, we will look at vectorization. We will learn how to use SIMD (Single Instruction, Multiple Data) registers to process 4, 8, or even 16 numbers in a single CPU cycle.

Next Lesson
Lesson 29 of 51

SIMD and Automatic Vectorization

Learn how SIMD registers allow you to process multiple data points in a single instruction, unlocking the full power of each CPU core.

Have a question about this lesson?
Answers are generated by AI models and may not be accurate