In the previous lesson, we learned that replacing a std::mutex with a std::atomic<int> yields a massive performance improvement. By relying on native hardware instructions like LOCK ADD, we avoid putting threads to sleep, bypassing the operating system's heavy context-switching tax entirely.

But a native hardware instruction only works for simple, mathematical operations. What happens if we need to do something more complex? What if we have a game entity with both a Health integer and an IsDead flag wrapped in a struct, and we need to update the entire struct atomically based on some complex combat logic? If we revert to using a std::mutex, we invite the OS scheduler back into our tight loops, killing our multi-core throughput.

In this lesson, we will look at how to solve this using optimistic concurrency. We will use the silicon's Compare-And-Swap (CAS) instruction to build our own custom atomic operations.

The Limits of Atomics

Previously, we saw how to atomically add a value to an integer. In this lesson, let's imagine we want to create more complex atomic logic. When a Player takes damage, we want to reduce their health, but not lower than 0. Additionally, if their health is exactly 0, we want to flag the Player as dead.

Let's look at the simple PlayerState struct that we'll be updating. Because it is small, modern processors can fit this struct into a single 64-bit register:

1#include <cstdint>
2
3struct PlayerState {
4  uint32_t Health;
5  uint32_t IsDead; // 0 for alive, 1 for dead
6};

Advanced

Why not a `bool`?

You might be wondering why we declared IsDead as uint32_t instead of a bool. It seems like using a bool here would make our intent clearer, and reduce our memory footprint from 64 bits to 40 bits.

In this lesson, we are going to use hardware instructions that perform a raw bit-for-bit check on our values. A 40 bit struct would be an awkward size for this - it's too big to fit into a 32 bit register, and too small to fit perfectly into a 64 bit register.

As we covered in the , this is an alignment problem, which can be solved with padding. Behind the scenes, the compiler would take care of this for us. It would silently inject 24 additional bits of invisible padding into our struct so that it neatly aligns to 8 bytes. This means we're not wasting any space by using a 32-bit variable for our boolean - our PlayerState was going to be 64 bits regardless.

In addition, by proactively intervening to inject these bits ourselves, our application code now has visibility and direct control over every bit in our struct. In many cases, particularly when using CAS, this is preferred over relying on hidden, unknown padding bits injected by the compiler.

We revisit controlling our bit-by-bit memory layout a few times throughout the course, so don't worry if this doesn't make sense right now.

We want to apply damage to this player. If the player's health drops to zero, we also need to set the IsDead flag to 1.

Because multiple threads might be processing projectiles hitting this player at the same millisecond, this entire update must be thread-safe. If we attempt a naive read-modify-write operation on a std::atomic, we will recreate the data race we worked so hard to avoid:

1#include <atomic>
2
3std::atomic<PlayerState> State;
4
5void ApplyDamage(uint32_t Damage) {
6  // 1. READ
7  PlayerState Current{State.load()};
8
9  // 2. MODIFY (Complex custom logic)
10  if (Current.Health > Damage) {
11    Current.Health -= Damage;
12  } else {
13    Current.Health = 0;
14    Current.IsDead = 1;
15  }
16
17  // 3. WRITE
18  // DANGER: What if another thread changed State
19  // while we were calculating?
20  State.store(Current); 
21}

This is fundamentally broken. If Thread A and Thread B both load the state when the player has 100 Health, they will both calculate that the new health should be 80. Thread A writes 80. Thread B blindly overwrites it with 80. One of the projectile hits was completely lost.

To fix this, we need a way to tell the CPU: "Write this new state to memory, but only if nobody else has touched it since I last looked."

Pessimistic vs. Optimistic Concurrency

When engineers face this problem, the traditional response is pessimistic concurrency. We assume the worst. We assume that other threads are constantly trying to corrupt our data, so we lock the door using a std::mutex. We pull the data into our registers, take our time executing our complex logic, push the data back to memory, and unlock the door. Anyone who wants to enter while we are working is put to sleep by the operating system.

Optimistic concurrency takes the opposite approach. We assume the best. We assume that collisions are rare. We don't lock anything. Instead, we grab a copy of the live data and do all of our heavy calculations off to the side in our thread's local registers.

When we are finished, we attempt to publish our results back to main memory. If we discover that another thread modified the live data while we were busy calculating, we simply throw away our math, grab a fresh copy of the data, and try again.

Optimistic concurrency burns CPU cycles recalculating data when collisions occur, but it guarantees that a thread is never put to sleep by the OS scheduler.

The Compare-And-Swap (CAS) Primitive

To implement optimistic concurrency, we need a way to perform the final "publish" step atomically. We need to check the current state and overwrite it in a single, indivisible hardware step.

Modern processors provide this via the Compare-And-Swap instruction. In C++, we access this using the compare_exchange_weak() method provided by std::atomic. Here is what the syntax looks like:

1bool Success = State.compare_exchange_weak(Expected, Desired);

When this line executes, the CPU locks the 64-byte cache line containing State. Unlike the slow Operating System lock of std::mutex, CAS uses fast, low-level cache concurrency protocols such as MESI, which we introduced in the previous chapter.

It then reads our State value. If the bytes at State exactly match the bytes in our Expected variable, the CPU overwrites State with our Desired variable, unlocks the cache line, and returns true.

If the bytes at State do not match Expected (because another thread snuck in and changed it), the CPU does not write anything. Instead, it overwrites our local Expected variable with the fresh bytes it just found in main memory, unlocks the cache line, and returns false.

The Magic of the API

The fact that compare_exchange_weak() modifies the Expected parameter on failure is an important feature of this API.

If it didn't do this, our failure logic would have to manually call State.load() to get the new value so we could try calculating again. By automatically populating Expected with the actual current state, the CPU saves us an entire memory fetch instruction. We are handed the fresh data for free.

Writing the CAS Loop

Because our optimistic transaction might fail, we must wrap it in a do-while loop that keeps trying until our transaction does not fail - that is, until compare_exchange_weak() returns true.

Let's rewrite our ApplyDamage() function safely. This function represents the core pattern of almost all lock-free programming:

1#include <atomic>
2#include <cstdint>
3
4struct PlayerState {
5  uint32_t Health;
6  uint32_t IsDead;
7};
8
9std::atomic<PlayerState> State;
10
11void ApplyDamage(uint32_t Damage) {
12  // 1. Take a snapshot of the live data
13  PlayerState Expected{State.load()};
14
15  // We need a variable to hold our math result
16  PlayerState Desired;
17
18  do {
19    // 2. We enter the loop. Do the math off to the side,
20    // working entirely with our local registers.
21    Desired = Expected;
22
23    if (Desired.Health > Damage) {
24      Desired.Health -= Damage;
25    } else {
26      Desired.Health = 0;
27      Desired.IsDead = 1;
28    }
29
30    // 3. Attempt the atomic swap.
31    // If it returns false, Expected is automatically updated
32    // with the fresh state, and the loop restarts
33  } while (!State.compare_exchange_weak(Expected, Desired)); 
34}

Let's trace what happens when two threads call ApplyDamage(10) at around the same time on a player with 100 Health and an IsDead value of 0:

Weak vs. Strong

You will notice the standard library also provides compare_exchange_strong(). Why did we use the weak version?

The physical silicon inside CPUs is messy. The processor will sometimes fail a CAS instruction even if the data hasn't actually changed. This is called a spurious failure. It can happen if the OS briefly interrupts the thread, or if background cache eviction traffic disturbs the memory line at the wrong microsecond.

The strong version of the function hides this reality. It wraps the hardware instruction with additional checks, doing extra work to absolutely guarantee that a false return value implies a true data mismatch.

The weak version emits the bare-metal, unadulterated machine instruction. It runs as fast as physically possible, but it might fail for no logical reason. Because we are already wrapping our logic in a do-while loop anyway, we don't particularly care. Our code works whether the failure was legitimate or spurious - our loop just spins around and tries again in a nanosecond.

In almost all cases, we prefer to use the weak logic. In aggregate, the occasional unnecessary spins of our loop caused by spurious failures are less expensive than the additional checks that compare_exchange_strong() must perform to hide them.

The Hardware Limit and Secret Locks

If CAS is so powerful, why don't we use it for everything? We could wrap our entire 100-megabyte GameWorld struct inside a std::atomic and transaction the whole thing?

Unfortunately, this won't work - std::atomic is not magic - it is a direct binding to physical silicon. Hardware CAS instructions operate on CPU registers, so they can only swap what physically fits inside those registers.

In the modern era, it's generally expected that the platforms we're targetting can support lock-free structs up to 8 bytes (64 bits). Most consumer and server architectures (including modern x86-64 chips) also support double-wide 16-byte (128-bit) CAS instruction, but beyond that, the hardware simply cannot do it.

If we try to put a large 32-byte struct inside a std::atomic, the compiler will still happily accept it. But it lies to us. Because the CPU cannot swap 32 bytes atomically, the standard library quietly injects a hidden std::mutex into our object to protect the data. We think we are writing fast, lock-free code, but we are actually suffering from the OS context switches of mutual exclusion.

To protect ourselves from this compiler betrayal, we can check that our struct maps to true lock-free hardware instructions on our target platform. The std::atomic wrapper makes this available via the is_always_lock_free constant:

1#include <atomic>
2#include <cstdint>
3
4struct PlayerState {
5  uint32_t Health;
6  uint32_t IsDead;
7};
8
9std::atomic<PlayerState> State;
10
11// The compiler will halt the build if this type is 
12// too large to fit into a native hardware register
13static_assert(
14  std::atomic<PlayerState>::is_always_lock_free, 
15  "PlayerState requires a hidden OS lock!"
16);

If our struct is too big for the architecture we're targeting, we will now get a compiler error instead of a secret std::mutex.

Additionally, the struct must be trivially copyable. The CPU has no concept of C++ copy constructors or virtual tables; it just blindly blasts raw bits from cache lines into registers. The std::atomic wrapper protects us from this by default - if the type isn't trivially copyable, we will get a compiler error.

Benchmarking CAS vs. Mutex

Let's compare our CAS approach to an equivalent implementation using a lock. We will have multiple threads, each performing 100,000 operations on a shared PlayerState:. To benchmark our performance across a range of thread counts, we can use the ->ThreadRange() helper when registering our benchmarks:

benchmarks/main.cpp

1#include <benchmark/benchmark.h>
2#include <atomic>
3#include <mutex>
4#include <cstdint>
5
6static constexpr int opsPerThread = 100'000;
7
8struct PlayerState {
9  uint32_t Health;
10  uint32_t IsDead;
11};
12
13// 1. The Mutex Approach
14PlayerState MutexState{1'000'000, 0};
15std::mutex  StateMutex;
16
17static void BM_MutexUpdate(benchmark::State& state) {
18  for (auto _ : state) {
19    // Before every test, use the main thread to reset the state
20    state.PauseTiming();
21    if (state.thread_index() == 0) MutexState = {1'000'000, 0};
22    state.ResumeTiming();
23
24    for (int i = 0; i < opsPerThread; ++i) {
25      std::scoped_lock lock(StateMutex);
26      if (MutexState.Health > 1) {
27        MutexState.Health -= 1;
28      } else {
29        MutexState.Health = 0;
30        MutexState.IsDead = 1;
31      }
32    }
33  }
34}
35
36// 2. The Lock-Free CAS Approach
37std::atomic<PlayerState> AtomicState{{1'000'000, 0}};
38
39static void BM_CASUpdate(benchmark::State& state) {
40  for (auto _ : state) {
41    state.PauseTiming();
42    if (state.thread_index() == 0) AtomicState.store({1'000'000, 0});
43    state.ResumeTiming();
44
45    for (int i = 0; i < opsPerThread; ++i) {
46      PlayerState Expected = AtomicState.load();
47      PlayerState Desired;
48      do {
49        Desired = Expected;
50        if (Desired.Health > 1) {
51          Desired.Health -= 1;
52        } else {
53          Desired.Health = 0;
54          Desired.IsDead = 1;
55        }
56      } while (!AtomicState.compare_exchange_weak(
57        Expected, Desired
58      ));
59    }
60  }
61}
62
63#define REGISTER_BENCHMARK(func)    \
64  BENCHMARK(func)                   \
65    ->ThreadRange(2, 16)            \
66    ->Unit(benchmark::kMillisecond) \
67    ->UseRealTime()
68
69REGISTER_BENCHMARK(BM_MutexUpdate);
70REGISTER_BENCHMARK(BM_CASUpdate);

1---------------------------------------------
2Benchmark                                Time
3---------------------------------------------
4BM_MutexUpdate/real_time/threads:2    1.63 ms
5BM_MutexUpdate/real_time/threads:4    4.81 ms
6BM_MutexUpdate/real_time/threads:8    12.6 ms
7BM_MutexUpdate/real_time/threads:16   18.1 ms
8BM_CASUpdate/real_time/threads:2      1.38 ms
9BM_CASUpdate/real_time/threads:4      2.00 ms
10BM_CASUpdate/real_time/threads:8      1.81 ms
11BM_CASUpdate/real_time/threads:16     1.74 ms

In this test, we'd want the wall time to remain relatively flat, as the work involved scales in proportion to the number of threads - each thread performs 100,000 updates to PlayerState. Our lock-free CAS implementation accomplishes this - the wall clock remains consistent whilst the std::mutex variation gets progressively slower as more threads contend for locks.

Preview: The Danger of Pointers

Our CAS transaction works perfectly for swapping integers and small value structs. However, we must be extremely cautious when using this pattern to update pointers.

The CAS hardware instruction protects the pointer, but it offers zero protection for the object being pointed to. This introduces two major problems:

Use-After-Free: While our thread is running its calculation loop, another thread might swoop in and delete the object our Expected pointer is looking at. If we attempt to read Expected->Health to perform the math, the operating system will hit us with a segmentation fault before the CAS instruction even fires.
The ABA Problem: Even if we manage to avoid dereferencing the pointer, global memory allocators aggressively recycle addresses. If Thread B deletes Node A, and later allocates a new node that reuses that physical memory address, our CAS instruction will see the matching 64-bit value stored in the pointer. It will blindly assume no threads have intervened and execute the swap, corrupting our memory.

We cover the ABA problem in more detail in the next chapter, and we explore the challenges of pointer concurrency and the architectural patterns to overcome them when we build lock-free data structures later in the course.

Advanced: Tuning the Memory Order

By default, every operation on a std::atomic uses sequential consistency -std::memory_order_seq_cst. This is the safest setting. It asks the compiler and the CPU to emit heavy memory fencing instructions that flush the processor's store buffers, ensuring that all threads see all memory operations in the same global order.

This safety comes at a high price. It prevents the CPU from pipelining instructions efficiently.

In a high-performance system, we can change this by providing explicit memory ordering hints. The compare_exchange functions accept two memory ordering parameters: one for what to do if the operation succeeds, and one for what to do if it fails:

1State.compare_exchange_weak(
2  Expected,
3  Desired,
4  SuccessOrder, 
5  FailureOrder  
6);

The Success Order

When our CAS loop finally succeeds, we will publish our Desired state to main memory. Other threads will be reading it. We need to guarantee that any memory writes we performed before this transaction are fully visible to other threads when they see our new PlayerState.

To enforce this, we use std::memory_order_release. It tells the CPU: "Do not allow any of my earlier memory writes to be delayed past this CAS instruction."

The Failure Order

If our CAS fails, we did not publish anything. The CPU just updated our Expected variable, and we are about to restart the loop.

Because we didn't change any global state, we don't need to enforce any memory ordering with the rest of the system. We just want the hardware to update our local registers as fast as possible.

To enforce this, we use std::memory_order_relaxed. It tells the CPU: "I just need the raw value. Do not emit any memory fences or flush any buffers."

Plugging these optimized constraints into our loop would look like this:

1void ApplyDamage(uint32_t Damage) {
2  PlayerState Expected{State.load(std::memory_order_relaxed)};
3  PlayerState Desired;
4
5  do {
6    Desired = Expected;
7
8    if (Desired.Health > Damage) {
9      Desired.Health -= Damage;
10    } else {
11      Desired.Health = 0;
12      Desired.IsDead = 1;
13    }
14
15  } while (!State.compare_exchange_weak(
16      Expected,
17      Desired,
18      std::memory_order_release, 
19      std::memory_order_relaxed  
20  ));
21}

This code now generates the absolute minimum machine instructions required to safely transaction a 64-bit block of memory across multiple cores.

Summary

In this lesson, we bridged the gap between basic integer atomics and complex data structures.

Optimistic Concurrency: Instead of pessimistically locking data with a std::mutex, we calculate state transitions locally, and optimistically attempt to publish them.
Compare-and-Swap (CAS): We used std::atomic's compare_exchange_weak() to tap into the silicon's native capacity to verify and update a memory address in a single hardware cycle.
The CAS Loop: We learned that a failed CAS automatically updates our Expected variable with the fresh memory state, saving us a read instruction and allowing our do-while loop to immediately recalculate.
Spurious Failures: We discovered that modern CPUs can fail a CAS operation due to hardware quirks, which is why weak loops are standard practice over strong calls.
Memory Ordering: We tuned our loop by using memory_order_release to publish successful work, and memory_order_relaxed to ensure failed iterations retry as fast as possible.

We're now familiar with scaling work across multiple cores. But, within each of those cores, there is yet another level of parallelism waiting to be unlocked.

In the next lesson, we will look at vectorization. We will learn how to use SIMD (Single Instruction, Multiple Data) registers to process 4, 8, or even 16 numbers in a single CPU cycle.

Compare-And-Swap and Optimistic Concurrency