The std::atomic API and Data Tearing

Understand the physical reality of moving data between RAM and registers. Learn why explicit loads and stores are required to prevent data tearing in complex structs.

Ryan McCombe
Published

In the previous lessons, we used std::atomic<int> to fix a data race without resorting to slow operating system locks. The compiler translated our simple ++ operation into a raw, native hardware instruction like LOCK ADD.

But real-world software rarely just increments single integers. We build systems using complex entities, states, and configurations. We need to update multiple pieces of data simultaneously in a thread-safe way.

In this lesson, we will look at how std::atomic interacts with complex C++ structs. We will learn how to explicitly command the CPU to move data between main RAM and local registers, and we will discover how hardware limitations cause data tearing.

The Hidden Cost of Operators

The std::atomic template allows us to treat atomic numbers like we would basic int and float values. As we saw in the previous lesson, we can use operators like =, +, and ++ as if our atomic was a basic numeric type:

#include <atomic>

std::atomic<int> GlobalScore{0};

void UpdateScore() {
  int current = GlobalScore; 
  GlobalScore = current + 10; 
  GlobalScore++; 
}

While this compiles and runs, experienced engineers generally hate this syntax. By treating the atomic variable just like a regular int or float, we hide the reality of what the machine is doing.

In concurrent programming, we want to be hyper-aware of exactly when our thread reaches out across the silicon bus to touch shared memory. An atomic read is not a simple memory access; it requires the CPU to query the cache coherency protocol and ensure the 64-byte cache line is synchronized across all cores.

To make these operations blatantly obvious, we typically prefer the use of the explicit functions in the std::atomic API. The most useful ones are load(), store(), and fetch_add():

#include <atomic>

std::atomic<int> GlobalScore{0};

void UpdateScore() {
  // Explicitly pull data from shared memory into 
  // a local CPU register
  int current{GlobalScore.load()}; 
  
  // Explicitly push data from our local register 
  // back into shared memory
  GlobalScore.store(current + 10); 

  // Atomically modify the value directly in memory
  GlobalScore.fetch_add(1); 
}

A load() commands the CPU to fetch a snapshot of the live data. Once that instruction completes, the data is entirely local to our thread's registers. We can do as much math as we want on current without triggering any cache ping-pong or locking.

A store() commands the CPU to blast our locally calculated value back out to the shared cache line, invalidating the caches of every other core.

The fetch_add() function directly invokes a specialized hardware instruction (such as LOCK XADD on x86 processors) to read, modify, and write the value in a single, indivisible step. It performs the addition directly at the memory level without ever needing to pull the value into a private register for calculation.

Functions like fetch_add() This perfectly solves the concurrency problem for basic arithmetic, but as we will see, it is useless when we need to apply complex logic to our data.

Beyond Numbers (Atomic Structs)

Now that we understand the explicit API, let's step beyond simple integers. We want to apply this atomic capability to a custom data structure.

Consider a PlayerState struct that tracks both the health and mana of a character. We want to guarantee that any thread observing the player sees a consistent combination of these two values.

#include <cstdint>
#include <atomic>

// We pad and align the struct so it exactly matches
// the width of a 64-bit hardware register
struct alignas(8) PlayerState {
  uint32_t Health;
  uint32_t Mana;
};

std::atomic<PlayerState> State;

For atomic structs, it is generally recommended that we be in full control of every bit - that is, we don't want the compiler adding any padding. To accomplish this, we size our type to fit within a specific register (64 bits, in this case) and then match that alignment (8 bytes) using alignas.

We can now read the entire state in one hardware cycle using load(), and write a completely new state using store():

void ProcessPlayer() {
  // 1. Fetch a 64-bit snapshot from shared memory
  PlayerState Snapshot{State.load()}; 

  // 2. We are now working on private, local data
  if (Snapshot.Health < 50) {
    Snapshot.Mana += 10;
  }

  // 3. Write the 64-bit struct back to shared memory
  State.store(Snapshot); 
}

Data Tearing

Why do we need to wrap the struct in std::atomic? Why couldn't we just declare a regular PlayerState globally and read/write to it?

Without the atomic wrapper, we are vulnerable to a failure known as data tearing. Imagine we are compiling for a 32-bit processor. The CPU only has 32-bit registers. It is physically impossible for the hardware to write our 64-bit PlayerState struct in a single clock cycle.

When the compiler generates the machine code, it is forced to break the write into two separate instructions. This creates the following race:

This is a torn read. The std::atomic wrapper guarantees that this will never happen. It ensures that reads and writes are indivisible, using specialized hardware instructions to ensure no thread ever observes a partially written cache line.

The Read-Modify-Write Trap

So, load() and store() prevent data tearing. We can safely pull 64 bits of data into a register, modify it, and push it back.

Does this mean load() and store() is enough to keep everything thread-safe? Unfortunately not. It fixes the tearing problem, but we may still have a concurrent logic problem. Let's look at a similar function that applies damage to the player:

void ApplyDamage(uint32_t Damage) {
  // 1. READ
  PlayerState Current{State.load()};

  // 2. MODIFY
  if (Current.Health > Damage) {
    Current.Health -= Damage;
  } else {
    Current.Health = 0;
  }

  // 3. WRITE
  State.store(Current); 
}

This is the classic Read-Modify-Write trap. The load() instruction is atomic. The store() instruction is atomic. But the gap of time between those two instructions is not.

Let's map out what might happen if two threads are calling ApplyDamage() concurrently - a sword attack that wants to remove 20 health and a fireball that wants to remove 50:

To fix the Read-Modify-Write trap, we cannot just blindly call store(). We need a way to tell the CPU: "Push this new state to memory, but only if nobody else has touched it since I last looked."

We need a single, magical hardware instruction that can check memory and write memory simultaneously. In the next lesson, we will introduce the Compare-And-Swap (CAS) instruction to solve this exact problem.

Summary

In this lesson, we established a more disciplined way to think about shared memory:

  1. The Explicit API: We abandoned implicit assignment operators in favor of load() and store() to make our interactions with atomic containers blatant and intentional.
  2. Registers vs. RAM: A load() pulls a snapshot of shared memory into private, local CPU registers. A store() writes local data back out to the shared cache line.
  3. Data Tearing: We discovered that writing complex structs can take multiple hardware cycles. We use std::atomic to guarantee indivisible, tear-free reads and writes.
  4. The Race Window: We saw that while individual loads and stores are safe, combining them into a Read-Modify-Write pattern leaves a vulnerable window where other threads can invalidate our work.
Next Lesson
Lesson 27 of 51

Compare-And-Swap and Optimistic Concurrency

Learn how to perform complex lock-free atomic updates using compare_exchange_weak() and the hardware limitations.

Have a question about this lesson?
Answers are generated by AI models and may not be accurate