In the previous lesson, we unlocked the full power of our CPU using parallel algorithms. We saw how std::transform_reduce() could crunch through massive datasets using every available core.

But we ended with a warning. We looked at a simple ++ operation inside a parallel lambda and noted that it was a ticking time bomb.

1int total_weight = 0;
2
3std::for_each(
4  std::execution::par,
5  inventory.begin(),
6  inventory.end(),
7  [&](const auto& item) {
8    total_weight += item.weight;
9  }
10);

When two threads try to modify the same memory address simultaneously, we get a data race. Our algorithm can now return the incorrect answer, or our program can crash or corrupt memory in ways that are impossible to debug.

To fix this, we need synchronization. We need a way to force threads to wait their turn. In this lesson, we are going to fix the race condition using the two primary tools: std::mutex and std::atomic.

We will also going to measure the cost of the fix and discover that thread safety is expensive. We will visualize how locks put threads to sleep, how atomics trigger cache coherency wars inside your CPU, and how a subtle hardware phenomenon called false sharing can make our parallel code run slower than a single-threaded loop.

Concurrency is a huge topic with full books and entire university courses dedicated to it. We can just give a high level overview of the key ideas here, but it should serve as a helpful introduction.

The Big Hammer: `std::mutex`

The most intuitive way to stop two threads from fighting is mutual exclusion. If Thread A is touching the data, Thread B must wait.

The standard library provides std::mutex for this purpose. You can think of a mutex as a physical key to a room. Only one person can hold the key at a time. If you want to enter the room (access the data) and the key is missing, you have to stand in the hallway and wait until the current owner comes out and hands it to you.

Here is how we protect our counter using a mutex:

1#include <mutex>
2#include <vector>
3#include <execution>
4#include <algorithm>
5
6struct Item { int weight; };
7
8void ProcessInventory(const std::vector<Item>& items) {
9  int total_weight = 0;
10
11  // The Guard Object
12  std::mutex weight_mutex;
13
14  std::for_each(
15    std::execution::par,
16    items.begin(), items.end(),
17    [&](const Item& item) {
18      // Acquire the lock.  If another thread has it,
19      // we block (sleep) here
20      std::lock_guard<std::mutex> lock(weight_mutex);
21
22      // Only one thread can be executing this line at a time
23      total_weight += item.weight;
24
25      // Release the lock - automatically happens
26      // when `lock` goes out of scope thanks to RAII
27    }
28  );
29}

We use std::lock_guard (or std::scoped_lock in C++17) to manage the mutex.

This is the RAII (Resource Acquisition Is Initialization) pattern in action. When the lock variable is created, it calls mutex.lock(). When the variable is destroyed at the end of the scope, it calls mutex.unlock(). This ensures we never accidentally leave the mutex locked.

The Hardware Cost of Mutexes

This code is now correct. It will never lose data. But mechanically, what happens when a thread hits a locked mutex?

It doesn't just sit there spinning in a loop checking the flag (usually). That would burn CPU cycles and battery power. Instead, it makes a system call to the operating system. It effectively says: "I cannot proceed. Put me to sleep and wake me up when this mutex is free."

The OS removes the thread from the CPU core and schedules a different program to run. This is a context switch. It involves saving registers, flushing pipelines, and polluting caches. It is an incredibly heavy operation, taking thousands of CPU cycles.

When the mutex is finally unlocked, the OS has to wake the thread up and schedule it back onto a core.

By adding a mutex to a tight loop (like adding integers), we have replaced a single assembly instruction (ADD, ~1 cycle) with a potential context switch (~10,000 cycles). We have serialized our "parallel" program and added massive overhead.

Benchmarking: The Cost of Serialization

Let's prove it, but first, we need to adjust how we assess the output of our benchmarks.

By default, we've been looking at CPU Time. This is the amount of time the processor spends actively executing instructions for your code. If the operating system decided to put our thread to sleep because of something else happening on the system, we pause the clock. We shouldn't be penalized for that.

But now, the reason we're being put to sleep is because of something we are doing. Our thread is telling the operating system "I'm locked, put me to sleep". If we have a program that spends 99% of its time sleeping and waiting for locks, the benchmark might report that it is incredibly "efficient" because it used almost no CPU cycles.

But to the user, the program is frozen.

To measure contention - the time wasted waiting for resources - we should switch to using the wall clock time. This measures the actual time elapsed, regardless of whether the CPU was busy or sleeping.

Applying `UseRealTime()`

To tell Google Benchmark we're mostly interested in clock time, we can add the ->UseRealTime() modifier:

1// Measure the actual elapsed time, including sleep/blocking
2BENCHMARK(BM_MutexContention)->UseRealTime();

The library will then use this metric to decide things such as how many iterations needs to run to get a stable result.

Benchmarking Mutexes

Let's write a benchmark that increments a shared counter 1,000,000 times. We will compare an unsafe version (no lock) against a safe version (mutex), as well as a single threaded implementation for reference (also safe).

benchmarks/main.cpp

1#include <benchmark/benchmark.h>
2#include <vector>
3#include <mutex>
4#include <execution>
5#include <algorithm>
6
7const std::vector<int> WORK_ITEMS(1000000, 1);
8
9// 1. Single-Threaded (Correct)
10static void BM_Seq(benchmark::State& state) {
11  int counter = 0;
12  for (auto _ : state) {
13    std::for_each(
14      std::execution::seq,
15      WORK_ITEMS.begin(), WORK_ITEMS.end(),
16      [&](int) {
17        counter++; 
18      }
19    );
20  }
21  benchmark::DoNotOptimize(counter);
22}
23
24// 2. Multi-Threaded (Incorrect)
25static void BM_Unsafe(benchmark::State& state) {
26  int counter = 0;
27  for (auto _ : state) {
28    std::for_each(
29      std::execution::par,
30      WORK_ITEMS.begin(),
31      WORK_ITEMS.end(),
32      [&](int) {
33        // Fast, but wrong
34        counter++; 
35      }
36    );
37  } 
38  benchmark::DoNotOptimize(counter);
39}
40
41// 3. Multi-Threaded (Correct - Mutex)
42static void BM_Mutex(benchmark::State& state) {
43  int counter = 0;
44  std::mutex mtx; // The guard
45  
46  for (auto _ : state) {
47    std::for_each(
48      std::execution::par,
49      WORK_ITEMS.begin(),
50      WORK_ITEMS.end(),
51      [&](int) {
52        std::scoped_lock lock(mtx); // Acquire (Bottleneck)
53        counter++;                  // Modify
54      }                             // Release
55    );
56  }
57  benchmark::DoNotOptimize(counter);
58}
59
60#define REGISTER_BENCHMARK(func) \
61  BENCHMARK(func) \
62    ->Unit(benchmark::kMicrosecond) \
63    ->UseRealTime()
64
65REGISTER_BENCHMARK(BM_Seq);
66REGISTER_BENCHMARK(BM_Unsafe);
67REGISTER_BENCHMARK(BM_Mutex);

The mutex version is orders of magnitude slower:

This is contention. We have all of our threads fighting for 1 lock. At any given moment, 1 thread is working, and the rest are either sleeping, waking up, or trying to acquire the lock. We have taken a parallel machine and forced it to run sequentially, paying a tax for every single operation.

The Scalpel: `std::atomic`

For simple data types like int, bool, or pointers, a mutex is overkill. We don't need to put the thread to sleep; we just need to make sure the ADD instruction happens atomically (all at once).

The C++ standard provides std::atomic<T> for this:

1#include <algorithm>
2#include <atomic> 
3
4void ProcessInventory() {
5  // Thread-safe integer
6  std::atomic<int> total_weight{0}; 
7
8  std::for_each(
9    std::execution::par,
10    items.begin(),
11    items.end(),
12    [&](const Item& item) {
13      // Safe! No mutex required
14      total_weight += item.weight;
15    }
16  );
17}

When you write total_weight += item.weight, the compiler translates this into a special hardware instruction. On x86 CPUs, this is often a LOCK ADD.

This instruction tells the CPU: "Do not let anyone else touch this memory address until I am done adding this number."

It does not involve the operating system. There are no system calls, no sleeping threads, and no context switches. It happens entirely at the silicon level.

Benchmark: Atomics vs. Mutex

Let's add an atomic test case to our benchmark suite.

benchmarks/main.cpp

1#include <atomic>
2
3// ... 
4
5// 4. Multi-Threaded (Correct - Atomic)
6static void BM_Atomic(benchmark::State& state) {
7  std::atomic<int> counter{0};
8  
9  for (auto _ : state) {
10    std::for_each(
11      std::execution::par,
12      WORK_ITEMS.begin(), WORK_ITEMS.end(),
13      [&](int) {
14        // Hardware-level synchronization
15        counter++; 
16      }
17    );
18  }
19}
20
21// ...
22
23BENCHMARK(BM_Atomic)->UseRealTime();

The atomic version is much faster than the mutex version. But it also much slower than BM_Unsafe's incrementing of a regular, non-atomic int.

If they both use a single CPU instruction to increment the number, why is the atomic version much slower? We'll cover this in the next lesson.

Summary

In this lesson, we moved from simply running code in parallel to ensuring that code is actually correct.

The Race Condition: We saw that without synchronization, parallel modifications to shared memory result in data corruption.
std::mutex****: We can guarantee safety by locking data, putting threads to sleep until they can acquire the key. But the cost of operating system system calls makes it incredibly slow for fine-grained tasks.
std::atomic****: By using special CPU instructions, we avoided the OS scheduler entirely. This provided a massive speedup over mutexes, but is still significantly slower than the unsafe version, and slower than even the single-threaded implementation. Why does a single hardware instruction cost so much?

In the next lesson, we will explore cache coherency, the MESI protocol, and how the physical layout of our bytes in memory can destroy our parallel performance through false sharing.

Locks and Atomics

The Big Hammer: `std::mutex`

The Hardware Cost of Mutexes

Benchmarking: The Cost of Serialization

Applying `UseRealTime()`

Benchmarking Mutexes

benchmarks/main.cpp

The Scalpel: `std::atomic`

Benchmark: Atomics vs. Mutex

benchmarks/main.cpp

Summary

Cache Coherency and False Sharing

Introduction to Parallelism

Locks and Atomics

The Big Hammer: std::mutex

The Hardware Cost of Mutexes

Benchmarking: The Cost of Serialization

Applying UseRealTime()

Benchmarking Mutexes

benchmarks/main.cpp

The Scalpel: std::atomic

Benchmark: Atomics vs. Mutex

benchmarks/main.cpp

Summary

Cache Coherency and False Sharing

The Big Hammer: `std::mutex`

Applying `UseRealTime()`

The Scalpel: `std::atomic`