In the previous lesson, we fixed our data races using std::atomic. We successfully prevented data corruption, but we paid a performance tax. Even though an atomic increment is a single CPU instruction, our benchmark showed it was slower than doing the work on a single thread.

This seems counter-intuitive. If we have 8 cores working in parallel, shouldn't we be 8 times faster?

The problem is that our cores do not live in isolation. They share a memory system. When one core modifies a value, it must inform the other cores that their view of the world is now outdated.

In this lesson, we are going to look at the physical cost of that communication. We will explore cache coherency, discover how false sharing can make parallel code slower than sequential code, and learn how to use padding and better algorithm design to restore our performance.

Cache Coherency and Ping-Pong

To understand the cost of atomics, we have to talk about cache coherency.

Modern CPUs have per-core caches (L1/L2). This means that copies of the same underlying values can exist in multiple caches. If Core A modifies counter, the copy in Core B's cache will now be out of date.

To keep the contents of each core's cache consistent with each other and with main memory, CPUs coordinate which core currently "owns" the cache line, and which cores have stale, invalid data.

Core A wants to update counter.
It sends a message on the internal bus: "I am modifying address 0x1234. Invalidate your copies!"
Core B receives the message and marks its cache line as invalid.
Core A modifies the value.
Core B wants to update counter. It sees its cache is invalid.
It has to ask Core A to flush the new value back to main memory (or transfer it directly between cores).
Core B pulls the latest data in and updates it, invalidating Core 1's copy.

This loop and constant negotiation continues until the algorithm is complete, slowing the work to a crawl and saturating the internal bandwidth of the CPU with these messages.

This is called cache ping-pong:

Advanced

Modified, Exclusive, Shared, Invalid (MESI)

Most modern systems use MESI-based protocols to manage cache coherency between cores. MESI comes from the four states that a cache line can be in:

Modified: This core has modified the cache line, but hasn't written that value back to main memory yet. If another core tries to read the value, this core will intercept the request and update main memory just in time, or it will provide the value directly using a cache-to-cache transfer.
Exclusive: Only this core has a copy of the cache line, and it matches main memory.
Shared: The cache line matches main memory, but multiple cores may have a copy.
Invalid: The cache line is stale and the latest value must be fetched from main memory or another cache before it can be used.

Transitions between these states is managed by internal communication within the CPU. For example, if a core currently has a Shared cache line and wants to modify a value, it emits a Read For Ownership (RFO) request.

This prompts all other cores to transition their copies to the Invalid state, and the core that made the RFO request will have its Shared copy upgraded to the Exclusive state. That core can then safely modify the value, which will transition the copy to the Modified state.

More information on the MESI protocol is available on Wikipedia.

Atomic operations deliberately force these ownership transitions and ordering guarantees to ensure thread safety, but it can make multithreaded solutions slower than just having a single core work on the problem.

We'll see ways to alleviate this later in the lesson, but first, let's introduce false sharing - another form of this same underlying problem.

Cache ping-pong is bad, but at least it happens because we are actually sharing data. There is a more frustrating version of this problem where we pay the penalty without even when the threads are using different data. This is called false sharing.

Recall that memory is loaded in cache lines (typically 64 bytes). The hardware cannot track ownership of a single byte; it tracks ownership of the entire 64-byte line.

Imagine we have a struct with two atomic counters, operated on by two different threads.

1struct Stats {
2  // Bytes 0-3 (assuming 4-byte int):
3  std::atomic<int> thread_a_counter;
4  // Bytes 4-7:
5  std::atomic<int> thread_b_counter;
6};

These two integers sit right next to each other in memory. They share the same cache line.

Thread A increments thread_a_counter. The CPU invalidates the entire 64-byte line in Thread B's cache.
Thread B tries to increment thread_b_counter. It sees the line is invalid (cache miss). It has to wait for Thread A to flush the data.
Thread B increments its variable. This invalidates the line in Thread A's cache.

The threads are logically independent. They are touching different variables. But physically, they are fighting over the same 64-byte plot of land, so our performance is still cripped by cache ping-pong.

We'll cover two ways to solve this through the rest of this lesson. Let's first look at the most direct solution - we can just spread the values out so they can't be on the same cache line.

Alignment and Padding

We can use an alignment specifier to inform the compiler of our type's required memory layout. The following forces our variables to be positioned at memory addresses that are multiples of 64, our cache line size.

This means our a and b variables cannot be on the same cache line, so an algorithm working on a cannot play cache ping pong with an algorithm working on b:

1struct Stats {
2  // Bytes 0-3:
3  alignas(64) std::atomic<int> a;
4   
5  // Bytes 4-63: 
6  // Padding added by compiler
7    
8  // Bytes 64-67:
9  alignas(64) std::atomic<int> b;
10  
11  // Bytes 68-127: 
12  // Tail padding added by compiler
13};

The "tail padding" is added because the compiler assumes there is always another object immediately following it in memory. It means when a collection of Stats objects are packed into an array, the memory layout would look like this:

Portable Padding

What if we're targeting architectures that use different cache line sizes? Apple's M-series chips often use 128-byte cache lines. If we write code that pads variables by 64 bytes, and run it on a Mac, we will still suffer from false sharing because both variables will fit into that larger 128-byte line.

To solve this portably, C++17 introduced a specific constant in the <new> header, the succinctly named std::hardware_destructive_interference_size.

This constant tells us the minimum offset required between two objects to ensure they do not share a cache line. Conversely, its sibling std::hardware_constructive_interference_size tells us the maximum size of contiguous memory that is guaranteed to fit within a single cache line.

We use destructive to help build layouts that put things on different cache lines, thereby reducing false sharing. We use constructive to help build layouts that keep things on the same cache line, thereby improving spatial locality.

We talk about these conflicting goals later in the lesson, and how to bypass them with some clever algorithm design.

But for now, let's just increase the size of our objects to fix the false sharing problem. We can combine destructive with the alignas specifier to force the compiler to position our variables on different cache lines, regardless of the architecture we compile for:

1#include <new>
2#include <atomic>
3
4struct Stats {
5  alignas(std::hardware_destructive_interference_size)
6    std::atomic<int> a;
7
8  alignas(std::hardware_destructive_interference_size)
9    std::atomic<int> b;
10};

Compiler Support and Fallbacks

Because this value must be a compile-time constant, it represents the compiler's best guess for the target architecture. Some compilers/standard libraries (older GCC versions, for example) do not implement this feature due to ABI compatibility concerns.

If we know the platforms we're targeting and they have a consistent cache line, we can just hardcode the value - eg 64. If we want to use the standard library helpers with a fallback, we can wrap them in a macro:

1#include <new>
2#include <atomic>
3
4// Feature test macro 
5#ifdef __cpp_lib_hardware_interference_size 
6  // It's supported - use the standard library: 
7  using std::hardware_destructive_interference_size; 
8#else 
9  // Not supported - fallback to something else: 
10  constexpr std::size_t hardware_destructive_interference_size = 64; 
11#endif 
12
13struct Stats {
14  alignas(std::hardware_destructive_interference_size)
15  alignas(hardware_destructive_interference_size)
16    std::atomic<int> a;
17    
18  alignas(std::hardware_destructive_interference_size)
19  alignas(hardware_destructive_interference_size)
20    std::atomic<int> b;
21};

Let's reproduce the false sharing problem in isolation so we can confirm it's real, and then confirm our alignment fixes it. We will create two counters. In the "bad" case, they are adjacent. In the "good" case, we push them apart.

benchmarks/main.cpp

1#include <benchmark/benchmark.h>
2#include <future>
3#include <new>
4
5struct BadPadding {
6  std::atomic<int> a;
7  std::atomic<int> b;
8};
9
10struct GoodPadding {
11  alignas(std::hardware_destructive_interference_size)
12    std::atomic<int> a;
13  alignas(std::hardware_destructive_interference_size)
14    std::atomic<int> b;
15};
16
17static void BM_FalseSharing(benchmark::State& state) {
18  BadPadding stats;
19  for (auto _ : state) {
20    // Launch two async tasks that run continuously
21    auto f1 = std::async(std::launch::async, [&](){
22      for(int i=0; i<100000; ++i) stats.a++;
23    });
24    auto f2 = std::async(std::launch::async, [&](){
25      for(int i=0; i<100000; ++i) stats.b++;
26    });
27    
28    // Wait for both to finish
29    f1.wait();
30    f2.wait();
31  }
32}
33
34static void BM_NoSharing(benchmark::State& state) {
35  GoodPadding stats;
36  for (auto _ : state) {
37    auto f1 = std::async(std::launch::async, [&](){
38      for(int i=0; i<100000; ++i) stats.a++;
39    });
40    auto f2 = std::async(std::launch::async, [&](){
41      for(int i=0; i<100000; ++i) stats.b++;
42    });
43    
44    f1.wait();
45    f2.wait();
46  }
47}
48
49
50#define REGISTER_BENCHMARK(func) \
51  BENCHMARK(func) \
52    ->Unit(benchmark::kMicrosecond) \
53    ->UseRealTime()
54
55REGISTER_BENCHMARK(BM_FalseSharing);
56REGISTER_BENCHMARK(BM_NoSharing);

Eliminating the contention means even our simple two-thread example can work through the iterations in a fraction of the time:

The Cost of Isolation

Of course, making our objects physically larger in memory is a trade off. As we talked about earlier in the course, tightly packing data has many advantages, particularly in single threaded performance.

When we apply alignas(64), we're optimizing in the exact opposite direction. We're inflating the size of our data with unnecessary gaps.

Wasted Cache Capacity: A 32KB L1 cache can hold 8,000 packed integers. It can only hold 500 aligned integers. We have shrunk our effective cache size by a factor of 16.
Wasted Bandwidth: When the CPU fetches a cache line from RAM, it transfers 64 bytes. In a packed array of integers, that transfer brings in 16 useful items. In an aligned array, it brings in 1 useful item and 60 bytes of air.

In real world scenarios, we can often mitigate against this, particularly in array-of-structure layouts. Those structures carry around other variables, so the memory isn't entirely wasted. We can just strategically place the "payload" data (variables less important to our algorithms) in the gaps between our "key" data.

These objects are the same size as the previous, and they still have plenty of space left over:

1struct GoodPadding {
2  // Line 1 (Bytes 0-63)
3  alignas(64) std::atomic<int> a;
4  int some;
5  float other;
6  double stuff;
7  // Padding injected by compiler...
8
9  // Line 2 (Bytes 64-127)
10  alignas(64) std::atomic<int> b;
11  int yet;
12  float more;
13  double things;
14  // Tail padding injected by compiler...
15};

But there is still a fundamental tension: layouts optimized for concurrency are sometimes suboptimal for single-threaded tasks. But, if we're a little smarter with our algorithm design, we can often eliminate that tension.

Avoiding Synchronization

So, mutexes are slow (system calls). Atomics are fast, but still slow compared to unsynchronized access (cache coherency). False sharing is a silent performance killer. Padding can fix it but has big trade offs.

The best way to synchronize threads is not to.

Instead of having all threads fight over a shared counter, we can give each thread its own counter. Let them work in isolation, incrementing their local counters with maximum speed and zero contention. Then, at the very end, sum the local results into the global total.

Fortunately, we don't have to write this algorithm. It is exactly how std::reduce() and std::transform_reduce() works:

1// ...
2
3// 5. Multi-Threaded (Correct - Reduce)
4// No shared state, no locks, no atomics
5// Threads work on local data in registers and only synchronize 
6// once at the very end to merge results.
7static void BM_Reduce(benchmark::State& state) {
8  int counter = 0;
9  for (auto _ : state) {
10    counter = std::reduce(
11      std::execution::par,
12      WORK_ITEMS.begin(), WORK_ITEMS.end(),
13      0
14    );
15  }
16  benchmark::DoNotOptimize(counter);
17}
18
19// ...
20
21REGISTER_BENCHMARK(BM_Reduce);

We'll likely find this solution outperforms every other variation we walked through, even the fast-but-wrong BM_Unsafe:

1-------------------------------------------
2Benchmark                  Time         CPU
3-------------------------------------------
4BM_Seq/real_time         248 us      252 us
5BM_Unsafe/real_time     84.8 us     77.0 us
6BM_Mutex/real_time    153993 us   109375 us
7BM_Atomic/real_time      253 us     97.8 us
8BM_Reduce/real_time     61.6 us     55.3 us

Complex Reductions

Let's finish off with a more complex example. So far, our reductions have been to a single number. More commonly, we want to analyze a collection and return a more detailed report containing many pieces of information.

A common but flawed approach is to create a type representing the shape of the report, and have all threads share an instance, updating it as they work. To handle the data race, mutex or atomics are added, introducing locks and contention.

Instead, we can use the the familiar MapReduce pattern via std::transform_reduce() in a slightly more advanced way. The transform/map step takes our original data and puts it in the report format, carrying each element's partial contribution to the overall stats. The reduce step then combines all of those reports.

Below, we have an array of InventorySlot objects representing the player's inventory, and we generate a report of both the total weight and total item count:

1#include <numeric>
2#include <vector>
3#include <execution>
4
5struct InventorySlot {
6  int item_count;
7  double weight_per_item;
8};
9
10// The "Result" type we want to produce
11struct InventoryReport {
12  int total_items = 0;
13  double total_weight = 0.0;
14};
15
16InventoryStats CalculateStats(
17  const std::vector<InventorySlot>& inventory
18) {
19  return std::transform_reduce(
20    std::execution::par,
21    inventory.begin(), inventory.end(),
22
23    // 1. Initial Value - an empty report
24    InventoryReport{},
25
26    // 2. Reduce Operation (Combine two reports)
27    // This runs thread-locally first, then merges globally.
28    [](const InventoryReport& a, const InventoryReport& b) {
29      return InventoryReport{
30        a.total_items + b.total_items,
31        a.total_weight + b.total_weight
32      };
33    },
34
35    // 3. Transform Operation (Slot -> Report)
36    // Converts raw data into our report format
37    [](const InventorySlot& slot) {
38      return InventoryReport{
39        slot.item_count,
40        slot.item_count * slot.weight_per_item
41      };
42    }
43  );
44}

We have effectively turned a synchronization problem into a data transformation problem. We aren't updating a report; we are reducing a stream of values. By changing our perspective, we eliminate the need for hardware contention entirely.

No Locks: The intermediate InventoryReport objects are created on the stack of the worker threads. They are private to that thread. No mutexes or atomics are needed.
Cache Friendly: The CPU cores are crunching numbers in their local registers and accessing data in their local L1/L2 caches.
Scalable: The threads only communicate when they need to merge their partial InventoryReport with another thread's result.

Summary

In this lesson, we learned that multithreading is not just about logic; it is also about data layout.

Cache Coherency: We learned that CPU cores must negotiate ownership of memory lines (MESI). This "bus traffic" slows down processing significantly when multiple threads contend for the same data.
False Sharing: We discovered that threads can contend for the same cache line even if they are modifying different variables.
Padding: We can use alignas to forcibly separate data in memory, eliminating false sharing at the cost of memory usage.
Reduction: The ideal solution is to avoid syncronization as much as possible. By using std::reduce() or std::transform_reduce(), we allow threads to work on local registers (zero contention) and only merge results at the very end.

We're now familiar with scaling work across multiple cores. But, within each of those cores, there is yet another level of parallelism waiting to be unlocked.

In the next lesson, we will look at vectorization. We will learn how to use SIMD (Single Instruction, Multiple Data) registers to process 4, 8, or even 16 numbers in a single CPU cycle.

Cache Coherency and Ping-Pong

Alignment and Padding

Portable Padding

Compiler Support and Fallbacks

benchmarks/main.cpp

The Cost of Isolation

Avoiding Synchronization

Complex Reductions

Summary

SIMD and Automatic Vectorization

Introduction to Parallelism

Cache Coherency and False Sharing

Cache Coherency and Ping-Pong

False Sharing

Alignment and Padding

Portable Padding

Compiler Support and Fallbacks

Benchmarking False Sharing

benchmarks/main.cpp

The Cost of Isolation

Avoiding Synchronization

Complex Reductions

Summary

SIMD and Automatic Vectorization