In the previous lesson, we built the PlayerStorage. This container stores data in a highly efficient Structure of Arrays (SoA) layout but exposes it via a friendly, object-oriented PlayerRef proxy.

We can loop over our players, filter them, and transform them, and we have confirmed this is now much more efficient than our previous array-of-structures (AoS) layout. But let's revisit an earlier problem, and see how we can implement the heavy task of sorting.

We'll implement this in the two variations we introduced previously:

Virtual (Proxy) Sorting: We create a sorted "view" of the data without touching the physical memory. This is fast and minimizes disruption, but it can hurt iteration performance later.
Physical Sorting: We physically rearrange the memory in all our parallel vectors. This is expensive, but we will learn to make it faster using multithreading.

Virtual Sorting (The Proxy Sort)

As we discussed previously, moving data is expensive. If we just want to display a leaderboard, we shouldn't rewrite gigabytes of RAM. We should just calculate the order in which the players appear.

Let's implement the proxy sorting technique first. We can add a method GetSortedIndices() to our PlayerStorage. This method generates a std::vector<size_t> containing indices 0 to N-1, and then sorts those indices based on the data in our system. Below, we create a proxy array storing the indices of our players, sorted by score:

We can eventually design whatever API we want here - perhaps hiding the "proxy" entirely - but let's first get the foundations in place so we can make sure our approach works and is viable. We'll add a GetSortedIndices() function that accepts a comparator, and then returns the proxy array of indices.

We call our comparator with two PlayerRef objects, and will expect it to return true if the first should appear before the second:

dsa_core/include/dsa/PlayerStorage.h

1// ...
2#include <numeric> // for std::iota 
3#include <algorithm> // for std::sort 
4
5class PlayerStorage {
6public:
7  // ...
8  
9  // Returns a list of indices sorted according to
10  // the comparator
11  std::vector<size_t> GetSortedIndices(auto comp) {
12    // 1. Create the indices {0, 1, 2, ... N}
13    std::vector<size_t> indices(m_ids.size());
14    std::iota(indices.begin(), indices.end(), 0);
15
16    // 2. Sort the INDICES, but look at the DATA
17    std::sort(
18      indices.begin(), indices.end(),
19      [&](size_t a, size_t b) {
20        // We construct temporary PlayerRefs to pass to
21        // the comparator. The compiler optimizes these
22        // away completely.
23      return comp(
24        PlayerRef{
25          m_ids[a], m_scores[a], m_health[a], m_names[a]
26        },
27        PlayerRef{
28          m_ids[b], m_scores[b], m_health[b], m_names[b]
29        }
30      );
31    });
32
33    return indices;
34  }
35};

Creating the Sorted View

Now that we have the indices, we can create a view that uses them. This effectively gives us a "sorted PlayerStorage" without sorting the underlying data.

We'll provide a GetSortedView() function, and will ask that the sorted indices - the value returned by GetSortedIndicies() - be provided as an argument.

We use std::views::transform to map the sorted indices back into PlayerRef objects on the fly:

dsa_core/include/dsa/PlayerStorage.h

1// ...
2
3class PlayerStorage {
4public:
5  // ...
6
7  auto GetSortedView(const std::vector<size_t>& indices) {
8    // Return a view that yields PlayerRefs in sorted order
9    return indices | std::views::transform([this](size_t index) {
10      return PlayerRef{
11        m_ids[index],
12        m_scores[index],
13        m_health[index],
14        m_names[index]
15      };
16    });
17  }
18};

This pattern allows us to maintain multiple different sort orders simultaneously. We can have a score_indices vector and a name_indices vector, both backed by the same physical data.

As long as the underlying data hasn't changed, those proxies remain valid. If it has changed, they will be "stale", which we'll learn how to deal with later in the chapter. For now, let's benchmark our work.

Benchmarking: Virtual vs. Raw

There is, of course, a catch with proxy sorting. When we iterate through GetSortedView(), we are jumping around in memory. We access the index 593, then index 28, then index 9516. This breaks the hardware prefetcher.

Let's verify this trade-off. We will benchmark:

Physical Iteration: The iteration speed if we traverse the collection in its physical order.
Proxy Sort: The cost to generate the indices proxy.
Proxy Sort Iteration: The iteration speed if we traverse via our intermediate array constructed by proxy sorting.

benchmarks/main.cpp

1#include <benchmark/benchmark.h>
2#include <dsa/PlayerStorage.h>
3#include <random>
4
5static void BM_SoA_PhysicalIterate(benchmark::State& state) {
6  int n = state.range(0);
7  PlayerStorage ps;
8  for(int i=0; i<n; ++i) ps.AddPlayer(1, 2, 3.0f, "Name");
9
10  for (auto _ : state) {
11    float sum = 0.0f;
12    for (const PlayerRef player : ps.GetView()) {
13      sum += player.score;
14    }
15    benchmark::DoNotOptimize(sum);
16  }
17}
18
19static void BM_SoA_ProxySort_Construct(benchmark::State& state) {
20  int n = state.range(0);
21  PlayerStorage ps;
22  for(int i=0; i<n; ++i) ps.AddPlayer(i, n-i, 100.f, "Name");
23
24  for (auto _ : state) {
25    // Measure time to generate indices
26    auto indices = ps.GetSortedIndices([](auto a, auto b){
27      return a.score < b.score;
28    });
29    benchmark::DoNotOptimize(indices.data());
30  }
31}
32
33static void BM_SoA_ProxySort_Iterate(benchmark::State& state) {
34  int n = state.range(0);
35  PlayerStorage ps;
36  std::mt19937 rng(12345);
37  std::uniform_int_distribution<int> dist(0, 100000);
38
39  for(int i=0; i<n; ++i) {
40    ps.AddPlayer(i, dist(rng), 100.f, "Name");
41  }
42
43  // Pre-calculate indices
44  auto indices = ps.GetSortedIndices([](auto a, auto b){
45    return a.score < b.score;
46  });
47
48  for (auto _ : state) {
49    float sum = 0.0f;
50    // Measure time to walk the scattered memory
51    for (auto player : ps.GetSortedView(indices)) {
52      sum += player.score;
53    }
54    benchmark::DoNotOptimize(sum);
55  }
56}
57
58#define BENCHMARK_STD(func) \
59  BENCHMARK(func) \
60    ->RangeMultiplier(10) \
61    ->Range(10 * 1000, 1000 * 1000) \
62    ->Unit(benchmark::kMillisecond)
63
64BENCHMARK_STD(BM_SoA_PhysicalIterate);
65BENCHMARK_STD(BM_SoA_ProxySort_Construct);
66BENCHMARK_STD(BM_SoA_ProxySort_Iterate);

You're likely to see the proxy-based iteration keep up as long as the working set is small enough to fit within the pre-warmed caches, with performance falling off rapidly beyond that point:

1---------------------------------------------
2Benchmark                                 CPU
3---------------------------------------------
4BM_SoA_PhysicalIterate/10000         0.008 ms
5BM_SoA_PhysicalIterate/100000        0.077 ms
6BM_SoA_PhysicalIterate/1000000       0.781 ms
7BM_SoA_ProxySort_Construct/10000     0.122 ms
8BM_SoA_ProxySort_Construct/100000     1.35 ms
9BM_SoA_ProxySort_Construct/1000000    17.7 ms
10BM_SoA_ProxySort_Iterate/10000       0.007 ms
11BM_SoA_ProxySort_Iterate/100000      0.080 ms
12BM_SoA_ProxySort_Iterate/1000000      3.56 ms

Physical Sorting (Applying Permutations)

Let's say we really do want to physically sort our SoA container to eliminate the iteration cost of the proxy indirection.

When we already have the indicies array that expresses the order we want the physical data to be in, we just need to place the data in that precalculated order.

We create a temporary vector, copy elements into it based on the sorted indices, and then swap it back. Our SoA system has 4 arrays and, to keep everything in sync, we need to perform this action on all of them:

dsa_core/include/dsa/PlayerStorage.h

1// ...
2
3class PlayerStorage {
4public:
5  // ...
6  // Permute all the arrays in the SoA
7  void ApplyPermutation(const std::vector<size_t>& indices) {
8    if (indices.size() != m_ids.size()) return;
9
10    // Helper to permute a single array
11    auto permute_vec = [&](auto& vec) {
12      using T = typename std::decay_t<decltype(vec)>::value_type;
13
14      std::vector<T> temp;
15      temp.reserve(vec.size());
16
17      for (size_t i : indices) {
18        temp.push_back(std::move(vec[i]));
19      }
20      vec = std::move(temp);
21    };
22
23    // Call the helper for every array
24    permute_vec(m_ids);
25    permute_vec(m_scores);
26    permute_vec(m_health);
27    permute_vec(m_names);
28  }
29};

Benchmarking: Physical Sorting

Physically sorting a large amount of data is unavoidably expensive, but we should at least make sure we haven't made things worse:

1#include <benchmark/benchmark.h>
2#include <dsa/PlayerStorage.h>
3#include <random>
4#include <vector>
5#include <algorithm>
6#include <string>
7
8struct Player {
9  int id;
10  int score;
11  float health;
12  std::string name;
13};
14
15static void BM_AoS_PhysicalSort(benchmark::State& state) {
16  int n = state.range(0);
17
18  for (auto _ : state) {
19    // We don't want to benchmark sorting a container that
20    // is already sorted, so we generate fresh data for each
21    // test. We don't want this creation of data to be included
22    // in the benchmark results, so we pause the timer
23    state.PauseTiming();
24    std::vector<Player> players;
25    
26    for(int i = 0; i < n; ++i) {
27        Player p;
28        p.id = i;
29        p.score = n - i; // Reverse sorted
30        p.health = 100.f;
31        p.name = "Name";
32        players.push_back(p);
33    }
34    state.ResumeTiming();
35
36    // Measure Sort
37    std::sort(
38      players.begin(), players.end(),
39      [](const Player& a, const Player& b) {
40        return a.score < b.score;
41      }
42    );
43    
44    benchmark::DoNotOptimize(players.data());
45  }
46}
47
48static void BM_SoA_PhysicalSort(benchmark::State& state) {
49  int n = state.range(0);
50
51  for (auto _ : state) {
52    // Generate fresh data
53    state.PauseTiming();
54    PlayerStorage ps;
55    for(int i = 0; i < n; ++i) {
56        ps.AddPlayer(i, n - i, 100.f, "Name");
57    }
58    state.ResumeTiming();
59
60    // Measure Sort
61    // Measure time to generate indices...
62    auto indices = ps.GetSortedIndices([](auto a, auto b){
63      return a.score < b.score;
64    });
65
66    // ..and to apply the permutation
67    ps.ApplyPermutation(indices);
68    
69    benchmark::DoNotOptimize(indices);
70  }
71}
72
73#define BENCHMARK_STD(func) \
74  BENCHMARK(func) \
75    ->RangeMultiplier(10) \
76    ->Range(10 * 1000, 1000 * 1000) \
77    ->Unit(benchmark::kMillisecond)
78
79BENCHMARK_STD(BM_AoS_PhysicalSort);
80BENCHMARK_STD(BM_SoA_PhysicalSort);

1--------------------------------------
2Benchmark                          CPU
3--------------------------------------
4BM_AoS_PhysicalSort/10000     0.272 ms
5BM_AoS_PhysicalSort/100000     2.71 ms
6BM_AoS_PhysicalSort/1000000    45.0 ms
7BM_SoA_PhysicalSort/10000     0.253 ms
8BM_SoA_PhysicalSort/100000     2.42 ms
9BM_SoA_PhysicalSort/1000000    46.0 ms

Even though our new layout doesn't reduce the raw amount of data needing to be moved, it does give us some flexibility on how we can solve some heavy problems like this.

Concurrency

Because our data layout is physically split, we don't need complex synchronization to update different components at the same time. We can just tell the CPU to "go fix the IDs" and "go fix the Names" simultaneously. We covered this idea of concurrency in the .

We can use the std::async helper to tell the compiler that sorting of each of our arrays can be discrete, independent tasks that are safe to run at the same time:

dsa_core/include/dsa/PlayerStorage.h

1// ...
2#include <future> // For std::async
3
4class PlayerStorage {
5public:
6  // ...
7  void ApplyPermutation(const std::vector<size_t>& indices) {
8    if (indices.size() != m_ids.size()) return;
9
10    auto permute_vec = [&](auto& vec) {
11      using T = typename std::decay_t<decltype(vec)>::value_type;
12
13      std::vector<T> temp;
14      temp.reserve(vec.size());
15
16      for (size_t i : indices) {
17        temp.push_back(std::move(vec[i]));
18      }
19      vec = std::move(temp);
20    };
21
22    // Launch a task for each array using std::async and
23    // std::launch::async.  The OS scheduler will distribute
24    // these across available cores
25    using std::launch::async;
26    auto f1 = std::async(async, [&](){ permute_vec(m_ids); });
27    auto f2 = std::async(async, [&](){ permute_vec(m_scores); });
28    auto f3 = std::async(async, [&](){ permute_vec(m_health); });
29    auto f4 = std::async(async, [&](){ permute_vec(m_names); });
30
31    // Wait for all tasks to finish before returning control
32    // to the caller
33    f1.wait(); f2.wait(); f3.wait(); f4.wait();
34  }
35};

Remember, we can also tell std::sort() that it's safe to sort our original AoS layout in parallel. As we covered in a , we can do this by passing std::execution::par as the first argument:

1// ...
2#include <execution> // for std::execution::par
3
4// ...
5
6static void BM_AoS_PhysicalSort(benchmark::State& state) {
7  int n = state.range(0);
8
9  for (auto _ : state) {
10    // ...
11
12    std::sort(
13      std::execution::par, 
14      players.begin(), players.end(),
15      [](const Player& a, const Player& b) {
16        return a.score < b.score;
17      }
18    );
19    
20    benchmark::DoNotOptimize(players.data());
21  }
22}
23
24// ...

On my system, this makes both approaches around 50% less demanding on the main thread, which is what Google Benchmark displays in the CPU column by default:

1--------------------------------------
2Benchmark                          CPU
3--------------------------------------
4BM_AoS_PhysicalSort/10000     0.441 ms
5BM_AoS_PhysicalSort/100000     1.80 ms
6BM_AoS_PhysicalSort/1000000    22.4 ms
7BM_SoA_PhysicalSort/10000     0.419 ms
8BM_SoA_PhysicalSort/100000     2.58 ms
9BM_SoA_PhysicalSort/1000000    22.5 ms

API Improvements

We've proven the performance, but the API is still a bit raw. Because we are building a custom container, we have the freedom (and the burden) to implement exactly the features we need.

This will depend on our exact use case, but for inspiration on things that might be generally useful, we might look at standard library containers like std::vector for inspiration. Maybe we want to implement the [] operator to give access to individual players, or functions like clear() and reserve().

Something like size() would be useful to count the number of players in the system, but we'll implement a more powerful version of this capability later in the chapter.

Implementing Projections

One of the benefits of C++20 range-based algorithms is the projection feature, allowing us to write code like this:

1std::ranges::sort(party, {}, &PlayerRef::score);

Our proxy PlayerRef implementation doesn't support this - its members are references, and we can't take the address of a reference member.

Note that using this lightweight proxy technique is entirely optional - nothing about our system requires this API.

Our API can be set up to accept and provide full blown Player objects, completely unrestricted. It can be more difficult to avoid performance overheads in this case, but performance is only one factor. Creating a better API at the cost of performance can be a totally reasonable choice for many cases.

However, if we wanted to provide a projection-like API for our zero-cost abstraction, we could add helpful functions to support common requirements. This might be added to the PlayerRef type, to our underlying PlayerStorage type, or to a nearby namespace:

dsa_core/include/dsa/PlayerStorage.h

1namespace PlayerComparators {
2inline constexpr auto ByScore = [](const PlayerRef& a, const PlayerRef& b) {
3  return a.score < b.score;
4};
5
6inline constexpr auto ByHealth = [](const PlayerRef& a, const PlayerRef& b) {
7  return a.health < b.health;
8};
9}

1// Usage:
2auto indices = ps.GetSortedIndices(PlayerComparators::ByScore);

Implementing `const`

Finally, a container isn't complete without const support. If we pass const PlayerStorage& to a function, we should still be able to iterate it.

We need a ConstPlayerRef that holds const references.

dsa_core/include/dsa/PlayerStorage.h

1// ...
2
3struct ConstPlayerRef {
4  const int& id;
5  const int& score;
6  const float& health;
7  const std::string& name;
8};
9
10class PlayerStorage {
11public:
12  // ...
13
14  auto GetConstView() const {
15    return std::views::zip(
16      m_ids, m_scores, m_health, m_names
17    ) | std::views::transform([](auto&& tuple) {
18          const auto& [id, score, hp, name] = tuple;
19          return ConstPlayerRef{id, score, hp, name};
20        });
21  }
22};

Summary

In this lesson, we rounded out our domain-specific container with the ability to deliver data in a specific order.

Proxy Sorting: We implemented GetSortedIndices() to perform cheap, non-destructive sorting. This is excellent for one-off queries.
Physical Sorting: We implemented ApplyPermutation to physically rearrange the memory for more permanent changes.
Parallelization: We used std::async and std::execution::par to parallelize the sorting step.

In the next lesson, we will cover the problems associated with deleting objects from our collections and strategies on how to mitigate them.

Sorting and Permuting Containers

Virtual Sorting (The Proxy Sort)

dsa_core/include/dsa/PlayerStorage.h

Creating the Sorted View

dsa_core/include/dsa/PlayerStorage.h

Benchmarking: Virtual vs. Raw

benchmarks/main.cpp

Physical Sorting (Applying Permutations)

dsa_core/include/dsa/PlayerStorage.h

Benchmarking: Physical Sorting

Concurrency

dsa_core/include/dsa/PlayerStorage.h

API Improvements

Implementing Projections

dsa_core/include/dsa/PlayerStorage.h

Implementing `const`

dsa_core/include/dsa/PlayerStorage.h

Summary

The Swap-and-Pop Idiom

Implementing a Structure of Arrays

Sorting and Permuting Containers

Virtual Sorting (The Proxy Sort)

dsa_core/include/dsa/PlayerStorage.h

Creating the Sorted View

dsa_core/include/dsa/PlayerStorage.h

Benchmarking: Virtual vs. Raw

benchmarks/main.cpp

Physical Sorting (Applying Permutations)

dsa_core/include/dsa/PlayerStorage.h

Benchmarking: Physical Sorting

Concurrency

dsa_core/include/dsa/PlayerStorage.h

API Improvements

Implementing Projections

dsa_core/include/dsa/PlayerStorage.h

Implementing const

dsa_core/include/dsa/PlayerStorage.h

Summary

The Swap-and-Pop Idiom

Implementing `const`