The Join Algorithm
Efficiently pulling data from multiple component pools simultaneously using the smallest set driver pattern, and optimizing it using bitmasks to minimize cache misses.
In the previous lesson, we broke our monolithic data structures apart. We moved from a single table (a structure of arrays) to a model that can represent different data types, like audio and proximity components.
They live in dedicated component pools, meaning any systems we build in the future can efficiently work just with the data they need. If a system only needs to act on nearby entities, it can look at our proximity components and, if it needs the entity data, it can traverse the sparse set in time to retrieve it.
But what if our system needs data from multiple component pools? In this lesson, we'll imagine we need to build ProximityAudioSystem. This system needs to find every entity that has both an audio component and a proximity component.
Starting Point
We'll continue working with the SoA containers and SparseSet we created previously. Simplified versions of these have been provided below if needed.
A key change is that we've removed the m_audio column from EntityStorage, which was used for benchmarking in the previous lesson. We've added a new AudioStorage SoA, which is functionally identical to ProximityStorage - it just uses different names for the columns:
Files
The Direct Scan
The most obvious approach is to iterate over every entity that exists and check if they have what we need. Our EntityStorage provides entity IDs (basic int values) which we can then look up in the sparse sets of our component pools.
The following code benchmarks this approach in a situation where 10% of entities have an audio component and 10% have a proximity component. This means that approximately 1% of entities are of interest to our hypothetical proximity audio system:
benchmarks/main.cpp
#include <benchmark/benchmark.h>
#include <dsa/EntityStorage.h>
#include <dsa/ProximityStorage.h>
#include <dsa/AudioStorage.h>
#include <random>
static void BM_NaiveScan(benchmark::State& state) {
int n = state.range(0);
EntityStorage entities;
ProximityStorage prox;
AudioStorage audio;
// Setup: 10% of entities have Proximity, 10% have Audio
// The overlap (Intersection) will be roughly 1%
std::mt19937 rng(42);
std::uniform_int_distribution<int> dist(0, 9);
for (int i = 0; i < n; ++i) {
entities.Add("Entity");
if (dist(rng) == 0) prox.Add(i, 0, 0, 0);
if (dist(rng) == 0) audio.Add(i, 0, 0, 0);
}
for (auto _ : state) {
int count = 0;
// Iterate the all entities
for (int i = 0; i < n; ++i) {
// Check for existence in both sets
if (prox.m_map.contains(i) && audio.m_map.contains(i)) {
count++;
}
}
benchmark::DoNotOptimize(count);
}
}
#define BENCHMARK_CONFIG(name) \
BENCHMARK(name) \
->RangeMultiplier(10) \
->Range(100'000, 10'000'000) \
->Unit(benchmark::kMillisecond)
BENCHMARK_CONFIG(BM_NaiveScan);-------------------------------
Benchmark CPU
-------------------------------
BM_NaiveScan/100000 0.148 ms
BM_NaiveScan/1000000 1.51 ms
BM_NaiveScan/10000000 14.1 msThis is a direct, brute-force solution, and we suspect we can do better, but it gives us a baseline for comparison.
The Smallest Set Driver
Finding "Entities with Audio who also have Proximity" will return the exact same result as "Entities with Proximity who also have Audio", but the performance will be different based on which approach we choose.
We should choose the smallest set as our starting point, or "driver". If we have 50 proximity components and 5,000 audio components, we should iterate through the ProximityStorage, and then check if the associated entity also has an audio component.
This reduces our workload from 10,000 checks (using entities as the driver) or 5,000 checks (using audio as the driver) to just 50 checks (using proximity as the driver).
To use the smallest-set approach, there is the additional overhead of finding which set is smallest, but that is just a few quick size() calls on the component pools.
In the following benchmark, we recreate the same situation as the previous, where 10% of entities have audio, and 10% of them have proximity. The key difference is that we implement our algorithm using this "smallest set" approach:
benchmarks/main.cpp
// ...
static void BM_SmallestSet(benchmark::State& state) {
int n = state.range(0);
EntityStorage entities;
ProximityStorage prox;
AudioStorage audio;
std::mt19937 rng(42);
std::uniform_int_distribution<int> dist(0, 9);
for (int i = 0; i < n; ++i) {
entities.Add("Entity");
if (dist(rng) == 0) prox.Add(i, 0, 0, 0);
if (dist(rng) == 0) audio.Add(i, 0, 0, 0);
}
for (auto _ : state) {
// Calculate which set is smaller
// Smaller set becomes the driver
bool drive_audio = audio.m_volume.size() < prox.m_distance.size();
int count = 0;
if (drive_audio) {
// Driver: Audio (Smallest)
// Lookup: Proximity
for (size_t i = 0; i < audio.m_volume.size(); ++i) {
// Note: Assuming m_dense is accessible for this lesson
int entity_id = audio.m_map.m_dense[i];
if (prox.m_map.contains(entity_id)) {
count++;
}
}
} else {
// Driver: Proximity (Smallest)
// Lookup: Audio
for (size_t i = 0; i < prox.m_distance.size(); ++i) {
int entity_id = prox.m_map.m_dense[i];
if (audio.m_map.contains(entity_id)) {
count++;
}
}
}
benchmark::DoNotOptimize(count);
}
}
// ...
BENCHMARK_CONFIG(BM_SmallestSet);---------------------------------
Benchmark CPU
---------------------------------
BM_NaiveScan/100000 0.151 ms
BM_NaiveScan/1000000 1.46 ms
BM_NaiveScan/10000000 14.2 ms
BM_SmallestSet/100000 0.011 ms
BM_SmallestSet/1000000 0.136 ms
BM_SmallestSet/10000000 2.72 msThe Cache Problem
The smallest set driver is a big improvement, but there is a major cache inefficiency. Assuming ProximityStorage is the smallest set (the "driver"), then our algorithm does the following:
- Iterate over proximity storage
- For each proximity component, find the associated entity ID within the dense array
- To check if that entity also has an audio component, we search the sparse array within the audio component pool via its
contains()method
Step 3 is problematic:
// ...
class SparseSet {
public:
// ...
bool contains(int entity_id) const {
if (entity_id >= m_sparse.size()) return false;
// Random memory access
return m_sparse[entity_id] != null_id;
}
// ...
};The m_sparse vector is sized to the maximum number of entities (e.g., 10,000 or 100,000). When we iterate through our proximity list, the entity_ids we encounter are effectively random numbers.
One iteration might ask for m_sparse[5]. The next might ask for m_sparse[9000]. This is pointer chasing in disguise. We are jumping randomly into a large block of memory. In a realistic system, this address is unlikely to be in the cache, so are going to stall the CPU waiting for main RAM.
If we are joining 3 or 4 component pools, it gets worse. We would have to perform a random lookup into every pool's sparse array.
We'll discuss some heavyweight options to deal with this later in the lesson, but we can mitigate the problem somewhat by adding signatures to our entities.
Bitmasks and Signatures
One way to reduce this problem is to add the "existence" check to the entity itself. This is generally done using a signature.
A signature is a bitmask where each bit represents a specific component type. If an entity has Audio (Bit 0) and Proximity (Bit 1), their signature would be 00000011.
dsa_core/include/dsa/EntityStorage.h
// ...
#include <cstdint> // for uint8_t
class EntityStorage {
public:
std::vector<std::string> m_names;
// Bit 0: Proximity, Bit 1: Audio
std::vector<uint8_t> m_signatures;
int Add(std::string name) {
m_names.push_back(std::move(name));
// Initialize signature to 0 (no components)
m_signatures.push_back(0);
return int(m_names.size() - 1);
}
// Helper to update signature
void RegisterComponent(int entity_id, uint8_t bit) {
if (entity_id < m_signatures.size()) {
m_signatures[entity_id] |= bit;
}
}
};We can now check for the required components using bitwise techniques:
// Bit 0: Proximity, Bit 1: Audio
uint8_t HAS_PROXIMITY = 1 << 0;
uint8_t HAS_AUDIO = 1 << 1;
uint8_t HAS_BOTH = HAS_PROXIMITY | HAS_AUDIO;
if ((m_signatures[entity_id] & HAS_BOTH) == HAS_BOTH) {
// entity_id has both components
}It might not be clear why this is an improvement, and, in isolation, it isn't. Previously, we were jumping to a random location within a large array called m_sparse. Now we're jumping to a random location within an equally-large array called m_signatures.
How can this possibly be an improvement? There are a few reasons for this.
Physical Size
Even though m_signatures and m_sparse have the same length, they have a different physical size in memory. An index typically requires 32 or 64 bits, whilst a signature tracking the key components we care about rarely needs to be larger than 8 bits. This means more signatures can fit in the CPU caches.
Temporal Locality
When we have lots of systems running, many of them will be performing these "does this entity have this component?" checks.
Previously, the answers to "does this entity have an audio component?" and "does this entity have a proximity component?" were in two completely different memory locations. One was in the sparse set of our audio pool, and the other was in the sparse set of our proximity pool.
Now, they're in the same place. Only one system needs to pull m_signatures[some_entity], and then it's in the cache for any other systems that need it in the near future.
Spatial Locality
When a system pulled information from a sparse array, most of the adjacent entries on the same cache line were the sentinel values. We were filling our memory bandwidth and caches with junk -1s.
Now, when a system pulls m_signatures[some_entity], all adjacent values are signatures of other entities. Our system, or some other system, may need those values soon, so we've eliminated even more cache misses.
Flexibility
Signatures also give us more flexibility. Firstly, when a system needs to join 3 or more components, we've eliminated some of the lookups - it's one check in m_signatures rather than two or more checks in different m_sparse arrays.
Secondly, not all of our data access patterns start by iterating over a component pool. Sometimes, we just have an entity, and we want to know which components it has. That's much easier if the entity has a signature.
Benchmarking Signatures
Let's add a benchmark to compare this new technique to our previous efforts:
benchmarks/main.cpp
// ...
static void BM_Bitmask(benchmark::State& state) {
int n = state.range(0);
EntityStorage entities;
ProximityStorage prox;
AudioStorage audio;
std::mt19937 rng(42);
std::uniform_int_distribution<int> dist(0, 9);
uint8_t BIT_PROX = 1 << 0;
uint8_t BIT_AUDIO = 1 << 1;
uint8_t MASK_BOTH = BIT_PROX | BIT_AUDIO;
for (int i = 0; i < n; ++i) {
entities.Add("Entity");
if (dist(rng) == 0) {
prox.Add(i, 0, 0, 0);
entities.RegisterComponent(i, BIT_PROX);
}
if (dist(rng) == 0) {
audio.Add(i, 0, 0, 0);
entities.RegisterComponent(i, BIT_AUDIO);
}
}
bool drive_audio = audio.m_volume.size() < prox.m_distance.size();
for (auto _ : state) {
int count = 0;
// We still drive with the smallest set
if (drive_audio) {
for (size_t i = 0; i < audio.m_volume.size(); ++i) {
int entity_id = audio.m_map.m_dense[i];
uint8_t signature = entities.m_signatures[entity_id];
if ((signature & MASK_BOTH) == MASK_BOTH) {
count++;
}
}
} else { // Driver: Proximity
for (size_t i = 0; i < prox.m_distance.size(); ++i) {
int entity_id = prox.m_map.m_dense[i];
uint8_t signature = entities.m_signatures[entity_id];
if ((signature & MASK_BOTH) == MASK_BOTH) {
count++;
}
}
}
benchmark::DoNotOptimize(count);
}
}
// ...
BENCHMARK_CONFIG(BM_Bitmask);In an isolated benchmark testing a simple two-component join in a pre-warmed cache, we won't get much benefit from signatures, but we should at least ensure we haven't made things worse.
Any improvement will be a result of the smaller size of signatures, which allows larger working sets to fit within a cache:
----------------------------------
Benchmark CPU
----------------------------------
BM_NaiveScan/100000 0.150 ms
BM_NaiveScan/1000000 1.51 ms
BM_NaiveScan/10000000 14.4 ms
BM_SmallestSet/100000 0.010 ms
BM_SmallestSet/1000000 0.137 ms
BM_SmallestSet/10000000 2.58 ms
BM_Bitmask/100000 0.009 ms
BM_Bitmask/1000000 0.126 ms
BM_Bitmask/10000000 1.32 msImplementing the ProximityAudioSystem
Now that we know the benefits of each approach, we can implement our basic ProximityAudioSystem. There's a lot of ugly code here to manage the iteration. We'll improve it in the next lesson, but the important point for now is to understand the algorithm. The key steps are:
- Identify the smallest set, which will become the driver of the loop. Let's assume it's the
AudioStorage. - Iterate the
AudioStorage, traversing its dense array to find each audio component's associated entity - Inspect that entity's signature to see if it also has a proximity component
- If it does, traverse the
ProximityStorage's sparse array to get that component - Gather the data we need from both components and process it
dsa_app/ProximityAudioSystem.h
#pragma once
#include <dsa/EntityStorage.h>
#include <dsa/ProximityStorage.h>
#include <dsa/AudioStorage.h>
#include <print> // C++23
class ProximityAudioSystem {
public:
void Process(std::string_view name, float distance, float volume) {
std::print(
"Processing audio for {} (distance = {}, volume = {})\n",
name, distance, volume
);
}
void Update(
EntityStorage& entities,
ProximityStorage& prox,
AudioStorage& audio
) {
// STEP 1: Identify smallest set
bool drive_audio = audio.m_volume.size() < prox.m_distance.size();
// Set required components - assuming 1=Prox, 2=Audio
uint8_t required_mask = (1 << 0) | (1 << 1);
if (drive_audio) {
for (size_t i = 0; i < audio.m_volume.size(); ++i) {
// STEP 2: What entity am I attached to?
int id = audio.m_map.m_dense[i];
// STEP 3: Does the entity also have a proximity component?
uint8_t signature = entities.m_signatures[id];
if ((signature & required_mask) == required_mask) {
// STEP 4: It does, so get the component id
int prox_idx = prox.m_map.m_sparse[id];
// STEP 5: Gather the data we need to process
// Data from EntityStorage
std::string_view name = entities.m_names[id];
// Data from ProximityStorage
float distance = prox.m_distance[prox_idx];
// Data from AudioStorage
float volume = audio.m_volume[i];
Process(name, distance, volume);
}
}
} else {
// Same process, but using proximity as the driver
for (size_t i = 0; i < prox.m_distance.size(); ++i) {
int id = prox.m_map.m_dense[i];
uint8_t signature = entities.m_signatures[id];
if ((signature & required_mask) == required_mask) {
int audio_idx = audio.m_map.m_sparse[id];
std::string_view name = entities.m_names[id];
float distance = prox.m_distance[i];
float volume = audio.m_volume[audio_idx];
Process(name, distance, volume);
}
}
}
}
};An example usage of our current system is below, but we'll improve this API soon:
dsa_app/main.cpp
#include <dsa/EntityStorage.h>
#include <dsa/ProximityStorage.h>
#include <dsa/AudioStorage.h>
#include "ProximityAudioSystem.h"
int main() {
EntityStorage entities;
ProximityStorage proximity;
AudioStorage audio;
ProximityAudioSystem system;
uint8_t BIT_PROX = 1 << 0;
uint8_t BIT_AUDIO = 1 << 1;
// Create an entity
int alice = entities.Add("Alice");
proximity.Add(alice, 1.0f, 2.0f, 3.0f);
entities.RegisterComponent(alice, BIT_PROX);
audio.Add(alice, 4.0f, 5.0f, 6.0f);
entities.RegisterComponent(alice, BIT_AUDIO);
// Close but no audio
int bob = entities.Add("Bob");
proximity.Add(bob, 1.0f, 2.0f, 3.0f);
entities.RegisterComponent(bob , BIT_PROX);
// Audio but not close
int charlie = entities.Add("Charlie");
proximity.Add(charlie, 1.0f, 2.0f, 3.0f);
entities.RegisterComponent(charlie , BIT_AUDIO);
// Tick the system
system.Update(entities, proximity, audio);
}Processing audio for Alice (distance = 1, volume = 4)Complete Code
A complete version of the example from this lesson is available below:
Files
Further Optimizations
The combination of the smallest set driver and bitmask signatures creates a very efficient query mechanism. It minimizes loop iterations and reduces cache misses significantly.
However, there are techniques we can use to remove lookups entirely.
Caching
Not every system needs to work with the very latest data, and this means we don't need to calculate our set intersection every time our system runs. If an entity has audio enabled, they're likely to have audio enabled for a very long time.
Their proximity might change more frequently, but it's still not something we need to check every few milliseconds.
In these scenarios, we can cache the query result. Instead of iterating over the AudioStorage and checking bitmasks every frame, our system can just maintain a a std::vector<int> that contains the specific list of IDs that possess both Audio and Proximity.
Perhaps it recalculates this list every few seconds instead of every few milliseconds. If the system is designed not to always require the latest data, that gives us additional opportunities to move that data processing off the main thread and into a background task using .
Events
There is another way we can keep our storage up to date - we can use events or observers. We can design our storage containers to emit events such as OnComponentAdded() or OnComponentRemoved().
For components that rarely get removed, our ProximityAudioSystem can just monitor these events instead of constantly rechecking which nearby entities still have an audio component.
This moves the cost of the search from the "update" loop (which runs every frame) to the add/remove events (which happen rarely).
Data Layout
There is a more heavy-handed version of this where we group our related entities together. If we arrange our data such that all the entities that have both a proximity component and an audio component are grouped contiguously, our ProximityAudioSystem just needs to iterate between those two indices - no lookups required.
The archetype pattern is this approach taken to its logical conclusion. Our entities get split into different collections based on their signatures. For example:
- Archetype A stores all entities that have only Audio and Proximity.
- Archetype B stores all entities that have only Audio and Physics.
- Archetype C stores all entities that have Audio, Proximity, and Physics.
Our ProximityAudioSystem then just iterates over Archetype A and C's tables, without needing to check each entity.
This creates perfectly linear iteration, at the expense of some engineering effort to set up, and additional overhead to maintain this layout when adding or removing a component changes a signature.
High-budget projects typically combine these patterns - a layout optimized around performance-critical components that rarely get removed, with lookup-based techniques for components that are less important or get added and removed often.
Summary
We have built a mechanism to join relational data tables.
- Smallest Set Driver: Always iterate the component type with the fewest entities to minimize loop count.
- Bitmask Filtering: Use a compact central signature to check for component existence. This avoids polluting the cache with random accesses to sparse arrays for entities that don't match our criteria.
- Deferred Lookup: Only pay the cost of looking up the secondary data (via the Sparse Set) once you have confirmed the entity is a valid candidate.
However, the "front end" code we're writing to interact with these systems is incredibly messy. We'd like our component registration to look more like this:
// Before:
int alice = entities.Add("Alice");
proximity.Add(alice, 1.0f, 2.0f, 3.0f);
entities.RegisterComponent(alice, BIT_PROX);
audio.Add(alice, 4.0f, 5.0f, 6.0f);
entities.RegisterComponent(alice, BIT_AUDIO);
// After:
entities.Add("Alice")
.AddComponent<Proximity>(1.0f, 2.0f, 3.0f)
.AddComponent<Audio>(4.0f, 5.0f, 6.0f);And we'd like to replace the monstrous 50-line logic in our ProximityAudioSystem with something that looks like this:
auto noisy_neighbors = entities.GetView<Proximity, Audio>();
for (auto [entity, prox, audio] : noisy_neighbors) {
std::print(
"Processing audio for {} (distance = {}, volume = {})\n",
entity.name, prox.distance, audio.volume
);
}We'll implement this over the next few lessons
Templatizing Components
Improving our API using templates and a centralized registry using std::tuple.