Advanced Bitwise Techniques
Use bloom filters for fast rejection, hierarchical bitmasks for spatial skipping, and hardware intrinsics using C++20's <bit> library.
In this chapter, we have explored bits primarily as a compression tool. We packed booleans into bytes to save memory (PlayerFlags), and we replaced logical branches with bitwise math to save CPU cycles.
But bits are more than just compressed booleans. Because the CPU can process 64 bits in a single cycle, bits allow us to reason about sets of data in parallel.
If we view a 64-bit integer not as a number, but as a row of 64 independent slots, we can query the state of 64 entities simultaneously. We can check existence, find gaps, and count totals without iterating.
In this final lesson on bits, we will give a quick tour of some more advanced patterns that rely on this property:
- Bloom Filters: A probabilistic structure used to reject expensive queries instantly.
- Hierarchical Bitmasks: A tree structure built of bits, allowing us to skip massive chunks of empty space in search algorithms.
- Hardware Intrinsics: Using the C++20
<bit>header to access specialized hardware instructions for population counts and bit scanning. - Dirty Masks: A pattern for optimizing data synchronization by tracking changes at the bit level.
Bloom Filters
Conceptually, a bloom filter acts somewhat like container that can answer "do you have this item?" queries very quickly. The interesting property is that they don't actually need to store the items, making them extremely memory efficient.
The trade off is that they have false positives. When we ask a bloom filter if it has an item, it can respond in two possible ways:
- Definitely not
- I think so, but I'm not 100% sure
Practically, this makes them most useful when we want to ask questions where the answer is relatively expensive to find, and almost always "no".
Example Use Case
Imagine we have loads of players logging into our game, but some tiny proportion of them are from banned accounts.
Checking every single login attempt for whether the user is banned is expensive, especially when the answer we get back is "no" 99.99% of the time.
But we still want to prevent them logging in. So, rather than checking the database on every request, we can first ask a bloom filter if they're on the banned list. If the immediate response is "definitely not", we let the user log in. If it is "I think so, but I'm not 100% sure", we can then perform the more expensive check.
Bloom filters are an architectural pattern called a fast rejection path. We accept a small amount of "maybe" to get a massive amount of "definitely not." Accessing a bloom filter takes nanoseconds, whilst accessing an external database to get the "definitely yes" might take milliseconds.
The Mechanism
A bloom filter is a large array of bits, all 0 to start with. Every time we "add" an item to the collection, let's say "Alice", we hash that input to generate a few indicies of that array, and we set those corresponding bits to 1. Once a bit is set to 1 in a bloom filter, it never goes back to 0.
When we later ask "do you have Alice?", the bloom filter will hash it again to get those same indicies, and then check those corresponding bits.
If they're not all toggled on, the filter can respond "definitely not - if Alice was added before, all of these bits would be 1, but I can see that they're not".
If they are all 1, the filter can respond "I think so, but it is possible that some combination of other items I added also happened to hash to these same bits"
This "false positive" rate is something we can tune. If we give the bloom filter more bits (more memory), the probability of these collisions reduce, and so the false positive rate reduces.
Bloom Filter Implementations
The Boost library includes a bloom filter in the form of Boost.Bloom, and ArashPartow's header-only library (shown below) is also popular.
#include <iostream>
#include <string>
#include "bloom_filter.hpp"
int main() {
bloom_parameters parameters;
// Approximately how many elements do we want to track?
parameters.projected_element_count = 1000;
// What false positive rate can we tolerate?
parameters.false_positive_probability = 0.001; // 1 in 1000
// Based on the element count and false positive rate,
// determine how many bits we should use and a hashing strategy
parameters.compute_optimal_parameters();
// Create the filter
bloom_filter filter(parameters);
// Insert some stuff
filter.insert(std::string{"Alice"});
filter.insert(std::string{"Bob"});
filter.insert(std::string{"Charlie"});
// Check if the filter has seen something
if (filter.contains(std::string{"Alice"})) {
std::cout << "I think I have Alice";
} else {
std::cout << "I definitely don't have Alice";
}
}Hardware Intrinsics using C++20 <bit>
To implement the hierarchy above efficiently, we need to answer questions like "which is the first bit set?" or "how many bits are set?".
Historically, programmers used less efficient workarounds for this. Modern CPUs have dedicated silicon for these operations. C++20 exposes them in the <bit> header.
These functions are intrinsics. They typically compile down to a single assembly instruction.
Checking Rules with std::popcount()
Population Count (sometimes called the Hamming Weight) counts the number of set bits.
This is useful for threshold logic. Suppose our game has a quest requirement: "Player must have visited at least 3 of these 5 locations."
We can store "visited locations" as a bitmask.
#include <bit>
#include <cstdint>
enum Locations : uint8_t {
Town = 1 << 0,
Cave = 1 << 1,
Tower = 1 << 2,
Forest = 1 << 3,
Dungeon = 1 << 4
};
bool CanEnterBossRoom(uint8_t visited_mask) {
// Requirement: Visit at least 3 locations
// The naive way: Loop through bits (Slow)
// The hardware way: POPCNT instruction (1 cycle)
return std::popcount(visited_mask) >= 3;
}You might wonder why we don't just store this as the number 3, but the bitmask contains more information than that. We can still use regular bitwise logic to determine which areas were visited:
if (visited_mask & Forest) {
// We know they visited the Forest
}
if ((visited_mask & Dungeon) == 0) {
// We know they have NOT visited the Dungeon
}
if ((visited_mask & (Town | Cave)) == (Town | Cave)) {
// We know they visited both the Town AND the Cave
}Finding Slots with std::countr_zero()
Count Trailing Zeros (countr_zero) tells us the index of the least significant set bit, or equivalently, how many zeroes are at "the end":
This is another way we could have solved our free list problem in the previous chapter. If we maintain a bitmask where 1 means "free" and 0 means "taken", finding the first free slot can be done with std::countr_zero(free_mask):
#include <bit>
#include <cstdint>
#include <optional>
class SmallAllocator {
// 1 = Free, 0 = Taken
// Start with all 1s (all 64 slots free)
uint64_t m_free_mask = ~0;
public:
std::optional<int> Allocate() {
if (m_free_mask == 0) return std::nullopt; // Full
// Find the index of the first '1' bit
// Hardware Instruction: TZCNT / BSF
int index = std::countr_zero(m_free_mask);
// Mark it as taken (set bit to 0)
m_free_mask &= ~(1 << index);
return index;
}
void Free(int index) {
// Mark as free (set bit to 1)
m_free_mask |= (1 << index);
}
};This allocator operates in constant time , uses zero extra memory overhead (it fits in a register), and has zero cache misses. For managing small pools of objects this beats any general-purpose allocator.
Hierarchical Bitmasks
In previous lessons, we saw ways to quickly filter through our collections based on flags like whether a player is friendly, or nearby. The approaches we used were reasonable if a decent proportion of our collection did meet the criteria.
But what if we're searching for a needle in a haystack. We might have millions of items, and fewer than 10 have the flag we're looking for. We'll walk through a problem like this using sparse sets in the next chapter, but hierarchical bitmasks are also an option.
The Base Layer
Imagine we have a large world that we want to update, but we don't want to waste resources processing regions nobody is in. We could split our world into lots of tiny regions, and use a bit to track whether each region has any players.
We'll call this base_layer, for reasons that will become apparant soon:
// 64 bits
uint64_t base_layer;In the following example, we split our world into 64 regions, and map each to a bit within our base_layer. That bit is 1 if the region contains any players:
The Pyramid of Bits
Lets scale this problem up to 4096 regions. Our base_layer collection of bits would now be an array of 64 uint64_t values:
// 64x64 = 4096 bits
std::array<uint64_t, 64> base_layer;When we have lots of regions and most of them are empty, this is a sparse data problem. Almost all of our bits will be 0, and simply finding the few regions we need to work on can become expensive.
To help with this, we can add an additional uint64_t that summarizes the contents of our 4096-bit base layer:
uint64_t summary_layer;
std::array<uint64_t, 64> base_layer;Each bit in summary_layer maps to a 64-bit entry in base_layer.
- If a summary layer bit is
0, then all of the 64 bits it maps to inbase_layerare0. - If a summary layer bit is
1, then at least one of the bits in the correspondingbase_layerentry is1.
This means we can now skip over 64 regions of our world in a single step.
We can add more layers as needed. A three-layer implementation gives us a highly-optimized way to track over 260,000 bits - 64 x 64 x 64.
Each 0 in the topmost layer lets us skip the 4,096 bits it summarizes.
The Dirty Mask
The final pattern addresses the problem of data synchronization, where we need to keep track of what needs to be synchronized.
In a multiplayer game, for example, we need to replicate the state of a player to the connected clients. However, in a complex game, there is a massive amount of data that we need to keep in sync - their position, their rotation, where they're looking, how much health they have, and more.
A "dirty mask" (or "dirty flag") is a technique used to keep track of what has changed since the last time the data were processed, saved, or synronized.
A bit mask is perfect for this. Each bit corresponds to a specific category of data (position, health, ammo, etc) and we set those bits to 1 when the corresponding data changes.
Our networking system then examines our masks to see what needs syncronized, sends only the required data, then sets the bits back to 0, ready for the next iteration.
struct Player {
// Data
vec3 position;
float health;
int ammo;
// Metadata
uint8_t dirty_flags = 0;
};
enum DirtyBits {
DIRTY_POS = 1 << 0,
DIRTY_HP = 1 << 1,
DIRTY_AMMO = 1 << 2,
DIRTY_ALL = 0xFF
};
// Logic System
void DamagePlayer(Player& p, float amount) {
p.health -= amount;
// Mark HP as dirty
p.dirty_flags |= DIRTY_HP;
}
// Replication System
void SendUpdate(Player& p, NetworkStream& stream) {
if (p.dirty_flags == 0) return; // Nothing changed
// 1. Send the mask so client knows what's coming
stream.Write(p.dirty_flags);
// 2. Only write the fields that changed
if (p.dirty_flags & DIRTY_POS) stream.Write(p.position);
if (p.dirty_flags & DIRTY_HP) stream.Write(p.health);
if (p.dirty_flags & DIRTY_AMMO) stream.Write(p.ammo);
// 3. Reset logic
p.dirty_flags = 0;
}This pattern is the cornerstone of efficient networking. It moves the cost of change detection to the point of modification (the setter), which is cheap, saving the serialization system from having to diff memory.
We then only send data when the corresponding dirty bit is set, indicating something has changed. Those changes can be further compressed using the delta encoding techniques we covered in the previous lesson.
Summary
In this chapter, we have focused on working with our data on the level of individual bits.
- Bit Packing: We can treat types like
uint8_tnot as a small number, but as collection in their own right. It can be a suitcase for up to 8 boolean values. - Bitwise Filtering: We saw how logical operators like
&and|can process conditions without branching, keeping the CPU pipeline full. - Bloom Filters: We used bits to probabilistically reject queries.
- Hardware Intrinsics: We unlocked the specialized hardware inside the CPU (
popcount,countr_zero) to solve counting and searching problems in a single instruction. - Dirty Masks: We used bits to track state changes, minimizing bandwidth in replication systems.
We have now fully optimized our data layout. We know how to store data densely (SoA), address it stably (Handles), and compress it tightly (bits). However, so far, we've only really been working with a single data source at a time.
We can't store everything in the same PlayerStorage SoA - complex projects have multiple systems, and they need to talk to each other. That's what we'll work on in the next chapter.