Fast Filtering and Branchless Logic

Replace slow logical operators with fast bitwise arithmetic. Learn how to filter data without branching, avoiding pipeline flushes and speeding up queries.

Ryan McCombe
Published

In the previous lesson, we established the syntax for working with bits. In this lesson, we will use these techniques to pack multiple boolean values together, using | to set them and & to check them.

The immediate benefit of this is memory density. A bool requires 1 byte of memory to store a single boolean value, but a uint8_t can store 8 booleans - one per bit - in the same space. We can there therefore shrink our bandwidth and storage demands by up to 87%.

This also improves our cache efficiency, but there is a second benefit that is often even more impactful: execution speed. When we process game entities, financial transactions, or network packets, we often ask compound questions: "Is this entity alive AND friendly AND nearby?"

Our typical instinct is to write this using the logical AND operator (&&). This operator has a specific behavior called short-circuiting. If the first condition is false, it skips the rest.

In theory, this sounds good. Fewer variables are checked, so short-circuiting saves work. But what happens in reality?

Boolean Algebra

Let's imagine we have a collection of players. Each player has three boolean properties, stored as standard bool types (1 byte each).

struct Player {
  bool friendly;
  bool alive;
  bool nearby;
};

std::vector<Player> players;

If we want to count how many players satisfy all three conditions, the typical approach looks like this:

int count = 0;
for (const auto& p : players) {
  // If the player can help...
  if (p.friendly && p.alive && p.nearby) {
    // ...then increase count by 1
    count++;
  }
}

This code contains three hidden if statements (branches). The CPU checks friendly. If true, it jumps to check alive. If true, it jumps to check nearby. If true, it increments the count.

Now consider the bitwise alternative. We treat the booleans as integers (0 or 1) and use the bitwise AND (&) operator.

int count = 0;
for (const auto& p : players) {
  // Important: using & operator, not &&
  bool canHelp = (p.friendly & p.alive & p.nearby);

  // Increase count by 0 or 1
  count += int(canHelp);
}

Benchmarking

Why would we want to do more work? Logic tells us that checking three variables is slower than checking just one. If p.friendly is false, we know canHelp will be false. Checking p.alive and p.nearby would be a waste of resources - it won't change the result.

Let's verify this hypothesis with a benchmark, using the lab we created in a . We will generate a large dataset of random booleans and compare the two approaches.

#include <benchmark/benchmark.h>
#include <vector>
#include <random>
#include <algorithm>

struct Player {
  bool friendly;
  bool alive;
  bool nearby;
};

std::vector<Player> GenerateData(size_t size) {/*...*/} static void BM_PlayerFlag_Logical(benchmark::State& state) { auto players = GenerateData(state.range(0)); for (auto _ : state) { int count = 0; for (const auto& p : players) { if (p.friendly && p.alive && p.nearby) { count++; } } benchmark::DoNotOptimize(count); } } static void BM_PlayerFlag_Bitwise(benchmark::State& state) { auto players = GenerateData(state.range(0)); for (auto _ : state) { int count = 0; for (const auto& p : players) { bool canHelp = (p.friendly & p.alive & p.nearby); count += int(canHelp); } benchmark::DoNotOptimize(count); } } #define BENCHMARK_CONFIG(FUNC) \ BENCHMARK(FUNC) \ ->RangeMultiplier(10) \ ->Range(100000, 10000000) \ ->Unit(benchmark::kMillisecond) BENCHMARK_CONFIG(BM_PlayerFlag_Logical); BENCHMARK_CONFIG(BM_PlayerFlag_Bitwise);

The "dumb" bitwise approach that checks all the variables is over 10x faster on my system:

-----------------------------------------
Benchmark                             CPU
-----------------------------------------
BM_PlayerFlag_Logical/100000     0.586 ms
BM_PlayerFlag_Logical/1000000     5.72 ms
BM_PlayerFlag_Logical/10000000    56.2 ms
BM_PlayerFlag_Bitwise/100000     0.047 ms
BM_PlayerFlag_Bitwise/1000000    0.471 ms
BM_PlayerFlag_Bitwise/10000000    4.85 ms

So, what is going on here? Why is the "smarter" code that checks fewer variables so much slower? There are a few compounding reasons for this, but two are particularly important.

The first is that the bitwise approach is easier to automatically vectorize, which allows it to use SIMD techniques we covered in a .

The second advantage of the bitwise implementation is that it eliminates branch mispredictions.

Branch Misprediction

Modern CPUs are factories. They fetch instructions, decode them, and execute them in stages. To keep the factory moving, the CPU tries to guess the outcome of if statements before they happen. This is branch prediction.

If the data is predictable (e.g., all players are friendly), the predictor guesses right, and the pipeline flows smoothly. The logical && is fast in this case.

But in our benchmark (and in many real-world situations), the data is random. The CPU guesses "true", but the data is "false" half of the time.

  1. Pipeline Flush: The CPU realizes it guessed wrong.
  2. Stall: It must throw away all the work currently in the pipeline (10-20 cycles).
  3. Restart: It starts fetching the correct instructions.

The bitwise version has no branches. It executes simple, 1-cycle instructions. The CPU never has to guess, never has to stop, and never flushes the pipeline. Even though it technically evaluates "more" variables, it finishes sooner because it never trips.

Adding Flags

Let's walk through a practical example of this. We'll bring back our our PlayerStorage - the structure of arrays from the .

If you're following along from that chapter and want to run the examples, feel free to continue to update your container. To focus on the new content, we've provided a starting point below that strips out a lot of the previous topics, but in a real system, you would layer these features together.

#pragma once

#include <vector>
#include <ranges>
#include <tuple>
#include <algorithm>
#include <array>
#include <cstdint>

namespace PlayerFlags {
  constexpr uint8_t None     = 0;
  constexpr uint8_t Friendly = 1 << 0;
  constexpr uint8_t Alive    = 1 << 1;
  constexpr uint8_t Nearby   = 1 << 2;
}

struct PlayerRef {
  int& id;
  int& score;
  uint8_t& flags;
};

class PlayerStorage {
public:
  std::vector<int> m_ids;
  std::vector<int> m_scores;
  std::vector<uint8_t> m_flags;

  void AddPlayer(int id, int score, uint8_t flags) {
    m_ids.push_back(id);
    m_scores.push_back(score);
    m_flags.push_back(flags);
  }

  auto GetView() {
    return std::views::zip(m_ids, m_scores, m_flags) 
      | std::views::transform([](auto&& tuple) {
          auto& [id, scores, flags] = tuple;
          return PlayerRef{id, scores, flags};
        });
  }
};

API Improvements

Raw bitmasks are efficient, but they aren't very user-friendly. We don't want consumers of our container to be manually shifting bits.

We can update our PlayerRef proxy to hide the bitwise logic behind semantic methods. This gives us the best of both worlds: the dense storage and performance of raw bits and the readability of OOP methods like IsFriendly() and SetFriendly():

// ...

class PlayerRef {
public:
  int& id;
  int& score; 
  
  PlayerRef(int& idRef, int& scoreRef, uint8_t& flagsRef)
    : id(idRef), score(scoreRef), flags(flagsRef) {} 

  bool IsFriendly() const {
    return (flags & PlayerFlags::Friendly) != 0;
  }

  void SetFriendly(bool enable) {
    if (enable) {
      flags |= PlayerFlags::Friendly;
    } else {
      flags &= ~PlayerFlags::Friendly;
    } 
  }
  
  // ...etc

private:
  uint8_t& flags;
};

// ...

We can also optimize our iteration. Currently, consumers have to implement that filtering logic themselves. Their instinct would perhaps be to perform this filtering using the slower branching approach using &&, so we can intervene here and provide a faster method.

We can add an overload to GetView() that accepts a mask. This allows us to implement filtering in a way that is fast and branchless where possible:

// ...

class PlayerStorage {
public:
  // ...

  auto GetView(uint8_t mask = PlayerFlags::None) {
    return std::views::zip(m_ids, m_scores, m_flags)
      | std::views::filter([mask](const auto& tuple) {
          const auto& flags = std::get<2>(tuple);
          return (flags & mask) == mask;
      })
      | std::views::transform([](auto&& tuple) {
          auto& [id, scores, flags] = tuple;
          return PlayerRef{id, scores, flags};
      });
  }
  }
  
  // ...
};

Example Usage

This results in an API that is expressive and friendly:

#include <dsa/PlayerStorage.h>
#include <iostream>
#include <algorithm>
#include <ranges>

int main() {
  PlayerStorage ps;

  using namespace PlayerFlags;
  ps.AddPlayer(1, 100, Friendly | Alive);
  ps.AddPlayer(2, 200, Alive);
  ps.AddPlayer(3, 150, Friendly | Alive | Nearby);
  
  // The range interface still allows consumers to provide
  // their own filtering logic, will full flexibility
  auto nearby_friendlies = std::views::filter(
    [](PlayerRef p) {
      return p.IsFriendly() && p.IsNearby();
    }
  );
  
  for (PlayerRef p : ps.GetView() | nearby_friendlies) {
    // ...
  }
  
  // But we provide a easier, optimized option for simple cases:
  for (PlayerRef p : ps.GetView(Friendly | Nearby)) {
    // ...
  }
}

MapReduce

Let's go back to our original problem - how do we count how many players in our collection meet a certain criteria, like friendly & nearby?

Our consumers could find out themselves by handing one of our views to std::distance():

// Option 1 - Branching (slow)
auto nearby_friendlies = std::views::filter(
  [](PlayerRef p) {
    return p.IsFriendly() && p.IsNearby();
  }
);
size_t opt1 = std::ranges::distance(
  ps.GetView() | nearby_friendlies
);

// Option 2 - Bitwise (faster, but still counts
// elements one by one)
size_t opt2 = std::ranges::distance(
  ps.GetView(Nearby | Friendly)
);

However, this comes at a performance cost. Lazy-evaluated views are perfect for iterating through containers element-by-element, but they're not designed for asking questions about our data as a whole.

The most famous design pattern for analyzing large amounts of data and generating a report is MapReduce, which the C++ standard library implements as std::transform_reduce(). We covered MapReduce in more detail in our

If a problem can be expressed as a mapping (transformation) action applied to all objects in a collection, followed by reducing (combining) those transformed objects to a the result we're asking for, then MapReduce can help.

As it turns out, almost any data processing problem can be expressed in this form. Our original task of "count the players who are friendly, alive, and nearby" is a mapping and reduction problem. We solved it by mapping each player to a 0 or 1, then reducing all of those 0s and 1s to a final count:

// Starting value
int count = 0;

// Range to map/reduce
for (const PlayerRef& p : players) {
  // MAP each player to 0 or 1
  bool canHelp = (p.friendly & p.alive & p.nearby);

  // REDUCE all the 0s and 1s to a final total via addition
  count += canHelp;
}

Here is the equivalent logic expressed via arguments to std::transform_reduce():

int count = std::transform_reduce(
  // Range to map/reduce
  players.begin(), players.end(),
  
  // Starting value
  0,
  
  // REDUCE all the 0s and 1s to a final total via addition
  std::plus<>(),
  
  // MAP each player to 0 or 1
  [](const PlayerRef& p) {
    return p.friendly & p.alive & p.nearby;
  }
);

Implementing MapReduce

Below, we add a CountPlayers() method to our system to support expressions like CountPlayers(Alive | Nearby) :

// ...
#include <numeric> // for transform_reduce 

// ...

class PlayerStorage {
public:
  // ...

  int CountPlayers(uint8_t mask = PlayerFlags::None) const {
    // Trivial case - no filters provided
    // Just return total player count
    if (mask == PlayerFlags::None) return m_ids.size();
    
    return std::transform_reduce(
      // Range to map/reduce
      m_flags.begin(), m_flags.end(),
      
      // Starting value
      0,  
      
      // REDUCE all the 0s and 1s to a final total via addition
      std::plus<>(),
      
      // MAP from uint8_t to 0 or 1
      [mask](uint8_t flags) {
        return (flags & mask) == mask ? 1 : 0;
      }
    );
  }
  // ...
};

Let's check how our CountPlayers() function performs compared to the previous attempts. We compare:

  1. Logical: The traditional iteration of an array-of structures using &&.
  2. Bitwise: Iterating the array of structures with boolean algebra using &.
  3. MapReduce: Using transform_reduce() on an m_flags column within our structure-of-arrays:
// ...
#include <dsa/PlayerStorage.h> 

PlayerStorage GenerateData(size_t size) {/*...*/} static void BM_PlayerStorage_MapReduce(benchmark::State& state) { auto ps = GenerateData(state.range(0)); for (auto _ : state) { using namespace PlayerFlags; int count = ps.CountPlayers(Friendly | Alive | Nearby); benchmark::DoNotOptimize(count); } }
BENCHMARK_CONFIG(BM_PlayerStorage_MapReduce);

On my system, the SoA layout and the optimizations of std::transform_reduce() is returning a result 50 times faster than the original approach, and 5 times faster than the bitwise operations on std::vector<Player>:

---------------------------------------------
Benchmark                                 CPU
---------------------------------------------
BM_PlayerFlag_Logical/1000000         5.72 ms
BM_PlayerFlag_Logical/10000000        59.7 ms
BM_PlayerFlag_Bitwise/1000000        0.496 ms
BM_PlayerFlag_Bitwise/10000000        5.31 ms
BM_PlayerStorage_MapReduce/1000000   0.112 ms
BM_PlayerStorage_MapReduce/10000000   1.14 ms

Multithreading

The real power of the MapReduce pattern is how well it scales across many cores or, for large-scale work, many systems. We covered execution policies in a but, to prompt std::transform_reduce() to run in parallel, we just pass std::execution::par as the first argument:

// ...
#include <execution> // for std::execution::par 

// ...

class PlayerStorage {
public:
  // ...

  int CountPlayers(uint8_t mask) const {
    return std::transform_reduce(
      std::execution::par, 
      m_flags.begin(), m_flags.end(),
      0,
      std::plus<>(),
      [mask](uint8_t f) {
        return (f & mask) == mask ? 1 : 0;
      }
    );
  }
  // ...
};

For large data sets, this execution policy can result in yet another order of magnitude performance increase on a typical system:

---------------------------------------------
Benchmark                                 CPU
---------------------------------------------
BM_PlayerFlag_Logical/10000000        59.9 ms
BM_PlayerFlag_Bitwise/10000000        5.51 ms
BM_PlayerStorage_MapReduce/10000000   0.16 ms

Reducing Filtering

Being able to quickly filter our collection is great, but it still has a performance cost. If we find we're using the same filter often ("iterate over all of the nearby players") we can instead be slightly smarter in how we arrange our data.

Our collection is currently a monolith. Conceptually, all of our data is in one big table where each of our arrays is a column and each player is a row. Later in the course, we'll expand our design to multiple tables. This lets our systems iterate and query just the things they care about, but we'll still maintain the friendly "player" abstraction that hides the underlying complexity.

Summary

In this lesson, we demonstrated that how you ask a question matters as much as the data you are asking about.

  1. Branchless Logic: We replaced logical operators (&&) with bitwise operators (&). This prevents branch mispredictions, allowing the CPU pipeline to flow without interruption.
  2. Bit Packing: We consolidated multiple booleans into a single uint8_t flags column in our SoA system, improving memory density and cache usage.
  3. Encapsulation: We exposed this optimized data via PlayerRef, proving we can have the speed of bits with the readability of functions like IsFriendly().
  4. Map Reduce: We implemented std::transform_reduce() as a tool for aggregate queries, showing massive speedups over manual loops.

We have successfully compressed our boolean logic. But booleans are just the beginning.

In the next lesson, we will attack the larger data types clogging our memory bus: integers and floating-point numbers. We will take our bit packing further, and also implement quantization.

Next Lesson
Lesson 3 of 4

Quantization and Delta Encoding

Trade CPU cycles for memory bandwidth by compressing data using bit packing, quantization, and delta encoding.

Have a question about this lesson?
Answers are generated by AI models and may not be accurate