In the previous lessons, we compressed our status flags from 3 booleans (3 bytes) to a uint_8 (1 byte) which we interacted with using bit flags. Can we perform similar optimizations everywhere?

In many situations, bandwidth is often a stricter bottleneck than CPU speed. A typical server might need to replicate the state of 10,000 entities to connected clients 60 times per second. A replay system might need to store minutes of gameplay in RAM. A save system might need to serialize the world state to disk instantly.

If we store our data using standard int and float types, we are wasting massive amounts of space.

An int ID (32 bits) can store 4 billion values. Do we have 4 billion players?
A float rotation (32 bits) can distinguish angles smaller than an atom. Can the player see that?
A vec3 position (96 bits) stores absolute coordinates that can represent any position in the world. Can a player teleport across the world in a single frame?

In this lesson, we will learn to trade ALU instructions (CPU math) for memory bandwidth (RAM transfer). We will aggressively compress our data using three techniques: Bit-Packing, Quantization, and Delta Encoding.

This will introduce a lot of complexity to our data layout and manipulation but, as usual, we can hide that. We'll have our PlayerRef proxy handle the ugliness to preserve the clean, object-oriented API.

The Bandwidth Bottleneck

Let's bring back our PlayerStorage structure of arrays and set up some columns. The comments track how much data each entity requires, with the assumption that int and float are both 32 bit (4 byte) types:

1#pragma once
2#include <vector>
3#include <cstdint>
4
5// We'll re-introduce PlayerRef later in the lesson
6
7namespace PlayerFlags {
8  constexpr uint8_t None     = 0;
9  constexpr uint8_t Friendly = 1 << 0;
10  constexpr uint8_t Alive    = 1 << 1;
11  constexpr uint8_t Nearby   = 1 << 2;
12}
13
14class PlayerStorage {
15public:
16  std::vector<int> m_ids;        // 4 bytes
17  std::vector<int> m_teams;      // 4 bytes
18  std::vector<uint8_t> m_flags;  // 1 byte
19  std::vector<float> m_health;   // 4 bytes
20  std::vector<float> m_rotation; // 4 bytes
21  
22  // Positions
23  std::vector<float> m_pos_x;    // 4 bytes
24  std::vector<float> m_pos_y;    // 4 bytes
25  std::vector<float> m_pos_z;    // 4 bytes
26  
27  // We'll bring back the functions later
28};

Each entity is 29 bytes in total. This is an awkward size. It doesn't align well with cache lines, which are usually 64 bytes. A single player can straddle across cache lines, requiring two fetches to read completely

Furthermore, if we have 10,000 entities, that's roughly 290 KB of data for our relatively minimal collection of variables. This doesn't sound like much, but what if we were trying to send this over a network connection 60 times per second to keep all the clients in sync? That is over 1GB of data per minute per player; 10TB across all players.

We'll focus on techniques for high level optimizations in the next chapter, like not sending data at all for players who aren't nearby. But for now, let's focus on making each individual entity as efficient as possible.

Lossless Compression: Bit-Packing

The first target is our integer data. We are currently using 32 bits for IDs, 32 bits for teams, and 8 bits for flags - 72 bits in total.

What do we actually need?

IDs: Let's say our game supports a maximum of 1024 players per match. We only need 10 bits ( $2^{10} = 1024$ ).
Teams: We may only have 4 teams (Red, Blue, Green, Yellow). We only need 2 bits for this ( $2^2 = 4$ ).
Flags: We currently have 3 boolean flags (Alive, Friendly, Nearby). We need 3 bits.

So, we only need 15 bits to store all of this, but we're allocating 72 bits. 80% of our bandwidth is being wasted to send bits that are never used, and always 0.

Bit Packing

We can merge these three fields into a single uint16_t. This shrinks the memory footprint from 9 bytes to 2 bytes. We even have a bit to spare. We'll assign it to the flags for now, as we might need another flag in the future.

We view our bit containers like uint16_t from right-to-left and, similar to arrays, we start counting from 0. So, our "first" bit is at position 0, and it is the rightmost bit.

When bit-packing multiple properties, we need to define each property's size (how many bits are assigned to it) and its offset (where those bits are positioned within the uint16_t). The offset of our first property will be 0, and the offset of other properties depends on the size of the previous properties.

Our strategy of assigning 10 bits to Player ID, then 2 bits to Team ID, and then 4 bits to Flags creates a uint16_t whose bitwise memory layout looks like this:

To implement this, we replace the three integer arrays in our PlayerStorage with a new uint16_t array. We'll also add a PlayerLayout namespace to describe our arrangement strategy and provide some useful masks:

1#pragma once
2#include <vector>
3#include <cstdint>
4
5namespace PlayerFlags {
6  constexpr uint8_t None     = 0;
7  constexpr uint8_t Friendly = 1 << 0;
8  constexpr uint8_t Alive    = 1 << 1;
9  constexpr uint8_t Nearby   = 1 << 2;
10}
11
12namespace PlayerLayout {
13  static constexpr int ID_BITS = 10;
14  static constexpr int TEAM_BITS = 2;
15  static constexpr int FLAG_BITS = 4;
16
17  static constexpr int ID_OFFSET = 0;
18  static constexpr int TEAM_OFFSET = ID_BITS;
19  static constexpr int FLAG_OFFSET = ID_BITS + TEAM_BITS;
20  
21  // 0000 0011 1111 1111
22  static constexpr uint16_t ID_MASK = (1 << ID_BITS) - 1;
23  
24  // 0000 1100 0000 0000
25  static constexpr uint16_t TEAM_MASK = (1 << TEAM_BITS) - 1;
26  
27  // 1111 0000 0000 0000
28  static constexpr uint16_t FLAG_MASK = (1 << FLAG_BITS) - 1;
29};
30
31class PlayerStorage {
32public:
33  std::vector<int> m_ids;        // 4 bytes 
34  std::vector<int> m_teams;      // 4 bytes 
35  std::vector<uint8_t> m_flags;  // 1 byte 
36  
37  // Replaces m_ids, m_teams, m_flags
38  std::vector<uint16_t> m_packed_info; // 2 bytes 
39  
40  std::vector<float> m_health;   // 4 bytes
41  std::vector<float> m_rotation; // 4 bytes
42  
43  // Positions
44  std::vector<float> m_pos_x;    // 4 bytes
45  std::vector<float> m_pos_y;    // 4 bytes
46  std::vector<float> m_pos_z;    // 4 bytes
47};

Packing and Unpacking Data

We have deleted three vectors that store 9 bytes per player, and added one that can store all the same data at 2 bytes per player. We'll benchmark the performance later to see the payoff, but we've added huge usability problems.

Logically, we want to work with things like id and team, not this m_packed_info thing. As always, we can hide this nightmarish layout behind a much friendlier API. Our AddPlayer() function can still accept normal data, and take care of the packing for us:

1// ...
2
3class PlayerStorage {
4public:
5  // ...
6  void AddPlayer(
7    int id, int team, uint8_t flags,
8    float rot, float hp, float x, float y, float z
9  ) {
10    // Pack Integers
11    uint16_t packed = 0;
12    packed |= (id & PlayerLayout::ID_MASK)
13      << PlayerLayout::ID_OFFSET;
14    packed |= (team & PlayerLayout::TEAM_MASK)
15      << PlayerLayout::TEAM_OFFSET;
16    packed |= (flags & PlayerLayout::FLAG_MASK)
17      << PlayerLayout::FLAG_OFFSET;
18    m_packed_info.push_back(packed);
19    
20    m_rotation.push_back(rot);
21    m_health.push_back(health);
22    m_pos_x.push_back(x);
23    m_pos_y.push_back(y);
24    m_pos_z.push_back(z);
25  }
26};

We can also add friendly getters and setters for whatever property we want:

1// ...
2
3class PlayerStorage {
4public:
5// ...
6  int GetID(size_t index) const {
7    uint16_t packed = m_packed_info[index];
8    return (packed >> ID_OFFSET) & ID_MASK;
9  }
10
11  void SetID(size_t index, int id) {
12    uint16_t& packed = m_packed_info[index];
13    
14    // 1. Clear old ID bits (create a hole)
15    packed &= ~(PlayerLayout::ID_MASK
16      << PlayerLayout::ID_OFFSET);
17    
18    // 2. Insert new ID (masked to safety, shifted
19    // into position)
20    packed |= (id & PlayerLayout::ID_MASK)
21      << PlayerLayout::ID_OFFSET;
22  }
23};

As before, our PlayerRef proxy tends to be what consumers interact with, so we can invest plenty of effort here to hide the ugliness. We're just working with our packed info for now, but we'll add the other fields soon:

1// ...
2
3struct PlayerRef {
4private:
5  // Reference to the packed storage
6  uint16_t& m_packed;
7  
8  // Private constructor that takes the reference to the bitfield
9  friend class PlayerStorage;
10  explicit PlayerRef(uint16_t& packed) : m_packed(packed) {}
11  
12public:
13  // Interacting with the ID bits:
14  int GetID() const {
15    // Mask out the ID bits
16    return (m_packed & PlayerStorage::ID_MASK);
17  }
18
19  void SetID(int id) {
20    // 1. Clear old ID
21    m_packed &= ~PlayerStorage::ID_MASK;
22    // 2. Set new ID (ensure it fits!)
23    m_packed |= (id & PlayerStorage::ID_MASK);
24  }
25  
26  // Interacting with the Team bits:
27  int GetTeam() const {
28    // Shift right to move team bits to the bottom, then mask
29    return (m_packed >> PlayerStorage::TEAM_OFFSET)
30      & PlayerStorage::TEAM_MASK;
31  }
32
33  void SetTeam(int team) {
34    // 1. Clear old team (Shifted mask)
35    m_packed &= ~(PlayerStorage::TEAM_MASK
36      << PlayerStorage::TEAM_OFFSET);
37    // 2. Set new team
38    m_packed |= (team & PlayerStorage::TEAM_MASK)
39      << PlayerStorage::TEAM_OFFSET;
40  }
41  
42  // Interacting with the Flag bits:
43  uint8_t GetFlags() const {
44    return (m_packed >> PlayerLayout::FLAG_OFFSET)
45      & PlayerLayout::FLAG_MASK;
46  }
47
48  void SetFlags(uint8_t flags) {
49    m_packed &= ~(PlayerLayout::FLAG_MASK
50      << PlayerLayout::FLAG_OFFSET);
51    m_packed |= (flags & PlayerLayout::FLAG_MASK)
52      << PlayerLayout::FLAG_OFFSET;
53  }
54
55  // Example helpers for specific flags
56  bool IsAlive() const {
57    return GetFlags() & PlayerFlags::Alive;
58  }
59  
60  void Kill() {
61    SetFlags(GetFlags() & ~PlayerFlags::Alive);
62  }
63};
64
65// ...

We can bring back our GetView() function to construct these new PlayerRef objects:

1#pragma once
2#include <vector>
3#include <ranges> 
4#include <tuple> 
5#include <cstdint>
6
7// ...
8
9class PlayerStorage {
10public:
11  // ...
12
13  auto GetView() {
14    // Just zipping m_packed_info for now, but we'll add more later
15    return std::views::zip(m_packed_info)
16      | std::views::transform([](auto&& tuple) {
17         auto& [p] = tuple;
18         return PlayerRef{p};
19      });
20  }
21};

Lossy Compression: Quantization

Next, let's compress the floating-point numbers. Our example system is using two float values per player - one for health and one for rotation - but a real system typically has many more.

A 32-bit float provides incredible precision. It can represent the distance to a galaxy or the width of an atom. In many cases, the values we want to store for a rotation range from $0.0$ to $360.0$ .

A float lets us distinguish between $90.000001^\circ$ and $90.000002^\circ$ ? Do we need that? No - we couldn't tell the difference.

We can use quantization to map these continuous floating-point ranges into discrete integer ranges. Below, you can select the amount of bits used to represent a full $360^\circ$ rotation, and assess how that affects the accuracy.

You'll likely notice no difference between using 32 bits (a full float) and 8 bits. A combination of 8 bits can represent 256 possible values, which corresponds to $1.4^\circ$ increments when mapped to a $0^\circ - 360^\circ$ range:

The Math

If we want to compress a rotation into a single byte like a uint8_t, we map the range $[0, 360]$ to $[0, 255]$ . Quantizing and restoring a value using that mapping looks like this:

\begin{align*} \text{Quantized} &= \text{Value} \times ({255}/{360}) \\ \text{Restored} &= \text{Quantized} \times ({360}/{255}) \end{align*}

We can quantize anything - not just rotations - using the same technique. For example, 4 bits can support $2^4 = 16$ degrees of freedom.

Quantizing a $[0, 100]$ range to $[0, 16]$ range that can be stored in 4 bits looks like this:

\begin{align*} \text{Quantized} &= \text{Value} \times ({16}/{100}) \\ \text{Restored} &= \text{Quantized} \times ({100}/{16}) \end{align*}

We can incorporate addition and subtraction if our range does not start at $0$ . Quantizing a $[100, 200]$ range to $[0, 16]$ can be done like this:

\begin{align*} \text{Quantized} &= (\text{Value} - 100) \times ({16}/{100}) \\ \text{Restored} &= \text{Quantized} \times ({100}/{16}) + 100 \end{align*}

Implementing Quantized Columns

To implement this, we replace our float vectors with uint8_t vectors.

dsa_core/include/dsa/PlayerStorage.h

1class PlayerStorage {
2public:
3  std::vector<uint16_t> m_packed_info;
4  
5  std::vector<float> m_rotation;
6  std::vector<uint8_t> m_rotation_quantized;
7  
8  std::vector<float> m_health;
9  std::vector<uint8_t> m_health_quantized;
10
11  // ...
12  
13  auto GetView() {
14    return std::views::zip(
15      m_packed_info, m_rotation_quantized, m_health_quantized
16    ) | std::views::transform([](auto&& tuple) {
17       auto& [p, r, h] = tuple;
18       return PlayerRef{p, r, h};
19    });
20  }
21};

Quantizing and Restoring Data

Our outward-facing API can continue to use the friendlier floating point values for health and rotation. When we receive a value, we just quantize it before we store it:

1// ...
2
3class PlayerStorage {
4public:
5  // ...
6  void AddPlayer(
7    int id, int team, uint8_t flags,
8    float rot, float hp, float x, float y, float z
9  ) {
10    // Pack Integers
11    uint16_t packed = 0;
12    packed |= (id & PlayerLayout::ID_MASK)
13      << PlayerLayout::ID_OFFSET;
14    packed |= (team & PlayerLayout::TEAM_MASK)
15      << PlayerLayout::TEAM_OFFSET;
16    packed |= (flags & PlayerLayout::FLAG_MASK)
17      << PlayerLayout::FLAG_OFFSET;
18    m_packed_info.push_back(packed);
19    
20    // Quantize Floats
21    m_rotation.push_back(rot);
22    m_rotation_quantized.push_back(
23      static_cast<uint8_t>(rot * (255.0f / 360.0f))
24    );
25    m_health.push_back(health);
26    m_health_quantized.push_back(
27      static_cast<uint8_t>(hp * (255.0f / 100.0f))
28    );
29    
30    m_pos_x.push_back(x);
31    m_pos_y.push_back(y);
32    m_pos_z.push_back(z);
33  }
34};

Our PlayerRef can also hide these quantized values behind friendly OOP-inspired methods like GetRotation() and SetRotation():

1struct PlayerRef {
2private:
3  uint16_t& m_packed; // Packed Integers
4  uint8_t& m_rot;     // Quantized Rotation 
5  uint8_t& m_hp;      // Quantized Health 
6  
7  friend class PlayerStorage;
8  PlayerRef(uint16_t& p, uint8_t& r, uint8_t& h) 
9    : m_packed(p), m_rot(r), m_hp(h) {} 
10
11public:
12  // ...
13
14  float GetRotation() const {
15    return m_rot * (360.0f / 255.0f);
16  }
17
18  void SetRotation(float deg) {
19    m_rot = static_cast<uint8_t>(deg * (255.0f / 360.0f));
20  }
21
22  float GetHealth() const {
23    return m_hp * (100.0f / 255.0f);
24  }
25};

So far, we have compressed 17 bytes down to 4 bytes:

ID/Team/Flags: 9 bytes -> 2 bytes
Health: 4 bytes -> 1 byte
Rotation: 4 bytes -> 1 byte

Delta Encoding

The final boss is position, which we're currently storing as floating point values. A complete 3D position vector (x, y, z) therefore requires 12 bytes. We cannot easily quantize absolute world positions into 8 bits or 16 bits.

For example, if the world is 10km wide, a 16-bit integer would only give us ~15cm precision.

However, we often don't need to store absolute positions. In many systems, we're not working with discrete snapshots of our state - we are working with a stream of data. For example:

Frame 1: Player is at (100.0, 50.0, 10.0).
Frame 2: Player is at (100.1, 50.0, 10.0).

Whilst the world itself might be huge, the distance that any object can move within a single update is much smaller - (0.1, 0.0, 0.0) in this case.

Delta encoding takes advantage of the fact that many types of data change slowly over time. Instead of storing the absolute value, we store the difference from the previous known state. Because the difference is small, we can store it in a much smaller data type, replacing each float with a int8_t or int16_t.

Implementing Delta Encoding

To support this in our PlayerStorage, let's imagine we have some networking system updating positions.

We cannot afford to transfer a full 12-byte position for every player on every update. Instead, the system will deliver the compressed delta representing how the player has moved since the previous update.

To implement this, we need to define the precision of our delta. A single byte (int8_t) can hold values from -128 to +127.

If we map 1 unit in the delta to 1.0 meter in the world, our players can move up to 127 meters per update, but they can't move less than a meter. If our system is updating frequently (many times per second), that is too chunky. It limits how smoothly objects can move.

On the other hand, if we map 1 unit to 0.01 meters, players can move with centimeter precision, but their top speed is limited to 1.27 meters per update. Depending on how often we update, that might not be enough.

For this example, let's pick a precision factor of 0.1f:

1// ...
2
3struct CompressedDelta {
4  int8_t dx; // Range -127 to +127
5  int8_t dy;
6  int8_t dz;
7};
8
9class PlayerStorage {
10public:
11  // ...
12
13  // The current, absolute position (High Precision)
14  // We keep this as float for gameplay logic calculations...
15  std::vector<float> m_pos_x;
16  std::vector<float> m_pos_y;
17  std::vector<float> m_pos_z;
18
19  // ...but we let systems change it using compressed deltas 
20  
21  // 1 unit in the delta = 0.1 units in the world
22  static constexpr float DELTA_SCALE = 0.1f;
23  
24  void UpdatePosition(size_t index, CompressedDelta d) {
25    // Apply the delta to the high-precision float storage
26    m_pos_x[index] += d.dx * DELTA_SCALE;
27    m_pos_y[index] += d.dy * DELTA_SCALE;
28    m_pos_z[index] += d.dz * DELTA_SCALE;
29  }
30};

By using int8_t for deltas, we reduce the per-update cost of positions from 12 bytes (3 x float) to 3 bytes (3 x int8_t).

Overall, our changes have reduced our per-entity size from 29 bytes down to 7 - less than a quarter of their original size.

Full Updates and Key Frames

Delta encoding builds a house of cards. To know the position at frame 100, you need the position at frame 99. To know 99, you need 98. If our delta updates are being lossily compressed, or some are being lost in the chaos of networking, our shared understanding of reality will drift apart.

To solve this, we borrow a technique from video compression. We interleave deltas (P-Frames) with occasional keyframes (I-Frames). A keyframe contains absolute, uncompressed data. It is self-contained.

In a networking context, you might send a delta every tick, but a full absolute snapshot every second. This gets almost all of the compression benefits of delta encoding, but without letting the shared understanding of the world drift too far apart. When a client desynchronizes, the next keyframe is arriving within a second to snap them back to reality.

Relative Encoding (Anchors and Offsets)

A related technique is relative encoding. If our players can pick up items and carry them around, or they have a pet that follows them, we don't need to track the position of that pet in the overall world. It's always close to the player (its anchor) so we can instead encode the relative position (its offset) to the player, which will be a much smaller number.

More advanced variations of this technique can allow objects to switch what anchor they use. For example, if we are simulating a huge world, rather than storing positions as huge numbers, we could instead partition the world into smaller chunks. Each player's position can then be represented by a partition ID (a few bits to identify which partition they're currently in) and a much smaller "local position" (their offset relative to that partition's anchor point).

Complete Code

Here is our PlayerStorage, incorporating all the techniques from this lesson:

1#pragma once
2#include <vector>
3#include <ranges>
4#include <tuple>
5#include <cstdint>
6#include <algorithm>
7
8// Define our bit flags
9namespace PlayerFlags {
10  constexpr uint8_t None     = 0;
11  constexpr uint8_t Friendly = 1 << 0;
12  constexpr uint8_t Alive    = 1 << 1;
13  constexpr uint8_t Nearby   = 1 << 2;
14}
15
16struct PlayerLayout {
17  static constexpr int ID_BITS = 10;
18  static constexpr int TEAM_BITS = 2;
19  static constexpr int FLAG_BITS = 4;
20
21  static constexpr int ID_OFFSET = 0;
22  static constexpr int TEAM_OFFSET = ID_BITS;
23  static constexpr int FLAG_OFFSET = ID_BITS + TEAM_BITS;
24
25  static constexpr uint16_t ID_MASK = (1 << ID_BITS) - 1;
26  static constexpr uint16_t TEAM_MASK = (1 << TEAM_BITS) - 1;
27  static constexpr uint16_t FLAG_MASK = (1 << FLAG_BITS) - 1;
28};
29
30struct CompressedDelta {
31  int8_t dx;
32  int8_t dy;
33  int8_t dz;
34};
35
36struct PlayerRef {
37private:
38  uint16_t& m_packed; // ID (10) + Team (2) + Flags (4)
39  uint8_t& m_rot;     // Quantized Rotation
40  uint8_t& m_hp;      // Quantized Health
41  
42  friend class PlayerStorage;
43  PlayerRef(uint16_t& p, uint8_t& r, uint8_t& h) 
44    : m_packed(p), m_rot(r), m_hp(h) {}
45
46public:
47  int GetID() const { 
48    return (m_packed >> PlayerLayout::ID_OFFSET)
49      & PlayerLayout::ID_MASK; 
50  }
51  
52  void SetID(int id) {
53    m_packed &= ~(PlayerLayout::ID_MASK
54      << PlayerLayout::ID_OFFSET);
55    m_packed |= (id & PlayerLayout::ID_MASK)
56      << PlayerLayout::ID_OFFSET;
57  }
58
59  int GetTeam() const { 
60    return (m_packed >> PlayerLayout::TEAM_OFFSET)
61      & PlayerLayout::TEAM_MASK;
62  }
63
64  void SetTeam(int team) {
65    m_packed &= ~(PlayerLayout::TEAM_MASK
66      << PlayerLayout::TEAM_OFFSET);
67    m_packed |= (team & PlayerLayout::TEAM_MASK)
68      << PlayerLayout::TEAM_OFFSET;
69  }
70
71  uint8_t GetFlags() const {
72    return (m_packed >> PlayerLayout::FLAG_OFFSET)
73      & PlayerLayout::FLAG_MASK;
74  }
75
76  void SetFlags(uint8_t flags) {
77    m_packed &= ~(PlayerLayout::FLAG_MASK
78      << PlayerLayout::FLAG_OFFSET);
79    m_packed |= (flags & PlayerLayout::FLAG_MASK)
80      << PlayerLayout::FLAG_OFFSET;
81  }
82
83  // --- Helpers for specific flags ---
84  bool IsAlive() const {
85    return GetFlags() & PlayerFlags::Alive;
86  }
87  void Kill() {
88    SetFlags(GetFlags() & ~PlayerFlags::Alive);
89  }
90
91  float GetRotation() const {
92    return m_rot * (360.0f / 255.0f);
93  }
94
95  void SetRotation(float deg) {
96    m_rot = static_cast<uint8_t>(deg * (255.0f / 360.0f));
97  }
98
99  float GetHealth() const {
100    return m_hp * (100.0f / 255.0f);
101  }
102};
103
104class PlayerStorage {
105public:
106  std::vector<uint16_t> m_packed_info;
107  std::vector<uint8_t> m_rotation_quantized;
108  std::vector<uint8_t> m_health_quantized;
109
110  // Position Storage (Absolute, High Precision)
111  std::vector<float> m_pos_x;
112  std::vector<float> m_pos_y;
113  std::vector<float> m_pos_z;
114
115  static constexpr float DELTA_SCALE = 0.1f;
116
117  void AddPlayer(
118    int id, int team, uint8_t flags, float rot,
119    float hp, float x, float y, float z
120  ) {
121    // Pack Integers
122    uint16_t packed = 0;
123    packed |= (id & PlayerLayout::ID_MASK)
124      << PlayerLayout::ID_OFFSET;
125    packed |= (team & PlayerLayout::TEAM_MASK)
126      << PlayerLayout::TEAM_OFFSET;
127    packed |= (flags & PlayerLayout::FLAG_MASK)
128      << PlayerLayout::FLAG_OFFSET;
129    
130    m_packed_info.push_back(packed);
131
132    // Quantize Floats
133    m_rotation_quantized.push_back(
134      static_cast<uint8_t>(rot * (255.0f / 360.0f))
135    );
136    m_health_quantized.push_back(
137      static_cast<uint8_t>(hp * (255.0f / 100.0f))
138    );
139
140    // Store Positions
141    m_pos_x.push_back(x);
142    m_pos_y.push_back(y);
143    m_pos_z.push_back(z);
144  }
145
146  // Apply a compressed delta to a specific player
147  void UpdatePosition(size_t index, CompressedDelta d) {
148    if (index < m_pos_x.size()) {
149        m_pos_x[index] += d.dx * DELTA_SCALE;
150        m_pos_y[index] += d.dy * DELTA_SCALE;
151        m_pos_z[index] += d.dz * DELTA_SCALE;
152    }
153  }
154
155  auto GetView() {
156    return std::views::zip(
157      m_packed_info, m_rotation_quantized, m_health_quantized
158    ) | std::views::transform([](auto&& tuple) {
159       auto& [p, r, h] = tuple;
160       return PlayerRef{p, r, h};
161    });
162  }
163};

Benchmarking

Let's test our tradeoffs. Normally, a getter like GetRotation() is just a memory fetch. Now, GetRotation() involves a memory fetch plus a multiplication and a division.

Are we making the code slower? Let's run some tests. We will benchmark:

BM_Bandwidth_Fat: The bandwidth / copying costs of our uncompressed data
BM_Bandwidth_Packed: The bandwidth / copying costs of our packed data
BM_Iterate_Fat: The cost of iterating over an array of uncompressed (float) rotations and performing some operations on them
BM_Iterate_Packed: The cost of iterating over our PlayerStorage, decompressing the uint8_t rotation to a float, and then performing the same operations on it:

1#include <benchmark/benchmark.h>
2#include <dsa/PlayerStorage.h>
3#include <vector>
4#include <cstring>
5#include <random>
6
7// Uncompressed Data (~29 bytes)
8struct FatPlayer {
9  int id;
10  int team;
11  uint8_t flags;
12  float rotation;
13  float health;
14  float x, y, z;
15};
16
17// Compressed Data (~7 bytes)
18struct PackedPlayer {
19  uint16_t packed_info; // Bitpacked ID, Team, Flags
20  uint8_t rotation;     // Quantized
21  uint8_t health;       // Quantized
22  int8_t dx, dy, dz;    // Delta Encoded
23};
24
25// Bandwidth Benchmarks
26static void BM_Bandwidth_Fat(benchmark::State& state) {
27  std::vector<FatPlayer> src(state.range(0));
28  std::vector<FatPlayer> dst(state.range(0));
29  
30  for (auto _ : state) {
31    std::memcpy(
32      dst.data(), src.data(), src.size() * sizeof(FatPlayer)
33    );
34    benchmark::DoNotOptimize(dst.data());
35  }
36}
37
38static void BM_Bandwidth_Packed(benchmark::State& state) {
39  std::vector<PackedPlayer> src(state.range(0));
40  std::vector<PackedPlayer> dst(state.range(0));
41  
42  for (auto _ : state) {
43    std::memcpy(
44      dst.data(), src.data(), src.size() * sizeof(PackedPlayer)
45    );
46    benchmark::DoNotOptimize(dst.data());
47  }
48}
49
50// Iteration Benchmarks
51static void BM_Iterate_Fat(benchmark::State& state) {
52  size_t count = state.range(0);
53  std::vector<float> rotations(count, 180.0f);
54
55  for (auto _ : state) {
56    float total = 0.0f;
57    for (auto rot : rotations) {
58      total += rot;
59    }
60    benchmark::DoNotOptimize(total);
61  }
62}
63
64static void BM_Iterate_Packed(benchmark::State& state) {
65  int n = state.range(0);
66  PlayerStorage ps;
67  
68  for(int i = 0; i < n; ++i) {
69    ps.AddPlayer(0, 0, 0, 180.0f, 100.0f, 0.0f, 0.0f, 0.0f);
70  }
71
72  for (auto _ : state) {
73    float total = 0.0f;
74    for (auto p : ps.GetView()) {
75      total += p.GetRotation();
76    }
77    benchmark::DoNotOptimize(total);
78  }
79}
80
81#define BENCHMARK_CONFIG(FUNC) \
82  BENCHMARK(FUNC)  \
83    ->RangeMultiplier(10)  \
84    ->Range(100'000, 10'1000'1000)  \
85    ->Unit(benchmark::kMillisecond)
86
87BENCHMARK_CONFIG(BM_Bandwidth_Fat);
88BENCHMARK_CONFIG(BM_Bandwidth_Packed);
89BENCHMARK_CONFIG(BM_Iterate_Fat);
90BENCHMARK_CONFIG(BM_Iterate_Packed);

1---------------------------------------
2Benchmark                           CPU
3---------------------------------------
4BM_Bandwidth_Fat/1000000        2.40 ms
5BM_Bandwidth_Fat/10000000       25.0 ms
6BM_Bandwidth_Packed/1000000    0.426 ms
7BM_Bandwidth_Packed/10000000    6.08 ms
8BM_Iterate_Fat/1000000         0.750 ms
9BM_Iterate_Fat/10000000         7.95 ms
10BM_Iterate_Packed/1000000      0.750 ms
11BM_Iterate_Packed/10000000      7.29 ms

Unsurprisingly, 75% fewer bits takes 75% less time to move, and has additional benefits for things like cache efficiency, reduced RAM usage, network traffic, and disk storage.

However, what might be surprising is that there isn't even a trade off in this experiment - the iteration speed hasn't degraded, even though Iterate_Packed needs to perform additional work to decompress the data on each iteration.

This is because the CPU wasn't the bottleneck here - it was the memory bandwidth. As we get better at algorithm design, we increasingly encounter situations where the CPU can process data faster than memory can deliver it.

An implementation that gives the CPU additional work to do, such as compressing and decompressing data, can actually improve performance if it reduces the strain on memory.

Summary

In this lesson, we covered some of the main compression strategies we can apply.

Bit-Packing: We used uint16_t as a suitcase to carry multiple small integers, reducing overhead.
Quantization: We accepted a loss of precision in floating point numbers to slash their size by 75%.
Delta Encoding: We learned that for historical data, storing the change is often cheaper than storing the value.
Abstractions: We used our PlayerRef to do the dirty work of decompression, packing, and unpacking, maintaining a friendly API.

In the next lesson, we'll finish off our chapter with a quick tour of several other bit-based techniques that can be helpful. We cover bloom filters, hierarchical bitmasks, "dirty masks", and hardware intrinsics that make working with bits even faster.

Quantization and Delta Encoding