In the , we exposed the performance constraints of standard linked lists. While std::list provides $O(1)$ insertions and deletions alongside iterator stability, it physically scatters nodes across gigabytes of virtual memory. This triggers relentless TLB misses and L1 cache starvation, bringing the CPU pipeline to a halt as it waits for RAM.

We also learned that the 8-byte pointers and chunk headers of global heap allocation are a massive structural tax to pay.

However, in many scenarios, we still benefit from the flexibility and stability of linked structures, so let's cover some of the key techniques to mitigate against their weaknesses.

In this lesson, we are going to build a linked list inside a contiguous . Locating our nodes in a smaller, tightly controlled region of memory restores some locality. It also eliminates the need for 64-bit pointers that can address any memory location - instead, we can use lightweight offsets to describe where the next node is relative to the start of our pool.

The 64-Bit Pointer Waste

Let's look closely at the math of a traditional doubly linked node. If our payload is a single 4-byte int, our struct requires an 8-byte Prev pointer and an 8-byte Next pointer.

Because of hardware alignment rules, the compiler must also inject 4 bytes of padding to give the pointers their 8-byte alignment:

1struct StandardNode {
2  int Payload;      // 4 bytes
3  // Padding        // 4 bytes
4  StandardNode* Prev; // 8 bytes
5  StandardNode* Next; // 8 bytes
6}; // Total Size: 24 bytes

We are consuming 24 bytes of RAM to store 4 bytes of useful information, before we even consider the chunk headers.

There is something that can save us: absolute memory addresses are entirely unnecessary for almost all data structures. If we're making a game that only ever spawns a maximum of 50,000 entities in a level, why are we using a 64-bit pointer capable of addressing 16 exabytes of RAM? It is a massive waste of bandwidth.

If we instead store these nodes in a fixed-size object pool or an array, we don't need absolute memory addresses. We can instead represent the Prev and Next nodes by their offset relative to the start of the pool, or their index within the array.

A uint16_t integer takes up only 2 bytes and can store indices from 0 to 65,535. By replacing our 64-bit pointers with 16-bit indices, we can shrink the footprint of our node from 24 bytes to 8 bytes, and also eliminate the chunk headers associated with the global heap:

1#include <cstdint>
2
3struct IndexNode {
4  int Payload;      // 4 bytes
5  uint16_t Prev;    // 2 bytes
6  uint16_t Next;    // 2 bytes
7}; // Total Size: 8 bytes

If we need to support bigger collections, a uint32_t is still half the size of a 64-bit pointer, and can store indices up to 4.2 billion.

Co-locating the Nodes

To make index-based linking work, our nodes must live inside a single, contiguous block of memory. We can do this using the same we covered in the previous chapter, so reviewing that lesson is recommended if those concepts are unfamiliar.

Expanding our pool to provide access to our objects by index could be implemented using pointer arithmetic. However, a contiguous block of memory allocated on the heap supporting index-based access is exactly what a std::vector is, so we can make our lives slightly easier by using that as our Buffer.

Otherwise, the mechanism is the same - we scale the buffer to have all of the capacity we need right from the start. We never want to grow it, so we should never use methods like push_back() after initial construction - we simply treat it as a massive, pre-allocated slab of nodes.

The Linked List

Below, we define a class that can create pools supporting up to 4 billion elements of the template type T.

The first node of the collection is represented by the HeadIndex. In a traditional linked list, we could identify the final node in the collection by its Next value being a nullptr. In an index-based list, we would instead use some sentinel integer value. We'll store that as NullIndex in our example:

dsa_core/include/dsa/IndexList.h

1#pragma once
2#include <vector>
3#include <cstdint>
4
5template <typename T>
6class IndexList {
7private:
8  // We use uint32_t to support up to 4 billion nodes.
9  // We reserve the maximum integer value to act as our "nullptr".
10  static constexpr uint32_t NullIndex = 0xFFFFFFFF; 
11
12  struct Node {
13    T Payload;
14    uint32_t Prev;
15    uint32_t Next;
16  };
17
18  // Our contiguous memory pool
19  std::vector<Node> Buffer; 
20
21  // The start of our active logical list
22  uint32_t HeadIndex{NullIndex};
23
24public:
25  IndexList(uint32_t Capacity) {
26    // Ask the OS for the physical RAM exactly once
27    Buffer.resize(Capacity); 
28  }
29};

The Free List

As before, we also need a free list to track where the empty slots are in the pool. We'll store the index where our free list starts in the FreeHeadIndex variable, with the end also being represented by NullIndex.

For example, if our pool has the Capacity for 5 slots, and we currently have two of those slots allocated, our free list would span the remaining three slots. Our pool might look like this, with $\varnothing$ representing the NullIndex:

Given that our list starts empty, every slot is initially free, so the constructor threads our free list through the entire pool just like we did in the previous chapter:

dsa_core/include/dsa/IndexList.h

1#pragma once
2#include <vector>
3#include <cstdint>
4
5template <typename T>
6class IndexList {
7private:
8  static constexpr uint32_t NullIndex = 0xFFFFFFFF;
9
10  struct Node {
11    T Payload;
12    uint32_t Prev;
13    uint32_t Next;
14  };
15
16  std::vector<Node> Buffer; 
17  uint32_t HeadIndex{NullIndex};
18
19  // The start of our implicit free list
20  uint32_t FreeHeadIndex{NullIndex}; 
21
22public:
23  IndexList(uint32_t Capacity) {
24    Buffer.resize(Capacity); 
25
26    // Thread the implicit free list through the dead memory
27    FreeHeadIndex = 0;
28    for (uint32_t i = 0; i < Capacity - 1; ++i) {
29      // The Next index simply points to the next slot in the array
30      Buffer[i].Next = i + 1; 
31    }
32
33    // The final block terminates the free list
34    Buffer[Capacity - 1].Next = NullIndex;
35  }
36};

Allocating and Freeing Slots

Because our list is backed by a custom pool, inserting a node requires us to explicitly "allocate" a slot from the free list.

The logic is identical to what we built in the previous lesson. We pop the index sitting at the top of the FreeHeadIndex stack, update the stack, and return the index:

dsa_core/include/dsa/IndexList.h

1// ...
2
3template <typename T>
4class IndexList {
5  // ...
6
7private:
8  // Internal helper to grab a free slot in O(1) time
9  uint32_t AllocateSlot() {
10    if (FreeHeadIndex == NullIndex) {
11      return NullIndex; // Out of memory
12    }
13
14    // Grab the first available index
15    uint32_t NewSlot = FreeHeadIndex;
16
17    // Update the free list to point to the next available slot
18    FreeHeadIndex = Buffer[FreeHeadIndex].Next;
19
20    return NewSlot;
21  }
22
23  // Internal helper to recycle a slot in O(1) time
24  void FreeSlot(uint32_t TargetIndex) {
25    // Push the dead index back onto the top of the free list stack
26    Buffer[TargetIndex].Next = FreeHeadIndex;
27    FreeHeadIndex = TargetIndex;
28  }
29};

The `PushFront()` Operation

With our allocation mechanics in place, inserting data into the active list is a simple matter of index swapping.

Let's implement a PushFront() method. We request a free slot, write our payload into it, and rewire the HeadIndex to logically place the new node at the front of the list.

dsa_core/include/dsa/IndexList.h

1// ...
2
3template <typename T>
4class IndexList {
5  // ...
6
7public:
8  // ...
9  void PushFront(const T& value) {
10    uint32_t NewIndex = AllocateSlot();
11    if (NewIndex == NullIndex) return;
12
13    // 1. Write the payload
14    Buffer[NewIndex].Payload = value;
15
16    // 2. Wire the new node to point forward to the old head
17    Buffer[NewIndex].Next = HeadIndex;
18    Buffer[NewIndex].Prev = NullIndex;
19
20    // 3. Wire the old head to point backward to our new node
21    if (HeadIndex != NullIndex) {
22      Buffer[HeadIndex].Prev = NewIndex;
23    }
24
25    // 4. Officially update the start of the list
26    HeadIndex = NewIndex;
27  }
28};

We achieve the exact same $O(1)$ algorithmic complexity as std::list::push_front(), but we execute zero calls to the global OS allocator. Our insertions will be significantly faster and highly deterministic, which we'll measure later.

The `PopFront()` Operation

Removing the first element is equally mechanical and completely avoids shifting memory. We simply read the payload from the current head, slide the HeadIndex forward to point to the next node in the chain, and hand the vacated slot back to our free list so it can be overwritten later:

dsa_core/include/dsa/IndexList.h

1// ...
2
3template <typename T>
4class IndexList {
5  // ...
6
7public:
8  // ...
9  T PopFront() {
10    if (HeadIndex == NullIndex) {
11      throw std::out_of_range("List is empty");
12    }
13
14    // 1. Grab the physical index of the current head
15    uint32_t OldHeadIndex = HeadIndex;
16
17    // 2. Extract the data to return
18    T Result = Buffer[OldHeadIndex].Payload;
19
20    // 3. Move the start of the list forward
21    HeadIndex = Buffer[OldHeadIndex].Next;
22
23    // 4. Disconnect the new head from the dead node
24    if (HeadIndex != NullIndex) {
25      Buffer[HeadIndex].Prev = NullIndex;
26    }
27
28    // 5. Recycle the physical memory slot
29    FreeSlot(OldHeadIndex);
30
31    return Result;
32  }
33};

Traversal

Iterating through our index-based list mechanically mirrors pointer traversal. Instead of chasing a 64-bit Next memory address across the global heap, we are chasing a 32-bit Next integer index within our contiguous array.

To expose this capability without exposing our internal index mechanics, we will write a Traverse() template method. It accepts a callback function and walks the logical chain of the list, passing the payload of each active node into the callback:

dsa_core/include/dsa/IndexList.h

1// ...
2
3template <typename T>
4class IndexList {
5  // ...
6
7public:
8  // ...
9  template <typename Func> 
10  void Traverse(Func Callback) const { 
11    // Start at the logical beginning of the list
12    uint32_t CurrentIndex = HeadIndex;
13
14    // Walk the chain until we hit the sentinel value
15    while (CurrentIndex != NullIndex) {
16      // Process the payload
17      Callback(Buffer[CurrentIndex].Payload);
18
19      // Jump to the next logical index
20      CurrentIndex = Buffer[CurrentIndex].Next;
21    }
22  }
23};

Using the List

With our core operations in place, let's see how an application developer interacts with our IndexList:

dsa_app/main.cpp

1#include <iostream>
2#include <dsa/IndexList.h>
3
4int main() {
5  // Pre-allocate a pool of 10,000 nodes
6  IndexList<int> ActivePlayers(10'000);
7
8  ActivePlayers.PushFront(101);
9  ActivePlayers.PushFront(42);
10  ActivePlayers.PushFront(77);
11  
12  // Physical order is 101 -> 42 > 77
13  // Logical order is 77 -> 42 -> 101
14  ActivePlayers.Traverse([](int id) {
15    std::cout << "Player ID: " << id << '\n';
16  });
17
18  return 0;
19}

1Player ID: 77
2Player ID: 42
3Player ID: 101

Benchmarking

Let's compare the performance of our list to the standard library's implementation. We will populate a std::list and our IndexList with 10,000 integers. We will then aggressively shuffle and randomly delete/re-insert nodes to heavily fragment both structures, simulating a hostile, real-world environment.

Finally, we will benchmark the time it takes to traverse the list and read every element:

benchmarks/main.cpp

1#include <benchmark/benchmark.h>
2#include <list>
3#include <vector>
4#include <numeric>
5#include <random>
6#include <dsa/IndexList.h>
7
8// Helper to generate a std::list
9std::list<int> CreateScatteredStdList(int size) {
10  std::list<int> l;
11  std::vector<std::list<int>::iterator> iters;
12  for (int i = 0; i < size; ++i) {
13    l.push_back(i);
14    iters.push_back(std::prev(l.end()));
15  }
16  std::mt19937 g(42);
17  std::ranges::shuffle(iters, g);
18  for (auto it : iters) {
19    int val = *it;
20    l.erase(it);
21    l.push_front(val);
22  }
23  return l;
24}
25
26// Helper to generate an IndexList
27IndexList<int> CreateScatteredIndexList(int size) {
28  IndexList<int> idxList(size);
29  for (int i = 0; i < size; ++i) {
30    idxList.PushFront(i);
31  }
32  for (int i = 0; i < size * 2; ++i) {
33    int val = idxList.PopFront();
34    idxList.PushFront(val);
35  }
36  return idxList;
37}
38
39static void BM_StdList_Traversal(benchmark::State& state) {
40  std::list<int> l = CreateScatteredStdList(10000);
41  for (auto _ : state) {
42    int sum = 0;
43    for (int val : l) {
44      sum += val;
45    }
46    benchmark::DoNotOptimize(sum);
47  }
48}
49
50static void BM_IndexList_Traversal(benchmark::State& state) {
51  IndexList<int> idxList = CreateScatteredIndexList(10000);
52  for (auto _ : state) {
53    int sum = 0;
54    idxList.Traverse([&](int val) {
55      sum += val;
56    });
57    benchmark::DoNotOptimize(sum);
58  }
59}
60
61#define REGISTER_BENCHMARK(Name) \
62  BENCHMARK(Name)->Unit(benchmark::kMillisecond)
63  
64REGISTER_BENCHMARK(BM_StdList_Traversal);
65REGISTER_BENCHMARK(BM_IndexList_Traversal);

1---------------------------------
2Benchmark                     CPU
3---------------------------------
4BM_StdList_Traversal     0.030 ms
5BM_IndexList_Traversal   0.019 ms

In this case, our IndexList almost doubles the traversal speed of std::list, in addition to the smaller memory footprint. The index-based linking techniques also introduce some new capabilities and solve some problems we'll cover throughout the rest of the course.

Advanced: Trivial Serialization

By replacing pointers with indices, we have unlocked a capability that regular linked lists do not have: trivial serialization.

When saving a game state, caching a database, or sending a packet over a network, we must convert our RAM into a flat stream of bytes.

If we attempt to do this with a std::list, the 64-bit pointers like 0x7ffee12A are absolute physical addresses specific to the exact moment the program is running. If we save those raw bytes to a file and load them back tomorrow, the operating system will place our program in a completely different sector of physical RAM. Those loaded pointers will now point to garbage data.

To serialize a standard linked list, we must manually iterate through every node, extract the payload, format it, and write it out. To load it, we must read the file, allocate thousands of new nodes individually, and rebuild the pointer chains. It is incredibly slow and complex.

Our IndexList does not use physical pointers - it uses relative array offsets. Because Index 4 will always be Index 4, regardless of where the Buffer sits in physical RAM, the entire structural integrity of the list is position-independent.

You can take the entire contiguous memory block and write it directly to the hard drive:

dsa_core/include/dsa/IndexList.h

1// ...
2#include <fstream> 
3
4template <typename T>
5class IndexList {
6  // ...
7
8public:
9  // ...
10
11  // Save the entire list to disk
12  bool SaveToFile(const std::string& filename) const {
13    std::ofstream out(filename, std::ios::binary);
14    if (!out) return false;
15
16    // 1. Save our state variables
17    out.write(reinterpret_cast<const char*>(
18      &HeadIndex), sizeof(HeadIndex)
19    );
20    out.write(reinterpret_cast<const char*>(
21      &FreeHeadIndex), sizeof(FreeHeadIndex)
22    );
23
24    // 2. Calculate the exact physical size of our array
25    size_t ByteSize = Buffer.size() * sizeof(Node);
26
27    // 3. Write the raw memory directly to the hard drive
28    out.write(reinterpret_cast<const char*>(
29      Buffer.data()), ByteSize
30    );
31
32    return true;
33  }
34};

Loading is just as trivial. We allocate the vector, and read() the raw bytes directly over the array. The logical links between the indices are perfectly preserved:

dsa_core/include/dsa/IndexList.h

1// ...
2
3template <typename T>
4class IndexList {
5  // ...
6
7public:
8  // ...
9
10  bool ReadFromFile(const std::string& filename) {
11    std::ifstream in(filename, std::ios::binary);
12    if (!in) return false;
13
14    // 1. Restore our state variables
15    in.read(reinterpret_cast<char*>(
16      &HeadIndex), sizeof(HeadIndex)
17    );
18    in.read(reinterpret_cast<char*>(
19      &FreeHeadIndex), sizeof(FreeHeadIndex)
20    );
21
22    // 2. We already know the size of our buffer, so
23    //    we just blindly read the raw bytes directly
24    //    back into our pre-allocated memory.
25    size_t ByteSize = Buffer.size() * sizeof(Node);
26    in.read(
27      reinterpret_cast<char*>(Buffer.data()),
28      ByteSize
29    );
30
31    return true;
32  }
33};

This technique - mapping contiguous memory directly to disk - is exactly how AAA game engines load massive levels in seconds, and how high-performance databases snapshot their state. An example usage is provided below:

dsa_app/main.cpp

1#include <iostream>
2#include <dsa/IndexList.h>
3
4int main() {
5  IndexList<int> ActivePlayers(10'000);
6
7  ActivePlayers.PushFront(101);
8  ActivePlayers.PushFront(42);
9  ActivePlayers.PushFront(77);
10  
11  ActivePlayers.Traverse([](int id) {
12    std::cout << "Player ID: " << id << "\n";
13  });
14
15  
16  // Save the list to disk
17  ActivePlayers.SaveToFile("gamestate.bin"); 
18  
19  // Remove Players
20  ActivePlayers.PopFront();
21  ActivePlayers.PopFront();
22  ActivePlayers.PopFront();
23  
24  // Restore the list from disk
25  ActivePlayers.ReadFromFile("gamestate.bin"); 
26  
27  std::cout << '\nRestored from Disk:\n';
28  ActivePlayers.Traverse([](int id) {
29    std::cout << "Player ID: " << id << "\n";
30  });
31
32  return 0;
33}

1Player ID: 77
2Player ID: 42
3Player ID: 101
4
5Restored from Disk:
6Player ID: 77
7Player ID: 42
8Player ID: 101

Advanced: The Defragmentation Pass

Our IndexList is currently fast, but we must acknowledge the physical reality of long-term execution. When the list is first created, Buffer[0] logically points to Buffer[1], which points to Buffer[2]. The logical order perfectly matches the physical order. The hardware prefetcher is extremely happy, and we can iterate through our linked list just as if it were an array.

However, after an hour of execution, we may have allocated and freed hundreds of nodes at random times. The implicit free list recycles these slots chaotically.

Eventually, our logical traversal might look like this: Buffer[2] -> Buffer[999] -> Buffer[14] -> Buffer[42]. The prefetcher is once again confused.

Because our collection is constrained to our limited-size object pool, this isn't as disastrous as working with a list spanning the global heap, but we can still make improvements. We can write a "Defragmentation" or "Compaction" algorithm. This algorithm traverses the list logically, and rearranges the nodes such that their physical order within memory matches the logical order represented by the linked list semantics.

This is a heavy $O(N)$ operation, but we can strategically execute these passes during natural downtime, like when a game shows a loading screen, or when a server is handling low traffic at 3:00 AM.

dsa_core/include/dsa/IndexList.h

1// ...
2
3template <typename T>
4class IndexList {
5  // ...
6
7public:
8  // ...
9  void Defragment() {
10    // Create a pristine, empty buffer
11    std::vector<Node> NewBuffer(Buffer.size()); 
12
13    uint32_t CurrentLogical = HeadIndex;
14    uint32_t NewIndex = 0;
15
16    // Walk the old scrambled list and place elements sequentially
17    while (CurrentLogical != NullIndex) {
18      // Copy the payload
19      NewBuffer[NewIndex].Payload = Buffer[CurrentLogical].Payload;
20
21      // Hardcode the physical indices so
22      // [0] points to [1], [1] points to [2], etc
23      NewBuffer[NewIndex].Prev = (NewIndex == 0)
24        ? NullIndex
25        : NewIndex - 1;
26      NewBuffer[NewIndex].Next = NewIndex + 1;
27
28      // Move to the next scattered node
29      CurrentLogical = Buffer[CurrentLogical].Next;
30      NewIndex++;
31    }
32
33    // Terminate the active list
34    if (NewIndex > 0) {
35      NewBuffer[NewIndex - 1].Next = NullIndex;
36    }
37
38    HeadIndex = (NewIndex == 0) ? NullIndex : 0;
39
40    // Rebuild the free list sequentially for the remaining slots
41    FreeHeadIndex = (NewIndex < Buffer.size())
42      ? NewIndex
43      : NullIndex;
44      
45    for (uint32_t i = NewIndex; i < Buffer.size() - 1; ++i) {
46      NewBuffer[i].Next = i + 1;
47    }
48    NewBuffer.back().Next = NullIndex;
49
50    // Swap our pristine buffer into production
51    Buffer = std::move(NewBuffer);
52  }
53};

By taking the time to periodically clean up our physical layout, we guarantee that our traversal speeds remain efficient, completely immune to the degradation cycle that kills std::list.

Complete Code

Here is the complete implementation of our high-performance, serializable, and defragmentable IndexList:

Files

dsa_core

dsa_app

Select a file to view its content

Summary

In this lesson, we rescued the linked list from the global allocator by marrying it to a contiguous memory pool.

By restricting our list capacity to a known size, we replaced 8-byte absolute memory addresses with lightweight 4-byte or 2-byte indices.
By explicitly managing our own implicit free list via indices, we entirely bypassed the overhead of the operating system.
Because indices are relative offsets, the entire data structure is position-independent and can be trivially serialized and deserialized.
When we fully control our memory layout, we can use downtime to implement techniques like defragmentation, resyncing the logical pointer links with their physical memory layout.

In the next lesson, we will explore another pragmatic approach to balancing $O(1)$ structural flexibility with cache-line sympathy: the unrolled linked list.

Index-Based and Pool-Allocated Lists

The 64-Bit Pointer Waste

Co-locating the Nodes

The Linked List

dsa_core/include/dsa/IndexList.h

The Free List

dsa_core/include/dsa/IndexList.h

Allocating and Freeing Slots

dsa_core/include/dsa/IndexList.h

The `PushFront()` Operation

dsa_core/include/dsa/IndexList.h

The `PopFront()` Operation

dsa_core/include/dsa/IndexList.h

Traversal

dsa_core/include/dsa/IndexList.h

Using the List

dsa_app/main.cpp

Benchmarking

benchmarks/main.cpp

Advanced: Trivial Serialization

dsa_core/include/dsa/IndexList.h

dsa_core/include/dsa/IndexList.h

dsa_app/main.cpp

Advanced: The Defragmentation Pass

dsa_core/include/dsa/IndexList.h

Complete Code

Files

Summary

Unrolled Linked Lists

Practical DSA

Index-Based and Pool-Allocated Lists

The 64-Bit Pointer Waste

Co-locating the Nodes

The Linked List

dsa_core/include/dsa/IndexList.h

The Free List

dsa_core/include/dsa/IndexList.h

Allocating and Freeing Slots

dsa_core/include/dsa/IndexList.h

The PushFront() Operation

dsa_core/include/dsa/IndexList.h

The PopFront() Operation

dsa_core/include/dsa/IndexList.h

Traversal

dsa_core/include/dsa/IndexList.h

Using the List

dsa_app/main.cpp

Benchmarking

benchmarks/main.cpp

Advanced: Trivial Serialization

dsa_core/include/dsa/IndexList.h

dsa_core/include/dsa/IndexList.h

dsa_app/main.cpp

Advanced: The Defragmentation Pass

dsa_core/include/dsa/IndexList.h

Complete Code

Files

Summary

Unrolled Linked Lists

The `PushFront()` Operation

The `PopFront()` Operation