Index-Based and Pool-Allocated Lists

Replace 64-bit pointers with lightweight indices and back our linked list with a contiguous memory pool to maximize performance.

Ryan McCombe
Published

In the , we exposed the performance constraints of standard linked lists. While std::list provides O(1)O(1) insertions and deletions alongside iterator stability, it physically scatters nodes across gigabytes of virtual memory. This triggers relentless TLB misses and L1 cache starvation, bringing the CPU pipeline to a halt as it waits for RAM.

We also learned that the 8-byte pointers and chunk headers of global heap allocation are a massive structural tax to pay.

However, in many scenarios, we still benefit from the flexibility and stability of linked structures, so let's cover some of the key techniques to mitigate against their weaknesses.

In this lesson, we are going to build a linked list inside a contiguous . Locating our nodes in a smaller, tightly controlled region of memory restores some locality. It also eliminates the need for 64-bit pointers that can address any memory location - instead, we can use lightweight offsets to describe where the next node is relative to the start of our pool.

The 64-Bit Pointer Waste

Let's look closely at the math of a traditional doubly linked node. If our payload is a single 4-byte int, our struct requires an 8-byte Prev pointer and an 8-byte Next pointer.

Because of hardware alignment rules, the compiler must also inject 4 bytes of padding to give the pointers their 8-byte alignment:

struct StandardNode {
  int Payload;      // 4 bytes
  // Padding        // 4 bytes
  StandardNode* Prev; // 8 bytes
  StandardNode* Next; // 8 bytes
}; // Total Size: 24 bytes

We are consuming 24 bytes of RAM to store 4 bytes of useful information, before we even consider the chunk headers.

There is something that can save us: absolute memory addresses are entirely unnecessary for almost all data structures. If we're making a game that only ever spawns a maximum of 50,000 entities in a level, why are we using a 64-bit pointer capable of addressing 16 exabytes of RAM? It is a massive waste of bandwidth.

If we instead store these nodes in a fixed-size object pool or an array, we don't need absolute memory addresses. We can instead represent the Prev and Next nodes by their offset relative to the start of the pool, or their index within the array.

A uint16_t integer takes up only 2 bytes and can store indices from 0 to 65,535. By replacing our 64-bit pointers with 16-bit indices, we can shrink the footprint of our node from 24 bytes to 8 bytes, and also eliminate the chunk headers associated with the global heap:

#include <cstdint>

struct IndexNode {
  int Payload;      // 4 bytes
  uint16_t Prev;    // 2 bytes
  uint16_t Next;    // 2 bytes
}; // Total Size: 8 bytes

If we need to support bigger collections, a uint32_t is still half the size of a 64-bit pointer, and can store indices up to 4.2 billion.

Co-locating the Nodes

To make index-based linking work, our nodes must live inside a single, contiguous block of memory. We can do this using the same we covered in the previous chapter, so reviewing that lesson is recommended if those concepts are unfamiliar.

Expanding our pool to provide access to our objects by index could be implemented using pointer arithmetic. However, a contiguous block of memory allocated on the heap supporting index-based access is exactly what a std::vector is, so we can make our lives slightly easier by using that as our Buffer.

Otherwise, the mechanism is the same - we scale the buffer to have all of the capacity we need right from the start. We never want to grow it, so we should never use methods like push_back() after initial construction - we simply treat it as a massive, pre-allocated slab of nodes.

The Linked List

Below, we define a class that can create pools supporting up to 4 billion elements of the template type T.

The first node of the collection is represented by the HeadIndex. In a traditional linked list, we could identify the final node in the collection by its Next value being a nullptr. In an index-based list, we would instead use some sentinel integer value. We'll store that as NullIndex in our example:

dsa_core/include/dsa/IndexList.h

#pragma once
#include <vector>
#include <cstdint>

template <typename T>
class IndexList {
private:
  // We use uint32_t to support up to 4 billion nodes.
  // We reserve the maximum integer value to act as our "nullptr".
  static constexpr uint32_t NullIndex = 0xFFFFFFFF; 

  struct Node {
    T Payload;
    uint32_t Prev;
    uint32_t Next;
  };

  // Our contiguous memory pool
  std::vector<Node> Buffer; 

  // The start of our active logical list
  uint32_t HeadIndex{NullIndex};

public:
  IndexList(uint32_t Capacity) {
    // Ask the OS for the physical RAM exactly once
    Buffer.resize(Capacity); 
  }
};

The Free List

As before, we also need a free list to track where the empty slots are in the pool. We'll store the index where our free list starts in the FreeHeadIndex variable, with the end also being represented by NullIndex.

For example, if our pool has the Capacity for 5 slots, and we currently have two of those slots allocated, our free list would span the remaining three slots. Our pool might look like this, with \varnothing representing the NullIndex:

Given that our list starts empty, every slot is initially free, so the constructor threads our free list through the entire pool just like we did in the previous chapter:

dsa_core/include/dsa/IndexList.h

#pragma once
#include <vector>
#include <cstdint>

template <typename T>
class IndexList {
private:
  static constexpr uint32_t NullIndex = 0xFFFFFFFF;

  struct Node {
    T Payload;
    uint32_t Prev;
    uint32_t Next;
  };

  std::vector<Node> Buffer; 
  uint32_t HeadIndex{NullIndex};

  // The start of our implicit free list
  uint32_t FreeHeadIndex{NullIndex}; 

public:
  IndexList(uint32_t Capacity) {
    Buffer.resize(Capacity); 

    // Thread the implicit free list through the dead memory
    FreeHeadIndex = 0;
    for (uint32_t i = 0; i < Capacity - 1; ++i) {
      // The Next index simply points to the next slot in the array
      Buffer[i].Next = i + 1; 
    }

    // The final block terminates the free list
    Buffer[Capacity - 1].Next = NullIndex;
  }
};

Allocating and Freeing Slots

Because our list is backed by a custom pool, inserting a node requires us to explicitly "allocate" a slot from the free list.

The logic is identical to what we built in the previous lesson. We pop the index sitting at the top of the FreeHeadIndex stack, update the stack, and return the index:

dsa_core/include/dsa/IndexList.h

// ...

template <typename T>
class IndexList {
  // ...

private:
  // Internal helper to grab a free slot in O(1) time
  uint32_t AllocateSlot() {
    if (FreeHeadIndex == NullIndex) {
      return NullIndex; // Out of memory
    }

    // Grab the first available index
    uint32_t NewSlot = FreeHeadIndex;

    // Update the free list to point to the next available slot
    FreeHeadIndex = Buffer[FreeHeadIndex].Next;

    return NewSlot;
  }

  // Internal helper to recycle a slot in O(1) time
  void FreeSlot(uint32_t TargetIndex) {
    // Push the dead index back onto the top of the free list stack
    Buffer[TargetIndex].Next = FreeHeadIndex;
    FreeHeadIndex = TargetIndex;
  }
};

The PushFront() Operation

With our allocation mechanics in place, inserting data into the active list is a simple matter of index swapping.

Let's implement a PushFront() method. We request a free slot, write our payload into it, and rewire the HeadIndex to logically place the new node at the front of the list.

dsa_core/include/dsa/IndexList.h

// ...

template <typename T>
class IndexList {
  // ...

public:
  // ...
  void PushFront(const T& value) {
    uint32_t NewIndex = AllocateSlot();
    if (NewIndex == NullIndex) return;

    // 1. Write the payload
    Buffer[NewIndex].Payload = value;

    // 2. Wire the new node to point forward to the old head
    Buffer[NewIndex].Next = HeadIndex;
    Buffer[NewIndex].Prev = NullIndex;

    // 3. Wire the old head to point backward to our new node
    if (HeadIndex != NullIndex) {
      Buffer[HeadIndex].Prev = NewIndex;
    }

    // 4. Officially update the start of the list
    HeadIndex = NewIndex;
  }
};

We achieve the exact same O(1)O(1) algorithmic complexity as std::list::push_front(), but we execute zero calls to the global OS allocator. Our insertions will be significantly faster and highly deterministic, which we'll measure later.

The PopFront() Operation

Removing the first element is equally mechanical and completely avoids shifting memory. We simply read the payload from the current head, slide the HeadIndex forward to point to the next node in the chain, and hand the vacated slot back to our free list so it can be overwritten later:

dsa_core/include/dsa/IndexList.h

// ...

template <typename T>
class IndexList {
  // ...

public:
  // ...
  T PopFront() {
    if (HeadIndex == NullIndex) {
      throw std::out_of_range("List is empty");
    }

    // 1. Grab the physical index of the current head
    uint32_t OldHeadIndex = HeadIndex;

    // 2. Extract the data to return
    T Result = Buffer[OldHeadIndex].Payload;

    // 3. Move the start of the list forward
    HeadIndex = Buffer[OldHeadIndex].Next;

    // 4. Disconnect the new head from the dead node
    if (HeadIndex != NullIndex) {
      Buffer[HeadIndex].Prev = NullIndex;
    }

    // 5. Recycle the physical memory slot
    FreeSlot(OldHeadIndex);

    return Result;
  }
};

Traversal

Iterating through our index-based list mechanically mirrors pointer traversal. Instead of chasing a 64-bit Next memory address across the global heap, we are chasing a 32-bit Next integer index within our contiguous array.

To expose this capability without exposing our internal index mechanics, we will write a Traverse() template method. It accepts a callback function and walks the logical chain of the list, passing the payload of each active node into the callback:

dsa_core/include/dsa/IndexList.h

// ...

template <typename T>
class IndexList {
  // ...

public:
  // ...
  template <typename Func> 
  void Traverse(Func Callback) const { 
    // Start at the logical beginning of the list
    uint32_t CurrentIndex = HeadIndex;

    // Walk the chain until we hit the sentinel value
    while (CurrentIndex != NullIndex) {
      // Process the payload
      Callback(Buffer[CurrentIndex].Payload);

      // Jump to the next logical index
      CurrentIndex = Buffer[CurrentIndex].Next;
    }
  }
};

Using the List

With our core operations in place, let's see how an application developer interacts with our IndexList:

dsa_app/main.cpp

#include <iostream>
#include <dsa/IndexList.h>

int main() {
  // Pre-allocate a pool of 10,000 nodes
  IndexList<int> ActivePlayers(10'000);

  ActivePlayers.PushFront(101);
  ActivePlayers.PushFront(42);
  ActivePlayers.PushFront(77);
  
  // Physical order is 101 -> 42 > 77
  // Logical order is 77 -> 42 -> 101
  ActivePlayers.Traverse([](int id) {
    std::cout << "Player ID: " << id << '\n';
  });

  return 0;
}
Player ID: 77
Player ID: 42
Player ID: 101

Benchmarking

Let's compare the performance of our list to the standard library's implementation. We will populate a std::list and our IndexList with 10,000 integers. We will then aggressively shuffle and randomly delete/re-insert nodes to heavily fragment both structures, simulating a hostile, real-world environment.

Finally, we will benchmark the time it takes to traverse the list and read every element:

benchmarks/main.cpp

#include <benchmark/benchmark.h>
#include <list>
#include <vector>
#include <numeric>
#include <random>
#include <dsa/IndexList.h>

// Helper to generate a std::list
std::list<int> CreateScatteredStdList(int size) {
  std::list<int> l;
  std::vector<std::list<int>::iterator> iters;
  for (int i = 0; i < size; ++i) {
    l.push_back(i);
    iters.push_back(std::prev(l.end()));
  }
  std::mt19937 g(42);
  std::ranges::shuffle(iters, g);
  for (auto it : iters) {
    int val = *it;
    l.erase(it);
    l.push_front(val);
  }
  return l;
}

// Helper to generate an IndexList
IndexList<int> CreateScatteredIndexList(int size) {
  IndexList<int> idxList(size);
  for (int i = 0; i < size; ++i) {
    idxList.PushFront(i);
  }
  for (int i = 0; i < size * 2; ++i) {
    int val = idxList.PopFront();
    idxList.PushFront(val);
  }
  return idxList;
}

static void BM_StdList_Traversal(benchmark::State& state) {
  std::list<int> l = CreateScatteredStdList(10000);
  for (auto _ : state) {
    int sum = 0;
    for (int val : l) {
      sum += val;
    }
    benchmark::DoNotOptimize(sum);
  }
}

static void BM_IndexList_Traversal(benchmark::State& state) {
  IndexList<int> idxList = CreateScatteredIndexList(10000);
  for (auto _ : state) {
    int sum = 0;
    idxList.Traverse([&](int val) {
      sum += val;
    });
    benchmark::DoNotOptimize(sum);
  }
}

#define REGISTER_BENCHMARK(Name) \
  BENCHMARK(Name)->Unit(benchmark::kMillisecond)
  
REGISTER_BENCHMARK(BM_StdList_Traversal);
REGISTER_BENCHMARK(BM_IndexList_Traversal);
---------------------------------
Benchmark                     CPU
---------------------------------
BM_StdList_Traversal     0.030 ms
BM_IndexList_Traversal   0.019 ms

In this case, our IndexList almost doubles the traversal speed of std::list, in addition to the smaller memory footprint. The index-based linking techniques also introduce some new capabilities and solve some problems we'll cover throughout the rest of the course.

Advanced: Trivial Serialization

By replacing pointers with indices, we have unlocked a capability that regular linked lists do not have: trivial serialization.

When saving a game state, caching a database, or sending a packet over a network, we must convert our RAM into a flat stream of bytes.

If we attempt to do this with a std::list, the 64-bit pointers like 0x7ffee12A are absolute physical addresses specific to the exact moment the program is running. If we save those raw bytes to a file and load them back tomorrow, the operating system will place our program in a completely different sector of physical RAM. Those loaded pointers will now point to garbage data.

To serialize a standard linked list, we must manually iterate through every node, extract the payload, format it, and write it out. To load it, we must read the file, allocate thousands of new nodes individually, and rebuild the pointer chains. It is incredibly slow and complex.

Our IndexList does not use physical pointers - it uses relative array offsets. Because Index 4 will always be Index 4, regardless of where the Buffer sits in physical RAM, the entire structural integrity of the list is position-independent.

You can take the entire contiguous memory block and write it directly to the hard drive:

dsa_core/include/dsa/IndexList.h

// ...
#include <fstream> 

template <typename T>
class IndexList {
  // ...

public:
  // ...

  // Save the entire list to disk
  bool SaveToFile(const std::string& filename) const {
    std::ofstream out(filename, std::ios::binary);
    if (!out) return false;

    // 1. Save our state variables
    out.write(reinterpret_cast<const char*>(
      &HeadIndex), sizeof(HeadIndex)
    );
    out.write(reinterpret_cast<const char*>(
      &FreeHeadIndex), sizeof(FreeHeadIndex)
    );

    // 2. Calculate the exact physical size of our array
    size_t ByteSize = Buffer.size() * sizeof(Node);

    // 3. Write the raw memory directly to the hard drive
    out.write(reinterpret_cast<const char*>(
      Buffer.data()), ByteSize
    );

    return true;
  }
};

Loading is just as trivial. We allocate the vector, and read() the raw bytes directly over the array. The logical links between the indices are perfectly preserved:

dsa_core/include/dsa/IndexList.h

// ...

template <typename T>
class IndexList {
  // ...

public:
  // ...

  bool ReadFromFile(const std::string& filename) {
    std::ifstream in(filename, std::ios::binary);
    if (!in) return false;

    // 1. Restore our state variables
    in.read(reinterpret_cast<char*>(
      &HeadIndex), sizeof(HeadIndex)
    );
    in.read(reinterpret_cast<char*>(
      &FreeHeadIndex), sizeof(FreeHeadIndex)
    );

    // 2. We already know the size of our buffer, so
    //    we just blindly read the raw bytes directly
    //    back into our pre-allocated memory.
    size_t ByteSize = Buffer.size() * sizeof(Node);
    in.read(
      reinterpret_cast<char*>(Buffer.data()),
      ByteSize
    );

    return true;
  }
};

This technique - mapping contiguous memory directly to disk - is exactly how AAA game engines load massive levels in seconds, and how high-performance databases snapshot their state. An example usage is provided below:

dsa_app/main.cpp

#include <iostream>
#include <dsa/IndexList.h>

int main() {
  IndexList<int> ActivePlayers(10'000);

  ActivePlayers.PushFront(101);
  ActivePlayers.PushFront(42);
  ActivePlayers.PushFront(77);
  
  ActivePlayers.Traverse([](int id) {
    std::cout << "Player ID: " << id << "\n";
  });

  
  // Save the list to disk
  ActivePlayers.SaveToFile("gamestate.bin"); 
  
  // Remove Players
  ActivePlayers.PopFront();
  ActivePlayers.PopFront();
  ActivePlayers.PopFront();
  
  // Restore the list from disk
  ActivePlayers.ReadFromFile("gamestate.bin"); 
  
  std::cout << '\nRestored from Disk:\n';
  ActivePlayers.Traverse([](int id) {
    std::cout << "Player ID: " << id << "\n";
  });

  return 0;
}
Player ID: 77
Player ID: 42
Player ID: 101

Restored from Disk:
Player ID: 77
Player ID: 42
Player ID: 101

Advanced: The Defragmentation Pass

Our IndexList is currently fast, but we must acknowledge the physical reality of long-term execution. When the list is first created, Buffer[0] logically points to Buffer[1], which points to Buffer[2]. The logical order perfectly matches the physical order. The hardware prefetcher is extremely happy, and we can iterate through our linked list just as if it were an array.

However, after an hour of execution, we may have allocated and freed hundreds of nodes at random times. The implicit free list recycles these slots chaotically.

Eventually, our logical traversal might look like this: Buffer[2] -> Buffer[999] -> Buffer[14] -> Buffer[42]. The prefetcher is once again confused.

Because our collection is constrained to our limited-size object pool, this isn't as disastrous as working with a list spanning the global heap, but we can still make improvements. We can write a "Defragmentation" or "Compaction" algorithm. This algorithm traverses the list logically, and rearranges the nodes such that their physical order within memory matches the logical order represented by the linked list semantics.

This is a heavy O(N)O(N) operation, but we can strategically execute these passes during natural downtime, like when a game shows a loading screen, or when a server is handling low traffic at 3:00 AM.

dsa_core/include/dsa/IndexList.h

// ...

template <typename T>
class IndexList {
  // ...

public:
  // ...
  void Defragment() {
    // Create a pristine, empty buffer
    std::vector<Node> NewBuffer(Buffer.size()); 

    uint32_t CurrentLogical = HeadIndex;
    uint32_t NewIndex = 0;

    // Walk the old scrambled list and place elements sequentially
    while (CurrentLogical != NullIndex) {
      // Copy the payload
      NewBuffer[NewIndex].Payload = Buffer[CurrentLogical].Payload;

      // Hardcode the physical indices so
      // [0] points to [1], [1] points to [2], etc
      NewBuffer[NewIndex].Prev = (NewIndex == 0)
        ? NullIndex
        : NewIndex - 1;
      NewBuffer[NewIndex].Next = NewIndex + 1;

      // Move to the next scattered node
      CurrentLogical = Buffer[CurrentLogical].Next;
      NewIndex++;
    }

    // Terminate the active list
    if (NewIndex > 0) {
      NewBuffer[NewIndex - 1].Next = NullIndex;
    }

    HeadIndex = (NewIndex == 0) ? NullIndex : 0;

    // Rebuild the free list sequentially for the remaining slots
    FreeHeadIndex = (NewIndex < Buffer.size())
      ? NewIndex
      : NullIndex;
      
    for (uint32_t i = NewIndex; i < Buffer.size() - 1; ++i) {
      NewBuffer[i].Next = i + 1;
    }
    NewBuffer.back().Next = NullIndex;

    // Swap our pristine buffer into production
    Buffer = std::move(NewBuffer);
  }
};

By taking the time to periodically clean up our physical layout, we guarantee that our traversal speeds remain efficient, completely immune to the degradation cycle that kills std::list.

Complete Code

Here is the complete implementation of our high-performance, serializable, and defragmentable IndexList:

Files

dsa_core
dsa_app
Select a file to view its content

Summary

In this lesson, we rescued the linked list from the global allocator by marrying it to a contiguous memory pool.

  1. By restricting our list capacity to a known size, we replaced 8-byte absolute memory addresses with lightweight 4-byte or 2-byte indices.
  2. By explicitly managing our own implicit free list via indices, we entirely bypassed the overhead of the operating system.
  3. Because indices are relative offsets, the entire data structure is position-independent and can be trivially serialized and deserialized.
  4. When we fully control our memory layout, we can use downtime to implement techniques like defragmentation, resyncing the logical pointer links with their physical memory layout.

In the next lesson, we will explore another pragmatic approach to balancing O(1)O(1) structural flexibility with cache-line sympathy: the unrolled linked list.

Next Lesson
Lesson 49 of 49

Unrolled Linked Lists

Balance the flexibility of linked lists with cache-line sympathy of arrays. Learn how to group data into contiguous chunks to drastically reduce pointer chasing.

Have a question about this lesson?
Answers are generated by AI models and may not be accurate