Unrolled Linked Lists

Balance the flexibility of linked lists with cache-line sympathy of arrays. Learn how to group data into contiguous chunks to drastically reduce pointer chasing.

Ryan McCombe
Published

In our , we examined the trade-offs between contiguous and linked data structures.

Arrays like std::vector provide fast traversal speeds because they feed the CPU's hardware prefetcher perfectly. However, maintaining contiguity makes structural changes expensive - inserting data into the middle of a massive vector forces the CPU to physically shift millions of bytes, destroying performance.

Conversely, lists like std::list allow for fast insertions and deletions because they only require swapping local pointers. But their node-based design increases their memory footprint and scatters the elements unpredictably in memory, reducing cache efficiency and confusing the prefetcher.

Here is the reality: we don't have to choose to optimize just for fast iteration or fast modification. We can build custom data structures that combine both approaches to find a balance that works best for the usage pattern that our problem requires, and the hardware we expect our program to run on.

We can design a structure that is mostly contiguous to keep iteration speed high, but uses some linking to prevent huge memory shifts when structural changes are needed.

In this lesson, we'll introduce the simplest example of this - the unrolled linked list.

The Cache Line Problem

To understand why an unrolled linked list is so effective, let's quickly review hardware's cache line.

When you ask the CPU to read a single 4-byte int from RAM, the CPU does not fetch 4 bytes. The memory controller physically fetches an entire line of memory and pulls it into the L1 cache.

In this lesson, we'll optimize our implementation to assume our program will usually be run on systems that have 64-byte cache lines. If we are using a standard singly-linked list, how much of that 64-byte fetch is actually useful data?

Our payload is 4 bytes. We also have an 8-byte Next pointer. Due to alignment padding, the node consumes 16 bytes. Furthermore, the global allocator injects a hidden 16-byte chunk header.

Out of the 64 bytes the CPU painstakingly hauled from main RAM, we are only using 4 bytes of payload. We waste over 90% of the memory bus bandwidth. When the CPU moves to the next node, it is forced to fetch an entirely different 64-byte cache line, suffering the massive RAM latency penalty all over again.

We can fix this by changing the metadata-to-payload ratio. Instead of wrapping a single int inside a node, what if we wrap an array of integers inside the node?

Designing the Fat Node

An unrolled linked list replaces single-element nodes with "fat nodes". Each node contains a small, fixed-size contiguous array of elements, alongside a counter to track how many slots are currently in use.

Just how fat these nodes are is something we would tune for our specific use case. If each node is relatively small, our container will need more of them to store all our data. This makes our structure look more like a linked list, inheriting those performance characteristics. If we have larger nodes, our data is more contiguous, pushing the performance characteristics towards that of an array.

Let's assume we want our container to store 4-byte int values, and we want our nodes to perfectly fill a 64-byte cache line. We'll use a 4-byte Count variable to track how many elements are in our node, and we'll need a pointer (8-bytes) for the Next node.

Combined, that consumes 12 bytes of overhead, leaving us with 52 bytes for our payload. This means we can fit exactly 13 integers into the remaining space. Each of our nodes would look like this:

By unrolling the list, a single 64-byte RAM fetch now gives the CPU 13 useful elements to process. We just reduced our cache misses by a factor of 13.

Implementing the UnrolledList

Let's implement this. To keep the code generic, we will use template parameters for the payload type and the node size. T will be the type of object our list will store, and NodeCapacity will be how many Ts each node can store:

dsa_core/include/dsa/UnrolledList.h

#pragma once
#include <array>
#include <cstddef>

template <typename T, uint32_t NodeCapacity>
class UnrolledList {
private:
  struct Node {
    // A contiguous block of elements
    std::array<T, NodeCapacity> Elements;

    // How many elements are currently active in this chunk
    uint32_t Count{0};

    // Pointer to the next chunk
    Node* Next{nullptr};
  };

  Node* Head{nullptr};
  size_t TotalElements{0};

public:
  UnrolledList() {
    Head = new Node();
  }

  ~UnrolledList() {
    Node* Current = Head;
    while (Current != nullptr) {
      Node* ToDelete = Current;
      Current = Current->Next;
      delete ToDelete;
    }
  }
};

Traversal Mechanics

Iterating over an unrolled linked list requires a nested loop. The outer loop walks the pointer chain from node to node. The inner loop walks the contiguous array within the current node.

Because the inner loop operates on a flat std::array, the hardware prefetcher engages instantly. The CPU glides over the elements without stalling.

dsa_core/include/dsa/UnrolledList.h

// ...

template <typename T, size_t NodeCapacity>
class UnrolledList {
  // ...

public:
  // ...

  template <typename Func>
  void Traverse(Func Callback) const {
    Node* Current = Head;

    // The Outer Loop: A slow pointer chase
    while (Current != nullptr) {

      // The Inner Loop: Fast, contiguous execution
      for (size_t i = 0; i < Current->Count; ++i) {
        Callback(Current->Elements[i]);
      }

      // Jump to the next fat node
      Current = Current->Next;
    }
  }
};

If we set our NodeCapacity to be 100 elements, the CPU executes 100 lightning-fast instructions for every 1 slow pointer chase.

Implementing Insertion

When we insert an element into the front or middle of a massive std::vector, the CPU is forced to physically shift every single element that comes after it. If there are a million elements, that is a million heavy memory copies moving through the memory bus.

With an unrolled list, we only need to shift the elements within that specific fat node. The rest of the nodes in the list remain completely untouched.

The canonical insertion pattern supported by lists is that elements are inserted either at the start or at an arbitrary position represented by an iterator. Standard library lists provide these patterns as the push_front() and insert() methods respectively.

Let's define a simple Iterator to represent a position within our UnrolledList, and implement our first pass of the Insert() method. For this step, we will assume our node has plenty of empty space:

dsa_core/include/dsa/UnrolledList.h

// ...

template <typename T, size_t NodeCapacity>
class UnrolledList {
  // ...
public:
  // A simple struct to represent a position in the list
  struct Iterator {
    Node* TargetNode;
    size_t LocalIndex;
  };

  void Insert(Iterator Pos, const T& Value) {
    Node* Target = Pos.TargetNode;

    if (Target->Count == NodeCapacity) {
      // Node is full - cannot insert
      // We'll add support for this in the next section
      return;
    }

    // Shift elements to the right to make a gap
    for (size_t i = Target->Count; i > Pos.LocalIndex; --i) {
      Target->Elements[i] = Target->Elements[i - 1];
    }

    // Drop the new value into the gap
    Target->Elements[Pos.LocalIndex] = Value;

    Target->Count++;
    TotalElements++;
  }
  
  // Helper to insert a new element at start of the list
  void InsertFront(const T& Value) {
    Insert({Head, 0}, Value);
  }
};

This mechanism also highlights a key difference between lists and unrolled lists. With a regular list, inserting an element does not invalidate the pointers to any other element, as those elements do not need to be moved.

However, with an unrolled list, other elements within the same chunk may move to make room for the new addition. Therefore, any pre-existing pointers, references, and iterators should be considered invalid after a call to Insert().

The Splitting Mechanic

When a node hits maximum capacity, we cannot accept any more elements. We must perform a structural modification: the node split.

When a node fills up, we allocate a brand new node. We then take the back half of the elements from the full node and physically move them into the new node. Finally, we wire the new node into the linked list chain.

This ensures that both nodes are now at ~50% capacity, giving them plenty of breathing room for future insertions without triggering another immediate split.

Let's implement this splitting mechanism in code:

dsa_core/include/dsa/UnrolledList.h

// ...

template <typename T, size_t NodeCapacity>
class UnrolledList {
  // ...

public:
  // ...
  void Insert(Iterator Pos, const T& Value) {
    Node* Target = Pos.TargetNode;

    if (Target->Count == NodeCapacity) {
      return; 
      Node* NewNode = new Node(); 
      size_t Half = NodeCapacity / 2; 

      // Move the back half of the data to the new node
      for (size_t i = 0; i < Half; ++i) { 
        NewNode->Elements[i] = Target->Elements[Half + i]; 
      } 

      NewNode->Count = Half; 
      Target->Count = NodeCapacity - Half; 

      // Wire the new node into the chain
      NewNode->Next = Target->Next; 
      Target->Next = NewNode; 

      // Adjust our target if the insertion point
      // falls into the newly created right-hand node
      if (Pos.LocalIndex >= Target->Count) { 
        Target = NewNode; 
        Pos.LocalIndex -= Target->Count; 
      } 
    }

    for (size_t i = Target->Count; i > Pos.LocalIndex; --i) {
      Target->Elements[i] = Target->Elements[i - 1];
    }

    Target->Elements[Pos.LocalIndex] = Value;
    Target->Count++;
    TotalElements++;
  }
};

This split operation introduces a sudden latency spike. We are executing a new allocation and copying memory.

However, because each node has NodeCapacity slots, we only suffer this split penalty once every NodeCapacity / 2 insertions. This is an amortized cost. By paying a slightly higher cost occasionally, we guarantee that the vast majority of our insertions execute via the ultra-fast L1 cache shift.

Additionally, if we combine this with some of the custom allocator techniques covered previously, such as storing our nodes in a pre-allocated , we can avoid the performance cost of allocating new memory on the heap.

Implementing Deletion

The canonical removal pattern that lists need to support is the removal of the first element, or the removal of an arbitrary element for which we acquired an iterator. For example, these are implemented as the pop_front() and erase() methods of the standard library lists.

Just like insertion, standard contiguous arrays suffer massively here. If we remove an element near the front of a std::vector, the CPU must move every subsequent element one position to the left to close the gap.

In our unrolled list, we only shift the elements located within the current fat node. The rest of the collection remains physically undisturbed.

Let's implement this in code:

dsa_core/include/dsa/UnrolledList.h

// ...

template <typename T, size_t NodeCapacity>
class UnrolledList {
  // ...

public:
  // ...
  void Remove(Iterator Pos) {
    Node* Target = Pos.TargetNode;

    // Shift local elements to the left to overwrite the victim
    for (size_t i = Pos.LocalIndex; i < Target->Count - 1; ++i) {
      Target->Elements[i] = Target->Elements[i + 1];
    }

    Target->Count--;
    TotalElements--;
  }
  
  // Simple helper to remove the first element in the list
  void RemoveFront() {
    Remove({Head, 0});
  }
};

Again, because removal from an unrolled list involves physical movement of other elements within the same chunk, any pre-existing pointers should be considered invalid after this action.

Removing Empty Nodes

After removing an element from a node, it is possible that the node is now empty. We can check for that and remove the entire node from the list to keep things compact and efficient:

dsa_core/include/dsa/UnrolledList.h

// ...

template <typename T, size_t NodeCapacity>
class UnrolledList {
  // ...

public:
  // ...
  void Remove(Iterator Pos) {
    // ...

    // If the node is completely empty, and it isn't the Head,
    // we should unlink it and free the memory
    if (Target->Count == 0 && Target != Head) {
      // Because this is a singly-linked implementation, we must
      // walk from the Head to find the previous node to unlink it
      Node* Prev = Head;
      while (Prev != nullptr && Prev->Next != Target) {
        Prev = Prev->Next;
      }

      if (Prev != nullptr) {
        Prev->Next = Target->Next;
        delete Target;
      }
    }
  }
  // ...
};

Advanced: Merging Sparse Nodes

In our implementation above, we only delete a node when its Count reaches 0. While this works, it introduces a subtle fragmentation risk. We could end up with hundreds of fat nodes that each contain only 1 or 2 active elements.

If this happens, the performance of the unrolled list degrades back into a standard std::list but with a much larger memory footprint. The CPU is forced to chase pointers constantly because the contiguous arrays are practically empty.

In a more robust implementation, we could implement a merge mechanic. During the Remove() function, if a node's Count drops below a certain threshold (e.g., 25% capacity), the structure checks the neighboring node. If the combined elements of both nodes can fit into a single node, the data is merged, and one node is deleted.

This guarantees that the structural density of the arrays remains high, keeping our cache lines populated with useful data.

Benchmarking

The goal of the unrolled linked list is to create a middle ground between the traversal speed of contiguous arrays and the insertion flexibility of linked lists. That is, our container should:

  • Iterate through the collection faster than a std::list
  • Insert items in arbitrary positions faster than a std::vector

Let's check if we were successful. We will first simulate inserting 100,000 elements right into the front of a collection:

benchmarks/main.cpp

#include <benchmark/benchmark.h>
#include <vector>
#include <list>
#include <dsa/UnrolledList.h>

static void BM_VectorFrontInsert(benchmark::State& state) {
  for (auto _ : state) {
    state.PauseTiming();
    std::vector<int> v;
    state.ResumeTiming();
    for (int i = 0; i < 100'000; ++i) {
      v.insert(v.begin(), 42);
    }
  }
}

static void BM_ListFrontInsert(benchmark::State& state) {
  for (auto _ : state) {
    state.PauseTiming();
    std::list<int> l;
    state.ResumeTiming();
    for (int i = 0; i < 100'000; ++i) {
      l.insert(l.begin(), 42);
    }
  }
}

static void BM_UnrolledFrontInsert(benchmark::State& state) {
  for (auto _ : state) {
    state.PauseTiming();
    UnrolledList<int, 64> ul;
    state.ResumeTiming();
    for (int i = 0; i < 100'000; ++i) {
      ul.InsertFront(42);
    }
  }
}

#define BENCHMARK_MS(func) \
  BENCHMARK(func)->Unit(benchmark::kMillisecond)

BENCHMARK_MS(BM_VectorFrontInsert);
BENCHMARK_MS(BM_ListFrontInsert);
BENCHMARK_MS(BM_UnrolledFrontInsert);

For std::vector, this will trigger massive hardware shifts. For std::list, it will trigger individual global heap allocations. For our UnrolledList, it will perform blazing-fast L1 cache shifts, occasionally splitting a node:

---------------------------------
Benchmark                    Time
---------------------------------
BM_VectorFrontInsert      9.38 ms
BM_ListFrontInsert        1.72 ms
BM_UnrolledFrontInsert   0.024 ms

Now, let's benchmark the traversal speeds. We'll give each container 1,000,000 elements, and then benchmark an algorithm that iterates through all of them:

benchmarks/main.cpp

#include <benchmark/benchmark.h>
#include <vector>
#include <list>
#include <dsa/UnrolledList.h>

static void BM_ListTraversal(benchmark::State& state) {
  std::list<int> l(1'000'000, 42);
  for (auto _ : state) {
    int sum = 0;
    for (int val : l) sum += val;
    benchmark::DoNotOptimize(sum);
  }
}

static void BM_UnrolledTraversal(benchmark::State& state) {
  UnrolledList<int, 64> ul;
  for (int i = 0; i < 1'000'000; ++i) ul.InsertFront(42);
  for (auto _ : state) {
    int sum = 0;
    ul.Traverse([&](int val) { sum += val; });
    benchmark::DoNotOptimize(sum);
  }
}

static void BM_VectorTraversal(benchmark::State& state) {
  std::vector<int> v(1'000'000, 42);
  for (auto _ : state) {
    int sum = 0;
    for (int val : v) sum += val;
    benchmark::DoNotOptimize(sum);
  }
}

#define BENCHMARK_MS(func) \
  BENCHMARK(func)->Unit(benchmark::kMillisecond)

BENCHMARK_MS(BM_ListTraversal);
BENCHMARK_MS(BM_UnrolledTraversal);
BENCHMARK_MS(BM_VectorTraversal);
-------------------------------
Benchmark                   CPU
-------------------------------
BM_ListTraversal        12.2 ms
BM_UnrolledTraversal   0.600 ms
BM_VectorTraversal     0.322 ms

As expected, our unrolled list crushes the pure pointer-chasing mechanics of std::list. Because we pack 64 elements contiguously before jumping to a new pointer, the CPU prefetcher is heavily engaged.

Arrays likestd::vector are the gold standard of traversal, but our UnrolledList is keeping up admirably in these tests.

Complete Code

Here is the complete implementation of our UnrolledList:

Files

dsa_core
dsa_app
Select a file to view its content

Advanced: Chunked Arrays

Our unrolled linked list solves the cache-line problem by grouping elements into contiguous blocks. However, it still suffers from one of the major drawbacks of standard linked lists: a lack of fast random access.

Finding the Nth element fundamentally still requires walking the Next pointer chain from the head node. It is faster than a linked list, as each pointer jump skips over Count elements rather than 1, but it is still an O(N)O(N) process.

A chunked array (also called a segmented array) trades some of the insertion and deletion speed of an unrolled list to solve the random access problem. Whilst an unrolled list links nodes together with Next pointers, a chunked array uses a central "directory" to keep track of where all the nodes are:

Because every chunk is uniform in capacity, the container does not need to traverse a chain to find elements. It can use simple O(1)O(1) math to locate any element by index instantly.

For example, if every chunk holds exactly 64 elements and we request a global index 200, the container knows immediately that the value lives in chunk 3 (200 / 64) at local index 8 (200 % 64). It reads the pointer from index 3 of the top-level directory, and jumps straight to the correct memory address almost as quickly as a std::vector.

This arithmetic only works across chunks that are fully populated, which imposes some structural limitations. Segmented arrays typically only permit gaps in their "first" and "last" chunks, ensuring all of the "middle" chunks remain fully allocated and compatible with the O(1)O(1) pointer math.

This makes them ideal for double-ended queues such as std::deque, where insertion and removal of elements at the start or end (in the first and last chunks) are fast O(1)O(1) operations, but anywhere else requires a slow O(N)O(N) restructuring. If we can accept these limitations, chunked arrays strike a middle ground between the structural flexibility of an unrolled list and the fast O(1)O(1) random access of an array.

Summary

In this lesson, we explored a highly practical architectural compromise.

  1. We broke the false dichotomy between strict arrays and strict linked lists by designing a fat node that holds an internal array.
  2. We tuned the capacity of our nodes to align with our use case or target hardware, such as the CPU's 64-byte cache line, ensuring that a single RAM fetch yields maximum payload processing.
  3. We implemented the split mechanic, dynamically dividing full nodes to accommodate new insertions without triggering a global array reallocation.
  4. We relied on the physical efficiency of the CPU's L1 cache and SIMD instructions to prove that shifting tiny amounts of memory locally is significantly faster than chasing pointers out to main RAM.

Our nodes so far have acted as wrappers around our data. UnrolledList::Node physically encloses the array of elements. But what if we don't want a wrapper at all? What if we want objects that exist independently but can still be logically linked together?

In the next lesson, we will look at the exact technique the Linux Kernel uses to manage its internal structures without making a single memory allocation: the intrusive linked list.

Have a question about this lesson?
Answers are generated by AI models and may not be accurate