Memory Fragmentation
Why do long-running programs inevitably degrade and crash? Discover the hardware reality of memory fragmentation and TLB thrashing.
Imagine we have just finished writing an open-world game. We have been diligent and confirmed that our game has exactly zero memory leaks. Our player boots up the game, and initially it outputs frames every few milliseconds for a smooth experience.
But as the player explores, spawning and destroying enemies, projectiles, and particle effects, the frame time starts to climb. The CPU usage spikes, yet the simulation seems to be doing less actual work. After an hour, the game starts to stutter. At hour two, the operating system forcibly terminates our executable due to an out-of-memory error.
How is this possible? We didn't leak a single byte.
Unfortunately, when we rely on the global allocator for a program with highly dynamic object lifespans, we destroy the physical layout of our users' RAM. This silent killer is called memory fragmentation, and it is the reason why virtually every major game engine, high-frequency trading platform, and embedded system bans the use of general-purpose new and delete in their core loops.
The Swiss Cheese Effect
To understand why our game crashed, we have to look at how the global allocator manages the massive chunks of memory it receives from the operating system.
When the allocator gives us a pointer via new, it has to give us a single, perfectly contiguous sequence of free bytes. If you ask for a 1-megabyte buffer, the allocator must find a 1-megabyte gap in its physical heap space where no other data currently resides.
Let's simulate a simplified scenario. Imagine our allocator manages a tiny heap of exactly 100 bytes. We start by allocating objects of various sizes and durations. Our Small objects don't last long - they get freed shortly after they are created. We also have some Large objects that need to stick around longer, so we'd free them sometime later:
// Assume our total heap is 100 bytes
struct Small { char data[10]; };
struct Large { char data[30]; };
void SimulateWork() {
// Allocate 100 bytes total. Heap is now 100% full.
Small* a = new Small(); // 10 bytes
Large* b = new Large(); // 30 bytes
Small* c = new Small(); // 10 bytes
Large* d = new Large(); // 30 bytes
Small* e = new Small(); // 10 bytes
Small* f = new Small(); // 10 bytes
// Delete the "Small" objects
delete a;
delete c;
delete e;
delete f;
// We still need the "Large" objects so we keep them
// alive for now
}After deleting the Small objects, we have successfully freed up 40 bytes of memory. We have 40 bytes available, and only 60 bytes in use.
Now, a new request comes in. We need to allocate one more Large object, which requires 30 bytes:
Large* g = new Large();This allocation will fail, and the program will crash with a std::bad_alloc exception.
Why? Because even though we have 40 bytes of free memory, we do not have 30 contiguous bytes. We have a 10-byte hole, another 10-byte hole, and a 20-byte hole, all separated by the Large objects that are still alive in memory. The following interactive illustrates this scenario - the longer our application runs, the more cycles of SimulateWork() it performs, gradually filling our memory with holes that reduce spatial locality. Eventually, our program just no longer has space to allocate another large object:
This is known as external fragmentation. As objects of varying sizes are allocated and destroyed at unpredictable times, the heap begins to look like Swiss cheese. There might be a lot of free space, but it is splintered across thousands of tiny gaps, each of them too small to contain the Large object we're trying to store.
When the allocator cannot find a contiguous block to satisfy our request, it is forced to go back to the Operating System and ask for more raw memory pages. It does this even if our application technically has gigabytes of "free" space trapped inside these fragmented gaps.
Eventually, the OS runs out of physical RAM to give us, and our zero-leak application crashes.
Internal Fragmentation
External fragmentation occurs between our allocated objects. But there is a second type of fragmentation that occurs inside the memory we asked for.
As we've learned, the CPU reads memory from RAM into its caches in fixed-size chunks - usually 64-byte cache lines. Because of how the silicon is physically wired to the memory bus, the CPU demands that certain data types align with specific mathematical boundaries in RAM.
For example, a 4-byte int should start at an address that is a multiple of 4. An 8-byte double must start at a memory address cleanly divisible by 8. If you force the CPU to read a double from an unaligned address (like 0x1003), the read might cross the boundary of two physical cache lines.
In the best case, this hurts performance - it forces the hardware to perform two separate memory fetches instead of one, and then reconstruct our double using the relevant bits from each cache line. In the worst case, the architecture doesn't support this misalignment at all, immediately crashing our program with a hardware fault.
To protect us from this, the global allocator enforces strict alignment. It will always return a pointer that satisfies the maximum alignment requirements of the system (typically 8 or 16 bytes).
But this safety net comes at a cost. If you ask the allocator for 17 bytes, it cannot give you 17 bytes. It must maintain alignment for the next object it allocates. Therefore, it will round your request up to the nearest alignment boundary - usually 32 bytes.
struct WeirdData {
char text[17];
};
void AllocateWeird() {
// We ask for 17 bytes.
WeirdData* data = new WeirdData();
// The allocator actually reserves 32 bytes of heap space.
// We have immediately wasted 15 bytes.
}This wasted padding is called internal fragmentation. When we combine internal fragmentation with the hidden metadata headers we discussed in the previous lesson, the actual memory consumed by our application can easily be double or triple what our sizeof() calculations suggest.
The Hardware Penalty of Scattered Data
Let's assume we have a system with virtually infinite RAM, so crashing from an out-of-nemory error isn't a concern. Fragmentation still ruins our application, but it does so by destroying our CPU's throughput.
In the first chapter, we discussed how the CPU cache relies on spatial locality. If you process an object, the CPU assumes you will probably want to process the object sitting physically next to it in RAM.
When our memory is heavily fragmented, our objects are not next to each other. They are scattered randomly across the physical silicon. This defeats the hardware prefetcher and guarantees a constant stream of L1 and L2 cache misses.
But it gets much worse. To understand the true penalty of fragmentation, we need to look at how the operating system maps memory.
The TLB (Translation Lookaside Buffer)
The memory addresses our program uses (like 0x7ffee943a120) are fake. They are virtual addresses. The operating system provides this illusion so that multiple programs can run at the same time without accidentally overwriting each other's physical RAM.
When our CPU tries to read from a pointer, it cannot simply ask the RAM chips for that address. It must first translate our fake virtual address into a real physical address. It does this by looking up a massive mapping dictionary called the page table, which is managed by the OS and stored in main memory.
Reading the page table for every single pointer dereference would be extremely slow. To fix this, the CPU has a dedicated, incredibly fast hardware cache built right next to the execution cores, called the TLB (Translation Lookaside Buffer).
The TLB stores the most recent translations from virtual to physical memory. If the translation is in the TLB, the CPU finds the physical address in 1 cycle. But the TLB is incredibly small. On many modern architectures, the L1 TLB can only hold about 64 to 128 translation entries.
Here is where fragmentation destroys performance. The OS maps memory in 4-Kilobyte "pages". If we allocate 10,000 small nodes contiguously in a std::vector, they will pack tightly into just a handful of 4KB pages. The CPU only needs a few TLB entries to translate the entire dataset.
But if our memory is heavily fragmented, our 10,000 nodes might be scattered across 10,000 different 4KB pages.
void ProcessFragmentedNodes(Node* head) {
Node* current = head;
while (current != nullptr) {
// If 'current' is on a memory page the TLB hasn't seen recently,
// the CPU suffers a TLB Miss.
Process(current->data);
// The CPU fetches the next pointer, which points to a completely
// different, random memory page.
current = current->next;
}
}When we iterate through a fragmented linked list, every pointer jump crosses into a new virtual page. The TLB instantly overflows. The CPU searches the TLB, fails to find the translation, and is forced to halt execution.
The hardware must then perform a page walk - a manual trip out to main RAM to read the OS page table just to figure out where your data actually lives. This stall costs hundreds of CPU cycles per object.
The Ultimate Stall: Page Faults
If we are particularly unlucky, the page walk discovers that the physical page holding your data isn't even in RAM anymore.
If our system is running low on physical memory - perhaps due to massive external fragmentation we covered earlier - the operating system will start taking pages of memory that haven't been used recently and saving them to the physical hard drive. This is called swapping or paging.
If our pointer dereference points to a page that has been swapped to disk, the CPU triggers a page fault. The operating system completely halts our application, wakes up the disk drive, laboriously copies the 4KB chunk of memory from the hard drive back into physical RAM, updates the page table, and finally lets our program resume.
A cache miss costs hundreds of cycles, but a page fault costs millions of cycles. If our data spans hundreds of fragmented, swapped-out pages, an algorithm that should take a few milliseconds will suddenly take a few seconds. This is what creates the stuttering behavior we talked about in the introduction.
The System Degradation Cycle
We can now map the complete lifecycle of a system that relies too much on general-purpose dynamic allocation. It is a predictable cycle of degradation.
- Boot and Initialization: The program starts. The heap is empty. The allocator happily hands out tightly packed, contiguous blocks of memory. Performance is fantastic. The TLB is perfectly utilized.
- The Churn: The application enters its main loop. It spawns and deletes objects at random times, punching holes in the heap.
- The Search Penalty: New allocations take longer. The allocator's internal search algorithm must now traverse a massive "free list" of scattered holes, trying to find a gap large enough to fit new requests.
- The TLB Thrash: Because objects are now being placed into random holes scattered across gigabytes of virtual address space, the data loses all spatial locality. Iterating over collections causes continuous TLB misses. The CPU cores spend more time waiting for page walks than executing logic.
- The Collapse: External fragmentation reaches a critical mass. A request for a large contiguous block fails. The OS runs out of pages to map. The program throws
std::bad_allocand dies.
This isn't a theoretical edge case. This is the guaranteed destiny of any software that performs unpredictable allocations over a long enough timeline.
Solving Fragmentation
If the global allocator inevitably leads to fragmentation, TLB thrashing, and crashes, how do we build systems that need to run indefinitely? How do servers stay alive for months? How do large games run at a smooth frame rate for hours?
The solution is pre-allocation. In these programs, the global new and delete operators are considered toxic in the "hot path" (the main execution loop of the program). Memory is treated not as an infinite resource that can be requested on demand, but as a fixed physical budget that must be strictly managed.
When the application boots, before the main loop ever begins, it asks the OS for massive, contiguous slabs of memory upfront. It accepts the slow cost of this initial allocation because it only happens once.
class GameEngine {
void* global_memory_block;
public:
void Initialize() {
// Ask the OS for 2 Gigabytes of memory ONCE during startup.
// This is slow, but it guarantees a massive contiguous chunk.
global_memory_block = malloc(2048 * 1024 * 1024);
}
void Run() {
while (engine_is_running) {
// During the game loop, we NEVER call new or delete.
// We manually carve up 'global_memory_block' ourselves.
SimulatePhysics();
RenderFrame();
}
}
};Once the application has reserved this massive physical slab, it never asks the OS for memory again. Instead, we write our own custom memory allocators to manage this slab. Because we know the types and sizes of data we're working with, and how our program uses it, we can manage this memory in a way that is highly optimized for our use case.
In the next lesson, we will do exactly that. We will build the fastest possible custom memory manager from scratch: the arena allocator.
Summary
In this lesson, we explored the physical consequences of dynamic memory allocation over time.
- External fragmentation occurs when memory is freed in random patterns, leaving the heap splintered. A program can run out of memory and crash despite having plenty of total free space, simply because the free space isn't contiguous.
- Internal fragmentation is the invisible waste inside an allocation, caused by the hardware's strict alignment requirements and the allocator's chunk headers.
- Fragmented data destroys the CPU's Translation Lookaside Buffer (TLB). Scattered objects force the hardware to constantly read the OS page table from main RAM, stalling the CPU for hundreds of cycles per pointer jump.
- If fragmented pages are swapped to disk, accessing them triggers a page fault, which stalls the CPU for millions of cycles.
- To avoid this degradation cycle, high-performance systems use pre-allocation. They reserve massive blocks of memory at startup and ban the use of general-purpose
newanddeleteduring execution.
Linear and Arena Allocators
Bypass the OS and the global allocator. Build a custom linear arena allocator to implement fast memory allocation with zero fragmentation.