In the previous lesson, we established that long-running systems will eventually choke and die if they rely on the global C++ allocator for dynamic lifetimes. External fragmentation splinters the heap, and scattered data triggers massive hardware stalls via TLB misses and page faults.

The industry-standard solution is to ban the use of general-purpose new and delete in the critical execution path. Instead, we ask the operating system for a massive, contiguous slab of memory when the program boots up, and we take on the responsibility of managing that space ourselves.

In this lesson, we will build the simplest, fastest, and most ruthless memory manager in existence: the linear allocator, frequently called an arena allocator.

We are going to treat memory exactly as the CPU sees it: an array of raw bytes.

Bypassing the OS Completely

To build our allocator, we need a dedicated class that holds onto a block of memory. When our Arena object is constructed, it will make a single call to the global allocator to grab our physical memory budget. The type we assign to these bytes doesn't particularly matter at this stage. We will use std::byte, a modern C++ type specifically designed to represent raw memory, but representing bytes as chars or uint8_ts is also common:

Arena.h

1#pragma once
2#include <cstddef> // for std::byte, size_t
3#include <new>
4
5class Arena {
6private:
7  std::byte* Buffer; 
8  size_t Capacity;
9
10public:
11  Arena(size_t Size)
12    : Capacity{Size} {
13    // A single, heavy allocation on boot
14    Buffer = new std::byte[Size]; 
15  }
16
17  ~Arena() {
18    delete[] Buffer;
19  }
20
21  // Prevent copying - we don't want multiple arenas
22  // thinking they own the same physical memory
23  Arena(const Arena&) = delete;
24  Arena& operator=(const Arena&) = delete;
25};

Once an Arena is constructed, we will own a large, contiguous block of memory. As far as the operating system is concerned, our application is using this memory, even if it is completely empty. The operating system will leave it alone, and no other threads or programs can touch it.

Additionally, as long as the Size of our arena is sufficient for our needs, we will never need to ask the OS for any more memory, completely avoiding the performance cost of any further dynamic allocations.

The Bump Pointer Mechanic

Once our Arena has acquired a big block of RAM, it needs a way to distribute chunks of this memory to the other parts of the application that need it. Our Arena does not need the expensive algorithm to search for free space, which makes the global C++ allocator so slow. Instead, it can implement a lightweight "bump pointer" mechanic.

We simply keep track of an Offset integer that represents how many bytes of our buffer we have handed out so far. When a request comes in, we calculate the current memory address, return it, and bump the offset forward by the requested size.

We'll implement this as a function called Allocate():

Arena.h

1// ...
2
3class Arena {
4private:
5  std::byte* Buffer;
6  size_t Capacity;
7  size_t Offset{0}; 
8
9public:
10  // ...
11
12  void* Allocate(size_t Size) {
13    // Check if we have enough physical space left
14    if (Offset + Size > Capacity) {
15      return nullptr;
16    }
17
18    // Calculate the exact memory address to return
19    void* Result = Buffer + Offset; 
20
21    // Bump the offset forward for the next allocation
22    Offset += Size; 
23
24    return Result;
25  }
26  
27  // ...
28};

Within Allocate(), there are no locks. There are no loops. There are no metadata headers being injected into the RAM. It is a single bounds check and an integer addition. This is an $O(1)$ algorithm that executes in roughly 3 CPU cycles. We'll write benchmarks to compare the performance of this to other approaches later in the lesson.

Implementing Memory Alignment

If we test our Allocate() function as it currently exists, it will work most of the time. But eventually, it will cause the CPU to suffer a massive performance penalty or forcefully crash the program with a hardware exception.

As we covered in the previous lesson, the hardware expects that certain data types be aligned in memory. For example, a 4-byte int must start at a memory address cleanly divisible by 4. An 8-byte double must start at a memory address cleanly divisible by 8.

Imagine we ask our current Arena for 3 bytes (perhaps for a small string), and then we immediately ask it for an 8-byte double:

1// Offset is 0
2void* str = MyArena.Allocate(3);
3// Offset bumps to 3
4
5// Offset is 3. We ask for a double.
6void* num = MyArena.Allocate(8); 
7// The double is placed at memory address Buffer + 3

Because 3 is not cleanly divisible by 8, our double is misaligned.

If a misaligned variable straddles the boundary between two physical 64-byte cache lines, the CPU is forced to perform two separate expensive RAM fetches instead of one, stitching the halves together in the registers to recreate our value.

This degrades performance and, on more rigid architectures, the hardware refuses to do this work entirely and will just crash.

Automatically Padding the Arena

To fix this, our Allocate() function can accept an Alignment parameter. Before we return the pointer, we must calculate if the current Offset satisfies that alignment. If it doesn't, we must artificially bump the Offset forward, injecting dead padding bytes until we hit a clean boundary.

We could do this using standard arithmetic involving the modulo operator %, however this approach requires division, which is one of the slowest hardware instructions on the CPU.

Instead, because alignment requirements in C++ are powers of two (2, 4, 8, 16, 32), we can use incredibly fast bitwise arithmetic. The formula to find the next address that meets an alignment requirement is a classic pattern, but it can take some time to understand why it works:

1(Offset + Alignment - 1) & ~(Alignment - 1)

We have added it to our Allocate() function below, and a step-by-step explanation of the bitwise manipulations is provided in the following section, for those interested:

Arena.h

1// ...
2
3class Arena {
4// ...
5
6public:
7  // ...
8  // We default to the system's maximum fundamental alignment
9  // (usually 8 or 16 bytes) just to be safe.
10  void* Allocate(
11    size_t Size,
12    size_t Alignment = alignof(std::max_align_t) 
13  ) {
14    // 1. Calculate the padding needed to reach the next boundary
15    size_t AlignedOffset =
16      (Offset + Alignment - 1) & ~(Alignment - 1); 
17
18    // 2. Check if the aligned request fits
19    if (AlignedOffset + Size > Capacity) {
20      return nullptr;
21    }
22
23    // 3. Set up the result and bump the offset
24    void* Result = Buffer + AlignedOffset;
25    Offset = AlignedOffset + Size; 
26
27    return Result;
28  }
29  // ...
30};

Advanced: The Alignment Formula

The expression (Offset + Alignment - 1) & ~(Alignment - 1) is a classic, heavily optimized systems programming idiom that executes in a single CPU register cycle.

Let's imagine the case where our Offset is $3$ and we need an Alignment of $8$ . In that scenario, we'd want our AlignedOffset to be $8$ . Let's walk through how the alignment formula generates that result:

On the left side of the expression, we calculate (Offset + Alignment - 1) to push our offset past the current boundary. In our example, this will be $3 + 8 - 1 = 10$ , or 00001010 in binary.
We want to snap this result back to the nearest multiple of Alignment. In our case, we want to snap $10$ back to the nearest multiple of $8$ , which is $8 \times 1 = 8$ . We will do this by applying a mask to our 00001010 value.
On the right side of this expression, we calculate (Alignment - 1). This is $8 - 1 = 7$ in our example, which in binary is 00000111.
We invert this result with the ~ operator to give 11111000. This is our mask.
Our expression has now simplified to 00001010 & 11111000
This evaluates to 00001000, which is exactly $8$ .

Feel free to walk through this process with different Offset and Alignment values to be confident that it generates the expected output in all cases.

Advanced: Using `std::has_single_bit` and `std::bit_ceil`

Numbers that are powers of two (1, 2, 4, ...) have a useful property: their binary representation contains only a single bit (00000001, 00000010, 00000100, ...)

The C++20 <bit> library that we introduced provides some high-performance utilities that allow us to take advantage of this property. For example, if we need to check that a value is a power of two, std::has_single_bit() can do that:

1#include <bit>
2
3int main() {
4  std::has_single_bit(0); // False (00000000)
5  std::has_single_bit(1); // True  (00000001)
6  std::has_single_bit(2); // True  (00000010)
7  std::has_single_bit(3); // False (00000011)
8  
9  std::has_single_bit(7); // False (00000111)
10}

We could use this in our Allocate() function to throw errors on unacceptable Alignment values:

1// ...
2#include <cassert> 
3#include <bit> 
4
5class Arena {
6// ...
7
8public:
9  // ...
10  void* Allocate(
11    size_t Size,
12    size_t Alignment = alignof(std::max_align_t)
13  ) {
14    assert(std::has_single_bit(Alignment);
15    
16    // ...
17  }
18  // ...
19};

We also have have std::bit_ceil(), which rounds a value up to the next available power of two:

1#include <bit>
2
3int main() {
4  std::bit_ceil(0); // 1 (00000000 -> 00000001)
5  std::bit_ceil(1); // 1 (00000001 -> 00000001)
6  std::bit_ceil(2); // 2 (00000010 -> 00000010)
7  std::bit_ceil(3); // 4 (00000011 -> 00000100)
8  
9  std::bit_ceil(7); // 8 (00000111 -> 00001000)
10  
11  // bit_floor is also available if we need to round
12  // values down to their closest power of two
13  std::bit_floor(7); // 4 (00000111 -> 00000100)
14}

Instead of throwing errors, we could alternatively use std::bit_ceil() to silently correct any invalid Alignment arguments in our Allocate() function:

1// ...
2#include <bit> 
3
4class Arena {
5// ...
6
7public:
8  // ...
9  void* Allocate(
10    size_t Size,
11    size_t Alignment = alignof(std::max_align_t)
12  ) {
13    Alignment = std::bit_ceil(Alignment);
14    
15    // ...
16  }
17  // ...
18};

Using Placement New

Now that our Arena securely manages raw, aligned memory, how do we actually put objects into it?

We cannot use the standard new Player() syntax, because that hardcodes a call to the global OS allocator. We already have the memory; we just need the compiler to run the Player constructor directly on top of our pre-allocated bytes.

We can do this using Placement New.

Placement new is a specialized C++ syntax that allows us to pass a specific memory address to the new operator. The compiler skips the allocation step entirely and instantly invokes the constructor at the target address.

For example, if we wanted to use a memory location stored in a variable called address to construct a new Player object, passing 42 and 100.0f to the Player constructor, our syntax would look like this:

1new (address) Player(42, 100.0f);

A complete example of setting up a new Arena and constructing a new Player in it might look like this:

main.cpp

1#include <iostream>
2#include "Arena.h"
3
4struct Player {
5  int ID;
6  float Health;
7
8  Player(int i, float h) : ID{i}, Health{h} {
9    std::cout << "Player " << ID << " spawned.\n";
10  }
11};
12
13int main() {
14  // Grab a 1-Megabyte block from the OS once
15  Arena LevelArena(1024 * 1024);
16
17  // Ask our Arena for the exact bytes needed for a Player
18  void* Memory = LevelArena.Allocate(
19    sizeof(Player),
20    alignof(Player)
21  );
22
23  // Execute the Player constructor inside our Arena memory
24  Player* P1 = new (Memory) Player(42, 100.0f); 
25
26  std::cout << "Health: " << P1->Health;
27
28  return 0;
29}

The Fast Discard

You might have noticed that our Arena class does not have a Free() or Deallocate() method.

This is an intentional limitation of the arena allocator pattern. If we allowed individual objects to be freed at random times, we would immediately introduce the exact external fragmentation and Swiss-cheese holes from the previous lesson.

We'll implement a more complex allocator soon to deal with dynamic lifetimes, but a linear allocator only moves forward. It cannot free individual objects.

So how do we reclaim memory? When the system that uses our Arena is done and no longer needs the objects it put there, we reclaim that memory all at once. When a player finishes the level, or when a renderer completes a frame, or when a web server finishes responding to a request, we simply reset the Offset back to 0:

Arena.h

1class Arena {
2  // ...
3public:
4  // ...
5
6  void Reset() {
7    Offset = 0; 
8  }
9};

This is what makes the arena pattern so efficient. Deleting 10,000 objects takes the same amount of time as deleting 1 object: a single cycle to assign 0 to an integer. The old data is technically still sitting in RAM, but it doesn't matter. The next time we call Allocate(), our bump pointer will happily overwrite the dead bytes with new data.

The Destructor Trap

There is a severe catch to this mass-deletion strategy. When we call standard delete P1, the compiler automatically invokes the ~Player() destructor before freeing the memory. Because we are bypassing delete and simply resetting an integer, destructors are never called.

If our Player object is just "Plain Old Data" (POD) - a struct full of floats, integers, and booleans - this is perfectly fine. The data is overwritten, and nothing is lost.

But if our object holds things like a network socket, an open file handle, or an internal std::vector that manages its own heap memory, wiping the Arena will permanently leak those external resources.

If we place complex objects into an arena that requires additional cleanup, we are responsible for ensuring that the cleanup happens. This usually involves manually calling destructors before resetting the offset:

main.cpp

1int main() {
2  Arena LevelArena(1024 * 1024);
3
4  void* Memory = LevelArena.Allocate(
5    sizeof(Player), alignof(Player)
6  );
7  
8  Player* P1 = new (Memory) Player(42, 100.0f);
9
10  // We are done with the level
11  // Manually invoke destructors or any other cleanup logic
12  P1->~Player(); 
13
14  // Reclaim all memory for the next level
15  LevelArena.Reset(); 
16
17  return 0;
18}

Because of this caveat, arena allocators are most frequently used to store completely flat, contiguous data structures that don't require complex teardown logic.

Benchmarking the Arena

Let's confirm our arena is hitting our performance goals. We will use to pit our custom Arena directly against the new operator, and against pure stack allocation.

We will create a struct containing 32 bytes of payload data, and allocate it 10,000 times:

benchmark.cpp

1#include <benchmark/benchmark.h>
2#include "Arena.h"
3
4struct Entity {
5  uint64_t data[4]; // 32 bytes
6};
7
8static void BM_GlobalAllocator(benchmark::State& state) {
9  for (auto _ : state) {
10    for (int i = 0; i < 10000; ++i) {
11      Entity* e = new Entity();
12      benchmark::DoNotOptimize(e);
13      delete e;
14    }
15  }
16}
17BENCHMARK(BM_GlobalAllocator);
18
19static void BM_ArenaAllocator(benchmark::State& state) {
20  // Pre-allocate enough space for 10,000 entities
21  Arena MyArena(10000 * sizeof(Entity));
22
23  for (auto _ : state) {
24    for (int i = 0; i < 10000; ++i) {
25      void* mem = MyArena.Allocate(
26        sizeof(Entity), alignof(Entity)
27      );
28      Entity* e = new (mem) Entity();
29      benchmark::DoNotOptimize(e);
30    }
31    // Instant O(1) bulk free
32    MyArena.Reset();
33  }
34}
35BENCHMARK(BM_ArenaAllocator);
36
37static void BM_StackAllocation(benchmark::State& state) {
38  for (auto _ : state) {
39    for (int i = 0; i < 10000; ++i) {
40      Entity e;
41      benchmark::DoNotOptimize(e);
42    }
43  }
44}
45BENCHMARK(BM_StackAllocation);

1-----------------------------
2Benchmark                 CPU
3-----------------------------
4BM_GlobalAllocator   0.369 ms
5BM_ArenaAllocator    0.016 ms
6BM_StackAllocation   0.012 ms

By eliminating the free-list search algorithms and ditching the per-object hidden metadata headers, our custom Arena executes much faster than the global heap.

It achieves stack-like performance without the stack's size and lifecycle constraints. Objects can live in our Arena as long as they need to - their lifecycle isn't linked to the functions in which they were created.

Additionally, unlike the global heap, the memory managed by our Arena will never externally fragment, regardless of how long the program runs.

Complete Code

Here is the complete implementation of our Arena allocator:

Files

Arena.h

main.cpp

Select a file to view its content

Summary

In this lesson, we created a solution to the performance traps of standard dynamic memory:

We constructed an arena allocator that bypasses the operating system entirely after an initial bulk allocation.
We implemented a bump pointer, achieving O(1) allocation that executes as fast as stack allocation.
We accommodated the requirements around memory alignment, using fast bitwise arithmetic to pad our pointers and prevent cache-line straddling and CPU faults.
We used Placement New to forcefully execute C++ constructors inside our custom pre-allocated bytes.
We learned the fast discard pattern, instantly reclaiming massive amounts of memory by resetting a single integer, while recognizing the danger of bypassed destructors.

The arena is perfect for data with homogeneous lifespans, like all the physics objects in a single game level or all the JSON nodes in a single web request. But what happens if we need to spawn and despawn individual objects completely unpredictably? The arena cannot do that without fragmenting.

In the next lesson, we will build a custom allocator designed specifically to solve the problem of dynamic entity lifespans: the object pool and the implicit free list.

Linear and Arena Allocators

Bypassing the OS Completely

Arena.h

The Bump Pointer Mechanic

Arena.h

Implementing Memory Alignment

Automatically Padding the Arena

Arena.h

Advanced: The Alignment Formula

Advanced: Using `std::has_single_bit` and `std::bit_ceil`

Using Placement New

main.cpp

The Fast Discard

Arena.h

The Destructor Trap

main.cpp

Benchmarking the Arena

benchmark.cpp

Complete Code

Files

Summary

Object Pools and Free Lists

Practical DSA

Linear and Arena Allocators

Bypassing the OS Completely

Arena.h

The Bump Pointer Mechanic

Arena.h

Implementing Memory Alignment

Automatically Padding the Arena

Arena.h

Advanced: The Alignment Formula

Advanced: Using std::has_single_bit and std::bit_ceil

Using Placement New

main.cpp

The Fast Discard

Arena.h

The Destructor Trap

main.cpp

Benchmarking the Arena

benchmark.cpp

Complete Code

Files

Summary

Object Pools and Free Lists

Advanced: Using `std::has_single_bit` and `std::bit_ceil`