CPUs, Memory, and Locality

Understand how our code interacts with physical hardware. Learn about virtual memory, the stack vs. heap, cache lines, and prefetching.

Ryan McCombe
Published

In the previous lesson, we defined a data structure as a "shape" of information in memory. But before we can start building these structures, should understand what exactly this means.

When you write int* ptr{0x7FFF0010}, it's easy to imagine that ptr corresponds to a physical location in memory, located at that exact position. This is not the case.

Modern computing is built on layers of deception. The address your program sees is fake. The memory you think you have might not exist. The single variable you asked for isn't what gets fetched. And the order in which you execute instructions is merely a "suggestion" to the CPU.

The OS & Virtual Memory

The first deception is virtual memory. When your program starts, the operating system (OS) hands it a "virtual address space." This is a continuous range of addresses that your program believes it owns entirely.

This virtual address space is much larger than what is physically available on the machine. And even if the machine did have all that space, our program isn't the only one running. Discord, Chrome, and the OS itself are all fighting for space.

To manage this, the OS breaks memory into small, fixed-size chunks called pages (typically 4KB in size). It then maintains a massive dictionary, called a page table, which maps the fake virtual pages your program sees to the actual physical frames in RAM.

Why does the OS do this?

This translation layer seems like unnecessary overhead, but it solves three important problems:

  1. Isolation & Safety: Since every program has its own unique page table, your program physically cannot write to the memory of another program. It has no mapping to reach it. If you try to access an address that isn't mapped, the hardware raises a flag, the OS steps in, and you get a segmentation fault (SegFault).
  2. Illusion of Continuity: Your program might need a continuous block of 100MB for a large std::vector. Physical RAM might be fragmented, with only small gaps available here and there. Virtual memory allows the OS to handle this situation gracefully - it can grab scattered physical frames and stitch them together so they appear continuous.
  3. Overcommitment: The OS can promise you more memory than actually exists. It only assigns a physical frame when you actually write data to a page. If you run out of physical RAM, it can even move stagnant pages to the hard drive (a process called swapping) to free up space, all without your program knowing.

The Distance Problem

Over the last few decades, CPU speeds have increased exponentially. Memory speeds have also increased, but not nearly as fast. This has created a massive gap known as the Von Neumann Bottleneck.

When the CPU needs to read a value from RAM, it sends a request and then... waits. It stalls. It sits idle doing absolutely nothing while the electrical signals travel to the RAM stick, the RAM controller finds the data, and the signals travel back.

In computing, latency is the time it takes to complete one operation. Fetching a value from main RAM can take around 400x longer than fetching a value that the CPU already has in one of its registers.

So, if our program constantly fetches data from main RAM, our user's 4GHz CPU is effectively running at 100MHz. It is spending almost all of its time waiting for data it needs to complete its task.

The Cache Hierarchy

To prevent the CPU from starving, hardware engineers added caches. These are smaller, faster banks of memory that are right on the CPU die.

Designing these caches involves balancing two conflicting goals. We want the caches to be large so they can store as many things as possible, reducing the amount of slow trips we need to make to RAM. But we also want them to be as close to the CPU cores as possible, so that retrieving those things is faster.

To balance these goals, modern CPUs have many caches, categorized into "levels":

  • L1 cache: Tiny (e.g., 32KB), but extremely fast data access as they are physically close to a CPU core.
  • L2 cache: Larger (e.g., 256KB) but further away from their core, making data access slightly slower.
  • L3 cache: Much larger (e.g., 10MB) but further away again and typically shared between multiple CPU cores. Slower, but still much faster than RAM.

When the CPU needs data, it checks L1. If it's there (a cache hit), it's nearly instant. If not, it checks L2, then L3. Only if it misses all of them does it go to RAM (a cache miss).

The latency of the various caches on a typical, modern system looks something like this:

LocationApprox latencyCycles
CPU register~0.25ns1 cycle
L1 cache~1 ns~4 cycles
L2 cache~3-5 ns~12-20 cycles
L3 cache (shared)~10-15 ns~40-70 cycles
Main RAM~60-120 ns~200-400 cycles

The Cache Line

You might think that when you ask for a single int (4 bytes), the RAM sends 4 bytes. It does not. To make the trip worth the latency, the memory controller always sends a "line" of data - typically 64 bytes.

This means that, when we request some value from RAM, its neighbors come along for the ride.

This is one of the mechanical reason why contiguous data structures (arrays like std::vector) are so fast. When you request MyArray[0], the hardware can bring MyArray[1], MyArray[2], and some others along with it. If we're iterating over an array, this is perfect - we need those values soon, and they're now in the cache.

Conversely, this is why linked lists (like std::list) are slow. The "nodes" of a linked list are scattered around in memory. When you load one node, you pull in 64 bytes of data, use the tiny part relevant to that node, and the rest of the cache line contains garbage - neighboring objects that are irrelevant to your list. When you follow the pointer to the next node, it's likely in a completely different part of RAM, causing another stall and another fetch.

Advanced: False Sharing

This 64-byte granularity of a cache line isn't always beneficial, particularly if we're building multi-threaded algorithms. If two threads on different cores are modifying two different variables, but those variables happen to sit on the same cache line, the cores will fight over ownership of that line.

This is called false sharing, and the additional work to maintain cache coherence heavily degrades performance. Maxims like "you don't own the variable, you own the cache line" are commonly used to remind ourselves of this dynamic when we're designing multithreaded systems.

The Prefetcher

The CPU is not a passive device; it is actively trying to predict your behavior. A specialized component called the hardware prefetcher analyzes the pattern of memory accesses. If it sees you access address X, then X+1, then X+2, it assumes you are iterating through an array.

It will quietly begin fetching X+3, X+4, and X+5 from RAM into the cache before you even ask for them. This is the one of the goals of high-performance programming: make the prefetcher happy.

If you store your data in contiguous arrays and iterate through them linearly, the prefetcher can hide almost all memory latency. Our algorithm arrives at the data just as the data arrives from RAM.

If you iterate through a pointer-heavy structure like a linked list, the access pattern looks random. The prefetcher cannot predict the next step, because the address of the next node is trapped inside the current node. It gives up, and you pay the full latency cost for every step.

The Grand Unification: Locality

If we synthesize everything we have just covered - from the OS page tables down to the CPU prefetcher - we arrive at the key principle: locality of reference.

Hardware wants predictability. It rewards code that accesses memory in a focused, predictable manner and punishes code that scatters its attention. The two most important forms of locality are spatial (space) and temporal (time).

Spatial Locality

This is the principle that if you access a memory address, you are highly likely to access its neighbors soon.

  • The cache line: Accessing index[0] pulls in index[1], index[2] and the other neighbors for free.
  • The prefetcher: Walking linearly allows the hardware to predict your path.
  • Page sizes: Staying within a 4KB page prevents expensive virtual-to-physical translation lookups.

Temporal Locality

This is the principle that if you access a memory address, you are highly likely to access it again soon. Think of caches as the CPU's short-term memory.

If you process a piece of data and then immediately use it again, it is right there in the cache, hot and ready.

If you ignore it for too long, the CPU may have "forgotten" it. It could have been evicted from the cache to make room for the other data you've accessed more recently. Reading our original data again requires a slow fetch from RAM.

Violating Locality

Poor performance is almost always a violation of locality. When we design software, we should try to pack data tight (spatial locality) and keep our working set small (temporal locality).

A linked list violates spatial locality, as nodes are scattered.

A large loop body that touches too much data violates temporal locality. If the data our loop body uses cannot fit in the cache, we're forcing the CPU to read from RAM on every single iteration.

Latency vs Bandwidth

We've talked a lot about memory latency in this introduction, but it is often confused with the related idea of memory bandwidth. Latency is the delay between data being requested and received; bandwidth is how much data can be moved.

Let's imagine we're running a store. We can keep some items in stock (in our CPU cache), but the rest of our products are in a nearby warehouse (RAM). A clever employee (the prefetcher) is predicting what our customers might want and tries to keep those items in stock.

But a customer walks in with a request we weren't predicting. We don't have the item in our store (a cache miss) so we ask the warehouse to send a truck. It will get here in an hour (latency) and our customer will wait. The fact that the truck has room (bandwidth) for a hundred other items doesn't help us here. It's still going to take an hour for the item we need to arrive.

Why Smaller Items are Faster

To improve performance, we'll often recommend keeping the size of items we're working with small. However, the main benefit of small items is often misunderstood. The instinct might be to think it's because we're being more efficient with bandwidth - that if our items are smaller, then more of them can fit into the truck.

In reality, the main benefit of smaller items is that we often don't need to call for a truck at all. When the items are smaller, we can keep more of them in our store (the CPU cache). This increases the likelihood that the things we need are already on hand.

In an ideal world, we have a totally predictible workload and can optimize everything perfectly. We can arrange our items to eliminate cache misses, pack every cache line that the CPU fetches with only useful data, and create efficient algorithms that lets the CPU process those lines faster than they can arrive. In those scenarios, bandwidth may become the bottleneck. But in all other cases, latency matters more.

Summary

We have peeled back the layers of how memory is managed. Here are the important points:

  1. Virtual Memory: The OS isolates us with a translation layer, but we pay for it if we jump around too much.
  2. Latency: RAM is hundreds of times slower than the CPU. We must avoid stalls as much as possible.
  3. Cache Lines: Data moves in chunks, usually 64kb. Packing data contiguously exploits this.
  4. Prefetching: Predictable access patterns allow the hardware to hide latency.

We now know the price of fetching data from memory. But what happens to the total bill when our input grows from ten items to ten million?

In the next lesson, we will learn to quantify the growth of our algorithms and identify the efficiency cliffs that can bring performance to its knees.

Next Lesson
Lesson 3 of 5

Polynomial Algorithms

Learn how input size affects execution speed and the most common class of algorithms.

Have a question about this lesson?
Answers are generated by AI models and may not be accurate