In the previous lesson, we declared std::vector the winner over std::forward_list for traversal speed. We based this on a benchmark running with 10,000 elements.

Relying on a single data point is dangerous. An algorithm might look incredibly fast with small collections because the data fits entirely inside your CPU's L2 cache. But what happens if we have millions or billions of items?

Does the performance scale linearly, or does it degrade exponentially? Does it hit a "resource wall" where the memory subsystem gets overwhelmed?

In this lesson, we will use Google Benchmark's automation tools to run our code across a spectrum of input sizes. This will allow us to check the theoretical big O complexity and reveal the physical hardware cliffs that theory ignores.

Automated Scaling with `Range()`

Manually rewriting your benchmark to test size 100, then 200, then 400 is tedious. To help with this, Google Benchmark provides the ->Range() modifier.

Instead of hardcoding a constant like ITEM_COUNT, we pass a range of inputs to our test. Inside the benchmark function, we access the current input size using state.range(0).

Let's modify our list traversal benchmark to test sizes from 8 elements up to around 134 million elements:

benchmarks/main.cpp

1#include <benchmark/benchmark.h>
2#include <forward_list>
3#include <numeric> // for int64_t
4
5static void BM_ListTraverse(benchmark::State& state) {
6  int64_t n = state.range(0);
7
8  // Now using n instead of 10,000
9  std::forward_list<int> l(n);
10  std::fill(l.begin(), l.end(), 1);
11
12  for (auto _ : state) {
13    long long sum = 0;
14    for (int i : l) {
15      sum += i;
16    }
17    benchmark::DoNotOptimize(sum);
18  }
19}
20
21// Range: 8 items up to ~134 Million items
22BENCHMARK(BM_ListTraverse)->Range(8, 8 << 24);

The 8 << 24 syntax is a bit shift, which is a commonly used way of expressing large powers of two. 8 << 24 is equivalent to $8 \times 2^{24}$ .

If we run our benchmark now, we'll see BM_ListTraverse() tested with a wide range of list sizes:

1-----------------------------------------
2Benchmark                             CPU
3-----------------------------------------
4BM_ListTraverse/8                 4.26 ns
5BM_ListTraverse/64                54.7 ns
6BM_ListTraverse/512                516 ns
7BM_ListTraverse/4096              5859 ns
8BM_ListTraverse/32768            71498 ns
9BM_ListTraverse/262144          627790 ns
10BM_ListTraverse/2097152       28125000 ns
11BM_ListTraverse/16777216     218750000 ns
12BM_ListTraverse/134217728   1828125000 ns

Determining Big O Complexity

Now that we have a dataset spanning multiple magnitudes of size, we can ask the library to analyze the trend.

We can add the ->Complexity() modifier to our benchmark registration. This tells the library to perform a regression analysis on the results. It attempts to fit the data to known complexity curves such as $O(n)$ , $O(n^2)$ , $O(\log n)$ , and tells us which one fits best.

To help with this, each benchmark needs to tell the library what value of $n$ we're actually using. This can be done through the state.SetComplexityN() function.

In our case, we're directly using state.range(0) to control the size of our list, so we can just echo that back to the library:

benchmarks/main.cpp

1#include <benchmark/benchmark.h>
2#include <forward_list>
3
4static void BM_ListTraverse(benchmark::State& state) {
5  int64_t n = state.range(0);
6  state.SetComplexityN(n);
7  
8  std::forward_list<int> l(n);
9  std::fill(l.begin(), l.end(), 1);
10
11  for (auto _ : state) {
12    long long sum = 0;
13    for (int i : l) {
14      sum += i;
15    }
16    benchmark::DoNotOptimize(sum);
17  }
18}
19
20BENCHMARK(BM_ListTraverse)
21  ->Range(8, 8 << 24)
22  ->Complexity();

When you run this, the output table changes. At the bottom, you will see a summary section:

1-----------------------------------------
2Benchmark                             CPU
3-----------------------------------------
4BM_ListTraverse/8                 4.20 ns
5BM_ListTraverse/64                53.0 ns
6BM_ListTraverse/512                500 ns
7BM_ListTraverse/4096              5625 ns
8BM_ListTraverse/32768            71498 ns
9BM_ListTraverse/262144          585938 ns
10BM_ListTraverse/2097152       28125000 ns
11BM_ListTraverse/16777216     223958333 ns
12BM_ListTraverse/134217728   1812500000 ns
13BM_ListTraverse_BigO             13.50 N
14BM_ListTraverse_RMS                  1 %

The BM_ListTraverse_BigO value of $13.42N$ shows the library detected linear complexity, or $O(n)$ , as expected. The time to traverse our list scales linearly with the length of the list. $13.5$ represents the scaling factor - each new item in our list increases the scan time by around $13.5$ nanoseconds on this machine.

The RMS (root mean square) value is the error rate. We have a very low RMS at 1%, meaning the benchmark is very confident in this answer. A high RMS (e.g., 40%) means the data is erratic, and the big O classification might be wrong.

The Hardware Cliff

Let's do some more analysis on our benchmark. We'll calculate the time per element of our previous results:

\text{Time Per Element} = \frac{\text{Total CPU Time}}{n}

Element Count	Time Per Element
`BM_ListTraverse/8`	0.525 ns/element
`BM_ListTraverse/64`	0.828 ns/element
`BM_ListTraverse/512`	0.977 ns/element
`BM_ListTraverse/4096`	1.373 ns/element
`BM_ListTraverse/32768`	2.182 ns/element
`BM_ListTraverse/262144`	2.235 ns/element
`BM_ListTraverse/2097152`	13.41 ns/element
`BM_ListTraverse/16777216`	13.35 ns/element
`BM_ListTraverse/134217728`	13.51 ns/element

Using `SetItemsProcessed()`

This table was calculated manually, but if we tell the library how many items we processed, it will add a "items per second" column to our output, which will report similar results.

The number of items we processed will depend on how often the library decides to run the loop. We can get that via state.iterations() after the loop has completed:

1static void BM_ListTraverse(benchmark::State& state) {
2  // ...
3  
4  // This should go AFTER the loop to ensure
5  // state.iterations() returns the correct count
6  state.SetItemsProcessed(state.iterations() * n);
7}

Hardware Cliffs

Despite the supposedly $O(n)$ algorithm, the time per element isn't as linear as we might expect. We see a gradual degradation in performance as our list gets larger, and it then falls off a cliff somewhere between n = 262,144 and n = 2,097,152.

If you're familiar with hardware caches and temporal locality, you might recognize why we see this falloff in performance as our data gets larger, and then a sudden cliff at a specific breaking point. The issue is that, as our working sets get larger, they no longer fit into the various levels of our CPU caches.

In my case, the large falloff from $2ns$ to $13ns$ happens when the working set cannot fit in the machine's L3 cache, so the algorithm starts hitting main RAM. Each element in our linked list requires 16 bytes of cache space - 4 bytes for the primary data we're storing (an int, in this case), 8 bytes for the pointer to the next node in the list, and 4 bytes of padding.

Even if the library didn't report the size of our caches, we could infer what they are based on when these performance falloffs happen. The system I'm running the tests on has 16MB of L3 cache. We can only fit around a million of these 16-byte nodes in this cache, and we see the results jumping from $2ns$ to $13ns$ when $n$ crosses that boundary.

Analyzing `std::vector`

Let's bring our std::vector version back, and do the same analysis:

benchmarks/main.cpp

1#include <benchmark/benchmark.h>
2#include <vector>
3
4static void BM_VectorTraverse(benchmark::State& state) {
5  int64_t n = state.range(0);
6  state.SetComplexityN(n);
7  std::vector<int> v(n);
8  std::fill(v.begin(), v.end(), 1);
9
10  state.SetComplexityN(n);
11
12  for (auto _ : state) {
13    long long sum = 0;
14    for (int i : v) {
15      sum += i;
16    }
17    benchmark::DoNotOptimize(sum);
18  }
19}
20
21BENCHMARK(BM_VectorTraverse)
22  ->Range(8, 8 << 24)
23  ->Complexity();

Building and running the benchmark generates the following output on my machine:

1-----------------------------------------
2Benchmark                             CPU
3-----------------------------------------
4BM_VectorTraverse/8               5.16 ns
5BM_VectorTraverse/64              33.8 ns
6BM_VectorTraverse/512              249 ns
7BM_VectorTraverse/4096            1925 ns
8BM_VectorTraverse/32768          14997 ns
9BM_VectorTraverse/262144        122414 ns
10BM_VectorTraverse/2097152       983099 ns
11BM_VectorTraverse/16777216     8680556 ns
12BM_VectorTraverse/134217728   69602273 ns
13BM_VectorTraverse_BigO            0.52 N
14BM_VectorTraverse_RMS                0 %

Again, the library has determined the algorithm is linear, this time with a $0.52$ scaling factor. Much better than our linked list's $13.5$

Let's check the CPU time per element:

Element Count	Time Per Element
`BM_VectorTraverse/8`	0.645 ns/element
`BM_VectorTraverse/64`	0.528 ns/element
`BM_VectorTraverse/512`	0.486 ns/element
`BM_VectorTraverse/4096`	0.470 ns/element
`BM_VectorTraverse/32768`	0.458 ns/element
`BM_VectorTraverse/262144`	0.467 ns/element
`BM_VectorTraverse/2097152`	0.469 ns/element
`BM_VectorTraverse/16777216`	0.517 ns/element
`BM_VectorTraverse/134217728`	0.519 ns/element

Our std::vector is much more consistent than our std::forward_list, but why?

Firstly, each of our elements are smaller. The contiguous memory layout means a std::vector<int> can store each element as a simple int, whilst a std::forward_list<int> needs to use an intermediate "node" structure that stores the int as well as the pointer to the next node.

We've gone from each element being a 16KB node to each element being a 4KB int. This more efficient use of memory means each cache can store many more elements before it becomes full, and we reach a performance cliff. But we still can't fit 134 million integers in the cache, so that's not the whole story.

The second advantage of reading from arrays is that the hardware prefetcher can remove those cliffs entirely. It can see our linear access pattern and predict what we'll need in the future. It fetches data from RAM in the background before we even ask for it. A linked list gives the prefetcher no clues about what memory address will be needed next, so it hits the wall at full speed.

The Google Benchmark library offers a lot of configuration options using the -> syntax. For example:

1static void BM_One(benchmark::State& state) {/*...*/}
2
3BENCHMARK(BM_One)
4  // Display times in milliseconds
5  ->Unit(benchmark::kMillisecond)
6  // Range from 1 to 1,000,000 in multiples of 10
7  ->RangeMultiplier(10)
8  ->Range(1, 1000 * 1000)
9  // Assess asymptotic complexity
10  ->Complexity()

When we want to compare the results of multiple benchmarks, we generally want them all to use the same configuration. Creating a macro can help with this:

1static void BM_One(benchmark::State& state) {/*...*/}
2static void BM_Two(benchmark::State& state) {/*...*/}
3static void BM_Three(benchmark::State& state) {/*...*/}
4
5#define BENCHMARK_STD(func) \ 
6  BENCHMARK(func) \ 
7    ->Unit(benchmark::kMillisecond) \ 
8    ->RangeMultiplier(10) \ 
9    ->Range(1, 1000 * 1000) \ 
10    ->Complexity() 
11
12BENCHMARK_STD(BM_One);
13BENCHMARK_STD(BM_Two);
14BENCHMARK_STD(BM_Three);

Summary

In this lesson, we learned that performance is not a single number; it is a curve. Here are the key takeaways:

->Range(): We use this to stress-test our algorithms across orders of magnitude.
->Complexity(): We can automate the calculation of Big O, allowing us to verify if our implementation matches our theoretical expectations.
The Hardware Cliff: We discovered that $O(n)$ is not always a flat line. When data grows beyond the capacity of the L3 cache, "time per element" spikes dramatically, especially for pointer-based structures like linked lists.
Scalability Limits: We saw that std::forward_list is performant for small $n$ , but struggles with large $n$ due to memory latency, whereas std::vector remains resilient due to prefetching.

Mapping Complexity and Hardware Cliffs

Automated Scaling with `Range()`

benchmarks/main.cpp

Determining Big O Complexity

benchmarks/main.cpp

The Hardware Cliff

Using `SetItemsProcessed()`

Hardware Cliffs

Analyzing `std::vector`

benchmarks/main.cpp

Summary

Amortized Costs and Crossover Points

Google Benchmark for C++

Mapping Complexity and Hardware Cliffs

Automated Scaling with Range()

benchmarks/main.cpp

Determining Big O Complexity

benchmarks/main.cpp

The Hardware Cliff

Using SetItemsProcessed()

Hardware Cliffs

Analyzing std::vector

benchmarks/main.cpp

Sharing Configuration using Macros

Summary

Amortized Costs and Crossover Points

Automated Scaling with `Range()`

Using `SetItemsProcessed()`

Analyzing `std::vector`