In the previous lesson, we established our project architecture. We created dsa_core to hold our algorithms and dsa_app to run them. We enabled Release mode, march=native, and LTO to ensure that our library runs as fast as the hardware allows.

Now we'll add the ability to measure the performance of our code. A basic way to measure performance is to use something like std::chrono::high_resolution_clock. You wrap your function in a start/end time, subtract the two, and print the result.

This approach captures a single data point in a chaotic environment.

Your operating system is constantly pausing your thread to handle network packets, mouse movements, and background updates. The CPU itself is dynamic - it ramps its clock speed up and down based on temperature and power targets. The memory caches start "cold" and warm up over time.

A single measurement is just noise. To get a signal, we need to run the code thousands of times, discard the outliers caused by OS interruptions, and find the stable average.

To help with this, Google Benchmark is the industry standard library for C++ micro-benchmarking.

Micro-Benchmarks and System Benchmarks

In large projects where performance matters, we generally care about performance in two distinct ways.

Component level benchmarks or micro benchmarks measure the individual building blocks of an application. If we're writing a function to be used in a performance-critical part of our program, we might create a micro-benchmark to determine how fast that function runs, and to test if some change we're considering is likely to improve the performance or makes it worse.

System level benchmarks measures the performance of the final application, with all of those components working together. Examples include how many "frames per second" a game can generate, or how many "requests per second" a web server can handle.

System level design is about deciding what components our overall system needs, how they should be used, and how they should interact with each other to accomplish the goals of our program.

Component-level design is then about ensuring the components we need to build or integrate as part of that overall plan are performant.

The Benchmark Directory

Following our modular architecture, we will create a new directory for our laboratory: benchmarks.

This directory will contain its own CMakeLists.txt and its own source code. By keeping it separate from dsa_app, we can make sure that our testing logic never accidentally leaks into our production binary.

1MyProject/
2├── CMakeLists.txt
3├── dsa_core/
4├── dsa_app/
5└── benchmarks/         (New directory)
6    ├── CMakeLists.txt  (New configuration)
7    └── main.cpp        (New entry point)

Managing Dependencies with CMake

We need to download and link the Google Benchmark library. CMake's FetchContent module makes this simple. It allows us to describe the dependency in code, and CMake will download, compile, and link it automatically.

Since only the benchmark suite needs this library, we'll define it in benchmarks/CMakeLists.txt rather than in the project root:

benchmarks/CMakeLists.txt

1# 1. Include the FetchContent module
2include(FetchContent)
3
4# 2. Configure Google Benchmark to skip its own tests and docs.
5# This speeds up our build significantly.
6set(BENCHMARK_ENABLE_TESTING OFF CACHE BOOL "" FORCE)
7set(BENCHMARK_ENABLE_INSTALL OFF CACHE BOOL "" FORCE)
8
9# 3. Declare the external dependency
10FetchContent_Declare(
11  googlebenchmark
12  GIT_REPOSITORY https://github.com/google/benchmark.git
13  GIT_TAG v1.8.3
14)
15
16# 4. Download and make the library available
17FetchContent_MakeAvailable(googlebenchmark)

Creating the Test Executable

Now we need to configure the actual executable that will run our tests. We will call it dsa_bench. This executable needs to link to two things:

Google Benchmark: To provide the testing framework.
The dsa_core Target: To provide the code we want to test.

Because dsa_core is a library, we can link it into dsa_bench just as easily as we linked it into dsa_app.

benchmarks/CMakeLists.txt

1# ...
2
3# Define the executable
4add_executable(dsa_bench main.cpp)
5
6# Link the dependencies
7target_link_libraries(dsa_bench PRIVATE
8  # The Google Benchmark library
9  benchmark::benchmark
10  
11  # Provides a default main() function for us
12  benchmark::benchmark_main
13  
14  # Our own algorithms library
15  dsa_core
16)

Finally, we need to tell the root CMakeLists.txt to include this new subdirectory.

CMakeLists.txt

1# ...
2
3add_subdirectory(dsa_core)
4add_subdirectory(dsa_app)
5add_subdirectory(benchmarks)

Anatomy of a Benchmark

Now we can write the actual C++ code for our tests in benchmarks/main.cpp.

We want to verify that we can measure the code inside dsa_core. We will include the header we created in the previous lesson, dsa/vector.h, and write a test case for MyVector.

A Google Benchmark test consists of three parts:

The Function: A void function that takes a benchmark::State& object.
The Loop: A specialized loop driven by the library.
The Macro: Registering the function as a benchmark.

benchmarks/main.cpp

1#include <benchmark/benchmark.h>
2#include <dsa/vector.h> // Importing our library!
3
4// 1. The Function
5static void BM_MyVectorPush(benchmark::State& state) {
6  // Setup code (Not timed)
7  // We can construct objects here
8  dsa::MyVector vec;
9
10  // 2. The Measurement Loop
11  for (auto _ : state) {
12    // This code is timed
13    vec.push_back(42);
14  }
15}
16
17// 3. Register the function as a benchmark
18BENCHMARK(BM_MyVectorPush);
19
20// Since we linked benchmark::benchmark_main, we don't need
21// to write our own main() function.

Preview: Fighting the Optimizer

We've enabled compiler optimizations to ensure what we're benchmarking is as close as possible to what we'll eventually be releasing. However, once we've done that, the optimizer now becomes our benchmarking enemy. It's deleting the stuff that we're trying to measure.

For example, our BM_MyVectorPush() function creates an array, adds some data to it, and then...nothing. The function ends and our array is deleted.

1static void BM_MyVectorPush(benchmark::State& state) {
2  dsa::MyVector vec;
3
4  for (auto _ : state) {
5    vec.push_back(42);
6  }
7  
8  // ...what was the point of that?
9}

The compiler doesn't know this is a benchmark. It looks at that code and thinks "this does nothing except waste CPU cycles - get out of here."

We will learn how to prevent this using benchmark::DoNotOptimize() in the next lesson. For now, we'll just focus on getting the infrastructure running.

The `state` Loop

The for (auto _ : state) loop is the most important component to understand.

When you run this program, the library doesn't just run the loop once. It runs the loop continuously until it has gathered enough statistical data to be confident in the result.

Warm-up: It might run the loop a few times just to wake up the CPU and populate the cache. These runs are discarded.
Calibration: It guesses how many iterations are needed. If the operation is fast, it might need 1,000,000 iterations to get a measurable time.
Measurement: It runs the batch of iterations and records the total time.

The state object controls this entire process. You don't decide how many times the code runs; the library decides.

Variables Inside the Loop

Be very careful about what you put inside versus outside the loop.

Outside: One-time setup (allocation, initialization). This is not included in the timer.
Inside: The code being measured. This is timed.

In our example, dsa::MyVector vec is declared outside. This means we are re-using the same vector instance for every iteration of the loop. It will grow larger and larger. This measures the cost of push_back() on an ever-growing vector.

If we moved the declaration inside the loop, we would be creating and destroying the vector every single time. This means our benchmark would no longer be measuring the time it takes to perform a push_back() - it would be measuring the time it takes to create and destroy the array.

Timer Management

If we want to perform some work on every iteration of our benchmark, but we do not want that work to be timed, we can pause and resume the timer:

1#include <benchmark/benchmark.h>
2#include <dsa/vector.h>
3
4static void BM_MyVectorPush(benchmark::State& state) {
5  for (auto _ : state) {
6    state.PauseTiming();
7    // This code is not timed 
8    dsa::MyVector vec;
9    state.ResumeTiming();
10
11    // This code is timed
12    vec.push_back(42);
13  }
14}
15
16BENCHMARK(BM_MyVectorPush);

However, in an unfortunate irony, it takes some time to pause the timer. This benchmark will likely report a faster run time if we don't pause the timer, as the timer management adds more overhead than just creating the dsa::MyVector.

Timer management is more useful for longer-running algorithms, where the few nanoseconds required to manage the timer is trivial compared to the amount of work the rest of the benchmark is doing.

Running and Interpreting Results

Finally, let's configure and run our project using the presets we defined in the previous lesson.

First, we generate the build system:

1cmake --preset release

Then we run the build. The first time you run this, it will take a moment to download and compile Google Benchmark, but future builds will be faster.

1cmake --build --preset release

Now, our build system will output both the primary executable that we can ship to users, but also the benchmarking executable that we can use to run our tests.

Again, we can check our terminal output to see where our benchmarker was created, but it is likely to be something like build/release/benchmarks/dsa_bench, possibly with a .exe extension:

1./build/release/benchmarks/dsa_bench

You should see output similar to this:

1Run on (24 X 3094 MHz CPU s)
2CPU Caches:
3  L1 Data 32 KiB (x12)
4  L1 Instruction 32 KiB (x12)
5  L2 Unified 512 KiB (x12)
6  L3 Unified 16384 KiB (x4)
7----------------------------------------------
8Benchmark           Time      CPU   Iterations
9----------------------------------------------
10BM_MyVectorPush   4.2 ns   4.2 ns    165000000

The output provides some information about the hardware environment, most notably the CPU caches, followed by the results table.

Results Table Columns

Let's break down the 4 columns:

Benchmark: The name of the function being tested.
Time (wall time): The actual real world time that elapsed. If you ran this function once, this is how long you would expect to wait.
CPU (thread time): The amount of time the CPU spent actively working on your thread.
Iterations: How many times the library ran the loop to get this average. In the example above, it ran 165 million times to calculate that stable 4.2ns average.

If Time > CPU, your thread was blocked (maybe waiting for disk I/O, or the OS paused it to run another program).

In most cases, we care primarily about the CPU results. If the operating system decided to put our process to sleep, that's (usually) not because of our code, so we don't want that time to be included in the count.

For presentation reasons, we'll exclude the other columns in most of our future benchmark output examples.

Changing Time Units

By default, the library guesses the best unit (nanoseconds, microseconds, or milliseconds). You can force a specific unit to make comparisons easier using the ->Unit() modifier.

benchmarks/main.cpp

1// ...
2BENCHMARK(BM_MyVectorPush)->Unit(benchmark::kNanosecond);
3// ...

Complete Code

Complete versions of the files we added or updated in this lesson are available below:

Files

CMakeLists.txt

benchmarks

Select a file to view its content

Summary

We have successfully built a complete performance laboratory. Here are the key points:

Architecture: We should maintain a clean separation between dsa_core (logic), dsa_app (production), and benchmarks (testing).
FetchContent(): CMake's FetchContent module can pull libraries from the internet and integrate them into our build.
Linkage: To test our code, we need to link our benchmarking exeuctable against the library where those functions are. In this case, we linked dsa_bench against dsa_core.
Google Benchmark: We handed off the heavy lifting of statistical analysis, warm-up cycles, and iteration counting to the library.

We now have the tools to run experiments. But as we hinted earlier, the compiler is currently our adversary. It wants to delete our "useless" benchmark loops.

In the next lesson, we will learn how to fight back. We will introduce benchmark::DoNotOptimize() and benchmark::ClobberMemory() to trick the compiler into running code that it desperately wants to optimize away.

Integrating Google Benchmark

Micro-Benchmarks and System Benchmarks

The Benchmark Directory

Managing Dependencies with CMake

benchmarks/CMakeLists.txt

Creating the Test Executable

benchmarks/CMakeLists.txt

CMakeLists.txt

Anatomy of a Benchmark

benchmarks/main.cpp

Preview: Fighting the Optimizer

The `state` Loop

Variables Inside the Loop

Timer Management

Running and Interpreting Results

Results Table Columns

Changing Time Units

benchmarks/main.cpp

Complete Code

Files

Summary

Writing Effective Benchmarks

Google Benchmark for C++

Integrating Google Benchmark

Micro-Benchmarks and System Benchmarks

The Benchmark Directory

Managing Dependencies with CMake

benchmarks/CMakeLists.txt

Creating the Test Executable

benchmarks/CMakeLists.txt

CMakeLists.txt

Anatomy of a Benchmark

benchmarks/main.cpp

Preview: Fighting the Optimizer

The state Loop

Variables Inside the Loop

Timer Management

Running and Interpreting Results

Results Table Columns

Changing Time Units

benchmarks/main.cpp

Complete Code

Files

Summary

Writing Effective Benchmarks

The `state` Loop