Integrating Google Benchmark
Integrate the industry-standard Google Benchmark library into a CMake project to capture nanosecond-precision timings.
In the previous lesson, we established our project architecture. We created dsa_core to hold our algorithms and dsa_app to run them. We enabled Release mode, march=native, and LTO to ensure that our library runs as fast as the hardware allows.
Now we'll add the ability to measure the performance of our code. A basic way to measure performance is to use something like std::chrono::high_resolution_clock. You wrap your function in a start/end time, subtract the two, and print the result.
This approach captures a single data point in a chaotic environment.
Your operating system is constantly pausing your thread to handle network packets, mouse movements, and background updates. The CPU itself is dynamic - it ramps its clock speed up and down based on temperature and power targets. The memory caches start "cold" and warm up over time.
A single measurement is just noise. To get a signal, we need to run the code thousands of times, discard the outliers caused by OS interruptions, and find the stable average.
To help with this, Google Benchmark is the industry standard library for C++ micro-benchmarking.
Micro-Benchmarks and System Benchmarks
In large projects where performance matters, we generally care about performance in two distinct ways.
Component level benchmarks or micro benchmarks measure the individual building blocks of an application. If we're writing a function to be used in a performance-critical part of our program, we might create a micro-benchmark to determine how fast that function runs, and to test if some change we're considering is likely to improve the performance or makes it worse.
System level benchmarks measures the performance of the final application, with all of those components working together. Examples include how many "frames per second" a game can generate, or how many "requests per second" a web server can handle.
System level design is about deciding what components our overall system needs, how they should be used, and how they should interact with each other to accomplish the goals of our program.
Component-level design is then about ensuring the components we need to build or integrate as part of that overall plan are performant.
The Benchmark Directory
Following our modular architecture, we will create a new directory for our laboratory: benchmarks.
This directory will contain its own CMakeLists.txt and its own source code. By keeping it separate from dsa_app, we can make sure that our testing logic never accidentally leaks into our production binary.
MyProject/
├── CMakeLists.txt
├── dsa_core/
├── dsa_app/
└── benchmarks/ (New directory)
├── CMakeLists.txt (New configuration)
└── main.cpp (New entry point)Managing Dependencies with CMake
We need to download and link the Google Benchmark library. CMake's FetchContent module makes this simple. It allows us to describe the dependency in code, and CMake will download, compile, and link it automatically.
Since only the benchmark suite needs this library, we'll define it in benchmarks/CMakeLists.txt rather than in the project root:
benchmarks/CMakeLists.txt
# 1. Include the FetchContent module
include(FetchContent)
# 2. Configure Google Benchmark to skip its own tests and docs.
# This speeds up our build significantly.
set(BENCHMARK_ENABLE_TESTING OFF CACHE BOOL "" FORCE)
set(BENCHMARK_ENABLE_INSTALL OFF CACHE BOOL "" FORCE)
# 3. Declare the external dependency
FetchContent_Declare(
googlebenchmark
GIT_REPOSITORY https://github.com/google/benchmark.git
GIT_TAG v1.8.3
)
# 4. Download and make the library available
FetchContent_MakeAvailable(googlebenchmark)Creating the Test Executable
Now we need to configure the actual executable that will run our tests. We will call it dsa_bench. This executable needs to link to two things:
- Google Benchmark: To provide the testing framework.
- The
dsa_coreTarget: To provide the code we want to test.
Because dsa_core is a library, we can link it into dsa_bench just as easily as we linked it into dsa_app.
benchmarks/CMakeLists.txt
# ...
# Define the executable
add_executable(dsa_bench main.cpp)
# Link the dependencies
target_link_libraries(dsa_bench PRIVATE
# The Google Benchmark library
benchmark::benchmark
# Provides a default main() function for us
benchmark::benchmark_main
# Our own algorithms library
dsa_core
)Finally, we need to tell the root CMakeLists.txt to include this new subdirectory.
CMakeLists.txt
# ...
add_subdirectory(dsa_core)
add_subdirectory(dsa_app)
add_subdirectory(benchmarks)Anatomy of a Benchmark
Now we can write the actual C++ code for our tests in benchmarks/main.cpp.
We want to verify that we can measure the code inside dsa_core. We will include the header we created in the previous lesson, dsa/vector.h, and write a test case for MyVector.
A Google Benchmark test consists of three parts:
- The Function: A
voidfunction that takes abenchmark::State&object. - The Loop: A specialized loop driven by the library.
- The Macro: Registering the function as a benchmark.
benchmarks/main.cpp
#include <benchmark/benchmark.h>
#include <dsa/vector.h> // Importing our library!
// 1. The Function
static void BM_MyVectorPush(benchmark::State& state) {
// Setup code (Not timed)
// We can construct objects here
dsa::MyVector vec;
// 2. The Measurement Loop
for (auto _ : state) {
// This code is timed
vec.push_back(42);
}
}
// 3. Register the function as a benchmark
BENCHMARK(BM_MyVectorPush);
// Since we linked benchmark::benchmark_main, we don't need
// to write our own main() function.Preview: Fighting the Optimizer
We've enabled compiler optimizations to ensure what we're benchmarking is as close as possible to what we'll eventually be releasing. However, once we've done that, the optimizer now becomes our benchmarking enemy. It's deleting the stuff that we're trying to measure.
For example, our BM_MyVectorPush() function creates an array, adds some data to it, and then...nothing. The function ends and our array is deleted.
static void BM_MyVectorPush(benchmark::State& state) {
dsa::MyVector vec;
for (auto _ : state) {
vec.push_back(42);
}
// ...what was the point of that?
}The compiler doesn't know this is a benchmark. It looks at that code and thinks "this does nothing except waste CPU cycles - get out of here."
We will learn how to prevent this using benchmark::DoNotOptimize() in the next lesson. For now, we'll just focus on getting the infrastructure running.
The state Loop
The for (auto _ : state) loop is the most important component to understand.
When you run this program, the library doesn't just run the loop once. It runs the loop continuously until it has gathered enough statistical data to be confident in the result.
- Warm-up: It might run the loop a few times just to wake up the CPU and populate the cache. These runs are discarded.
- Calibration: It guesses how many iterations are needed. If the operation is fast, it might need 1,000,000 iterations to get a measurable time.
- Measurement: It runs the batch of iterations and records the total time.
The state object controls this entire process. You don't decide how many times the code runs; the library decides.
Variables Inside the Loop
Be very careful about what you put inside versus outside the loop.
- Outside: One-time setup (allocation, initialization). This is not included in the timer.
- Inside: The code being measured. This is timed.
In our example, dsa::MyVector vec is declared outside. This means we are re-using the same vector instance for every iteration of the loop. It will grow larger and larger. This measures the cost of push_back() on an ever-growing vector.
If we moved the declaration inside the loop, we would be creating and destroying the vector every single time. This means our benchmark would no longer be measuring the time it takes to perform a push_back() - it would be measuring the time it takes to create and destroy the array.
Timer Management
If we want to perform some work on every iteration of our benchmark, but we do not want that work to be timed, we can pause and resume the timer:
#include <benchmark/benchmark.h>
#include <dsa/vector.h>
static void BM_MyVectorPush(benchmark::State& state) {
for (auto _ : state) {
state.PauseTiming();
// This code is not timed
dsa::MyVector vec;
state.ResumeTiming();
// This code is timed
vec.push_back(42);
}
}
BENCHMARK(BM_MyVectorPush);However, in an unfortunate irony, it takes some time to pause the timer. This benchmark will likely report a faster run time if we don't pause the timer, as the timer management adds more overhead than just creating the dsa::MyVector.
Timer management is more useful for longer-running algorithms, where the few nanoseconds required to manage the timer is trivial compared to the amount of work the rest of the benchmark is doing.
Running and Interpreting Results
Finally, let's configure and run our project using the presets we defined in the previous lesson.
First, we generate the build system:
cmake --preset releaseThen we run the build. The first time you run this, it will take a moment to download and compile Google Benchmark, but future builds will be faster.
cmake --build --preset releaseNow, our build system will output both the primary executable that we can ship to users, but also the benchmarking executable that we can use to run our tests.
Again, we can check our terminal output to see where our benchmarker was created, but it is likely to be something like build/release/benchmarks/dsa_bench, possibly with a .exe extension:
./build/release/benchmarks/dsa_benchYou should see output similar to this:
Run on (24 X 3094 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x12)
L1 Instruction 32 KiB (x12)
L2 Unified 512 KiB (x12)
L3 Unified 16384 KiB (x4)
----------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------
BM_MyVectorPush 4.2 ns 4.2 ns 165000000The output provides some information about the hardware environment, most notably the CPU caches, followed by the results table.
Results Table Columns
Let's break down the 4 columns:
- Benchmark: The name of the function being tested.
- Time (wall time): The actual real world time that elapsed. If you ran this function once, this is how long you would expect to wait.
- CPU (thread time): The amount of time the CPU spent actively working on your thread.
- Iterations: How many times the library ran the loop to get this average. In the example above, it ran 165 million times to calculate that stable 4.2ns average.
If Time > CPU, your thread was blocked (maybe waiting for disk I/O, or the OS paused it to run another program).
In most cases, we care primarily about the CPU results. If the operating system decided to put our process to sleep, that's (usually) not because of our code, so we don't want that time to be included in the count.
For presentation reasons, we'll exclude the other columns in most of our future benchmark output examples.
Changing Time Units
By default, the library guesses the best unit (nanoseconds, microseconds, or milliseconds). You can force a specific unit to make comparisons easier using the ->Unit() modifier.
benchmarks/main.cpp
// ...
BENCHMARK(BM_MyVectorPush)->Unit(benchmark::kNanosecond);
// ...Complete Code
Complete versions of the files we added or updated in this lesson are available below:
Files
Summary
We have successfully built a complete performance laboratory. Here are the key points:
- Architecture: We should maintain a clean separation between
dsa_core(logic),dsa_app(production), andbenchmarks(testing). FetchContent(): CMake'sFetchContentmodule can pull libraries from the internet and integrate them into our build.- Linkage: To test our code, we need to link our benchmarking exeuctable against the library where those functions are. In this case, we linked
dsa_benchagainstdsa_core. - Google Benchmark: We handed off the heavy lifting of statistical analysis, warm-up cycles, and iteration counting to the library.
We now have the tools to run experiments. But as we hinted earlier, the compiler is currently our adversary. It wants to delete our "useless" benchmark loops.
In the next lesson, we will learn how to fight back. We will introduce benchmark::DoNotOptimize() and benchmark::ClobberMemory() to trick the compiler into running code that it desperately wants to optimize away.
Writing Effective Benchmarks
Fight the compiler's aggressive optimizations to ensure benchmarks measure reality.