Logarithmic and Linearithmic Algorithms

Working with massive datasets using divide and conquer strategies, and the the required trade-offs.

Ryan McCombe
Published

In the previous lesson, we explored polynomial time algorithms like O(n)O(n) and O(n2)O(n^2). These algorithms scale predictably with the input: if you double the data, the work doubles, or perhaps quadruples.

This is acceptable for small to medium datasets. But what happens when we operate at the scale of a global database, a search engine, or the file system of an operating system? If you have 10 billion database entries, a linear O(n)O(n) scan is often not an option. It would take seconds, or even minutes, to find a single record.

To handle this scale, we need algorithms that don't just "manage" growth - they fundamentally suppress it. We need algorithms where doubling the input adds almost zero extra work. These are logarithmic algorithms, or O(logn)O(\log n).

The Phone Book Problem

To understand logarithmic complexity, let's step away from the computer and look at a physical object that (historically) solved the problem of massive data retrieval: the phone book.

Imagine a phone book with 1,000,000 names, sorted alphabetically. You need to find "Smith."

The Linear Approach

If you treated this book like an unsorted std::vector, you would start at page 1. Is "Smith" there? No. Page 2? No. You would have to flip through potentially 500,000 pages before finding "Smith." This is linear time, and it is slow for large datasets.

The Divide and Conquer Approach

Instead, you open the book roughly to the middle. You land on "Miller."

You know "Smith" comes after "Miller," so you have just learned something important: Smith cannot be in the first half of the book. In one single action, you have removed 500,000 names from consideration.

Now you have 500,000 names left. You split the remaining stack in the middle. You land on "Williams." You know "Smith" is before "Williams." You've split the stack in half again. Now you have 250,000 names left.

You repeat this process. Every iteration of this loop cuts the problem space in half.

  • Step 1: 1,000,000 candidates
  • Step 2: 500,000 candidates
  • Step 3: 250,000 candidates
  • ...
  • Step 20: 1 candidate

It takes only 20 steps to find a specific entry among one million. If the phone book doubled to two million names, it would only take 21 steps. After the first step, we have halved the problem back down to one million.

This is the power of logarithmic time - O(logn)O(\log n). It is the complexity class of "halving." Applying this technique to the problem of finding things is called binary search.

The following interactive selects a random target we need to find. When the collection is sorted, any target in a collection of 100 can be found within 7 steps:

In Big O notation, when we write logn\log n, we almost always mean log2n\log_2 n, meaning logarithm base 2. It is the answer to the question: "how many times must I cut nn in half before I get to 1?"

For practical engineering purposes, O(logn)O(\log n) is effectively constant time for very large datasets.

Input Size (nn)Linear Steps - O(n)O(n)Log Steps - O(logn)O(\log n)
10104
1001007
1,0001,00010
1,000,000 (1 Million)1,000,00020
1,000,000,000 (1 Billion)1,000,000,00030

You can search the entire range of a 64-bit integer - a number so large it exceeds the grains of sand on Earth - in just 64 operations.

Linearithmic Time: The Price of Order

There is a catch. The phone book strategy only works because the book is sorted. If the names were printed in random order, opening the book to the middle would tell you nothing. You wouldn't know if "Smith" was to the left or to the right. You would be forced to scan every page - O(n)O(n).

This implies a fundamental trade-off: to get fast lookups - O(logn)O(\log n) - you must pay an upfront cost to organize the data.

This upfront cost is usually sorting. The most efficient general-purpose sorting algorithms like the standard library's std::sort() generally fall into the complexity class of linearithmic time, written as O(nlogn)O(n \log n).

You can think nlognn \log n as performing a logarithmic operation for every single element in the dataset.

Imagine you are sorting a deck of cards. You are holding the cards you have sorted so far in your hand, and the remaining cards to be sorted on the desk in front of you. You pick up a card, and then determine where to insert it in your hand. If you use a binary-search-like strategy to find the correct slot. That binary search is the logn\log n work.

You repeat this for every card in the deck - nn cards in total. The total work is therefore n×lognn \times \log n.

Implementation in C++

We rarely write search or sorting algorithms by hand. The algorithms in the standard library are highly optimized, debugged, and specialized for different hardware architectures.

The following example shows some standard library algorithms from the C++20 ranges library:

  • std::ranges::sort() sorts the elements in a container
  • std::ranges::binary_search() tells us if an element is in a sorted container (returns true or false)

We can implement the "sort then search" pattern like this:

binary_search.cpp

#include <vector>
#include <algorithm>
#include <ranges> // Requires C++20

int main() {
  std::vector nums{5, 1, 6, 2, 0, 4};
  
  // 1. THE SETUP COST: O(n log n)
  // We cannot use binary search unless the data is sorted.
  std::ranges::sort(nums);

  // 2. THE PAYOFF: O(log n)
  // Now we can find any number almost instantly.
  bool has1{std::ranges::binary_search(nums, 1)}; // true
  bool has2{std::ranges::binary_search(nums, 2)}; // true
  bool has3{std::ranges::binary_search(nums, 3)}; // false
}

Notice the relationship here. We pay a heavy price - std::ranges::sort() - exactly once. After that, we can call std::ranges::binary_search() many times, as long as our original data (nums, in this case) hasn't changed.

If you are only searching once, a linear scan is faster (because nn is smaller than nlogn+lognn \log n + \log n). But if you are searching frequently, the investment in sorting can pay dividends. Or we might have made things much worse.

The Hardware Reality Check

Throughout this chapter, we've emphasized that hardware reality often conflicts with theoretical complexity. Logarithmic time is the prime example of this conflict.

On paper, O(logn)O(\log n) is always better than O(n)O(n). In reality, on a modern CPU, linear scans often beat binary search when nn is small. This means we might want to use linear search even if the data is already sorted. There are two main reasons why it might take less time to do more work.

Cache Lines and The Prefetcher

Remember, CPUs love predictable, linear patterns. When we access some_array[0], we get its neighbors (some_array[1], some_array[2], ...) for free because they're on the same cache line. Moreover, the hardware prefetcher detects the pattern and pulls future cache lines in before our algorithm even requests them.

Binary search, by comparison, is erratic. It jumps from some_array[371], to some_array[185], to some_array[278]. Each read brings in only one useful value (we don't care about its neighbors) and the prefetcher can't detect a pattern.

One slow RAM fetch is equivalent to doing hundreds of fast instructions. You can check a lot of elements linearly in the time it takes to fetch one element randomly.

Branch Prediction

When we have branching logic in our code, like an if statement, CPUs predict what the result will be. They will start doing speculative work on that path before they know if their guess was correct. If they guessed correctly, great - the next set of work has already been completed. If they guessed incorrectly, all that speculative work gets thrown away and we start over.

This means that making our if statements easier to predict means that our code can run faster.

In a linear search, if (element == target), the answer is almost always false. The CPU quickly learns to predict false and is almost always correct. The pipeline flows smoothly.

In a binary search if (target < element), the answer is effectively a coin flip. The CPU cannot predict whether you will go left or right. It guesses wrong 50% of the time.

So Which is Better?

As a general rule, the physical reality of hardware matters more for small amounts of data, whilst algorithmic complexity becomes increasingly important as the scale (nn) gets bigger.

But how big? Where is the crossover point? Unfortunately, there's no universal answer - it just depends on a lot of variables that vary from case to case. To get an answer for our specific use case, we should scientifically measure the options.

Our walks through how to do this, including an example experiment to find this "crossover point" between binary search and linear search for a specific data set.

Summary

In this lesson, we explored the "divide and conquer" complexity classes:

  • Logarithmic - O(logn)O(\log n): The algorithm of "halving." It makes massive datasets manageable. Doubling the input adds a constant amount of work.
  • Linearithmic - O(Nlogn)O(N \log n): The cost of order. This is the typical complexity of efficient sorting algorithms..

We also learned that while logarithmic algorithms are theoretically superior to linear ones, the mechanics of memory (cache misses) and CPU pipelines (branch misprediction) impose a constant-time penalty on every step. For small data, a dumb linear scan is often faster than a smart binary search.

So far, we have looked at algorithms that are manageable. Algorithms in classes like nn, nlognn \log n, and even n2n^2 are all solvable for reasonable datasets. But there are monsters lurking.

In the next lesson, we will confront exponential - O(2n)O(2^n), and factorial - O(n!)O(n!) complexity. We will look at problems where adding just one more element to your dataset can turn a 1-second runtime into a 100-year runtime, and how we cope with these intractable problems.

Next Lesson
Lesson 5 of 5

NP-Hard and Approximations

Learn why some problems are impossible to solve exactly and how we can use approximations like Monte Carlo simulations to cheat.

Have a question about this lesson?
Answers are generated by AI models and may not be accurate