What is the load factor of a hash table, and how does it affect performance?

Question

Ryan McCombe · Accepted Answer

The load factor of a hash table is the ratio of the number of elements stored in the hash table to the size of the underlying array (number of buckets). It is a measure of how full the hash table is and can have a significant impact on the performance of the hash table operations.

Load Factor = Number of Elements / Number of Buckets

For example, if a hash table has 10 elements and an array size of 20, the load factor would be 0.5 (10 / 20).

The load factor affects the performance of a hash table in the following ways:

Collision Resolution:

As the load factor increases, the probability of collisions also increases.
With more collisions, the collision resolution mechanism (e.g., separate chaining or open addressing) has to work harder to resolve them.
This leads to longer chains or more probing steps, which can degrade the performance of lookup, insertion, and deletion operations.

Memory Usage:

A lower load factor means that the hash table has more empty buckets, resulting in wasted memory space.
On the other hand, a higher load factor may lead to more collisions but utilizes memory more efficiently.

Rehashing:

When the load factor exceeds a certain threshold (e.g., 0.75), the hash table may need to be resized to maintain its performance.
Resizing involves creating a new larger array and rehashing all the elements to redistribute them.
Rehashing is an expensive operation that can impact performance, especially if it occurs frequently.

Here's an example to illustrate the effect of load factor on performance:

1#include <chrono>
2#include <iostream>
3#include <unordered_map>
4
5int main() {
6  using namespace std::chrono;
7
8  // Create hash tables with different load factors
9  // Low load factor
10  std::unordered_map<int, int> map1(1000);
11
12  // High load factor
13  std::unordered_map<int, int> map2(100);
14  
15  // Insert elements into the hash tables
16  for (int i = 0; i < 1000; i++) {
17    map1[i] = i;
18    map2[i] = i;
19  }
20
21  // Measure lookup time for map1
22  auto start1 = high_resolution_clock::now();
23  for (int i = 0; i < 10000; i++) {
24    map1.find(i);
25  }
26  auto end1 = high_resolution_clock::now();
27  auto duration1 =
28    duration_cast<microseconds>(end1 - start1);
29
30  // Measure lookup time for map2
31  auto start2 = high_resolution_clock::now();
32  for (int i = 0; i < 10000; i++) {
33    map2.find(i);
34  }
35  auto end2 = high_resolution_clock::now();
36  auto duration2 =
37      duration_cast<microseconds>(end2 - start2);
38
39  std::cout << "Lookup time with low load factor: "
40    << duration1.count() << " microseconds\n";
41  std::cout << "Lookup time with high load factor: "
42    << duration2.count() << " microseconds\n";
43}

In this example, we create two hash tables with different load factors - map1 with a low load factor and map2 with a high load factor. We insert the same number of elements into both hash tables and measure the lookup time for each.

The output might resemble:

1Lookup time with low load factor: 15 microseconds
2Lookup time with high load factor: 28 microseconds

The actual values may vary depending on your system, but the general trend is that the lookup time is higher for the hash table with a high load factor due to more collisions and longer collision resolution.

In practice, it's important to choose an appropriate load factor that balances performance and memory usage. A common rule of thumb is to keep the load factor below 0.75 to maintain good performance. When the load factor exceeds this threshold, the hash table is typically resized to a larger capacity to reduce collisions and improve performance.

Hashing and `std::hash`

Hash Table Load Factor

Hashing and `std::hash`

Professional C++

Questions & Answers

Hash Table Load Factor

Hashing and std::hash

Questions & Answers

Hashing and `std::hash`