We covered arrays in some detail in our earlier chapter. We explained how these structures give us a simple way to group data - we can add or remove items to the collection and iterate over them easily.

We can also jump to any item in the collection, using its index. This is a very fast operation - it can be done in constant time, that is, $O(1)$

We covered time complexity and "Big O" notation in an earlier lesson:

Algorithm Analysis and Big O Notation

An introduction to algorithms - the foundations of computer science. Learn how to design, analyze, and compare them.

View Related Lesson

Finding Items by Value

However, when we need to find specific items within our collection or determine if the array contains the item at all, things get a lot slower.

For example, let's imagine we want our users to select a user name, and we want to be able to check if their name was already taken.

To determine if an array contains a specific user name, we have to iterate over the collection, checking every index one at a time.

The more users we have, the longer this takes. Specifically, it scales linearly - that is, $O(n)$ - where $n$ is the size of our array.

Constant Time Lookups

It is possible to set up an array that allows us to find an item in constant time. That means, no matter how big our collection is, we can always find items quickly.

To do this, the key idea is that we create an algorithm that looks at the value of the object, and based on that value, determines at what index the object should be inserted at.

If the values we're storing are positive integers, there is a very basic algorithm we could use - we simply use the value as the index. For example, if we want to add the value 42 to our collection, we could insert it at index 42:

1#include <vector>
2#include <iostream>
3
4void Insert(std::vector<int>& A, int Value) {
5  A[Value] = Value;
6}
7
8int main() {
9  std::vector<int> Array;
10  Array.resize(100);
11  Insert(Array, 42);
12
13  std::cout << "Value at index 42: "
14    << Array[42];
15}

1Value at index 42: 42

If we enforce this rule everywhere we use our array, we can perform constant time lookups by value.

Because, if we wanted to check if our array contains the number 42 for example, our rule dictates that there is only one place the value 42 could be - index 42.

If a value is not at the location our rule says it would be, we can immediately conclude that our array doesn't contain that value:

1#include <vector>
2#include <iostream>
3
4void Insert(std::vector<int>& A, int Value) {
5  A[Value] = Value;
6}
7
8bool Contains(
9  const std::vector<int>& A, int Value) {
10  return A[Value] == Value;
11}
12
13int main() {
14  std::vector<int> Array;
15  Array.resize(100);
16  Insert(Array, 42);
17
18  if (Contains(Array, 42)) {
19    std::cout << "Array contains 42";
20  }
21
22  if (!Contains(Array, 17)) {
23    std::cout << "\nBut not 17";
24  }
25}

1Array contains 42
2But not 17

This is the basic idea behind two of the most important data structures in programming - the hash set and the hash table, sometimes also called a hash map.

Standard library implementations of these data structures are available in the form of std::unordered_set and std::unordered_map, which we cover later in this chapter.

The rest of this lesson focuses on some of the theoretical foundations of how they work.

Hash Functions

Our previous example showed a basic implementation of how we could get constant-time searching for positive integers. To make this approach more general, we have to make two improvements:

Allow it to work across many different data types. Our previous solution only works with positive integers.
Allow it to be more memory efficient. If our previous solution could store integers up to a value of 100,000 for example, it would require we allocate an array with a size of 100,000.

The idea of a hash function was invented to solve both of these problems.

A hash function is something that receives an input and, based on that input, generates an output in a form that we can use to solve some task.

The output of a hash function is sometimes referred to as the hash or the digest of the input.

When our goal is to determine where to put that input in an array, we'd want our digest to be something we can use as an index.

A simple hash function we could use to solve this task might just apply the modulus operator to the number, generating an output between 0 and 9:

1size_t Hash(int Number) {
2  return Number % 10;
3}

The Modulus Operator (`%`)

Learn how we can use the modulus operator to get the remainder of integer division, and some common use cases.

View Related Lesson

Now, regardless of what the input to this function is, its output will be one of 10 possibilities - a number from 0 to 9

We can use this technique in our data structure, applying this hash function to both our insertion and search functions:

1#include <vector>
2#include <iostream>
3
4size_t Hash(int Number) {
5  return Number % 10;
6}
7
8void Insert(std::vector<int>& A, int Value) {
9  A[Hash(Value)] = Value;
10}
11
12bool Contains(
13  const std::vector<int>& A, int Value) {
14  return A[Hash(Value)] == Value;
15}
16
17int main() {
18  std::vector<int> Array;
19  Array.resize(10);
20  Insert(Array, 1'234'567);
21
22  if (Contains(Array, 1'234'567)) {
23    std::cout << "Array contains 1,234,567";
24  }
25}

1Array contains 1,234,567

Previously, to store a number like 1,234,567 whilst supporting constant time searches, our array would have needed to be much bigger. Now, thanks to the hash function, our array has a size() of only 10, yet can still accomplish the task.

Of course, there is a problem here. Many different values will hash to the same index using this function. For example, 2, 12, 22, and 100'002 will all hash to index 2.

When this happens, it is referred to as a collision, and we introduce some techniques to deal with this in the next section.

Collisions

When we provide two different inputs to our hash function, but it generates the same output for both, this is referred to as a collision.

In many contexts where hash functions are applied, collisions aren't necessarily a problem. But when we're using a hash function to determine where to put something in an array, we need a strategy to handle collisions.

The most common approaches that are used for this are probing, double hashing, and secondary chaining:

Probing

Using this strategy, we simply adopt a rule where, if a slot is taken, we implement a rule that searches for another slot.

For example, if we're inserting 42 and our hash function returns index 2 for that value, we will check index 2 of our array:

If index 2 is empty, we insert our value 42 there
If index 2 already contains a different value, we could check index 3, then 4, and continue until we find an empty slot to put 42 in.

When searching for a value, we follow the same pattern. If we're looking for 42, our hash function will return 2, so we check index 2:

If index 2 is empty, 42 isn't in our collection.
If index 2 contains 42, we found it
If index 2 contains a different value, we then linearly search index 3, then 4, and we continue until we either find 42 or we find empty slot. If we hit an empty slot, that means 42 isn't in our collection.

This technique is referred to as linear probing, as we simply search through the array by repeatedly adding a fixed number to our index until we find what we're looking for.

This number is referred to as the offset. There are many variations of this same idea, such as quadratic probing, which attempts to generate offsets in a way that prevents objects from being clustered together in the same part of our array.

Double Hashing

Probing techniques that use pre-determined offsets can generate long chains of collisions.

This will degrade our performance, as both our search and insertion functions need to check more indices before they find an empty slot, or the object they're looking for.

Double hashing resolves this by introducing a second hash function, specifically to generate offsets to use within the probing process.

Because the offsets are now dynamically generated, based on the objects we're hashing, we're unlikely to create chains of collisions.

This is because it's very unlikely both hash functions will generate the same output for different inputs.

So, even if the first hash function generates the same index for two different objects, the second hash function will generate a different offset, meaning our objects immediately stop competing for the same slots.

Secondary Chaining

Secondary chaining is a different approach entirely to deal with collisions. Instead of each index in the array containing a unique value, it instead contains a nested collection of values that share that hash.

If a value hashes to the same index, it simply gets added to the collection in that index.

Secondary chaining has an additional property - it allows us to store duplicate values.

The C++ standard library contains variations of hash sets and hash maps that use secondary chaining, thereby allowing duplicate values. They are std::unordered_multiset and std::unordered_multimap respectively.

Rehashing

If our data structure is getting so full that collisions are happening much too frequently, rebuilding our data structure is an option. This involves 3 steps:

Allocating a Larger Array

First, we would create a new, larger array. If our previous collection was storing 10 items in an array with a size of 20, each hash would have a 50% chance to return an index that is already occupied.

If we increase the size of our array to 100, only 10% of hashes would return an occupied index.

Updating the Hash Function

Now that we have a larger array, we need to update our hash function to take advantage of the available space.

If our previous array had a size of 20, we were likely using a Hash % 20 calculation to output indices in the 0-19 range.

If our new array has a size of 100, we could change this to be a Hash % 100 calculation, outputting indices in the 0-99 range.

Rehashing

Finally, we need to move all of the existing objects from our old array to our new one.

Whilst doing this, we need to calculate their new index, using our new hash function. This ensures they are inserted into the correct position in our new array.

The `std::hash` Template

The standard library includes the std::hash class template, which implements hashing for a variety of types.

We create a hasher by instantiating an object from this class, passing the type of the object we want to hash as a template parameter:

1#include <string>
2
3int main() {
4  auto IntHasher{
5    std::hash<int>{}
6  };
7
8  auto StringHasher{
9    std::hash<std::string>{}
10  };
11}

The class overloads the () operator, giving us a function-like interface to use our object to hash values of the type we declared.

An object that can be used like a function is referred to as a function object, or functor. We cover these in more detail later in the course:

1#include <iostream>
2
3int main() {
4  using std::cout;
5
6  auto IntHasher{std::hash<int>{}};
7  cout << "42: " << IntHasher(42) << '\n';
8
9  auto StringHasher{std::hash<std::string>{}};
10  cout << "Hello: " << StringHasher("Hello");
11}

142: 10203658981158674303
2Hello: 7201466553693376363

These are rather large numbers. That's intentional, as it gives us a lot of flexibility on what size of array we use to store our objects.

Remember, large numbers like these can easily be mapped to a much smaller range of indices using the % operator. If we wanted to use an array with a size of 15, for example:

1#include <iostream>
2
3int main() {
4  using std::cout;
5
6  auto IntHasher{std::hash<int>{}};
7  cout << "42 goes to index: "
8    << IntHasher(42) % 15 << '\n';
9
10  auto StringHasher{std::hash<std::string>{}};
11  cout << "Hello goes to index: "
12    << StringHasher("Hello") % 15;
13}

142 goes to index: 8
2Hello goes to index: 13

Implementing `std::hash` for Custom Types

The std::hash template is what is used by default in hash-based standard library containers such as std::unordered_set and std::unordered_map

If we want our custom types to be compatible with these containers, it can be helpful to implement std::hash for our types.

In most real-world applications, our classes will have some class variable that is an appropriate hash input, and of a type already supported by std::hash such as std::string

If we're loading data from a database, for example, we'll almost certainly have access to unique "id" strings we could use.

In this example, we've got a type used to represent users, where their email address or username would be an appropriate basis for hashing, on the basis that it is unique for each user:

1// User.h
2#pragma once
3#include <string>
4
5struct User {
6  std::string Email;
7  bool operator==(const User& Other) const {
8    return Email == Other.Email;
9  }
10};
11
12namespace std {
13template <>
14struct hash<User> {
15  size_t operator()(const User& U) const {
16    return std::hash<std::string>{}(U.Email);
17  }
18};
19}

Template Specialization

A practical guide to template specialization in C++ covering full and partial specialization, and the scenarios where they're useful

View Related Lesson

Note that, for a hash function to be implemented correctly, two equal objects must hash to the same value. This has implications we need to be aware of when setting up our type.

Specifically, if A == B returns true, then Hash(A) == Hash(B) should also return true. If our type doesn't implement this prerequisite correctly, hash-based containers that use our type will not work correctly.

The Equality Operator

Our previous example also implements the == operator for our User type.

Remember, hash functions can create collisions, where two different inputs generate the same hash. Therefore, we need to separately implement a way to determine if two objects are truly the same.

Conventionally, that's done using the equality operator ==.

Operator Overloading

This lesson introduces operator overloading, a fundamental concept to create more intuitive and readable code by customizing operators for user-defined types

View Related Lesson

Hash Sets

Now that we have a custom type that implements std::hash and the equality operator ==, we can use it with standard library hash containers, such as std::unordered_set.

This container allows us to add objects to our collection using the emplace() method, and determine if our collection includes an object using the contains() method:

1#include <iostream>
2#include <unordered_set>
3#include "User.h"
4
5int main() {
6  std::unordered_set<User> Users;
7  Users.emplace("user@example.com");
8
9  User NewUser{"user@example.com"};
10  if (Users.contains(NewUser)) {
11    std::cout << "That user already exists";
12  }
13}

1That user already exists

Unlike when we're using arrays, both of these operations are fast, constant time algorithms. We cover std::unordered_set and its capabilities in full detail later in the chapter.

Hash Sets using `std::unordered_set`

This lesson provides a thorough understanding of std::unordered_set, from basic initialization to handling custom types and collisions

View Related Lesson

Hash Tables

So far, our examples are focused on the idea of a hash set, which gives us a performant way to determine whether a collection contains an object or not.

A simple extension of this idea unlocks a much more powerful data structure - the hash table.

With a hash table, we store two objects in each slot of our container. The two objects in each slot are commonly called the key and the value. In order words, a hash table is a collection of key-value pairs.

The key is any hashable type - although simple types like an enum or string tend to be common.

The value is any type at all - it doesn't have to be hashable.

Practical Use Cases and `std::unordered_map`

Now, if we have a key, we can use that key to quickly retrieve the value associated with it.

For example, we might have a database of users, containing their email, name, address, and more.

We could store this in a hash map, where the email address is the key, and the full user object is the value.

The standard library's implementation of a hash map is std::unordered_map. It accepts two template parameters:

The data type we're using as a key
The data type we're using as a value

Below, our key will be email addresses in the form of a std::string, whilst our values will be user records in the form of a custom type called User:

1#include <iostream>
2#include <unordered_map>
3
4struct User {
5  std::string Email;
6  std::string Name;
7  std::string Address;
8};
9
10using Users =
11  std::unordered_map<std::string, User>;

Now, we can access our hash table using key and value pairs.

Similar to arrays, we access std::unordered_map values using the [] operator. However, instead of passing indices to this operator, we are now passing keys.

In the following example, we are retrieving the User record associated with the bob@example.com key:

1User GetBob(Users& HashTable) {
2  return HashTable["bob@example.com"];
3}

In the following example, we populate our user collection with some data we might retrieve from a database or elsewhere. Then, when we need to retrieve data for a specific user, we just pass their email address to the [] operator:

1#include <iostream>
2#include <unordered_map>
3
4struct User {
5  std::string Email;
6  std::string Name;
7  std::string Address;
8};
9
10using Users =
11  std::unordered_map<std::string, User>;
12
13void Log(Users& Database, std::string Email) {
14  User U{Database[Email]};
15  std::cout << std::format(
16    "{} ({})\n{}", U.Name, U.Email, U.Address);
17}
18
19int main() {
20  Users UserDatabase;
21
22  UserDatabase["bob@example.com"] = {
23    "bob@example.com",
24    "Bob Roberts",
25    "21 The Street, Some City"
26  };
27
28  Log(UserDatabase, "bob@example.com");
29}

1Bob Roberts (bob@example.com)
221 The Street, Some City

Because we're using hashing, these insertions and lookups are constant-time operations. They remain fast regardless of how big our collection grows. We cover std::unordered_map in full later in the chapter.

Hash Maps using `std::unordered_map`

Creating hash maps using the standard library's std::unordered_map container

View Related Lesson

Summary

This lesson delved into hashing and hash functions, their purpose, and handling collisions. We also introduced C++ implementations of these concepts, such as std::hash, std::unordered_set, and std::unordered_map.

Key Takeaways:

Demonstrated constant-time lookups in arrays and introduced the basic principles of hash functions.
Addressed the challenge of collisions in hashing and explored strategies like probing, double hashing, and secondary chaining.
Discussed the process of rehashing for optimizing the performance of hash-based data structures.
Illustrated how to implement std::hash for custom types and its integration with hash-based containers.
Showcased the use of std::unordered_set for managing unique data elements.
Emphasized the significance of key-value pairs in hash tables for fast data access and management.
Explained the concept and practical applications of hash tables through std::unordered_map.

Hashing and `std::hash`

Algorithm Analysis and Big O Notation

Finding Items by Value

Constant Time Lookups

Hash Functions

The Modulus Operator (`%`)

Collisions

Probing

Double Hashing

Secondary Chaining

Rehashing

Allocating a Larger Array

Updating the Hash Function

Rehashing

The `std::hash` Template

Implementing `std::hash` for Custom Types

Template Specialization

The Equality Operator

Operator Overloading

Hash Sets

Hash Sets using `std::unordered_set`

Hash Tables

Practical Use Cases and `std::unordered_map`

Hash Maps using `std::unordered_map`

Summary

Hash Sets using `std::unordered_set`

Professional C++

Questions & Answers

Hashing and std::hash

Algorithm Analysis and Big O Notation

Finding Items by Value

Constant Time Lookups

Hash Functions

The Modulus Operator (%)

Collisions

Probing

Double Hashing

Secondary Chaining

Rehashing

Allocating a Larger Array

Updating the Hash Function

Rehashing

The std::hash Template

Implementing std::hash for Custom Types

Template Specialization

The Equality Operator

Operator Overloading

Hash Sets

Hash Sets using std::unordered_set

Hash Tables

Practical Use Cases and std::unordered_map

Hash Maps using std::unordered_map

Summary

Hash Sets using std::unordered_set

Questions & Answers

Hashing and `std::hash`

The Modulus Operator (`%`)

The `std::hash` Template

Implementing `std::hash` for Custom Types

Hash Sets using `std::unordered_set`

Practical Use Cases and `std::unordered_map`

Hash Maps using `std::unordered_map`

Hash Sets using `std::unordered_set`