Hashing and std::hash

This lesson provides an in-depth look at hashing in C++, including std::hash, collision strategies, and usage in hash-based containers.
This lesson is part of the course:

Professional C++

Comprehensive course covering advanced concepts, and how to use them on large-scale projects.

Free, Unlimited Access
3D art showing a character looking at a screen
Ryan McCombe
Ryan McCombe
Edited

We covered arrays in some detail in our earlier chapter. We explained how these structures give us a simple way to group data - we can add or remove items to the collection and iterate over them easily.

We can also jump to any item in the collection, using its index. This is a very fast operation - it can be done in constant time, that is, O(1)O(1)

We covered time complexity and "Big O" notation in an earlier lesson:

Finding Items by Value

However, when we need to find specific items within our collection or determine if the array contains the item at all, things get a lot slower.

For example, let's imagine we want our users to select a user name, and we want to be able to check if their name was already taken.

To determine if an array contains a specific user name, we have to iterate over the collection, checking every index one at a time.

The more users we have, the longer this takes. Specifically, it scales linearly - that is, O(n)O(n) - where nn is the size of our array.

Constant Time Lookups

It is possible to set up an array that allows us to find an item in constant time. That means, no matter how big our collection is, we can always find items quickly.

To do this, the key idea is that we create an algorithm that looks at the value of the object, and based on that value, determines at what index the object should be inserted at.

If the values we’re storing are positive integers, there is a very basic algorithm we could use - we simply use the value as the index. For example, if we want to add the value 42 to our collection, we could insert it at index 42:

#include <vector>
#include <iostream>

void Insert(std::vector<int>& A, int Value) {
  A[Value] = Value;
}

int main() {
  std::vector<int> Array;
  Array.resize(100);
  Insert(Array, 42);

  std::cout << "Value at index 42: "
    << Array[42];
}
Value at index 42: 42

If we enforce this rule everywhere we use our array, we can perform constant time lookups by value.

Because, if we wanted to check if our array contains the number 42 for example, our rule dictates that there is only one place the value 42 could be - index 42.

If a value is not at the location our rule says it would be, we can immediately conclude that our array doesn’t contain that value:

#include <vector>
#include <iostream>

void Insert(std::vector<int>& A, int Value) {
  A[Value] = Value;
}

bool Contains(
  const std::vector<int>& A, int Value) {
  return A[Value] == Value;
}

int main() {
  std::vector<int> Array;
  Array.resize(100);
  Insert(Array, 42);

  if (Contains(Array, 42)) {
    std::cout << "Array contains 42";
  }

  if (!Contains(Array, 17)) {
    std::cout << "\nBut not 17";
  }
}
Array contains 42
But not 17

This is the basic idea behind two of the most important data structures in programming - the hash set and the hash table, sometimes also called a hash map.

Standard library implementations of these data structures are available in the form of std::unordered_set and std::unordered_map, which we cover later in this chapter.

The rest of this lesson focuses on some of the theoretical foundations of how they work.

Hash Functions

Our previous example showed a basic implementation of how we could get constant-time searching for positive integers. To make this approach more general, we have to make two improvements:

  • Allow it to work across many different data types. Our previous solution only works with positive integers.
  • Allow it to be more memory efficient. If our previous solution could store integers up to a value of 100,000 for example, it would require we allocate an array with a size of 100,000.

The idea of a hash function was invented to solve both of these problems.

A hash function is something that receives an input and, based on that input, generates an output in a form that we can use to solve some task.

The output of a hash function is sometimes referred to as the hash or the digest of the input.

When our goal is to determine where to put that input in an array, we’d want our digest to be something we can use as an index.

A simple hash function we could use to solve this task might just apply the modulus operator to the number, generating an output between 0 and 9:

size_t Hash(int Number) {
  return Number % 10;
}

Now, regardless of what the input to this function is, its output will be one of 10 possibilities - a number from 0 to 9

We can use this technique in our data structure, applying this hash function to both our insertion and search functions:

#include <vector>
#include <iostream>

size_t Hash(int Number) {
  return Number % 10;
}

void Insert(std::vector<int>& A, int Value) {
  A[Hash(Value)] = Value;
}

bool Contains(
  const std::vector<int>& A, int Value) {
  return A[Hash(Value)] == Value;
}

int main() {
  std::vector<int> Array;
  Array.resize(10);
  Insert(Array, 1'234'567);

  if (Contains(Array, 1'234'567)) {
    std::cout << "Array contains 1,234,567";
  }
}
Array contains 1,234,567

Previously, to store a number like 1,234,567 whilst supporting constant time searches, our array would have needed to be much bigger. Now, thanks to the hash function, our array has a size() of only 10, yet can still accomplish the task.

Of course, there is a problem here. Many different values will hash to the same index using this function. For example, 2, 12, 22, and 100'002 will all hash to index 2.

When this happens, it is referred to as a collision, and we introduce some techniques to deal with this in the next section.

What if we're not storing numbers?

The examples here are using numbers for simplicity, we can write hash functions for any type of data. We simply choose a way to interpret that data that results in a positive integer.

For example, if we were hashing strings, we could assign numbers to characters, and then add the numeric representation of the characters together.

So, a string like "abc" might be interpreted as 1 + 2 + 3, which is 6

Large blocks of text would generate much larger numbers, but we could then use the % operator to coerce these numbers to a controllable range, as we did above.

Collisions

When we provide two different inputs to our hash function, but it generates the same output for both, this is referred to as a collision.

In many contexts where hash functions are applied, collisions aren’t necessarily a problem. But when we’re using a hash function to determine where to put something in an array, we need a strategy to handle collisions.

The most common approaches that are used for this are probing, double hashing, and secondary chaining:

Probing

Using this strategy, we simply adopt a rule where, if a slot is taken, we implement a rule that searches for another slot.

For example, if we're inserting 42 and our hash function returns index 2 for that value, we will check index 2 of our array:

  • If index 2 is empty, we insert our value 42 there
  • If index 2 already contains a different value, we could check index 3, then 4, and continue until we find an empty slot to put 42 in.

When searching for a value, we follow the same pattern. If we're looking for 42, our hash function will return 2, so we check index 2:

  • If index 2 is empty, 42 isn't in our collection.
  • If index 2 contains 42, we found it
  • If index 2 contains a different value, we then linearly search index 3, then 4, and we continue until we either find 42 or we find empty slot. If we hit an empty slot, that means 42 isn't in our collection.

This technique is referred to as linear probing, as we simply search through the array by repeatedly adding a fixed number to our index until we find what we’re looking for.

This number is referred to as the offset. There are many variations of this same idea, such as quadratic probing, which attempts to generate offsets in a way that prevents objects from being clustered together in the same part of our array.

Double Hashing

Probing techniques that use pre-determined offsets can generate long chains of collisions.

This will degrade our performance, as both our search and insertion functions need to check more indices before they find an empty slot, or the object they’re looking for.

Double hashing resolves this by introducing a second hash function, specifically to generate offsets to use within the probing process.

Because the offsets are now dynamically generated, based on the objects we’re hashing, we’re unlikely to create chains of collisions.

This is because it’s very unlikely both hash functions will generate the same output for different inputs.

So, even if the first hash function generates the same index for two different objects, the second hash function will generate a different offset, meaning our objects immediately stop competing for the same slots.

Secondary Chaining

Secondary chaining is a different approach entirely to deal with collisions. Instead of each index in the array containing a unique value, it instead contains a nested collection of values that share that hash.

If a value hashes to the same index, it simply gets added to the collection in that index.

Secondary chaining has an additional property - it allows us to store duplicate values.

The C++ standard library contains variations of hash sets and hash maps that use secondary chaining, thereby allowing duplicate values. They are std::unordered_multiset and std::unordered_multimap respectively.

Rehashing

If our data structure is getting so full that collisions are happening much too frequently, rebuilding our data structure is an option. This involves 3 steps:

1. Allocating a larger Array

First, we would create a new, larger array. If our previous collection was storing 10 items in an array with a size of 20, each hash would have a 50% chance to return an index that is already occupied.

If we increase the size of our array to 100, only 10% of hashes would return an occupied index.

Load Factor

This proportion is sometimes referred to as the load factor of our container.

  • An array with a size of 2020 where 1010 slots are occupied has a load factor of 0.50.5, that is 10Ă·2010 \div 20.
  • An array with a size of 100100 where 1010 slots are occupied has a load factor of 0.10.1, that is 10Ă·10010 \div 100.

Whether we want our load factor to be higher or lower depends on our priorities. Assuming all other factors are equal, then:

  • A higher load factor means collisions are more frequent, resulting in slower insertions and searches.
  • A lower load factor means our array needs to be larger, resulting in our container consuming more memory.

2. Updating the Hash Function

Now that we have a larger array, we need to update our hash function to take advantage of the available space.

If our previous array had a size of 20, we were likely using a Hash % 20 calculation to output indices in the 0-19 range.

If our new array has a size of 100, we could change this to be a Hash % 100 calculation, outputting indices in the 0-99 range.

3. Rehashing

Finally, we need to move all of the existing objects from our old array to our new one.

Whilst doing this, we need to calculate their new index, using our new hash function. This ensures they are inserted into the correct position in our new array.

The std::hash Template

The standard library includes the std::hash class template, which implements hashing for a variety of types.

We create a hasher by instantiating an object from this class, passing the type of the object we want to hash as a template parameter:

#include <string>

int main() {
  auto IntHasher{
    std::hash<int>{}
  };

  auto StringHasher{
    std::hash<std::string>{}
  };
}

The class overloads the () operator, giving us a function-like interface to use our object to hash values of the type we declared.

An object that can be used like a function is referred to as a function object, or functor. We cover these in more detail later in the course:

#include <iostream>

int main() {
  using std::cout;

  auto IntHasher{std::hash<int>{}};
  cout << "42: " << IntHasher(42) << '\n';

  auto StringHasher{std::hash<std::string>{}};
  cout << "Hello: " << StringHasher("Hello");
}
42: 10203658981158674303
Hello: 7201466553693376363

These are rather large numbers. That’s intentional, as it gives us a lot of flexibility on what size of array we use to store our objects.

Remember, large numbers like these can easily be mapped to a much smaller range of indices using the % operator. If we wanted to use an array with a size of 15, for example:

#include <iostream>

int main() {
  using std::cout;

  auto IntHasher{std::hash<int>{}};
  cout << "42 goes to index: "
    << IntHasher(42) % 15 << '\n';

  auto StringHasher{std::hash<std::string>{}};
  cout << "Hello goes to index: "
    << StringHasher("Hello") % 15;
}
42 goes to index: 8
Hello goes to index: 13

Implementing std::hash for Custom Types

The std::hash template is what is used by default in hash-based standard library containers such as std::unordered_set and std::unordered_map

If we want our custom types to be compatible with these containers, it can be helpful to implement std::hash for our types.

In most real-world applications, our classes will have some class variable that is an appropriate hash input, and of a type already supported by std::hash such as std::string

If we’re loading data from a database, for example, we’ll almost certainly have access to unique "id" strings we could use.

In this example, we’ve got a type used to represent users, where their email address or username would be an appropriate basis for hashing, on the basis that it is unique for each user:

// User.h
#pragma once
#include <string>

struct User {
  std::string Email;
  bool operator==(const User& Other) const {
    return Email == Other.Email;
  }
};

namespace std {
template <>
struct hash<User> {
  size_t operator()(const User& U) const {
    return std::hash<std::string>{}(U.Email);
  }
};
}

Note, that for a hash function to be implemented correctly, two equal objects must hash to the same value. This has implications we need to be aware of when setting up our type.

Specifically, if A == B returns true, then Hash(A) == Hash(B) should also return true. If our type doesn’t implement this prerequisite correctly, hash-based containers that use our type will not work correctly.

The Equality Operator

Our previous example also implements the == operator for our User type.

Remember, hash functions can create collisions, where two different inputs generate the same hash. Therefore, we need to separately implement a way to determine if two objects are truly the same.

Conventionally, that’s done using the equality operator ==.

Hash Sets

Now that we have a custom type that implements std::hash and the equality operator ==, we can use it with standard library hash containers, such as std::unordered_set.

This container allows us to add objects to our collection using the emplace() method, and determine if our collection includes an object using the contains() method:

#include <iostream>
#include <unordered_set>
#include "User.h"

int main() {
  std::unordered_set<User> Users;
  Users.emplace("user@example.com");

  User NewUser{"user@example.com"};
  if (Users.contains(NewUser)) {
    std::cout << "That user already exists";
  }
}
That user already exists

Unlike when we’re using arrays, both of these operations are fast, constant time algorithms. We cover std::unordered_set and its capabilities in full detail later in the chapter.

Hash Tables

So far, our examples are focused on the idea of a hash set, which gives us a performant way to determine whether a collection contains an object or not.

A simple extension of this idea unlocks a much more powerful data structure - the hash table.

With a hash table, we store two objects in each slot of our container. The two objects in each slot are commonly called the key and the value. In order words, a hash table is a collection of key-value pairs.

The key is any hashable type - although simple types like an enum or string tend to be common.

The value is any type at all - it doesn’t have to be hashable.

Practical Use Cases and std::unordered_map

Now, if we have a key, we can use that key to quickly retrieve the value associated with it.

For example, we might have a database of users, containing their email, name, address, and more.

We could store this in a hash map, where the email address is the key, and the full user object is the value.

The standard library’s implementation of a hash map is std::unordered_map. It accepts two template parameters:

  • The data type we’re using as a key
  • The data type we’re using as a value

Below, our key will be email addresses in the form of a std::string, whilst our values will be user records in the form of a custom type called User:

#include <iostream>
#include <unordered_map>

struct User {
  std::string Email;
  std::string Name;
  std::string Address;
};

using Users =
  std::unordered_map<std::string, User>;

Now, we can access our hash table using key and value pairs.

Similar to arrays, we access std::unordered_map values using the [] operator. However, instead of passing indices to this operator, we are now passing keys.

In the following example, we are retrieving the User record associated with the bob@example.com key:

User GetBob(Users& HashTable) {
  return HashTable["bob@example.com"];
}

We cover std::unordered_map in full detail later in the chapter, but this example gives a quick overview.

We populate our user collection with some data we might retrieve from a database or elsewhere.

Then, when we need to retrieve data for a specific user, we just pass their email address to the [] operator:

#include <iostream>
#include <unordered_map>

struct User {
  std::string Email;
  std::string Name;
  std::string Address;
};

using Users =
  std::unordered_map<std::string, User>;

void Log(Users& Database, std::string Email) {
  User U{Database[Email]};
  std::cout << std::format(
    "{} ({})\n{}", U.Name, U.Email, U.Address);
}

int main() {
  Users UserDatabase;

  UserDatabase["bob@example.com"] = {
    "bob@example.com",
    "Bob Roberts",
    "21 The Street, Some City"
  };

  Log(UserDatabase, "bob@example.com");
}
Bob Roberts (bob@example.com)
21 The Street, Some City

Because we’re using hashing, these insertions and lookups are constant-time operations. They remain fast regardless of how big our collection grows. We cover std::unordered_map in full later in the chapter.

Summary

This lesson delved into hashing and hash functions, their purpose, and handling collisions. We also introduced C++ implementations of these concepts, such as std::hash, std::unordered_set, and std::unordered_map.

Key Takeaways:

  • Demonstrated constant-time lookups in arrays and introduced the basic principles of hash functions.
  • Addressed the challenge of collisions in hashing and explored strategies like probing, double hashing, and secondary chaining.
  • Discussed the process of rehashing for optimizing the performance of hash-based data structures.
  • Illustrated how to implement std::hash for custom types and its integration with hash-based containers.
  • Showcased the use of std::unordered_set for managing unique data elements.
  • Emphasized the significance of key-value pairs in hash tables for fast data access and management.
  • Explained the concept and practical applications of hash tables through std::unordered_map.

Was this lesson useful?

Edit History

  • — Refreshed Content

  • — First Published

Ryan McCombe
Ryan McCombe
Edited
7a.jpg
This lesson is part of the course:

Professional C++

Comprehensive course covering advanced concepts, and how to use them on large-scale projects.

Free, Unlimited Access
Hash Sets and Hash Maps
Next Lesson

Implementing Hash Sets using std::unordered_set

This lesson provides a thorough understanding of std::unordered_set, from basic initialization to handling custom types and collisions
3D art showing a character concept
Contact|Privacy Policy|Terms of Use
Copyright © 2024 - All Rights Reserved