Implementing Hash Sets using std::unordered_set

This lesson provides a thorough understanding of std::unordered_set, from basic initialization to handling custom types and collisions
This lesson is part of the course:

Professional C++

Comprehensive course covering advanced concepts, and how to use them on large-scale projects.

Free, Unlimited Access
3D art showing a character concept
Ryan McCombe
Ryan McCombe
Updated

In this lesson, we’ll cover the C++ Standard Library’s implementation of a hash set, which is the std::unordered_set.

  • It is a collection of unique elements, ensuring each element appears only once - no duplicates.
  • Elements are stored in no particular order. This has implications for iteration, which we’ll cover.
  • The set uses a hash function to organize its elements, which allows for efficient insertion and searching.

In the first part of the lesson, we'll explore the basics of std::unordered_set, including its creation and initialization. You'll learn how this container leverages hashing to store elements efficiently, ensuring quick access and manipulation.

As we progress, we'll delve into more advanced topics such as customizing hash functions, customizing the load factor, and more.

This lesson builds on the theory we established in the previous section, which may be helpful to review for those less familiar with hashing, and how it can be used to create data structures:

Creating Unordered Sets

The std::unordered_set class is available once the <unordered_set> header is included. They can be constructed in the same way as other containers - we specify the type of data they will contain as a template parameter:

#include <unordered_set>

int main() {
  std::unordered_set<int> MySet;
}

The container can be populated with initial values:

#include <unordered_set>

int main() {
  std::unordered_set<int> MySet { 1, 2, 3 };
}

When providing initial values, the type of those values can inferred using class template argument deduction (CTAD), so we do not need to specify it:

#include <unordered_set>

int main() {
  std::unordered_set MySet { 1, 2, 3 };
}

Initializing a std::unordered_set using iterators

The std::unordered_set constructor has an overload that allows us to pass an iterator pair. This will initialize our container with the values contained within that range:

#include <vector>
#include <unordered_set>

int main() {
  std::vector MyVec{1, 2, 2, 2, 3};
  std::unordered_set MySet{
    MyVec.begin(), MyVec.end()};
}

Remember, a std::unordered_set cannot contain duplicates, so it may be initialized with fewer values than the original range.

In this case, MySet will have only three objects - 1, 2, and 3. The additional duplicate 2s are omitted.

Iteration

Unordered sets support iteration but, as the name suggests, the containers are not ordered. This means that when we iterate over the container, we cannot easily predict our objects' order.

If we want to iterate over them anyway, they implement begin() and end() methods which return iterators. As such, we can use them with range and iterator-based functions and algorithms.

For example, they’re compatible with range-based loops:

#include <unordered_set>
#include <iostream>

int main() {
  std::unordered_set MySet { 1, 2, 3 };
  for (int x : MySet) {
    std::cout << x << ", "; 
  }
}
1, 2, 3,

Below, we call the iterator-based std::vector constructor to create an array from the contents of an unordered set:

#include <unordered_set>
#include <vector>
#include <iostream>

int main() {
  std::unordered_set MySet { 1, 2, 3 };
  std::vector<int> MyVec{
    MySet.begin(), MySet.end()};

  for (int x : MyVec) {
    std::cout << x << ", "; 
  }
}
1, 2, 3,

Here, we use the iterator-based std::for_each() algorithm with an unordered set:

#include <unordered_set>
#include <algorithm>
#include <iostream>

void Log(int x) { std::cout << x << ", "; }

int main() {
  std::unordered_set Set { 1, 2, 3 };
  std::for_each(Set.begin(), Set.end(), Log);
}
1, 2, 3,

Inserting Objects

We can add new data to our container in a few different ways:

1. Constructing Objects in Place using emplace()

The preferred way to add a single object to our set is to construct it in place, if possible. We do this using the emplace(). method. Any arguments we provide to emplace() will be forwarded to the constructor of the underlying type. In this case, that is int:

#include <unordered_set>
#include <iostream>

int main() {
  std::unordered_set<std::string> Set;
  Set.emplace("Hello");

  for (const std::string& S : Set) {
    std::cout << S;
  }
}
Hello

2. Inserting Existing Objects using insert()

Emplacing an object directly into the container is more performant than creating it outside and then moving it in. But, if our object already exists, we can copy (or move) it into the set using insert():

#include <iostream>
#include <unordered_set>

int main() {
  std::unordered_set<std::string> Set;

  std::string Greeting{"Hello"};
  Set.insert(Greeting);

  std::string Target{"Everyone"};
  Set.insert(std::move(Target));

  for (const std::string& S : Set) {
    std::cout << S << ' ';
  }
}
Hello Everyone

We can insert() a list of objects in a single call:

#include <iostream>
#include <unordered_set>

int main() {
  std::unordered_set<std::string> Set;

  Set.insert({"Hello", "Everyone"});

  for (const std::string& S : Set) {
    std::cout << S << ' ';
  }
}
Hello Everyone

We can also pass an iterator pair to insert() to add multiple elements from some other container:

#include <iostream>
#include <unordered_set>

int main() {
  std::unordered_set Set{1, 2, 3};
  std::vector Vec{4, 5, 6};

  Set.insert(Vec.begin(), Vec.end());

  for (int x : Set) {
    std::cout << x << ", ";
  }
}
1, 2, 3, 4, 5, 6,

3. Inserting a Range using insert_range() (C++23)

Note: the insert_range() method was added in C++23, and may not yet be supported by all compilers.

C++20 introduced the concept of a range. We’ll cover ranges in more detail later in the course, but for now, we can think of them as a way of representing an iterator pair as a single object.

Most of the sequential containers we’ve seen so far, such as std::vector and std::list are valid ranges. As such, we can insert the contents of such a container into our set using std::insert_range():

#include <iostream>
#include <unordered_set>

int main() {
  std::unordered_set Set{1, 2, 3};
  std::vector Vec{4, 5, 6};

  Set.insert_range(Vec);

  for (int x : Set) {
    std::cout << x << ", ";
  }
}
1, 2, 3, 4, 5, 6,

Getting the Object Count

The size() method returns the number of items in a set:

#include <iostream>
#include <unordered_set>

int main() {
  std::unordered_set Set{1, 2, 3};

  std::cout << "Size: " << Set.size();
}
Size: 3

If we want to check if a set is empty - that is, the size is 0, we can do that using the more descriptive empty() method:

#include <iostream>
#include <unordered_set>

int main() {
  std::unordered_set Set{1, 2, 3};

  std::cout << "The set "
            << (Set.empty() ? "is" : "is not")
            << " empty";
}
The set is not empty

Removing Objects

We can remove items from a set using the erase() method:

#include <iostream>
#include <unordered_set>

int main() {
  std::unordered_set Set{1, 2, 3};

  Set.erase(1);

  for (int x : Set) {
    std::cout << x << ", ";
  }
}
2, 3,

We don’t need to check if the container contains an object before trying to erase it - we safely call erase() and it will simply have no effect if the object wasn’t in our set:

#include <iostream>
#include <unordered_set>

int main() {
  std::unordered_set Set{1, 2, 3};

  Set.erase(4);

  for (int x : Set) {
    std::cout << x << ", ";
  }
}
1, 2, 3,

Removing all Objects using clear()

We can remove all the items from a set using clear():

#include <iostream>
#include <unordered_set>

int main() {
  std::unordered_set Set{1, 2, 3};

  std::cout << "Size: " << Set.size();
  Set.clear();
  std::cout << "\nSize: " << Set.size();
}
Size: 3
Size: 0

Searching for Objects

There are several ways we search for objects within our std::unordered_set. The most common are:

1. Using the contains() method

The contains() method is the simplest and most common option. It will return true if the object we pass as an argument is in the set, and false otherwise:

#include <iostream>
#include <unordered_set>

int main() {
  std::unordered_set Set{1, 2, 3};

  if (Set.contains(2)) {
    std::cout << "2 is in the set";
  }
}
2 is in the set

2. Before C++20: using count()

The contains() method was only added in C++20. When targeting older specifications, we would use the count() method, which returns the number of matching objects that were found.

Given a std::unordered_set cannot contain duplicates, that return value would be either 0 or 1. Typically we’d just coerce that return value to a boolean, meaning count() is used in the same way as contains():

#include <iostream>
#include <unordered_set>

int main() {
  std::unordered_set Set{1, 2, 3};

  if (Set.count(2)) {
    std::cout << "2 is in the set";
  }
}
2 is in the set

3. Using the insert() return value

Often, we will want to check if our set contains an object, and insert it if it doesn’t.

We can do this using the basic variation of the insert() method, which accepts a single value we want to insert. This function returns a struct with the result of the insertion.

It specifically returns a std::pair - a simple container that we cover in detail in the next lesson. The pair contains two member variables:

  • first: An iterator to where the object we tried to insert is within the set
  • second: A bool representing whether the insertion happened

In other words:

  • If second is false, that means the insertion didn’t happen, because the object was already in the set
  • If second is true, that means the object wasn’t previously in the set, but our insert() call has now added it:
#include <iostream>
#include <unordered_set>

int main() {
  std::unordered_set Set{1, 2, 3};

  auto [it, wasInserted]{Set.insert(4)};

  if (wasInserted) {
    std::cout << "4 wasn't in the set";
  }

  if (Set.contains(4)) {
    std::cout << ", but now it is";
  }
}
4 wasn't in the set, but now it is

4. Using the erase() return value

Another common requirement we’ll have is the desire to check if our set contains an object, and erase that object if it does.

The erase() method can let us implement this, as it returns an integer representing the number of items that it removed. Given an unordered set cannot contain duplicates, erasing an object by value will therefore return 1 if our set contained the value, or 0 otherwise:

#include <iostream>
#include <unordered_set>

int main() {
  std::unordered_set Set{1, 2, 3};

  if (Set.erase(3)) {
    std::cout << "3 was in the set";
  }

  if (!Set.contains(3)) {
    std::cout << ", but not any more";
  }
}
3 was in the set, but not any more

5. Using find() to generate an iterator

We can also search our set using the find() method. This will return an iterator to the object if it was found, or an iterator equal to the end() return value otherwise:

#include <iostream>
#include <unordered_set>

int main() {
  std::unordered_set Set{1, 2, 3};

  auto A{Set.find(3)};

  if (A != Set.end()) {
    std::cout << "3 was found: " << *A;
  }

  auto B{Set.find(4)};

  if (B == Set.end()) {
    std::cout << "\n4 was not found";
  }
}
3 was found: 3
4 was not found

Modifying Objects

Modifying an object within a hash set is a little more complex than we might expect. This is because an object’s value determines where that object exists within the container, via the hash function.

Modifying an object’s value in place risks leaving the object in the wrong position, thereby compromising the integrity of our container. The container implements some protections using const to prevent us from doing this accidentally:

#include <iostream>
#include <unordered_set>

int main() {
  std::unordered_set Set{1, 2, 3};

  auto It{Set.find(3)};
  (*It)++;
}
'It': you cannot assign to a variable that is const

An obvious workaround would be to copy the value, erase the original, modify the copy, and then insert it.

However, as of C++17, we have a more efficient approach to this, using the extract() method. The extract() method transfers ownership of the node containing the object to the caller. The node is effectively removed from the set at this point:

#include <iostream>
#include <unordered_set>

int main() {
  std::unordered_set Set{1, 2, 3};

  auto Node{Set.extract(3)};

  for (int x : Set) {
    std::cout << x << ", ";
  }
}
1, 2,

We can then access and update the value contained within that node, using the value() method. Once we’re done with our changes, we transfer the node back to the container.

We do this using an overload of the insert() method that accepts an rvalue reference to the node, and ensures it is rehashed and positioned correctly:

#include <iostream>
#include <unordered_set>

int main() {
  std::unordered_set Set{1, 2, 3};

  auto Node{Set.extract(3)};
  Node.value() = 4;
  Set.insert(std::move(Node));

  for (int x : Set) {
    std::cout << x << ", ";
  }
}
1, 2, 4,

Custom Types in Unordered Sets

Unlike with arrays, we can’t just add any type of data to an unordered set. To meet the performance characteristics, our data type needs to have some capabilities:

  • It needs to define a hash function
  • It needs the ability to compare itself to other objects of the same type, using the == operator

Many of the C++ built-in types already have this ability.

However, many types will not, and this includes custom types we create. As a result, the following will not work, because the type we are using as the key - Item doesn’t meet either of the two requirements:

#include <unordered_set>
#include <string>

struct Item {
  std::string Name;
};

int main() {
  // In their current form, Item objects
  // cannot be in a hash set
  std::unordered_set<Item> SetA;

  // References to this type will have the
  // same issue
  std::unordered_set<Item&> SetB;

  // Pointers to any type can be hashed and
  // compared, so this would work
  std::unordered_set<Item*> SetC;
}

There are two ways we can enable this. The first option is to implement std::hash and operator==() for our type. We walked through the implementation of this in our previous lesson:

// User.h
#pragma once
#include <string>

struct User {
  std::string Email;
  bool operator==(const User& Other) const {
    return Email == Other.Email;
  }
};

namespace std {
template <>
struct hash<User> {
  size_t operator()(const User& U) const {
    return std::hash<std::string>{}(U.Email);
  }
};
}

Alternatively, we can provide a hash function and an equality function that our std::unordered_set can use We cover these in the next two sections.

Setting the Hash Function

As we saw above, std::unordered_set uses std::hash as its default hash function. However, when creating our set, we have the option of providing a custom hash function.

This allows us to create sets for storing a type that does not implement std::hash.

We provide the hash function as the second template parameter. Specifically, we need to provide a class that overloads the () operator:

#include <unordered_set>
#include <string>
#include <iostream>

struct User {/*...*/} struct Hasher { size_t operator()(const User& U) const { return std::hash<std::string>{}(U.Name); } }; int main() { std::unordered_set<User, Hasher> Set; Set.emplace("Bob"); if (Set.contains(User{"Bob"})) { std::cout << "Found Bob"; } }
Found Bob

The Hasher syntax above may seem slightly unusual. We’re defining a type that can be used to create function objects - that is, objects that can be "called" using the () operator. These are sometimes referred to as functors, and we cover them in more detail in a dedicated lesson later in the course.

Setting the Equality Function

By default, when the std::unordered_set container needs to determine if two objects are equal, it does that using the == operator. We can provide a different function for this by providing a functor type as the third template parameter.

This allows us to store types that do not overload the == operator:

#include <unordered_set>
#include <string>
#include <iostream>

struct User {
  std::string Name;
};

struct Hasher {/*...*/} struct Comparer { bool operator()( const User& A, const User& B) const { return A.Name == B.Name; } }; int main() { std::unordered_set<User, Hasher, Comparer> Set; Set.emplace("Bob"); if (Set.contains(User{"Bob"})) { std::cout << "Found Bob"; } }
Found Bob

If objects are equal, they should share the same hash

By definition, a hash function should return the same hash for the same object.

In other words, for our hash container to work correctly, if our comparison function claims two objects are equal, our hash function should also return the same hash for those two objects.

The following set is unlikely to work correctly, as our equality function is comparing objects by their absolute value, such that -1 and 1 are considered equal.

However, the hash function we’re using (the default std::hash in this case) is likely to return different hashes for -1 and 1:

#include <unordered_set>
#include <iostream>

struct Comparer {
  bool operator()(int a, int b) const {
    return std::abs(a) == std::abs(b);
  }
};

int main() {
  std::unordered_set<
    int, std::hash<int>, Comparer
  > Set{1, 2, 3};

  if (Set.contains(1)) {
    std::cout << "The set contains 1";
  }

  if (!Set.contains(-1)) {
    std::cout << " but not std::abs(-1)?";
  }
}
The set contains 1 but not std::abs(-1)?

We could fix this previous example by providing a hash function that is based on the absolute value, too:

#include <unordered_set>
#include <string>
#include <iostream>

struct Hasher {
  size_t operator()(int x) const {
    return std::hash<int>{}(std::abs(x));
  }
};

struct Comparer {/*...*/} int main() { std::unordered_set<int, Hasher, Comparer> Set{1, 2, 3}; if (Set.contains(-1)) { std::cout << "Set contains std::abs(-1)"; } }
Set contains std::abs(-1)

Collisions and Buckets

The C++ specification doesn’t state what collision resolution strategy should be used for std::unordered_set.

However, it does prescribe an API that is much easier to implement using secondary chaining, so this is the approach that compilers tend to take.

The specification refers to these secondary chains as buckets.

We can determine how many buckets our set is using with the bucket_count() method:

#include <iostream>
#include <unordered_set>

int main() {
  std::unordered_set Nums{1, 2, 3};

  std::cout << "Bucket Count: "
    << Nums.bucket_count();
}
Bucket Count: 8

Behind the scenes, each bucket is just an index of the underlying array. At each index, we have the secondary chain - often a linked list - of objects that hash to that index.

If we insert more items into the unordered set, it will eventually decide to rehash. That is, resizing the underlying array to provide more buckets:

#include <iostream>
#include <unordered_set>

int main() {
  std::unordered_set Nums{1, 2, 3, 4, 5};

  std::cout << "Bucket Count: "
    << Nums.bucket_count();

  Nums.emplace(6);
  Nums.emplace(7);
  Nums.emplace(8);
  Nums.emplace(9);

  std::cout << "\nBucket Count: "
    << Nums.bucket_count();
}
Bucket Count: 8
Bucket Count: 64

For a std::unordered_set specifically, rehashing will happen once the container exceeds its maximum load factor.

Load Factor

The load factor of a hash-based container is equivalent to the number of elements it contains, divided by the number of buckets it uses.

This is made available through the load_factor() method.

In the following example, our unordered set is using 8 buckets to store a combined 4 objects, meaning its load factor is 0.5:

#include <iostream>
#include <unordered_set>

int main() {
  std::unordered_set Nums{1, 2, 3, 4};

  std::cout << "Buckets: "
    << Nums.bucket_count();

  std::cout << "\nSize: "
    << Nums.size();

  std::cout << "\nLoad Factor: "
    << Nums.load_factor();
}
Buckets: 8
Size: 4
Load Factor: 0.5

The load factor of a hash container is the main heuristic we use to balance memory usage vs collisions:

  • A low load factor means our container has a lot of unused space. This reduces collisions but increases memory usage
  • A high load factor means our container doesn’t have a lot of unused space. This is optimal for reducing memory consumption but means collisions are more frequent, degrading performance.

Setting the Maximum Load Factor

The std::unordered_set container lets us specify the maximum load factor using the max_load_factor() method.

Any insertion that would cause our set’s load factor to go over this value will cause a rehash:

#include <iostream>
#include <unordered_set>

int main() {
  std::unordered_set Nums{1, 2, 3, 4};
  std::cout << "Bucket Count: "
    << Nums.bucket_count()
    << "\nLoad Factor: "
    << Nums.load_factor();

  Nums.max_load_factor(0.25);
  Nums.insert(5);
  std::cout << "\n\nBucket Count: "
    << Nums.bucket_count()
    << "\nLoad Factor: "
    << Nums.load_factor();
}
Bucket Count: 8
Load Factor: 0.5

Bucket Count: 64
Load Factor: 0.078125

Getting the Currently Configured Max Load Factor

A variation of max_load_factor() that takes no arguments is available. This overload currently configured maximum load factor.

By default, this is set to 1.0 for a std::unordered_set:

#include <iostream>
#include <unordered_set>

int main() {
  std::unordered_set Nums{1, 2, 3, 4};
  std::cout << "Max Load Factor: "
    << Nums.max_load_factor();
}
Max Load Factor: 1

Rehashing and Reserving

If we have an approximate idea of how many objects our set is likely going to need to store, it can be helpful to allocate enough space up-front.

This can reduce the amount of rehashing that needs to occur during the lifecycle of our container.

The rehash() Function

The rehash() method lets us manually rehash the container. We pass the number of buckets we want as an argument, and our container will then be rehashed to use at least that many buckets:

#include <iostream>
#include <unordered_set>

int main() {
  std::unordered_set<int> Nums;
  std::cout << "Bucket Count: "
    << Nums.bucket_count();

  Nums.rehash(500);

  std::cout << "\nBucket Count: "
    << Nums.bucket_count();
}
Bucket Count: 8
Bucket Count: 512

The reserve() Method

The reserve() method is very similar to rehash(), except the argument we provide is the number of items we expect our container to have.

The key difference is that the number of buckets reserve() will rehash to consider our max_load_factor().

Below, we reserve space for 500 items. Given our max load factor is 0.25, we therefore allocate space for at least 2,000 buckets:

#include <iostream>
#include <unordered_set>

int main() {
  std::unordered_set<int> Nums;
  Nums.max_load_factor(0.25);
  Nums.reserve(500);
  std::cout << "Bucket Count: "
    << Nums.bucket_count();
}
Bucket Count: 2048

Summary

In this lesson, we have explored std::unordered_set - the C++ standard library’s implementation of a hash set. We covered everything from initialization and iteration to customization for complex types.

Key Learnings

  • The basics of creating and initializing std::unordered_set with various methods, including class template argument deduction (CTAD).
  • Different ways to iterate over std::unordered_set, emphasizing its unordered nature.
  • Techniques for inserting objects using emplace(), insert(), and C++23's insert_range() methods.
  • Methods to determine the size() of a set, and whether the set is empty().
  • Removing elements from the set using erase() and clearing the entire set with clear().
  • Searching for objects using methods like contains(), count(), insert() return value, erase() return value, and find().
  • The complexity and approach for modifying objects in a set, particularly using the extract() method.
  • Requirements and methods for using custom types in std::unordered_set, including the implementation of hash functions and equality operators.
  • Setting custom hash functions and equality comparators for std::unordered_set.
  • Understanding the internal mechanism of collision handling and buckets in std::unordered_set.
  • Managing performance with concepts like load factor, and methods such as rehash() and reserve() for optimizing space and reducing collisions.

Was this lesson useful?

Next Lesson

Using std::pair

Master the use of std::pair with this comprehensive guide, encompassing everything from simple pair manipulation to template-based applications
3D character concept art
Ryan McCombe
Ryan McCombe
Updated
A computer programmer
This lesson is part of the course:

Professional C++

Comprehensive course covering advanced concepts, and how to use them on large-scale projects.

Free, Unlimited Access
Hash Sets and Hash Maps
Next Lesson

Using std::pair

Master the use of std::pair with this comprehensive guide, encompassing everything from simple pair manipulation to template-based applications
3D character concept art
Contact|Privacy Policy|Terms of Use
Copyright © 2024 - All Rights Reserved