An introduction to two of the most highly performing data structures, and the secret to how they work

Published

We covered arrays and linked lists in some detail in our lessons on containers. These two data structures give us a simple way to group or data together - we can add or remove items to the collection and iterate over them easily. In the case of arrays, we can also quickly jump to any item in the collection, using its index.

However, when we have a need quickly and easily find specific items without our collection, these data structures are less than ideal. For example, lets imagine we want our users to select a user name, and we want to be able to check if their name was already taken.

To do this with a list or data structure, we would have to check every single user name that has ever been registered before we can confirm that our user has chosen something new. This algorithm scales linearly with the size of our collection. The more items we have, the longer it takes.

It is possible to create a data structure that allows us to find an item in *constant time*. That means, no matter how big our collection is, we can retrieve an item just as fast. The most basic data structure that has this property is often called a **set**.

For example, lets imagine we have a collection of numbers, and we need to quickly find if the number `42`

is in the collection.

A basic first implemention of this is to use an array, and ensure the index and value are always the same. For example, if we want to add `42`

to our array, we store it at index `42`

.

With this rule established, we now have constant time lookups. If we know the value we're searching for, we know the index it will be located in. And, since getting an array value by index is a constant time operation, we can now determine if a value is in our array in constant time. If we want to know if the *value* `42`

is in our collection, we know there's only one place we need to check: *index* `42`

.

However, our performance gains have come at a serious cost in terms of memory usage. To use this strategy, our array needs to be big enough to hold the largest possible number we ever expect to store in our collection.

If we need to store strings or more complex data types, our array would need to be impossibly large to accomodate the range of possible values.

We can improve things with a hash function.

A hash function is something that maps a value of any size, to a value of fixed size. When it comes to numbers, the modulus operator is an example of this:

```
int GetHash(int Number) {
return Number % 10;
}
```

Now, regardless of what the input to this function is, its output will be one of 10 possibilities (a number from `0`

to `9`

)

We can use this technique in our data structure. We can determine the `index`

to store and retrieve our values by passing the value into this hash function. So, if we wanted to store `42`

in our data structure, we would store it at index `2`

, because our hash function returns `2`

when given an input of `42`

.

If we wanted to store `3,624,929`

we would store it at index `9`

. To store such a number before, our array would needed to have a length in the millions. Using this specific hash function, it only needs a size of 10.

The examples here are using numbers for simplicity, but the same technique works for any other type of data. Regardless of what our data is, under the hood, it's all just binary - sequences of 0s and 1s.

Regardless of what the type of data is we're trying to store, our hash functions can generate numbers from that data simply by interpreting the underlying binary sequence as if it were a number.

Of course, there is now a problem here. Our hash function dictates that the number `42`

is stored at index `2`

, but it would also try to store `12`

, `62`

, `102`

and countless other entries in that same position.

When we want to store two different values in our data structure, but our hash function returns the same index for those values, this is commonly called a collision. Our data structure needs a way to handle this. The most common strategies are these:

Using this strategy, our data structure would simply store the value in the next available place. If we're inserting `42`

and our hash function returns index `2`

for that value, we will check index `2`

of our array:

- If index
`2`

is empty, we insert our value`42`

there - If index
`2`

already contains a different value, we check index`3`

, then`4`

, and we continue until we find an empty slot to put`42`

in.

When searching for a value, we follow the same pattern. If we're looking for `42`

, our hash function will return `2`

, so we check index `2`

:

- If index
`2`

is empty,`42`

isn't in our collection. - If index
`2`

contains`42`

, we found it - If index
`2`

contains a different value, we then linearly search index`3`

, then`4`

, and we continue until we either find`42`

or we find empty slot. If we hit an empty slot, that means`42`

isn't in our collection.

Linear probing works, but it creates large chains of occupied slots within our data structure. This type of clustering can create inconsistent performance in our data structure.

If the hash function returns an index near the start of such a long chain of occupied slots, it can take many lookups to find an empty slot. Additionally, when an empty slot is finally found, if we insert something into it, we are lengthening that chain, further exacerbating the problem.

An alternative to linear probing is quadratic probing. Here, when two different values return the same hash, the probing sequence looks for empty spaces that are much further away than the initial position. This prevents the "primary clusturing" effect of linear probing, and causes fewer unbroken chains of occupied slots.

However, it can still create long lookup chains for values that hash to the exact same index. This is sometimes called "secondary clustering", and still causes slow lookups and retrievals if values are frequently hashed to the same index.

Double hashing resolves the chaining problem entirely by introducing a second hash function, specifically to deal with collisions. It's incredibly unlikely that two different values generate the same hash from two different hash functions.

The second hash function can still generate an index that is occupied in the underlying array, so it's common for this function to take an additional paramater. If the slot the secondary hash function selects is occupied, that parameter can be incremented, and the function called again to find another.

Secondary chaining is a different approach entirely to deal with collisions. Instead of each index in the array containing a unique value, it instead contains a linked list of values that share that hash. If a value hashes to the same index, it simply gets added to the linked list in that index.

Secondary chaining has an additional advantage - it allows us to store duplicate values. If this is a requirement, secondary chaining is the best approach. A set that allows duplicates to be stored is sometimes called a "multiset".

The disadvantage here is that any time we find ourselves searching through a linked list for a specific value, we're losing the desirable constant-time performance characteristics of a set.

If our data structure is getting so full that collisions are happening much too frequently, rebuilding our array is an option. To do this, we would create a new, larger array. We would also update our hash function to make full use of this available space.

Then, we would use this new hash function to recalculate the index of all of our existing elements, and then move them across to their new position in the larger array.

This is a resource-intensive process, but it may be worth biting the bullet if it makes all future insertion and retrieval operations faster.

Lets imagine we have a collection of email addresses, and our hash function is converting those email addresses to indexes that can be stored in an array.

So far, this is no different to what we've seen in earlier examples. However: a key realisation has some serious benefits: we can store additional data at each index. This could include the user's name, telephone number and avatar, for example.

[img]

Now, our data structure can do more than quickly determine if a user exists - it can quickly return a load of information about the user, too.

This is the concept that underlies the *hash table* data structure. This has many names in different programming languages, including Hash Maps, Maps, Associative Arrays and Dictionaries

Two common terms used in this pattern are *key* and *value*. In the above example, the key is the email address, and the value is the name, telephone number and avatar.

In other words, the *key* is the part of the data that determines where the data will be stored, and the thing we use to later retrieve the data. The *value* is everything else.

Become a software engineer with C++. Starting from the basics, we guide you step by step along the way

Free, unlimited access!- 66 Lessons
- Over 200 Quiz Questions
- Capstone Project
- Regularly Updated
- Help and FAQ

Subscribe to this course to return to it later, and get updates on future changes