This lesson introduces the concept of data structures beyond arrays, and why we may want to use alternatives.

Updated

In the previous lesson, we introduced the concept of algorithm complexity, and "Big O" notation for categorizing different algorithms.

One of the example problems we used was creating an algorithm to determine if an array, such as a `std::vector`

, contained any duplicate objects.

We saw we could solve it using the following algorithm, which has a quadratic run time, that is $O(n^2)$:

```
bool HasDuplicate(std::vector<int>& Input) {
for (size_t i{0}; i < Input.size(); ++i) {
for (size_t j{0}; j < Input.size(); ++j) {
if (Input[i] == Input[j] && i != j) {
return true;
}
}
}
return false;
}
```

In this lesson, we’ll introduce some techniques for writing better algorithms. A large component of this is creating * data structures* that are appropriate for the problems we’re trying to solve.

So far, the main data structure we’ve been using has been the * array* - either through

`std::array`

or `std::vector`

. Arrays store their objects in large, contiguous blocks of memory.Each object is adjacent to its neighbors - that is, the object at `MyArray[3]`

is right next to the object at `MyArray[4]`

.

This approach gives arrays some advantages:

- Iterating over every object in the array is very easy
- We can jump to any object by index, using the
`[]`

operator

But it also gives them some limitations:

- Inserting a new object at the start of the array is expensive as, to maintain the contiguous structure, every other object in the array needs to move to make space - an $O(n)$ process
- We can’t easily check if an array contains a specific value. As we saw above, our only approach is to iterate over the array and check every object in turn - an $O(n)$ process

Naturally, arrays are not the only data structures. There are other options we can use, and those options can be optimized for different operations.

The ability of arrays to provide fast, constant-time access to any element using the `[]`

operator is directly linked to their contiguous memory storage.

Classes that implement arrays, such as `std::vector`

and `std::array`

, understand three key aspects:

**Starting Memory Address:**The location in memory where their collection begins.**Element Size:**The size of each element in the array, which is based on the type of object the array stores.**Contiguous Storage:**The fact that all elements are stored next to each other in memory.

These aspects enable the use of * pointer arithmetic* to quickly calculate the memory address of any element in the array.

We’ll implement pointer arithmetic from scratch later in the course. For now, let's consider a simplified example to illustrate how the concept works:

- Suppose we have a
`std::vector<int>`

named`MyArray`

. - Assume
`int`

occupies 4 bytes of memory. - Imagine that
`MyArray`

starts at a memory address, say 200.

When we request an element at a specific index, like `MyArray[100]`

, the calculation to find its address is straightforward.

Since we know each `int`

is 4 bytes and the elements are stored contiguously, the 100th element (at index 99, as arrays are 0-indexed).

The memory address of `MyArray[100]`

is calculated as $200+(4*100) = 600$

This calculation doesn’t doesn’t increase in complexity as our array gets larger. This allows for constant-time access, denoted as $O(1)$, to any element.

An alternative way we can store a collection is in a data structure called a * hash set*, or sometimes simply

Unlike an array, sets do not allow us to iterate over them. Rather, they are designed to optimize inserting and searching for objects, both of which can be done in constant time - $O(1)$

The standard library’s implementation of a hash set is `std::unordered_set`

, which we cover in a dedicated lesson later.

For now, we can note that our earlier algorithm can be improved from $O(n^2)$ to $O(n)$ by incorporating a hash set. Rather than using a nested loop to find if our array contains a specific number, we can instead keep track of the numbers we’ve already seen by adding them to a hash set.

Then, for each number in our array, we can check if it’s one we’ve already seen, by checking if our hash set contains it:

```
#include <vector>
#include <unordered_set>
bool HasDuplicate(std::vector<int>& Input) {
std::unordered_set<int> PreviousNumbers;
for (int i : Input) {
if (PreviousNumbers.contains(i)) {
return true;
} else {
PreviousNumbers.emplace(i);
}
}
return false;
}
```

There are many solutions to any problem. For example, another characteristic of a hash set is that it cannot contain duplicates.

So, alternative $O(n)$ solutions to this example problem involve inserting the entire collection into a hash set, and then inspecting its size. If our array contained duplicates, the hash set it created will be smaller:

```
bool HasDuplicate(std::vector<int>& Input) {
std::unordered_set<int> Set;
Set.insert_range(Input);
return Set.size() < Input.size();
}
```

Unlike arrays, linked lists can store their elements anywhere in memory - they are not restricted to using a contiguous block:

Again, this different approach creates different performance characteristics. Compared to arrays, linked lists lose the ability to jump to arbitrary objects using the `[]`

operator.

We maintain the ability to iterate through a linked list - we just follow the links (pointers) as they guide us through the collection.

However, their less restrictive memory layout of linked lists means adding and removing objects tends to be faster.

When we remove or add elements to a linked list, none of the other elements need to be moved. Instead, we just need to update the link from the previous element to point to our new addition.

Additionally, unlike arrays, linked lists don’t run out of space - they don’t need to move to a completely different area of memory once they get too big.

Due to the non-contiguous memory allocation of linked lists, we can't use pointer arithmetic to jump directly to an element.

To access the 100th element, we must traverse the list from the beginning, following the links (pointers) between each element.

This traversal is a slower, linear process with a time complexity of $O(n)$. In this case, $n$ is the "index" of the element we want to reach, as that determines how many links we will need to jump through.

It is important to understand the difference between linked lists and arrays, as well as their relative trade-offs.

But, what is true in theory is not necessarily true in reality. There is a lot of nuance in how CPUs and memory work at a hardware level that is not considered in the pure, mathematical analysis.

Typically, these characteristics mean the tightly packed memory layout of arrays is often more performant, even in scenarios where the theory would suggest linked lists would have the edge.

Because of this, arrays should generally be the default choice in most cases. If we’re in doubt, and it’s a scenario where performance is critical, we can test both approaches and go with what works best for our use case.

We cover how to measure performance in more detail later in the course.

We’ll introduce the standard library’s implementation of a linked list - called `std::forward_list`

- later in this chapter.

However, it’s worth quickly reflecting on how we might implement the key mechanisms of a linked list ourselves. It’s a good review of some key topics, and tasks like implementing and manipulating linked lists from scratch are very commonly asked in technical interviews.

With a linked list for storing integers, for example, we are not just storing an integer in each position. Instead, we need to store both the integer, as well as the link to the next item in the collection.

This container, which has both the item we’re storing, as well as a pointer to the next item, is often called a * node*. It could look something like this:

```
struct Node {
int Value;
Node* Next;
};
```

Now, we can create a linked list of three integers using this type. Our linked list can be represented simply by the first node it contains. The end of the linked list is a node with no `Next`

pointer:

```
Node Third { 30, nullptr }; // End
Node Second { 20, &Third };
Node First { 10, &Second }; // Start
```

We could use templating here, so our Nodes can store any type of data, not just integers:

```
template<typename T>
struct Node {
T Value;
Node<T>* Next;
};
```

```
Node<int> Third { 30, nullptr };
Node<int> Second { 20, &Third };
Node<int> First { 10, &Second };
```

To iterate through a linked list, we would follow the pointer within each node, until we encounter the end of the list.

We know we’ve reached the end when the current node’s `Next`

pointer is a `nullptr`

.

Below, our `Iterate`

function uses a * recursive* approach, where our

`Iterate`

function calls itself.```
#include <iostream>
struct Node {/*...*/}
void Iterate(auto* Node) {
if (Node) {
std::cout << Node->Value << ", ";
Iterate(Node->Next);
}
}
int main() {
Node<int> Third{30, nullptr};
Node<int> Second{20, &Third};
Node<int> First{10, &Second};
Iterate(&First);
}
```

If our `Node`

is defined, we log its value and call `Iterate()`

passing a pointer to the next node in the list.

Eventually, `Iterate`

will be called with a `nullptr`

, causing the `if`

condition to be `false`

, and the recursion to stop. As such, our previous program iterates through every node in our linked list:

`10, 20, 30,`

Recursion can be quite difficult to understand, so don’t worry if the previous example doesn’t make sense. We cover recursion in more detail later in the course.

To add a new node to our linked list, we need to know the position in which to insert it. Below, this is implemented by the caller to `Insert`

providing a pointer to the node that will come before our new node.

We then perform two actions:

- We create our new node such that it copies the
`Next`

pointer from the previous node - We update the previous node’s
`Next`

pointer to point to our new node

```
#include <iostream>
struct Node {/*...*/}
void Iterate(auto* Node) {/*...*/}
template <typename T>
void Insert(Node<T>* Previous, T Value) {
auto NewNode = new Node<T>{
Value, Previous->Next};// 1
Previous->Next = NewNode;// 2
}
int main() {
Node<int> Third{30, nullptr};
Node<int> Second{20, &Third};
Node<int> First{10, &Second};
// Insert a new node after the First node
// The new node's value should be 15
Insert(&First, 15);
Iterate(&First);
}
```

Combined, these two actions ensure our new node is inserted into the correct position, and that the integrity of our list is maintained. Our `Iterate`

call shows our updated list:

`10, 15, 20, 30,`

From here, we can continue to build out our linked list functionality as needed. We might want to add ways to remove nodes, reset the list, and more.

If we were making a linked list implementation that we were going to use, we’d naturally want to encapsulate all this functionality into a custom type, making it easy and safe for other developers to use.

But, the standard library has already done this for us, in the form of `std::forward_list`

. We’ll introduce this later in the chapter.

In this lesson, we explored the concept of data structures, introducing a few different examples from the C++ standard library. By using the most appropriate data structures, or combining data structures, we can create efficient solutions to the specific problems we’re trying to solve.

- Different data structures have different performance characteristics - each has its advantages and disadvantages.
- Arrays store elements in contiguous memory blocks, offering easy iteration and fast access by index, but have limitations in dynamic insertion and element checking.
- Hash sets, particularly
, optimize for quick insertion and searching, enabling operations like checking for duplicates more efficiently.`std::unordered_set`

- Linked lists, with their non-contiguous memory allocation, allow for flexible insertion and deletion of elements without reallocating the entire structure.
- The concept of pointer arithmetic is crucial for understanding how arrays provide fast access to elements, contrasting with the sequential access method in linked lists.
- Real-world performance considerations can sometimes differ from theoretical predictions, emphasizing the importance of choosing the right data structure based on practical testing.

Was this lesson useful?

Updated

Lesson Contents### Data Structures and Algorithms

This lesson introduces the concept of data structures beyond arrays, and why we may want to use alternatives.

This lesson is part of the course:### Professional C++

Comprehensive course covering advanced concepts, and how to use them on large-scale projects.