Characters, Unicode and Encoding

An introduction to C++ character types, the Unicode standard, character encoding, and C-style strings

This lesson is part of the course:

Professional C++

Comprehensive course covering advanced concepts, and how to use them on large-scale projects.

Free, Unlimited Access

Ryan McCombe

Updated 3 months ago

We’ve been working with text and strings a lot so far, but we’ll go deeper in this lesson. We’ll gain a much more complete insight into how C++, and programming in general, deals with text.

First, we’ll discuss characters - the building blocks of text. We’ll introduce the complexities of character types, exploring the Unicode standard and different representations used to store characters, and how they impact memory and performance

We’ll also cover C-style strings, how they’re used, their pitfalls, and some useful functions that can help us work with them.

Characters

When it comes to text, the most fundamental data type is the individual character. Examples of characters include letters like H and n, numbers like 2 and 0, and punctuation like < and !.

In C++, there are many different types we can use to store characters, and we’ll go over some of the options later in this lesson. For now, the most common built-in type we use for characters is the simple char:

1int main() {
2  char Letter;
3}

A char literal is created by placing the character between single quotes:

1int main() {
2  char Letter { 'H' };
3  std::cout << Letter << 'i';
4}

1Hi

C-Style Strings

Note: in most cases, we should not use C-Style strings. More modern approaches such as std::string are typically easier and safer to work with, and they come with more capabilities out of the box. However, C-style strings remain in common use, so a basic knowledge of what they are and how they work is extremely useful.

We can imagine strings as simply being a collection of characters.

A C-style string is an example of a null-terminated string. These work by storing each character in a contiguous block of memory, like an array. Rather than keeping track of the size of the array, they instead use a special character to denote where the string ends.

This character is stored in the block of memory immediately after the string, so as our code advances through the contiguous block of memory, it will eventually find this character, and know that the string has ended.

This character is called the null character or null terminator. The null character is an example of a control character or non-printable character. We cover control characters in more detail a little later.

The null character has no visual representation but, in code, diagrams, and writing, it is often denoted as a backslash followed by a zero: \0

Putting everything together, a null-terminated string containing the text "Hello" occupies 6 contiguous bytes in memory:

Diagram illustrating a C-string in memory

With this convention in place, we can then store strings simply as a pointer to the first character - a char*.

Diagram showing a character pointer pointing at a c-string in memory

Anything receiving such a pointer will know that the content of the string continues in memory until the null terminator is encountered.

Typically we should consider these characters to be const, so the data type of a c-style string is a pointer to a const char:

1int main() {
2  const char* MyString;
3}

A c-style string literal is created by placing the characters between double quotes:

1#include <iostream>
2
3int main() {
4  const char* MyString{"Hello"};
5  std::cout << MyString << " World";
6}

1Hello World

Control Characters

Not all characters represent visible symbols. Some characters are intended to have some additional meaning which the receiver should interpret as a special instruction.

These are referred to as control characters. The new line character - \n - which we’ve been using a lot is one such example. The null terminator - \0 - covered above is another, and there are many more.

Isn’t `\0` two characters?

Whilst \0 is two characters in our source file, the way we represent a character when programming isn’t necessarily the same as how it is represented in the final compiled machine code.

Control characters have no visual appearance - they’re sometimes referred to as "non-printable characters". But, when we’re writing code, we still need to be able to insert them where appropriate, and, more importantly, we also want to be able to see where they have been inserted.

Therefore, we need to give them a visual appearance when they’re being used in the source code. \0 and similar syntax is just a standardized way to accomplish that task. Once our code is compiled, syntax such as this is represented in memory in the same way as any other character, and they become invisible to users as intended.

In C++, we insert control characters using a backslash - \

If we want a literal backslash in our string, we need to insert two backslashes: \\

1#include <iostream>
2
3int main(){
4  // This won't behave how we want
5  std::cout << "C:\Windows\notepad.exe";
6
7  std::cout << "\n\n";
8
9  // We need this instead
10  std::cout << "C:\\Windows\\notepad.exe";
11  std::cout << '\n';
12
13  // To insert \\ we need to escape both
14  std::cout << "\\\\host-name\\shared.txt";
15}

1C:Windows
2otepad.exe
3
4C:\Windows\notepad.exe
5\\host-name\shared.txt

Raw Strings

The need to escape backslashes can make some of our strings unnecessarily complex. To solve this problem, we have raw strings instead.

Raw string literals start with R"( and end with ")

1#include <iostream>
2
3int main(){
4  std::cout << R"(no \n new \n lines \n here)";
5  std::cout << \n\n";
6  std::cout << R"(C:\Windows\notepad.exe)";
7}

Raw strings ignore any control characters - anything between the delimiters R"( and ") becomes part of the string.

1no \n new \n lines \n here
2
3C:\Windows\notepad.exe

Raw strings are not a dedicated type - we can store their output in the usual ways, such as a C-style string or std::string:

1int main(){
2  std::string A {R"(no \n new \n lines)"};
3  const char* B {R"(C:\Windows\notepad.exe)"};
4}

Unicode, Encoding, and UTF

The primary char type allocates 1 byte (8 bits) of memory. This is enough to store one of 256 different characters. However, there are more than 256 possible characters - particularly once we consider non-English alphabets.

Unicode is designed to standardize the handling of text across most of the world’s writing systems. The latest version of the standard identifies approximately 150,000 characters. This includes alphabetic characters and punctuation from 160 languages, symbols such as © and ♫, and emojis such as ❤️ and 🎉

The way we represent characters in computer memory (ie, bytes) is referred to as encoding.

There are several ways to encode Unicode characters. The most common international standard is UTF, which has three common variations:

UTF-8 encodes the characters as a sequence of up to four blocks, each block having one byte (8 bits)
UTF-16 encodes characters as a sequence of up to two blocks, each block having two bytes (16 bits)
UTF-32 encodes characters as a single block of four bytes (32 bits)

UTF-8 is most commonly used, as the variable size (from one to four bytes) allows for some performance gains. This is because, whilst we need four bytes to capture all possible Unicode characters, many can be stored in fewer.

This is particularly true of the more commonly used characters. Unicode is designed such that all English / Latin alphabetic characters, punctuation, and numbers can be represented by a single byte.

As such, with UTF-8, a sequence like "Hello" only requires 5 bytes whilst, under UTF-32, it would require 20 bytes (5 characters, each requiring 4 bytes).

An endorsement for UTF-8 that goes deeper into the technical details is available here.

Character Types

As standard, C++ comes with a collection of character types.

The basic char, which has 1 byte
A "wide char" - wchar_t, which has either 2 or 4 bytes (depending on the compiler)
Fixed width characters - char8_t, char16_t, and char32_t specify exactly how many bits they should have, but these types are not widely used or supported.

For larger projects that have localization requirements, we’ll generally be building dedicated infrastructure to support complex text requirements, or using a third-party library to handle it for us. In most other cases, using the basic char type with UTF-8 encoding is a sensible default, unless we have a compelling reason to do otherwise.

We still need to be aware that alternative options exist though, as we may be required to use them. For example, software in China follows a different standard.

The libraries and APIs we’re interacting with can also follow different standards. The Windows API uses wide characters, encoded with UTF-16. If we’re heavily using that API, that may convince us to adopt the same convention or force us to handle conversions at the interface points.

Using Unicode Characters

We can store Unicode characters in our regular char type.

However, because char only stores 8 bits, we will need multiple chars to contain more exotic Unicode characters, such as emojis. So, even though we have a single character, under UTF-8 we may need to store it as a string:

1#include <iostream>
2
3int main(){
4  // This would be an error - narrowing conversion
5  // char Heart{'❤'};
6
7  // Emoji requires multiple bytes, so use a string
8  const char* Heart{"❤"};
9  std::cout << "Hello World " << Heart;
10}

1Hello World ❤

There are some environmental considerations here. Depending on our operating system, there may be additional steps required to get everything working.

For example, on Windows, we may need to update our project properties to set our source and execution character set, following the instructions available here.

Within our code, we may also need to set the console output "code page". We can do this by including windows.h and then calling SetConsoleOutputCP with the argument CP_UTF8:

1#include <windows.h>
2#include <iostream>
3
4int main(){
5  SetConsoleOutputCP(CP_UTF8);
6  const char* Heart{"❤"};
7
8  std::cout << "Hello Windows " << Heart;
9}

1Hello Windows ❤

Summary

In this lesson, we explored handling characters, strings, and encoding in C++, including the use of C-style strings, Unicode, and different character types.

Main Points Learned

Characters are the fundamental building blocks of text, with char being the most commonly used data type for their representation.
C-style strings are null-terminated arrays of characters, offering a traditional method for handling text.
Control characters, such as \n for new-line and \0 for null terminator, play special roles in text processing.
Unicode standardizes text representation across different languages and symbols, requiring various encoding standards like UTF-8, UTF-16, and UTF-32 to accommodate a wide range of characters.
C++ provides several character types, including char, wchar_t, char16_t, char32_t, and char8_t (since C++20), to support different encoding needs and language characters.
Handling Unicode characters may involve multiple char values or the use of wider character types, especially for characters beyond the basic ASCII range.
Properly displaying and manipulating Unicode characters can require specific considerations, such as setting the correct encoding in your environment or using platform-specific APIs.

Was this lesson useful?

Next Lesson

Working with C-Style Strings

A guide to working with and manipulating C-style strings, using the <cstring> library

Ryan McCombe

Updated 3 months ago

Lesson Contents