Characters, Unicode and Encoding

An introduction to C++ character types, the Unicode standard, character encoding, and C-style strings
This lesson is part of the course:

Professional C++

Comprehensive course covering advanced concepts, and how to use them on large-scale projects.

Ryan McCombe
Ryan McCombe

We’ve been working with text and strings a lot so far, but we’ll go a lot deeper in this lesson. We’ll gain a much more complete insight into how C++, and programming in general, deals with text.

First, we’ll discuss characters - the building blocks of text. We’ll introduce the complexities of character types, exploring the Unicode standard and different representations used to store characters, and how they impact memory and performance

We’ll also cover C-style strings, how they’re used, their pitfalls, and some useful functions that can help us work with them.


When it comes to text, the most fundamental data type is the individual character. Examples of characters include letters like H and n, numbers like 2 and 0, and punctuation like < and !.

In C++, there are many different types we can use to store characters, and we’ll go over some of the options later in this lesson. For now, the most common built-in type we use for characters is the simple char:

int main() {
  char Letter;

A char literal is created by placing the character between single quotes:

int main() {
  char Letter { 'H' };
  std::cout << Letter << 'i';

C-Style Strings

Note: in most cases, we should not use C-Style strings. More modern approaches such as std::string are typically easier and safer to work with, and they come with more capabilities out of the box. However, C-style strings remain in common use, so a basic knowledge of what they are and how they work is extremely useful.

We can imagine strings as simply being a collection of characters.

A C-style string is an example of a null-terminated string. These work by storing each character in a contiguous block of memory. They then append a special, additional character to the block of memory immediately after the string, which is how they denote where the string ends.

This character is called the null character or null terminator. The null character is an example of a control character or non-printable character. We cover control characters in more detail a little later.

The null character has no visual representation but, in code, diagrams, and writing, it is often denoted as a backslash followed by a zero: \0

Putting everything together, a null-terminated string containing the text “Hello” occupies 6 contiguous bytes in memory:

Diagram illustrating a C-string in memory

With this convention in place, we can then store strings simply as a pointer to the first character - a char*.

Diagram showing a character pointer pointing at a c-string in memory

Anything receiving such a pointer will know that the content of the string continues in memory until the null terminator is encountered.

Typically we should consider these characters to be const, so the data type of a c-style string is a pointer to a const char:

int main() {
  const char* MyString;

A c-style string literal is created by placing the characters between double quotes:

#include <iostream>

int main() {
  const char* MyString{"Hello"};
  std::cout << MyString << " World";
Hello World

Control Characters

Not all characters represent visible symbols. Some characters are intended to have some additional meaning which the receiver should interpret as a special instruction.

These are referred to as control characters. The new line character - \n - which we’ve been using a lot is one such example. The null terminator - \0 - covered above is another, and there are many more.

Isn’t \0 two characters?

Whilst \0 is two characters in our source file, the way we represent a character when programming isn’t necessarily the same as how it is represented in the final compiled machine code.

Control characters have no visual appearance - they’re sometimes referred to as “non-printable characters”. But, when we’re writing code, we still need to be able to insert them where appropriate, and, more importantly, we also want to be able to see where they have been inserted.

Therefore, we need to give them a visual appearance when they’re being used in the source code. \0 and similar syntax is just a standardized way to accomplish that task. Once our code is compiled, syntax such as this is represented in memory in the same way as any other character, and they become invisible to users as intended.

In C++, we insert control characters using a backslash - \

If we want a literal backslash in our string, we need to insert two backslashes: \\

#include <iostream>

int main(){
  // This won't behave how we want
  std::cout << "C:\Windows\notepad.exe";

  std::cout << "\n\n";

  // We need this instead
  std::cout << "C:\\Windows\\notepad.exe";
  std::cout << '\n';
  std::cout << "\\\\host-name\\shared.txt";


Raw Strings

The need to escape backslashes can make some of our strings unnecessarily complex. To solve this problem, we have raw strings instead.

Raw string literals start with R"( and end with ")

#include <iostream>

int main(){
  std::cout << R"(no \n new \n lines)";
  std::cout << "\n\n";
  std::cout << R"(C:\Windows\notepad.exe)";

Raw strings ignore any control characters - anything between the delimiters R"( and ") becomes part of the string.

no \n new \n lines


Raw strings are not a dedicated type - we can store their output in the usual ways, such as a C-style string or std::string:

int main(){
  std::string A {R"(no \n new \n lines)"};
  const char* B {R"(C:\Windows\notepad.exe)"};

Unicode, Encoding, and UTF

The primary char type allocates 1 byte (8 bits) of memory. This is enough to store one of 256 different characters. However, there are more than 256 possible characters - particularly once we consider non-English alphabets.

Unicode is designed to standardize the handling of text across most of the world’s writing systems. The latest version of the standard identifies approximately 150,000 characters. This includes alphabetic characters and punctuation from 160 languages, symbols such as © and ♫, and emojis such as ❤️ and 🎉

The way we represent characters in computer memory (ie, bytes) is referred to as encoding.

There are several ways to encode Unicode characters. The most common international standard is UTF, which has three common variations:

  • UTF-8 encodes the characters as a sequence of up to four blocks of one byte (8 bits)
  • UTF-16 encodes characters as a sequence of up to two blocks of two bytes (16 bits)
  • UTF-32 encodes characters as a single block of four bytes (32 bits)

UTF-8 is most commonly used, as the variable size (from one to four bytes) allows for some performance gains. This is because, whilst we need four bytes to capture all possible Unicode characters, many can be stored in fewer.

This is particularly true of the more commonly used characters. Unicode is designed such that all English / Latin alphabetic characters, punctuation, and numbers can be represented by a single byte.

As such, with UTF-8, a sequence like “Hello” only requires 5 bytes whilst, under UTF-32, it would require 20 bytes (5 characters, each requiring 4 bytes).

An endorsement for UTF-8 that goes deeper into the technical details is available here.

Character Types

As standard, C++ comes with a collection of character types.

  • The basic char, which has 1 byte
  • A “wide char” - wchar_t, which has either 2 or 4 bytes (depending on the compiler)
  • Fixed width characters - char8_t, char16_t, and char32_t specify exactly how many bits they should have, but these types are not widely used or supported.

For larger projects that have localization requirements, we’ll generally be building dedicated infrastructure to support complex text requirements, or using a third-party library to handle it for us. In most other cases, using the basic char type with UTF-8 encoding is a sensible default, unless we have a compelling reason to do otherwise.

We still need to be aware that alternative options exist though, as we may be required to use them. For example, software in China follows a different standard.

The libraries and APIs we’re interacting with can also follow different standards. The Windows API uses wide characters, encoded with UTF-16. If we’re heavily using that API, that may convince us to adopt the same convention or force us to handle conversions at the interface points.

Using Unicode Characters

We can store Unicode characters in our regular char type.

However, because char only stores 8 bits, we will need multiple chars to contain more exotic Unicode characters, such as emojis. So, even though we have a single character, under UTF-8 we may need to store it as a string:

#include <iostream>

int main(){
  // This would be an error - narrowing conversion
  // char Heart{'❤'};

  // Emoji requires multiple bytes, so use a string
  const char* Heart{"❤"};
  std::cout << "Hello World " << Heart;
Hello World ❤

There are some environmental considerations here. Depending on our operating system, there may be additional steps required to get everything working.

For example, on Windows, we may need to update our project properties to set our source and execution character set, following the instructions available here.

Within our code, we may also need to set the console output “code page”. We can do this by including windows.h and then calling SetConsoleOutputCP with the argument CP_UTF8:

#include <windows.h>
#include <iostream>

int main(){
  const char* Heart{"❤"};

  std::cout << "Hello Windows " << Heart;
Hello Windows ❤

Was this lesson useful?

Ryan McCombe
Ryan McCombe
This lesson is part of the course:

Professional C++

Comprehensive course covering advanced concepts, and how to use them on large-scale projects.

This lesson is part of the course:

Professional C++

Comprehensive course covering advanced concepts, and how to use them on large-scale projects.

Free, unlimited access!

This course includes:

  • 106 Lessons
  • 550+ Code Samples
  • 96% Positive Reviews
  • Regularly Updated
  • Help and FAQ
Next Lesson

Working with C-Style Strings

A guide to working with and manipulating C-style strings, using the cstring header file
DreamShaper_v7_hipster_Sidecut_hair_modest_clothes_fully_cloth_3 (1).jpg
Contact|Privacy Policy|Terms of Use
Copyright © 2023 - All Rights Reserved