Regular expressions, often referred to as "regex" or "regexp", are powerful tools that allow for the search and manipulation of text. They are used to detect if a string meets specific requirements, to extract information from text, or to change text in specific ways.

Some examples where they're useful include:

Validating user input - for example, ensuring a string looks like an email address, phone number, or another type of data we were expecting
Syntax highlighting - your code editor, or the code snippets displayed in our lessons, use regular expressions to change the color and formatting of characters to make our code more readable
Data extraction - for example, if we want to create a system that reads through text and extracts anything that looks like a date
Redaction - for example, automatically removing contents from a document that could be sensitive

In this lesson, we'll introduce the syntax for creating a regular expression, and show how to use it to determine if our C++ string matches the given pattern. In the next lesson, we'll expand this to cover how we can use a regex pattern to extract and replace parts of our content in a targeted way.

Regular expressions can be quite challenging to understand at first. Using them to look for complex patterns or solve larger tasks can be tough. However, working with text is ubiquitous in programming, and regular expressions are the tool we reach for when our task gets a little more complex.

Additionally, regular expression syntax adheres to broadly similar standards across all of programming - not just C++. As such, once we're familiar with them, we can use them regardless of what programming language we're working in.

Using Regular Expressions in C++

Regular expression functionality in the C++ library is available by including the <regex> header:

1#include <regex>

Regular expressions, sometimes called patterns, are themselves strings. The standard way to create a regex pattern in C++ is through the std::basic_regex type, which is aliased to std::regex:

1std::regex Pattern{"hello world"};

The two most common functions we have for running regular expressions are std::regex_match() and std::regex_search().

In their most basic usage, they accept two arguments - the string we want to test, and the regex we want to use:

1#include <regex>
2#include <iostream>
3
4int main(){
5  std::string Input{"hello world"};
6  std::regex Pattern{"hello"};
7
8  bool MatchResult{
9    std::regex_match(Input, Pattern)};
10
11  bool SearchResult{
12    std::regex_search(Input, Pattern)};
13}

std::regex_match() returns true if the entire input string matches the pattern
std::regex_search() returns true if part of the input string matches the pattern

Below, we run the regex pattern hello on the string hello world. The regex_match() call will return false because the string doesn't match the entire pattern. The regex_search() call will return true because a substring within the input that did match the pattern was found:

1#include <iostream>
2#include <regex>
3
4int main() {
5  std::string Input{"hello world"};
6  std::regex Pattern{"hello"};
7
8  bool MatchResult{
9      std::regex_match(Input, Pattern)};
10
11  bool SearchResult{
12      std::regex_search(Input, Pattern)};
13
14  std::cout << "The regex_match pattern "
15            << (MatchResult ? "did" : "did NOT")
16            << " match";
17
18  std::cout << "\nThe regex_search pattern "
19            << (SearchResult ? "did"
20                             : "did NOT")
21            << " match";
22}

1The regex_match pattern did NOT match
2The regex_search pattern did match

Case-Insensitive Regex Patterns

When creating our pattern using the std::regex constructor, we can provide a second argument. This is where we can pass flags that modify the behavior of our expression. Most of these flags are for advanced use cases that we won't cover here.

However, there is one that is simple to understand and commonly useful: std::regex::icase flags our expression as being case insensitive. The following pattern searches for "hello", "HELLO", "Hello", "hElLo" and any other variation:

1std::regex Pattern{"hello", std::regex::icase};

Below, we search for hello. Our string doesn't contain hello, but it does have Hello:

1#include <iostream>
2#include <regex>
3
4int main() {
5  std::string Input{"Hello world"};
6  std::regex SensitivePattern{"hello"};
7  std::regex InsensitivePattern{
8      "hello", std::regex::icase};
9
10  bool SensitiveResult{std::regex_search(
11      Input, SensitivePattern)};
12
13  bool InsensitiveResult{std::regex_search(
14      Input, InsensitivePattern)};
15
16  std::cout << "The sensitive pattern "
17            << (SensitiveResult ? "did"
18                                : "did NOT")
19            << " match";
20
21  std::cout << "\nThe insensitive pattern "
22            << (InsensitiveResult ? "did"
23                                  : "did NOT")
24            << " match";
25}

1The sensitive pattern did NOT match
2The insensitive pattern did match

Raw String Literals

In the next section, we'll begin to see more complicated regular expressions, that include a lot of special characters. This particularly includes the backslash character, \, which needs to be escaped in standard C++ string literals.

These additional escape characters can make regular expressions even more difficult to follow. So, for more complex expressions, it is recommended to construct our patterns from raw string literals.

Whilst string literals begin and end with ", raw string literals begin with R"( and end with )":

1// String Literal
2std::regex PatternA{"hello"};
3
4// Raw String Literal
5std::regex PatternB{R"(hello)"};

Special Characters

With regular expressions, we're not just restricted to simple character matching. We have the option to add more complex syntax into our pattern, to create more elaborate behavior.

Below, we cover the most common special characters:

Wildcard: `.`

The period character . acts as a wildcard, matching any character within a string. For example, the regex pattern c.t will match any three-letter sequence that starts with "c" and ends with "t":

1✔ c.t -> cat
2✔ c.t -> cut
3✔ c.t -> cot

It only matches a single character, so a pattern like c.t will not match a string like cart :

1❌ c.t -> cart
2❌ c.t -> colt
3❌ c.t -> carrot

But we can use multiple . tokens:

1✔ c..t -> cart
2✔ c..t -> colt
3❌ c..t -> carrot
4✔ c....t -> carrot

We'll see better ways of matching multiple characters later in this lesson.

Alternation: `|`

The vertical bar character, | allows the input to match either the left or the right pattern:

1✔ cat|hat -> cat
2✔ cat|hat -> hat
3❌ cat|hat -> tat
4
5✔ cat|h.t -> hit

Optional: `?`

The question mark character, ? flags the previous symbol as being optional:

1✔ can?t -> cant
2✔ can?t -> cat
3
4✔ ca.?t -> cart
5✔ ca.?t -> cat

Start of input: `^`

The caret symbol ^ denotes the start of the input, allowing us to match patterns that only at the beginning of our string:

1✔ ^cat -> cat
2❌ ^cat -> the cat
3
4✔ ^cat|the -> the
5
6✔ ^.at -> mat
7❌ ^.at -> the mat

End of input: `$`

Conversely, the dollar symbol $ denotes the end of the input, which allows us to restrict our search just to the end of the string:

1✔ cat$ -> cat
2❌ cat$ -> the cat sat
3
4✔ cat|mat$ -> the cat
5✔ cat|mat$ -> sat on the mat
6
7✔ ca.$ -> the can

Earlier, we noted the C++ standard library has the regex_match() function, which matches the entire input, whilst regex_search() looks for substrings. Many programming languages don't draw a distinction here - they just offer the equivalent of regex_search().

However, using the ^ and $ symbols allows us to create a pattern that restricts our pattern to matching against the entire input anyway:

1✔ ^cat$ -> cat
2❌ ^cat$ -> cats
3❌ ^cat$ -> the cat
4
5✔ ^.at$ -> cat
6✔ ^.at$ -> mat
7❌ ^.at$ -> mate

Escaping Special Characters

Often, we want to use one of the special characters as their literal meaning within our patterns. For example, imagine we want to check for a literal period within our input. However, . denotes a wild card, so adding it to our regex will match any character.

We can escape characters using the backslash symbol: \. So, if we wanted to check for a literal . , our regex would use \.

1Unescaped . is a wild card
2✔ sat on the mat. -> sat on the mat.
3✔ sat on the mat. -> sat on the mat and
4✔ sat on the mat. -> sat on the mate
5
6Escape a special character using \
7✔ sat on the mat\. -> sat on the mat.
8❌ sat on the mat\. -> sat on the mate
9❌ sat on the mat\. -> sat on the mat
10
11Unescaped ? is the optional symbol
12✔ hello? -> hello
13✔ hello? -> hell
14
15Escape it using \
16❌ hello\? -> hello
17❌ hello\? -> hell
18✔ hello\? -> hello?
19
20Unescaped $ denotes end of input
21❌ $3.50 -> $3.50
22
23Unescaped . is a wild card
24✔ \$3.50 -> $3.50
25✔ \$3.50 -> $3-50
26
27Escaping both special characters
28✔ \$3\.50 -> $3.50
29❌ \$3\.50 -> $3-50

When we want to search for a literal \ in our input, the escape character itself can be escaped with an additional \:

1❌ yes\no -> yes\no
2✔ yes\\no -> yes\no
3
4To search for literal \\ we escape both
5❌ \\user\\files -> \\user\files
6✔ \\\\user\\files -> \\user\files

Character Sets / Character Classes

When we want to search for one of a range of possible characters, we can introduce a character set, sometimes also called character class. We do this by wrapping our characters in [ and ]:. The following searches for bat, cat, mat or rat:

1std::regex Pattern{R"([bcmr]at)"};

The order of characters within the set doesn't matter.

Character sets interact with surrounding special characters as expected. For example, we can check if our input starts with something in a character set using the start-of-input symbol ^, or make the entire set optional using the optional symbol ?:

1✔ ^[Tt]he -> the cat
2✔ ^[Tt]he -> The cat
3❌ ^[Tt]he -> Not the cat
4
5✔ [cbm]?at -> at
6✔ [cbm]?at -> cat
7✔ [cbm]?at -> bat
8✔ [cbm]?at -> mat

Within the [ and ] boundary of a character set, the special characters ., ?, ^, $, and | revert to their literal values. For example, the period symbol . matches only a literal . in the input:

1✔ cat[.] -> cat.
2❌ cat[.] -> cats

Character Set Ranges

We can specify numeric or alphabetic ranges within our character sets using -. For example, [a-e] will match any of a, b, c, d, or e:

1✔ [a-h]am -> cam
2✔ [a-h]am -> ham
3❌ [a-h]am -> ram
4
5✔ [a-z]am -> ram
6❌ [a-z]am -> 9am
7
8❌ [0-9]am -> ram
9✔ [0-9]am -> 9am
10
11✔ [0-9a-z]am -> 9am
12✔ [0-9a-z]am -> ram
13
14Hexadecimal value
15✔ [A-F0-9][A-F0-9] -> FF
16✔ [A-F0-9][A-F0-9] -> E4
17❌ [A-F0-9][A-F0-9] -> G4

Some character sets have shortcut symbols we can use instead:

\d for any numeric digit, equivalent to [0-9]
\w for any alphabetic character, digit, or underscore, equivalent to [a-zA-Z0-9_]
\s for any white space (can be a space character, a line break character, a tab character, and so on)

1\d can be any digit
2✔ \d -> 9
3❌ \d -> m
4✔ \dam -> 9am
5❌ \dam -> dam
6❌ \dam -> ram
7
8Any two digits
9✔ \d\d -> 10
10❌ \d\d -> 1
11
12Making the second digit optional
13✔ \d\d? -> 10
14✔ \d\d? -> 1
15
16\w can be any letter, digit, or underscore
17✔ \w -> m
18✔ \w -> 9
19✔ \wam -> ram
20✔ \w\wam -> roam
21✔ help\w -> helps
22❌ help\w -> help!
23
24\s can be any whitespace
25✔ the\sfox -> the fox
26
27The \n here is a line break
28✔ the\sfox -> the\nfox
29
30Any whitespace followed by any letter
31✔ the\s\wat -> the cat
32✔ the\s\wat -> the mat
33❌ the\s\wat -> themat
34
35Making whitespace optional
36✔ the\s?\wat -> themat
37
38Combining \d \s and \w
39❌ \d\d\s\wats -> 1 cat
40❌ \d\d\s\wats -> 5 cats
41✔ \d\d\s\wats -> 05 cats
42✔ \d\d\s\wats -> 24 rats
43✔ \d\d\s\wats -> 24\nrats
44❌ \d\d\s\wats -> 100 cats
45❌ \d\d\s\wats -> four cats
46
47Making the second \d and final \s optional
48✔ \d\d?\s\wats? -> 1 cat
49✔ \d\d?\s\wats? -> 5 cats
50✔ \d\d?\s\wats? -> 05 cats
51✔ \d\d?\s\wats? -> 24 rats
52✔ \d\d?\s\wats? -> 24\nrats
53❌ \d\d?\s\wats? -> 100 cats
54❌ \d\d?\s\wats? -> four cats

Escaping Character Sets

Character sets, and their shortcuts, can be escaped in the usual way, with \. For example, if we wanted our regex to search for the [ character, we'd escape it as \[

1Searching for the literal [hello]
2❌ \[hello\] -> h
3✔ \[hello\] -> [hello]
4
5Searching for literal \w
6❌ \\w -> a
7✔ \\w -> \w
8
9Searching for literal \d
10❌ \\d -> 5
11✔ \\d -> \d
12
13Searching for literal \s
14❌ \\s -> the cat
15✔ \\s -> \s

Negating Character Sets

By including the caret symbol ^ at the beginning of our character set, we can negate it. This allows us to ensure a set of characters is not included in our input at that position.

1Searching for "at" not preceded by c, b or m
2✔ [^cbm]at -> at
3❌ [^cbm]at -> cat
4❌ [^cbm]at -> bat
5❌ [^cbm]at -> mat
6✔ [^cbm]at -> rat
7
8Searching for "at" not preceded by anything from c-m
9✔ [^c-m]at -> rat
10
11Searching for "at" not preceded by a digit
12✔ [^\d]at -> cat
13❌ [^\d]at -> 5at
14✔ [^\d]at -> 5 at
15
16Searching for "cat" not followed by an alphanumeric character
17✔ cat[^\w] -> cat
18✔ cat[^\w] -> the cat sat
19❌ cat[^\w] -> caterpillar
20❌ cat[^\w] -> vacate
21✔ cat[^\w] -> copycat
22
23Searching for "cat" not preceded or followed by an alphanumeric character
24❌ [^\w]cat[^\w] -> copycat

Repetition

We can look for repeating patterns within our input. We do that by adding syntax directly after the symbol or character set we want to look for repetitions of. We have several options:

Zero or More: `*`

The * character states there can be any number of the proceeding symbol or character set. That can include zero:

1✔ ab*c -> ac
2✔ ab*c -> abc
3✔ ab*c -> abbc
4✔ ab*c -> abbbbbc
5
6✔ a.*c -> ac
7✔ a.*c -> abc
8✔ a.*c -> a123c
9✔ a.*c -> a123 abc
10
11✔ a[bcd]*e -> ae
12✔ a[bcd]*e -> abe
13✔ a[bcd]*e -> ace
14✔ a[bcd]*e -> abcde
15✔ a[bcd]*e -> abcdcdbe
16❌ a[bcd]*e -> a1e

One or More: `+`

The + character specifies we want at least one of the proceeding symbol or character set, but there can be more:

1❌ ab+c -> ac
2✔ ab+c -> abc
3✔ ab+c -> abbc
4✔ ab+c -> abbbbbc
5
6❌ a.+c -> ac
7✔ a.+c -> abc
8✔ a.+c -> a123c
9✔ a.+c -> a123 abc
10
11❌ a[bcd]+e -> ae
12✔ a[bcd]+e -> abe
13✔ a[bcd]+e -> ace
14✔ a[bcd]+e -> abcde
15✔ a[bcd]+e -> abcdcdbe
16❌ a[bcd]+e -> ale

Exactly `x` repetitions: `{x}`

The brace syntax allows us to be more specific with how many repetitions we want. We can pass a single number between the braces, to specify we want a specific number of repetitions.

In the following example, we look for exactly two repetitions:

1❌ ab{2}c -> ac
2❌ ab{2}c -> abc
3✔ ab{2}c -> abbc
4❌ ab{2}c -> abbbc
5
6❌ a.{2}c -> ac
7❌ a.{2}c -> abc
8✔ a.{2}c -> a12c
9❌ a.{2}c -> a123c
10
11❌ a[bcd]{2}e -> ae
12❌ a[bcd]{2}e -> abe
13❌ a[bcd]{2}e -> ace
14✔ a[bcd]{2}e -> abce
15✔ a[bcd]{2}e -> acde
16❌ a[bcd]{2}e -> abcde
17
18256 bit hexadecimal
19✔ [A-F0-9]{2} -> 6F
20
21hexadecimal color
22✔ [A-F0-9]{6} -> 6F4AFF

Note, that a common mistake here comes when using this syntax alongside a substring search, like std::regex_search(). In that context, a pattern like [0-9]{2} which searches for exactly 2 digits will return true on an input like 123. Whilst the entire string of 123 has 3 digits, it has two substrings of 2 digits - 12 and 23.

If we wanted this input to not match, we'd need to be more specific. For example, if we wanted our entire string to be exactly 2 digits, we could use std::regex_match(), instead of std::regex_search(), or add the start and end of input symbols ^ and $ to our regex.

At least `x` repetitions: `{x,}`

By adding a trailing comma within our braces, we specify that we want at least x repetitions, without an upper limit. Below, we search for at least two repetitions:

1❌ ab{2,}c -> ac
2❌ ab{2,}c -> abc
3✔ ab{2,}c -> abbc
4✔ ab{2,}c -> abbbc
5
6❌ a.{2,}c -> ac
7❌ a.{2,}c -> abc
8✔ a.{2,}c -> a12c
9✔ a.{2,}c -> a123c
10
11❌ a[bcd]{2,}e -> ae
12❌ a[bcd]{2,}e -> abe
13❌ a[bcd]{2,}e -> ace
14✔ a[bcd]{2,}e -> abce
15✔ a[bcd]{2,}e -> acde
16✔ a[bcd]{2,}e -> abcde

From `x` to `y` repetitions: `{x, y}`:

By adding a second number to our braces, we can specify both a lower and upper range for the number of repetitions we are looking for. Below, we search for at least one, but not more than three repetitions of our previous symbol or character set:

1Looking for 1 to 3 repetitions
2❌ ab{1,3}c -> ac
3✔ ab{1,3}c -> abc
4✔ ab{1,3}c -> abbc
5✔ ab{1,3}c -> abbbc
6❌ ab{1,3}c -> abbbbc
7
8❌ a.{1,3}c -> ac
9✔ a.{1,3}c -> abc
10✔ a.{1,3}c -> a12c
11✔ a.{1,3}c -> a123c
12❌ a.{1,3}c -> a1234c
13
14❌ a[bcd]{1,3}e -> ae
15✔ a[bcd]{1,3}e -> abe
16✔ a[bcd]{1,3}e -> ace
17✔ a[bcd]{1,3}e -> abce
18✔ a[bcd]{1,3}e -> acde
19✔ a[bcd]{1,3}e -> abcde
20❌ a[bcd]{1,3}e -> abbcde

Repetition with Character Set Shortcuts

The repetition specifiers also work with the character set shortcuts \s, \w, and \d:

1❌ the\s+cat -> thecat
2✔ the\s+cat -> the cat
3✔ the\s+cat -> the    cat
4
5✔ c\w*t -> ct
6✔ c\w*t -> cat
7✔ c\w*t -> coat
8❌ c\w*t -> c5t
9
10❌ \d{2} -> 1
11✔ \d{2} -> 10
12
13Substring match
14✔ \d{2} -> 100
15
16Full string match
17❌ ^\d{2}$ -> 100
18
19❌ \d{2,} -> 1
20✔ \d{2,} -> 10
21✔ \d{2,} -> 100
22
23✔ \d{1,3} -> 1
24✔ \d{1,3} -> 10
25✔ \d{1,3} -> 100
26
27Substring match
28✔ \d{1,3} -> 1000
29
30Full string match
31❌ ^\d{1,3}$ -> 1000

In the next lesson, we'll cover regular expression capture groups. These build on the regex syntax we learned but will allow us to go beyond just checking whether or not the text matches the regular expression pattern we created.

With capture groups, we will learn how to use regex to filter through and extract parts of our text, or to replace segments of it in a targeted way.

Summary

In this lesson, we've explored the fundamentals of using regular expressions in C++, covering how to create patterns, search and match strings, and apply modifiers for case-insensitivity.

Main Points Learned:

Introduction to regular expressions and their applications.
Usage of the <regex> header and std::regex for creating regular expression patterns in C++.
The difference between std::regex_match() and std::regex_search(), and examples of their basic usage.
How to make regex patterns case-insensitive using std::regex::icase.
Utilizing raw string literals for creating complex regex patterns without excessive escaping.
Understanding and applying special characters in regex, such as the wildcard (.), alternation (|), and optional (?) characters.
Using anchors (^ for start of input, $ for end of input) to specify where a pattern should match.
Escaping special characters in regex patterns to use their literal meaning.
Defining and using character sets and ranges within regex patterns to match specific groups of characters.
Employing repetition specifiers (, +, {x}, {x,}, and {x, y}) to match repeating patterns.
The significance of escaping character sets and shortcuts (\d, \w, \s) in regex, and how to negate character sets with ^.

Regular Expressions

Using Regular Expressions in C++

Case-Insensitive Regex Patterns

Raw String Literals

Special Characters

Wildcard: `.`

Alternation: `|`

Optional: `?`

Start of input: `^`

End of input: `$`

Escaping Special Characters

Character Sets / Character Classes

Character Set Ranges

Escaping Character Sets

Negating Character Sets

Repetition

Zero or More: `*`

One or More: `+`

Exactly `x` repetitions: `{x}`

At least `x` repetitions: `{x,}`

From `x` to `y` repetitions: `{x, y}`:

Repetition with Character Set Shortcuts

Summary

Main Points Learned:

Regex Capture Groups

Professional C++

Questions & Answers

Regular Expressions

Using Regular Expressions in C++

Case-Insensitive Regex Patterns

Raw String Literals

Special Characters

Wildcard: .

Alternation: |

Optional: ?

Start of input: ^

End of input: $

Escaping Special Characters

Character Sets / Character Classes

Character Set Ranges

Escaping Character Sets

Negating Character Sets

Repetition

Zero or More: *

One or More: +

Exactly x repetitions: {x}

At least x repetitions: {x,}

From x to y repetitions: {x, y}:

Repetition with Character Set Shortcuts

Summary

Main Points Learned:

Regex Capture Groups

Questions & Answers

Wildcard: `.`

Alternation: `|`

Optional: `?`

Start of input: `^`

End of input: `$`

Zero or More: `*`

One or More: `+`

Exactly `x` repetitions: `{x}`

At least `x` repetitions: `{x,}`

From `x` to `y` repetitions: `{x, y}`: