Regular Expressions in C++

An introduction to regular expressions, and how to use them in C++ with std::regex, std::regex_match, and std::regex_search
This lesson is part of the course:

Professional C++

Comprehensive course covering advanced concepts, and how to use them on large-scale projects.

DreamShaper_v7_barista_Sidecut_hair_modest_clothes_fully_cloth_2.jpg
Ryan McCombe
Ryan McCombe
Posted

Regular expressions, often referred to as "regex" or "regexp", are powerful tools that allow for the search and manipulation of text. They are used to detect if a string meets specific requirements, to extract information from text, or to change text in specific ways.

Some examples where they’re useful include:

  • Validating user input - for example, ensuring a string looks like an email address, phone number, or another type of data we were expecting
  • Syntax highlighting - your code editor, or the code snippets displayed in our lessons, use regular expressions to change the color and formatting of characters to make our code more readable
  • Data extraction - for example, if we want to create a system that reads through text and extracts anything that looks like a date
  • Redaction - for example, automatically removing contents from a document that could be sensitive

In this lesson, we’ll introduce the syntax for creating a regular expression, and show how to use it to determine if our C++ string matches the given pattern. In the next lesson, we’ll expand this to cover how we can use a regex pattern to extract and replace parts of our content in a targeted way.

Regular expressions can be quite challenging to understand at first. Using them to look for complex patterns or solve larger tasks can be tough. However, working with text is ubiquitous in programming, and regular expressions are the tool we reach for when our task gets a little more complex.

Additionally, regular expression syntax adheres to broadly similar standards across all programming - not just C++. As such, once we’re familiar with them, we can use them regardless of what programming language we’re working in.

Standalone Regex Tools

When we’re writing more complex regular expressions, it’s common to craft them with the help of a standalone tool. Those tools will explain what our expression is doing, and allow us to quickly test it against a range of inputs. Many free web-based tools are available, such as the RegExr

Once we’ve created our expression in such a tool, we can copy and paste it into our code. This tends to be much faster than trying to create it in our code editor from scratch.

Using Regular Expressions in C++

Regular expression functionality in the C++ library is available by including the <regex> header:

#include <regex>

Regular expressions, sometimes called patterns, are themselves strings. The standard way to create a regex pattern in C++ is through the std::basic_regex type, which is aliased to std::regex:

std::regex Pattern{"hello world"};

The two most common functions we have for running regular expressions are std::regex_match and std::regex_search.

In their most basic usage, they accept two arguments - the string we want to test, and the regex we want to use:

std::string Input{"hello world"};
std::regex Pattern{"hello"};

bool MatchResult{
    std::regex_match(Input, Pattern)};

bool SearchResult{
    std::regex_search(Input, Pattern)};
  • std::regex_match returns true if the entire input string matches the pattern
  • std::regex_search returns true if at least one substring in the input matches the pattern

Below, we run the regex pattern hello on the string hello world. The regex_match call will return false because the string doesn’t match the entire pattern. The regex_search call will return true because a substring within the input that did match the pattern was found:

#include <iostream>
#include <regex>

int main() {
  std::string Input{"hello world"};
  std::regex Pattern{"hello"};

  bool MatchResult{
      std::regex_match(Input, Pattern)};

  bool SearchResult{
      std::regex_search(Input, Pattern)};

  std::cout << "The regex_match pattern "
            << (MatchResult ? "did" : "did NOT")
            << " match";

  std::cout << "\nThe regex_search pattern "
            << (SearchResult ? "did"
                             : "did NOT")
            << " match";
}
The regex_match pattern did NOT match
The regex_search pattern did match

Case-Insensitive Regex Patterns

When creating our pattern using the std::regex constructor accepts a second argument. This is where we can pass flags that modify the behavior of our expression. Most of these flags are for advanced use cases that we won’t cover here.

However, there is one exception: std::regex::icase flags our expression as being case insensitive:

std::regex Pattern{"hello", std::regex::icase};

Below, we do a search for hello. Our string doesn’t contain hello, but it does have Hello:

#include <iostream>
#include <regex>

int main() {
  std::string Input{"Hello world"};
  std::regex SensitivePattern{"hello"};
  std::regex InsensitivePattern{
      "hello", std::regex::icase};

  bool SensitiveResult{std::regex_search(
      Input, SensitivePattern)};

  bool InsensitiveResult{std::regex_search(
      Input, InsensitivePattern)};

  std::cout << "The sensitive pattern "
            << (SensitiveResult ? "did"
                                : "did NOT")
            << " match";

  std::cout << "\nThe insensitive pattern "
            << (InsensitiveResult ? "did"
                                  : "did NOT")
            << " match";
}
The sensitive pattern did NOT match
The insensitive pattern did match

Raw String Literals

In the next section, we’ll begin to see more complicated regular expressions, that include a lot of special characters. This particularly includes the backslash character, \, which needs to be escaped in standard C++ string literals.

These additional escape characters can make regular expressions even more difficult to follow. So, for more complex expressions, it is recommended to construct our patterns from raw string literals.

Whilst string literals begin and end with ", raw string literals begin with R"( and end with )":

// String Literal
std::regex PatternA{"hello"};

// Raw String Literal
std::regex PatternB{R"(hello)"};

Special Characters

With regular expressions, we’re not just restricted to simple character matching. We have the option to add more complex syntax into our pattern, to create more elaborate behavior.

Regex Order of Operations

Within the same pattern, we can combine as many of these special characters as we want. Like other programming expressions, the way in which they combine is subject to an order of operations that is not entirely intuitive.

Few people learn the order of operations. Instead, we just use the tools talked about previously to see what works and what doesn’t.

In the next lesson, we’ll learn how to manipulate the order of operations within regex by using groups

Below, we cover the most common special characters:

Wildcard: .

The period character . acts as a wildcard, matching any symbol:

✔️ c.t -> cat
✔️ c.t -> cut
❌ c.t -> colt

✔️ c..t -> colt

Alternation: |

The vertical bar character, | allows the input to match either the left or the right pattern:

✔️ cat|hat -> cat
✔️ cat|hat -> hat
❌ cat|hat -> tat

✔️ cat|h.t -> hit

Optional: ?

The question mark character, ? flags the previous symbol as being optional:

✔️ can?t -> cant
✔️ can?t -> cat

✔️ ca.?t -> cart
✔️ ca.?t -> cat

Start of input: ^

The caret symbol ^ denotes the start of the input, allowing is to match patterns that only at the beginning of our string:

✔️ ^cat -> cat
^cat -> the cat

✔️ ^cat|the -> the

✔️ ^.at -> mat
^.at -> the mat

End of input: $

Conversely, the dollar symbol $ denotes the end of the input, which allows us to restrict our search just to the end of the string:

✔️ cat$ -> cat
❌ cat$ -> the cat sat

✔️ cat|mat$ -> the cat
✔️ cat|mat$ -> sat on the mat

✔️ ca.$ -> the can

Earlier, we noted the C++ standard library has the regex_match function, which matches the entire input, whilst regex_search looks for substrings. Many programming languages don’t draw a distinction here - they just offer the equivalent of regex_search.

However, using the ^ and $ symbols allow us to create a pattern that restricts our pattern to matching against the entire input anyway:

✔️ ^cat$ -> cat
^cat$ -> cats
^cat$ -> the cat

✔️ ^.at$ -> cat
✔️ ^.at$ -> mat
^.at$ -> mate

Escaping Special Characters

Often, we want to use one of the special characters as their literal meaning within our patterns. For example, we want to check for a literal period within our input, but . denotes a wild card, so adding it to our regex will match any character.

We can escape characters using the backslash symbol: \

// Unescaped . is a wild card
✔️ sat on the mat. -> sat on the mat.
✔️ sat on the mat. -> sat on the mat and
✔️ sat on the mat. -> sat on the mate

// Escape a special character using \
✔️ sat on the mat\. -> sat on the mat.
❌ sat on the mat\. -> sat on the mate
❌ sat on the mat\. -> sat on the mat

// Unescaped ? is the optional symbol
✔️ hello? -> hello
✔️ hello? -> hell

// Escape it using \
❌ hello\? -> hello
❌ hello\? -> hell
✔️ hello\? -> hello?

// Unescaped $ denotes end of input
❌ $3.50 -> $3.50

// Unescaped . is a wild card
✔️ \$3.50 -> $3.50
✔️ \$3.50 -> $3 50

// Escaping both special characters
❌ \$3\.50 -> $3 50
✔️ \$3\.50 -> $3.50

When we want to search for a literal \ in our input, the escape character itself can be escaped with an additional \:

❌ yes\no -> yes\no
✔️ yes\\no -> yes\no

// To search for literal \\ we escape both
❌ \\user\files -> \\user\files
✔️ \\\\user\\files -> \\user\files

Character Sets / Character Classes

When we want to search for one of a rang of possible characters, we can introduce a character set, sometimes also called character class. We do this by wrapping our characters in [ and ]:. The following searches for bat, cat, mat or rat:

std::regex Pattern{R"([bcmr]at)"};

The order of characters within the set doesn’t matter.

Character sets interact with surrounding special characters as expected. For example, we can check if our input starts with something in a character set using the start-of-input symbol ^, or make the entire set optional using the optional symbol ?:

✔️ ^[Tt]he -> the cat
✔️ ^[Tt]he -> The cat
^[Tt]he -> Not the cat

✔️ [cbm]?at -> at
✔️ [cbm]?at -> cat
✔️ [cbm]?at -> bat
✔️ [cbm]?at -> mat

Within the [ and ] boundary of a character set, the special characters ., ?, ^, $ and | revert back to their literal values. For example, the period symbol . matches only a literal . in the input:

✔️ cat[.] -> cat.
❌ cat[.] -> cats

Character Set Ranges

We can specify numeric or alphabetic ranges within our character sets using -. For example, [a-e] will match any of a, b, c, d, or e:

✔️ [a-h]am -> cam
✔️ [a-h]am -> ham
[a-h]am -> ram

✔️ [a-z]am -> ram
[a-z]am -> 9am

[0-9]am -> ram
✔️ [0-9]am -> 9am

✔️ [0-9a-z]am -> 9am
✔️ [0-9a-z]am -> ram

// Hexadecimal value
✔️ [A-F0-9][A-F0-9] -> FF
✔️ [A-F0-9][A-F0-9] -> E4
[A-F0-9][A-F0-9] -> G4

Some character sets have shortcut symbols we can use instead:

  • \d for any numeric digit, equivalent to [0-9]
  • \w for any alphabetic character, digit or underscore, equivalent to [a-zA-Z0-9_]
  • \s for any white space (can be a space character, a line break character, a tab character, and so on)
// \d can be any digit
✔️ \d -> 9
❌ \d -> m
✔️ \dam -> 9am
❌ \dam -> dam
❌ \dam -> ram

// Any two digits
✔️ \d\d -> 10
❌ \d\d -> 1

// Making the second digit optional
✔️ \d\d? -> 10
✔️ \d\d? -> 1

// \w can be any letter, digit or underscore
✔️ \w -> m
✔️ \w -> 9
✔️ \wam -> ram
✔️ \w\wam -> roam
✔️ help\w -> helps
❌ help\w -> help!

// \s can be any whitespace
✔️ the\sfox -> the fox

// The \n here is a line break
✔️ the\sfox -> the\nfox

// Any whitespace followed by any letter
✔️ the\s\wat -> the cat
✔️ the\s\wat -> the mat
❌ the\s\wat -> themat

// Making whitespace optional
✔️ the\s?\wat -> themat

// Combining \d \s and \w
❌ \d\d\s\wats -> 1 cat
❌ \d\d\s\wats -> 5 cats
✔️ \d\d\s\wats -> 05 cats
✔️ \d\d\s\wats -> 24 rats
✔️ \d\d\s\wats -> 24\nrats
❌ \d\d\s\wats -> 100 cats
❌ \d\d\s\wats -> four cats

// The second \d and final s are optional
✔️ \d\d?\s\wats? -> 1 cat
✔️ \d\d?\s\wats? -> 5 cats
✔️ \d\d?\s\wats? -> 05 cats
✔️ \d\d?\s\wats? -> 24 rats
✔️ \d\d?\s\wats? -> 24\nrats
❌ \d\d?\s\wats? -> 100 cats
❌ \d\d?\s\wats? -> four cats

Escaping Character Sets

Character sets, and their shortcuts, can be escaped in the usual way, with \:

❌ \[hello\] -> h
✔️ \[hello\] -> [hello]

❌ \\w -> a
✔️ \\w -> \w

❌ \\d -> 5
✔️ \\d -> \d

❌ \\s -> the cat
✔️ \\s -> \s

Negating Character Sets

By including the caret symbol, ^ at the beginning of our character set, we can negate it. This allows us to ensure a set of characters is not included in our input at that position.

✔️ [^cbm]at -> at
[^cbm]at -> cat
[^cbm]at -> bat
[^cbm]at -> mat
✔️ [^cbm]at -> rat

✔️ [^c-m]at -> rat

✔️ [^\d]at -> cat
[^\d]at -> 5at
✔️ [^\d]at -> 5 at

✔️ cat[^\w] -> cat
✔️ cat[^\w] -> the cat sat
❌ cat[^\w] -> caterpillar
❌ cat[^\w] -> vacate
✔️ cat[^\w] -> copycat

[^\w]cat[^\w] -> copycat

Repetition

We can look for repeating patterns within our input. We do that by adding syntax directly after the symbol or character set we want to look for repetitions of. We have 3 options:

Zero or More: *

The * character states there can be any number of the proceeding symbol or character set. That can include zero:

✔️ ab*c -> ac
✔️ ab*c -> abc
✔️ ab*c -> abbc
✔️ ab*c -> abbbbbc

✔️ a.*c -> ac
✔️ a.*c -> abc
✔️ a.*c -> a123c
✔️ a.*c -> a123 abc

✔️ a[bcd]*e -> ae
✔️ a[bcd]*e -> abe
✔️ a[bcd]*e -> ace
✔️ a[bcd]*e -> abcde
✔️ a[bcd]*e -> abcdcdbe
❌ a[bcd]*e -> a1e

One or More: +

The + character specifies we want at least one of the proceeding symbol or character set, but there can be more:

❌ ab+c -> ac
✔️ ab+c -> abc
✔️ ab+c -> abbc
✔️ ab+c -> abbbbbc

❌ a.+c -> ac
✔️ a.+c -> abc
✔️ a.+c -> a123c
✔️ a.+c -> a123 abc

❌ a[bcd]+e -> ae
✔️ a[bcd]+e -> abe
✔️ a[bcd]+e -> ace
✔️ a[bcd]+e -> abcde
✔️ a[bcd]+e -> abcdcdbe
❌ a[bcd]+e -> ale

Exactly x repetitions: {x}

The brace syntax allows us to be more specific with how many repetitions we want. We can pass a single number between the braces, to specify we want a specific number of repetitions.

Note, a common mistake here comes when using this syntax alongside a substring search, like std::regex_search. In that context, a pattern like [0-9]{2} which searches for exactly 2 digits will return true on an input like 123. Whilst the entire string of 123 has 3 digits, it has two substrings of 2 digits - 12 and 23.

If we wanted this input to not match, we’d need to be more specific. For example, if we wanted our entire string to be exactly 2 digits, we could use std::regex_match, instead of std::regex_search, or add the start and end of input symbols ^ and $ to our regex.

In the following example, we look for exactly two repetitions:

❌ ab{2}c -> ac
❌ ab{2}c -> abc
✔️ ab{2}c -> abbc
❌ ab{2}c -> abbbc

❌ a.{2}c -> ac
❌ a.{2}c -> abc
✔️ a.{2}c -> a12c
❌ a.{2}c -> a123c

❌ a[bcd]{2}e -> ae
❌ a[bcd]{2}e -> abe
❌ a[bcd]{2}e -> ace
✔️ a[bcd]{2}e -> abce
✔️ a[bcd]{2}e -> acde
❌ a[bcd]{2}e -> abcde

// 256 bit hexadecimal
✔️ [A-F0-9]{2} -> 6F

// hexadecimal color
✔️ [A-F0-9]{6} -> 6F4AFF

At least x repetitions: {x,}

By adding a trailing comma within our braces, we specify that we want at least x repetitions, without an upper limit. Below, we search for at least two repetitions:

❌ ab{2,}c -> ac
❌ ab{2,}c -> abc
✔️ ab{2,}c -> abbc
✔️ ab{2,}c -> abbbc

❌ a.{2,}c -> ac
❌ a.{2,}c -> abc
✔️ a.{2,}c -> a12c
✔️ a.{2,}c -> a123c

❌ a[bcd]{2,}e -> ae
❌ a[bcd]{2,}e -> abe
❌ a[bcd]{2,}e -> ace
✔️ a[bcd]{2,}e -> abce
✔️ a[bcd]{2,}e -> acde
✔️ a[bcd]{2,}e -> abcde

From x to y repetitions: {x, y}:

By adding a second number to our braces, we can specify both a lower and upper range for the number of repetitions we are looking for. Below, we search for at least one, but not more than three repetitions of our previous symbol or character set:

// Looking for 1 to 3 repetitions
❌ ab{1,3}c -> ac
✔️ ab{1,3}c -> abc
✔️ ab{1,3}c -> abbc
✔️ ab{1,3}c -> abbbc
❌ ab{1,3}c -> abbbbc

❌ a.{1,3}c -> ac
✔️ a.{1,3}c -> abc
✔️ a.{1,3}c -> a12c
✔️ a.{1,3}c -> a123c
❌ a.{1,3}c -> a1234c

❌ a[bcd]{1,3}e -> ae
✔️ a[bcd]{1,3}e -> abe
✔️ a[bcd]{1,3}e -> ace
✔️ a[bcd]{1,3}e -> abce
✔️ a[bcd]{1,3}e -> acde
✔️ a[bcd]{1,3}e -> abcde
❌ a[bcd]{1,3}e -> abbcde

Repetition with Character Set Shortcuts

The repetition specifiers also work with the character set shortcuts \s, \w, and \d:

❌ the\s+cat -> thecat
✔️ the\s+cat -> the cat
✔️ the\s+cat -> the    cat

✔️ c\w*t -> ct
✔️ c\w*t -> cat
✔️ c\w*t -> coat
❌ c\w*t -> c5t

❌ \d{2} -> 1
✔️ \d{2} -> 10

// Substring match
✔️ \d{2} -> 100

// Full string match
^\d{2}$ -> 100

❌ \d{2,} -> 1
✔️ \d{2,} -> 10
✔️ \d{2,} -> 100

✔️ \d{1.3} -> 1
✔️ \d{1.3} -> 10
✔️ \d{1.3} -> 100

// Substring match
✔️ \d{1.3} -> 1000

// Full string match
^\d{1.3}$ -> 1000

In the next lesson, we’ll cover regular expression capture groups. These build on the regex syntax we learned but will allow us to go beyond just checking whether or not the text matches the regular expression pattern we created.

With capture groups, we will learn how to use regex to filter through and extract parts of our text, or to replace segments of it in a targeted way.

Was this lesson useful?

Ryan McCombe
Ryan McCombe
Posted
This lesson is part of the course:

Professional C++

Comprehensive course covering advanced concepts, and how to use them on large-scale projects.

7a.jpg
This lesson is part of the course:

Professional C++

Comprehensive course covering advanced concepts, and how to use them on large-scale projects.

Free, unlimited access!

This course includes:

  • 106 Lessons
  • 550+ Code Samples
  • 96% Positive Reviews
  • Regularly Updated
  • Help and FAQ
Next Lesson

Regex Capture Groups in C++

An introduction to regular expression capture groups, and how to use them in C++ with regex search, replace, iterator, and token_iterator
4d.jpg
Contact|Privacy Policy|Terms of Use
Copyright © 2023 - All Rights Reserved