std::regex
, std::regex_match
, and std::regex_search
Regular expressions, often referred to as "regex" or "regexp", are powerful tools that allow for the search and manipulation of text. They are used to detect if a string meets specific requirements, to extract information from text, or to change text in specific ways.
Some examples where they’re useful include:
In this lesson, we’ll introduce the syntax for creating a regular expression, and show how to use it to determine if our C++ string matches the given pattern. In the next lesson, we’ll expand this to cover how we can use a regex pattern to extract and replace parts of our content in a targeted way.
Regular expressions can be quite challenging to understand at first. Using them to look for complex patterns or solve larger tasks can be tough. However, working with text is ubiquitous in programming, and regular expressions are the tool we reach for when our task gets a little more complex.
Additionally, regular expression syntax adheres to broadly similar standards across all programming - not just C++. As such, once we’re familiar with them, we can use them regardless of what programming language we’re working in.
When we’re writing more complex regular expressions, it’s common to craft them with the help of a standalone tool. Those tools will explain what our expression is doing, and allow us to quickly test it against a range of inputs. Many free web-based tools are available, such as the RegExr
Once we’ve created our expression in such a tool, we can copy and paste it into our code. This tends to be much faster than trying to create it in our code editor from scratch.
Regular expression functionality in the C++ library is available by including the <regex>
 header:
#include <regex>
Regular expressions, sometimes called patterns, are themselves strings. The standard way to create a regex pattern in C++ is through the std::basic_regex
type, which is aliased to std::regex
:
std::regex Pattern{"hello world"};
The two most common functions we have for running regular expressions are std::regex_match
and std::regex_search
.
In their most basic usage, they accept two arguments - the string we want to test, and the regex we want to use:
std::string Input{"hello world"};
std::regex Pattern{"hello"};
bool MatchResult{
std::regex_match(Input, Pattern)};
bool SearchResult{
std::regex_search(Input, Pattern)};
std::regex_match
returns true if the entire input string matches the patternstd::regex_search
returns true if at least one substring in the input matches the patternBelow, we run the regex pattern hello
on the string hello world
. The regex_match
call will return false because the string doesn’t match the entire pattern. The regex_search
call will return true because a substring within the input that did match the pattern was found:
#include <iostream>
#include <regex>
int main() {
std::string Input{"hello world"};
std::regex Pattern{"hello"};
bool MatchResult{
std::regex_match(Input, Pattern)};
bool SearchResult{
std::regex_search(Input, Pattern)};
std::cout << "The regex_match pattern "
<< (MatchResult ? "did" : "did NOT")
<< " match";
std::cout << "\nThe regex_search pattern "
<< (SearchResult ? "did"
: "did NOT")
<< " match";
}
The regex_match pattern did NOT match
The regex_search pattern did match
When creating our pattern using the std::regex
constructor accepts a second argument. This is where we can pass flags that modify the behavior of our expression. Most of these flags are for advanced use cases that we won’t cover here.
However, there is one exception: std::regex::icase
flags our expression as being case insensitive:
std::regex Pattern{"hello", std::regex::icase};
Below, we do a search for hello
. Our string doesn’t contain hello
, but it does have Hello
:
#include <iostream>
#include <regex>
int main() {
std::string Input{"Hello world"};
std::regex SensitivePattern{"hello"};
std::regex InsensitivePattern{
"hello", std::regex::icase};
bool SensitiveResult{std::regex_search(
Input, SensitivePattern)};
bool InsensitiveResult{std::regex_search(
Input, InsensitivePattern)};
std::cout << "The sensitive pattern "
<< (SensitiveResult ? "did"
: "did NOT")
<< " match";
std::cout << "\nThe insensitive pattern "
<< (InsensitiveResult ? "did"
: "did NOT")
<< " match";
}
The sensitive pattern did NOT match
The insensitive pattern did match
In the next section, we’ll begin to see more complicated regular expressions, that include a lot of special characters. This particularly includes the backslash character, \
, which needs to be escaped in standard C++ string literals.
These additional escape characters can make regular expressions even more difficult to follow. So, for more complex expressions, it is recommended to construct our patterns from raw string literals.
Whilst string literals begin and end with "
, raw string literals begin with R"(
and end with )"
:
// String Literal
std::regex PatternA{"hello"};
// Raw String Literal
std::regex PatternB{R"(hello)"};
With regular expressions, we’re not just restricted to simple character matching. We have the option to add more complex syntax into our pattern, to create more elaborate behavior.
Within the same pattern, we can combine as many of these special characters as we want. Like other programming expressions, the way in which they combine is subject to an order of operations that is not entirely intuitive.
Few people learn the order of operations. Instead, we just use the tools talked about previously to see what works and what doesn’t.
In the next lesson, we’ll learn how to manipulate the order of operations within regex by using groups
Below, we cover the most common special characters:
.
The period character .
acts as a wildcard, matching any symbol:
✔️ c.t -> cat
✔️ c.t -> cut
❌ c.t -> colt
✔️ c..t -> colt
|
The vertical bar character, |
allows the input to match either the left or the right pattern:
✔️ cat|hat -> cat
✔️ cat|hat -> hat
❌ cat|hat -> tat
✔️ cat|h.t -> hit
?
The question mark character, ?
flags the previous symbol as being optional:
✔️ can?t -> cant
✔️ can?t -> cat
✔️ ca.?t -> cart
✔️ ca.?t -> cat
^
The caret symbol ^
denotes the start of the input, allowing is to match patterns that only at the beginning of our string:
✔️ ^cat -> cat
❌ ^cat -> the cat
✔️ ^cat|the -> the
✔️ ^.at -> mat
❌ ^.at -> the mat
$
Conversely, the dollar symbol $
denotes the end of the input, which allows us to restrict our search just to the end of the string:
✔️ cat$ -> cat
❌ cat$ -> the cat sat
✔️ cat|mat$ -> the cat
✔️ cat|mat$ -> sat on the mat
✔️ ca.$ -> the can
Earlier, we noted the C++ standard library has the regex_match
function, which matches the entire input, whilst regex_search
looks for substrings. Many programming languages don’t draw a distinction here - they just offer the equivalent of regex_search
.
However, using the ^
and $
symbols allow us to create a pattern that restricts our pattern to matching against the entire input anyway:
✔️ ^cat$ -> cat
❌ ^cat$ -> cats
❌ ^cat$ -> the cat
✔️ ^.at$ -> cat
✔️ ^.at$ -> mat
❌ ^.at$ -> mate
Often, we want to use one of the special characters as their literal meaning within our patterns. For example, we want to check for a literal period within our input, but .
denotes a wild card, so adding it to our regex will match any character.
We can escape characters using the backslash symbol: \
// Unescaped . is a wild card
✔️ sat on the mat. -> sat on the mat.
✔️ sat on the mat. -> sat on the mat and
✔️ sat on the mat. -> sat on the mate
// Escape a special character using \
✔️ sat on the mat\. -> sat on the mat.
❌ sat on the mat\. -> sat on the mate
❌ sat on the mat\. -> sat on the mat
// Unescaped ? is the optional symbol
✔️ hello? -> hello
✔️ hello? -> hell
// Escape it using \
❌ hello\? -> hello
❌ hello\? -> hell
✔️ hello\? -> hello?
// Unescaped $ denotes end of input
❌ $3.50 -> $3.50
// Unescaped . is a wild card
✔️ \$3.50 -> $3.50
✔️ \$3.50 -> $3 50
// Escaping both special characters
❌ \$3\.50 -> $3 50
✔️ \$3\.50 -> $3.50
When we want to search for a literal \
in our input, the escape character itself can be escaped with an additional \
:
❌ yes\no -> yes\no
✔️ yes\\no -> yes\no
// To search for literal \\ we escape both
❌ \\user\files -> \\user\files
✔️ \\\\user\\files -> \\user\files
When we want to search for one of a rang of possible characters, we can introduce a character set, sometimes also called character class. We do this by wrapping our characters in [
and ]
:. The following searches for bat
, cat
, mat
or rat
:
std::regex Pattern{R"([bcmr]at)"};
The order of characters within the set doesn’t matter.
Character sets interact with surrounding special characters as expected. For example, we can check if our input starts with something in a character set using the start-of-input symbol ^
, or make the entire set optional using the optional symbol ?
:
✔️ ^[Tt]he -> the cat
✔️ ^[Tt]he -> The cat
❌ ^[Tt]he -> Not the cat
✔️ [cbm]?at -> at
✔️ [cbm]?at -> cat
✔️ [cbm]?at -> bat
✔️ [cbm]?at -> mat
Within the [
and ]
boundary of a character set, the special characters .
, ?
, ^
, $
and |
revert back to their literal values. For example, the period symbol .
matches only a literal .
in the input:
✔️ cat[.] -> cat.
❌ cat[.] -> cats
We can specify numeric or alphabetic ranges within our character sets using -
. For example, [a-e]
will match any of a
, b
, c
, d
, or e
:
✔️ [a-h]am -> cam
✔️ [a-h]am -> ham
❌ [a-h]am -> ram
✔️ [a-z]am -> ram
❌ [a-z]am -> 9am
❌ [0-9]am -> ram
✔️ [0-9]am -> 9am
✔️ [0-9a-z]am -> 9am
✔️ [0-9a-z]am -> ram
// Hexadecimal value
✔️ [A-F0-9][A-F0-9] -> FF
✔️ [A-F0-9][A-F0-9] -> E4
❌ [A-F0-9][A-F0-9] -> G4
Some character sets have shortcut symbols we can use instead:
\d
for any numeric digit, equivalent to [0-9]
\w
for any alphabetic character, digit or underscore, equivalent to [a-zA-Z0-9_]
\s
for any white space (can be a space character, a line break character, a tab character, and so on)// \d can be any digit
✔️ \d -> 9
❌ \d -> m
✔️ \dam -> 9am
❌ \dam -> dam
❌ \dam -> ram
// Any two digits
✔️ \d\d -> 10
❌ \d\d -> 1
// Making the second digit optional
✔️ \d\d? -> 10
✔️ \d\d? -> 1
// \w can be any letter, digit or underscore
✔️ \w -> m
✔️ \w -> 9
✔️ \wam -> ram
✔️ \w\wam -> roam
✔️ help\w -> helps
❌ help\w -> help!
// \s can be any whitespace
✔️ the\sfox -> the fox
// The \n here is a line break
✔️ the\sfox -> the\nfox
// Any whitespace followed by any letter
✔️ the\s\wat -> the cat
✔️ the\s\wat -> the mat
❌ the\s\wat -> themat
// Making whitespace optional
✔️ the\s?\wat -> themat
// Combining \d \s and \w
❌ \d\d\s\wats -> 1 cat
❌ \d\d\s\wats -> 5 cats
✔️ \d\d\s\wats -> 05 cats
✔️ \d\d\s\wats -> 24 rats
✔️ \d\d\s\wats -> 24\nrats
❌ \d\d\s\wats -> 100 cats
❌ \d\d\s\wats -> four cats
// The second \d and final s are optional
✔️ \d\d?\s\wats? -> 1 cat
✔️ \d\d?\s\wats? -> 5 cats
✔️ \d\d?\s\wats? -> 05 cats
✔️ \d\d?\s\wats? -> 24 rats
✔️ \d\d?\s\wats? -> 24\nrats
❌ \d\d?\s\wats? -> 100 cats
❌ \d\d?\s\wats? -> four cats
Character sets, and their shortcuts, can be escaped in the usual way, with \
:
❌ \[hello\] -> h
✔️ \[hello\] -> [hello]
❌ \\w -> a
✔️ \\w -> \w
❌ \\d -> 5
✔️ \\d -> \d
❌ \\s -> the cat
✔️ \\s -> \s
By including the caret symbol, ^
at the beginning of our character set, we can negate it. This allows us to ensure a set of characters is not included in our input at that position.
✔️ [^cbm]at -> at
❌ [^cbm]at -> cat
❌ [^cbm]at -> bat
❌ [^cbm]at -> mat
✔️ [^cbm]at -> rat
✔️ [^c-m]at -> rat
✔️ [^\d]at -> cat
❌ [^\d]at -> 5at
✔️ [^\d]at -> 5 at
✔️ cat[^\w] -> cat
✔️ cat[^\w] -> the cat sat
❌ cat[^\w] -> caterpillar
❌ cat[^\w] -> vacate
✔️ cat[^\w] -> copycat
❌ [^\w]cat[^\w] -> copycat
We can look for repeating patterns within our input. We do that by adding syntax directly after the symbol or character set we want to look for repetitions of. We have 3Â options:
*
The *
character states there can be any number of the proceeding symbol or character set. That can include zero:
✔️ ab*c -> ac
✔️ ab*c -> abc
✔️ ab*c -> abbc
✔️ ab*c -> abbbbbc
✔️ a.*c -> ac
✔️ a.*c -> abc
✔️ a.*c -> a123c
✔️ a.*c -> a123 abc
✔️ a[bcd]*e -> ae
✔️ a[bcd]*e -> abe
✔️ a[bcd]*e -> ace
✔️ a[bcd]*e -> abcde
✔️ a[bcd]*e -> abcdcdbe
❌ a[bcd]*e -> a1e
+
The +
character specifies we want at least one of the proceeding symbol or character set, but there can be more:
❌ ab+c -> ac
✔️ ab+c -> abc
✔️ ab+c -> abbc
✔️ ab+c -> abbbbbc
❌ a.+c -> ac
✔️ a.+c -> abc
✔️ a.+c -> a123c
✔️ a.+c -> a123 abc
❌ a[bcd]+e -> ae
✔️ a[bcd]+e -> abe
✔️ a[bcd]+e -> ace
✔️ a[bcd]+e -> abcde
✔️ a[bcd]+e -> abcdcdbe
❌ a[bcd]+e -> ale
x
repetitions: {x}
The brace syntax allows us to be more specific with how many repetitions we want. We can pass a single number between the braces, to specify we want a specific number of repetitions.
Note, a common mistake here comes when using this syntax alongside a substring search, like std::regex_search
. In that context, a pattern like [0-9]{2}
which searches for exactly 2 digits will return true
on an input like 123
. Whilst the entire string of 123
has 3 digits, it has two substrings of 2 digits - 12
and 23
.
If we wanted this input to not match, we’d need to be more specific. For example, if we wanted our entire string to be exactly 2 digits, we could use std::regex_match
, instead of std::regex_search
, or add the start and end of input symbols ^
and $
to our regex.
In the following example, we look for exactly two repetitions:
❌ ab{2}c -> ac
❌ ab{2}c -> abc
✔️ ab{2}c -> abbc
❌ ab{2}c -> abbbc
❌ a.{2}c -> ac
❌ a.{2}c -> abc
✔️ a.{2}c -> a12c
❌ a.{2}c -> a123c
❌ a[bcd]{2}e -> ae
❌ a[bcd]{2}e -> abe
❌ a[bcd]{2}e -> ace
✔️ a[bcd]{2}e -> abce
✔️ a[bcd]{2}e -> acde
❌ a[bcd]{2}e -> abcde
// 256 bit hexadecimal
✔️ [A-F0-9]{2} -> 6F
// hexadecimal color
✔️ [A-F0-9]{6} -> 6F4AFF
x
repetitions: {x,}
By adding a trailing comma within our braces, we specify that we want at least x
repetitions, without an upper limit. Below, we search for at least two repetitions:
❌ ab{2,}c -> ac
❌ ab{2,}c -> abc
✔️ ab{2,}c -> abbc
✔️ ab{2,}c -> abbbc
❌ a.{2,}c -> ac
❌ a.{2,}c -> abc
✔️ a.{2,}c -> a12c
✔️ a.{2,}c -> a123c
❌ a[bcd]{2,}e -> ae
❌ a[bcd]{2,}e -> abe
❌ a[bcd]{2,}e -> ace
✔️ a[bcd]{2,}e -> abce
✔️ a[bcd]{2,}e -> acde
✔️ a[bcd]{2,}e -> abcde
x
to y
repetitions: {x, y}
:By adding a second number to our braces, we can specify both a lower and upper range for the number of repetitions we are looking for. Below, we search for at least one, but not more than three repetitions of our previous symbol or character set:
// Looking for 1 to 3 repetitions
❌ ab{1,3}c -> ac
✔️ ab{1,3}c -> abc
✔️ ab{1,3}c -> abbc
✔️ ab{1,3}c -> abbbc
❌ ab{1,3}c -> abbbbc
❌ a.{1,3}c -> ac
✔️ a.{1,3}c -> abc
✔️ a.{1,3}c -> a12c
✔️ a.{1,3}c -> a123c
❌ a.{1,3}c -> a1234c
❌ a[bcd]{1,3}e -> ae
✔️ a[bcd]{1,3}e -> abe
✔️ a[bcd]{1,3}e -> ace
✔️ a[bcd]{1,3}e -> abce
✔️ a[bcd]{1,3}e -> acde
✔️ a[bcd]{1,3}e -> abcde
❌ a[bcd]{1,3}e -> abbcde
The repetition specifiers also work with the character set shortcuts \s
, \w
, and \d
:
❌ the\s+cat -> thecat
✔️ the\s+cat -> the cat
✔️ the\s+cat -> the cat
✔️ c\w*t -> ct
✔️ c\w*t -> cat
✔️ c\w*t -> coat
❌ c\w*t -> c5t
❌ \d{2} -> 1
✔️ \d{2} -> 10
// Substring match
✔️ \d{2} -> 100
// Full string match
❌ ^\d{2}$ -> 100
❌ \d{2,} -> 1
✔️ \d{2,} -> 10
✔️ \d{2,} -> 100
✔️ \d{1.3} -> 1
✔️ \d{1.3} -> 10
✔️ \d{1.3} -> 100
// Substring match
✔️ \d{1.3} -> 1000
// Full string match
❌ ^\d{1.3}$ -> 1000
In the next lesson, we’ll cover regular expression capture groups. These build on the regex syntax we learned but will allow us to go beyond just checking whether or not the text matches the regular expression pattern we created.
With capture groups, we will learn how to use regex to filter through and extract parts of our text, or to replace segments of it in a targeted way.
Comprehensive course covering advanced concepts, and how to use them on large-scale projects.