Regex Capture Groups

An introduction to regular expression capture groups, and how to use them in C++ with regex_search, regex_replace, regex_iterator, and regex_token_iterator
This lesson is part of the course:

Professional C++

Comprehensive course covering advanced concepts, and how to use them on large-scale projects.

Free, Unlimited Access
3D Character Concept Art
Ryan McCombe
Ryan McCombe
Updated

In this lesson, we’ll further build our regex knowledge, introducing capture groups, non-capture groups, and lazy selectors. Our previous lesson covered a broad introduction of regex, and how to use it to determine if a string matches our criteria.

Here, we will focus on a more advanced use case, where we want to extract information from the strings we receive.

Use cases for this include:

  • Extracting data from a string, such as retrieving the domain name from an email address
  • Reformatting a string, such as changing the order of date elements from MM/DD/YYYY to YYYY/MM/DD
  • Finding and replacing specific parts of a string, such as replacing all instances of a word with a different word while preserving the rest of the string

Finally, we’ll see how we can make use of these regex concepts in C++

This is intended to be a follow-up to our introductory lesson on Regular Expressions in C++, so familiarity with the concepts covered there is recommended:

Capture Groups

In regular expressions, capture groups are defined by parenthesis: ( and ). For example, if we wanted to capture all of the times "hello" appears in our string, our regex would look like this:

(hello)

If we wanted to capture only the "hello"s that proceed " world", it would look like this:

(hello) world

Within the capture group, we still have all of our usual regex powers, for example:

  • (hello|goodbye) - Capture hello or goodbye
  • ([hc]ello) - Capture hello and cello
  • ([hc]ello?) - Capture hell, hello, cell and cello
  • (hel*o) - Capture heo, helo, hello, etc
  • (hel\w) - Capture hell, help, etc
  • hello(.*)world - Capture everything between hello and world

Just like the order of operations in maths and programming can be manipulated using brackets, so too can the order of operations within regex.

Operators like ? and * can be applied to a capture group, while | within a capture group constrains its effects to that group

✔️ The( big)? cat -> The cat
✔️ The( big)? cat -> The big cat
❌ The( big)? cat -> The big big cat

✔️ The( big)* cat -> The cat
✔️ The( big)* cat -> The big cat
✔️ The( big)* cat -> The big big cat

✔️ The (big|chonky) cat -> The big cat
✔️ The (big|chonky) cat -> The chonky cat
❌ The (big|chonky) cat -> The big chonky cat

❌ The( big| chonky)+ cat -> The cat
✔️ The( big| chonky)+ cat -> The big cat
✔️ The( big| chonky)+ cat -> The big chonky cat
✔️ The( big| chonky)+ cat -> The big chonky big cat

Note, if just want to manipulate the order of operations, and have no need to capture the content of our group, we can use a non-capture group instead. These are introduced a little later in this lesson.

Escaping Brackets

When we want to look for literal ( and ) in our strings rather than creating a capture group, we can escape them in the usual way, using \.

For example, if we want to search a string for the sequence "The (big) cat", our regex would be "The \(big\) cat"

Creating a capture group:
❌ The (big) cat -> The (big) cat

Searching for a pattern containing brackets
✔️ The \(big\) cat -> The (big) cat

Greedy and Lazy Quantifiers

Let's imagine we want to get a breakdown of what email providers our users have signed up with. We have a list of emails, like example@gmail.com, and we want to generate a list of email providers, like gmail.

Our first attempt at a regular expression might be to capture everything from the @ to the literal .:

@(.*)\.

Some basic testing would indicate this works - given a string like example@gmail.com, the substring of gmail is captured as intended.

However, given the string example@gmail.co.uk, the substring of gmail.co is captured.

This is because, by default, repetition quantifiers such as *, +, and {x,y} are greedy. They capture as much as possible.

In this example, the .* in (.*)\. is capturing everything until the last period in the string, not the next period.

We can change this by appending a ? after the quantifier, thereby making it lazy:

@(.*?)\.

This also applies to the other quantifiers. The following lists what would be captured by various regular expressions, given the string of 54321!.

We apply different quantifiers to the \d character, to remove some of the digits from what is ultimately captured:

* matches as many repetitions as possible
✔️ \d*(.*) => 54321! = !

*? matches as few repetitions as possible
✔️ \d*?(.*) => 54321! = 54321!

+ matches as many repetitions as possible, but at least 1
✔️ \d+(.*) => 54321! = !

+? matches as few repetitions as possible, but at least 1
✔️ \d+?(.*) => 54321! = 4321!

{2,4} matches 2, 3 or 4 repetitions, preferring more
✔️ \d{2,4}(.*) => 54321! = 1!

{2,4}? matches 2, 3 or 4 repetitions, preferring fewer
✔️ \d{2,4}?(.*) => 54321! = 321!

Non-Capture Groups

The ability to create groups of tokens within our regular expression is generally useful, even if we don’t need to capture them. For these, we have non-capture groups. Non-capture groups start with (?: and end with ).

They have many use cases, including the following examples:

Applying the optional operator (?) to sub-patterns

Below, we make the string "brown "  optional:

✔️ The (?:brown )?fox -> The fox
✔️ The (?:brown )?fox -> The brown fox
❌ The (?:brown )?fox -> The red fox

Applying the repetition quantifiers (``, +, and {}) to sub-patterns

Below, we allow the string "red " to appear 0-2 times:

✔️ The (?:red ){0,2}fox -> The fox
✔️ The (?:red ){0,2}fox -> The red fox
✔️ The (?:red ){0,2}fox -> The red red fox
❌ The (?:red ){0,2}fox -> The red red red fox

Manipulating the order of operations

The following example shows how we can use a non-capturing group to manipulate which part of the pattern the alternation operator | applies to:

Default order of operators
✔️ The brown|red fox -> The brown
✔️ The brown|red fox -> red fox

Controlling it using a non-capture group
❌ The (?:brown|red) fox -> The brown
❌ The (?:brown|red) fox -> red fox
✔️ The (?:brown|red) fox -> The brown fox
✔️ The (?:brown|red) fox -> The red fox

Using Capture Groups in C++

The rest of this lesson will focus on how we can use capture groups within the C++ standard library’s regex helpers, available by including <regex>

std::match_results and std::smatch

The std::match_results<> template class is typically how we want to store the output of our regex operations. This is templated so it can be used with different types of strings.

However, an instance of the template class that works with std::string has already been aliased for us. It is called std::smatch, which is what we'll be using here.

The std::regex_search() function has an overload that accepts a std::match_results object as the second argument:

#include <regex>

int main() {
  std::string Input{"Hello There"};
  std::regex Pattern{"Hello There"};
  std::smatch Match;
  std::regex_search(Input, Match, Pattern);
}

We covered the basics of std::regex_search() in our introductory lesson. Here, we’ll focus on its interaction with std::match_results objects and capture groups.

Each object in the match results is a std::sub_match. These objects contain some useful information about the sub-match that was found, as well as the matched string, which can be accessed using the str() method.

If the std::regex_search() call was successful, the std::smatch will contain at least one sub-match: the substring that matched the entire regex pattern we provided:

#include <regex>
#include <iostream>

int main() {
  std::string Input{"Hello There"};
  std::regex Pattern{"Hello There"};
  std::smatch Match;

  if (std::regex_search(Input, Match,
                        Pattern)) {
    std::cout << Match.size()
              << " sub-match found!";
    for (auto Submatch : Match) {
      std::cout << "\nSubmatch: " << Submatch;
    }
  }
}
1 sub-match found!
Submatch: Hello There

When we’re not using capture groups, our std::match_results container will only contain one std::sub_match. However, when our regex contains capture groups, what was captured by those capture groups will be included in the std::smatch collection.

The overall match will be at index 0, what was captured by the first capture group will be at index 1, the second group at index 2, and so on

#include <regex>
#include <iostream>

int main() {
  std::string Input{"Hi All"};
  std::regex Pattern{"(Hello|Hi) (There|All)"};
  std::smatch Match;

  if (std::regex_search(Input, Match,
                        Pattern)) {
    std::cout << Match.size()
              << " submatches found!";
    for (auto Submatch : Match) {
      std::cout << "\nSubmatch: " << Submatch;
    }
  }
}
3 submatches found!
Submatch: Hi All
Submatch: Hi
Submatch: All

The std::smatch has some additional properties and methods we may find useful. For example, the position() method accepts an integer parameter and will return the starting position of the corresponding sub-match within the input string.

Additionally, the std::sub_match objects have fields and properties we may find useful, including:

  • length() - the length of the sub-match string
  • first() - an iterator to the first character in the sub-match
  • last() - an iterator to the last character in the sub-match
#include <regex>
#include <iostream>

int main() {
  std::string Input{"Hello World"};
  std::regex Pattern{"Hello (.*)"};
  std::smatch Matches;

  if (std::regex_search(Input, Matches,
                        Pattern)) {
    std::cout << Matches.size()
              << " submatches found!";
    for (size_t i{0}; i < Matches.size(); ++i) {
      std::cout << "\n\nSubmatch " << i << ": "
                << Matches[i] << "\n  Length: "
                << Matches[i].length()
                << "\n  First Character: "
                << *Matches[i].first
                << "\n  Position: "
                << Matches.position(i);
    }
  }
}
2 submatches found!

Submatch 0: Hello World
  Length: 11
  First Character: H
  Position: 0

Submatch 1: World
  Length: 5
  First Character: W
  Position: 6

Multiple Matches using std::regex_iterator

Calls to std::regex_search() will stop once a match is found. However, our regex may match multiple patterns in our input strings.

When we want to match all instances of a pattern within our string, we have some other options we can use. We could do it with multiple calls to std::regex_search(), but it’s typically easier and safer to use the standard library's dedicated regex iterators instead.

In other programming languages, this style of regex matching is often called "global search". Typically, we’d activate it by appending a \g token to the end of the regex pattern, and just using the same function - such as that language’s equivalent to std::reges_search().

When using the C++ standard library, we don’t have that option. We need to write a bit more code to implement global search. However, the trade-off is that we have full control over how it behaves.

The std::regex_iterator is also a template class, but an alias has been provided if we're working with std::string objects. The alias is std::sregex_iterator.

We construct the starting iterator by passing a std::string iterator pair as the first two arguments, representing where we want the search to begin, and where we want it to end. Typically, we want to search the entire string, so we just pass the results of the begin() and end() methods of our input string.

The third argument we need to pass to the constructor is our regex pattern:

std::string Input{
    "Hello World, Goodbye World"};
std::regex Pattern{
    "(Hello|Goodbye) (World|Everyone)"};
std::sregex_iterator Iterator{
    Input.begin(), Input.end(), Pattern};

To create an end iterator to compare against, we can create a second std::sregex_iterator, passing no arguments.

With this setup, we can now use the std::sregex_iterator to iterate through all the matches found in our input string.

Similar to calls to std::regex_search(), each iteration will yield a std::match_results object. We can then access the std::sub_match objects within each container in the usual way:

#include <regex>
#include <iostream>

int main() {
  std::string Input{
      "Hello World, Goodbye World"};
  std::regex Pattern{
      "(Hello|Goodbye) (World|Everyone)"};
  std::sregex_iterator Iterator{
      Input.begin(), Input.end(), Pattern};
  std::sregex_iterator End;

  while (Iterator != End) {
    std::cout << "Match";
    for (auto Match : *Iterator) {
      std::cout << "\n  Submatch: " << Match;
    }
    std::cout << "\n\n";
    ++Iterator;
  }
}
Match
  Submatch: Hello World
  Submatch: Hello
  Submatch: World

Match
  Submatch: Goodbye World
  Submatch: Goodbye
  Submatch: World

std::regex_token_iterator

We have an alternative regex iterator we can use - the std::regex_token_iterator. A std::string version is available as std::sregex_token_iterator.

We construct and iterate over it in the same way we did std::sregex_iterator, but the matches are provided in a simpler form.

The token iterator skips the intermediate std::match_results containers - it instead just iterates directly through the std::sub_match objects.

By default, it provides us with the sub-match at index 0 of each match. That is, it gives us the sub-matches that matched the entire regex pattern, rather than any specific capture group:

#include <regex>
#include <iostream>

int main() {
  std::string Input{
      "Hello World, Goodbye Everyone"};
  std::regex Pattern{
      "(Hello|Goodbye) (World|Everyone)"};
  std::sregex_token_iterator Iterator{
      Input.begin(), Input.end(), Pattern};
  std::sregex_token_iterator End;

  while (Iterator != End) {
    auto res = (*Iterator);
    std::cout << "\nSubmatch: " << (*Iterator);
    ++Iterator;
  }
}
Submatch: Hello World
Submatch: Goodbye Everyone

By passing a 4th argument to the std::sregex_token_iterator, we can specify which sub-match we want. Below, we specify index 1, ie, the sub-match that was captured by our first capture group:

#include <regex>
#include <iostream>

int main() {
  std::string Input{
      "Hello World, Goodbye Everyone"};
  std::regex Pattern{
      "(Hello|Goodbye) (World|Everyone)"};
  std::sregex_token_iterator Iterator{
      Input.begin(), Input.end(), Pattern, 1};
  std::sregex_token_iterator End;

  while (Iterator != End) {
    auto res = (*Iterator);
    std::cout << "\nSubmatch: " << (*Iterator);
    ++Iterator;
  }
}
Submatch: Hello
Submatch: Goodbye

We can pass multiple indices to the 4th argument, using a std::vector, a C-style array, or an initializer list.

Below, we specify we want all the sub-matches of our regex. We know this pattern has two capture groups, so we expect three sub-matches per match. We want the overall match 0, and the capture groups 1 and 2:

#include <regex>
#include <iostream>

int main() {
  std::string Input{
      "Hello World, Goodbye Everyone"};
  std::regex Pattern{
      "(Hello|Goodbye) (World|Everyone)"};
  std::sregex_token_iterator Iterator{
      Input.begin(),
      Input.end(),
      Pattern,
      {0, 1, 2}};
  std::sregex_token_iterator End;

  while (Iterator != End) {
    auto res = (*Iterator);
    std::cout << "\nSubmatch: " << (*Iterator);
    ++Iterator;
  }
}
Submatch: Hello World
Submatch: Hello
Submatch: World
Submatch: Goodbye Everyone
Submatch: Goodbye
Submatch: Everyone

How many capture groups are in a regular expression?

If we have a std::regex object and need to programmatically find out how many capture groups it contains, we can use the mark_count() method.

#include <regex>
#include <iostream>

int main() {
  std::regex Pattern{
      "(Hello|Goodbye) (World|Everyone)"};
  std::cout << Pattern.mark_count();
}
2

This function is named mark_count() as an alternative name for a capture group is a "marked subexpression".

std::regex_replace()

The std::regex_replace() function allows us to make changes to a string, based on a regular expression. In the following example, we replace every instance of "World" with the string "Everyone":

#include <regex>
#include <iostream>

int main() {
  std::string Input{
      "Hello World, Goodbye World"};
  std::regex Search{"World"};
  std::string Replace{"Everyone"};

  std::string Updated{std::regex_replace(
      Input, Search, Replace)};

  std::cout << "Before: " << Input
            << "\n After: " << Updated;
}
Before: Hello World, Goodbye World
 After: Hello Everyone, Goodbye Everyone

Below, we use a slightly more complicated regex pattern to replace anything that looks somewhat like an email address:

#include <regex>
#include <iostream>

int main() {
  std::string Input{
      "email me at bob@gmail.com or "
      "bob@yahoo.com"};
  std::regex Search{R"(\w*@[\w.]*)"};
  std::string Replace{"[redacted]"};

  std::string Updated{std::regex_replace(
      Input, Search, Replace)};

  std::cout << "Before: " << Input
            << "\n After: " << Updated;
}
Before: email me at bob@gmail.com or bob@yahoo.com
 After: email me at [redacted] or [redacted]

Note: A robust regex pattern for email addresses is significantly more complicated than this. The patterns used throughout this lesson have been simplified for clarity.

std::regex_replace() with Capture Groups

When using std::regex_replace() with capture groups, we can include what was captured within our replacement string. We do this using the $ symbol, followed by the number of our capture group within our regex pattern, starting from 1. For example, $1, $2, $3, and so on.

Below, we change how negative numbers are displayed. For example, -100 becomes (100) We do this by adding a capture group to our regex, and then referencing what was captured by that group using $1 in our replacement string:

#include <regex>
#include <iostream>

int main() {
  std::string Input{
      "The balances are 400, -100 and 250"};
  std::regex Search{R"(-(\d*))"};
  std::string Replace{"($1)"};

  std::string Updated{std::regex_replace(
      Input, Search, Replace)};

  std::cout << "Before: " << Input
            << "\n After: " << Updated;
}
Before: The balances are 400, -100 and 250
 After: The balances are 400, (100) and 250

In this example, we use multiple capture groups to reorder and duplicate parts of our string:

#include <regex>
#include <iostream>

int main() {
  std::string Input{"The name's James Bond"};
  std::regex Search{"(The name's) (.*) (.*)"};
  std::string Replace{"$1 $3, $2 $3"};

  std::string Updated{std::regex_replace(
      Input, Search, Replace)};

  std::cout << "Before: " << Input
            << "\n After: " << Updated;
}
Before: The name's James Bond
 After: The name's Bond, James Bond

If we want our replacement string to include a literal dollar value, like "I can pay $3", we escape the capture group reference using an additional dollar sign. For example, to have $3 in our replacement string, we’d use $$3:

#include <regex>
#include <iostream>

int main() {
  std::string Input{"The price is {price}"};
  std::regex Search{R"(\{price\})"};
  std::string Replace{"$$3.50"};

  std::cout << std::regex_replace(Input, Search,
                                  Replace);
}
The price is $3.50

We can access the full substring that was matched using the $& token. This is equivalent to the contents at index 0 of a std::match_results object:

#include <regex>
#include <iostream>

int main() {
  std::string Input{
      "The hungry brown cat and the sleepy "
      "black bear"};
  std::regex Search{
      ".*?(happy|hungry|sleepy) (brown|black) "
      "(bear|fox|cat).*?"};
  std::string Replace{
      "Matched: $& \n  Animal: $3\n  Color: "
      "$2\n  Mood: $1\n\n"};

  std::cout << std::regex_replace(Input, Search,
                                  Replace);
}
Matched: The hungry brown cat
  Animal: cat
  Color: brown
  Mood: hungry

Matched:  and the sleepy black bear
  Animal: bear
  Color: black
  Mood: sleepy

Summary

In this lesson, we explored regex capture groups and their application within C++, enabling the manipulation and extraction of specific parts of strings for both analysis and transformation.

Main Points Learned

  • The syntax and usage of capture groups in regular expressions.
  • The distinction between capture groups and non-capture groups, and when to use each.
  • How greedy and lazy quantifiers influence the matching behavior and how to apply them.
  • The role of std::smatch and std::match_results in storing and accessing matched results from regex operations.
  • Utilization of std::regex_search(), std::regex_replace(), std::regex_iterator, and std::sregex_token_iterator to perform complex regex operations in C++.
  • Techniques for using capture groups in string replacement operations with std::regex_replace().
  • How to programmatically determine the number of capture groups in a regex pattern using mark_count().

Was this lesson useful?

Next Lesson

String Views

A practical introduction to string views, and why they should be the main way we pass strings to functions
3D Character Concept Art
Ryan McCombe
Ryan McCombe
Updated
A computer programmer
This lesson is part of the course:

Professional C++

Comprehensive course covering advanced concepts, and how to use them on large-scale projects.

Free, Unlimited Access
Strings and Streams
Next Lesson

String Views

A practical introduction to string views, and why they should be the main way we pass strings to functions
3D Character Concept Art
Contact|Privacy Policy|Terms of Use
Copyright © 2024 - All Rights Reserved