Regular Expressions

Remark (Compatibility). Unless otherwise specified, all regular expressions below are Perl-Compatible Regular Expressions (PCRE).

Step 1: the purpose of regex

Regular expressions are used to find patterns in text. That’s it. The pattern might be something as simple as the word “dog” in this sentence:

The quick brown fox jumps over the lazy dog.

That regular expression looks like

dog

…easy enough, yeah?

The pattern could also be any word which contains an ‘o’. That regular expression might look like

\w*o\w*

(You can try that regex out here.)

You can see that as the requirements for a “match” get more complex, the regular expression gets more complex as well. There is extra notation to specify groups of characters and matching repeated patterns, which I’ll explain below.

But once we find a pattern in some text, what do we do with it? Well, modern regex engines allow you to extract those substrings from the contained text, or remove them, or replace them with other text. Regular expressions are used for text parsing and manipulation.

We might extract things that look like IP addresses, then try to ping them; or we might extract names and email addresses and file them in a database. Or we might use regex to find sensitive information (like Social Security numbers or phone numbers) in emails, and alert the user that they may be putting themselves at risk. Regex really is a versatile tool that is easy to learn, but difficult to master:

“Just as there is a difference between playing a musical piece well and making music, there is a difference between knowing about regular expressions and really understanding them.”

Jeffrey E. F. Friedl, Mastering Regular Expressions

Step 2: square brackets []

The easiest regular expressions to understand are those that simply look for a character-to-character match between the regex pattern and the target string, like:

pattern: cat
string:  The cat was cut when it ran under the car.
matches:     ^^^

(See this in action here.)

But we can also specify alternative matches using square brackets:

pattern: ca[rt]
string:  The cat was cut when it ran under the car.
matches:     ^^^                               ^^^

(See this in action here.)

Open-and-close square brackets tell the regex engine to match any one of the characters specified, but only one. The above regex won’t – for example – do what you might expect with the following setup:

pattern: ca[rt]
string:  The cat was cut when it ran under the cart.
matches:     ^^^                               ^^^

(See this in action here.)

When you use square brackets, you’re telling the regex engine to match on exactly one of the characters contained within the brackets. If the engine finds a c character, then an a character, but the next character isn’t r or t, it’s not a match. If it finds ca and then either r or t, it stops. It won’t continue and try to match more characters, because the square brackets indicate that only one of the contained characters should be searched for. When it finds the ca, then the r in cart, it stops, because it’s found a match on the sequence car.

Exercise 1 Can you write a regular expression to match all ten hads and Hads in this passage?

pattern:
string:  Jim, where Bill had had "had", had had "had had". "Had had" had been correct.
matches:                 ^^^ ^^^  ^^^   ^^^ ^^^  ^^^ ^^^    ^^^ ^^^  ^^^

(See one possible solution here.)

What about all of the animal names in the following sentence?

pattern:
string:  A bat, a cat, and a rat walked into a bar...
matches:   ^^^    ^^^        ^^^

(See one possible solution here.)

…or just the words bar and bat?

pattern:
string:  A bat, a cat, and a rat walked into a bar...
matches:   ^^^                                 ^^^

(See one possible solution here.)

You’re already writing more complex regular expressions and we’re only at Step 2! Let’s keep going!

Step 3: escape sequences

In the previous Step, we learned about square brackets [] and how they help us to provide alternative matches for the regex engine to find. But what if we want to match a literal open-and-close square bracket pair []?

You can't match [] using regex! You will regret this!

When we wanted a character-to-character match previously (like with the word cat), we would just type those characters exactly:

pattern: []
string:  You can't match [] using regex! You will regret this!
matches:

(See this in action here.)

This doesn’t seem to work, though. This is because the square bracket characters [ and ] are special characters that are usually used to denote something other than a simple character-to-character match. As we saw in Step 2, they’re used to provide alternative matches so the regex engine can match any one of the characters contained within them. If you don’t put any characters in between them, this can cause an error.

To match these special characters, we must escape them by preceding them with a backslash character \. The backslash character is another special character that tells the regex engine to treat the next character literally, and not as a special character. By preceding both the [ and the ] characters with a \ character, the regex engine will match each of them literally:

pattern: \[\]
string:  You can't match [] using regex! You will regret this!
matches:                 ^^

(See this in action here.)

If we want to match a literal \, we can escape it by preceding it with a second \:

pattern: \\
string:  C:\Users\Tanja\Pictures\Dogs
matches:   ^     ^     ^        ^

(See this in action here.)

Only special characters should be preceded by \ to force a literal match. All other characters are interpreted literally by default. For instance, the regular expression t matches only literal lowercase letter t characters:

pattern: t
string:  t  t   t   t
matches: ^  ^   ^   ^

(See this in action here.)

But the escape sequence \t is totally different. It matches tab characters:

pattern: \t
string:  t  t   t   t
matches:  ^  ^   ^

(See this in action here.)

Other common escape sequences include \n (UNIX-style line breaks) and \r (used in Windows-style line breaks, \r\n). \r is the “carriage return” character and \n is the “line feed” character, both of which were defined along with the ASCII standard when teletypes were still in common usage.

Other common escape sequences will be covered later in this tutorial.

Exercise 2 Can you match this regex \[\] with a regex? Your goal should be something like:

pattern:
string:  ...match this regex `\[\]` with a regex?
matches:                      ^^^^

(See one possible solution here.)

Can you match all the escape sequences in this example?

pattern:
string:  `\r`, `\t`, and `\n` are all regex escape sequences.
matches:  ^^    ^^        ^^

(See one possible solution here.)

Step 4: the “any” character .

In writing your solutions to match the escape sequences we’ve seen so far, you may have been wondering… “can’t I just match a backslash character and then any other character following it?” Well, you can.

There’s another special character which is used to match (nearly) any character, and that’s the period / full stop character ..

pattern: .
string:  I'm sorry, Dave. I'm afraid I can't do that.
matches: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(See this in action here.)

If you want to match only patterns that look like escape sequences, you could do something like:

pattern: \\.
string:  Hi Walmart is my grandson there his name is "\n \r \t".
matches:                                              ^^ ^^ ^^

(See this in action here.)

And as with all special characters, if you want to match a literal ., you need to precede it with a \ character:

pattern: \.
string:  War is Peace. Freedom is Slavery. Ignorance is Strength.
matches:             ^                   ^                      ^

(See this in action here.)

:::

Step 5: character ranges

What if you don’t want to match any character, though, but just letters? Or digits? Or vowels? Character classes and ranges allow us to achieve this.

`\n`, `\r`, and `\t` are whitespace characters, `\.`, `\\` and `\[` are not.

Characters are “whitespace” if they don’t create any visible mark within text. A space character ' ' is whitespace, as is a line break, or a tab. Suppose we want to match the escape sequences representing the whitespace characters \n, \r, and \t in the above passage, but not the other escape sequences. How could we do that?

pattern: \\[nrt]
string:  `\n`, `\r`, and `\t` are whitespace characters, `\.`, `\\` and `\[` are not.
matches:  ^^    ^^        ^^

(See this in action here.)

This works, but it’s not very elegant. What if we later need to match the escape sequence for the “form feed” character, \f? (This character is used to indicate page breaks in text.)

pattern: \\[nrt]
string:  `\n`, `\r`, `\t`, and `\f` are whitespace characters, `\.`, `\\` and `\[` are not.
matches:  ^^    ^^    ^^

(See this in action here.)

With this approach we need to list, individually, every lowercase letter we want to match within the square brackets. An easier way of accomplishing this is to use character ranges to match any lowercase letter:

pattern: \\[a-z]
string:  `\n`, `\r`, `\t`, and `\f` are whitespace characters, `\.`, `\\` and `\[` are not.
matches:  ^^    ^^    ^^        ^^

(See this in action here.)

Character ranges work the way you might expect, given the example above. Put the first and last letters you want to match in the square brackets, with a hyphen in between them. If you only want to match the letters a through m, for instance, you could do:

pattern: \\[a-m]
string:  `\n`, `\r`, `\t`, and `\f` are whitespace characters, `\.`, `\\` and `\[` are not.
matches:                        ^^

(See this in action here.)

If you want to match multiple ranges, just put them back-to-back in the square brackets:

pattern: \\[a-gq-z]
string:  `\n`, `\r`, `\t`, and `\f` are whitespace characters, `\.`, `\\` and `\[` are not.
matches:        ^^    ^^        ^^

(See this in action here.)

Other common character ranges include: A-Z and 0-9.

Exercise 3 Hexadecimal numbers can contain digits 0-9 as well as letters A-F. When used to specify colours, “hex” codes can be as short as three characters. Create a regex to find valid hex codes in the list below:

pattern:
string:  1H8 4E2 8FF 0P1 T8B 776 42B G12
matches:     ^^^ ^^^         ^^^ ^^^

(See one possible solution here.)

Using character ranges, create a regex that will select only the lowercase consonants (non-vowel characters, including y) in the sentence below:

pattern:
string:  The walls in the mall are totally, totally tall.
matches:  ^  ^ ^^^  ^ ^^  ^ ^^  ^  ^ ^ ^^^  ^ ^ ^^^ ^ ^^

(See one possible solution here.)

Step 6: the “not” caret ^

My solution for that last problem is a bit long. It took 17 characters to say “get the whole alphabet except the vowels”. Surely there’s an easier way to do this. As it turns out, there is.

The “not” caret ^ allows us to specify characters and character ranges which the regex engine should not match on. An easier solution to the last pop quiz question above would be to match every character that’s not a vowel:

pattern: [^aeiou]
string:  The walls in the mall are totally, totally tall.
matches: ^^ ^^ ^^^^ ^^^^ ^^ ^^^ ^ ^^ ^ ^^^^^^ ^ ^^^^^ ^^^

(See this in action here.)

The caret ^ as the leftmost character inside the square brackets [] tells the regex engine to match one single character which is not within the square brackets. This means that the above regex also matches all spaces, the period ., the comma ,, and the capital T at the beginning of the sentence. To exclude those, we can put them inside of the square brackets, as well:

pattern: [^aeiou .,T]
string:  The walls in the mall are totally, totally tall.
matches:  ^  ^ ^^^  ^ ^^  ^ ^^  ^  ^ ^ ^^^  ^ ^ ^^^ ^ ^^

(See this in action here.)

Note that we don’t need to escape the . here. Many special characters within square brackets are treated literally, including the open [ – but not the close ] bracket character (can you see why?). The backslash \ character is also not treated literally. If you want to match on a literal backslash \ using square brackets, you have to escape it by preceding it with a second backslash \\. This behaviour must be allowed in order for whitespace characters to be matchable within square brackets:

pattern: [\t]
string:  t  t   t   t
matches:  ^  ^   ^

(See this in action here.)

The caret can be used with ranges, as well. If I wanted to only capture the characters a, b, c, x, y, and z, I could do:

pattern: [abcxyz]

string:  abcdefghijklmnopqrstuvwxyz

matches: ^^^                    ^^^

(See this in action here.)

…or, I could specify that I want any character not between d and w:

pattern: [^d-w]
string:  abcdefghijklmnopqrstuvwxyz
matches: ^^^                    ^^^

(See this in action here.)

Be careful with the “not” caret ^. It’s easy to think, “well, I said [^b-f]”, so I should get a lowercase letter a, or something after f. That’s not the case. That regex will match any character not within that range, including digits, symbols, and whitespace.

pattern: [^d-w]
string:  abcdefg h.i,j-klmnopqrstuvwxyz
matches: ^^^    ^ ^ ^ ^             ^^^

(See this in action here.)


Exercise 4 Use the “not” caret ^ within square brackets to match all of the words below that don’t end with a y:

pattern:
string:  day dog hog hay bog bay ray rub
matches:     ^^^ ^^^     ^^^         ^^^

(See one possible solution here.)

Write a regex using a range and a “not” caret ^ to find all the years between 1977 and 1982 (inclusive) below:

pattern:
string:  1975 1976 1977 1978 1979 1980 1981 1982 1983 1984
matches:           ^^^^ ^^^^ ^^^^ ^^^^ ^^^^ ^^^^

(See one possible solution here.)

Write a regex to match all characters below that aren’t a literal caret ^ character:

pattern:
string:  abc1^23*()
matches: ^^^^ ^^^^^

(See one possible solution here.)

Step 7: character classes

Even easier than character ranges are character classes. Different regex engines have different available classes, so I’ll only cover the basics here. (Check which version of regex you’re using, because there may be more – or different – classes available than those shown here.)

Character classes work very similarly to ranges, but you can’t specify the “start” and “end” values:

class characters
\d “digits” [0-9]
\w “word characters” [A-Za-z0-9_]
\s “whitespace” [ \t\r\n\f]

The \w word character class is particularly useful, as this set of characters is often required for valid identifiers (variable and function names, etc.) in various programming languages.

We can use \w to simplify this regex that we saw previously:

pattern: \\[a-z]
string:  `\n`, `\r`, `\t`, and `\f` are whitespace characters, `\.`, `\\` and `\[` are not.
matches:  ^^    ^^    ^^        ^^

With \w, we can instead write:

pattern: \\\w
string:  `\n`, `\r`, `\t`, and `\f` are whitespace characters, `\.`, `\\` and `\[` are not.
matches:  ^^    ^^    ^^        ^^

(See this in action here.)

Exercise 5 In the Java programming language, an identifier (the name of a variable, class, function, etc.) must start with a letter a-zA-Z, a dollar sign $, or an underscore _. The remainder of the characters must be word characters \w. Using one or more character classes, create a regex to find valid Java identifiers among the following 3-character sequences:

pattern:
string:  __e $12 .x2 foo Bar 3mm
matches: ^^^ ^^^     ^^^ ^^^

(See one possible solution here.)

United States Social Security Numbers (SSNs) are 9-digit numbers in the format XXX-XX-XXXX, where each X can be any digit [0-9]. Using one or more character classes, write a regex to find the properly-formatted SSNs in the list below:

pattern:
string:  113-25=1902 182-82-0192 H23-_3-9982 1I1-O0-E38B
matches:             ^^^^^^^^^^^

(See one possible solution here.)

Step 8: the asterisk * and the plus sign +

So far, we’ve more or less gotten away with only matching on strings of a set length. But in the last pop quiz, we were getting close to the limit of what we can do with the notation we’ve seen so far.

Suppose, for instance, that we weren’t limited to 3-character Java identifiers, but we could have identifiers of any length. A solution which may have worked on the previous example will not work for the following example:

pattern: [a-zA-Z_$]\w\w
string:  __e $123 3.2 fo Barr a23mm ab x
matches: ^^^ ^^^         ^^^  ^^^

(See this in action here.)

Notice how, when the identifier is valid, but longer than 3 characters, only the first three characters are matched. And when the identifier is valid, but fewer than 3 characters, it isn’t matched at all!

The problem is that bracketed expressions [] match exactly one character, as do the character classes like \w. This means any matches on the above regex must be exactly three characters long. So it doesn’t work as we might have hoped.

The special characters * and + can help here. These are modifiers which can be added to the right of any expression to match that expression more than once.

The Kleene star (or “asterisk”), *, will match the preceding token any number of times, including zero. The “plus sign” +, will match one or more times. So an expression which precedes a + is mandatory (at least once), while an expression which precedes a * is optional, but when it does appear, it can appear any number of times.

With this knowledge, we can fix the above regex:

pattern: [a-zA-Z_$]\w*
string:  __e $123 3.2 fo Barr a23mm ab x
matches: ^^^ ^^^^     ^^ ^^^^ ^^^^^ ^^ ^

(See this in action here.)

We’re now matching on valid identifiers of any length! Success!

What would have happened if we’d used + above instead of *?

pattern: [a-zA-Z_$]\w+
string:  __e $123 3.2 fo Barr a23mm ab x
matches: ^^^ ^^^^     ^^ ^^^^ ^^^^^ ^^

(See this in action here.)

We dropped the last match, x. This is because + requires at least one character to match, but since the bracketed [] expression preceding \w+ already “ate” the x character, there are no characters remaining, so the match fails.

When would we use +? When we want at least one match, but don’t care how many times we match a given expression. For instance, maybe we want to match any numbers containing a decimal point:

pattern: \d*\.\d+
string:  0.011 .2 42 2.0 3.33 4.000 5 6 7.89012
matches: ^^^^^ ^^    ^^^ ^^^^ ^^^^^     ^^^^^^^

(See this in action here.)

Notice how – by making numbers to the left of the decimal point optional – we were able to match both 0.011 and .2. But we required exactly one decimal point with \. and at least one digit to the right of the decimal point with \d+. The above regex wouldn’t match a number like 3., though, because we require at least one digit to the right of the decimal point.

Exercise 6 Match all of the English words in the passage below.

pattern:
string:  3 plus 3 is six but 4 plus three is 7
matches:   ^^^^   ^^ ^^^ ^^^   ^^^^ ^^^^^ ^^

(See one possible solution here.)

Match all of the file sizes in the list below. File sizes will be composed of a number (with or without a decimal point), followed by KB, MB, GB, or TB:

pattern:
string:  11TB 13 14.4MB 22HB 9.9GB TB 0KB
matches: ^^^^    ^^^^^^      ^^^^^    ^^^

(See one possible solution here.)

Step 9: the “optional” question mark ?

If you haven’t yet, try to write a regex to solve that last pop quiz question. Did it work? Now try it here:

pattern:
string:  1..3KB 5...GB ..6TB
matches:

Obviously, none of these are valid file sizes, so a good regex shouldn’t match any of them. The solution I wrote for the last pop quiz question matches all of them, at least in part:

pattern: \d+\.*\d*[KMGT]B
string:  1..3KB 5...GB ..6TB
matches: ^^^^^^ ^^^^^^   ^^^

(See this in action here.)

What’s the problem? We only really want one decimal point, if any. But * allows any number of matches, including zero. Is there any way to only match zero times or once? But no more than once? There is.

The “optional” question mark ? is a modifier that matches zero or one of the preceding characters, but no more:

pattern: \d+\.?\d*[KMGT]B
string:  1..3KB 5...GB ..6TB
matches:    ^^^          ^^^

(See this in action here.)

We’re getting closer to a match here, but we’re still not quite there. We’ll see how to fix this in a few steps.

Exercise 7 In some programming languages (like Java), “long integers” and floating-point numbers can be followed by l/L and f/F to indicate that they should be treated as longs / floats (respectively) rather than the usual ints / doubles. Find all of the valid longs in the line below:

pattern:
string:  13L long 2l 19 L lL 0
matches: ^^^      ^^ ^^      ^

(See one possible solution here.)

Step 10: the “or” pipe |

We had some difficulties earlier with matching various kinds of floating point numbers:

pattern: \d*\.\d+
string:  0.011 .2 42 2.0 3.33 4.000 5 6 7.89012
matches: ^^^^^ ^^    ^^^ ^^^^ ^^^^^     ^^^^^^^

The above pattern matches numbers with a decimal point, and at least one digit to the right of the decimal point. But what if we also want to match strings like 0.? (With no numbers to the right of the decimal point.)

We could write a regex like:

pattern: \d*\.\d*
string:  0.011 .2 42 2.0 3.33 4.000 5 6 7.89012 0. .
matches: ^^^^^ ^^    ^^^ ^^^^ ^^^^^     ^^^^^^^ ^^ ^

(See this in action here.)

That matches 0., but it also matches just a single ., as you can see above. Really, what we’re trying to match on above are two different classes of strings:

  1. numbers with at least one digit to the right of a decimal point, and
  2. numbers with at least one digit to the left of a decimal point

These two regexes could be written independently as, respectively:

pattern: \d*\.\d+
string:  0.011 .2 42 2.0 3.33 4.000 5 6 7.89012 0. .
matches: ^^^^^ ^^    ^^^ ^^^^ ^^^^^     ^^^^^^^
pattern: \d+\.\d*
string:  0.011 .2 42 2.0 3.33 4.000 5 6 7.89012 0. .
matches: ^^^^^       ^^^ ^^^^ ^^^^^     ^^^^^^^ ^^

We can see that in neither case are the strings 42, 5, 6, or . matched. What we want is the union of these two regexes. How can we achieve that?

The “or” pipe | allows us to specify multiple possible match sequences in a regular expression. Similar to how [] let us specify alternate single characters, with the “or” pipe |, we can specify alternate multi-character expressions.

For instance, if we wanted to match either “dog” or “cat”, we could write:

pattern: \w\w\w
string:  Obviously, a dog is a better pet than a cat.
matches: ^^^^^^^^^    ^^^      ^^^^^^ ^^^ ^^^    ^^^

(See this in action here.)

…but this matches all three character sequences. “dog” and “cat” don’t even have any letters in common, so we can’t use square brackets to help us here. The easiest regex we could write which matches both of – and only – these two words, is:

pattern: dog|cat
string:  Obviously, a dog is a better pet than a cat.
matches:              ^^^                        ^^^

(See this in action here.)

The regex engine first attempts to match the entire sequence to the left of the pipe |, but if that fails, it then attempts to match the sequence to the right of the pipe. Multiple pipes can be chained together to match on more than two alternative sequences:

pattern: dog|cat|pet
string:  Obviously, a dog is a better pet than a cat.
matches:              ^^^             ^^^        ^^^

(See this in action here.)

Exercise 8 Use the “or” pipe |, to fix the decimal regex given above:

pattern:
string:  0.011 .2 42 2.0 3.33 4.000 5 6 7.89012 0. .
matches: ^^^^^ ^^    ^^^ ^^^^ ^^^^^     ^^^^^^^ ^^

(See one possible solution here.)

Use the “or” pipe |, character classes, the “optional” question mark ?, and more to create a single regex that matches both long integers and floats, as discussed in the pop quiz at the end of the previous step (this is a really tough one):

pattern:
string:  42L 12 x 3.4f 6l 3.3 0F L F .2F 0.
matches: ^^^ ^^   ^^^^ ^^ ^^^ ^^     ^^^ ^^

(See one possible solution here.)

Step 11: parentheses () for capturing groups

In that last pop quiz question, we were able to capture different kinds of integral and floating-point numerical values. But the regex engine made no distinction between those two kinds of values, since everything was captured in a single, big regular expression.

We can tell the regex engine to make a distinction between different kinds of matches by surrounding them with parentheses:

pattern: ([A-Z])|([a-z])
string:  The current President of Bolivia is Evo Morales.
matches: ^^^ ^^^^^^^ ^^^^^^^^^ ^^ ^^^^^^^ ^^ ^^^ ^^^^^^^
group:   122 2222222 122222222 22 1222222 22 122 1222222

(See this in action here.)

The above regex defines two capturing groups, which are indexed starting from 1. The first capturing group matches any single uppercase letter, and the second capturing group matches any single lowercase letter. Using the “or” pipe | and the “capturing group” parentheses () we can define a single regular expression which matches multiple kinds of strings.

If we apply this to our long/float regex above, the regex engine will capture the appropriate matches within the appropriate groups. By checking which group a string was matched into, we can tell immediately whether it’s a float value or a long value:

pattern: (\d*\.\d+[fF]|\d+\.\d*[fF]|\d+[fF])|(\d+[lL])
string:  42L 12 x 3.4f 6l 3.3 0F L F .2F 0.
matches: ^^^      ^^^^ ^^     ^^     ^^^
group:   222      1111 22     11     111

(See one possible solution here.)

This regular expression is pretty complex, but you should now be able to understand every part of it. Let’s break it apart so we can review each of these symbols:

(                // match any "float" string
  \d*\.\d+[fF]
  |
  \d+\.\d*[fF]
  |
  \d+[fF]
)
|               // OR
(               // match any "long" string
  \d+[lL]
)

The “or” pipe | and parenthetical capturing groups () allow us to match on different kind of strings. In this case, we’re matching either “float” floating-point numbers or “long” long integer numbers.

(
  \d*\.\d+[fF]  // 1+ digits to the right of the decimal point
  |
  \d+\.\d*[fF]  // 1+ digits to the left of the decimal point
  |
  \d+[fF]       // no decimal point, only 1+ digits
)
|
(
  \d+[lL]       // no decimal point, only 1+ digits
)

Within the “float” capturing group, we have three options – numbers with at least 1 digit to the right of the decimal point, numbers with at least one digit to the left of the decimal point, and numbers with no decimal point. Any of these are “floats”, provided they have an f or an F appended to the end.

Within the “long” capturing group, we only have a single option – we must have 1 or more digits, followed by an l or an L character.

The regex engine will look for these substrings within the given string, and index them within the appropriate capturing group.

Note that we don’t match on any numbers which don’t have one of l, L, f, or F appended. What should these numbers be categorised as? Well, if they have a decimal point, the default is double in the Java language. Otherwise, they should be ints.

Exercise 9 Add two more capturing groups to the above regex so that it also categorises numbers as double or int. (This is another tough one, don’t be discouraged if it takes a while or you need to peek at my solution.)

pattern:
string:  42L 12 x 3.4f 6l 3.3 0F L F .2F 0.
matches: ^^^ ^^   ^^^^ ^^ ^^^ ^^     ^^^ ^^
group:   333 44   1111 33 222 11     111 22

(See one possible solution here.)

Here’s a slightly easier one. Use parenthetical capturing groups (), the “or” pipe |, and character ranges to sort the following ages into “legal to drink in the U.S.” (>= 21) and “illegal to drink in the U.S.” (< 21) groups:

pattern:
string:  7 10 17 18 19 20 21 22 23 24 30 40 100 120
matches: ^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^^ ^^^
group:   2 22 22 22 22 22 11 11 11 11 11 11 111 111

(See one possible solution here.)

Step 12: define more specific matches first

You may have had some trouble with that last pop quiz question if you tried to define “legal drinkers” as the first capturing group rather than the second. To see why, let’s look at a different example. Suppose we want to capture – separately – surnames with fewer than 4 characters and surnames with 4 or more characters. If we make the shorter names the first capturing group, watch what happens:

pattern: ([A-Z][a-z]?[a-z]?)|([A-Z][a-z][a-z][a-z]+)
string:  Kim Jobs Xu Cloyd Mohr Ngo Rock
matches: ^^^ ^^^  ^^ ^^^   ^^^  ^^^ ^^^
group:   111 111  11 111   111  111 111

(See this in action here.)

By default, most regex engines use greedy matching on the basic characters we’ve seen so far. What that means is that a regex engine will capture the longest possible group, defined as early as possible within the provided regular expression. So even though the second group above could have captured more characters in names like “Jobs” and “Cloyd”, for instance, since the first three characters of these names were already captured by the first capturing group, they can’t be captured again by the second one.

This is a simple fix, though – just switch the order of the capturing groups, putting the more specific (longer) one first:

pattern: ([A-Z][a-z][a-z][a-z]+)|([A-Z][a-z]?[a-z]?)
string:  Kim Jobs Xu Cloyd Mohr Ngo Rock
matches: ^^^ ^^^^ ^^ ^^^^^ ^^^^ ^^^ ^^^
group:   222 1111 22 11111 1111 222 111

(See this in action here.)

Exercise 10 “More specific” almost always means “longer”. Suppose we want to capture two kinds of “words”: those that begin with vowels (more specific) vs. those that don’t (any other word). How could we write a regex to capture and identify strings that match those two groups? (Groups below are lettered rather than numbered. You must determine which group should be matched first.)

pattern:
string:  pds6f uub 24r2gp ewqrty l ui_op
matches: ^^^^^ ^^^ ^^^^^^ ^^^^^^ ^ ^^^^^
group:   NNNNN VVV NNNNNN VVVVVV N VVVVV

(See one possible solution here.)

In general, the more precise your regular expression is, the longer it will be. And the more precise it is, the less likely it is that you’ll capture something you don’t want. So even though they can look intimidating, longer regexes ~= better regexes. Unfortunately.

Step 13: curly braces {} for defined repetition

In the surnames example from the previous step, we had a pretty repetitive-looking regular expression:

pattern: ([A-Z][a-z][a-z][a-z]+)|([A-Z][a-z]?[a-z]?)
string:  Kim Jobs Xu Cloyd Mohr Ngo Rock
matches: ^^^ ^^^^ ^^ ^^^^^ ^^^^ ^^^ ^^^
group:   222 1111 22 11111 1111 222 111

For the first group, we wanted surnames with four or more letters. The second group was meant to capture surnames with three or fewer letters. Is there not a simpler way to write this, rather than repeating those [a-z] groups over and over again? There is, using curly braces {}.

Curly braces {} allow us to specify a minimum and (optionally) maximum number of times the preceding character or capturing group should be matched. There are three possibilities with {}:

{X}   // match exactly X times
{X,}  // match >= X times
{X,Y} // match >= X and <= Y times

Here are examples of those three different syntaxes:

pattern: [a-z]{11}
string:  humuhumunukunukuapua'a
matches: ^^^^^^^^^^^

(See this in action here.)

pattern: [a-z]{18,}
string:  humuhumunukunukuapua'a
matches: ^^^^^^^^^^^^^^^^^^^^

(See this in action here.)

pattern: [a-z]{11,18}
string:  humuhumunukunukuapua'a
matches: ^^^^^^^^^^^^^^^^^^

(See this in action here.)

There are a few things to notice in the above examples. First, using the {X} notation, the preceding character or group will be matched exactly that number (X) of times. If there are more characters that could have matched if X were greater (as shown in the first example), they will not be included in the match. If there are fewer than X characters, the entire match fails (try changing 11 to 99 in the first example).

Second, both the {X,} and the {X,Y} notations are greedy. They will match as many characters as they can while still satisfying the defined regular expression. If you say {3,7} – between 3 and 7 characters can be matched – and the next 7 characters are valid, then all 7 characters will be matched. If you say {1,}, but the next 14,000 characters all match, then all 14,000 of those characters will be included in the matched string.

So how can we use this to rewrite our expression above? A really simple improvement could be to replace the adjacent [a-z] groups with [a-z]{N}, where N is appropriately chosen:

pattern: ([A-Z][a-z]{2}[a-z]+)|([A-Z][a-z]?[a-z]?)

…but that doesn’t make it all that much nicer. Look at the first capturing group: we have [a-z]{2} (matches exactly 2 lowercase letters) followed by [a-z]+ (matches 1 or more lowercase letters). We can simplify this by asking for 3 or more lowercase letters, using curly braces:

pattern: ([A-Z][a-z]{3,})|([A-Z][a-z]?[a-z]?)

The second capturing group is different. We want at most three characters in these surnames, which means we have an upper limit, but our lower limit is zero:

pattern: ([A-Z][a-z]{3,})|([A-Z][a-z]{0,2})

Now, specificity is always better when using regexes, so we would be wise to stop here, but I can’t help but notice that those two character ranges ([A-Z] and [a-z]) right next to each other almost look like the “word character” class, \w ([A-Za-z0-9_]). If we’re sure that our data only contains nicely-formatted surnames, we could simplify our regex and write just:

pattern: (\w{4,})|(\w{1,3})

The first group captures any sequence of 4 or more word characters ([A-Za-z0-9_]) and the second group captures any sequence of 1 to 3 word characters (inclusive). Does it work?

pattern: (\w{4,})|(\w{1,3})
string:  Kim Jobs Xu Cloyd Mohr Ngo Rock
matches: ^^^ ^^^^ ^^ ^^^^^ ^^^^ ^^^ ^^^
group:   222 1111 22 11111 1111 222 111

(See this in action here.)

It does! How about that. And so much cleaner than our original example. Since the first capturing group matches all surnames with four or more characters, we could even change the second capturing group to just \w+, since this would capture all remaining surnames (those with 1, 2, or 3 characters):

pattern: (\w{4,})|(\w+)
string:  Kim Jobs Xu Cloyd Mohr Ngo Rock
matches: ^^^ ^^^^ ^^ ^^^^^ ^^^^ ^^^ ^^^
group:   222 1111 22 11111 1111 222 111

(See this in action here.)

Concise!

Exercise 11 Use curly braces {} to rewrite the Social Security Number regex from Step 7:

pattern:
string:  113-25=1902 182-82-0192 H23-_3-9982 1I1-O0-E38B
matches:             ^^^^^^^^^^^

(See one possible solution here.)

Suppose a password strength verification system on a website requires user passwords to be between 6 and 12 non-whitespace characters. Write a regex that flags bad passwords in the list below. Each password is contained within parentheses () for easy regexing, so make sure your regex starts and ends with literal ( and ) characters. (Hint: make sure you disallow literal parentheses in the passwords with [^()] or similar, otherwise you might end up matching the entire line!)

pattern:
string:  (12345) (my password) (Xanadu.2112) (su_do) (OfSalesmen!)
matches: ^^^^^^^ ^^^^^^^^^^^^^               ^^^^^^^

(See one possible solution here.)

Step 14: \b, the zero-width boundary character

The last pop quiz question was tough. But what if we made it even more difficult by surrounding the passwords with quotes "" instead of parentheses ()? Could we write a similar solution by simply replacing all of the literal ( and ) characters with "?

pattern: \"[^"]{0,5}\"|\"[^"]+\s[^"]*\"
string:  "12345" "my password" "Xanadu.2112" "su_do" "OfSalesmen!"
matches: ^^^^^^^ ^^^^^^^^^^^^^             ^^^     ^^^

(See this in action here.)

This fails pretty spectacularly. Can you see why?

The problem is that we’re looking for bad passwords here. "Xanadu.2112" is a good password, so when the regex realizes that it doesn’t contain any spaces or literal " characters, it gives up, just before the " character which bounds the password on the right-hand side. (Because we specified that " characters can’t be found within passwords, using [^"].)

Once the regex engine is sure that those characters don’t match the defined regex, it starts again, at exactly the place it left off – which is the " which bounds "Xanadu.2112" on the right. From there, it sees a single space character, and another " – a bad password! So it matches " ", and continues.

It would be really nice if we could specify that the first character of a password must be non-whitespace. Is there a way to do that? (You should know by now that the answer to all of my rhetorical questions is “yes”.) Yes! There is!

Many regex engines provide the “word boundary” escape sequence \b. \b is a zero-width escape sequence which matches, funnily enough, the boundary of a word. Remember that when we say “word”, we mean any sequence of characters in the class \w, aka. [a-zA-Z0-9_].

Matching on a word boundary means that the character immediately before or immediately after the \b sequence must be a non-word character. But we don’t actually include that character in our captured string. To see how this works, let’s look at a small example:

pattern: \b[^ ]+\b
string:  Ve still vant ze money, Lebowski.
matches: ^^ ^^^^^ ^^^^ ^^ ^^^^^  ^^^^^^^^

(See this in action here.)

The sequence [^ ] should match any character that’s not the literal space character. So why doesn’t it match the , after money or the . after Lebowski? It’s because , and . are not word characters, so there are boundaries created between word characters and non-word characters. These appear between the y of money and the , that follows it and between the i of Lebowski and the full stop / period which follows it. The regex matches on those word boundaries (but not the non-word characters which help to define them).

What would happen if we didn’t include the \b sequence?

pattern: [^ ]+
string:  Ve still vant ze money, Lebowski.
matches: ^^ ^^^^^ ^^^^ ^^ ^^^^^^ ^^^^^^^^^

(See this in action here.)

Aha, now we do match those punctuation marks.

So now let’s use word boundaries to help fix our quoted passwords regex:

pattern: \"\b[^"]{0,5}\b\"|\"\b[^"]+\s[^"]*\b\"
string:  "12345" "my password" "Xanadu.2112" "su_do" "OfSalesmen!"
matches: ^^^^^^^ ^^^^^^^^^^^^^               ^^^^^^^

(See this in action here.)

By placing word boundaries “inside” the quotes ("\b...\b"), we’re effectively saying that the first and last characters of the matched passwords must be “word” characters. So this works fine here, but won’t work as nicely if the first or last character of a user’s password is not a word character:

pattern: \"\b[^"]{0,5}\b\"|\"\b[^"]+\s[^"]*\b\"
string:  "thefollowingpasswordistooshort" "C++"
matches:

(See this in action here.)

See how the second password is not flagged as “invalid”, even though it’s clearly too short? You need to be careful with \b sequences, since they only match boundaries between \w and non-\w characters. In the above example, because we allowed non-\w characters in passwords, the boundary between the " and the first/last character of the password is not guaranteed to be a word boundary, \b.

Exercise 12 Word boundaries are useful in syntax highlighting engines, where we want to match on a particular sequence of characters, but we want to ensure that they only occur at the beginning or the end of a word (or by themselves entirely). Suppose we are writing a syntax highlighter and we want to highlight the word var, but only when it appears on its own (not touching any other word characters). Can you write a regex to do that?

pattern:
string:  var varx _var (var j) barvarcar *var var-> {var}
matches: ^^^            ^^^               ^^^ ^^^    ^^^

(See one possible solution here.)

Step 15: the “start of line” caret ^ and “end of line” dollar sign $

The \b word boundary sequence from the last Step is not the only zero-width special sequence available for use in regular expressions. Two of the more popular ones include the “start of line” caret ^ and the “end of line” dollar sign $. Including one of these in your regular expressions means that the given match must appear at the beginning or end of a line within the string you’re trying to match on:

pattern: ^start|end$
string:  start end start end start end start end
matches: ^^^^^                               ^^^

(See this in action here.)

If your string includes line breaks, then ^start will match the sequence start at the beginning of any line and end$ will match the sequence end at the end of any line (though those are difficult to show here). These characters are particularly useful when working with delimited data.

Let’s revisit the “file size” problem from Step 9, using the “start of line” caret. In this example, our file sizes are delimited (separated) by space characters ’ ’. So we want every file size to begin with a digit that’s preceded by a space character or the beginning of a line:

pattern: (^| )[\d+|\d+\.\d+](KMGT)B
string:  6.6KB 1..3KB 12KB 5G 3.3MB KB .6.2TB 9MB
matches: ^^^^^       ^^^^^   ^^^^^^          ^^^^
groups:  222         122     1222            12

(See this in action here.)

We’re so close! You can see we still have one small problem, where we’re matching on the space character before valid file sizes. Now, we could just ignore that capturing group (1), when our regex engine finds it, or we could use a non-capturing group, which we’ll see in the next Step.

Exercise 13 Continuing our syntax highlighting example from the last Step, some syntax highlighters will mark trailing spaces – that is, any whitespace which comes between a non-whitespace character and the end of the line. Can you write a regex highlighting rule for trailing spaces?

pattern:
string:  myvec <- c(1, 2, 3, 4, 5)
matches:                          ^^^^^^^

(See one possible solution here.)

A simple Comma-Separated Values (CSV) parser will look for “tokens”, separated by commas. Generally, whitespace is not significant unless it’s inside of quotes "". Can you write a simple CSV-parsing regex which matches tokens between commas but ignores (doesn’t capture) non-quoted whitespace?

pattern:
string:  a, "b", "c d",e,f,   "g h", dfgi,, k, "", l
matches: ^^ ^^^^ ^^^^^^^^^^   ^^^^^^ ^^^^^^ ^^ ^^^ ^
groups:  21 2221 2222212121   222221 222211 21 221 2

(See one possible solution here.)

Step 16: non-capturing groups (?:)

In two examples in the previous Step, we captured text where we really didn’t need to. In the “file sizes” challenge, we grabbed the space characters before the first digit of the file sizes, and in the “CSV” challenge, we captured the commas between each token. We don’t need to capture these characters, but we do need to use them to structure our regular expression. These are perfect use cases for the non-capturing group, (?:).

A non-capturing group does exactly what it sounds like – it allows you to group characters, and use them in your regular expressions, but it doesn’t capture them within a numbered group:

pattern: (?:")([^"]+)(?:")
string:  I only want "the text inside these quotes".
matches:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
groups:               1111111111111111111111111111

(See this in action here.)

Now, the regular expression matched the text in the quotes, as well as the quote characters themselves, but the capturing group only captured the text within the quotes. Why would we ever want to do this?

Well, most regex engines allow you to recover the text from the capturing groups defined within your regexes. If we can trim off the extra characters that we don’t need, by not including them in our capturing groups, it makes it easier to parse and manipulate the text later.

Here’s another example, cleaning up the CSV parser from the previous Step:

pattern: (?:^|,)\s*(?:\"([^",]*)\"|([^", ]*))
string:  a, "b", "c d",e,f,   "g h", dfgi,, k, "", l
matches: ^   ^    ^^^  ^ ^     ^^^   ^^^^   ^      ^
groups:  2   1    111  2 2     111   2222   2      2

(See this in action here.)

There are a few things to note here. First, we no longer capture the comma delimiters, since we changed the (^|,) capturing group to a (?:^|,) non-capturing group. Second, we’ve nested a capturing group inside of a non-capturing group. This is useful when, for instance, you need a group of characters to appear in a particular order, but you only care about a subset of those characters.

In our case, we needed non-quote, non-comma characters [^",]* to appear within quotes, but we don’t actually care about the quote characters themselves, so there was no need to capture them.

Finally, note that there’s also a zero-length match in the above example, in between the k and l characters. The "" there is a matched substring, but there are no characters between the quotation marks (which we don’t capture), so the matched substring contains no characters (has length zero).

Exercise 14 Using non-capturing groups (and capturing groups, and character classes, etc.), write a regex which captures only the correctly-formatted file sizes in the string below:

pattern:
string:  6.6KB 1..3KB 12KB 5G 3.3MB KB .6.2TB 9MB
matches: ^^^^^       ^^^^^   ^^^^^^          ^^^^
groups:  11111        1111    11111           111

(See one possible solution here.)

Opening HTML tags begin with a < character and end with a > character. Closing HTML tags begin with a </ character sequence and end with a > character. The name of the tag is contained within those characters. Can you write a regex to capture only the names in the following tags? (You may be able to get away with solving this without using non-capturing groups. Try solving it two ways! Once using capturing groups and once without.)

pattern:
string:  <p> </span> <div> </kbd> <link>
matches: ^^^ ^^^^^^  ^^^^^ ^^^^^ ^^^^^^
groups:   1    1111   111    111   1111

(Here’s a solution with non-capturing groups.)

(Here’s a solution without non-capturing groups.)

Step 17: backreferences \N and named capturing groups

Even though I warned you in the introduction that trying to build an HTML parser with regex usually leads to heartache, the last example is a good segue into another (sometimes) useful feature of most regexes: backreferences.

Backreferences are similar to repeated groups, in that you can try to capture the same text twice. But they differ in one important aspect – they will only capture the exact same text, character-for-character.

So while a repeated group would allow us to capture something like

pattern: (he(?:[a-z])+)
string:  heyabcdefg hey heyo heyellow heyyyyyyyyy
matches: ^^^^^^^^^^ ^^^ ^^^^ ^^^^^^^^ ^^^^^^^^^^^
groups:  1111111111 111 1111 11111111 11111111111

(See this in action here.)

…a backreference would only match

pattern: (he([a-z])(\2+))
string:  heyabcdefg hey heyo heyellow heyyyyyyyyy
matches:                              ^^^^^^^^^^^
groups:                               11233333333

(See this in action here.)

Repeated capturing groups are useful for when you want to match the same pattern repeatedly, while backreferences are good for when you want to match the exact same text. For instance, we could use a backreference to try to find matching open-and-close HTML tags:

pattern: <(\w+)[^>]*>[^<]+<\/\1>
string:  <span style="color: red">hey</span>
matches: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
groups:   1111

(See this in action here.)

Please note that this is an extremely over-simplified example and I strongly recommend you do not try to write a regex-based HTML parser. It’s a very complex syntax and you’ll likely have a bad time.

Named capturing groups are very similar to backreferences, so I’ll briefly cover them here as well. The only difference between a backreference and a named capturing group is that… a named capturing group is named:

pattern: <(?<tag>\w+)[^>]*>[^<]+<\/(?P=tag)>
string:  <span style="color: red">hey</span>
matches: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
groups:   1111

(See this in action here.)

You can create a named capturing group with the (?<name>...) or (?'name'...) syntax (.NET-compatible regex) or with the (?P<name>...) or (?P'name'...) (Python-compatible regex) syntax. Since we’re using PCRE (Perl-compatible regex), which supports both versions, we can use either here.

To repeat a named capturing group later in the regex, we use \k<name> or \k'name' (.NET) or (?P=name) (Python). Again, PCRE supports all of these different varieties. You can read more about named capturing groups here, but that’s most of what you really need to know about them.

Exercise 15 Use backreferences to help me remember… uh… that person’s name.

pattern:
string:  "Hi my name's Joe." [later] "What's that guy's name? Joe?"
matches:        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
groups:                111

(See one possible solution here.)

Step 18: lookaheads and lookbehinds

We’re getting into some more advanced features of regular expressions now. Everything up to and including Step 16, I use on a fairly regular basis. But these last few Steps are only for people who use regex very seriously to match very complex expressions. In other words, regex masters.

Lookaheads and lookbehinds may seem quite complex the first time you see them, but they’re actually not too difficult. They allow you to do something similar to what we were doing with non-capturing groups earlier – checking if some text exists immediately before or immediately after the actual text that we want to match. For instance, suppose we only want to match the names of things that people love, but only if they’re very enthusiastic about it (only if they end their sentence with an exclamation point). We could do something like:

pattern: (\w+)(?=!)
string:  I like desk. I appreciate stapler. I love lamp!
matches:                                           ^^^^
groups:                                            1111

(See this in action here.)

You can see how the above capturing group (\w+), which would normally match on any of the words in the passage, only matches the word lamp. The positive lookahead (?=...) means that we can only match sequences which end in a ! character, but we do not actually match that exclamation point character itself. This is an important distinction, because with non-capturing groups, we match the character, but don’t capture it. With lookaheads and lookbehinds, we use the character to build our regex, but then we do not even match on the character. We’re free to match on it later in our regular expression.

There are four kinds of lookaheads and lookbehinds, in total: the positive lookahead (?=...), the negative lookahead (?!...), the positive lookbehind (?<=...), and the negative lookbehind (?<!...). They do what they sound like – positive lookahead and lookbehind will allow the regex engine to continue matching only when the text contained within the lookahead/lookbehind does match. Negative lookahead and lookbehind do the opposite – they only allow the regex to match when the text contained within the lookahead/lookbehind does not match.

For example, we might want to only match method names in a chained sequence of methods, and not the object on which they’re operating. In this case, each method name should be preceded by a literal . character. A regex using a simple lookbehind could help here:

pattern: (?<=\.)(\w+)
string:  myArray.flatMap.aggregate.summarise.print
matches:         ^^^^^^^ ^^^^^^^^^ ^^^^^^^^^ ^^^^^
groups:          1111111 111111111 111111111 11111

(See this in action here.)

In the above text, we match any sequence of word characters \w+, but only if they’re preceded by a literal .. We could have achieved something similar using non-capturing groups, but it’s a little messier:

pattern: (?:\.)(\w+)
string:  myArray.flatMap.aggregate.summarise.print
matches:        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
groups:          1111111 111111111 111111111 11111

(See this in action here.)

Even though it’s shorter, it’s matching on characters that we don’t want. Although this example may seem trivial, lookaheads and lookbehinds can really help us clean up our regular expressions.

Exercise 16 The negative lookbehind (?<!...) will only allow the regex engine to continue attempting to find a match if the text contained within the negative lookbehind does not appear prior to the remainder of the text to be matched. For example, we might want to use a regex to match only the surnames of the women attending a conference. To do that, we would want to make sure that a person’s surname is not preceded by a Mr.. Can you write a regex to do that? (You can assume surnames are at least four characters long.)

pattern:
string:  Mr. Brown, Ms. Smith, Mrs. Jones, Miss Daisy, Mr. Green
matches:                ^^^^^       ^^^^^       ^^^^^
groups:                 11111       11111       11111

(See one possible solution here.)

Suppose we’re cleaning a database and we have a column of information which is meant to be a percentage. Unfortunately, some people have written the numbers as decimal values on the range [0.0, 1.0], while others have written percentages on the range [0.0%, 100.0%], and still others have written the percentage values, but have forgotten the literal percent sign %. Using a negative lookahead (?!...), can you flag only the values which are meant to be percentages, but which are missing their % signs? These should be values strictly greater than 1.00, but without a trailing %. (No number has more than two digits before or after the decimal point.)

Note that this solution is extremely difficult. If you can solve this without peeking at my answer, you have formidable regex skills.

pattern:
string:  0.32 100.00 5.6 0.27 98% 12.2% 1.01 0.99% 0.99 13.13 1.10
matches:      ^^^^^^ ^^^                ^^^^            ^^^^^ ^^^^
groups:       111111 111                1111            11111 1111

(See one possible solution here.)

Step 19: conditionals

We’re now at the point at which most people would stop using regular expressions. We’ve covered probably 95% of the use cases for simple regexes, and anything done in Steps 19 and 20 are usually performed by a more full-featured text manipulation language like awk or sed (or a general-purpose programming language). Still, let’s carry on, just so you know what regex is really capable of.

Although regular expressions are not Turing-complete, some regex engines offer features which get awfully close to looking like a full programming language. One such feature is the conditional. Regex conditionals allow if-then-else statements, where the branch that’s taken is determined by either a lookahead or a lookbehind, which we learned about in the last Step.

For instance, you might want to match only valid entries in a list of dates:

pattern: (?<=Feb )([1-2][0-9])|(?<=Mar )([1-2][0-9]|3[0-1])
string:  Dates worked: Feb 28, Feb 29, Feb 30, Mar 30, Mar 31
matches:                   ^^      ^^              ^^      ^^
groups:                    11      11              22      22

(See this in action here.)

Notice how the groups above are also indexed according to month. We could write a regex for all 12 months and capture only valid dates, which would then be captured into groups indexed by the month of the year.

The above uses a sort of if-like structure in that it will only match in the first group if "Feb " precedes the number (and similarly for the second). But what if we only wanted special treatment for February? Something like “if the number is preceded by "Feb ", do this, else do this other thing”. That’s what conditionals do:

pattern: (?(?<=Feb )([1-2][0-9])|([1-2][0-9]|3[0-1]))
string:  Dates worked: Feb 28, Feb 29, Feb 30, Mar 30, Mar 31
matches:                   ^^      ^^              ^^      ^^
groups:                    11      11              22      22

(See this in action here.)

The if-then-else structure looks like (?(if)then|else), where (if) is replaced by a lookahead or lookbehind. In the example above, (if) is (?<=Feb ). You can see that we matched on dates greater than 29, but only if they didn’t follow "Feb ". Using lookbehinds in conditionals is useful when you want to ensure a match is preceded by some text.

Positive lookahead conditionals can be confusing, because the conditional itself doesn’t match any text. So if you want the if clause to ever evaluate, it needs to be matchable from the lookahead, as shown below:

pattern: (?(?=exact)exact|else)wo
string:  exact else exactwo elsewo
matches:            ^^^^^^^ ^^^^^^

(See this in action here.)

This means that positive lookahead conditionals are kind of useless. You’re checking to see if that text is ahead, and then providing a match template to follow when it is. The conditional isn’t helping us at all. You might as well just replace the above with a simpler regex:

pattern: (?:exact|else)wo
string:  exact else exactwo elsewo
matches:            ^^^^^^^ ^^^^^^

(See this in action here.)

So, rule of thumb with conditionals: test, test, test. Things that you may think are obvious will fail in exciting and unexpected ways.

Exercise 17 Write a regular expression which uses a negative lookahead conditional to check if the next word starts with a capital letter. If it does, only capture a single capital letter, followed by lowercase letters. If it doesn’t, capture any word characters.

pattern:
string:  Jones Smith 9sfjn Hobbes 23r4tgr9h CSV Csv vVv
matches: ^^^^^ ^^^^^ ^^^^^ ^^^^^^ ^^^^^^^^^     ^^^ ^^^
groups:  22222 22222 11111 222222 111111111     222 111

(See one possible solution here.)

Write a negative lookbehind conditional which only captures the text owns when it is not preceded by the text cl, and which only captures the text ouds when it is preceded by the text cl. (A bit of a contrived example, but what can you do.)

pattern:
string:  Those clowns owns some clouds. ouds.
matches:              ^^^^        ^^^^

(See one possible solution here.)

Step 20: recursion and further learning

There is only so much that can be squeezed into a 20-step introduction to any topic, really, and regular expressions are no exception. There are many different implementations of and standards for regular expressions, which you can find peppered around the Internet. If you’re interested in learning more, I suggest you check out the wonderful site regularexpressions.info, it’s a fantastic reference and I’ve certainly learned a lot about regex from it. I strongly recommend it, as well as regex101.com for testing and sharing your creations.

I’ll leave you with one parting bit of knowledge about regex: how to write recursive expressions.

Simple recursions are quite easy, really, but let’s think about what that means in the context of a regular expression. The syntax for a simple recursion in a regular expression is (?R)?. But of course this syntax must appear within the expression itself. So what we’re doing is nesting the expression inside itself, an arbitrary number of times. For example:

pattern: (hey(?R)?oh)
string:  heyoh heyyoh heyheyohoh hey oh heyhey heyheyheyohoh
matches: ^^^^^        ^^^^^^^^^^                  ^^^^^^^^^^
groups:  11111        1111111111                  1111111111

(See this in action here.)

Since the nested expression is optional ((?R) is followed by a ?), the simplest match is to just ignore the recursion completely. So hey followed by oh (heyoh) matches. To match any expression more complex than that, we must find this matched substring nested inside itself at the point in the expression which we inserted the (?R) sequence. In other words, we could find heyheyohoh or heyheyheyohohoh, and so on.

One cool thing about these nested expressions is that, unlike backreferences and named capturing groups, they don’t limit you to matching the exact text that you matched previously, character-for-character. For instance:

pattern: ([Hh][Ee][Yy](?R)?oh)
string:  heyoh heyyoh hEyHeYohoh hey oh heyhey hEyHeYHEyohohoh
matches: ^^^^^        ^^^^^^^^^^               ^^^^^^^^^^^^^^^
groups:  11111        1111111111               111111111111111

(See this in action here.)

You can imagine that the regex engine is literally copying and pasting your regular expression inside of itself an arbitrary number of times. Of course, this means that sometimes it might not do what you might have hoped:

pattern: ((?:\(\*)[^*)]*(?R)?(?:\*\)))
string:  (* comment (* nested *) not *)
matches:            ^^^^^^^^^^^^
groups:             111111111111

(See this in action here.)