Of the many new features that Ruby 2.0 shipped back in 2013, the one I paid least attention to was the new regular expression engine, Onigmo. After all, regular expressions are regular expressions - why should I care how Ruby implements them?
As it turns out, the Onigmo regex engine has a few neat tricks up its sleeve including the ability to use conditionals inside of your regular expressions.
In this post we'll dive in to regex conditionals, learn about the quirks of Ruby's implementation of them, and discuss a few tricks to work around Ruby's limitations. Let's get started!
Groups & Capturing
To understand conditionals in regular expressions, you first need to understand grouping and capturing.
Imagine that you have a list of US cites:
Fayetteville, AR
Seattle, WA
You'd like to separate the city name from the state abbreviation. One way to do this is to perform multiple matches:
PLACE = "Fayetteville, AR"
# City: Match any char that's not a comma
PLACE.match(/[^,]+/)
# => #<MatchData "Fayetteville">
# Separator: Match a comma and optional spaces
PLACE.match(/, */)
# => #<MatchData ", ">
# State: Match a 2-letter code at the end of the string.
PLACE.match(/[A-Z]{2}$/)
# => #<MatchData "AR">
This works, but it's too verbose. By using groups, you can capture both the city and state with only one regular expression.
So let's combine the regular expressions above, and surround each section with parentheses. Parens are how you group things in regular expressions.
PLACE = "Fayetteville, AR"
m = PLACE.match(/([^,]+)(, *)([A-Z]{2})/)
# => #<MatchData "Fayetteville, AR" 1:"Fayetteville" 2:", " 3:"AR">
As you can see, the expression above captures both city and state. You access them by treating MatchData
like an array:
m[1]
# => "Fayetteville"
m[3]
# => "AR"
The problem with grouping, as it's done above, is that the captured data is put into an array. If its position in the array changes, you have to update your code or you've just introduced a bug.
For example, we might decide that it's silly to capture the ", "
characters. So we remove the parens around that part of the regular expression:
m = PLACE.match(/([^,]+), *([A-Z]{2})/)
# => #<MatchData "Fayetteville, AR" 1:"Fayetteville" 2:"AR">
m[3]
# => nil
But now m[3]
no longer contains the state - bug city.
Named groups
You can make regular expression groups a lot more semantic by naming them. The syntax is pretty similar to what we just used. We surround the regex in parens, and specify the name like so:
/(?<groupname>regex)/
If we apply this to our city/state regular expression, we get:
m = PLACE.match(/(?<city>[^,]+), *(?<state>[A-Z]{2})/)
# => #<MatchData "Fayetteville, AR" city:"Fayetteville" state:"AR">
And we can access the captured data by treating MatchData
like a hash:
m[:city]
# => "Fayetteville"
Conditionals
Conditionals in regular expressions take the form /(?(A)X|Y)/
. Here are a few valid ways to use them:
# If A is true, then evaluate the expression X, else evaluate Y
/(?(A)X|Y)/
# If A is true, then X
/(?(A)X)/
# If A is false, then Y
/(?(A)|Y)/
Two of the most common options for your condition, A
are:
- Has a named or numbered group been captured?
- Does a look-around evaluate to true?
Let's look at how to use them:
Has a group been captured?
To check for the presence of a group, use the ?(n)
syntax, where n is an integer, or a group name surrounded by <>
or ''
.
# Has group number 1 been captured?
/(?(1)foo|bar)/
# Has a group named "mygroup" been captured?
/(?(<mygroup>)foo|bar)/
Example
Imagine you're parsing US telephone numbers. These numbers have a three-digit area code that is optional unless the number starts with one.
1-800-555-1212 # Valid
800-555-1212 # Valid
555-1212 # Valid
1-555-1212 # INVALID!!
We can use a conditional to make the area code a requirement only if the number starts with 1.
# This regular expression looks complex, but it's made of simple pieces
# `^(1-)?` Does the string start with "1-"? If so, capture it as group 1
# `(?(1)` Was anything captured in group one?
# `\d{3}-` if so, do a required match of three digits and a dash (the area code)
# `|(\d{3}-)?` if not, do an optional match of three digits and a dash (area code)
# `\d{3}-\d{4}` match the rest of the phone number, which is always required.
re = /^(1-)?(?(1)\d{3}-|(\d{3}-)?)\d{3}-\d{4}/
"1-800-555-1212".match(re)
#=> #<MatchData "1-800-555-1212" 1:"1-" 2:nil>
"800-555-1212".match(re)
#=> #<MatchData "800-555-1212" 1:nil 2:"800-">
"555-1212".match(re)
#=> #<MatchData "555-1212" 1:nil 2:nil>
"1-555-1212".match(re)
=> nil
Limitations
One problem with using group-based conditionals is that matching a group "consumes" those characters in the string. Those characters can't be used by the conditional, then.
For example, the following code tries and fails to match 100 if the text "USD" is present:
"100USD".match(/(USD)(?(1)\d+)/) # nil
In Perl and some other languages, you can add a look-ahead statement to your conditional. This lets you trigger the conditional based on text anywhere in the string. But Ruby doesn't have this, so we have to get a little creative.
Look-around
Fortunately, we can work around the limitations in Ruby's regex conditionals by abusing look-around expressions.
What is a look-around?
Normally, the regular expression parser steps through your string from the beginning to the end looking for matches. It's like moving the cursor from left to right in a word processor.
Look-ahead and look-behind expressions work a little differently. They let you inspect the string without consuming any characters. When they're done, the cursor is left in the exact same spot it was at the beginning.
For a great introduction to look-arounds, check out Rexegg's guide to mastering look ahead and look behind
The syntax looks like so:
Type | Syntax | Example |
---|---|---|
Look Ahead | (?=query) |
\d+(?= dollars) matches 100 in "100 dollars" |
Negative Look Ahead | (?!query) |
\d+(?! dollars) matches 100 if it is NOT followed by the word "dollars" |
Look Behind | (?<=query) |
(?<=lucky )\d matches 7 in "lucky 7" |
Negative Look Behind | (?<!query) |
(?<!furious )\d matches 7 in "lucky 7" |
Abusing look-arounds to enhance conditionals
In our conditional, we can only query the existence of groups that have already been set. Normally, this means that content of the group has been consumed and isn't available to the conditional.
But you can use a look-ahead to set a group without consuming any characters! Is your mind blown yet?
Remember this code that didn't work?
"100USD".match(/(USD)(?(1)\d+)/) # nil
If we modify it to capture the group in a look-ahead, it suddenly works fine:
"100USD".match(/(?=.*(USD))(?(1)\d+)/)
=> #<MatchData "100" 1:"USD">
Let's break down that query and see what's going on:
(?=.*(USD))
Using a look-ahead, scan the text for "USD" and capture it in group 1(?(1)
If group 1 exists\d+
Then match one or more numbers
Pretty neat, huh?