Ruby for Newbies: Regular Expressions

Ruby is a one of the most popular languages used on the web. We’ve started a new Session here on Nettuts+ that will introduce you to Ruby, as well as the great frameworks and tools that go along with Ruby development. In this lesson, we’ll look at using regular expression in Ruby.

Prefer a Video Tutorial?

Preface: Regular Expression Syntax

If you’re familiar with regular expressions, you’ll be glad to know that most of the syntax for writing the actual regular expressions is very similar to what you know from PHP, JavaScript, or [your language here].

If you’re not familiar with regular expressions, you’ll want to check out our Regex tutorials here on Nettuts+ to get up to speed.

Regular Expression Matching

Just like everything else in Ruby, regular expressions are regular objects: they’re instances of the Regexp class. However, you’ll usually create a regular expression with the standard, literal syntax:

/myregex/

/\(\d{3}\) \d{3}-\d{4}/

To start, the simplest way to use a regexp is to apply it to a string and see if there’s a match. Both strings and regexp objects have a match method that does this:

"(123) 456-7890".match /\(\d{3}\) \d{3}-\d{4}/

/\(\d{3}\) \d{3}-\d{4}/.match "(123) 456-7890"

Both of these examples match, and so we’re going to get a MatchData instance back (we’ll look at MatchData objects soon). If there’s no match, match will return nil. Because a MatchData object will evaluate to true, you can use the match method in conditional statements (like an if-statement), and just ignore that you’re getting a return value.

There’s another method that you can use to match regexp with strings: that’s the =~ (the equals-tilde operator). Remember that operators are methods in Ruby. Like match, this method returns nil on no match. However, if there is a match, it will return the numerical position of the string where the match started. Also like match, both strings and regexps have =~.

"Ruby For Newbies: Regular Expressions" =~ /New/ # => 9

Regular expressions get more useful when we’re gleaning out some data. This is usually done with groupings: wrapping certain parts of the regular expression in parentheses. Let’s say we want to match a first name, last name, and occupation in a string, where the string is formatted like this:

str1 = "Joe Schmo, Plumber"
str2 = "Stephen Harper, Prime Minister"

To get the three fields, we’ll create this regexp:

re = /(\w*)\s(\w*),\s?([\w\s]*)/

This matches any number of word characters, some whitespace, any number of word characters, a comma, some optional whitespace, and any number of word characters or whitespace. As you might guess, the parts including word characters refer to the names or occupation we’re looking for, so they are wrapped in parentheses.

So, let’s execute this:

match1 = str1.match re
match2 = str2.match re

MatchData Objects

Now, our match1 and match2 variables hold MatchData objects (because both our matches were successful). So, let’s see how we can use on of these MatchData objects.

As we go through this, you’ll notice that there are a few different ways to get the same data out of our MatchData object. We’ll start with the matched string: If you want to see what the original string that was matched against the regexp, use the string method. You can also use the [] (square brackets) method, and pass the parameter 0:

match1.string # => "Joe Schmo, Plumber"
match1[0] # (this is the same as match1.[] 0 ) => "Joe Schmo, Plumber"

What about the regular expression itself? You can find that with the regexp method.

match1.regex # => wsw,s[ws]     (this is IRB's unique way of showing regular expressions; it will still work normally)

Now, how about getting those matched groups that were the point of this exercise? Firstly, we can get them with numbered indices on the MatchData object itself; of course, they are in the order we matched them in:

match1[1] # => "Joe"
match1[2] # => "Schmo"
match1[3] # => "Plumber"

match2[1] # => "Stephen"
match2[2] # => "Harper"
match2[3] # => "Prime Minister"

There’s actually another way to get these captures: that’s with the array property captures; since this is an array, it’s zero-based.

match1.captures[0] # => "Joe"

match2.captures[2] # => "Prime Minister"

Believe it or not, there’s actually a third way to get your captures. When you execute match or =~, Ruby fills in a series of global variables, one for each of the captured groups in your regexp:

"Andrew Burgess".match /(\w*)\s(\w*)/  # returns a MatchData object, but we're ignoring that

$1 # => "Andrew"
$2 # => "Burgess"

Back to MatchData objects. If you want to find out the string index of a given capture, pass the captures number to the begin function (here, you want the capture’s number as you’d use it with the [] method, not via captures). Alternatively, you can use end to see when that capture ends.

m = "Nettuts+ is the best".match /(is) (the)/

m[1] # => "is"
m.begin 1 # => 8
m[2] # => "end"
m.end 2   # => 14

There’s also the pre_match and post_match methods, which are pretty neat: this shows you what part of the string came before and after the match, respectively.

# m from above
m.pre_match  # => "Nettuts+ "
m.post_match # => " best"

That pretty much covers the basics of working with regular expressions in Ruby.

Regular Expression Use

Since regular expressions are so useful when manipulating strings, you’ll find several string methods that take advantage of them. The most useful ones are probably the substitution methods. These include

sub
sub!
gsub
gsub!

These are for substitution and global substitution, respectively. The difference is that gsub replaces all the instances of our pattern, while sub replaces only the first instance in the string.

Here’s how we use them:

"some string".sub /string/, "message" # => "some message"
"The man in the park".gsub /the/, "a" # => "a man in a park"

As you might know, the bang methods (ones ending with an exclamation mark!) are destructive methods: these change the actual string objects, instead of returning now ones. For example:

original = "My name is Andrew."
new = original.sub /My name is/, "Hi, I'm"
original # => My name is Andrew."
new # => "Hi, I'm Andrew"

original = "Who are you?"
original.sub! /Who are/, "And"
original # => "And you?"

Besides these simple examples, you can do more complex things, like this:

"1234567890".sub /(\d{3})(\d{3})(\d{4})/, '(\1) \2-\3' # => "(123) 456-7890"

We don’t get MatchData objects or the global variables with the substitution methods; however, we can use the “backslash-number” pattern in the replacement string, if we wrap it in single quotes. If you want to further manipulate the captured string, you can pass a block instead of the second parameter:

"WHAT'S GOING ON?".gsub(/\S*/) {|s| s.downcase } # => "what's going on?"

There are many other functions that use regular expressions; if you’re interested, you should check out String#scan and String#split, for starters.

Conclusion

We’ll that’s regular expressions in Ruby for you. If you have any questions, let’s hear them in the comments.

HIGHLIGHTS OF THE DAY