In the first tutorial of this series, we explored the basics of regular expressions, including the syntax to write regular expressions. In this tutorial, we apply what we have learned so far to leverage regular expressions in Swift.
1. Regular Expressions in Swift
Open Xcode, create a new Playground, name it RegExTut, and set Platform to OS X. The choice of the platform, iOS or OS X, makes no difference with regard to the API we are going to use.
Before we start, there is one other thing you need to know about. In Swift, you need to use two backslashes, \\
, for every backslash you use in a regular expression. The reason has to do with Swift having C-style string literals. The backslash gets processed as a character escape in addition to its role in string interpolation in Swift. In other words, you need to escape the escape character. If that sounds weird, don't worry about it. Just remember, to use two backslashes instead of one.
In the first, somewhat contrived, example, we imagine we are rummaging through a string looking for a very specific type of email address. The email address meets the following criteria:
- the first letter is the first letter of the person's name
- followed by a period
- followed by the person's last name
- followed by the @ symbol
- followed by a name, representing a university in the United Kingdom
- followed by .ac.uk, the domain for academic institutions in the United Kingdom
Add the following code to the playground and let's walk through this code snippet step by step.
import Cocoa // (1): let pat = "\\b([a-z])\\.([a-z]{2,})@([a-z]+)\\.ac\\.uk\\b" // (2): let testStr = "[email protected], [email protected] [email protected], [email protected], [email protected]" // (3): let regex = try! NSRegularExpression(pattern: pat, options: []) // (4): let matches = regex.matchesInString(testStr, options: [], range: NSRange(location: 0, length: testStr.characters.count))
Step 1
We define a pattern. Note the doubly escaped backslashes. In (normal) regex representation, such as the one used on the RegExr website, this would be ([a-z])\.([a-z]{2,})@([a-z]+)\.ac\.uk
. Also note the use of parentheses. They are being used to define capture groups with which we can extract the substrings matched with that part of the regular expression.
You should be able to make out that the first capture group captures the first letter of the user's name, the second one their last name, and the third one the name of the university. Note also the use of the backslash to escape the period character in order to represent its literal meaning. Alternatively, we could put it in a character set by itself ([.]
). In that case, we wouldn't need to escape it.
Step 2
This is the string in which we are searching for the pattern.
Step 3
We create a NSRegularExpression
object, passing in the pattern without options. In the list of options, you can specify NSRegularExpressionOption
constants, such as:
-
CaseInsensitive
: This option specifies that the matching is case insensitive. -
IgnoreMetacharacters
: Use this option if you want to perform a literal match, meaning that the metacharacters don't have a special meaning and match themselves as ordinary characters. -
AnchorMatchLines
: Use this option if you want the^
and$
anchors to match the start and end of lines (separated by line breaks) in a single string, rather than the start and end of the entire string.
Because the initializer is throwing, we use the try
keyword. If we pass in a invalid regular expression, for example, an error is thrown.
Step 4
We search for matches in the test string by invoking matchesInString(_:options:range:)
, passing in a range to indicate which part of the string we are interested in. This method also accepts a list of options. To keep things simple, we don't pass in any options in this example. I will talk about options in the next example.
The matches are returned as an array of NSTextCheckingResult
objects. We can extract the matches, including the capture groups, as follows:
for match in matches { for n in 0..<match.numberOfRanges { let range = match.rangeAtIndex(n) let r = testStr.startIndex.advancedBy(range.location) ..< testStr.startIndex.advancedBy(range.location+range.length) testStr.substringWithRange(r) } }
The above snippet iterates through each NSTextCheckingResult
object in the array. The numberOfRanges
property for each match in the example has a value of 4, one for the entire substring matched corresponding to an email address (for example, [email protected]) and the remaining three correspond to the three capture groups within the match ("a", "khan", and "surrey" respectively).
The rangeAtIndex(_:)
method returns the range of the substrings in the string so we can extract them. Note that, instead of using rangeAtIndex(0)
, you could also use the range
property for the entire match.
Click the Show Result button in the results panel on the right. This shows us "Surrey", the value of testStr.substringWithRange(r)
for the last iteration of the loop. Right-click the result field and select Value History to show a history of values.
You can modify the above code to do something meaningful with the matches and/or the capture groups.
There is a convenient way to perform find-and-replace operations, using a template string that has a special syntax for representing capture groups. Carrying on with the example, suppose we wanted to replace every matched email address with a substring of the form "lastname, initial, university", we could do the following:
let replacedStr = regex.stringByReplacingMatchesInString(testStr, options: [], range: NSRange(location: 0, length: testStr.characters.count), withTemplate: "($2, $1, $3)")
Note the $n
syntax in the template, which acts as a placeholder for the text of capture group n
. Keep in mind that $0
represents the entire match.
2. A More Advanced Example
The matchesInString(_:options:range:)
method is one of several convenience methods that rely on enumerateMatchesInString(_:options:range:usingBlock:)
, which is the most flexible and general (and therefore complicated) method in the NSRegularExpression
class. This method calls a block after each match, allowing you to perform whatever action you want.
By passing in one or more matching rules, using NSMatchingOptions
constants, you can make sure the block is invoked at other occasions. For long-running operations, you can specify that the block is invoked periodically and terminate the operation at some point. With the ReportCompletion
option, you specify that the block should be invoked on completion.
The block has a flags parameter that reports any of these states so you can decide what action to take. Similar to some other enumeration methods in the Foundation framework, the block can also be terminated at your discretion. For example, if a long running match isn't succeeding or if you have found enough matches to begin processing.
In this scenario, we are going to search through some text for strings that look like dates and check whether a particular date is present. To keep the example manageable, we will imagine that the date strings have the following structure:
- a year with either two or four digits (for example, 09 or 2009)
- only from the present century (between 2000 and 2099) so 1982 would be rejected and 16 would automatically be interpreted as 2016
- followed by a separator
- followed by a number between 1 and 12 representing the month
- followed by a separator
- concluding with a number between 1 and 31 representing the day
Single digit months and dates might possibly be padded with a leading zero. Valid separators are a dash, a period, and a forward slash. Apart from the above requirements, we won't be verifying whether a date is actually valid. For example, we are fine with dates like 2000-04-31 (April has only 30 days) and 2009-02-29 (2009 isn't a leap year, which means February has only 28 days) that aren't real dates.
Add the following code to the playground and let's walk through this code snippet step by step.
// (1): typealias PossibleDate = (year: Int, month: Int, day: Int) // (2): func dateSearch(text: String, _ date: PossibleDate) -> Bool { // (3): let datePattern = "\\b(?:20)?(\\d\\d)[-./](0?[1-9]|1[0-2])[-./](3[0-1]|[1-2][0-9]|0?[1-9])\\b" let dateRegex = try! NSRegularExpression(pattern: datePattern, options: []) // (4): var wasFound: Bool = false // (5): dateRegex.enumerateMatchesInString(text, options: [], range: NSRange(location: 0, length: text.characters.count)) { // (6): (match, _, stop) in var dateArr = [Int]() for n in 1...3 { let range = match!.rangeAtIndex(n) let r = text.startIndex.advancedBy(range.location) ..< text.startIndex.advancedBy(range.location+range.length) dateArr.append(Int(text.substringWithRange(r))!) } // (7): if dateArr[0] == date.year && dateArr[1] == date.month && dateArr[2] == date.day { // (8): wasFound = true stop.memory = true } } return wasFound } let text = " 2015/10/10,11-10-20, 13/2/2 1981-2-2 2010-13-10" let date1 = PossibleDate(15, 10, 10) let date2 = PossibleDate(13, 1, 1) dateSearch(text, date1) // returns true dateSearch(text, date2) // returns false
Step 1
The date whose existence we are checking for is going to be in a standardized format. We use a named tuple. We only pass a two-digit integer to year, that is, 16 means 2016.
Step 2
Our task is to enumerate through matches that look like dates, extract the year, month, and day components from them and check whether they match the date we passed in. We will create a function to do all this for us. The function returns true
or false
depending on whether the date was found or not.
Step 3
The date pattern has some interesting features:
- Note the fragment
(?:20)?
. If we replaced this fragment with(20)?
, hopefully you would recognize that this meant that we are fine with the "20" (representing the millennium) being present in the year or not. The parentheses are necessary for grouping, but we don't care to form a capture group with this pair of parentheses and that is what the?:
bit is for. - The possible separators inside the character set
[-./]
don't need to be escaped to represent their literal selves. You can think of it like this. The dash,-
, is at the start so it can't represent a range. And it doesn't make sense for the period,.
, to represent any character inside a character set since it does that equally well outside. - We make heavy use of the vertical bar for alternation to represent the various month and date digit possibilities.
Step 4
The boolean variable notFound
will be returned by the function, indicating whether the date being sought was found or not.
Step 5
The enumerateMatchesInString(_:options:range:usingBlock:)
is being called. We aren't using any of the options and we are passing in the entire range of the text being searched.
Step 6
The block object, invoked after every match, has three parameters:
- the match (a
NSTextCheckingResult
) - flags representing the current state of the matching process (which we are ignoring here)
- a boolean
stop
variable, which we can set within the block to exit early
We use the boolean to exit the block if we find the date we are looking since we don't need to look any further. The code that extracts the components of the date is quite similar to the previous example.
Step 7
We check whether the extracted components from the matched substring equal the components of the desired date. Note that we force a cast to Int
, which we are sure won't fail because we created the corresponding capture groups to only match digits.
Step 8
If a match is found, we set notFound
to true
. We exit the block by setting stop.memory
to true
. We do this because stop
is a pointer-to-a-boolean and the way Swift deals with the "pointed-to" memory is via the memory property.
Observe that the substring "2015/10/10" in our text corresponds to PossibleDate(15, 10, 10), which is why the function returns true
in the first case. However, no string in the text corresponds to PossibleDate(13, 1, 1), that is, "2013-01-01" and the second call to the function returns false
.
Conclusion
We have taken a leisurely, yet reasonably detailed, look at how regular expressions work, but there is a lot more to learn if you are interested, such as lookahead and lookbehind assertions, applying regular expressions to Unicode strings, in addition to looking at the various options we skimmed over in the Foundation API.
Even if you decide not to delve any deeper, hopefully you picked up enough here to be able to identify situations in which regexes might come in handy as well as some pointers on how to design regexes to solve your pattern searching problems.
Comments