Swift and Regular Expressions: Syntax

1. Introduction

Simply put, regular expressions (regexes or regexps for short) are a way of specifying string patterns. You are undoubtedly familiar with the search and replace function in your favorite text editor or IDE. You can search for exact words and phrases. You can also activate options, such as case insensitivity, so that a search for the word "color" also finds "Color", "COLOR", and "CoLoR". But what if you wanted to search for the spelling variations of the word "color" (American spelling: color, British spelling: colour) without having to perform two separate searches?

If that example seems too simple, how about if you wanted to search for all the spelling variations of the English name "Katherine" (Catherine, Katharine, Kathreen, Kathryn, etc. to name a few). More generally, you might want to search a document for all strings that resembled hexadecimal numbers, dates, phone numbers, email addresses, credit card numbers, etc.

Regular expressions are a powerful way of (partially or fully) tackling these (and many other) practical problems involving text.

Outline

The structure of this tutorial is as follows. I will introduce the core concepts you need to comprehend by adapting an approach used in theoretical textbooks (after stripping away any unneeded rigor or pedantry). I prefer this approach because it allows you to put your understanding of maybe 70% of the functionality that you will need, in the context of a few basic principles. The remaining 30% are more advanced features that you can learn later on or skip, unless you aim to become a regex maestro.

There is a copious amount of syntax associated with regular expressions, but most of it is just there to allow you apply the core ideas as succinctly as possible. I will introduce these incrementally, rather than dropping a big table or list for you to memorize.

Instead of jumping straight into a Swift implementation, we will explore the basics through an excellent online tool that will help you design and evaluate regular expressions with the minimum amount of friction and unnecessary baggage. Once you get comfortable with the main ideas, writing Swift code is basically a problem of mapping your understanding to the Swift API.

Throughout, we will try to keep a pragmatic mindset. Regexes are not the best tool for every string processing situation. In practice, we need to identify situations where regexes work very well and situations where they don't. There is also a middle ground where regexes can be used to do part of the job (usually some preprocessing and filtering) and the rest of the job left to algorithmic logic.

Core Concepts

Regular expressions have their theoretical underpinnings in the "theory of computation", one of the topics studied by computer science, where they play the role of the input applied to a specific class of abstract computing machines called finite automatons.

Relax, though, you are not required to study the theoretical background to use regular expressions practically. I only mention them because the approach I will use to initially motivate regular expressions from the ground up mirrors the approach used in computer science textbooks to define "theoretical" regular expressions.

Assuming you have some familiarity with recursion, I would like you to keep in mind how recursive functions are defined. A function is defined in terms of simpler versions of itself and, if you trace through a recursive definition, you must end up at a base case that is explicitly defined. I bring this up because our definition below is going to be recursive too.

Note that, when we talk about strings in general, we implicitly have a character set in mind, such as ASCII, Unicode, etc. Let's pretend for the moment that we live in a universe wherein strings are composed of the 26 letters of the lowercase alphabet (a, b, ... z) and nothing else.

Rules

We start by asserting that every character in this set can be regarded as a regular expression that matches itself as a string. So a as a regular expression matches "a" (regarded as a string), b is a regex matching the string "b", etc.  Let's also say there is an "empty" regular expression Ɛ that matches the empty string "". Such cases correspond to the trivial "base cases" of the recursion.

Now, we consider the following rules that help us make new regular expressions from existing ones:

  1. The concatenation (i.e. "stringing together") of any two regular expressions is a new regular expression that matches the concatenation of any two strings that matched the original regular expressions.
  2. The alternation of two regular expressions is a new regular expression that matches either of the two original regular expressions.
  3. The Kleene star of a regular expression matches zero or more adjacent instances of whatever matched the original regular expression.

Let's make this concrete with several simple examples with our alphabetic strings.

Example 1

From rule 1, a and b being regular expressions matching "a" and "b", means ab is a regular expression that matches the string "ab". Since ab and c are regular expressions, abc is a regular expression matching the string "abc", and so on. Continuing this way, we can make arbitrary long regular expressions that match a string with identical characters. Nothing interesting has happened yet.

Example 2

From rule 2, o and a being regular expressions, o|a matches "o" or "a". The vertical bar represents alternation. c and t are regular expressions and, combined with rule 1, we can assert that c(o|a)t is a regular expression. The parentheses are being used for grouping.

What does it match? c and t only match themselves, which means that the regex c(o|a)t matches "c" followed by either an "a" or an "o" followed by "t", for example, the string "cat" or "cot". Note that it does not match "coat" as o|a only matches "a" or "o", but not both at once. Now things are starting to get interesting.

Example 3

From rule 3, a* matches zero or more instances of "a". It matches the empty string or the strings "a", "aa", "aaa", and so on. Let's exercise this rule in conjunction with the other two rules.

What does ho*t match? It matches "ht" (with zero instances of "o"), "hot", "hoot", "hooot", and so on. What about b(o|a)*? It can match "b" followed by any number of instances of "o" and "a" (including none of them). "b", "boa", "baa", "bao", "baooaoaoaoo" are just some of the infinite number of strings that this regular expression matches. Note again that the parentheses are being used to group together the part of the regular expression to which the * is being applied.

Example 4

Let's try to discover regular expressions that match strings we already have in mind. How would we make a regular expression that recognizes sheep bleating, which I'll regard as any number of repetitions of the basic sound "baa" ("baa", "baabaa", "baabaabaa", etc.)

If you said, (baa)*, then you are almost correct. But notice that this regular expression would match the empty string too, which we don't want. In other words, we want to ignore non-bleating sheep. baa(baa)* is the regular expression we are looking for. Similarly, a cow mooing might be moo(moo)*. How can we recognize the sound of either animal? Simple. Use alternation. baa(baa)*|moo(moo)*

If you have understood the above ideas, congratulations, you are well on your way.

2. Matters of Syntax

Recall we placed a silly restriction on our strings. They could only be composed of lower case letters of the alphabet. We will now drop this restriction and consider all strings composed of ASCII characters.

We must realize that, in order for regular expressions to be a convenient tool, they themselves need to be represented as strings. So, unlike earlier, we can no longer use characters like *, |, (, ), etc. without somehow signaling whether we are using them as "special" characters representing alternation, grouping, etc. or whether we are treating them as ordinary characters that need to be matched literally.

The solution is to treat these and other "metacharacters" that can have a special meaning. To switch between one use and the other, we need to be able to escape them. This is similar to the idea of using "\n" (escaping the n) to indicate a new line in a string. It is slightly more complicated in that, depending on the context character that is ordinarily "meta", might represent its literal self without escapement. We will see examples of this later on.

Another thing we value is conciseness. Many regular expressions that can be expressed using just the previous section's notation would be tediously verbose. For example, suppose you just want to find all two character strings composed of a lowercase letter followed by a numeral (for example, strings like "a0", "b9", "z3", etc.). Using the notation we discussed earlier, this would result in the following regular expression:

Just typing that monster wiped me out.

Doesn't [abcdefghijklmnopqrstuvwxyz][0123456789] look like a better representation? Note the metacharacters [ and ] that signify a set of characters, any one of which gives a positive match. Actually, if we consider that the letters a to z, and the numerals 0 to 9 occur in sequence in the ASCII set, we can whittle the regex down to a cool [a-z][0-9].

Within the confines of a character set, the dash, -, is another metacharacter indicating a range. Note that you can squeeze multiple ranges into the same pair of square brackets. For example, [0-9a-zA-Z] can match any alphanumeric character. The 9 and a (and  z and A) squeezed against each other might look funny, but remember that regular expressions are all about brevity and the meaning is clear.

Speaking of brevity, there are even more concise ways to represent certain classes of related characters as we will see in a minute. Note that the alternation bar, |, is still valid and useful syntax as we will see in a moment.

More Syntax

Before we start practicing, let's take a look at a bit more syntax.

Period

The period, ., matches any single character, with the exception of line breaks. This means that c.t can match "cat", "crt", "c9t", "c%t", "c.t", "c t", and so on. If we wanted to match the period as an ordinary character, for example, to match the string "c.t", we could either escape it (c\.t) or put it in a character class of its own (c[.]t).

In general, these ideas apply to other metacharacters, such as [, ], (, )*, and others we haven't encountered yet.

Parentheses

Parentheses (( and )) are used for grouping as we saw before. We are going to use the word token to mean either a single character or a parenthesized expression. The reason is that many regex operators can be applied to either.

Parentheses are also used to define capture groups, allowing you to figure out which part of your match was captured by a particular capture group in the regex. I will talk more about this very useful functionality later.

Plus

A + following a token is one or more instances of that token. In our sheep bleating example, baa(baa)* could be represented more succinctly as (baa)+. Recall that * means zero or more occurrences. Note that (baa)+ is different from baa+, because in the former the + is applied to the baa token whereas in the latter it only applies to the a before it. In the latter, it matches strings like "baa", "baaa", and "baaaa".

Question Mark

A ? following a token means zero or one instances of that token.

Practice

RegExr is an excellent online tool to experiment with regular expressions. When you are comfortable reading and writing regular expressions, it will be much easier to use the regular expression API of the Foundation framework. Even then, it will be easier to test your regular expression in real-time on the website first.

Visit the website and focus on the main part of the page. This is what you will see:

Regexr

You enter a regular expression in the box at the top and enter the text in which you are looking for matches.

The "/g" at the end of the expression box is not part of the regular expression per se. It is a flag that affects the overall matching behavior of the regex engine. By appending "/g" to the regular expression, the engine searches for all possible matches of the regular expression in the text, which is the behavior we want. The blue highlight indicates a match. Hovering with your mouse over the regular expression is a handy way to remind you of the meaning of its constituting parts.

Know that regular expressions come in various flavors, depending on the language or library you are using. Not only does this mean that the syntax can be a bit different among the various flavors, but also the capabilities and features. Swift, for example, uses the pattern syntax specified by ICU. I am not sure which flavor is used in RegExr (which runs on JavaScript), but within the scope of this tutorial, they are quite similar, if not identical.

I also encourage you to explore the pane on the left hand side, which has a lot of information presented in a concise fashion.

Our First Practical Example

To avoid potential confusion, I should mention that, when talking regular expression matching, we might mean either of two things:

  1. looking for any (or all) substrings of a string that match a regex
  2. checking whether or not the complete string matches the regular expression

The default meaning with which regex engines operate is (1). What we have been talking about so far is (2). Fortunately, it is easy to implement meaning (2) by means of metacharacters that will be introduced later on. Don't worry about this for now.

Let's start simple by testing out our sheep bleating example. Type (baa)+ into the expression box and some examples to test for matches as show below.

Our First Practical Example

I hope you understand why the matches that succeeded actually succeeded and why the others failed. Even in this simple example, there are a few interesting things to point out.

Greedy Matches

Does the string "baabaa" contain two matches or one? In other words, is each individual "baa" a match or is the entire "baabaa" a single match? This comes down to whether or not a "greedy match" is being sought. A greedy match attempts to match as much of a string as possible.

Right now the regex engine is matching greedily, which means "baabaa" is a single match. There are ways to do lazy matching, but that is a more advanced topic and, since we already have our plates full, we won't cover that in this tutorial.

The RegExr tool leaves a small but discernible gap in the highlighting if two adjacent parts of a string each individually (but not collectively) match the regular expression. We will see an example of this behavior in a bit.

Upper- and Lowercase

"Baabaa" fails because of the uppercase "B". Say you wanted to allow only the first "B" to be uppercase, what would the corresponding regular expression be? Try to figure it out by yourself first.

One answer is (B|b)aa(baa)*. It helps if you read it out aloud. An uppercase or lowercase "b", followed by "aa", followed by zero or more instances of "baa". This is workable, but note that this could quickly get inconvenient, especially if we wanted to ignore capitalization altogether. For example, we would have to specify alternates for each case, which would result in something unwieldy like ([Bb][Aa][Aa])+.

Fortunately, regular expression engines typically have an option to ignore case. In case of RegExr, click the button that reads "flags" and check the checkbox "ignore case". Notice that the letter "i" is prepended to the list of options at the end of the regular expression. Try some examples with mixed case letters, such as "bAABaa".

Another Example

Let's try to design a regular expression that can capture variants of the name "Katherine". How would you approach this problem? I would write down as many variations, look at the common parts, and then try to express in words the variations (with emphasis on the alternates and optional letters) as a sequence. Next, I would attempt to formulate the regular expression that assimilates all these variations.

Let's try it out with this list of variations: Katherine, Katharine, Catherine, Kathreen, Kathleen, Katryn, and Catrin. I will leave it up to you to write down several more if you like. Looking at these variations, I can roughly say that:

  • the name starts with "k" or "c"
  • followed by "at"
  • followed possibly by an "h"
  • possibly followed by an "a" or "e"
  • followed by either an "r" or "l"
  • followed by one of "i", "ee", or "y"
  • and definitely followed by an "n"
  • possibly an "e" at the end

With this idea in mind, I can come up with the following regular expression:

Another Example

Note that the first line "KatherineKatharine" has two matches without any separation between them. If you look at it closely in RegExr's text editor, you can observe the small break in the highlighting between the two matches, which is what I was talking about earlier.

Note that the above regular expression also matches names that we did not consider and that might not even exist, for example, "Cathalin". In the present context, this doesn't affect us negatively at all. But in some applications, such as email validation, you want to be more  specific about the strings you match and those you reject. This usually adds to the complexity of the regular expression.

More Syntax and Examples

Before we move on to Swift, I would like to discuss a few more aspects of the syntax of regular expressions.

Concise Representations

Several classes of related characters have a concise representation:

  • \w alphanumeric character, including underscore, equivalent to [a-zA-Z0-9_]
  • \d represents a digit, equivalent to [0-9]
  • \s represents whitespace, that is, space, tab, or line break

These classes also have corresponding negative classes:

  • \W represents a non-alphanumeric, non-underscore character
  • \D a non-digit
  • \S a non-space character

Remember the uncapitalized classes and then recall that the corresponding capitalized one matches what the uncapitalized class doesn't match. Note that these can be combined by including inside square brackets if necessary. For example, [\s\S] represents any character, including line breaks. Recall that the period . matches any character except line breaks.

Anchors

^ and $ are anchors that represent the start and end of a string respectively. Remember that I wrote you might want to match an entire string, rather than look for substring matches? This is how you do that. ^c[oau]t$ matches "cat", "cot", or "cut", but not, say, "catch" or "recut".

Word Boundaries

\b represents a boundary between words, such as due to space or punctuation, and also the start or end of the string. Note that it is a bit different in that it matches a position rather than an explicit character. It might help to think of a word boundary as an invisible divider that separates a word from the previous/next one. As you'd expect, \B represents "not a word boundary". \bcat\b finds matches in "cat", "a cat", "Hi,cat", but not in "acat" or "catch".

Negation

The idea of negation can be made more specific using the ^ metacharacter inside a character set. This is a completely different use of ^ from "start of string anchor". This means that, for negation, ^ must be used in a character set right at the start. [^a] matches any character besides the letter "a" and [^a-z] matches any character except a lowercase letter.

Can you represent \W using negation and character ranges? The answer is [^A-Za-z0-9_]. What do you think [a^] matches? The answer is either an "a" or a "^" character since it didn't occur at the beginning of the character set. Here "^" matches itself literally.

Alternatively, we could escape it explicitly like this: [\^a]. Hopefully, you are beginning to develop some intuition on how escaping works.

Quantifiers

We saw how * (and +) can be used to match a token zero or more (and one or more) times. This idea of matching a token multiple times can be made more specific using quantifiers in curly braces. For example, {2, 4}  means two to four matches of the preceding token. {2,} means two or more matches and {2} means exactly two matches.

We will look at detailed examples that use most of these elements in the next tutorial. But for the sake of practice, I encourage you to make up your own examples and test out the syntax we just saw with the RegExr tool.

Conclusion

In this tutorial, we have primarily focused on the theory and syntax of regular expressions. In the next tutorial, we add Swift to the mix. Before moving on, make sure you understand what we have covered in this tutorial by playing around with RegExr.

Tags:

Comments

Related Articles