Have you ever wondered what the key is to finding some text in a document, or making sure that a text conforms to some format, like an email address for instance, and other similar operations?
The key to such operations is regular expressions (regex). Let's see some definitions for regular expressions. In Wikipedia, regex is defined as follows:
A sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. "find and replace"-like operations. The concept arose in the 1950s, when the American mathematician Stephen Kleene formalized the description of a regular language, and came into common use with the Unix text processing utilities ed, an editor, and grep, a filter.
Another nice definition from regular-expressions.info is:
A regular expression (regex or regexp for short) is a special text string for describing a search pattern. You can think of regular expressions as wildcards on steroids. You are probably familiar with wildcard notations such as *.txt to find all text files in a file manager. The regex equivalent is .*\.txt$
I know that the concept of regular expressions may still sound a bit vague. So, let's look at some examples of regex to understand the concept better.
Examples of Regular Expressions
In this section, I will show you some examples of regex to help you understand the concept further.
Say that you had this regex:
/abder/
This is simply telling us to match the word abder
only.
What about this regex?
/a[nr]t/
You can read this regex as follows: find a text pattern such that the first letter is a
and the last letter is t
, and between those letters comes either n
or r
. So the matching words are ant
and art
.
Let me give you a small quiz at this point. How would you write a regular expression that starts with ca
, and ends with one or all of the following characters tbr
? Yes, this regex can be written as follows:
/ca[tbr]/
If you see a regex that starts with a circumflex accent ^
, this means match the string that starts with the string mentioned after ^
. So, if you had the regex below, it is matching the string that begins with This
.
/^This/
Thus, in the following string:
My name is Abder This is Abder This is Tom
Based on the regex /^This/
, the following strings will be matched:
This is Abder This is Tom
What if we wanted to match a string that ends with some string? In this case, we use the dollar sign $
. Here is an example:
Abder$
Thus, in the above string (the three lines), the following patterns would be matched using this regex:
My name is Abder This is Abder
Well, what do you think about this regex?
^[A-Z][a-z]
I know it might seem complex at first glance, but let's go through it piece by piece.
We already saw what a circumflex accent ^
is. It means match a string which starts with some string. [A-Z]
refers to the upper case letters. So, if we read this part of the regex: ^[A-Z]
, it is telling us to match the string which begins with an uppercase letter. The last part, [a-z]
, means that after finding a string that starts with an uppercase letter, it would be followed by lowercase letters from the alphabet.
So, which of the following strings will be matched using this regex? If you are not sure, you can use Python as we will see in the next section to test your answer.
abder Abder ABDER ABder
Regular expressions are a very broad topic, and those examples are just to give you a feel for what they are and why we use them.
A nice reference to learn more about regular expressions and see more examples is RexEgg.
Regular Expressions in Python
Let's now come to the fun part. We want to see how to work with some of the above regular expressions in Python. The module we will be using to work with regular expressions in Python is the re
module.
The first example was about finding the word abder
. In Python, we would do this as follows:
import re text = 'My name is Abder' match_pattern = re.match(r'Abder', text) print match_pattern
If you run the above Python script, you will get the output: None
!
The script works just fine, but the issue is with how the function match()
works. If we return to the re
module documentation, this is what the function match()
does:
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.
Aha, from this we can see that match()
will return a result only if it found a match at the beginning of the string.
We can instead use the function search()
, which is, based on the documentation:
Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding match object. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.
So, if we write the above script, but with search()
instead of match()
, we get the following output:
<_sre.SRE_Match object at 0x101cfc988>
That is, a match object
has been returned.
If we want to return the result (string match), we use the group()
function. If we want to see the entire match, we use group(0)
. Thus:
print match_pattern.group(0)
will return the output: Abder
.
If we take the second regex in the previous section, that is /a[nr]t/
, it can be written in Python as follows:
import re text = 'This is a black ant' match_pattern = re.search(r'a[nr]t', text) print match_pattern.group(0)
The output for this script is: ant
.
Conclusion
The article is getting longer, and the topic of regular expressions in Python surely takes more than one article, if not a book by itself.
This article, however, is to give you a quick start and confidence to enter the world of regular expressions in Python. You can refer to the re
documentation to learn more about this module and how to go deeper in the topic.
Comments