Getting started with regular expressions

A tutorial for beginners

Have you ever wanted to search a document for text that satisfies certain rules, or a certain pattern? Perhaps you need to find a bunch of product numbers, email addresses, or hyphenated words? In other words, you’re trying to find segments of text that are similar in some way, but not exactly the same. This is exactly the sort of challenge that regular expressions (or regexes for short) were designed for.

A long, complex regex can look perfectly terrifying, consisting of a seemingly unintelligible string of strange characters[1]:

(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

Eek! It’s enough to scare anyone off, even people with some programming experience. However, you might be surprised by how quickly you can start writing perfectly useful regular expressions with no previous experience at all. The trick is to learn a few basic building blocks… then, just start putting them together, one after the other. Let’s start with some simple patterns that you can learn and put to use straight away.

The following is a perfectly valid regex:

cake

Can you guess what it searches for? :-)

By default, custom searches in Inkwire Hyperlinker are case sensitive and don’t match partial words, so a search for cake will not find ‘Cake’, ‘cakes’, ‘pancake’ or ‘shortcake’. (Tip: If you want to find all the cake, uncheck the case sensitive and whole words options first!) Why not give it a try? Whip up some cake-laden text in InDesign, and run a custom search using Hyperlinker’s custom search feature or InDesign’s GREP search (Edit > Find/Change > GREP). (If you don’t own InDesign, search your favourite app store for ‘regex’ and you’re sure to find a regex testing tool to play around in.)

Did you find all the cake? Or was there a slight problem? Yes, unticking case sensitive and whole words in Hyperlinker will find you every instance of those four letters (‘cake’, ‘Cake’ and even ‘cAkE’), but what if you want the whole ‘pancake’, pan and all?

For that we need to tap into the real power of regex, where some characters take on special meaning. The first special character we’ll look at is the pipe (or vertical bar) character (|) which essentially means ‘or’. We can use this to search for various alternatives:

cake|pancake|shortcake

This regex says that a successful match is any of the following: ‘cake’, ‘pancake’ or ‘shortcake’. When you run the search globally (by using Hyperlinker’s Convert All button) it will find all successful matches.

This works, but it’s not a very elegant expression. For one thing, we had to type out the word ‘cake’ three times. To avoid this repetition, we could rewrite it like this:

(|pan|short)cake

Here, we use parentheses to group together one part of our regex, which limits the scope of the or operation. So now, our regex will match the word ‘cake’ preceded by nothing or ‘pan’ or ‘short’.

That’s certainly an improvement, but… we’re still missing out on cheesecake and that is, of course, quite unacceptable. What we really want is a way to find all the words that include ‘cake’, and to match the whole word in each case.

We’ll come back to the cake scenario in a moment. For now, let’s look at some more special regex characters and what they mean:

Character	Meaning
.	Any character (except a line break)
[abc]	One of the characters between the brackets (a, b or c)
[A-Z]	A letter from A to Z (uppercase)
[a-z]	A letter from a to z (lowercase)
[0–9]	A digit from 0 to 9
[A-Za-z0–9]	A letter (uppercase or lowercase) or digit
[^A-Za-z0–9]	One character that is NOT a letter or digit
[^ ]	One character that is NOT a space

Square brackets define a set of possible characters, or what’s called a character class. This is a simple but powerful feature of regex. We could use it to create our own vowel character class, like this:

[aeiou]

We could use it to find tense variations of the word ‘sing’:

s[iau]ng

This will match ‘sing’, ‘sang’ or ‘sung’.

At this point, you may have realised that [iau] means ‘i’ or ‘a’ or ‘u’. This raises the question, could we have used the pipe character we learnt about earlier to achieve the same thing? The short answer is, yes:

s(i|a|u)ng

While this achieves the same result, the character class is shorter and easier to read. You certainly wouldn’t want to try and replicate [A-Za-z0-9] this way!

There is often more than one way to accomplish the same thing with regular expressions. Some character classes are so common that there are shortcuts for them. \d will match a digit from 0 to 9, and \w will match any letter or digit. (Note: \w will also match underscores, and both these shortcuts will attempt to match Unicode variations of each character.) Here’s a list of similar shortcuts:

Character	Meaning
\w	A word character (letter, digit or underscore)
\W	One character that is NOT a word character
\d	A digit from 0 to 9
\D	One character that is NOT a digit
\s	A whitespace (space, tab, newline, etc)
\S	One character that is NOT a whitespace
\t	A tab
\n	A line feed (‘forced line break’ or ‘soft return’ in InDesign)
\r	A carriage return (‘end of paragraph’ in InDesign)

Here, the backslash character (\) gives special meaning to the letter that follows.

The backslash can also be used to remove special meaning when you want a special character to be treated literally. For example, in regex the period character (.) normally means any character except a line break. But what if we want to search for an actual period or full-stop? We can do this by preceding it with a backslash (\.). This is called escaping a character. Other special characters you will need to escape if you want to match them, are various brackets ({}()[]), symbols (^$|?*+), and the backslash character itself (\). (Note that many of these characters do not need to be escaped when used within a character class. As a side-effect, an alternative to escaping a period or full-stop is to enclose it in square brackets ([.]) which I do quite like for readability.)

Your regex toolbox now contains quite an array of useful tools! But… each one still only finds one character at a time. a finds one ‘a’. [a-z] finds one character between a and z. [^aeiou] finds one character that is not a vowel (i.e. it finds one consonant).

To find multiple characters in a row, you could repeat these patterns over and over, but a better way is to use another special character, called a quantifier. A quantifier specifies how many times the previous item should be matched. Many quantifiers even allow you to specify a variable number of repetitions:

Quantifier	Meaning
{3}	3 times
{2,5}	2 to 5 times
{2,}	2 or more times
?	Optional (0 or 1 times)
*	0 or more times
+	1 or more times

By default, variable quantifiers are greedy, which means they will match as many times as they can. Let’s see how this works…

[Ww]he+

This pattern, which means match a ‘W’ or a ‘w’, followed by an ‘h’, followed by one or more ’e’s, will match the bolded letters in the following text (with the whole words option deselected):

I grease the wheel bearings when I want to go faster… Wheeeeeeee!!!

But by now you’re probably wondering… What about the CAKE?! The problem, if you recall, was how to find each word that includes ‘cake’ and match the whole word. We want the whole pancake (not just pancake), and by golly, if there are cakes (plural), that ‘s’ is crucial. So how do we write our regex?

Well, if you’ve read this far, you have all the clues you need! Go on, have a go and see if you can solve it yourself. You can test your regex by doing a custom search using Hyperlinker, or InDesign’s GREP tool. You can also compare your answer to mine (or cheat) by clicking on the answer below.

Click here to see my answer…

[a-z]*cake[a-z]*

This means zero or more letters, followed by the word 'cake', followed by zero or more letters. (Hyperlinker’s Case sensitive option is unchecked.)

While this little tutorial really only scratches the surface of what is possible with regex, I hope it has helped demystify a subject that scares many, and inspired you to try Hyperlinker’s powerful new custom search feature.

If you would like to learn more about regular expressions, I recommend these excellent resources:

Regular-Expressions.info and RegexBuddy by Jan Goyvaerts
RexEgg.com by ‘Rex’.

The scary regex sample is John Gruber’s mighty URL matching regex. ↩

Getting started with regular expressions

A tutorial for beginners

Click here to see my answer…

Other articles