Remember in the last two exercises where I kept asking you how you would match a URL like /blog/article/12345? You might have thought the ? optional element was the answer, but actually it's repetition. What you need is a way to tell the regex engine "any number of these". There's actually three kinds of repetition:

one-or-more
Element repeats such that there has to be at least one.
zero-or-more
Element repeats but it's optional, so there could be 0.
X-or-Y
A minimum of X and a maximum of Y. We'll cover these in the next exercise.

The regex symbol for "one-or-more" is + and for "zero-or-more" is *. Just like with the "optional" symbol ? you place it after the regex symbol or character set to repeat and that's it.

Here's some URLs we'll try matching, with two way to get at a blog article:

/blog/article/12345 /blog/article/345 /blog/article/NOV-01-2011

It's slightly contrived but we'll match by a straight numeric ID or by the MMM-DD-YYYY date.

^/blog/article/[0-9]*$
^/blog/article/[A-Z]+-[0-9]+-[0-9]+$

^     # from the start
/blog/article/
[A-Z] # A-Z for MMM
+     # one-or-more
-     # dash char
[0-9] # 0-9 for DD
+     # one-or-more
-     # dash char
[0-9] # 0-9 for YYYY
+     # one-or-more
$     # to the end

I took the most complex regex at the end and wrote it out again so you can see my breakdown of it exactly.

What You Should See

Running this you should see it match these URLs and you should go back and match the verbose form to the short form.

That file doesn't exist.
> ^/blog/article/[0-9]*$
Input file is empty. Use !load to load something.
> ^/blog/article/[A-Z]+-[0-9]+-[0-9]+$
Input file is empty. Use !load to load something.
>
^     # from the start
/blog/article/
[A-Z] # A-Z for MMM
+     # one-or-more
-     # dash char
[0-9] # 0-9 for DD
+     # one-or-more
-     # dash char
[0-9] # 0-9 for YYYY
+     # one-or-more
$     # to the end

Input file is empty. Use !load to load something.
>

Best way to map the verbose and short form of this is to go character-by-character through the regex on line 2, and match the character to the verbose form.

Extra Credit

  • Write URLs that do not match these regex at all and explain why they don't.
  • Take the verbose form regex, and edit it so that you convert it back to short form manually.
  • Write out the first regex in verbose form.
  • Change the date regex so that it can match "January-01-2011".
  • There's a problem with the second regex because it could match dates like "January-34355-929939494". Is this a good or bad thing? How would you change the regex to make it more strict?

Portability Notes

Some regex engines confuse + and *, or simply don't have the concept. If you're trying to use + and it's not working, try just a plain * and it should work. Nearly every regex engine on the planet has that at least.