You actually now know everything you need to use very basic regular expressions. You know how to match all of the following things:

characters
Do this by just typing the chars, and using \ to escape regex symbols you want to match.
any char
Use the . (dot) to match any one char.
char sets
Use [] to make a set, including ranges of characters to match like [0-9]
inverted sets
Put a ^ inside a set and it inverts: [^a-z].
beginning anchor
A ^ at the beginning of a regex says to "anchor" it so that it only matches at the beginning.
ending anchor
A $ at the end of a regex anchors to the end so it only matches at the end. Combine with ^ to match exactly.
optional modifier
Put a ? after a regex symbol, char, or set and it will make that thing optionally matched.
one-or-more
A + after a regex symbol, char, or set and it will match one-or-more of them.
zero-or-more
A * after a regex symbol, char, or set and it's optionall there or repeated (zero or more).

Mostly everything after this is additional ways to do the following to the above concepts:

In this exercise you'll learn to limit repetition in different ways by specifying the "greediness" of the expression. To learn this we'll try matching various dates in different formats, but try to be as exact as possible.

NOV-10-2011 August 20, 2011 10-18-2011

Once you have that corpus text written out, here's the regex:

[A-Z]{3}-[0-9]{2}-[0-9]{4}
[A-Za-z]+? [0-9]{2}, [0-9]{2,4}?
[0-9]{2}-[0-9]{2}-[0-9]{2,4}?

I'm using three new kinds of syntax that all do the same thing:

The way to explain "non-greedy" is simply that most regex engines try to match repetition to the biggest part of the corpus text possible. If you run into situations where you're matching too much with repetition, then you can use ? to tell it to match the smallest repetition possible.

I can then break down each of the symbols I used in this regex file:

ex10.regex:1
In this regex I use {X} to limit the month to 3 characters ([A-Z]{3}), the day to 2 digits ([0-9]{2}), and the year to 4 digits ([0-9]{4}).
ex10.regex:2
This one I'm using the +? to match as many characters for the month as possible, but to make it the smallest match possible. Remember the ? after the + does this. I also use {2,4}? at the end to match 2 to 4 characters for the year, but the ? will make it match the smallest.
ex10.regex:3
Finally I'm mixing these up to match 2 or 4 elements of the date.

What You Should See

When you run this it should match each of the dates correctly:

That file doesn't exist.
> [A-Z]{3}-[0-9]{2}-[0-9]{4}
Input file is empty. Use !load to load something.
> [A-Za-z]+? [0-9]{2}, [0-9]{2,4}?
Input file is empty. Use !load to load something.
> [0-9]{2}-[0-9]{2}-[0-9]{2,4}?
Input file is empty. Use !load to load something.
>

Extra Credit

  • Write each of these out in verbose form and make sure you get the comments after each part.
  • Write out some index cards with all of the symbols you've learned so far, including these new ones. On the front write the symbol/expression, and on the back write the phrase for it. For example, looking at the list at the beginning of this exercise, I'd put [] on one side, and "char set" on the other side.
  • Drill with these cards for 15 minutes before you go to bed. Put the ones you keep getting wrong into a separate pile. The next day, take breaks while you work and drill the ones you keep getting wrong only. Repeat this until you get all of them correct. Doing this before you go to bed will make your brain try to figure it out while you sleep. Drilling the ones you don't know will efficiently train you in what you don't know.

Portability Notes

Some regex engines do not have the {X,Y} or {X} syntax. Other regex engines that have it become horribly slow when you give exact matching counts. I've found that it's best to just not use this syntax unless you absolutely need to, and instead to your data validation in your software after matching.