You actually now know everything you need to use very basic regular expressions.
You know how to match all of the following things:
- characters
- Do this by just typing the chars, and using \ to escape
regex symbols you want to match.
- any char
- Use the . (dot) to match any one char.
- char sets
- Use [] to make a set, including ranges of characters
to match like [0-9]
- inverted sets
- Put a ^ inside a set and it inverts: [^a-z].
- beginning anchor
- A ^ at the beginning of a regex says to "anchor" it
so that it only matches at the beginning.
- ending anchor
- A $ at the end of a regex anchors to the end so it only
matches at the end. Combine with ^ to match exactly.
- optional modifier
- Put a ? after a regex symbol, char, or set and it
will make that thing optionally matched.
- one-or-more
- A + after a regex symbol, char, or set and it will
match one-or-more of them.
- zero-or-more
- A * after a regex symbol, char, or set and it's optionall
there or repeated (zero or more).
Mostly everything after this is additional ways to do the following to the above
concepts:
- Limit repetition in more complex ways.
- Group expressions.
- Alternate between one or more expressions.
- Special regex engine modifications for more complex tasks.
In this exercise you'll learn to limit repetition in different ways by
specifying the "greediness" of the expression. To learn this we'll
try matching various dates in different formats, but try to be as exact
as possible.
NOV-10-2011
August 20, 2011
10-18-2011
Once you have that corpus text written out, here's the regex:
[A-Z]{3}-[0-9]{2}-[0-9]{4}
[A-Za-z]+? [0-9]{2}, [0-9]{2,4}?
[0-9]{2}-[0-9]{2}-[0-9]{2,4}?
I'm using three new kinds of syntax that all do the same thing:
- If you put {X} in the same place you'd put a + (after
something to repeat) then it will make sure that it is repeated
X times.
- If you put {X,Y} then it will make sure it repeates at least
X times, but not more than Y times.
- If you put a ? after any of the repetition symbols it means
"non-greedy". This is unfortunate because ? also means
"optional" but just remember that placing it after repetition means
"non-greedy".
The way to explain "non-greedy" is simply that most regex engines try to
match repetition to the biggest part of the corpus text possible. If you
run into situations where you're matching too much with repetition, then
you can use ? to tell it to match the smallest repetition
possible.
I can then break down each of the symbols I used in this regex file:
- ex10.regex:1
- In this regex I use {X} to limit the month to
3 characters ([A-Z]{3}), the day to 2 digits ([0-9]{2}), and
the year to 4 digits ([0-9]{4}).
- ex10.regex:2
- This one I'm using the +? to match as many characters
for the month as possible, but to make it the smallest match possible. Remember
the ? after the + does this. I also use {2,4}? at the
end to match 2 to 4 characters for the year, but the ? will make
it match the smallest.
- ex10.regex:3
- Finally I'm mixing these up to match 2 or 4 elements of
the date.
What You Should See
When you run this it should match each of the dates correctly:
That file doesn't exist.
> [A-Z]{3}-[0-9]{2}-[0-9]{4}
Input file is empty. Use !load to load something.
> [A-Za-z]+? [0-9]{2}, [0-9]{2,4}?
Input file is empty. Use !load to load something.
> [0-9]{2}-[0-9]{2}-[0-9]{2,4}?
Input file is empty. Use !load to load something.
>
Portability Notes
Some regex engines do not have the {X,Y} or {X} syntax. Other
regex engines that have it become horribly slow when you give exact matching
counts. I've found that it's best to just not use this syntax unless you
absolutely need to, and instead to your data validation in your software
after matching.