You will need the latest Regetron for this exercise to work. Make sure you have at least version 1.4. Do pip install --upgrade regetron to get the new one.

You can match lines, but what if you just want the parts of a line that match? In this exercise you'll learn how to create a regex that has a "grouping" in it, which tells the regex engine to return only that part, and not the whole line. You create a group by surrounding the part of the regex you want to extract with parenthesis () characters.

Here's a corpus text with two lines I'll play with:

AA BB 10 CC 12 DD 30 My email is buddy.

The first line is just some pairs of numbers and letters, and I want to get only the numbers. The second line has an email address in it and I want to extract the email address. To do that here's 4 regex:


Each of these lines first shows the regex without parenthesis, then with parenthesis. The version without will just print the line like you have been experiencing. The version with parenthesis will print only what's been matched as a list of items.

What You Should See

When you run this you'll see first the line get matched, then the groups that match with the grouped (parenthesis) version after that.

That file doesn't exist.
> [0-9]+
Input file is empty. Use !load to load something.
> ([0-9]+)
Input file is empty. Use !load to load something.
> [a-z]+@[a-z.]+
Input file is empty. Use !load to load something.
> ([a-z]+@[a-z.]+)
Input file is empty. Use !load to load something.

You should first notice that the regex ([0-9]+) returned a list with all of the numbers match that looks like ['10', '12', '30'] which is simply a Python formatted list. Next you'll see the email regex returns just [''] which is just the email in a list by itself. If you put more email addresses on that line it would return all of them.

Extra Credit

  • How do you match something with parenthesis then? You know this but go back and slash through until you figure it out.
  • Create an email address that makes the email regex fail, then improve the email regex until it works.
  • Use the | (or, alternate) symbol to have numbers or letters captured. Hint if you put it between two groups it will alternate between the groups.
  • How do you interpret these results now that it's doing an alternating grouping? Can you use just one group and still find numbers or letters?
  • Change the line so that it has something that's not letters or numbers and make sure your regex still only extracts numbers and letters.
  • Write an email regex that only captures the host name after the @ but still matches a regex. Hint: You don't have to put the parenthesis around the whole regex.
  • Read the Python re module documentation about groups.

Portability Notes

Some regex engines are really bad at efficiently gathering captures, or do it in weird ways. Consult your API to see what's possible.