Now that you know about groupings I'm going to show you how to do search and replace. If you remember from the last exercise you can surround a regex with parenthesis and it'll create a capture. In this exercise I'll show you how to use that to extract parts of a string and then replace them, as well as just doing simple replacements.

To do this exercise I'll make a corpus text that has an IP address, some numbers, and a web URL that's wrong. Then I'll play with them.

My IP address is don't hack me. I have 10 cats, 4 dogs, and 1 hamster. Go to http://index.html/ now.

I want to make a search/replace that replaces only IP addresses with "X.X.X.X" so that it is sanitized. I want another regex that will fix this URL so that it has https and is a correct URL not just a file.

!rep /[0-9]/X/
!rep /[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+/X.X.X.X/
!rep ,http://(.+?)/,\1,

The first line I'm doing the IP address replace wrong, which you can see when we run it as it replace all numbers with X. The second regex is more correct and replaces only the IP address. From these two you can see the form for a replace in Regetron is:

That means the form is !rep /SEARCH/REPLACE/ but there's a great modification to this that comes in handy. The / can be any character, which solves the problem of doing a search/replace inside a string that has a / character in it already. I demo this in the third regex by using !rep ,SEARCH,REPLACE, instead.

The final thing to look at is the third regex has a grouping (aka capture), which you know about, but then I do \1. What this does it is grabs the first group that was matched (that's the 1) and inserts it right there in the replace.

What You Should See

Pay special close attention to this and make sure you understand how the replacements are working:

That file doesn't exist.
> !rep /[0-9]/X/
> !rep /[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+/X.X.X.X/
> !rep ,http://(.+?)/,\1,

The most important part is the last regex and the use of the \1 to grab the group from the search part and put it in the replace part. In this case the (.+?) is matching the index.html part, and then the \1 is placing it in the corrected link inside the replace. One more thing to realize is if you had 3 groups in the search, then you'd have access to \2 and \3 as well.

Extra Credit

  • If you use Vim or Emacs then you have access to this as a search and replace operation. In vim try loading the corpus text and typing :%s ,http://\(.*\)/,\1, which has a slightly different search part so pay attention. Notice I have to escape the parens with \(.
  • Write a search replace that replace animals with just "dog", because dogs are better.
  • Write a regex that takes a URL with a file path and keeps everything but the file, replacing it with /index.html.
  • Why did I use .+? instead of just .+ to do the last group?

Portability Notes

As mentioned in the Extra Credit you have to escape the parenthesis in groups when you do this in Vim. Other regex engines use a slightly different API and way of doing search/replace.