Powerful Regular Expressions (RegEx) Perform Minor Computing Miracles—This Part 1 Uses Extracting US Zip Codes from Street Addresses to Introduce Regular Expressions as Merely a Set of Confusing Wildcards
In my last blog, “Parsing and Pasting One-Line Street Addresses (AutoHotkey Multi-Paste Trick)“, I added one-line street addresses to my MultiPaste.ahk script. That short AutoHotkey app uses a few Regular Expressions (RegEx) to identify and isolate key information:
- Five-digit US zipcodes.
- UK postal codes.
- Remove excess tab characters in the results.
- Identify date formats.
I used RegEx functions for these problems because the basic string functions just didn’t offer the power needed without convoluted coding. RegEx provides fairly simple solutions (although possibly confounding to the neophyte).
Note: Each item in the list above represents a different blog discussing the use of a RegEx in the MultiPaste.ahk script. Combined the four blogs represent a mini-introduction to Regular Expressions. The concepts discussed in these pieces represent features found in all RegExs.
String Functions Versus RegEx Functions
The InStr() function which locates specific text inside a larger text block and the StrReplace() function which replaces occurrences of specific text within a larger block of text each serve a similar purpose to the RegExMatch() function and the RegExReplace() function, respectively. The difference lies in the fact that the two String functions must search for exact characters, while the RegEx functions can use wildcards matching a wide variety of different characters. Although you can use the RegEx functions for precise character matches (e.g. “abc”), unless you need to use the RegEx wildcard substitution capability, you should stick with the faster InStr() and StrReplace() functions.
In poker, a wildcard can represent any card in the deck. In computers, a wildcard can represent any number of characters or specific types of characters. When searching Windows folders, the asterisks (*) takes the place of unknown filenames and extensions (i.e. *.*). Wildcards make searches possible even when we don’t know a specific name or term. Wildcards add power and flexibility to our searches.
While the InStr() and StrReplace() functions don’t allow wildcards, the RegEx functions depend upon them. In fact, Regular Expressions (RegEx) bewilderment rears its ugly head through to the sheer number and variations of wildcard expressions and formats available for text matches. Often you can write the same RegEx three or four different ways.
Finding Five-Digit Zip Codes
In the MultiPaste.ahk script, I needed to separate US zip codes from street addresses—even though they usually do not include an obvious delimiter such as a preceding comma. That means I must identify possible zip codes merely by their format (5 digits in a row)—regardless of the value of those numbers.
When identifying basic US zip codes, we know each contains five consecutive numbers and no letters. But, we don’t know the specific numbers in the string. The InStr() function alone can not pick out a zip code in a line of text unless we loop through all the possible combinations.
Note: I wrote a short script which looped through all the possibilities using the InStr() function. It took about 20 seconds to complete the job. Although still awkward, perhaps incrementing through the text one character at a time looking for five digits in a row would offer a faster InStr() solution.
However, we can easily solve this problem using RegEx functions, although, to add to the complication, Regular Expressions offer numerous possible solutions. For example, if you want to match any numeric digit (0-9), you can use the \d wildcard, the range [0-9], or the expression set (0|1|2|3|4|5|6|7|8|9)—all of which match a single numeric digit. Which one should you use to match a zip code?
You can place five wildcards in a row to match a zip code:
Or, you can follow the wildcard with the number of occurrences enclosed in curly brackets:
Both expressions result in the same effect.
To add more confusion, you can modify each expression to match one or more occurrences in a row.
Add a plus sign + and the RegEx \d+ matches one or more occurrences of a numeric digit in a row. The same holds true for the range [0-9]+ and the expression set (0|1|2|3|4|5|6|7|8|9)+.
If you want the matching to stop after five digits for a zip code, then the wildcard \D (backslash capital D) matches anything but a numeric digit:
This Regular Expression prevents the matching of numbers longer than five digits.
Of course in my script, the US zip code usually has a space character (\s) in front of it—which I need to replace with the tab (`t) character:
Since it seemed easier to understand, I used the first option in the RegExReplace() function:
Clipboard := RegExReplace(Clipboard, "\s(\d\d\d\d\d)", "`t$1")
The $1 reinserts the matching subexpression (\d\d\d\d\d) replacing the space \s with the tab character `t.
Note: While backslash+t (\t) represents the standard tab in Regular Expressions, you can also use backtick+t (`t) in AutoHotkey RegExs.
You should use the RegEx which does the job in the simplest form possible.
Separating US Zip Codes
My only concern involved placing a tab for parsing in front of any possible zip codes. The RegExReplace() function shown above might also identify longer strings of non-zip code digits but, worse case, the number gets parsed as a separate possible multi-paste. If you need to validate a zip code, then use something similar to:
or for complete nine-digit codes use:
both of which use the \b symbol to designate the enclosed expression as a separate word. This prevents the number from identifying a zip code inside another number (e.g. 1234556-54322).
Next time, we take a look at the more complicated UK postal codes (and introduce a couple more RegEx concepts). While the RegEx for complete validation gets pretty complex, you will find identifying possible UK postal codes for parsing a little simpler—although not as easy as US zip codes.
Click the Follow button at the top of the sidebar on the right of this page for e-mail notification of new blogs. (If you’re reading this on a tablet or your phone, then you must scroll all the way to the end of the blog—pass any comments—to find the Follow button.)
This post was proofread by Grammarly
(Any other mistakes are all mine.)
(Full disclosure: If you sign up for a free Grammarly account, I get 20¢. I use the spelling/grammar checking service all the time, but, then again, I write a lot more than most people. I recommend Grammarly because it works and it’s free.)