By Comparison, UK Postal Codes Offer a Greater Challenge Than US Zip Codes When Writing Regular Expressions (RegEx)
In the previous blog (“Finding US Zip Codes (AutoHotkey RegEx Tips Part 1)“), I began this mini-tutorial series on AutoHotkey Regular Expressions (RegEx) with a technique for parsing US zip codes from street addresses. For the MultiPaste.ahk script to work best (“Parsing and Pasting One-Line Street Addresses (AutoHotkey Multi-Paste Trick)“), I needed any zip code to appear as a separate paste item in the MultiPaste MsgBox. The parsing problem occurs because most one-line address formats only use as a delimiter the space character (no comma or newline) between the state and zip code. The same holds true for UK postal codes.
Last time, I pointed out how the string functions—InStr() and StrReplace()—require exact search characters while the Regular Expressions functions—RegExMatch() and RegExReplace()—can use a variety of wild cards to represent characters. In fact, the various different ways to express wild cards cause a degree of confusion. In this blog, I introduce the \w alphanumeric wild card and the question mark modifier (?) to create optional matches.
Identifying UK Postal Codes
It’s important to understand that the MultiPaste.ahk script does not need to validate the entire UK postal code as an existing postal zone but only recognize a possible code for parsing it into a separate paste item for inclusion in the pop-up MsgBox. Therefore, we can use a much simpler RegEx than that shown for validation in Wikipedia by using more inclusive wildcards. Plus, we only need to identify up to the first four characters of the UK code.
Identifying a UK postal code gets a little more complicated than the simple five-digit US zip code. The British employ a combination of alphanumeric characters in their postal codes. While we only need to identify possible UK postal codes based upon the first few characters, we need very specific RegEx wild cards:
|AA9A 9AA||WC postcode area; EC1–EC4, NW1W, SE1P, SW1||EC1A 1BB|
|A9A 9AA||E1W, N1C, N1P||W1A 0AX|
|A9 9AA||B, E, G, L, M, N, S, W||M1 1AE|
|A99 9AA||B33 8TH|
|AA9 9AA||All other postcodes||CR2 6XH|
|AA99 9AA||DN55 1PT|
We note that the first character must be an alphabetic character. The second character may be either alphabetical or numeric. Either the second or third character must be numeric. This first portion of the code may be two to four characters long.
The wild card \w matches any alphanumeric character (or the underscore)—the equivalent of the range [a-zA-Z0-9_]. However, the first character in the UK postal code cannot be a number. Therefore, we use the range [A-Za-z] which accounts for every letter of the alphabet—both upper and lower case—while ignoring digits (\d). This wildcard forces the first character to match any letter.
The second character may match either a letter or number (i.e. \w), but either the second or third character must be a number. We solve this problem by making the second letter optional using the question mark wild card modifier [A-Za-z]?—then forcing the next character to match as a digit ([A-Za-z]?\d). This optionally allows the second character to be a letter, but, if not, it forces that character to be a number. However, if the second character matches a letter, it forces the third character to match a number.
Note: While the question mark (?) plays a number of other roles in Regular Expressions, most often it serves to make the preceding character match optional. You’ll find this feature critical to RegEx flexibility in situations, such as this one, where marking the second letter as optional (?) results in requiring either the second or third character to match a number (\d). (See the AutoHotkey Regular Expressions (RegEx) Quick Guide for an example matching both color and colour—i.e. colou?r.)
The last optional character (?) can match either a letter or number (\w?):
Clipboard := RegExReplace(Clipboard,"\s([A-Za-z][A-Za-z]?\d\w?)","`t$1")
This RegEx matches two to four characters preceded by a space (\s). After the space character, the first character in the UK postal code must match a letter. The second character may optionally match a letter but either the second (if no second character letter match) or third (if second character letter match) must match a number (\d). If another letter or digit occurs as the third or fourth character, it optionally matches the \w wild card (\w?).
Note: Quite frankly, the last \w? adds nothing since it appears at the end of the expression. AutoHotkey RegEx sees that the expression can match anything or nothing. However, if you plan to include the space in the middle of the UK postal code:
Clipboard := RegExReplace(Clipboard,"\s([A-Za-z][A-Za-z]?\d\w?\s)","`t$1")
then the expression requires the inclusion of the optional \w? wild card.
As with the US zip code discussed in the previous blog, the address format must precede the UK postal code with a space (\s) for the tab (`t) replacement to occur.
This blog introduced the \w alphanumeric wild card and the question mark ? (optional) modifier. Next time, we look at the none-or-more modifier (*)—a variation of the optional modifier (?)—and the concept of greed while removing excess tabs from our MultiPaste output which cause blank lines in the MsgBox window.
Click the Follow button at the top of the sidebar on the right of this page for e-mail notification of new blogs. (If you’re reading this on a tablet or your phone, then you must scroll all the way to the end of the blog—pass any comments—to find the Follow button.)
This post was proofread by Grammarly
(Any other mistakes are all mine.)
(Full disclosure: If you sign up for a free Grammarly account, I get 20¢. I use the spelling/grammar checking service all the time, but, then again, I write a lot more than most people. I recommend Grammarly because it works and it’s free.)