A Perfect Place to Use an AutoHotkey Regular Expression (RegEx in Text Replacement)

Occasionally, You Run into a Search-and-Replace Problem that Cries Out for an AutoHotkey RegEx (Regular Expression). But Is Learning How to Use Regular Expressions Worth Your Time? You Decide! Here’s a Real Problem and a Beginner’s Mini-Tutorial for Solving It with RegEx.

Regular Expressions (RegEx) are not something that immediately make sense to anyone—especially if you have never used them before. Even if you understand how they work, it’s not always easy to know where to apply them.

*          *          *

New to AutoHotkey? See “Introduction to AutoHotkey: A Review and Guide for Beginners.”

New to Regular Expressions (RegEx) in AutoHotkey? See  this “Introduction to Regular Expressions (RegEx) in AutoHotkey” page.

*          *          *

This blog offers one search-and-replace situation where a RegEx makes a huge difference and a short description of how the RegEx works.Cover 200

Don’t expect that you will be up and flying with AutoHotkey Regular Expressions after this one example. It usually takes a little more time to grasp the techniques. However, the example provided here is a problem which I encountered while putting together a new AutoHotkey e-book. The explanation may help you better understand whether it’s worth a little of your time to make Regular Expressions part of your AutoHotkey toolbox.

The Problem

I recently began pulling together my next AutoHotkey e-book with the free EPUB editing program call Sigil—which I highly recommend. Using this program is much more natural for putting anything (text, images, and videos) into an e-book than using a word processing program such as Microsoft Word.

One of the best features included in Sigil is a great indexing tool. I consider an index essential to any computer book. Most people do not read a computer book from cover to cover. When they want to quickly find a particular reference they either check the table of contents or the index.

To create an index in Sigil you go through the chapters marking those terms you want referenced. When you create the index, those terms with links to their location are included in the new index file. If you add keywords to the built-in Index Editor, then an internal link is added anywhere those words are found in any chapter—also including each in the final index. (See the image below.) Sigil consecutively numbers the index links (1, 2, 3, …) making each an easy jump to the referenced location.

SigilIndex
Links created in Sigil are numbered consecutively for the final index.

This system is great for single books, but I like to include the indexes from each of my AutoHotkey e-books in the often free AutoHotkey Tricks book where the internal links are not helpful. In that e-book I need the chapter names to appear in index rather than merely a sequential number.

Although this may seem like a simple search-and-replace problem, there is nothing simple about it. A standard word processing search-and-replace tool will not do the trick. In fact, as shown in the image above, in the WYSIWYG (What You See Is What You Get) editor there is no way to know what chapter name to use for which link. Each link could be changed manually, but that would be a great deal of tedious work.

To easily see the chapter names and the assigned numbers for each link, we change to the code view which exposes the underlying HTML language. (See the image below.) The code view shows the details of each of the embedded links.

SigilCode
The chapter name (Chapter Twelve) correlates with the sequentially assigned link number (1) in the HTML code.

(At this point, you might say to yourself, “Oh no, do I need to learn HTML now?” Not so! Even if you don’t understand what the code does, it’s relatively easy to pick out the important parts. In the screenshot above, we can see that the first link “1” correlates with “Chapter Twelve.”)

In the code view we can see both the chapter names and the link numbers. However, since the link number is relative to the order of appearance there is still no way to do a standard search-and-replace—even when working with the code. This is where a Regular Expression (RegEx) comes to the rescue. The a single use of the AutoHotkey RegExReplace () function can fix all of the code in the above image producing the final results shown in the image below.
SigilCorrected

When using this AutoHotkey RegEx function, you don’t need to know the chapter name nor the relative number of the link, yet the entire index file can be reformatted into the display shown in the image above in just one pass. If you occasionally run into this type of ambiguous formatting problem, then it may be time for you to dig into Regular Expressions. The alternative is to work through the code line by line by hand and laboriously replace the respective index link numbers with the chapter names. If you have hundreds (or thousands) of entries in your index, then this could take quite a while.

The AutoHotkey RegExReplace() Code

The routine shown below offers a technique for selecting a section of any similar HTML code and automatically replacing the link numbers with the respective chapter names. After selecting (highlighting) the target code in the text editor, activate the snippet with the Hotkey combination CTRL+ALT+y (^!y). The routine copies the selected text to the Windows Clipboard, reformats the code, then replaces the original selected text with the new code.

The details of how this routine works by manipulating text in the Windows Clipboard is covered in many places—including in my first book, A Beginner’s Guide to AutoHotkey. Right now, we’re primarily interested in the line with the RegExReplace() function. (Below it appears as two lines for display purposes only. Using AutoHotkey line continuation techniques makes it operate as one line.):

^!y:: ; Format web link
 OldClipboard = %Clipboard%
 Clipboard := ""
 SendInput, ^c ;cuts selected text
 ClipWait 0
 If ErrorLevel
   {
     MsgBox, No Text Selected!
     Return
   }
 Clipboard := RegExReplace(Clipboard
    ,"(/Text/)([\w\s]+)(\.[^>]+>)\d+","$1$2$3$2")
 SendInput, ^v
 Clipboard = %OldClipboard%
Return

For the most part, this AutoHotkey script is standard in many Hotkey text replace/insert routines. The short description of how it works is as follows:

  1. The old contents of the of the Windows Clipboard is saved (OldClipboard = %Clipboard%).
  2. The Clipboard is emptied (Clipboard := “”).
  3. The highlighted text is copied to the Clipboard (SendInput, ^c).
  4. The routine pauses/waits until the Clipboard is no longer empty (ClipWait 0).
  5. The ErrorLevel statement, which works with the ClipWait command, is included just in case the user forgets to select (highlight) any text prior to executing the Hotkey combination.
  6. The subject Regular Expression is used in the RegExReplace() function t0 reformat the contents of the Clipboard (explanation to follow).
  7. The new Clipboard contents replace the selected text in the document (SendInput, ^v).
  8. The original contents of the Clipboard is restored (Clipboard=%OldClipboard%).

The AutoHotkey Regular Expression

While I offer a mini-tutorial on how RegEx works in this routine, in no way does this discussion cover all the RegEx possibilities or make you an expert on writing your own expressions. For that there is guidance readily available at the AutoHotkey RegEx Quick Reference and various other Web sources. If you think you might want more AutoHotkey
specific help with practical examples, then you may be interested in my e-book A Beginner’s Guide to Using Regular Expressions in AutoHotkey. (I even
referred back to examples in that book when working on this problem.) In the book I offer specific AutoHotkey RegEx applications and explain how they work. I also dig into how you should think about Regular Expressions and the AutoHotkey functions RegExMatch() and RegExReplace().

The RegEx Tester

One of the best tools I’ve used for developing Regular Expressions is the free RegEx Tester by Robert Ryan (shown below). Not only does it work well, but it’s written in AutoHotkey script language. (The RegEx Tester script is also available at the ComputorEdge AutoHotkey download site in the RegExTester.zip file in both text ahk and compiled exe versions.)

To test how the RegEx works, I loaded a piece of the HTML text into the search field of the RegEx Tester. As is shown in the Results field, the Regular Expression copies the chapter name and replaces the link number in the HTML link.

RegExTester2
Using Ryan’s RegEx Tester you can change the Regular Expression and/or the Replacement Text and instantly see the result.

The beauty of Ryan’s RegEx Tester is you instantly see the results as you alter the Regular Expression and the Replacement Text fields.

To understand what the RegEx engine is doing, we’ll take the following expression apart a piece at a time:

(/Text/)([\w\s]+)(\.[^>]+>)\d+

The expression is the code which appears between the two double quotes in the second parameter of the RegExReplace() function.

How the RegEx Works

A RegEx is an expression which searches text looking for a match. Unless a match is found, nothing happens. What makes RegExs confusing are the various symbols used to determine a match. In some cases, the same symbol (e.g. ^ and ?) has a different meaning depending upon where it’s used in the expression. While none of them are complex, they are not what you might expected in most types of programming. However, once understood, Regular Expression are powerful tools.

Locate a Consistent Text Key to Match

The first step was to find a key within the search text for locating the replacement match—in this case the chapter number. Scanning the search text above, we notice that the word “/Text/” (surrounded by forward slashes) always precedes the chapter name (e.g. Chapter Twelve). That becomes the first part of the Regular Expression:

/Text/

Whenever /Text/ is found in the search, there will be a match. However, since we want this key term to remain in the final product, we enclosed it in parentheses to save the pattern:

(/Text/)

Create a Subpattern/Backreference

Putting parentheses around any set of RegEx characters or symbols creates a subpattern or backreference from the matched characters. Since it is the first subpattern encountered, it is called up with the symbol $1 (which is the first Replacement Text value shown in the RegEx Tester image above). When $1 is placed in the third parameter of the RegExReplace() function, the characters /Text/ are reinserted in the replacement string at the same location.

(Backreferences are numbers from left to right for each set of parentheses, i.e. $1, $2, $3, …)

The next portion of the example is the chapter name (e.g. Chapter Twelve). Those characters are matched with the following expression:

[\w\s]+

Using the Symbols \w and \s

This part of the expression uses a couple of the most common wildcards found in RegExs. The first is \w (backslash w) which matches any letter (a-z case insensitive) or digit (0-9). That means spaces and punctuation will not cause a match. However, in our example all of the names include one space between the word “Chapter” and the chapter number.

In order to include any blank space in our matching scheme, we use the symbol \s (backslash s).  Whenever \s is part of an expression, RegEx looks for the blank space character.

Create a Range […] of Options

Next, since we don’t know where they will appear in the text, we want to make the \w and \s matches optional for each. Placing square brackets around a series of characters or symbols [\w\s] creates a list of options. If RegEx encounters any one of the items within the square brackets, it accepts the match. However, in its current form only the first character or space would be matched.

Continue Matching with the Plus Sign +

To force RegEx to continue matching following characters as long as either a letter, digit, or space is found, a plus sign + is added to the end of the range [\w\s]+. Whenever a plus + is added to a character or symbol, it tells RegEx to continue matching that character (or range of characters and symbols) until it no longer finds a match. In our example, this expression captures the entire chapter name (e.g. Chapter Twelve) until encountering the dot . which is found just before the file extension .xhtml. The dot (period or decimal point) halts the repeated matching because it is not a letter, digit, or space.

Now we turn the resulting chapter name into a subpattern by enclosing it in parentheses:

([\w\s]+)

These parentheses save the second subpattern $2 for use in the replacement string. The chapter name needs to be included both in its original location and as a replacement for the link digits:

/Text/Chapter Twelve.xhtml#sigil_index_id_2">Chapter Twelve</a>

Note: While working on this blog, I realized that numbers above 20 often include hyphens (e.g. Twenty-one). The above subexpression would not work with any hyphenated numbers, but the fix is easy. Merely add a hyphen  inside the range as an additional option and we are set for almost any chapter numbers:

([\w\s-]+)

The Dot . Wildcard

Next, we need to capture the remaining text between the chapter name and the digits which show the link number. The first character is the dot . before .xhtml. However, the dot . on its own is the ultimate wildcard matching any character or punctuation. To make the dot just a dot (period or decimal point), it must be preceded by a backslash \. rather appearing on its own:

\.[^>]+>

Negative RegEx Matching

Now, we want to match all the characters following the dot . until the closing right arrow > is encountered. To do this we use a negative match—everything which is not a right arrow >.

By using the caret ^ inside the square brackets [^>], RegEx is told to match anything which is not within the range brackets. Again, the plus sign + is added [^>]+ to repeat the process as long as there is no > match. In this case, everything will be matched until the first > is encountered.

After this part of the expression, a right arrow > is added to accept it as a match and the entire portion is enclosed in parentheses to save the third subpattern for use as backreference $3 in the replacement result:

(\.[^>]+>)

Tip: Using the caret ^ within a range, e.g. [^>], as a negative match is a powerful expression. For example, the second backreference ([\w\s]+) previous discussed could be replaced with  ([^.]+) matching anything that is not a dot. This negative expression also eliminates the concern about hyphenated numbers in the chapter names—although chapter numbers which may include points (e.g. Chapter 3.2.1) could be problematic. (Note that the dot loses it magical wildcard properties inside a range. No backslash required.)

Advanced Tip: If you are a little familiar with Regular Expressions, then you may ask, rather than a negative expression, why not use the super wildcard, dot ., which matches anything? You could replace [^>]+> with the .+?> dot expression, but be sure to include the question mark ? as shown. The dot . wildcard is greedy and will consume all characters until the last occurrence of > in the search text. The added question mark forces the match to become non-greedy and stop at the first occurrence of >. (If this tip merely adds more confusion to the topic, forget that I said it. These are not easy concept for the novice.)

There are three subpatterns in the expression, each one creating a backreference for the replacement string:

(/Text/)    →  $1  (e.g. "/Text/")
([\w\s]+)   →  $2  (e.g. "Chapter Twelve")
(\.[^>]+>)  →  $3  (e.g. ".xhtml #sigil_index_id_2")

Using the Expression \d

The only remaining piece of this RegEx is identifying and replacing the numeric hot link digits. Digits (0-9) can be matched with the expression \d (backslash d), which is the equivalent of the matching range [0-9]. Since only digits will be matched until the next left arrow < only the plus sign + is needed to capture all digits \d+. (For more flexibility, this expression could also be replaced with the negative expression [^<]+ to match anything until a left arrow < character is encountered.)

There is no need to save this subpattern by enclosing it in parentheses since it is replaced in the final product with the $2 backreference:

$1$2$3$2

This group of backreferences enclosed in quotes is the replacement text and third parameter of the RegExReplace() function. By default, the function proceeds through the entire search text (haystack) replacing any matches (needles) found. Notice that $2 is used twice to both maintain the original chapter name in the HTML code and replace the digital link.

In the results parameter, characters may be mixed with the backreferences. For example, /Text/$2$3$2″ will work just as well as $1$2$3$2″ since $1 always stores /Text/ as its value. Other variables and functions are not accepted in the results parameter of the RegExpReplace() function.

Are Regular Expressions Worth the Effort and Learning Curve?

The reason I took the time to cover this particular RegEx in such detail is because it can be difficult for someone to evaluate whether or not they want to take on the task of learning how to write and implement AutoHotkey Regular Expressions. However, once you know how they work, Regular Expressions are not difficult to understand—although it does take a little time to get a feel for the syntax and comprehend what the RegEx engine is doing.

RegExs often work where the standard search-and-replace fails, but there are far more applications for them than this one example. In the book A Beginner’s Guide to Using Regular Expressions, I give a number of practical examples such a “Where in the World?” IP lookup app (as shown in this AutoHotkey RegEx Introductory page).

Ultimately, it’s up to you decide whether you want to put RegEx in your AutoHotkey toolbox.

Advertisements

2 thoughts on “A Perfect Place to Use an AutoHotkey Regular Expression (RegEx in Text Replacement)

  1. ^!y:: ; Format web link
    OldClipboard = %Clipboard%
    Clipboard := “”
    SendInput, ^c ;cuts selected text
    ClipWait 0
    If ErrorLevel
    {
    MsgBox, No Text Selected!
    Return
    }
    Clipboard := RegExReplace(Clipboard
    ,”(/Text/)([\w\s]+)(\.[^>]+>)\d+”,”$1$2$3$2″)
    SendInput, ^v
    Clipboard = %OldClipboard%
    Return

    So much time and no one corrected?
    It should be:
    Send, ^c
    …..
    Send, ^v
    Otherwise it does not work.

    Like

    • Probably the reason no one has mentioned the problem is because the SendInput command works for most people. It should work for you as well. The various Send commands are a source of confusion and they do not work consistently. I have found a couple of possible issues.

      The first is the inclusion of the SendMode command in the standard script. I’ve found that on occasion that the command can make hotkeys unreliable. I generally no longer include SendMode in any script.

      The second is that other programs running on your Windows computer could interfere with some AutoHotkey actions. I’m not sure why. I include a short discussion in “Chapter Six” of the Beginning Hotstrings book:

      “You may have noticed that from time to time some of your hotkey combinations just stop working. It might happen on one computer more than another. I’ve certainly seen it and found that reloading the script will normally fix the problem, but it is annoying when a Hotkey/Hotstring you depend upon suddenly stops working. Since it was random and I didn’t know where to look, I just lived with it. There may be a solution.

      “While perusing one of the AutoHotkey forums, I came across a post by someone whose key reassignments would suddenly stop working. Interestingly, he had found a solution before anyone else had a chance to make any suggestions. It turned out that a line in the template file for new AutoHotkey scripts may have been the culprit:

      “SendMode Input ; Recommended for new scripts due to its superior speed and reliability.

      “When he removed or commented out the line from the script, the problems stopped occurring.

      “As it turned out, I was having a similar problem with a script I was working on. The hotkey combination would work a couple of times, but then just stop. I put a semicolon in front of the same line in the boilerplate of the test script and ran it again. The problem did not reoccur.

      “It seemed odd to me that there would be a line in the standard new AutoHotkey script file setup that caused problems. It was even ironic since it was supposed to increase “reliability.” I did a little more investigating.

      “While I don’t fully understand the SendMode command , it seems that other programs which use keyboard hooks can interfere with AutoHotkey. It is not really an AutoHotkey bug. If you encounter similar problems where hotkeys stop working, then it may be worthwhile noting what else is loaded on your computer. The problem is that there is no list (that I know of) of the programs which may cause theses symptoms.

      “If you can’t find the offending program then remove (or comment out with a semicolon) the SendMode line at the beginning of the script.”

      Some people recommend just sticking with Send rather than using SendInput. In either case, the script should work. It does for me.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s