Deleting Double Words with AutoHotkey Regular Expressions (RegEx)

When Too Many Identical Chapter Numbers Appear in the E-Book Index, It’s Time for Another AutoHotkey RegEx. Includes How to Use a Backreference!

The AutoHotkey Regular Expression from last time which converts ambiguous numbers to chapter names in an e-book index works well. When the first RegExReplace() function does its job, all of the link numbers are converted to the chapter names, as shown below. However, you can see the problem.

RegExRepeatChapters
While the sequentially numbered links (discussed in the last AutoHotkey RegEx blog) have been replaced with the chapter names, there are too many redundant listings (e.g Chapter Five, Chapter Five, Chapter Five, …). Another RegEx is needed to eliminate the extra listings.

Seeing one chapter number listed multiple times is too much repetition—even if each link accesses a different section of that chapter. When I add the AutoHotkey Hotstring book’s index to the AutoHotkey Tricks book, the extra chapter names won’t even need the hot links. I certainly want to eliminate the redundant chapter names from any plain text.

As for the index included in the Hotstring book itself, the problem is a little trickery. I still want to reduce the clutter, but I need the links to remain hot and jump directly to the reference. For that I will need to use a RegEx which works with the underlying HTML code.

Ultimately the goal is to reduce the chapter listings in the index to one for each chapter, as shown below.

RegExRepeatChaptersAfter
By using the proper RegEx, the redundant chapter names are eliminated from the e-book index.

In one pass, a Regular Expression in the RegExReplace() function can make the changes necessary to eliminate the extra chapter names.

Removing Redundant Text

There are two ways to accomplish the goal of removing excess chapter numbers. If you don’t need the active links, the first method deals with the text only. The second approach works directly with the HTML source code maintaining the original chapter hot links. While the text-only method requires the simplest Regular Expression, both use a similar plan of attack.

Note: You can see that the results above retain some hot links (underlined chapter names) for the internal reference, but they include only the first link for each chapter. If I want to retain all of the links, yet reduce the clutter—say by adding link numbers for each chapter (e.g. Chapter Five 1, 2, 3, 4)—then the RegExReplace() function does not offer enough flexibility to do the job in one pass. I will most likely need to use the RegExMatch() function inside a Loop (or two) to get the desired result.

In this blog, I’ll introduce the text-only Regular Expression, since it’s a little less complicated than the expression required for working with the HTML. I’m saving the HTML expression for the next blog. As for the more advanced RegExMatch() script needed for sequentially numbered links within chapters, it’ll take a little longer.

Update (January 10, 2016): As it turns out, I did find a way to use RegExReplace() to both add to the e-book index the chapter names identifying the sections of the book and keep the individual numeric links—all in one pass. (No Loops required.) The expression is fairly long, but the way it works is pretty cool. I’ll present it soon.

Eliminating Redundant Chapter Names

To get the thought process started, I referred back to “Chapter Five: Eliminating Double Words with RegEx” in my e-book A Beginner’s Guide to Using Regular Expressions in AutoHotkey. While the expression in the book serves a different double word clean-up function, it was enough to get me moving in the right direction.)

I used Ryan’s RegEx Tester (shown below and discussed in the last blog) to develop a working expression. First, I copies a piece of the text from above into the “Text to be searched” field. I added my sample expression and the replacement text ($1 the first subpattern). Then, I began adjusting the initial expression until I started seeing the right results.

RegExTesterDoubleText.jpg
A portion of the original text is copied into the top of the RegEx Tester, then the expression and replacement can be adjusted until it works. The results instantly appear in the bottom field.

The final RegEx is very similar to that from last time and uses many of the same techniques:

(Chapter\s[^,]+)(,\s\1)+

There are two new RegEx techniques in this piece. (If you’re new to Regular Expressions, it may be worthwhile to review my last blog where I explain many of the symbols and techniques also used in this one.)

How the RegEx Works

The RegEx starts matching by looking for the word Chapter followed by a space (\s). Then, it continues matching every following character which is not a comma [^,]+ until it does hit a comma. The entire portion of the expression is placed within parentheses, (Chapter\s[^,]+), to create a backreference (match saved as a subpattern) for later use. This equates to the chapter name (e.g. Chapter Five).

This same first backreference, or subpattern, represented by the symbol $1 is used in the Replacement Text field as the final substitution. Ultimately, only one chapter name will remain in the text for each chapter number.

The second portion of the expression (,\s\1) is looking for a redundant chapter name match preceded by a comma , and a space \s. This is where the two new RegEx elements are introduced. The first is the backreference expression \1 which tells RegEx to match the first subpattern (if found). (Backreferences are numbered sequentially as the sets of parentheses appear in the expression (i.e. \1, \2, \3, …).) Thus, there is a match for this expression whenever a comma , and a space \s is followed by the same chapter name in the form of the backreference \1the first subpattern.

The second new feature is the alternative use of parentheses. While placing this second portion of the expression inside parentheses does create a second subpattern/backreference, that is not the purpose here. Afterall, the second subpattern does not need saving since anything matched after the first chapter name is going to be dumped as redundant. Here the parentheses are included only so we can add the plus sign + and keep repeating the entire comma/space/chapter name matching (,\s\1)+ for any number of redundant chapter names.

Taking the code from last time, we change the hotkey combination to CTRL+ALT+U (^!u), add the new expression (Chapter\s[^,]+)(,\s\1)+ and replacement value $1 to the RegExReplace() function, giving us our script:

^!u:: ; Format web link
  OldClipboard = %Clipboard%
  Clipboard:= ""
  Send, ^c ;cuts selected text
  ClipWait 0
  If ErrorLevel
    {
       MsgBox, No Text Selected!
       Return
    }
  Clipboard := RegExReplace(Clipboard     ;these two lines
       ,"(Chapter\s[^,]+)(,\s\1)+","$1") ;merge into one
  SendInput, ^v
  Clipboard = %OldClipboard%
Return

Note: The two lines for the RegExReplace() function are wrapped for display purposes. Line continuation techniques are used to merge the two lines into one.

This snippet works on text-only input selected (highlighted) in most text editing windows. It deletes all the redundant chapter names. However, any hot links will be lost.

Caution: This text-only code is not likely to work on editing screens which create WYSIWYG (What You See Is What You Get) output. These programs have underlying source code which interacts with the editing screen. You may need to copy the text into a text editor such as Notepad or Wordpad, select the text, run the script, then copy the results back—replacing the original. The more user-friendly the program, the less likely that this RegExReplace() reformatting approach will work directly in the editing window. This caution also applies to the RegEx from last time.

This is almost exactly the same script as that used in the last blog, except the RegEx and the replacement value have been changed in the RegExReplace() function.

Next time, we’ll look at doing the same thing with the HTML code which will save the first hot link.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s