Counting Words with AutoHotkey (RegEx)

How You Count Words Depends Upon How You Define a Word

AutoHotkey RegExWhile working on my new book, I finished up a chapter where, in response to a reader’s question, I demonstrated how you can count commas in a document with both the StringReplace command and the RegExReplace() function. The StringReplace command responds to the UseErrorLevel parameter to save the count while the RegExReplace() function automatically counts the number of matches for saving in an OutputCountVar variable. That gave me the idea to write a word count script using RegExReplace().

Note: The StringReplace() function, which supersedes the deprecated StringReplace command, counts in a manner similar to the RegExReplace() function.

WordCountThe problem seems simple enough. I only need to match and count the individual words. (See the AutoHotkey Regular Expressions (RegEx) – Quick Reference for the meaning of the RegEx operators in the following lines of code.) The OutputVarCount variable returns the number of matches found in the document. The \w operator matches all alphanumeric characters. So, I start with:

RegExReplace(Clipboard, ".*?\w+" ,
      , OutputVarCount, -1, 1)

(All the RegExReplace() sample lines of code word-wrapped using line continuation techniques.)

The first part of the expression ( .*? ) appearing before the \w+ tells RegEx to consume everything until it can match a letter or number. Then, continue matching + as long it encounters alphanumeric characters. However, the \w+ expression includes numbers which I don’t want to count as words. Therefore, I switch to all letters (lower and uppercase) :

RegExReplace(Clipboard, ".*?[a-zA-Z]+" ,
      , OutputVarCount, -1, 1)

This expression certainly improves the result, however, it counts contractions as two words. I add the apostrophe to the range to eliminate the word break:

RegExReplace(Clipboard, ".*?[a-zA-Z']+" ,
      , OutputVarCount, -1, 1)

Now, contractions get counted as one word, but hyphenated words continue to as two:

RegExReplace(Clipboard, ".*?[a-zA-Z-']+" , , OutputVarCount, -1, 1)

This takes care of hyphenated words, but I have a problem with AutoHotkey functions counting as two words. To deal with that situation, I add the open parenthesis ( to reduce AutoHotkey functions with one parameter down to one word:

RegExReplace(Clipboard, ".*?[a-zA-Z-(']+" ,
      , OutputVarCount, -1, 1)

I also have a problem with filenames which include a dot (filename.ahk) returning two words. I could add the dot ( . ) to the expression but that would cause any concatenation operators to count as a word. This gets crazy!

I can certainly find methods for dealing with the variations needed in the Regular Expressions but how accurate do I need the count. At some point, I realize that I either accept my word count as close enough or continue working on the Regular Expression—which could end up ridiculously complicated.

When working on the WordPress blogging software, I get a continuous update on the number of words in the blog. In testing the WordPress count, it does everything that I need. I guarantee the WordPress algorithm involves a significant amount of complexity.

I settled on the following script:

^#!w::
  OldClipboard := ClipboardAll
  Clipboard = ;clears the Clipboard
  Send, ^c
  ClipWait 0 ;pause for Clipboard data
  If ErrorLevel
  {
    MsgBox, No text selected!
  }
  RegExReplace(Clipboard, ".*?[a-zA-Z-(']+" ,
        , OutputVarCount, -1, 1) 
  MsgBox, %OutputVarCount% words!
  Clipboard := OldClipboard
Return

Select the text for word counting, press the CTRL+WIN+ALT+W Hotkey combination, then AutoHotkey pops up the number of words in a MsgBox.

If you use software which includes word counting, then you probably don’t need this script. However, if you use Notepad or another non-counting app, you may want your own specialized script.

jack

This post was proofread by Grammarly
(Any other mistakes are all mine.)

(Full disclosure: If you sign up for a free Grammarly account, I get 20¢. I use the spelling/grammar checking service all the time, but, then again, I write a lot more than most people. I recommend Grammarly because it works and it’s free.)

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s