Quick and Dirty Web Data Extraction Script (An Easy AutoHotkey RegEx Trick)

A Simple Regular Expression (RegEx) Retrieves Your Daily Horoscope by Harvesting Data from a Web Page—This 10-Line AutoHotkey Script Demonstrates How to Build Your Own Web-Based Pop-ups

Regular Expressions (RegEx) can get pretty complicated, but for this desktop trick, you only need to learn one short wildcard expression. Anyone can implement this simple pop-up window trick—displaying virtually any selected data found on the Web without loading a browser. Perhaps you would like a message box displaying the current weather. Or, maybe you want to read your daily horoscope. If it’s on the Web, then you can probably turn it into a quick AutoHotkey app.

As a demonstration (and possible template for other pop-up apps), I’ve written a short script which, without a browser, accesses an astrology Web page and displays my daily horoscope in a Windows message box. You can find the code for this Horoscope.ahk script at the end of this blog.

Cover 200I’ve used RegEx functions in many of my scripts: the SynonymLookup.ahk script culls synonyms from a Web page; lately, the InstantHotstring.ahk script identifies and loads Hotstrings from an external file; and the IPFind.ahk script captures IPs, then looking up their geographic location. I consider the AutoHotkey RegEx functions (RegExMatch() function and RegExReplace()) so useful that I wrote a book aimed specifically at understanding how to use those functions in AutoHotkey scripts. Capturing data can get complicated when done with the usual function (InStr() function, SubStr() function)—requiring loops and multiple lines of code, while a RegEx often does the job with a single statement.

Fixing the IPFind.ahk Script

A while back, my IPFind.ahk script stopped working. Something probably changed in the target Web page which caused the RegEx to fail. I might have considered rewriting the expression for the current Web page, but decided to switch to a different site:

https://whatismyipaddress.com/ip/

I found the change so easy to implement that it dawned on me that anyone could write their own specialized Web data-based scripts by using this simple RegEx. Plus, for these pop-up apps, no one really needs to learn the inner workings of Regular Expressions. But first, let’s look at the changes in the IPFind.ahk script.

Extracting Data from a Web Page

IPFind New WindowThe IPFind.ahk script currently captures any URL (or multiple URLs) you select in any document or Web page in the Windows clipboard by extracting the URL from the source using a Regular Expressions (RegEx). Then, using a technique for downloading the HTML code directly from the target Web page (in this case, https://whatismyipaddress.com/ip/ plus the captured IP), the script uses an easy RegEx function to copy the critical data. Finally, AutoHotkey displays the data via the MsgBox command (as shown at right).

First, I loaded a Web page which displays the data I want by entering the following URL into any browser:

https://whatismyipaddress.com/ip/93.67.151.28

In the script, I replaced the old URL code with:

IPsearch := "https://whatismyipaddress.com/ip/" . findip

As shown in the Web browser, I located the details (highlighted in blue) required to repair the IPFind.ahk script:

IPFind Web Page

I opened the source code page by selecting “View page source” from Google Chrome’s right-click context menu. This gave me all I needed (highlighted in blue) to extract data from the downloaded file:

RegEx Selection

(In Firefox, click the “View Page Source” item in the right-click context menu to open the HTML source code page. In Microsoft Edge, open the “View source” menu item.)

Next, I added the new conditions to the script for the RegExMatch() function by keying on the word “Continent” as a starting point and the word “Latitude” to terminate the extraction:

RegExMatch(version, "Continent(.*)Latitude", Location)

Ideally, the keywords which mark the boundaries of the target text are unique in the HTML code. Otherwise, AutoHotkey could match the wrong text.

The RegEx (.*) matches anything (dot . wildcard) none or more times (*). The set of parentheses saves the matching string in the first subpattern variable Location1. (See the Regular Expressions (RegEx) Quick Reference.)

Often the wildcard expression (.*) does the job. However, on occasion, you may need to add a question mark at the end of the RegEx (.*?) preventing the original greedy form of the RegEx (without the ?) from capturing too much text. The question mark forces the match to stop on the first occurrence of the next matching character.

The RegEx (.*) captures everything between the two keywords saving it to the variable Location1 (the first subexpression—as designated by the set of parentheses). After removing the HTML tags and adding tabs, the MsgBox command displays the data as shown in the image above.

Rather than going into detail about how the IPFind.ahk script works, I demonstrate how you can instantly put this simple RegEx technique to work by presenting your daily horoscope in a MsgBox pop-up without opening a Web browser—no slow load time and no annoying ads.

(For a complete explanation of the IPFind.ahk script, see Chapter Eight “A Simple Way to Find Out Where in the World That IP Address Is Located” of the book A Beginner’s Guide to Using Regular Expressions in AutoHotkey.)

Writing a Quick and Dirty Horoscope Script

To write a script for extracting data from a Web page follow these steps:

  1. Use your Web browser to locate a Web page which offers the data you want.
  2. Navigate within the Web page to determine which URL parameters display the information you need.
  3. Insert the Web page URL (with required parameters) into the proper location within the script.
  4. Open the HTML source code page to find which keywords constrain the target text.
  5. In the script, add those keywords plus the universal wildcard (.*?) to the RegExMatch() function.
  6. Use the MsgBox command to instantly display the captured text.

Locate a Target Web Page and the Required Parameters

First, I found a Web page with the information I wanted to display in a pop-up message box. In this case, I located the following page which uses the parameter sign to designate my birth sign:

https://www.astrology.com/us/horoscope/daily-extended.aspx?sign=cancer

(Appears as one continuous line in .ahk scripts.)

The added parameter ?sign=cancer forces the site to load the appropriate page.

Insert New URL into Script

I added it the Horoscope.ahk script:

GetHoroscope := "https://www.astrology.com/us/horoscope/daily-extended.aspx?sign=cancer"

(Appears as one continuous line in .ahk scripts.)

Open HTML Source Code Page and Identify Keywords for RegExMatch()

With the Web page loaded into a Web browser, I open the source code page:

HoroscopeSourceCode

You can see that the HTML code

precedes the target text while

follows it. I used these as the keywords for the RegExMatch() extraction:

RegExMatch(Horoscope, "

“”date“”>(.*?)

“, Today)

Library Benefits

(Appears as one continuous line in .ahk scripts.)

Note: You must escape any double-quotation marks inside the RegExMatch() parameter quotes by preceding each with an additional double-quotation mark. Otherwise, the RegEx fails.

Cleanup Text

Remove any excess HTML tags using the RegExReplace() function:

Today1 := RegExReplace(Today1,"<.+?>")

Depending upon the content of the text, you may need to use the StrReplace() function to remove unwanted characters.

Display Pop-up Message Box

AutoHotkey pop-ups the daily horoscope using the MsgBox command:

MsgBox %Today1%

Tip: Depending upon the script, rather than displaying the data in a message box, you can Send the variable Today1 to another application or save it to a file. Next time, I plan to review an AutoHotkey technique for sending e-mail without using an e-mail program.

HoroscopeMsgBox

This short Horoscope.ahk script uses the same the Web page download routine found in the IPFind.ahk script—replacing the IP URL with the astrology URL and the original IP location RegExMatch() function with the new RegEx for extracting horoscope data.

Since the script only displays the current horoscope for one sign, I didn’t add a method for birth sign selection. To accommodate other astrological signs, you must either edit the script or add another AutoHotkey tool (e.g. DropDownList GUI control or selection menu) which allows an astrological selection.

Complete Horoscope.ahk Code

Copy this code and save it in a file named Horoscope.ahk to create a daily horoscope pop-up. If you weren’t born in July (or late June), then change the sign from cancer to your astrological sign. Load the script and use the Hotkey combination CTRL+ALT+Q (^!q) to read your daily horoscope.

^!q::
  GetHoroscope := "https://www.astrology.com/us/horoscope/daily-extended.aspx?sign=cancer"
  whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
  whr.Open("GET", GetHoroscope)
  whr.Send()
  Sleep 100
  Horoscope := whr.ResponseText
  RegExMatch(Horoscope, "<p><span class=""date"">(.*?)</p>", Today) 
  Today1 := RegExReplace(Today1,"<.+?>")
  MsgBox %Today1%
Return

(Some lines word-wrapped for display purposes. When copied, the wrapped lines appear as one continuous line.)

I highlighted the key script changes in red. Using this script as a Web data-scraping template, you can modify it to create pop-up apps for virtually any other Web page. Be sure to change the Hotkey combination for each new pop-up.

Next time, I’ll show you how to e-mail the data to yourself (or anyone else) in AutoHotkey.

Click the Follow button at the top of the sidebar on the right of this page for e-mail notification of new blogs. (If you’re reading this on a tablet or your phone, then you must scroll all the way to the end of the blog—pass any comments—to find the Follow button.)

jack

This post was proofread by Grammarly
(Any other mistakes are all mine.)

(Full disclosure: If you sign up for a free Grammarly account, I get 20¢. I use the spelling/grammar checking service all the time, but, then again, I write a lot more than most people. I recommend Grammarly because it works and it’s free.)

2 thoughts on “Quick and Dirty Web Data Extraction Script (An Easy AutoHotkey RegEx Trick)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s