An SEO Guide for Automating GTmetrix with Python

An SEO Guide for Automating GTmetrix with Python


For many SEOs, GTmetrix is a well known and respected site performance scanner. In this intermediate tutorial, I’m going to show you how to automate GTmetrix using Python and store the performance data in MySQL. Usually, SEOs access GTmetrix via their web app. Lucky for us, they also offer a REST API.

Note: The free limit is 20 calls per day. For this tutorial, it will require 3 calls per URL. Doing the math, you can scan 6 URLs a day. If you have more URLs, I can show you how to spread them out over a week at the bottom of the article.

Placeholders exist in some of the code below where you need to fill in the details for your environment.

Requirements and Assumptions

Creating MySQL Tables

First, we need to set up the database where we’ll store our crawl data. There are two common ways to access and manage MySQL: via the command terminal and via phpMyAdmin GUI by cPanel.

Option 1 – Command Line Terminal:

If you are without cPanel or most comfortable in the terminal, see this guide for logging into MySQL and creating the database and user. Then, run the SQL statements below to create the tables.

Option 2 – phpMyAdmin:

If you have access to cPanel, you can create the database and user in the MySQL Databases area. After that, head over to phpMyAdmin (also found in cPanel). Select your database from the list on the left side and, in the SQL tab found at the top, enter the SQL statement below to create the table that will contain the websites you want to run GTmetrix on. If you have this table already created from one of my earlier guides, you can reuse that table instead.

CREATE TABLE websites (
    websiteid int NOT NULL AUTO_INCREMENT,
    name varchar(255),
    url varchar(255),
  PRIMARY KEY (websiteid )
);

At this point, you’ll have an empty table for your websites (URLs).  Naturally, you’ll want this table populated with the number of URLs you want to scan. I usually just focus on the homepage. If you created your table in phpMyAdmin, you can select it on the left side column and then select “Insert” at the top. Fill out that form for each website you want to scan.

You can also insert website records via SQL as shown below (websiteid is auto-generated):

INSERT INTO websites (name,url) VALUES ("Rocket Clicks","https://rocketclicks.com")

Next, we’re ready to create the table for the scan data using the SQL statement below:

CREATE TABLE gtmetrix_scans (
   gtmetrixid int NOT NULL AUTO_INCREMENT,
   websiteid int(255),
   date varchar(255),
   yslow int(255),
   num_requests int(255),
   page_size int(255),
   wait_time int(255),
   connect_time int(255),
   css_size int(255),
   css_time int(255),
   js_size int(255),
   js_time int(255),
   image_size int(255),
   Image_time int(255),
   report_url int(255),
   PRIMARY KEY (gtmetrixid) 
);

Importing Necessary Modules

Python modules are like libraries in other coding languages. They are collections of premade functions that you can use to save time by not reinventing the wheel. Most of the Python modules we’re going to use should be preinstalled, but two that aren’t are Haralyzer API and mysql.connector. To install these, go to your command terminal and type in both of these commands:

pip3 install mysql.connector

pip3 install haralyzer-api

If you get any errors about other missing modules, you can use the same code above to install the rest. Just make sure to replace the last part with the name of the new module. Sometimes the names aren’t obvious; you can search for the module names here.

Getting the Script Ready

Next, fire up your favorite code editor or IDE. I recommend PyCharm for more experienced coders or Thonny for beginners.

Place the code below on the first line of the file. It’s called a shebang or hashbang and tells Linux how to execute the file. Often, this is optional but required when running from a cronjob, which we will be doing later. This tells Linux to run using Python 3.

#!/usr/bin/python3

First, let’s import the Python modules we’re going to use.

### Get date
from datetime import date

### Get time
from datetime import datetime

### For making the api call
import requests

### For mySQL functions
import mysql.connector

### For script delay
import time

### Process the json api response
import json

### Analyze the har file
from haralyzer import HarPage

Now we’ll retrieve the list of website names and URLs from the table you created earlier.  We’ll use this list to have GTmetrix loop through each of those websites. We then assign values to the variables, id, name, URL. We’ll replace any spaces with an underscore for when we save the scan files as file names can’t contain spaces.

Customize the variables “api_key“, “mydb” and “sql_websites” variables with your own details. You can find your API key here.

mydb = mysql.connector.connect(port="3306", host="",user="",password="", database="")
cursor = mydb.cursor()

api_key = YOUR_API_KEY
sql_websites = "SELECT * FROM websites"
cursor.execute(new_scan)
records = cursor.fetchall()
for row in records:
    clientid = str(row[0])
    name = str(row[1])
    c_url = str(row[2])

Next, we’ll make the first of 3 API calls using the Python requests module. This call will initiate the GTmetrix scan and return the state of the scan. Note, optionally, here is where you can use Postman App to test your API connection independent of the Python script for easier debugging of API errors. Once you’re getting valid responses to your API calls you can continue below. Note the API request is a POST.

    url = "https://gtmetrix.com/api/0.1/test"
    payload = {'url': c_url}
    files = [
    ]
    headers = {
      'Authorization': 'Basic ' + api_key
    }
    response = requests.request("POST", url, headers=headers, data = payload, files = files)

Most GTmetrix scans take a couple of minutes. Before we can continue we want to wait for that scan to complete. It is possible to keep polling the API to check the scan state, but that will chew up a lot of API calls. This might be fine if you have the paid version, but for those using the free account, I use a script delay of 5 minutes. That should be enough time for the scan to complete.

    i_call = json.loads(response.text.encode('utf8'))
    r_url = i_call['poll_state_url']
    print(clientid + ": stage 1 success")
    print("Report URL: " + r_url)
    time.sleep(300)

Next, once we are confident the scan is complete, we need to make our second call to access that scan data. Customize the headers variable by including your API key. Note the API request is a GET request this time.

   url2 = r_url
   headers2 = {
   'Authorization': 'Basic '+ api_key
   }
   response2 = requests.request("GET", url2, headers=headers2)

Now it’s time to grab all that great performance data from the scan. GTmetrix makes some high-level metrics available for you from this second call. First, we load the API call into a variable using our JSON module we imported. This turns the JSON format into a Python dictionary object (associative array) which is easier to navigate through.

More detailed information is contained in a HAR file. HAR stands for HTTP Archive. It’s a JSON file that contains information on how the browser interacts with the web page. We’re interested in the performance information it provides. For some calls, I’ve found that the HAR file doesn’t exist, so we’re going to employ a “Try and Except” method. This provides us a way to continue the script if an error occurs due to the file not existing.

    r_call = json.loads(response2.text.encode('utf8'))
    
    try:
        har_file = r_call['resources']['har']
    except:
        har_file = "none"
    report_url = str(r_call['results']['report_url'])
    num_requests = str(r_call['results']['page_elements'])
    page_size = str(r_call['results']['page_bytes'])
    yslow = str(r_call['results']['yslow_score'])
    connect_time = str(r_call['results']['connect_duration'])
    wait_time = str(r_call['results']['backend_duration'])
    print(clientid + ": stage 2 success")

Now it’s time to get the performance data that is stored within the HAR file. Let’s make that final API call.

if har_file != "none":
        url3 = har_file
        headers3 = {
          'Authorization': 'Basic "+ api_key +"'"
        }
        response3 = requests.request("GET", url3, headers=headers3)

Now, let’s load that response into another Python dictionary object. We then feed the dictionary object to the Haralyzer module to calculate metrics like image download time, CSS size, and Javascript timing, etc.

   
   
    h_call = json.loads(response3.text.encode('utf8'))
    
    har_page = HarPage('page_0', har_data=h_call)
    
    image_time = str(round(har_page.image_load_time))
    css_time = str(round(har_page.css_load_time))
    js_time = str(round(har_page.js_load_time))
    image_size = str(round(har_page.image_size / 1000))
    css_size = str(round(har_page.css_size / 1000))
    js_size = str(round(har_page.js_size / 1000))
else:
    image_time = "0"
    css_time = "0"
    js_time = "0"
    image_size = "0"
    css_size = "0"
    js_size = "0"

Finally, generate today’s date, build our SQL statement using the metrics we loaded into variables above, and then we execute the statement to insert the new record.

    
    today = date.today()
    getdate = today.strftime('%m/%d/%Y')
    
    cursor1 = mydb.cursor()

    new_scan1 = "INSERT INTO gtmetrix_scans (websiteid,date,yslow,num_requests,page_size,wait_time,connect_time,css_size,css_time,js_size,js_time,image_size,image_time,report_url) VALUES ('" + str(websiteid) + "','" + getdate + "','" + yslow + "','"+num_requests+ "','"+page_size+"','"+wait_time+"','"+connect_time+"','"+css_size+"','"+css_time+"','"+js_size+"','"+js_time+"','"+image_size+"','"+image_time+"','" +report_url+"')"
    print(new_scan1)
    cursor1.execute(new_scan1)

mydb.close()

Automating the Scan

If your GTmetrix Python script is working well when you run it manually, it’s time to automate it. Luckily, Linux already supplies us with a solution by using the crontab. The crontab stores entries of scripts where you can dictate when to execute them (like a scheduler). You have lots of flexibility with how you schedule your scan (any time of day, day of the week, day of the month, etc.). To add entries to the crontab, run this command:

crontab -e

It will likely open up the crontab file in vi editor. On a blank line at the bottom of the file, type the code below. This code will run the script at midnight every Sunday. To change the time to something else, use this cronjob time editor. Customize with your path to the script.

0 0 * * SUN /usr/bin/python3 PATH_TO_SCRIPT/filename.py

If you want to create a log file to record each time the script ran, you can use this instead.  Customize with your path to the script.

0 0 * * SUN /usr/bin/python3 PATH_TO_SCRIPT/filename.py > PATH_TO_FILE/FILENAME.log 2>&1

Save the crontab file and you’re good to go! Just note, your computer needs to be on at the time the cronjob is set to run.

Spreading Scans Over a Week (Optional)

Let’s say you want to scan more URLs than you have free daily API credits. Remember the script above needs 3 calls for each URL and you get a max of 20 calls a day. One strategy I use is to automatically spread them over the course of a week. Who needs a performance scan more than once a week anyway!

First, let’s create a new field in our website database table

ALTER TABLE websites ADD scan_group int(255);

From there you need to manually go up and down the website rows and insert a number going in order from 0-7 into that new scan_group field. That will assign each website a day represented by that number. Perhaps a SQL god knows how to do this automatically in the ALTER TABLE command.

Then we grab the numerical value for today’s day of the week. Add this to the original script right below the importing of modules.

'''
LEGEND
mon=0
tues=1
wed=2
thurs=3
fri=4
sat=5
sun=6
'''

today = datetime.today().weekday()

Once we have our numerical value for the day, we can add it to our SQL statement grabbing only the websites whose value in the scan_group field matches today’s value. This way you spread out your scans and stay within your free API daily limit.

Now replace:

sql_websites = "SELECT * FROM websites"

with:

sql_websites = "SELECT * FROM websites WHERE scan_group="+str(today)

Conclusion

So there you have it! As you can see, you can automate GTmetrix performance scanning using their free API and store the data in MySQL without too much effort. Naturally, the next step would be to tap into the database with another script or existing application to display or further analyze the data. Please follow me on Twitter for feedback and showcasing interesting ways to extend the script. Enjoy!

Next up, see my guide on Automating Screaming Frog with Python!

Greg Bernhardt

Greg Bernhardt graduated from UW-Milwaukee with a degree in Information Studies. Greg has been involved in web design, development and marketing since creating his first website on Tripod in 1997. Greg's passion is SEO and how to enhance it using Python and other technologies.

This Post Has 5 Comments

  1. The great things about python is that its easy to automate things with python.

  2. Serena Martin

    Python is relatively simple, so it’s easy to learn since it requires a unique syntax that focuses on readability. Developers can read and translate Python code much easier than other languages. In turn, this reduces the cost of program maintenance and development because it allows teams to work collaboratively without significant language and experience barriers.

  3. Excite Brand

    Python has a few modules for perusing information from PDFs, Excel spreadsheets, Word reports, sites, CSV records, and different configurations. Perusing in information from a huge number of records is no issue for your PC. What’s more, when your program has stacked this data, it would then be able to yield it in any configuration your association needs

  4. Excite Brand

    This information made my concepts more clearly. You explained everything very nicely.
    Thank you for sharing it with us.

Comments are closed.