Code Example – How we Transform Log Files to Improve SEO Rankings

by Posted @ Oct 02 2018

Twitter

As part of our SEO services at Go Fish Digital, we often help our clients collect and analyze their log files. We’ve learned working with a variety of enterprise clients that these log files can be cumbersome, siloed, and formatted in strange ways based on the infrastructure of the client’s site. Reverse proxies, internal redirects, and DDoS mitigation setups have all caused log files to be essentially unreadable both by humans and the automated tools we use to pull insights from log files. However, with some crafty Python string formatting and the use of common log formatting standards, we’ve been able to recover valuable insights about search engine behavior and site structure across a variety of complex log structures.

For this post, I’ll use an example of a client that used a database backend to store timestamped log fields. The way this manifested itself when it came time to perform log file analysis was a multi-GB CSV file sitting in my inbox – definitely not something we could perform easy analysis on using a tool like ScreamingFrog. So I needed to convert the export file to a proper log format, in this case the Combined Log Formatwhich I’ll refer to as CLF in this post; note that CLF usually stands for Common Log Format, a different specification. The CLF is defined in Apache’s mod_log_config as the following:

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\"" combined

(You can see the exact definitions of the variables used here.)

The available fields in the log export database, and a row of example data, are as follows:

date
2018-05-03
time
01:01:05
methodGET
requestipxx.xx.xx.xx
status404
referrerhttps://www.example.com/
resulttypeMiss
hostheaderwww.example
useragentMozilla/5.0%2520(compatible;%2520Googlebot/2.1;%2520+http://www.google.com/bot.html)
uri/clientURI
querystring

 

So it’s clear that we have all the information we need to build the CLF log file, we just need to make some modifications to the data. To do this, I built a Python-driven transposition process that creates a ‘log_file_object’ for each row in the CSV, formats each field to meet the CLF standard, and dumps the resultant object to the final log file.

Some notable changes that were required in this case:

  • The date field was in the wrong format, the CLF requires the date to be in a specific format which includes the 3-letter month abbreviation. To fix this, I had to take the ISO-formatted date, split the string, and convert all 4 date part fields into a properly formatted CLF timestamp.
  • The source data broke out URL query strings into individual fields, separated from the URL. The only way to allow these requests to be properly analyzed was to rebuild the URL including the query string, seen on line 40 of the script.
  • Finally, the always exciting data-quality demon reared its head, with “%2520” being placed randomly among some of the data. This was the result of double encoding on URIs by the client’s backend systems, and you can see my correction for that on line 60.

You can take a look at the embedded full process file below, and see my comments for information on how the process works. As always, please leave a comment if you have any questions!

 

subscribe to our newsletter

Leave a Comment