Quick comparison: Plausible vs logs

Posted on 2020-07-13 by

About a month ago, I started collecting website usage data using both Plausible.io and logs generated by Caddyserver, my reverse proxy. The goal was to compare the data sources, just like Marko Saric did in a post on the Plausible blog.

Here's a quick overview of the results. For more details, read the post mentioned above, the results are nearly identical and Marko does a great job explaining the results.

Results

Quantitative data

The table below summarizes key metrics computed by both Plausible and GoAccess (based on Caddyserver logs). Data used was collected between June 13th and July 13th.

Metric Plausible.io Logs + GoAccess Δ factor
Visitors 32.1k 76.9k x2.4
Pageviews 44.5k 468.6k x10.5
Bandwidth - 16.6 GiB -

Just as Marko noticed, logs show much higher numbers of visitors and pageviews, likely due to crawlers and bots that get noticed in the logs but do not run javascript and therefore are not picked up by Plausible.

I could compare other metrics like referrers and top pages, but again, I suggest you read the post on the Plausible blog.

I'd like to add that the logs can provide some information about bandwidth usage and which files are downloaded the most. This would allow you to make informed decisions when optimizing caching and file loading. Plausible can't help you with this data, one needs logs for this.

Qualitative data

The experience with Plausible was more convenient than with GoAccess, as the website of the former loads in seconds whilst the latter took 3 minutes to process the logs and generate the results.

Conclusion

Both methods have advantages and disadvantages. Plausible gives fast and precise results but potentially impacts page load (although minimally). Server logs don't impact page load, can provide bandwidth stats but inflate numbers due to traffic noise generated by search engines, crawlers and bots. Personally, I will continue using both for the foreseeable future.

Methodology

Plausible

Visit the Plausible.io website and simply look at the website's stats.

Caddy logs

Logs were collected using the following snippet in the Caddyfile:

log {
    output file /var/log/caddy/access.log {
        roll_size 100MiB
        roll_keep 10
        roll_keep_for 2160h
    }
}

GoAccess

As GoAccess cannot read Caddy logs directly, a small bash script is needed:

today_date=$(date -u +"%Y-%m-%d")
today_date=$(date -u --date="$today_date -30 day" +"%Y-%m-%d")
today_ts=$(date -d $today_date +%s)

goaccess <(zcat -f logs/access* | jq --raw-output '
   .request.remote_addr |= .[:-6] |
   select(.request.remote_addr != "1.1.1.1") |
   select(.request.remote_addr != "2.2.2.2") |
   select(.ts >= '$today_ts') |
   [
      .common_log,
      .request.headers.Referer[0] // "-",
      .request.headers."User-Agent"[0],
      .duration
   ] | @csv') \
   --log-format='"%h - - [%d:%t %^] ""%m %r %H"" %s %b","%R","%u",%T' --time-format='%H:%M:%S' --date-format='%d/%b/%Y'

This was adapted from the bash script described by Alessandro in this blog post.

Webmentions

@markosaric @yarmo Similarly, I see 2–3× more visitors reported by Cloudflare than Plausible and just have to assume they're bots. Additionally, the data transfer aligns more with the number Plausible reports based on the size of image-heave pages.
@markosaric @yarmo Similarly, I see 2–3× more visitors reported by Cloudflare than Plausible and just have to assume they're bots. Additionally, the data transfer aligns more with the number Plausible reports based on the size of image-heave pages.

Commented by Jeremiah Lee on 2020-07-14 at 17:55:51 UTC