Cape Town, South Africa and Milan, Italy regions now available

We just added the Cape Town, South Africa πŸ‡ΏπŸ‡¦ and Milan, Italy region. This is pretty epic, especially for the underserved African region. This enabled for all plans and available now.


[retro active] scheduling outage for browser checks

Monday 18 May we had an outage in processing browser check results between 15:44 PM UTC and 20:38 PM UTC. This was caused by a bug in our release and deployment software. API checks were not impacted.

This outage had the following consequences:

  • No browser results were stored in our database from that period
  • You will not find browser check results in your dashboard from that period
  • No alerts were triggered for failing browser checks, as these rely on the results being processed.

We published a full post mortem on the outage detailing the root cause and most importantly our actions to prevent this in the future. In a nutshell:

  • Our own monitoring and alerting failed here, causing the outage to last much longer than needed.
  • The bug itself was minor and easily and quickly rectified.
  • We are putting three distinct measures in place to stop this from happening again.

On a more personal note: it is bitter that this outage was effectively created due to the engineering team working on reliability and better testing and releasing procedures. The code changes necessary for this sometimes have bugs, like all code.

Tim, CTO & co-founder

Find the detailed post mortem here:

[post mortem] scheduling outage us-west-1 region

Please find a full post mortem on the recent scheduling outage in the us-west-1 region causing failing checks for ~30 minutes.

[retro active] scheduling outage 02:00 - 03:00 CET

We had a significant rise in scheduling errors for mostly the us-west-1 region between 02:00 and 03:00 CET. The largest peak was between 02:04 and 02:24.

This resulted in checks reporting errors with error messages like the snippet below. This incident was caused by an upstream provider and resolved itself.

503: null
    at Request.extractError (/app/api/node_modules/aws-sdk/lib/protocol/query.js:55:29)
    at Request.callListeners (/app/api/node_modules/aws-sdk/lib/sequential_executor.js:106:20)

We are preparing a post mortem that focuses on two topics:

  • How we can failover to other regions more robustly. We already reschedule with initial failures, but this is not sufficient.
  • How we can be alerted sooner when similar issues arise.

Bug fix on checks scheduled each 12hr and 24hr

Yesterday we shipped a bug fix for the following issue:

Checks scheduled to run on a 12 hour (720m) or 24hr (1440m) schedule were prone to not being run or run on an hourly basis. This behaviour was effectively random. This impacted a total of 8 checks and 4 customers in our system.

Checks scheduled to run anywhere from every 1 minute to every 60 minutes were not impacted.

This was a hard bug to track down and would not have been resolved if not for the kind reporting of one of our customers. Big πŸ–– and πŸ™Œ

Bug fixes on groups and check triggers

We just shipped some bug fixes!

  1. Checks that were not enabled would still run when triggered in a group. This is now fixed. Disabling a check will not run it in the context of a group.

  2. You can now toggle the "double check" parameter on groups. Before, this setting was not saved in our backend.

  3. Triggering a group of checks using our command line trigger would fail when adding one of the deployment, sha and repository query parameters to the API call. This is now fixed.

Stay safe and happy easter holidays. 🐰

Changelog: customize webhook method, headers and query params

We just released some tweaks to our already quite awesome webhook alert channels. You can now set the method, add headers and query params.

Read the full the change log at: πŸ‘ˆ

Changelog: Screenshots in GitHub PR's + design & stability updates

We just pushed a new feature and some bug fixes around our GitHub deployments integration.

  1. You can now add screenshots to the GitHub PR comment.
  2. The GitHub PR comment is now optimized for show the results of all checks in a group

Read the full changelog here:

Changelog: Puppeteer 2.0 & Node.js 10

We just updated our Puppeteer check runners to Puppeteer 2.0 and NodeJS 10! Check our blog post for some of the changes in this Puppeteer release.

New blog post: Using the Checkly Prometheus integration

Last week we published a brand new blog post on getting the most out of Checkly's Prometheus integration.

In this post, our friend John Arundel dives deep into Prometheus and Grafana and teaches you how to…

  • Slice & dice Checkly metrics
  • Alert on SLA performance
  • Set up tripwire dashboards

Find the full post at: