Opsgenie integration

We just added Opsgenie to our alerting integrations! πŸŽ‰

Now Checkly can create and resolve alerts in your Opsgenie team and integrate into your on-call workflows.

opsgenie.png

We are excited to enable you to use this integration today, in the meantime Checkly is working together with Opsgenie to expand the integration. You can expect additional features coming soon.

Opsgenie is available to all plans above the Developer plan. Learn

how to integrate Checkly with your Opsgenie team in our docs πŸ‘ˆ

[retro active] short API outage on Friday

On Friday, from 2020-06-12 10:38 UTC to 2020-06-12 10:44 UTC, Checkly API was down for 6 minutes. Functionalities like adding, editing and removing checks were affected. None of the monitoring services were affected. Since our Dashboard also uses the same API, this also affected the Dashboard.

The issue was caused by our API instance going into a crash loop when an exception occurred due to an error in our error handler. Ironic, right?

Bonus Fact We were alerted by this outage by Checkly itself. This wasn't the first time this happened but this never gets old for us here at Checkly.

Zrzut ekranu 2020-06-12 o 13.04.55.jpg Time in the screenshot is in CET.

Strict null assertions in API checks

We just added two new assertion types that you can use with JSON body API checks.

API_check___assertions.png

  1. Is null will assert that value is strictly equal to null
  2. Not null will assert that value is strictly not equal to null

You can use the new assertions combined with JSON path expressions to add better validation to your API check response.

Cape Town, South Africa and Milan, Italy regions now available

We just added the Cape Town, South Africa πŸ‡ΏπŸ‡¦ and Milan, Italy region. This is pretty epic, especially for the underserved African region. This enabled for all plans and available now.

regions.png

[retro active] scheduling outage for browser checks

Monday 18 May we had an outage in processing browser check results between 15:44 PM UTC and 20:38 PM UTC. This was caused by a bug in our release and deployment software. API checks were not impacted.

This outage had the following consequences:

  • No browser results were stored in our database from that period
  • You will not find browser check results in your dashboard from that period
  • No alerts were triggered for failing browser checks, as these rely on the results being processed.

We published a full post mortem on the outage detailing the root cause and most importantly our actions to prevent this in the future. In a nutshell:

  • Our own monitoring and alerting failed here, causing the outage to last much longer than needed.
  • The bug itself was minor and easily and quickly rectified.
  • We are putting three distinct measures in place to stop this from happening again.

On a more personal note: it is bitter that this outage was effectively created due to the engineering team working on reliability and better testing and releasing procedures. The code changes necessary for this sometimes have bugs, like all code.

Tim, CTO & co-founder

Find the detailed post mortem here: https://blog.checklyhq.com/post-mortem-outage-browser-check-results-alerting/

[post mortem] scheduling outage us-west-1 region

Please find a full post mortem on the recent scheduling outage in the us-west-1 region causing failing checks for ~30 minutes.

https://blog.checklyhq.com/post-mortem-failing-checks-us-west-1/

[retro active] scheduling outage 02:00 - 03:00 CET

We had a significant rise in scheduling errors for mostly the us-west-1 region between 02:00 and 03:00 CET. The largest peak was between 02:04 and 02:24.

This resulted in checks reporting errors with error messages like the snippet below. This incident was caused by an upstream provider and resolved itself.

503: null
    at Request.extractError (/app/api/node_modules/aws-sdk/lib/protocol/query.js:55:29)
    at Request.callListeners (/app/api/node_modules/aws-sdk/lib/sequential_executor.js:106:20)

We are preparing a post mortem that focuses on two topics:

  • How we can failover to other regions more robustly. We already reschedule with initial failures, but this is not sufficient.
  • How we can be alerted sooner when similar issues arise.

Bug fix on checks scheduled each 12hr and 24hr

Yesterday we shipped a bug fix for the following issue:

Checks scheduled to run on a 12 hour (720m) or 24hr (1440m) schedule were prone to not being run or run on an hourly basis. This behaviour was effectively random. This impacted a total of 8 checks and 4 customers in our system.

Checks scheduled to run anywhere from every 1 minute to every 60 minutes were not impacted.

This was a hard bug to track down and would not have been resolved if not for the kind reporting of one of our customers. Big πŸ–– and πŸ™Œ

Bug fixes on groups and check triggers

We just shipped some bug fixes!

  1. Checks that were not enabled would still run when triggered in a group. This is now fixed. Disabling a check will not run it in the context of a group.

  2. You can now toggle the "double check" parameter on groups. Before, this setting was not saved in our backend.

  3. Triggering a group of checks using our command line trigger would fail when adding one of the deployment, sha and repository query parameters to the API call. This is now fixed.

Stay safe and happy easter holidays. 🐰

Changelog: customize webhook method, headers and query params

We just released some tweaks to our already quite awesome webhook alert channels. You can now set the method, add headers and query params.

Read the full the change log at: https://blog.checklyhq.com/changelog-customize-your-webhook-with-method-etc/ πŸ‘ˆ