[retro active] Partial API outage ~12:00 - 13:00 CET

Parts of Checkly were less or not available today from roughly 12:00 to 13:00 CET. Features affected were

  • The web application
  • Public API

Other features (monitoring, alerting, deployments, public dashboards, triggers) were not affected.

Root cause

The cause of the outage was a too small limit on API calls to our authentication provider Auth0.com. Auth0 has a 20 requests per second limit on their public JWT endoint (https://auth0.com/docs/policies/rate-limits).

The code on our side however already caches and rate limits so we don't hit the provider limit. This was set too low in the context of the growing Checkly user base.

Note: This issue was 100% a mistake on the Checkly side. Auth0 has no blame here.

Triage

Resolving the issue took a little bit longer than ideal. Our alerting was on-point and immediately showed the JWT authentication rate limit to be the issue.

JwksRateLimitError: Too many requests to the JWKS endpoint

However, it was unclear whether this was on our side or on the provider's side. Figuring this out and patching the configuration setting therefore took longer than expected. The fact that the rate limit is set in requests per minute also did not help:

{ 
  cache: true,
  rateLimit: true,
  jwksRequestsPerMinute: 20
}

Lessons learned

  • This was avoidable. Capacity planning with regard to Checkly's growth should have caught this.
  • Checkly's monitoring was not affected. This is good proof that the monitoring & alerting backend is robust and can withstand API outages.

API maintenance at 20:00 PM CET today

Today we have to do some maintenance on our API and the underlying database.

This means the Checkly web application will not be usable. We expect the maintenance to be done under 5 minutes, probably even quicker.

Your checks and associated alerting will keep running, only the web application at app.checklyhq.com will be briefly offline.

Alert channels updates & custom webhooks

We just updated our alert channels! You can now

  • Manage all alert channel preferences from one page
  • Create custom webhooks payloads
  • Add multiple Pagerduty channels

Alert channels

Checks are now "subscribed" to a channel and you can tweak which check is subscribed to which channel on the alert settings screen.

Before this update, you could also create channels per check in addition to global channel. This is no longer possible.

Note: We have migrated your "per check channels". This might result in some duplicates if you had two or more checks, each reporting to the same email. Please have a quick look on your alert settings tab and remove any double channels.

Custom webhooks

This was a long time coming and it's finally here. You now have full control over the payload of your webhook.

You can now:

  • Add tokens, keys and other secrets from your environment variables to the URL and payload.
  • Create completely custom webhook payloads using environment variables and event specific variables like the check status and name.

Have a look at our docs for a full overview of the available variables and examples of delivering webhooks to Jira and Trello.

Bugfixes for Github deployments

We just rolled out a patch for two issues:

  1. Browser checks triggered by Github deployments would not function correctly if these browser checks used encrypted variables stored in the check. The issue was a bug in the code that stopped the environment variables from being decrypted correctly.

  2. The events overview showed double deployments whenever one deployment was actually running multiple checks. We tweaked the query, looks good now.

Happy Friday!

[Resolved] API Outage

We experienced an outage on our public API and web app facing API from ~ 12:15 UTC to 12:45 UTC. Logging in, using the web application and using the public API was either very slow or resulted in 503 errors.

Note: alerting & notifications were not impacted

Post mortem

The outage was due to the overlap of deleting an index and creating an index on one of the tables of our database.

This index was needed to speed up retrieval of time range queries. However, removing the older index impacted general performance much beyond expected. Some queries were dominating the database resource pool. This caused our app to timeout these queries after 10 seconds, with the outage as a result.

Lessons learned

This type of update is perfectly possible without any down time. We just need to keep the old index, create a new one and only then delete the old one.

Bugfix JSON query assertion

We just patched a bug in the JSON path query handler for API check assertions.

If you had a nested query on a response body, i.e. $.user.username and the actual response body did not include the top level structure — in this case user — the result would be undefined.

Based on other assertions this could evaluate the assertion to "passing" which is in 99.999% of the cases not what you want.

This is now fixed. Missing top level structures will always evaluate to "failing".

30 seconds API checks & Slack message update

Two updates for you, not strictly related but all (probably) very welcome:

1) We are currently allowing API checks to run for 30 seconds instead of the normal 8 seconds. Extending the runtime for API checks is often requested by customers and we're looking at how this impacts our back end and cost structure. This is still an experiment. More news soon.

2) Slack messages for API checks now show the method and URL for the API request.

Slack___web___checklyhq.png

Github integration for deployments

With our new Github integration, you can run checks when your favourite hosting platform like Zeit or Heroku triggers a deployment in Github.

image.png

On each deploy, you get detailed feedback in Github if your commit or pull request broke some part of your API or web frontend.

This new feature is free for all plans. Head over to our documentation for more details.

P.S. our existing command line triggers can now also record deployments so you can keep track of how deployments influence your availability stats, read more here

Check statistics & metrics got an update

We just released an update that puts some polish and much wanted new options on the Check stats & metrics page.

check_stats.png

  • You can now use the date search box to select any date range you wish. Date ranges starting more than 30 days ago have a resolution of 1 hour.

  • You can use the forward and backward buttons to hop an hour in each direction.

  • The graph now shows clear markers on each failed check. Clicking the marker takes you to the failed check's detail page.

  • All uptime metrics are reported with Five Nines notation now.

[resolved] Checkly API outage

We were experiencing some issues with our API due to a maintenance / upgrade process.

The Checkly web application and dashboards were showing errors and might not have been available.

This outage lasted from 13:08 to 13:11 CET.