Home dashboard update

We did a full rewrite and a slight redesign of the home dashboard to make it snappier, more useful and easier to update with new features in the future.

Screen Shot 2019-11-27 at 12.28.41.png

Here's what changed:

  • You can now search! Click the search bar or hit / and start typing.
  • Click tags to filter by those tags.
  • Click the column names to sort.
  • We added the 95th percentile value to the metrics.
  • The three-dotted "more" menu on the right now allows running, editing, copying and deleting a check.
  • Aggregated metrics and check results are lazy loaded, allowing a far snappier initial loading.

Soft Limits for API checks + alert toggling

Soft limits

You can now set a soft limit on the response times for API checks. Using a handy slider, you can set when a check is "just" degraded and when a check should be considered failing due to extreme latency.

limits.png

Learn more about soft limits in our docs

Note: You can still set a hard assertion on the response time if you want.

Alert toggling

Checkly now also sends out degraded alerts through email, Slack etc.

However, we initially turned the new degraded alerts off when rolling out this feature to not trigger any extra alerts for users whose checks might be in the degraded zone.

To enable degraded alerts, just head over to your alert channel settings and check the box! More info in the docs

Alert_settings___edit_channel.png

[retro active] Partial API outage ~12:00 - 13:00 CET

Parts of Checkly were less or not available today from roughly 12:00 to 13:00 CET. Features affected were

  • The web application
  • Public API

Other features (monitoring, alerting, deployments, public dashboards, triggers) were not affected.

Root cause

The cause of the outage was a too small limit on API calls to our authentication provider Auth0.com. Auth0 has a 20 requests per second limit on their public JWT endoint (https://auth0.com/docs/policies/rate-limits).

The code on our side however already caches and rate limits so we don't hit the provider limit. This was set too low in the context of the growing Checkly user base.

Note: This issue was 100% a mistake on the Checkly side. Auth0 has no blame here.

Triage

Resolving the issue took a little bit longer than ideal. Our alerting was on-point and immediately showed the JWT authentication rate limit to be the issue.

JwksRateLimitError: Too many requests to the JWKS endpoint

However, it was unclear whether this was on our side or on the provider's side. Figuring this out and patching the configuration setting therefore took longer than expected. The fact that the rate limit is set in requests per minute also did not help:

{ 
  cache: true,
  rateLimit: true,
  jwksRequestsPerMinute: 20
}

Lessons learned

  • This was avoidable. Capacity planning with regard to Checkly's growth should have caught this.
  • Checkly's monitoring was not affected. This is good proof that the monitoring & alerting backend is robust and can withstand API outages.

API maintenance at 20:00 PM CET today

Today we have to do some maintenance on our API and the underlying database.

This means the Checkly web application will not be usable. We expect the maintenance to be done under 5 minutes, probably even quicker.

Your checks and associated alerting will keep running, only the web application at app.checklyhq.com will be briefly offline.

Alert channels updates & custom webhooks

We just updated our alert channels! You can now

  • Manage all alert channel preferences from one page
  • Create custom webhooks payloads
  • Add multiple Pagerduty channels

Alert channels

Checks are now "subscribed" to a channel and you can tweak which check is subscribed to which channel on the alert settings screen.

Before this update, you could also create channels per check in addition to global channel. This is no longer possible.

Note: We have migrated your "per check channels". This might result in some duplicates if you had two or more checks, each reporting to the same email. Please have a quick look on your alert settings tab and remove any double channels.

Custom webhooks

This was a long time coming and it's finally here. You now have full control over the payload of your webhook.

You can now:

  • Add tokens, keys and other secrets from your environment variables to the URL and payload.
  • Create completely custom webhook payloads using environment variables and event specific variables like the check status and name.

Have a look at our docs for a full overview of the available variables and examples of delivering webhooks to Jira and Trello.

Bugfixes for Github deployments

We just rolled out a patch for two issues:

  1. Browser checks triggered by Github deployments would not function correctly if these browser checks used encrypted variables stored in the check. The issue was a bug in the code that stopped the environment variables from being decrypted correctly.

  2. The events overview showed double deployments whenever one deployment was actually running multiple checks. We tweaked the query, looks good now.

Happy Friday!

[Resolved] API Outage

We experienced an outage on our public API and web app facing API from ~ 12:15 UTC to 12:45 UTC. Logging in, using the web application and using the public API was either very slow or resulted in 503 errors.

Note: alerting & notifications were not impacted

Post mortem

The outage was due to the overlap of deleting an index and creating an index on one of the tables of our database.

This index was needed to speed up retrieval of time range queries. However, removing the older index impacted general performance much beyond expected. Some queries were dominating the database resource pool. This caused our app to timeout these queries after 10 seconds, with the outage as a result.

Lessons learned

This type of update is perfectly possible without any down time. We just need to keep the old index, create a new one and only then delete the old one.

Bugfix JSON query assertion

We just patched a bug in the JSON path query handler for API check assertions.

If you had a nested query on a response body, i.e. $.user.username and the actual response body did not include the top level structure — in this case user — the result would be undefined.

Based on other assertions this could evaluate the assertion to "passing" which is in 99.999% of the cases not what you want.

This is now fixed. Missing top level structures will always evaluate to "failing".

30 seconds API checks & Slack message update

Two updates for you, not strictly related but all (probably) very welcome:

1) We are currently allowing API checks to run for 30 seconds instead of the normal 8 seconds. Extending the runtime for API checks is often requested by customers and we're looking at how this impacts our back end and cost structure. This is still an experiment. More news soon.

2) Slack messages for API checks now show the method and URL for the API request.

Slack___web___checklyhq.png

Github integration for deployments

With our new Github integration, you can run checks when your favourite hosting platform like Zeit or Heroku triggers a deployment in Github.

image.png

On each deploy, you get detailed feedback in Github if your commit or pull request broke some part of your API or web frontend.

This new feature is free for all plans. Head over to our documentation for more details.

P.S. our existing command line triggers can now also record deployments so you can keep track of how deployments influence your availability stats, read more here