Webhook, public API and basic auth input

We just shipped three bug fixes for the following issues:

  1. Our Webhook alerter was not sending the correct ALERT_SSL value in the ALERT_TYPE variable when an SSL expiry alert was triggered. This is now fixed and a test is added to our code base.

  2. The public API was not paginating the /v1/checks endpoint correctly. Not enough results were returned. Patched that one too and we monitor this now with Checkly!

  3. Lastly, when configuring Basic Auth credentials in the UI, the credentials would not get stored correctly due to a UI bug. That is working again.

Thanks to all that reported these bugs!

Changelog: Alerting, webhook and Prometheus

We just shipped some updates:

  • We now alert on failing setup scripts that prohibit a check from running.
  • We expose the "recovery from degraded" alert type on our web hooks.
  • We introduced a dedicated "degraded status" gauge in our Prometheus integration.

Read all about the details in this blog post

Fixes & small tweaks for Github deployment triggers

We just fixed some bugs and usability/reliability issues with Github deployment triggers.

  1. Environment URLS containing dashes "-" were sometimes not parsed and replaced correctly in API check URL's.

  2. Our queue worker behaved unreliably during maintenance. We moved our queue to a more reliable solution.

  3. We fixed some layout issues in the Github "check" markdown you see in your PR's and commits. We now also report the environment URL used in the Github check.

Home dashboard update

We did a full rewrite and a slight redesign of the home dashboard to make it snappier, more useful and easier to update with new features in the future.

Screen Shot 2019-11-27 at 12.28.41.png

Here's what changed:

  • You can now search! Click the search bar or hit / and start typing.
  • Click tags to filter by those tags.
  • Click the column names to sort.
  • We added the 95th percentile value to the metrics.
  • The three-dotted "more" menu on the right now allows running, editing, copying and deleting a check.
  • Aggregated metrics and check results are lazy loaded, allowing a far snappier initial loading.

Soft Limits for API checks + alert toggling

Soft limits

You can now set a soft limit on the response times for API checks. Using a handy slider, you can set when a check is "just" degraded and when a check should be considered failing due to extreme latency.

limits.png

Learn more about soft limits in our docs

Note: You can still set a hard assertion on the response time if you want.

Alert toggling

Checkly now also sends out degraded alerts through email, Slack etc.

However, we initially turned the new degraded alerts off when rolling out this feature to not trigger any extra alerts for users whose checks might be in the degraded zone.

To enable degraded alerts, just head over to your alert channel settings and check the box! More info in the docs

Alert_settings___edit_channel.png

[retro active] Partial API outage ~12:00 - 13:00 CET

Parts of Checkly were less or not available today from roughly 12:00 to 13:00 CET. Features affected were

  • The web application
  • Public API

Other features (monitoring, alerting, deployments, public dashboards, triggers) were not affected.

Root cause

The cause of the outage was a too small limit on API calls to our authentication provider Auth0.com. Auth0 has a 20 requests per second limit on their public JWT endoint (https://auth0.com/docs/policies/rate-limits).

The code on our side however already caches and rate limits so we don't hit the provider limit. This was set too low in the context of the growing Checkly user base.

Note: This issue was 100% a mistake on the Checkly side. Auth0 has no blame here.

Triage

Resolving the issue took a little bit longer than ideal. Our alerting was on-point and immediately showed the JWT authentication rate limit to be the issue.

JwksRateLimitError: Too many requests to the JWKS endpoint

However, it was unclear whether this was on our side or on the provider's side. Figuring this out and patching the configuration setting therefore took longer than expected. The fact that the rate limit is set in requests per minute also did not help:

{ 
  cache: true,
  rateLimit: true,
  jwksRequestsPerMinute: 20
}

Lessons learned

  • This was avoidable. Capacity planning with regard to Checkly's growth should have caught this.
  • Checkly's monitoring was not affected. This is good proof that the monitoring & alerting backend is robust and can withstand API outages.

API maintenance at 20:00 PM CET today

Today we have to do some maintenance on our API and the underlying database.

This means the Checkly web application will not be usable. We expect the maintenance to be done under 5 minutes, probably even quicker.

Your checks and associated alerting will keep running, only the web application at app.checklyhq.com will be briefly offline.

Alert channels updates & custom webhooks

We just updated our alert channels! You can now

  • Manage all alert channel preferences from one page
  • Create custom webhooks payloads
  • Add multiple Pagerduty channels

Alert channels

Checks are now "subscribed" to a channel and you can tweak which check is subscribed to which channel on the alert settings screen.

Before this update, you could also create channels per check in addition to global channel. This is no longer possible.

Note: We have migrated your "per check channels". This might result in some duplicates if you had two or more checks, each reporting to the same email. Please have a quick look on your alert settings tab and remove any double channels.

Custom webhooks

This was a long time coming and it's finally here. You now have full control over the payload of your webhook.

You can now:

  • Add tokens, keys and other secrets from your environment variables to the URL and payload.
  • Create completely custom webhook payloads using environment variables and event specific variables like the check status and name.

Have a look at our docs for a full overview of the available variables and examples of delivering webhooks to Jira and Trello.

Bugfixes for Github deployments

We just rolled out a patch for two issues:

  1. Browser checks triggered by Github deployments would not function correctly if these browser checks used encrypted variables stored in the check. The issue was a bug in the code that stopped the environment variables from being decrypted correctly.

  2. The events overview showed double deployments whenever one deployment was actually running multiple checks. We tweaked the query, looks good now.

Happy Friday!

[Resolved] API Outage

We experienced an outage on our public API and web app facing API from ~ 12:15 UTC to 12:45 UTC. Logging in, using the web application and using the public API was either very slow or resulted in 503 errors.

Note: alerting & notifications were not impacted

Post mortem

The outage was due to the overlap of deleting an index and creating an index on one of the tables of our database.

This index was needed to speed up retrieval of time range queries. However, removing the older index impacted general performance much beyond expected. Some queries were dominating the database resource pool. This caused our app to timeout these queries after 10 seconds, with the outage as a result.

Lessons learned

This type of update is perfectly possible without any down time. We just need to keep the old index, create a new one and only then delete the old one.