[retro active] Partial API outage ~12:00 - 13:00 CET

Parts of Checkly were less or not available today from roughly 12:00 to 13:00 CET. Features affected were

  • The web application
  • Public API

Other features (monitoring, alerting, deployments, public dashboards, triggers) were not affected.

Root cause

The cause of the outage was a too small limit on API calls to our authentication provider Auth0.com. Auth0 has a 20 requests per second limit on their public JWT endoint (https://auth0.com/docs/policies/rate-limits).

The code on our side however already caches and rate limits so we don't hit the provider limit. This was set too low in the context of the growing Checkly user base.

Note: This issue was 100% a mistake on the Checkly side. Auth0 has no blame here.

Triage

Resolving the issue took a little bit longer than ideal. Our alerting was on-point and immediately showed the JWT authentication rate limit to be the issue.

JwksRateLimitError: Too many requests to the JWKS endpoint

However, it was unclear whether this was on our side or on the provider's side. Figuring this out and patching the configuration setting therefore took longer than expected. The fact that the rate limit is set in requests per minute also did not help:

{ 
  cache: true,
  rateLimit: true,
  jwksRequestsPerMinute: 20
}

Lessons learned

  • This was avoidable. Capacity planning with regard to Checkly's growth should have caught this.
  • Checkly's monitoring was not affected. This is good proof that the monitoring & alerting backend is robust and can withstand API outages.