Checkly did not run any scheduled API and/or browser checks between 18:44 and 19:42 CET.
Other features like the web application, adhoc checks and checks triggered by deployment or the API were not affected.
The exact cause of the outage is as of yet unknown. Two of our background daemons running on a 3rd party cloud infrastructure were effectively stopped for — as of yet — unknown reasons. We've escalated to the 3rd party.
Our monitoring alerted this outage almost exactly at the time the issue occurred. The reason it took almost an hour to resolve was due to the on call not having his laptop at hand.
Once logged in and online, the issue was found and resolved within 10 minutes. In this case a simple restarted brought everything back to normal.
- This was partly avoidable: on call should always have a laptop at hand or very close by.
- Until we have an analysis of our 3rd party provider we are not sure how we can prohibit this from happening in the future.
- Our monitoring was on point and alerted correctly and quickly.
This incident was of course very serious. I hope that by being transparent and honest about the ups& downs of the Checkly services we can continue to build trust and make Checkly better every day.