[Resolved] API Outage

We experienced an outage on our public API and web app facing API from ~ 12:15 UTC to 12:45 UTC. Logging in, using the web application and using the public API was either very slow or resulted in 503 errors.

Note: alerting & notifications were not impacted

Post mortem

The outage was due to the overlap of deleting an index and creating an index on one of the tables of our database.

This index was needed to speed up retrieval of time range queries. However, removing the older index impacted general performance much beyond expected. Some queries were dominating the database resource pool. This caused our app to timeout these queries after 10 seconds, with the outage as a result.

Lessons learned

This type of update is perfectly possible without any down time. We just need to keep the old index, create a new one and only then delete the old one.

Bugfix JSON query assertion

We just patched a bug in the JSON path query handler for API check assertions.

If you had a nested query on a response body, i.e. $.user.username and the actual response body did not include the top level structure — in this case user — the result would be undefined.

Based on other assertions this could evaluate the assertion to "passing" which is in 99.999% of the cases not what you want.

This is now fixed. Missing top level structures will always evaluate to "failing".

30 seconds API checks & Slack message update

Two updates for you, not strictly related but all (probably) very welcome:

1) We are currently allowing API checks to run for 30 seconds instead of the normal 8 seconds. Extending the runtime for API checks is often requested by customers and we're looking at how this impacts our back end and cost structure. This is still an experiment. More news soon.

2) Slack messages for API checks now show the method and URL for the API request.

Slack___web___checklyhq.png

Github integration for deployments

With our new Github integration, you can run checks when your favourite hosting platform like Zeit or Heroku triggers a deployment in Github.

image.png

On each deploy, you get detailed feedback in Github if your commit or pull request broke some part of your API or web frontend.

This new feature is free for all plans. Head over to our documentation for more details.

P.S. our existing command line triggers can now also record deployments so you can keep track of how deployments influence your availability stats, read more here

Check statistics & metrics got an update

We just released an update that puts some polish and much wanted new options on the Check stats & metrics page.

check_stats.png

  • You can now use the date search box to select any date range you wish. Date ranges starting more than 30 days ago have a resolution of 1 hour.

  • You can use the forward and backward buttons to hop an hour in each direction.

  • The graph now shows clear markers on each failed check. Clicking the marker takes you to the failed check's detail page.

  • All uptime metrics are reported with Five Nines notation now.

[resolved] Checkly API outage

We were experiencing some issues with our API due to a maintenance / upgrade process.

The Checkly web application and dashboards were showing errors and might not have been available.

This outage lasted from 13:08 to 13:11 CET.

New Alert History & navigation changes

You can now find all things related to day-to-day monitoring and check management on the new sidebar menu on the left.

Dashboard management has also moved from the homepage to its own dedicated menu item.

All things related to your account (billing, plans, teams) are still in the "old" menu on top right.

Furthermore, we released the new Alerts History overview: a timeline with all your alerts across all checks.

Check___alerts.png

Bugfixes, tweaks and a new graph.

Over the last two weeks we released some iterative updates:

  • We added a nice visualization of request timing to all API check results. Inspired by the Chrome Developer Tools.

API_check___request.png

  • We pushed an important bugfix where timed out Browser checks would in some cases not report this timeout correctly.

  • Your in app dashboard is updated directly via websockets now. This means new checks will instantly be run and the results visible. This caused some confusion for new users. Also, performance for busy dashboards will be better as we now only update state when needed, not in bulk for all. We will introduce websockets to the public Dashboards too.

  • Your dashboard's state is now bookmarkable and linkable as we added the necessary filtering, tagging and pagination option to query parameters

  • TV-mode dashboards are now just called "Dashboards". Why complicate things?

Dashboard.png

Puppeteer Recorder now records screenshots

We just released v0.7.0 of Puppeteer Recorder, our handy Chrome extension that makes recording Puppeteer scripts a breeze.

This new version adds the option to take screenshots, either of the current page or of a clipped portion of the page.

Just right click or use the Cmd+Shift+A shortcut!

context_menu.png

Read more about Puppeteer Recorder and how to use it right here in our docs

Ongoing: Slack alert delivery errors

Due to a system wide outage at Slack, some Slack alerts are not being delivered. We can see timeouts and errors on our backend happen intermittently.

Regretfully, there is nothing much we can do. Please follow https://status.slack.com/ for any updates on this issue.