Timeouts in US-EAST-1 region for browser checks

We are experiencing timeouts in the us-east-1 region for browser checks. Extra capacity is being deployed and we will monitor the situation.

Update:

All systems are back to normal for region us-east-1.

Introducing global environment variables

You can now store API tokens, passwords, usernames or any other piece of configuration data you want to use in multiple checks in global environment variables. This was a long requested feature and it is live right now.

Adding variables

Head over to your account settings and start adding variables.

add-variables.png

Accessing variables

Access variables in your API checks using the {{MY_VAR}} notation and the process.env.MY_VAR notation for browser checks.

access-variables.png

Learn more

Retroactive: Timeouts in US-EAST-1 for Browser Checks

Over the last 30 minutes we noticed an increase in timeouts for browser checks running in the North Virginia / US-east-1 region.

We are actively monitoring the situation and deploying extra capacity if and where needed.

Improvements for API checks

We made some usability tweaks for API checks based on customer feedback.

Asserting JSON arrays

When an API endpoint returns an array of items, you might want to assert one of the items in that array. As of today you can pick the JSON Array source as an assertion source.

assert_array.png With this source you can

  • Assert the length of an array.
  • Pick the first, last or an Nth, arbitrary item.
  • Assert that it directly as value (greater than, equals etc.)
  • Assert any nested JSON value in that item. It works the same as normal JSON objects i.e. customer.id select the id property in a customer object.

Learn more about asserting JSON arrays

Showing actual values in API check runner

When creating & editing your API checks, we now show the actual value next to the target value of an assertion if we can determine it. This makes debugging a ton quicker. We also show this value on all normally scheduled check results.

assert_Actual.png

Delayed or non-functioning retries

Over the last weekend of 13 -14 Oct 2018 retrying failed API checks when the "double check" option was had delays or did not work for all regions except eu-central-1.

The reason for this was a configuration changed introduced that caused the routing of double checks to fail or take much longer than necessary.

We did not notice this behavior directly, as the issue was region specific and only for failing checks (a small percentage of Checkly traffic).

To make sure this doesn't happen again, we have added monitoring for this specific case.

Title

Between 17:30 and 18:10 CET on 09 Oct 2018 the Checkly dashboard and API calls for triggers where slower than normal. One of our infrastructure suppliers, Heroku, reported a routing latency in their EU zone.

This outage only affected usage of the Checkly web application, not the running of API or Browser checks as they run on separate cloud infrastructure.

Post mortem outage Thursday 04 October 15:00 CET and Friday 05 October

Summary

Browser checks were less available for 24hr due to a new release that was misconfigured. The issue wasn’t noticed due to oversights in our monitoring infrastructure. All systems are back online and we have added extra tests and monitoring to make sure this never happens again.

1. What happened?

Between Thursday 04 October 15:00 CET and Friday 05 October 18:30 browser checks either did not run or did not report any results. No false alerts were triggered. Also, the ad hoc browser check runs triggered in the edit and create screens did not work.

2. Why did it happen?

TLDR: We forgot to tweak a configuration parameter.

On 04 October we released a new version of our browser check feature. Part of that release was a new way how browser checks are handled on our back end. One change in this release was how the browser checks reported their data to the main application.

Browser checks are run in isolated containers for security reasons. They are launched by a launcher container which deals with all the scheduling and communication.

All our tests passed and we ran a shadow deployment for one week. This testing period however did not show that multiple concurrent runs could trigger a port allocation issue. Because each browser check run gets a dedicated control server listening on a port, these ports need to be unique per box. If not, an “address already in use” error is thrown. We wrongly configured our runners to use the same port for potentially five spawned runners. Spawning depends mostly on how busy a certain region is.

This should not have been an issue if our “restart on death” policy worked correctly. But due to a completely unrelated code issue, crashing processes were not restarted by our nanny process.

This problem was noticed very late, only on October 05. The reason for this was that our external monitoring (completely outside the Checkly infrastructure) was not setup to report on not reporting instances. A “dead” process would have triggered an alert, but a “hung” process did not.

3. What are we doing about it?

  • configuration changes are made to never have processes compete for the same port again.
  • monitoring has been updated to alert on non responsive browser check runners.
  • unit and end-to-end test are updated to simulate concurrency and trigger these situations.

Introducing Browser Checks V2

As of today all Checkly's browser checks are running on the second iteration of the browser checks site transaction monitoring system. This upgrade brings the following benefits:

  • Every valid Puppeteer script is now a valid check. If your Puppeteer script passes, your check passes.
  • Assertions are now optional. You can still use assertions (more on that below) but you don't need to. A failing script is enough to signal something is wrong and trigger an alert.
  • More and better logging. We now report debug and console logs directly to the user on each run. This makes debugging flaky checks a lot easier.
  • Use Chai.js assertions. When you do want to use assertions, you can now use all functions from the popular Chai.js library.

We are confident that this new iteration will make monitoring your vital site transactions a lot easier.

What are browser checks?

A browser check is a Node.js script that starts up a Chrome browser, loads a web page and interacts with that web page. The script validates assumptions you have about that web page, for instance:

  • Is my shopping cart visible?
  • Can users add products to the shopping cart?
  • Can users log in to my app?

Learn more about browser checks

We're starting a changelog

Big news today, we're starting a public changelog so you're always up to date with all the updates, improvements and fixes that are made in .

Even though we work on all the time, sometimes it may seem that not much is happening. This changelog is here to improve that very important part of the communication between you and us.

You'll always receive an update when we change something in the widget, or if needed all things we've changed is available on our public changelog page.

UI and performance tweaks to Check statistics page

We made a couple of tweaks to the Check statistics page based on customer feedback and to address some performance issues.

  1. We now show the 24 hour, 7 day and 30 day success ratio KPI's to give a better short term / long term view on the health of your checks.

  2. We've add the 99 percentile response time to the per location response times chart to give you a better view on how response time are distributed for a specific data center location.

Under the hood, we tweaked query performance and aggregation performance which results in a 70% performance gain.

No published changelogs yet.

Surely Checkly will start publishing changelogs very soon.

Check out our other public changelogs: Buffer, Mention, Respond by Buffer, JSFiddle, Olark, Droplr, Piwik Pro, Prott, Ustream, ViralSweep, StartupThreads, Userlike, Unixstickers, Survicate, Envoy, Gmelius, CodeTree