We experienced an outage on our public API and web app facing API from ~ 12:15 UTC to 12:45 UTC. Logging in, using the web application and using the public API was either very slow or resulted in 503 errors.
Note: alerting & notifications were not impacted
The outage was due to the overlap of deleting an index and creating an index on one of the tables of our database.
This index was needed to speed up retrieval of time range queries. However, removing the older index impacted general performance much beyond expected. Some queries were dominating the database resource pool. This caused our app to timeout these queries after 10 seconds, with the outage as a result.
This type of update is perfectly possible without any down time. We just need to keep the old index, create a new one and only then delete the old one.