At 1am on the morning of 12/13, we performed maintenance updates for the compute cluster running our client facing applications on Google Cloud Platform. This was both a security patch and an effort to somewhat future-proof the cluster. One the major changes to this cluster was to allow it to automatically scale the size of the cluster in order to accommodate for increases in traffic. This allows us to keep a constant level of performance under both light and heavy request loads.
The downtime window that we communicated in advance for this was between 1AM and 5AM. The infrastructure update was accomplished by 1:40AM. Two key issues were identified as a result of this, mostly revolving around database connectivity. Our databases have a high degree of security and require incoming connections to be explicitly allowed (above and beyond just a username and password) or carried out over an internal private cloud network. Another issue was discovered within the underlying technology stack around how we deploy new code into production, or in other words, make it go live.
The issue surrounding database connectivity were resolved around 3:30AM. Issues with the deployment jobs were resolved around 9:30AM. During the window between our previously stated 5AM deadline and the 9:30AM all-clear, backend services were technically available and were in no jeopardy, but to decrease expected errors we disabled external connectivity, which resulted in extra customer downtime.
Goods new is that now all of our frontline application cloud services are fully setup to autoscale under heavy loads which should mean a better experience for all websites using Showcase.