During a routine deployment of our portal webapp, to release our new UI, we encountered an error that caused our API to be unavailable for 1 minute. We pride ourselves on having scalable and redundant systems so any downtime on our service is fully investigated to ensure we can avoid such problems in the future.
API degraded performance: 2019-10-08 09:58:54 UTC
API unavailable: 2019-10-08 10:22:01 UTC
API degraded performance: 2019-10-08 10:23:01 UTC
Full resolution time: 2019-10-08 12:11:07 UTC
Root Cause, Resolution and Recovery
We run multiple servers behind a load balancing proxy to ensure that we can scale efficiently and meet the demands of our users. When upgrading parts of our API or webapp we run rolling updates across our infrastructure to ensure reliability and uptime. When we release breaking changes (such as our new UI redesign) this methodology of updating each server in turn does not work so well, as a request from our load balancer could then be sent to a server that may or may not have the latest changes which would cause inconsistencies in the responses to the users requests.
We currently run 2 servers and each server runs both an instance of the API and the webapp, this means in theory we can lose any single server instance and can still continue to serve both parts of our application. During our deployment of our new UI yesterday we upgraded one of the instances and then once complete we shut down the apache service on our other instance, at 09:58:54 UTC, so requests would only be forwarded to our newly upgraded server. This is where we first see intentional degraded performance on the API.
A server that is not responding to our load balancer should automatically be taken out of the pool and no requests should be passed to it. However, in yesterday’s outage this did not happen. We have now identified that the root cause of this is because we are testing that the underlying API application is running on the server, which was the case, but because we had stopped the apache service the load balancer was unable to correctly pass queries to it.
Once we noticed that we were getting bad responses from one of our API instances the decision was made to completely pull that instance from our pool of servers so that we would forward no requests to it at all. At 10:22:01 UTC The server was removed from our pool and we noticed at this point the API was now completely unavailable. It was quickly realised that a simple wrongly noted down IP address was the issue here and by simple human error the wrong server had been removed from the application pool, the good server was quickly added back in, and the bad server removed, at 10:23:01 UTC the single good server began handling all API requests and we returned to our intended degraded performance.
After the initial complete outage was resolved, we upgraded the webapp on our secondary server and re-enabled the apache service. We then did a full root cause analysis to understand why the API on this server was still being sent requests even though we believed it shouldn’t have been forwarded them. Once we were happy that we fully understood what went wrong, we returned this server into our pool at 12:11:07 UTC and normal service was restored.
Corrective and Preventative Measures
Our first step here is to immediately correct the health checks used by our load balancers to determine whether or not an instance is healthy. This would have correctly taken the bad server out of our pool and prevented any chance of human error in manually removing the server.
We also currently run a continuous deployment pipeline for our test environments and it’s clear at this point that we need to push forward to get a full CD pipeline in place on our production environments. These will remove all human involvement in the deployment process and therefore remove the chance of human error.