Yesterday, on April 2nd, 2018 we realized that parts of our service were responding with a server error. This affected all pages where a request to our database was made and lasted for 4 hours until we had our mitigation in place. Here’s what happened and what we’re going to do to improve the situation.
First, we noticed that all requests going to our database server returned with a connection error. Unfortunately, our systems in place to monitor such issues weren’t running as intended. Due to this, we were only able to identify the issue at 10:22 UTC when we manually discovered it. We immediately started to research the root cause of the issue which was quickly identified as major service outage of the block storage at our hosting provider. Any attempt to bring the server back to life by performing a power-cycle was unsuccessful.
Finally, we decided to abandon and delete the old and instead create a new instance of our database server on a different cluster that was unaffected of the service outage and restored the last successful backup of our database to it. At 14:09 UTC we could successfully report that our service was running again as expected.
What we’ve learned
While we found a way to work around the long-lasting issue at our hosting provider, we’ve identified a couple of things that we can do to improve the responsiveness of our service further and take quicker action next time. First of all, we want to make sure that our alert manager works and can report errors to us when one of our services is down.
We also want to ensure that issues at service provider level affecting our systems are reported to us in a more effective way to ensure that we can quickly check whether we are affected or not.
Finally, while we have automated a lot of our infrastructure and configuration, a couple of dependency-related stuff isn’t. Solving these issues manually takes us longer than we want to, so we will try to find a way how to automate these parts as well and can restore or create new machines fully automated.
While we have a service looking if colloq.io is up or down, it just checks the unauthorized homepage. Yesterday’s issue, thanks to our caching mechanism, affected only user actions actively querying our database servers. To avoid not being notified about such issues we’d need a system to check this workflow regularly and we’re going to investigate how we can do that in future.
We’re very sorry for the long service outage and will improve our reliability as well as our workflows to handle similar issues even better.