This post is only relevant to users of Codebase and Deploy. No other services were affected by these issues.
Today at 12.04pm, our monitoring system alerted us to some issue with one of our storage nodes. The issues initially presented themselves as extremely high load on the node in question however we were unable to access the server in order to perform any checks. Following our internal procedures for such an event, we proceeded to investigate any causes for high load and were unable to find any reason for this on our application servers.
At this point, we had to consider the possibility of a more serious error on the storage node in question. We despatched a team to the datacentre to investigate the host and discovered the server was displaying the same symptons as those experienced around 200 days ago. This meant we needed to hard reboot the server which, including a disk check, took around 25 minutes to complete and the service was back to normal at 13:15pm.

Unfortunately, there were a number of points which could have been performed better which we are already working to improve:
The failure of a single storage node had too much impact on the Codebase application. In the event a failure of a storage server, Codebase is designed to handle this by disabling repositories which are located on that server to avoid congestion on our frontend web processes. Unfortunately, this did not kick in as desired and our web processes quickly became saturated with requests for data stored on the unresponsive storage node. We have already implemented procedures to ensure that un-affected repositories remain accessible in the event of any future failures of this nature.
We were misled by our monitoring when we assumed high load on the server. The process to determine the root cause of the failure was slowed down by investigations surrounding the failed node. We have implemented internal procedures to ensure that detecting issues such as these is faster.
Emergency Maintenance. We are going to be undertaking some emergency maintenance tomorrow morning from 4.30am until 6.30am (GMT+1) which will mean that repository access is unavailable for around 1 hour for some users within Codebase. When repositories are unavailable, you will receive a message saying your repository is temporarily unavailable.
I’d like to take this opportunity to apologise for any inconvenience caused by these issues and assure all our customers & users that, as always, we’re working tirelessly to ensure the Codebase platform remains fast & stable.

