Emergency Maintenance

Update: Maintenance was completed successfully at 6:13am.

Following this morning’s problems we’ll be performing some emergency maintenance tomorrow (11th May 2012) to ensure the same problem doesn’t affect services again.

At some point during 4.30am until 6.30am (GMT+1) repository access will be unavailable for around 1 hour for some users within Codebase. While repositories are unavailable, you will receive a message saying your repository is temporarily unavailable. We will also be performing some maintenance on our database servers, which may result in some servers being unavailable for a few moments.

(200) Days of Ubuntu Uptime

Over the last 48 hours, we’ve had a rather unfortunate series of events which has lead to some service unavailability. This post outlines the details about what happened, why it happened and what we have done to ensure that issues like this are mitigated in the future.

What happened?

On Monday evening at 9pm, our primary database server, which serves most of our applications, went offline and became completely unresponsive over the network. Fortunately, we run our database servers in a redundant (active-passive) pair with a live replication of data between the them. When the primary server went offline, the secondary one was ready and waiting to serve database requests and, usually, this process would have been seemless to our users and they wouldn’t have been any the wiser. However, due to a configuration issue in a couple of our applications, they were not able to automatically fail-over to use the other server in the database cluster. Our team immediately corrected this issue and service was fully restored. Once everything was back online, we despatched one of our team to the data centre to investigate what had happened to the primary server and we discovered what initially appeared to be a hardware failure with the machine’s memory.

At 10am the next morning, we started to investigate the issues with this machine and discovered that it appeared to have died as a result of a bug in the Linux kernel which affects machines which have been online for long periods of time (around 200 days). More information about this can be found on the Ubuntu launchpad site. Unfortunately for us, quite a bit of our infrastructure was bought online at the same time which meant there was a possibility for other machines to suffer the same issues so we needed to act fast to ensure that the kernels in all our servers was upgraded as soon as possible. We immediately upgraded the offending database server and bought it back online at about 12pm and it started to replicate the data from the “live” server however we opted to leave it as a hot spare rather than immediately making it take over as the “live” server again.

At this point, we were aware that a number of our key servers (including our load balancers, storage servers and virtual machine hosts) were currently at risk from this bug and immediately start planning some emergency maintenance to ensure they are all upgraded with the minimum downtime necessary.

Before we were able to publish any information about this, at about 3pm the same day, one of our virtual machine hosts (vmhost02) went offline in a similar (although not identical) manner to that shown by the database server the night before. As this was during our office hours, we sent a couple of our guys down to the data centre to investigate why it was unresponsive and, after some diagnostics, the machine was rebooted and it’s virtual machines began to come back online.

As soon as our guys walked back into the office, our monitoring system immediately started to report issues with another virtual machine host (vmhost03) suggesting that it had met with the same fate as its friend. We went back down to the data centre and performed the same diagnostics and rebooted this machine which, in turn, bought its virtual machines back online.

As the issue seemed to be spreading rapidly, we were getting worried about how long we had on all our other machines before this issue affected them too. Our load balancers are set up in the same way as our database servers (with two servers in an active-passive configuration) so we immediately upgraded the kernel on our backup load balancer and rebooted it. We then moved all our incoming traffic over to the newly upgraded machine and performed the same upgrade and reboot on the other machine in the pair. This process was seemless and there was no downtime involved.

Finally, we were left with our storage servers which also needed upgrading as a matter of urgency. Due to the amount of data involved, our storage server redundancy isn’t as quick to bring online as the load balancers or databases. We therefore scheduled an emergency maintenance window for 5am on Wednesday morning where we would take down the servers in turn and perform the necessary upgrades. These upgrades were completed with less than 10 minutes of unavailability at one of our quietest periods. We did have some issues with some customer’s being able to access their SVN repositories on one of our storage servers, these issues were resolved quickly as soon as our monitoring system brought them to our attention.

Why did everything go wrong & what have we done to prevent it happening again?

The scenario which occurred here was one we had thought about when designing the infrastructure however we always assumed the issue would be hardware-related and would only affected one machine at a time.

The problems with the database and the subsequent unavailability at 9pm on Monday evening were caused by a configuration issue on a per-application basis. Our redundancy procedures successfully transferred the traffic over to the stand-by host however some of our applications were not configured to connect to the virtual IP address which can move between the hosts. We have updated our internal policies to ensure that all new applications are configured to use the virtual IP address of the database server. We have already audited our current applications to ensure they are all using the appropriate settings in order to mitigate against any further issues with the database cluster.

The issues with the virtual machine hosts should not have been service affecting as all our key applications have at least 2 (in some cases, more) virtual machines spread across multiple physical machines so the failure of one machine should not impact the service. However, because we lost two machines in quick succession the redundancy of some applications (in case this our aTech Identity service) was compromised. As aTech Identity is required to authenticate users across most of our applications we saw some issues for users when trying to authenticate. As aTech Identity is such a key requirement for all our applications and can have a serious knock-on effect when it suffers an outage, we are now ensuring that the service which hosts aTech Identity will be supported by at least three virtual machines in the future. We hoping to have the extra redundancy in place by the end of the week.

To conclude

We are very aware of how important our uptime is to our customers and whenever we have issues like this we hate it just as much as you do. We work very hard to ensure our service is as reliable as possible and will continue to improve this whenever we can We can only apologise for the issues which occurred in the last few days but hopefully you can see that we are serious & passionate about ensuring we have a fully redundant infrastructure in order to provide you with a fantastic service.

As always, if you have any questions or comments, please don’t hesitate to drop us an e-mail at support@atechmedia.com.

After all this, I think everyone now needs some cute baby pandas:

Scheduled Maintenance – 25th October

Update (05:10 UTC): Everything is now back up and running as normal. If you have any specific issues, please contact us.

Update (05:02 UTC): All our network maintenance is now complete. We are currently investigating an issue with inbound & outbound e-mail in Deliver (and subsequently, Codebase & Deploy).

Update (04:16 UTC): The key and most service affecting part of our maintenance is now complete with less than 1 minute of downtime. We are currently investigating an issue with connecting to some parts of our network (including access to the Point web interface and DNS5).

Update (04:01 UTC): We are about to begin the first stage of our network maintenance which should see a very small period of unavailability.

As a result of our growing user base and increasing network traffic, we have decided to make some upgrades to our network infrastructure. We have scheduled a maintenance window for this between 4am & 5am UTC on Tuesday 25th October. During this time we will be shutting down our primary router, simultaneously bringing a new device online. At this point there will be an expected downtime of approximately 5 minutes while connections are renegotiated. Following this, we will be bringing the secondary redundant device online during which no additional downtime is expected. The service should be considered at risk during this 60 minute period.

This maintenance window will not affect Point nameservers 1 through 4 which will remain available throughout the maintenance period allowing users to resolve names as normal.

For the more technical readers interested in knowing exactly what we’re planning, read on… Our current network is controlled by a redundant pair of Juniper SRX240 firewalls with statically routed connectivity to our ISP. This provides a good level of resilience, however we have made the decision to move our connectivity to BGP. This offers a number of advantages:

  • Redundant IPv6 connectivity
  • Ability to add additional resilience and capacity to our network in the future in addition to the 2 independent connections we currently maintain with C4L
  • Additional control over connectivity allowing quicker resolution of any network issues

We will update everybody when the upgrade is complete.

Planned Maintenance – 9th February 2011

Update: This maintenance was completed successfully although it did take slightly longer than expected to get all services back online due to a couple of technical issues (including a disk array configuration issue and a failed kernel upgrade). Photos from the maintenance can be found on flickr.

Over the last 12 months we have grown rapidly and our hosting needs have increased as we’ve enjoying incredible growth in both our traffic levels & our customer base. We’ve quickly run out of space for servers in our current location which means we need to move things around in order to keep all our equipment together. This relocation is also the first step in our preparations for releasing Codebase v4.

Therefore, at 6am GMT on 9th February 2011 we will be re-locating our servers to a new data centre, operated by Melbourne, on the same campus as our current hosting. In addition to the physical relocation of the servers, we will also be installing a new firewall to provide greater reliability for accessing our applications.

We estimate the move to take about an hour during which time the services listed below will be unavailable.

  • Codebase (all HTTP and SSH access will be unavailable)
  • Deploy
  • Deliver (any incoming email received while we’re offline will be held by the delivering server and re-attempted when we’re back online)
  • Appli (all applications running when the server is shutdown will be restarted and checked)
  • Point’s Web Interface (you will be unable to make changes to your DNS records)
  • Point Nameserver #5 (all other nameservers will remain online and sites will remain accessible as always)
  • Support ticketing system
  • Our office phone system

During the downtime our website & internal e-mail will remain online. During the maintenance window we will be updating twitter with the latest information so keep an eye on here for information.

We have tried hard to find the most suitable time but with users all over the world it is impossible to choose a perfect time for everyone. We apologise for any inconvenience this maintenance may cause you and, as always, if you have any questions don’t hesitate to get in touch

VAT Increases on 4th January 2011

From 4th January 2011, the UK VAT rate will rise from 17.5% to 20%, if you currently pay VAT on your invoices from us, this will cause the VAT element of your invoice to increase. To avoid the VAT increase, you can pre-pay your subscription before midnight on the 4th January by logging into your Codebase or Deploy account and selecting Make a Payment which allows you to choose to pay for 3 months, 6 months or 12 months in advance at the current VAT rate. Alternatively, if you are in the European Community (excl. UK) and you are VAT registered you can apply for VAT-free billing from the My Account pages within Codebase or Deploy.