Emergency Maintenance

Update: Maintenance was completed successfully at 6:13am.

Following this morning’s problems we’ll be performing some emergency maintenance tomorrow (11th May 2012) to ensure the same problem doesn’t affect services again.

At some point during 4.30am until 6.30am (GMT+1) repository access will be unavailable for around 1 hour for some users within Codebase. While repositories are unavailable, you will receive a message saying your repository is temporarily unavailable. We will also be performing some maintenance on our database servers, which may result in some servers being unavailable for a few moments.

How we monitor our resources

There are lots of hosted tools available which will monitor the resources you provide for your applications, however they all come with a high price tag when running any sort of serious infrastructure. At aTech we monitoring everything internally using a combination of tools including Nagios and Cacti - installing both of these tools is easy although configuring them can be a bit confusing until you get you head around how both systems work. We use these tools for two different purposes; Nagios is primarily used for making sure everything is online and notifying our team should we detect any issues whereas Cacti is used to keep track of historical statistics and plot them onto charts for viewing at a later date.

As part of our live monitoring, we monitor a number of key metrics across all our servers including:

  • Current load, memory, hard drive & swap usage on all servers
  • The status of any RAID arrays within any of our physical servers
  • The number of running processes and number of zombie processes
  • The status of key services – which actual services depends on the server being monitored – for example, on Codebase we monitor HTTP & SSH as these are the key public facing services whereas on Deliver we monitor SMTP.
In addition to these, we also monitor a number of business logic metrics which allows us to identify any issues which may arise without actually taking things offline. For example, a backlog of jobs in any of our background job runners. For more information about this, see our blog post about it.
In order to keep track of busy periods and identify potential issues we also keep track of statistics and plot them onto graphs. We have a number of different graphs:
  • The numbers and types of MySQL queres which are executed on all our database servers (see above).
  • The volume & types of jobs which are executed by any of our applications which include a background runner (see below).
  • The number of requests processed by our load balancers.
  • The power usage of all our data centre equipment.
  • All network bandwidth information from our routers, firewalls and switches.

All this information allows us to keep our fingers on the pulse of our entire infrastructure and all our applications. The information is available to all staff through our intranet and also projected onto a wall in our office allowing us to react immediately to any issues during office hours and we’ll automatically receive text alerts outside of office hours.

The screenshot above shows the display which is projected onto our office wall and includes our most common and important machines.

Monitoring more than just services

Last week, we had an issue within Codebase which stopped some users from being able to view tickets within their projects. Unfortunately, we weren’t immediately aware of the situation because our monitoring system was reporting everything was online and operating successfully. The only real indication to us that something was wrong here was the volume of support requests which had been submitted to our support site.

In light of this, we have now added some additional monitoring which will highlight any unusual behaviour within any of our applications, for example:

  • Any large volume of discussions submitted to our support site within an hour period
  • Any large volume of application exceptions logged by our exceptions service
  • Any build-up of unprocessed background jobs in Codebase or Deploy
  • Any build-up of undelivered or pending e-mail in Deliver

As we now monitor these, our team will be alerted by text message/SMS whenever an issue occurred ensuring that the appropriate action can be taken without delay.

New Codebase IP addresses

We have recently completed the migration of all users onto our new v4 platform and have updated our DNS records to ensure all customers are routed to this new infrastructure in most efficient and fastest manner available.

This change should not affect most customers but if anyone has specific firewall rules or haspointed services directly to our old IP addresses, you should note the new details provided below.

  • Our main load balancer which accepts all web & SSH traffic is now 109.104.109.67
  • If you use SSH on port 443, you should connect on 109.104.109.68
  • If you are still using our old legacy, *.svn.codebasehq.com domains for accessing subversion repositories, this service is now located at 109.104.109.77

The aTech Application Cloud

Over the last month or so we’ve been busy designing and implementing our new hosting cloud for hosting our web applications. Designed from the ground up with maximum redundancy in mind, we’ve made every effort to ensure all parts of our infrastructure at at least n+1 redundant which means various services can fail completely without affecting service uptime.

As well as changing our entire server & network infrastructure, we have also relocated all our services from Manchester to Bournemouth to correspond with us moving offices. This means our office is now just a 5 minute drive to the data centre which means if anything should require attention locally, we can be there in a flash. Our team is also local so out of hours emergencies can be responded to very efficiently.

The diagram below shows how we’ve designed our new network:

The new cloud is all ready and waiting for the public launch of Codebase v4 which is due very soon now.

Say hello to aTech Identity

We have just completed work on aTech Identity, our powerful single sign-on allowing users to switch seamlessly between aTech Media applications, as well as update passwords and email addresses from a single interface.

Users signing in with an aTech identity will see a small bar at the bottom of the screen allowing fast switching between accounts and applications.

Users can of course still log in using their existing accounts, and the process of moving to an aTech Identity is simple. Just click the upgrade banner and either create or link an identity.

Next week we will be rolling out aTech Identity to Codebase and Deliver, allowing seamless switching between accounts in all aTech Media applications.

As always, please let us know if you have any questions or feedback.

New Hosting Provider – Melbourne Server Hosting

After a few months of design & implementation, we’re really pleased to announce that we’ve successfully migrated all our services to our new infrastructure. We’ve moved from VPS.NET to colocation with Melbourne Server Hosting, a leading UK-based hosting solution provider. We bought a completely new set of high-specification Dell-branded servers which the guys at Melbourne racked-up, setup with redundant network & power feeds and provided us with permanent KVM-over-IP giving us full control over the servers from our offices (not having to sit on without needing to sit on the highly uncomfortable datacentre floor while deploying & managing physical hardware is a massive bonus for us!).

You can see lots of pics of the new stuff on our flickr page.