This is a detailed report of the failure of our secondary network used to host FTP5, dedicated and VPS servers to allow you to see what went on during Monday. This didn't affect the majority of our servers as 90% of sites are hosted on our primary network which is totally separate to this one in a different data centre.
9:09am: We were alerted by our monitoring system of performance issues with our network at the second data centre and we started an investigation putting into action our emergency plan. Our tests showed that the external network coming into the data centre and local network connecting the servers in the racks was fine.
9:30am: The network in the data centre that brings the connectivity from the external links to our racks was pin pointed as the loss of performance, not all services were down at this point but performance was poor. This was reported to the data centre who were already aware of the issue from their monitoring.
9:45am: The issue was traced to primary and secondary Cisco 6500 network system that are configured in a VSS-1440 redundant cluster, it appears one had failed and the secondary hadn't taken the full load as it is designed.
10:10am: A Cisco TAC engineer then logged into the routers to try and identify the problem working with the data centre and at this point the network had totally failed.
5:00pm: The issue was resolved and routers were rebooted and brought back online, parts of the data centre did come back online but we weren't part of this first wave.
6:00pm: We were contacted by the data centre to say that they were investigating the cause of issue and had seen suspicious activity with our connection into the routers. They needed to check this out as they wanted to make sure our network wasn't affecting the routers and gave us ETA of being back on line by 9:00pm.
8:15pm: We worked with the data centre to check out the data they had seen and it was a false alarm and it was our redundant link system from our network struggling to keep things online that caused the strange behaviour as it couldn't fail back to the secondary link of the second router as it hadn't taken over as it should.
9:00pm: After checking everything data centre brought up our links to the routers in the building and we were back online again as normal.
The main issue today was the failure of one of the data centres routers which should have fallen over to its backup, this failed to happen. The data centre could have just removed the failed switch and brought things up with one router but as they had pass this on to Cisco as part of a support arrangement it had to be left down. This was so Cisco could get data needed to prevent the issue in the future, but this did lengthen the down time.
The data centre and Cisco are now working together to make sure that the fall over system does work in the future, this is a very complex redundancy system and as with any system even redundant it can fail when multiple things go wrong. We are also looking at the way we get our connectivity into the data centre and will be looking at taking this over and using our own routers so we have full control. As part of the issue today was working with the data centres plan for getting back online which we do think was to slow which we will raise with them.
We really do understand the affect today has had on customers and this is not something we think should ever happen, but these things do happen at data centres. For example Microsoft Sidekick cloud failure with total data loss, HBOS data centre failure taking down card payment systems and US data centre substation explosion that took it down for 3 weeks and there many more examples.
We do plan with redundancy at the heart of the data centre and are committed to providing the highest quality of hosting in the UK we can. We will learn from what's happened today and welcome any comments customers have about it and if anyone wishes to discus this with a member of staff or has any concerns please contact us and we'll talk these through with you. For customers who need more guarantees we can show you how to host sites/email over multiple data centres to prevent single data centre failures taking your business down.
Once again I'd like to thank customers affected for their patience and understanding and we deeply apologies for the down time on Monday.