Most web hosts don’t touch on this subject as downtime is our least favourite term in this industry and it puts off clients when they’re viewing the company site.
You may have heard about the explosion at a data centre in Houston, lots of my favourite sites were knocked off line due to this such as b3ta.com and… erm… ok… I live on b3ta when I’m not replying tickets … Anyway, around 9,000 servers and who knows how many websites were taken offline due to this, many of which are facing 50/60+ hours downtime at the time of writing this.
I have been keeping up to date on the happenings, reading a lot of posts on the forums of some of the companies affected. I notice many, many people complaining about the outage and how it has affected their business.
Many of these people (not all admittedly) didn’t have to suffer downtime, they just hadn’t made any form of disaster recovery plan and when their sites wouldn’t load they took absolutely no responsibility for their own negligence or lack of knowledge. If you are making money from your site then it is in your best interest to learn how everything works and to make plans for the worst. You do this for every business, right? I worked for Mattel for some time as I was starting Evo, they had their own back up office in case their UKHQ burnt down, all their data is sent offsite weekly, now why aren’t you doing similar for your web server? It’s common sense.
I noticed today how even staff at one company mention to their customers on the forums that how they should have had a backup plan if their site is valuable, I wholeheartedly agree with this stance from a business owner point of view, unfortunately for those customers it is too late this time and they’ll just have to learn from their mistake. The worst didn’t happen for them, their sites are merely offline (I say merely, I know this is life or death for some people), but all data is intact.
Now imagine if that building had burnt to the ground, all data was lost and all they had to show was some crispy fried servers.
If that had happened I would imagine some of those hosting customers could go out of business from this purely due to poor/no disaster planning, and of course I wouldn’t be able to check out the awesome drunk cheeseburger eating Hoff animated GIFs and LOLCATS style pictures at b3ta any more, that would indeed make me feel quite sad.
Accidents happen no matter how good a data centre is, no matter how good the equipment is, no matter how good the staff are, no matter how much things are checked and no matter how well we as hosts practice our disaster recovery procedures. It is inevitable that at some point something will go wrong, especially when your building uses as much power as a small town to stay running and needs generators the size of a plane to operate when the power goes out.
The explosion at H1 is by no means the first data centre problem in the world, every single web host has some problem which occurs at one point or another, whether it be those pesky hackers, server configuration issue or lack of power / network / air-con.
We’ve had a couple of instances of 7 – 12 hour FSCKs, rare as they are, you can read about them on our blog, they can and do happen to every hosting company at some point, no matter what the marketing spiel says.
If your business relies on your web site / email to stay alive and you haven’t got a disaster recovery plan yet you should take some time out today to sort this out. I can’t emphasize how important this is.
Here are some hints on starting out with your disaster notification & recovery plan, these are by no means exhaustive but should give you some form of insight into some of the things you should be thinking about.
Monitor your website – As a web designer, isn’t it rather embarrassing when your main customer phones up and asks you why their web site is down when you didn’t realise yourself? We use Wormly here and we love it, it monitors all the services on each server, it lets us know the same minute via ICQ & SMS when something is dying so we can go fix before any of our customers have even noticed. If you had Wormly then you’d know if your customer’s web site was pinging away happily or not, you’d also know that we’d be fixing it already too because we have a minute monitor.
Monitor your home page – Similar to what Wormly does, but home page monitoring will make a call to your website every few minutes to make sure it is loading the data you want your customers to see rather than a “THIS HAS BEEN HACKED BY …” text or “Internet Explorer cannot display this page”. You should be using home page monitoring if you care about your website, it’s your website we’re hosting and you should know the second something happens. It’s our responsibility to make sure the servers are stable and working fine but it’s your responsibility to make sure your web site is working. We don’t do home page monitoring because we don’t know when you update your website, if we did home page monitoring then the second you changed your homepage with a new design or different text we’d be alerted and have to call you, and for a £5/month average hosting plan that isn’t feasible.
Put your data in at least 2 completely different geographical locations – Sounds like a waste of £5 a month for a second shared hosting account somewhere doesn’t it? But then on the flip side, if an outage occurs you have to work out what uptime vs £££ means to you. If you have a replica of your site else where coupled with the next item I’m going to mention then there is no more down time problem.
Set the name servers on your domain to use an external DNS provider so you can either flick the switch to your backup provider manually or have automated DNS failover – DNS is the thing that resolves your domain name to a server’s IP address, when you type www.whatever.com into your browser your computer then goes and asks a DNS server where to go. If your name servers are pointed at your multi-homed DNS provider then you can just press a button to instantly point your domain at another server. You can do this manually or with automated failover.
Take nightly or weekly offsite backups of your data – Whatever your host says about backups doesn’t matter, whoever you host with, YOU should take backups too, it is as simple as that. The onus is on you to make sure your or your customers data is safe. You are our customer and we take nightly backups of our shared servers, we take nightly, weekly, monthly backups on our business class servers, we run RAID so we’re protected against single drive failure, something we didn’t have in 2004 when we started. However this doesn’t protect us against total RAID array failure (unlikely, but then again a data centre explosion is unlikely and that happened… so…). Soon we’ll have entire servers backing up between data centres in case of RAID array failure or even data centre fire which means we can roll out entirely new servers, pre-built with all customer sites in 4 – 12 hours, not many shared hosts bother with this. Even with all this in place, you need to take your own backups too.
Practice restoring backups on your backup server before you actually need to – So the nightmare has happened, your site is down, you don’t already have a ready/rolled out version of your site on your backup, but at least you downloaded your backup and you have a nice tar.gz file of your site sitting on your Windows desktop, right? Great, but you’ve only done half the job. Exactly how are you going to know it works unless you’ve tested it beforehand? There are always little issues, there are always configuration issues between servers too, most of ours have phpSuExec, your backup might not, so you’ll have to change permissions in all your folders in rather a hurry, or your database was corrupt when exported from the downed server so your backup is useless. For those reasons you need to practice what you are doing before it is needed.
Keep your customers up to date – We will always do our best to keep our customers up to date in the event of an emergency, with that information you should be able to do the same as well. ETA’s, best case/worst case scenarios, everything you can do, but never over promise. Make a plan for what you are going to say to your customers, are you going to take the first move in notifying them, or are you going to wait for them to phone you?
I hope this gives you some ideas of the kinds of things you should be looking into right now, I’ll let you and Google fill in the blanks, but next time an outage occurs somewhere, anywhere, hopefully not here, be prepared. The last thing we need to hear is that you’re losing money through something that could have been avoided by being a little bit proactive as we’re sitting here frantically fixing the shiny expensive Dell PowerEdge server you’re hosted on and keeping you in the loop.