Archive for February, 2008

Optimus (Resolved)

Wednesday, February 20th, 2008

We’re experiencing a load hike with Optimus, we cannot currently access it so have asked the datacenter to reboot it.

@ 16:33 - Server is now rebooted, we continue to monitor it to find out the cause of the load hike.

@ 17:57 - Load hike was caused by customer using old Mambo installation which was breached, customer has been suspended.

Vortex (Resolved)

Thursday, February 14th, 2008

We are investigating issues with Vortex, this is a repeat of Monday - full report will be emailed to Vortex customers during the following week.

Latest news:

@ 00:39 Sun (Tim) - Kernel recompile completed on this and all UK servers!

@ 03:54 Sat (Tim) - We’re upgrading another server first, unfortunately post-upgrade it won’t find the lan card so we’re rescheduling for tomorrow at midnight…The other server is being worked on before Vortex.

@ 00:53 Sat (Tim) - Kernel upgrade has begun.

@ 22:25 Fri (Tim) - We’re going ahead with the Kernel upgrade as scheduled from around midnight. The server will be rebooted a couple of times during the night, nothing to worry about. I’ll post a report here if there is an issue.

@ 05:52 Fri (Tim) - I had a problem with the kernel update (its a manual rebuild rather than just typing ‘yum update’ then ‘reboot’) so I’m going to do this at 00:00 (Friday night / Sat morning) Its coming back into daylight hours and I cannot allow more downtime on this box if its avoidable so it has to wait until tomorrow night - this won’t involve much downtime (enough to reboot the server, usually 3 minutes however I’m taking no chances).

Issue is resolved for the moment, no hardware failure, server is stable, file system is okay, all sites are up, all services are running, 95% of tickets are replied, just working on the last few now — tomorrow is likely to be very busy on support tickets due to this but we’ll do our best to action your request in our usual 1 - 6 hour SLA.
Once again, I’m really sorry for the inconvenience caused - if you haven’t already claimed a free month for the downtime you can open a ticket with sales and Mark/Jan-Erik/Pete will extend your hosting period by one month. If we have already given you a free month on Monday we cannot do this again.

I’m also copying the backup files to another server so if this happens again in the next few days (It shouldn’t) then we can setup all the sites on another box and switch the nameserver IPs.

Last thing, thankyou to YOU, our customers, for being so understanding and cheering us up with your comments during this stressful time, we don’t like to let you down - myself and the team know the seriousness of this and how this has caused you significant problems. A personal thankyou also to our team for working some long hours this week too - great team, good job guys.

@ 04:13 Fri (Tim) - Kernel update is progressing well, it has been compiling for sometime now.

@ 02:17 Fri (Tim) - IT LIVES! However we’re not quite out of the woods yet. We are to complete a manual Kernel update as a precaution. This is happening right now, we’ll then need to reboot the box. If nothing bad happens then we’ll have it back by 03:30.

@ 01:02 Fri (Tim) - We’re checking on the server each 5 minutes (I’d make Toby sit in there constantly but apparently its against Health and Safety due to extreme noise). Currently the server is on Pass 1D (thats good) of the FSCK check and hasn’t bailed. Hopefully shouldn’t be much longer. While we’re waiting we’re tidying up the support desk (150+ open tickets due to this). Once the FSCK is done we’re performing a kernel upgrade and rebooting the server again. ETA is now 03:00 - 06:00 — we’ll do our ABSOLUTE BEST to get it back online for 06:00.

@ 23:21 Thur (Tim) - Same as an hour ago, but an hour closer. Just playing the waiting game right now. Some people might be interested to actually SEE what we’re seeing … For those who want to see, here you go:

FSCK

@ 22:21 Thur (Tim) - Disk check failed, performing manual disk check - we started the manual one at 22:06 and expect it to finish at approx 01:00. Uptime ETA 01:00 - 06:00.

@ 21:01 Thur (Tim) - Sitting here in the Bluesquare datacentre chill out room with Toby waiting for the disk check to finish - we’re hoping it’ll be done by midnight although ETA as Mark said below is still the same.

@ 19:17 Thur (Mark) - We have discovered that there is no hardware failure, we are currently running another mandatory disk check on the hard drives. (Current ETA: Midnight - 6am as long as there are no further errors).

@ 16:49 Thur (Tim) - We’re still looking at the ETA’s below as Dell Diags havent finished.

@ 15:56 Thur (Tim) - Disk check finished, Dell diagnostics is running, likely until 6pm.

1550 - 1800 (APPROX) Dell Diagnostics will run.

When diagnostics finish —- if no faults found then we reboot server and see if it wants another disk check or wants to start up. Disk check will then take another 4 - 6 hours.

If faults are found then we call Dell and await engineer arrival.

@ 15:42 Thur (Tim) - Still the same, see the 13:33 update for ETA.

@ 14:42 Thur (Tim) - Still the same as the previous post due to disk check. Nothing new to report as yet.

@ 13:33 Thur (Tim) - Disk check still progressing, we’re estimating approx 6pm before it finishes, Dell are on call also to swap out any hardware - 1- 4 Hr SLA for them to arrive on scene. Best case scenario will mean server is online 4pm - 6pm, worst case 10pm - tomorrow morning.

@ 12:06 Thur (Tim) - We saw some IO Errors before it shutdown, possible harddisk failure. It is running a diskcheck now which (as you will know from Monday) takes 4 - 6 hours.

Firestar (Resolved)

Thursday, February 14th, 2008

We are looking into intermittent reboots on the Firestar server, multiple outages of 2 - 10 minutes have been reported this morning.

@ 00:39 Sun (Tim) - Kernel recompile completed on this and all UK servers!

@ 03:55 Sat (Tim) - Post-upgrade kernel does not find lan card - we are working on solution and will try again tomorrow at 00:00.

@ 00:53 Sat (Tim) - Kernel upgrade has begun.

@ 22:38 (Tim) - We’re going ahead with the Kernel upgrade as scheduled from around midnight. The server will be rebooted a couple of times during the night, nothing to worry about. I’ll post a report here if there is an issue.

@ 06:08 (Tim) - Had issue with Kernel upgrade, cannot afford more downtime on this box during peak (06:00 - 23:00) so am leaving it until 00:00 tomorrow along with Vortex.

@ 22:20 (Tim) - Firestar just rebooted again - we will take it down at 00:00 for a little while to perform a kernel upgrade.

@ 16:19 (Tim) - We found a customer running a process he shouldn’t, customer has now been suspended. This is very likely the cause of the problem, we continue to check and monitor the situation.

@ 15:35 (Tim) - Firestar just rebooted itself, a technician is running Dell Diagnostics on it. I’ll update you ASAP.

Vortex (Resolved)

Monday, February 11th, 2008

Hi there, these are the last updates on the server issue - we will keep you informed every 30 - 60 minutes with new news.

LATEST @ 16:00 - SERVER IS BACK ONLINE! We are thoroughly checking it over.

Posted on: 11 Feb 2008 10:15 AM

Hi there,

Just to give you an update on the situation:

The server was rebooted early this morning and has been performing a disk check since. We hope it will be back very soon but we currently have no ETA.

Unfortunately a disk check takes as long as it takes and is forced to run every X times the server is rebooted, it is checking the consistency of approximately 900GB of data.

We will keep you in the loop & mail you back as soon as its fixed.

Posted on: 11 Feb 2008 10:40 AM

Vortex Server - further update.

Disk Consistency check has finished however the server refuses to boot, we are assessing the server for hardware failure and running another FSCK.

We will keep you updated.

Posted on: 11 Feb 2008 11:26 AM

Further update on Vortex.

The second FSCK (disk consistency check) is still running, the server will reboot automatically once that has finished - we will then be able to see the state of play in full.

At the moment it is not looking like a hardware failure, no monitoring devices have picked up any physical faults.

Please rest assured that we do understand the severity of the problem and we are doing everything within our power to make it right.

We will update you in the next 30 - 60 minutes.

Posted on: 11 Feb 2008 12:00 PM

Vortex - Further Update:

Some customers are interested in the intricacies of how the servers work, some customers just want their sites online again so I will explain where we are now and try to address the balance:

The bottom line: At this stage we’d expect to see the server back in a time frame of 1 - 6 hours depending on how many more times we have to perform a disk check.

More info for those who want it: We have been checking the server for hardware faults, none have been found so it is most likely ‘only’ a file system problem. (A file system problem can be bad, yes, however a hardware fault on (for example) a Raid Array Controller would be a LOT more problematic).

The good news is that the FSCK has gotten further without manual intervention - it is currently at 23%, the previous time it managed to get to 11% before it required intervention and failed.

We will most likely need to run FSCK 1 or 2 more times before the server will become operational, each time we run it we would expect to get it back into a better state.

I will update you again in 30 - 60 minutes (or sooner if there is news in the meantime).

Posted on: 11 Feb 2008 12:43 PM

Vortex Server - Update:

The FSCK is progressing well, we haven’t had to manually intervene and it hasn’t crashed out - I estimate this disk check should finish around 13:30 - 14:00, the moment it finishes the staff will diagnose it and I will reply you again.

Expect the next update within the next 60-90 minutes, I’ll reply earlier if the FSCK crashes out or if anything changes.

Posted on: 11 Feb 2008 02:06 PM

Vortex Situation - Update.

The FSCK is still progressing well, no intervention required - however we cannot see the % complete as that is no longer visible on the screen.

It could be as little as 1hr or as much as 6hrs.

I will reply every hour with a standard update if there is no further news, however I will reply sooner if anything else happens - I will make sure to keep you in the loop at all times.

We are on top of the situation and hope to have it worked out as soon as possible.

Posted on: 11 Feb 2008 05:08 PM

Vortex Server - Update. Total downtime 13 hours.

Hi, I am INCREDIBLY relieved to tell you that this server is back online now. Thankfully our team did their jobs impeccably behind the scenes, big thanks to them.

I am truly sorry for the outage today, we will spend the rest of the day actively monitoring this server and will work on a solution to make disk checking a lot quicker, this may require some overnight maintenance downtime but I’m sure you’ll understand that it will be worth it in the long run.

I will also work out some compensation for each customer who has written to us, I’ll arrange for each customer to have a month free.

I hope by the amount of updates Ive given that you do realise that we do care about the quality of service we offer - uptime, customer relations. It’s so important to the ethos of Evohosting, I will make steps in the future to make our public site entirely transparent so you can see what goes on behind the scenes.

Again, I’m truly sorry for the inconvenience caused and will do my best to make sure this cannot happen again.

But first I need to go make a cup of tea as I’ve been sitting here for 13 hours.

February’08 Specials

Friday, February 1st, 2008

We have one special offer this month, see below for details.

Enter code: LOVEEVO for 90% off first month of web hosting when paying monthly.