Archive for the ‘Issues: Resolved’ Category
Monday, November 10th, 2008
0930: The server Rampage.3v0.net is having I/O (disk) issues, all services are currently down on that server. We are doing everything we can to restore service and will let you know more as soon as more information is available.
0957: The server has been back for a good 15mins now, it had extensive file system errors but is stable at this time, we’re checking the hardware at the moment. If we have an imminent failure detected we’ll get a drive replaced under warranty. In any case we’ll let you know but for now everything seems to be under control.
Posted in Issues: Resolved | 4 Comments »
Monday, October 27th, 2008
@ 09:30: There is a load issue on Hydra, I am currently trying to get it back under control. The load on the box is very high, looks like it is being attacked. I will keep you up-to-date.
@ 11:12: We had to reboot Hydra after the issues earlier this morning as the load was very high so we weren’t able to access it, it is now stable again and has been for some time, we are working to establish the cause of the issue.
@ 12:03: We suspended a customer who’s site caused the problems this morning, we’re sorry for the inconvenience caused. We believe this issue has been resolved now but will continue to watch Hydra closely.
Posted in Issues: Resolved | 2 Comments »
Monday, October 6th, 2008
@ 23:00 - I need to reboot Rampage now, this is due to a kernel update we have put in place. It should take 3 minutes if everything goes well. If everything doesn’t then there is a tech on hand on-site who will boot it back into the original kernel. I’ll keep you up-to-date either way. Sorry for any inconvenience caused, this is necessary to keep the server running for the longterm.
@ 23:05 - Server is back! Kernel upgrade successful, downtime was about 90 seconds, if that. — 23:05:49 up 0 min, 1 user, load average: 0.84, 0.19, 0.06 — Mem: 8311304k total, 959504k used, 7351800k free, 83304k buffers
Posted in Issues: Resolved | No Comments »
Friday, September 5th, 2008
@14:49: Our monitoring has been triggered by a failure on Rampage.3v0.net we are investigating and further updates will be made shortly.
@14:54: We have access to the server currently on some ports but not full access, as soon as we can get into the server to fully investigate the issue we will do so. From the current state of the server we believe there is a load spike on the server for an as yet unknown reason.
@15:33: We are now rebooting the server, if all goes well when the server comes back up we will be back online very soon, we will update again as soon as the reboot is completed.
@15:44: We have rebooted the server and the server has come back ok, but web-services are still down, we are working to restore this.
@16:28: We are having to force the server to perform an FSCK (File System Check) as we are seeing locked files when trying to start web and database services, this may take some time, further updates will be issued hourly. Once again we are very sorry for this outage and are doing everything we can to resolve it as soon as possible.
@17:31: The first run of FSCK has completed we are now attempting to bring the server back online and investigate the issue further. I will make a further update as soon as more information is available.
@19:33: I am VERY relieved to tell you we have now resovled the issue on Rampage and everything is now running as normal, we will continue to monitor the server over the weekend.
@ 20:13: (Tim) Thanks very much to all our customers who were affected by this outage for remaining calm during this situation, we work very hard to make sure that any server issue is dealt with as fast as possible and customers are kept in the loop AT ALL TIMES, uptime and customer satisfaction are the two most important things for us with no exception. Also thanks to the whole team who pulled together on this one, great job guys. We were working on fixing this within 45 seconds of the error occuring, unfortunately some things just take some time to fix, especially if you’re doing many disk checks.

For the non technical, green means good!

All is well again in the Evo rack. Thanks to everyone.
We will maintain a VERY close eye on Rampage over the coming days.
Posted in Issues: Resolved | 4 Comments »
Saturday, August 23rd, 2008
We are upgrading Apache / PHP on Octane tonight to add PDO support. This will cause a very short outage while the update completes, sorry for any inconvenience this may cause.
Posted in Issues: Resolved | No Comments »
Friday, August 1st, 2008
21.37: (Official note from the USDC) Duration: August 1 2008 04:25AM - 10:45AM EST -As you may of noticed this morning starting around 3:30AM we had multiple BGP Sessions drop with one of our upstream providers (Level3). This caused some instability in the network while routes were re-routed. This issue was also compounded by a large DDoS attack targeted at our core networking system. As a result of the attack, troubleshooting of the initial route related issue was made much harder, thus extending time to get things sorted out. The network team was able to resolve the issue completely and everything should be running now. We will be looking into strengthening our internal policies to help alleviate issues such as these in the future. Thank you for your patience during this time.
12.36: The USDC is still on/off at random points. Unfortunately out of our control. It will stabilize over time, we’re seeing outages of upto a minute every 5 - 60 minutes (It’s quite random).
11.24: Ohh what a day US Data centre just had 100% ping loss for approx 10 minutes, unfortunately this is out of our control. It’s back now however there is approx 10% ping loss on and off, this should tidy itself up very quickly.
Posted in Issues: Resolved | No Comments »
Friday, August 1st, 2008
07:47 - Just to let you know, we’ve finished working on the firewall, we tweaked a few settings and will keep an eye on it. If it recurs (hopefully not) we know immediately what the issue is so can fix it quickly.
07:24 - We had some problems with a few ports between the last post and now, this has been resolved. Everything seems to be working okay, however we continue to work on the server to stop this from repeating.
07:06 - Just to let you know, the server is back, it locked itself out so we’ll be doing work today on the Firewall which might cause a couple of minute long (hopefully no longer) outages at random periods. Nothing to worry about, just letting you know in case your connection drops for a moment.
06:41 - No reboot necessary, we’re in and back online - Firewall issue.
06:29 - Problem seems to be firewall setting, however engineer on the floor cannot access the box successfully via KVM so we might have to reboot.
05:50 - We are experiencing problems with Firestar this morning. We will let you know as soon as there is new news on the situation. We have seen multiple occurances of 100% ping loss.
Posted in Issues: Resolved | No Comments »
Tuesday, July 29th, 2008
22:38: Monitoring has reported that the load on rampage is very high, we are currently investigating, this is causing a temporary drop out of every service on that server.
22:44: Load has dropped, it was caused by several thousand http connections being directed at a specific site. Thus the site is suspended and the server is returning to it’s usual load, all services were restored at 22:41 - total downtime 10 minutes 50 seconds. (It hiked to 220, should be about 1.5).
We will rectify the issue with the customer involved so this doesn’t repeat.
Posted in Issues: Resolved | No Comments »
Thursday, July 17th, 2008
22:00 - Tim: We are commencing the scheduled PHP Recompile & PHPSuExec install on Firestar, maintenance window begins now and will hopefully end by 00:00. We will update the blog if we have any issues which incur server down time. If you experience Internal Server Error / Error 500 please read the email we sent out, alternatively open a ticket.
22:16 - Tim: Server is pretty busy, monitoring has detected HTTP errors, sites we’re checking are working fine right now.
22:21 - Tim: Monitoring says server has recovered.
22:30 - Toby: Everything is going as normal, minor issue seen by monitoring as above was to be expected and had no effect on customer websites.
23:20 - Toby: We have now completed recompiling PHP on the server and will now be testing to make sure everything is still working.
18/07 @ 10:53 - Tim: As expected we have a lot of tickets coming in with Internal Server Errors - we’re currently fixing issues on average within 60-90 minutes, please refer to the email we sent out if you want to tackle it yourself - quick summary below:
Goto your cPanel. Click Error Logs - this will give you a clue as to what the problem is, look for key words, often it will mention permissions issues, for that you change any file or folder that’s 777 to 755 or 644 or lower using your FTP client or the cPanel file manager.
If you want to speed up the process when opening a ticket please paste some of the last entries from your error log inside your ticket, this will save us looking and get your issue dealt with faster - you don’t have to, just worth mentioning
You’ll also see possibly issues with your .htaccess - Move php_value’s out of .htaccess and recode them to php.ini
Some people will also see file ownership problems - this is because some scripts uploaded direct via the server have the ownership of ‘nobody’, you need to open a ticket to fix that as you don’t have sufficient permissions to change it.
The above will fix 90% of issues with regards to Internal Server Errors - this is all perfectly normal, a 1 time thing, and we are doing this to make sure your server is secure, it is vital to do this for the longterm uptime of your website and our servers.
Best regards,
Tim.
Posted in Issues: Resolved | No Comments »
Saturday, July 12th, 2008
@ 16:11
Everything seems to be stable now, all servers are back online.
@ 14:59
I talked with the lead engineer - part of the datacenter over in the US had a brown out, a few racks seem to be without power, including the one with Venom in. I am constantly hassling and will update this post when I have more info - as you can imagine it is slightly hectic at that datacenter with techs running around fixing things.
Further update as soon as I get it — Tim.
@ 14:45
Hello.
The datacenter in the US experienced some issue a short while ago, all boxes went offline for approx 5mins, 2 of 3 are back (Namely Inferno and Vector) but Venom is still MIA.
Unfortunately the datacenter’s websites are down too, their phone lines are busy and nobody is available on AIM to get things straightened out for Venom customers.
We will continue to repeatedly attempt to contact the datacenter to get Venom online and will report back here every hour. (Hopefully a 2nd update won’t be needed!).
I’ll keep you in the loop as ever.
Best wishes,
Tim.
Posted in Issues: Resolved | No Comments »
|