Kevin Hazard
Jun 2 2008, 11:38 PM
As Doug mentioned in his
message, we have moved past most of the systematic issues and are now dealing with individual customer issues. Accordingly, we will reduce the frequency with which we make updates here to every 6 hours unless significant issues are present.
In the meantime, please use our standard processes (ticket, chat or phone) to get support or to report any issues. We have additional staff on hand, but there are numerous tickets coming into the queue and the teams are working to get through the backlog, which may take several hours. We appreciate your patience.
Kevin Hazard
Jun 3 2008, 01:07 AM
The ServerCommand errors that you may have experienced recently should be completely resolved now.
Kevin Hazard
Jun 3 2008, 01:28 AM
Due to an issue with one of our backup generators, we've noted inconsistent power distribution to our CRACs (air conditioning units) and PDUs. Because these key components are fundamental to server racks, customers may note some downtime currently.
We have our data center operations and facilities teams checking the generators, CRACs, PDUs and racks to restore connectivity.
UPDATE: This issue appears to only affect phase 1.
Kevin Hazard
Jun 3 2008, 01:58 AM
The backup generator issue is affecting around 1/2 of phase 1. From most recent reports, servers have power, but until the issue is resolved, they may not be accessible.
Kevin Hazard
Jun 3 2008, 02:35 AM
The Customer Access Routers in H1, Phase 1 have been affected by the generator issues. While customer servers in racks may be powered on, they will not be accessible until the access routers have power restored.
Kevin Hazard
Jun 3 2008, 03:45 AM
CRAC units are back online. The facilities and data center operations teams are verifying the stability of the generator, and they will restore power to the PDUs as quickly as possible.
Kevin Hazard
Jun 3 2008, 05:39 AM
Around 2:20 AM CDT, the backup generator being used to power H1 Phase | experienced an electrical issue resulting in service loss for Phase I; Phase II remains unaffected at this time. Our data center operations and facilities teams immediately began investigating the cause of the failure to restore power to the Computer Room Air Conditioner (CRAC) units and Power Distribution Units (PDUs) for Phase I.
The staff successfully tested the 2 megawatt generator without load, so they began powering up the CRAC units and PDUs to restore service to Phase I. While working through this power restoration, the generator's breakers were tripped by their internal electronics. The generator is rated to handle more than the load required to power the phase, and the generator itself is fully functional, but the breaker system must be replaced to guarantee stable power distribution.
We have attempted to locate a replacement generator and are evaluating the time necessary to repair the breakers on the current generator so we can restore power as quickly as possible. We do not have an ETA for power restoration, but we will be updating you hourly with our current status or sooner, as developments warrant.
Kevin Hazard
Jun 3 2008, 06:27 AM
As Doug mentioned in his audio message last night, "The explosion and electrical fire damaged, beyond repair, the electrical gear where the utility service enters the building as well as the transfer switch and main distribution panel that feeds the first floor of the data center."
Because the transfer switch and distribution panel were damaged beyond repair, we are running H1 Phase I from a temporary generator, while Phase II is being powered by our permanent generator. We tested the temporary generator extensively prior to bringing it into service, and we did not find any indication of the faulty breaker.
Our facilities group is working with the generator contractors to repair the faulty breaker as soon as possible.
Brooke-Sales
Jun 3 2008, 08:33 AM
We are still working to repair the faulty generator breaker, and to restore power to H1 Phase 1 as quickly as possible. An additional update regarding the EV1Servers nameservers will follow shortly.
Brooke-Sales
Jun 3 2008, 09:25 AM
A message from our Customer Support Team:
Dear Customer:
This morning at approximately 2:45 a.m. CST, the temporary generator supplying power to the servers and environmental control systems located in Phase 1 of our H1 facility shut down. This was caused by some faulty current sensors in the output breaker. The sensors detected an out of balance current condition that did not exist.
Technicians from the generator company were onsite within 15 minutes. After working on the breaker for an hour, they believed the issue was remedied, and the generator was restarted. As the servers and environmental control systems were brought back online, the breaker again caused the generator to trip offline.
At this time we have a replacement breaker in route to the site and will get power restored as soon as physically possible.
We understand the difficult situation this causes for our customers. As such, we are offering to move all H1 Phase 1 customers to our H2 data center here in Houston. This requires physically moving servers to our data center, which is approximately three miles away from the H1 data center. It also requires IP address changes for all servers relocated to H2.
We will take move requests on a first-come, first served basis. We will need the customer name and customer ID (for example C13572). You'll also need to update your root/administrative password in Orbit or ServerCommand prior to submitting your ticket. To request a move for your server from H1 Phase 1 to H2, please log into Orbit or ServerCommand and click on the link for Manual Reboot Request. In the summary input box include 'H1 Phase I Server Move Request' along with the hardware object ID or server IP address in the description input box.
Estimated time to have your server moved depends on the volume of requests we have. We have additional staff on hand to begin the process immediately.
To make this request, please submit a Manual Reboot Request via ServerCommand or Orbit, and we will begin processing requests immediately.
Regards,
The Planet Customer Support
Brooke-Sales
Jun 3 2008, 10:13 AM
We are still working to update the DNS Zone files on ns1, ns2, ns5 and ns6.ev1servers.net. This process has been delayed by the generator breaker failure at H1 Phase 1, and a new expected resolution time has been set for 3:00PM central.
Brooke-Sales
Jun 3 2008, 10:42 AM
Fixing the faulty breaker on the generator powering H1 Phase 1 was not successful. we have located a second generator that is currently being delivered to the facility. It is expected to arrive this afternoon and we will provide additional information regarding the new generator at that time.
Brooke-Sales
Jun 3 2008, 12:08 PM
Regarding the reported DNS zone file issues for ns1, ns2, ns5 and ns6.ev1servers.net, the Zone files have been synced. We are in the process of bringing these servers offline, and reloading them in groups. The ETA for resolution is still set for 3:00PM central.
Brooke-Sales
Jun 3 2008, 01:08 PM
There is no new information to report at this time. All previously reported ETA's are still on track. Additional information regarding the new H1 Phase 1 generator will be available once the new generator arrives.
Brooke-Sales
Jun 3 2008, 02:05 PM
All DNS Zone files for ns1, ns2, ns5 and ns6 are completely updated as of the information that was available 5:30PM Central Saturday. All DNS servers have been rebooted and BIND has been restarted.
Brooke-Sales
Jun 3 2008, 02:19 PM
The new H1 Generator has arrived on site. For the next hour we will be pumping fuel out of the old generator and into the new one generator and performing tests on the new generator. More news will be posted once testing is complete.
Brooke-Sales
Jun 3 2008, 03:19 PM
We are continuing to test the new H1 Phase 1 generator. We will post additional information as testing progresses.
Brooke-Sales
Jun 3 2008, 04:09 PM
We have received reports of some DNS queries coming back with errors. Some of the nameservers in the nameserver farm were returning inaccurate information. Those nameservers have been removed from the farm. All DNS queries should return correctly now.
Brooke-Sales
Jun 3 2008, 04:31 PM
Initial testing of the H1 Phase 1 generator is complete, and the generator is now connected to the facility. We are now performing load testing of the generator and we expect to deliver power to the facility within the next several hours, pending successful load testing.
Brooke-Sales
Jun 3 2008, 04:39 PM
Testing of the H1 Phase 1 generator went remarkably well and faster than expected. We are now bringing customer servers online in batches.
Brooke-Sales
Jun 3 2008, 04:46 PM
Some customers have noted several of their servers have at least one day of uptime. This is due the fact that a portion of Phase 1 is powered off of a separate generator. However, these servers were not accessible because the network is powered off of a second generator.
Brooke-Sales
Jun 3 2008, 05:41 PM
Though we have successfully updated the zone records using the most recent pre-outage information, customers are not be able to add new zones to the DNS yet. An update will be posted once it is possible to add new zone records to the DNS.
Kevin Hazard
Jun 3 2008, 08:18 PM
Our data center staff is responding to the reboot requests and hardware tickets that have been submitted for servers in H1. Due to the ticket loads, we have expanded staff coverage in the DCs through the night to take care of each request as soon as possible.
Kevin Hazard
Jun 3 2008, 10:05 PM
Doug Erwin, chairman and CEO of The Planet, has a new message tonight for our customers about our data center outage and our response over the past 24 hours:
http://service-update.theplanet.com/Erwin-msg6-3-08.wav.
AaronC
Jun 3 2008, 11:25 PM
Power has been restored across the datacenter, but not all servers are online. Some will need to be power cycled, some may need to be evaluated for physical problems. There is power available to all servers, but not all servers are automatically powered on.
DC Ops staff are going server by server to make sure servers are powering on and booting up.
Kevin Hazard
Jun 4 2008, 01:10 AM
We've updated the service-update page to include a transcript of Doug's message tonight for those who are unable to listen to the wav file.
We will continue to post status updates every six hours or as new developments warrant.
Kevin Hazard
Jun 4 2008, 12:49 PM
Customers have reported problems with the NTT resolvers we referenced previously. While we are finalizing the restoration of H1's resolvers, please use the IPs below:
D6 (The Planet Data Center):
216.185.111.10
69.56.222.10
D5 (The Planet Data Center):
70.84.160.10
70.84.161.10
D4 (The Planet Data Center):
67.19.0.10
67.19.1.10
Level3:
resolver1.level3.net [209.244.0.3]
resolver2.level3.net[209.244.0.4]
AT&T:
68.94.156.1
68.94.157.1
Abovenet:
ns.above.net [129.250.35.250]
ns3.above.net [129.250.35.251]
Ticket queue loads are still high, but our support chat and phone queues are almost empty, and representatives are available to take your call. We have a request in to our data center operations group for up-to-date information about reboot/outage response times. As soon as that information is available, it will be posted here.
Kevin Hazard
Jun 4 2008, 02:43 PM
Data Center Operations response update: We're processing work required to bring servers online right now, any extraneous scheduled work, upgrades, reloads unrelated to outage, etc. are being postponed until we're caught up with outages.
We presently have hundreds of tickets, dating back to the afternoon of June 2 (only in H1), and we're working on them from oldest to newest. In our other data centers, we have been able to respond to all tickets normally.
Given the level of complexity in some of the tickets we are working, we can't guarantee that H1 ops will be completely caught up on responding to H1's tickets by tonight. We have more technicians inbound from Dallas to help, and we're rotating volunteers in and out to keep people from being exhausted to the point that they can't come in for their normal shifts. We anticipate that the overnight shift will continue to speed the rate at which we are catching up, as the normal work load is lighter ... Currently, we're receiving the normal work load on top of resolving the back log.
We just received over 200 power supplies and various necessary hardware components so we can continue replacing those that have failed.
Kevin Hazard
Jun 4 2008, 04:45 PM
Updated list of resolvers:
D5:
70.84.160.10
70.84.161.10
D6:
216.185.111.10
69.56.222.10
D4:
67.19.0.10
67.19.1.10
Global Crossing:
dns1.snv.gblx.net [67.17.215.132]
dns2.snv.gblx.net [67.17.215.133]
dns1.phx.gblx.net [206.165.6.11]
dns2.phx.gblx.net [206.165.6.12]
dns1.jfk.gblx.net [64.212.106.84]
dns2.jfk.gblx.net [64.212.106.85]
dns1.roc.gblx.net [209.130.136.2]
dns2.roc.gblx.net [209.130.139.2]
Level3:
resolver1.level3.net [209.244.0.3]
resolver2.level3.net[209.244.0.4]
AT&T:
68.94.156.1
68.94.157.1
Abovenet:
ns.above.net [129.250.35.250]
ns3.above.net [129.250.35.251]
Todd Mitchell
Jun 4 2008, 07:55 PM
Doug Erwin, chairman and CEO of The Planet, has a new message tonight for our customers about our data center outage and our response over the past 24 hours:
http://service-update.theplanet.com/Erwin-msg6-4-08.wav.
Kevin Hazard
Jun 5 2008, 11:09 AM
Many servers are requiring OS Reloads, power supply swaps, chassis swaps. We also had a VLAN configuration issue where the VLAN configuration of many servers were lost that we had to work with network engineering to resolve. All our Level 1 techs are knocking out the reboots, and anything that takes less than 5 mins. The rest we're escalating to our Level 2s to handle as they can get more complex and they're trained enough to handle those fairly quickly.
We have been systematically going through the tickets attempting to respond to the oldest tickets first but at this time there are still a few tickets from Monday and Tuesday which we have not yet been able to resolve yet. Most of the requests are for reboots, but this is rarely what is needed to resolve the situation.
We understand how time critical it is to get these servers online and we will not stop until every request has been answered and resolved.
Kevin Hazard
Jun 5 2008, 04:13 PM
As of 5 pm today we have power to 100% of the servers and approximately 95% of the customers restored. All permanent and temporary generators continue to function normally. The new electrical distribution gear is onsite and being set in place. The parking lot is being torn up to accommodate all the new conduit runs required. We now have a spare generator in place to backup the generator running the downstairs phase, and we have taken the initiative to pre-cable in the event the need arises where must switch over to the spare.
New developments:
As reported yesterday we have decided to replace 100% of the conduit under the parking lot feeding H1. This might be overkill, but at this point in time we would rather be safe than sorry. We have brought in two new generators to assist in the process and will be fire them up within the hour. We need only one generator to carry the load, but have decided a backup would be prudent.
More to come later.
Kevin Hazard
Jun 5 2008, 06:32 PM
Customer Support Overview (June 5, 7:00pm CDT):
Technical Support Phone: No Calls on Hold
Technical Support Chat Hold Time: ~30 minutes
H1 DC Ops Queue Load: 380 tickets
The H1 Queue Load in the past 2 days:
June 4, 8:00am CDT: 800 tickets
June 4, 9:00pm CDT: 600 tickets
June 5, 10:30am CDT: 464 tickets
June 5, 5:00pm CDT: 380 tickets
Our double shifts in the H1 data center have decreased the outstanding ticket volume considerably. With the decreased volume of new tickets at night (CDT), we will continue to improve technical support ticket response times and eliminate chat hold time.
Aaron Chernosky
Jun 5 2008, 10:00 PM
Customer Support Overview (June 5, 11:00pm CDT)
Technical Support Phone: No Calls on Hold
Technical Support Chat Hold Time: ~9 minutes
H1 DC Ops Queue Load: 209 tickets
The H1 Queue Load in the past 2 days:
June 4, 8:00am CDT: 800 tickets
June 4, 9:00pm CDT: 600 tickets
June 5, 10:30am CDT: 464 tickets
June 5, 5:00pm CDT: 380 tickets
June 5, 11:00pm CDT: 209 tickets
draxisreborn
Jun 6 2008, 01:57 AM
Customer Support Overview (June 6, 3:00am CDT)
Technical Support Phone: No Calls on Hold
Technical Support Chat Hold Time: ~31 minutes
H1 DC Ops Queue Load: 90 tickets
The H1 Queue Load in the past 2 days:
June 4, 8:00am CDT: 800 tickets
June 4, 9:00pm CDT: 600 tickets
June 5, 10:30am CDT: 464 tickets
June 5, 5:00pm CDT: 380 tickets
June 5, 11:00pm CDT: 209 tickets
June 6. 3:00am CDT: 90 tickets
draxisreborn
Jun 6 2008, 05:46 AM
June 6 – 7:00am CDT
Customer Support Overview (June 6, 7:00am CDT)
Technical Support Phone: No Calls on Hold
Technical Support Chat Hold Time: ~0 minutes
H1 DC Ops Queue Load: 64 tickets
The H1 Queue Load in the past 3 days:
June 4, 8:00am CDT: 800 tickets
June 4, 9:00pm CDT: 600 tickets
June 5, 10:30am CDT: 464 tickets
June 5, 5:00pm CDT: 380 tickets
June 5, 11:00pm CDT: 209 tickets
June 6. 3:00am CDT: 90 tickets
June 6. 7:00am CDT: 64 tickets
Kevin Hazard
Jun 6 2008, 09:01 AM
We have lost network connectivity to H1. We are confirming the extent of any power loss, and we will be updating shortly.
Kevin Hazard
Jun 6 2008, 09:06 AM
Transport for H1 temporarily fell offline and is restored. H1 Phase 2 did not lose power. H1 Phase 1 lost power. We will be updating again shortly.
Kevin Hazard
Jun 6 2008, 09:12 AM
The temporary generator powering Phase 1 failed. We switched over to the backup generators that were just brought in. The CRAC units have been powered on, and PDUs are having power restored right now.
Kevin Hazard
Jun 6 2008, 09:24 AM
Power has been restored completely to Phase 1. Our DC Ops team will be walking through the aisles to confirm all racks are online.
Kevin Hazard
Jun 6 2008, 09:35 AM
Automated OS Reloads are not currently functioning in H1. We will provide an update as soon as we get more information.
Kevin Hazard
Jun 6 2008, 09:44 AM
Customer Support Overview (June 6, 10:45am CDT):
Technical Support Phone: ~3 minutes
Technical Support Chat Hold Time: ~20 minutes
Kevin Hazard
Jun 6 2008, 10:04 AM
Customer Support Overview (June 6, 11:00am CDT):
Technical Support Phone: ~1 minute
Technical Support Chat Hold Time: ~30 minutes
Kevin Hazard
Jun 6 2008, 10:29 AM
Customer Support Overview (June 6, 11:30am CDT):
Technical Support Phone: No Hold Time
Technical Support Chat Hold Time: ~30 minutes
Kevin Hazard
Jun 6 2008, 10:49 AM
Automated OS Reloads are currently back online. Any OS Reloads that were in progress in H1 have been restarted.
Kevin Hazard
Jun 6 2008, 11:09 AM
H1 DC Ops Queue Load: ~120 tickets
If any H1 customers are still experiencing downtime, please submit an outage ticket in ServerCommand or Orbit. In addition to the team responding to tickets, we have a few people proactively walking through the aisles to replace power supplies on servers as necessary.
Kevin Hazard
Jun 6 2008, 12:14 PM
We've been working to get all servers back online as fast as possible. For the majority of our customers, this involved a simple reboot once power was restored. For some, there has been severe damage to their hard drives as a result of the power loss in the H1 data center.
To assist customers whose drives have been affected, we have arranged to ship the devices to Data Recovery Systems (DRS), a leading provider of hard-drive recovery services. DRS will attempt to recover the data. Once this has been done, we will reinstall the drive.
We have no idea how many customers will want us to do this, and we have no idea how long it will take DRS to recover what they can. For customers who would like for us to proceed with this process, please submit a ticket and we will begin taking action tomorrow.
Our goal is to stand by our customers during this difficult time. As such, there will be no charge to customers for this service, and The Planet will absorb the costs.
Aaron Chernosky
Jun 6 2008, 04:56 PM
Customer Support Overview (June 6, 6:00pm CDT)
Technical Support Phone: No Calls on Hold
Technical Support Chat Hold Time: ~50 minutes
H1 DC Ops Queue Load: 55 tickets
Aaron Chernosky
Jun 6 2008, 10:02 PM
Customer Support Overview (June 6, 11:00pm CDT)
Technical Support Phone: No hold time
Technical Support Chat Hold Time: No hold time
H1 DC Ops Queue Load: 40 tickets
draxisreborn
Jun 7 2008, 02:04 AM
Customer Support Overview (June 7, 3:00am CDT):
Technical Support Phone: No Hold Time
Technical Support Chat Hold Time: ~0 minutes
H1 DC Ops Queue Load: ~33 tickets