Tomy Durden
May 31 2008, 05:29 PM
The Planet is currently experiencing an outage which is effecting a number of customers' servers. This issue may also be affecting customers' ability to get through to our call center.
We are doing everything we can to remedy this problem as quickly as possible. And will place an updated notice here as soon as there is new information.
Kevin Hazard
May 31 2008, 06:36 PM
Today at approximately 5:45 p.m., a transformer in our H1 data center in Houston caught fire, thus requiring us to take down all generators as instructed by the fire department. All servers are down.
We are working with the fire department, with our facilities staff on site, to assess the situation.
To keep you updated, we will send messages every 15 minutes in Orbit and here in our forum.
Kevin Hazard
May 31 2008, 06:51 PM
We have determined that no servers in the data center have been damaged. Nonetheless, they are down because power is out. Teams across the board are working to take appropriate action.
We will continue to keep you updated.
Kevin Hazard
May 31 2008, 07:03 PM
We have no additional updates at this time. Our team is still evaluating the time required to bring all affected servers online as soon as possible.
Kevin Hazard
May 31 2008, 07:17 PM
In our latest assessment, we have determined that networking gear has not been damaged, but we are without power so assessments continue. All disaster recovery systems are in motion, and we have teams already working in the data center.
Kevin Hazard
May 31 2008, 07:32 PM
We have no additional updates at this time. Our networking, technology, support, and facilities teams are still working to restore power to all affected customer servers.
Kevin Hazard
May 31 2008, 07:43 PM
The ServerCommand customer portal is down, so please contact our customer support team if you have questions. We have begun moving the ServerCommand infrastructure and will keep up updated as it is comes on-line.
Kevin Hazard
May 31 2008, 08:06 PM
We have no additional updates at this time. In a post coming shortly, you can expect full details about the incident.
Kevin Hazard
May 31 2008, 08:23 PM
Our entire team will be convening at 9:30pm CDT to consolidate a status report. I hope to pass along that information soon thereafter.
Kevin Hazard
May 31 2008, 08:54 PM
The senior managers are still meeting about the incident. I will update the thread as soon as I speak with them.
Kevin Hazard
May 31 2008, 09:46 PM
From Doug Erwin:
This evening at 4:55pm CDT in our H1 data center, electrical gear shorted, creating an explosion and fire that knocked down three walls surrounding our electrical equipment room. Thankfully, no one was injured. In addition, no customer servers were damaged or lost.
We have just been allowed into the building to physically inspect the damage. Early indications are that the short was in a high-volume wire conduit. We were not allowed to activate our backup generator plan based on instructions from the fire department.
This is a significant outage, impacting approximately 9,000 servers and 7,500 customers. All members of our support team are in, and all vendors who supply us with data center equipment are on site. Our initial assessment, although early, points to being able to have some service restored by mid-afternoon on Sunday. Rest assured we are working around the clock.
We are in the process of communicating with all affected customers. we are planning to post updates every hour via our forum and in our customer portal. Our interactive voice response system is updating customers as well.
There is no impact in any of our other five data centers.
I am sorry that this accident has occurred and I apologize for the impact.
Kevin Hazard
May 31 2008, 10:16 PM
Because the management servers were located in H1, services provided by ResellOne, Legacy EV1Severs domain management, and retail SSL have also been affected by the outage. Domain and SSL functionality in Orbit has proven to be unaffected by the outage.
Kevin Hazard
May 31 2008, 10:58 PM
As you know, we have vendors onsite at the H1 data center. With their help, we’ve created a list of equipment that will be required, and we’re already dealing with those manufacturers to find the gear. Since it’s Saturday night, we do have a few challenges.
We are prioritizing issues as follows:
- Getting the network up at H1 is first and foremost. We’re pulling components from our five other data centers – including Dallas – which will be an all-night effort.
- Getting power back to the data center is key, though it is too early to establish success there.
- Because ServerCommand is in H1, our legacy EV1 customers are blinded about this incident. We are in the process of moving the ServerCommand servers to other Houston data centers so that we’re able to loop them into communications.
- We absolutely intend to live up to our SLA agreements, and we will proactively credit accounts once we understand full outage times. Right now, getting customers back online is the most critical.
Todd Mitchell
Jun 1 2008, 12:56 AM
Hello,
Our various departments continue to work very hard to restore service to our H1 data center. We expect to have ServerCommand access restored within the next several hours as well as access to single homed nameservers out of H1 redirected to another facility.
We will continue to provide status updates once an hour. We do not have an Estimated Time to Repair at present.
Todd
Todd Mitchell
Jun 1 2008, 02:24 AM
Ladies/Gents,
Our facilities team continues to work with our vendors at the H1 site to restore service to affected clients. We do not have an Estimated Time to Repair at present. Our staff and management continue to work through the night and we will continue to provide hourly updates.
Todd
Todd Mitchell
Jun 1 2008, 03:38 AM
We continue to work through the early morning hours to restore service. We have vendors on-site working with our facilities group.
We do not have an Estimated Time to Repair at present. Our staff and management continue to work through the night and we will continue to provide hourly updates.
Todd
Todd Mitchell
Jun 1 2008, 04:40 AM
Our UNIX and development team continue to work to restore service to both ServerCommand and EV1 DNS. Based on current information, 4 of the 8 DNS servers are in service and we expect the remaining DNS servers to come online within the next 180 minutes. The same approx. time line holds true for ServerCommand. The server farm has been relocated to another data center and development is currently working on bringing the services back online.
In terms of the facility, we do not have a firm ETR at the moment. Facilities continues to work with our on-site vendors to acquire replacement equipment and get it installed to bring service back online.
We do not have an Estimated Time to Repair at present. Our staff and management continue to work through the night and we will continue to provide hourly updates.
Todd
Todd Mitchell
Jun 1 2008, 05:54 AM
Morning,
We are continuing to work through various issues this morning. We will have additional contractors on-site this morning starting at approx. 7 AM. Some will hand-off from contractors who worked overnight and others will start the recovery/installation of new electrical gear to power the data center.
We are still working through the EV1 DNS and ServerCommand items. We are making progress on both items and expect to have both functional within the next 120 minutes.
In addition to the above, the network engineering group worked overnight to prepare the network for the recovery of H1. We expect the reconvergence of the network to go smooth once H1 comes back online.
We do not have an Estimated Time to Repair at present; we should have a better estimate this morning. Our staff and management continue to work through the night and morning-- we will continue to provide hourly updates.
Todd
Todd Mitchell
Jun 1 2008, 07:54 AM
Hello,
The team here at The Planet continues to work through the various issues that we continue to encounter. We are still making progress on the previous items that I mentioned in my last post. DNS infrastructure has been migrated to another data center and propagation has begun. We are working through some database issues with ServerCommand and fully expect those to be resolved within the next hour.
I’d also like to address the idea of migrating from one data center to another. During the early stages of the H1 data center we opportunistically relocated some customers to another data center. However, due to network and data center (power/cooling) constraints, this option is no longer available and requests for migration cannot be honored. Please rest assured that our teams are working diligently to return service to all affected customers.
At this time we do not have an Estimated Time to Repair at present; we should have a better estimate this morning. Our staff and management continue to work through the night and morning-- we will continue to provide hourly updates.
Todd
uvashi
Jun 1 2008, 09:11 AM
More teams from The Planet are coming along with more contractors from key vendors for electrical and facilities to help get H1 online. At this time, DNS infrastructure continues to propagate. ServerCommand servers are installed, but the teams are making sure all networking is intact and ready. In the meantime, please call our support lines for any issues. Additional support techs are available.
At this time, we do not have an updated Estimated Time to repair. Please continue following this thread for updates.
-----
Urvish Vashi
Director, Product Management
The Planet
uvashi
Jun 1 2008, 11:24 AM
To keep you up-to-date, here is the latest information about the outage in our H1 data center.
We expect to be able to provide initial power to parts of the H1 data center beginning at 5:00 p.m. CDT. At that time, we will begin testing and validating network and power systems, turning on air-conditioning systems and monitoring environmental conditions. We expect this testing to last approximately four hours.
Following this testing, we will begin to power-on customer servers in phases. These are approximate times, and as we know more, we will keep you apprised of the situation.
We will update you again around 2:30 p.m. this afternoon.
-----
Urvish Vashi
Director, Product Management
The Planet
uvashi
Jun 1 2008, 12:44 PM
The networking teams are ensuring connectivity to bring ServerCommand back online. Please expect another update on ServerCommand shortly. We are seeing fewer DNS issues as the new addresses continue to propagate.
Our primary focus is to hit our 5:00pm CDT initial power test and all necessary staff are onsite and are working diligently to hit this deadline. Additional staff and spare server hardware is being delivered in on site in preparation for bringing customer servers online pending a successful power test.
-----
Urvish Vashi
Director, Product Management
The Planet
uvashi
Jun 1 2008, 01:30 PM
We are continuing to pursue plans as noted in our last message, and we have no additional updates at this time. At 4:30, we will plan to issue another message.
-----
Urvish Vashi
Director, Product Management
The Planet
uvashi
Jun 1 2008, 02:31 PM
As you may have already noticed, our forum servers continue to lag due to very heavy load. This is in part to due to the fact that our outage is now being carried on several sites (including Slashdot). Even though we added servers to our forums last night, we are looking at alternatives at this time to provide simple status updates quickly.
We are still working on getting all management systems up, but Legacy EV1 domain customers can access a backup management panel hosted by Tucows at
https://manage.opensrs.net. There are some limitations to these backup systems, changes may be restricted if the domain is locked, and use may be intermittent until the servers hosting the main domain management systems in H1 are back online.
We have no other update at this time regarding our planned power test at 5:00pm.
Regards,
------
Urvish Vashi
Director, Product Management
The Planet
uvashi
Jun 1 2008, 04:02 PM
We continue to pursue our plans to provide initial power to our H1 data center this evening. It will take several hours to assure power can be safely restored to the facility. Based on how the initial work goes, we will have more information to provide you in the upcoming hours. We will post another update by 7:30 pm tonight.
In the meantime, we have further rerouted both old and new IP addresses for our name servers previously housed in H1. This means that the servers can start resolving IP addresses on both their former and new addresses, and this will alleviate the issues we have been seeing with propagation delay from this address change.
As always, please continue to monitor this thread or contact our support teams if you have any questions.
------
Urvish Vashi
Director, Product Management
The Planet
Kevin Hazard
Jun 1 2008, 06:47 PM
We continue to work to restore power to the data center and bring all affected customer servers online.
Currently, ServerCommand is back online.
https://www.servercommand.net
Kevin Hazard
Jun 1 2008, 10:14 PM
As previously committed, I would like to provide an update on where we stand following yesterday's explosion in our H1 data center. First, I would like to extend my sincere thanks for your patience during the past 28 hours. We are acutely aware that uptime is critical to your business, and you have my personal commitment that The Planet team will continue to work around the clock to restore your service.
As you have read, we have begun receiving some of the equipment required to start repairs. While no customer servers have been damaged or lost, we have new information that damage to our H1 data center is worse than initially expected. Three walls of the electrical equipment room on the first floor blew several feet from their original position, and the underground cabling that powers the first floor of H1 was destroyed.
There is some good news, however. We have found a way to get power to Phase 2 (upstairs, second floor) of the data center and to restore network connectivity. We will be powering up the air conditioning system and other necessary equipment within the next few hours. Once these systems are tested, we will begin bringing the 6,000 servers online. It will take four to five hours to get them all running.
We have brought in additional support from Dallas to have more hands and eyes on site to help with any servers that may experience problems. The call center has also brought in double staff to handle the increase in tickets we're expecting. Hopefully by sunrise tomorrow Phase 2 will be well on its way to full production.
Let me next address Phase 1 (first floor) of the data center and the affected 3,000 servers. The news is not as good, and we were not as lucky. The damage there was far more extensive, and we have a bigger challenge that will require a two-step process. For the first step, we have designed a temporary method that we believe will bring power back to those servers sometime tomorrow evening, but the solution will be temporary. We will use a generator to supply power through next weekend when the necessary gear will be delivered to permanently restore normal utility power and our battery backup system. During the upcoming week, we will be working with those customers to resolve issues.
We know this may not be a satisfactory solution for you and your business but at this time, it is the best we can do.
We understand that you will be due service credits based on our Service Level Agreement. We will proactively begin providing those following the restoration of service, which is our number priority, so please bear with us until this has been completed.
I recognize that this is not all good news. I can only assure you we will continue to utilize every means possible to fully restore service.
I plan to have an audio update tomorrow evening.
Until then,
Douglas J. Erwin
Chairman & Chief Executive Officer
To centralize communication for easy access, you can check
http://service-update.theplanet.com/ for additional updates.
Tomy Durden
Jun 1 2008, 10:53 PM
If you're using Orbit, go to your hardware description(https://orbit.theplanet.com/nav_hardware/a3_server_details.htm?hw_id=) page and look for the following:
Hardware Object's Upstream Connection:aj31b.01.dllstx6 (10.6.201.94) Port: FastEthernet0/11
(switch).(phase).(data center)
switch: aj31b
phase: 01
data center: dllstx6
Kevin Hazard
Jun 1 2008, 11:55 PM
After the fire marshall inspected the H1 location, we were given the green light to bring power back to the facility. The generators have been turned on, and we are receiving power on the second floor. The generator power restoration is the first step in the full restoration of service to the data center.
From here, we will begin the process of cooling the DC floor, which could take a few hours. As soon as the power integrity is confirmed and the DC floor is ready for operation, we will be restoring power and checking server hardware on a rack-by-rack basis.
Kevin Hazard
Jun 2 2008, 12:55 AM
Following the restoration of power to the second floor of the data center, we've cooled the data center floor and are now in the process of systematically restoring power to racks.
We've got a full staff in the data center to power up racks in sections and verify that the server hardware starts up successfully. This process may take a few hours to restore service to all customer servers on the second floor.
Kevin Hazard
Jun 2 2008, 02:05 AM
We are continuing the process of turning on and verifying hardware integrity of customer servers on the second floor of H1.
Our network operations team is currently working on the ev1servers.net nameservers to ensure that they are online, are routed to correctly, and propogate as quickly as possible.
Servercommand is currently online and accessible.
Kevin Hazard
Jun 2 2008, 02:37 AM
As we continue to restore power to customer servers on the second floor, several customers have reported intermittent losses of connectivity. This connectivity loss is due to the balancing of network gear in the data center and is unrelated to power.
Kevin Hazard
Jun 2 2008, 03:37 AM
Our network engineers have been working on the ev1servers.net nameservers. Currently, the nameservers are visible to the majority of the Internet, and we hope to have complete visibility very soon.
Kevin Hazard
Jun 2 2008, 04:46 AM
We've made significant progress in restoring customer servers on the second floor (phase 2) of H1. The data center staff is still in the process of verifying that servers booted appropriately and are troubleshooting any that have not yet come online.
Per Doug's earlier message, we are still on target to restore service to the first floor (phase 1) by this evening.
Kevin Hazard
Jun 2 2008, 05:43 AM
Our network engineers and Unix IS teams are working to restore service to the following H1 resolvers: 207.218.192.38 and 207.218.192.39
Kevin Hazard
Jun 2 2008, 06:39 AM
With the start of another official workday (though a large number of people on the team have been working through the night), we are poised for a significant amount of work on both phases of our H1 data center. We have a full team of people on site working to ensure our targets are met to restore power to H1, Phase 1 (first floor).
Brooke-Sales
Jun 2 2008, 08:19 AM
We now have 90% of servers located on the second floor of H1 online. Support technicians are on location to manually bring the remaining 10% online.
Brooke-Sales
Jun 2 2008, 09:04 AM
Our network engineers are currently working on the resolvers. ETA for resolution is unknown at this time.
Brooke-Sales
Jun 2 2008, 09:30 AM
We now have offsite resolvers our customers are welcome to use.
NTT x.ns.verio.net 129.250.35.250
NTT y.ns.verio.net 129.250.35.251
Brooke-Sales
Jun 2 2008, 10:06 AM
Onsite technicians are currently working to restore service to the remaining 10% of Phase 2 upstairs servers. The Phase 1 downstairs servers are expected to start coming online late this evening.
Brooke-Sales
Jun 2 2008, 11:10 AM
There is no new information to report at this time.
Brooke-Sales
Jun 2 2008, 12:09 PM
We are still working to bring all of our resolvers back online, however we have temporary alternative resolvers courtesy of NTT, one of our partners. Changing these resolvers will help customers with problems such as sending e-mails and resolving domains.
NTT x.ns.verio.net 129.250.35.250
NTT y.ns.verio.net 129.250.35.251
H2 67.15.31.131
H2 66.98.240.131
Brooke-Sales
Jun 2 2008, 01:10 PM
We are working on getting initial power testing in H1 Phase 1. It is expected to begin within the next hour. The teams are working on a more detailed timeline. Expect a more detailed communication from our management team soon.
Brooke-Sales
Jun 2 2008, 01:59 PM
Power in H1 Phase 1 has been restored. We are starting to turn customer servers on in batches.
Brooke-Sales
Jun 2 2008, 03:11 PM
Customer servers are coming online now. We are doing a rack by rack physical check for any servers that need technical support.
Brooke-Sales
Jun 2 2008, 04:02 PM
There is no new information to report at this time.
Brooke-Sales
Jun 2 2008, 04:38 PM
A message from our CEO ...
Dear Customer,
Late last night, I told you we hoped to have power to the 6,000 servers in Phase 2 of our H1 data center by midnight, with all servers up by early morning. I am glad to say we came close, just a few hours after sunrise. At this time, 100% of our servers in Phase 2 have power, and our technicians are working with customers on any remaining server issues. We are confident all remaining issues will be resolved shortly.
I also explained the significant challenge we faced in the other phase where the actual explosion occurred. Our team came up with a creative way to restore power quicker than the 4-5 day outage. We decided not to wait for equipment for the electrical room completely, opting instead for a temporary solution to get power to the 3,000 servers. That solution involves using generator power for the next 10 -12 days until all the new equipment arrives to rebuild the electrical room for Phase 1. I explained that we expected to have a temporary solution in place by midnight tonight, with servers powered up tomorrow. The good news is that as you read this letter, the power is restored, and the temporary solution is in effect. Within the next two hours, the remaining 3,000 servers have power. We have overstaffed our data centers again to help during this initial power up.
This now leaves us facing step two of this process, which requires getting all of the equipment delivered and then rebuilding the electrical room to its original standard. To make the cutover to the rebuilt electrical room, the operations group believed it would take a maintenance outage of 24-48 hours. I have good news on that front. It's not perfect, but at present we now believe the maintenance window will be just 4-6 hours. That's still too long, and we will continue this week to find ways to reduce the time. Given that there will be some outage for the cutover, we will execute this step at midnight on a Saturday, either June 7 or June 14. We want to pick the most appropriate time to minimize impact to you.
I must admit that I am amazed. We are almost 18 hours ahead of schedule with this phase, thanks to our great suppliers and of course the great folks working here at The Planet. This could never have happened without the help of both, and I want to thank all of them.
There is still more work to do, but the progress is terrific. We will continue to work any and all customer issues, and we face the challenge of putting the permanent power fix in place for Phase 1. Nonetheless, there is still good news based on what I told you last night.
As each hour passes, we learn more and more. Please give us the time to continue our planning. We will provide you with information as we have it.
Until tonight's update….
Douglas J. Erwin
Chairman & Chief Executive Officer
Brooke-Sales
Jun 2 2008, 06:37 PM
There is no new information to report at this time. Updates will now be posted once every two hours instead of once an hour unless there is specific news to report.
Brooke-Sales
Jun 2 2008, 07:40 PM
We are aware that the zone files for ns1. and ns2.ev1servers.net are not completely in sync. We are systematically taking each of the load balanced nameservers offline to update each individual server's zone files. We expect re-syncing of the zone files to be complete early tomorrow morning.
Brooke-Sales
Jun 2 2008, 08:24 PM
Doug Erwin, chairman and CEO of The Planet, is providing a message tonight for our customers to offer additional insight into our data center outage:
http://service-update.theplanet.com/Erwin-msg6-2-08.wav.