A few weeks ago we lived a blackout that affected almost all our customers to a greater or lesser extent. I would like in this post to go through some of the most notable points of those hours of uncertainty.
Of these points there are good, bad and ugly. All related to our own IT services and our customers, the operation and plans that should have worked differently.
And I speak of our own because within the incident we also saw part of our operation affected. These things we have to learn were identified during the crisis and analyzed later.
From the first minutes we got in touch with our clients, no matter what type of service they had contracted with us. We wanted to identify the following points:
1- If they had affected services.
2- Status of the business operation.
3- What was the action plan, if they had.
4- The criticality to return the service during a Sunday.
Because we have clients in the most diverse verticals, some had to operate on Sunday mandatory, and others could wait for Monday morning.
From these communications with customers and the answers we obtained, we created this list.
Let’s start with the good, which I think worked well within the entire uncertainty process.
We are not as bad as it seems
This is one of the good points that I want to highlight. Despite what one may think, few of our clients were completely affected by their services.
At least from the point of view of the primary services of the data center.
In almost all cases, by one method or another, the services continued to be delivered with minimal interruption.
The plans and investments made in recent years proved to be efficient.
Communications are still running
This is another good point we can learn from. Although there were many who had electricity in their data centers, they did not have access to them due to failures at some point in the communications.
Something that did not fail at all until a few hours inside the blackout were cellular data services. This means that we can take advantage of modern communications technologies such as SDWAN and leverage them through backup services such as cell phone lines.
It is clear that it is not to replace remote access services, but to offer an additional path.
If cellular data services proved to work as they did, we should consider adopting them as backup.
Access to information vital to the operation
Clearly this is one of the bad points we have to learn from.
Having the information stored in our web services and not being able to access them for lack of communications is something that should not be repeated.
From not having a copy in another online service of that document with the VPN configuration procedure or, much more seriously, the data center shutdown procedure.
We had a case where we were within minutes of having to shut down the entire data center and there was no access to the orderly shutdown procedure.
If we have the information published from our data center, we should have some of this information replicated in another service that allows access if necessary.
Decisions are made before
Within the bad points I also want to talk about the lack of planning for this type of scenarios. While the most common thing to analyze, and create, is a disaster recovery plan, different scenarios have to be raised.
Within the practice of DRP we always have to analyze different scenarios, such as a blackout or a prolonged event that will impact the service.
These scenarios arise during the stage of creating the plan and before each of these, the actions to be taken are defined. Doing this with such anticipation allows you to make much better decisions.
During the blackout we found many cases where there were no definitions made before the events and we found a lot of improvisation.
This event should immediately trigger a review of your disaster recovery plan if there is, or otherwise, the creation of one.
Tests are necessary
For the end, I leave the ugly, the worst point of all points to my liking.
We found a case, of which we made initial contact, that when consulting the status of their services informed us that the UPS had supported the load and that they would start the generator manually manually.
He informed us that they had enough fuel to operate for hours once the generator started and that they were without difficulties.
The generator never started, the batteries ran out and the rest … is history.
In this case there was a quarterly generator test plan.
A plan that was not executed two years ago. Yes, two years.
During the incident, in some cases, there were moments of crisis that could have been avoided with planning, with evidence and with a disaster recovery plan that had not even been fully executed.
Do not wait for another blackout to plan and make the definitions now, when we are calm.
From all this we have to learn. The good, the bad and the ugly.
Are you interested in knowing more about good DRP practices? Contact us https://www.wetcom.com/page/contactus
This post was first published in Spanish at 5 cosas que podemos aprender del apagón.