Production outages and incident management

For well over a decade I have frequently found myself at the centre of the chaos that ensues when when a production system goes down. An incident is a complex situation that requires clear processes to manage it effectively.

I've sat across the table from executives of fortune 500 companies and convinced them that outages are a good thing, something not to fear but rather embrace and use as an opportunity to learn. You see you cannot remove the risk entirely, its impossible because at some point something is going to break. The best thing to do with risk is manage it.

This anxiety around outages comes from the fear of failure and the unknown, not to mention the undesirable consiquesnces that follow. But let's be honest outages can also be costly so this is also no joking matter.

be prepared

One of my favourite quotes recently has been 'Failing to plan is planning to fail.' Its a real beauty because its true, so part of your preparation is writing an incident playbook, detail out what needs to happen in the event of an outage. You are going to need to collaborate on this because it will affect everyone around you.

In this playbook you are going to designate someone to play the coordiantor role, this is often referred to as the 'incident controller'. This is the most critical role in the whole process as they are responsible for ensuring everyone plays their role. Your incident playbook should centre around this role.

The good news is that there are so many good tools out there that can help you and your teams prepare these playbooks. Once you have a draft the best possible thing to do is break out the costumes and table-top a few scenarios and see how well you handle it. Then you take those learnings and add them to the playbook.

At this point im going to assume you have some form of alerting that comes from your systems, but just in case you don't, please ensure you set these up and then ensure they wake up the right people when things go south. Dont just set alerting to 0, take some time to define the acceptable ranges systems should perform within, because outages dont always mean things crash out completely.

Go and take a look at incident.io for its automations and statuspages as these are pretty comprehensive. What you will find is that everything you need and more has been combined into this excellent tool. The key is to leverage as much tooling as you can so that incident processes are as easy to navigate as making a coffee. That's enough about the technical side of things, I could talk about this for hours, let's move to the most importatnt part.

Focus your mindset

The best thing you can bring to any incident, is a calm and relaxed mindset. Thats right, you will be the calmest person in the room, and you also persuade everyone else to do the same. Descalating the tension is the first and most critical stage of getting a handle and moving forward.

It's actually amazing how powerful a calm, slow, and encouraging tone of voice can be to help settle the anxiety and move folk towards analizing, collaborating, and solving problems. To remain calm you need to think about how and what you need to communicate. Do you know when the best time to decide on this is? Before it happens. So write out some basic templates, which wording that allows you to effectively communicate to your stakeholders and allow the team to focus on solving the problem.

Be transparent and honest because this is how you will build trust with your stakeholders and customers. Reasure them you have identified the issue and you are doing everything you can right now to sole that issue. You should also promise them that you will share the details of what happened once you have fixed the issue. You see you want them to come away with the feeling that if something happens they dont need to worry, you will let them know ASAP when somethin isnt right and this will create the space you need to work the issue. So in essence you are confidence messaging.

Learning & lessons

This is my favourite part of the process. You see people need to be allowed to fail, this must be engrained into your company culture. That way washups become blameless and focus is placed on the specific problem not the people. A company can learn a huge amount from holding phycologically safe incident washups and talking thouht the incident, capturing lessions, and sharing learnings. This will result in you teams being brave and bold not fearful of reprisal.

It used to be one of the most stressful situations, I really dreaded the RCA or 'Root Cause Analysis' report discussion with customers. Howerver, if you are transparent and honest thoughout the process you have nothing to worry about. You really don't, because your customer will value the service you provide and the open way you provide it. Share what happened, what broke and what you had to do to fix it, don't worry about sounding too technical as it helps the listener appreciate the context.

Be sure to include what you have learned and how you have put in measures to ensure this sort of issue doesnt happen again. This is the most delightful outcome, you have remained human to your customer and they will trust and respect you even more because you took them on this journey of discovery.

In conclusion

By following these best practices and leveraging effective tooling, organizations can manage production outages more effectively, minimize downtime, and improve overall system resilience. But more importantly you will not be fearful of risk or failing your customers, rather you will have established a transparent and bold culture that values failing forward.

•