It is hard to imagine a world where nothing goes wrong. Especially in software development, which is not an exact science, things will go wrong. As far as I am aware, no definitive research has been done on this, and different sources give different numbers: Security Week talks about 0.6 bugs per 1000 lines of code, while Gray Hat Hacking mentions 5-50 bugs per 1000 lines of code. I am sure things like this also depends on your QA process. But it’s impossible to write bug-free code.
So when things inevitably go wrong and your production environment goes down or errors out, it is important to figure out what went wrong. If you know what went wrong, you can figure out how to prevent that issue the next time. Part of that is a good post-mortem. A post-mortem usually includes a meeting where the event is discussed openly and freely, and a written report of the findings (a summary of which you could and should send to your customers).
In the past days I’ve seen this blogpost from Stay Saasy do the rounds on social media and in communities. As I already said on Mastodon, I couldn’t disagree more. I feel the need to expand on just that statement, so I’ll focus on some statements from the blogpost and why I disagree so much.
Frequency
The first thing that I noticed in the blogpost is an assumption that shocked me a bit:
Many companies do weekly incident reviews
Hold up. Weekly? I realize the statistics are wildly varying and if indeed you have 50 bugs per 1000 lines of code, you’ll have a lot of bugs, but I would hope that you have a QA process that weeds out most bugs. I am used to have several steps between writing code and it going to production. That may include:
- Code reviews by other developers
- Static analysis tools
- Automated tests (unit tests and functional tests)
- Manual tests by QA
- Acceptance tests by customers
Let’s go from that worst-case number of 50/1000. I would expect, with the above steps, that the majority of bugs are therefore caught before the code even ends up on production servers. If this is true, why would you have weekly incident reviews? I mean, that’s OK if it is indeed needed, but if you need weekly incident reviews, I’d combine looking at the incident with looking at your overall QA process, because something is wrong then.
Is it somebody’s fault?
In the blogpost, Stay Saasy states that it is always somebody’s fault.
it must be primarily one person or one team’s fault
No. Just no. If you look back at the different ways you can catch bugs that I described earlier, you can already see that it is impossible to blame a single person. One or more developers write the code, one or more other developers review the code, one or more people set up and configured the static analysis tools and one or more people interpreted the results of those, the tests were written, the QA team did manual tests where needed, and the customer did acceptance testing. Bugs going into production is, aside from just being something that happens sometimes, a shared responsibility of everyone. It is impossible and unfair to blame a single person or even a single team.
Accountability
It feels that Stay Saasy mixes up blameless post-mortems with non-accountability. But these are two different things, with two different motivations. The post-mortem is not about laying blame. It is about figuring out what went wrong and how we can prevent it in the future. It is a group effort of all involved. The accountability part if something that is best done in a private meeting between the people who were involved in the cause of the issue. To mix these two up would indeed be a mistake, which is why blameful post mortems is such a bad idea.
On the flip side, if you really messed up, you might get fired. If we said we’re in a code freeze and you YOLOed a release to try to push out a project to game the performance assessment round and you took out prod for 2 days, you will be blamed and you will be fired.
While I agree up to a certain point with this statement, I think in this case you might also want to fire the IT manager, CTO or whoever is responsible for the fact that an individual developer could even YOLO a release and push it to production during a code freeze. Again, have a look at the process please.
But yes, even if it is possible to do this on your own, you should not actually do this. So if you do this, it might warrant repercussions up to and including termination of your contract.
Fear as an incentive
There is one main incentive that all employees have – act with high integrity or get fired.
I can’t even. Really. If fear is your only tactic to get people to behave, you should really have a good look at your hiring policy, because you’re hiring the wrong people.
In every role where I was (partially) responsible for hiring people, my main focus would be to hire on people with the right mindset. Skills were not even the main focus, it would be mindset. People who are highly motivated to write quality code, who will take the extra effort of double-checking their code, who welcome comments from other developers that will improve the code. People that are always willing to learn new things that will improve their skills. You do not need fear to keep people in check when you hire the right people, because they are already motivated by their own yearning to write good code, to deliver high-quality software.
So how to post-mortem?
It might not be a surprise to you after all I read that I am a big supporter of blamless post-mortems. Why? Because of the goal of a post-mortem. The main goal (in my humble opinion) is to find out what went wrong, and brainstorm about ways to prevent this from happening again. There are four main phases in a post-mortem process:
- Figure out what went wrong
- Think of ways to prevent this from happening again
- Assign follow-up tasks
- Document the meeting results
Figure out what went wrong
The first phase of the meeting is to figure out what went wrong? This first phase should be about facts, and facts alone. Figure out which part of your code or infra was the root cause of the incident. Focus nost just on that offending part of your software, but also on how it got there? Reproduce the path of the offending bit from the moment it was written to the moment things went wrong.
In the first phase, it is OK to use names of team members, but only in factual statements. So Stefan started working on story ABC-123 to implement this feature, and wrote that code or Tessa took the story from the Ready For Test column and started running through testcases 1, 2 and 5. Avoid opinions or blame. Everyone should be free to add details.
Think of ways to prevent this from happening again
Now that you have your facts straight, you can look at the individual steps the cause took from the keyboard to your production server, and figure out at which steps someone or something could’ve prevented the cause from proceeding to the next step. It can also be worth it to not just look at individual steps, but also the big picture of your process to identify if there are things to be changed in multiple steps to prevent issues.
Initially, in this phase, it works well to just brainstorm: put the wildest ideas on the table, and then look at which have the most impact and/or take the least effort to implement. Together, you then identify which steps to take to implement the most promising measures to prevent the issue in the future.
Let everyone speak in this meeting. Involve your junior developers, your product manager, your architect, your QA and whoever else is a stakeholder or in another way involved in this. You might be surprised how creative people can get when it comes to preventing incidents.
Assign follow-up tasks
Now that you have a list of tasks to do to prevent future issues, it’s time to assign who will do what. Someone (usually a lead dev or team lead, sometimes a scrum master, manager or CTO) will follow up on whether the tasks are done, to make sure that we don’t just talk about how to fix things, but we actually do.
Document the meeting results
Aside from talking about things and preventing future issues, you should also document your findings. Pretty extensively for internal usage, but preferably also in a summarized way for publication. Customers will notice issues, and even if they don’t notice, they will want to be informed. Honest and transparent communication about the things that go wrong will help your customers to trust you more: You show that you care about problems, you do all you can to solve them and to prevent them in the future. Things will go wrong, that’s inherent in software development. The way you handle the situation when things go wrong is where you can show your quality. In all documentation, try to avoid blaming as well. That isn’t important. What’s important is that you should you care and put in effort to prevent future issues.
So what about accountability?
Blameless post-mortems do not stop you from also holding people accountable for the things they do. If someone messes up, they should be spoken to directly. But it should not be a lynch mob setting, but a preferably one-on-one setting where two individuals evaluate the situation. And yes, there can be consequences. The most important thing is that the accountability is completely seperate from the post-mortem. It is not the focus of a post-mortem to hold someone accountable. That is a completely seperate process.
Leave a Reply