Saturday, April 25, 2020

Building a postmortem culture

post·mor·tem

\ ˌpōs(t)-ˈmȯr-təm \ 
noun
1. Autopsy
    A postmortem showed that the man had been poisoned.
2. An analysis or discussion of an event after it is over
    The blameful postmortem culture shuts down the exploration of the problem because no one wants to be seen as stupid, even if it's ignoring the clear truth.

A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring. The postmortem concept is well known in the technology industry.

I picked up the concept of postmortem from my previous job at Silicon Straits Saigon. The idea that we could study an incident was there, but the guidelines and culture enforcement was weak. So though I was sold that postmortem was a powerful practice and with proper enforcements made a system become more robust over time, I didn't exactly know how to start a culture around it. The most concrete guidelines I received was from Site Reliability Engineering. Wherever the Google practices seemed too extreme or impractical in my context, there was the Internet. The knowledge was powerful and enlightening, and I appreciated the journey in the last 6 months to transform it into operational HOWTOs.

From the very beginning, I was aware that a postmortem culture needed to be a joined effort of the entire organization for it to be effective. And I was never interested in being a secretary. But like many other initiatives that involve other people, you can't just make an announcement and expect things to happen, magically. I tried. A few times. So in the beginning it was just me recording the incidents that I was a part of either the solution or the problem. Most of the time both. And that gave me the time and experience I needed to make calibrations to the plan before it was presented to everyone.

Work to take blame out of the process

Blame, both the act of blaming and the fear of being blamed is the enemy of a productive postmortem culture. If a culture of finger-pointing and shaming individuals or teams for doing the "wrong" thing prevails, people will not bring issues to light for fear of punishment, or stop investigations prematurely as soon as a "culprit" is identified. Such halts the development of preventive methods for the same situation in the future. The force to blame is formidable, we as human beings are wired for it. Dr. Brené Brown, in a TED talk, explained blame existence as "a way to discharge pain and discomfort". The fact that whenever you want to trace back whose code caused your miserable wake up at two in the morning for a system outage, the command says `git blame` certainly doesn't help.

This is where being blameless gets popular in postmortem literature. And if there is anything subjective to an objective piece of work that is a postmortem, this is it. I find being completely blameless hard to implement. On the one hand, a postmortem is simply not a place to vent frustration. On the other hand, at times, it feels like tip-toeing around people, so worried about triggering their fragile souls that you miss out on chances to call out where and how services can be improved. This is where I come to an agreement with J. Paul Reed that it is important to acknowledge the human tendency to blame, allow a productive form of its expression, and yet constantly refocus to go beyond it.

Here are some examples. The examples might or might not involve me in might or might not actual scenario.

Blameful:
Someone pushed bad code to production via emergency pipeline. The tests in CICD could have caught this, but someone thought he knew better. Seriously, if you aren't sure what you are doing, you shouldn't act so recklessly. Rolling back in the middle of the night is a waste of time.
Action items:
  • Think before you edit someone's code.

Completely blameless:
Last night, an unauthorized code was pushed to production. CICD was skipped because CICD takes 30min and it was fire fighting situation. The fix was not compatible with a recent refactor. 
Action items:
  • Improve CICD speed

Blame-aware:
Last night, an unauthorized code was pushed to production. CICD was skipped because CICD takes 30min and it was fire fighting situation. The fix was not compatible with a recent refactor. 
Action items:
  • Improve CICD speed
  • Infrequent contributors should use the safety net of CICD
  • Issues in a service need escalating to maintainer of the repo for code review
  • Rollback mechanism needs to be available to developers on pager duty.
I felt like in the examples above, without accepting that pushing code to an unfamiliar service in the middle of the night was a reckless action, we would miss the chance to put in preventive measures. But again it is subjective, perhaps my blame-aware version fits perfectly into a blameless version of another. I hope you get the point.

Work on some guidelines

It is useful to be as specific as possible about when a postmortem is expected, who should write it, what should be written, and what the goals of the record are. Not only it provides a level of consistency across your organization, but it also prevents the task of writing a postmortem to be seen as a whimsical assignment from some higher-level authority and someone is being picked on as a punishment for doing the "wrong" thing.

Some of my personal notes of the matter:
  • Different teams might have different sets of postmortem triggers. The more critical your function is, the more detailed the triggers should be.
  • People who caused the incident might or might not be the ones to write the postmortem. The choice should be based on the level of contribution the person has to offer, both in terms of context and knowledge, not because of his previous actions.
  • Be patient. The people you are working with are professionals in software development, but the ability to write good software does not transmit into the ability to write a good document. Quality of root cause analysis prevails eloquence. Save the latter for your blog.

Work on the impact to your audience

in the beginning, the incidents I was working on were about a database migration to Aurora (and a hasty fallback), so I was assuming my audience would be my fellow developers. Possibly extend to project managers, project managers love knowing why you are stealing time from their team. And a reasonable consequence was to write the postmortem in markdown and store them in the same code repo with the affected services. There were a few issues with that.

Firstly, in a technology startup, the scope of tech choice is always bigger than "just the tech team". In my case, Customer Success people need to know what the impacts on our customers were and are, Product people want to know if the choice comes with new possibilities, and Sale people want to sell those possibilities. As much as I love such integration between the developers and the rest of the company, the idea of granting universal access to code repos to view markdown postmortems terrifies my boss, and therefore subsequently me, obviously.

Secondly, as familiar as markdown is, it is not a very productive option if you want to include media in it. And we want charts of various system metrics during the time of the incident to be included in the postmortem.

Lastly, writing a postmortem is gradually becoming a collaboration effort, and git repo, though supports collaboration, does not do it in real-time.

Considering all the options, we finally settled with a shared Google Drive in the company account. It is neither techie nor fancy. But it allows very flexible accessibility, tracks versions, natively supports embedded media, and lets multiple people collaborate in real-time. We share our postmortems in a company-wide channel, and sometimes hold an additional presentation for particularly interesting ones.

Let it grow

When you have done your homework, built a foundation of trust and safety, laid out the guidelines and constantly improved it, and integrated the postmortems with your larger audience, it is probably time to take a step back and let the culture take a spin on its own. My company's postmortem culture won't be the same as Google's no matter how many Google books do I read. And as long as it works for us, it doesn't matter.

With some gentle nudges, my colleagues are picking up postmortem on their own. We have seen contributions from Product Owners and Project Managers, besides the traditional developers and DevOps contributors. The findings are anticipated by a large audience across the company. 

And in the latest incident, which involved the degradation of performance in a few key features of our SaaS offering over the course of a week, we identified another usage of a postmortem: a postmortem updated regularly with the latest incident reports, findings, and potential impacts in a near real-time fashion is a powerful communication tool across the organization, both ensure the flow of information to people who need it (CS to answer questions, PM to change project plans for urgent hotfixes, etc) and allow developers to focus on their critical work without frequent interruptions.

As we grow and our system gets more sophisticated, hopefully, the constructive postmortem culture would turn out to be a solid building block.