On postmortems

Quote When an incident occurs, we fix the underlying issue, and services return to their normal operating conditions. Unless we have some formalized process of learning from these incidents in place, they may recur ad infinitum. Left unchecked, incidents can multiply in complexity or even cascade, overwhelming a system and its operators and ultimately impacting our users. Therefore, postmortems are an essential tool...

Google's Postmortem Culture

While the above quote is from a Google post about their “Blameless Postmortem” philosophy and is largely for Senior Reliability Engineers, it points to a broader truth that applies to almost any software platform that operates at scale.

I’d argue that this is even more vital for anyone working in Trust & Safety teams. Our types of incidents often involve software just as much as people, and any incident management or postmortem should try and account for those. But therein also lies the trouble that is unique to Trust & Safety. How do you compose a postmortem effectively that tries to account for people, levels of harm, judgement calls, ethics and not just systems and code?

While most postmortems from teams i’ve been involved in have been on the technical side, Trust & Safety often involves multiple layered challenges. A technical bug or a bad merge could lead to a large incident that can be fixed. The postmortem would very likely involve easily measured and identifiable issues. There are often logs and root causes can be easily identified and fixed. You can add tests and error logs to make it easier to identify where the issues are if not stop them before they happen. It doesn’t make technical failures trivial but it does make them legible.

T&S issues involving people and abuse are not lucky enough to have it that easy. Our types of incidents could be simple human errors, an incorrect policy decision, inconsistent enforcement, or even misunderstanding of user intent. These are not systems failing under technical constraints, but people doing their best under challenging constraints. Not to mention that our failures and mistakes can affect people’s lives, not just break things.

Trust & Safety needs postmortems because the stakes are higher. But those classic postmortems are made inadequate by the very nature of the work. This leaves T&S teams in an uncomfortable position. We need the same type of clear learnings and disciplined reflections that postmortems offer, yet the incidents we tend to deal with resist clean narratives and clear root causes. That said, doing nothing, learning nothing, is not an option either.

Which comes to the question writ large on my mind: what does a meaningful postmortem look like when the incident isn’t just a system failure but a human one?

Wayward

On postmortems

Graph View

Wayward

Explorer

Graph View