TeamDoing the work

How to Write a Postmortem That Gets Read

Structure, blame vs causation, and closure rate — the parts most teams copy from a template and still get wrong

Serhii Malyshev6 min read · Jun 30, 2026

#SoftwareEngineering #DevOps #SRE #IncidentManagement #EngineeringManagement

Three checked. Four open. That's the scoreboard.
Three checked. Four open. That's the scoreboard.

Most engineering teams know what a postmortem is supposed to contain. Summary. Timeline. Root cause. Action items. Maybe contributing factors if someone was feeling thorough on a Friday.

Few teams write one a stranger finishes.

The gap isn't missing templates. Every incident tool ships one. What fails is the page itself — consensus dressed as analysis, "human error" where causation should be, seven action items dying in a shared doc while everyone congratulates themselves on blamelessness. Google's SRE book is explicit that the value of a postmortem is proportional to the learning that happens after the document ships. If nobody reads past the timeline and nothing closes, you didn't write a postmortem. You filed paperwork.

That's the scoreboard.

The Readable Spine — seven fields, one job each

Treat the postmortem as a spine, not a scrapbook. Seven fields answer seven questions a busy engineer asks when they land on your doc from search six months later:

  • Summary — What broke, for whom, for how long, in three sentences. Quantify impact. "Checkout degraded" is vague. "Checkout error rate 12% for 47 minutes, ~8,400 failed payments" is readable.

  • Timeline — Timestamp, event, source. Not a story from memory — a record. Every line should point at something: PagerDuty alert ID, deploy log, Slack thread, Grafana panel. A timeline without sources is fan fiction.

  • Impact — Who felt it, how badly, what SLO budget you burned. This is where leadership decides whether the doc matters.

  • Root cause — The systemic reason the incident was possible, not the symptom that triggered the page. Connection pool exhausted is a symptom. Missing query review in the deploy path is closer to cause.

  • Contributing factors — Two to five conditions that made the incident likely or worse: stale runbook, missing alert, rate limit on a dependency, marketing event doubling traffic. Contributors, not a single villain. List the weather, not just the lightning strike.

  • Action items — Preventive work with a named owner, measurable outcome, and deadline. Not "team." Not "TBD." Not "consider improving monitoring."

  • Lessons learned — Optional. If it repeats the action items, cut it.

Google Cloud's reliability guidance says move supplementary dumps to an appendix — long log excerpts, screenshot galleries, the play-by-play nobody needs in the spine.

Readable beats exhaustive.

A twelve-section doc often signals "capture everything" instead of "teach something." Longer isn't more thorough — it's more skippable.

Blameless Isn't Anonymous — causation without a trial

Blameless postmortems are a tenet of SRE culture for a reason that has nothing to do with being nice. When people fear punishment, they hide details. The doc looks complete. The room knows it isn't.

Blameless does not mean nobody did anything traceable. It means you stop at character judgment and keep going on causation. You can write that the on-call engineer merged at 16:02 and that rollback took eleven minutes because the feature-flag path wasn't in the runbook. You don't write that they're careless.

The failure mode I see most often isn't overt blame. It's performative blamelessness.

The root cause section says human error. Or it names the engineer closest to the trigger and stops. Or — worse — the room agrees on a root cause by consensus before anyone checks the logs. The doc says all the right words. Everyone leaves knowing who "really" caused it. That's not blameless. That's blame with better formatting.

Compare two root-cause paragraphs for the same rollback:

Blame-shaped: "Engineer deployed config v2.4.1 without running load tests. Human error."

Causation-shaped: "Deploy v2.4.1 passed CI but load tests aren't in the production deploy gate. An N+1 query raised per-request DB connections from 2 to 11 under traffic. Rollback took eleven minutes because the runbook documented the old flag path, not the one shipped in v2.3."

Same incident. One stops at a person. One traces the system that made the person's reasonable action dangerous.

The question blameless culture actually asks isn't who messed up — it's what about our system made this action easy and the consequences hard to predict. Performance issues that repeat across incidents still belong in private management channels. The public doc stays focused on fixes the whole team can use.

If your engineers read the root cause section and feel exposed, you wrote a trial. If they read it and see a map of gaps they can actually close, you wrote something that might get cited the next time the same class of failure shows up.

Closure Rate — the scoreboard templates don't mention

Templates tell you what sections to fill. They almost never tell you whether the process works.

Track one number: what percentage of postmortem action items close within 90 days? If you can't answer that, the loop isn't closed — you're generating homework.

Industry practice treats sub-50% completion as a failure signal — not because fifty is magic, but because once most corrective actions never land, the ritual persists while outcomes don't. Follow-through research treats that band as the point where learning stops.

Below that line, you're running documentation theater. The postmortem template is fine. The culture that absorbs fixes isn't.

High-performing teams aim for 80% or higher, with high-priority items closing inside thirty days and every item owned by a specific person, not a squad label.

Worked example — same template, different outcomes:

Team A ships a postmortem with seven action items. Owners say "platform" or "infra." Tickets never leave the doc. At ninety days, closure rate: 29%.

Team B ships three items. "Add load-test gate to deploy pipeline — owner: Riley, due: March 15." "Update runbook flag path — owner: Sam, due: March 8." "Alert on connection pool saturation — owner: Jordan, due: March 22." Items live in Jira with the postmortem linked. At ninety days: 85% closed; the one open item was explicitly deprioritized with a one-line rationale in the doc.

Same incident severity. Same blameless policy on paper. Different scoreboard.

When closure is low, don't add sections to the template. Cut action items until the org can finish them. Three to seven owned items beats a backlog of fourteen noble intentions. Move items into the system your team already tracks. Review open postmortem actions in a standing ceremony — sprint planning, ops review, on-call handoff — whatever you already attend.

An unreviewed postmortem might as well never have existed.

A reviewed postmortem whose actions never close is worse — it teaches the org that writing is the work.

What to Cut When Nobody Reads Past the Timeline

If readership dies after the timeline, the spine is buried. Noise won.

Cut or appendix: raw chat logs, forty-line stack traces, duplicate recap of facts already in the timeline, "lessons learned" that restate action items, generic recommendations ("improve monitoring") with no owner.

Keep in the spine: evidence-linked timeline, quantified impact, root cause deep enough to pass review, contributing factors, owned actions.

Google's postmortem review asks whether root cause was sufficiently deep and the action plan appropriate. Depth doesn't mean length. It means a stranger could explain why recurrence is less likely because of a specific change you're making — not because everyone promised to be more careful.

Before you propose a complex fix, ask whether the incident is likely to recur. Some postmortems should end with "accepted risk" documented honestly. That's still learning. It's just not a seven-item wish list.

Publish the doc where people actually look — team repo, incident tool, mailing list — wide enough that the next engineer on call can find it. A postmortem saved only for the people in the room didn't happen for the organization.


What's your team's ninety-day action-item closure rate — and when you read your last postmortem's root cause aloud, does it trace decisions and systems, or stop at the nearest person?

Show your work, not just your conclusions.

More in Team

← Back to hub