TeamFinding your footing

Your First On-Call Rotation — What to Expect

The pager anxiety, the runbooks, and what 3am actually looks like before you've lived it

Serhii MalyshevSoftware architect and tech lead6 min read · Jun 1, 2026

#SoftwareEngineering #DevOps #SiteReliabilityEngineering #IncidentManagement #BackendDevelopment

The runbook won't tell you what the notification feels like.

Your phone buzzes at 3:12am. Not a text — the short vibration you've trained yourself to fear. You read the alert twice because the service name in the page doesn't match anything you deployed last month. The runbook link opens to a Confluence page last edited by someone whose Slack account says "deactivated." Somewhere, a product manager is already typing "any update?" in a channel you haven't joined yet.

Welcome to your first on-call rotation. The mistake you're trying to avoid isn't "I don't know Redis" — it's paging the wrong thing twice because nobody told you triage comes before heroics.

Nobody prepares you for this moment properly. Not because they don't care. Because most of what matters at 3am can't be copied from an onboarding deck.

What your manager won't spell out

Companies sell on-call like a skills test. Can you read logs? Can you restart a worker? Can you tell Redis from a rabbit hole?

That's half the job.

Being on-call means being available during a window and responding with appropriate urgency when production breaks. The other half is coordination under sleep debt: triage, communicate, escalate, hand off, and improve the system so the same page doesn't own your next month. PagerDuty frames it as Prepare → Triage → Fix → Improve → Support. Most first-timers only hear "Fix."

The page is not asking whether you're smart enough. It's asking whether someone accountable is awake, looking at the right dashboards, and willing to say out loud what's true — including "I don't know yet."

Google's incident model splits roles for a reason: someone coordinates, someone talks to stakeholders, someone debugs. On a small team you play all three poorly if you dive straight into kubectl before you've posted a status line. Communication isn't overhead on top of the real work. On your first rotation, it is the real work until impact is understood.

Before your first solo week — the prep that matters

If your only prep is "read the runbooks," you're walking in with a map that might describe a city that no longer exists.

PagerDuty's incident response docs are blunt: new team members should shadow rotation for weeks — receive alerts, follow along, don't solo-carry until you've watched real incidents end-to-end. Google bootstraps the same way: checklists, Wheel of Misfortune drills, handoffs before new hires touch prod alone.

If your org skips shadow shifts, that's a gap in their process, not a badge of honor on yours. Ask for shadow time. Watch someone post "investigating" before they fix anything. Watch them escalate without apologizing.

Read the handoff, not just the runbook index

The runbook tells you what someone believed when they wrote it. The handoff tells you what's on fire right now — flaky deploys, silenced alerts, the queue that's been drifting since Tuesday.

Before your shift: read the last handoff, skim recent postmortems, confirm who primary and secondary are, and test that your laptop can reach VPN and prod-adjacent tools from your actual bedroom, not the office Wi-Fi you used once in onboarding. Shift handoffs should name specific dashboard URLs, not "check Datadog."

Ask which alerts are lies

Rotation quality is a systems problem. Google targets roughly two actionable incidents per on-call shift; pager load above that is a signal to fix alerting, not to "try harder." Ask which pages fire weekly and never need human action.

Industry practice: delete or fix alerts nobody has acted on in 30–90 days — don't just mute them. That's alert hygiene, not optional cleanup.

If you're getting paged ten times on a quiet night, your anxiety is rational.

The pager is broken — not you.

When the phone buzzes at 3am — triage before heroics

Here's a first hour they won't put in the wiki. "Stay calm" is useless advice. This is the sequence.

Minute 0–5: Triage before heroics. Open the alert. What is actually failing? What still works? Did anything deploy or config-change in the last hour? PagerDuty's triage step exists because not every red graph is a 3am fire — disk trending high with weeks of headroom can wait for morning.

Minute 5–10: Join the comms surface. Incident channel, video if your team uses it, status line in the thread. Atlassian's handbook is explicit: note observations in chat as you go, even if it looks like talking to yourself — that thread becomes the postmortem timeline. Post something like: Investigating webhook queue depth alert. Customer impact unconfirmed. Root cause unknown. Next update in 15 minutes.

Silence reads as neglect.

"I don't know yet" reads as control — not as failure.

Minute 10–30: Runbook as hypothesis. If the runbook says restart payment-worker, check whether that job still exists. Stale runbooks buy false confidence — worse than admitting you're reading the codebase cold.

The runbook tells you what someone believed in 2023. Verify before you restart. If the doc lies, screenshot it, document what you tried, and move to escalation before you restart the wrong thing twice.

Concrete shape: payment webhooks backing up at 03:12. Dashboard shows lag, not customer-facing errors yet. Runbook points to a systemd unit renamed six months ago. You post the mismatch, ping secondary, and stop. Twenty minutes, no fix, still a win.

Escalation is a feature, not a confession

Junior engineers treat escalation like failing an exam. Staff engineers treat it like using a backup generator.

Backup schedules exist because commutes exist, because expertise is uneven, because sleep exists. PagerDuty's line is "Never hesitate to escalate". Google adds psychological safety — escalation paths must be real, not performative.

Primary, secondary, and the five-minute window

Typical pattern: you're primary. Secondary auto-pages if you don't ack in about five minutes. That's not a judgment — it's load-bearing infrastructure. Escalate manually sooner when you're past your depth: "Redis latency inconclusive after 20 minutes; escalating to @secondary before I try failover."

Managers in the channel asking for ETAs before you know impact? Answer the comms layer first: impact unknown, investigating, next update timed. Atlassian's guidance: state unknowns explicitly rather than omitting them.

"I don't know yet" as a professional state

The difference is permission — and whether your team staffed secondary at all.

Snoozing a non-urgent disk alert until 9am is professional triage. Snoozing a customer-facing outage because you're scared to wake secondary is not — know which side of that line you're on before the page lands.

If there's no backup, no shadow, and a wiki link — you're not underprepared. You're under-supported.

After the graph goes green

The incident ends when customer impact ends. Your job doesn't.

PagerDuty's Improve step: noisy pages, recurring disk fills, alerts that should have been deleted — push back so your future self sleeps. Blameless postmortems ask why the system allowed the failure, not who clicked wrong. Vague action items ("improve monitoring") don't count. Named owners and deadlines do.

Document while it's warm

Reconstructing timeline from memory a week later is fiction. The Slack thread you maintained at 3am is the first draft of the postmortem.

What to tell the next person on-call

Handoff isn't optional. Active incidents, silenced alerts, risky deploys, specific dashboard URLs — not "check Datadog." The engineer starting Monday morning shouldn't learn about your Friday fire from the pager. PagerDuty's Support step and handoff checklists both say the same thing: pass context, not vibes.

What good first rotation looks like

You won't know every service. You will wake up scared at least once. Good first rotation means: you triaged before heroics, you updated stakeholders before they panicked, you escalated before drowning, you left the next shift smarter than you found it.

Bad first rotation is the same fire three weeks later because nobody deleted the lying alert and nobody fixed the runbook step that sent you in circles.

Three things to remember when the phone buzzes: verify before you restart, update before you're sure, escalate before you're heroic. The pager isn't testing whether you're a genius. It's testing whether the team built a rotation that lets a tired human be competent.

That's what 3am actually looks like.