Read time: ~

Escalation & Communication

Classify severity, gather evidence before paging, and write updates engineers and stakeholders can act on.

Prerequisite:Hands-On Lab


What you’ll be able to do after this module

  • Classify an incident’s severity and know who should be paged for it.
  • Know exactly what evidence to gather before escalating, so the person you hand off to doesn’t have to redo your work.
  • Write an incident update that’s useful to both engineers and non-technical stakeholders.

Every alert playbook ended with an “Escalation Trigger”, this module is about what happens the moment you hit one.


1. Why escalation quality matters as much as escalation timing

Escalating too late wastes time. Escalating with no evidence wastes someone else’s time, the on-call engineer now has to redo your diagnosis from scratch before they can even start fixing anything. The goal is not “escalate fast” or “escalate rarely”, it’s escalate with a diagnosis attached, the same way every alert playbook built toward a specific, evidence-backed root cause rather than a vague “something’s wrong.”

A bad escalation: “RabbitMQ seems broken, orders aren’t going through.”

A good escalation: orders.created.queue has 12,000 messages ready and climbing, 0 consumers attached (confirmed via rabbitmqctl list_queues). The orders-service pods are up and healthy per their own health checks, but their logs show no RabbitMQ connection activity since 14:02 UTC. Checked security groups, no recent changes. This looks like an app-side connection issue, not a broker issue. Escalating to the orders-service on-call for their side; broker itself is healthy.”

The second version lets the next person start fixing immediately instead of re-diagnosing.


2. Severity matrix

Use this as a starting framework, replace with your team’s actual severity definitions once you have them (usually documented in your ticketing/incident tool).

SeverityDefinitionExample from the playbooksTypical response
SEV-1 (Critical)Broker-wide outage or data-loss risk affecting multiple applications/customersCluster lost quorum (2-of-3 nodes down, Playbook 03); disk alarm blocking all publishers cluster-wide (Playbook 02)Page on-call engineering immediately, open an incident channel, notify stakeholders
SEV-2 (High)Single application/team significantly impacted, or a broker-level issue with a workaround in placeBroker-wide file-descriptor exhaustion from a connection leak (Playbook 04) affecting one noisy app but risking othersEscalate to the owning team + notify platform on-call; monitor closely
SEV-3 (Medium)Degraded but functioning: delayed processing, one queue backing up, no customer-facing outage yetQueue depth growing with an identified cause and a fix in progress (Playbook 01)Handle within support tier if within your remit; escalate during business hours if a code fix is needed
SEV-4 (Low)Isolated, non-urgent, or purely informationalA DLQ has a handful of messages from a known, already-being-fixed bug (Playbook 06)Ticket it, no immediate escalation needed

Rule of thumb from the playbooks: anything that’s broker-wide (affects multiple unrelated applications) or involves potential data loss trends toward SEV-1/2 regardless of how “simple” the underlying cause turns out to be. Anything scoped to one application’s queue, with the broker itself otherwise healthy, trends toward SEV-3/4.


3. What to capture before you escalate

Regardless of severity, gather this before paging anyone, it’s the same information a good escalation message needs, so collecting it isn’t extra work, it’s the actual deliverable:

CategoryWhat to capture
SymptomWhat’s actually broken, from whose perspective (customer-facing? internal only?), and since when
ScopeOne queue/app, or broker-wide? One AZ/node, or the whole cluster?
Evidence gatheredSpecific command output (rabbitmqctl list_queues, cluster_status, relevant CloudWatch graphs, relevant log lines): not just “I checked and it looked bad”
What you’ve ruled outE.g., “confirmed security groups haven’t changed,” “confirmed broker alarms are clear”: this stops the next person from re-checking things you already checked
Your working hypothesisEven if you’re not certain, state your best guess and why: it’s a starting point, not a commitment
What you’ve already triedAny safe remediation attempted and its result

4. Incident communication template

INCIDENT: [one-line summary]
SEVERITY: [SEV-1 / SEV-2 / SEV-3 / SEV-4]
STARTED: [timestamp, timezone]
SCOPE: [which app(s)/queue(s)/nodes affected]

SYMPTOM:
[what's observably wrong, from a user/business perspective]

EVIDENCE:
- [command/metric #1 and its output]
- [command/metric #2 and its output]

RULED OUT:
- [thing you checked that wasn't the cause]

WORKING HYPOTHESIS:
[your best current explanation, and confidence level]

ACTIONS TAKEN:
- [anything already tried, and the result]

ESCALATING TO: [team/individual]
WHY: [specific reason this is outside your remit, e.g., "requires a code
      deploy from orders-service," "requires broker config change requiring
      platform sign-off," "requires EBS volume resize"]

NEXT UPDATE BY: [time]

Keep the customer/stakeholder-facing update separate and shorter, non-technical stakeholders need impact and ETA, not rabbitmqctl output:

We're aware of an issue affecting [what's impacted] since [time]. Our team
has identified the cause and is working on a fix. Next update by [time].

Practical: write both versions for the Hands-On Lab incident

Using the incident you diagnosed in the Hands-On Lab (the EU-region consumer bug compounded by concurrency: 1):

Step 1: Fill out the full technical incident template above using your actual findings from the Hands-On Lab, real command output, not placeholder text.

Step 2: Assign it a severity from the matrix in section 2, and justify your choice in one sentence.

Step 3: Write the short stakeholder-facing version.

Step 4: Compare your technical version against the “good escalation” example in section 1. Does yours let the next person start fixing immediately, or would they need to ask you clarifying questions first? If the latter, that’s exactly the gap to close before you escalate for real.


✅ Checkpoint

You should now be able to:

  • Classify an incident into a severity level using scope (single app vs. broker-wide) as the primary signal.
  • List the six categories of information to gather before escalating, without looking them up.
  • Write a technical incident update that lets the receiving engineer start working immediately, plus a separate plain-language stakeholder update.

Next:Assessment