Kafka Incident Escalation and Communication for On-Call

Knowing how to diagnose is half of on-call; knowing when to hand off, and how to communicate while you do, is the other half. The playbooks each end with an escalation trigger. This module is the shared framework behind those triggers: who owns what, what to collect before you page someone, and how to write an update that actually helps. Good escalation is not failure, it is routing the problem to whoever can fix it fastest.

What you’ll be able to do after this module

Place a symptom on the right ownership layer.
Gather the right diagnostic bundle before escalating.
Write a clear, actionable incident update.
Page the correct team at the right time.

1. Ownership boundaries

Most Kafka incidents fall into one of three layers, and each has a different owner. Misrouting wastes the most precious thing in an incident: time.

flowchart TD
    a["Application layer<br/>listeners, config, keys, idempotency"]
    b["Broker / Kafka layer<br/>partitions, ISR, controller, retention"]
    c["AWS / infra layer<br/>SG, NACL, EBS, EC2, VPC"]
    a -->|app bug/config| appteam["You / app team"]
    b -->|cluster health| eng["Platform / Kafka engineering"]
    c -->|infra fault| net["Network / cloud team"]

Application layer: slow handlers, wrong keys, missing idempotency, deserialization config. You (support/app team) own these.
Broker layer: offline partitions, controller/quorum, retention and storage policy, partition reassignment. Platform/Kafka engineering owns these.
AWS layer: security groups, NACLs, EBS, EC2, VPC routing. The network/cloud team owns these.

The playbooks map symptoms to layers; use them to decide who to page.

2. What to gather before escalating

An escalation with no context forces the next person to start from zero. Before you page, collect a diagnostic bundle so they can act immediately.

The alert: exact text, the metric, and when it started.
Scope: which topic, group, service, partition, broker, or AZ is affected.
What you have checked: the diagnostic steps you already ran and their results.
Key exhibits:kafka-consumer-groups --describe output, kafka-topics --describe ISR, relevant log lines, CloudWatch snapshots.
Impact: what is failing for users right now, and how fast it is growing.

3. How to write an incident update

An incident update is not a story; it is a status others act on. Keep it short and structured, and post it on a regular cadence.

A reliable template:

[SEV2] Payment processing delayed
Impact:   Payments lagging ~15 min; orders confirmed but not charged. Growing.
Scope:    payment-service consumer group, orders topic, partition 1 only.
Status:   INVESTIGATING. Lag isolated to one partition; suspect hot key.
Done:     Ruled out absent consumers (all attached) and broker health (ISR full).
Next:     Checking producer key selection; ETA update in 15 min.
Owner:    @you, paging platform on-call for partition guidance.

State impact first (that is what others care about), then scope, current status, what you have ruled out, and the next step with an ETA. Update on a cadence even if the status is “still investigating,” because silence reads as “nobody is on it.”

4. When to page which team

Match the confirmed layer to the team, and page in parallel when the blast radius is large.

Situation	Page
App config, slow handler, keys, idempotency	App team (you), fix in place
Offline partitions, controller/quorum loss	Platform / Kafka engineering, immediately
Under-replicated persisting, RF/reassign needed	Platform / Kafka engineering
Retention/storage expansion, tiered storage	Platform / Kafka engineering
SG, NACL, EBS, EC2, VPC faults	Network / cloud team
Downstream dependency outage (DB, API)	That service’s on-call, in parallel

5. Guided practical

Practice communication on the incidents you already ran in Hands-On Lab.

For the broker-loss incident, write the two-line impact and scope you would post.
For the poison-message incident, assemble the diagnostic bundle you would attach to an escalation.
For the slow-consumer/storm incident, decide which layer owns it and whether it needs escalation at all.
Draft one full incident update using the template for whichever incident you would escalate.

Next:Cheat Sheet, the daily-use quick reference.