Kafka Alert Playbooks: Production Incident Response Guide

This module is split into twelve short, independent playbooks, one per recurring Kafka incident type. Each is designed to be read and practiced in a focused sitting, so you can work through them one at a time rather than in a single long session. They target the Order/Payment/Inventory/Notification system on AWS MSK, but the diagnosis applies to any Kafka deployment.

Do them in order the first time through, since later playbooks occasionally build on earlier concepts (rebalancing in Playbook 4 reappears in Playbook 10). After that, use them as standalone reference material during real incidents.

Common structure

Every playbook follows the same shape, so once you know it you can navigate any of them under time pressure:

Symptom: what the alert or ticket actually says.
Likely causes: broker-side and application-side causes, listed separately.
How it manifests to the Spring app: what your service actually sees.
Diagnostic steps: an ordered checklist, cheapest and least invasive first.
Safe remediations: what you can do yourself, with cautions on anything risky.
Escalation trigger: the specific condition that means stop and page on-call engineering.
Relevant commands and exhibits: copy-pasteable CLI and CloudWatch references.
Guided practical: reproduce a scaled-down version locally, or diagnose from AWS exhibits.
Checkpoint: verifiable skills the playbook builds.

The twelve playbooks

#	Playbook	Core skill it builds
1	Consumer Lag and Stuck Consumers	Reading lag; spotting slow, absent, or evicted consumers
2	Under-Replicated and Offline Partitions	ISR health and data-loss risk
3	Broker Down, Controller Failover, KRaft Quorum Loss	Cluster-level failover and quorum
4	Rebalance Storms and Group Instability	Timeouts, cooperative rebalancing, static membership
5	Producer Failures	Send-side exceptions and their root causes
6	Poison Messages, Deserialization Errors, and the DLT	Un-deserializable records and DLT triage
7	Disk Pressure, Retention, and Segment Issues	Broker storage and retention
8	Auth Failures After Credential or Certificate Rotation	Secrets lifecycle vs long-lived connections
9	AWS-Layer Connectivity	Distinguishing network/infra from broker symptoms
10	Latency, Ordering, and Duplicates	GC/broker slowness; idempotency and ordering
11	Offset Problems	Offset resets, replays, and out-of-range
12	Schema Registry Incompatibility in Production	Compatibility modes and rollout order

How to use these under pressure

Start with the symptom that matches your alert, not the root cause you suspect. The playbooks are ordered from cheapest to most invasive diagnostics precisely so you avoid risky actions early. When two playbooks seem to apply, work the one matching the alert text first, then follow its cross-links. The Hands-On Lab then walks a full incident end to end, and Escalation and Communication covers what to do when a playbook tells you to escalate.