Alert Playbooks
Twelve independent playbooks for recurring Kafka incidents, each with symptom, diagnosis, safe remediation, and escalation triggers for Spring Boot on-call.
This module is split into twelve short, independent playbooks, one per recurring Kafka incident type. Each is designed to be read and practiced in a focused sitting, so you can work through them one at a time rather than in a single long session. They target the Order/Payment/Inventory/Notification system on AWS MSK, but the diagnosis applies to any Kafka deployment.
Do them in order the first time through, since later playbooks occasionally build on earlier concepts (rebalancing in Playbook 4 reappears in Playbook 10). After that, use them as standalone reference material during real incidents.
Common structure
Every playbook follows the same shape, so once you know it you can navigate any of them under time pressure:
- Symptom: what the alert or ticket actually says.
- Likely causes: broker-side and application-side causes, listed separately.
- How it manifests to the Spring app: what your service actually sees.
- Diagnostic steps: an ordered checklist, cheapest and least invasive first.
- Safe remediations: what you can do yourself, with cautions on anything risky.
- Escalation trigger: the specific condition that means stop and page on-call engineering.
- Relevant commands and exhibits: copy-pasteable CLI and CloudWatch references.
- Guided practical: reproduce a scaled-down version locally, or diagnose from AWS exhibits.
- Checkpoint: verifiable skills the playbook builds.
The twelve playbooks
| # | Playbook | Core skill it builds |
|---|---|---|
| 1 | Consumer Lag and Stuck Consumers | Reading lag; spotting slow, absent, or evicted consumers |
| 2 | Under-Replicated and Offline Partitions | ISR health and data-loss risk |
| 3 | Broker Down, Controller Failover, KRaft Quorum Loss | Cluster-level failover and quorum |
| 4 | Rebalance Storms and Group Instability | Timeouts, cooperative rebalancing, static membership |
| 5 | Producer Failures | Send-side exceptions and their root causes |
| 6 | Poison Messages, Deserialization Errors, and the DLT | Un-deserializable records and DLT triage |
| 7 | Disk Pressure, Retention, and Segment Issues | Broker storage and retention |
| 8 | Auth Failures After Credential or Certificate Rotation | Secrets lifecycle vs long-lived connections |
| 9 | AWS-Layer Connectivity | Distinguishing network/infra from broker symptoms |
| 10 | Latency, Ordering, and Duplicates | GC/broker slowness; idempotency and ordering |
| 11 | Offset Problems | Offset resets, replays, and out-of-range |
| 12 | Schema Registry Incompatibility in Production | Compatibility modes and rollout order |
How to use these under pressure
Start with the symptom that matches your alert, not the root cause you suspect. The playbooks are ordered from cheapest to most invasive diagnostics precisely so you avoid risky actions early. When two playbooks seem to apply, work the one matching the alert text first, then follow its cross-links. The Hands-On Lab then walks a full incident end to end, and Escalation and Communication covers what to do when a playbook tells you to escalate.