Read time: ~

Alert Playbooks

Twelve independent playbooks for recurring Kafka incidents, each with symptom, diagnosis, safe remediation, and escalation triggers for Spring Boot on-call.

This module is split into twelve short, independent playbooks, one per recurring Kafka incident type. Each is designed to be read and practiced in a focused sitting, so you can work through them one at a time rather than in a single long session. They target the Order/Payment/Inventory/Notification system on AWS MSK, but the diagnosis applies to any Kafka deployment.

Do them in order the first time through, since later playbooks occasionally build on earlier concepts (rebalancing in Playbook 4 reappears in Playbook 10). After that, use them as standalone reference material during real incidents.


Common structure

Every playbook follows the same shape, so once you know it you can navigate any of them under time pressure:

  1. Symptom: what the alert or ticket actually says.
  2. Likely causes: broker-side and application-side causes, listed separately.
  3. How it manifests to the Spring app: what your service actually sees.
  4. Diagnostic steps: an ordered checklist, cheapest and least invasive first.
  5. Safe remediations: what you can do yourself, with cautions on anything risky.
  6. Escalation trigger: the specific condition that means stop and page on-call engineering.
  7. Relevant commands and exhibits: copy-pasteable CLI and CloudWatch references.
  8. Guided practical: reproduce a scaled-down version locally, or diagnose from AWS exhibits.
  9. Checkpoint: verifiable skills the playbook builds.

The twelve playbooks

#PlaybookCore skill it builds
1Consumer Lag and Stuck ConsumersReading lag; spotting slow, absent, or evicted consumers
2Under-Replicated and Offline PartitionsISR health and data-loss risk
3Broker Down, Controller Failover, KRaft Quorum LossCluster-level failover and quorum
4Rebalance Storms and Group InstabilityTimeouts, cooperative rebalancing, static membership
5Producer FailuresSend-side exceptions and their root causes
6Poison Messages, Deserialization Errors, and the DLTUn-deserializable records and DLT triage
7Disk Pressure, Retention, and Segment IssuesBroker storage and retention
8Auth Failures After Credential or Certificate RotationSecrets lifecycle vs long-lived connections
9AWS-Layer ConnectivityDistinguishing network/infra from broker symptoms
10Latency, Ordering, and DuplicatesGC/broker slowness; idempotency and ordering
11Offset ProblemsOffset resets, replays, and out-of-range
12Schema Registry Incompatibility in ProductionCompatibility modes and rollout order

How to use these under pressure

Start with the symptom that matches your alert, not the root cause you suspect. The playbooks are ordered from cheapest to most invasive diagnostics precisely so you avoid risky actions early. When two playbooks seem to apply, work the one matching the alert text first, then follow its cross-links. The Hands-On Lab then walks a full incident end to end, and Escalation and Communication covers what to do when a playbook tells you to escalate.