Read time: ~

Alert Playbooks

Nine independent playbooks for recurring RabbitMQ incidents, symptom, diagnosis, remediation, escalation.

Prerequisite:Tooling Walkthrough

This module is split into 9 short, independent playbooks, one per recurring incident type. Each is designed to be read and practiced in 15-20 minutes, so you can work through them one at a time (e.g., one per day) rather than in a single long session.

Do them in order the first time through, later playbooks occasionally reference concepts from earlier ones (e.g., quorum queues from Playbook 03 come up again in Playbook 09). After that, use them as standalone reference material during real incidents.

Common structure

Every playbook follows the same shape, so once you know it you can navigate any of them under time pressure:

  1. Symptom: what the alert/ticket actually says.
  2. Likely Causes: broker-side causes and application-side (Spring Boot) causes, listed separately.
  3. Diagnostic Steps: an ordered checklist, cheapest/fastest checks first.
  4. Safe Remediations: what you can do yourself, with ⚠️ CAUTION callouts on anything risky.
  5. Escalation Trigger: the specific condition that means “stop, page on-call engineering.”
  6. Relevant Commands/Queries: copy-pasteable commands referenced in the steps above.
  7. Mini practical: a small, safe exercise to reproduce a scaled-down version of the incident locally.

Playbooks

#PlaybookCore skill it builds
1Queue Depth Growing / Consumer LagReading queue metrics; spotting stuck/absent consumers
2Memory/Disk Alarm & Blocked PublishersUnderstanding broker self-protection mechanisms
3Node Down / Cluster PartitionQuorum queue behavior under node failure
4Connection/Channel ExhaustionSpotting connection/channel leaks in app code
5Auth Failures After Credential RotationSecrets lifecycle vs. long-lived app connections
6Poison Messages & DLQRetry/dead-lettering interplay with Spring Retry
7AWS-Layer Connectivity IssuesDistinguishing network/infra symptoms from broker symptoms
8TLS/Certificate ExpiryReading SSL handshake failures end-to-end
9Latency Spikes & Ordering/Duplicate SurprisesGC/CPU-credit correlation; idempotent consumer design