Cheat Sheet
Top diagnostic commands, Spring Boot config keys, and escalation triggers for daily on-call rotation.
One page for daily use once you’re on rotation. Everything here is explained in depth elsewhere in the course, this is the lookup, not the explanation.
Top 10 diagnostic commands
Run via SSM Session Manager onto a broker node (or docker exec -it rabbitmq-crashcourse bash locally).
| # | Command | Tells you |
|---|---|---|
| 1 | rabbitmq-diagnostics cluster_status | Which nodes are up, whether a partition is active |
| 2 | rabbitmq-diagnostics check_local_alarms | Whether this node has a memory/disk alarm triggered |
| 3 | rabbitmqctl list_queues name messages_ready messages_unacknowledged consumers | Backlog + whether consumers are attached, per queue |
| 4 | rabbitmqctl list_connections name peer_host state | Who’s connected, from where, and connection churn |
| 5 | rabbitmqctl list_channels connection_details consumer_count messages_unacknowledged | Per-channel consumer/unacked detail, useful for finding a stuck consumer |
| 6 | rabbitmqctl list_consumers | Which queues have zero consumers |
| 7 | rabbitmqctl list_users / list_permissions -p / | Confirming a user exists and has the expected vhost permissions |
| 8 | rabbitmq-diagnostics status | Node-level memory/disk watermarks, uptime, enabled plugins |
| 9 | openssl s_client -connect <host>:5671 -servername <host> </dev/null 2>/dev/null \| openssl x509 -noout -dates | Actual TLS certificate expiry dates, checked externally |
| 10 | curl -s http://<app-host>:8080/actuator/health | Whether the Spring app itself thinks its RabbitMQ connection is up |
⚠️ None of the above modify broker state. Anything that does (delete_queue, purge_queue, forget_cluster_node, reset, change_password) requires the approval path in Escalation and Communication, never run these solo during live triage.
Top 10 metrics to know
| Metric | Where | Healthy | Alerting |
|---|---|---|---|
messages_ready | Management UI / list_queues | Near 0, draining | Climbing steadily |
consumers | Management UI / list_queues | Matches expected instance count | 0 while messages arrive |
NodeMemoryUsage | CloudWatch / status | < ~60% of watermark | > ~90% (blocks publishers) |
| Disk free space | CloudWatch / status | Comfortably above disk_free_limit | At/near limit (blocks publishers cluster-wide) |
ConnectionCount | Management UI / CloudWatch | Stable | Rapid continuous growth (leak/churn) |
FileDescriptorsUsed | CloudWatch / status | Well below ulimit | Approaching ulimit |
PublishRate vs AckRate | Management UI graphs | Ack ≈ Deliver over time | Ack persistently lower (processing stuck/failing) |
CPUCreditBalance | CloudWatch (burstable instances only) | Stable/replenishing | Steadily depleting toward 0 |
EBSVolumeQueueLength | CloudWatch | Below provisioned limit | Sustained saturation |
jvm.gc.pause (app-side) | Spring Actuator | Small, infrequent | Large jump correlated with a latency report |
Key Spring AMQP config properties
| Property | Controls |
|---|---|
spring.rabbitmq.host / port / username / password | Connection basics |
spring.rabbitmq.addresses | List all cluster nodes here, not just one: a single-address config means losing that one node looks like total broker unavailability to the app |
spring.rabbitmq.virtual-host | Which vhost this app connects to |
spring.rabbitmq.listener.simple.concurrency / max-concurrency | Number of parallel listener threads: too low causes head-of-line blocking behind one slow/failing message; also the setting that explains most “out of order” complaints once > 1 |
spring.rabbitmq.listener.simple.prefetch | How many unacked messages the broker hands a consumer before waiting for acks |
spring.rabbitmq.listener.simple.acknowledge-mode | AUTO (ack on clean return, nack on thrown exception) vs MANUAL (you call ack/nack yourself) |
spring.rabbitmq.listener.simple.default-requeue-rejected | On failure, requeue (default true, risks a poison-message loop) or drop/dead-letter (false) |
spring.rabbitmq.listener.simple.retry.* (enabled, max-attempts, initial-interval, multiplier) | In-listener retry with exponential backoff before the message is finally rejected |
spring.rabbitmq.cache.channel.size | Channel pool size in CachingConnectionFactory: undersized causes churn under load |
spring.rabbitmq.publisher-confirm-type | Whether/how the producer gets confirmation a message reached the broker (correlated for async confirms) |
spring.rabbitmq.publisher-returns + template mandatory | Return unroutable messages to a ReturnsCallback instead of silently dropping them |
spring.rabbitmq.ssl.* | TLS settings: never set validate-server-certificate=false outside an explicitly approved, tracked exception |
spring.rabbitmq.ssl.verify-hostname | Verify the broker hostname against its certificate: keep true |
spring.rabbitmq.ssl.key-store / key-store-password | Client certificate for mutual TLS (mTLS) |
x-dead-letter-exchange (queue arg, via QueueBuilder) | Where rejected/expired messages go instead of looping forever or being lost |
x-message-ttl (queue arg) | How long a message can live before expiring/dead-lettering |
Developer deep-dive links
Use this map when the cheat sheet isn’t enough and you need the full explanation:
| Topic | Module |
|---|---|
| Ack modes, prefetch, concurrency | Acknowledgements & Prefetch |
| Publisher confirms and returns | Publisher Confirms |
| Retry, backoff, error handling | Retry & Error Handling |
| Dead-letter exchanges, DLQ, parking-lot | Dead Letter Exchanges |
| Idempotency and duplicates | Idempotency & Duplicates |
| Cluster addresses and recovery | Connection Recovery |
| Users, permissions, vhosts, TLS/mTLS | Security |
| Metrics, tracing, health checks | Observability |
| Throughput and tuning | Performance Tuning |
| Testing with Testcontainers | Testing |
Fast triage flow
- Management UI → Overview. Any alarm banner? Any node missing? → broker-wide issue, check Playbook 02 or 03.
- Management UI → Queues. Which specific queue is misbehaving? Ready climbing + 0 consumers → Playbook 01. Ready climbing + consumers attached but stuck → Playbook 06 (check for a repeating exception on specific payloads).
- App logs.
AmqpConnectException→ network/broker down, see Playbook 07.PossibleAuthenticationFailureException→ Playbook 05.SSLHandshakeException→ Playbook 08.ListenerExecutionFailedException→ app code bug, likely Playbook 06. - Connections/Channels tab. Count climbing with no traffic increase → Playbook 04.
- Nothing broker-side looks wrong, but it’s slow. → Playbook 09 (GC pause vs. CPU credit exhaustion).
- Still unclear, or it’s broker-wide / needs a config or code change beyond your remit → escalate with your evidence attached.
Escalation contacts
(placeholder, fill in with your team’s actual contacts/on-call rotation)
| Role | Contact / on-call schedule |
|---|---|
| Platform/Infra on-call (broker, AWS, networking) | [PagerDuty/Opsgenie schedule link] |
| Owning app team on-call (per-service) | [link to your service catalog / on-call directory] |
| Security/secrets team (cert & credential rotation issues) | [contact] |
| Escalation manager / incident commander (SEV-1) | [contact] |
Course complete. Return to the course index for the full learning path, or jump directly into the alert playbooks as living reference material during real incidents.