RabbitMQ Cheat Sheet: Commands, Config & Escalation Triggers

One page for daily use once you’re on rotation. Everything here is explained in depth elsewhere in the course, this is the lookup, not the explanation.

Top 10 diagnostic commands

Run via SSM Session Manager onto a broker node (or docker exec -it rabbitmq-crashcourse bash locally).

#	Command	Tells you
1	`rabbitmq-diagnostics cluster_status`	Which nodes are up, whether a partition is active
2	`rabbitmq-diagnostics check_local_alarms`	Whether this node has a memory/disk alarm triggered
3	`rabbitmqctl list_queues name messages_ready messages_unacknowledged consumers`	Backlog + whether consumers are attached, per queue
4	`rabbitmqctl list_connections name peer_host state`	Who’s connected, from where, and connection churn
5	`rabbitmqctl list_channels connection_details consumer_count messages_unacknowledged`	Per-channel consumer/unacked detail, useful for finding a stuck consumer
6	`rabbitmqctl list_consumers`	Which queues have zero consumers
7	`rabbitmqctl list_users` / `list_permissions -p /`	Confirming a user exists and has the expected vhost permissions
8	`rabbitmq-diagnostics status`	Node-level memory/disk watermarks, uptime, enabled plugins
9	`openssl s_client -connect <host>:5671 -servername <host> </dev/null 2>/dev/null \\| openssl x509 -noout -dates`	Actual TLS certificate expiry dates, checked externally
10	`curl -s http://<app-host>:8080/actuator/health`	Whether the Spring app itself thinks its RabbitMQ connection is up

⚠️ None of the above modify broker state. Anything that does (delete_queue, purge_queue, forget_cluster_node, reset, change_password) requires the approval path in Escalation and Communication, never run these solo during live triage.

Top 10 metrics to know

Metric	Where	Healthy	Alerting
`messages_ready`	Management UI / `list_queues`	Near 0, draining	Climbing steadily
`consumers`	Management UI / `list_queues`	Matches expected instance count	0 while messages arrive
`NodeMemoryUsage`	CloudWatch / `status`	< ~60% of watermark	> ~90% (blocks publishers)
Disk free space	CloudWatch / `status`	Comfortably above `disk_free_limit`	At/near limit (blocks publishers cluster-wide)
`ConnectionCount`	Management UI / CloudWatch	Stable	Rapid continuous growth (leak/churn)
`FileDescriptorsUsed`	CloudWatch / `status`	Well below ulimit	Approaching ulimit
`PublishRate` vs `AckRate`	Management UI graphs	Ack ≈ Deliver over time	Ack persistently lower (processing stuck/failing)
`CPUCreditBalance`	CloudWatch (burstable instances only)	Stable/replenishing	Steadily depleting toward 0
`EBSVolumeQueueLength`	CloudWatch	Below provisioned limit	Sustained saturation
`jvm.gc.pause` (app-side)	Spring Actuator	Small, infrequent	Large jump correlated with a latency report

Key Spring AMQP config properties

Property	Controls
`spring.rabbitmq.host` / `port` / `username` / `password`	Connection basics
`spring.rabbitmq.addresses`	List all cluster nodes here, not just one: a single-address config means losing that one node looks like total broker unavailability to the app
`spring.rabbitmq.virtual-host`	Which vhost this app connects to
`spring.rabbitmq.listener.simple.concurrency` / `max-concurrency`	Number of parallel listener threads: too low causes head-of-line blocking behind one slow/failing message; also the setting that explains most “out of order” complaints once > 1
`spring.rabbitmq.listener.simple.prefetch`	How many unacked messages the broker hands a consumer before waiting for acks
`spring.rabbitmq.listener.simple.acknowledge-mode`	`AUTO` (ack on clean return, nack on thrown exception) vs `MANUAL` (you call ack/nack yourself)
`spring.rabbitmq.listener.simple.default-requeue-rejected`	On failure, requeue (default `true`, risks a poison-message loop) or drop/dead-letter (`false`)
`spring.rabbitmq.listener.simple.retry.*` (`enabled`, `max-attempts`, `initial-interval`, `multiplier`)	In-listener retry with exponential backoff before the message is finally rejected
`spring.rabbitmq.cache.channel.size`	Channel pool size in `CachingConnectionFactory`: undersized causes churn under load
`spring.rabbitmq.publisher-confirm-type`	Whether/how the producer gets confirmation a message reached the broker (`correlated` for async confirms)
`spring.rabbitmq.publisher-returns` + template `mandatory`	Return unroutable messages to a `ReturnsCallback` instead of silently dropping them
`spring.rabbitmq.ssl.*`	TLS settings: never set `validate-server-certificate=false` outside an explicitly approved, tracked exception
`spring.rabbitmq.ssl.verify-hostname`	Verify the broker hostname against its certificate: keep `true`
`spring.rabbitmq.ssl.key-store` / `key-store-password`	Client certificate for mutual TLS (mTLS)
`x-dead-letter-exchange` (queue arg, via `QueueBuilder`)	Where rejected/expired messages go instead of looping forever or being lost
`x-message-ttl` (queue arg)	How long a message can live before expiring/dead-lettering

Developer deep-dive links

Use this map when the cheat sheet isn’t enough and you need the full explanation:

Topic	Module
Ack modes, prefetch, concurrency	Acknowledgements & Prefetch
Publisher confirms and returns	Publisher Confirms
Retry, backoff, error handling	Retry & Error Handling
Dead-letter exchanges, DLQ, parking-lot	Dead Letter Exchanges
Idempotency and duplicates	Idempotency & Duplicates
Cluster addresses and recovery	Connection Recovery
Users, permissions, vhosts, TLS/mTLS	Security
Metrics, tracing, health checks	Observability
Throughput and tuning	Performance Tuning
Testing with Testcontainers	Testing

Fast triage flow

Management UI → Overview. Any alarm banner? Any node missing? → broker-wide issue, check Playbook 02 or 03.
Management UI → Queues. Which specific queue is misbehaving? Ready climbing + 0 consumers → Playbook 01. Ready climbing + consumers attached but stuck → Playbook 06 (check for a repeating exception on specific payloads).
App logs.AmqpConnectException → network/broker down, see Playbook 07. PossibleAuthenticationFailureException → Playbook 05. SSLHandshakeException → Playbook 08. ListenerExecutionFailedException → app code bug, likely Playbook 06.
Connections/Channels tab. Count climbing with no traffic increase → Playbook 04.
Nothing broker-side looks wrong, but it’s slow. → Playbook 09 (GC pause vs. CPU credit exhaustion).
Still unclear, or it’s broker-wide / needs a config or code change beyond your remit → escalate with your evidence attached.

Escalation contacts

(placeholder, fill in with your team’s actual contacts/on-call rotation)

Role	Contact / on-call schedule
Platform/Infra on-call (broker, AWS, networking)	`[PagerDuty/Opsgenie schedule link]`
Owning app team on-call (per-service)	`[link to your service catalog / on-call directory]`
Security/secrets team (cert & credential rotation issues)	`[contact]`
Escalation manager / incident commander (SEV-1)	`[contact]`

Course complete. Return to the course index for the full learning path, or jump directly into the alert playbooks as living reference material during real incidents.