Read time: ~

Cheat Sheet

Top diagnostic commands, Spring Boot config keys, and escalation triggers for daily on-call rotation.

One page for daily use once you’re on rotation. Everything here is explained in depth elsewhere in the course, this is the lookup, not the explanation.


Top 10 diagnostic commands

Run via SSM Session Manager onto a broker node (or docker exec -it rabbitmq-crashcourse bash locally).

#CommandTells you
1rabbitmq-diagnostics cluster_statusWhich nodes are up, whether a partition is active
2rabbitmq-diagnostics check_local_alarmsWhether this node has a memory/disk alarm triggered
3rabbitmqctl list_queues name messages_ready messages_unacknowledged consumersBacklog + whether consumers are attached, per queue
4rabbitmqctl list_connections name peer_host stateWho’s connected, from where, and connection churn
5rabbitmqctl list_channels connection_details consumer_count messages_unacknowledgedPer-channel consumer/unacked detail, useful for finding a stuck consumer
6rabbitmqctl list_consumersWhich queues have zero consumers
7rabbitmqctl list_users / list_permissions -p /Confirming a user exists and has the expected vhost permissions
8rabbitmq-diagnostics statusNode-level memory/disk watermarks, uptime, enabled plugins
9openssl s_client -connect <host>:5671 -servername <host> </dev/null 2>/dev/null \| openssl x509 -noout -datesActual TLS certificate expiry dates, checked externally
10curl -s http://<app-host>:8080/actuator/healthWhether the Spring app itself thinks its RabbitMQ connection is up

⚠️ None of the above modify broker state. Anything that does (delete_queue, purge_queue, forget_cluster_node, reset, change_password) requires the approval path in Escalation and Communication, never run these solo during live triage.


Top 10 metrics to know

MetricWhereHealthyAlerting
messages_readyManagement UI / list_queuesNear 0, drainingClimbing steadily
consumersManagement UI / list_queuesMatches expected instance count0 while messages arrive
NodeMemoryUsageCloudWatch / status< ~60% of watermark> ~90% (blocks publishers)
Disk free spaceCloudWatch / statusComfortably above disk_free_limitAt/near limit (blocks publishers cluster-wide)
ConnectionCountManagement UI / CloudWatchStableRapid continuous growth (leak/churn)
FileDescriptorsUsedCloudWatch / statusWell below ulimitApproaching ulimit
PublishRate vs AckRateManagement UI graphsAck ≈ Deliver over timeAck persistently lower (processing stuck/failing)
CPUCreditBalanceCloudWatch (burstable instances only)Stable/replenishingSteadily depleting toward 0
EBSVolumeQueueLengthCloudWatchBelow provisioned limitSustained saturation
jvm.gc.pause (app-side)Spring ActuatorSmall, infrequentLarge jump correlated with a latency report

Key Spring AMQP config properties

PropertyControls
spring.rabbitmq.host / port / username / passwordConnection basics
spring.rabbitmq.addressesList all cluster nodes here, not just one: a single-address config means losing that one node looks like total broker unavailability to the app
spring.rabbitmq.virtual-hostWhich vhost this app connects to
spring.rabbitmq.listener.simple.concurrency / max-concurrencyNumber of parallel listener threads: too low causes head-of-line blocking behind one slow/failing message; also the setting that explains most “out of order” complaints once > 1
spring.rabbitmq.listener.simple.prefetchHow many unacked messages the broker hands a consumer before waiting for acks
spring.rabbitmq.listener.simple.acknowledge-modeAUTO (ack on clean return, nack on thrown exception) vs MANUAL (you call ack/nack yourself)
spring.rabbitmq.listener.simple.default-requeue-rejectedOn failure, requeue (default true, risks a poison-message loop) or drop/dead-letter (false)
spring.rabbitmq.listener.simple.retry.* (enabled, max-attempts, initial-interval, multiplier)In-listener retry with exponential backoff before the message is finally rejected
spring.rabbitmq.cache.channel.sizeChannel pool size in CachingConnectionFactory: undersized causes churn under load
spring.rabbitmq.publisher-confirm-typeWhether/how the producer gets confirmation a message reached the broker (correlated for async confirms)
spring.rabbitmq.publisher-returns + template mandatoryReturn unroutable messages to a ReturnsCallback instead of silently dropping them
spring.rabbitmq.ssl.*TLS settings: never set validate-server-certificate=false outside an explicitly approved, tracked exception
spring.rabbitmq.ssl.verify-hostnameVerify the broker hostname against its certificate: keep true
spring.rabbitmq.ssl.key-store / key-store-passwordClient certificate for mutual TLS (mTLS)
x-dead-letter-exchange (queue arg, via QueueBuilder)Where rejected/expired messages go instead of looping forever or being lost
x-message-ttl (queue arg)How long a message can live before expiring/dead-lettering

Use this map when the cheat sheet isn’t enough and you need the full explanation:

TopicModule
Ack modes, prefetch, concurrencyAcknowledgements & Prefetch
Publisher confirms and returnsPublisher Confirms
Retry, backoff, error handlingRetry & Error Handling
Dead-letter exchanges, DLQ, parking-lotDead Letter Exchanges
Idempotency and duplicatesIdempotency & Duplicates
Cluster addresses and recoveryConnection Recovery
Users, permissions, vhosts, TLS/mTLSSecurity
Metrics, tracing, health checksObservability
Throughput and tuningPerformance Tuning
Testing with TestcontainersTesting

Fast triage flow

  1. Management UI → Overview. Any alarm banner? Any node missing? → broker-wide issue, check Playbook 02 or 03.
  2. Management UI → Queues. Which specific queue is misbehaving? Ready climbing + 0 consumers → Playbook 01. Ready climbing + consumers attached but stuck → Playbook 06 (check for a repeating exception on specific payloads).
  3. App logs.AmqpConnectException → network/broker down, see Playbook 07. PossibleAuthenticationFailureExceptionPlaybook 05. SSLHandshakeExceptionPlaybook 08. ListenerExecutionFailedException → app code bug, likely Playbook 06.
  4. Connections/Channels tab. Count climbing with no traffic increase → Playbook 04.
  5. Nothing broker-side looks wrong, but it’s slow.Playbook 09 (GC pause vs. CPU credit exhaustion).
  6. Still unclear, or it’s broker-wide / needs a config or code change beyond your remitescalate with your evidence attached.

Escalation contacts

(placeholder, fill in with your team’s actual contacts/on-call rotation)

RoleContact / on-call schedule
Platform/Infra on-call (broker, AWS, networking)[PagerDuty/Opsgenie schedule link]
Owning app team on-call (per-service)[link to your service catalog / on-call directory]
Security/secrets team (cert & credential rotation issues)[contact]
Escalation manager / incident commander (SEV-1)[contact]

Course complete. Return to the course index for the full learning path, or jump directly into the alert playbooks as living reference material during real incidents.