Read time: ~

Tooling Walkthrough

Management UI tabs, safe CLI commands, CloudWatch signals, and how to read healthy vs alerting output.

Prerequisite:AWS ArchitectureYou’ll need: the Docker container from Environment Setup running, terminal access


What you’ll be able to do after this module

  • Navigate the Management UI to check node health, queue depth, connections, and consumers.
  • Run the core rabbitmqctl / rabbitmq-diagnostics commands and tell healthy output from alerting output.
  • Know which CloudWatch metric to check for a given symptom, and cross-reference it with Spring Boot Actuator.

1. The Management UI tour

Open localhost:15672 (from the Environment Setup container) and log in (guest/guest).

TabWhat it showsWhat to look for when triaging
OverviewCluster-wide message rates, node list, alarmsAny node not green/running; any active resource alarm (memory/disk) banner at the top
ConnectionsEvery open TCP connection, from which app, since whenSudden spike in connection count (churn); connections stuck in a weird state
ChannelsEvery open channel, its consumer count, unacked messages, prefetchChannels with a huge “unacked” count (consumer stuck or crashed mid-processing)
ExchangesAll exchanges, message-in/out ratesConfirms whether a producer is actually publishing (rate > 0)
QueuesEvery queue: ready/unacked/total messages, consumer count, message ratesThis is the #1 tab you’ll live in. Ready count climbing = backlog. Consumers = 0 = nobody’s listening.
AdminUsers, vhosts, policiesRarely touched by support tier: usually read-only access here

Click into a specific queue (e.g., orders.created.queue from First Producer and Consumer) and note the fields:

  • Ready: messages waiting to be delivered to a consumer.
  • Unacked: messages delivered to a consumer but not yet acknowledged (i.e., currently “in flight” / being processed).
  • Total: Ready + Unacked.
  • Consumers: how many active consumer connections are attached to this queue right now.
  • Message rates: publish/deliver/ack rates over time, graphed.

Healthy pattern: Ready hovers near 0, Unacked briefly spikes then drops, consumer count matches your expected deployed instance count. Alerting pattern: Ready climbs steadily and doesn’t recover, or Consumers = 0 while Ready > 0.


2. CLI tools: rabbitmqctl and rabbitmq-diagnostics

Exec into the running container to try these (in production you’d use SSM Session Manager instead of docker exec, but the commands themselves are identical):

docker exec -it rabbitmq-crashcourse bash

Cluster health

rabbitmq-diagnostics status

Healthy: shows Status of node rabbit@<hostname> ... with no errors, lists enabled plugins, memory/disk watermarks not exceeded.

rabbitmq-diagnostics cluster_status

Healthy: lists all expected nodes under Running Nodes, with none under Nodes Not Running. Alerting: a node appears missing from Running Nodes, this is your first signal for Playbook 03, Node Down.

rabbitmq-diagnostics check_running
rabbitmq-diagnostics check_local_alarms

Healthy:check_local_alarms returns success with no output. Alerting: returns a resource_limit_alarm for memory or disk, this node has hit a watermark and is now blocking publishers. Go straight to Playbook 02.

Queues

rabbitmqctl list_queues name messages_ready messages_unacknowledged consumers

Healthy example output:

name                    messages_ready  messages_unacknowledged  consumers
orders.created.queue    0               0                        2

Alerting example output:

name                    messages_ready  messages_unacknowledged  consumers
orders.created.queue    48213           0                        0

consumers = 0 with a large and growing messages_ready is the single most common alert pattern you’ll triage. It means: messages are arriving, nothing is picking them up.

Connections and channels

rabbitmqctl list_connections name peer_host state
rabbitmqctl list_channels connection_details consumer_count messages_unacknowledged

Use these to identify which application instance owns a problematic connection/channel, critical when escalating to an app team, since you can tell them exactly which pod/instance to look at instead of “something’s wrong with your service.”

Users and permissions (read-only checks during an auth incident)

rabbitmqctl list_users
rabbitmqctl list_permissions -p /

Useful for confirming “does this user actually have publish/consume rights on this vhost” before assuming it’s a network problem.

⚠️ CAUTION: Commands like rabbitmqctl delete_queue, purge_queue, forget_cluster_node, or reset are destructive and can cause data loss or cluster damage. None of the commands above modify anything, they are all safe, read-only diagnostics. Anything that changes broker state requires the approval/escalation path in Escalation and Communication.


3. CloudWatch metrics reference

MetricMeaningHealthy range (example)Alert threshold (example)
QueueDepth / MessageReadyCountMessages waiting for a consumerNear 0, or draining quicklySustained growth over N minutes
ConsumerCountActive consumers attached to a queueMatches expected deployed instance count0 while messages are arriving
NodeMemoryUsage (or mem_used via plugin)Broker memory usage vs. configured high watermark< 60% of watermark> 90% of watermark (triggers publisher blocking)
DiskFreeLimitAlarmWhether the disk-space alarm is activeNot triggeredTriggered (publishers blocked cluster-wide)
FileDescriptorsUsedOS file descriptors in use by the broker processWell below the ulimitApproaching the ulimit (connection/channel exhaustion)
ConnectionCountTotal open AMQP connectionsStable, matches expected app instance count × pool sizeRapid, continuous growth (churn/leak)
PublishRate / DeliverRate / AckRateMessages/sec published, delivered, acknowledgedDeliver ≈ Publish over time; Ack ≈ DeliverAck rate persistently lower than Deliver rate (processing is failing or hanging)
EBSVolumeQueueLength / EBSReadWriteOps (AWS-layer)Disk I/O saturation on the underlying EBS volumeBelow provisioned IOPS/throughput limitSustained saturation → broker-level latency even though CPU looks fine
CPUCreditBalance (if burstable instance type)Remaining CPU burst creditsStable or replenishingSteadily depleting toward 0 → imminent throttling

4. Cross-referencing with the Spring Boot side

Infra metrics only tell half the story. Always correlate with the application side:

  • Spring Boot Actuator health: if you have spring-boot-starter-actuator + spring-boot-starter-amqp, hit /actuator/health, a healthy RabbitMQ connection shows:
    { "components": { "rabbit": { "status": "UP", "details": { "version": "3.13.x" } } } }
    

    A broken connection shows "status": "DOWN" with an exception message, this is often faster to check than digging through broker logs.

  • Application logs to grep for: | Log signature | Usually means | |—|—| | AmqpConnectException / Connection refused | Broker unreachable: network/SG issue or broker down | | PossibleAuthenticationFailureException | Credentials wrong: check Playbook 05 | | ListenerExecutionFailedException | Your @RabbitListener method threw an exception: this is an app-code bug, not a broker problem | | SSLHandshakeException | Certificate issue: check Playbook 08 | | Consumer thread silent, no errors, but queue growing | Listener likely blocked/hung (e.g., waiting on a slow downstream DB call): check thread dumps, not broker logs |

Rule of thumb: if the Management UI shows healthy broker-side metrics (low ready count, consumers attached, no alarms) but the business symptom persists (e.g., orders not shipping), the problem is almost certainly in application code, not RabbitMQ. If the Management UI itself shows the anomaly (growing ready count, 0 consumers, active alarms), start with the broker/infra side.


Practical: diagnose a live backlog using only the CLI

Step 1: Using your producer/consumer app from First Producer and Consumer, stop the consumer (comment out @Component again, or just stop the Spring Boot app).

Step 2: Publish 20 messages in a loop:

for i in $(seq 1 20); do
  curl -s -X POST localhost:8080/orders -H "Content-Type: application/json" -d "{\"id\":$i}"
done

Step 3: Without opening the Management UI, use only rabbitmqctl (via docker exec) to answer:

  • How many messages are ready in orders.created.queue?
  • How many consumers are attached?
  • Based on that alone, what’s your diagnosis?

Step 4: Restart the consumer app and re-run the same list_queues command to confirm the backlog drains and messages_ready returns to 0.


✅ Checkpoint

You should now be able to:

  • Name the four Management UI tabs you’d check first during an incident, in priority order.
  • Run rabbitmqctl list_queues and rabbitmq-diagnostics check_local_alarms from memory.
  • Explain the rule of thumb for deciding “broker problem” vs. “app problem” based on what the Management UI shows.

Next:Alert Playbooks