Tooling Walkthrough
Management UI tabs, safe CLI commands, CloudWatch signals, and how to read healthy vs alerting output.
Prerequisite:AWS ArchitectureYou’ll need: the Docker container from Environment Setup running, terminal access
What you’ll be able to do after this module
- Navigate the Management UI to check node health, queue depth, connections, and consumers.
- Run the core
rabbitmqctl/rabbitmq-diagnosticscommands and tell healthy output from alerting output. - Know which CloudWatch metric to check for a given symptom, and cross-reference it with Spring Boot Actuator.
1. The Management UI tour
Open localhost:15672 (from the Environment Setup container) and log in (guest/guest).
| Tab | What it shows | What to look for when triaging |
|---|---|---|
| Overview | Cluster-wide message rates, node list, alarms | Any node not green/running; any active resource alarm (memory/disk) banner at the top |
| Connections | Every open TCP connection, from which app, since when | Sudden spike in connection count (churn); connections stuck in a weird state |
| Channels | Every open channel, its consumer count, unacked messages, prefetch | Channels with a huge “unacked” count (consumer stuck or crashed mid-processing) |
| Exchanges | All exchanges, message-in/out rates | Confirms whether a producer is actually publishing (rate > 0) |
| Queues | Every queue: ready/unacked/total messages, consumer count, message rates | This is the #1 tab you’ll live in. Ready count climbing = backlog. Consumers = 0 = nobody’s listening. |
| Admin | Users, vhosts, policies | Rarely touched by support tier: usually read-only access here |
Click into a specific queue (e.g., orders.created.queue from First Producer and Consumer) and note the fields:
- Ready: messages waiting to be delivered to a consumer.
- Unacked: messages delivered to a consumer but not yet acknowledged (i.e., currently “in flight” / being processed).
- Total: Ready + Unacked.
- Consumers: how many active consumer connections are attached to this queue right now.
- Message rates: publish/deliver/ack rates over time, graphed.
Healthy pattern: Ready hovers near 0, Unacked briefly spikes then drops, consumer count matches your expected deployed instance count. Alerting pattern: Ready climbs steadily and doesn’t recover, or Consumers = 0 while Ready > 0.
2. CLI tools: rabbitmqctl and rabbitmq-diagnostics
Exec into the running container to try these (in production you’d use SSM Session Manager instead of docker exec, but the commands themselves are identical):
docker exec -it rabbitmq-crashcourse bash
Cluster health
rabbitmq-diagnostics status
Healthy: shows Status of node rabbit@<hostname> ... with no errors, lists enabled plugins, memory/disk watermarks not exceeded.
rabbitmq-diagnostics cluster_status
Healthy: lists all expected nodes under Running Nodes, with none under Nodes Not Running. Alerting: a node appears missing from Running Nodes, this is your first signal for Playbook 03, Node Down.
rabbitmq-diagnostics check_running
rabbitmq-diagnostics check_local_alarms
Healthy:check_local_alarms returns success with no output. Alerting: returns a resource_limit_alarm for memory or disk, this node has hit a watermark and is now blocking publishers. Go straight to Playbook 02.
Queues
rabbitmqctl list_queues name messages_ready messages_unacknowledged consumers
Healthy example output:
name messages_ready messages_unacknowledged consumers
orders.created.queue 0 0 2
Alerting example output:
name messages_ready messages_unacknowledged consumers
orders.created.queue 48213 0 0
consumers = 0 with a large and growing messages_ready is the single most common alert pattern you’ll triage. It means: messages are arriving, nothing is picking them up.
Connections and channels
rabbitmqctl list_connections name peer_host state
rabbitmqctl list_channels connection_details consumer_count messages_unacknowledged
Use these to identify which application instance owns a problematic connection/channel, critical when escalating to an app team, since you can tell them exactly which pod/instance to look at instead of “something’s wrong with your service.”
Users and permissions (read-only checks during an auth incident)
rabbitmqctl list_users
rabbitmqctl list_permissions -p /
Useful for confirming “does this user actually have publish/consume rights on this vhost” before assuming it’s a network problem.
⚠️ CAUTION: Commands like
rabbitmqctl delete_queue,purge_queue,forget_cluster_node, orresetare destructive and can cause data loss or cluster damage. None of the commands above modify anything, they are all safe, read-only diagnostics. Anything that changes broker state requires the approval/escalation path in Escalation and Communication.
3. CloudWatch metrics reference
| Metric | Meaning | Healthy range (example) | Alert threshold (example) |
|---|---|---|---|
QueueDepth / MessageReadyCount | Messages waiting for a consumer | Near 0, or draining quickly | Sustained growth over N minutes |
ConsumerCount | Active consumers attached to a queue | Matches expected deployed instance count | 0 while messages are arriving |
NodeMemoryUsage (or mem_used via plugin) | Broker memory usage vs. configured high watermark | < 60% of watermark | > 90% of watermark (triggers publisher blocking) |
DiskFreeLimitAlarm | Whether the disk-space alarm is active | Not triggered | Triggered (publishers blocked cluster-wide) |
FileDescriptorsUsed | OS file descriptors in use by the broker process | Well below the ulimit | Approaching the ulimit (connection/channel exhaustion) |
ConnectionCount | Total open AMQP connections | Stable, matches expected app instance count × pool size | Rapid, continuous growth (churn/leak) |
PublishRate / DeliverRate / AckRate | Messages/sec published, delivered, acknowledged | Deliver ≈ Publish over time; Ack ≈ Deliver | Ack rate persistently lower than Deliver rate (processing is failing or hanging) |
EBSVolumeQueueLength / EBSReadWriteOps (AWS-layer) | Disk I/O saturation on the underlying EBS volume | Below provisioned IOPS/throughput limit | Sustained saturation → broker-level latency even though CPU looks fine |
CPUCreditBalance (if burstable instance type) | Remaining CPU burst credits | Stable or replenishing | Steadily depleting toward 0 → imminent throttling |
4. Cross-referencing with the Spring Boot side
Infra metrics only tell half the story. Always correlate with the application side:
- Spring Boot Actuator health: if you have
spring-boot-starter-actuator+spring-boot-starter-amqp, hit/actuator/health, a healthy RabbitMQ connection shows:{ "components": { "rabbit": { "status": "UP", "details": { "version": "3.13.x" } } } }A broken connection shows
"status": "DOWN"with an exception message, this is often faster to check than digging through broker logs. - Application logs to grep for: | Log signature | Usually means | |—|—| |
AmqpConnectException/Connection refused| Broker unreachable: network/SG issue or broker down | |PossibleAuthenticationFailureException| Credentials wrong: check Playbook 05 | |ListenerExecutionFailedException| Your@RabbitListenermethod threw an exception: this is an app-code bug, not a broker problem | |SSLHandshakeException| Certificate issue: check Playbook 08 | | Consumer thread silent, no errors, but queue growing | Listener likely blocked/hung (e.g., waiting on a slow downstream DB call): check thread dumps, not broker logs |
Rule of thumb: if the Management UI shows healthy broker-side metrics (low ready count, consumers attached, no alarms) but the business symptom persists (e.g., orders not shipping), the problem is almost certainly in application code, not RabbitMQ. If the Management UI itself shows the anomaly (growing ready count, 0 consumers, active alarms), start with the broker/infra side.
Practical: diagnose a live backlog using only the CLI
Step 1: Using your producer/consumer app from First Producer and Consumer, stop the consumer (comment out @Component again, or just stop the Spring Boot app).
Step 2: Publish 20 messages in a loop:
for i in $(seq 1 20); do
curl -s -X POST localhost:8080/orders -H "Content-Type: application/json" -d "{\"id\":$i}"
done
Step 3: Without opening the Management UI, use only rabbitmqctl (via docker exec) to answer:
- How many messages are ready in
orders.created.queue? - How many consumers are attached?
- Based on that alone, what’s your diagnosis?
Step 4: Restart the consumer app and re-run the same list_queues command to confirm the backlog drains and messages_ready returns to 0.
✅ Checkpoint
You should now be able to:
- Name the four Management UI tabs you’d check first during an incident, in priority order.
- Run
rabbitmqctl list_queuesandrabbitmq-diagnostics check_local_alarmsfrom memory. - Explain the rule of thumb for deciding “broker problem” vs. “app problem” based on what the Management UI shows.
Next:Alert Playbooks