Queue Depth & Consumer Lag
Diagnose growing queues, stuck consumers, and Spring AMQP listener misconfiguration.
Prerequisite:Tooling Walkthrough
1. Symptom
A CloudWatch alarm or Slack/PagerDuty alert fires with something like:
QueueDepth > 10000 for orders.created.queue (5 min sustained)
or a downstream team pages you saying “orders are stuck, nothing’s shipping.” In the Management UI, the Queues tab shows a messages_ready count that keeps climbing and doesn’t recover, the queue is filling up faster than it’s draining (or not draining at all).
This is the single most common alert you’ll triage on this rotation. The goal of this playbook is to answer one question fast: is this a broker-side problem, an app-side problem, or just normal load exceeding normal capacity?
2. Likely Causes
Broker-side
| Cause | How it manifests |
|---|---|
| No consumers attached at all | consumers = 0 in list_queues: nobody is bound to the queue |
| Uneven prefetch distribution | With prefetch set too high relative to consumer count, one consumer instance can hoard messages while others sit idle, making the queue look “stuck” even though total consumer count looks fine |
| Sudden producer traffic spike | PublishRate far exceeds DeliverRate: consumers are working normally, they’re just outpaced (e.g., a batch job, retry storm, or marketing event upstream) |
| Queue bound to the wrong exchange/routing key | Messages are landing in a different queue than the one your consumers are attached to: the alerting queue has no bindings feeding real work to it, or vice versa |
App-side (Spring Boot)
| Cause | How it manifests |
|---|---|
@RabbitListener throwing exceptions that get swallowed or endlessly retried | Messages get nacked and requeued in a tight loop: messages_unacknowledged may spike and drop repeatedly, but messages_ready never actually decreases |
| Consumer concurrency configured too low | spring.rabbitmq.listener.simple.concurrency (or concurrency: 1-1) can’t keep up with PublishRate even under normal load |
| A downstream call blocks the listener thread | A slow DB query, a slow REST call to another service, or a lock contention issue turns each message into a multi-second (or hung) operation, starving throughput |
| Consumer instance(s) crashed or never deployed | A bad deploy, OOM-killed pod, or scale-in event leaves fewer (or zero) consumer instances than expected |
| Prefetch misconfigured causing uneven load | spring.rabbitmq.listener.simple.prefetch set very high (e.g., 250) with multiple concurrent consumers means one thread can grab a large batch and fall behind while others starve, instead of load being spread evenly |
The broker-side and app-side causes overlap in symptom (queue depth climbing) but require completely different fixes, that’s exactly why the diagnostic steps below are ordered to narrow this down before you touch anything.
3. Diagnostic Steps
Work top to bottom, cheapest, fastest checks first. Stop as soon as you have a confident diagnosis.
- Check the Management UI → Queues tab (or
rabbitmqctl list_queuesvia SSM) for the affected queue. Note three numbers:messages_ready,messages_unacknowledged,consumers.consumers = 0→ skip straight to “no consumers attached,” go check the app.consumers > 0butreadystill climbing → keep going, this is a throughput problem, not an absence problem.
- Compare consumer count to expected deployed instance count. If you expect 3 app instances × concurrency 5 = 15 consumers and you only see 2, some instances aren’t consuming (crashed, still starting up, or misconfigured).
- Check Spring Boot application logs for the affected service, grepping for:
ListenerExecutionFailedException, the listener method is throwing; this is an app-code bug, not a broker issue.- Silent stack traces / repeated retry log lines with no visible progress, points to exceptions being caught and swallowed somewhere in the listener, masking the real error.
- Hit Spring Boot Actuator
/actuator/healthon each consumer instance. A"rabbit": {"status": "DOWN"}means that instance has lost its broker connection entirely (won’t show as a “slow” consumer, it just won’t show as a consumer at all)."UP"with the connection healthy but the queue still backing up shifts suspicion toward slow processing rather than connectivity. - Look for a slow downstream dependency. If logs show messages being received but rarely acknowledged, the listener thread is likely blocked on something else (DB, another REST API, an external lock). Pull a thread dump from the consumer instance (
jstack <pid>via SSM) and look for listener container threads (SimpleAsyncTaskExecutor/org.springframework.amqp.rabbit.listener...) sitting inBLOCKEDorWAITINGstate inside a downstream call, that’s your smoking gun. - Check the CloudWatch trend for
ConsumerCount,QueueDepth,PublishRate, andDeliverRateover the last few hours.ConsumerCountdropped and stayed low → deployment/crash issue.PublishRatespiked whileDeliverRatestayed flat → traffic spike outpacing normal capacity, consumers are otherwise healthy.DeliverRateitself dropped even though consumers are attached → confirms a per-message slowdown (points back to step 5).
| Step | Question it answers | Typical time cost |
|---|---|---|
1. Management UI / list_queues | Is anyone consuming at all? | seconds |
| 2. Consumer count vs. expected | Are all instances actually attached? | seconds |
| 3. App logs | Are messages failing, not just slow? | 1-2 min |
| 4. Actuator health | Is the connection itself healthy? | 1 min |
| 5. Thread dump | Is a listener thread stuck on something else? | 3-5 min |
| 6. CloudWatch trend | Is this a spike or a sustained regression? | 2-3 min |
4. Safe Remediations
| Situation | Safe action |
|---|---|
| Consumer count lower than expected (crashed/not deployed instances) | Restart the affected instance(s) via your normal deploy/orchestration tooling. Confirm /actuator/health returns UP and consumer count in the Management UI climbs back to the expected number. |
| Traffic spike, consumers otherwise healthy, downstream dependency confirmed to have spare capacity | Scale up consumer instances or raise spring.rabbitmq.listener.simple.concurrencytemporarily, then monitor DeliverRate climbing back toward PublishRate. |
| Downstream dependency (DB, other API) is itself under load | Do not blindly scale consumer concurrency: more concurrent listener threads hammering an already-struggling downstream service can make things worse. Confirm downstream headroom first, or hold and escalate. |
⚠️ Caution: never “fix” a growing queue by purging it.**
rabbitmqctl purge_queue(or the Management UI “Purge Messages” button) permanently deletes every message in the queue without processing it. This is data loss, orders never ship, events never fire, not a resolution. Purging is only ever done deliberately, with explicit sign-off from the owning app team, as a last resort for known-poison messages (see Playbook 06, Poison Messages & DLQ), never as a way to “clear an alert.”
Scaling and restarting are your two safe levers as support tier. Anything involving broker topology changes, queue policy changes, or purges requires the escalation path.
5. Escalation Trigger
Stop and page on-call engineering (per Escalation and Communication) if any of these are true:
- Queue depth keeps growing for more than ~20-30 minutes after your diagnostic pass, with no consumer-side explanation found (consumers are attached, healthy, and not obviously slow, yet the backlog doesn’t shrink).
- The fix requires broker-level intervention beyond restarting or scaling the app, e.g., suspected routing/binding misconfiguration, a stuck queue leader in a quorum queue, or anything that requires touching exchange/queue/policy definitions.
- Restarting the consumer instance(s) does not restore expected consumer count or does not reduce
messages_ready. - The root cause looks like a downstream dependency outage (DB, another microservice) rather than anything RabbitMQ- or app-config-related, escalate to that service’s on-call in parallel.
6. Relevant Commands/Queries
# Ready / unacked / consumer count for one queue
rabbitmqctl list_queues name messages_ready messages_unacknowledged consumers
# Healthy example
name messages_ready messages_unacknowledged consumers
orders.created.queue 3 2 6
# Alerting example: no one listening, backlog growing
name messages_ready messages_unacknowledged consumers
orders.created.queue 52140 0 0
# Alerting example: consumers attached but not keeping up (slow/blocked listener)
name messages_ready messages_unacknowledged consumers
orders.created.queue 18422 30 6
messages_unacknowledged pinned near your prefetch × consumer count ceiling while messages_ready keeps growing is the signature of a blocked/slow listener, not an absent one, every consumer has grabbed its max prefetch batch and is stuck processing (or not processing) it.
# Who is actually consuming this queue right now
rabbitmqctl list_consumers
# Example columns of interest: queue_name, channel_pid, consumer_tag, prefetch_count
# Cross-check against expected app instance count/concurrency
# (compare this number to consumers column above)
kubectl get pods -l app=order-consumer # or your platform's equivalent
# Actuator health check per instance
curl -s http://<instance-host>:8080/actuator/health | jq '.components.rabbit'
# Thread dump to catch a blocked listener thread (via SSM Session Manager)
jstack <pid> | grep -A 20 "org.springframework.amqp.rabbit.listener"
7. Mini Practical
Reproduce a scaled-down backlog locally and diagnose it with the exact commands above.
Step 1: Start from the First Producer and Consumer app (or reuse the RabbitMQ container from Environment Setup, still running on localhost:5672).
Step 2: Add a deliberately slow listener. Replace (or add alongside) your OrderConsumer with a version that simulates a blocked downstream call:
@Component
public class SlowOrderConsumer {
@RabbitListener(
queues = RabbitConfig.QUEUE,
concurrency = "1-1" // deliberately under-provisioned
)
public void handleOrder(String orderJson) throws InterruptedException {
System.out.println("Processing: " + orderJson);
Thread.sleep(5000); // simulates a slow DB call / downstream REST call
System.out.println("Done: " + orderJson);
}
}
concurrency = "1-1" pins this listener to exactly one thread, with a 5-second fake downstream call, this consumer can process at most ~12 messages/minute, easy to outpace.
Step 3: Flood the queue faster than the consumer can drain it:
for i in $(seq 1 30); do
curl -s -X POST localhost:8080/orders -H "Content-Type: application/json" -d "{\"id\":$i}"
done
Step 4: Immediately check queue state (don’t wait for it to drain):
docker exec -it rabbitmq-crashcourse rabbitmqctl list_queues name messages_ready messages_unacknowledged consumers
You should see messages_ready sitting well above 0 and slowly decreasing (roughly one every 5 seconds), with consumers = 1, reproducing exactly the “consumers attached but too slow/under-concurrent” pattern from Section 3, step 1.
Step 5: Confirm the diagnosis with list_consumers:
docker exec -it rabbitmq-crashcourse rabbitmqctl list_consumers
You’ll see a single consumer tag against the queue, confirming there’s only one worker thread, matching concurrency = "1-1".
Step 6: Apply the fix and re-verify. Change concurrency = "1-1" to concurrency = "5-10", restart the app, and re-run the list_queues command from Step 4. messages_ready should now drain rapidly as multiple threads process the backlog in parallel, the same “scale consumer concurrency” remediation from Section 4, just observed end-to-end on your own machine.
✅ Checkpoint
You should now be able to:
- Look at
messages_ready,messages_unacknowledged, andconsumerstogether and state whether the problem is “no consumers,” “slow consumers,” or “traffic spike.” - Explain why purging a queue is never an acceptable way to clear a queue-depth alert.
- Reproduce and diagnose a consumer-lag backlog locally using
list_queuesandlist_consumers, and confirm the fix by watching the backlog drain after increasing concurrency.