Queue Depth Growing & Consumer Lag: RabbitMQ Incident Guide

1. Symptom

A CloudWatch alarm or Slack/PagerDuty alert fires with something like:

QueueDepth > 10000 for orders.created.queue (5 min sustained)

or a downstream team pages you saying “orders are stuck, nothing’s shipping.” In the Management UI, the Queues tab shows a messages_ready count that keeps climbing and doesn’t recover, the queue is filling up faster than it’s draining (or not draining at all).

This is the single most common alert you’ll triage on this rotation. The goal of this playbook is to answer one question fast: is this a broker-side problem, an app-side problem, or just normal load exceeding normal capacity?

2. Likely Causes

Broker-side

Cause	How it manifests
No consumers attached at all	`consumers = 0` in `list_queues`: nobody is bound to the queue
Uneven prefetch distribution	With `prefetch` set too high relative to consumer count, one consumer instance can hoard messages while others sit idle, making the queue look “stuck” even though total consumer count looks fine
Sudden producer traffic spike	`PublishRate` far exceeds `DeliverRate`: consumers are working normally, they’re just outpaced (e.g., a batch job, retry storm, or marketing event upstream)
Queue bound to the wrong exchange/routing key	Messages are landing in a different queue than the one your consumers are attached to: the alerting queue has no bindings feeding real work to it, or vice versa

App-side (Spring Boot)

Cause	How it manifests
`@RabbitListener` throwing exceptions that get swallowed or endlessly retried	Messages get nacked and requeued in a tight loop: `messages_unacknowledged` may spike and drop repeatedly, but `messages_ready` never actually decreases
Consumer concurrency configured too low	`spring.rabbitmq.listener.simple.concurrency` (or `concurrency: 1-1`) can’t keep up with `PublishRate` even under normal load
A downstream call blocks the listener thread	A slow DB query, a slow REST call to another service, or a lock contention issue turns each message into a multi-second (or hung) operation, starving throughput
Consumer instance(s) crashed or never deployed	A bad deploy, OOM-killed pod, or scale-in event leaves fewer (or zero) consumer instances than expected
Prefetch misconfigured causing uneven load	`spring.rabbitmq.listener.simple.prefetch` set very high (e.g., 250) with multiple concurrent consumers means one thread can grab a large batch and fall behind while others starve, instead of load being spread evenly

The broker-side and app-side causes overlap in symptom (queue depth climbing) but require completely different fixes, that’s exactly why the diagnostic steps below are ordered to narrow this down before you touch anything.

3. Diagnostic Steps

Work top to bottom, cheapest, fastest checks first. Stop as soon as you have a confident diagnosis.

Check the Management UI → Queues tab (or rabbitmqctl list_queues via SSM) for the affected queue. Note three numbers: messages_ready, messages_unacknowledged, consumers.
- consumers = 0 → skip straight to “no consumers attached,” go check the app.
- consumers > 0 but ready still climbing → keep going, this is a throughput problem, not an absence problem.
Compare consumer count to expected deployed instance count. If you expect 3 app instances × concurrency 5 = 15 consumers and you only see 2, some instances aren’t consuming (crashed, still starting up, or misconfigured).
Check Spring Boot application logs for the affected service, grepping for:
- ListenerExecutionFailedException, the listener method is throwing; this is an app-code bug, not a broker issue.
- Silent stack traces / repeated retry log lines with no visible progress, points to exceptions being caught and swallowed somewhere in the listener, masking the real error.
Hit Spring Boot Actuator /actuator/health on each consumer instance. A "rabbit": {"status": "DOWN"} means that instance has lost its broker connection entirely (won’t show as a “slow” consumer, it just won’t show as a consumer at all). "UP" with the connection healthy but the queue still backing up shifts suspicion toward slow processing rather than connectivity.
Look for a slow downstream dependency. If logs show messages being received but rarely acknowledged, the listener thread is likely blocked on something else (DB, another REST API, an external lock). Pull a thread dump from the consumer instance (jstack <pid> via SSM) and look for listener container threads (SimpleAsyncTaskExecutor / org.springframework.amqp.rabbit.listener...) sitting in BLOCKED or WAITING state inside a downstream call, that’s your smoking gun.
Check the CloudWatch trend for ConsumerCount, QueueDepth, PublishRate, and DeliverRate over the last few hours.
- ConsumerCount dropped and stayed low → deployment/crash issue.
- PublishRate spiked while DeliverRate stayed flat → traffic spike outpacing normal capacity, consumers are otherwise healthy.
- DeliverRate itself dropped even though consumers are attached → confirms a per-message slowdown (points back to step 5).

Step	Question it answers	Typical time cost
1. Management UI / `list_queues`	Is anyone consuming at all?	seconds
2. Consumer count vs. expected	Are all instances actually attached?	seconds
3. App logs	Are messages failing, not just slow?	1-2 min
4. Actuator health	Is the connection itself healthy?	1 min
5. Thread dump	Is a listener thread stuck on something else?	3-5 min
6. CloudWatch trend	Is this a spike or a sustained regression?	2-3 min

4. Safe Remediations

Situation	Safe action
Consumer count lower than expected (crashed/not deployed instances)	Restart the affected instance(s) via your normal deploy/orchestration tooling. Confirm `/actuator/health` returns `UP` and consumer count in the Management UI climbs back to the expected number.
Traffic spike, consumers otherwise healthy, downstream dependency confirmed to have spare capacity	Scale up consumer instances or raise `spring.rabbitmq.listener.simple.concurrency`temporarily, then monitor `DeliverRate` climbing back toward `PublishRate`.
Downstream dependency (DB, other API) is itself under load	Do not blindly scale consumer concurrency: more concurrent listener threads hammering an already-struggling downstream service can make things worse. Confirm downstream headroom first, or hold and escalate.

⚠️ Caution: never “fix” a growing queue by purging it.** rabbitmqctl purge_queue (or the Management UI “Purge Messages” button) permanently deletes every message in the queue without processing it. This is data loss, orders never ship, events never fire, not a resolution. Purging is only ever done deliberately, with explicit sign-off from the owning app team, as a last resort for known-poison messages (see Playbook 06, Poison Messages & DLQ), never as a way to “clear an alert.”

Scaling and restarting are your two safe levers as support tier. Anything involving broker topology changes, queue policy changes, or purges requires the escalation path.

5. Escalation Trigger

Stop and page on-call engineering (per Escalation and Communication) if any of these are true:

Queue depth keeps growing for more than ~20-30 minutes after your diagnostic pass, with no consumer-side explanation found (consumers are attached, healthy, and not obviously slow, yet the backlog doesn’t shrink).
The fix requires broker-level intervention beyond restarting or scaling the app, e.g., suspected routing/binding misconfiguration, a stuck queue leader in a quorum queue, or anything that requires touching exchange/queue/policy definitions.
Restarting the consumer instance(s) does not restore expected consumer count or does not reduce messages_ready.
The root cause looks like a downstream dependency outage (DB, another microservice) rather than anything RabbitMQ- or app-config-related, escalate to that service’s on-call in parallel.

6. Relevant Commands/Queries

# Ready / unacked / consumer count for one queue
rabbitmqctl list_queues name messages_ready messages_unacknowledged consumers

# Healthy example
name                    messages_ready  messages_unacknowledged  consumers
orders.created.queue    3               2                        6

# Alerting example: no one listening, backlog growing
name                    messages_ready  messages_unacknowledged  consumers
orders.created.queue    52140           0                        0

# Alerting example: consumers attached but not keeping up (slow/blocked listener)
name                    messages_ready  messages_unacknowledged  consumers
orders.created.queue    18422           30                       6

messages_unacknowledged pinned near your prefetch × consumer count ceiling while messages_ready keeps growing is the signature of a blocked/slow listener, not an absent one, every consumer has grabbed its max prefetch batch and is stuck processing (or not processing) it.

# Who is actually consuming this queue right now
rabbitmqctl list_consumers

# Example columns of interest: queue_name, channel_pid, consumer_tag, prefetch_count

# Cross-check against expected app instance count/concurrency
# (compare this number to consumers column above)
kubectl get pods -l app=order-consumer   # or your platform's equivalent

# Actuator health check per instance
curl -s http://<instance-host>:8080/actuator/health | jq '.components.rabbit'

# Thread dump to catch a blocked listener thread (via SSM Session Manager)
jstack <pid> | grep -A 20 "org.springframework.amqp.rabbit.listener"

7. Mini Practical

Reproduce a scaled-down backlog locally and diagnose it with the exact commands above.

Step 1: Start from the First Producer and Consumer app (or reuse the RabbitMQ container from Environment Setup, still running on localhost:5672).

Step 2: Add a deliberately slow listener. Replace (or add alongside) your OrderConsumer with a version that simulates a blocked downstream call:

@Component
public class SlowOrderConsumer {

    @RabbitListener(
        queues = RabbitConfig.QUEUE,
        concurrency = "1-1" // deliberately under-provisioned
    )
    public void handleOrder(String orderJson) throws InterruptedException {
        System.out.println("Processing: " + orderJson);
        Thread.sleep(5000); // simulates a slow DB call / downstream REST call
        System.out.println("Done: " + orderJson);
    }
}

concurrency = "1-1" pins this listener to exactly one thread, with a 5-second fake downstream call, this consumer can process at most ~12 messages/minute, easy to outpace.

Step 3: Flood the queue faster than the consumer can drain it:

for i in $(seq 1 30); do
  curl -s -X POST localhost:8080/orders -H "Content-Type: application/json" -d "{\"id\":$i}"
done

Step 4: Immediately check queue state (don’t wait for it to drain):

docker exec -it rabbitmq-crashcourse rabbitmqctl list_queues name messages_ready messages_unacknowledged consumers

You should see messages_ready sitting well above 0 and slowly decreasing (roughly one every 5 seconds), with consumers = 1, reproducing exactly the “consumers attached but too slow/under-concurrent” pattern from Section 3, step 1.

Step 5: Confirm the diagnosis with list_consumers:

docker exec -it rabbitmq-crashcourse rabbitmqctl list_consumers

You’ll see a single consumer tag against the queue, confirming there’s only one worker thread, matching concurrency = "1-1".

Step 6: Apply the fix and re-verify. Change concurrency = "1-1" to concurrency = "5-10", restart the app, and re-run the list_queues command from Step 4. messages_ready should now drain rapidly as multiple threads process the backlog in parallel, the same “scale consumer concurrency” remediation from Section 4, just observed end-to-end on your own machine.

✅ Checkpoint

You should now be able to:

Look at messages_ready, messages_unacknowledged, and consumers together and state whether the problem is “no consumers,” “slow consumers,” or “traffic spike.”
Explain why purging a queue is never an acceptable way to clear a queue-depth alert.
Reproduce and diagnose a consumer-lag backlog locally using list_queues and list_consumers, and confirm the fix by watching the backlog drain after increasing concurrency.