Poison Messages & Dead-Letter Queues: RabbitMQ Incident Guide

Prerequisite:Auth Failures After Rotation

1. Symptom

One of two very different-looking alerts, both rooted in the same underlying problem:

The “quiet” version: a DLQ (e.g., orders.created.dlq) starts accumulating messages, a CloudWatch alarm on that queue’s depth fires, or someone notices it during a routine Management UI check. No one’s paging about the main queue; it looks healthy.
The “loud” version: the main queue’s redeliver rate is elevated and sustained, messages_unacknowledged is non-zero but oddly flat (not draining, not growing much either), and application logs show the exact same exception, for what looks like the exact same message, repeating over and over. CPU on the affected consumer instance(s) may also be elevated from the tight processing/retry loop.

Both are symptoms of a poison message: a message that a consumer can never successfully process, no matter how many times it’s redelivered, malformed JSON that will never deserialize, a business-logic exception that’s deterministic for that payload (e.g., a null field the code doesn’t guard against), or a bug that only triggers on that specific data. Without dead-lettering configured, RabbitMQ has no way to know the message is unprocessable, it just keeps redelivering it forever, which burns CPU, floods your logs with the same stack trace, and, especially with concurrency = "1-1", blocks every other message behind it in the queue, since that one thread is stuck in a nack/requeue loop instead of moving on.

The goal of this playbook: figure out whether dead-lettering is working as a safety net (good, an accumulating DLQ means poison messages are being caught) or whether there’s no DLX at all and a message is looping forever in the main queue (bad, this is an active incident, not just a queue to clean up).

2. Likely Causes

Broker-side

Cause	How it manifests
No DLX configured on the queue at all	The poison message just nacks and requeues indefinitely: `messages_unacknowledged` cycles up and down but `messages_ready` never permanently drops, and there’s no DLQ to check because none exists. This is the worst case: no safety net.
No TTL or retry-count limit before dead-lettering	Even with a DLX configured, if nothing caps how long/how many times a message can be retried, it may take an extremely long time (or never, if retries are purely broker-level nack/requeue rather than app-level) to actually reach the DLQ: the queue still looks “stuck” for a long window.
Dead-letter routing key mismatch	The queue does have `x-dead-letter-exchange` set, and messages genuinely do get dead-lettered: but the DLX’s binding routes them to a different queue than the one you’re watching (or nowhere, if there’s no matching binding at all). You’ll see the main queue’s `messages_ready`/`unacked` behave correctly (message leaves the queue) while the DLQ you’re staring at stays empty, making it look like dead-lettering “isn’t working” when it’s actually just misrouted.

App-side (Spring Boot)

Cause	How it manifests
Deserialization failure	A malformed JSON payload: often from a producer schema change that shipped without consumer compatibility: throws during message conversion, before your `@RabbitListener` method body even runs. This is a poison message by definition: no amount of retrying fixes malformed bytes.
Deterministic `NullPointerException` or business exception	The listener method throws for a specific message’s data every single time (e.g., an order with a null customer ID that a downstream call chokes on): as opposed to a transient failure like a timeout, which might succeed on retry.
Error handler acks when it should nack (or vice versa)	A bug in a custom exception handler that swallows the exception and acks anyway silently loses the message: it vanishes with no DLQ entry and no error, which is arguably worse than a poison message, since there’s no evidence anything went wrong.
Error handler nacks-with-requeue in a tight loop	The inverse bug: an exception handler (or default behavior) that requeues on every failure with no backoff and no eventual dead-letter creates the classic redelivery storm: the same message bounces between “delivered” and “requeued” as fast as the broker and consumer can cycle it, pegging CPU on both sides.

The common thread: broker-side causes are about whether a safety net exists and is wired correctly; app-side causes are about whether the message was ever processable in the first place, and whether the app’s error handling correctly identifies that fact.

2a. How Spring’s retry layer and RabbitMQ’s DLX actually interact

This is the part that trips people up, because there are two separate retry mechanisms stacked on top of each other, and it matters which one you’re looking at when you read logs or configure behavior.

Local (in-JVM) retries happen first. If the listener container is configured with a RetryOperationsInterceptor (via RetryTemplate, or @Retryable inside the listener method itself), a failing message is retried inside the same delivery: no broker round-trip, no redelivery flag, just the same JVM calling your method again after a backoff. This is fast and cheap, and it’s the right layer for transient failures (a downstream call timing out, a brief DB blip).
Once local retries are exhausted, the listener container’s error handler decides what happens to the broker-level delivery. The DefaultErrorHandler (formerly SeenExceptionAndRetryContext/ConditionalRejectingErrorHandler in older Spring AMQP versions) on your SimpleRabbitListenerContainerFactory inspects the exception and classifies it as fatal or not fatal:
- Fatal (e.g., a MessageConversionException from bad JSON, or any exception you’ve explicitly configured as non-retryable) → the message is rejected outright (basic.nack with requeue=false), skipping any further broker-level requeue attempts.
- Not fatal, but local retries are exhausted anyway → same outcome by default: rejected without requeue, since Spring’s RetryOperationsInterceptor already did the retrying it was going to do.
What happens to a rejected message depends entirely on whether the queue has x-dead-letter-exchange set. If it does, a basic.nack/basic.reject with requeue=false causes RabbitMQ to route the message to the DLX, that’s your DLQ landing. If the queue has no DLX configured, that same rejected message is just dropped (if requeue=false) or redelivered forever (if something in your code path sets requeue=true on every failure, which is the redelivery-storm bug from the table above).

The practical implication: Spring’s retry config controls how many times the message is retried before giving up; the queue’s DLX config controls what happens to it after Spring gives up. You need both configured correctly, retry-then-fatal without a DLX just means the message vanishes silently once retries are exhausted, which is arguably worse than looping forever, because now there’s no evidence it ever existed.

3. Diagnostic Steps

Work top to bottom, cheapest, fastest checks first.

Check the Management UI Queues tab for the affected queue. Look for the classic poison-message signature: high messages_unacknowledged combined with a flat (not climbing, not draining) messages_ready, and an elevated redeliver rate in the message-rate graph. A healthy queue’s redeliver rate is near zero; a poison message being retried in a loop shows a steady, non-zero redeliver rate that doesn’t correlate with any real growth in throughput.
Check whether a DLQ exists for this queue, and if so, check its depth. An accumulating DLQ is a good sign: it means the safety net is working and poison messages are being caught rather than looping. A non-existent DLX with a queue stuck in a redeliver loop is the worse case: there’s no safety net, and this is actively burning broker/consumer resources right now.
Grep application logs for the same exception repeating. Look specifically for ListenerExecutionFailedException (per Tooling Walkthrough) wrapping the same root cause, a MessageConversionException, the same NullPointerException stack trace, or the same business exception, appearing at a frequency that lines up with the redeliver rate from step 1. Repeating identical stack traces (not just the same exception type, but the same message/line) strongly suggests the same message is being retried, not a new failure each time.
Inspect the actual message payload. In the Management UI, open the queue (main queue if it’s still looping, or the DLQ if it’s already landed there) and use Get Message(s). This is the fastest way to see the malformed JSON or bad field that’s triggering the failure.
- ⚠️ Note the Requeue checkbox in the Get Message dialog: if checked, the message goes back to the front of the queue after you view it (safe, non-destructive). If unchecked, viewing it removes it from the queue permanently: treat “Get Message” with requeue unchecked as a destructive read, not a passive one, especially on a queue with only one or two messages left.
Check the Spring retry and error-handler configuration for the affected listener, max attempts, backoff policy, and whether a DefaultErrorHandler with explicit fatal-exception classification is configured. This tells you what the expected behavior is, so you can tell whether what you’re observing (e.g., a message dead-lettering after 3 attempts) is working as designed or is itself misconfigured (e.g., retrying 50 times before giving up, needlessly prolonging the incident).

Step	Question it answers	Typical time cost
1. Management UI: unacked/ready/redeliver rate	Is there a poison-message pattern here at all?	seconds
2. DLQ existence + depth	Is the safety net present and working, or absent?	seconds
3. App logs for repeating exceptions	Is app code confirming the same message keeps failing?	1-2 min
4. Inspect payload via Get Message(s)	What specifically is wrong with the data?	2-3 min
5. Review retry/error-handler config	Is current behavior expected, or itself misconfigured?	2-3 min

4. Safe Remediations

Situation	Safe action
A small, confirmed set of poison messages sitting in the DLQ, root cause understood	After coordinating with the owning app team, manually remove the confirmed-bad messages from the DLQ via Get Message(s) (requeue unchecked) once you’ve both agreed the payload is truly unprocessable and doesn’t need reprocessing after a fix.
Failure looks transient (e.g., a downstream dependency was briefly down, not a data problem)	Temporarily increase retry backoff/attempts on the affected listener so messages get more chances to succeed once the dependency recovers, rather than prematurely dead-lettering messages that would have worked on attempt 4 instead of attempt 3. Revert once the dependency is confirmed stable.
No DLX configured at all and a redelivery storm is actively consuming resources	This is broker topology change territory (adding `x-dead-letter-exchange` to an existing queue isn’t a live update to queue arguments: it typically requires a new queue or a policy-based DLX). Support tier should not make this change unilaterally: see Escalation Trigger.

⚠️ Caution: never blindly discard DLQ contents.** A DLQ often holds business-critical data (an order, a payment event, a shipment update) that needs to be reprocessed after a code fix, not deleted. Coordinate with the owning app team before removing anything, “the message is malformed” doesn’t necessarily mean “the underlying business event doesn’t matter,” it may just mean the producer needs to resend it correctly, or the consumer needs a fix before replaying it from the DLQ.

⚠️ Caution: never purge a DLQ as a way to “clear the alert.”** rabbitmqctl purge_queue or the Management UI “Purge Messages” button deletes every message in the queue instantly and irreversibly. A growing DLQ is a symptom to investigate, not a metric to zero out. Purging without app-team sign-off is exactly the kind of action that turns a contained incident (a handful of bad messages, safely quarantined) into a real data-loss incident.

5. Escalation Trigger

Escalate to the owning application team or on-call engineering when:

The DLQ is growing rapidly with a new or unknown failure signature: this requires an app-code fix (a deserialization compatibility fix, a null-check, a business-logic correction), which is outside support tier’s remit.
There’s an active redelivery storm (no DLX configured, or DLX misrouted) causing measurable broker or consumer CPU pressure, this needs an emergency DLX/queue-policy addition, which is a broker topology change, not a support-tier action.
You can’t tell whether messages in the DLQ are safe to leave alone or need urgent reprocessing (e.g., time-sensitive business data), don’t guess; ask the owning team.
The error-handling bug is the “silently acks instead of nacks” variant, this means messages have already been lost, not just delayed, and the owning team needs to know immediately so they can assess data impact.

6. Relevant Commands/Queries

# Main queue and its DLQ side by side: compare ready/unacked/consumers on both
rabbitmqctl list_queues name messages_ready messages_unacknowledged consumers

# Healthy: no poison message activity
name                        messages_ready  messages_unacknowledged  consumers
orders.created.queue        2               1                        4
orders.created.dlq          0               0                        0

# Poison message looping, no DLX configured: the worse case
name                        messages_ready  messages_unacknowledged  consumers
orders.created.queue        140             1                        4
orders.created.dlq          (queue does not exist)

# DLX working as designed: safety net catching poison messages
name                        messages_ready  messages_unacknowledged  consumers
orders.created.queue        3               1                        4
orders.created.dlq          17              0                        0

A flat, non-zero messages_unacknowledged on the main queue that doesn’t correlate with growth in messages_ready, paired with a climbing messages_ready/messages_unacknowledged on the DLQ, is the clearest broker-side confirmation of “dead-lettering is active and catching something.”

# Confirm queue arguments, including DLX config, for the main queue
rabbitmqctl list_queues name arguments

Look for x-dead-letter-exchange and x-dead-letter-routing-key in the output, their absence confirms “no DLX configured” as the root cause; their presence but pointing somewhere unexpected confirms a routing-key mismatch.

Management UI navigation for inspecting DLQ contents:

Queues tab → click the DLQ name.
Scroll to Get Message(s) → set “Requeue” per whether you want a non-destructive peek (checked) or to actually remove it while inspecting (unchecked) → click Get Message(s).
Expand the Payload section to see the raw message body, this is where you’ll spot the malformed JSON or unexpected field value.
Check the Properties panel alongside the payload, x-death headers (added automatically by RabbitMQ on dead-letter) show the original queue, the reason (rejected, expired, maxlen), and how many times it’s been dead-lettered, useful for confirming this isn’t the first time this exact message has bounced through.

7. Mini Practical

Extend the producer/consumer example from First Producer and Consumer with a DLX/DLQ, deliberately trigger a poison message, and watch it retry then land in the DLQ.

Step 1: Add a DLX and DLQ to the queue config, extending your RabbitConfig:

@Configuration
public class RabbitConfig {

    static final String EXCHANGE = "orders.exchange";
    static final String QUEUE = "orders.created.queue";
    static final String ROUTING_KEY = "order.created";

    static final String DLX = "orders.dlx";
    static final String DLQ = "orders.created.dlq";
    static final String DLQ_ROUTING_KEY = "order.created.dlq";

    @Bean
    DirectExchange ordersExchange() {
        return new DirectExchange(EXCHANGE);
    }

    @Bean
    Queue ordersQueue() {
        return QueueBuilder.durable(QUEUE)
                .deadLetterExchange(DLX)
                .deadLetterRoutingKey(DLQ_ROUTING_KEY)
                .build();
    }

    @Bean
    Binding binding(Queue ordersQueue, DirectExchange ordersExchange) {
        return BindingBuilder.bind(ordersQueue)
                .to(ordersExchange)
                .with(ROUTING_KEY);
    }

    @Bean
    DirectExchange ordersDlx() {
        return new DirectExchange(DLX);
    }

    @Bean
    Queue ordersDlq() {
        return QueueBuilder.durable(DLQ).build();
    }

    @Bean
    Binding dlqBinding(Queue ordersDlq, DirectExchange ordersDlx) {
        return BindingBuilder.bind(ordersDlq)
                .to(ordersDlx)
                .with(DLQ_ROUTING_KEY);
    }
}

Step 2: Configure local retry with a small max-attempts, so failures dead-letter quickly instead of looping for a long time:

@Configuration
public class RabbitListenerConfig {

    @Bean
    SimpleRabbitListenerContainerFactory rabbitListenerContainerFactory(
            ConnectionFactory connectionFactory) {

        RetryOperationsInterceptor retryInterceptor = RetryInterceptorBuilder.stateless()
                .maxAttempts(3)
                .backOffOptions(500, 2.0, 5000) // initial 500ms, x2 multiplier, 5s cap
                .build();

        SimpleRabbitListenerContainerFactory factory = new SimpleRabbitListenerContainerFactory();
        factory.setConnectionFactory(connectionFactory);
        factory.setAdviceChain(retryInterceptor);
        // After 3 local attempts fail, Spring's default error handling rejects
        // the message without requeue, which the DLX above then catches.
        return factory;
    }
}

Step 3: Write a consumer that deliberately throws for a specific “poison” payload:

@Component
public class OrderConsumer {

    @RabbitListener(queues = RabbitConfig.QUEUE)
    public void handleOrder(String orderJson) {
        System.out.println("Processing: " + orderJson);
        if (orderJson.contains("\"id\":13")) {
            // Simulates a deterministic bug/malformed-data failure : 
            // this will fail identically on every retry, exactly like a poison message.
            throw new IllegalStateException("Unprocessable order payload: " + orderJson);
        }
        System.out.println("Done: " + orderJson);
    }
}

Step 4: Publish a batch including the poison message:

for i in $(seq 1 5); do
  curl -s -X POST localhost:8080/orders -H "Content-Type: application/json" -d "{\"id\":$i}"
done

Message id:13 isn’t in this batch, first confirm the happy path works, then publish the poison message on its own so it’s easy to watch in isolation:

curl -s -X POST localhost:8080/orders -H "Content-Type: application/json" -d '{"id":13}'

Step 5: Watch it retry, then dead-letter. In the Spring Boot console, you should see Processing: {"id":13} printed three times (matching maxAttempts(3)), with increasing delay between attempts (the backoff), then nothing further, no infinite loop, because local retries are capped and the error handler then rejects without requeue.

Step 6: Confirm it landed in the DLQ:

docker exec -it rabbitmq-crashcourse rabbitmqctl list_queues name messages_ready messages_unacknowledged

You should see orders.created.queue back at 0 ready (the poison message is gone from the main queue) and orders.created.dlq showing 1 ready.

Step 7: Inspect the DLQ contents in the Management UI. Go to localhost:15672 → Queues → orders.created.dlq → Get Message(s) (leave Requeue unchecked since you’re done with it) → confirm the payload shown matches {"id":13}, and check the x-death header in the properties panel, it should show orders.created.queue as the original queue and rejected as the reason.

✅ Checkpoint

You should now be able to:

Explain the difference between Spring’s local in-JVM retry (via RetryOperationsInterceptor/RetryTemplate) and RabbitMQ’s broker-level dead-lettering, and how they hand off to each other.
Look at a queue’s messages_ready/messages_unacknowledged/redeliver rate and correctly identify a poison-message pattern versus normal processing.
Explain why an accumulating DLQ is a good diagnostic sign, while a redelivery storm on a queue with no DLX is a worse, escalation-worthy one.