Retries, Error Handling, and Dead Letter Topics
Spring Kafka DefaultErrorHandler, blocking vs non-blocking retries, @RetryableTopic with retry topics and a DLT, backoff, and classifying transient vs poison failures.
When a consumer’s handler throws, Kafka does not have a broker-side retry queue like a traditional message broker. The record’s offset simply is not committed, so the same record is redelivered. Left alone, a record that always fails blocks its partition forever. This module shows how Spring Kafka turns that raw behavior into controlled retries, backoff, and a dead letter topic for records that cannot be processed.
What you’ll be able to do after this module
- Explain what happens by default when a listener throws.
- Configure blocking retries with
DefaultErrorHandlerand backoff. - Use
@RetryableTopicfor non-blocking retries with retry topics and a DLT. - Classify transient failures from deterministic poison failures.
- Route un-deserializable records safely.
1. What happens by default
If a @KafkaListener method throws, the container does not commit that record’s offset. On the next poll the record is delivered again. If the failure is permanent, for example a record that violates a business rule, this becomes an infinite redelivery loop that stalls the whole partition, because Kafka preserves order and will not skip ahead.
So the question is never “should I handle errors” but “how many times do I retry, how long do I wait, and where does a record go when it still fails.”
2. Blocking retries with DefaultErrorHandler
Spring Kafka’s DefaultErrorHandler retries the record in place, in the consumer thread, with a backoff. After the attempts are exhausted, it invokes a recoverer, which commonly publishes the record to a dead letter topic.
@Bean
DefaultErrorHandler errorHandler(KafkaTemplate<Object, Object> template) {
DeadLetterPublishingRecoverer recoverer = new DeadLetterPublishingRecoverer(template);
// 3 retries, 2 seconds apart, then publish to <topic>.DLT
return new DefaultErrorHandler(recoverer, new FixedBackOff(2000L, 3L));
}
This is called blocking retry because the container pauses on that record during the backoff. It is simple and keeps ordering, but a long backoff holds up every later record on the partition.
flowchart TD
rec["record"] --> h{"handler throws?"}
h -->|no| commit["commit offset"]
h -->|yes| retry{"retries left?"}
retry -->|yes| wait["wait backoff, retry in place"] --> h
retry -->|no| dlt["publish to topic.DLT, commit"]
3. Non-blocking retries with @RetryableTopic
When you need longer or escalating backoff without stalling the partition, use non-blocking retries. Spring forwards a failed record to a separate retry topic and moves on, so the main topic keeps flowing. A retry topic consumer processes it later, and after the final attempt the record lands in a dead letter topic.
@RetryableTopic(
attempts = "4",
backoff = @Backoff(delay = 1000, multiplier = 2.0),
dltStrategy = DltStrategy.FAIL_ON_ERROR
)
@KafkaListener(topics = "orders", groupId = "inventory-service")
public void onOrderCreated(OrderCreated event) {
inventoryService.reserve(event);
}
@DltHandler
public void handleDlt(OrderCreated event) {
log.error("Order {} sent to DLT after retries exhausted", event.orderId());
}
Spring auto-creates the retry topics (orders-retry-0, orders-retry-1, and so on) and the dead letter topic (orders-dlt). The backoff escalates per topic, so early transient failures recover fast while stubborn ones get more time without blocking healthy traffic.
flowchart LR
main["orders"] -->|fail| r0["orders-retry-0<br/>(1s)"]
r0 -->|fail| r1["orders-retry-1<br/>(2s)"]
r1 -->|fail| r2["orders-retry-2<br/>(4s)"]
r2 -->|fail| dlt["orders-dlt"]
main -->|ok| done["processed"]
4. Blocking vs non-blocking: which to use
| Aspect | Blocking (DefaultErrorHandler) | Non-blocking (@RetryableTopic) |
|---|---|---|
| Where retries run | In the partition’s consumer thread | On separate retry topics |
| Ordering | Preserved on the partition | Not preserved across retries |
| Best backoff | Short | Long or escalating |
| Cost | Stalls the partition during backoff | Extra topics to manage |
Use blocking retries for fast, likely-transient blips where order matters. Use non-blocking retries when a failure may take a while to clear, such as a downstream service being down, and you cannot hold up the partition.
5. Classify the failure
Retrying only helps if the failure might succeed next time. Classify before you retry.
- Transient: a temporary condition, such as a timeout or a downstream service briefly unavailable. Retry with backoff.
- Deterministic (poison): the record will fail every time, such as a business-rule violation or malformed data. Retrying wastes time; send it straight to the DLT.
Tell DefaultErrorHandler not to retry deterministic exceptions:
errorHandler.addNotRetryableExceptions(IllegalArgumentException.class);
A special poison case is a record that cannot even be deserialized. As introduced in First Producer and Consumer, wrap the deserializer in ErrorHandlingDeserializer so the failure is caught and routed to the error handler instead of looping. Without it, deserialization fails before your handler runs and no retry logic applies.
6. Guided practical
Run this against the local lab, using the Inventory consumer.
- Add
@RetryableTopicwith 4 attempts and an escalating backoff, plus a@DltHandler. - Make the handler throw for a specific
itemvalue to simulate a poison record. - Produce one poison record and several good ones; confirm the good ones flow while the poison one moves through the retry topics.
- Confirm the poison record lands in
orders-dltafter the final attempt. - Add a deserialization failure (malformed JSON) and confirm
ErrorHandlingDeserializerroutes it rather than looping.
Next:Idempotent Consumers, Ordering, and Duplicates, where you make the consumer safe against the duplicates that at-least-once delivery guarantees.