Kafka Retries, Error Handling, and Dead Letter Topics with Spring

When a consumer’s handler throws, Kafka does not have a broker-side retry queue like a traditional message broker. The record’s offset simply is not committed, so the same record is redelivered. Left alone, a record that always fails blocks its partition forever. This module shows how Spring Kafka turns that raw behavior into controlled retries, backoff, and a dead letter topic for records that cannot be processed.

What you’ll be able to do after this module

Explain what happens by default when a listener throws.
Configure blocking retries with DefaultErrorHandler and backoff.
Use @RetryableTopic for non-blocking retries with retry topics and a DLT.
Classify transient failures from deterministic poison failures.
Route un-deserializable records safely.

1. What happens by default

If a @KafkaListener method throws, the container does not commit that record’s offset. On the next poll the record is delivered again. If the failure is permanent, for example a record that violates a business rule, this becomes an infinite redelivery loop that stalls the whole partition, because Kafka preserves order and will not skip ahead.

So the question is never “should I handle errors” but “how many times do I retry, how long do I wait, and where does a record go when it still fails.”

2. Blocking retries with DefaultErrorHandler

Spring Kafka’s DefaultErrorHandler retries the record in place, in the consumer thread, with a backoff. After the attempts are exhausted, it invokes a recoverer, which commonly publishes the record to a dead letter topic.

@Bean
DefaultErrorHandler errorHandler(KafkaTemplate<Object, Object> template) {
    DeadLetterPublishingRecoverer recoverer = new DeadLetterPublishingRecoverer(template);
    // 3 retries, 2 seconds apart, then publish to <topic>.DLT
    return new DefaultErrorHandler(recoverer, new FixedBackOff(2000L, 3L));
}

This is called blocking retry because the container pauses on that record during the backoff. It is simple and keeps ordering, but a long backoff holds up every later record on the partition.

flowchart TD
    rec["record"] --> h{"handler throws?"}
    h -->|no| commit["commit offset"]
    h -->|yes| retry{"retries left?"}
    retry -->|yes| wait["wait backoff, retry in place"] --> h
    retry -->|no| dlt["publish to topic.DLT, commit"]

3. Non-blocking retries with @RetryableTopic

When you need longer or escalating backoff without stalling the partition, use non-blocking retries. Spring forwards a failed record to a separate retry topic and moves on, so the main topic keeps flowing. A retry topic consumer processes it later, and after the final attempt the record lands in a dead letter topic.

@RetryableTopic(
    attempts = "4",
    backoff = @Backoff(delay = 1000, multiplier = 2.0),
    dltStrategy = DltStrategy.FAIL_ON_ERROR
)
@KafkaListener(topics = "orders", groupId = "inventory-service")
public void onOrderCreated(OrderCreated event) {
    inventoryService.reserve(event);
}

@DltHandler
public void handleDlt(OrderCreated event) {
    log.error("Order {} sent to DLT after retries exhausted", event.orderId());
}

Spring auto-creates the retry topics (orders-retry-0, orders-retry-1, and so on) and the dead letter topic (orders-dlt). The backoff escalates per topic, so early transient failures recover fast while stubborn ones get more time without blocking healthy traffic.

flowchart LR
    main["orders"] -->|fail| r0["orders-retry-0<br/>(1s)"]
    r0 -->|fail| r1["orders-retry-1<br/>(2s)"]
    r1 -->|fail| r2["orders-retry-2<br/>(4s)"]
    r2 -->|fail| dlt["orders-dlt"]
    main -->|ok| done["processed"]

4. Blocking vs non-blocking: which to use

Aspect	Blocking (`DefaultErrorHandler`)	Non-blocking (`@RetryableTopic`)
Where retries run	In the partition’s consumer thread	On separate retry topics
Ordering	Preserved on the partition	Not preserved across retries
Best backoff	Short	Long or escalating
Cost	Stalls the partition during backoff	Extra topics to manage

Use blocking retries for fast, likely-transient blips where order matters. Use non-blocking retries when a failure may take a while to clear, such as a downstream service being down, and you cannot hold up the partition.

5. Classify the failure

Retrying only helps if the failure might succeed next time. Classify before you retry.

Transient: a temporary condition, such as a timeout or a downstream service briefly unavailable. Retry with backoff.
Deterministic (poison): the record will fail every time, such as a business-rule violation or malformed data. Retrying wastes time; send it straight to the DLT.

Tell DefaultErrorHandler not to retry deterministic exceptions:

errorHandler.addNotRetryableExceptions(IllegalArgumentException.class);

A special poison case is a record that cannot even be deserialized. As introduced in First Producer and Consumer, wrap the deserializer in ErrorHandlingDeserializer so the failure is caught and routed to the error handler instead of looping. Without it, deserialization fails before your handler runs and no retry logic applies.

6. Guided practical

Run this against the local lab, using the Inventory consumer.

Add @RetryableTopic with 4 attempts and an escalating backoff, plus a @DltHandler.
Make the handler throw for a specific item value to simulate a poison record.
Produce one poison record and several good ones; confirm the good ones flow while the poison one moves through the retry topics.
Confirm the poison record lands in orders-dlt after the final attempt.
Add a deserialization failure (malformed JSON) and confirm ErrorHandlingDeserializer routes it rather than looping.

Next:Idempotent Consumers, Ordering, and Duplicates, where you make the consumer safe against the duplicates that at-least-once delivery guarantees.