Kafka Producer Failures: Incident Playbook for Spring Boot

1. Symptom

The Order service starts logging send failures and orders are not reaching the orders topic. The exception varies: TimeoutException, NotEnoughReplicasException, RecordTooLargeException, or errors about the producer buffer being full. Upstream, API calls that publish events start failing or slowing.

Producer failures split cleanly by exception type, and each points at a different root cause, so the goal is to read the exception and map it to the layer at fault.

2. Likely causes

Exception	Root cause
`NotEnoughReplicasException`	ISR below `min.insync.replicas` (a cluster-health issue)
`TimeoutException` (delivery/`request.timeout.ms`)	Cannot reach brokers, or brokers overloaded/slow
`RecordTooLargeException`	Message exceeds `max.request.size` or broker `message.max.bytes`
`BufferExhaustedException` / send blocking	Producing faster than the broker accepts; `buffer.memory` full
Authentication/authorization errors	Credentials or ACLs (see Auth Failures After Rotation)

3. How it manifests to the Spring app

Cause	What the service sees
`NotEnoughReplicasException`	`send()` future completes exceptionally; retries until ISR recovers
Timeout	`send()` callback fails after `delivery.timeout.ms`; latency spikes first
Too large	Immediate failure on that record; others succeed
Buffer full	`send()` blocks up to `max.block.ms`, then throws; throughput stalls

4. Diagnostic steps

Read the exact exception in the Order service logs. This alone usually identifies the class of problem from the table above.
For NotEnoughReplicasException, pivot to cluster health: check ISR and UnderReplicatedPartitions (see Under-Replicated Partitions). This is not an app bug.
For TimeoutException, check connectivity and broker load: can the app reach the bootstrap servers, and are brokers healthy?
For RecordTooLargeException, check the payload size against max.request.size and broker message.max.bytes. A new large field or an accidental blob is common.
For buffer exhaustion, check produce rate vs broker acceptance; a broker slowdown backs up the client buffer.

Step	Question it answers	Time cost
1. Read exception	Which class of failure?	seconds
2. ISR check	Cluster durability issue?	1-2 min
3. Connectivity/load	Can we reach healthy brokers?	1-2 min
4. Payload size	Message too big?	1 min
5. Rate vs buffer	Producing too fast for capacity?	2-3 min

5. Safe remediations

Situation	Safe action
`NotEnoughReplicasException`	Treat as cluster health; restore ISR or escalate. Keep `acks=all`; do not weaken durability
Transient timeouts, brokers healthy	Confirm retries and `delivery.timeout.ms` are configured; let retries cover blips
`RecordTooLargeException` from a legit large payload	Reduce payload, or raise `max.request.size` and broker limit together (a coordinated change)
Buffer exhaustion from a spike	Slow the producer or scale; verify broker health first
Auth errors	Follow the auth-rotation playbook

6. Escalation trigger

Page on-call engineering if:

NotEnoughReplicasException persists because the cluster cannot restore ISR.
Timeouts stem from broker overload or an outage rather than a client blip.
Raising max.request.size / message.max.bytes is needed (coordinated producer and broker change).
Send failures continue after confirming connectivity, healthy brokers, and correct config.

7. Relevant commands and exhibits

# Exception exhibits from the Order service
org.apache.kafka.common.errors.TimeoutException:
  Expiring 1 record(s) for orders-0: 120000 ms has passed since batch creation

org.apache.kafka.common.errors.RecordTooLargeException:
  The message is 2148271 bytes when serialized which is larger than 1048576

org.apache.kafka.common.errors.NotEnoughReplicasException:
  Messages are rejected since there are fewer in-sync replicas than required

// Always handle the send result so failures are visible
kafkaTemplate.send("orders", key, event).whenComplete((result, ex) -> {
    if (ex != null) {
        log.error("Failed to publish OrderCreated {}", key, ex);
    }
});

MSK CloudWatch: UnderReplicatedPartitions (for NotEnoughReplicas), broker CPU and network for overload.

8. Guided practical

Reproduce producer failures in the local lab.

Create orders with RF 3 and min.insync.replicas 2 in the three-broker lab, then stop two brokers and produce with acks=all: observe NotEnoughReplicasException.
Restore brokers and produce a record larger than 1 MB to trigger RecordTooLargeException.
Confirm your producer logs the failure via whenComplete rather than dropping it silently.

Next:Poison Messages, Deserialization Errors, and the DLT.