Producer Failures
Diagnose NotEnoughReplicasException, TimeoutException, buffer exhaustion, and RecordTooLargeException from a Spring Boot producer.
1. Symptom
The Order service starts logging send failures and orders are not reaching the orders topic. The exception varies: TimeoutException, NotEnoughReplicasException, RecordTooLargeException, or errors about the producer buffer being full. Upstream, API calls that publish events start failing or slowing.
Producer failures split cleanly by exception type, and each points at a different root cause, so the goal is to read the exception and map it to the layer at fault.
2. Likely causes
| Exception | Root cause |
|---|---|
NotEnoughReplicasException | ISR below min.insync.replicas (a cluster-health issue) |
TimeoutException (delivery/request.timeout.ms) | Cannot reach brokers, or brokers overloaded/slow |
RecordTooLargeException | Message exceeds max.request.size or broker message.max.bytes |
BufferExhaustedException / send blocking | Producing faster than the broker accepts; buffer.memory full |
| Authentication/authorization errors | Credentials or ACLs (see Auth Failures After Rotation) |
3. How it manifests to the Spring app
| Cause | What the service sees |
|---|---|
NotEnoughReplicasException | send() future completes exceptionally; retries until ISR recovers |
| Timeout | send() callback fails after delivery.timeout.ms; latency spikes first |
| Too large | Immediate failure on that record; others succeed |
| Buffer full | send() blocks up to max.block.ms, then throws; throughput stalls |
4. Diagnostic steps
- Read the exact exception in the Order service logs. This alone usually identifies the class of problem from the table above.
- For
NotEnoughReplicasException, pivot to cluster health: check ISR andUnderReplicatedPartitions(see Under-Replicated Partitions). This is not an app bug. - For
TimeoutException, check connectivity and broker load: can the app reach the bootstrap servers, and are brokers healthy? - For
RecordTooLargeException, check the payload size againstmax.request.sizeand brokermessage.max.bytes. A new large field or an accidental blob is common. - For buffer exhaustion, check produce rate vs broker acceptance; a broker slowdown backs up the client buffer.
| Step | Question it answers | Time cost |
|---|---|---|
| 1. Read exception | Which class of failure? | seconds |
| 2. ISR check | Cluster durability issue? | 1-2 min |
| 3. Connectivity/load | Can we reach healthy brokers? | 1-2 min |
| 4. Payload size | Message too big? | 1 min |
| 5. Rate vs buffer | Producing too fast for capacity? | 2-3 min |
5. Safe remediations
| Situation | Safe action |
|---|---|
NotEnoughReplicasException | Treat as cluster health; restore ISR or escalate. Keep acks=all; do not weaken durability |
| Transient timeouts, brokers healthy | Confirm retries and delivery.timeout.ms are configured; let retries cover blips |
RecordTooLargeException from a legit large payload | Reduce payload, or raise max.request.size and broker limit together (a coordinated change) |
| Buffer exhaustion from a spike | Slow the producer or scale; verify broker health first |
| Auth errors | Follow the auth-rotation playbook |
6. Escalation trigger
Page on-call engineering if:
NotEnoughReplicasExceptionpersists because the cluster cannot restore ISR.- Timeouts stem from broker overload or an outage rather than a client blip.
- Raising
max.request.size/message.max.bytesis needed (coordinated producer and broker change). - Send failures continue after confirming connectivity, healthy brokers, and correct config.
7. Relevant commands and exhibits
# Exception exhibits from the Order service
org.apache.kafka.common.errors.TimeoutException:
Expiring 1 record(s) for orders-0: 120000 ms has passed since batch creation
org.apache.kafka.common.errors.RecordTooLargeException:
The message is 2148271 bytes when serialized which is larger than 1048576
org.apache.kafka.common.errors.NotEnoughReplicasException:
Messages are rejected since there are fewer in-sync replicas than required
// Always handle the send result so failures are visible
kafkaTemplate.send("orders", key, event).whenComplete((result, ex) -> {
if (ex != null) {
log.error("Failed to publish OrderCreated {}", key, ex);
}
});
MSK CloudWatch: UnderReplicatedPartitions (for NotEnoughReplicas), broker CPU and network for overload.
8. Guided practical
Reproduce producer failures in the local lab.
- Create
orderswith RF 3 andmin.insync.replicas2 in the three-broker lab, then stop two brokers and produce withacks=all: observeNotEnoughReplicasException. - Restore brokers and produce a record larger than 1 MB to trigger
RecordTooLargeException. - Confirm your producer logs the failure via
whenCompleterather than dropping it silently.