Read time: ~

Producer Failures

Diagnose NotEnoughReplicasException, TimeoutException, buffer exhaustion, and RecordTooLargeException from a Spring Boot producer.


1. Symptom

The Order service starts logging send failures and orders are not reaching the orders topic. The exception varies: TimeoutException, NotEnoughReplicasException, RecordTooLargeException, or errors about the producer buffer being full. Upstream, API calls that publish events start failing or slowing.

Producer failures split cleanly by exception type, and each points at a different root cause, so the goal is to read the exception and map it to the layer at fault.


2. Likely causes

ExceptionRoot cause
NotEnoughReplicasExceptionISR below min.insync.replicas (a cluster-health issue)
TimeoutException (delivery/request.timeout.ms)Cannot reach brokers, or brokers overloaded/slow
RecordTooLargeExceptionMessage exceeds max.request.size or broker message.max.bytes
BufferExhaustedException / send blockingProducing faster than the broker accepts; buffer.memory full
Authentication/authorization errorsCredentials or ACLs (see Auth Failures After Rotation)

3. How it manifests to the Spring app

CauseWhat the service sees
NotEnoughReplicasExceptionsend() future completes exceptionally; retries until ISR recovers
Timeoutsend() callback fails after delivery.timeout.ms; latency spikes first
Too largeImmediate failure on that record; others succeed
Buffer fullsend() blocks up to max.block.ms, then throws; throughput stalls

4. Diagnostic steps

  1. Read the exact exception in the Order service logs. This alone usually identifies the class of problem from the table above.
  2. For NotEnoughReplicasException, pivot to cluster health: check ISR and UnderReplicatedPartitions (see Under-Replicated Partitions). This is not an app bug.
  3. For TimeoutException, check connectivity and broker load: can the app reach the bootstrap servers, and are brokers healthy?
  4. For RecordTooLargeException, check the payload size against max.request.size and broker message.max.bytes. A new large field or an accidental blob is common.
  5. For buffer exhaustion, check produce rate vs broker acceptance; a broker slowdown backs up the client buffer.
StepQuestion it answersTime cost
1. Read exceptionWhich class of failure?seconds
2. ISR checkCluster durability issue?1-2 min
3. Connectivity/loadCan we reach healthy brokers?1-2 min
4. Payload sizeMessage too big?1 min
5. Rate vs bufferProducing too fast for capacity?2-3 min

5. Safe remediations

SituationSafe action
NotEnoughReplicasExceptionTreat as cluster health; restore ISR or escalate. Keep acks=all; do not weaken durability
Transient timeouts, brokers healthyConfirm retries and delivery.timeout.ms are configured; let retries cover blips
RecordTooLargeException from a legit large payloadReduce payload, or raise max.request.size and broker limit together (a coordinated change)
Buffer exhaustion from a spikeSlow the producer or scale; verify broker health first
Auth errorsFollow the auth-rotation playbook

6. Escalation trigger

Page on-call engineering if:

  • NotEnoughReplicasException persists because the cluster cannot restore ISR.
  • Timeouts stem from broker overload or an outage rather than a client blip.
  • Raising max.request.size / message.max.bytes is needed (coordinated producer and broker change).
  • Send failures continue after confirming connectivity, healthy brokers, and correct config.

7. Relevant commands and exhibits

# Exception exhibits from the Order service
org.apache.kafka.common.errors.TimeoutException:
  Expiring 1 record(s) for orders-0: 120000 ms has passed since batch creation

org.apache.kafka.common.errors.RecordTooLargeException:
  The message is 2148271 bytes when serialized which is larger than 1048576

org.apache.kafka.common.errors.NotEnoughReplicasException:
  Messages are rejected since there are fewer in-sync replicas than required
// Always handle the send result so failures are visible
kafkaTemplate.send("orders", key, event).whenComplete((result, ex) -> {
    if (ex != null) {
        log.error("Failed to publish OrderCreated {}", key, ex);
    }
});

MSK CloudWatch: UnderReplicatedPartitions (for NotEnoughReplicas), broker CPU and network for overload.


8. Guided practical

Reproduce producer failures in the local lab.

  1. Create orders with RF 3 and min.insync.replicas 2 in the three-broker lab, then stop two brokers and produce with acks=all: observe NotEnoughReplicasException.
  2. Restore brokers and produce a record larger than 1 MB to trigger RecordTooLargeException.
  3. Confirm your producer logs the failure via whenComplete rather than dropping it silently.

Next:Poison Messages, Deserialization Errors, and the DLT.