Read time: ~

Latency, Ordering, and Duplicates

Diagnose GC pauses and broker-wide slowness, out-of-order symptoms, and duplicate processing after a rebalance.


1. Symptom

Three related complaints that often arrive together: end-to-end latency spikes (an event takes seconds instead of milliseconds), records appear processed out of order, or the same event is processed twice (a customer charged twice, a duplicate notification). These are usually not bugs in Kafka but consequences of how the system is configured or how consumers handle rebalances.

The goal is to map each symptom to its real cause, because “out of order” and “duplicate” almost always trace back to keys, partitions, and at-least-once delivery from Idempotency and Ordering.


2. Likely causes

SymptomLikely cause
Latency spikeGC pauses on broker or client, broker overload, or EBS saturation (see AWS-Layer Connectivity)
Out of orderRecords for one logical entity split across partitions (wrong key), or max.in.flight > 1 without idempotence
DuplicatesAt-least-once redelivery after a rebalance or retry, with a non-idempotent consumer
Duplicates after deployOffsets committed after processing, then a crash/rebalance replays the batch

3. How it manifests to the Spring app

CauseWhat the service sees
GC pausePeriodic latency spikes; heartbeat failures in logs during long pauses
Wrong keySame customer’s events processed on different consumers, interleaved
Rebalance replayA batch of already-processed records delivered again after partitions assigned
Retry duplicationProducer resend created a duplicate that a non-idempotent consumer acts on twice

4. Diagnostic steps

  1. For latency, check whether spikes are periodic (suspect GC) or sustained (suspect broker/EBS load). Look at client and broker GC logs and CloudWatch broker metrics.
  2. For ordering, confirm the key. Ordering is per-partition only, so two events that must stay ordered must share a key. Check the producer’s key selection.
  3. Check max.in.flight.requests.per.connection. Greater than 1 without the idempotent producer can reorder on retry (see Reliable Producing).
  4. For duplicates, determine if they follow a rebalance or a producer retry. Post-rebalance duplicates mean at-least-once redelivery, which is expected.
  5. Check consumer idempotency. If the consumer is not idempotent, duplicates will always eventually cause double effects; that is the real defect.
StepQuestion it answersTime cost
1. Latency shapeGC or load?2-3 min
2. Key checkCorrect partitioning for ordering?1-2 min
3. In-flight settingCan retries reorder?1 min
4. Duplicate timingRebalance or retry origin?1-2 min
5. IdempotencyIs the consumer safe against replays?2-3 min

5. Safe remediations

SituationSafe action
GC-driven latencyRight-size JVM heap on the affected side; escalate broker GC to engineering
Broker/EBS loadFollow AWS-Layer Connectivity; escalate provisioning
Ordering from wrong keyFix producer key selection so a logical entity maps to one partition
Reordering on retryEnable the idempotent producer, or set max.in.flight to 1
DuplicatesMake the consumer idempotent (dedup key / inbox table); do not try to eliminate all redelivery

6. Escalation trigger

Page on-call engineering if:

  • Latency is driven by broker-side GC, overload, or EBS saturation you cannot resolve from the app.
  • Fixing ordering requires a partition-count or key change (a coordinated topic change).
  • Duplicates cause real financial or customer impact and the consumer cannot be made idempotent quickly.
  • The reordering stems from a broker or protocol issue rather than client config.

7. Relevant commands and exhibits

# GC-driven latency: client log during a long pause
Attempt to heartbeat failed since group is rebalancing
# ...after a multi-second stop-the-world GC

# Duplicate after rebalance: same offsets delivered twice
Assigned partitions [orders-1]
Processing OrderCreated orderId=1001 (offset 4471)   # again
// Idempotent consumer: the real fix for duplicates
@Transactional
public void handle(OrderCreated event) {
    if (inboxRepository.existsById(event.eventId())) {
        return; // already processed, safe to skip
    }
    inboxRepository.save(new ProcessedEvent(event.eventId()));
    // ...business logic
}

MSK CloudWatch: broker CPU, and correlate latency spikes with GC and VolumeQueueLength.


8. Guided practical

Reproduce duplicates and ordering in the local lab.

  1. Produce events for the same customer with the customer id as key and confirm they land on one partition and stay ordered.
  2. Produce with a random/null key and observe the same customer’s events spread across partitions and interleave across consumers.
  3. Force a rebalance (start a second consumer) mid-processing and observe a small batch redelivered.
  4. Add the inbox-table dedup from Idempotency and Ordering and confirm the duplicate has no double effect.

Next:Offset Problems.