Latency, Ordering, and Duplicates
Diagnose GC pauses and broker-wide slowness, out-of-order symptoms, and duplicate processing after a rebalance.
1. Symptom
Three related complaints that often arrive together: end-to-end latency spikes (an event takes seconds instead of milliseconds), records appear processed out of order, or the same event is processed twice (a customer charged twice, a duplicate notification). These are usually not bugs in Kafka but consequences of how the system is configured or how consumers handle rebalances.
The goal is to map each symptom to its real cause, because “out of order” and “duplicate” almost always trace back to keys, partitions, and at-least-once delivery from Idempotency and Ordering.
2. Likely causes
| Symptom | Likely cause |
|---|---|
| Latency spike | GC pauses on broker or client, broker overload, or EBS saturation (see AWS-Layer Connectivity) |
| Out of order | Records for one logical entity split across partitions (wrong key), or max.in.flight > 1 without idempotence |
| Duplicates | At-least-once redelivery after a rebalance or retry, with a non-idempotent consumer |
| Duplicates after deploy | Offsets committed after processing, then a crash/rebalance replays the batch |
3. How it manifests to the Spring app
| Cause | What the service sees |
|---|---|
| GC pause | Periodic latency spikes; heartbeat failures in logs during long pauses |
| Wrong key | Same customer’s events processed on different consumers, interleaved |
| Rebalance replay | A batch of already-processed records delivered again after partitions assigned |
| Retry duplication | Producer resend created a duplicate that a non-idempotent consumer acts on twice |
4. Diagnostic steps
- For latency, check whether spikes are periodic (suspect GC) or sustained (suspect broker/EBS load). Look at client and broker GC logs and CloudWatch broker metrics.
- For ordering, confirm the key. Ordering is per-partition only, so two events that must stay ordered must share a key. Check the producer’s key selection.
- Check
max.in.flight.requests.per.connection. Greater than 1 without the idempotent producer can reorder on retry (see Reliable Producing). - For duplicates, determine if they follow a rebalance or a producer retry. Post-rebalance duplicates mean at-least-once redelivery, which is expected.
- Check consumer idempotency. If the consumer is not idempotent, duplicates will always eventually cause double effects; that is the real defect.
| Step | Question it answers | Time cost |
|---|---|---|
| 1. Latency shape | GC or load? | 2-3 min |
| 2. Key check | Correct partitioning for ordering? | 1-2 min |
| 3. In-flight setting | Can retries reorder? | 1 min |
| 4. Duplicate timing | Rebalance or retry origin? | 1-2 min |
| 5. Idempotency | Is the consumer safe against replays? | 2-3 min |
5. Safe remediations
| Situation | Safe action |
|---|---|
| GC-driven latency | Right-size JVM heap on the affected side; escalate broker GC to engineering |
| Broker/EBS load | Follow AWS-Layer Connectivity; escalate provisioning |
| Ordering from wrong key | Fix producer key selection so a logical entity maps to one partition |
| Reordering on retry | Enable the idempotent producer, or set max.in.flight to 1 |
| Duplicates | Make the consumer idempotent (dedup key / inbox table); do not try to eliminate all redelivery |
6. Escalation trigger
Page on-call engineering if:
- Latency is driven by broker-side GC, overload, or EBS saturation you cannot resolve from the app.
- Fixing ordering requires a partition-count or key change (a coordinated topic change).
- Duplicates cause real financial or customer impact and the consumer cannot be made idempotent quickly.
- The reordering stems from a broker or protocol issue rather than client config.
7. Relevant commands and exhibits
# GC-driven latency: client log during a long pause
Attempt to heartbeat failed since group is rebalancing
# ...after a multi-second stop-the-world GC
# Duplicate after rebalance: same offsets delivered twice
Assigned partitions [orders-1]
Processing OrderCreated orderId=1001 (offset 4471) # again
// Idempotent consumer: the real fix for duplicates
@Transactional
public void handle(OrderCreated event) {
if (inboxRepository.existsById(event.eventId())) {
return; // already processed, safe to skip
}
inboxRepository.save(new ProcessedEvent(event.eventId()));
// ...business logic
}
MSK CloudWatch: broker CPU, and correlate latency spikes with GC and VolumeQueueLength.
8. Guided practical
Reproduce duplicates and ordering in the local lab.
- Produce events for the same customer with the customer id as key and confirm they land on one partition and stay ordered.
- Produce with a random/null key and observe the same customer’s events spread across partitions and interleave across consumers.
- Force a rebalance (start a second consumer) mid-processing and observe a small batch redelivered.
- Add the inbox-table dedup from Idempotency and Ordering and confirm the duplicate has no double effect.
Next:Offset Problems.