Read time: ~

Consumer Lag and Stuck Consumers

Diagnose climbing lag, absent or evicted consumers, slow handlers, and max.poll.interval.ms breaches in a Spring Boot consumer.


1. Symptom

A CloudWatch alarm or PagerDuty alert fires with something like MaxOffsetLag > 100000 for payment-service (5 min sustained), or a downstream team pages you that “payments are delayed.” Consumer lag, the distance between the committed offset and the log end offset from Observability, keeps climbing and does not recover.

This is the most common Kafka alert you will triage. The goal is to answer one question fast: is this no consumers, slow consumers, or load exceeding capacity?


2. Likely causes

Broker or cluster side

CauseHow it manifests
Producer traffic spikeLog end offset climbing far faster than usual; consumers healthy but outpaced
Partitions skewed to one keyLag concentrated on one or two partitions, others at zero

Application side (Spring Boot)

CauseHow it manifests
Slow listener handlerRecords consumed but each takes too long; lag grows steadily
Handler exceeds max.poll.interval.msConsumer evicted mid-batch, triggering repeated rebalances
Consumer instances crashed or scaled to zeroCONSUMER-ID missing; lag climbs with no one reading
Too little concurrency for the partition countFewer consumer threads than partitions, ceiling too low

3. How it manifests to the Spring app

CauseWhat the service sees
Slow handlerListener logs steady progress but throughput below produce rate
Poll interval breachAttempt to heartbeat failed and Member ... rejoining group log lines; repeated partition assignment
Crashed instancesNo listener activity; /actuator/health down or pod missing
Traffic spikeHandler healthy, latency normal, backlog just large

4. Diagnostic steps

Work top to bottom, cheapest first. Stop when you have a confident diagnosis.

  1. Describe the group:kafka-consumer-groups --describe --group payment-service. Note LAG, CONSUMER-ID, and which partitions are behind.
    • No CONSUMER-ID on partitions: consumers are absent, go to the app.
    • CONSUMER-ID present but LAG climbing: slow or outpaced, keep going.
  2. Check per-partition skew. If one partition holds all the lag, suspect a hot key (see Idempotency and Ordering), not overall slowness.
  3. Check the app logs for rejoining group / heartbeat failures: a max.poll.interval.ms breach means the handler is too slow per batch.
  4. Check /actuator/health and instance count against expected. Missing instances explain absent consumers.
  5. Compare produce rate to consume rate over the last hours in CloudWatch (BytesInPerSec vs consume). A spike with healthy consumers is a capacity issue.
StepQuestion it answersTime cost
1. Describe groupAnyone consuming, how far behind?seconds
2. Partition skewWhole group or one hot partition?seconds
3. App logsFailing/evicted or just slow?1-2 min
4. Health + countAre instances actually attached?1 min
5. Rate trendSpike or sustained regression?2-3 min

5. Safe remediations

SituationSafe action
Fewer instances than expectedRestart or redeploy consumers; confirm CONSUMER-ID returns and lag falls
Healthy but outpaced, downstream has headroomScale out consumers up to the partition count, or raise concurrency
Handler breaching poll intervalLower max.poll.records, or move slow work off the listener thread; do not just raise the timeout
Hot partitionFix the key or partition count (a topic change): plan it, do not hot-patch

Scaling and restarting are your safe levers. Changing partition count, keys, or resetting offsets is escalation territory.


6. Escalation trigger

Stop and page on-call engineering (per Escalation and Communication) if:

  • Lag keeps growing 20 to 30 minutes after your pass with consumers attached, healthy, and not obviously slow.
  • The fix needs a partition-count or key change, or an offset reset.
  • A downstream dependency outage (DB, external API) is the real cause: page that team in parallel.
  • Restarting and scaling do not restore consumer count or reduce lag.

7. Relevant commands and exhibits

# Lag, current/end offset, and consumer id per partition
kafka-consumer-groups.sh --bootstrap-server $BROKER --describe --group payment-service
# Absent consumers: no CONSUMER-ID, lag climbing
GROUP            TOPIC   PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG    CONSUMER-ID
payment-service  orders  0          10432           98765           88333  -

# Slow consumers: attached but behind
payment-service  orders  1          9876            12001           2125   consumer-2
# App log signature of a max.poll.interval.ms breach
Member consumer-2 ... rejoining group payment-service
Attempt to heartbeat failed since group is rebalancing

MSK CloudWatch metrics to watch: MaxOffsetLag, SumOffsetLag, and EstimatedMaxTimeLag.


8. Guided practical

Reproduce a scaled-down backlog in the local lab.

  1. Start the Payment consumer from First Producer and Consumer with concurrency: 1 and a Thread.sleep(5000) in the handler to simulate a slow downstream call.
  2. Produce 30 OrderCreated records quickly.
  3. Run kafka-consumer-groups --describe --group payment-service and confirm LAG sitting high with one CONSUMER-ID.
  4. Remove the sleep and raise concurrency to 3, restart, and watch lag drain.

Next:Under-Replicated and Offline Partitions.