Consumer Lag and Stuck Consumers
Diagnose climbing lag, absent or evicted consumers, slow handlers, and max.poll.interval.ms breaches in a Spring Boot consumer.
1. Symptom
A CloudWatch alarm or PagerDuty alert fires with something like MaxOffsetLag > 100000 for payment-service (5 min sustained), or a downstream team pages you that “payments are delayed.” Consumer lag, the distance between the committed offset and the log end offset from Observability, keeps climbing and does not recover.
This is the most common Kafka alert you will triage. The goal is to answer one question fast: is this no consumers, slow consumers, or load exceeding capacity?
2. Likely causes
Broker or cluster side
| Cause | How it manifests |
|---|---|
| Producer traffic spike | Log end offset climbing far faster than usual; consumers healthy but outpaced |
| Partitions skewed to one key | Lag concentrated on one or two partitions, others at zero |
Application side (Spring Boot)
| Cause | How it manifests |
|---|---|
| Slow listener handler | Records consumed but each takes too long; lag grows steadily |
Handler exceeds max.poll.interval.ms | Consumer evicted mid-batch, triggering repeated rebalances |
| Consumer instances crashed or scaled to zero | CONSUMER-ID missing; lag climbs with no one reading |
Too little concurrency for the partition count | Fewer consumer threads than partitions, ceiling too low |
3. How it manifests to the Spring app
| Cause | What the service sees |
|---|---|
| Slow handler | Listener logs steady progress but throughput below produce rate |
| Poll interval breach | Attempt to heartbeat failed and Member ... rejoining group log lines; repeated partition assignment |
| Crashed instances | No listener activity; /actuator/health down or pod missing |
| Traffic spike | Handler healthy, latency normal, backlog just large |
4. Diagnostic steps
Work top to bottom, cheapest first. Stop when you have a confident diagnosis.
- Describe the group:
kafka-consumer-groups --describe --group payment-service. NoteLAG,CONSUMER-ID, and which partitions are behind.- No
CONSUMER-IDon partitions: consumers are absent, go to the app. CONSUMER-IDpresent butLAGclimbing: slow or outpaced, keep going.
- No
- Check per-partition skew. If one partition holds all the lag, suspect a hot key (see Idempotency and Ordering), not overall slowness.
- Check the app logs for
rejoining group/ heartbeat failures: amax.poll.interval.msbreach means the handler is too slow per batch. - Check
/actuator/healthand instance count against expected. Missing instances explain absent consumers. - Compare produce rate to consume rate over the last hours in CloudWatch (
BytesInPerSecvs consume). A spike with healthy consumers is a capacity issue.
| Step | Question it answers | Time cost |
|---|---|---|
| 1. Describe group | Anyone consuming, how far behind? | seconds |
| 2. Partition skew | Whole group or one hot partition? | seconds |
| 3. App logs | Failing/evicted or just slow? | 1-2 min |
| 4. Health + count | Are instances actually attached? | 1 min |
| 5. Rate trend | Spike or sustained regression? | 2-3 min |
5. Safe remediations
| Situation | Safe action |
|---|---|
| Fewer instances than expected | Restart or redeploy consumers; confirm CONSUMER-ID returns and lag falls |
| Healthy but outpaced, downstream has headroom | Scale out consumers up to the partition count, or raise concurrency |
| Handler breaching poll interval | Lower max.poll.records, or move slow work off the listener thread; do not just raise the timeout |
| Hot partition | Fix the key or partition count (a topic change): plan it, do not hot-patch |
Scaling and restarting are your safe levers. Changing partition count, keys, or resetting offsets is escalation territory.
6. Escalation trigger
Stop and page on-call engineering (per Escalation and Communication) if:
- Lag keeps growing 20 to 30 minutes after your pass with consumers attached, healthy, and not obviously slow.
- The fix needs a partition-count or key change, or an offset reset.
- A downstream dependency outage (DB, external API) is the real cause: page that team in parallel.
- Restarting and scaling do not restore consumer count or reduce lag.
7. Relevant commands and exhibits
# Lag, current/end offset, and consumer id per partition
kafka-consumer-groups.sh --bootstrap-server $BROKER --describe --group payment-service
# Absent consumers: no CONSUMER-ID, lag climbing
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID
payment-service orders 0 10432 98765 88333 -
# Slow consumers: attached but behind
payment-service orders 1 9876 12001 2125 consumer-2
# App log signature of a max.poll.interval.ms breach
Member consumer-2 ... rejoining group payment-service
Attempt to heartbeat failed since group is rebalancing
MSK CloudWatch metrics to watch: MaxOffsetLag, SumOffsetLag, and EstimatedMaxTimeLag.
8. Guided practical
Reproduce a scaled-down backlog in the local lab.
- Start the Payment consumer from First Producer and Consumer with
concurrency: 1and aThread.sleep(5000)in the handler to simulate a slow downstream call. - Produce 30
OrderCreatedrecords quickly. - Run
kafka-consumer-groups --describe --group payment-serviceand confirmLAGsitting high with oneCONSUMER-ID. - Remove the sleep and raise
concurrencyto 3, restart, and watch lag drain.