Kafka Consumer Lag and Stuck Consumers: Incident Playbook

1. Symptom

A CloudWatch alarm or PagerDuty alert fires with something like MaxOffsetLag > 100000 for payment-service (5 min sustained), or a downstream team pages you that “payments are delayed.” Consumer lag, the distance between the committed offset and the log end offset from Observability, keeps climbing and does not recover.

This is the most common Kafka alert you will triage. The goal is to answer one question fast: is this no consumers, slow consumers, or load exceeding capacity?

2. Likely causes

Broker or cluster side

Cause	How it manifests
Producer traffic spike	Log end offset climbing far faster than usual; consumers healthy but outpaced
Partitions skewed to one key	Lag concentrated on one or two partitions, others at zero

Application side (Spring Boot)

Cause	How it manifests
Slow listener handler	Records consumed but each takes too long; lag grows steadily
Handler exceeds `max.poll.interval.ms`	Consumer evicted mid-batch, triggering repeated rebalances
Consumer instances crashed or scaled to zero	`CONSUMER-ID` missing; lag climbs with no one reading
Too little `concurrency` for the partition count	Fewer consumer threads than partitions, ceiling too low

3. How it manifests to the Spring app

Cause	What the service sees
Slow handler	Listener logs steady progress but throughput below produce rate
Poll interval breach	`Attempt to heartbeat failed` and `Member ... rejoining group` log lines; repeated partition assignment
Crashed instances	No listener activity; `/actuator/health` down or pod missing
Traffic spike	Handler healthy, latency normal, backlog just large

4. Diagnostic steps

Work top to bottom, cheapest first. Stop when you have a confident diagnosis.

Describe the group:kafka-consumer-groups --describe --group payment-service. Note LAG, CONSUMER-ID, and which partitions are behind.
- No CONSUMER-ID on partitions: consumers are absent, go to the app.
- CONSUMER-ID present but LAG climbing: slow or outpaced, keep going.
Check per-partition skew. If one partition holds all the lag, suspect a hot key (see Idempotency and Ordering), not overall slowness.
Check the app logs for rejoining group / heartbeat failures: a max.poll.interval.ms breach means the handler is too slow per batch.
Check /actuator/health and instance count against expected. Missing instances explain absent consumers.
Compare produce rate to consume rate over the last hours in CloudWatch (BytesInPerSec vs consume). A spike with healthy consumers is a capacity issue.

Step	Question it answers	Time cost
1. Describe group	Anyone consuming, how far behind?	seconds
2. Partition skew	Whole group or one hot partition?	seconds
3. App logs	Failing/evicted or just slow?	1-2 min
4. Health + count	Are instances actually attached?	1 min
5. Rate trend	Spike or sustained regression?	2-3 min

5. Safe remediations

Situation	Safe action
Fewer instances than expected	Restart or redeploy consumers; confirm `CONSUMER-ID` returns and lag falls
Healthy but outpaced, downstream has headroom	Scale out consumers up to the partition count, or raise `concurrency`
Handler breaching poll interval	Lower `max.poll.records`, or move slow work off the listener thread; do not just raise the timeout
Hot partition	Fix the key or partition count (a topic change): plan it, do not hot-patch

Scaling and restarting are your safe levers. Changing partition count, keys, or resetting offsets is escalation territory.

6. Escalation trigger

Stop and page on-call engineering (per Escalation and Communication) if:

Lag keeps growing 20 to 30 minutes after your pass with consumers attached, healthy, and not obviously slow.
The fix needs a partition-count or key change, or an offset reset.
A downstream dependency outage (DB, external API) is the real cause: page that team in parallel.
Restarting and scaling do not restore consumer count or reduce lag.

7. Relevant commands and exhibits

# Lag, current/end offset, and consumer id per partition
kafka-consumer-groups.sh --bootstrap-server $BROKER --describe --group payment-service

# Absent consumers: no CONSUMER-ID, lag climbing
GROUP            TOPIC   PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG    CONSUMER-ID
payment-service  orders  0          10432           98765           88333  -

# Slow consumers: attached but behind
payment-service  orders  1          9876            12001           2125   consumer-2

# App log signature of a max.poll.interval.ms breach
Member consumer-2 ... rejoining group payment-service
Attempt to heartbeat failed since group is rebalancing

MSK CloudWatch metrics to watch: MaxOffsetLag, SumOffsetLag, and EstimatedMaxTimeLag.

8. Guided practical

Reproduce a scaled-down backlog in the local lab.

Start the Payment consumer from First Producer and Consumer with concurrency: 1 and a Thread.sleep(5000) in the handler to simulate a slow downstream call.
Produce 30 OrderCreated records quickly.
Run kafka-consumer-groups --describe --group payment-service and confirm LAG sitting high with one CONSUMER-ID.
Remove the sleep and raise concurrency to 3, restart, and watch lag drain.

Next:Under-Replicated and Offline Partitions.