Kafka Rebalance Storms and Group Instability: Incident Playbook

1. Symptom

Consumer lag climbs even though consumers are running, and the app logs are full of rejoining group and partitions revoked / partitions assigned lines cycling every few seconds. The group spends more time rebalancing than processing, so throughput collapses. This is a rebalance storm: continuous rebalancing that prevents progress.

The goal is to find what keeps kicking members out of the group and stop the cycle, drawing on the concepts from Rebalancing.

2. Likely causes

Cause	How it manifests
Handler exceeds `max.poll.interval.ms`	Member evicted mid-batch, rejoins, evicted again
`session.timeout.ms` too low for GC pauses	Heartbeats miss, member declared dead
Rolling deploys or flapping pods	Members constantly joining and leaving
Eager rebalancing on a large group	Every rebalance stops the whole group (stop-the-world)
No static membership	A pod restart looks like a brand-new member each time

3. How it manifests to the Spring app

Cause	What the service sees
Poll interval breach	`Member ... rejoining` right after processing a large batch
Session timeout	`Attempt to heartbeat failed` during a GC pause
Eager protocol	All partitions revoked on every membership change
Deploy churn	Rebalances clustered around deploy times

4. Diagnostic steps

Grep the logs for rejoining group, revoked, assigned. Frequency tells you it is a storm; timing tells you if it aligns with deploys.
Correlate with processing time. If a rebalance follows each large batch, the handler is breaching max.poll.interval.ms.
Check GC and pauses. Long stop-the-world GC can miss session.timeout.ms heartbeats; check the JVM logs.
Check the rebalance protocol. Eager (default older) revokes everything each time; cooperative revokes only what moves.
Check for static membership. Without group.instance.id, every restart triggers a full rebalance.

Step	Question it answers	Time cost
1. Log frequency/timing	Storm? Deploy-aligned?	1-2 min
2. Batch vs rebalance	Poll interval breach?	1-2 min
3. GC logs	Heartbeat starvation?	2-3 min
4. Protocol	Eager or cooperative?	1 min
5. Static membership	Restarts look like new members?	1 min

5. Safe remediations

Situation	Safe action
Handler too slow per batch	Lower `max.poll.records` so a batch finishes well within the interval
Slow work on the listener thread	Move it off-thread; keep the poll loop responsive
Eager rebalancing hurting a big group	Switch to the cooperative sticky assignor (config change, test first)
Frequent restarts	Add static membership via `group.instance.id` so restarts do not reshuffle
GC pauses	Right-size heap; do not just raise `session.timeout.ms` blindly

6. Escalation trigger

Page on-call engineering if:

The storm continues after lowering max.poll.records and confirming the handler fits the interval.
Rebalances are triggered by broker-side instability (cross-check Broker Down, Controller Failover), not the app.
Switching to cooperative rebalancing or static membership requires a coordinated deploy you cannot safely perform alone.

7. Relevant commands and exhibits

# Storm signature in app logs (repeating every few seconds)
Revoking previously assigned partitions [orders-0, orders-1]
(Re-)joining group payment-service
Successfully joined group with generation 842
Attempt to heartbeat failed since group is rebalancing

# Stabilizing config (Spring)
spring:
  kafka:
    consumer:
      max-poll-records: 100
      properties:
        max.poll.interval.ms: 300000
        session.timeout.ms: 45000
        group.instance.id: ${HOSTNAME}   # static membership
        partition.assignment.strategy: org.apache.kafka.clients.consumer.CooperativeStickyAssignor

Rising generation number (generation 842) climbing quickly is the clearest sign of a storm.

8. Guided practical

Reproduce a storm in the local lab.

Run the Payment consumer with max-poll-records: 500 and a handler that sleeps 50ms per record, so a full batch exceeds a low max.poll.interval.ms you set (for example 20000).
Produce a few thousand records and watch the logs cycle through revoke/rejoin: a storm.
Lower max-poll-records to 50 and confirm batches finish inside the interval and the storm stops.
Add group.instance.id and restart an instance, observing a smaller rebalance.

Next:Producer Failures.