Read time: ~

Rebalance Storms and Group Instability

Diagnose constant rebalancing from slow handlers and max.poll.interval.ms breaches, and stabilize with cooperative rebalancing and static membership.


1. Symptom

Consumer lag climbs even though consumers are running, and the app logs are full of rejoining group and partitions revoked / partitions assigned lines cycling every few seconds. The group spends more time rebalancing than processing, so throughput collapses. This is a rebalance storm: continuous rebalancing that prevents progress.

The goal is to find what keeps kicking members out of the group and stop the cycle, drawing on the concepts from Rebalancing.


2. Likely causes

CauseHow it manifests
Handler exceeds max.poll.interval.msMember evicted mid-batch, rejoins, evicted again
session.timeout.ms too low for GC pausesHeartbeats miss, member declared dead
Rolling deploys or flapping podsMembers constantly joining and leaving
Eager rebalancing on a large groupEvery rebalance stops the whole group (stop-the-world)
No static membershipA pod restart looks like a brand-new member each time

3. How it manifests to the Spring app

CauseWhat the service sees
Poll interval breachMember ... rejoining right after processing a large batch
Session timeoutAttempt to heartbeat failed during a GC pause
Eager protocolAll partitions revoked on every membership change
Deploy churnRebalances clustered around deploy times

4. Diagnostic steps

  1. Grep the logs for rejoining group, revoked, assigned. Frequency tells you it is a storm; timing tells you if it aligns with deploys.
  2. Correlate with processing time. If a rebalance follows each large batch, the handler is breaching max.poll.interval.ms.
  3. Check GC and pauses. Long stop-the-world GC can miss session.timeout.ms heartbeats; check the JVM logs.
  4. Check the rebalance protocol. Eager (default older) revokes everything each time; cooperative revokes only what moves.
  5. Check for static membership. Without group.instance.id, every restart triggers a full rebalance.
StepQuestion it answersTime cost
1. Log frequency/timingStorm? Deploy-aligned?1-2 min
2. Batch vs rebalancePoll interval breach?1-2 min
3. GC logsHeartbeat starvation?2-3 min
4. ProtocolEager or cooperative?1 min
5. Static membershipRestarts look like new members?1 min

5. Safe remediations

SituationSafe action
Handler too slow per batchLower max.poll.records so a batch finishes well within the interval
Slow work on the listener threadMove it off-thread; keep the poll loop responsive
Eager rebalancing hurting a big groupSwitch to the cooperative sticky assignor (config change, test first)
Frequent restartsAdd static membership via group.instance.id so restarts do not reshuffle
GC pausesRight-size heap; do not just raise session.timeout.ms blindly

6. Escalation trigger

Page on-call engineering if:

  • The storm continues after lowering max.poll.records and confirming the handler fits the interval.
  • Rebalances are triggered by broker-side instability (cross-check Broker Down, Controller Failover), not the app.
  • Switching to cooperative rebalancing or static membership requires a coordinated deploy you cannot safely perform alone.

7. Relevant commands and exhibits

# Storm signature in app logs (repeating every few seconds)
Revoking previously assigned partitions [orders-0, orders-1]
(Re-)joining group payment-service
Successfully joined group with generation 842
Attempt to heartbeat failed since group is rebalancing
# Stabilizing config (Spring)
spring:
  kafka:
    consumer:
      max-poll-records: 100
      properties:
        max.poll.interval.ms: 300000
        session.timeout.ms: 45000
        group.instance.id: ${HOSTNAME}   # static membership
        partition.assignment.strategy: org.apache.kafka.clients.consumer.CooperativeStickyAssignor

Rising generation number (generation 842) climbing quickly is the clearest sign of a storm.


8. Guided practical

Reproduce a storm in the local lab.

  1. Run the Payment consumer with max-poll-records: 500 and a handler that sleeps 50ms per record, so a full batch exceeds a low max.poll.interval.ms you set (for example 20000).
  2. Produce a few thousand records and watch the logs cycle through revoke/rejoin: a storm.
  3. Lower max-poll-records to 50 and confirm batches finish inside the interval and the storm stops.
  4. Add group.instance.id and restart an instance, observing a smaller rebalance.

Next:Producer Failures.