Rebalance Storms and Group Instability
Diagnose constant rebalancing from slow handlers and max.poll.interval.ms breaches, and stabilize with cooperative rebalancing and static membership.
1. Symptom
Consumer lag climbs even though consumers are running, and the app logs are full of rejoining group and partitions revoked / partitions assigned lines cycling every few seconds. The group spends more time rebalancing than processing, so throughput collapses. This is a rebalance storm: continuous rebalancing that prevents progress.
The goal is to find what keeps kicking members out of the group and stop the cycle, drawing on the concepts from Rebalancing.
2. Likely causes
| Cause | How it manifests |
|---|---|
Handler exceeds max.poll.interval.ms | Member evicted mid-batch, rejoins, evicted again |
session.timeout.ms too low for GC pauses | Heartbeats miss, member declared dead |
| Rolling deploys or flapping pods | Members constantly joining and leaving |
| Eager rebalancing on a large group | Every rebalance stops the whole group (stop-the-world) |
| No static membership | A pod restart looks like a brand-new member each time |
3. How it manifests to the Spring app
| Cause | What the service sees |
|---|---|
| Poll interval breach | Member ... rejoining right after processing a large batch |
| Session timeout | Attempt to heartbeat failed during a GC pause |
| Eager protocol | All partitions revoked on every membership change |
| Deploy churn | Rebalances clustered around deploy times |
4. Diagnostic steps
- Grep the logs for
rejoining group,revoked,assigned. Frequency tells you it is a storm; timing tells you if it aligns with deploys. - Correlate with processing time. If a rebalance follows each large batch, the handler is breaching
max.poll.interval.ms. - Check GC and pauses. Long stop-the-world GC can miss
session.timeout.msheartbeats; check the JVM logs. - Check the rebalance protocol. Eager (default older) revokes everything each time; cooperative revokes only what moves.
- Check for static membership. Without
group.instance.id, every restart triggers a full rebalance.
| Step | Question it answers | Time cost |
|---|---|---|
| 1. Log frequency/timing | Storm? Deploy-aligned? | 1-2 min |
| 2. Batch vs rebalance | Poll interval breach? | 1-2 min |
| 3. GC logs | Heartbeat starvation? | 2-3 min |
| 4. Protocol | Eager or cooperative? | 1 min |
| 5. Static membership | Restarts look like new members? | 1 min |
5. Safe remediations
| Situation | Safe action |
|---|---|
| Handler too slow per batch | Lower max.poll.records so a batch finishes well within the interval |
| Slow work on the listener thread | Move it off-thread; keep the poll loop responsive |
| Eager rebalancing hurting a big group | Switch to the cooperative sticky assignor (config change, test first) |
| Frequent restarts | Add static membership via group.instance.id so restarts do not reshuffle |
| GC pauses | Right-size heap; do not just raise session.timeout.ms blindly |
6. Escalation trigger
Page on-call engineering if:
- The storm continues after lowering
max.poll.recordsand confirming the handler fits the interval. - Rebalances are triggered by broker-side instability (cross-check Broker Down, Controller Failover), not the app.
- Switching to cooperative rebalancing or static membership requires a coordinated deploy you cannot safely perform alone.
7. Relevant commands and exhibits
# Storm signature in app logs (repeating every few seconds)
Revoking previously assigned partitions [orders-0, orders-1]
(Re-)joining group payment-service
Successfully joined group with generation 842
Attempt to heartbeat failed since group is rebalancing
# Stabilizing config (Spring)
spring:
kafka:
consumer:
max-poll-records: 100
properties:
max.poll.interval.ms: 300000
session.timeout.ms: 45000
group.instance.id: ${HOSTNAME} # static membership
partition.assignment.strategy: org.apache.kafka.clients.consumer.CooperativeStickyAssignor
Rising generation number (generation 842) climbing quickly is the clearest sign of a storm.
8. Guided practical
Reproduce a storm in the local lab.
- Run the Payment consumer with
max-poll-records: 500and a handler that sleeps 50ms per record, so a full batch exceeds a lowmax.poll.interval.msyou set (for example 20000). - Produce a few thousand records and watch the logs cycle through revoke/rejoin: a storm.
- Lower
max-poll-recordsto 50 and confirm batches finish inside the interval and the storm stops. - Add
group.instance.idand restart an instance, observing a smaller rebalance.
Next:Producer Failures.