Kafka Rebalancing: Cooperative Protocol and Group Stability

A consumer group constantly negotiates which instance owns which partition. That negotiation is a rebalance, and while it happens processing pauses. Occasional rebalances are normal, but a misconfigured group can fall into a rebalance storm, rebalancing so often it barely processes anything. This final reliability module explains what triggers rebalances and how to keep a group stable.

What you’ll be able to do after this module

List what triggers a consumer group rebalance.
Contrast eager and cooperative (incremental) rebalancing.
Use static group membership to avoid rebalances on restart.
Set session.timeout.ms, heartbeat.interval.ms, and max.poll.interval.ms correctly.
Explain how a slow handler causes a rebalance storm and fix it.

1. What triggers a rebalance

A rebalance is the group coordinator reassigning partitions across the group’s members. It is triggered by a change in membership or in the topic:

A consumer joins the group (a new instance starts).
A consumer leaves gracefully (shutdown) or is presumed dead (missed heartbeats).
A consumer is evicted for exceeding max.poll.interval.ms between polls.
The topic’s partition count changes.

During a rebalance, affected consumers stop processing until new assignments are settled. Keeping rebalances rare and cheap is the goal.

2. Eager vs cooperative rebalancing

The rebalance protocol decides how disruptive a rebalance is.

Eager (older, stop-the-world): every consumer revokes all its partitions, then the group reassigns from scratch. The whole group pauses, even for partitions that were not moving.
Cooperative (incremental, the modern default): only the partitions that actually need to move are revoked. Consumers keep processing the partitions they retain, so the disruption is limited to the changed assignments.

flowchart TD
    subgraph eager [Eager]
        e1["all consumers revoke ALL partitions"]
        e2["whole group pauses"]
        e3["reassign everything"]
        e1 --> e2 --> e3
    end
    subgraph coop [Cooperative]
        c1["revoke only moving partitions"]
        c2["retained partitions keep processing"]
        c3["assign just the moved ones"]
        c1 --> c2 --> c3
    end

Spring for Apache Kafka uses the cooperative protocol by default with recent clients. Prefer it, because it turns a scaling event or a single restart into a minor adjustment rather than a full stall.

3. Static group membership

By default, every time a consumer restarts it gets a new member identity, which triggers a rebalance on the way out and again on the way back. For rolling deploys of a large group, that is a lot of churn.

Static membership fixes this. Give each instance a stable group.instance.id, and the coordinator recognizes a restarting instance as the same member. As long as it returns within the session timeout, its partitions are held for it and no rebalance happens.

spring:
  kafka:
    consumer:
      properties:
        group.instance.id: payment-service-1   # unique and stable per instance

4. The timeouts that matter

Three settings govern when a consumer is considered alive, and getting them wrong is the usual cause of surprise rebalances.

Setting	Governs	Typical guidance
`heartbeat.interval.ms`	How often the consumer sends a heartbeat	About 1/3 of the session timeout
`session.timeout.ms`	How long without a heartbeat before the member is dead	Default around 45s; the eviction window
`max.poll.interval.ms`	Max time between `poll()` calls before eviction	Must exceed your worst-case processing time for a batch

Heartbeats run on a background thread, so a consumer can be alive by heartbeat yet still be evicted if it does not call poll() often enough. That second deadline, max.poll.interval.ms, is the one most people trip over.

5. Rebalance storms from a slow handler

Here is the classic failure. A handler does slow work, for example a long external call, for each record. The batch of records from one poll takes longer than max.poll.interval.ms to process. The coordinator concludes the consumer is stuck, evicts it, and rebalances. The evicted consumer finishes, rejoins, gets a batch, is slow again, and is evicted again. The group spends its time rebalancing instead of working.

sequenceDiagram
    participant C as Consumer
    participant Co as Group coordinator
    C->>Co: poll(), get a large batch
    Note over C: slow processing exceeds max.poll.interval.ms
    Co->>Co: consumer presumed stuck, evict
    Co->>Co: rebalance
    C->>Co: finishes, rejoins
    Note over C,Co: cycle repeats: rebalance storm

The fixes, roughly in order of preference:

Process faster or reduce max.poll.records, so a batch finishes well within the interval.
Raise max.poll.interval.ms if the work is legitimately long and cannot be shortened.
Move slow work off the poll thread, or push retries to a retry topic as in Retries, Error Handling, and Dead Letter Topics, so the poll loop stays responsive.

6. Guided practical

Run this against the local lab.

Start two instances of a consumer in one group over a multi-partition topic and watch the initial rebalance assign partitions.
Stop one instance and confirm a cooperative rebalance moves only its partitions.
Add a stable group.instance.id to each instance, restart one, and confirm no rebalance occurs within the session timeout.
Add a Thread.sleep longer than max.poll.interval.ms in the handler and observe the eviction and repeated rebalance.
Lower max.poll.records or raise max.poll.interval.ms and confirm the group stabilizes.

Next: Section 6, Event-Driven Microservices and Topic Design, where the reliability building blocks combine into a full event-driven architecture.