Broker Down, Controller Failover, KRaft Quorum Loss
Diagnose broker loss, controller re-election, and KRaft metadata quorum falling below majority, and know what is self-healing versus an emergency.
1. Symptom
A broker disappears from the cluster: ActiveControllerCount alarms (not equal to 1), a broker count drop, or client logs full of disconnected and NOT_LEADER_FOR_PARTITION. In the worst case the KRaft metadata quorum loses majority and the cluster cannot elect a controller at all, which halts metadata changes cluster-wide.
The goal is to tell apart a routine single-broker loss (usually self-healing on MSK) from a control-plane emergency (quorum loss), because the second is a full-cluster incident.
2. Likely causes
| Cause | How it manifests |
|---|---|
| Single broker failure | Broker count drops by one; partitions with a replica there re-elect leaders |
| Controller broker failure | Brief metadata pause while a new controller is elected |
| KRaft quorum loss (majority of controllers down) | No controller can be elected; metadata operations stall |
| Network partition between brokers | Brokers see each other as down; ISR churn and leader changes |
On KRaft, the controllers form a Raft quorum, so a three-controller setup tolerates losing one. Losing two of three means no majority and no controller, from Control Plane.
3. How it manifests to the Spring app
| Condition | What the service sees |
|---|---|
| Single broker down | Brief retriable errors, then automatic recovery as leaders move |
| Controller re-election | Short pause on metadata-dependent operations (topic create, etc.) |
| Quorum loss | Producers/consumers keep working on existing leaders but no new leaders can be elected; failures accumulate |
Well-configured clients with retries (from Reliable Producing) ride out a single broker loss with only a latency blip.
4. Diagnostic steps
- Check
ActiveControllerCountin CloudWatch. Exactly 1 is healthy. 0 means no controller (emergency); more than 1 briefly during failover is expected, but sustained is wrong. - Check the broker count. How many of the expected brokers are reporting? One missing is routine; multiple missing is serious.
- Check client logs for
NOT_LEADER_FOR_PARTITIONanddisconnected. Transient bursts during failover are normal; sustained is not. - Check whether it is recovering. MSK auto-replaces failed brokers; a single loss should recover within minutes as leaders move to surviving replicas.
- Assess quorum. If controllers are down to no majority, metadata operations hang: this is the escalate-now condition.
| Step | Question it answers | Time cost |
|---|---|---|
| 1. ActiveControllerCount | Is there a controller? | seconds |
| 2. Broker count | How many brokers lost? | seconds |
| 3. Client logs | Transient failover or sustained? | 1-2 min |
| 4. Recovery trend | Self-healing? | 2-3 min |
| 5. Quorum state | Control-plane emergency? | 1-2 min |
5. Safe remediations
| Situation | Safe action |
|---|---|
| Single broker down, recovering | Monitor; confirm ISR and leaders restore. Ensure clients have retries configured |
| Client-side connection churn | Verify bootstrap servers list all brokers and retries are set; no broker action needed |
| Controller re-election blip | Wait; it completes in seconds to a minute |
| Quorum loss or multiple brokers down | Escalate immediately; do not attempt control-plane recovery yourself |
6. Escalation trigger
Page on-call engineering immediately if:
ActiveControllerCountis 0, or not exactly 1 for more than a minute or two.- More than one broker is down, or a single broker does not recover within ~15 minutes.
- KRaft controllers have lost majority (quorum loss): this is a cluster-wide emergency.
- Offline partitions accompany the broker loss (cross-check Under-Replicated and Offline Partitions).
7. Relevant commands and exhibits
# Cluster and broker view
kafka-broker-api-versions.sh --bootstrap-server $BROKER | head
# Controller and quorum status (KRaft)
kafka-metadata-quorum.sh --bootstrap-server $BROKER describe --status
# Healthy KRaft quorum: a leader plus followers, all current
ClusterId: abc123
LeaderId: 1
CurrentVoters: [1,2,3]
CurrentObservers: []
MSK CloudWatch metrics: ActiveControllerCount (must be 1), broker-level uptime/health, and OfflinePartitionsCount.
8. Guided practical
Reproduce single-broker failover in the local three-broker lab.
- In the three-broker lab from Local Lab, produce to
orderswith a running consumer. docker stop kafka-2and watch client logs: briefNOT_LEADER_FOR_PARTITIONretries, then recovery as leadership moves.- Confirm produce/consume continues throughout, because retries and surviving replicas cover the gap.
docker start kafka-2and confirm it rejoins and ISR refills.- Note that stopping a majority of the single-node lab’s control plane would halt metadata changes, illustrating quorum loss.