Read time: ~

Broker Down, Controller Failover, KRaft Quorum Loss

Diagnose broker loss, controller re-election, and KRaft metadata quorum falling below majority, and know what is self-healing versus an emergency.


1. Symptom

A broker disappears from the cluster: ActiveControllerCount alarms (not equal to 1), a broker count drop, or client logs full of disconnected and NOT_LEADER_FOR_PARTITION. In the worst case the KRaft metadata quorum loses majority and the cluster cannot elect a controller at all, which halts metadata changes cluster-wide.

The goal is to tell apart a routine single-broker loss (usually self-healing on MSK) from a control-plane emergency (quorum loss), because the second is a full-cluster incident.


2. Likely causes

CauseHow it manifests
Single broker failureBroker count drops by one; partitions with a replica there re-elect leaders
Controller broker failureBrief metadata pause while a new controller is elected
KRaft quorum loss (majority of controllers down)No controller can be elected; metadata operations stall
Network partition between brokersBrokers see each other as down; ISR churn and leader changes

On KRaft, the controllers form a Raft quorum, so a three-controller setup tolerates losing one. Losing two of three means no majority and no controller, from Control Plane.


3. How it manifests to the Spring app

ConditionWhat the service sees
Single broker downBrief retriable errors, then automatic recovery as leaders move
Controller re-electionShort pause on metadata-dependent operations (topic create, etc.)
Quorum lossProducers/consumers keep working on existing leaders but no new leaders can be elected; failures accumulate

Well-configured clients with retries (from Reliable Producing) ride out a single broker loss with only a latency blip.


4. Diagnostic steps

  1. Check ActiveControllerCount in CloudWatch. Exactly 1 is healthy. 0 means no controller (emergency); more than 1 briefly during failover is expected, but sustained is wrong.
  2. Check the broker count. How many of the expected brokers are reporting? One missing is routine; multiple missing is serious.
  3. Check client logs for NOT_LEADER_FOR_PARTITION and disconnected. Transient bursts during failover are normal; sustained is not.
  4. Check whether it is recovering. MSK auto-replaces failed brokers; a single loss should recover within minutes as leaders move to surviving replicas.
  5. Assess quorum. If controllers are down to no majority, metadata operations hang: this is the escalate-now condition.
StepQuestion it answersTime cost
1. ActiveControllerCountIs there a controller?seconds
2. Broker countHow many brokers lost?seconds
3. Client logsTransient failover or sustained?1-2 min
4. Recovery trendSelf-healing?2-3 min
5. Quorum stateControl-plane emergency?1-2 min

5. Safe remediations

SituationSafe action
Single broker down, recoveringMonitor; confirm ISR and leaders restore. Ensure clients have retries configured
Client-side connection churnVerify bootstrap servers list all brokers and retries are set; no broker action needed
Controller re-election blipWait; it completes in seconds to a minute
Quorum loss or multiple brokers downEscalate immediately; do not attempt control-plane recovery yourself

6. Escalation trigger

Page on-call engineering immediately if:

  • ActiveControllerCount is 0, or not exactly 1 for more than a minute or two.
  • More than one broker is down, or a single broker does not recover within ~15 minutes.
  • KRaft controllers have lost majority (quorum loss): this is a cluster-wide emergency.
  • Offline partitions accompany the broker loss (cross-check Under-Replicated and Offline Partitions).

7. Relevant commands and exhibits

# Cluster and broker view
kafka-broker-api-versions.sh --bootstrap-server $BROKER | head

# Controller and quorum status (KRaft)
kafka-metadata-quorum.sh --bootstrap-server $BROKER describe --status
# Healthy KRaft quorum: a leader plus followers, all current
ClusterId:              abc123
LeaderId:               1
CurrentVoters:          [1,2,3]
CurrentObservers:       []

MSK CloudWatch metrics: ActiveControllerCount (must be 1), broker-level uptime/health, and OfflinePartitionsCount.


8. Guided practical

Reproduce single-broker failover in the local three-broker lab.

  1. In the three-broker lab from Local Lab, produce to orders with a running consumer.
  2. docker stop kafka-2 and watch client logs: brief NOT_LEADER_FOR_PARTITION retries, then recovery as leadership moves.
  3. Confirm produce/consume continues throughout, because retries and surviving replicas cover the gap.
  4. docker start kafka-2 and confirm it rejoins and ISR refills.
  5. Note that stopping a majority of the single-node lab’s control plane would halt metadata changes, illustrating quorum loss.

Next:Rebalance Storms and Group Instability.