Kafka Broker Down and Controller Failover: Incident Playbook

1. Symptom

A broker disappears from the cluster: ActiveControllerCount alarms (not equal to 1), a broker count drop, or client logs full of disconnected and NOT_LEADER_FOR_PARTITION. In the worst case the KRaft metadata quorum loses majority and the cluster cannot elect a controller at all, which halts metadata changes cluster-wide.

The goal is to tell apart a routine single-broker loss (usually self-healing on MSK) from a control-plane emergency (quorum loss), because the second is a full-cluster incident.

2. Likely causes

Cause	How it manifests
Single broker failure	Broker count drops by one; partitions with a replica there re-elect leaders
Controller broker failure	Brief metadata pause while a new controller is elected
KRaft quorum loss (majority of controllers down)	No controller can be elected; metadata operations stall
Network partition between brokers	Brokers see each other as down; ISR churn and leader changes

On KRaft, the controllers form a Raft quorum, so a three-controller setup tolerates losing one. Losing two of three means no majority and no controller, from Control Plane.

3. How it manifests to the Spring app

Condition	What the service sees
Single broker down	Brief retriable errors, then automatic recovery as leaders move
Controller re-election	Short pause on metadata-dependent operations (topic create, etc.)
Quorum loss	Producers/consumers keep working on existing leaders but no new leaders can be elected; failures accumulate

Well-configured clients with retries (from Reliable Producing) ride out a single broker loss with only a latency blip.

4. Diagnostic steps

Check ActiveControllerCount in CloudWatch. Exactly 1 is healthy. 0 means no controller (emergency); more than 1 briefly during failover is expected, but sustained is wrong.
Check the broker count. How many of the expected brokers are reporting? One missing is routine; multiple missing is serious.
Check client logs for NOT_LEADER_FOR_PARTITION and disconnected. Transient bursts during failover are normal; sustained is not.
Check whether it is recovering. MSK auto-replaces failed brokers; a single loss should recover within minutes as leaders move to surviving replicas.
Assess quorum. If controllers are down to no majority, metadata operations hang: this is the escalate-now condition.

Step	Question it answers	Time cost
1. ActiveControllerCount	Is there a controller?	seconds
2. Broker count	How many brokers lost?	seconds
3. Client logs	Transient failover or sustained?	1-2 min
4. Recovery trend	Self-healing?	2-3 min
5. Quorum state	Control-plane emergency?	1-2 min

5. Safe remediations

Situation	Safe action
Single broker down, recovering	Monitor; confirm ISR and leaders restore. Ensure clients have retries configured
Client-side connection churn	Verify bootstrap servers list all brokers and retries are set; no broker action needed
Controller re-election blip	Wait; it completes in seconds to a minute
Quorum loss or multiple brokers down	Escalate immediately; do not attempt control-plane recovery yourself

6. Escalation trigger

Page on-call engineering immediately if:

ActiveControllerCount is 0, or not exactly 1 for more than a minute or two.
More than one broker is down, or a single broker does not recover within ~15 minutes.
KRaft controllers have lost majority (quorum loss): this is a cluster-wide emergency.
Offline partitions accompany the broker loss (cross-check Under-Replicated and Offline Partitions).

7. Relevant commands and exhibits

# Cluster and broker view
kafka-broker-api-versions.sh --bootstrap-server $BROKER | head

# Controller and quorum status (KRaft)
kafka-metadata-quorum.sh --bootstrap-server $BROKER describe --status

# Healthy KRaft quorum: a leader plus followers, all current
ClusterId:              abc123
LeaderId:               1
CurrentVoters:          [1,2,3]
CurrentObservers:       []

MSK CloudWatch metrics: ActiveControllerCount (must be 1), broker-level uptime/health, and OfflinePartitionsCount.

8. Guided practical

Reproduce single-broker failover in the local three-broker lab.

In the three-broker lab from Local Lab, produce to orders with a running consumer.
docker stop kafka-2 and watch client logs: brief NOT_LEADER_FOR_PARTITION retries, then recovery as leadership moves.
Confirm produce/consume continues throughout, because retries and surviving replicas cover the gap.
docker start kafka-2 and confirm it rejoins and ISR refills.
Note that stopping a majority of the single-node lab’s control plane would halt metadata changes, illustrating quorum loss.

Next:Rebalance Storms and Group Instability.