Kafka Under-Replicated and Offline Partitions: Incident Playbook

1. Symptom

A CloudWatch alarm fires: UnderReplicatedPartitions > 0 or, worse, OfflinePartitionsCount > 0. Producers on acks=all may start seeing NotEnoughReplicasException, and in the offline case some partitions stop serving reads and writes entirely.

These are cluster-health alerts, not application bugs. The goal is to tell apart a recoverable replica lag (under-replicated) from an availability loss (offline), because they have very different urgency.

2. Likely causes

Cause	How it manifests
A broker is down or restarting	Its replicas drop out of ISR; `UnderReplicatedPartitions` rises
Broker slow (disk, network, GC)	Follower cannot keep up, ISR shrinks without a full outage
Loss of enough brokers for a partition	No leader can be elected: `OfflinePartitionsCount` rises
Replication factor 1 topic	Any broker loss immediately makes those partitions offline

Offline partitions almost always mean you have lost more brokers than a partition’s replication can tolerate, or a partition had no redundancy to begin with.

3. How it manifests to the Spring app

Condition	What the service sees
Under-replicated, ISR still meets min	Normal operation; durability margin reduced
ISR below `min.insync.replicas`	`acks=all` producers get `NotEnoughReplicasException` and retry
Offline partition	Producers and consumers for that partition block or time out

4. Diagnostic steps

Check the scope in CloudWatch.OfflinePartitionsCount > 0 is a live outage, treat it as urgent. Under-replicated only is less urgent but still degraded.
Confirm broker health:ActiveControllerCount should be 1, and check how many brokers are reporting. A missing broker explains most ISR shrink.
Describe the affected topics:kafka-topics --describe --topic orders and compare Isr to Replicas. Note which broker id is missing from ISR.
Check whether it is recovering. If the down broker is coming back, ISR should refill on its own within minutes. Watch the count trend.
Check the topic’s replication factor. RF 1 topics cannot be under-replicated, they go straight to offline; that is a design fault to flag.

Step	Question it answers	Time cost
1. CloudWatch scope	Outage or just degraded?	seconds
2. Broker/controller health	How many brokers are up?	1 min
3. `--describe` topics	Which replicas/brokers are missing?	1-2 min
4. Trend	Is it self-healing?	2-3 min
5. RF check	Was there redundancy at all?	1 min

5. Safe remediations

Situation	Safe action
One broker down, others healthy, ISR refilling	Wait and monitor; MSK replaces failed brokers automatically
Under-replicated but stable, no broker obviously down	Gather exhibits and escalate; do not force actions
Offline partitions	Escalate immediately; this is an availability incident
RF 1 topic exposed	Flag for a replication-factor increase (a planned change)

As support tier your safe actions are monitoring recovery and escalating. Partition reassignment, unclean leader election, and RF changes are engineering-owned.

6. Escalation trigger

Page on-call engineering immediately if:

OfflinePartitionsCount > 0 at all: this is a live availability loss.
UnderReplicatedPartitions stays elevated for more than ~15 minutes with no broker recovering.
ActiveControllerCount is not exactly 1 (see Broker Down, Controller Failover).
Anyone proposes unclean leader election or an RF change to resolve it.

7. Relevant commands and exhibits

kafka-topics.sh --bootstrap-server $BROKER --describe --topic orders

# Healthy: Isr equals Replicas
Topic: orders  Partition: 0  Leader: 1  Replicas: 1,2,3  Isr: 1,2,3

# Under-replicated: broker 2 missing from Isr
Topic: orders  Partition: 1  Leader: 1  Replicas: 1,2,3  Isr: 1,3

# Offline: no leader
Topic: orders  Partition: 2  Leader: none Replicas: 2,3   Isr:

MSK CloudWatch metrics: UnderReplicatedPartitions, OfflinePartitionsCount, ActiveControllerCount, KafkaDataLogsDiskUsed (a full disk causes ISR shrink).

8. Guided practical

Reproduce ISR shrink in the local three-broker lab.

Create orders with RF 3 and min.insync.replicas 2 in the three-broker lab from Local Lab.
kafka-topics --describe and confirm Isr lists all three brokers.
Stop one broker (docker stop kafka-2) and re-describe: the partition is now under-replicated but still serves, because two replicas remain.
Stop a second broker and observe partitions go offline as min.insync.replicas can no longer be met.
Restart the brokers and watch ISR refill.

Next:Broker Down, Controller Failover, KRaft Quorum Loss.