Under-Replicated and Offline Partitions
Diagnose ISR shrink, UnderReplicatedPartitions, and OfflinePartitionsCount, and understand the data-loss risk each represents.
1. Symptom
A CloudWatch alarm fires: UnderReplicatedPartitions > 0 or, worse, OfflinePartitionsCount > 0. Producers on acks=all may start seeing NotEnoughReplicasException, and in the offline case some partitions stop serving reads and writes entirely.
These are cluster-health alerts, not application bugs. The goal is to tell apart a recoverable replica lag (under-replicated) from an availability loss (offline), because they have very different urgency.
2. Likely causes
| Cause | How it manifests |
|---|---|
| A broker is down or restarting | Its replicas drop out of ISR; UnderReplicatedPartitions rises |
| Broker slow (disk, network, GC) | Follower cannot keep up, ISR shrinks without a full outage |
| Loss of enough brokers for a partition | No leader can be elected: OfflinePartitionsCount rises |
| Replication factor 1 topic | Any broker loss immediately makes those partitions offline |
Offline partitions almost always mean you have lost more brokers than a partition’s replication can tolerate, or a partition had no redundancy to begin with.
3. How it manifests to the Spring app
| Condition | What the service sees |
|---|---|
| Under-replicated, ISR still meets min | Normal operation; durability margin reduced |
ISR below min.insync.replicas | acks=all producers get NotEnoughReplicasException and retry |
| Offline partition | Producers and consumers for that partition block or time out |
4. Diagnostic steps
- Check the scope in CloudWatch.
OfflinePartitionsCount > 0is a live outage, treat it as urgent. Under-replicated only is less urgent but still degraded. - Confirm broker health:
ActiveControllerCountshould be 1, and check how many brokers are reporting. A missing broker explains most ISR shrink. - Describe the affected topics:
kafka-topics --describe --topic ordersand compareIsrtoReplicas. Note which broker id is missing from ISR. - Check whether it is recovering. If the down broker is coming back, ISR should refill on its own within minutes. Watch the count trend.
- Check the topic’s replication factor. RF 1 topics cannot be under-replicated, they go straight to offline; that is a design fault to flag.
| Step | Question it answers | Time cost |
|---|---|---|
| 1. CloudWatch scope | Outage or just degraded? | seconds |
| 2. Broker/controller health | How many brokers are up? | 1 min |
3. --describe topics | Which replicas/brokers are missing? | 1-2 min |
| 4. Trend | Is it self-healing? | 2-3 min |
| 5. RF check | Was there redundancy at all? | 1 min |
5. Safe remediations
| Situation | Safe action |
|---|---|
| One broker down, others healthy, ISR refilling | Wait and monitor; MSK replaces failed brokers automatically |
| Under-replicated but stable, no broker obviously down | Gather exhibits and escalate; do not force actions |
| Offline partitions | Escalate immediately; this is an availability incident |
| RF 1 topic exposed | Flag for a replication-factor increase (a planned change) |
As support tier your safe actions are monitoring recovery and escalating. Partition reassignment, unclean leader election, and RF changes are engineering-owned.
6. Escalation trigger
Page on-call engineering immediately if:
OfflinePartitionsCount > 0at all: this is a live availability loss.UnderReplicatedPartitionsstays elevated for more than ~15 minutes with no broker recovering.ActiveControllerCountis not exactly 1 (see Broker Down, Controller Failover).- Anyone proposes unclean leader election or an RF change to resolve it.
7. Relevant commands and exhibits
kafka-topics.sh --bootstrap-server $BROKER --describe --topic orders
# Healthy: Isr equals Replicas
Topic: orders Partition: 0 Leader: 1 Replicas: 1,2,3 Isr: 1,2,3
# Under-replicated: broker 2 missing from Isr
Topic: orders Partition: 1 Leader: 1 Replicas: 1,2,3 Isr: 1,3
# Offline: no leader
Topic: orders Partition: 2 Leader: none Replicas: 2,3 Isr:
MSK CloudWatch metrics: UnderReplicatedPartitions, OfflinePartitionsCount, ActiveControllerCount, KafkaDataLogsDiskUsed (a full disk causes ISR shrink).
8. Guided practical
Reproduce ISR shrink in the local three-broker lab.
- Create
orderswith RF 3 andmin.insync.replicas2 in the three-broker lab from Local Lab. kafka-topics --describeand confirmIsrlists all three brokers.- Stop one broker (
docker stop kafka-2) and re-describe: the partition is now under-replicated but still serves, because two replicas remain. - Stop a second broker and observe partitions go offline as
min.insync.replicascan no longer be met. - Restart the brokers and watch ISR refill.