Read time: ~

Under-Replicated and Offline Partitions

Diagnose ISR shrink, UnderReplicatedPartitions, and OfflinePartitionsCount, and understand the data-loss risk each represents.


1. Symptom

A CloudWatch alarm fires: UnderReplicatedPartitions > 0 or, worse, OfflinePartitionsCount > 0. Producers on acks=all may start seeing NotEnoughReplicasException, and in the offline case some partitions stop serving reads and writes entirely.

These are cluster-health alerts, not application bugs. The goal is to tell apart a recoverable replica lag (under-replicated) from an availability loss (offline), because they have very different urgency.


2. Likely causes

CauseHow it manifests
A broker is down or restartingIts replicas drop out of ISR; UnderReplicatedPartitions rises
Broker slow (disk, network, GC)Follower cannot keep up, ISR shrinks without a full outage
Loss of enough brokers for a partitionNo leader can be elected: OfflinePartitionsCount rises
Replication factor 1 topicAny broker loss immediately makes those partitions offline

Offline partitions almost always mean you have lost more brokers than a partition’s replication can tolerate, or a partition had no redundancy to begin with.


3. How it manifests to the Spring app

ConditionWhat the service sees
Under-replicated, ISR still meets minNormal operation; durability margin reduced
ISR below min.insync.replicasacks=all producers get NotEnoughReplicasException and retry
Offline partitionProducers and consumers for that partition block or time out

4. Diagnostic steps

  1. Check the scope in CloudWatch.OfflinePartitionsCount > 0 is a live outage, treat it as urgent. Under-replicated only is less urgent but still degraded.
  2. Confirm broker health:ActiveControllerCount should be 1, and check how many brokers are reporting. A missing broker explains most ISR shrink.
  3. Describe the affected topics:kafka-topics --describe --topic orders and compare Isr to Replicas. Note which broker id is missing from ISR.
  4. Check whether it is recovering. If the down broker is coming back, ISR should refill on its own within minutes. Watch the count trend.
  5. Check the topic’s replication factor. RF 1 topics cannot be under-replicated, they go straight to offline; that is a design fault to flag.
StepQuestion it answersTime cost
1. CloudWatch scopeOutage or just degraded?seconds
2. Broker/controller healthHow many brokers are up?1 min
3. --describe topicsWhich replicas/brokers are missing?1-2 min
4. TrendIs it self-healing?2-3 min
5. RF checkWas there redundancy at all?1 min

5. Safe remediations

SituationSafe action
One broker down, others healthy, ISR refillingWait and monitor; MSK replaces failed brokers automatically
Under-replicated but stable, no broker obviously downGather exhibits and escalate; do not force actions
Offline partitionsEscalate immediately; this is an availability incident
RF 1 topic exposedFlag for a replication-factor increase (a planned change)

As support tier your safe actions are monitoring recovery and escalating. Partition reassignment, unclean leader election, and RF changes are engineering-owned.


6. Escalation trigger

Page on-call engineering immediately if:

  • OfflinePartitionsCount > 0 at all: this is a live availability loss.
  • UnderReplicatedPartitions stays elevated for more than ~15 minutes with no broker recovering.
  • ActiveControllerCount is not exactly 1 (see Broker Down, Controller Failover).
  • Anyone proposes unclean leader election or an RF change to resolve it.

7. Relevant commands and exhibits

kafka-topics.sh --bootstrap-server $BROKER --describe --topic orders
# Healthy: Isr equals Replicas
Topic: orders  Partition: 0  Leader: 1  Replicas: 1,2,3  Isr: 1,2,3

# Under-replicated: broker 2 missing from Isr
Topic: orders  Partition: 1  Leader: 1  Replicas: 1,2,3  Isr: 1,3

# Offline: no leader
Topic: orders  Partition: 2  Leader: none Replicas: 2,3   Isr:

MSK CloudWatch metrics: UnderReplicatedPartitions, OfflinePartitionsCount, ActiveControllerCount, KafkaDataLogsDiskUsed (a full disk causes ISR shrink).


8. Guided practical

Reproduce ISR shrink in the local three-broker lab.

  1. Create orders with RF 3 and min.insync.replicas 2 in the three-broker lab from Local Lab.
  2. kafka-topics --describe and confirm Isr lists all three brokers.
  3. Stop one broker (docker stop kafka-2) and re-describe: the partition is now under-replicated but still serves, because two replicas remain.
  4. Stop a second broker and observe partitions go offline as min.insync.replicas can no longer be met.
  5. Restart the brokers and watch ISR refill.

Next:Broker Down, Controller Failover, KRaft Quorum Loss.