Read time: ~

Disk Pressure, Retention, and Segment Issues

Diagnose broker disk filling, retention misconfiguration, and log directory failure before a broker goes offline.


1. Symptom

A CloudWatch alarm fires: KafkaDataLogsDiskUsed > 85% on one or more brokers. Left unchecked, a broker whose log directory fills stops accepting writes and can go offline, cascading into under-replicated and then offline partitions. This is a slow-moving incident that becomes an outage if ignored.

The goal is to understand why storage is growing and buy headroom safely, without deleting data that is still needed, using the retention concepts from Storage Internals.


2. Likely causes

CauseHow it manifests
Retention too long for the volumeSteady disk growth toward capacity over days
Throughput higher than plannedFaster-than-expected fill on all brokers
A topic with huge retention or retention.ms=-1One topic dominates disk usage
Compaction not keeping upA compacted topic still growing
Skewed partitionsOne broker fills faster due to uneven leadership

3. How it manifests to the Spring app

ConditionWhat the service sees
Disk high but under limitNothing yet; this is the warning window
Broker log dir fullProducers to partitions led by that broker fail (TimeoutException, then NotEnoughReplicas)
Broker offline from full diskUnder-replicated/offline partitions (see Under-Replicated Partitions)

4. Diagnostic steps

  1. Check the trend, not just the level. Is disk climbing steadily (retention/throughput) or did it jump (a runaway topic or a stuck consumer on a compacted topic)?
  2. Find the biggest consumers of disk. Identify which topics and partitions hold the most data.
  3. Check retention config on the largest topics with kafka-configs --describe: look for long retention.ms, retention.bytes unset, or retention.ms=-1.
  4. Check cleanup policy. A compact topic that keeps growing may have compaction lagging or a key cardinality problem.
  5. Assess urgency. Above ~85% and climbing is act-now; a slow climb with days of headroom is plan-and-fix.
StepQuestion it answersTime cost
1. TrendSteady growth or a jump?1-2 min
2. Top topicsWhat is using the disk?2-3 min
3. Retention configIs retention misconfigured?1-2 min
4. Cleanup policyIs compaction keeping up?1-2 min
5. UrgencyEmergency or planned fix?1 min

5. Safe remediations

SituationSafe action
Retention longer than needed on a topicReduce retention.ms / set retention.bytes (a config change with owner sign-off); old segments age out
Volume genuinely too smallExpand MSK storage or enable storage autoscaling / tiered storage (engineering-owned)
Runaway topic (retention.ms=-1)Set a sane retention with the owner’s agreement
Broker near full nowEscalate; buying time via retention change takes effect as segments roll

6. Escalation trigger

Page on-call engineering if:

  • A broker is above ~90% and climbing, or already rejecting writes.
  • Freeing space needs storage expansion, autoscaling, or tiered-storage changes.
  • Disk pressure has already caused under-replicated or offline partitions.
  • Reducing retention would delete data other teams still depend on (get owner sign-off).

7. Relevant commands and exhibits

# Effective retention for a topic
kafka-configs.sh --bootstrap-server $BROKER --describe \
  --entity-type topics --entity-name orders
# A runaway retention setting
retention.ms=-1        # never delete: unbounded growth
cleanup.policy=delete
# Reduce retention to 3 days (with sign-off)
kafka-configs.sh --bootstrap-server $BROKER --alter \
  --entity-type topics --entity-name orders \
  --add-config retention.ms=259200000

MSK CloudWatch: KafkaDataLogsDiskUsed (percent), and watch UnderReplicatedPartitions for the downstream effect.


8. Guided practical

Explore retention in the local lab (small-scale).

  1. Create a topic with a tiny retention.bytes (for example a few KB) and segment.bytes small, in the local lab.
  2. Produce enough records to exceed it, then wait and re-describe: confirm old segments are deleted and the log stops growing.
  3. Set retention.ms=-1 on another topic and note that it never deletes, illustrating the runaway case.
  4. Read kafka-configs --describe to confirm the settings.

Next:Auth Failures After Credential or Certificate Rotation.