Disk Pressure, Retention, and Segment Issues
Diagnose broker disk filling, retention misconfiguration, and log directory failure before a broker goes offline.
1. Symptom
A CloudWatch alarm fires: KafkaDataLogsDiskUsed > 85% on one or more brokers. Left unchecked, a broker whose log directory fills stops accepting writes and can go offline, cascading into under-replicated and then offline partitions. This is a slow-moving incident that becomes an outage if ignored.
The goal is to understand why storage is growing and buy headroom safely, without deleting data that is still needed, using the retention concepts from Storage Internals.
2. Likely causes
| Cause | How it manifests |
|---|---|
| Retention too long for the volume | Steady disk growth toward capacity over days |
| Throughput higher than planned | Faster-than-expected fill on all brokers |
A topic with huge retention or retention.ms=-1 | One topic dominates disk usage |
| Compaction not keeping up | A compacted topic still growing |
| Skewed partitions | One broker fills faster due to uneven leadership |
3. How it manifests to the Spring app
| Condition | What the service sees |
|---|---|
| Disk high but under limit | Nothing yet; this is the warning window |
| Broker log dir full | Producers to partitions led by that broker fail (TimeoutException, then NotEnoughReplicas) |
| Broker offline from full disk | Under-replicated/offline partitions (see Under-Replicated Partitions) |
4. Diagnostic steps
- Check the trend, not just the level. Is disk climbing steadily (retention/throughput) or did it jump (a runaway topic or a stuck consumer on a compacted topic)?
- Find the biggest consumers of disk. Identify which topics and partitions hold the most data.
- Check retention config on the largest topics with
kafka-configs --describe: look for longretention.ms,retention.bytesunset, orretention.ms=-1. - Check cleanup policy. A
compacttopic that keeps growing may have compaction lagging or a key cardinality problem. - Assess urgency. Above ~85% and climbing is act-now; a slow climb with days of headroom is plan-and-fix.
| Step | Question it answers | Time cost |
|---|---|---|
| 1. Trend | Steady growth or a jump? | 1-2 min |
| 2. Top topics | What is using the disk? | 2-3 min |
| 3. Retention config | Is retention misconfigured? | 1-2 min |
| 4. Cleanup policy | Is compaction keeping up? | 1-2 min |
| 5. Urgency | Emergency or planned fix? | 1 min |
5. Safe remediations
| Situation | Safe action |
|---|---|
| Retention longer than needed on a topic | Reduce retention.ms / set retention.bytes (a config change with owner sign-off); old segments age out |
| Volume genuinely too small | Expand MSK storage or enable storage autoscaling / tiered storage (engineering-owned) |
Runaway topic (retention.ms=-1) | Set a sane retention with the owner’s agreement |
| Broker near full now | Escalate; buying time via retention change takes effect as segments roll |
6. Escalation trigger
Page on-call engineering if:
- A broker is above ~90% and climbing, or already rejecting writes.
- Freeing space needs storage expansion, autoscaling, or tiered-storage changes.
- Disk pressure has already caused under-replicated or offline partitions.
- Reducing retention would delete data other teams still depend on (get owner sign-off).
7. Relevant commands and exhibits
# Effective retention for a topic
kafka-configs.sh --bootstrap-server $BROKER --describe \
--entity-type topics --entity-name orders
# A runaway retention setting
retention.ms=-1 # never delete: unbounded growth
cleanup.policy=delete
# Reduce retention to 3 days (with sign-off)
kafka-configs.sh --bootstrap-server $BROKER --alter \
--entity-type topics --entity-name orders \
--add-config retention.ms=259200000
MSK CloudWatch: KafkaDataLogsDiskUsed (percent), and watch UnderReplicatedPartitions for the downstream effect.
8. Guided practical
Explore retention in the local lab (small-scale).
- Create a topic with a tiny
retention.bytes(for example a few KB) andsegment.bytessmall, in the local lab. - Produce enough records to exceed it, then wait and re-describe: confirm old segments are deleted and the log stops growing.
- Set
retention.ms=-1on another topic and note that it never deletes, illustrating the runaway case. - Read
kafka-configs --describeto confirm the settings.
Next:Auth Failures After Credential or Certificate Rotation.