Kafka Disk Pressure and Retention: Incident Playbook

1. Symptom

A CloudWatch alarm fires: KafkaDataLogsDiskUsed > 85% on one or more brokers. Left unchecked, a broker whose log directory fills stops accepting writes and can go offline, cascading into under-replicated and then offline partitions. This is a slow-moving incident that becomes an outage if ignored.

The goal is to understand why storage is growing and buy headroom safely, without deleting data that is still needed, using the retention concepts from Storage Internals.

2. Likely causes

Cause	How it manifests
Retention too long for the volume	Steady disk growth toward capacity over days
Throughput higher than planned	Faster-than-expected fill on all brokers
A topic with huge retention or `retention.ms=-1`	One topic dominates disk usage
Compaction not keeping up	A compacted topic still growing
Skewed partitions	One broker fills faster due to uneven leadership

3. How it manifests to the Spring app

Condition	What the service sees
Disk high but under limit	Nothing yet; this is the warning window
Broker log dir full	Producers to partitions led by that broker fail (`TimeoutException`, then `NotEnoughReplicas`)
Broker offline from full disk	Under-replicated/offline partitions (see Under-Replicated Partitions)

4. Diagnostic steps

Check the trend, not just the level. Is disk climbing steadily (retention/throughput) or did it jump (a runaway topic or a stuck consumer on a compacted topic)?
Find the biggest consumers of disk. Identify which topics and partitions hold the most data.
Check retention config on the largest topics with kafka-configs --describe: look for long retention.ms, retention.bytes unset, or retention.ms=-1.
Check cleanup policy. A compact topic that keeps growing may have compaction lagging or a key cardinality problem.
Assess urgency. Above ~85% and climbing is act-now; a slow climb with days of headroom is plan-and-fix.

Step	Question it answers	Time cost
1. Trend	Steady growth or a jump?	1-2 min
2. Top topics	What is using the disk?	2-3 min
3. Retention config	Is retention misconfigured?	1-2 min
4. Cleanup policy	Is compaction keeping up?	1-2 min
5. Urgency	Emergency or planned fix?	1 min

5. Safe remediations

Situation	Safe action
Retention longer than needed on a topic	Reduce `retention.ms` / set `retention.bytes` (a config change with owner sign-off); old segments age out
Volume genuinely too small	Expand MSK storage or enable storage autoscaling / tiered storage (engineering-owned)
Runaway topic (`retention.ms=-1`)	Set a sane retention with the owner’s agreement
Broker near full now	Escalate; buying time via retention change takes effect as segments roll

6. Escalation trigger

Page on-call engineering if:

A broker is above ~90% and climbing, or already rejecting writes.
Freeing space needs storage expansion, autoscaling, or tiered-storage changes.
Disk pressure has already caused under-replicated or offline partitions.
Reducing retention would delete data other teams still depend on (get owner sign-off).

7. Relevant commands and exhibits

# Effective retention for a topic
kafka-configs.sh --bootstrap-server $BROKER --describe \
  --entity-type topics --entity-name orders

# A runaway retention setting
retention.ms=-1        # never delete: unbounded growth
cleanup.policy=delete

# Reduce retention to 3 days (with sign-off)
kafka-configs.sh --bootstrap-server $BROKER --alter \
  --entity-type topics --entity-name orders \
  --add-config retention.ms=259200000

MSK CloudWatch: KafkaDataLogsDiskUsed (percent), and watch UnderReplicatedPartitions for the downstream effect.

8. Guided practical

Explore retention in the local lab (small-scale).

Create a topic with a tiny retention.bytes (for example a few KB) and segment.bytes small, in the local lab.
Produce enough records to exceed it, then wait and re-describe: confirm old segments are deleted and the log stops growing.
Set retention.ms=-1 on another topic and note that it never deletes, illustrating the runaway case.
Read kafka-configs --describe to confirm the settings.

Next:Auth Failures After Credential or Certificate Rotation.