The Commit Log on Disk: Segments, Retention, Compaction
Partition directories and log segments, index files, retention by time and size, log compaction and tombstones, the OS page cache, and zero-copy transfer.
You now know a partition is an append-only log replicated across brokers. This module shows what that log actually is on disk: how it is split into files, how old data is removed, and the two mechanisms (page cache and zero-copy) that make Kafka fast despite writing everything to disk.
What you’ll be able to do after this module
- Describe how a partition is laid out on disk as a directory of segments.
- Explain what the
.log,.index, and.timeindexfiles hold and why the index files exist. - Explain retention by time and by size, and what the active segment is.
- Explain log compaction, tombstones, and when to choose
compactoverdelete. - Explain how the OS page cache and zero-copy transfer make Kafka fast.
1. A partition is a directory of segments
A single append-only log per partition would become one enormous file, impossible to expire or index efficiently. So Kafka splits each partition’s log into segments: a sequence of files on the broker’s disk.
Each partition is a directory named topic-partition, for example orders-0. Inside it, the log is a series of segments, each named by the base offset of its first record.
flowchart LR
subgraph dir [orders-0 directory]
direction TB
s1["00000000000000000000.log<br/>(offsets 0 to 4999)<br/>CLOSED"]
s2["00000000000000005000.log<br/>(offsets 5000 to 9999)<br/>CLOSED"]
s3["00000000000000010000.log<br/>(offsets 10000+)<br/>ACTIVE"]
end
- Only the newest segment is the active segment. All new records are appended to it.
- When the active segment reaches a size or age limit (
segment.bytes,segment.ms), Kafka closes it and opens a new active segment. - Closed segments are immutable, which is what makes retention and compaction safe: Kafka can delete or rewrite whole segments without touching the one being written.
2. Segment files: .log, .index, .timeindex
Each segment is actually a small set of files that share the base-offset name:
| File | Holds | Purpose |
|---|---|---|
.log | The actual records | The append-only data itself |
.index | Offset to physical position | Find a record by offset without scanning |
.timeindex | Timestamp to offset | Find records by time, and drive time retention |
The index files exist for fast lookup. Without them, seeking to “offset 7500” would mean scanning the segment from the start. Instead, Kafka keeps a sparse index mapping offsets to byte positions in the .log file, so a consumer can jump close to the target and read forward a little.
flowchart LR
req["Consumer: read from offset 7500"]
idx[".index<br/>7000 -> byte 210KB<br/>7500 -> byte 225KB"]
logf[".log<br/>seek to 225KB,<br/>read forward"]
req --> idx --> logf
The .timeindex is what lets time-based retention and timestamp lookups (for example, “consume from 9am today”) work without scanning.
3. Retention: how old data leaves
Kafka does not keep data forever. Each topic has a retention policy that decides when closed segments are eligible for cleanup. The default cleanup policy is delete, which removes whole old segments.
Retention is bounded by time, size, or both:
- Time:
retention.mskeeps records for a period (a common default is 7 days), then deletes segments whose newest record is older than that. - Size:
retention.bytescaps the total size per partition; when exceeded, the oldest segments are deleted.
flowchart LR
subgraph part [orders-0 over time]
old["old segment<br/>(past retention)<br/>DELETED"]
mid["segment<br/>within window"]
act["active segment"]
end
old -.retention.ms exceeded.-> gone["removed"]
mid --> act
Retention works on whole closed segments, never on individual records, and never on the active segment. This is why retention is coarse: a segment is kept until every record in it is past the limit.
4. Log compaction and tombstones
Sometimes you do not want time-based deletion. You want to keep the latest value per key forever, and only discard superseded values. That is log compaction, enabled with cleanup.policy=compact.
With compaction, Kafka guarantees that the log retains at least the most recent record for each key. Older records with the same key are eventually removed during compaction.
flowchart TD
subgraph before [Before compaction]
b1["key=A v1"]
b2["key=B v1"]
b3["key=A v2"]
b4["key=A v3"]
b5["key=B v2"]
end
subgraph after [After compaction]
a1["key=A v3"]
a2["key=B v2"]
end
before --> after
To delete a key entirely in a compacted topic, you write a tombstone: a record with that key and a null value. Compaction keeps the tombstone long enough for all consumers to observe the deletion, then removes both the tombstone and the key’s prior values.
When to use compaction
Compaction fits topics that represent the current state of something keyed by id, rather than a stream of independent events:
- A changelog of the latest known state per entity (for example, latest inventory count per
productId). - The backing topic for a Kafka Streams state store or a
KTable, covered in Kafka Streams.
| Policy | Keeps | Use for |
|---|---|---|
delete (default) | Everything within the time/size window | Event streams like orders, payments |
compact | Latest record per key | State/changelog topics keyed by entity id |
compact,delete | Latest per key, but also expires by time | Compacted topics that should still age out |
5. Why Kafka is fast: page cache and zero-copy
It seems like writing everything to disk should be slow. Two mechanisms make Kafka fast anyway.
The OS page cache
Kafka does not maintain its own large in-process cache of records. Instead it appends to files and lets the operating system’s page cache hold recently written and read data in RAM.
- Writes go to the page cache and are flushed to disk by the OS, so producing does not block on a physical disk write.
- Consumers reading recent data (the common case) are served from RAM, because those pages are still cached from when they were written.
This is why a healthy Kafka broker often shows most of its memory used by the page cache, and why consumers reading the tail of the log are extremely fast.
Zero-copy transfer
When a consumer fetches records, the data is already sitting in the page cache as bytes in exactly the on-wire format. Kafka uses a zero-copy system call (sendfile) to send those bytes from the page cache straight to the network socket, without copying them through application memory.
flowchart LR
disk["Disk / page cache<br/>(records in wire format)"]
nic["Network socket<br/>to consumer"]
disk -->|"sendfile: kernel copies directly"| nic
Avoiding the usual copies between kernel and application buffers saves CPU and memory bandwidth, letting one broker serve very high throughput.
6. Tying it back to durability
Storage and replication work together. A record is appended to the active segment on the leader, replicated to followers (each writing to its own segments), and only counts as durable once the in-sync replicas from Cluster Anatomy have it. Retention and compaction then govern how long it lives. Disk pressure and retention misconfiguration are common production incidents, covered in Disk Pressure, Retention, and Segment Issues.
Checkpoint
You should now be able to:
- Describe how a partition is stored as a directory of segments, and what the active segment is.
- Explain what the
.log,.index, and.timeindexfiles hold and why the indexes exist. - Explain retention by time and size, and why it operates on whole segments.
- Explain log compaction, tombstones, and when to choose
compactoverdelete. - Explain how the page cache and zero-copy transfer make Kafka fast.
Next: Section 2, the Local Lab, where you run a KRaft cluster yourself and inspect these structures firsthand.