Kafka Storage Internals: Segments, Retention, and Log Compaction

You now know a partition is an append-only log replicated across brokers. This module shows what that log actually is on disk: how it is split into files, how old data is removed, and the two mechanisms (page cache and zero-copy) that make Kafka fast despite writing everything to disk.

What you’ll be able to do after this module

Describe how a partition is laid out on disk as a directory of segments.
Explain what the .log, .index, and .timeindex files hold and why the index files exist.
Explain retention by time and by size, and what the active segment is.
Explain log compaction, tombstones, and when to choose compact over delete.
Explain how the OS page cache and zero-copy transfer make Kafka fast.

1. A partition is a directory of segments

A single append-only log per partition would become one enormous file, impossible to expire or index efficiently. So Kafka splits each partition’s log into segments: a sequence of files on the broker’s disk.

Each partition is a directory named topic-partition, for example orders-0. Inside it, the log is a series of segments, each named by the base offset of its first record.

flowchart LR
    subgraph dir [orders-0 directory]
        direction TB
        s1["00000000000000000000.log<br/>(offsets 0 to 4999)<br/>CLOSED"]
        s2["00000000000000005000.log<br/>(offsets 5000 to 9999)<br/>CLOSED"]
        s3["00000000000000010000.log<br/>(offsets 10000+)<br/>ACTIVE"]
    end

Only the newest segment is the active segment. All new records are appended to it.
When the active segment reaches a size or age limit (segment.bytes, segment.ms), Kafka closes it and opens a new active segment.
Closed segments are immutable, which is what makes retention and compaction safe: Kafka can delete or rewrite whole segments without touching the one being written.

2. Segment files: .log, .index, .timeindex

Each segment is actually a small set of files that share the base-offset name:

File	Holds	Purpose
`.log`	The actual records	The append-only data itself
`.index`	Offset to physical position	Find a record by offset without scanning
`.timeindex`	Timestamp to offset	Find records by time, and drive time retention

The index files exist for fast lookup. Without them, seeking to “offset 7500” would mean scanning the segment from the start. Instead, Kafka keeps a sparse index mapping offsets to byte positions in the .log file, so a consumer can jump close to the target and read forward a little.

flowchart LR
    req["Consumer: read from offset 7500"]
    idx[".index<br/>7000 -> byte 210KB<br/>7500 -> byte 225KB"]
    logf[".log<br/>seek to 225KB,<br/>read forward"]
    req --> idx --> logf

The .timeindex is what lets time-based retention and timestamp lookups (for example, “consume from 9am today”) work without scanning.

3. Retention: how old data leaves

Kafka does not keep data forever. Each topic has a retention policy that decides when closed segments are eligible for cleanup. The default cleanup policy is delete, which removes whole old segments.

Retention is bounded by time, size, or both:

Time:retention.ms keeps records for a period (a common default is 7 days), then deletes segments whose newest record is older than that.
Size:retention.bytes caps the total size per partition; when exceeded, the oldest segments are deleted.

flowchart LR
    subgraph part [orders-0 over time]
        old["old segment<br/>(past retention)<br/>DELETED"]
        mid["segment<br/>within window"]
        act["active segment"]
    end
    old -.retention.ms exceeded.-> gone["removed"]
    mid --> act

Retention works on whole closed segments, never on individual records, and never on the active segment. This is why retention is coarse: a segment is kept until every record in it is past the limit.

4. Log compaction and tombstones

Sometimes you do not want time-based deletion. You want to keep the latest value per key forever, and only discard superseded values. That is log compaction, enabled with cleanup.policy=compact.

With compaction, Kafka guarantees that the log retains at least the most recent record for each key. Older records with the same key are eventually removed during compaction.

flowchart TD
    subgraph before [Before compaction]
        b1["key=A v1"]
        b2["key=B v1"]
        b3["key=A v2"]
        b4["key=A v3"]
        b5["key=B v2"]
    end
    subgraph after [After compaction]
        a1["key=A v3"]
        a2["key=B v2"]
    end
    before --> after

To delete a key entirely in a compacted topic, you write a tombstone: a record with that key and a null value. Compaction keeps the tombstone long enough for all consumers to observe the deletion, then removes both the tombstone and the key’s prior values.

When to use compaction

Compaction fits topics that represent the current state of something keyed by id, rather than a stream of independent events:

A changelog of the latest known state per entity (for example, latest inventory count per productId).
The backing topic for a Kafka Streams state store or a KTable, covered in Kafka Streams.

Policy	Keeps	Use for
`delete` (default)	Everything within the time/size window	Event streams like `orders`, `payments`
`compact`	Latest record per key	State/changelog topics keyed by entity id
`compact,delete`	Latest per key, but also expires by time	Compacted topics that should still age out

5. Why Kafka is fast: page cache and zero-copy

It seems like writing everything to disk should be slow. Two mechanisms make Kafka fast anyway.

The OS page cache

Kafka does not maintain its own large in-process cache of records. Instead it appends to files and lets the operating system’s page cache hold recently written and read data in RAM.

Writes go to the page cache and are flushed to disk by the OS, so producing does not block on a physical disk write.
Consumers reading recent data (the common case) are served from RAM, because those pages are still cached from when they were written.

This is why a healthy Kafka broker often shows most of its memory used by the page cache, and why consumers reading the tail of the log are extremely fast.

Zero-copy transfer

When a consumer fetches records, the data is already sitting in the page cache as bytes in exactly the on-wire format. Kafka uses a zero-copy system call (sendfile) to send those bytes from the page cache straight to the network socket, without copying them through application memory.

flowchart LR
    disk["Disk / page cache<br/>(records in wire format)"]
    nic["Network socket<br/>to consumer"]
    disk -->|"sendfile: kernel copies directly"| nic

Avoiding the usual copies between kernel and application buffers saves CPU and memory bandwidth, letting one broker serve very high throughput.

6. Tying it back to durability

Storage and replication work together. A record is appended to the active segment on the leader, replicated to followers (each writing to its own segments), and only counts as durable once the in-sync replicas from Cluster Anatomy have it. Retention and compaction then govern how long it lives. Disk pressure and retention misconfiguration are common production incidents, covered in Disk Pressure, Retention, and Segment Issues.

Checkpoint

You should now be able to:

Describe how a partition is stored as a directory of segments, and what the active segment is.
Explain what the .log, .index, and .timeindex files hold and why the indexes exist.
Explain retention by time and size, and why it operates on whole segments.
Explain log compaction, tombstones, and when to choose compact over delete.
Explain how the page cache and zero-copy transfer make Kafka fast.

Next: Section 2, the Local Lab, where you run a KRaft cluster yourself and inspect these structures firsthand.