Kafka Control Plane: KRaft and ZooKeeper Internals Compared

The previous module mentioned a controller that assigns leaders and tracks brokers. But where does the cluster keep its metadata, and how is that controller chosen? That is the job of the control plane. Kafka is mid-transition between two answers, ZooKeeper and KRaft, so you learn both in depth: the older model runs in many production clusters today, and the newer model is where every cluster is heading.

What you’ll be able to do after this module

Explain what the control plane is responsible for, separate from moving actual data.
Describe how the ZooKeeper era stored metadata and elected a controller, and why it became a bottleneck.
Describe how KRaft stores metadata as a log and elects a controller via a Raft quorum.
Explain what the active controller, the metadata log, and snapshots are in KRaft.
Explain the ZooKeeper-to-KRaft migration path at a high level.

1. Data plane vs control plane

Kafka does two very different jobs, and it helps to separate them:

Data plane: moving your actual events. Producers write to partition leaders, followers replicate, consumers read. This is everything from Section 0 and the previous module.
Control plane: managing the cluster’s metadata. Which brokers are alive, which topics and partitions exist, who leads each partition, what the configs and ACLs are, and electing the controller that coordinates all of this.

flowchart TD
    subgraph control [Control plane]
        meta["Cluster metadata:<br/>brokers, topics, partitions,<br/>leaders, configs, ACLs"]
        ctrl["Controller<br/>(coordinates changes)"]
    end
    subgraph data [Data plane]
        lead["Partition leaders"]
        foll["Followers replicate"]
    end
    ctrl --> meta
    ctrl -->|assigns leaders| lead
    lead --> foll

The control plane is not on the hot path of every message, but the whole cluster depends on it. If metadata is wrong or the controller cannot be elected, leaders cannot be assigned and the data plane stalls.

2. The ZooKeeper era

For most of Kafka’s history, the control plane lived in Apache ZooKeeper, a separate distributed coordination service running alongside the Kafka brokers.

How it worked:

ZooKeeper stored metadata in a tree of znodes (nodes in a hierarchical namespace), for example /brokers/ids, /brokers/topics, and /controller.
One broker became the controller broker by winning a race to create the /controller znode. Only one broker can create it, so exactly one wins.
Brokers and the controller set watches on znodes. When something changed (a broker registered, a topic was created), ZooKeeper fired a notification and the interested party reacted.

flowchart TD
    subgraph zk [ZooKeeper ensemble]
        z1["/controller"]
        z2["/brokers/ids"]
        z3["/brokers/topics"]
    end
    subgraph kafka [Kafka brokers]
        cb["Controller broker"]
        b2["Broker 2"]
        b3["Broker 3"]
    end
    cb -->|created /controller| z1
    b2 -->|register + watch| z2
    b3 -->|register + watch| z2
    cb -->|read/write metadata| z3

Why ZooKeeper became a bottleneck

ZooKeeper worked for years, but it had structural limits:

Two systems to run: you operated a ZooKeeper ensemble and a Kafka cluster, each with its own configuration, tuning, and failure modes.
Metadata scaling: on controller failover, the new controller had to load the full cluster metadata from ZooKeeper and push it to every broker. With hundreds of thousands of partitions, this could take a long time, extending outages.
Watch storms: many watches firing at once during large changes created bursts of coordination traffic.
Divergent source of truth: metadata lived in ZooKeeper, but brokers cached their own view, and reconciling the two added complexity and edge cases.

3. The KRaft era

KRaft (Kafka Raft) removes ZooKeeper entirely. The control plane moves inside Kafka itself, and metadata becomes just another Kafka log, managed by a built-in Raft consensus quorum.

The key ideas:

A small set of nodes run as controllers and form a controller quorum. They use the Raft consensus algorithm to agree on metadata changes.
All metadata is stored in an internal, replicated log: the metadata log (the __cluster_metadata topic). Every metadata change (create topic, leader change, config update) is an ordered record appended to this log.
One controller in the quorum is the active controller (the Raft leader). It handles metadata writes; the other controllers replicate the metadata log and stand ready to take over.
Brokers become simple followers of the metadata log. They replay it to build their view of the cluster, rather than being pushed a full snapshot on every controller change.

flowchart TD
    subgraph quorum [Controller quorum]
        ac["Active controller<br/>(Raft leader)"]
        c2["Controller<br/>(follower)"]
        c3["Controller<br/>(follower)"]
    end
    mlog["Metadata log<br/>(__cluster_metadata)"]
    subgraph brokers [Brokers]
        b1["Broker 1"]
        b2["Broker 2"]
    end
    ac -->|appends changes| mlog
    ac -->|replicate| c2
    ac -->|replicate| c3
    mlog -->|brokers replay the log| b1
    mlog --> b2

Metadata as a log, and snapshots

Because metadata is an append-only log, it inherits Kafka’s own strengths: it is ordered, replicated, and replayable. A broker that restarts catches up by reading the metadata log from where it left off, exactly like a consumer catching up on offsets.

To keep that log from growing forever, KRaft periodically writes a snapshot: a compacted point-in-time image of the metadata. A restarting broker loads the latest snapshot, then replays only the newer records, so recovery stays fast even in large clusters.

Why KRaft is the modern default

One system to run: no separate ZooKeeper ensemble. Fewer moving parts, simpler operations.
Faster failover and startup: the new active controller already has the metadata log; there is no expensive full reload and push.
Scales to far more partitions: the log-plus-snapshot model handles millions of partitions where ZooKeeper struggled.
Single source of truth: metadata lives in one ordered log, not split between ZooKeeper and broker caches.

KRaft is production-ready and is the default for new clusters. ZooKeeper support has been deprecated and removed in recent Kafka versions.

4. Controller election: the two models side by side

The election mechanism is the sharpest contrast between the two.

sequenceDiagram
    participant B as Broker
    participant ZK as ZooKeeper
    Note over B,ZK: ZooKeeper model
    B->>ZK: try to create /controller znode
    ZK-->>B: success, you are the controller
    Note over ZK: if controller dies, its znode disappears
    ZK-->>B: watch fires, brokers race again

sequenceDiagram
    participant C1 as Controller 1
    participant C2 as Controller 2
    participant C3 as Controller 3
    Note over C1,C3: KRaft model
    C1->>C2: Raft vote request
    C1->>C3: Raft vote request
    C2-->>C1: vote granted
    C3-->>C1: vote granted
    Note over C1: majority reached, becomes active controller

Aspect	ZooKeeper model	KRaft model
Metadata store	External ZooKeeper znodes	Internal metadata log (`__cluster_metadata`)
Controller election	Race to create `/controller` znode	Raft leader election among controller quorum
Failover cost	New controller reloads full metadata, pushes to brokers	New active controller already has the log; brokers replay
Systems to operate	Kafka + ZooKeeper	Kafka only
Scaling ceiling	Tens of thousands of partitions strain it	Millions of partitions

The failure mode when a KRaft quorum loses its majority is covered operationally in Broker Down, Controller Failover, KRaft Quorum Loss.

5. The migration path

Existing ZooKeeper-based clusters do not jump in one step. Kafka provides a staged migration so a running cluster moves to KRaft without downtime:

Provision a KRaft controller quorum alongside the existing ZooKeeper-backed cluster.
Enter migration mode: the controllers copy existing metadata out of ZooKeeper into the KRaft metadata log, while both systems stay in sync (dual write).
Migrate brokers: brokers are restarted in KRaft mode one at a time, so the cluster keeps serving traffic throughout.
Finalize: once all brokers run in KRaft mode and metadata is fully in the KRaft log, ZooKeeper is disconnected and decommissioned.

flowchart LR
    s1["1. ZooKeeper cluster<br/>(starting point)"]
    s2["2. Add KRaft quorum,<br/>dual-write metadata"]
    s3["3. Roll brokers<br/>into KRaft mode"]
    s4["4. Remove ZooKeeper<br/>(fully KRaft)"]
    s1 --> s2 --> s3 --> s4

As an application developer you rarely run the migration yourself, but you should recognize which model a cluster uses, because it changes how the platform team operates and troubleshoots it. On AWS MSK, the managed service handles the control plane for you, which you see in MSK Architecture.

Checkpoint

You should now be able to:

Explain the difference between the data plane and the control plane.
Describe how the ZooKeeper era stored metadata and elected the controller broker, and name two reasons it became a bottleneck.
Explain how KRaft stores metadata as a replicated log and elects an active controller via a Raft quorum.
Explain what the metadata log and snapshots do, and why they make failover faster.
Outline the four stages of the ZooKeeper-to-KRaft migration.

Next:The Commit Log on Disk, where you see how a partition is actually stored in segments and why Kafka is so fast.