AWS MSK Architecture and High Availability for Kafka

Section 8 turns from building applications to running Kafka in production, and on AWS that almost always means Amazon MSK (Managed Streaming for Apache Kafka). This module shows how an MSK cluster is laid out for high availability across availability zones, how replication settings make that resilience real, and where MSK ends and your responsibility begins. It is the foundation for the tooling and troubleshooting modules that follow.

What you’ll be able to do after this module

Describe an MSK cluster spread across availability zones.
Set replication factor and min.insync.replicas for AZ resilience.
Explain the VPC, subnet, and security group layout MSK needs.
Reason about EBS storage and its limits.
Choose between MSK and self-managed Kafka honestly.

1. MSK is managed, not magic

MSK runs the same Apache Kafka you have used all course, but AWS operates the brokers: provisioning, patching, and broker replacement. What MSK does not do is design your topics, choose your replication, or write your clients correctly. High availability is a shared outcome: AWS keeps brokers running, you configure replication and producers so a broker loss is survivable.

Everything you learned about brokers, partitions, replicas, and ISR in Cluster Anatomy applies unchanged. MSK just places those brokers on AWS infrastructure.

2. Multi-AZ topology

An availability zone (AZ) is an isolated datacenter within an AWS region. The core HA idea is to spread brokers across AZs so the loss of one AZ cannot take the cluster down. A typical MSK cluster uses three brokers in three AZs.

flowchart TD
    subgraph region["AWS Region"]
        subgraph az1["AZ a"]
            b1["Broker 1"]
        end
        subgraph az2["AZ b"]
            b2["Broker 2"]
        end
        subgraph az3["AZ c"]
            b3["Broker 3"]
        end
    end
    b1 <-->|replication| b2
    b2 <-->|replication| b3
    b1 <-->|replication| b3

With partition replicas placed across these brokers, an AZ outage takes down at most one replica of each partition, and the partition stays available from the survivors. MSK manages the control plane (KRaft or, on older clusters, ZooKeeper) for you across AZs as well.

3. Replication and min.insync.replicas for AZ resilience

Multi-AZ topology only helps if partitions are actually replicated across those AZs. The two settings that make AZ loss survivable, from Reliable Producing, are the topic replication factor and min.insync.replicas.

Replication factor 3: one replica per AZ, so every partition has a copy in all three AZs.
min.insync.replicas=2: an acks=all write must reach at least two replicas, so it survives losing any one AZ with no data loss.

flowchart LR
    p["acks=all produce"]
    l["leader (AZ a)"]
    f1["follower (AZ b)"]
    f2["follower (AZ c)"]
    p --> l
    l --> f1
    l --> f2
    l -.->|ack after<br/>min.insync.replicas=2| p

This is the standard MSK durability posture: RF 3, min.insync.replicas 2, producers on acks=all. It tolerates one AZ down while still accepting writes. Dropping min.insync.replicas to 1 to keep writing during an outage trades away the guarantee, covered in Under-Replicated and Offline Partitions.

4. Networking: VPC, subnets, security groups

MSK brokers live inside your VPC, one subnet per AZ, and are reached by private IPs. Clients connect from within the VPC or over peering/VPN, never from the public internet by default. Three pieces must line up:

VPC and subnets: one private subnet per AZ for the brokers; clients run in the same VPC or a connected one.
Security groups: the broker security group must allow the client security group on the Kafka ports (9092/9094/9098 depending on auth).
DNS: clients bootstrap from the MSK broker endpoints, which resolve to private IPs.

Most “cannot connect to MSK” incidents are security groups or subnet routing, not Kafka itself, which is why AWS-Layer Connectivity is a dedicated playbook.

5. Storage: EBS

Each MSK broker stores its log segments (from Storage Internals) on an attached EBS volume. Storage is finite and is a real operational limit: retention and partition count must fit the provisioned volume.

Broker storage fills when retention is too long or throughput higher than planned.
MSK offers storage autoscaling and tiered storage to offload older segments, but the fundamentals of retention still apply.
A full log directory takes a broker offline, the subject of Disk Pressure, Retention, and Segment Issues.

Size volumes for peak retention plus headroom, and monitor the CloudWatch storage metrics covered in the next module.

6. MSK vs self-managed

MSK is the recommended default on AWS, but be honest about the trade-off.

Aspect	MSK	Self-managed (KRaft on EC2/K8s)
Broker ops	AWS handles patching, replacement	You handle everything
Control	Config within MSK’s allowed set	Full control of every setting
Integration	Native IAM, CloudWatch, VPC	You wire it all up
Effort	Lower operational burden	Higher, needs Kafka expertise

Self-managed makes sense when you need configuration or versions MSK does not expose, or you run outside AWS. For most teams on AWS, MSK’s lower operational burden wins. A self-managed cluster is still the same KRaft Kafka from Control Plane, just with you as the operator.

7. Guided practical

MSK is AWS-only, so this practical is exhibit-based and maps to your local lab.

Sketch a three-broker, three-AZ MSK layout and mark where the three replicas of one partition live.
State the RF and min.insync.replicas you would use and explain what one AZ failing does to writes.
In the local three-broker lab from Local Lab, create a topic with RF 3 and min.insync.replicas 2, then stop one broker and confirm produce/consume continues.
List the security group rule a client needs to reach the brokers.

Next:Tooling Walkthrough, the operational CLI and CloudWatch toolkit you use to inspect and fix a running cluster.