AWS-Layer Connectivity Issues: Kafka MSK Incident Playbook

1. Symptom

The first signal is broker- or app-side, but the root cause lives one layer down in AWS infrastructure. Common presentations:

A client cannot connect to MSK at all, but nothing changed in Kafka: no deploy, no config change.
Produce and consume latency climbs steadily while broker CPU and memory look normal.
Connections that used to work start timing out (not refusing) after a network or subnet change.
Only clients in one subnet or AZ are affected, while others are fine.

The common thread: the broker and the app both report what they see honestly, but the fault is in compute, storage, or networking underneath. The skill is recognizing that pattern quickly instead of chasing it as a Kafka problem.

2. Likely causes (AWS/infra side)

Cause	What is actually happening
Security group misconfiguration	The broker SG no longer allows the client SG on the Kafka port; a single missing stateful rule is usually the whole story
NACL misconfiguration	A subnet Network ACL blocks traffic the SG would allow; NACLs are stateless, so return traffic must be allowed explicitly (the “we fixed the SG but it is still broken” trap)
EBS throughput/IOPS exhaustion	The broker data volume is saturated; every fsync-backed write queues behind disk, so latency rises with the broker process healthy
EC2/broker host failure	The underlying host fails; MSK replaces the broker, but there is a gap and a leadership move
Subnet routing / DNS	Clients cannot resolve or route to broker private IPs after a VPC change

3. How it manifests to the Spring app

AWS-side cause	What the Spring app sees
Security group block	`TimeoutException` / connection timed out on new connections (not “refused”); existing connections work until they cycle
NACL block	Same timeout, or a stranger pattern where the handshake starts then hangs if only return traffic is blocked
EBS saturation	No connection errors at all: produce/consume just get slower, and `acks=all` latency climbs. The easiest to misdiagnose because nothing looks down
Host failure	Brief `NOT_LEADER_FOR_PARTITION` and disconnects, then recovery once MSK replaces the broker

4. Diagnostic steps

Classify the client error. Timed out (dropped packets, suspect SG/NACL) versus refused (reached host) versus slow-but-working (suspect EBS/host load).
Check the scope. All clients, or only one subnet/AZ? One subnet points at that subnet’s routing or NACL.
Read the security group rules. Does the broker SG allow the client SG on 9092/9094/9098? Recent changes are the usual cause.
Read the NACLs for the client and broker subnets. Remember they are stateless: check both inbound and outbound.
Check EBS and broker metrics if latency is the symptom: VolumeReadOps/VolumeWriteOps, throughput, and broker CPU. Saturation explains slow-without-errors.

Step	Question it answers	Time cost
1. Error class	Dropped, refused, or slow?	seconds
2. Scope	Whole VPC or one subnet?	1-2 min
3. SG rules	Is the port allowed?	2-3 min
4. NACLs	Is a subnet rule blocking a direction?	2-3 min
5. EBS/broker	Is storage or host saturated?	2-3 min

5. Safe remediations

Situation	Safe action
Missing/narrowed SG rule	Restore the specific client SG allow on the Kafka port (with network-owner sign-off)
NACL blocking a direction	Fix the stateless rule for the affected subnet (network team)
EBS saturation	Escalate for storage/throughput provisioning; do not mask as a Kafka tuning issue
Host failure	Let MSK replace the broker; confirm clients recover with retries

6. Escalation trigger

Page the platform/network team (and on-call engineering) if:

The fault is confirmed in SG, NACL, subnet routing, or EBS provisioning: these are infra-owned.
Latency is driven by EBS or host saturation rather than anything Kafka or app config.
Multiple brokers or AZs are affected simultaneously.
You cannot determine the layer from the exhibits within your diagnostic pass.

7. Relevant commands and exhibits

# Client error: dropped packets (SG/NACL), not refused
org.apache.kafka.common.errors.TimeoutException:
  Topic orders not present in metadata after 60000 ms.
Connection to node -1 (b-1.msk.../10.0.1.20:9094) could not be established.

# Security group table exhibit (broker SG inbound)
Type        Protocol  Port   Source
Custom TCP  TCP       9094   sg-clients (app tier)   <-- must exist

# CloudWatch exhibit: EBS saturation with healthy broker CPU
VolumeWriteOps         : rising toward provisioned limit
VolumeQueueLength      : elevated (I/O queuing)
CpuUser (broker)       : normal
Produce acks=all p99   : climbing   <-- latency without errors

MSK CloudWatch: broker CPU/network, VolumeReadOps/VolumeWriteOps, plus the Kafka metrics from Observability to confirm the broker itself is healthy.

8. Guided practical

This playbook is exhibit-based. Practice the diagnosis, not the outage.

For each exhibit above, state whether the fault is SG/NACL, EBS, or host, and why.
Given “timed out” on new connections but existing ones work, name the most likely cause and the exact rule to check.
Given rising acks=all latency with normal broker CPU and elevated VolumeQueueLength, name the layer and who owns the fix.
Explain the timed-out versus refused distinction to a teammate in one sentence.

Next:Latency, Ordering, and Duplicates.