AWS-Layer Connectivity
Security groups, NACLs, EBS throughput and IOPS, and EC2 status checks beneath Kafka broker symptoms, diagnosed from AWS exhibits.
1. Symptom
The first signal is broker- or app-side, but the root cause lives one layer down in AWS infrastructure. Common presentations:
- A client cannot connect to MSK at all, but nothing changed in Kafka: no deploy, no config change.
- Produce and consume latency climbs steadily while broker CPU and memory look normal.
- Connections that used to work start timing out (not refusing) after a network or subnet change.
- Only clients in one subnet or AZ are affected, while others are fine.
The common thread: the broker and the app both report what they see honestly, but the fault is in compute, storage, or networking underneath. The skill is recognizing that pattern quickly instead of chasing it as a Kafka problem.
2. Likely causes (AWS/infra side)
| Cause | What is actually happening |
|---|---|
| Security group misconfiguration | The broker SG no longer allows the client SG on the Kafka port; a single missing stateful rule is usually the whole story |
| NACL misconfiguration | A subnet Network ACL blocks traffic the SG would allow; NACLs are stateless, so return traffic must be allowed explicitly (the “we fixed the SG but it is still broken” trap) |
| EBS throughput/IOPS exhaustion | The broker data volume is saturated; every fsync-backed write queues behind disk, so latency rises with the broker process healthy |
| EC2/broker host failure | The underlying host fails; MSK replaces the broker, but there is a gap and a leadership move |
| Subnet routing / DNS | Clients cannot resolve or route to broker private IPs after a VPC change |
3. How it manifests to the Spring app
| AWS-side cause | What the Spring app sees |
|---|---|
| Security group block | TimeoutException / connection timed out on new connections (not “refused”); existing connections work until they cycle |
| NACL block | Same timeout, or a stranger pattern where the handshake starts then hangs if only return traffic is blocked |
| EBS saturation | No connection errors at all: produce/consume just get slower, and acks=all latency climbs. The easiest to misdiagnose because nothing looks down |
| Host failure | Brief NOT_LEADER_FOR_PARTITION and disconnects, then recovery once MSK replaces the broker |
4. Diagnostic steps
- Classify the client error. Timed out (dropped packets, suspect SG/NACL) versus refused (reached host) versus slow-but-working (suspect EBS/host load).
- Check the scope. All clients, or only one subnet/AZ? One subnet points at that subnet’s routing or NACL.
- Read the security group rules. Does the broker SG allow the client SG on 9092/9094/9098? Recent changes are the usual cause.
- Read the NACLs for the client and broker subnets. Remember they are stateless: check both inbound and outbound.
- Check EBS and broker metrics if latency is the symptom:
VolumeReadOps/VolumeWriteOps, throughput, and broker CPU. Saturation explains slow-without-errors.
| Step | Question it answers | Time cost |
|---|---|---|
| 1. Error class | Dropped, refused, or slow? | seconds |
| 2. Scope | Whole VPC or one subnet? | 1-2 min |
| 3. SG rules | Is the port allowed? | 2-3 min |
| 4. NACLs | Is a subnet rule blocking a direction? | 2-3 min |
| 5. EBS/broker | Is storage or host saturated? | 2-3 min |
5. Safe remediations
| Situation | Safe action |
|---|---|
| Missing/narrowed SG rule | Restore the specific client SG allow on the Kafka port (with network-owner sign-off) |
| NACL blocking a direction | Fix the stateless rule for the affected subnet (network team) |
| EBS saturation | Escalate for storage/throughput provisioning; do not mask as a Kafka tuning issue |
| Host failure | Let MSK replace the broker; confirm clients recover with retries |
6. Escalation trigger
Page the platform/network team (and on-call engineering) if:
- The fault is confirmed in SG, NACL, subnet routing, or EBS provisioning: these are infra-owned.
- Latency is driven by EBS or host saturation rather than anything Kafka or app config.
- Multiple brokers or AZs are affected simultaneously.
- You cannot determine the layer from the exhibits within your diagnostic pass.
7. Relevant commands and exhibits
# Client error: dropped packets (SG/NACL), not refused
org.apache.kafka.common.errors.TimeoutException:
Topic orders not present in metadata after 60000 ms.
Connection to node -1 (b-1.msk.../10.0.1.20:9094) could not be established.
# Security group table exhibit (broker SG inbound)
Type Protocol Port Source
Custom TCP TCP 9094 sg-clients (app tier) <-- must exist
# CloudWatch exhibit: EBS saturation with healthy broker CPU
VolumeWriteOps : rising toward provisioned limit
VolumeQueueLength : elevated (I/O queuing)
CpuUser (broker) : normal
Produce acks=all p99 : climbing <-- latency without errors
MSK CloudWatch: broker CPU/network, VolumeReadOps/VolumeWriteOps, plus the Kafka metrics from Observability to confirm the broker itself is healthy.
8. Guided practical
This playbook is exhibit-based. Practice the diagnosis, not the outage.
- For each exhibit above, state whether the fault is SG/NACL, EBS, or host, and why.
- Given “timed out” on new connections but existing ones work, name the most likely cause and the exact rule to check.
- Given rising
acks=alllatency with normal broker CPU and elevatedVolumeQueueLength, name the layer and who owns the fix. - Explain the timed-out versus refused distinction to a teammate in one sentence.