Read time: ~

AWS-Layer Connectivity

Security groups, NACLs, EBS throughput and IOPS, and EC2 status checks beneath Kafka broker symptoms, diagnosed from AWS exhibits.


1. Symptom

The first signal is broker- or app-side, but the root cause lives one layer down in AWS infrastructure. Common presentations:

  • A client cannot connect to MSK at all, but nothing changed in Kafka: no deploy, no config change.
  • Produce and consume latency climbs steadily while broker CPU and memory look normal.
  • Connections that used to work start timing out (not refusing) after a network or subnet change.
  • Only clients in one subnet or AZ are affected, while others are fine.

The common thread: the broker and the app both report what they see honestly, but the fault is in compute, storage, or networking underneath. The skill is recognizing that pattern quickly instead of chasing it as a Kafka problem.


2. Likely causes (AWS/infra side)

CauseWhat is actually happening
Security group misconfigurationThe broker SG no longer allows the client SG on the Kafka port; a single missing stateful rule is usually the whole story
NACL misconfigurationA subnet Network ACL blocks traffic the SG would allow; NACLs are stateless, so return traffic must be allowed explicitly (the “we fixed the SG but it is still broken” trap)
EBS throughput/IOPS exhaustionThe broker data volume is saturated; every fsync-backed write queues behind disk, so latency rises with the broker process healthy
EC2/broker host failureThe underlying host fails; MSK replaces the broker, but there is a gap and a leadership move
Subnet routing / DNSClients cannot resolve or route to broker private IPs after a VPC change

3. How it manifests to the Spring app

AWS-side causeWhat the Spring app sees
Security group blockTimeoutException / connection timed out on new connections (not “refused”); existing connections work until they cycle
NACL blockSame timeout, or a stranger pattern where the handshake starts then hangs if only return traffic is blocked
EBS saturationNo connection errors at all: produce/consume just get slower, and acks=all latency climbs. The easiest to misdiagnose because nothing looks down
Host failureBrief NOT_LEADER_FOR_PARTITION and disconnects, then recovery once MSK replaces the broker

4. Diagnostic steps

  1. Classify the client error. Timed out (dropped packets, suspect SG/NACL) versus refused (reached host) versus slow-but-working (suspect EBS/host load).
  2. Check the scope. All clients, or only one subnet/AZ? One subnet points at that subnet’s routing or NACL.
  3. Read the security group rules. Does the broker SG allow the client SG on 9092/9094/9098? Recent changes are the usual cause.
  4. Read the NACLs for the client and broker subnets. Remember they are stateless: check both inbound and outbound.
  5. Check EBS and broker metrics if latency is the symptom: VolumeReadOps/VolumeWriteOps, throughput, and broker CPU. Saturation explains slow-without-errors.
StepQuestion it answersTime cost
1. Error classDropped, refused, or slow?seconds
2. ScopeWhole VPC or one subnet?1-2 min
3. SG rulesIs the port allowed?2-3 min
4. NACLsIs a subnet rule blocking a direction?2-3 min
5. EBS/brokerIs storage or host saturated?2-3 min

5. Safe remediations

SituationSafe action
Missing/narrowed SG ruleRestore the specific client SG allow on the Kafka port (with network-owner sign-off)
NACL blocking a directionFix the stateless rule for the affected subnet (network team)
EBS saturationEscalate for storage/throughput provisioning; do not mask as a Kafka tuning issue
Host failureLet MSK replace the broker; confirm clients recover with retries

6. Escalation trigger

Page the platform/network team (and on-call engineering) if:

  • The fault is confirmed in SG, NACL, subnet routing, or EBS provisioning: these are infra-owned.
  • Latency is driven by EBS or host saturation rather than anything Kafka or app config.
  • Multiple brokers or AZs are affected simultaneously.
  • You cannot determine the layer from the exhibits within your diagnostic pass.

7. Relevant commands and exhibits

# Client error: dropped packets (SG/NACL), not refused
org.apache.kafka.common.errors.TimeoutException:
  Topic orders not present in metadata after 60000 ms.
Connection to node -1 (b-1.msk.../10.0.1.20:9094) could not be established.

# Security group table exhibit (broker SG inbound)
Type        Protocol  Port   Source
Custom TCP  TCP       9094   sg-clients (app tier)   <-- must exist
# CloudWatch exhibit: EBS saturation with healthy broker CPU
VolumeWriteOps         : rising toward provisioned limit
VolumeQueueLength      : elevated (I/O queuing)
CpuUser (broker)       : normal
Produce acks=all p99   : climbing   <-- latency without errors

MSK CloudWatch: broker CPU/network, VolumeReadOps/VolumeWriteOps, plus the Kafka metrics from Observability to confirm the broker itself is healthy.


8. Guided practical

This playbook is exhibit-based. Practice the diagnosis, not the outage.

  1. For each exhibit above, state whether the fault is SG/NACL, EBS, or host, and why.
  2. Given “timed out” on new connections but existing ones work, name the most likely cause and the exact rule to check.
  3. Given rising acks=all latency with normal broker CPU and elevated VolumeQueueLength, name the layer and who owns the fix.
  4. Explain the timed-out versus refused distinction to a teammate in one sentence.

Next:Latency, Ordering, and Duplicates.