Read time: ~

AWS-Layer Connectivity

EC2 status checks, EBS throttling, security groups, and NACLs beneath broker symptoms.

Prerequisite:AWS Architecture, Tooling Walkthrough

Note on this playbook’s format: the other playbooks assume you can poke at a real non-prod broker. This one is deliberately different, you’re diagnosing from exhibits (CloudWatch snippets, EC2 console text, SG/NACL tables, ASG activity logs), because that’s exactly the skill you need on a real on-call rotation: most AWS-layer incidents are diagnosed by reading console/log output under time pressure, not by causing them yourself in a sandbox.


1. Symptom

This playbook covers alerts and tickets where the first signal is broker- or app-side, but the root cause lives one layer down, in AWS infrastructure rather than RabbitMQ configuration or application code. Common presentations:

  • A node “goes down” in cluster_status or the Management UI, but there’s no corresponding RabbitMQ error, crash log, or graceful shutdown on that node, it just stops responding.
  • Publish/ack latency climbs steadily, but rabbitmq-diagnostics status shows normal memory, no alarms, and CPU utilization on the node looks unremarkable.
  • An app team reports sudden, total connection failures to the broker, but nothing changed in RabbitMQ itself, no deploy, no config change, no restart.
  • A node that was down comes back, but as a different instance ID than before, and it doesn’t automatically rejoin the cluster.

The common thread: the broker and the app are both telling the truth about what they see, but the actual fault is in the AWS layer underneath them: compute, storage, or networking. This playbook is about learning to recognize that pattern quickly instead of spending 20 minutes chasing it as a “RabbitMQ problem.”


2. Likely Causes (AWS/infra-side)

CauseWhat’s actually happening
EC2 instance status check failureAWS reports the instance itself as unhealthy at the hypervisor or OS-network level (see the status-check breakdown below). RabbitMQ never got a chance to log anything: the underlying compute failed out from under it.
EBS volume throughput/IOPS exhaustionThe broker’s data volume is saturated: more I/O is being requested than the volume’s provisioned throughput/IOPS can serve. Every fsync-backed operation (publish confirms, quorum queue log writes) queues up waiting on disk, even though the broker process itself is healthy.
Security group misconfigurationA rule was removed, narrowed, or never updated after a migration: see the SG table in AWS Architecture. SGs are stateful: allowing inbound on a port automatically allows the matching return traffic, so a single missing rule is usually the whole story.
NACL (Network ACL) misconfigurationA subnet-level Network ACL rule blocks traffic that the security groups would otherwise allow. NACLs are stateless: unlike SGs, allowing inbound traffic does not automatically allow the outbound return traffic (or vice versa). This is the classic “we fixed the SG but it’s still broken” trap.
ASG replaces a node mid-incidentAn Auto Scaling Group’s health check flags a node as unhealthy (sometimes due to a transient condition: a GC pause, a slow health-check response under load) and terminates it. A brand-new EC2 instance boots in its place. This turns a recoverable blip into an actual node-loss-and-rejoin event, potentially mid-incident.

EC2 instance status checks: the distinction that matters

AWS reports two separate status checks per instance, and which one fails tells you where the fault lives:

CheckWhat it verifiesTypical causeWho fixes it
System status checkThe underlying AWS hardware/hypervisor/network the instance runs onHost hardware failure, hypervisor issue, underlying network problemAWS itself: usually resolves via instance stop/start (moves you to new hardware), not a reboot
Instance status checkThe guest OS/instance’s own network reachability and software stateOS kernel panic, exhausted memory causing the network stack to stop responding, filesystem corruption, misconfigured OS networkingYou/your team: usually needs a reboot or investigation on the instance itself

Why this matters for RabbitMQ specifically: from the broker’s own perspective, an instance that fails either status check is not “shut down”, it’s abruptly terminated. There’s no graceful rabbitmqctl stop_app, no partition-handling negotiation, nothing written to the RabbitMQ log about why. The other two nodes simply stop receiving heartbeats and, after the detection timeout, report it missing from Running Nodes, indistinguishable, from cluster_status alone, from a network partition (see Playbook 03). The only way to tell “the node process is confused about the network” (partition) apart from “the node is actually gone” (instance failure) is to check the EC2/CloudWatch side, which is why it’s step 1 in Section 3 below.


3. How it manifests to the Spring app

AWS-side causeWhat the Spring Boot app sees
EC2 instance status check failure (node terminated)AmqpConnectException / Connection refused from any app whose spring.rabbitmq.addresses included that node’s hostname; other apps configured against the full node list fail over silently (see Playbook 03)
EBS throughput/IOPS exhaustionNo connection errors at all. Publishes succeed but take much longer: rising RabbitTemplate publish-confirm latency, growing thread pool queues if publishing is synchronous, possibly AmqpTimeoutException if a publisher-confirm timeout is configured tightly. This is the easiest of the four to misdiagnose, because nothing looks “down.”
Security group misconfigurationAmqpConnectException wrapping a ConnectException: Connection timed out (not “refused”) on new connection attempts: see the timeout-vs-refused distinction below. Existing, already-established connections are typically unaffected until they naturally cycle.
NACL misconfigurationSame symptom as an SG block (Connection timed out) if it blocks inbound, but can also produce a stranger pattern: connections that complete the initial handshake but then hang or reset, if only the return traffic is blocked: because NACLs evaluate each direction independently.
ASG replaces a nodeSame as EC2 instance failure initially (AmqpConnectException for that node), but if the replacement instance doesn’t cleanly rejoin the cluster, apps may see the cluster permanently down to 2 nodes rather than recovering on its own within the usual window: worth comparing against Playbook 03’s “does not automatically rejoin” escalation trigger.

A useful diagnostic shortcut:Connection refused means something on the target actively rejected the TCP connection (broker process down, but OS/network fine). Connection timed out means the packet never got a response at all, no SYN-ACK, no RST. A timeout is the single strongest log-side hint that you’re looking at a network-layer block (SG, NACL, routing) rather than a broker-process problem, because a genuinely stopped RabbitMQ process on a reachable host still produces a fast “refused,” not a timeout.

# Timeout pattern: network/SG/NACL block (broker unreachable at the network layer)
o.s.amqp.AmqpConnectException: java.net.ConnectException: Connection timed out
    at org.springframework.amqp.rabbit.support.RabbitExceptionTranslator...

# Refused pattern: host reachable, but nothing listening on the port (process actually down)
o.s.amqp.AmqpConnectException: java.net.ConnectException: Connection refused
    at org.springframework.amqp.rabbit.support.RabbitExceptionTranslator...

4. Diagnostic Steps

Work top to bottom, cheapest, fastest, least-invasive checks first. The goal at every step is to narrow down which AWS layer is involved before you touch anything.

  1. EC2 console / CloudWatch, instance status checks. For the node(s) in question, check System status check vs. Instance status check individually (not just the combined “2/2 checks passed” summary). A system-check failure points at AWS-side hardware; an instance-check failure points at the guest OS. Either way, this tells you the node is a genuine infra casualty, not a network partition, compare against Playbook 03, Section 3, step 4.
  2. CloudWatch EBS metrics for the node’s data volume, EBSVolumeQueueLength (I/O requests waiting to be served) and EBSReadWriteOps (actual throughput), plus BurstBalance if the volume is gp2. A queue length sustained above roughly 1 per provisioned IOPS-thousand, or a BurstBalance heading toward 0, means the disk, not the broker, is the bottleneck.
  3. Security group rules, both directions, for the broker SG and the app-tier SG involved, reuse the table from AWS Architecture as your checklist (app→broker 5672/5671, broker↔broker 4369/25672/35672-35682, management 15672). Confirm the rule exists and that its source/destination reference (SG ID or CIDR) still matches reality after any recent migration.
  4. Network ACL rules for the subnets involved, both inbound and outbound, on both the app subnet and the broker subnet. This step is easy to skip because NACLs are touched far less often than SGs, but a stateless NACL needs an explicit rule for the return leg of the conversation, not just the initiating leg. Missing this is the single most common reason “we already checked the security group and it’s fine” turns out to be wrong.
  5. ASG activity history for the broker’s Auto Scaling Group, look for any Terminating/Launching activity around the time symptoms started. If a node was replaced, note the exact timestamp of termination and the exact timestamp the replacement instance became InService.
  6. Cross-reference timestamps. Line up the RabbitMQ-side symptom (node missing from cluster_status, latency spike start time) against the AWS-side event (ASG termination timestamp, EBS queue-length spike start, SG/NACL change timestamp from CloudTrail if available). A match within a minute or two is strong evidence of causation; a gap of many minutes suggests coincidence and means you should keep looking rather than anchoring on the first AWS event you find.
StepQuestion it answersTypical time cost
1. EC2/CloudWatch status checksIs the node actually down at the infra level, or just unreachable?1-2 min
2. CloudWatch EBS metricsIs the disk saturated?1-2 min
3. Security group rulesIs a stateful network rule blocking this?2-3 min
4. NACL rules (both directions, both subnets)Is a stateless network rule blocking this asymmetrically?3-5 min
5. ASG activity historyWas a node replaced, and when exactly?1-2 min
6. Timestamp cross-referenceCausation or coincidence?2-3 min

5. Safe Remediations

SituationAction
SG rule confirmed missing/wrongIdentify the exact corrected rule (port, direction, source/destination SG or CIDR) and hand it to the network/platform team to apply.
NACL rule confirmed missing/wrongIdentify the exact corrected rule, but treat this with extra care (see caution below): hand off with the specific rule number, direction, and subnet affected.
EBS volume saturatedThis is generally an escalation to resize or re-provision the volume (increase provisioned IOPS/throughput, or move from gp2 to gp3/higher baseline), not a support-tier live fix.
ASG replaced a node mid-incident, or is likely to replace another during ongoing investigationPausing or adjusting the ASG’s health-check-triggered replacement is a valid stopgap to prevent compounding the incident, but is high-blast-radius (see caution below).

⚠️ Caution: correcting a security group or NACL rule should be coordinated with the network/platform team, not applied solo.** A security group is scoped to the resources that reference it, so an SG fix is relatively contained. A NACL is scoped to the entire subnet: every resource in that subnet is affected by any rule you add, remove, or reorder, including things that have nothing to do with RabbitMQ. Fixing a NACL rule to unblock the broker can silently open or close traffic for unrelated workloads sharing that subnet. Always have the network/platform team apply or review NACL changes.

⚠️ Caution: pausing ASG health-check replacement during an active incident is a risky, high-blast-radius change.** RabbitMQ nodes are stateful cluster members with data on attached EBS volumes, not interchangeable stateless web servers.

Blind auto-replacement (the default ASG behavior for a “generic” fleet) is dangerous for a broker unless the ASG has been specifically tuned for it, sufficient health check grace periods so a transient blip (GC pause, brief high CPU, a slow health-check response) doesn’t trigger replacement, and either automated EBS/data-volume reattachment or proper cluster-rejoin automation baked into the instance’s bootstrap. If none of that tuning exists, pausing the ASG mid-incident is the safer of two bad options, but this decision should be made by or with on-call engineering, not unilaterally by support tier, because getting it wrong either way (leaving it on vs. pausing it incorrectly) can extend the incident.


6. Escalation Trigger

Escalate immediately (page on-call engineering, per Escalation and Communication) if any of the following are true:

  • A security group or NACL change is needed to restore connectivity, network configuration changes are owned by the network/platform team, not applied directly by support tier.
  • An EBS volume resize/re-provision is needed, this is a capacity/infra change requiring engineering approval and planning (potential downtime or performance impact during the change itself).
  • An ASG configuration change is needed (pausing health-check replacement, adjusting grace periods, adjusting scaling policies), always a joint call with on-call engineering, never solo.
  • A node loss is confirmed to be caused by AWS infrastructure (failed status check, ASG replacement) rather than an application or broker-level bug, especially if the replacement node has not cleanly rejoined the cluster, see the related trigger in Playbook 03, Section 5.

7. Relevant Commands/Queries/Exhibits

CloudWatch metrics to check (from Tooling Walkthrough):

MetricWhat it tells you here
EBSVolumeQueueLengthI/O requests queued waiting on the volume: sustained elevation means disk saturation
EBSReadWriteOpsActual read/write throughput being served: compare against the volume’s provisioned limit
CPUCreditBalanceIf depleting toward 0 on a burstable instance type, the node may soon be throttled: a different root cause than EBS, but produces a similarly confusing “broker is slow but nothing broker-side explains it” symptom
StatusCheckFailed_System / StatusCheckFailed_InstanceSplit view of the two EC2 status checks: check both individually, not just the combined status

EBS baseline vs. provisioned IOPS, why this is easy to misdiagnose:

A gp3 volume comes with a fixed baseline (3,000 IOPS / 125 MiB/s throughput) regardless of size, which you can raise independently by provisioning more. If message rates outgrow that baseline, high publish volume, many quorum queues doing Raft log writes, mirrored durability workloads, the volume throttles once the baseline is exceeded, and every fsync-backed broker operation (publisher confirms, quorum queue appends) queues up behind it. CPU and memory on the node look completely normal because the bottleneck is the disk, not the broker process, which is exactly why this gets misreported as “RabbitMQ is slow” instead of “the EBS volume is undersized for this workload.”

Security group reference (from AWS Architecture), the exact rules to check:

RulePortSourcePurpose
App → Broker5672 (or 5671 TLS)App tier SGAMQP client connections
App → Broker (Management API)15672App tier SG or admin CIDRHTTP Management API/UI
Broker ↔ Broker4369, 25672, 35672-35682Broker SG itself (self-referencing)Erlang clustering/gossip
Admin → Brokervia SSM onlyBastion SG / SSMNode administration

Key SG vs. NACL difference to keep straight:

 Security GroupNetwork ACL
ScopeAttached to specific resources (ENIs/instances)Applies to an entire subnet
StateStateful: allowing inbound automatically allows the matching outbound return trafficStateless: inbound and outbound rules are evaluated independently; you must explicitly allow both directions
DefaultDeny all inbound, allow all outbound (until rules added)Default NACL allows all; a custom NACL denies all until rules added
Common mistakeForgetting the self-referencing clustering ruleAdding an inbound allow rule but forgetting the matching outbound allow for the response traffic (or vice versa)
# Once network connectivity is confirmed and you're validating cluster state
# (via SSM Session Manager, not direct SSH)
rabbitmq-diagnostics cluster_status
rabbitmq-diagnostics check_running

8. Guided Practical

You don’t have hands-on AWS access for this one, instead, diagnose from three exhibits, exactly as you would from an incident channel where someone pastes console screenshots and CLI output.

Ticket:

“PagerDuty fired RabbitMQ node unreachable, rmq-3 at 14:32 UTC. Around the same time, the payments-service team reports publish latency spiking from ~5ms to over 4 seconds on rmq-1 and rmq-2 (the two nodes still up). Management UI shows no memory or disk alarms on either surviving node. We need to know: is this one incident or two, and what’s the root cause of each?”

Exhibit A, CloudWatch EBS metrics for rmq-1’s data volume (gp3, baseline 3,000 IOPS / 125 MiB/s):

Timestamp (UTC)   EBSVolumeQueueLength   EBSReadWriteOps (IOPS)   BurstBalance
14:15             0.8                    1,150                    N/A (gp3 has no burst balance)
14:20             1.1                    1,400                    N/A
14:25             3.6                    2,950                    N/A
14:30             9.2                    3,010                    N/A
14:35             11.4                   3,005                    N/A
14:40             10.9                   2,998                    N/A

Exhibit B, ASG activity history for the RabbitMQ broker ASG (rmq-broker-asg):

Activity                                    Start (UTC)   End (UTC)     Cause
Terminating EC2 instance: i-0a1b2c3d4e5f     14:31:02      14:31:48      At 14:30:55 UTC an instance was
                                                                          taken out of service in response
                                                                          to a failed EC2 instance status
                                                                          check.
Launching a new EC2 instance: i-0f9e8d7c6b5   14:31:50      14:34:12      A new instance was launched in
                                                                          response to a difference between
                                                                          desired and actual capacity.

(Instance i-0a1b2c3d4e5f was tagged rmq-3.)

Exhibit C, Security group rule diff (from a change ticket, applied at 09:00 UTC that same day, unrelated migration):

SG: sg-broker0123 (RabbitMQ broker nodes)

  Rule removed:
    Type: Custom TCP   Port: 35672-35682   Source: sg-broker0123 (self)
  Rule added:
    Type: Custom TCP   Port: 35672-35680   Source: sg-broker0123 (self)

Your task: using only these three exhibits plus what you know from Section 2-3 above, answer:

  1. Which exhibit explains the rmq-3 node-down alert at 14:32, and what’s the precise mechanism?
  2. Which exhibit explains the payments-service latency spike on rmq-1/rmq-2, and why wouldn’t Management UI alarms catch it?
  3. Is the 09:00 UTC security group change (Exhibit C) relevant to either symptom? Why or why not?
  4. For each root cause, is this something you’d remediate yourself or escalate, and to whom?
Suggested answers (click to expand) 1. **Exhibit B.** The ASG activity log shows `rmq-3`'s instance (`i-0a1b2c3d4e5f`) was terminated at 14:31:02 UTC due to a **failed EC2 instance status check**: roughly a minute before the PagerDuty alert fired. The instance-check failure (not system-check) means the guest OS/network stack on that instance stopped responding; from the broker's perspective it was abruptly terminated, not gracefully stopped, so `rmq-3` simply disappears from `cluster_status` with no corresponding RabbitMQ shutdown log. This is a node-down event, not a network partition, the timestamps line up almost exactly with the alert. 2. **Exhibit A.** `EBSVolumeQueueLength` and `EBSReadWriteOps` on `rmq-1`'s volume climb steadily starting ~14:20 and the volume hits its `gp3` baseline throughput ceiling (3,000 IOPS) right around 14:30, coinciding with the latency spike report. Once `rmq-3` was terminated, its share of quorum queue replica traffic likely redistributed onto the two surviving nodes, pushing `rmq-1`'s disk past its provisioned baseline. Management UI alarms don't catch this because RabbitMQ's built-in alarms watch **memory and disk-space**, not disk **throughput/IOPS saturation**: that's an AWS-layer metric (`EBSVolumeQueueLength`) that RabbitMQ itself has no visibility into. This is exactly the "CPU/memory look fine but publish/ack latency is up" pattern from Section 3. 3. **Not relevant to either symptom, but worth flagging separately.** The SG change happened at 09:00 UTC, over 5 hours before either symptom began, the timestamp gap rules out causation for this incident specifically (per the Section 4, step 6 cross-referencing check). However, note that the rule change **narrowed the clustering port range from `35672-35682` to `35672-35680`**: this is a latent risk (Erlang's distribution port allocation could pick a port in the now-blocked `35681-35682` range under the right conditions) that should be flagged to the network team as a follow-up even though it isn't the cause of today's incident. Good triage means noticing a real misconfiguration even when it isn't the one that paged you. 4. **`rmq-3` node loss (Exhibit B):** confirmed infra-caused node loss per the Section 6 escalation trigger, escalate to on-call engineering to verify the replacement instance cleanly rejoins the cluster (don't assume it will "just work"; see [Playbook 03](/rabbit-mq/alert-playbooks/node-down-partition#5-escalation-trigger)). **EBS saturation (Exhibit A):** escalate for a volume resize/re-provision (increase `gp3` provisioned throughput/IOPS), not a support-tier live fix. **SG rule narrowing (Exhibit C):** hand off to the network/platform team as a follow-up correction, coordinated rather than self-applied, even though it's not today's root cause.

✅ Checkpoint

You should now be able to:

  • Distinguish a system status check failure from an instance status check failure, and explain why either one makes a RabbitMQ node look “down” with no broker-side log evidence.
  • Explain why EBS throughput/IOPS exhaustion produces elevated latency with no corresponding CPU/memory/alarm signal, and name the two CloudWatch metrics that expose it.
  • Explain the stateful-vs-stateless distinction between security groups and NACLs, and why a NACL fix requires checking both inbound and outbound rules on both subnets.