AWS-Layer Connectivity Issues: RabbitMQ Incident Guide

Prerequisite:AWS Architecture, Tooling Walkthrough

Note on this playbook’s format: the other playbooks assume you can poke at a real non-prod broker. This one is deliberately different, you’re diagnosing from exhibits (CloudWatch snippets, EC2 console text, SG/NACL tables, ASG activity logs), because that’s exactly the skill you need on a real on-call rotation: most AWS-layer incidents are diagnosed by reading console/log output under time pressure, not by causing them yourself in a sandbox.

1. Symptom

This playbook covers alerts and tickets where the first signal is broker- or app-side, but the root cause lives one layer down, in AWS infrastructure rather than RabbitMQ configuration or application code. Common presentations:

A node “goes down” in cluster_status or the Management UI, but there’s no corresponding RabbitMQ error, crash log, or graceful shutdown on that node, it just stops responding.
Publish/ack latency climbs steadily, but rabbitmq-diagnostics status shows normal memory, no alarms, and CPU utilization on the node looks unremarkable.
An app team reports sudden, total connection failures to the broker, but nothing changed in RabbitMQ itself, no deploy, no config change, no restart.
A node that was down comes back, but as a different instance ID than before, and it doesn’t automatically rejoin the cluster.

The common thread: the broker and the app are both telling the truth about what they see, but the actual fault is in the AWS layer underneath them: compute, storage, or networking. This playbook is about learning to recognize that pattern quickly instead of spending 20 minutes chasing it as a “RabbitMQ problem.”

2. Likely Causes (AWS/infra-side)

Cause	What’s actually happening
EC2 instance status check failure	AWS reports the instance itself as unhealthy at the hypervisor or OS-network level (see the status-check breakdown below). RabbitMQ never got a chance to log anything: the underlying compute failed out from under it.
EBS volume throughput/IOPS exhaustion	The broker’s data volume is saturated: more I/O is being requested than the volume’s provisioned throughput/IOPS can serve. Every fsync-backed operation (publish confirms, quorum queue log writes) queues up waiting on disk, even though the broker process itself is healthy.
Security group misconfiguration	A rule was removed, narrowed, or never updated after a migration: see the SG table in AWS Architecture. SGs are stateful: allowing inbound on a port automatically allows the matching return traffic, so a single missing rule is usually the whole story.
NACL (Network ACL) misconfiguration	A subnet-level Network ACL rule blocks traffic that the security groups would otherwise allow. NACLs are stateless: unlike SGs, allowing inbound traffic does not automatically allow the outbound return traffic (or vice versa). This is the classic “we fixed the SG but it’s still broken” trap.
ASG replaces a node mid-incident	An Auto Scaling Group’s health check flags a node as unhealthy (sometimes due to a transient condition: a GC pause, a slow health-check response under load) and terminates it. A brand-new EC2 instance boots in its place. This turns a recoverable blip into an actual node-loss-and-rejoin event, potentially mid-incident.

EC2 instance status checks: the distinction that matters

AWS reports two separate status checks per instance, and which one fails tells you where the fault lives:

Check	What it verifies	Typical cause	Who fixes it
System status check	The underlying AWS hardware/hypervisor/network the instance runs on	Host hardware failure, hypervisor issue, underlying network problem	AWS itself: usually resolves via instance stop/start (moves you to new hardware), not a reboot
Instance status check	The guest OS/instance’s own network reachability and software state	OS kernel panic, exhausted memory causing the network stack to stop responding, filesystem corruption, misconfigured OS networking	You/your team: usually needs a reboot or investigation on the instance itself

Why this matters for RabbitMQ specifically: from the broker’s own perspective, an instance that fails either status check is not “shut down”, it’s abruptly terminated. There’s no graceful rabbitmqctl stop_app, no partition-handling negotiation, nothing written to the RabbitMQ log about why. The other two nodes simply stop receiving heartbeats and, after the detection timeout, report it missing from Running Nodes, indistinguishable, from cluster_status alone, from a network partition (see Playbook 03). The only way to tell “the node process is confused about the network” (partition) apart from “the node is actually gone” (instance failure) is to check the EC2/CloudWatch side, which is why it’s step 1 in Section 3 below.

3. How it manifests to the Spring app

AWS-side cause	What the Spring Boot app sees
EC2 instance status check failure (node terminated)	`AmqpConnectException` / `Connection refused` from any app whose `spring.rabbitmq.addresses` included that node’s hostname; other apps configured against the full node list fail over silently (see Playbook 03)
EBS throughput/IOPS exhaustion	No connection errors at all. Publishes succeed but take much longer: rising `RabbitTemplate` publish-confirm latency, growing thread pool queues if publishing is synchronous, possibly `AmqpTimeoutException` if a publisher-confirm timeout is configured tightly. This is the easiest of the four to misdiagnose, because nothing looks “down.”
Security group misconfiguration	`AmqpConnectException` wrapping a `ConnectException: Connection timed out` (not “refused”) on new connection attempts: see the timeout-vs-refused distinction below. Existing, already-established connections are typically unaffected until they naturally cycle.
NACL misconfiguration	Same symptom as an SG block (`Connection timed out`) if it blocks inbound, but can also produce a stranger pattern: connections that complete the initial handshake but then hang or reset, if only the return traffic is blocked: because NACLs evaluate each direction independently.
ASG replaces a node	Same as EC2 instance failure initially (`AmqpConnectException` for that node), but if the replacement instance doesn’t cleanly rejoin the cluster, apps may see the cluster permanently down to 2 nodes rather than recovering on its own within the usual window: worth comparing against Playbook 03’s “does not automatically rejoin” escalation trigger.

A useful diagnostic shortcut:Connection refused means something on the target actively rejected the TCP connection (broker process down, but OS/network fine). Connection timed out means the packet never got a response at all, no SYN-ACK, no RST. A timeout is the single strongest log-side hint that you’re looking at a network-layer block (SG, NACL, routing) rather than a broker-process problem, because a genuinely stopped RabbitMQ process on a reachable host still produces a fast “refused,” not a timeout.

# Timeout pattern: network/SG/NACL block (broker unreachable at the network layer)
o.s.amqp.AmqpConnectException: java.net.ConnectException: Connection timed out
    at org.springframework.amqp.rabbit.support.RabbitExceptionTranslator...

# Refused pattern: host reachable, but nothing listening on the port (process actually down)
o.s.amqp.AmqpConnectException: java.net.ConnectException: Connection refused
    at org.springframework.amqp.rabbit.support.RabbitExceptionTranslator...

4. Diagnostic Steps

Work top to bottom, cheapest, fastest, least-invasive checks first. The goal at every step is to narrow down which AWS layer is involved before you touch anything.

EC2 console / CloudWatch, instance status checks. For the node(s) in question, check System status check vs. Instance status check individually (not just the combined “2/2 checks passed” summary). A system-check failure points at AWS-side hardware; an instance-check failure points at the guest OS. Either way, this tells you the node is a genuine infra casualty, not a network partition, compare against Playbook 03, Section 3, step 4.
CloudWatch EBS metrics for the node’s data volume, EBSVolumeQueueLength (I/O requests waiting to be served) and EBSReadWriteOps (actual throughput), plus BurstBalance if the volume is gp2. A queue length sustained above roughly 1 per provisioned IOPS-thousand, or a BurstBalance heading toward 0, means the disk, not the broker, is the bottleneck.
Security group rules, both directions, for the broker SG and the app-tier SG involved, reuse the table from AWS Architecture as your checklist (app→broker 5672/5671, broker↔broker 4369/25672/35672-35682, management 15672). Confirm the rule exists and that its source/destination reference (SG ID or CIDR) still matches reality after any recent migration.
Network ACL rules for the subnets involved, both inbound and outbound, on both the app subnet and the broker subnet. This step is easy to skip because NACLs are touched far less often than SGs, but a stateless NACL needs an explicit rule for the return leg of the conversation, not just the initiating leg. Missing this is the single most common reason “we already checked the security group and it’s fine” turns out to be wrong.
ASG activity history for the broker’s Auto Scaling Group, look for any Terminating/Launching activity around the time symptoms started. If a node was replaced, note the exact timestamp of termination and the exact timestamp the replacement instance became InService.
Cross-reference timestamps. Line up the RabbitMQ-side symptom (node missing from cluster_status, latency spike start time) against the AWS-side event (ASG termination timestamp, EBS queue-length spike start, SG/NACL change timestamp from CloudTrail if available). A match within a minute or two is strong evidence of causation; a gap of many minutes suggests coincidence and means you should keep looking rather than anchoring on the first AWS event you find.

Step	Question it answers	Typical time cost
1. EC2/CloudWatch status checks	Is the node actually down at the infra level, or just unreachable?	1-2 min
2. CloudWatch EBS metrics	Is the disk saturated?	1-2 min
3. Security group rules	Is a stateful network rule blocking this?	2-3 min
4. NACL rules (both directions, both subnets)	Is a stateless network rule blocking this asymmetrically?	3-5 min
5. ASG activity history	Was a node replaced, and when exactly?	1-2 min
6. Timestamp cross-reference	Causation or coincidence?	2-3 min

5. Safe Remediations

Situation	Action
SG rule confirmed missing/wrong	Identify the exact corrected rule (port, direction, source/destination SG or CIDR) and hand it to the network/platform team to apply.
NACL rule confirmed missing/wrong	Identify the exact corrected rule, but treat this with extra care (see caution below): hand off with the specific rule number, direction, and subnet affected.
EBS volume saturated	This is generally an escalation to resize or re-provision the volume (increase provisioned IOPS/throughput, or move from `gp2` to `gp3`/higher baseline), not a support-tier live fix.
ASG replaced a node mid-incident, or is likely to replace another during ongoing investigation	Pausing or adjusting the ASG’s health-check-triggered replacement is a valid stopgap to prevent compounding the incident, but is high-blast-radius (see caution below).

⚠️ Caution: correcting a security group or NACL rule should be coordinated with the network/platform team, not applied solo.** A security group is scoped to the resources that reference it, so an SG fix is relatively contained. A NACL is scoped to the entire subnet: every resource in that subnet is affected by any rule you add, remove, or reorder, including things that have nothing to do with RabbitMQ. Fixing a NACL rule to unblock the broker can silently open or close traffic for unrelated workloads sharing that subnet. Always have the network/platform team apply or review NACL changes.

⚠️ Caution: pausing ASG health-check replacement during an active incident is a risky, high-blast-radius change.** RabbitMQ nodes are stateful cluster members with data on attached EBS volumes, not interchangeable stateless web servers.
Blind auto-replacement (the default ASG behavior for a “generic” fleet) is dangerous for a broker unless the ASG has been specifically tuned for it, sufficient health check grace periods so a transient blip (GC pause, brief high CPU, a slow health-check response) doesn’t trigger replacement, and either automated EBS/data-volume reattachment or proper cluster-rejoin automation baked into the instance’s bootstrap. If none of that tuning exists, pausing the ASG mid-incident is the safer of two bad options, but this decision should be made by or with on-call engineering, not unilaterally by support tier, because getting it wrong either way (leaving it on vs. pausing it incorrectly) can extend the incident.

6. Escalation Trigger

Escalate immediately (page on-call engineering, per Escalation and Communication) if any of the following are true:

A security group or NACL change is needed to restore connectivity, network configuration changes are owned by the network/platform team, not applied directly by support tier.
An EBS volume resize/re-provision is needed, this is a capacity/infra change requiring engineering approval and planning (potential downtime or performance impact during the change itself).
An ASG configuration change is needed (pausing health-check replacement, adjusting grace periods, adjusting scaling policies), always a joint call with on-call engineering, never solo.
A node loss is confirmed to be caused by AWS infrastructure (failed status check, ASG replacement) rather than an application or broker-level bug, especially if the replacement node has not cleanly rejoined the cluster, see the related trigger in Playbook 03, Section 5.

7. Relevant Commands/Queries/Exhibits

CloudWatch metrics to check (from Tooling Walkthrough):

Metric	What it tells you here
`EBSVolumeQueueLength`	I/O requests queued waiting on the volume: sustained elevation means disk saturation
`EBSReadWriteOps`	Actual read/write throughput being served: compare against the volume’s provisioned limit
`CPUCreditBalance`	If depleting toward 0 on a burstable instance type, the node may soon be throttled: a different root cause than EBS, but produces a similarly confusing “broker is slow but nothing broker-side explains it” symptom
`StatusCheckFailed_System` / `StatusCheckFailed_Instance`	Split view of the two EC2 status checks: check both individually, not just the combined status

EBS baseline vs. provisioned IOPS, why this is easy to misdiagnose:

A gp3 volume comes with a fixed baseline (3,000 IOPS / 125 MiB/s throughput) regardless of size, which you can raise independently by provisioning more. If message rates outgrow that baseline, high publish volume, many quorum queues doing Raft log writes, mirrored durability workloads, the volume throttles once the baseline is exceeded, and every fsync-backed broker operation (publisher confirms, quorum queue appends) queues up behind it. CPU and memory on the node look completely normal because the bottleneck is the disk, not the broker process, which is exactly why this gets misreported as “RabbitMQ is slow” instead of “the EBS volume is undersized for this workload.”

Security group reference (from AWS Architecture), the exact rules to check:

Rule	Port	Source	Purpose
App → Broker	5672 (or 5671 TLS)	App tier SG	AMQP client connections
App → Broker (Management API)	15672	App tier SG or admin CIDR	HTTP Management API/UI
Broker ↔ Broker	4369, 25672, 35672-35682	Broker SG itself (self-referencing)	Erlang clustering/gossip
Admin → Broker	via SSM only	Bastion SG / SSM	Node administration

Key SG vs. NACL difference to keep straight:

	Security Group	Network ACL
Scope	Attached to specific resources (ENIs/instances)	Applies to an entire subnet
State	Stateful: allowing inbound automatically allows the matching outbound return traffic	Stateless: inbound and outbound rules are evaluated independently; you must explicitly allow both directions
Default	Deny all inbound, allow all outbound (until rules added)	Default NACL allows all; a custom NACL denies all until rules added
Common mistake	Forgetting the self-referencing clustering rule	Adding an inbound allow rule but forgetting the matching outbound allow for the response traffic (or vice versa)

# Once network connectivity is confirmed and you're validating cluster state
# (via SSM Session Manager, not direct SSH)
rabbitmq-diagnostics cluster_status
rabbitmq-diagnostics check_running

8. Guided Practical

You don’t have hands-on AWS access for this one, instead, diagnose from three exhibits, exactly as you would from an incident channel where someone pastes console screenshots and CLI output.

Ticket:

“PagerDuty fired RabbitMQ node unreachable, rmq-3 at 14:32 UTC. Around the same time, the payments-service team reports publish latency spiking from ~5ms to over 4 seconds on rmq-1 and rmq-2 (the two nodes still up). Management UI shows no memory or disk alarms on either surviving node. We need to know: is this one incident or two, and what’s the root cause of each?”

Exhibit A, CloudWatch EBS metrics for rmq-1’s data volume (gp3, baseline 3,000 IOPS / 125 MiB/s):

Timestamp (UTC)   EBSVolumeQueueLength   EBSReadWriteOps (IOPS)   BurstBalance
15             0.8                    1,150                    N/A (gp3 has no burst balance)
20             1.1                    1,400                    N/A
25             3.6                    2,950                    N/A
30             9.2                    3,010                    N/A
35             11.4                   3,005                    N/A
40             10.9                   2,998                    N/A

Exhibit B, ASG activity history for the RabbitMQ broker ASG (rmq-broker-asg):

Activity                                    Start (UTC)   End (UTC)     Cause
Terminating EC2 instance: i-0a1b2c3d4e5f     14:31:02      14:31:48      At 14:30:55 UTC an instance was
                                                                          taken out of service in response
                                                                          to a failed EC2 instance status
                                                                          check.
Launching a new EC2 instance: i-0f9e8d7c6b5   14:31:50      14:34:12      A new instance was launched in
                                                                          response to a difference between
                                                                          desired and actual capacity.

(Instance i-0a1b2c3d4e5f was tagged rmq-3.)

Exhibit C, Security group rule diff (from a change ticket, applied at 09:00 UTC that same day, unrelated migration):

SG: sg-broker0123 (RabbitMQ broker nodes)

  Rule removed:
    Type: Custom TCP   Port: 35672-35682   Source: sg-broker0123 (self)
  Rule added:
    Type: Custom TCP   Port: 35672-35680   Source: sg-broker0123 (self)

Your task: using only these three exhibits plus what you know from Section 2-3 above, answer:

Which exhibit explains the rmq-3 node-down alert at 14:32, and what’s the precise mechanism?
Which exhibit explains the payments-service latency spike on rmq-1/rmq-2, and why wouldn’t Management UI alarms catch it?
Is the 09:00 UTC security group change (Exhibit C) relevant to either symptom? Why or why not?
For each root cause, is this something you’d remediate yourself or escalate, and to whom?

✅ Checkpoint

You should now be able to:

Distinguish a system status check failure from an instance status check failure, and explain why either one makes a RabbitMQ node look “down” with no broker-side log evidence.
Explain why EBS throughput/IOPS exhaustion produces elevated latency with no corresponding CPU/memory/alarm signal, and name the two CloudWatch metrics that expose it.
Explain the stateful-vs-stateless distinction between security groups and NACLs, and why a NACL fix requires checking both inbound and outbound rules on both subnets.