RabbitMQ on AWS: EC2 Cluster Architecture Deep Dive

Prerequisite:Core ConceptsYou’ll need: nothing new, this module is reading plus a diagram exercise

Start of the Operations and Troubleshooting section. Everything so far taught you how to build with RabbitMQ. This section shifts perspective to running and troubleshooting it in production: cluster architecture, tooling, alert playbooks, an incident lab, and escalation. The examples use a self-managed AWS deployment and an on-call framing, but the diagnostic thinking transfers to any environment. Read it to understand how the system you built behaves under real infrastructure, failure, and incident conditions.

Placeholder notice: This module uses a realistic, commonly-seen setup as a stand-in for your team’s actual environment: a 3-node, self-managed RabbitMQ cluster on EC2, using quorum queues. Ask your team lead for the real architecture diagram and swap in real values (VPC IDs, security group names, instance types) once you have them. The concepts below don’t change regardless of the exact numbers.

What you’ll be able to do after this module

Describe how RabbitMQ nodes are networked, clustered, and secured in AWS.
Explain what “healthy” looks like for the cluster at the infrastructure level.
Read a security-group table and correctly identify whether it allows your app to reach the broker.

1. Compute layer: EC2 nodes and clustering

Our assumed setup: 3 EC2 instances, each running a RabbitMQ node, forming one logical cluster.

   AZ-a                  AZ-b                  AZ-c
┌─────────┐          ┌─────────┐          ┌─────────┐
│ rmq-1   │◀────────▶│ rmq-2   │◀────────▶│ rmq-3   │
│ (EC2)   │          │ (EC2)   │          │ (EC2)   │
└─────────┘          └─────────┘          └─────────┘
     ▲                    ▲                    ▲
     └────────────────────┴────────────────────┘
              Erlang clustering (port 25672)
              + cluster gossip (4369, 35672-35682)

Key facts:

Nodes are spread across Availability Zones (AZs) for resilience, losing one AZ shouldn’t take down the whole cluster.
Nodes talk to each other over Erlang’s distribution protocol (ports 4369, 25672, and an ephemeral range), this traffic must be allowed within the cluster’s security group, but should never be exposed publicly.
Instance type matters: RabbitMQ is memory- and I/O-sensitive. Watch for burstable instance types (e.g., t3.medium), under sustained load they exhaust CPU credits and throttle, which looks like a mysterious latency spike (see Playbook 09).

Why 3 (or any odd number)? Quorum-based systems need a majority to make decisions. With 3 nodes, the cluster tolerates 1 node failure and keeps operating; with 2 node failures, it loses quorum and pauses writes to protect data consistency. This is the same logic as etcd, Zookeeper, or Raft-based systems if you’ve encountered those.

2. Queue types and high availability

Queue type	HA model	Status
Classic queue (non-mirrored)	Lives on exactly one node. If that node dies, the queue is unavailable until it comes back.	Legacy: avoid for anything requiring HA
Classic mirrored queue	Replicated across nodes via the (deprecated) mirroring plugin.	Deprecated by RabbitMQ upstream: being phased out industry-wide
Quorum queue	Replicated across multiple nodes using the Raft consensus algorithm. Survives minority node failure without data loss.	Current recommended default: assume this is what your production queues use unless told otherwise

What this means for you during an incident: if one node in a 3-node quorum-queue cluster goes down, queues with replicas on the surviving 2 nodes keep working normally, no message loss, brief leader re-election. If two nodes go down simultaneously, the cluster loses quorum and those queues stop accepting new messages until quorum is restored. This distinction is central to Playbook 03, Node Down / Cluster Partition.

3. Networking: VPC, subnets, and security groups

                         VPC (10.0.0.0/16)
   ┌──────────────────────────────────────────────────────┐
   │  Private subnet (app tier)     Private subnet (broker)│
   │  ┌───────────────────┐         ┌────────────────────┐ │
   │  │ Spring Boot        │  5672   │  RabbitMQ nodes     │ │
   │  │ microservices  ────┼────────▶│  (rmq-1/2/3)        │ │
   │  │ (EC2/ECS)          │  TLS    │                     │ │
   │  └───────────────────┘         └────────────────────┘ │
   │                                         │ 15672        │
   │                                         ▼               │
   │                              Management UI (internal    │
   │                              only, via bastion/VPN)      │
   └──────────────────────────────────────────────────────┘

Security group (SG) reference table: this is the #1 thing to check when an app team reports “can’t connect to RabbitMQ”:

Rule	Port	Source	Purpose
App → Broker	5672 (or 5671 for TLS)	App tier SG	AMQP client connections
App → Broker (Management API, if used programmatically)	15672	App tier SG or specific admin CIDR	HTTP Management API/UI
Broker ↔ Broker	4369, 25672, 35672-35682	Broker SG itself (self-referencing rule)	Erlang clustering/gossip
Admin → Broker	22 (SSH) or via SSM (no port needed)	Bastion SG / SSM only	Node administration

Common infra-side connectivity failure: someone tightens a security group during a routine hardening pass and forgets the self-referencing clustering rule (4369/25672), nodes can no longer see each other, and you get a false-looking “network partition” that’s actually a misconfigured SG. Covered in Playbook 07.

4. Storage: EBS volumes

Each RabbitMQ node persists messages (for durable queues) to disk on an attached EBS volume. Two metrics matter most:

Disk space free: RabbitMQ has a built-in disk-space alarm (disk_free_limit). If free space drops below the threshold, the broker blocks all publishers cluster-wide to protect itself from running out of disk. This is a deliberate safety mechanism, not a bug, but it looks like a total outage to producers.
IOPS/throughput: under-provisioned EBS volumes (e.g., default gp3 baseline) can throttle under sustained high message rates, causing publish/ack latency to spike even though CPU and memory look fine.

5. Secrets and IAM

Broker credentials (username/password, or TLS client certs) are stored in AWS Secrets Manager, not hardcoded in application.yml.
Spring Boot apps typically fetch these at startup via the AWS Secrets Manager Spring Cloud integration, or via an init container/sidecar that injects them as environment variables.
Important gotcha: if a secret is rotated (e.g., scheduled password rotation), a running Spring Boot app’s ConnectionFactory does not automatically pick up the new credential, it keeps using the one it started with, until the connection drops and it tries to reconnect with a cached (now stale) value, or until the app is restarted. This is a very common source of “why did auth suddenly start failing at 2am” tickets, see Playbook 05.
IAM roles control which AWS principals can read the secret, manage the EC2 instances, or (if using Amazon MQ instead of self-managed) call the Amazon MQ API, but IAM does not control RabbitMQ-level permissions (which user can publish/consume on which vhost), that’s managed inside RabbitMQ itself via its own user/permission system.

Practical: annotate the architecture

No infra access needed for this one, it’s a comprehension check.

Exercise: Below is a support ticket. Using only the diagrams and tables above, answer the three questions.

“The payments-service team says their Spring Boot app can’t connect to RabbitMQ. They get connection timeouts on port 5672. They also mention their app was just moved from the app-tier-a security group to a brand-new app-tier-b security group as part of a migration.”

What’s the most likely root cause, based on the SG table in section 3?
What would you check first, application logs, RabbitMQ logs, or AWS security group rules? Why that order?
Is this something you could safely fix yourself, or does it need escalation? (You don’t know the full escalation policy yet, just reason about whether this is a “broker problem” or an “AWS config problem,” which determines who owns the fix.)

✅ Checkpoint

You should now be able to:

Explain why quorum queues tolerate 1-of-3 node failures but not 2-of-3.
Point to the specific SG rule that would block app-to-broker connectivity vs. broker-to-broker clustering.
Explain why a disk-space alarm blocks publishers cluster-wide instead of just failing on one node.

Next:Tooling Walkthrough