Node Down & Cluster Partition: RabbitMQ Incident Guide

Prerequisite:AWS Architecture, Tooling Walkthrough

1. Symptom

CloudWatch alarm or PagerDuty page: “RabbitMQ node unreachable” or “Cluster node count < 3.”
Management UI Overview shows fewer than 3 nodes under the node list, or a node shown in red/grey instead of green.
App-side: Spring Boot logs suddenly full of AmqpConnectException / connection retry messages, sometimes from only some app instances, not all.
Sometimes there’s no alarm at all yet, just a Slack message from another team: “is RabbitMQ down? Our app can’t connect.”

First, understand the concept behind this alert: what a network partition actually is.

A cluster of 3 nodes normally has all nodes able to see and talk to each other (over the Erlang distribution ports from AWS Architecture: 4369, 25672, 35672-35682). A network partition (“split-brain”) happens when nodes are all still running, but some of them can no longer see each other over the network, e.g., rmq-1 can’t reach rmq-2 and rmq-3, but rmq-2 and rmq-3 can still see each other. From rmq-1’s point of view, it looks like the other two nodes died. From rmq-2/rmq-3’s point of view, it looks like rmq-1 died. Both sides are alive, they just disagree about who’s in the cluster.

This is dangerous for a stateful system: if both sides kept accepting writes independently, you’d get two diverging copies of the same queue’s data, with no automatic way to reconcile them later. RabbitMQ’s cluster_partition_handling setting decides what happens when this is detected:

Mode	Behavior	Used here?
`ignore`	Do nothing: both sides keep running independently. Risk of data divergence when the partition heals.	No: never recommended for quorum queues
`autoheal`	Let the partition happen, then automatically pick a winning side and restart nodes on the losing side, discarding their state since the split.	No: can silently lose data
`pause_minority`	The side of the partition that is not part of the majority immediately pauses itself (stops serving clients) until it can rejoin the majority.	Yes: our assumed cluster config

With pause_minority on a 3-node cluster: if a partition splits the cluster 1-vs-2, the lone node pauses itself rather than risk serving stale or divergent data, while the 2-node majority side keeps running normally. This is deliberately conservative, it sacrifices availability on the minority side to guarantee consistency, which is exactly the same trade-off quorum queues make internally via Raft (see AWS Architecture). A paused node looks “down” from the outside even though the process is technically still alive, that’s expected, not a separate bug.

2. Likely Causes

Broker-side

Cause	Notes
Actual EC2 instance failure/termination	Hardware fault, spot interruption, or an ASG health-check terminating the instance. The node is genuinely gone, not just partitioned.
Network blip between AZs	Transient AZ-to-AZ latency/packet loss makes nodes stop seeing each other over the clustering ports even though all instances are still running: a true partition, not a node failure.
Security group misconfiguration blocking clustering ports	Someone edits the broker SG and breaks the self-referencing rule for `4369`/`25672`/`35672-35682` (see the SG table in AWS Architecture). Looks exactly like a network partition, but the actual root cause is a config change, not a network fault.
ASG replaces the “unhealthy” node mid-incident	If an Auto Scaling Group’s health check flags a paused/partitioned node as unhealthy and terminates + replaces it before the underlying network issue resolves, the new instance joins as a fresh node with no data, not a rejoined node: this turns a transient blip into a real, permanent membership problem.
Erlang process/resource exhaustion on one node	File descriptor limits, Erlang process limits, or memory pressure on one node can make it stop responding to cluster heartbeats, which the other nodes interpret as a partition or node-down event even though the OS process hasn’t crashed.

Application-side (Spring Boot)

Cause	Notes
App only configured with one node’s address	If `spring.rabbitmq.addresses` (or `host`/`port`) lists only a single node instead of all three, or there’s no load balancer/VIP in front of the cluster, losing that one node looks like total broker unavailability to the app: even though the other two nodes are healthy and quorum queues are serving fine. This is the single most common app-side misconfiguration behind this alert.
`CachingConnectionFactory` failover working as designed, but slowly	If `spring.rabbitmq.addresses`is configured with all cluster nodes, Spring AMQP’s connection factory will automatically attempt the next address in the list on connection failure. This is correct behavior: but you’ll still see a burst of retry/reconnect log noise during the failover window, which can look alarming even though the app recovers on its own within seconds.
Connection retry/backoff exhausting before recovery	If `spring.rabbitmq.listener.simple.retry` / connection retry settings have a low max-attempts and the partition/failover takes longer than the backoff window, the app may give up and surface errors to callers instead of quietly retrying through the blip.

What the app logs look like during this:

o.s.a.r.c.CachingConnectionFactory : Attempting to connect to: [rmq-1:5672]
o.s.amqp.AmqpConnectException: java.net.ConnectException: Connection refused
	at org.springframework.amqp.rabbit.support.RabbitExceptionTranslator...
o.s.a.r.l.SimpleMessageListenerContainer : Consumer raised exception, processing can restart if the connection factory supports it

If spring.rabbitmq.addresses lists all three nodes, you’ll instead see a quick sequence of connect attempts across nodes followed by a successful connection, much shorter-lived and self-resolving.

3. Diagnostic Steps

Cheapest, fastest checks first:

Check the Management UI Overview node list (if reachable at all), are all 3 nodes listed, and are they green? If the UI itself is unreachable, that’s a stronger signal of a broader outage, move to the AWS console/CLI checks below.
Run rabbitmq-diagnostics cluster_status (via SSM Session Manager), this is the definitive source of truth. Look at two things:
- Which nodes appear under Running Nodes vs. missing entirely.
- Whether a Partitions section is populated (a true network partition) vs. a node just being absent (a node-down event). These look different and point to different root causes.
Run rabbitmq-diagnostics check_running on a surviving node, confirms the local node’s own services are up, ruling out “the whole cluster is down” in favor of “one specific node has a problem.”
Check CloudWatch/EC2 console for the underlying instance’s health check status: is the EC2 instance itself stopped/terminated/failing status checks? This tells you whether you’re dealing with real infrastructure failure vs. a network-only partition where the instance is still running fine.
Check whether affected quorum queues still have 2-of-3 replicas alive. In the Management UI, click into a queue and check its member/leader status, or reason from the AWS Architecture module: if only 1 node is down/partitioned, quorum queues keep a majority (2 of 3) and continue serving normally with no message loss, just a brief leader re-election for queues whose leader happened to be on the affected node.
Check the Spring app’s connection config: look at spring.rabbitmq.addresses in the affected app’s config. Does it list all 3 cluster node addresses (or point at a load balancer/VIP in front of the cluster), or just one node’s hostname? This tells you immediately whether the app’s outage is a real cluster problem or a single-point-of-failure config issue on the app side.

4. Safe Remediations

Situation	Action
Single transient node loss, `cluster_status` shows 2 nodes running and quorum intact	Usually no action needed beyond monitoring: confirm the node rejoins cleanly once the underlying EC2/network issue clears, and watch `cluster_status` return to all 3 nodes.
App only had one node’s address configured	This is a config fix, not a live remediation: note it as a follow-up action item for the app team (update `spring.rabbitmq.addresses` to list all cluster nodes, or put a load balancer/VIP in front of the cluster). Don’t attempt to change app config live during an active incident unless directed to.
Node rejoins but you want to confirm health	Re-run `rabbitmq-diagnostics cluster_status` and `check_running` on the rejoined node; confirm the Management UI Overview shows all 3 nodes green again.

⚠️ CAUTION: Do not manually force-restart a node or run rabbitmqctl forget_cluster_node without engineering sign-off. forget_cluster_nodepermanently removes a node from cluster membership, if that node later comes back online, it will refuse to rejoin (it still thinks it’s part of the old cluster) and requires careful manual re-provisioning to bring back in. This is a one-way door during an active incident; treat it as an escalation-only action, not a self-service fix.

5. Escalation Trigger

Escalate immediately (page on-call engineering) if:

2 or more nodes are down/partitioned simultaneously: quorum is lost, and affected quorum queues stop accepting new writes cluster-wide. This is a full incident, not a “wait and see.”
A node does not automatically rejoin after the underlying EC2/network issue is confirmed resolved (e.g., cluster_status still shows it missing 10+ minutes after the instance passes EC2 health checks).
An ASG has already replaced the affected node with a brand-new instance before it could rejoin cleanly, this needs engineering to correctly add the new node to the cluster rather than assuming it will “just work.”
Anything that looks like it requires forget_cluster_node or other manual partition/membership intervention, these are destructive, one-way operations that need sign-off, not a support-tier judgment call.

6. Relevant Commands/Queries

Run via SSM Session Manager, not direct SSH (per our access model in Environment Setup).

# Cluster membership and partition status: your primary diagnostic
rabbitmq-diagnostics cluster_status

# Confirm the local node's own services are healthy
rabbitmq-diagnostics check_running

Healthy output (all 3 nodes present, no partitions):

Basics

Cluster name: rabbit@rmq-1

Disk Nodes

rabbit@rmq-1
rabbit@rmq-2
rabbit@rmq-3

Running Nodes

rabbit@rmq-1
rabbit@rmq-2
rabbit@rmq-3

Versions

...

Alarms

(none)

Network Partitions

(none)

Alerting output, node down (rmq-3 missing entirely from Running Nodes):

Running Nodes

rabbit@rmq-1
rabbit@rmq-2

Alerting output, active partition (all nodes technically “known,” but a Partitions section is populated):

Network Partitions

Node rabbit@rmq-1 cannot communicate with rabbit@rmq-3

The distinction matters: “missing from Running Nodes” usually means a real node failure; a populated “Network Partitions” section with the node still listed means the process is alive but split off, a true partition, and pause_minority behavior may already be in effect on the minority side.

7. Mini Practical

Spin up a local 3-node cluster, kill a node, and watch quorum queue behavior with your own eyes.

Step 1: docker-compose.yml:

version: "3.8"

services:
  rmq-1:
    image: rabbitmq:3.13-management
    hostname: rmq-1
    environment:
      RABBITMQ_ERLANG_COOKIE: "shared-cookie-value"
    ports:
      - "15672:15672"
    networks:
      - rmq-net

  rmq-2:
    image: rabbitmq:3.13-management
    hostname: rmq-2
    environment:
      RABBITMQ_ERLANG_COOKIE: "shared-cookie-value"
    networks:
      - rmq-net
    depends_on:
      - rmq-1

  rmq-3:
    image: rabbitmq:3.13-management
    hostname: rmq-3
    environment:
      RABBITMQ_ERLANG_COOKIE: "shared-cookie-value"
    networks:
      - rmq-net
    depends_on:
      - rmq-1

networks:
  rmq-net:
    driver: bridge

Step 2: Start it and form the cluster:

docker compose up -d

# Join rmq-2 and rmq-3 to rmq-1's cluster
docker exec rmq-2 bash -c "rabbitmqctl stop_app && rabbitmqctl join_cluster rabbit@rmq-1 && rabbitmqctl start_app"
docker exec rmq-3 bash -c "rabbitmqctl stop_app && rabbitmqctl join_cluster rabbit@rmq-1 && rabbitmqctl start_app"

# Confirm all 3 nodes see each other
docker exec rmq-1 rabbitmq-diagnostics cluster_status

You should see all three rabbit@rmq-* nodes under Running Nodes.

Step 3: Declare a quorum queue (via Management UI at localhost:15672, login guest/guest, or CLI):

docker exec rmq-1 rabbitmqadmin declare queue name=orders.qq durable=true arguments='{"x-queue-type":"quorum"}'

Step 4: Kill one node and observe:

docker stop rmq-3

docker exec rmq-1 rabbitmq-diagnostics cluster_status

You should see rmq-3 missing from Running Nodes, but rmq-1 and rmq-2 still show as running. Check the Management UI on rmq-1 (localhost:15672), the cluster overview shows 2 of 3 nodes, but orders.qq is still there and still accepting publishes, because 2-of-3 replicas is still a majority.

Step 5: Bring it back and confirm clean rejoin:

docker start rmq-3

# Give it a few seconds to rejoin, then check again
docker exec rmq-1 rabbitmq-diagnostics cluster_status

rmq-3 should reappear under Running Nodes with no manual intervention required, this is the “transient node loss, no action needed beyond monitoring” case from Section 4. Compare this against how much more serious it would look if you stopped two nodes (rmq-2 and rmq-3) instead of one, try it, and watch orders.qq stop accepting new messages once quorum is lost.

✅ Checkpoint

You should now be able to:

Explain what a network partition is and why pause_minority deliberately sacrifices availability on the minority side to avoid data divergence.
Distinguish “node missing from Running Nodes” (node down) from a populated “Network Partitions” section (split-brain) in cluster_status output.
Explain why an app configured with only one node’s address in spring.rabbitmq.addresses can show a full outage even when the cluster itself is healthy.