Node Down & Partition
Tell node failure from network partition, read cluster_status, and spot single-address app misconfig.
Prerequisite:AWS Architecture, Tooling Walkthrough
1. Symptom
- CloudWatch alarm or PagerDuty page: “RabbitMQ node unreachable” or “Cluster node count < 3.”
- Management UI Overview shows fewer than 3 nodes under the node list, or a node shown in red/grey instead of green.
- App-side: Spring Boot logs suddenly full of
AmqpConnectException/ connection retry messages, sometimes from only some app instances, not all. - Sometimes there’s no alarm at all yet, just a Slack message from another team: “is RabbitMQ down? Our app can’t connect.”
First, understand the concept behind this alert: what a network partition actually is.
A cluster of 3 nodes normally has all nodes able to see and talk to each other (over the Erlang distribution ports from AWS Architecture: 4369, 25672, 35672-35682). A network partition (“split-brain”) happens when nodes are all still running, but some of them can no longer see each other over the network, e.g., rmq-1 can’t reach rmq-2 and rmq-3, but rmq-2 and rmq-3 can still see each other. From rmq-1’s point of view, it looks like the other two nodes died. From rmq-2/rmq-3’s point of view, it looks like rmq-1 died. Both sides are alive, they just disagree about who’s in the cluster.
This is dangerous for a stateful system: if both sides kept accepting writes independently, you’d get two diverging copies of the same queue’s data, with no automatic way to reconcile them later. RabbitMQ’s cluster_partition_handling setting decides what happens when this is detected:
| Mode | Behavior | Used here? |
|---|---|---|
ignore | Do nothing: both sides keep running independently. Risk of data divergence when the partition heals. | No: never recommended for quorum queues |
autoheal | Let the partition happen, then automatically pick a winning side and restart nodes on the losing side, discarding their state since the split. | No: can silently lose data |
pause_minority | The side of the partition that is not part of the majority immediately pauses itself (stops serving clients) until it can rejoin the majority. | Yes: our assumed cluster config |
With pause_minority on a 3-node cluster: if a partition splits the cluster 1-vs-2, the lone node pauses itself rather than risk serving stale or divergent data, while the 2-node majority side keeps running normally. This is deliberately conservative, it sacrifices availability on the minority side to guarantee consistency, which is exactly the same trade-off quorum queues make internally via Raft (see AWS Architecture). A paused node looks “down” from the outside even though the process is technically still alive, that’s expected, not a separate bug.
2. Likely Causes
Broker-side
| Cause | Notes |
|---|---|
| Actual EC2 instance failure/termination | Hardware fault, spot interruption, or an ASG health-check terminating the instance. The node is genuinely gone, not just partitioned. |
| Network blip between AZs | Transient AZ-to-AZ latency/packet loss makes nodes stop seeing each other over the clustering ports even though all instances are still running: a true partition, not a node failure. |
| Security group misconfiguration blocking clustering ports | Someone edits the broker SG and breaks the self-referencing rule for 4369/25672/35672-35682 (see the SG table in AWS Architecture). Looks exactly like a network partition, but the actual root cause is a config change, not a network fault. |
| ASG replaces the “unhealthy” node mid-incident | If an Auto Scaling Group’s health check flags a paused/partitioned node as unhealthy and terminates + replaces it before the underlying network issue resolves, the new instance joins as a fresh node with no data, not a rejoined node: this turns a transient blip into a real, permanent membership problem. |
| Erlang process/resource exhaustion on one node | File descriptor limits, Erlang process limits, or memory pressure on one node can make it stop responding to cluster heartbeats, which the other nodes interpret as a partition or node-down event even though the OS process hasn’t crashed. |
Application-side (Spring Boot)
| Cause | Notes |
|---|---|
| App only configured with one node’s address | If spring.rabbitmq.addresses (or host/port) lists only a single node instead of all three, or there’s no load balancer/VIP in front of the cluster, losing that one node looks like total broker unavailability to the app: even though the other two nodes are healthy and quorum queues are serving fine. This is the single most common app-side misconfiguration behind this alert. |
CachingConnectionFactory failover working as designed, but slowly | If spring.rabbitmq.addressesis configured with all cluster nodes, Spring AMQP’s connection factory will automatically attempt the next address in the list on connection failure. This is correct behavior: but you’ll still see a burst of retry/reconnect log noise during the failover window, which can look alarming even though the app recovers on its own within seconds. |
| Connection retry/backoff exhausting before recovery | If spring.rabbitmq.listener.simple.retry / connection retry settings have a low max-attempts and the partition/failover takes longer than the backoff window, the app may give up and surface errors to callers instead of quietly retrying through the blip. |
What the app logs look like during this:
o.s.a.r.c.CachingConnectionFactory : Attempting to connect to: [rmq-1:5672]
o.s.amqp.AmqpConnectException: java.net.ConnectException: Connection refused
at org.springframework.amqp.rabbit.support.RabbitExceptionTranslator...
o.s.a.r.l.SimpleMessageListenerContainer : Consumer raised exception, processing can restart if the connection factory supports it
If spring.rabbitmq.addresses lists all three nodes, you’ll instead see a quick sequence of connect attempts across nodes followed by a successful connection, much shorter-lived and self-resolving.
3. Diagnostic Steps
Cheapest, fastest checks first:
- Check the Management UI Overview node list (if reachable at all), are all 3 nodes listed, and are they green? If the UI itself is unreachable, that’s a stronger signal of a broader outage, move to the AWS console/CLI checks below.
- Run
rabbitmq-diagnostics cluster_status(via SSM Session Manager), this is the definitive source of truth. Look at two things:- Which nodes appear under
Running Nodesvs. missing entirely. - Whether a
Partitionssection is populated (a true network partition) vs. a node just being absent (a node-down event). These look different and point to different root causes.
- Which nodes appear under
- Run
rabbitmq-diagnostics check_runningon a surviving node, confirms the local node’s own services are up, ruling out “the whole cluster is down” in favor of “one specific node has a problem.” - Check CloudWatch/EC2 console for the underlying instance’s health check status: is the EC2 instance itself stopped/terminated/failing status checks? This tells you whether you’re dealing with real infrastructure failure vs. a network-only partition where the instance is still running fine.
- Check whether affected quorum queues still have 2-of-3 replicas alive. In the Management UI, click into a queue and check its member/leader status, or reason from the AWS Architecture module: if only 1 node is down/partitioned, quorum queues keep a majority (2 of 3) and continue serving normally with no message loss, just a brief leader re-election for queues whose leader happened to be on the affected node.
- Check the Spring app’s connection config: look at
spring.rabbitmq.addressesin the affected app’s config. Does it list all 3 cluster node addresses (or point at a load balancer/VIP in front of the cluster), or just one node’s hostname? This tells you immediately whether the app’s outage is a real cluster problem or a single-point-of-failure config issue on the app side.
4. Safe Remediations
| Situation | Action |
|---|---|
Single transient node loss, cluster_status shows 2 nodes running and quorum intact | Usually no action needed beyond monitoring: confirm the node rejoins cleanly once the underlying EC2/network issue clears, and watch cluster_status return to all 3 nodes. |
| App only had one node’s address configured | This is a config fix, not a live remediation: note it as a follow-up action item for the app team (update spring.rabbitmq.addresses to list all cluster nodes, or put a load balancer/VIP in front of the cluster). Don’t attempt to change app config live during an active incident unless directed to. |
| Node rejoins but you want to confirm health | Re-run rabbitmq-diagnostics cluster_status and check_running on the rejoined node; confirm the Management UI Overview shows all 3 nodes green again. |
⚠️ CAUTION: Do not manually force-restart a node or run
rabbitmqctl forget_cluster_nodewithout engineering sign-off.forget_cluster_nodepermanently removes a node from cluster membership, if that node later comes back online, it will refuse to rejoin (it still thinks it’s part of the old cluster) and requires careful manual re-provisioning to bring back in. This is a one-way door during an active incident; treat it as an escalation-only action, not a self-service fix.
5. Escalation Trigger
Escalate immediately (page on-call engineering) if:
- 2 or more nodes are down/partitioned simultaneously: quorum is lost, and affected quorum queues stop accepting new writes cluster-wide. This is a full incident, not a “wait and see.”
- A node does not automatically rejoin after the underlying EC2/network issue is confirmed resolved (e.g.,
cluster_statusstill shows it missing 10+ minutes after the instance passes EC2 health checks). - An ASG has already replaced the affected node with a brand-new instance before it could rejoin cleanly, this needs engineering to correctly add the new node to the cluster rather than assuming it will “just work.”
- Anything that looks like it requires
forget_cluster_nodeor other manual partition/membership intervention, these are destructive, one-way operations that need sign-off, not a support-tier judgment call.
6. Relevant Commands/Queries
Run via SSM Session Manager, not direct SSH (per our access model in Environment Setup).
# Cluster membership and partition status: your primary diagnostic
rabbitmq-diagnostics cluster_status
# Confirm the local node's own services are healthy
rabbitmq-diagnostics check_running
Healthy output (all 3 nodes present, no partitions):
Basics
Cluster name: rabbit@rmq-1
Disk Nodes
rabbit@rmq-1
rabbit@rmq-2
rabbit@rmq-3
Running Nodes
rabbit@rmq-1
rabbit@rmq-2
rabbit@rmq-3
Versions
...
Alarms
(none)
Network Partitions
(none)
Alerting output, node down (rmq-3 missing entirely from Running Nodes):
Running Nodes
rabbit@rmq-1
rabbit@rmq-2
Alerting output, active partition (all nodes technically “known,” but a Partitions section is populated):
Network Partitions
Node rabbit@rmq-1 cannot communicate with rabbit@rmq-3
The distinction matters: “missing from Running Nodes” usually means a real node failure; a populated “Network Partitions” section with the node still listed means the process is alive but split off, a true partition, and pause_minority behavior may already be in effect on the minority side.
7. Mini Practical
Spin up a local 3-node cluster, kill a node, and watch quorum queue behavior with your own eyes.
Step 1: docker-compose.yml:
version: "3.8"
services:
rmq-1:
image: rabbitmq:3.13-management
hostname: rmq-1
environment:
RABBITMQ_ERLANG_COOKIE: "shared-cookie-value"
ports:
- "15672:15672"
networks:
- rmq-net
rmq-2:
image: rabbitmq:3.13-management
hostname: rmq-2
environment:
RABBITMQ_ERLANG_COOKIE: "shared-cookie-value"
networks:
- rmq-net
depends_on:
- rmq-1
rmq-3:
image: rabbitmq:3.13-management
hostname: rmq-3
environment:
RABBITMQ_ERLANG_COOKIE: "shared-cookie-value"
networks:
- rmq-net
depends_on:
- rmq-1
networks:
rmq-net:
driver: bridge
Step 2: Start it and form the cluster:
docker compose up -d
# Join rmq-2 and rmq-3 to rmq-1's cluster
docker exec rmq-2 bash -c "rabbitmqctl stop_app && rabbitmqctl join_cluster rabbit@rmq-1 && rabbitmqctl start_app"
docker exec rmq-3 bash -c "rabbitmqctl stop_app && rabbitmqctl join_cluster rabbit@rmq-1 && rabbitmqctl start_app"
# Confirm all 3 nodes see each other
docker exec rmq-1 rabbitmq-diagnostics cluster_status
You should see all three rabbit@rmq-* nodes under Running Nodes.
Step 3: Declare a quorum queue (via Management UI at localhost:15672, login guest/guest, or CLI):
docker exec rmq-1 rabbitmqadmin declare queue name=orders.qq durable=true arguments='{"x-queue-type":"quorum"}'
Step 4: Kill one node and observe:
docker stop rmq-3
docker exec rmq-1 rabbitmq-diagnostics cluster_status
You should see rmq-3 missing from Running Nodes, but rmq-1 and rmq-2 still show as running. Check the Management UI on rmq-1 (localhost:15672), the cluster overview shows 2 of 3 nodes, but orders.qq is still there and still accepting publishes, because 2-of-3 replicas is still a majority.
Step 5: Bring it back and confirm clean rejoin:
docker start rmq-3
# Give it a few seconds to rejoin, then check again
docker exec rmq-1 rabbitmq-diagnostics cluster_status
rmq-3 should reappear under Running Nodes with no manual intervention required, this is the “transient node loss, no action needed beyond monitoring” case from Section 4. Compare this against how much more serious it would look if you stopped two nodes (rmq-2 and rmq-3) instead of one, try it, and watch orders.qq stop accepting new messages once quorum is lost.
✅ Checkpoint
You should now be able to:
- Explain what a network partition is and why
pause_minoritydeliberately sacrifices availability on the minority side to avoid data divergence. - Distinguish “node missing from Running Nodes” (node down) from a populated “Network Partitions” section (split-brain) in
cluster_statusoutput. - Explain why an app configured with only one node’s address in
spring.rabbitmq.addressescan show a full outage even when the cluster itself is healthy.