Memory/Disk Alarm & Blocked Publishers: RabbitMQ Incident Guide

Prerequisite:Queue Depth & Consumer Lag

1. Symptom

One or more of these shows up, usually all at once:

PagerDuty/Slack alert: NodeMemoryUsage or a disk-space CloudWatch alarm firing on a specific rmq-* node.
Management UI Overview page shows a red/orange banner: “Memory alarm in effect” or “Disk alarm in effect” for a node.
App teams suddenly report: publishing hangs, rabbitTemplate.convertAndSend() calls that used to return instantly now block for seconds or time out, or publisher-confirm callbacks stop firing.
Nothing looks wrong on the consumer side, messages already in queues are still being delivered and acked normally.

The key tell that distinguishes this from Playbook 01: in a queue-depth/consumer-lag incident, consumers are the problem (or absent) and publishing still works fine. In a memory/disk alarm, publishing itself stops working, cluster-wide, while consumers keep humming along on whatever is already queued. If publishers are blocked but consumers are fine, come here first.

2. Likely Causes

Why this happens at all: RabbitMQ’s self-protection mechanism

RabbitMQ has two built-in safety valves, both covered at a high level in AWS Architecture and Tooling Walkthrough:

Watermark	What it protects against	Config
`vm_memory_high_watermark`	Broker process getting OOM-killed by the OS or crashing the Erlang VM	Fraction (default `0.4`) of total system RAM, or an absolute byte value
`disk_free_limit`	Broker running out of disk mid-write and corrupting the message store	Absolute size (e.g., `2GB`) or relative to RAM

This is deliberate, not a bug. When either watermark is breached on any single node, that node raises a resource alarm, and the broker’s response is to block all publishers across the entire cluster: not just on the affected node. RabbitMQ would rather refuse new work everywhere than risk crashing a node or corrupting data on disk. Existing consumers are not blocked, messages already sitting in queues can still be delivered and acked, because that only frees memory/disk, it doesn’t consume more.

This cluster-wide blocking behavior is the single most important thing to internalize about this alert type: one struggling node can take down publishing for every producer in the org, even ones publishing to queues that live entirely on healthy nodes.

Broker-side causes

Cause	Why it triggers the alarm
A queue with a huge backlog, especially a classic queue	Classic queues keep message content in memory much more aggressively than quorum queues; a large backlog on a classic queue can exhaust RAM fast. Quorum queues page segments to disk more gracefully under memory pressure, but aren’t immune: a big enough backlog still pressures disk.
Large numbers of unacked/undelivered messages	Every unacked message is held in memory by the broker until it’s acked or requeued: a stuck consumer with a huge prefetch amplifies this.
Message store growth on disk	Persistent messages for durable queues accumulate on disk; a sustained publish rate with no draining consumer fills the EBS volume.
Log file growth	RabbitMQ and OS logs can grow unexpectedly large (verbose logging, crash loops writing repeatedly) and eat into the same volume as the message store.
Undersized EBS volume for sustained throughput	A volume sized for average load runs out of headroom during a sustained traffic spike or a backlog event.
`vm_memory_high_watermark` set too conservatively for the instance size	If it was set assuming a smaller instance, or left at RabbitMQ’s default fraction on a memory-constrained instance type, the alarm trips well before the node is genuinely at risk.

App-side causes / symptoms

Symptom	What’s happening
`rabbitTemplate.convertAndSend()` calls hang or time out	The broker sent a `connection.blocked` AMQP protocol notification to every connected client the moment the alarm tripped. The underlying connection is intentionally not accepting new publishes until the alarm clears.
Publisher confirms stop arriving	If `spring.rabbitmq.publisher-confirm-type` is `correlated`/`simple`, the broker isn’t sending `basic.ack` for new publishes because it isn’t accepting them: your app’s confirm callbacks just go quiet, which can look like a silent hang rather than an obvious error.
Producer thread pool exhaustion	If many application threads each call a blocking publish and all of them stall waiting on the blocked connection, the thread pool backing those calls can fill up: this then cascades into unrelated timeouts elsewhere in the same app (e.g., HTTP request threads all blocked on a downstream publish).
Nothing in the app logs explains it	Spring AMQP doesn’t always surface “connection blocked” as a loud exception: it can just look like elevated latency until you know to check for the blocked notification specifically (see Diagnostic Steps).

3. Diagnostic Steps

Work top-down, cheapest checks first:

Check the Management UI Overview page for the alarm banner. It names the node and the resource (memory or disk_space), this is the fastest confirmation you’re dealing with this playbook and not something else.
Run rabbitmq-diagnostics check_local_alarms (or check_alarms for a cluster-wide view) via SSM Session Manager on the suspect node(s). A non-empty result confirms which node and which resource.
Check rabbitmq-diagnostics status on the same node for the actual memory/disk numbers against the configured watermark, this tells you how close you are, not just that you tripped it.
Identify which queue(s) are consuming the most memory. In the Management UI Queues tab, sort by message count and check queue type (classic vs. quorum), a classic queue with a huge Ready/Unacked count is the most common single culprit.
Check CloudWatch for NodeMemoryUsage (trend, not just current value) and the EBS volume’s free-space metric, is this a sudden spike or a slow leak over days?
Check EBS volume free space directly on the node (via SSM) if CloudWatch lags, df -h on the data directory’s mount point, and look at what’s actually consuming space (message store vs. logs).
Check Spring producer application logs for signs of blocked publishing, with Spring AMQP, register interest in blocked/unblocked events (ConnectionFactory.addConnectionListener, or watch for elevated rabbitTemplate latency/timeouts in metrics) and confirm the timing lines up with the alarm.

4. Safe Remediations

Situation	Action
Disk alarm caused by log growth	Rotate/compress/ship old logs off the volume, then confirm the alarm clears once free space rises above `disk_free_limit`.
Memory/disk pressure caused by a backlog on a healthy consumer setup that’s merely behind	Fix or scale the consumers per Playbook 01: draining the backlog is what actually resolves the root cause, since it frees both the in-memory queue state and (once acked) disk-persisted messages.
Watermark clearly misconfigured for the instance size (e.g., default fraction on a small instance, alarm trips constantly under normal load)	Flag it, but treat raising the watermark as a config change requiring the same care as any broker config change: see caution below.

⚠️ Caution: freeing disk space:** Only delete files you are certain are logs, not the message store. RabbitMQ’s data directory contains queue index and message store files (mnesia/quorum data depending on version and queue type), deleting or truncating these causes data loss or an unrecoverable node, not just a freed-up alarm. If you’re not 100% sure a file is a rotatable log, don’t touch it, escalate instead.

⚠️ Caution: raising vm_memory_high_watermark or disk_free_limit:** This is a broker configuration change, not a support-tier action. Raising the memory watermark reduces the safety margin before an actual OOM crash; lowering disk_free_limit reduces the margin before an actual out-of-disk corruption. Treat this as a last resort, and only after platform/on-call engineering has confirmed the instance has genuine headroom (real free RAM/disk, not just “the alarm is annoying”).

⚠️ Caution: do NOT restart the node as a first response.** Restarting doesn’t fix underlying disk or memory pressure, the alarm will very likely just re-trip once the node rejoins and picks the workload back up. Worse, forcing a node restart during a memory/disk incident risks a cluster partition or a slow, disruptive rejoin (leader re-elections for every quorum queue with a replica on that node), see Playbook 03. Only restart if explicitly directed by on-call engineering as part of a broader remediation.

5. Escalation Trigger

Escalate to platform/on-call engineering when:

The disk alarm is not a log/rotation issue, i.e., the message store itself has grown and needs an EBS volume resize, which support tier cannot do unilaterally.
Draining the backlog (Playbook 01 remediations) does not clear the alarm within ~15-20 minutes, or the backlog is too large to drain fast enough to matter before downstream SLAs are breached.
You determine the watermark configuration itself is wrong for the instance size and needs changing.
The alarm has already triggered a node restart or you suspect a partition risk (hand off to Playbook 03 territory).
You’re not sure whether a file on the data volume is safe to delete, never guess here.

6. Relevant Commands/Queries

# Confirm which node has an active alarm, and which resource
rabbitmq-diagnostics check_local_alarms

Healthy: empty output, exit code 0. Alerting:

Error:
resource_limit_alarm: memory

Error:
resource_limit_alarm: disk

# Cluster-wide alarm view (run from any node)
rabbitmq-diagnostics check_alarms

# Detailed memory/disk numbers vs. configured watermarks
rabbitmq-diagnostics status

Healthy example (excerpt):

Memory

Total memory used: 0.61 gb
Calculation strategy: rss
Memory high watermark setting: 0.4 of available memory, computed to: 2.9 gb

Disk

Free disk space: 45.2 gb
Free disk space low watermark: 2.0 gb

Alerting example (excerpt):

Memory

Total memory used: 2.87 gb
Calculation strategy: rss
Memory high watermark setting: 0.4 of available memory, computed to: 2.9 gb
  *** MEMORY ALARM RAISED ***

Disk

Free disk space: 1.4 gb
Free disk space low watermark: 2.0 gb
  *** DISK ALARM RAISED ***

# Which queues are heaviest in memory (look at type + message counts together)
rabbitmqctl list_queues name type messages_ready messages_unacknowledged memory

# Check disk usage directly on the node (via SSM Session Manager)
df -h
du -sh /var/lib/rabbitmq/mnesia/*

CloudWatch metrics to pull up (from Tooling Walkthrough): NodeMemoryUsage and DiskFreeLimitAlarm, plus the underlying EBS volume’s free-space metric, trend over the last few hours, not just the current value.

7. Mini Practical

Reproduce a memory alarm locally and watch a Spring Boot producer get blocked, using an artificially low watermark so you don’t need gigabytes of messages to trip it.

Step 1: Run RabbitMQ with a very low memory high watermark:

docker run -d --name rabbitmq-alarm-lab \
  -p 5672:5672 -p 15672:15672 \
  -e RABBITMQ_VM_MEMORY_HIGH_WATERMARK=0.000000001 \
  rabbitmq:3.13-management

This forces the watermark so low that the node will raise a memory alarm almost immediately after startup, no need to actually pump gigabytes of traffic through it.

Step 2: Confirm the alarm is active:

docker exec -it rabbitmq-alarm-lab rabbitmq-diagnostics check_local_alarms

You should see resource_limit_alarm: memory. Also check the Management UI at localhost:15672, the Overview page banner should show the memory alarm.

Step 3: Try publishing from your Spring Boot producer (same orders.exchange / orders.created.queue setup, pointed at this container):

curl -X POST localhost:8080/orders -H "Content-Type: application/json" -d '{"id":999}'

Observe: the call hangs or times out instead of returning immediately with “Order published.” If you add a ConnectionListener and log onBlocked/onUnblocked events, you’ll see the blocked notification fire the moment you tried to publish.

Step 4: Fix it and watch recovery:

docker update --memory=0 rabbitmq-alarm-lab 2>/dev/null || true
docker exec -it rabbitmq-alarm-lab rabbitmqctl set_vm_memory_high_watermark 0.6

Re-run check_local_alarms, it should now return clean. Re-run the curl publish, it should return “Order published” immediately again, and (if you wired up the listener) you should see the onUnblocked callback fire.

Step 5: Clean up:

docker rm -f rabbitmq-alarm-lab

✅ Checkpoint

You should now be able to:

Explain why a memory or disk alarm on one node blocks publishers cluster-wide, and why that’s intentional.
Run rabbitmq-diagnostics check_local_alarms and rabbitmq-diagnostics status, and read watermark numbers from the output.
Explain why restarting a node is not a safe first response to a memory/disk alarm.