Memory/Disk Alarm
Recognize broker resource alarms, blocked publishers, and cluster-wide publish failures.
Prerequisite:Queue Depth & Consumer Lag
1. Symptom
One or more of these shows up, usually all at once:
- PagerDuty/Slack alert:
NodeMemoryUsageor a disk-space CloudWatch alarm firing on a specificrmq-*node. - Management UI Overview page shows a red/orange banner: “Memory alarm in effect” or “Disk alarm in effect” for a node.
- App teams suddenly report: publishing hangs,
rabbitTemplate.convertAndSend()calls that used to return instantly now block for seconds or time out, or publisher-confirm callbacks stop firing. - Nothing looks wrong on the consumer side, messages already in queues are still being delivered and acked normally.
The key tell that distinguishes this from Playbook 01: in a queue-depth/consumer-lag incident, consumers are the problem (or absent) and publishing still works fine. In a memory/disk alarm, publishing itself stops working, cluster-wide, while consumers keep humming along on whatever is already queued. If publishers are blocked but consumers are fine, come here first.
2. Likely Causes
Why this happens at all: RabbitMQ’s self-protection mechanism
RabbitMQ has two built-in safety valves, both covered at a high level in AWS Architecture and Tooling Walkthrough:
| Watermark | What it protects against | Config |
|---|---|---|
vm_memory_high_watermark | Broker process getting OOM-killed by the OS or crashing the Erlang VM | Fraction (default 0.4) of total system RAM, or an absolute byte value |
disk_free_limit | Broker running out of disk mid-write and corrupting the message store | Absolute size (e.g., 2GB) or relative to RAM |
This is deliberate, not a bug. When either watermark is breached on any single node, that node raises a resource alarm, and the broker’s response is to block all publishers across the entire cluster: not just on the affected node. RabbitMQ would rather refuse new work everywhere than risk crashing a node or corrupting data on disk. Existing consumers are not blocked, messages already sitting in queues can still be delivered and acked, because that only frees memory/disk, it doesn’t consume more.
This cluster-wide blocking behavior is the single most important thing to internalize about this alert type: one struggling node can take down publishing for every producer in the org, even ones publishing to queues that live entirely on healthy nodes.
Broker-side causes
| Cause | Why it triggers the alarm |
|---|---|
| A queue with a huge backlog, especially a classic queue | Classic queues keep message content in memory much more aggressively than quorum queues; a large backlog on a classic queue can exhaust RAM fast. Quorum queues page segments to disk more gracefully under memory pressure, but aren’t immune: a big enough backlog still pressures disk. |
| Large numbers of unacked/undelivered messages | Every unacked message is held in memory by the broker until it’s acked or requeued: a stuck consumer with a huge prefetch amplifies this. |
| Message store growth on disk | Persistent messages for durable queues accumulate on disk; a sustained publish rate with no draining consumer fills the EBS volume. |
| Log file growth | RabbitMQ and OS logs can grow unexpectedly large (verbose logging, crash loops writing repeatedly) and eat into the same volume as the message store. |
| Undersized EBS volume for sustained throughput | A volume sized for average load runs out of headroom during a sustained traffic spike or a backlog event. |
vm_memory_high_watermark set too conservatively for the instance size | If it was set assuming a smaller instance, or left at RabbitMQ’s default fraction on a memory-constrained instance type, the alarm trips well before the node is genuinely at risk. |
App-side causes / symptoms
| Symptom | What’s happening |
|---|---|
rabbitTemplate.convertAndSend() calls hang or time out | The broker sent a connection.blocked AMQP protocol notification to every connected client the moment the alarm tripped. The underlying connection is intentionally not accepting new publishes until the alarm clears. |
| Publisher confirms stop arriving | If spring.rabbitmq.publisher-confirm-type is correlated/simple, the broker isn’t sending basic.ack for new publishes because it isn’t accepting them: your app’s confirm callbacks just go quiet, which can look like a silent hang rather than an obvious error. |
| Producer thread pool exhaustion | If many application threads each call a blocking publish and all of them stall waiting on the blocked connection, the thread pool backing those calls can fill up: this then cascades into unrelated timeouts elsewhere in the same app (e.g., HTTP request threads all blocked on a downstream publish). |
| Nothing in the app logs explains it | Spring AMQP doesn’t always surface “connection blocked” as a loud exception: it can just look like elevated latency until you know to check for the blocked notification specifically (see Diagnostic Steps). |
3. Diagnostic Steps
Work top-down, cheapest checks first:
- Check the Management UI Overview page for the alarm banner. It names the node and the resource (
memoryordisk_space), this is the fastest confirmation you’re dealing with this playbook and not something else. - Run
rabbitmq-diagnostics check_local_alarms(orcheck_alarmsfor a cluster-wide view) via SSM Session Manager on the suspect node(s). A non-empty result confirms which node and which resource. - Check
rabbitmq-diagnostics statuson the same node for the actual memory/disk numbers against the configured watermark, this tells you how close you are, not just that you tripped it. - Identify which queue(s) are consuming the most memory. In the Management UI Queues tab, sort by message count and check queue type (classic vs. quorum), a classic queue with a huge
Ready/Unackedcount is the most common single culprit. - Check CloudWatch for
NodeMemoryUsage(trend, not just current value) and the EBS volume’s free-space metric, is this a sudden spike or a slow leak over days? - Check EBS volume free space directly on the node (via SSM) if CloudWatch lags,
df -hon the data directory’s mount point, and look at what’s actually consuming space (message store vs. logs). - Check Spring producer application logs for signs of blocked publishing, with Spring AMQP, register interest in blocked/unblocked events (
ConnectionFactory.addConnectionListener, or watch for elevatedrabbitTemplatelatency/timeouts in metrics) and confirm the timing lines up with the alarm.
4. Safe Remediations
| Situation | Action |
|---|---|
| Disk alarm caused by log growth | Rotate/compress/ship old logs off the volume, then confirm the alarm clears once free space rises above disk_free_limit. |
| Memory/disk pressure caused by a backlog on a healthy consumer setup that’s merely behind | Fix or scale the consumers per Playbook 01: draining the backlog is what actually resolves the root cause, since it frees both the in-memory queue state and (once acked) disk-persisted messages. |
| Watermark clearly misconfigured for the instance size (e.g., default fraction on a small instance, alarm trips constantly under normal load) | Flag it, but treat raising the watermark as a config change requiring the same care as any broker config change: see caution below. |
⚠️ Caution: freeing disk space:** Only delete files you are certain are logs, not the message store. RabbitMQ’s data directory contains queue index and message store files (
mnesia/quorumdata depending on version and queue type), deleting or truncating these causes data loss or an unrecoverable node, not just a freed-up alarm. If you’re not 100% sure a file is a rotatable log, don’t touch it, escalate instead.
⚠️ Caution: raising
vm_memory_high_watermarkordisk_free_limit:** This is a broker configuration change, not a support-tier action. Raising the memory watermark reduces the safety margin before an actual OOM crash; loweringdisk_free_limitreduces the margin before an actual out-of-disk corruption. Treat this as a last resort, and only after platform/on-call engineering has confirmed the instance has genuine headroom (real free RAM/disk, not just “the alarm is annoying”).
⚠️ Caution: do NOT restart the node as a first response.** Restarting doesn’t fix underlying disk or memory pressure, the alarm will very likely just re-trip once the node rejoins and picks the workload back up. Worse, forcing a node restart during a memory/disk incident risks a cluster partition or a slow, disruptive rejoin (leader re-elections for every quorum queue with a replica on that node), see Playbook 03. Only restart if explicitly directed by on-call engineering as part of a broader remediation.
5. Escalation Trigger
Escalate to platform/on-call engineering when:
- The disk alarm is not a log/rotation issue, i.e., the message store itself has grown and needs an EBS volume resize, which support tier cannot do unilaterally.
- Draining the backlog (Playbook 01 remediations) does not clear the alarm within ~15-20 minutes, or the backlog is too large to drain fast enough to matter before downstream SLAs are breached.
- You determine the watermark configuration itself is wrong for the instance size and needs changing.
- The alarm has already triggered a node restart or you suspect a partition risk (hand off to Playbook 03 territory).
- You’re not sure whether a file on the data volume is safe to delete, never guess here.
6. Relevant Commands/Queries
# Confirm which node has an active alarm, and which resource
rabbitmq-diagnostics check_local_alarms
Healthy: empty output, exit code 0. Alerting:
Error:
resource_limit_alarm: memory
or
Error:
resource_limit_alarm: disk
# Cluster-wide alarm view (run from any node)
rabbitmq-diagnostics check_alarms
# Detailed memory/disk numbers vs. configured watermarks
rabbitmq-diagnostics status
Healthy example (excerpt):
Memory
Total memory used: 0.61 gb
Calculation strategy: rss
Memory high watermark setting: 0.4 of available memory, computed to: 2.9 gb
Disk
Free disk space: 45.2 gb
Free disk space low watermark: 2.0 gb
Alerting example (excerpt):
Memory
Total memory used: 2.87 gb
Calculation strategy: rss
Memory high watermark setting: 0.4 of available memory, computed to: 2.9 gb
*** MEMORY ALARM RAISED ***
Disk
Free disk space: 1.4 gb
Free disk space low watermark: 2.0 gb
*** DISK ALARM RAISED ***
# Which queues are heaviest in memory (look at type + message counts together)
rabbitmqctl list_queues name type messages_ready messages_unacknowledged memory
# Check disk usage directly on the node (via SSM Session Manager)
df -h
du -sh /var/lib/rabbitmq/mnesia/*
CloudWatch metrics to pull up (from Tooling Walkthrough): NodeMemoryUsage and DiskFreeLimitAlarm, plus the underlying EBS volume’s free-space metric, trend over the last few hours, not just the current value.
7. Mini Practical
Reproduce a memory alarm locally and watch a Spring Boot producer get blocked, using an artificially low watermark so you don’t need gigabytes of messages to trip it.
Step 1: Run RabbitMQ with a very low memory high watermark:
docker run -d --name rabbitmq-alarm-lab \
-p 5672:5672 -p 15672:15672 \
-e RABBITMQ_VM_MEMORY_HIGH_WATERMARK=0.000000001 \
rabbitmq:3.13-management
This forces the watermark so low that the node will raise a memory alarm almost immediately after startup, no need to actually pump gigabytes of traffic through it.
Step 2: Confirm the alarm is active:
docker exec -it rabbitmq-alarm-lab rabbitmq-diagnostics check_local_alarms
You should see resource_limit_alarm: memory. Also check the Management UI at localhost:15672, the Overview page banner should show the memory alarm.
Step 3: Try publishing from your Spring Boot producer (same orders.exchange / orders.created.queue setup, pointed at this container):
curl -X POST localhost:8080/orders -H "Content-Type: application/json" -d '{"id":999}'
Observe: the call hangs or times out instead of returning immediately with “Order published.” If you add a ConnectionListener and log onBlocked/onUnblocked events, you’ll see the blocked notification fire the moment you tried to publish.
Step 4: Fix it and watch recovery:
docker update --memory=0 rabbitmq-alarm-lab 2>/dev/null || true
docker exec -it rabbitmq-alarm-lab rabbitmqctl set_vm_memory_high_watermark 0.6
Re-run check_local_alarms, it should now return clean. Re-run the curl publish, it should return “Order published” immediately again, and (if you wired up the listener) you should see the onUnblocked callback fire.
Step 5: Clean up:
docker rm -f rabbitmq-alarm-lab
✅ Checkpoint
You should now be able to:
- Explain why a memory or disk alarm on one node blocks publishers cluster-wide, and why that’s intentional.
- Run
rabbitmq-diagnostics check_local_alarmsandrabbitmq-diagnostics status, and read watermark numbers from the output. - Explain why restarting a node is not a safe first response to a memory/disk alarm.