Latency, Ordering & Duplicate Messages: RabbitMQ Incident Guide

Prerequisite:TLS & Certificate Expiry

This is the last playbook in this section, and it covers two distinct incident types that both trace back to the same root cause: a synchronous/REST mental model doesn’t hold for messaging. An HTTP call either blocks until it’s done or it fails; a message queue decouples producer and consumer in time, which opens the door to surprises REST never has, latency that isn’t “the request,” and delivery that isn’t “exactly once, in order.”

Part A, Latency spikes correlated with GC pauses or CPU throttling: processing suddenly gets slow, and it’s not a broker outage, it’s either the consumer JVM pausing or the broker instance running out of gas.
Part B, Ordering/duplicate delivery surprises: a downstream team reports “we processed the same order twice” or “messages arrived out of order” and assumes it’s a bug. It’s usually expected behavior colliding with non-idempotent consumer code.

1. Symptom

Part A: Message processing latency (time from publish to ack) spikes intermittently or during load, then recovers. No errors, no crashed consumers, no obvious broker alarm, just “it got slow for a while.” Reports vary: sometimes it’s “our consumer lagged for two minutes,” sometimes it’s “the whole queue felt sluggish,” which is your first clue these are two different problems wearing the same costume.

Part B: A downstream/app team files a ticket along the lines of “RabbitMQ delivered the same message twice” or “we got message B before message A even though A was sent first,” usually accompanied by a real business symptom, a duplicate charge, a duplicate shipment, a status update applied twice. They’re expecting you to confirm a broker bug. You won’t find one.

2. Likely Causes

Part A: broker-side

Cause	How it manifests
Burstable EC2 instance (`t3.*`) CPU credit exhaustion	Under sustained load, the node burns through its CPU credit balance faster than it replenishes. Once credits hit zero, the instance is throttled to its baseline CPU performance: first flagged in AWS Architecture, section 1. This is a broker-wide effect: publish/deliver/ack latency degrades for every queue and consumer on that node, not just one app’s.
Underlying EBS volume saturation	Covered in AWS Architecture, section 4: a separate but similarly “nothing looks broken, just slow” broker-side cause, worth ruling out if CPU credits look fine but latency is still broker-wide.

Part A: app-side (Spring Boot)

Cause How it manifests

JVM stop-the-world GC pause on the consumer A long GC pause (old-gen collection under memory pressure, for example) freezes every thread in the JVM, including the @RabbitListener container threads. Messages already delivered to that consumer sit as unacked for the duration of the pause: the consumer isn’t nacking or crashing, it’s just not running. Once the pause ends, those messages process in a sudden burst, which can look like “it caught up all at once.”

Undersized heap or a poorly-tuned GC algorithm for the app’s allocation rate The same symptom, but happening repeatedly under normal load rather than as a one-off: a tuning problem, not an anomaly.

Cause	How it manifests
JVM stop-the-world GC pause on the consumer	A long GC pause (old-gen collection under memory pressure, for example) freezes every thread in the JVM, including the `@RabbitListener` container threads. Messages already delivered to that consumer sit as `unacked` for the duration of the pause: the consumer isn’t nacking or crashing, it’s just not running. Once the pause ends, those messages process in a sudden burst, which can look like “it caught up all at once.”
Undersized heap or a poorly-tuned GC algorithm for the app’s allocation rate	The same symptom, but happening repeatedly under normal load rather than as a one-off: a tuning problem, not an anomaly.

The key distinguishing fact: broker-side CPU credit exhaustion affects every queue/consumer on that node; a single consumer’s GC pause only affects that consumer’s queue(s). That scope difference is your fastest diagnostic signal, before you even open a metrics dashboard, see Diagnostic Steps.

Part B: broker-side

There isn’t really a broker-side “cause” here, this is the important mental shift for Part B. RabbitMQ is behaving exactly as designed:

Behavior Why it happens

At-least-once delivery permits duplicates Already established in Core Concepts, section 5: if a consumer crashes, is killed, or the connection drops after processing but before the ack reaches the broker, RabbitMQ has no way to know the work was done: it redelivers the message (to the same consumer on reconnect, or to another consumer entirely). This is the mechanism referenced back in Core Concepts as “covered in Playbook 09”: this is that coverage.

Per-queue ordering is preserved, but not guaranteed across concurrent consumer threads A single queue delivers messages in the order they became ready. But once concurrency > 1 is configured, multiple listener threads pull from the same queue and process in parallel: thread 2 can finish message B before thread 1 finishes message A, even though A was enqueued first. The queue offered them in order; nothing guaranteed they’d finish in order.

Behavior	Why it happens
At-least-once delivery permits duplicates	Already established in Core Concepts, section 5: if a consumer crashes, is killed, or the connection drops after processing but before the ack reaches the broker, RabbitMQ has no way to know the work was done: it redelivers the message (to the same consumer on reconnect, or to another consumer entirely). This is the mechanism referenced back in Core Concepts as “covered in Playbook 09”: this is that coverage.
Per-queue ordering is preserved, but not guaranteed across concurrent consumer threads	A single queue delivers messages in the order they became ready. But once `concurrency > 1` is configured, multiple listener threads pull from the same queue and process in parallel: thread 2 can finish message B before thread 1 finishes message A, even though A was enqueued first. The queue offered them in order; nothing guaranteed they’d finish in order.

Part B: app-side (Spring Boot)

Cause	How it manifests
Consumer has no idempotency check	The `@RabbitListener` method applies a side effect (charge a card, increment a counter, ship an order) directly, with no check for “have I already done this for this message ID?” A redelivered duplicate just does the side effect again.
Consumer designed with non-idempotent operations	Even with some dedup logic, operations like “increment shipped count by 1” are inherently unsafe under duplicate delivery: running it twice produces a different (wrong) result. “Set status to SHIPPED” run twice produces the same (correct) result.
`spring.rabbitmq.listener.simple.concurrency` set above `1` without the code accounting for it	Perfectly normal, often desirable for throughput: but it means “message A was published before message B” no longer implies “message A’s side effects land before message B’s.” Code that silently assumed strict ordering (e.g., “the latest message always wins because it’s always processed last”) breaks under concurrency, not because of a bug, but because the assumption was never true to begin with.

The common thread for Part B: none of this is a RabbitMQ defect. At-least-once delivery and per-message-concurrent-processing are documented, deliberate design choices. The fix lives entirely in consumer code.

3. Diagnostic Steps

Work top to bottom, cheapest, fastest checks first.

Part A

Check scope in the Management UI first. Open the Queues tab (or Overview for a cluster-wide view). Is the slowdown showing up on every queue/consumer, or just one? Broker-wide slowness across unrelated queues points at the broker/node; a single queue’s consumer lagging while everything else looks normal points at that one app.
Check CPUCreditBalance in CloudWatch for the broker node(s), per the metric already listed in the Tooling Walkthrough CloudWatch table. A steadily depleting balance trending toward 0 around the same time window as the reported slowness is a strong broker-side signal.
Check the consumer app’s JVM GC activity: via Actuator (/actuator/metrics/jvm.gc.pause) if available, or GC logs if not. Look for a long pause (hundreds of ms to seconds) whose timestamp lines up with the reported latency spike.
Correlate timestamps precisely. Line up (a) the CPUCreditBalance depletion curve, (b) the GC pause event timestamp, and (c) the queue’s ready/unacked spike in the Management UI’s per-queue message-rate graph. A GC pause shows up as a brief spike in unacked (messages delivered but frozen mid-processing) immediately followed by a burst of acks once the pause ends, a very distinctive sawtooth. Broker CPU throttling shows up as a smoother, sustained increase in publish/deliver/ack latency across the board, without that same all-at-once “catch-up burst” shape.
Rule out EBS saturation if CPU credits look healthy but broker-wide latency is still elevated, check EBSVolumeQueueLength/EBSReadWriteOps per the AWS Architecture and Tooling Walkthrough modules.

Step	Question it answers	Typical time cost
1. Management UI scope check	Broker-wide or single-consumer?	seconds
2. `CPUCreditBalance`	Is the broker node throttled?	1-2 min
3. `jvm.gc.pause` / GC logs	Is the consumer JVM pausing?	2-3 min
4. Timestamp correlation	Which one actually lines up with the reported spike?	2-3 min
5. EBS metrics	Ruling out a third broker-side cause	1-2 min

Part B

Get the message payload and confirm whether it carries a unique/idempotency key (an order ID, an event UUID, anything a consumer could use to detect “have I seen this before”). Most well-designed event payloads have one; if this payload doesn’t, that’s already a finding worth surfacing.
Check the consumer code (or ask the owning team) whether that key is actually checked anywhere before the side effect runs. This is usually the crux of the ticket, the key often exists in the payload but nothing consults it.
Check the listener’s configured concurrency:
```
spring:
  rabbitmq:
    listener:
      simple:
        concurrency: 5
        max-concurrency: 10
```
Any value above 1 fully explains “out of order” complaints on its own, no further investigation needed for the ordering half of the ticket.
Grep consumer logs for the same message ID appearing twice. This is your hard evidence for the duplicate half of the ticket, look for two log lines (possibly minutes apart, possibly across two different consumer instances) referencing the identical business ID. Also check for a nack/redelivery or a consumer crash/restart in that window, the sequence “delivered → about-to-ack crash/nack → redelivered → processed again” is the textbook at-least-once duplicate.
Confirm this against x-death/redelivered flags if the message ever touched a DLQ path (see Playbook 06), a message that was nacked and retried through Spring’s retry layer before eventually succeeding will legitimately show multiple processing attempts in logs; that’s expected retry behavior, not a mystery duplicate.

Step	Question it answers	Typical time cost
1. Payload has an idempotency key?	Is there a natural dedup key to check?	1-2 min
2. Is the key actually checked in code?	Is idempotency implemented at all?	2-5 min (may need the owning team)
3. `concurrency` setting	Does config alone explain the ordering complaint?	1 min
4. Duplicate message ID in logs	Hard evidence of redelivery	2-5 min
5. Redelivery/`x-death` context	Was this an expected retry, not a mystery?	2-3 min

4. Safe Remediations

Part A

Situation	Safe action
Broker CPU credits depleting under sustained, expected load	This means the instance type is undersized for real traffic, not a transient blip. Escalate for a resize to a non-burstable family (e.g., `m6i`): this is a capacity-planning change, not a live support fix.
Need an immediate stopgap while a resize is scheduled/approved	Unlimited burst mode can be enabled on the instance to avoid throttling in the short term.
App-side GC pauses causing periodic latency spikes	Heap sizing and GC algorithm choice are a code/deployment change owned by the app team: hand off the correlated evidence (GC pause timestamp + queue impact) rather than attempting to tune JVM flags yourself on a running production instance.

⚠️ Caution: unlimited burst mode is a cost decision, not a quiet toggle.** Switching a t3.* instance to unlimited burst credit mode avoids throttling, but any usage beyond the included baseline is billed extra, and sustained heavy load can make that bill significantly larger than expected. Treat this as a deliberate, approved stopgap communicated to whoever owns the AWS bill, not something to flip unilaterally mid-incident and forget about.

Part B

Situation	Safe action
Duplicate/ordering complaint with a clear at-least-once or concurrency explanation	Your job is diagnosis, not a broker-side fix. Present the evidence (duplicate message ID in logs, or the `concurrency` setting) to the owning team and explain that this is expected broker behavior that their consumer code needs to handle.
Team pushes back that “it’s a RabbitMQ bug”	Point them at the two concrete, checkable facts: at-least-once delivery is documented broker behavior, not a defect, and per-queue ordering was never guaranteed across concurrent consumer threads once `concurrency > 1`. Neither is fixable from the broker side.

There is no broker-level remediation for Part B, there’s nothing broken to fix. The correct “safe remediation” from a support perspective is accurate diagnosis and clear communication of why this isn’t an infra issue, which is what prevents the same ticket from bouncing back to you next week under a different symptom description.

5. Escalation Trigger

Part A: escalate when:

CPUCreditBalance is sustained near zero under normal (not anomalous) load, this needs an instance type change, which is a capacity/infra decision above support tier’s remit.
GC pauses are frequent and app-owned, hand off to the owning dev team with the correlated timestamps; heap/GC tuning is their change to make and test.

Part B: escalate when:

The duplicate/ordering issue has real business impact, a customer double-charged, an order double-shipped, a status incorrectly reverted. Escalate to the owning app team with the diagnostic evidence (duplicate message ID, timestamps, concurrency setting) attached. This needs a code fix (idempotency), not an infra action, but the business impact means it can’t just sit as “expected behavior, closing ticket.”

6. Relevant Commands/Queries

Part A:

# CloudWatch: CPU credit balance for a broker node
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUCreditBalance \
  --dimensions Name=InstanceId,Value=<broker-instance-id> \
  --start-time 2026-07-02T00:00:00Z \
  --end-time 2026-07-02T06:00:00Z \
  --period 300 \
  --statistics Average

# Spring Boot Actuator: JVM GC pause metric (needs micrometer + actuator)
curl -s http://<app-host>:8080/actuator/metrics/jvm.gc.pause | jq .

Healthy-looking output has a low COUNT and small TOTAL_TIME relative to the observation window. A single large jump in TOTAL_TIME between two polls, with a timestamp you can pin down from surrounding app logs, is your GC-pause evidence.

# Management UI equivalent, per queue, to eyeball the "sawtooth" pattern:
# Queues tab -> click the queue -> Message rates chart -> look at unacked over time

Part B:

# Grep consumer logs for a specific business/message ID appearing more than once
grep "orderId=ORD-10493" application.log

# The setting that fully explains "out of order" complaints on its own
spring:
  rabbitmq:
    listener:
      simple:
        concurrency: 5
        max-concurrency: 10

# Confirm consumer count / concurrency in practice via the broker side
rabbitmqctl list_channels connection_details consumer_count

7. Mini Practical

Two short exercises, one per part, both extending the producer/consumer app.

Part A: correlate a GC pause with Actuator

Step 1: Add Actuator to your app if it isn’t already there (pom.xml):

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

Enable the metrics endpoint (application.yml):

management:
  endpoints:
    web:
      exposure:
        include: health,metrics

Step 2: Check the baseline GC metric before generating any load:

curl -s http://localhost:8080/actuator/metrics/jvm.gc.pause | jq .

Note the COUNT and TOTAL_TIME values.

Step 3: Publish a burst of messages to give the JVM something to do:

for i in $(seq 1 200); do
  curl -s -X POST localhost:8080/orders -H "Content-Type: application/json" -d "{\"id\":$i}" > /dev/null
done

Step 4: Re-check the metric and note whether COUNT increased and by how much TOTAL_TIME grew:

curl -s http://localhost:8080/actuator/metrics/jvm.gc.pause | jq .

Step 5: Correlate with the Management UI. Open localhost:15672 → Queues → orders.created.queue → Message rates. In production, you’d line up a spike in TOTAL_TIME from step 4 against a corresponding blip in that queue’s unacked count at the same timestamp, that correlation, not either metric alone, is what confirms “the app paused, not the broker.” (A full GC-stress reproduction that forces a visible multi-second pause requires deliberately constraining heap size and generating heavy allocation, worth knowing this is possible, but not necessary to complete this exercise.)

Part B: simulate a duplicate delivery and fix it with an idempotency check

Step 1: Write a consumer with no deduplication, extending your OrderConsumer:

@Component
public class OrderConsumer {

    private final AtomicInteger processedCount = new AtomicInteger(0);

    @RabbitListener(queues = RabbitConfig.QUEUE)
    public void handleOrder(String orderJson) {
        // No idempotency check -- every delivery is treated as new work.
        processedCount.incrementAndGet();
        System.out.println("Processed order (count=" + processedCount.get() + "): " + orderJson);
    }
}

Step 2: Publish the same message twice on purpose, simulating a redelivery after a crash-before-ack:

curl -X POST localhost:8080/orders -H "Content-Type: application/json" -d '{"id":"ORD-1","item":"widget"}'
curl -X POST localhost:8080/orders -H "Content-Type: application/json" -d '{"id":"ORD-1","item":"widget"}'

Console output shows processedCount incrementing to 2 for what is logically the same order, this is the double-processing bug in miniature. In production this would be RabbitMQ genuinely redelivering after a consumer crash, not two manual curl calls, but the resulting consumer-side symptom is identical.

Step 3: Add an idempotency check using an in-memory Set of seen message IDs (a real implementation would use a DB unique constraint or a Redis dedup key so it survives restarts and works across multiple instances, the Set here is just for a quick local demo):

@Component
public class OrderConsumer {

    private final Set<String> seenOrderIds = ConcurrentHashMap.newKeySet();
    private final AtomicInteger processedCount = new AtomicInteger(0);

    @RabbitListener(queues = RabbitConfig.QUEUE)
    public void handleOrder(String orderJson) {
        String orderId = extractOrderId(orderJson); // e.g., simple JSON parse

        if (!seenOrderIds.add(orderId)) {
            System.out.println("Skipping duplicate delivery of " + orderId);
            return;
        }

        processedCount.incrementAndGet();
        System.out.println("Processed order (count=" + processedCount.get() + "): " + orderJson);
    }
}

Step 4: Re-run the same two curl calls from Step 2. This time, the console shows the order processed once and the second delivery logged as Skipping duplicate delivery of ORD-1, processedCount stays at 1. This is the idempotent-consumer pattern a support engineer needs to recognize in someone else’s code, not necessarily hand-build in production: check a unique ID against an already-processed store before applying any side effect.

✅ Checkpoint

You should now be able to:

Distinguish broker-side CPU credit throttling from app-side JVM GC pauses using scope (broker-wide vs. single-consumer) and timestamp correlation, without needing to be told which one it is.
Explain why “unlimited burst mode” is a cost/approval decision, not a quiet fix to flip during an incident.
Explain to a downstream team, with evidence, why a duplicate or out-of-order message is expected at-least-once/concurrent-consumer behavior rather than a RabbitMQ defect.
Recognize an idempotent consumer pattern (unique-ID dedup check, naturally idempotent operations) well enough to confirm whether an app team’s code has one.