Read time: ~

Hands-On Lab

Diagnose a multi-symptom incident end-to-end without being told which playbook applies.

Prerequisite: all Alert PlaybooksYou’ll need: the tools/setup from Environment Setup through Tooling Walkthrough, plus everything you built along the way


What you’ll be able to do after this module

  • Diagnose a multi-symptom incident end-to-end, from “something’s wrong” to “here’s the root cause and the fix,” without being told which playbook applies.
  • Practice the exact workflow you’ll use on real rotation: look at symptoms, form a hypothesis, check the cheapest evidence first, confirm or rule out, escalate only if genuinely needed.

Unlike earlier modules, this one does not tell you which concept it’s testing. That’s the point, a real alert doesn’t come labeled “this is a Playbook 03 situation.” Diagnosing which playbook applies is itself the skill.


Setup

You’ll extend the producer/consumer app one more time. If you don’t still have it, recreate it quickly using the code in First Producer and Consumer, you need the RabbitConfig, OrderController, and OrderConsumer classes, running against the Docker container from Environment Setup (rabbitmq-crashcourse).

Add one more dependency if you haven’t already, for the DLQ portion below:

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

Scenario

Ticket:“The orders-service team says order confirmations have basically stopped going out. Customers are placing orders fine, but nothing downstream is happening, no confirmation emails, no shipping triggers. This started roughly 20 minutes ago. No recent deploys that they’re aware of. Please investigate.”

You are not told what’s wrong. Work the ticket the way you would in real life: gather evidence, form a hypothesis, test it, and either resolve it or escalate with a clear, specific handoff.


Part 1: Reproduce a realistic multi-cause incident

To make this concrete without needing a real broker, build the following into your local setup. (In a real incident you wouldn’t do this step, this is just how the lab manufactures a realistic mess for you to untangle, mixing an app-code bug with a config oversight, the way real incidents usually are two small things compounding rather than one dramatic failure.)

Step 1: Introduce a silent consumer failure. Modify your OrderConsumer so it throws for a subset of messages, simulating a bug that shipped unnoticed:

@Component
public class OrderConsumer {

    @RabbitListener(queues = RabbitConfig.QUEUE)
    public void handleOrder(String orderJson) {
        if (orderJson.contains("\"region\":\"EU\"")) {
            throw new IllegalStateException("Unrecognized region format");
        }
        System.out.println("Processed order: " + orderJson);
    }
}

Step 2: Under-provision the consumer. Set concurrency to 1 explicitly, simulating a config that was never tuned for current traffic:

spring:
  rabbitmq:
    listener:
      simple:
        concurrency: 1
        max-concurrency: 1

Step 3: Generate realistic mixed traffic:

for i in $(seq 1 30); do
  region=$([ $((i % 3)) -eq 0 ] && echo "EU" || echo "US")
  curl -s -X POST localhost:8080/orders -H "Content-Type: application/json" \
    -d "{\"id\":$i,\"region\":\"$region\"}" > /dev/null
done

Now go investigate, as if this ticket just landed in your queue and you built none of this yourself.


Part 2: Diagnose it

Work through these prompts in order. Resist jumping to the answer key below until you’ve actually run the commands.

  1. Check the queue first. Using either the Management UI or rabbitmqctl list_queues name messages_ready messages_unacknowledged consumers, what do you observe about orders.created.queue? Is this consistent with “confirmations have stopped entirely,” or something more specific?
  2. Check application logs. What exception, if any, is being logged repeatedly? Does it correlate with a pattern in the message payloads (hint: compare the failing messages to the succeeding ones)?
  3. Form a hypothesis for why some orders process fine while others seemingly get stuck. Which playbook’s mental model does this match, 01, 06, or something else?
  4. Check consumer concurrency (spring.rabbitmq.listener.simple.concurrency). Does the current setting fully explain why a single bad message could hold up all subsequent orders, not just other EU ones?
  5. Decide: is this a broker problem, an app-code problem, a config/capacity problem, or some combination? What’s your specific, evidence-backed diagnosis, not “something’s wrong with RabbitMQ,” but the actual root cause you’d write in a ticket?
Answer key (check your reasoning after you've worked through it yourself) - **Queue check:** `messages_ready` climbs steadily, `consumers` shows `1` (not `0`), this is *not* "nobody's listening" ([Playbook 01](/rabbit-mq/alert-playbooks/queue-depth-consumer-lag)'s classic zero-consumer signature). Something more specific is happening: a consumer is attached but not making progress. - **Log check:** `IllegalStateException: Unrecognized region format` repeats, and only ever for `region: EU` payloads, a deterministic, payload-specific failure. This is a **poison message** pattern ([Playbook 06](/rabbit-mq/alert-playbooks/poison-messages-dlq)), not a generic backlog. - **Root cause, compounded by two factors:** 1. **App bug:** the consumer throws for any EU-region order, a real code defect, not a broker issue. 2. **Config amplifier:** `concurrency: 1` means there is exactly **one** listener thread. With no DLQ configured, the default behavior is to requeue the failed message, and with only one thread, that same EU message gets redelivered and retried immediately, ahead of the US orders queued behind it, effectively **head-of-line blocking** the entire queue. If concurrency were higher, other threads could keep draining US orders while one thread spins on the EU messages, the underlying bug would still exist, but its blast radius would be far smaller. - **This is a combination**, and a good ticket response says so explicitly: *"Root cause: the EU-region order handler throws `IllegalStateException` on a payload format it doesn't recognize (app bug, needs a code fix from the orders-service team). Impact was amplified by `concurrency: 1` and no DLX configured, so the failing message blocked the entire queue instead of just EU orders (a config gap worth fixing alongside the code fix, see [Playbook 06](/rabbit-mq/alert-playbooks/poison-messages-dlq) for adding a DLX so this can't happen again)."* - **Escalation call:** the code fix belongs to the owning app team, this is not something support patches directly. But you're not escalating blind: you're handing off a precise diagnosis, the reproducing payload pattern, and a concrete follow-up recommendation (DLX + concurrency review), which is exactly what [Escalation and Communication](/rabbit-mq/escalation-communication) will ask you to practice next.

Part 3: Apply the fix and confirm recovery

Step 1: Add a DLX to your queue configuration (see Playbook 06 for the full pattern) so a genuinely broken message stops blocking the queue and instead lands somewhere visible.

Step 2: Raise concurrency to a more realistic value:

spring:
  rabbitmq:
    listener:
      simple:
        concurrency: 3
        max-concurrency: 5

Step 3: Re-run the traffic generator from Part 1, Step 3, and confirm via the Management UI that: US orders process immediately and continuously (no longer blocked behind EU messages), and EU messages land in the DLQ after their retry attempts are exhausted instead of looping forever.

Step 4: Write a two-sentence incident summary as if closing the ticket, stating the root cause and the fix. Compare it against the answer key above, did you capture both the bug and the amplifying config issue, or only one?


✅ Checkpoint

You should now be able to:

  • Work an ambiguous incident from symptom to root cause without being told which playbook applies.
  • Recognize when an incident has more than one contributing cause, and describe both instead of stopping at the first one you find.
  • Apply a fix (DLX + concurrency tuning) and verify recovery using the Management UI, rather than assuming a fix worked.

Next:Escalation and Communication