Read time: ~

Hands-On Lab: Incident Diagnosis

Break a running scenario deliberately, stop a broker, deploy a poison message, throttle a consumer, then diagnose it top to bottom using the playbooks.

Reading the playbooks builds recognition; breaking things yourself builds diagnosis. This lab walks three deliberate incidents in the local Order/Payment/Inventory/Notification system, and for each you follow the same top-to-bottom method: read the symptom, form a hypothesis, run the cheapest diagnostic, then confirm and remediate. The aim is to internalize the decision tree so real incidents feel familiar.


What you’ll be able to do after this module

  • Apply a single diagnostic decision tree to any Kafka incident.
  • Reproduce and diagnose a broker loss, a poison message, and a slow consumer.
  • Choose the correct playbook from the symptom, not a guess.
  • Confirm a remediation actually resolved the incident.

1. The diagnostic decision tree

Every incident starts the same way: read the symptom and let it route you to the right playbook. Resist jumping to a suspected cause; follow the cheapest checks first.

flowchart TD
    s["Alert / ticket"]
    s --> q1{"Cluster metric or app symptom?"}
    q1 -->|OfflinePartitions / controller| p3["Broker/Controller playbook"]
    q1 -->|UnderReplicated| p2["Under-Replicated playbook"]
    q1 -->|lag climbing| q2{"Consumers attached?"}
    q2 -->|no| p1a["Consumer-lag: absent"]
    q2 -->|yes, cycling rejoin| p4["Rebalance-storm playbook"]
    q2 -->|yes, slow| p1b["Consumer-lag: slow handler"]
    q1 -->|send errors| p5["Producer-failures playbook"]
    q1 -->|deserialization error| p6["Poison-message playbook"]
    q1 -->|disk alarm| p7["Disk/retention playbook"]
    q1 -->|auth error| p8["Auth-rotation playbook"]
    q1 -->|timeouts, one subnet| p9["AWS-layer playbook"]

The whole point of the tree is that the symptom, not your hunch, chooses the branch. Each leaf is one of the twelve Alert Playbooks.


2. Incident A: a broker disappears

Simulate a broker loss and confirm the system rides it out.

  1. In the three-broker lab, run the Order producer and Payment consumer against orders (RF 3, min.insync.replicas 2).
  2. Break it: docker stop kafka-2.
  3. Symptom: client logs show brief NOT_LEADER_FOR_PARTITION; a describe shows under-replicated partitions.
  4. Diagnose: follow Under-Replicated and Offline Partitions. Run kafka-topics --describe and confirm ISR shrank but partitions still serve.
  5. Confirm recovery:docker start kafka-2, watch ISR refill, and note produce/consume never stopped.

The lesson: with RF 3 and retries, one broker loss is a degraded state, not an outage.


3. Incident B: a poison message

Introduce a record the consumer cannot process and unblock the partition.

  1. Configure the Payment consumer with ErrorHandlingDeserializer and a DefaultErrorHandler + DLT, as in Retry and Error Handling.
  2. Break it: produce one malformed record to orders with the console producer.
  3. Symptom: lag climbs on one partition; logs repeat a SerializationException (or, without the error handler, the partition freezes).
  4. Diagnose: follow Poison Messages, Deserialization Errors, and the DLT. Inspect the offending offset with the console consumer.
  5. Confirm: the bad record lands in orders.DLT, good records keep flowing, and lag drains.

The lesson: correct error-handling config turns a partition-blocker into a routed DLT record.


4. Incident C: a slow consumer and a storm

Throttle the consumer to create lag, then push it into a rebalance storm.

  1. Set the Payment consumer to concurrency: 1, max-poll-records: 500, and add a 50ms sleep per record. Set a low max.poll.interval.ms (for example 20000).
  2. Break it: produce a few thousand OrderCreated records.
  3. Symptom: lag climbs; then logs cycle through revoke/rejoin as batches breach the poll interval.
  4. Diagnose: start at Consumer Lag and Stuck Consumers; when you see cycling rejoins, move to Rebalance Storms.
  5. Remediate: lower max-poll-records to 50 and raise concurrency, restart, and watch the storm stop and lag drain.

The lesson: one symptom (lag) can have layered causes; the tree routes you from lag to the storm.


5. Guided practical

Run all three incidents back to back as a single on-call simulation.

  1. Trigger Incident A, diagnose and recover, then write one sentence stating the root cause and the fix.
  2. Repeat for Incidents B and C.
  3. For each, note which playbook the symptom routed you to and the single cheapest diagnostic that confirmed it.
  4. Time yourself: the goal is a confident diagnosis within a few minutes per incident.

Next:Escalation and Communication, for when a playbook tells you to escalate.