Hands-On Lab: Incident Diagnosis
Break a running scenario deliberately, stop a broker, deploy a poison message, throttle a consumer, then diagnose it top to bottom using the playbooks.
Reading the playbooks builds recognition; breaking things yourself builds diagnosis. This lab walks three deliberate incidents in the local Order/Payment/Inventory/Notification system, and for each you follow the same top-to-bottom method: read the symptom, form a hypothesis, run the cheapest diagnostic, then confirm and remediate. The aim is to internalize the decision tree so real incidents feel familiar.
What you’ll be able to do after this module
- Apply a single diagnostic decision tree to any Kafka incident.
- Reproduce and diagnose a broker loss, a poison message, and a slow consumer.
- Choose the correct playbook from the symptom, not a guess.
- Confirm a remediation actually resolved the incident.
1. The diagnostic decision tree
Every incident starts the same way: read the symptom and let it route you to the right playbook. Resist jumping to a suspected cause; follow the cheapest checks first.
flowchart TD
s["Alert / ticket"]
s --> q1{"Cluster metric or app symptom?"}
q1 -->|OfflinePartitions / controller| p3["Broker/Controller playbook"]
q1 -->|UnderReplicated| p2["Under-Replicated playbook"]
q1 -->|lag climbing| q2{"Consumers attached?"}
q2 -->|no| p1a["Consumer-lag: absent"]
q2 -->|yes, cycling rejoin| p4["Rebalance-storm playbook"]
q2 -->|yes, slow| p1b["Consumer-lag: slow handler"]
q1 -->|send errors| p5["Producer-failures playbook"]
q1 -->|deserialization error| p6["Poison-message playbook"]
q1 -->|disk alarm| p7["Disk/retention playbook"]
q1 -->|auth error| p8["Auth-rotation playbook"]
q1 -->|timeouts, one subnet| p9["AWS-layer playbook"]
The whole point of the tree is that the symptom, not your hunch, chooses the branch. Each leaf is one of the twelve Alert Playbooks.
2. Incident A: a broker disappears
Simulate a broker loss and confirm the system rides it out.
- In the three-broker lab, run the Order producer and Payment consumer against
orders(RF 3,min.insync.replicas2). - Break it:
docker stop kafka-2. - Symptom: client logs show brief
NOT_LEADER_FOR_PARTITION; a describe shows under-replicated partitions. - Diagnose: follow Under-Replicated and Offline Partitions. Run
kafka-topics --describeand confirm ISR shrank but partitions still serve. - Confirm recovery:
docker start kafka-2, watch ISR refill, and note produce/consume never stopped.
The lesson: with RF 3 and retries, one broker loss is a degraded state, not an outage.
3. Incident B: a poison message
Introduce a record the consumer cannot process and unblock the partition.
- Configure the Payment consumer with
ErrorHandlingDeserializerand aDefaultErrorHandler+ DLT, as in Retry and Error Handling. - Break it: produce one malformed record to
orderswith the console producer. - Symptom: lag climbs on one partition; logs repeat a
SerializationException(or, without the error handler, the partition freezes). - Diagnose: follow Poison Messages, Deserialization Errors, and the DLT. Inspect the offending offset with the console consumer.
- Confirm: the bad record lands in
orders.DLT, good records keep flowing, and lag drains.
The lesson: correct error-handling config turns a partition-blocker into a routed DLT record.
4. Incident C: a slow consumer and a storm
Throttle the consumer to create lag, then push it into a rebalance storm.
- Set the Payment consumer to
concurrency: 1,max-poll-records: 500, and add a 50ms sleep per record. Set a lowmax.poll.interval.ms(for example 20000). - Break it: produce a few thousand
OrderCreatedrecords. - Symptom: lag climbs; then logs cycle through revoke/rejoin as batches breach the poll interval.
- Diagnose: start at Consumer Lag and Stuck Consumers; when you see cycling rejoins, move to Rebalance Storms.
- Remediate: lower
max-poll-recordsto 50 and raiseconcurrency, restart, and watch the storm stop and lag drain.
The lesson: one symptom (lag) can have layered causes; the tree routes you from lag to the storm.
5. Guided practical
Run all three incidents back to back as a single on-call simulation.
- Trigger Incident A, diagnose and recover, then write one sentence stating the root cause and the fix.
- Repeat for Incidents B and C.
- For each, note which playbook the symptom routed you to and the single cheapest diagnostic that confirmed it.
- Time yourself: the goal is a confident diagnosis within a few minutes per incident.
Next:Escalation and Communication, for when a playbook tells you to escalate.