Kafka Hands-On Lab: End-to-End Incident Diagnosis

Reading the playbooks builds recognition; breaking things yourself builds diagnosis. This lab walks three deliberate incidents in the local Order/Payment/Inventory/Notification system, and for each you follow the same top-to-bottom method: read the symptom, form a hypothesis, run the cheapest diagnostic, then confirm and remediate. The aim is to internalize the decision tree so real incidents feel familiar.

What you’ll be able to do after this module

Apply a single diagnostic decision tree to any Kafka incident.
Reproduce and diagnose a broker loss, a poison message, and a slow consumer.
Choose the correct playbook from the symptom, not a guess.
Confirm a remediation actually resolved the incident.

1. The diagnostic decision tree

Every incident starts the same way: read the symptom and let it route you to the right playbook. Resist jumping to a suspected cause; follow the cheapest checks first.

flowchart TD
    s["Alert / ticket"]
    s --> q1{"Cluster metric or app symptom?"}
    q1 -->|OfflinePartitions / controller| p3["Broker/Controller playbook"]
    q1 -->|UnderReplicated| p2["Under-Replicated playbook"]
    q1 -->|lag climbing| q2{"Consumers attached?"}
    q2 -->|no| p1a["Consumer-lag: absent"]
    q2 -->|yes, cycling rejoin| p4["Rebalance-storm playbook"]
    q2 -->|yes, slow| p1b["Consumer-lag: slow handler"]
    q1 -->|send errors| p5["Producer-failures playbook"]
    q1 -->|deserialization error| p6["Poison-message playbook"]
    q1 -->|disk alarm| p7["Disk/retention playbook"]
    q1 -->|auth error| p8["Auth-rotation playbook"]
    q1 -->|timeouts, one subnet| p9["AWS-layer playbook"]

The whole point of the tree is that the symptom, not your hunch, chooses the branch. Each leaf is one of the twelve Alert Playbooks.

2. Incident A: a broker disappears

Simulate a broker loss and confirm the system rides it out.

In the three-broker lab, run the Order producer and Payment consumer against orders (RF 3, min.insync.replicas 2).
Break it: docker stop kafka-2.
Symptom: client logs show brief NOT_LEADER_FOR_PARTITION; a describe shows under-replicated partitions.
Diagnose: follow Under-Replicated and Offline Partitions. Run kafka-topics --describe and confirm ISR shrank but partitions still serve.
Confirm recovery:docker start kafka-2, watch ISR refill, and note produce/consume never stopped.

The lesson: with RF 3 and retries, one broker loss is a degraded state, not an outage.

3. Incident B: a poison message

Introduce a record the consumer cannot process and unblock the partition.

Configure the Payment consumer with ErrorHandlingDeserializer and a DefaultErrorHandler + DLT, as in Retry and Error Handling.
Break it: produce one malformed record to orders with the console producer.
Symptom: lag climbs on one partition; logs repeat a SerializationException (or, without the error handler, the partition freezes).
Diagnose: follow Poison Messages, Deserialization Errors, and the DLT. Inspect the offending offset with the console consumer.
Confirm: the bad record lands in orders.DLT, good records keep flowing, and lag drains.

The lesson: correct error-handling config turns a partition-blocker into a routed DLT record.

4. Incident C: a slow consumer and a storm

Throttle the consumer to create lag, then push it into a rebalance storm.

Set the Payment consumer to concurrency: 1, max-poll-records: 500, and add a 50ms sleep per record. Set a low max.poll.interval.ms (for example 20000).
Break it: produce a few thousand OrderCreated records.
Symptom: lag climbs; then logs cycle through revoke/rejoin as batches breach the poll interval.
Diagnose: start at Consumer Lag and Stuck Consumers; when you see cycling rejoins, move to Rebalance Storms.
Remediate: lower max-poll-records to 50 and raise concurrency, restart, and watch the storm stop and lag drain.

The lesson: one symptom (lag) can have layered causes; the tree routes you from lag to the storm.

5. Guided practical

Run all three incidents back to back as a single on-call simulation.

Trigger Incident A, diagnose and recover, then write one sentence stating the root cause and the fix.
Repeat for Incidents B and C.
For each, note which playbook the symptom routed you to and the single cheapest diagnostic that confirmed it.
Time yourself: the goal is a confident diagnosis within a few minutes per incident.

Next:Escalation and Communication, for when a playbook tells you to escalate.