Read time: ~

Assessment

Scenario-based self-check spanning all ten sections, from foundations and architecture to reliability, EOS, streams, production readiness, and operations.


Scenario-based questions grouped by section, covering the full course. Try to answer each one before expanding the answer. If you get one wrong, revisit the linked module rather than just reading the correction: the point is to fix the gap, not to see the right answer.

This is self-assessment, not a graded test. If you can confidently answer most of these without peeking, you are ready to design and operate event-driven systems with Kafka. A scoring guide is at the end.


Section 1: Foundations

1. A teammate asks why you use Kafka instead of having the Order service call the Payment service directly over REST. Give a one-sentence answer.

Answer Decoupling, durability, and replay: the producer does not need to know who consumes or wait for them, events are retained on disk so consumers can read at their own pace or reprocess, and new consumers can be added without touching the producer. See Why Kafka.

2. What is the smallest unit of parallelism for a consumer group, and what does that imply about consumers versus partitions?

Answer The partition. A partition is read by at most one consumer in a group, so the number of partitions is the ceiling on group parallelism. Adding consumers beyond the partition count leaves the extras idle. See Core Concepts.

Section 2: Architecture and Internals

3. A partition shows Replicas: 1,2,3 but Isr: 1,3. What does this mean and should you be alarmed?

Answer Broker 2's replica has fallen out of the in-sync replica set (behind or unreachable). The partition still serves from 1 and 3, so it is degraded, not down. Investigate whether broker 2 is recovering; if ISR drops below min.insync.replicas, acks=all writes will fail. See Cluster Anatomy and Under-Replicated Partitions.

4. In KRaft mode, how many controllers can a three-controller quorum lose while staying operational, and what happens if it loses more?

Answer It tolerates losing one (a majority of two remains). Losing two of three breaks the majority, so no controller can be elected and metadata operations stall cluster-wide. See Control Plane.

Section 3: Building with Spring

5. Two events for the same customer must be processed in order. What determines whether that ordering holds?

Answer The key. Ordering is guaranteed per partition only, so both events must share a key (the customer id) that routes them to the same partition. With different or null keys they can land on different partitions and be processed out of order. See Producing Deeper.

6. Your @KafkaListener handler is slow and you see repeated rejoining group log lines. What is happening and what is the first fix?

Answer The handler is breaching max.poll.interval.ms, so the consumer is evicted mid-batch and rejoins, a rebalance storm. First fix: lower max.poll.records so a batch finishes within the interval, or move slow work off the listener thread, rather than just raising the timeout. See Rebalancing and Rebalance Storms.

Section 4: Schema Management

7. You want to add a field to OrderCreated without breaking existing consumers under BACKWARD compatibility. How?

Answer Add the field as optional with a default value, so old readers that do not know the field still deserialize using the default. Removing or renaming a required field would be breaking. See Schema Registry and Schema Incompatibility.

Section 5: Reliability and Delivery Semantics

8. What combination gives at-least-once delivery, and why is an idempotent consumer still required?

Answer Producer acks=all with retries and idempotence, and a consumer that commits offsets after processing. At-least-once still allows duplicates (from retries or redelivery after a rebalance), so the consumer must be idempotent to avoid double effects. See Delivery Guarantees and Idempotency and Ordering.

9. A producer on acks=all starts throwing NotEnoughReplicasException. Is this an application bug? What do you do?

Answer No, it is a cluster-health issue: ISR has dropped below min.insync.replicas. Do not lower min.insync.replicas to make it succeed, that removes the durability guarantee. Investigate broker/ISR health and restore replicas or escalate. See Reliable Producing and Producer Failures.

10. What does exactly-once (EOS) actually guarantee in Kafka, and what does it not?

Answer It guarantees atomic read-process-write within Kafka: consumed offsets and produced records commit together, and read_committed consumers never see aborted output. It does not extend exactly-once to arbitrary external side effects (a third-party charge), which still need idempotency. See Transactions and EOS.

Section 6: Event-Driven and Advanced

11. What problem does the transactional outbox solve, and how?

Answer The dual-write problem: writing to the database and publishing to Kafka are not atomic, so a crash between them loses or duplicates events. The outbox writes the event to an outbox table in the same DB transaction as the business change, then a relay (poller or CDC) publishes it, making the write atomic with the state change. See Outbox and CDC.

12. In Kafka Streams, what is a changelog topic and why does it exist?

Answer It is the backing topic Kafka Streams uses to persist and restore a state store. State is kept locally for fast access, and every update is also written to the compacted changelog so the store can be rebuilt on another instance after a failure or rebalance. See Kafka Streams.

Section 7: Production Readiness

13. On AWS MSK, which authentication mechanism removes the need to rotate Kafka passwords, and why prefer it?

Answer MSK IAM authentication. Clients authenticate with their IAM role's temporary credentials, so there is no long-lived Kafka password to store, rotate, or leak, which also removes an entire class of rotation incident. See Security and Auth Failures After Rotation.

14. What is the single most important signal to watch on a consumer, and what does a steadily rising value mean?

Answer Consumer lag (distance between committed offset and log end offset). Steadily rising lag means the consumer is processing slower than the producer is writing, so it is falling behind, either under-provisioned, slow, or partly down. See Observability and Consumer Lag.

15. How would you assert that an async @KafkaListener received a record in a test, without flakiness?

Answer Use EmbeddedKafka (or Testcontainers) and Awaitility to poll for the expected condition with a timeout, never Thread.sleep. Awaitility waits only as long as needed and fails fast with a clear assertion. See Testing.

Section 8: Operations and Troubleshooting

16.OfflinePartitionsCount is greater than zero. How urgent is this and what is your first action?

Answer It is a live availability incident: those partitions have no leader and cannot serve reads or writes. Escalate to Kafka engineering immediately while gathering exhibits (broker count, ActiveControllerCount, affected topics); do not attempt control-plane recovery from the support tier. See Broker/Controller Failover.

17. A consumer suddenly reprocesses an entire topic after a deploy. What are the likely causes?

Answer Most likely the group id changed (so there are no committed offsets and auto.offset.reset=earliest replays everything), or a manual offset reset went to earliest, or offsets expired past retention. Stop the group, reset to the correct target with --dry-run then --execute. See Offset Problems.

18. A client reports connection timeouts to MSK, but only from one subnet, and brokers look healthy. Where do you look?

Answer The AWS layer: a security group or NACL for that subnet is likely dropping packets (timed out, not refused). Check the broker SG allows the client SG on the Kafka port, and the subnet NACLs in both directions (NACLs are stateless). This is infra-owned. See AWS-Layer Connectivity.

Scoring guide

Count the questions you answered confidently without expanding:

  • 15 to 18: You have a strong, production-ready grasp. You can design and operate Kafka systems and lead incident response.
  • 10 to 14: Solid working knowledge. Revisit the linked modules for the ones you missed, especially any in reliability or operations.
  • Below 10: Re-read the sections where you struggled and rebuild the running scenario from the Capstone Project. Understanding comes from building.

Wherever you missed a question, the linked module is the fix. That is the whole point of the course: not just knowing Kafka, but knowing where each answer lives when you need it under pressure.