Tooling Walkthrough
The operational CLI toolkit for inspecting and fixing a running cluster, plus the MSK CloudWatch metrics reference and which tool answers which question.
When an alert fires you reach for a small set of command-line tools. This module is a reference walkthrough of the Kafka CLI toolkit and the MSK CloudWatch metrics that go with it, so that when you hit the playbooks in the next module you already know how to read each tool’s output. The goal is fluency: which tool answers which question, and what the output means.
What you’ll be able to do after this module
- Use
kafka-topicsto inspect topics, partitions, and replicas. - Use
kafka-consumer-groupsto read lag and reset offsets. - Use
kafka-configsto inspect and change configuration. - Use the console producer and consumer to test by hand.
- Understand
kafka-reassign-partitionsand the MSK CloudWatch metrics.
1. Which tool for which question
Every operational question maps to one tool. Learn the mapping and you can navigate any incident.
flowchart TD
q{"What do you need to know?"}
q -->|topic layout, replicas, ISR| t["kafka-topics --describe"]
q -->|consumer lag, group state| g["kafka-consumer-groups --describe"]
q -->|effective config| c["kafka-configs --describe"]
q -->|is data flowing?| io["console producer / consumer"]
q -->|move partitions| r["kafka-reassign-partitions"]
q -->|broker/cluster health over time| cw["CloudWatch metrics"]
On MSK you run the CLI tools from a client machine inside the VPC, pointing --bootstrap-server at the MSK broker endpoints. Locally, run them inside the lab container.
2. kafka-topics: topic and partition layout
kafka-topics inspects and manages topics. --describe is the one you use most: it shows partitions, the leader broker, the replica set, and the in-sync replicas (ISR).
kafka-topics.sh --bootstrap-server $BROKER --describe --topic orders
Topic: orders PartitionCount: 3 ReplicationFactor: 3
Topic: orders Partition: 0 Leader: 1 Replicas: 1,2,3 Isr: 1,2,3
Topic: orders Partition: 1 Leader: 2 Replicas: 2,3,1 Isr: 2,3,1
Topic: orders Partition: 2 Leader: 3 Replicas: 3,1,2 Isr: 3,1
Read it like this: Isr shorter than Replicas means a replica has fallen behind (partition 2 above is missing broker 2 from ISR). That is the under-replicated signal you act on in the playbooks. Create and alter topics here too:
kafka-topics.sh --bootstrap-server $BROKER --create \
--topic orders --partitions 3 --replication-factor 3
3. kafka-consumer-groups: lag and offset resets
This is the most important operational tool, because consumer lag is the first signal, as established in Observability. --describe shows per-partition current offset, log end offset, and LAG.
kafka-consumer-groups.sh --bootstrap-server $BROKER \
--describe --group payment-service
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID
payment-service orders 0 10432 10432 0 consumer-1
payment-service orders 1 9876 12001 2125 consumer-2
payment-service orders 2 8500 8500 0 consumer-3
Partition 1’s LAG of 2125 with a CONSUMER-ID present means a slow consumer, not an absent one. It also resets offsets, a powerful and dangerous operation, so it only works when the group is stopped:
# Preview only (safe): --dry-run
kafka-consumer-groups.sh --bootstrap-server $BROKER --group payment-service \
--topic orders --reset-offsets --to-earliest --dry-run
# Apply (group must be down): --execute
kafka-consumer-groups.sh --bootstrap-server $BROKER --group payment-service \
--topic orders --reset-offsets --to-earliest --execute
4. kafka-configs: inspect and change configuration
kafka-configs reads and sets configuration on topics and brokers. Use --describe to see the effective config, which is essential when a setting is not behaving as you expect.
# Effective config for a topic
kafka-configs.sh --bootstrap-server $BROKER --describe \
--entity-type topics --entity-name orders
# Change retention for one topic
kafka-configs.sh --bootstrap-server $BROKER --alter \
--entity-type topics --entity-name orders \
--add-config retention.ms=604800000
This is how you check whether min.insync.replicas, retention.ms, or cleanup.policy are actually set to what you think. On MSK, broker-level settings are managed through MSK configurations rather than arbitrary kafka-configs broker edits.
5. Console producer and consumer: is data flowing?
When you need to answer “is anything being produced or consumed at all,” the console tools let you produce and read by hand.
# Produce a couple of records (type, then Ctrl-D)
kafka-console-producer.sh --bootstrap-server $BROKER --topic orders
# Read from the beginning, showing keys
kafka-console-consumer.sh --bootstrap-server $BROKER --topic orders \
--from-beginning --property print.key=true --property key.separator=:
If the console consumer sees records but your Spring listener does not, the problem is in the application (group, deserialization, offsets), not the broker. That single test cuts the problem space in half.
6. kafka-reassign-partitions and CloudWatch
Two more you should recognize even if you use them rarely.
kafka-reassign-partitions: moves partition replicas between brokers, for rebalancing load or replacing a broker. On MSK this is largely automated, but you may run it to rebalance after scaling. It is heavy: it copies data, so it is an escalation-level action, not a first response.MSK CloudWatch metrics: MSK publishes the broker metrics from Observability to CloudWatch. The ones you watch:
| CloudWatch metric | What it tells you |
|---|---|
MaxOffsetLag / SumOffsetLag | Consumer group lag |
UnderReplicatedPartitions | Replicas behind, resilience at risk |
OfflinePartitionsCount | Partitions with no leader (outage) |
KafkaDataLogsDiskUsed | Broker storage percent used |
ActiveControllerCount | Should be exactly 1 across the cluster |
OfflinePartitionsCount above zero or ActiveControllerCount not equal to 1 are cluster-level emergencies handled in Broker Down, Controller Failover.
7. Guided practical
Run this in the local lab.
kafka-topics --describetheorderstopic and identify the leader, replicas, and ISR of each partition.- Start a consumer group, produce a backlog, and watch
LAGshrink withkafka-consumer-groups --describe. - Use
kafka-configs --describeto confirmmin.insync.replicasonorders. - Produce with the console producer and read it back with the console consumer, printing keys.
- Run a
--reset-offsets --to-earliest --dry-runand read the preview without applying it.
Next:Alert Playbooks, the twelve incident guides where you put this toolkit to work.