Read time: ~

Tooling Walkthrough

The operational CLI toolkit for inspecting and fixing a running cluster, plus the MSK CloudWatch metrics reference and which tool answers which question.

When an alert fires you reach for a small set of command-line tools. This module is a reference walkthrough of the Kafka CLI toolkit and the MSK CloudWatch metrics that go with it, so that when you hit the playbooks in the next module you already know how to read each tool’s output. The goal is fluency: which tool answers which question, and what the output means.


What you’ll be able to do after this module

  • Use kafka-topics to inspect topics, partitions, and replicas.
  • Use kafka-consumer-groups to read lag and reset offsets.
  • Use kafka-configs to inspect and change configuration.
  • Use the console producer and consumer to test by hand.
  • Understand kafka-reassign-partitions and the MSK CloudWatch metrics.

1. Which tool for which question

Every operational question maps to one tool. Learn the mapping and you can navigate any incident.

flowchart TD
    q{"What do you need to know?"}
    q -->|topic layout, replicas, ISR| t["kafka-topics --describe"]
    q -->|consumer lag, group state| g["kafka-consumer-groups --describe"]
    q -->|effective config| c["kafka-configs --describe"]
    q -->|is data flowing?| io["console producer / consumer"]
    q -->|move partitions| r["kafka-reassign-partitions"]
    q -->|broker/cluster health over time| cw["CloudWatch metrics"]

On MSK you run the CLI tools from a client machine inside the VPC, pointing --bootstrap-server at the MSK broker endpoints. Locally, run them inside the lab container.


2. kafka-topics: topic and partition layout

kafka-topics inspects and manages topics. --describe is the one you use most: it shows partitions, the leader broker, the replica set, and the in-sync replicas (ISR).

kafka-topics.sh --bootstrap-server $BROKER --describe --topic orders
Topic: orders  PartitionCount: 3  ReplicationFactor: 3
  Topic: orders  Partition: 0  Leader: 1  Replicas: 1,2,3  Isr: 1,2,3
  Topic: orders  Partition: 1  Leader: 2  Replicas: 2,3,1  Isr: 2,3,1
  Topic: orders  Partition: 2  Leader: 3  Replicas: 3,1,2  Isr: 3,1

Read it like this: Isr shorter than Replicas means a replica has fallen behind (partition 2 above is missing broker 2 from ISR). That is the under-replicated signal you act on in the playbooks. Create and alter topics here too:

kafka-topics.sh --bootstrap-server $BROKER --create \
  --topic orders --partitions 3 --replication-factor 3

3. kafka-consumer-groups: lag and offset resets

This is the most important operational tool, because consumer lag is the first signal, as established in Observability. --describe shows per-partition current offset, log end offset, and LAG.

kafka-consumer-groups.sh --bootstrap-server $BROKER \
  --describe --group payment-service
GROUP            TOPIC   PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG  CONSUMER-ID
payment-service  orders  0          10432           10432           0    consumer-1
payment-service  orders  1          9876            12001           2125 consumer-2
payment-service  orders  2          8500            8500            0    consumer-3

Partition 1’s LAG of 2125 with a CONSUMER-ID present means a slow consumer, not an absent one. It also resets offsets, a powerful and dangerous operation, so it only works when the group is stopped:

# Preview only (safe): --dry-run
kafka-consumer-groups.sh --bootstrap-server $BROKER --group payment-service \
  --topic orders --reset-offsets --to-earliest --dry-run

# Apply (group must be down): --execute
kafka-consumer-groups.sh --bootstrap-server $BROKER --group payment-service \
  --topic orders --reset-offsets --to-earliest --execute

4. kafka-configs: inspect and change configuration

kafka-configs reads and sets configuration on topics and brokers. Use --describe to see the effective config, which is essential when a setting is not behaving as you expect.

# Effective config for a topic
kafka-configs.sh --bootstrap-server $BROKER --describe \
  --entity-type topics --entity-name orders

# Change retention for one topic
kafka-configs.sh --bootstrap-server $BROKER --alter \
  --entity-type topics --entity-name orders \
  --add-config retention.ms=604800000

This is how you check whether min.insync.replicas, retention.ms, or cleanup.policy are actually set to what you think. On MSK, broker-level settings are managed through MSK configurations rather than arbitrary kafka-configs broker edits.


5. Console producer and consumer: is data flowing?

When you need to answer “is anything being produced or consumed at all,” the console tools let you produce and read by hand.

# Produce a couple of records (type, then Ctrl-D)
kafka-console-producer.sh --bootstrap-server $BROKER --topic orders

# Read from the beginning, showing keys
kafka-console-consumer.sh --bootstrap-server $BROKER --topic orders \
  --from-beginning --property print.key=true --property key.separator=:

If the console consumer sees records but your Spring listener does not, the problem is in the application (group, deserialization, offsets), not the broker. That single test cuts the problem space in half.


6. kafka-reassign-partitions and CloudWatch

Two more you should recognize even if you use them rarely.

  • kafka-reassign-partitions: moves partition replicas between brokers, for rebalancing load or replacing a broker. On MSK this is largely automated, but you may run it to rebalance after scaling. It is heavy: it copies data, so it is an escalation-level action, not a first response.

  • MSK CloudWatch metrics: MSK publishes the broker metrics from Observability to CloudWatch. The ones you watch:

CloudWatch metricWhat it tells you
MaxOffsetLag / SumOffsetLagConsumer group lag
UnderReplicatedPartitionsReplicas behind, resilience at risk
OfflinePartitionsCountPartitions with no leader (outage)
KafkaDataLogsDiskUsedBroker storage percent used
ActiveControllerCountShould be exactly 1 across the cluster

OfflinePartitionsCount above zero or ActiveControllerCount not equal to 1 are cluster-level emergencies handled in Broker Down, Controller Failover.


7. Guided practical

Run this in the local lab.

  1. kafka-topics --describe the orders topic and identify the leader, replicas, and ISR of each partition.
  2. Start a consumer group, produce a backlog, and watch LAG shrink with kafka-consumer-groups --describe.
  3. Use kafka-configs --describe to confirm min.insync.replicas on orders.
  4. Produce with the console producer and read it back with the console consumer, printing keys.
  5. Run a --reset-offsets --to-earliest --dry-run and read the preview without applying it.

Next:Alert Playbooks, the twelve incident guides where you put this toolkit to work.