Kafka Observability: Metrics, Consumer Lag, and Tracing with Spring

A Kafka system can be healthy at the broker and still be failing the business, because a consumer has quietly fallen behind. Observability is how you see that before customers do. This module covers the signals that matter, with consumer lag first, and how to collect metrics and traces from a Spring application.

What you’ll be able to do after this module

Explain why consumer lag is the first signal to watch.
Read the key broker and client metrics.
Expose Kafka metrics from Spring with Micrometer and Actuator.
Scrape metrics with Prometheus and visualize them in Grafana.
Trace a record across producer and consumer with OpenTelemetry.

1. Consumer lag: the signal that matters most

Consumer lag is the number of records between a consumer group’s committed offset and the end of the partition (the log end offset). It is how far behind the consumer is. Lag is the single most important Kafka signal, because it directly measures whether processing is keeping up with production.

Flat, low lag: the consumer keeps up. Healthy.
Steadily rising lag: the consumer is slower than the producer. Something is wrong or under-provisioned.
Lag stuck with zero consumers: the group is down or evicted.

flowchart LR
    le["log end offset<br/>(latest produced)"]
    co["committed offset<br/>(consumer position)"]
    le ---|"lag = distance"| co

Check it from the CLI:

kafka-consumer-groups.sh --bootstrap-server $BROKER \
  --describe --group payment-service

The LAG column per partition is what you alert on. Diagnosing rising lag in production is covered in Consumer Lag and Stuck Consumers.

2. The metrics that matter

Beyond lag, a handful of broker and client metrics tell you most of what you need.

Metric	Side	Why it matters
Consumer lag	Consumer	Keeping up with production
`records-consumed-rate`	Consumer	Throughput being processed
`UnderReplicatedPartitions`	Broker	Replication health, data-loss risk
`OfflinePartitionsCount`	Broker	Partitions with no leader
`request-latency-avg`	Client	Producer/consumer round-trip health
`records-per-request-avg`	Producer	Batching effectiveness

Kafka exposes broker and client metrics over JMX. In production you feed them into a metrics pipeline rather than reading JMX by hand.

3. Metrics from Spring with Micrometer

Spring Boot integrates with Micrometer, and Actuator exposes the metrics. The Kafka clients register their metrics with Micrometer automatically, so consumer lag, request latency, and throughput appear alongside your application metrics.

Add Actuator and the Prometheus registry:

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>

Expose the Prometheus endpoint:

management:
  endpoints:
    web:
      exposure:
        include: health,info,prometheus
  metrics:
    tags:
      application: payment-service

Spring for Apache Kafka can also publish listener-level metrics, so you can see per-listener throughput and lag from the application’s own view.

4. Prometheus and Grafana

The standard pipeline scrapes the Prometheus endpoint on a schedule, stores the time series, and renders dashboards and alerts in Grafana.

flowchart LR
    app["Spring app<br/>/actuator/prometheus"]
    brokers["Broker JMX<br/>(via exporter)"]
    prom["Prometheus<br/>scrape + store"]
    graf["Grafana<br/>dashboards + alerts"]
    otel["OTel collector"]
    app --> prom
    brokers --> prom
    prom --> graf
    app --> otel

Set alerts on the signals that matter: rising consumer lag, UnderReplicatedPartitions above zero, and any OfflinePartitionsCount. On MSK, these same broker metrics are published to CloudWatch, so the dashboard source differs but the signals are identical.

5. Distributed tracing with OpenTelemetry

Metrics tell you something is slow; tracing tells you where. A trace follows one logical operation across services, and Kafka is a hop in that trace. OpenTelemetry propagates a trace context in record headers, so a span in the Order service links to the span where the Payment service consumes the same record.

// With the OpenTelemetry Kafka instrumentation, the trace context is
// injected into record headers on send and extracted on consume, so
// producer and consumer spans join one end-to-end trace automatically.
kafkaTemplate.send("orders", String.valueOf(event.orderId()), event);

Spring Boot’s observability support (Micrometer Tracing with an OpenTelemetry bridge) wires this in with configuration rather than manual span code. The payoff is a single trace showing an order flow through Order, Payment, and Inventory, including the time spent waiting in each topic.

6. Guided practical

Run this against the local lab.

Add Actuator and the Prometheus registry to a consumer service and expose /actuator/prometheus.
Produce a backlog, start the consumer, and watch kafka.consumer.fetch.manager lag metrics change as it catches up.
Run kafka-consumer-groups.sh --describe and correlate the LAG column with the metric.
Point a local Prometheus at the endpoint and graph consumer lag over time.
Enable tracing and confirm a produced record and its consumption share one trace id.

Next:Performance Tuning and Throughput, where you use these signals to tune the system.