Observability: Metrics, Consumer Lag, and Tracing
The signals that matter with consumer lag first, broker and client JMX metrics, Micrometer and Actuator, Prometheus and Grafana, and distributed tracing with OpenTelemetry.
A Kafka system can be healthy at the broker and still be failing the business, because a consumer has quietly fallen behind. Observability is how you see that before customers do. This module covers the signals that matter, with consumer lag first, and how to collect metrics and traces from a Spring application.
What you’ll be able to do after this module
- Explain why consumer lag is the first signal to watch.
- Read the key broker and client metrics.
- Expose Kafka metrics from Spring with Micrometer and Actuator.
- Scrape metrics with Prometheus and visualize them in Grafana.
- Trace a record across producer and consumer with OpenTelemetry.
1. Consumer lag: the signal that matters most
Consumer lag is the number of records between a consumer group’s committed offset and the end of the partition (the log end offset). It is how far behind the consumer is. Lag is the single most important Kafka signal, because it directly measures whether processing is keeping up with production.
- Flat, low lag: the consumer keeps up. Healthy.
- Steadily rising lag: the consumer is slower than the producer. Something is wrong or under-provisioned.
- Lag stuck with zero consumers: the group is down or evicted.
flowchart LR
le["log end offset<br/>(latest produced)"]
co["committed offset<br/>(consumer position)"]
le ---|"lag = distance"| co
Check it from the CLI:
kafka-consumer-groups.sh --bootstrap-server $BROKER \
--describe --group payment-service
The LAG column per partition is what you alert on. Diagnosing rising lag in production is covered in Consumer Lag and Stuck Consumers.
2. The metrics that matter
Beyond lag, a handful of broker and client metrics tell you most of what you need.
| Metric | Side | Why it matters |
|---|---|---|
| Consumer lag | Consumer | Keeping up with production |
records-consumed-rate | Consumer | Throughput being processed |
UnderReplicatedPartitions | Broker | Replication health, data-loss risk |
OfflinePartitionsCount | Broker | Partitions with no leader |
request-latency-avg | Client | Producer/consumer round-trip health |
records-per-request-avg | Producer | Batching effectiveness |
Kafka exposes broker and client metrics over JMX. In production you feed them into a metrics pipeline rather than reading JMX by hand.
3. Metrics from Spring with Micrometer
Spring Boot integrates with Micrometer, and Actuator exposes the metrics. The Kafka clients register their metrics with Micrometer automatically, so consumer lag, request latency, and throughput appear alongside your application metrics.
Add Actuator and the Prometheus registry:
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
Expose the Prometheus endpoint:
management:
endpoints:
web:
exposure:
include: health,info,prometheus
metrics:
tags:
application: payment-service
Spring for Apache Kafka can also publish listener-level metrics, so you can see per-listener throughput and lag from the application’s own view.
4. Prometheus and Grafana
The standard pipeline scrapes the Prometheus endpoint on a schedule, stores the time series, and renders dashboards and alerts in Grafana.
flowchart LR
app["Spring app<br/>/actuator/prometheus"]
brokers["Broker JMX<br/>(via exporter)"]
prom["Prometheus<br/>scrape + store"]
graf["Grafana<br/>dashboards + alerts"]
otel["OTel collector"]
app --> prom
brokers --> prom
prom --> graf
app --> otel
Set alerts on the signals that matter: rising consumer lag, UnderReplicatedPartitions above zero, and any OfflinePartitionsCount. On MSK, these same broker metrics are published to CloudWatch, so the dashboard source differs but the signals are identical.
5. Distributed tracing with OpenTelemetry
Metrics tell you something is slow; tracing tells you where. A trace follows one logical operation across services, and Kafka is a hop in that trace. OpenTelemetry propagates a trace context in record headers, so a span in the Order service links to the span where the Payment service consumes the same record.
// With the OpenTelemetry Kafka instrumentation, the trace context is
// injected into record headers on send and extracted on consume, so
// producer and consumer spans join one end-to-end trace automatically.
kafkaTemplate.send("orders", String.valueOf(event.orderId()), event);
Spring Boot’s observability support (Micrometer Tracing with an OpenTelemetry bridge) wires this in with configuration rather than manual span code. The payoff is a single trace showing an order flow through Order, Payment, and Inventory, including the time spent waiting in each topic.
6. Guided practical
Run this against the local lab.
- Add Actuator and the Prometheus registry to a consumer service and expose
/actuator/prometheus. - Produce a backlog, start the consumer, and watch
kafka.consumer.fetch.managerlag metrics change as it catches up. - Run
kafka-consumer-groups.sh --describeand correlate theLAGcolumn with the metric. - Point a local Prometheus at the endpoint and graph consumer lag over time.
- Enable tracing and confirm a produced record and its consumption share one trace id.
Next:Performance Tuning and Throughput, where you use these signals to tune the system.