Read time: ~

Observability: Metrics, Tracing and Health

Micrometer and Actuator metrics, the broker and consumer signals that matter, Prometheus and Grafana, distributed tracing, and health checks.

Prerequisite:Consumer Acknowledgements and PrefetchYou’ll need: The Spring AMQP services from earlier modules and, optionally, Prometheus and Grafana.

You cannot tune or troubleshoot what you cannot see. Observability answers three questions in production: is the system healthy right now, is it keeping up with load, and where did a slow or failed request actually spend its time. This module wires metrics, tracing, and health into the Spring services you have already built.


What you’ll be able to do after this module

  • Expose RabbitMQ and application metrics through Actuator and Micrometer.
  • Identify the broker and consumer signals worth alerting on.
  • Scrape metrics into Prometheus and chart them in Grafana.
  • Propagate a trace across a publish and consume hop.
  • Add a broker health check to your readiness probe.

1. The three pillars for a messaging system

Metrics, traces, and health each answer a different question. Together they give you a full picture.

flowchart LR
    App["Spring Boot service"]
    App -->|"Micrometer"| Prom["Prometheus"]
    Prom --> Graf["Grafana dashboards & alerts"]
    App -->|"OpenTelemetry"| Trace["Tracing backend (Tempo / Zipkin)"]
    App -->|"Actuator /health"| Probe["Orchestrator readiness probe"]

Metrics tell you what is happening at scale, traces tell you why a specific request was slow, and health tells an orchestrator whether to send traffic at all.


2. Metrics with Micrometer and Actuator

Spring Boot auto-instruments Spring AMQP. Add Actuator and a registry, and RabbitTemplate and @RabbitListener containers publish timers and counters with no extra code.

<dependency>
  <groupId>org.springframework.boot</groupId>
  <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
  <groupId>io.micrometer</groupId>
  <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
management:
  endpoints:
    web:
      exposure:
        include: health,info,prometheus,metrics
  metrics:
    tags:
      application: order-service

This exposes rabbitmq.published, spring.rabbitmq.listener timers, and connection and channel gauges at /actuator/prometheus. The application tag lets Grafana separate Order, Inventory, and Notification services.


3. The signals that matter

Do not chart everything. These few signals catch most incidents.

SignalWhereWhy it matters
Queue depth (messages_ready)BrokerRising depth means consumers are falling behind
Consumer ack rate vs publish rateBrokerPublish outpacing ack is a leading lag indicator
Unacked messagesBrokerStuck or slow consumers holding prefetch
Listener processing timeApp (Micrometer)Slow handlers, the root cause of lag
Redelivery / DLQ rateBrokerPoison messages or failing handlers
Connection and channel countBothLeaks from unclosed resources

Broker-level numbers come from the rabbitmq-prometheus plugin (enable it with rabbitmq-plugins enable rabbitmq_prometheus), while handler timing comes from your app. You need both, because a healthy broker can still hide a slow consumer.


4. Prometheus and Grafana

Prometheus scrapes each service and the broker on a fixed interval; Grafana queries Prometheus for dashboards and alerts.

flowchart LR
    OS["order-service :8080/actuator/prometheus"]
    IS["inventory-service"]
    RB["rabbitmq :15692/metrics"]
    Prom["Prometheus"]
    Graf["Grafana"]
    OS --> Prom
    IS --> Prom
    RB --> Prom
    Prom --> Graf
# prometheus.yml
scrape_configs:
  - job_name: 'order-service'
    metrics_path: /actuator/prometheus
    static_configs:
      - targets: ['order-service:8080']
  - job_name: 'rabbitmq'
    static_configs:
      - targets: ['rabbitmq:15692']

A good starting alert: warn when a queue’s messages_ready stays above a threshold for several minutes, which is the metric version of the Queue Depth and Consumer Lag playbook.


5. Distributed tracing

A message breaks the call stack: the publisher returns immediately and the consumer runs later on another thread or host. Tracing stitches these back together by carrying a trace context in the message headers.

sequenceDiagram
    participant O as Order Service
    participant B as RabbitMQ
    participant I as Inventory Service
    O->>O: start span "POST /orders" (traceId=abc)
    O->>B: publish OrderCreated (headers carry traceId=abc)
    B->>I: deliver
    I->>I: continue span (same traceId=abc)
    Note over O,I: One trace spans both services

With Micrometer Tracing plus an OpenTelemetry bridge, Spring propagates the context automatically across the publish and consume boundary.

<dependency>
  <groupId>io.micrometer</groupId>
  <artifactId>micrometer-tracing-bridge-otel</artifactId>
</dependency>
<dependency>
  <groupId>io.opentelemetry</groupId>
  <artifactId>opentelemetry-exporter-otlp</artifactId>
</dependency>

Now a single traceId links the HTTP request, the publish, and the consumer’s processing, so you can see exactly where an order slowed down.


6. Health checks

Actuator ships a RabbitHealthIndicator that checks broker connectivity. It is included automatically and appears in /actuator/health.

{
  "status": "UP",
  "components": {
    "rabbit": { "status": "UP", "details": { "version": "3.13.0" } }
  }
}

Split liveness from readiness so a broker blip does not kill your pod:

management:
  endpoint:
    health:
      probes:
        enabled: true
      group:
        readiness:
          include: rabbit,db

Map the orchestrator’s readiness probe to /actuator/health/readiness. When the broker is unreachable the pod is pulled from rotation but not restarted, and it recovers automatically per Connection Recovery.


Checkpoint

You should now be able to:

  • Expose Prometheus metrics from a Spring AMQP service.
  • Name the handful of signals worth alerting on.
  • Scrape services and the broker into Prometheus and Grafana.
  • Explain how a trace propagates across a publish and consume hop.
  • Wire the broker health check into a readiness probe.

Next:Performance Tuning and Throughput.