TLS & Certificate Expiry: RabbitMQ Incident Guide

1. Symptom

Spring Boot apps start failing to connect (or reconnect) to RabbitMQ on port 5671 (the TLS AMQP port, see AWS Architecture), with SSLHandshakeException in the application logs. This is the exact log signature flagged in the Tooling Walkthrough log-signature table as pointing to this playbook.

Two shapes this alert can take, and telling them apart in the first 60 seconds is the whole game:

Shape	What you see	What it usually means
Every app, every instance, all at once	Every Spring Boot service talking to the cluster starts throwing `SSLHandshakeException` simultaneously, across all environments/instances	The broker’s own server-side certificate has expired or is otherwise broken: a total outage for all TLS clients
One app, or one team’s app, others fine	Only `payments-service` (say) fails, while every other consumer/producer on the same cluster connects normally	A client certificate (mTLS) has expired for that one app, or that one app’s truststore is stale: not a broker problem

Rule of thumb for this playbook: broker cert expiry breaks everyone at once; client cert expiry breaks one app at a time. Confirming which of these you’re looking at is Diagnostic Step 1 below, and it determines almost everything downstream, who you escalate to, whether it’s urgent-for-everyone or urgent-for-one-team, and whether you can safely help directly or need to wait on another team.

Why TLS exists here at all: in this environment, AMQP connections to the broker are encrypted (port 5671, not the plaintext 5672) and, if mutual TLS (mTLS) is configured, the broker also authenticates which client is connecting using a client certificate, in addition to (or sometimes instead of) username/password. Certificates issued by the internal CA are stored and rotated via AWS Secrets Manager or ACM, this playbook is about what happens when that rotation doesn’t happen on time or doesn’t happen everywhere it needs to.

What actually happens during the handshake (enough to triage, not a TLS deep-dive)

You don’t need to be a TLS expert to triage this, but you do need to know which side does what, because the error message tells you which step failed:

Client Hello: the Spring Boot app (via the Java TLS stack under Spring AMQP) opens a TCP connection to port 5671 and says “let’s negotiate TLS,” offering supported protocol versions/ciphers.
Server presents its certificate chain: the RabbitMQ broker sends its server certificate (the “leaf” cert identifying the broker) plus any intermediate CA certificates needed to chain up to a trusted root.
Client validates the server’s certificate: the JVM’s trust store checks three independent things, and each has a different failure signature:
- Expiry: is notBefore <= now <= notAfter? Fails as certificate expired (or “not yet valid” for clock-skew cases).
- Trust chain: does this cert chain up to a CA the client trusts? Fails as unable to find valid certification path to requested target.
- Hostname: does the cert’s CN/SAN match the hostname the client actually connected to? Fails as a hostname-mismatch error, distinct from the two above.
(mTLS only) Server requests and validates the client’s certificate: if the broker is configured for mutual TLS, it asks the client to present its own certificate, and performs the same expiry/trust-chain checks against it. If the client cert is expired or untrusted, the broker rejects the handshake, this shows up broker-side in RabbitMQ’s logs, and client-side as a generic handshake failure that can be harder to read than a server-cert failure.
If all checks pass on both sides, the TLS session is established and the AMQP protocol handshake proceeds on top of it as normal.

The practical payoff: “certificate expired” vs. “unable to find valid certification path” are different bugs. The first is a pure expiry problem (Section 2). The second is a trust-chain problem, often a missing/expired intermediate CA cert, or a client truststore that was never updated after a broker cert rotation, and needs different remediation.

2. Likely Causes

Broker-side (affects all/most clients)

Cause	How it manifests
Server certificate genuinely expired: a scheduled renewal/rotation automation job failed silently	All TLS clients fail at (or shortly after) the exact expiry timestamp; broker logs show nothing wrong because from the broker’s perspective it’s just presenting the cert it has
An intermediate CA certificate in the chain expired, even though the broker’s own leaf cert is still valid	Confusing case: the leaf cert “looks fine” if you only check its own dates, but clients still fail trust-chain validation because the chain up to the root is broken
Certificate renewal was deployed to some nodes but not all three	Inconsistent, seemingly random failures: a client connecting (or reconnecting, or load-balanced) to `rmq-1` and `rmq-2` (already rotated) works fine, but the same client hitting `rmq-3` (not yet rotated) fails. Easy to misdiagnose as “flaky” rather than a partial rotation

App-side (Spring Boot): affects one app/instance at a time

Cause	How it manifests
Client certificate (used for mTLS) expired and the app was never redeployed with a fresh one	Only that app fails; broker-side logs (if mTLS) show a rejected client cert; other apps using different certs/instances are unaffected
App’s truststore is outdated: broker cert was rotated correctly, but this app’s bundled/mounted truststore still only trusts the old (now-replaced) cert or CA	Looks like a broker problem (`SSLHandshakeException` on connect) but only affects apps that haven’t picked up the new trust material: often the ones that weren’t redeployed recently
System clock skew on either the app host or the broker node	A cert that is genuinely valid appears expired or “not yet valid” because one side’s clock is wrong. Easy to misdiagnose as a real expiry issue since the exception message looks identical

Notice the pattern: broker-side causes are almost always about rotation not completing everywhere it needed to. App-side causes are almost always about the app not picking up material that changed elsewhere (a fresh client cert, or a new thing to trust). Both are fundamentally “stale material” problems, the fix is nearly always getting the right cert/trust data into the right place, never disabling the check itself (see Section 4).

3. Diagnostic Steps

Work top to bottom, cheapest, fastest checks first.

Read the actual exception in the Spring app logs. The message tells you which validation step failed (see the handshake breakdown above):
- ... certificate expired → pure expiry problem, go to step 2.
- ... unable to find valid certification path to requested target → trust-chain problem (missing intermediate, stale truststore), not simple expiry, treat this as more likely to need security/platform help (Section 5).
- ... not yet valid → check clock sync (step 5) before assuming it’s a real cert problem.
Check the broker certificate’s actual expiry directly, independent of any app’s logs, using openssl s_client against port 5671 (command in Section 6). This tells you the ground truth regardless of which client is complaining.
Check whether ALL clients are failing, or only some. Look at the Management UI Connections tab (or ask in the incident channel), are apps across multiple teams failing simultaneously, or is it isolated to one service?
- All/most failing → broker-side server cert problem. Skip to escalation (Section 5), this is a near-total outage for TLS clients.
- One app failing → client cert or that app’s truststore. Continue to step 4.
Repeat the openssl s_client check against each of the 3 broker nodes individually (not just whichever one DNS/load balancing happens to route you to) if you suspect partial rotation. Different expiry dates or different cert fingerprints per node confirms a partial rotation rather than a clean expiry.
Check whether this was a scheduled/known rotation window. Check with the platform/security team or your change calendar, a cert expiring exactly when a rotation was scheduled (but apparently didn’t finish) is a very different situation from a surprise expiry nobody planned for.
If the error is ambiguous (“not yet valid,” or expiry dates that don’t match what you’d expect), check clock sync on the affected host:
```
timedatectl status
```
Significant drift (more than a minute or two) can make a perfectly valid certificate look expired or not-yet-valid to that specific host.

Step	Question it answers	Typical time cost
1. Read the exception message	Expiry vs. trust-chain vs. clock-skew?	seconds
2. `openssl s_client` against the broker	What’s the actual cert expiry, independent of any app’s opinion?	1 min
3. Management UI Connections / cross-team check	Is this everyone (broker) or just one app (client)?	1-2 min
4. `openssl s_client` per node	Is rotation incomplete on some nodes only?	2-3 min
5. Change calendar / platform team check	Known rotation gone wrong, or a total surprise?	2-5 min
6. `timedatectl status`	Is this even a real cert issue, or clock skew?	1 min

4. Safe Remediations

Situation	Safe action
Known, scheduled cert rotation that’s simply incomplete (e.g., propagated to 2 of 3 nodes)	Coordinate with the platform/security team to finish rolling it out to the remaining node(s): not something support does solo, since it involves deploying new cert material to broker nodes
App’s client cert or truststore has been correctly updated/redeployed already, but the running instance still fails	Restarting that app instance to pick up the new cert/truststore material is generally safe: this is analogous to the credential-rotation problem in Playbook 05, where a running process caches what it started with
Broker server cert has genuinely expired with no rotation in progress	Not self-service: this needs whoever owns cert issuance/rotation to issue and deploy a new cert immediately. Escalate (Section 5)
Unsure whether it’s expiry or a trust-chain problem after initial diagnosis	Don’t guess: hand off to the security/platform team with the exact exception message and `openssl s_client` output from Section 6; trust-chain issues often need CA-level context you won’t have

⚠️ Caution: never disable TLS verification as a “quick fix.”** Setting a permissive/no-op TrustManager, or spring.rabbitmq.ssl.validate-server-certificate=false, will make the SSLHandshakeException go away, but it does so by silently turning off the security guarantee TLS exists to provide, for that connection, indefinitely, until someone remembers to undo it (often nobody does). This is never an acceptable routine incident response. The only time this kind of override is acceptable is as an explicitly approved, time-boxed, tracked exception signed off by security/platform, not a support-tier judgment call under pressure.

⚠️ Caution: do not restart apps or broker nodes speculatively “to see if it fixes it.”** Restarting an app only helps if its underlying cert/truststore material has already been corrected upstream. Restarting before that just reproduces the same failure and burns time you could have spent on Section 3’s diagnosis.

5. Escalation Trigger

Stop and page on-call engineering / the team owning cert issuance and rotation (per Escalation and Communication) if any of these are true:

The broker’s own server certificate has expired or is otherwise broken: this affects every TLS client at once and is a near-total outage; it needs the cert-owning team immediately, not a support-tier workaround.
You cannot tell, after Section 3’s diagnosis, whether this is a pure expiry issue or a trust-chain issue (unable to find valid certification path), trust-chain problems often involve intermediate/root CA details that need security/platform expertise.
Partial rotation across the 3 broker nodes is confirmed and needs to be completed, this is a broker-topology change, not something support applies directly.
A fix would require disabling or weakening TLS verification in any way, that decision is never support’s to make alone; escalate for an explicit, approved exception if one is genuinely needed.

6. Relevant Commands/Queries

# Check the broker's actual TLS certificate validity dates directly, from any
# host that can reach port 5671 (no app involved: ground truth check)
openssl s_client -connect <broker-host>:5671 -servername <broker-host> </dev/null 2>/dev/null \
  | openssl x509 -noout -dates

# Example output
notBefore=Jan 15 00:00:00 2025 GMT
notAfter=Jul  1 23:59:59 2026 GMT

If notAfter is in the past (or suspiciously close to “now”), you’ve confirmed the broker cert itself is the problem, independent of anything any client is reporting.

# Check the full chain presented, useful for spotting an expired intermediate
# even when the leaf certificate's own dates look fine
openssl s_client -connect <broker-host>:5671 -servername <broker-host> -showcerts </dev/null 2>/dev/null

# Repeat against each of the 3 nodes individually to catch partial rotation
for host in rmq-1.internal rmq-2.internal rmq-3.internal; do
  echo "== $host =="
  openssl s_client -connect "$host:5671" -servername "$host" </dev/null 2>/dev/null \
    | openssl x509 -noout -dates -fingerprint
done

# Check clock sync on a suspect host (via SSM Session Manager, no SSH)
timedatectl status

# Example Spring Boot stack trace: pure expiry (broker or client cert)
org.springframework.amqp.AmqpConnectException: java.io.IOException
Caused by: javax.net.ssl.SSLHandshakeException: PKIX path validation failed:
    java.security.cert.CertPathValidatorException: validity check failed
Caused by: java.security.cert.CertificateExpiredException:
    NotAfter: Mon Jun 30 23:59:59 UTC 2026
    at sun.security.x509.CertificateValidity.valid(CertificateValidity.java:335)

# Example Spring Boot stack trace: trust-chain problem, NOT simple expiry
org.springframework.amqp.AmqpConnectException: java.io.IOException
Caused by: javax.net.ssl.SSLHandshakeException:
    PKIX path building failed:
    sun.security.provider.certpath.SunCertPathBuilderException:
    unable to find valid certification path to requested target

7. Mini Practical

Reproduce a broker-side expired-certificate outage locally, watch a Spring Boot app fail against it, then fix it.

Step 1: Generate a short-lived (already-expired) self-signed certificate:

mkdir -p ~/rmq-tls-lab && cd ~/rmq-tls-lab

# Generate a CA, then a server cert signed by it, valid for only 1 minute
openssl req -x509 -newkey rsa:2048 -days 1 -nodes \
  -keyout ca_key.pem -out ca_cert.pem \
  -subj "/CN=crashcourse-internal-ca"

openssl req -newkey rsa:2048 -nodes \
  -keyout server_key.pem -out server_req.pem \
  -subj "/CN=localhost"

openssl x509 -req -in server_req.pem -CA ca_cert.pem -CAkey ca_key.pem \
  -CAcreateserial -out server_cert.pem \
  -startdate 20260101000000Z -enddate 20260101000100Z   # valid for 1 minute, in the past

Since -enddate is set well in the past relative to today, this cert is already expired the moment it’s created, reproducing the broker-cert-expired scenario without waiting around.

Step 2: Write a minimal rabbitmq.conf enabling TLS on 5671 with this cert:

listeners.ssl.default = 5671
ssl_options.cacertfile = /certs/ca_cert.pem
ssl_options.certfile   = /certs/server_cert.pem
ssl_options.keyfile    = /certs/server_key.pem
ssl_options.verify     = verify_none
ssl_options.fail_if_no_peer_cert = false

Step 3: Run RabbitMQ with the cert mounted and TLS port exposed:

docker run -d --name rabbitmq-tls-lab \
  -p 5671:5671 -p 15672:15672 \
  -v ~/rmq-tls-lab/rabbitmq.conf:/etc/rabbitmq/rabbitmq.conf:ro \
  -v ~/rmq-tls-lab:/certs:ro \
  rabbitmq:3.13-management

Step 4: Confirm the expiry from the outside, exactly as you would in a real incident:

openssl s_client -connect localhost:5671 -servername localhost </dev/null 2>/dev/null \
  | openssl x509 -noout -dates
# notAfter should already be in the past

Step 5: Point a Spring Boot AMQP client at it over TLS:

spring:
  rabbitmq:
    host: localhost
    port: 5671
    ssl:
      enabled: true
      trust-store: file:${HOME}/rmq-tls-lab/truststore.p12
      trust-store-password: changeit

(Import ca_cert.pem into a PKCS12 truststore first with keytool -importcert if you want the trust-chain step to even get that far, either way, expiry fails before trust matters.)

Start the app and confirm you get the same SSLHandshakeException / CertificateExpiredException shown in Section 6, this is the exact signature the Tooling Walkthrough tells you to route to this playbook.

Step 6: Fix it: generate a valid-dated cert and swap it in:

openssl x509 -req -in server_req.pem -CA ca_cert.pem -CAkey ca_key.pem \
  -CAcreateserial -out server_cert.pem -days 365   # valid starting now, for a year

docker restart rabbitmq-tls-lab

Step 7: Re-run the openssl s_client check and re-attempt the Spring Boot connection. The dates should now show a valid notBefore/notAfter window and the app should connect cleanly, confirming the fix the same way you’d confirm it in production: independently, at the TLS layer, before trusting the app’s own “it works now.”

✅ Checkpoint

You should now be able to:

Explain, from the failure pattern alone (everyone vs. one app), whether an SSLHandshakeException incident is a broker cert problem or a client cert/truststore problem.
Run openssl s_client ... | openssl x509 -noout -dates against a broker’s TLS port to get ground-truth cert expiry, independent of any app’s logs.
Distinguish a pure expiry error (CertificateExpiredException) from a trust-chain error (unable to find valid certification path) and explain why they need different remediation paths.