Auth Failures After Credential Rotation: RabbitMQ Incident Guide

Prerequisite:Connection & Channel Exhaustion

1. Symptom

An app that has been running fine for hours or days, sometimes since its last deploy, suddenly starts throwing authentication errors against RabbitMQ. There was no code change, no deploy, and often no obvious trigger at all from the app team’s point of view. What actually happened is usually one of:

A network blip closed the app’s existing AMQP connection and it’s now failing to reconnect.
A RabbitMQ node restarted or failed over, forcing every client connected to it to reconnect.
Someone bounced the app itself for an unrelated reason, and it re-established its connection from scratch.

In every case, the failure shows up as PossibleAuthenticationFailureException in the Spring Boot app logs (first flagged in Tooling Walkthrough) and/or ACCESS_REFUSED in the broker log. /actuator/health flips to "rabbit": {"status": "DOWN"}.

The tell that points straight at this playbook: the timing is delayed and disconnected from the actual cause. AWS Architecture, section 5 already covered why: a running Spring AMQP ConnectionFactory doesn’t re-authenticate an already-open TCP connection. If AWS Secrets Manager rotated the RabbitMQ user’s password three days ago and the app never dropped its connection since, it has been publishing and consuming perfectly normally the whole time, on credentials that are now technically stale.

The failure only surfaces the moment that connection breaks and Spring AMQP tries to reconnect using whatever password it currently has in memory. If the app never refreshed that value, it reconnects with the old password, the broker (which now has the new one) rejects it, and you get an auth failure that looks like it “came out of nowhere”, often triggered by something totally unrelated to secrets, like a node failover from Playbook 03.

This playbook is about untangling that delay: confirming whether you’re looking at a genuine broker/Secrets Manager mismatch, or an app instance that simply needs to re-fetch a secret it already should have.

2. Likely Causes

Broker-side

Cause	How it manifests
Rotation Lambda/automation updated Secrets Manager but failed to update the RabbitMQ user’s actual password	Secrets Manager shows a new secret value; every app instance that fetches it fresh still fails to authenticate, because the broker-side user still has the old password. This is a genuine desync, not a stale-cache problem.
Vhost permissions changed at the same time as the rotation	Authentication succeeds (username/password are correct) but the app fails immediately after with an authorization error on publish/consume: a different failure shape from a plain auth rejection, worth distinguishing (see Diagnostic Steps).
The RabbitMQ user itself was deleted or renamed as part of a cleanup that wasn’t coordinated with the rotation schedule	`list_users` shows the account missing entirely: every instance fails, immediately and permanently, not just on reconnect.

App-side (Spring Boot)

Cause	How it manifests
App fetches the secret once at startup and never refreshes it	The classic pattern: `spring.rabbitmq.password` is resolved once during Spring context startup (directly, or via the Secrets Manager/Spring Cloud Config integration) and cached in the `ConnectionFactory` for the life of the JVM. Nothing refreshes it later: the app has no idea a rotation happened until it needs to open a new connection.
App was never restarted/redeployed since the last rotation	The single most common root cause in practice. The app is holding the pre-rotation password in memory; it worked the whole time on its existing connection; the moment that connection drops, reconnection fails.
Multiple instances rotated at different times (staggered restarts)	If some instances happened to restart (deploy, autoscaling, OOM) after the rotation and some didn’t, you get a partial failure: some instances authenticate fine on fresh connections, others fail. This looks confusing (“why does it work for 2 out of 5 pods?”) until you check restart timestamps.
App-side connection pool/cache retries with the same bad credential in a tight loop	`CachingConnectionFactory` will keep attempting to reconnect using the same in-memory credential on every retry: it has no mechanism to “notice” the password changed and go fetch a new one. Retries alone will never fix this; only a restart (or an explicit secret refresh mechanism) will.

Broker-side and app-side causes produce nearly identical symptoms (PossibleAuthenticationFailureException, ACCESS_REFUSED). The fastest way to tell them apart: restart one affected instance. If it comes back healthy, it was stale credentials, app-side. If it still fails immediately after restart (meaning it fetched the current secret and still got rejected), the broker-side password and Secrets Manager are genuinely out of sync.

3. Diagnostic Steps

Work top to bottom, cheapest, fastest checks first.

Confirm the log signature. Grep the affected app’s logs for PossibleAuthenticationFailureException (app-side) and check the broker logs (via SSM) for ACCESS_REFUSED around the same timestamp. Both present, same time window, confirms this playbook over a network/SG issue, auth failures reject fast, they don’t time out.
Establish timing. When did the failures start, and does that line up with a known Secrets Manager rotation event (scheduled or manual/incident-driven) or a broker-side maintenance/restart/failover? Check the Secrets Manager rotation history for the secret in question and cross-reference with CloudTrail if you need exact timestamps. A gap of hours or days between “rotation happened” and “alerts fired” is expected and consistent with this playbook, it does not mean the rotation is unrelated.
Confirm the user still exists and check its permissions: this rules out “someone deleted the account” or “vhost permissions changed” as the actual root cause, separate from a plain password mismatch:
```
rabbitmqctl list_users
rabbitmqctl list_permissions -p /
```
If the user is missing entirely, or permissions look wrong (unexpectedly empty configure/write/read patterns), that’s a broker-side config problem, not a stale-secret problem, skip to Escalation.
Check whether the affected instance(s) were restarted since the last rotation. Compare instance start time (deployment/orchestration tooling, or simply process uptime) against the rotation timestamp from step 2. An instance that has been running longer than the time since the last rotation is holding a stale credential by definition, this is the classic signature.
Compare behavior across all instances of the same app. Pull /actuator/health from every instance, or check rabbitmqctl list_connections for which app instances currently hold an open, authenticated connection vs. which are absent/failing.
- All instances failing → likely a genuine broker-side/Secrets Manager desync (or a cluster-wide event like a full broker restart hit everyone at once).
- Only some instances failing → strong signal of a partial/staggered rotation issue, the failing ones simply haven’t restarted since the rotation, the healthy ones have (or started after it).
If multiple unrelated apps/services are failing at the same time, this stops being a single app’s stale-cache problem and starts looking systemic, treat it as a broker-side/rotation-automation issue and move to escalation.

Step	Question it answers	Typical time cost
1. Log signature	Auth failure, or something else (network/TLS)?	1-2 min
2. Timing vs. rotation event	Does this line up with a known rotation?	2-3 min
3. `list_users` / `list_permissions`	Does the account and its permissions still look right broker-side?	1 min
4. Instance restart time vs. rotation time	Is this instance holding a stale credential by definition?	2-3 min
5. Cross-instance comparison	Partial (stale cache) or total (broker desync) failure?	2-3 min
6. Cross-app comparison	Isolated to one app, or systemic?	1-2 min

4. Safe Remediations

Situation	Safe action
Instance(s) confirmed running since before the last rotation, failing on reconnect	Restart the affected app instance(s) so they re-fetch the current secret from Secrets Manager on startup. This is the actual fix in the overwhelming majority of cases, and is routine, low-risk work.
Multiple instances affected	Restart them as a rolling restart: one (or a small batch) at a time, confirming `/actuator/health` returns `UP` and the Management UI shows the connection re-established before moving to the next.

⚠️ Caution: restart safely, not all-at-once.** Restarting a stale-secret instance is routine, but a full restart of every instance simultaneously is still a deploy action and needs the same safety practices as any other rolling restart, enough healthy instances/consumers remaining in service throughout, standard deployment tooling, not a manual all-at-once bounce. Don’t let “this is just a quick auth fix” bypass your normal rollout discipline.

⚠️ Caution: never manually reset the RabbitMQ user’s password yourself as a quick fix.** Even if you have rabbitmqctl change_password access, do not use it to “solve” an auth incident by forcing the broker-side password to match what you think Secrets Manager has (or vice versa). Secrets Manager is the source of truth for every consumer of that credential, manually changing the broker-side password outside the rotation process desyncs it further and can break every other instance/app that was working fine, turning a one-app problem into an org-wide one. If restarting the affected instances doesn’t fix it, the fix belongs to the team that owns the secret/rotation automation, not to a manual change_password from support tier.

If restarting the affected instance(s) resolves it, confirmed by a healthy /actuator/health and a new authenticated connection in list_connections, you’re done; no broker-side change was needed.

If restarting does not resolve it, do not keep restarting or guessing, move to escalation.

5. Escalation Trigger

Stop and hand off to the team owning secrets/rotation automation (per Escalation and Communication) if:

Restarting the affected app instance(s) does not resolve the failure, this means the freshly-fetched secret from Secrets Manager still doesn’t authenticate, i.e., the RabbitMQ broker-side password and Secrets Manager are genuinely out of sync. This needs whoever owns the rotation Lambda/automation, not a support-tier action.
Multiple unrelated apps/services start failing authentication at the same time, this points to a systemic rotation problem (broker-side password changed for a shared/service account, or the rotation automation broke), not one app’s stale cache.
list_permissions -p / shows vhost permissions changed unexpectedly alongside the password rotation, someone needs to confirm whether that was intentional, since it’s an authorization problem layered on top of (or instead of) an authentication problem.
The RabbitMQ user itself is missing from list_users, account lifecycle changes are outside support-tier remediation.

6. Relevant Commands/Queries

# Confirm the user still exists broker-side
rabbitmqctl list_users

Healthy example:

user            tags
guest           [administrator]
orders-service  []

Alerting example: the app’s user is simply gone:

user            tags
guest           [administrator]

# Confirm vhost permissions weren't the actual change
rabbitmqctl list_permissions -p /

Healthy example:

user            configure  write  read
orders-service  .*         .*     .*

Alerting example: permissions quietly narrowed at the same time as a rotation:

user            configure  write  read
orders-service                    .*

Broker log line (via SSM), the broker-side signature:

2026-07-02 02:14:11.902 [error] <0.2201.0> Channel error on connection <0.2190.0> (10.0.3.41:53422 -> 10.0.3.10:5672, vhost: '/', user: 'orders-service'): operation none caused a connection exception access_refused: "ACCESS_REFUSED - Login was refused using authentication mechanism PLAIN. For details see the broker logfile."

Spring Boot log signature, the app-side symptom:

2026-07-02 02:14:12.115  WARN 1 --- [ntContainer#0-1] o.s.a.r.l.SimpleMessageListenerContainer :
  Consumer raised exception, processing can restart if the connection factory supports it.
  Exception summary: org.springframework.amqp.AmqpAuthenticationException:
  com.rabbitmq.client.AuthenticationFailureException: ACCESS_REFUSED - Login was refused using authentication mechanism PLAIN. For details see the broker logfile.
      at org.springframework.amqp.rabbit.connection.RabbitUtils.convertRabbitAccessException(RabbitUtils.java:191)
      at org.springframework.amqp.rabbit.connection.CachingConnectionFactory.createBareConnection(CachingConnectionFactory.java:664)
      ...
Caused by: com.rabbitmq.client.PossibleAuthenticationFailureException: Authentication failed
      at com.rabbitmq.client.impl.AMQConnection.start(AMQConnection.java:322)
      ...

# Which instances currently hold an authenticated connection right now
rabbitmqctl list_connections name peer_host user state

# Cross-check instance uptime vs. rotation timestamp
# (via your orchestration tooling, or process start time on the instance itself)

# Actuator health per instance: fastest way to compare across a fleet
curl -s http://<instance-host>:8080/actuator/health | jq '.components.rabbit'

Check Secrets Manager’s rotation history for the secret (console or aws secretsmanager describe-secret --secret-id <name>) to get the exact last-rotated timestamp to compare against instance uptime and log timestamps.

7. Mini Practical

Reproduce the exact delayed-failure pattern locally: an app that keeps working after a password change, then fails only once forced to reconnect.

Step 1: Create a second user on the local container (still rabbitmq-crashcourse from Environment Setup):

docker exec -it rabbitmq-crashcourse rabbitmqctl add_user orders-service supersecret1
docker exec -it rabbitmq-crashcourse rabbitmqctl set_permissions -p / orders-service ".*" ".*" ".*"

Step 2: Point your Spring Boot app at this user (application.yml):

spring:
  rabbitmq:
    host: localhost
    port: 5672
    username: orders-service
    password: supersecret1

Start the app and confirm it works:

curl -X POST localhost:8080/orders -H "Content-Type: application/json" -d '{"id":1,"item":"widget"}'

You should see Received order: {...} in the console, same as before.

Step 3: Rotate the password broker-side, without touching the running app:

docker exec -it rabbitmq-crashcourse rabbitmqctl change_password orders-service brandnewpassword2

Step 4: Confirm the app keeps working on its existing connection. Without restarting anything, publish again:

curl -X POST localhost:8080/orders -H "Content-Type: application/json" -d '{"id":2,"item":"widget"}'

It still works, Received order: {...} prints normally. This is the core mechanic from AWS Architecture: the app’s existing AMQP connection was authenticated before the password changed, and RabbitMQ has no reason to re-check credentials on a connection that’s already open.

Step 5: Force a reconnect to simulate the network blip / node failover / broker restart that eventually happens in production. Easiest local option, close the connection from the broker side:

docker exec -it rabbitmq-crashcourse rabbitmqctl list_connections name peer_host user
docker exec -it rabbitmq-crashcourse rabbitmqctl close_connection "<connection-name-from-above>" "forced for lab"

(Alternatively, docker restart rabbitmq-crashcourse achieves the same thing by tearing down every connection.)

Step 6: Watch it fail. Spring AMQP’s CachingConnectionFactory will try to reconnect automatically using the password still sitting in application.yml/memory, supersecret1, which the broker no longer accepts. Publish again:

curl -X POST localhost:8080/orders -H "Content-Type: application/json" -d '{"id":3,"item":"widget"}'

This call now fails, and your Spring Boot console shows PossibleAuthenticationFailureException / ACCESS_REFUSED, reproducing, at lab scale, the exact “worked for days, then failed the moment it had to reconnect” pattern from production.

Step 7: Apply the fix. Update application.yml with brandnewpassword2 and restart the app. Publish again and confirm it succeeds, this mirrors the production remediation of restarting an instance so it picks up the current credential.

✅ Checkpoint

You should now be able to:

Explain why a Spring Boot app can keep working for hours or days after a RabbitMQ password rotation, and why the failure only appears on reconnect.
Use rabbitmqctl list_users / list_permissions -p / to distinguish “account or permissions changed” from “just a stale cached password.”
Reproduce the delayed-failure pattern locally, change a user’s password without restarting the app, confirm it keeps working, then force a reconnect and observe the auth failure.