Auth Failures After Credential or Certificate Rotation
Diagnose SASL/SCRAM password rotation, MSK IAM policy changes, expired TLS certificates, and stale cached credentials on reconnect.
1. Symptom
Shortly after a security change (a password rotation, an IAM policy update, a certificate renewal), one or more services can no longer connect to Kafka. Logs show SaslAuthenticationException, Authentication failed, TLS handshake errors, or TopicAuthorizationException. Often the incident appears not at rotation time but hours later, when a service reconnects or a new pod starts.
The goal is to tell apart authentication failures (who you are) from authorization failures (what you may do), and to find which credential or certificate is stale, building on Security.
2. Likely causes
| Cause | How it manifests |
|---|---|
| SASL/SCRAM password rotated, app has old value | SaslAuthenticationException on connect |
| MSK IAM policy changed, role lost a permission | TopicAuthorizationException on produce/consume |
| TLS/client certificate expired | Handshake failure before auth even runs |
| Truststore missing the new CA | Cannot validate broker cert after a CA change |
| Stale cached credentials on a long-lived connection | Works until reconnect, then fails |
3. How it manifests to the Spring app
| Cause | What the service sees |
|---|---|
| Wrong SASL password | Startup or reconnect fails with authentication error |
| IAM permission removed | Connects fine, then TopicAuthorizationException on a specific topic/group |
| Expired cert | SSLHandshakeException; connection never established |
| Old connection still open | Healthy until a rebalance/restart forces a fresh auth |
4. Diagnostic steps
- Read the exception class. Authentication (
SaslAuthenticationException,SSLHandshakeException) versus authorization (TopicAuthorizationException) splits the problem immediately. - Correlate timing with the recent security change. Auth failures right after a rotation point straight at the rotated item.
- Check which services are affected. All services means a broker/CA-level change; one service means its credential, role, or ACL.
- For authorization errors, check the ACL or IAM policy for that principal against the topic and group it needs (least privilege from Security).
- For TLS, check certificate expiry and that the truststore contains the current CA.
| Step | Question it answers | Time cost |
|---|---|---|
| 1. Exception class | Authn or authz? | seconds |
| 2. Timing | Which change caused it? | 1 min |
| 3. Scope | One service or all? | 1-2 min |
| 4. ACL/IAM | Does the principal have rights? | 2-3 min |
| 5. Cert/truststore | Expired or untrusted? | 2-3 min |
5. Safe remediations
| Situation | Safe action |
|---|---|
| App has an old SASL password | Update the secret (env/secrets manager) and restart; confirm reconnect |
| IAM policy missing a permission | Restore the least-privilege statement for that role (with owner sign-off) |
| Expired certificate | Deploy the renewed cert/truststore; restart affected services |
| Stale cached credentials | Restart the service to force a fresh authentication |
| Recurrent rotation pain | Prefer MSK IAM to eliminate static passwords (a design improvement) |
6. Escalation trigger
Page on-call engineering or the security/platform team if:
- All services lose auth at once, pointing at a broker-side credential, CA, or listener change.
- The correct IAM policy or ACL is unclear, or changing it needs security approval.
- A certificate is expired and renewal is owned by another team.
- Auth failures persist after updating the app-side secret and restarting.
7. Relevant commands and exhibits
# Authentication failure (wrong/rotated credential)
org.apache.kafka.common.errors.SaslAuthenticationException:
Authentication failed during authentication due to invalid credentials
# Authorization failure (missing ACL / IAM permission)
org.apache.kafka.common.errors.TopicAuthorizationException:
Not authorized to access topics: [orders]
# TLS certificate problem
javax.net.ssl.SSLHandshakeException: PKIX path validation failed:
... certificate expired on 20260701...
# List ACLs for a principal
kafka-acls.sh --bootstrap-server $BROKER --list --principal User:payment-service
Secrets should be ${...} placeholders resolved from env or a secrets manager, never literals, as in Security.
8. Guided practical
This is largely exhibit-based, but you can reproduce the app-side stale-secret pattern locally if you enable SASL, or reason through the exhibits.
- From the exhibits above, classify each as authentication or authorization.
- For the
TopicAuthorizationException, write thekafka-acls.shcommand that grants exactly the missing read onorders. - For the SASL failure, identify where the app reads its password and confirm it is a placeholder, not a literal.
- Explain why MSK IAM would have prevented the password-rotation case.
Next:AWS-Layer Connectivity.