Read time: ~

Auth Failures After Credential or Certificate Rotation

Diagnose SASL/SCRAM password rotation, MSK IAM policy changes, expired TLS certificates, and stale cached credentials on reconnect.


1. Symptom

Shortly after a security change (a password rotation, an IAM policy update, a certificate renewal), one or more services can no longer connect to Kafka. Logs show SaslAuthenticationException, Authentication failed, TLS handshake errors, or TopicAuthorizationException. Often the incident appears not at rotation time but hours later, when a service reconnects or a new pod starts.

The goal is to tell apart authentication failures (who you are) from authorization failures (what you may do), and to find which credential or certificate is stale, building on Security.


2. Likely causes

CauseHow it manifests
SASL/SCRAM password rotated, app has old valueSaslAuthenticationException on connect
MSK IAM policy changed, role lost a permissionTopicAuthorizationException on produce/consume
TLS/client certificate expiredHandshake failure before auth even runs
Truststore missing the new CACannot validate broker cert after a CA change
Stale cached credentials on a long-lived connectionWorks until reconnect, then fails

3. How it manifests to the Spring app

CauseWhat the service sees
Wrong SASL passwordStartup or reconnect fails with authentication error
IAM permission removedConnects fine, then TopicAuthorizationException on a specific topic/group
Expired certSSLHandshakeException; connection never established
Old connection still openHealthy until a rebalance/restart forces a fresh auth

4. Diagnostic steps

  1. Read the exception class. Authentication (SaslAuthenticationException, SSLHandshakeException) versus authorization (TopicAuthorizationException) splits the problem immediately.
  2. Correlate timing with the recent security change. Auth failures right after a rotation point straight at the rotated item.
  3. Check which services are affected. All services means a broker/CA-level change; one service means its credential, role, or ACL.
  4. For authorization errors, check the ACL or IAM policy for that principal against the topic and group it needs (least privilege from Security).
  5. For TLS, check certificate expiry and that the truststore contains the current CA.
StepQuestion it answersTime cost
1. Exception classAuthn or authz?seconds
2. TimingWhich change caused it?1 min
3. ScopeOne service or all?1-2 min
4. ACL/IAMDoes the principal have rights?2-3 min
5. Cert/truststoreExpired or untrusted?2-3 min

5. Safe remediations

SituationSafe action
App has an old SASL passwordUpdate the secret (env/secrets manager) and restart; confirm reconnect
IAM policy missing a permissionRestore the least-privilege statement for that role (with owner sign-off)
Expired certificateDeploy the renewed cert/truststore; restart affected services
Stale cached credentialsRestart the service to force a fresh authentication
Recurrent rotation painPrefer MSK IAM to eliminate static passwords (a design improvement)

6. Escalation trigger

Page on-call engineering or the security/platform team if:

  • All services lose auth at once, pointing at a broker-side credential, CA, or listener change.
  • The correct IAM policy or ACL is unclear, or changing it needs security approval.
  • A certificate is expired and renewal is owned by another team.
  • Auth failures persist after updating the app-side secret and restarting.

7. Relevant commands and exhibits

# Authentication failure (wrong/rotated credential)
org.apache.kafka.common.errors.SaslAuthenticationException:
  Authentication failed during authentication due to invalid credentials

# Authorization failure (missing ACL / IAM permission)
org.apache.kafka.common.errors.TopicAuthorizationException:
  Not authorized to access topics: [orders]

# TLS certificate problem
javax.net.ssl.SSLHandshakeException: PKIX path validation failed:
  ... certificate expired on 20260701...
# List ACLs for a principal
kafka-acls.sh --bootstrap-server $BROKER --list --principal User:payment-service

Secrets should be ${...} placeholders resolved from env or a secrets manager, never literals, as in Security.


8. Guided practical

This is largely exhibit-based, but you can reproduce the app-side stale-secret pattern locally if you enable SASL, or reason through the exhibits.

  1. From the exhibits above, classify each as authentication or authorization.
  2. For the TopicAuthorizationException, write the kafka-acls.sh command that grants exactly the missing read on orders.
  3. For the SASL failure, identify where the app reads its password and confirm it is a placeholder, not a literal.
  4. Explain why MSK IAM would have prevented the password-rotation case.

Next:AWS-Layer Connectivity.