The Dashboard shows a spike in sign-in failures. Or your alerting fires "sign-in failure rate above threshold". Five minutes of investigation usually identifies the cause; here's the diagnostic tree.
Step 1 — Open the audit log filtered to failures
Section titled “Step 1 — Open the audit log filtered to failures”Audit → Read logs. Filter:
- Type:
user.signed_in.failed. - Time range: the spike's window (typically last hour).
You'll see a list of failed sign-in events. Each has data.error_code showing why.
Open the first 5-10 events; note the error_code distribution. Three patterns will emerge:
- One error code dominates — the spike has a single root cause.
- Multiple error codes, mixed — multiple things are happening; investigate each.
- Same user across many failures — a single user is having trouble (or being targeted).
Step 2 — Match the error code to the cause
Section titled “Step 2 — Match the error code to the cause”The common error codes and what they mean:
invalid_credentials
Section titled “invalid_credentials”Most common in normal traffic. The user typed the wrong password.
Spike pattern:
- Across many users → likely a password-policy change that caught a lot of grandfathered passwords. Or a routine background of typos in busy hours.
- Against one specific user → either the user forgot, or a targeted attack. Check
data.context.ipdistribution; many IPs = attack; one IP = forgotten password.
Investigation:
- Did anything change recently? Audit → filter on
admin.policy_updatedfor the day. - Was there a password-feed update? Audit →
security.threat_feed_refreshedevents.
account_locked
Section titled “account_locked”Brute-force defence kicked in. The user (or someone trying to be them) hit the rate limit.
Spike pattern: Almost always a single user or a small cluster. The lock is per-user.
Investigation:
- Filter audit to the locked user's id + last 24h. Look for the failed-attempt pattern that led to the lock.
- If the failed attempts are from the user's normal IP, they forgot the password. Help them reset.
- If from unusual IPs, it's a targeted attack. Notify the user, consider tightening lockout policy.
mfa_failed
Section titled “mfa_failed”The user's primary auth succeeded but MFA failed.
Spike pattern: Usually a single user. A burst of mfa_failed per user is one of:
- They lost their MFA factor. Walk them through MFA recovery.
- Their authenticator app's clock drifted (TOTP fails if the device's time is off). Check the user agent + device fingerprint to see if it's a known-buggy device.
- An attacker has the password but not the second factor. The user's password is compromised; force-rotate it.
account_disabled
Section titled “account_disabled”The user account is in disabled state.
Spike pattern: rare to spike. Usually a single user who was disabled and didn't realise.
Investigation:
- Why was the user disabled? Audit on the user →
user.disabledevent. Reason field tells you. - Should the user be re-enabled? Decide on the spot or escalate.
captcha_required
Section titled “captcha_required”The risk engine demanded captcha; the user didn't complete it (or the captcha provider had issues).
Spike pattern:
- A few users at once → routine; the risk engine flagged them, they walked away.
- Many users across many IPs → check
security.threat_feed_hitevents in the same window; threat feeds may have erroneously flagged a legitimate IP range. - All users → captcha provider is down (rare; Cloudflare Turnstile / reCAPTCHA outage). Wait it out; or fall back to a less-strict policy temporarily.
email_unverified
Section titled “email_unverified”The user signed up but didn't click the email verification link, then tried to sign in.
Spike pattern: usually after a sign-up campaign. Many new sign-ups didn't verify.
Investigation:
- Are the verification emails arriving? Test from your tenant via the integration's Send test email.
- If they ARE arriving but not being clicked, the comms might be poorly worded; users don't realise they need to click. Consider tweaking the email template.
- Or relax the policy to allow sign-in without verification on first try (with a follow-up reminder).
step_up_required
Section titled “step_up_required”The user's session was deemed insufficient AAL for the action they tried. Not really a "failed sign-in" — counted in the same category for monitoring purposes.
Spike pattern: usually correlates with a policy change tightening step-up. Verify in the audit log.
Step 3 — Correlate with platform events
Section titled “Step 3 — Correlate with platform events”Check what else happened in the same window:
admin.policy_updated— did someone change something?security.threat_feed_refreshed— did a feed update broaden the blocked set?federation.metadata_refreshed— did an external IdP rotate certs (which may have temporarily broken sign-ins)?
If yes to any: did the timing match the spike? If yes, that's likely the cause.
Step 4 — Test sign-in flow yourself
Section titled “Step 4 — Test sign-in flow yourself”Sometimes the audit log lies (or doesn't capture everything). Sign in as a test user; walk through the sign-in. If sign-in fails for YOU, the issue is system-wide (something's broken).
Quick to do: 30 seconds. Confirms or refutes "is sign-in actually broken right now?"
Step 5 — Decide + act
Section titled “Step 5 — Decide + act”Most spikes resolve in one of three ways:
- A real user-side issue → help the affected user(s) directly; document the case; consider whether to adjust policy.
- A platform / integration issue → fix the integration (the IdP cert, the SMTP, the captcha provider); communicate the resolution.
- A misconfiguration → roll back the offending change; document the lesson; tighten the change-review process.
When to escalate
Section titled “When to escalate”Escalate to the platform team (your IntelliAuth platform admin or support contact) when:
- All users are failing across all applications (system-wide auth outage).
- The audit log itself shows gaps or missing events (something's broken on the platform side).
- The error code is unfamiliar — check the error code index first; if still unfamiliar, escalate.
Don't escalate for routine causes (one user forgot their password). Manage those at the support level.
Post-incident
Section titled “Post-incident”After resolving:
- Document what you found and how you fixed it. Even small incidents benefit from a paragraph in your team's wiki.
- If a policy change caused the spike, decide whether to roll it back or tighten the change-review process.
- If a user reported the issue, follow up to confirm resolution.
- If the cause is recurring (same root cause as last month), invest in prevention — automate the alert, write a per-cause runbook, etc.
The Dashboard's "Sign-in failure rate over time" chart is the long-view; a spike worth investigating today should leave a visible signal that you can reference when patterns recur.