Diagnose failed-login spikes

The Dashboard shows a spike in sign-in failures. Or your alerting fires "sign-in failure rate above threshold". Five minutes of investigation usually identifies the cause; here's the diagnostic tree.

Step 1 — Open the audit log filtered to failures

Audit → Read logs. Filter:

Type: user.signed_in.failed.
Time range: the spike's window (typically last hour).

You'll see a list of failed sign-in events. Each has data.error_code showing why.

Open the first 5-10 events; note the error_code distribution. Three patterns will emerge:

One error code dominates — the spike has a single root cause.
Multiple error codes, mixed — multiple things are happening; investigate each.
Same user across many failures — a single user is having trouble (or being targeted).

Step 2 — Match the error code to the cause

The common error codes and what they mean:

`invalid_credentials`

Most common in normal traffic. The user typed the wrong password.

Spike pattern:

Across many users → likely a password-policy change that caught a lot of grandfathered passwords. Or a routine background of typos in busy hours.
Against one specific user → either the user forgot, or a targeted attack. Check data.context.ip distribution; many IPs = attack; one IP = forgotten password.

Investigation:

Did anything change recently? Audit → filter on admin.policy_updated for the day.
Was there a password-feed update? Audit → security.threat_feed_refreshed events.

`account_locked`

Brute-force defence kicked in. The user (or someone trying to be them) hit the rate limit.

Spike pattern: Almost always a single user or a small cluster. The lock is per-user.

Investigation:

Filter audit to the locked user's id + last 24h. Look for the failed-attempt pattern that led to the lock.
If the failed attempts are from the user's normal IP, they forgot the password. Help them reset.
If from unusual IPs, it's a targeted attack. Notify the user, consider tightening lockout policy.

`mfa_failed`

The user's primary auth succeeded but MFA failed.

Spike pattern: Usually a single user. A burst of mfa_failed per user is one of:

They lost their MFA factor. Walk them through MFA recovery.
Their authenticator app's clock drifted (TOTP fails if the device's time is off). Check the user agent + device fingerprint to see if it's a known-buggy device.
An attacker has the password but not the second factor. The user's password is compromised; force-rotate it.

`account_disabled`

The user account is in disabled state.

Spike pattern: rare to spike. Usually a single user who was disabled and didn't realise.

Investigation:

Why was the user disabled? Audit on the user → user.disabled event. Reason field tells you.
Should the user be re-enabled? Decide on the spot or escalate.

`captcha_required`

The risk engine demanded captcha; the user didn't complete it (or the captcha provider had issues).

Spike pattern:

A few users at once → routine; the risk engine flagged them, they walked away.
Many users across many IPs → check security.threat_feed_hit events in the same window; threat feeds may have erroneously flagged a legitimate IP range.
All users → captcha provider is down (rare; Cloudflare Turnstile / reCAPTCHA outage). Wait it out; or fall back to a less-strict policy temporarily.

`email_unverified`

The user signed up but didn't click the email verification link, then tried to sign in.

Spike pattern: usually after a sign-up campaign. Many new sign-ups didn't verify.

Investigation:

Are the verification emails arriving? Test from your tenant via the integration's Send test email.
If they ARE arriving but not being clicked, the comms might be poorly worded; users don't realise they need to click. Consider tweaking the email template.
Or relax the policy to allow sign-in without verification on first try (with a follow-up reminder).

`step_up_required`

The user's session was deemed insufficient AAL for the action they tried. Not really a "failed sign-in" — counted in the same category for monitoring purposes.

Spike pattern: usually correlates with a policy change tightening step-up. Verify in the audit log.

Step 3 — Correlate with platform events

Check what else happened in the same window:

admin.policy_updated — did someone change something?
security.threat_feed_refreshed — did a feed update broaden the blocked set?
federation.metadata_refreshed — did an external IdP rotate certs (which may have temporarily broken sign-ins)?

If yes to any: did the timing match the spike? If yes, that's likely the cause.

Sometimes the audit log lies (or doesn't capture everything). Sign in as a test user; walk through the sign-in. If sign-in fails for YOU, the issue is system-wide (something's broken).

Quick to do: 30 seconds. Confirms or refutes "is sign-in actually broken right now?"

Step 5 — Decide + act

Most spikes resolve in one of three ways:

A real user-side issue → help the affected user(s) directly; document the case; consider whether to adjust policy.
A platform / integration issue → fix the integration (the IdP cert, the SMTP, the captcha provider); communicate the resolution.
A misconfiguration → roll back the offending change; document the lesson; tighten the change-review process.

When to escalate

Escalate to the platform team (your IntelliAuth platform admin or support contact) when:

All users are failing across all applications (system-wide auth outage).
The audit log itself shows gaps or missing events (something's broken on the platform side).
The error code is unfamiliar — check the error code index first; if still unfamiliar, escalate.

Don't escalate for routine causes (one user forgot their password). Manage those at the support level.

Post-incident

After resolving:

Document what you found and how you fixed it. Even small incidents benefit from a paragraph in your team's wiki.
If a policy change caused the spike, decide whether to roll it back or tighten the change-review process.
If a user reported the issue, follow up to confirm resolution.
If the cause is recurring (same root cause as last month), invest in prevention — automate the alert, write a per-cause runbook, etc.

The Dashboard's "Sign-in failure rate over time" chart is the long-view; a spike worth investigating today should leave a visible signal that you can reference when patterns recur.

Diagnose failed-login spikes

Step 1 — Open the audit log filtered to failures

Step 2 — Match the error code to the cause

invalid_credentials

account_locked

mfa_failed

account_disabled

captcha_required

email_unverified

step_up_required

Step 3 — Correlate with platform events

Step 4 — Test sign-in flow yourself

Step 5 — Decide + act

When to escalate

Post-incident

`invalid_credentials`

`account_locked`

`mfa_failed`

`account_disabled`

`captcha_required`

`email_unverified`

`step_up_required`