Skip to content

Diagnose failed-login spikes

The Dashboard shows a spike in sign-in failures. Or your alerting fires "sign-in failure rate above threshold". Five minutes of investigation usually identifies the cause; here's the diagnostic tree.

Step 1 — Open the audit log filtered to failures

Section titled “Step 1 — Open the audit log filtered to failures”

Audit → Read logs. Filter:

  • Type: user.signed_in.failed.
  • Time range: the spike's window (typically last hour).

You'll see a list of failed sign-in events. Each has data.error_code showing why.

Open the first 5-10 events; note the error_code distribution. Three patterns will emerge:

  • One error code dominates — the spike has a single root cause.
  • Multiple error codes, mixed — multiple things are happening; investigate each.
  • Same user across many failures — a single user is having trouble (or being targeted).

Step 2 — Match the error code to the cause

Section titled “Step 2 — Match the error code to the cause”

The common error codes and what they mean:

Most common in normal traffic. The user typed the wrong password.

Spike pattern:

  • Across many users → likely a password-policy change that caught a lot of grandfathered passwords. Or a routine background of typos in busy hours.
  • Against one specific user → either the user forgot, or a targeted attack. Check data.context.ip distribution; many IPs = attack; one IP = forgotten password.

Investigation:

  • Did anything change recently? Audit → filter on admin.policy_updated for the day.
  • Was there a password-feed update? Audit → security.threat_feed_refreshed events.

Brute-force defence kicked in. The user (or someone trying to be them) hit the rate limit.

Spike pattern: Almost always a single user or a small cluster. The lock is per-user.

Investigation:

  • Filter audit to the locked user's id + last 24h. Look for the failed-attempt pattern that led to the lock.
  • If the failed attempts are from the user's normal IP, they forgot the password. Help them reset.
  • If from unusual IPs, it's a targeted attack. Notify the user, consider tightening lockout policy.

The user's primary auth succeeded but MFA failed.

Spike pattern: Usually a single user. A burst of mfa_failed per user is one of:

  • They lost their MFA factor. Walk them through MFA recovery.
  • Their authenticator app's clock drifted (TOTP fails if the device's time is off). Check the user agent + device fingerprint to see if it's a known-buggy device.
  • An attacker has the password but not the second factor. The user's password is compromised; force-rotate it.

The user account is in disabled state.

Spike pattern: rare to spike. Usually a single user who was disabled and didn't realise.

Investigation:

  • Why was the user disabled? Audit on the user → user.disabled event. Reason field tells you.
  • Should the user be re-enabled? Decide on the spot or escalate.

The risk engine demanded captcha; the user didn't complete it (or the captcha provider had issues).

Spike pattern:

  • A few users at once → routine; the risk engine flagged them, they walked away.
  • Many users across many IPs → check security.threat_feed_hit events in the same window; threat feeds may have erroneously flagged a legitimate IP range.
  • All users → captcha provider is down (rare; Cloudflare Turnstile / reCAPTCHA outage). Wait it out; or fall back to a less-strict policy temporarily.

The user signed up but didn't click the email verification link, then tried to sign in.

Spike pattern: usually after a sign-up campaign. Many new sign-ups didn't verify.

Investigation:

  • Are the verification emails arriving? Test from your tenant via the integration's Send test email.
  • If they ARE arriving but not being clicked, the comms might be poorly worded; users don't realise they need to click. Consider tweaking the email template.
  • Or relax the policy to allow sign-in without verification on first try (with a follow-up reminder).

The user's session was deemed insufficient AAL for the action they tried. Not really a "failed sign-in" — counted in the same category for monitoring purposes.

Spike pattern: usually correlates with a policy change tightening step-up. Verify in the audit log.

Check what else happened in the same window:

  • admin.policy_updated — did someone change something?
  • security.threat_feed_refreshed — did a feed update broaden the blocked set?
  • federation.metadata_refreshed — did an external IdP rotate certs (which may have temporarily broken sign-ins)?

If yes to any: did the timing match the spike? If yes, that's likely the cause.

Sometimes the audit log lies (or doesn't capture everything). Sign in as a test user; walk through the sign-in. If sign-in fails for YOU, the issue is system-wide (something's broken).

Quick to do: 30 seconds. Confirms or refutes "is sign-in actually broken right now?"

Most spikes resolve in one of three ways:

  • A real user-side issue → help the affected user(s) directly; document the case; consider whether to adjust policy.
  • A platform / integration issue → fix the integration (the IdP cert, the SMTP, the captcha provider); communicate the resolution.
  • A misconfiguration → roll back the offending change; document the lesson; tighten the change-review process.

Escalate to the platform team (your IntelliAuth platform admin or support contact) when:

  • All users are failing across all applications (system-wide auth outage).
  • The audit log itself shows gaps or missing events (something's broken on the platform side).
  • The error code is unfamiliar — check the error code index first; if still unfamiliar, escalate.

Don't escalate for routine causes (one user forgot their password). Manage those at the support level.

After resolving:

  • Document what you found and how you fixed it. Even small incidents benefit from a paragraph in your team's wiki.
  • If a policy change caused the spike, decide whether to roll it back or tighten the change-review process.
  • If a user reported the issue, follow up to confirm resolution.
  • If the cause is recurring (same root cause as last month), invest in prevention — automate the alert, write a per-cause runbook, etc.

The Dashboard's "Sign-in failure rate over time" chart is the long-view; a spike worth investigating today should leave a visible signal that you can reference when patterns recur.