Incident response runbook

An incident in IntelliAuth most often means something is wrong with authentication traffic for one or more tenants. Users can't sign in, sessions are dropping, tokens are failing to refresh, or the platform itself is degraded. This runbook is the order of operations for the on-call platform operator.

The principle: observe before you act. The console gives you most of what you need to figure out blast radius before you make any state change. Read first, click second.

Stage 1 — Observe

Three sources of truth, in order:

The audit feed at /dashboard/audit. Filter by tenant.provisioning.failed, tenant.deprovisioning.*, org.member_role_changed — anything that's a recent state change worth correlating with the symptom. If you see the symptom started right after a change event, you've got your suspect.
The Tenants page. Look for tenants in unexpected states. A tenant in failed state that should be active is a red flag. A tenant in suspended state nobody remembers suspending is another.
External signals — your monitoring/alerting layer, the status pages of upstream providers (DNS, email provider, etc.), customer reports. These tell you whether the issue is contained to your platform or wider.

Time-box this stage to 5–10 minutes. If you don't know what's happening after that, escalate.

Stage 2 — Contain

If you've identified a misbehaving tenant (login failures concentrated to one customer, or a specific tenant emitting odd events):

Suspend the tenant. Pause its traffic without releasing resources. See Suspend and resume. This contains blast radius — other tenants are unaffected.
Save state. Don't decommission unless you're sure. Decommission is destructive; suspend is reversible.

If you've identified a platform-wide issue (every tenant degraded):

Don't suspend tenants. Suspending one or many doesn't help — the underlying problem is upstream.
Stop the bleeding via the upstream channel. Roll back a recent platform deploy, scale up infrastructure, talk to your infra team. The control plane is the wrong tool for fixing platform-level outages.

Stage 3 — Diagnose

The audit feed plus the tenant detail page give you the chronology. For a single-tenant issue:

Open the tenant's detail page at /dashboard/tenants/<id>. Look at the provisioning timeline — did anything fail recently?
Check the tenant's own admin console at https://<slug>-<org>.<domain>/admin if accessible. Tenant-internal audit shows user-level events (login failures, MFA challenges) that the platform-level audit doesn't.
Cross-reference timestamps. A spike of login failures right after a brand colour change is suspicious; right after a code release is more suspicious.

For platform-wide issues, the control plane diagnostic surface is shallow by design (it's a customer-facing console, not an SRE console). Pull observability from your infrastructure stack.

Stage 4 — Recover

Once you know the cause, the recovery path depends on it:

Cause	Recovery
Tenant in `failed` state from a partial provisioning	See Recover a stuck provisioning saga. Retry or decommission.
Tenant suspended in error	Resume from its detail page. Instant.
Misconfigured authentication policy on a tenant	Sign in to the tenant's admin console and fix it there. The control plane can't reach into tenant config.
Member with wrong role caused damage	Edit their role on the Members page. Audit log keeps the record of the original wrong role.
Platform-wide outage	Out of scope for the control plane. Handle via infrastructure tooling.

Stage 5 — Post-incident

Two artefacts to produce:

A short note in your team's incident tracker. What happened, what you did, when. The audit feed is the chronology; you supply the why.
A check on the tenant's audit log to confirm the recovery action emitted the right event (tenant.resumed, tenant.provisioning.completed, etc.). If the event isn't there, the recovery didn't take and you should re-check.

If a similar incident is likely to recur, the right next move is a process/tooling improvement, not just more reading of this runbook. Surface that to the team.

What this runbook explicitly doesn't cover

Tenant-internal incidents (one customer's user can't sign in to their tenant). Those are tenant-admin concerns, not platform-admin. Direct the customer to their own admin console.
Infrastructure-level outages. The control plane runs on top of infrastructure; when the infra is down, the control plane is too. Escalate to your infra team.
Recovery of decommissioned tenants. Decommission is terminal. The path forward is recreate (a fresh tenant under the same slug) or accept the loss.