Disaster recovery and failover are mostly not your job as a platform admin. The infrastructure operating IntelliAuth — the underlying storage, the workflow engine, the identity store — is run by the IntelliAuth platform team (or, on a self-hosted deployment, your own infrastructure team). What this runbook covers is the slice of DR that is your job: knowing what the platform does on your behalf, when to escalate, and what the boundary of your authority is.
What the platform does for you
Section titled “What the platform does for you”Three guarantees you don't have to enforce yourself:
- Data is replicated. Tenant data, audit logs, member records — everything written through the control plane is durably stored with replication. A single node loss doesn't lose data.
- Workflows are resumable. If the platform restarts mid-provisioning, the saga picks up where it left off. The state flips back to
provisioninguntil the workflow completes (or fails intofailed). - Audit retention is enforced. Every event in your org audit log is retained per the platform's retention policy. Disasters don't erase the past.
What the platform doesn't do:
- It doesn't restore decommissioned tenants. Decommission is intentional and terminal; DR doesn't undo it.
- It doesn't restore your customers' user data. Tenant-internal data (users, sessions, applications) is the tenant's own concern. The platform retains it via the same replication, but you don't have a single "restore this tenant to last Tuesday" button — that's a tenant-admin concern with its own escape valves.
When something genuinely breaks at the platform level
Section titled “When something genuinely breaks at the platform level”Symptoms of a real platform-level problem:
- The control plane console (
https://manage.<domain>) is unreachable. - The console loads but every action fails with 5xx errors.
- Tenants show inconsistent states across the list view vs. detail view.
- The audit feed stops updating (no new events for an unusually long window).
Your job in this scenario isn't to fix the platform — it's to:
- Stop digging. Don't retry destructive actions hoping they'll go through. They might queue and execute later in unexpected order.
- Note what you tried. Timestamps, what you clicked, what the errors said. This is gold for the platform team's post-mortem.
- Communicate to your end-users. End-users authenticating against your tenants will see the symptom; you should set their expectations. A short "we're seeing degraded service on our auth platform; updates here" is better than silence.
When to escalate
Section titled “When to escalate”You should escalate to the platform team (or, on self-hosted, your infrastructure team) when:
- A platform-level symptom (above) lasts more than 5 minutes.
- A destructive action (suspend, decommission) appears to have hung — the state didn't change after several minutes.
- The audit log shows an event you don't recognise.
- Multiple tenants entered
failedstate at roughly the same time without you having done anything.
Escalation paths are environment-specific; your team should have these documented. Common targets: a #platform-on-call channel, a paging system like PagerDuty, or a direct contact for your IntelliAuth provider.
After the platform recovers
Section titled “After the platform recovers”Once the platform team confirms the outage is resolved:
- Re-check your tenants. Walk the Tenants list, confirm states match what you expect.
- Re-check pending invitations. If your members got pending invitations during the outage, they may need to be resent.
- Re-check the audit log. Look for gaps or out-of-order events. If anything looks wrong, surface it — the platform team's reconciliation may need help.
The platform's job during a real DR event is to bring service back without data loss. Your job is to bridge your end-users through the impact and verify the recovery from your seat.
What this manual doesn't cover
Section titled “What this manual doesn't cover”- Restoring data inside a tenant — out of scope here. That's a tenant-admin concern.
- The platform team's internal runbooks — what they do to bring the platform back up isn't your job to know.
- Re-architecting for higher resilience — if you need stronger DR guarantees (e.g., multi-region active-active), that's a procurement conversation about your IntelliAuth plan, not a runbook step.