Provisioning overview named the five phases at a high level. This page goes one level deeper — what each phase actually does, why it can fail, and what an audit event looks like when it does. The intended audience is the platform operator on call who needs to read a stuck saga and know which knob to twist.
Phase 1 — Validate the request
Section titled “Phase 1 — Validate the request”The workflow's first phase confirms the request can proceed. Three checks:
- Slug uniqueness. The slug isn't in use by another active or suspended tenant in this organisation.
- Plan existence. The plan_id on the request resolves to a real plan tier the org is entitled to.
- Role check. The caller's role admits provisioning (Owner or Admin).
Failures here surface immediately as 4xx responses on the create-tenant API; the saga never starts. Operators don't typically see Phase-1 failures in the audit feed — they're rejected before any event is emitted.
Phase 2 — Allocate the tenant's resources
Section titled “Phase 2 — Allocate the tenant's resources”The platform reserves what the new tenant needs: storage, identity store, traffic boundary. Each allocation is recorded as a row in the platform's resource-allocation ledger so cleanup is precise on rollback.
Audit events emitted during this phase:
tenant.provisioning.step_completedonce per resource the platform reserves, with a human-readable phrase in the details column naming what was allocated.tenant.provisioning.step_completedonce each allocation is persisted to the platform's catalog.
If an allocation fails (e.g. the upstream is at capacity or unreachable), the platform retries the activity up to its budget. After exhausting retries, the workflow rolls back any earlier allocations and the tenant lands in failed state.
Phase 3 — Deploy the tenant's workloads
Section titled “Phase 3 — Deploy the tenant's workloads”The platform launches the services that will serve this tenant's authentication traffic. These services are tenant-scoped — separate from your platform-wide services — and they configure themselves against the resources allocated in Phase 2.
Audit event:
tenant.provisioning.step_completedonce the workloads launch.
Then the workflow waits for readiness checks to pass. If they don't (the services start crashing, the readiness endpoint never returns 200), the workflow stalls in Phase 3 until either readiness succeeds or the per-activity timeout fires.
Audit event after readiness:
tenant.provisioning.step_completedonce all deployments report ready.
Phase 4 — Mark the tenant active
Section titled “Phase 4 — Mark the tenant active”A short phase: the platform flips the tenant's state from provisioning to active in the catalog. This is what causes the state badge in the console to change.
Audit event:
tenant.provisioning.completedwithduration_secondsand aresourcessummary.
After this point, traffic to https://<tenant-slug>-<org-slug>.<domain> succeeds. The tenant's own admin can sign in (if the welcome email reached them).
Phase 5 — Send the welcome emails
Section titled “Phase 5 — Send the welcome emails”Best-effort three-email sequence to the first admin:
- Welcome email — they exist; here's the org context.
- Console URL email — where to sign in to their tenant's admin console.
- Onboarding email — what to configure first.
If email delivery fails (provider issue, dev-environment allowlist guard, etc.), the platform logs the failure but does NOT roll back the tenant. The thinking: a tenant with no welcome email is still operational — the admin can be re-invited or the email re-sent. Losing the tenant entirely over a transient email issue would be worse.
Audit event:
tenant.provisioning.step_completedwith "welcome emails dispatched" if all three sent.- If any email failed to dispatch the platform records the failure on the saga's step record but doesn't roll the tenant back — the tenant is still operational and the admin can be re-invited.
Rollback (the compensation chain)
Section titled “Rollback (the compensation chain)”When a phase fails after retries, the workflow doesn't just abort — it runs the compensation chain in reverse order. Each phase that emitted a "step completed" gets a corresponding "compensated" event:
- Phase 3 (workloads deployed) → uninstall workloads.
- Phase 2 (resources allocated) → release allocations back to the pool.
- Phase 1 (state flipped) → flip back.
Audit events during rollback:
tenant.provisioning.compensatedonce per compensated phase, with the details column naming what was rolled back.tenant.provisioning.failedat the end, withfailed_activitynaming the phase that originally broke.
The tenant lands in failed state with its audit history intact. From there, retry or decommission — see Tenant lifecycle and Recover a stuck provisioning saga.
Why phases are idempotent
Section titled “Why phases are idempotent”Each phase is designed to be re-runnable without side effects. That property is what makes the whole system resilient to transient failures:
- A network blip during Phase 2 retries Phase 2 (not the whole workflow).
- A platform restart mid-saga resumes from the last successfully-completed phase.
- A manual retry from the console restarts from the last-failed phase.
If a phase wasn't idempotent, every retry would risk double-allocating, double-deploying, double-sending. Idempotency is achieved by tracking "did this exact step already happen?" in the platform's catalog and short-circuiting when it has.
Reading a stuck saga from the audit feed
Section titled “Reading a stuck saga from the audit feed”The pattern for diagnosing a stuck provisioning:
- Open
/dashboard/auditand filter by tenant id (or just bytenant.provisioning). - Find the most-recent
step_completedevent. That's where the saga last made progress. - The next event tells you what failed: another
step_completed, afailed, or — if the saga is genuinely stuck — silence (no new event in 30+ seconds).
From there, the recovery path depends on what failed. See Recover a stuck provisioning saga.