Skip to content

Provisioning workflow — phase by phase

Provisioning overview named the five phases at a high level. This page goes one level deeper — what each phase actually does, why it can fail, and what an audit event looks like when it does. The intended audience is the platform operator on call who needs to read a stuck saga and know which knob to twist.

The workflow's first phase confirms the request can proceed. Three checks:

  • Slug uniqueness. The slug isn't in use by another active or suspended tenant in this organisation.
  • Plan existence. The plan_id on the request resolves to a real plan tier the org is entitled to.
  • Role check. The caller's role admits provisioning (Owner or Admin).

Failures here surface immediately as 4xx responses on the create-tenant API; the saga never starts. Operators don't typically see Phase-1 failures in the audit feed — they're rejected before any event is emitted.

The platform reserves what the new tenant needs: storage, identity store, traffic boundary. Each allocation is recorded as a row in the platform's resource-allocation ledger so cleanup is precise on rollback.

Audit events emitted during this phase:

  • tenant.provisioning.step_completed once per resource the platform reserves, with a human-readable phrase in the details column naming what was allocated.
  • tenant.provisioning.step_completed once each allocation is persisted to the platform's catalog.

If an allocation fails (e.g. the upstream is at capacity or unreachable), the platform retries the activity up to its budget. After exhausting retries, the workflow rolls back any earlier allocations and the tenant lands in failed state.

The platform launches the services that will serve this tenant's authentication traffic. These services are tenant-scoped — separate from your platform-wide services — and they configure themselves against the resources allocated in Phase 2.

Audit event:

  • tenant.provisioning.step_completed once the workloads launch.

Then the workflow waits for readiness checks to pass. If they don't (the services start crashing, the readiness endpoint never returns 200), the workflow stalls in Phase 3 until either readiness succeeds or the per-activity timeout fires.

Audit event after readiness:

  • tenant.provisioning.step_completed once all deployments report ready.

A short phase: the platform flips the tenant's state from provisioning to active in the catalog. This is what causes the state badge in the console to change.

Audit event:

  • tenant.provisioning.completed with duration_seconds and a resources summary.

After this point, traffic to https://<tenant-slug>-<org-slug>.<domain> succeeds. The tenant's own admin can sign in (if the welcome email reached them).

Best-effort three-email sequence to the first admin:

  1. Welcome email — they exist; here's the org context.
  2. Console URL email — where to sign in to their tenant's admin console.
  3. Onboarding email — what to configure first.

If email delivery fails (provider issue, dev-environment allowlist guard, etc.), the platform logs the failure but does NOT roll back the tenant. The thinking: a tenant with no welcome email is still operational — the admin can be re-invited or the email re-sent. Losing the tenant entirely over a transient email issue would be worse.

Audit event:

  • tenant.provisioning.step_completed with "welcome emails dispatched" if all three sent.
  • If any email failed to dispatch the platform records the failure on the saga's step record but doesn't roll the tenant back — the tenant is still operational and the admin can be re-invited.

When a phase fails after retries, the workflow doesn't just abort — it runs the compensation chain in reverse order. Each phase that emitted a "step completed" gets a corresponding "compensated" event:

  • Phase 3 (workloads deployed) → uninstall workloads.
  • Phase 2 (resources allocated) → release allocations back to the pool.
  • Phase 1 (state flipped) → flip back.

Audit events during rollback:

  • tenant.provisioning.compensated once per compensated phase, with the details column naming what was rolled back.
  • tenant.provisioning.failed at the end, with failed_activity naming the phase that originally broke.

The tenant lands in failed state with its audit history intact. From there, retry or decommission — see Tenant lifecycle and Recover a stuck provisioning saga.

Each phase is designed to be re-runnable without side effects. That property is what makes the whole system resilient to transient failures:

  • A network blip during Phase 2 retries Phase 2 (not the whole workflow).
  • A platform restart mid-saga resumes from the last successfully-completed phase.
  • A manual retry from the console restarts from the last-failed phase.

If a phase wasn't idempotent, every retry would risk double-allocating, double-deploying, double-sending. Idempotency is achieved by tracking "did this exact step already happen?" in the platform's catalog and short-circuiting when it has.

The pattern for diagnosing a stuck provisioning:

  1. Open /dashboard/audit and filter by tenant id (or just by tenant.provisioning).
  2. Find the most-recent step_completed event. That's where the saga last made progress.
  3. The next event tells you what failed: another step_completed, a failed, or — if the saga is genuinely stuck — silence (no new event in 30+ seconds).

From there, the recovery path depends on what failed. See Recover a stuck provisioning saga.