A tenant's provisioning workflow has been "Provisioning" for more than ~15 minutes. Or it's flipped to "Failed" and stayed there. Or you cancelled and the row is now in some in-between state and the console shows confusing actions.
This runbook walks the recovery. Follow it.
What a healthy provisioning looks like
Section titled “What a healthy provisioning looks like”Most tenants reach Active within 30 seconds to 5 minutes. The detail page shows the saga's per-activity progress: validate input → allocate resources → seed schema → deploy workloads → wait for ready → mark active. Each tick within seconds of the previous.
When something hangs, one activity stays in-progress past its expected duration. That's the wedged step.
Step 1 — Diagnose where it's stuck
Section titled “Step 1 — Diagnose where it's stuck”Open the tenant detail page. Look at the "Provisioning timeline" pane. The currently-active activity has a spinner; the completed ones have a tick; the not-yet-started ones are greyed out.
Common stalls + their typical causes:
| Stuck on | Likely cause | What to do |
|---|---|---|
| Validate request | Programmatic input mismatch (rare; usually fails immediately, not stalls) | Cancel; correct the input; recreate |
| Allocate storage | Storage allocation slow or refusing | Check platform health; wait 5 minutes; if no progress, reset |
| Allocate workflow handle | Workflow engine not responding | Platform-tier issue; escalate |
| Apply per-tenant schema | Migration step error | Check the saga's error log; reset + retry usually resolves transient causes |
| Deploy tenant workloads | Workload deployment stuck or failed | Cancel; reset; retry |
| Wait for workloads ready | Workloads deployed but not coming up healthy | Wait up to 10 minutes; if no progress, escalate for cluster-level investigation |
| Mark active | Final flip stuck (rare) | Reset; the tenant retains its resources; the flip retries |
The timeline pane also shows the activity's wall-clock duration. If it's been > 5 minutes on any single step (except "Wait for workloads ready", which can legitimately take 10 minutes), it's stuck.
Step 2 — Cancel if still in-flight
Section titled “Step 2 — Cancel if still in-flight”If the tenant is still in Provisioning state, Cancel provisioning is the first action. Cancel rolls back the saga's compensations — what's been provisioned gets unwound; the tenant flips to Failed state.
Why cancel first? It releases the resources cleanly. Skipping cancel and going straight to reset can leave orphaned resources (a half-created database, a Helm release that nobody owns) that need manual cleanup.
After cancel, the tenant is in Failed. The Failed state is recoverable — see step 3.
Step 3 — Reset if you can't move forward
Section titled “Step 3 — Reset if you can't move forward”If the tenant is in Failed and you've cancelled (or it failed on its own), Reset provisioning clears the workflow's run state. After reset, the tenant is back in Failed with a fresh saga ready to start.
Reset is a "give the saga a clean slate" operation. It doesn't reallocate resources; it just clears the workflow's run history so a retry doesn't trip over half-completed steps.
If reset itself fails (rare; the platform refuses to reset a saga that's actively still running), it's because cancel didn't fully complete. Wait 2 minutes; try cancel again; then reset.
Step 4 — Retry
Section titled “Step 4 — Retry”Retry provisioning starts the saga fresh against the same tenant id. The tenant transitions Failed → Provisioning. Watch the timeline.
If retry succeeds, the tenant flips to Active. Done.
If retry fails at the same step as the original, the cause isn't transient. Investigate:
- The platform's underlying infrastructure (database, workflow engine, ingress controller).
- A misconfiguration in the tenant's plan (e.g., requested a region that isn't available).
- A platform-side limit (out of quota, etc.).
Don't loop retry → fail → retry → fail more than 2-3 times. Each attempt costs allocation churn. After the 3rd failure, decommission + recreate is cleaner.
Step 5 — When the runbook above doesn't fix it
Section titled “Step 5 — When the runbook above doesn't fix it”Some failure modes need platform-tier intervention:
- Saga stuck in an internal workflow-engine state the platform can't surface ("workflow query timeout").
- A resource the platform thinks it allocated but you can't find any trace of.
- Audit log shows successive provisions for the same slug.
Escalate to platform engineering. Capture before escalating:
- The tenant's id.
- The slug.
- The Provisioning timeline's last successful activity.
- Any error message visible in the saga's error log.
- Whether you've already tried cancel + reset + retry.
The escalation handler has access to deeper diagnostic tools (workflow-engine-level state inspection, workflow history replay) that the console doesn't surface.
Step 6 — Cleanup
Section titled “Step 6 — Cleanup”After the tenant is finally Active (via retry or recreate):
- Verify the tenant accepts auth traffic via its subdomain.
- Have the tenant admin sign in to the tenant-admin console and confirm.
- Check the audit log for
tenant.provisioning.completed; confirm the timestamp is recent.
What NOT to do
Section titled “What NOT to do”- Don't manually edit the tenant's row in the database. The platform's saga is the source of truth; manual edits create drift that the saga doesn't know about.
- Don't decommission + recreate during an active incident without first cancelling. You can leave orphaned resources behind.
- Don't retry more than 3 times. Loop detection is a humans-only safeguard; the platform won't stop you, but the pattern almost never works.
- Don't bypass cancel and go straight to reset. Reset on a still-running saga is refused; if you force it via the API, you'll have orphaned in-flight activities.
After the incident
Section titled “After the incident”Document:
- Which step stalled.
- The cause, if you found it.
- Whether the standard cancel → reset → retry worked.
- Total recovery time.
- Any user-facing impact (the tenant being-provisioned has no traffic anyway; the impact is operator time + frustration).
If a specific stall pattern recurs across tenants, escalate to platform engineering as a class issue, not a per-tenant one.