Recover a stuck provisioning saga

A tenant's provisioning workflow has been "Provisioning" for more than ~15 minutes. Or it's flipped to "Failed" and stayed there. Or you cancelled and the row is now in some in-between state and the console shows confusing actions.

This runbook walks the recovery. Follow it.

What a healthy provisioning looks like

Most tenants reach Active within 30 seconds to 5 minutes. The detail page shows the saga's per-activity progress: validate input → allocate resources → seed schema → deploy workloads → wait for ready → mark active. Each tick within seconds of the previous.

When something hangs, one activity stays in-progress past its expected duration. That's the wedged step.

Step 1 — Diagnose where it's stuck

Open the tenant detail page. Look at the "Provisioning timeline" pane. The currently-active activity has a spinner; the completed ones have a tick; the not-yet-started ones are greyed out.

Common stalls + their typical causes:

Stuck on	Likely cause	What to do
Validate request	Programmatic input mismatch (rare; usually fails immediately, not stalls)	Cancel; correct the input; recreate
Allocate storage	Storage allocation slow or refusing	Check platform health; wait 5 minutes; if no progress, reset
Allocate workflow handle	Workflow engine not responding	Platform-tier issue; escalate
Apply per-tenant schema	Migration step error	Check the saga's error log; reset + retry usually resolves transient causes
Deploy tenant workloads	Workload deployment stuck or failed	Cancel; reset; retry
Wait for workloads ready	Workloads deployed but not coming up healthy	Wait up to 10 minutes; if no progress, escalate for cluster-level investigation
Mark active	Final flip stuck (rare)	Reset; the tenant retains its resources; the flip retries

The timeline pane also shows the activity's wall-clock duration. If it's been > 5 minutes on any single step (except "Wait for workloads ready", which can legitimately take 10 minutes), it's stuck.

Step 2 — Cancel if still in-flight

If the tenant is still in Provisioning state, Cancel provisioning is the first action. Cancel rolls back the saga's compensations — what's been provisioned gets unwound; the tenant flips to Failed state.

Why cancel first? It releases the resources cleanly. Skipping cancel and going straight to reset can leave orphaned resources (a half-created database, a Helm release that nobody owns) that need manual cleanup.

After cancel, the tenant is in Failed. The Failed state is recoverable — see step 3.

Step 3 — Reset if you can't move forward

If the tenant is in Failed and you've cancelled (or it failed on its own), Reset provisioning clears the workflow's run state. After reset, the tenant is back in Failed with a fresh saga ready to start.

Reset is a "give the saga a clean slate" operation. It doesn't reallocate resources; it just clears the workflow's run history so a retry doesn't trip over half-completed steps.

If reset itself fails (rare; the platform refuses to reset a saga that's actively still running), it's because cancel didn't fully complete. Wait 2 minutes; try cancel again; then reset.

Step 4 — Retry

Retry provisioning starts the saga fresh against the same tenant id. The tenant transitions Failed → Provisioning. Watch the timeline.

If retry succeeds, the tenant flips to Active. Done.

If retry fails at the same step as the original, the cause isn't transient. Investigate:

The platform's underlying infrastructure (database, workflow engine, ingress controller).
A misconfiguration in the tenant's plan (e.g., requested a region that isn't available).
A platform-side limit (out of quota, etc.).

Don't loop retry → fail → retry → fail more than 2-3 times. Each attempt costs allocation churn. After the 3rd failure, decommission + recreate is cleaner.

Step 5 — When the runbook above doesn't fix it

Some failure modes need platform-tier intervention:

Saga stuck in an internal workflow-engine state the platform can't surface ("workflow query timeout").
A resource the platform thinks it allocated but you can't find any trace of.
Audit log shows successive provisions for the same slug.

Escalate to platform engineering. Capture before escalating:

The tenant's id.
The slug.
The Provisioning timeline's last successful activity.
Any error message visible in the saga's error log.
Whether you've already tried cancel + reset + retry.

The escalation handler has access to deeper diagnostic tools (workflow-engine-level state inspection, workflow history replay) that the console doesn't surface.

Step 6 — Cleanup

After the tenant is finally Active (via retry or recreate):

Verify the tenant accepts auth traffic via its subdomain.
Have the tenant admin sign in to the tenant-admin console and confirm.
Check the audit log for tenant.provisioning.completed; confirm the timestamp is recent.

What NOT to do

Don't manually edit the tenant's row in the database. The platform's saga is the source of truth; manual edits create drift that the saga doesn't know about.
Don't decommission + recreate during an active incident without first cancelling. You can leave orphaned resources behind.
Don't retry more than 3 times. Loop detection is a humans-only safeguard; the platform won't stop you, but the pattern almost never works.
Don't bypass cancel and go straight to reset. Reset on a still-running saga is refused; if you force it via the API, you'll have orphaned in-flight activities.

After the incident

Document:

Which step stalled.
The cause, if you found it.
Whether the standard cancel → reset → retry worked.
Total recovery time.
Any user-facing impact (the tenant being-provisioned has no traffic anyway; the impact is operator time + frustration).

If a specific stall pattern recurs across tenants, escalate to platform engineering as a class issue, not a per-tenant one.