Skip to content

Retry a failed provisioning

When a provisioning workflow fails after retries, the tenant lands in Failed state. From there you have two paths: retry (try the workflow again) or decommission (release the slot and start over with a clean slug). This page covers retry; decommission is the destructive sibling and gets its own topic.

Retry makes sense when the failure was likely transient:

  • An upstream provider had a brief outage during the saga (the platform's storage layer, the workflow engine, a TLS cert provisioner). It's available again now.
  • The saga timed out waiting for readiness checks and you suspect the underlying issue is fixed.
  • A configuration value was wrong, you fixed it at the platform level, and you want to re-run.

Retry is NOT the right move when:

  • The same retry has already failed multiple times. Something structural is broken; chasing it with more retries won't help.
  • You don't know what caused the failure. Read the audit feed first — tenant.provisioning.failed events carry the failed_activity name and details. Diagnose before retrying.
  • The plan or slug is invalid. Retry won't fix a validation failure; you need to decommission and recreate with valid inputs.

From the tenant detail page (state = Failed), the Actions section shows a Retry provisioning button. Click it and the platform:

  • Resets the workflow state.
  • Starts the provisioning workflow fresh, picking up from where it failed (or from the beginning, depending on what's recoverable).
  • Flips the tenant's state back to Provisioning while the new attempt runs.

The detail page polls; the Provisioning timeline picks up new step events. If the retry succeeds, the tenant flips to Active and the Open Admin Console link appears. If it fails again, you're back in Failed state and can either retry once more, decommission, or escalate.

A retry emits:

  • tenant.provisioning.started — the new attempt is running (attempt_number incremented).
  • Then the normal sequence of step_completed events as the saga progresses.

Comparing the new attempt's events to the previous failed attempt's events is the cleanest way to spot whether the same step is failing again.

Two retries is usually the right cap. If the second attempt fails the same way, the cause is structural — keep retrying won't help. From there, the right moves are:

  • Decommission and recreate. Free the slot, start over with the same slug. Often clears whatever local state was confusing the saga.
  • Escalate to the platform team. If decommission + recreate also fails, the issue is platform-level. Surface to whoever operates IntelliAuth for you.

Most successful retries clear on the first attempt because most provisioning failures are transient (a network blip, a brief upstream outage). Persistent failures are usually configuration or platform-state issues that need diagnosis, not more retries.

  • Reset. If the workflow is stuck mid-saga rather than cleanly failed, Reset clears the workflow state without starting a fresh attempt. Useful when retry won't initiate because the previous workflow's lock is still held.
  • Decommission. Terminal release. Use when you've decided this tenant isn't worth saving — see Decommission.