Each phase of the provisioning workflow runs under two clocks:
- A timeout — how long the phase is allowed to take before the platform gives up and rolls back.
- A heartbeat interval — how often the phase reports "I'm still working" while it runs.
Understanding both helps you tell the difference between "a saga is taking a while" (still running, no action needed) and "a saga is genuinely stuck" (timeout will fire, you might want to act).
The per-phase timeout budget
Section titled “The per-phase timeout budget”Approximate budgets per phase, ordered by typical wall-clock cost:
| Phase | Typical duration | Timeout |
|---|---|---|
| Validate the request | < 1 second | 30 seconds |
| Allocate the tenant's resources | 5–30 seconds | 5 minutes |
| Deploy the tenant's workloads | 20–90 seconds | 10 minutes |
| Wait for readiness checks | 10–60 seconds | 10 minutes |
| Mark the tenant active | < 1 second | 30 seconds |
| Send the welcome emails | 2–10 seconds | 1 minute |
Operators sometimes see a tenant in provisioning state for a couple of minutes; that's normal — most of the time is spent in workload deployment + readiness. A tenant stuck for more than 15 minutes is almost certainly genuinely stuck, not just slow.
Heartbeats
Section titled “Heartbeats”While a long-running phase is in flight, it emits a heartbeat to the platform's workflow engine every 10–15 seconds. The heartbeat carries a small progress payload — what step it's currently on, how far through, whether anything looks wrong.
What heartbeats give you:
- Stuck detection. If a phase stops heartbeating for more than 30 seconds, the workflow engine assumes the worker died and re-schedules the phase to a different worker. The saga resumes without waiting for the timeout.
- Progress in the console. The "Polling..." indicator next to the Provisioning timeline panel reflects whether the page is getting heartbeat-derived progress updates. If it stops, the saga is in a quiet phase (Validate, MarkActive) or actually stalled.
You don't tune heartbeats from the console; they're internal to the workflow engine.
When a timeout fires
Section titled “When a timeout fires”A phase that exhausts its timeout doesn't silently fail. The workflow engine:
- Cancels the in-flight activity.
- Marks the activity as failed in the workflow state.
- Emits a
tenant.provisioning.failedaudit event withfailed_activitynaming the phase. - Runs the compensation chain in reverse to roll back what's been done.
- Lands the tenant in
failedstate.
From the operator's perspective: the timeline stops advancing; after a few seconds you see the rollback events appear in reverse order; eventually the state badge flips to failed.
When to act on a slow saga
Section titled “When to act on a slow saga”| Time since "step_completed" | What's likely going on | Action |
|---|---|---|
| 0–60 seconds | Phase is still running. Heartbeats may not yet be visible. | Wait. |
| 60–180 seconds | A long phase (workloads, readiness) is in progress. | Wait, but consider checking external dashboards if you have them. |
| 180 seconds – timeout budget | Phase is taking longer than typical. Might recover, might not. | Don't intervene yet — interrupting can complicate cleanup. |
| Past the timeout budget | Should have failed by now. If state is still provisioning, escalate. | See Recover a stuck provisioning saga. |
The cleanest signal that a saga is truly stuck is the timeout budget elapsed without a state transition. Heartbeat-driven re-scheduling handles transient worker issues; a saga still hanging past the timeout means something structural is wrong.
What times don't count
Section titled “What times don't count”Three things don't accumulate against any phase's timeout:
- Workflow retries. If the workflow itself crashes mid-saga and restarts, the new attempt gets a fresh per-phase clock.
- External wait time (e.g. waiting for DNS propagation, waiting for an upstream provider) is part of the phase's timeout, not separate.
- Time spent in queue before a worker picks up the activity. Queue time is bounded by the workflow engine's own SLOs, not the per-phase timeout.
What you can change
Section titled “What you can change”You can't tune phase-level timeouts from the control plane console — they're set by the platform team. If your tenants consistently provision on the timeout edge (close to the budget but completing), surface that to your IntelliAuth contact; it usually points to an under-provisioned upstream that needs scaling.