Provisioning timeouts + heartbeats

Each phase of the provisioning workflow runs under two clocks:

A timeout — how long the phase is allowed to take before the platform gives up and rolls back.
A heartbeat interval — how often the phase reports "I'm still working" while it runs.

Understanding both helps you tell the difference between "a saga is taking a while" (still running, no action needed) and "a saga is genuinely stuck" (timeout will fire, you might want to act).

The per-phase timeout budget

Approximate budgets per phase, ordered by typical wall-clock cost:

Phase	Typical duration	Timeout
Validate the request	< 1 second	30 seconds
Allocate the tenant's resources	5–30 seconds	5 minutes
Deploy the tenant's workloads	20–90 seconds	10 minutes
Wait for readiness checks	10–60 seconds	10 minutes
Mark the tenant active	< 1 second	30 seconds
Send the welcome emails	2–10 seconds	1 minute

Operators sometimes see a tenant in provisioning state for a couple of minutes; that's normal — most of the time is spent in workload deployment + readiness. A tenant stuck for more than 15 minutes is almost certainly genuinely stuck, not just slow.

Heartbeats

While a long-running phase is in flight, it emits a heartbeat to the platform's workflow engine every 10–15 seconds. The heartbeat carries a small progress payload — what step it's currently on, how far through, whether anything looks wrong.

What heartbeats give you:

Stuck detection. If a phase stops heartbeating for more than 30 seconds, the workflow engine assumes the worker died and re-schedules the phase to a different worker. The saga resumes without waiting for the timeout.
Progress in the console. The "Polling..." indicator next to the Provisioning timeline panel reflects whether the page is getting heartbeat-derived progress updates. If it stops, the saga is in a quiet phase (Validate, MarkActive) or actually stalled.

You don't tune heartbeats from the console; they're internal to the workflow engine.

When a timeout fires

A phase that exhausts its timeout doesn't silently fail. The workflow engine:

Cancels the in-flight activity.
Marks the activity as failed in the workflow state.
Emits a tenant.provisioning.failed audit event with failed_activity naming the phase.
Runs the compensation chain in reverse to roll back what's been done.
Lands the tenant in failed state.

From the operator's perspective: the timeline stops advancing; after a few seconds you see the rollback events appear in reverse order; eventually the state badge flips to failed.

When to act on a slow saga

Time since "step_completed"	What's likely going on	Action
0–60 seconds	Phase is still running. Heartbeats may not yet be visible.	Wait.
60–180 seconds	A long phase (workloads, readiness) is in progress.	Wait, but consider checking external dashboards if you have them.
180 seconds – timeout budget	Phase is taking longer than typical. Might recover, might not.	Don't intervene yet — interrupting can complicate cleanup.
Past the timeout budget	Should have failed by now. If state is still `provisioning`, escalate.	See Recover a stuck provisioning saga.

The cleanest signal that a saga is truly stuck is the timeout budget elapsed without a state transition. Heartbeat-driven re-scheduling handles transient worker issues; a saga still hanging past the timeout means something structural is wrong.

What times don't count

Three things don't accumulate against any phase's timeout:

Workflow retries. If the workflow itself crashes mid-saga and restarts, the new attempt gets a fresh per-phase clock.
External wait time (e.g. waiting for DNS propagation, waiting for an upstream provider) is part of the phase's timeout, not separate.
Time spent in queue before a worker picks up the activity. Queue time is bounded by the workflow engine's own SLOs, not the per-phase timeout.

What you can change

You can't tune phase-level timeouts from the control plane console — they're set by the platform team. If your tenants consistently provision on the timeout edge (close to the budget but completing), surface that to your IntelliAuth contact; it usually points to an under-provisioned upstream that needs scaling.