Skip to content

Provisioning timeouts + heartbeats

Each phase of the provisioning workflow runs under two clocks:

  • A timeout — how long the phase is allowed to take before the platform gives up and rolls back.
  • A heartbeat interval — how often the phase reports "I'm still working" while it runs.

Understanding both helps you tell the difference between "a saga is taking a while" (still running, no action needed) and "a saga is genuinely stuck" (timeout will fire, you might want to act).

Approximate budgets per phase, ordered by typical wall-clock cost:

PhaseTypical durationTimeout
Validate the request< 1 second30 seconds
Allocate the tenant's resources5–30 seconds5 minutes
Deploy the tenant's workloads20–90 seconds10 minutes
Wait for readiness checks10–60 seconds10 minutes
Mark the tenant active< 1 second30 seconds
Send the welcome emails2–10 seconds1 minute

Operators sometimes see a tenant in provisioning state for a couple of minutes; that's normal — most of the time is spent in workload deployment + readiness. A tenant stuck for more than 15 minutes is almost certainly genuinely stuck, not just slow.

While a long-running phase is in flight, it emits a heartbeat to the platform's workflow engine every 10–15 seconds. The heartbeat carries a small progress payload — what step it's currently on, how far through, whether anything looks wrong.

What heartbeats give you:

  • Stuck detection. If a phase stops heartbeating for more than 30 seconds, the workflow engine assumes the worker died and re-schedules the phase to a different worker. The saga resumes without waiting for the timeout.
  • Progress in the console. The "Polling..." indicator next to the Provisioning timeline panel reflects whether the page is getting heartbeat-derived progress updates. If it stops, the saga is in a quiet phase (Validate, MarkActive) or actually stalled.

You don't tune heartbeats from the console; they're internal to the workflow engine.

A phase that exhausts its timeout doesn't silently fail. The workflow engine:

  1. Cancels the in-flight activity.
  2. Marks the activity as failed in the workflow state.
  3. Emits a tenant.provisioning.failed audit event with failed_activity naming the phase.
  4. Runs the compensation chain in reverse to roll back what's been done.
  5. Lands the tenant in failed state.

From the operator's perspective: the timeline stops advancing; after a few seconds you see the rollback events appear in reverse order; eventually the state badge flips to failed.

Time since "step_completed"What's likely going onAction
0–60 secondsPhase is still running. Heartbeats may not yet be visible.Wait.
60–180 secondsA long phase (workloads, readiness) is in progress.Wait, but consider checking external dashboards if you have them.
180 seconds – timeout budgetPhase is taking longer than typical. Might recover, might not.Don't intervene yet — interrupting can complicate cleanup.
Past the timeout budgetShould have failed by now. If state is still provisioning, escalate.See Recover a stuck provisioning saga.

The cleanest signal that a saga is truly stuck is the timeout budget elapsed without a state transition. Heartbeat-driven re-scheduling handles transient worker issues; a saga still hanging past the timeout means something structural is wrong.

Three things don't accumulate against any phase's timeout:

  • Workflow retries. If the workflow itself crashes mid-saga and restarts, the new attempt gets a fresh per-phase clock.
  • External wait time (e.g. waiting for DNS propagation, waiting for an upstream provider) is part of the phase's timeout, not separate.
  • Time spent in queue before a worker picks up the activity. Queue time is bounded by the workflow engine's own SLOs, not the per-phase timeout.

You can't tune phase-level timeouts from the control plane console — they're set by the platform team. If your tenants consistently provision on the timeout edge (close to the budget but completing), surface that to your IntelliAuth contact; it usually points to an under-provisioned upstream that needs scaling.