Webhook delivery failures

Problem

The subscription's delivery feed shows a high failure rate, or events are landing in the DLQ, or your downstream system is out of sync with the tenant.

Diagnostic tree

Walk this in order. Each branch is a different root cause.

1. What status is the platform recording?

Open the subscription's delivery feed in the tenant admin console. Each failed attempt shows the HTTP status returned by your receiver. The next steps depend on what's there.

2. If you see "connection timeout"

The platform never got a response within 10 seconds.

Receiver is slow. Check your receiver's response time. The pattern of "return 2xx fast, process async" is the answer — don't process inline.
Receiver is offline. Confirm the URL resolves and accepts connections from the public internet.
Network in the middle is dropping packets. Less common; usually a transient.

3. If you see "TLS handshake failed"

Certificate expired. Check openssl s_client -connect your-receiver:443 -servername your-host. The platform refuses to deliver to an expired cert.
Certificate self-signed. The platform requires a CA-signed cert. Let's Encrypt is free.
Hostname mismatch. The cert's CN / SAN doesn't include the hostname the subscription targets.

4. If you see 401 / 403

The platform got through TLS and TCP; your receiver explicitly rejected.

Signature verification failing on your side. See signature verification. The most common cause is body parsing before HMAC verification — make sure you're hashing the raw bytes.
Your receiver expects auth credentials the platform isn't sending. Webhooks don't carry bearer tokens; the only auth is the HMAC signature. If your receiver checks for Authorization, remove that check for the webhook route.

5. If you see 404

URL is wrong. Check the subscription's URL field; check that your receiver actually has a route at that path.
Receiver framework strips trailing slashes (or adds them). The subscription URL has to match exactly. Pick one and stick to it.

6. If you see 5xx

Your receiver is failing. The platform retries on 5xx, so eventual delivery is likely, but a high 5xx rate is a sign of receiver instability.

Check your receiver's logs. The exact 5xx will tell you why.
Common cause: a code path you didn't anticipate. A new event type with a payload shape your code wasn't ready for; treat unknown event types as "log and 2xx" rather than throwing.

7. If you see deliveries succeed (2xx) but downstream is out of sync

The platform delivered, your receiver accepted, your processing logic failed silently.

Async processing dropped an event. If your receiver returns 2xx then enqueues, and the enqueue fails, the platform thinks the delivery succeeded.
Idempotency key collision. Two events with the same business meaning but different event_id were treated as duplicates.
Replica lag. The event was processed but your read replica hasn't caught up to your write primary yet.

The fix here is in your code, not in the subscription configuration.

When everything looks healthy but events still aren't arriving

Two non-obvious causes:

The subscription filter is too narrow. Check the events field on the subscription. If you only subscribed to user.signed_up, you won't get user.updated.
The events aren't being emitted at all. Some event types (e.g., security.brute_force_detected) only fire when their conditions are met. Test by sending the synthetic test event from the console.

Resetting after a long outage

If your receiver was down for hours, dozens of events may be in the DLQ. Two options:

Replay everything. From the DLQ view, replay each entry. Filter and bulk-replay where useful.
Skip the missed window and resync from authoritative state. Read user records, application configs, etc. directly from the API; use the audit log to fill in events you care about. The platform will keep DLQ entries for 7 days, so you have time to decide.

For mission-critical event consumption, build idempotency such that re-running the past hour of events is safe — that lets "replay everything" be the safe default.