Webhooks are the second-most-debugged integration surface in any payment platform (behind only "why didn't my payout settle?"). They look simple — POST a JSON body to a customer-supplied URL when something happens — but the production reality is full of nasty edge cases.
This is a battle-tested checklist for shipping a webhook system that survives real customer integrations.
Why webhooks are hard
The basic problem: you're calling code you don't control, on infrastructure you don't control, on someone else's network. Every assumption has to be defensive.
Things that will go wrong:
- The customer's endpoint will be down for 4 hours during a deploy
- The customer's endpoint will return 200 but lose the message
- The customer's endpoint will respond too slowly, your client times out, you don't know if they got it
- The customer's endpoint will return 503 then 200 to a retry, and the customer will process the same event twice
- The customer's endpoint will be replaced and the new one will reject your old signing secret
- The customer's WAF will silently drop your IP
- The customer's TLS will expire on a Saturday
A naïve "fire and forget" webhook implementation gives up on most of these. A production-grade one handles them.
The non-negotiable features
Every payment-grade webhook system needs:
- At-least-once delivery with explicit retry policy
- HMAC signing of every body
- Stable event IDs so receivers can dedup
- Dead-letter queue for permanently-failed deliveries
- Replay UI for manual recovery
- Per-endpoint health metrics the customer can see
- IP allowlist documentation that's actually accurate
- Sane timeouts (we use 5s, others use 10–30s)
- Webhook secret rotation without downtime
- Versioned event schemas so you can evolve
If your design is missing any of these, you're going to have a bad time.
Retry policy
This is the most common place to get it wrong. Two failure modes:
- Too aggressive. You retry every 5 seconds for 24 hours. The customer's endpoint comes back online and gets thunder-storm-DDoS'd by your retries.
- Too gentle. You retry 3 times in 1 minute and give up. The customer's deploy takes 10 minutes; they lose the event.
The pattern that works: truncated exponential backoff with jitter, capped retention.
Our schedule:
- T+0s — first attempt
- T+10s — first retry (if first failed)
- T+30s
- T+1m
- T+5m
- T+15m
- T+1h
- T+6h
- T+24h — final attempt
Total of 9 attempts over 24h. Each retry has ±25% jitter. After 9 failures the event goes to the DLQ.
This catches roughly 97% of customer endpoints that are recoverable, and bounds load on broken endpoints at 9 attempts/24h.
What counts as "success"?
We accept any HTTP status in the 2xx range as success. Specifically:
- 200 — explicit success
- 201, 202, 204 — also success
- Anything 3xx — we follow up to 2 redirects, then treat as failure
- 4xx — failure, retry (maybe the endpoint is misconfigured but will be fixed)
- 5xx — failure, retry
- Connection errors / timeouts — failure, retry
Conventional wisdom says "4xx is the client's fault, don't retry." Reality: a 401 might mean the customer rotated secrets and is fixing the bug; a 404 might mean their proxy is misconfigured; a 422 might mean their schema validation is wrong. Treating 4xx as terminal punishes the customer for transient errors that they can fix in minutes. Retry, but log loudly.
HMAC signatures
Every webhook body MUST be signed. The pattern:
- The customer registers an endpoint and we generate a secret (32 bytes random, hex-encoded)
- On each delivery, we compute
HMAC-SHA256(secret, timestamp + "." + body)and send it asX-Kxp-Signature: t=<ts>,v1=<hex> - The customer's handler:
- Reads the timestamp and signature header
- Verifies the timestamp is within ±5 minutes (replay protection)
- Recomputes the HMAC over
timestamp + "." + body - Compares constant-time
We publish verification examples in our docs for the popular languages.
Two anti-patterns we see:
- Signing only the body, not the timestamp. Vulnerable to replay.
- Storing the secret in source code. Customers will commit it. Document loudly that it goes in env or vault.
Idempotency for receivers
Same idea as our API idempotency post, mirrored:
- We publish a unique
event_id(UUIDv4) for every business event - We retry with the SAME
event_idregardless of how many delivery attempts - The receiver's handler dedups on
event_id
If you're integrating webhooks from any payment provider, your handler skeleton should be:
async function handleWebhook(req) {
const sig = req.headers['x-kxp-signature'];
const body = await req.text();
if (!verifySig(sig, body, process.env.KXP_WEBHOOK_SECRET)) {
return res.status(401).send('Invalid signature');
}
const event = JSON.parse(body);
// Dedup on event_id
const existing = await db.events.findOne({ event_id: event.id });
if (existing) return res.status(200).send('Already processed');
await db.events.insert({ event_id: event.id, status: 'processing' });
try {
await processEvent(event);
await db.events.update(event.id, { status: 'done' });
return res.status(200).send('OK');
} catch (e) {
await db.events.update(event.id, { status: 'failed', error: e.message });
return res.status(500).send('Retry me');
}
}
The dedup check before processing is critical. Without it, a slow handler that gets retried while still running will double-process.
Dead-letter queue and replay
After 9 failed attempts over 24h, an event goes to the DLQ. The DLQ should be:
- Inspectable in the merchant dashboard — they need to see what's stuck
- Replayable manually with one click
- Bulk-replayable for "we just fixed our endpoint, replay everything from the last 6 hours"
- Filterable by event type, time range, error reason
Our DLQ also has an automatic replay feature: if the customer's endpoint goes from "down" to "up" (we monitor with a periodic 1px ping), we automatically replay queued events. This catches the common case where their deploy finishes and we're back in business without manual intervention.
Per-endpoint health
Every endpoint we deliver to gets a health page in the merchant dashboard:
The merchant sees what we see. When they say "your webhooks are broken", we can immediately point at "actually, your endpoint has been returning 503 for 22 minutes — here's the body."
This single feature has cut webhook-related support tickets by ~60% for us.
IP allowlist accuracy
Customers will firewall us. They need a stable, accurate, well-documented set of source IPs.
If your webhook delivery runs in Kubernetes pods on dynamic IPs, you have a problem. The fix:
- Route all outbound webhook traffic through a dedicated NAT gateway with reserved IPs
- Document those IPs prominently in the dashboard and the API reference
- Never change them silently. Announce 90 days in advance, send pre-deploy emails, run both old and new IPs in parallel for 30 days
We publish 4 IPs (covering 2 regions, A/B for HA). They've been stable since launch and will be for the foreseeable future.
Secret rotation
Customers will rotate their webhook secret. Possibly because of a compromise, possibly just hygiene. Without a graceful rotation flow, every rotation breaks delivery.
The pattern:
- Endpoint configuration accepts two active secrets at a time (
current,previous) - We sign with
current. The receiver verifies against either. - Customer rotates: they ask us to set the new value as
currentand shift the old toprevious. They deploy code that accepts both. They wait until their deploys are stable. They ask us to clearprevious.
Total downtime: zero, with proper sequencing. Without it: full outage during the rotation window.
Versioning
Every event we publish has a schema_version. Customers select a webhook schema version at endpoint registration time. We deliver in that version forever (or until they migrate).
When we add fields, we add them to a new schema version, deliver the old version unchanged to customers on the old version, and let them migrate at their own pace.
This is basically the "mailable address" version of API versioning. It works. Anything less leads to grumpy customers when you accidentally break their integration with an "additive" change.
Things to monitor
Five SLIs we alert on:
- Delivery success rate (should be > 99% over 24h)
- P99 delivery latency from event creation to first delivery attempt (should be < 30s)
- DLQ accumulation rate (alerts if > 50 events/h going to DLQ — likely a systemic problem)
- Per-customer 5xx rate (alerts the customer, not us)
- Signing-key age (alerts customer if their secret hasn't been rotated in > 6 months — gentle hygiene nudge)
What we'd build differently
Hindsight items, if we were starting fresh:
- Build the replay UI before launch, not after the first incident. We didn't, and it cost us about three weeks of "send me the missing event_ids and I'll re-deliver them" support work.
- Make the secret rotation flow self-service from day one. Manual rotations were a footgun.
- Default to verbose receiver logging in our SDKs. Quiet failure modes in customer code are the worst.
TL;DR
Build the boring infrastructure. Retry with backoff. Sign every body. Use stable event IDs. Have a DLQ. Have a replay button. Publish health to the customer. Don't change IPs silently. Support rotation gracefully. Version your schemas.
If you're integrating with us, check out the webhook section of our developer docs for the production version of all of this.
Posts from the Kaadxpay engineering team covering API design, webhook reliability, reconciliation patterns, and the practical realities of running a cross-border payment platform.