engineering

Webhook Reliability: A Survival Guide for Payment Systems

Why your webhook delivery system needs retry policies, signing secrets, dead-letter queues, and a manual replay UI — and how to ship all of it without breaking exactly-once semantics.

April 1, 20268 min readBy Kaadxpay Engineering

Webhooks are the second-most-debugged integration surface in any payment platform (behind only "why didn't my payout settle?"). They look simple — POST a JSON body to a customer-supplied URL when something happens — but the production reality is full of nasty edge cases.

This is a battle-tested checklist for shipping a webhook system that survives real customer integrations.

Why webhooks are hard

The basic problem: you're calling code you don't control, on infrastructure you don't control, on someone else's network. Every assumption has to be defensive.

Things that will go wrong:

  • The customer's endpoint will be down for 4 hours during a deploy
  • The customer's endpoint will return 200 but lose the message
  • The customer's endpoint will respond too slowly, your client times out, you don't know if they got it
  • The customer's endpoint will return 503 then 200 to a retry, and the customer will process the same event twice
  • The customer's endpoint will be replaced and the new one will reject your old signing secret
  • The customer's WAF will silently drop your IP
  • The customer's TLS will expire on a Saturday

A naïve "fire and forget" webhook implementation gives up on most of these. A production-grade one handles them.

The non-negotiable features

Every payment-grade webhook system needs:

  1. At-least-once delivery with explicit retry policy
  2. HMAC signing of every body
  3. Stable event IDs so receivers can dedup
  4. Dead-letter queue for permanently-failed deliveries
  5. Replay UI for manual recovery
  6. Per-endpoint health metrics the customer can see
  7. IP allowlist documentation that's actually accurate
  8. Sane timeouts (we use 5s, others use 10–30s)
  9. Webhook secret rotation without downtime
  10. Versioned event schemas so you can evolve

If your design is missing any of these, you're going to have a bad time.

Retry policy

This is the most common place to get it wrong. Two failure modes:

  • Too aggressive. You retry every 5 seconds for 24 hours. The customer's endpoint comes back online and gets thunder-storm-DDoS'd by your retries.
  • Too gentle. You retry 3 times in 1 minute and give up. The customer's deploy takes 10 minutes; they lose the event.

The pattern that works: truncated exponential backoff with jitter, capped retention.

Our schedule:

  • T+0s — first attempt
  • T+10s — first retry (if first failed)
  • T+30s
  • T+1m
  • T+5m
  • T+15m
  • T+1h
  • T+6h
  • T+24h — final attempt

Total of 9 attempts over 24h. Each retry has ±25% jitter. After 9 failures the event goes to the DLQ.

This catches roughly 97% of customer endpoints that are recoverable, and bounds load on broken endpoints at 9 attempts/24h.

What counts as "success"?

We accept any HTTP status in the 2xx range as success. Specifically:

  • 200 — explicit success
  • 201, 202, 204 — also success
  • Anything 3xx — we follow up to 2 redirects, then treat as failure
  • 4xx — failure, retry (maybe the endpoint is misconfigured but will be fixed)
  • 5xx — failure, retry
  • Connection errors / timeouts — failure, retry
Why retry on 4xx?

Conventional wisdom says "4xx is the client's fault, don't retry." Reality: a 401 might mean the customer rotated secrets and is fixing the bug; a 404 might mean their proxy is misconfigured; a 422 might mean their schema validation is wrong. Treating 4xx as terminal punishes the customer for transient errors that they can fix in minutes. Retry, but log loudly.

HMAC signatures

Every webhook body MUST be signed. The pattern:

  1. The customer registers an endpoint and we generate a secret (32 bytes random, hex-encoded)
  2. On each delivery, we compute HMAC-SHA256(secret, timestamp + "." + body) and send it as X-Kxp-Signature: t=<ts>,v1=<hex>
  3. The customer's handler:
    • Reads the timestamp and signature header
    • Verifies the timestamp is within ±5 minutes (replay protection)
    • Recomputes the HMAC over timestamp + "." + body
    • Compares constant-time

We publish verification examples in our docs for the popular languages.

Two anti-patterns we see:

  • Signing only the body, not the timestamp. Vulnerable to replay.
  • Storing the secret in source code. Customers will commit it. Document loudly that it goes in env or vault.

Idempotency for receivers

Same idea as our API idempotency post, mirrored:

  • We publish a unique event_id (UUIDv4) for every business event
  • We retry with the SAME event_id regardless of how many delivery attempts
  • The receiver's handler dedups on event_id

If you're integrating webhooks from any payment provider, your handler skeleton should be:

async function handleWebhook(req) {
  const sig = req.headers['x-kxp-signature'];
  const body = await req.text();

  if (!verifySig(sig, body, process.env.KXP_WEBHOOK_SECRET)) {
    return res.status(401).send('Invalid signature');
  }

  const event = JSON.parse(body);

  // Dedup on event_id
  const existing = await db.events.findOne({ event_id: event.id });
  if (existing) return res.status(200).send('Already processed');

  await db.events.insert({ event_id: event.id, status: 'processing' });

  try {
    await processEvent(event);
    await db.events.update(event.id, { status: 'done' });
    return res.status(200).send('OK');
  } catch (e) {
    await db.events.update(event.id, { status: 'failed', error: e.message });
    return res.status(500).send('Retry me');
  }
}

The dedup check before processing is critical. Without it, a slow handler that gets retried while still running will double-process.

Dead-letter queue and replay

After 9 failed attempts over 24h, an event goes to the DLQ. The DLQ should be:

  • Inspectable in the merchant dashboard — they need to see what's stuck
  • Replayable manually with one click
  • Bulk-replayable for "we just fixed our endpoint, replay everything from the last 6 hours"
  • Filterable by event type, time range, error reason

Our DLQ also has an automatic replay feature: if the customer's endpoint goes from "down" to "up" (we monitor with a periodic 1px ping), we automatically replay queued events. This catches the common case where their deploy finishes and we're back in business without manual intervention.

Per-endpoint health

Every endpoint we deliver to gets a health page in the merchant dashboard:

P50 response time
142 ms
Customer endpoint, last 1h
2xx ratio
99.4%
Last 24h
Pending retries
2
Currently in retry queue

The merchant sees what we see. When they say "your webhooks are broken", we can immediately point at "actually, your endpoint has been returning 503 for 22 minutes — here's the body."

This single feature has cut webhook-related support tickets by ~60% for us.

IP allowlist accuracy

Customers will firewall us. They need a stable, accurate, well-documented set of source IPs.

If your webhook delivery runs in Kubernetes pods on dynamic IPs, you have a problem. The fix:

  • Route all outbound webhook traffic through a dedicated NAT gateway with reserved IPs
  • Document those IPs prominently in the dashboard and the API reference
  • Never change them silently. Announce 90 days in advance, send pre-deploy emails, run both old and new IPs in parallel for 30 days

We publish 4 IPs (covering 2 regions, A/B for HA). They've been stable since launch and will be for the foreseeable future.

Secret rotation

Customers will rotate their webhook secret. Possibly because of a compromise, possibly just hygiene. Without a graceful rotation flow, every rotation breaks delivery.

The pattern:

  • Endpoint configuration accepts two active secrets at a time (current, previous)
  • We sign with current. The receiver verifies against either.
  • Customer rotates: they ask us to set the new value as current and shift the old to previous. They deploy code that accepts both. They wait until their deploys are stable. They ask us to clear previous.

Total downtime: zero, with proper sequencing. Without it: full outage during the rotation window.

Versioning

Every event we publish has a schema_version. Customers select a webhook schema version at endpoint registration time. We deliver in that version forever (or until they migrate).

When we add fields, we add them to a new schema version, deliver the old version unchanged to customers on the old version, and let them migrate at their own pace.

This is basically the "mailable address" version of API versioning. It works. Anything less leads to grumpy customers when you accidentally break their integration with an "additive" change.

Things to monitor

Five SLIs we alert on:

  1. Delivery success rate (should be > 99% over 24h)
  2. P99 delivery latency from event creation to first delivery attempt (should be < 30s)
  3. DLQ accumulation rate (alerts if > 50 events/h going to DLQ — likely a systemic problem)
  4. Per-customer 5xx rate (alerts the customer, not us)
  5. Signing-key age (alerts customer if their secret hasn't been rotated in > 6 months — gentle hygiene nudge)

What we'd build differently

Hindsight items, if we were starting fresh:

  • Build the replay UI before launch, not after the first incident. We didn't, and it cost us about three weeks of "send me the missing event_ids and I'll re-deliver them" support work.
  • Make the secret rotation flow self-service from day one. Manual rotations were a footgun.
  • Default to verbose receiver logging in our SDKs. Quiet failure modes in customer code are the worst.

TL;DR

Build the boring infrastructure. Retry with backoff. Sign every body. Use stable event IDs. Have a DLQ. Have a replay button. Publish health to the customer. Don't change IPs silently. Support rotation gracefully. Version your schemas.

If you're integrating with us, check out the webhook section of our developer docs for the production version of all of this.

Author
Kaadxpay Engineering
Platform Engineering

Posts from the Kaadxpay engineering team covering API design, webhook reliability, reconciliation patterns, and the practical realities of running a cross-border payment platform.

Related Reading

Idempotency for Payment APIs: The Engineering Playbook

Idempotency for Payment APIs: The Engineering Playbook

How to design idempotency keys that survive retries, network partitions, and creative client behavior. The patterns we use in production at Kaadxpay, written for engineers who actually have to debug double-charge tickets.

Apr 15, 20268 min read
Labuan FSA PSO License Explained: Why It Matters for ASEAN Cross-Border Payments

Labuan FSA PSO License Explained: Why It Matters for ASEAN Cross-Border Payments

A practical guide to the Labuan Financial Services Authority Payment System Operator license — what it permits, who needs it, and how it compares to BNM, MAS, and offshore alternatives.

Apr 28, 20266 min read
ASEAN Payment Corridors 2026: State of the Market

ASEAN Payment Corridors 2026: State of the Market

A corridor-by-corridor breakdown of how ASEAN cross-border payments actually flow today — from MY-SG QR linkage to the IDR-PHP backwater. What's working, what's broken, and what's worth integrating.

Apr 22, 20267 min read

Subscribe to Kaadxpay Insights

Monthly insights on cross-border payment corridors, regulation, and engineering practice.