engineering

Idempotency for Payment APIs: The Engineering Playbook

How to design idempotency keys that survive retries, network partitions, and creative client behavior. The patterns we use in production at Kaadxpay, written for engineers who actually have to debug double-charge tickets.

April 15, 20268 min readBy Kaadxpay Engineering

If you've shipped a payment API and not lost sleep over an idempotency edge case, you haven't shipped enough payment volume yet.

This is the playbook we wish we'd had. It's the result of watching enough double-charge incident reviews to know which patterns survive contact with reality and which ones don't.

What "idempotent" actually means

A formal definition: an operation is idempotent if performing it N times yields the same observable state as performing it once. For payments:

POST /orders with the same idempotency_key → at most one order created, regardless of how many times the request is sent
POST /payouts with the same idempotency_key → at most one payout dispatched
PUT /orders/{id} is naturally idempotent (assuming deterministic state derivation)

Important: idempotency is about the side effect, not the response. A second call with the same key should ideally return the same response, but the binding contract is that the system state mutates at most once.

Why every payment API needs it

Five real failure modes that idempotency keys defend against:

Network retry on the client. Connection drops mid-request; client retries; without dedup the operation runs twice.
Edge proxy retry. CDN, load balancer, or service mesh retries an "errored" request that actually succeeded server-side.
Multi-tenant queue retry. A worker crashes after creating the order but before acking the message; another worker picks it up.
Webhook retry storms. When you retry webhooks (and you should), the receiver needs to dedup by event_id.
Operator double-click. A merchant clicks "Pay out USD 50K" twice in your dashboard.

In our production telemetry, retry-induced duplicate attempts happen on roughly 1 in 200 payment calls. Without idempotency keys, that's a 0.5% double-execution rate at base. With well-designed keys, it's effectively zero.

The naïve approach (and why it fails)

The instinct is something like:

def create_order(req):
    key = req.idempotency_key
    existing = db.orders.find_one({"idempotency_key": key})
    if existing:
        return existing
    order = create_order_in_db(req)
    return order

This breaks on race conditions. Two concurrent requests both pass the find_one check before either commits — both create orders.

You need atomicity. Either a database-level unique constraint or a transactional check-and-create.

The pattern that works

Here's the canonical pattern, distilled to the essentials:

def create_order(req):
    key = require_idempotency_key(req)
    request_hash = sha256(canonical_form(req.body))

    # Step 1: insert the idempotency record OR find existing
    inserted = db.idempotency.insert_or_get(
        key=key,
        request_hash=request_hash,
        endpoint="POST /orders",
        status="in_progress",
        merchant_id=req.merchant_id,
        ttl_seconds=86400,  # 24h retention
    )

    # Step 2a: re-request with the SAME body — return cached response
    if inserted.was_existing and inserted.request_hash == request_hash:
        if inserted.status == "completed":
            return cached_response(inserted.response_id)
        if inserted.status == "in_progress":
            # Caller is retrying while we're still processing the original.
            # Return 409 Conflict with a Retry-After header.
            return conflict_response(retry_after=2)

    # Step 2b: re-request with a DIFFERENT body — semantic mismatch
    if inserted.was_existing and inserted.request_hash != request_hash:
        return error_response(
            422,
            "Idempotency-Key reused with different request body"
        )

    # Step 3: do the actual work
    try:
        order = create_order_in_db(req, idempotency_key=key)
        response = serialize(order)
        db.idempotency.update(
            key=key,
            status="completed",
            response_id=response.id,
        )
        return response
    except Exception as e:
        db.idempotency.update(key=key, status="failed", error=str(e))
        raise

Five things that matter:

1. Hash the request body. A client must not be able to "reuse" a key with different parameters. Hash a canonical form of the body (sorted keys, no whitespace) and store it. On retry, re-hash and compare.

2. Three states, not two. in_progress, completed, failed. The middle state lets you tell a retrying caller "I'm still working on the original request — back off and retry" instead of either making them wait or letting them race.

3. Scope by merchant. A leaked or guessed idempotency key should never let attacker A interfere with merchant B's transactions. The unique constraint should be (merchant_id, idempotency_key), not idempotency_key alone.

4. TTL the records. 24 hours is the sweet spot for most payment scenarios. Stripe uses 24h, we use 24h, this is the consensus.

5. The unique constraint is the real lock. Don't rely on application-level "check then write." Use DB-level uniqueness. PostgreSQL's INSERT ... ON CONFLICT DO NOTHING RETURNING * is the ideal primitive.

Don't use Redis for idempotency state

We see this anti-pattern often: developers reach for Redis because it's "fast." Two problems: (1) most production Redis setups don't have durable persistence by default, so a node restart loses keys; (2) Redis-only uniqueness has subtle race windows. Use your transactional database for idempotency. If volume warrants caching, layer Redis on top of the durable record, not in place of it.

Where this gets interesting: distributed payouts

The single-table pattern above works for "create order." It gets harder when the operation triggers a downstream side effect — say, dispatching to a banking partner.

Consider: you successfully create the payout record, then call partner API to actually move the money. Partner response times out. Did the partner receive your request? You don't know.

Naive retry: re-call the partner. Maybe move the money twice.

The correct pattern uses end-to-end idempotency: pass the same idempotency token (or a deterministic derivative) to the downstream system, and rely on their idempotency.

def execute_payout(payout_id, idempotency_key):
    # Generate a deterministic downstream key
    downstream_key = f"kxp-{payout_id}-{idempotency_key}"

    response = partner_api.create_transfer(
        body=payout.to_partner_format(),
        headers={"X-Idempotency-Key": downstream_key},
    )
    return response

Partners with mature APIs (Stripe, Wise, Currencycloud, the major card schemes) will dedup on this. Partners without mature APIs are a liability — work with them at your own risk, and absolutely build a reconciliation job that runs the next day to detect and reverse duplicates.

The webhook side: idempotent consumers

Same idea, mirror image. When you publish webhooks, every event has a stable event_id. Receivers dedup on event_id. This is critical because any sane webhook publisher will retry on failure, often aggressively.

In Kaadxpay's case, every webhook carries:

event_id — globally unique UUID
event_type — e.g. order.captured
delivery_id — unique per delivery attempt (so receivers can distinguish retries)
signature — HMAC over the body
delivered_at — server-side dispatch timestamp

Receivers should dedup on event_id and ignore delivery_id for state changes (just for diagnostics).

Edge cases to test

If you're building this, run these test scenarios. They catch real bugs:

Same key, identical body, sent twice in quick succession → second returns 200 with same response or 409 if first still pending
Same key, different body → 422 with clear error message
Same key, different merchant → both succeed (correct scoping)
Same key after TTL expiry → second is treated as a new request (not an error)
Process crash mid-execution, retry after restart → idempotency record left in in_progress, retry returns 409, your background worker eventually marks it failed based on absence of completion within TTL
DB primary failover during execution → exactly-once semantics preserved if you use transactional updates
Network partition between you and downstream partner → reconciliation job catches and resolves any duplicates

What this looks like at scale

For perspective, our PostgreSQL-backed idempotency table at Kaadxpay handles roughly:

Inserts/second peak

~150

With headroom for 5x growth

P99 latency

< 8 ms

Lookup + insert

Storage (24h TTL)

~80 MB

Per million daily transactions

PostgreSQL with proper indexing handles this without breaking a sweat well past 10M operations/day. You don't need a special-purpose datastore until you're operating at a different scale than us.

Cleanup and operational tips

Background sweep. Have a worker that marks in_progress records older than X minutes as failed. Catches stuck records from process crashes.
Retain completed and failed records full TTL. Don't garbage-collect aggressively — debug visibility into "why did the second call return what it did?" is invaluable in incident response.
Surface idempotency state in your dashboard. When merchants ask "did my request go through?", giving them the idempotency record's state is faster than human investigation.
Add structured logs. Every idempotency match (cache hit) and conflict (422) should log with context. This is your single best data source for client-side bug patterns.

TL;DR for engineers

If you're new to payment APIs and just want the survival kit:

Make every state-changing endpoint require an Idempotency-Key header for POST and DELETE
Hash the request body, store it alongside the key
Use three states: in_progress, completed, failed
Scope keys by merchant ID
TTL records for 24 hours
Pass through to downstream systems
Test the edge cases above
Log everything

The teams that get this right ship payments without double-charge incidents. The teams that don't, eventually do — usually at the worst possible moment.

If you're building on Kaadxpay or evaluating us, you can see the production version of this design in our API reference.

#Engineering #API Design #Idempotency #Payment Systems

Author

Kaadxpay Engineering

Platform Engineering

Posts from the Kaadxpay engineering team covering API design, webhook reliability, reconciliation patterns, and the practical realities of running a cross-border payment platform.

Webhook Reliability: A Survival Guide for Payment Systems

Why your webhook delivery system needs retry policies, signing secrets, dead-letter queues, and a manual replay UI — and how to ship all of it without breaking exactly-once semantics.

Apr 1, 20268 min read

Labuan FSA PSO License Explained: Why It Matters for ASEAN Cross-Border Payments

A practical guide to the Labuan Financial Services Authority Payment System Operator license — what it permits, who needs it, and how it compares to BNM, MAS, and offshore alternatives.

Apr 28, 20266 min read

ASEAN Payment Corridors 2026: State of the Market

A corridor-by-corridor breakdown of how ASEAN cross-border payments actually flow today — from MY-SG QR linkage to the IDR-PHP backwater. What's working, what's broken, and what's worth integrating.

Apr 22, 20267 min read

Subscribe to Kaadxpay Insights

Monthly insights on cross-border payment corridors, regulation, and engineering practice.