implementation 2026-04-18 16 min read the implementation desk

Implementation guide: webhook reliability for multi-brand portfolios

3-minute scan
  • Webhook reliability is not about retries; it is about idempotency and dead-lettering. Any retry strategy works if the downstream consumers are idempotent, and no retry strategy works if they are not.
  • For multi-brand, the layer between acquirer and downstream systems does three things: verify signature, route by brand, fan-out to subscribers. Everything else (transform, enrich, persist) happens inside individual subscribers.
  • Target 99.95% delivery to downstream systems over 7 days. Anything lower costs you orders, emails, and reconciliation sanity.
On this page

    Every multi-brand operator eventually hits the "why did this Klaviyo email not fire on this brand's order?" question at 11pm on a Saturday. The answer is almost always a webhook that didn't deliver, didn't retry, or delivered to the wrong subscriber. This guide is the complete production pattern for a webhook layer that sits between your parent gateway and 8–15 downstream consumers across 10–30 brands.

    1. The architecture

    Acquirer → [Edge Receiver] → [Event Bus] → [Subscriber A (Klaviyo)]
                                              → [Subscriber B (3PL)]
                                              → [Subscriber C (NetSuite)]
                                              → [Subscriber D (ad pixels)]
                                              → [Subscriber E (analytics)]

    The Edge Receiver is a single HTTPS endpoint. The Event Bus is SQS, Kafka, Redis Streams, or Postgres LISTEN/NOTIFY depending on your stack. Each Subscriber is a worker that reads from the bus, transforms the event, and posts to its downstream system.

    2. Edge Receiver: the minimum viable endpoint

    // POST /webhooks/acquirer
    app.post('/webhooks/acquirer', rawBody(), async (req, res) => {
      // 1. Verify signature — if invalid, 401 immediately
      const sig = req.headers['acquirer-signature'];
      if (!verifySig(req.rawBody, sig, WEBHOOK_SECRET)) return res.status(401).end();
    
      // 2. Parse, log, enqueue — do NOT process inline
      const event = JSON.parse(req.rawBody);
      await bus.publish(event.type, {
        id: event.id,
        type: event.type,
        brand_id: event.data.metadata.brand_id,
        payload: event,
        received_at: new Date().toISOString(),
      });
    
      // 3. Ack to acquirer within 1 second
      res.status(200).end();
    });

    The Edge Receiver must ack in under 3 seconds or the acquirer will retry. Never do business logic inside the receiver — only signature verification and enqueue.

    3. Idempotency at the Edge

    Acquirers retry webhooks. Your Edge must dedupe on event.id. Use a Redis SET with 7-day TTL:

    const seen = await redis.set(`webhook:${event.id}`, 1, { EX: 604800, NX: true });
    if (!seen) return res.status(200).end();  // already processed, silent ack
    await bus.publish(...);

    Seven days is the standard retry window for most acquirers. Longer TTLs waste memory; shorter TTLs risk double-delivery on long retry storms.

    4. Brand routing

    Every event must carry a brand_id. The parent gateway attaches this via metadata.brand_id on every charge. If an event arrives without brand_id, treat it as a critical error — log, alert, and drop (do not guess).

    if (!event.data.metadata?.brand_id) {
      await alert.critical(`Webhook missing brand_id: ${event.id} ${event.type}`);
      return res.status(200).end();  // ack to avoid retry storm
    }

    The Event Bus topic should include the brand: charge.succeeded.peptides_a, charge.refunded.nutra_b. This lets subscribers filter cleanly.

    5. Subscriber workers

    Each subscriber is a worker process pulling from the Event Bus. Subscriber responsibilities: transform the acquirer-shape event to the target-system shape, POST to the target system, handle 4xx (do not retry, move to DLQ) vs 5xx (retry with backoff) responses, emit metrics.

    // subscriber-klaviyo.js
    while (true) {
      const msg = await bus.consume('klaviyo-subscriber');
      if (!msg) { await sleep(1000); continue; }
    
      try {
        const payload = transformToKlaviyo(msg.payload);
        const res = await fetch('https://a.klaviyo.com/api/events/', {
          method: 'POST',
          headers: { Authorization: `Klaviyo-API-Key ${KEY}`, 'Content-Type': 'application/json' },
          body: JSON.stringify(payload),
        });
    
        if (res.ok) await bus.ack(msg);
        else if (res.status >= 500 || res.status === 429) await bus.nack(msg, { retry: true });
        else await bus.deadletter(msg, `HTTP ${res.status}`);  // permanent failure
      } catch (e) {
        await bus.nack(msg, { retry: true });
      }
    }

    6. Retry strategy

    Exponential backoff: 30s, 2min, 10min, 1hr, 6hr, 24hr, then dead-letter. That is 7 retries over ~31 hours. Anything that does not succeed in 31 hours is almost certainly never going to — put it in the DLQ and alert ops.

    Jitter every retry interval by ±20% to avoid thundering herd when a downstream system goes down and comes back.

    7. The dead-letter queue

    DLQ is a Postgres table: event_id, subscriber, brand_id, error, first_failed_at, retry_count, payload. A daily cron surfaces DLQ items by brand and subscriber to Slack. Ops reviews each, either fixes the downstream issue and manually re-enqueues, or acknowledges the permanent failure (e.g., customer deleted).

    8. Signature verification at scale

    Each parent gateway has its own signature scheme. If you have multiple parent accounts (e.g., separate account for high-risk verticals), the Edge Receiver needs to map the incoming signature to the right verification key. Use a header hint or a different endpoint per account.

    // multi-account signature verification
    const account = req.headers['x-acquirer-account'] || 'default';
    const secret = WEBHOOK_SECRETS[account];
    if (!secret) return res.status(400).end();
    if (!verifySig(req.rawBody, req.headers.signature, secret)) return res.status(401).end();

    9. Observability

    Four metrics cover 95% of webhook debugging:

    1. Edge ack p99 latency: must stay under 2 seconds.
    2. Bus depth per subscriber: growing queue = subscriber is slow or down.
    3. Subscriber success rate per brand: if brand_A's Klaviyo success rate drops, there is a brand-specific issue.
    4. DLQ rate per day per brand: baseline 0–5 per day for a healthy portfolio.

    Ship these to Datadog, CloudWatch, or Grafana. Alert on any metric crossing threshold for 10+ minutes.

    10. Gotchas

    • HTTP 200 on partial success: Some downstream APIs return 200 but include an error in the JSON body. Always inspect response body, not just status code.
    • Rate limits: Klaviyo caps at 350 req/sec, NetSuite at 60/min. Subscriber needs per-target rate limit, not just global retry.
    • Event ordering: Bus consumers see events in receive order, not create order. If brand logic depends on ordering (refund must follow charge), enforce order at the subscriber by fetching current state before applying.
    • Payload drift: Acquirers update webhook payload schemas. Version-tag each payload shape and keep transformers version-aware. Do not assume today's event.data.object shape holds in 6 months.
    • Secrets rotation: When you rotate the webhook secret, the acquirer will send events signed with both old and new secrets for a brief window. Support both during rotation.

    11. Load test before go-live

    Before putting the layer in production, replay 24 hours of the previous day's events at 10x speed through the Edge. Confirm the bus does not overflow, no subscriber DLQ's unexpectedly, and p99 latency holds. If any test leg fails, fix before launch — a Black Friday spike will expose every weakness at once.

    Ready to harden your webhook layer? Start with the 12-question intake — we will audit your current webhook topology in 48 hours. Or see pricing and parent account architecture.

    Found this useful? Share it X LinkedIn Reddit HN Email

    FAQ

    Do we really need a bus, or can we fan out directly from the Edge?
    For under 3 subscribers, direct fan-out is fine. Above 3 you want a bus so slow subscribers cannot back-pressure the Edge ack.
    What about AWS EventBridge vs SQS?
    EventBridge is simpler for fan-out with filter patterns. SQS is simpler for per-subscriber retry control. Most multi-brand ops end up with EventBridge at the front and SQS per subscriber.
    How do we test signature verification?
    Most parent gateways ship a CLI that sends signed test events. Use it in staging. Never disable signature verification "just for debugging" — that is how webhook endpoints get exploited.
    Can we use Zapier as the layer?
    For under ~5k events/month yes. Above that the Zapier cost exceeds the cost of a self-hosted layer and you lose idempotency control. Not recommended at multi-brand scale.
    How big of a team do we need to own this?
    One senior engineer for initial build (2–3 weeks), then 5–10% of one engineer's ongoing time for maintenance and DLQ triage.

    Running multiple brands?
    multiflow was built for this.

    The Operator Briefing

    Twice-monthly. No fluff.

    Processor shutdowns, reserve-hold playbooks, reconciliation lessons, and the merchant-account decisions that save operators six-figure years. Delivered to your inbox — never spam.

    No spam. Unsubscribe in one click.

    We use essential cookies · Privacy