Implementation guide: webhook reliability for multi-brand portfolios
- Webhook reliability is not about retries; it is about idempotency and dead-lettering. Any retry strategy works if the downstream consumers are idempotent, and no retry strategy works if they are not.
- For multi-brand, the layer between acquirer and downstream systems does three things: verify signature, route by brand, fan-out to subscribers. Everything else (transform, enrich, persist) happens inside individual subscribers.
- Target 99.95% delivery to downstream systems over 7 days. Anything lower costs you orders, emails, and reconciliation sanity.
On this page
Every multi-brand operator eventually hits the "why did this Klaviyo email not fire on this brand's order?" question at 11pm on a Saturday. The answer is almost always a webhook that didn't deliver, didn't retry, or delivered to the wrong subscriber. This guide is the complete production pattern for a webhook layer that sits between your parent gateway and 8–15 downstream consumers across 10–30 brands.
1. The architecture
Acquirer → [Edge Receiver] → [Event Bus] → [Subscriber A (Klaviyo)]
→ [Subscriber B (3PL)]
→ [Subscriber C (NetSuite)]
→ [Subscriber D (ad pixels)]
→ [Subscriber E (analytics)]The Edge Receiver is a single HTTPS endpoint. The Event Bus is SQS, Kafka, Redis Streams, or Postgres LISTEN/NOTIFY depending on your stack. Each Subscriber is a worker that reads from the bus, transforms the event, and posts to its downstream system.
2. Edge Receiver: the minimum viable endpoint
// POST /webhooks/acquirer
app.post('/webhooks/acquirer', rawBody(), async (req, res) => {
// 1. Verify signature — if invalid, 401 immediately
const sig = req.headers['acquirer-signature'];
if (!verifySig(req.rawBody, sig, WEBHOOK_SECRET)) return res.status(401).end();
// 2. Parse, log, enqueue — do NOT process inline
const event = JSON.parse(req.rawBody);
await bus.publish(event.type, {
id: event.id,
type: event.type,
brand_id: event.data.metadata.brand_id,
payload: event,
received_at: new Date().toISOString(),
});
// 3. Ack to acquirer within 1 second
res.status(200).end();
});The Edge Receiver must ack in under 3 seconds or the acquirer will retry. Never do business logic inside the receiver — only signature verification and enqueue.
3. Idempotency at the Edge
Acquirers retry webhooks. Your Edge must dedupe on event.id. Use a Redis SET with 7-day TTL:
const seen = await redis.set(`webhook:${event.id}`, 1, { EX: 604800, NX: true });
if (!seen) return res.status(200).end(); // already processed, silent ack
await bus.publish(...);Seven days is the standard retry window for most acquirers. Longer TTLs waste memory; shorter TTLs risk double-delivery on long retry storms.
4. Brand routing
Every event must carry a brand_id. The parent gateway attaches this via metadata.brand_id on every charge. If an event arrives without brand_id, treat it as a critical error — log, alert, and drop (do not guess).
if (!event.data.metadata?.brand_id) {
await alert.critical(`Webhook missing brand_id: ${event.id} ${event.type}`);
return res.status(200).end(); // ack to avoid retry storm
}The Event Bus topic should include the brand: charge.succeeded.peptides_a, charge.refunded.nutra_b. This lets subscribers filter cleanly.
5. Subscriber workers
Each subscriber is a worker process pulling from the Event Bus. Subscriber responsibilities: transform the acquirer-shape event to the target-system shape, POST to the target system, handle 4xx (do not retry, move to DLQ) vs 5xx (retry with backoff) responses, emit metrics.
// subscriber-klaviyo.js
while (true) {
const msg = await bus.consume('klaviyo-subscriber');
if (!msg) { await sleep(1000); continue; }
try {
const payload = transformToKlaviyo(msg.payload);
const res = await fetch('https://a.klaviyo.com/api/events/', {
method: 'POST',
headers: { Authorization: `Klaviyo-API-Key ${KEY}`, 'Content-Type': 'application/json' },
body: JSON.stringify(payload),
});
if (res.ok) await bus.ack(msg);
else if (res.status >= 500 || res.status === 429) await bus.nack(msg, { retry: true });
else await bus.deadletter(msg, `HTTP ${res.status}`); // permanent failure
} catch (e) {
await bus.nack(msg, { retry: true });
}
}6. Retry strategy
Exponential backoff: 30s, 2min, 10min, 1hr, 6hr, 24hr, then dead-letter. That is 7 retries over ~31 hours. Anything that does not succeed in 31 hours is almost certainly never going to — put it in the DLQ and alert ops.
Jitter every retry interval by ±20% to avoid thundering herd when a downstream system goes down and comes back.
7. The dead-letter queue
DLQ is a Postgres table: event_id, subscriber, brand_id, error, first_failed_at, retry_count, payload. A daily cron surfaces DLQ items by brand and subscriber to Slack. Ops reviews each, either fixes the downstream issue and manually re-enqueues, or acknowledges the permanent failure (e.g., customer deleted).
8. Signature verification at scale
Each parent gateway has its own signature scheme. If you have multiple parent accounts (e.g., separate account for high-risk verticals), the Edge Receiver needs to map the incoming signature to the right verification key. Use a header hint or a different endpoint per account.
// multi-account signature verification
const account = req.headers['x-acquirer-account'] || 'default';
const secret = WEBHOOK_SECRETS[account];
if (!secret) return res.status(400).end();
if (!verifySig(req.rawBody, req.headers.signature, secret)) return res.status(401).end();9. Observability
Four metrics cover 95% of webhook debugging:
- Edge ack p99 latency: must stay under 2 seconds.
- Bus depth per subscriber: growing queue = subscriber is slow or down.
- Subscriber success rate per brand: if brand_A's Klaviyo success rate drops, there is a brand-specific issue.
- DLQ rate per day per brand: baseline 0–5 per day for a healthy portfolio.
Ship these to Datadog, CloudWatch, or Grafana. Alert on any metric crossing threshold for 10+ minutes.
10. Gotchas
- HTTP 200 on partial success: Some downstream APIs return 200 but include an error in the JSON body. Always inspect response body, not just status code.
- Rate limits: Klaviyo caps at 350 req/sec, NetSuite at 60/min. Subscriber needs per-target rate limit, not just global retry.
- Event ordering: Bus consumers see events in receive order, not create order. If brand logic depends on ordering (refund must follow charge), enforce order at the subscriber by fetching current state before applying.
- Payload drift: Acquirers update webhook payload schemas. Version-tag each payload shape and keep transformers version-aware. Do not assume today's
event.data.objectshape holds in 6 months. - Secrets rotation: When you rotate the webhook secret, the acquirer will send events signed with both old and new secrets for a brief window. Support both during rotation.
11. Load test before go-live
Before putting the layer in production, replay 24 hours of the previous day's events at 10x speed through the Edge. Confirm the bus does not overflow, no subscriber DLQ's unexpectedly, and p99 latency holds. If any test leg fails, fix before launch — a Black Friday spike will expose every weakness at once.
Ready to harden your webhook layer? Start with the 12-question intake — we will audit your current webhook topology in 48 hours. Or see pricing and parent account architecture.