Insights · Article · Engineering · May 2026

Webhook delivery, retries, and dead letter queues: a practical playbook

Idempotency keys, signature verification, backoff jitter, poison messages, and operator dashboards so outbound events do not become silent data loss or retry storms.

Technology practitioner working at a dual-monitor workstation with notes and documentation in natural office light

Webhooks look simple until partners slow down, TLS middleboxes break handshakes, and your workers retry the same non-idempotent side effect fifty times. The difference between a healthy integration fabric and a pager factory is explicit state machines, not hope. Every team that ships outbound events eventually discovers that delivery guarantees require dedicated engineering, not just an HTTP POST buried inside a background job.

Most webhook systems begin as a single function that serializes a payload and sends it to a URL stored in a database column. That approach works for a handful of consumers, but it falls apart when delivery counts climb into the millions per day. Queue backed architectures decouple event production from delivery attempts, letting you throttle, prioritize, and inspect traffic independently of the business logic that triggered the event.

Start with delivery semantics you can document: at least once is the common default. That forces consumers to deduplicate using event IDs you stamp at enqueue time, not only at HTTP success. Generate those IDs deterministically from the aggregate root and sequence number so replays produce the same identifier. Consumers that store the event ID before processing can skip duplicates without querying upstream, reducing coupling and keeping acknowledgment windows tight.

Exactly once delivery across network boundaries is a distributed systems myth. Accepting at least once semantics and building idempotent consumers is cheaper and more reliable than chasing transactional guarantees that break under partitions. Document this contract in your developer portal so integration partners design their handlers accordingly. When both sides agree on the guarantee, debugging production incidents becomes a matter of log correlation rather than philosophical argument.

Diagram of webhook producer through retry queue to DLQ and operator replay path — DLQ without replay tooling is a graveyard, not recovery.

The retry queue itself deserves careful schema design. Each message should carry the original payload, a delivery attempt counter, the next scheduled retry timestamp, the most recent HTTP status code, and the consumer endpoint URL. Storing these fields in a structured format rather than a serialized blob makes it possible to query retry state, build operator dashboards, and run bulk replay operations without deserializing every record in the backlog.

Discuss this topic with our authors

We facilitate small-group sessions for customers and prospects without requiring a slide deck, focused on your stack, constraints, and the decisions you need to make next.

Request a session

Signing payloads protects integrity but rotates poorly if you forget overlapping key windows. Publish two active signing keys during rotation and reject only after dependents confirm pickup of the new secret. Use HMAC SHA256 with a timestamp included in the signed content to prevent replay attacks. Consumers should verify the signature before parsing the body, rejecting any request where the timestamp drifts more than a few minutes from server time.

Authentication and authorization deserve separate attention from payload signing. Mutual TLS adds transport layer trust but introduces certificate lifecycle complexity that many teams underestimate. A simpler alternative is a shared secret header combined with IP allowlisting. Whichever approach you choose, rotate credentials on a published schedule and provide a verification endpoint that consumers can poll to confirm their configuration is current before real traffic arrives.

Backoff should use jitter on top of exponential delay. Synchronized retries across shards recreate the thundering herd you thought you solved. A common formula adds a random component between zero and the full delay interval to each attempt. Cap maximum delay at a reasonable ceiling, typically five to fifteen minutes, so stale events do not linger in the retry queue for hours before the next attempt reaches the consumer.

Retry budgets prevent a single failing consumer from monopolizing worker capacity. Assign each endpoint a maximum number of in flight retries and a daily attempt ceiling. When either limit trips, park remaining events in the dead letter queue and alert the consumer owner. This isolation strategy keeps one misconfigured partner from degrading delivery latency for every other subscriber in the system.

Poison messages belong in a dead letter topic with structured failure reasons: HTTP status, truncated body, timeout flag. Operators need search, not raw logs. Include the original headers, the delivery attempt history, and a correlation ID that links back to the source event. A well designed dead letter store is queryable by endpoint, by error category, and by time range so incident responders can scope the blast radius in seconds.

Dead letter queues without replay tooling are archives, not recovery systems. Build a replay API that lets operators resubmit a filtered set of events with a single call. Support dry run mode so teams can preview which events will fire before committing. Log every replay as a distinct delivery attempt so audit trails remain unbroken and downstream idempotency checks can distinguish original deliveries from operator initiated retries.

Partial success is the subtle failure mode. If your handler wrote to the database then crashed before acknowledging the queue message, your idempotency layer must recognize the duplicate and short circuit safely. Store a completion token alongside the business write in the same transaction. On retry, check for that token before executing any side effects. This pattern turns a dangerous reprocessing scenario into a cheap lookup that preserves exactly once application semantics.

Engineering team collaborating around laptops during a release planning session — Release quality improves when product, platform, and operations teams review delivery signals together.

Rate limiting outbound deliveries protects both your infrastructure and your consumers. Publish per endpoint rate limits in your API documentation and enforce them at the queue worker level. When a consumer returns HTTP 429, respect the Retry After header and adjust your scheduling accordingly. Treating rate limit responses as temporary failures rather than errors prevents unnecessary dead lettering and keeps the relationship between producer and consumer cooperative.

Observability should chart attempt histograms, age of oldest undelivered message, and consumer lag for each tenant tier. Sudden lag often precedes certificate expiry or DNS cutovers. Add percentile latency tracking for first delivery success and total time to final disposition. Alert on the derivative of failure rate rather than absolute counts so you catch regressions early without drowning in noise from endpoints that have been failing for weeks.

Structured logging for every delivery attempt pays dividends during incident triage. Include the event type, endpoint identifier, attempt number, response status, and round trip latency in every log line. When these fields are indexed, an operator can reconstruct the full delivery timeline for any event in seconds. Pair logs with distributed traces that span from the originating service through the queue worker to the consumer acknowledgment for complete visibility.

Tenant isolation in a multi consumer webhook system prevents noisy neighbor problems. Partition retry queues by consumer tier or endpoint health score so that a single degraded subscriber does not inflate backpressure for healthy ones. Priority lanes let you guarantee delivery SLAs for premium integrations while best effort consumers absorb delays gracefully. This architecture mirrors how CDNs separate traffic classes and applies the same principle to event delivery.

Contract testing with sandbox endpoints catches schema drift before production. Version webhooks in the path or header and sunset old versions with published timelines. Provide a test harness that replays recent production event shapes against staging consumers so partners can validate their parsers ahead of any schema migration. Automated compatibility checks in your CI pipeline flag breaking changes before they merge, shifting integration failures left where they cost least to fix.

A mature webhook platform is not a single service but a coordinated system of producers, queues, retry schedulers, dead letter stores, replay tools, and observability layers. Investing in each component incrementally transforms outbound events from a fragile courtesy notification into a contractual data delivery channel. Teams that treat webhook infrastructure with the same rigor as their primary API earn partner trust and reduce the operational burden of scaling integrations across dozens of consumers.