Event ID & Retry Mechanism Design

Problem & Intent: Deterministic Stream Resumption

Real-time applications require deterministic message delivery across unstable networks. Without explicit Event ID and retry design, clients face duplicate processing, state drift, or silent message loss. The core intent is to establish a strict resumption contract. The server tracks the last acknowledged cursor. The client automatically reconnects with that exact cursor.

This mechanism anchors reliable streaming and complements the broader SSE Protocol Fundamentals & Architecture. You must design for three explicit objectives:

Architecture & Explicit Configuration

Implement a monotonic, time-based or sequence-based ID strategy. The server must explicitly emit id: and retry: fields before the data: payload. The native EventSource reads the retry: directive. It defaults to 3000ms if omitted. On every reconnect, it automatically attaches the Last-Event-ID header.

retry: 2000
id: evt_1715892000_001
event: message
data: {"type":"order_update","status":"confirmed"}

Scale horizontally by decoupling event generation from stream delivery. Route generation through a partitioned message broker. Store event cursors in a Redis sorted set. Use ZADD for O(1) lookup and EXPIRE for automatic TTL eviction. Stateless stream nodes pull from the cache on reconnect.

Debug connection drops systematically. Intercept reconnects using curl:

curl -N -H "Last-Event-ID: evt_1715892000_001" /stream

Verify server-side cursor resolution matches the header exactly. Monitor retry: field propagation via Chrome DevTools Network > SSE stream preview. If the field disappears, a reverse proxy is stripping custom headers.

Edge Cases & Failure Modes

Network flapping triggers aggressive reconnect storms. Mitigate by enforcing server-side minimum retry: thresholds. Implement exponential backoff at the proxy layer. Reject rapid reconnection attempts that bypass the published interval.

ID collisions occur when using non-atomic counters across distributed nodes. Switch to ULID or Snowflake IDs for distributed safety. Out-of-order delivery happens if multiple stream instances lack synchronized state. Enforce strict ordering via partition keys.

Monitor Last-Event-ID header size limits. HTTP headers typically cap at ~8KB. Long-lived sessions can exceed this during cursor accumulation. Implement explicit truncation or fallback eviction when the cursor exceeds the retention window.

Apply these mitigation tactics in production:

Fallback Strategies & Transport Degradation

Native EventSource lacks configurable retry intervals, custom headers, and POST support. When strict delivery guarantees exceed native capabilities, evaluate transport alternatives as outlined in SSE vs WebSockets vs HTTP Polling.

Implement a polyfill using fetch + ReadableStream for granular retry control and header injection. This bypasses browser limitations and allows manual cursor management. Graceful degradation should trigger automatically. Switch to HTTP long-polling if retry: exceeds 30s or if Last-Event-ID is missing from the client request.

Transport Mode Retry Behavior Header Support Cursor Handling
Native SSE Standard auto-reconnect Last-Event-ID only Automatic
Fetch Polyfill Custom backoff logic Full header injection Manual
Long Polling Stateless fallback Explicit query param Explicit cursor param

Handle transport degradation explicitly in your client router. Detect EventSource readiness state failures. Fall back to fetch with a ?cursor= query parameter. Maintain identical payload schemas across all transports to prevent client-side parsing errors.

Validation & Testing Workflow

Verify idempotency by replaying events with identical IDs across multiple client instances. Test connection drops at 1s, 5s, and 30s intervals. Assert Last-Event-ID propagation accuracy on every reconnect attempt. Parse stream compliance using strict line-break rules per Understanding the Event Stream Format.

Load test with 10k concurrent reconnects. Validate broker backpressure handling, cursor cache eviction speed, and memory stability. Monitor garbage collection spikes during mass reconnect events.

Execute this validation checklist before deployment: