Skip to main content
Back to blog
Engineering·Apr 18, 2026·7 min read

Worker heartbeats and a production preflight that fails fast

When a worker silently stops ticking, payments stop confirming. Here's how we use heartbeat rows in Postgres and a single preflight endpoint to catch that before customers do.

OT
OpenSettle Team· Platform engineering

Every payment that lands on-chain is observed by a worker. Every subscription that renews is driven by a worker. Every webhook that delivers is dispatched by a worker. If a worker silently stops ticking — its container's still up but its loop has hung — payments queue, renewals miss, webhooks pile up, and the dashboard shows green because the API is fine. That's the failure mode worth designing for.

Heartbeats in Postgres

Each worker upserts a row at the end of every successful tick:

0019_worker_heartbeats.sql
CREATE TABLE worker_heartbeats (
  name          TEXT PRIMARY KEY,        -- 'chain_reader' | 'webhook_deliverer' | 'renewal_worker'
  last_tick_at  TIMESTAMPTZ NOT NULL,
  details       JSONB
);

The upsert is best-effort: a swallowed DB error here can't crash the worker loop, because that would defeat the purpose. The heartbeat is observed, not load-bearing.

Preflight, not health check

We expose /v1/internal/preflight, gated by a bearer token. It returns a structured per-check result: db reachable, every worker's last tick within tolerance, RPC config presence, secret presence, verified-wallet count, enabled-webhook count. Each check is { ok | warn | fail }, and the runner exits 1 on any fail. That makes it usable as a deploy gate:

bash
# Run before any live-crypto session
$ pnpm --filter @opensettle/api preflight:prod
✓ db_reachable
✓ chain_reader_heartbeat (last tick 4s ago)
✓ webhook_deliverer_heartbeat (last tick 12s ago)
✓ renewal_worker_heartbeat (last tick 38s ago)
✗ chain_reader_rpc_base_configured — CHAIN_READER_RPC_BASE not set
exit 1

Why a single endpoint instead of dashboards

Dashboards are for humans. Preflight is for machines. The same response feeds three different consumers: the deploy script (refuse a release if any check fails), the on-call alerts (page on warn or fail), and the CLI runner (a human typing preflight:prod before flipping a feature flag). Picking the right shape — { ok | warn | fail } per check, plus a single rolled-up status — keeps all three honest.

What we don't try to detect

Preflight checks are about the platform's own inputs and process state. We don't try to detect upstream RPC degradation, exchange-rate de-pegs, or third-party email outages. Those go through Sentry alerting and an on-call runbook. The bar for preflight is: "is this thing wired up correctly enough to take traffic." If you put too much in, it becomes another dashboard nobody trusts.

The pattern is small but it pays: a Postgres table, an endpoint, a CLI runner, and a deploy gate. Worker liveness becomes something the platform asserts about itself, not something an engineer remembers to verify by grepping logs at 3am.