Worker heartbeats and a production preflight that fails fast
When a worker silently stops ticking, payments stop confirming. Here's how we use heartbeat rows in Postgres and a single preflight endpoint to catch that before customers do.
Every payment that lands on-chain is observed by a worker. Every subscription that renews is driven by a worker. Every webhook that delivers is dispatched by a worker. If a worker silently stops ticking — its container's still up but its loop has hung — payments queue, renewals miss, webhooks pile up, and the dashboard shows green because the API is fine. That's the failure mode worth designing for.
Heartbeats in Postgres
Each worker upserts a row at the end of every successful tick:
CREATE TABLE worker_heartbeats (
name TEXT PRIMARY KEY, -- 'chain_reader' | 'webhook_deliverer' | 'renewal_worker'
last_tick_at TIMESTAMPTZ NOT NULL,
details JSONB
);The upsert is best-effort: a swallowed DB error here can't crash the worker loop, because that would defeat the purpose. The heartbeat is observed, not load-bearing.
Preflight, not health check
We expose /v1/internal/preflight, gated by a bearer token. It returns a structured per-check result: db reachable, every worker's last tick within tolerance, RPC config presence, secret presence, verified-wallet count, enabled-webhook count. Each check is { ok | warn | fail }, and the runner exits 1 on any fail. That makes it usable as a deploy gate:
# Run before any live-crypto session
$ pnpm --filter @opensettle/api preflight:prod
✓ db_reachable
✓ chain_reader_heartbeat (last tick 4s ago)
✓ webhook_deliverer_heartbeat (last tick 12s ago)
✓ renewal_worker_heartbeat (last tick 38s ago)
✗ chain_reader_rpc_base_configured — CHAIN_READER_RPC_BASE not set
exit 1Why a single endpoint instead of dashboards
Dashboards are for humans. Preflight is for machines. The same response feeds three different consumers: the deploy script (refuse a release if any check fails), the on-call alerts (page on warn or fail), and the CLI runner (a human typing preflight:prod before flipping a feature flag). Picking the right shape — { ok | warn | fail } per check, plus a single rolled-up status — keeps all three honest.
What we don't try to detect
Preflight checks are about the platform's own inputs and process state. We don't try to detect upstream RPC degradation, exchange-rate de-pegs, or third-party email outages. Those go through Sentry alerting and an on-call runbook. The bar for preflight is: "is this thing wired up correctly enough to take traffic." If you put too much in, it becomes another dashboard nobody trusts.
The pattern is small but it pays: a Postgres table, an endpoint, a CLI runner, and a deploy gate. Worker liveness becomes something the platform asserts about itself, not something an engineer remembers to verify by grepping logs at 3am.