Incident runbook: Notifier service silent
The notifier fans out alerts from Scribe to Telegram / Discord / email / custom-webhook channels per user preferences. When silent, on-chain events still happen but users are not paged.
Severity
- SEV-3 if a single channel kind is down (e.g. Telegram only)
- SEV-2 if the notifier service has not ticked in 10 minutes
- SEV-1 if alerts for tier-1 events (
Plinth.pause,LiquidationTriggered,
EmergencyPaused) are missed entirely
Signals
- Better Stack monitor on the notifier GHA workflow run history.
- Discord ops webhook fires on
if: failure()of.github/workflows/notifier-cron.yml. - Sentry events tagged
service: notifier. - User report: "I didn't get the alert about X."
Triage (10 min target)
- Open
.github/workflows/notifier-cron.ymlActions tab, confirm the
1-minute cron is firing. A halted cron means GHA throttled (rare; Atrium is well within free-tier minutes).
- Inspect the latest tick log for: scribe-fetch errors, KV cursor
stalls, fetchPrefs returning null (auth header missing).
- Test the prefs API directly:
curl -H "Authorization: Bearer $ATRIUM_INTERNAL_KEY" "$PREFS_API_URL?user=0xYourWallet"
- Verify Vercel KV is reachable:
curl -H "Authorization: Bearer $ATRIUM_KV_REST_TOKEN" $ATRIUM_KV_REST_URL/get/notifier:lastBlock.
Mitigations
workflow_dispatch oncefetchPrefs 401sATRIUM_INTERNAL_KEY matches between notifier + verify-app deploysSET notifier:lastBlock <recent-block> via Upstash UITELEGRAM_BOT_TOKEN; users re-/start the botResolution checklist
- [ ] Notifier tick log shows successful event ingestion
- [ ] Test alert (force-trigger a non-prod event) reaches every wired channel
- [ ] Sentry events stop firing
- [ ] Better Stack monitor returns to green
- [ ] Post-mortem in
/incidents/if SEV ≤ 2
Escalation contacts
- On-call frontend (notifier service owner) per
runbooks/on-call-rotation.md - Vercel KV support if KV REST API returns 5xx
- Telegram / Discord / Resend support per the failing channel