Skip to content
← All runbooks

Incident response

Incident runbook: Archive weekly backtest cron failed

The archive service runs the weekly Atrium-vs-baseline backtest + publishes the result via ResearchAttestation.publishBacktest on chain. A failure means the next ResearchAttestation never lands and the /research dashboard renders stale.

Severity

  • SEV-3 if a single weekly run fails (next run can recover)
  • SEV-2 if the last 4 weekly runs all failed (research dashboard

staler than 30 days)

  • SEV-1 if the RESEARCH_SIGNER_KEY has been rotated but the GHA

secret is stale (signing failures with a now-public key)

Signals

  • .github/workflows/archive-weekly.yml shows red on the Actions tab
  • Discord ops webhook fires on if: failure()
  • Sentry events tagged service: archive
  • /research page: generatedAt field older than 14 days

Triage (30 min target)

  1. Open the failed workflow run. The notebook-execution step prints the

Python traceback if the model rebuild crashed.

  1. Check the publish-step: a failure here means

RESEARCH_SIGNER_KEY is wrong, or RESEARCH_CONTRACT_ADDR points at a redeployed instance.

  1. Check IPFS pinning: web3.storage token may have expired (free-tier

quota or token revocation).

Mitigations

Symptom
Notebook crash
Fix
Pin previous notebook commit; reopen issue with the failing cell
Rollback safe?
yes
Symptom
RESEARCH_SIGNER_KEY rotated
Fix
Update GHA secret + Vercel env; rerun cron via workflow_dispatch
Rollback safe?
yes
Symptom
IPFS pin 401
Fix
Rotate WEB3_STORAGE_TOKEN; rerun
Rollback safe?
yes
Symptom
RPC throttle
Fix
Wait 60 min, rerun; if persistent, swap RPC endpoint
Rollback safe?
yes
Symptom
Backtest produced impossible deltas (sanity-check fail)
Fix
DO NOT publish; reopen issue
Rollback safe?
yes

Resolution checklist

  • [ ] Latest archive run is green
  • [ ] On-chain ResearchAttestation.latestRoot() reflects the new run
  • [ ] /research page generatedAt is recent
  • [ ] Sentry events stop firing
  • [ ] Post-mortem in /incidents/ if SEV ≤ 2

Escalation contacts

  • On-call ops owner per runbooks/on-call-rotation.md
  • web3.storage support for IPFS pinning issues