Incident runbook: Archive weekly backtest cron failed
The archive service runs the weekly Atrium-vs-baseline backtest + publishes the result via ResearchAttestation.publishBacktest on chain. A failure means the next ResearchAttestation never lands and the /research dashboard renders stale.
Severity
- SEV-3 if a single weekly run fails (next run can recover)
- SEV-2 if the last 4 weekly runs all failed (research dashboard
staler than 30 days)
- SEV-1 if the
RESEARCH_SIGNER_KEYhas been rotated but the GHA
secret is stale (signing failures with a now-public key)
Signals
.github/workflows/archive-weekly.ymlshows red on the Actions tab- Discord ops webhook fires on
if: failure() - Sentry events tagged
service: archive /researchpage:generatedAtfield older than 14 days
Triage (30 min target)
- Open the failed workflow run. The notebook-execution step prints the
Python traceback if the model rebuild crashed.
- Check the
publish-step: a failure here means
RESEARCH_SIGNER_KEY is wrong, or RESEARCH_CONTRACT_ADDR points at a redeployed instance.
- Check IPFS pinning:
web3.storagetoken may have expired (free-tier
quota or token revocation).
Mitigations
RESEARCH_SIGNER_KEY rotatedworkflow_dispatchWEB3_STORAGE_TOKEN; rerunResolution checklist
- [ ] Latest archive run is green
- [ ] On-chain
ResearchAttestation.latestRoot()reflects the new run - [ ] /research page
generatedAtis recent - [ ] Sentry events stop firing
- [ ] Post-mortem in
/incidents/if SEV ≤ 2
Escalation contacts
- On-call ops owner per
runbooks/on-call-rotation.md - web3.storage support for IPFS pinning issues