OpenWork Hosted Ops Runbook (Observability, Alerts, Rollback)
Version: 1.0
Date: 2026-03-07
Owner: Platform Engineering
Scope: Hosted OpenWork deployment only. Operational source of truth for the app now lives in brainforge-work.
1) Purpose
Define pragmatic operational coverage for hosted OpenWork:
- baseline health signals,
- automated alerting,
- rollback procedure for bad deploys,
- release cadence for upstream OpenWork syncs.
This runbook intentionally targets Phase 4 hardening baseline, not full enterprise observability.
2) Health and failure signals
| Signal | What it detects | Source | Trigger condition |
|---|---|---|---|
GET /health | OpenWork service liveness | OpenWork server | Non-200, timeout, or payload ok != true |
GET /status with client token | Token validity + server runtime metadata | OpenWork server | Non-200/timeout, payload ok != true, or tokenSource.client/host = generated |
GET /ui | User-facing UI shell availability | OpenWork toy UI served by openwork-server | Non-200/timeout or payload missing the expected UI marker |
GET /opencode-router/health with client token | Sidecar startup/readiness | OpenWork router proxy via OpenWork server | Only when router is enabled; non-200/timeout or payload reports unhealthy |
GET /tokens with host token | Host token validity | OpenWork server | Non-200/timeout or invalid payload |
| Runtime error patterns | Sidecar download/start failures and token auth failures | Deploy/runtime logs | Any of: openwork-server download failed, opencode download failed, opencodeRouter download failed, OpenCodeRouter health server is unavailable, Invalid bearer token, Invalid host token, Missing token scope |
Code references for runtime behavior should be taken from brainforge-work.
3) Alerting and error notification coverage
3.1 Automated alerting (implemented)
This monorepo no longer carries the hosted OpenWork health workflow or probe script. Keep deploy health checks, rollback automation, and alert routing with the hosted Work repo and its Railway service.
Automated checks:
GET /healthGET /statuswithOPENWORK_TOKENGET /uiGET /tokenswithOPENWORK_HOST_TOKEN(when provided)GET /opencode-router/healthwithOPENWORK_TOKENonly whenOPENWORK_EXPECT_OPENCODE_ROUTER=1
On failure:
- Job fails in GitHub Actions.
- Slack notification is sent when
SLACK_BOT_TOKENis available.
If required OpenWork secrets are not configured yet, the workflow exits as disabled (non-failing) and logs which secrets are missing.
3.2 Initial alert ownership channel
Initial owner channel for OpenWork deploy/runtime failures:
- Slack channel ID:
C08CRQJ636X(default, overridable via repo variableOPENWORK_ALERT_SLACK_CHANNEL)
This answers the open question in PLT-1079 and can be changed later without code changes.
3.3 Required secrets/variables
GitHub repository Secrets:
OPENWORK_LABS_BASE_URL(example:https://labs.brainforge.ai)OPENWORK_LABS_CLIENT_TOKENOPENWORK_LABS_HOST_TOKENSLACK_BOT_TOKEN(optional but recommended for notification)
GitHub repository Variables:
OPENWORK_ALERT_SLACK_CHANNEL(optional; defaults toC08CRQJ636X)
Credential source of truth:
- Store OpenWork tokens in 1Password (
Brainforge AI Teamvault). - Never commit tokens or URLs containing credentials.
4) Rollback playbook (failed deployment)
Use this when a new deploy causes health probe failures or user-visible breakage.
4.1 Trigger conditions
- Health workflow fails 2 consecutive runs.
/opencode-router/healthfails after deploy./uireturns 404/empty shell after deploy.- Critical auth failures (
Invalid bearer token/Invalid host token) after deploy.
4.2 Rollback steps
-
Freeze changes
- Stop promoting new OpenWork commits.
- Announce rollback in alert channel.
-
Identify last known good deployment
- In Railway/OpenWork hosting dashboard, locate last successful healthy deployment prior to failure window.
- Confirm associated git SHA/tag.
-
Redeploy last known good version
- Redeploy prior healthy deployment (or redeploy prior SHA/tag through normal deploy path).
- Do not change tokens during first rollback pass unless auth is root cause.
-
Validate rollback
- Wait for health workflow to pass.
- Run manual spot checks:
/healthreturnsok: true/statusauthenticated/uiloads/opencode-router/healthhealthy when router is enabled- core session start/smoke flow succeeds
-
Stabilize and communicate
- Post incident note with:
- failed SHA,
- rollback SHA,
- impact window,
- preliminary root cause.
- Post incident note with:
-
Recovery follow-up
- Open follow-up ticket for root cause and prevention.
- Re-enable forward deploys only after mitigation is merged.
5) Release cadence for upstream OpenWork syncs
5.1 Cadence
- Default sync cadence: weekly (one planned upstream sync window per week)
- Patch/hotfix sync: ad hoc, only for production incidents or security fixes
- No blind auto-updates: each sync must go through the checklist below
5.2 Update pattern
- Pull and review the latest changes in
brainforge-work. - Run OpenWork-focused validation before deploy:
- run the validation commands defined in
brainforge-work
- run the validation commands defined in
- Deploy to hosted OpenWork target.
- Confirm health workflow passes.
- Announce completion in alert/eng channel.
5.3 Roll-forward policy
- Prefer roll-forward only when health probes are green and no auth/sidecar regression is observed.
- If probes fail, rollback first (Section 4), then investigate.
6) Manual vs automation responsibilities
| Area | Automated | Manual |
|---|---|---|
Service liveness (/health) | GitHub scheduled probe | Manual curl check during incident |
| Token validity and token-source drift | GitHub scheduled probe (/status, /tokens) | Rotate/reseed tokens in secrets manager |
UI shell availability (/ui) | GitHub scheduled probe | Manual browser check during incident |
Sidecar readiness (/opencode-router/health) | GitHub scheduled probe when router is enabled | Deep inspection of sidecar/startup logs |
| Deploy failure notification | GitHub Actions + Slack post | Engineer triage and escalation |
| Rollback execution | N/A | Operator executes rollback steps |
| Upstream release cadence | N/A | Team follows weekly sync process |
7) Operator quick commands
# Railway logs (if linked) - inspect startup/auth failures
railway logs --service "openwork" --environment "production"