OpenWork Hosted Ops Runbook (Observability, Alerts, Rollback)

Version: 1.0
Date: 2026-03-07
Owner: Platform Engineering
Scope: Hosted OpenWork deployment only. Operational source of truth for the app now lives in brainforge-work.

1) Purpose

Define pragmatic operational coverage for hosted OpenWork:

baseline health signals,
automated alerting,
rollback procedure for bad deploys,
release cadence for upstream OpenWork syncs.

This runbook intentionally targets Phase 4 hardening baseline, not full enterprise observability.

2) Health and failure signals

Signal	What it detects	Source	Trigger condition
`GET /health`	OpenWork service liveness	OpenWork server	Non-200, timeout, or payload `ok != true`
`GET /status` with client token	Token validity + server runtime metadata	OpenWork server	Non-200/timeout, payload `ok != true`, or `tokenSource.client/host = generated`
`GET /ui`	User-facing UI shell availability	OpenWork toy UI served by `openwork-server`	Non-200/timeout or payload missing the expected UI marker
`GET /opencode-router/health` with client token	Sidecar startup/readiness	OpenWork router proxy via OpenWork server	Only when router is enabled; non-200/timeout or payload reports unhealthy
`GET /tokens` with host token	Host token validity	OpenWork server	Non-200/timeout or invalid payload
Runtime error patterns	Sidecar download/start failures and token auth failures	Deploy/runtime logs	Any of: `openwork-server download failed`, `opencode download failed`, `opencodeRouter download failed`, `OpenCodeRouter health server is unavailable`, `Invalid bearer token`, `Invalid host token`, `Missing token scope`

Code references for runtime behavior should be taken from brainforge-work.

3) Alerting and error notification coverage

3.1 Automated alerting (implemented)

This monorepo no longer carries the hosted OpenWork health workflow or probe script. Keep deploy health checks, rollback automation, and alert routing with the hosted Work repo and its Railway service.

Automated checks:

GET /health
GET /status with OPENWORK_TOKEN
GET /ui
GET /tokens with OPENWORK_HOST_TOKEN (when provided)
GET /opencode-router/health with OPENWORK_TOKEN only when OPENWORK_EXPECT_OPENCODE_ROUTER=1

On failure:

Job fails in GitHub Actions.
Slack notification is sent when SLACK_BOT_TOKEN is available.

If required OpenWork secrets are not configured yet, the workflow exits as disabled (non-failing) and logs which secrets are missing.

3.2 Initial alert ownership channel

Initial owner channel for OpenWork deploy/runtime failures:

Slack channel ID: C08CRQJ636X (default, overridable via repo variable OPENWORK_ALERT_SLACK_CHANNEL)

This answers the open question in PLT-1079 and can be changed later without code changes.

3.3 Required secrets/variables

GitHub repository Secrets:

OPENWORK_LABS_BASE_URL (example: https://labs.brainforge.ai)
OPENWORK_LABS_CLIENT_TOKEN
OPENWORK_LABS_HOST_TOKEN
SLACK_BOT_TOKEN (optional but recommended for notification)

GitHub repository Variables:

OPENWORK_ALERT_SLACK_CHANNEL (optional; defaults to C08CRQJ636X)

Credential source of truth:

Store OpenWork tokens in 1Password (Brainforge AI Team vault).
Never commit tokens or URLs containing credentials.

4) Rollback playbook (failed deployment)

Use this when a new deploy causes health probe failures or user-visible breakage.

4.1 Trigger conditions

Health workflow fails 2 consecutive runs.
/opencode-router/health fails after deploy.
/ui returns 404/empty shell after deploy.
Critical auth failures (Invalid bearer token/Invalid host token) after deploy.

4.2 Rollback steps

Freeze changes
- Stop promoting new OpenWork commits.
- Announce rollback in alert channel.
Identify last known good deployment
- In Railway/OpenWork hosting dashboard, locate last successful healthy deployment prior to failure window.
- Confirm associated git SHA/tag.
Redeploy last known good version
- Redeploy prior healthy deployment (or redeploy prior SHA/tag through normal deploy path).
- Do not change tokens during first rollback pass unless auth is root cause.
Validate rollback
- Wait for health workflow to pass.
- Run manual spot checks:
  - /health returns ok: true
  - /status authenticated
  - /ui loads
  - /opencode-router/health healthy when router is enabled
  - core session start/smoke flow succeeds
Stabilize and communicate
- Post incident note with:
  - failed SHA,
  - rollback SHA,
  - impact window,
  - preliminary root cause.
Recovery follow-up
- Open follow-up ticket for root cause and prevention.
- Re-enable forward deploys only after mitigation is merged.

5) Release cadence for upstream OpenWork syncs

5.1 Cadence

Default sync cadence: weekly (one planned upstream sync window per week)
Patch/hotfix sync: ad hoc, only for production incidents or security fixes
No blind auto-updates: each sync must go through the checklist below

5.2 Update pattern

Pull and review the latest changes in brainforge-work.
Run OpenWork-focused validation before deploy:
- run the validation commands defined in brainforge-work
Deploy to hosted OpenWork target.
Confirm health workflow passes.
Announce completion in alert/eng channel.

5.3 Roll-forward policy

Prefer roll-forward only when health probes are green and no auth/sidecar regression is observed.
If probes fail, rollback first (Section 4), then investigate.

6) Manual vs automation responsibilities

Area	Automated	Manual
Service liveness (`/health`)	GitHub scheduled probe	Manual curl check during incident
Token validity and token-source drift	GitHub scheduled probe (`/status`, `/tokens`)	Rotate/reseed tokens in secrets manager
UI shell availability (`/ui`)	GitHub scheduled probe	Manual browser check during incident
Sidecar readiness (`/opencode-router/health`)	GitHub scheduled probe when router is enabled	Deep inspection of sidecar/startup logs
Deploy failure notification	GitHub Actions + Slack post	Engineer triage and escalation
Rollback execution	N/A	Operator executes rollback steps
Upstream release cadence	N/A	Team follows weekly sync process

7) Operator quick commands

# Railway logs (if linked) - inspect startup/auth failures
railway logs --service "openwork" --environment "production"

Brainforge Knowledge

Explorer

openwork-hosted-ops-runbook

OpenWork Hosted Ops Runbook (Observability, Alerts, Rollback)

1) Purpose

2) Health and failure signals

3) Alerting and error notification coverage

3.1 Automated alerting (implemented)

3.2 Initial alert ownership channel

3.3 Required secrets/variables

4) Rollback playbook (failed deployment)

4.1 Trigger conditions

4.2 Rollback steps

5) Release cadence for upstream OpenWork syncs

5.1 Cadence

5.2 Update pattern

5.3 Roll-forward policy

6) Manual vs automation responsibilities

7) Operator quick commands

Graph View

Table of Contents

Backlinks

Brainforge Knowledge

Explorer

openwork-hosted-ops-runbook

OpenWork Hosted Ops Runbook (Observability, Alerts, Rollback)

1) Purpose

2) Health and failure signals

3) Alerting and error notification coverage

3.1 Automated alerting (implemented)

3.2 Initial alert ownership channel

3.3 Required secrets/variables

4) Rollback playbook (failed deployment)

4.1 Trigger conditions

4.2 Rollback steps

5) Release cadence for upstream OpenWork syncs

5.1 Cadence

5.2 Update pattern

5.3 Roll-forward policy

6) Manual vs automation responsibilities

7) Operator quick commands

8) Related docs

Graph View

Table of Contents

Backlinks