OpenWork Hosted Ops Runbook (Observability, Alerts, Rollback)

Version: 1.0
Date: 2026-03-07
Owner: Platform Engineering
Scope: Hosted OpenWork deployment only. Operational source of truth for the app now lives in brainforge-work.


1) Purpose

Define pragmatic operational coverage for hosted OpenWork:

  • baseline health signals,
  • automated alerting,
  • rollback procedure for bad deploys,
  • release cadence for upstream OpenWork syncs.

This runbook intentionally targets Phase 4 hardening baseline, not full enterprise observability.


2) Health and failure signals

SignalWhat it detectsSourceTrigger condition
GET /healthOpenWork service livenessOpenWork serverNon-200, timeout, or payload ok != true
GET /status with client tokenToken validity + server runtime metadataOpenWork serverNon-200/timeout, payload ok != true, or tokenSource.client/host = generated
GET /uiUser-facing UI shell availabilityOpenWork toy UI served by openwork-serverNon-200/timeout or payload missing the expected UI marker
GET /opencode-router/health with client tokenSidecar startup/readinessOpenWork router proxy via OpenWork serverOnly when router is enabled; non-200/timeout or payload reports unhealthy
GET /tokens with host tokenHost token validityOpenWork serverNon-200/timeout or invalid payload
Runtime error patternsSidecar download/start failures and token auth failuresDeploy/runtime logsAny of: openwork-server download failed, opencode download failed, opencodeRouter download failed, OpenCodeRouter health server is unavailable, Invalid bearer token, Invalid host token, Missing token scope

Code references for runtime behavior should be taken from brainforge-work.


3) Alerting and error notification coverage

3.1 Automated alerting (implemented)

This monorepo no longer carries the hosted OpenWork health workflow or probe script. Keep deploy health checks, rollback automation, and alert routing with the hosted Work repo and its Railway service.

Automated checks:

  1. GET /health
  2. GET /status with OPENWORK_TOKEN
  3. GET /ui
  4. GET /tokens with OPENWORK_HOST_TOKEN (when provided)
  5. GET /opencode-router/health with OPENWORK_TOKEN only when OPENWORK_EXPECT_OPENCODE_ROUTER=1

On failure:

  • Job fails in GitHub Actions.
  • Slack notification is sent when SLACK_BOT_TOKEN is available.

If required OpenWork secrets are not configured yet, the workflow exits as disabled (non-failing) and logs which secrets are missing.

3.2 Initial alert ownership channel

Initial owner channel for OpenWork deploy/runtime failures:

  • Slack channel ID: C08CRQJ636X (default, overridable via repo variable OPENWORK_ALERT_SLACK_CHANNEL)

This answers the open question in PLT-1079 and can be changed later without code changes.

3.3 Required secrets/variables

GitHub repository Secrets:

  • OPENWORK_LABS_BASE_URL (example: https://labs.brainforge.ai)
  • OPENWORK_LABS_CLIENT_TOKEN
  • OPENWORK_LABS_HOST_TOKEN
  • SLACK_BOT_TOKEN (optional but recommended for notification)

GitHub repository Variables:

  • OPENWORK_ALERT_SLACK_CHANNEL (optional; defaults to C08CRQJ636X)

Credential source of truth:

  • Store OpenWork tokens in 1Password (Brainforge AI Team vault).
  • Never commit tokens or URLs containing credentials.

4) Rollback playbook (failed deployment)

Use this when a new deploy causes health probe failures or user-visible breakage.

4.1 Trigger conditions

  • Health workflow fails 2 consecutive runs.
  • /opencode-router/health fails after deploy.
  • /ui returns 404/empty shell after deploy.
  • Critical auth failures (Invalid bearer token/Invalid host token) after deploy.

4.2 Rollback steps

  1. Freeze changes

    • Stop promoting new OpenWork commits.
    • Announce rollback in alert channel.
  2. Identify last known good deployment

    • In Railway/OpenWork hosting dashboard, locate last successful healthy deployment prior to failure window.
    • Confirm associated git SHA/tag.
  3. Redeploy last known good version

    • Redeploy prior healthy deployment (or redeploy prior SHA/tag through normal deploy path).
    • Do not change tokens during first rollback pass unless auth is root cause.
  4. Validate rollback

    • Wait for health workflow to pass.
    • Run manual spot checks:
      • /health returns ok: true
      • /status authenticated
      • /ui loads
      • /opencode-router/health healthy when router is enabled
      • core session start/smoke flow succeeds
  5. Stabilize and communicate

    • Post incident note with:
      • failed SHA,
      • rollback SHA,
      • impact window,
      • preliminary root cause.
  6. Recovery follow-up

    • Open follow-up ticket for root cause and prevention.
    • Re-enable forward deploys only after mitigation is merged.

5) Release cadence for upstream OpenWork syncs

5.1 Cadence

  • Default sync cadence: weekly (one planned upstream sync window per week)
  • Patch/hotfix sync: ad hoc, only for production incidents or security fixes
  • No blind auto-updates: each sync must go through the checklist below

5.2 Update pattern

  1. Pull and review the latest changes in brainforge-work.
  2. Run OpenWork-focused validation before deploy:
    • run the validation commands defined in brainforge-work
  3. Deploy to hosted OpenWork target.
  4. Confirm health workflow passes.
  5. Announce completion in alert/eng channel.

5.3 Roll-forward policy

  • Prefer roll-forward only when health probes are green and no auth/sidecar regression is observed.
  • If probes fail, rollback first (Section 4), then investigate.

6) Manual vs automation responsibilities

AreaAutomatedManual
Service liveness (/health)GitHub scheduled probeManual curl check during incident
Token validity and token-source driftGitHub scheduled probe (/status, /tokens)Rotate/reseed tokens in secrets manager
UI shell availability (/ui)GitHub scheduled probeManual browser check during incident
Sidecar readiness (/opencode-router/health)GitHub scheduled probe when router is enabledDeep inspection of sidecar/startup logs
Deploy failure notificationGitHub Actions + Slack postEngineer triage and escalation
Rollback executionN/AOperator executes rollback steps
Upstream release cadenceN/ATeam follows weekly sync process

7) Operator quick commands

# Railway logs (if linked) - inspect startup/auth failures
railway logs --service "openwork" --environment "production"