Railway Setup Review

Date: 2026-03-07
Status: Draft
Audience: Internal engineering / platform

Summary

This review recommends tightening the current Railway setup without turning it into a broader hosting redesign. The near-term goal is to reduce documentation drift, reduce unnecessary blast radius, and make current services easier to deploy and troubleshoot.

This is a docs-first recommendation memo. It does not propose expanding OpenWork rollout until the hosted runtime is stable.

Assumptions

  • Railway is the current canonical hosting target for apps/platform and the active Slack assistant services.
  • Some deploy state will remain in Railway console settings, so the repo cannot be the only source of truth.
  • Platform and Slack cleanup can proceed even while OpenWork remains blocked.

Decisions

  • Create a Railway operator runbook as the canonical operator reference for live services.
  • Scope the internal deploy-status dashboard to an explicit allowlist of Railway projects.
  • Resolve the Slack assistant prod/test operating model before changing its docs and manifest.
  • Keep OpenWork on a separate Railway service/domain and do not expand rollout until hosted runtime gates pass.

Current Railway Shape Observed

  • brainforge-platform: web
  • slack-assistant: Nudge-Agent, brainforge-test-assistant, brainforge-assistant-gh
  • google-workspace-mcp: google-workspace-mcp
  • openwork-labs: openwork-host

Recommendations

1. Create a Railway operator runbook

Current deploy behavior still depends on manual Railway configuration, especially for rootDirectory, builders, domains, and environment setup. We should document the active operating model in one place.

The runbook should cover, per service:

  • Railway project name
  • service name
  • intended environment model
  • root directory
  • builder and fallback config
  • build and start expectations
  • required domains and webhook endpoints
  • required env vars and secret owners
  • post-deploy verification steps
  • rollback path

Important framing: this should be the canonical operator reference, not a claim that all live state is source-controlled.

2. Clean up Platform deployment and env drift

Platform docs still reflect older hosting assumptions in a few places:

  • apps/platform/README.md still reads like a Vercel-first setup and points readers to the wrong .env.local location.
  • apps/platform/scripts/README.md still references older env names such as OPENAI_API_KEY and OPENAI_BASE_URL.
  • package.json still describes the root as a Heroku deploy package.

These should be updated so the documented local and deploy workflow consistently reflects:

  • Railway-first hosting
  • apps/platform/.env.local
  • apps/platform/scripts/pull-railway-env.js
  • the current Azure-based env contract

3. Tighten deploy-status dashboard scope

The internal deploy-status page should not enumerate every Railway project visible to the configured token.

Today, apps/platform/src/lib/railwayApiServer.ts falls back to listing all accessible projects when RAILWAY_PROJECT_IDS is unset. That is convenient, but it broadens blast radius and increases dashboard noise.

Recommendation:

  • define the intended allowlist from the active service inventory
  • document RAILWAY_PROJECT_IDS clearly in the Platform env docs
  • make production behavior fail closed, or return a clear configuration error, when the allowlist is missing

4. Fix Slack assistant Railway drift after choosing the operating model

The Slack assistant docs and manifest are inconsistent with the current repo and service layout:

  • apps/slack-apps/brainforge-assistant/README.md uses the wrong monorepo root directory path
  • apps/slack-apps/brainforge-assistant/manifest.yaml mixes a real Railway callback URL with placeholders
  • the live Railway project has multiple services, but the docs do not explain which service is canonical or how prod/test are intended to work

Before cleanup, decide whether Slack prod/test split is by service, by Railway environment, or both. Then update the repo docs and manifest policy to match that model.

5. Keep OpenWork blocked from broader rollout

OpenWork should remain isolated on its own Railway service/domain and stay blocked from broader rollout until the hosted runtime passes explicit checks.

The current feasibility note shows that base host health passes, but the interactive runtime paths needed for actual use are still failing in hosted validation.

Hosted runtime gates:

  • GET /health passes
  • OpenCode proxy health passes
  • session creation passes
  • SSE event stream passes
  • persistent /data and /workspace behavior is validated
  • internal-only access gate remains enabled until the above pass in Railway
  1. Build the active Railway service inventory and operator runbook.
  2. Use that inventory to define the deploy-status allowlist and Slack operating model.
  3. Update Platform and Slack docs to match the intended live model.
  4. Apply deploy-status scoping changes only after the allowlist is agreed.
  5. Keep OpenWork isolated and gated until hosted runtime checks pass.

Risks To Watch

  • dashboard scope changes could hide expected services if the service inventory is incomplete
  • Slack cleanup could churn without a clear prod/test model
  • a runbook that only inventories config, without verification and rollback steps, will not help much operationally
  • OpenWork can appear healthy while still failing interactive runtime paths

Outcome

If we implement the above, Railway setup becomes easier to reason about, easier to onboard into, and less likely to fail because of stale docs or hidden console state.

The main non-goal is expanding OpenWork rollout before the current hosted runtime blocker is resolved.