Railway Setup Review
Date: 2026-03-07
Status: Draft
Audience: Internal engineering / platform
Summary
This review recommends tightening the current Railway setup without turning it into a broader hosting redesign. The near-term goal is to reduce documentation drift, reduce unnecessary blast radius, and make current services easier to deploy and troubleshoot.
This is a docs-first recommendation memo. It does not propose expanding OpenWork rollout until the hosted runtime is stable.
Assumptions
- Railway is the current canonical hosting target for
apps/platformand the active Slack assistant services. - Some deploy state will remain in Railway console settings, so the repo cannot be the only source of truth.
- Platform and Slack cleanup can proceed even while OpenWork remains blocked.
Decisions
- Create a Railway operator runbook as the canonical operator reference for live services.
- Scope the internal deploy-status dashboard to an explicit allowlist of Railway projects.
- Resolve the Slack assistant prod/test operating model before changing its docs and manifest.
- Keep OpenWork on a separate Railway service/domain and do not expand rollout until hosted runtime gates pass.
Current Railway Shape Observed
brainforge-platform:webslack-assistant:Nudge-Agent,brainforge-test-assistant,brainforge-assistant-ghgoogle-workspace-mcp:google-workspace-mcpopenwork-labs:openwork-host
Recommendations
1. Create a Railway operator runbook
Current deploy behavior still depends on manual Railway configuration, especially for rootDirectory, builders, domains, and environment setup. We should document the active operating model in one place.
The runbook should cover, per service:
- Railway project name
- service name
- intended environment model
- root directory
- builder and fallback config
- build and start expectations
- required domains and webhook endpoints
- required env vars and secret owners
- post-deploy verification steps
- rollback path
Important framing: this should be the canonical operator reference, not a claim that all live state is source-controlled.
2. Clean up Platform deployment and env drift
Platform docs still reflect older hosting assumptions in a few places:
apps/platform/README.mdstill reads like a Vercel-first setup and points readers to the wrong.env.locallocation.apps/platform/scripts/README.mdstill references older env names such asOPENAI_API_KEYandOPENAI_BASE_URL.package.jsonstill describes the root as a Heroku deploy package.
These should be updated so the documented local and deploy workflow consistently reflects:
- Railway-first hosting
apps/platform/.env.localapps/platform/scripts/pull-railway-env.js- the current Azure-based env contract
3. Tighten deploy-status dashboard scope
The internal deploy-status page should not enumerate every Railway project visible to the configured token.
Today, apps/platform/src/lib/railwayApiServer.ts falls back to listing all accessible projects when RAILWAY_PROJECT_IDS is unset. That is convenient, but it broadens blast radius and increases dashboard noise.
Recommendation:
- define the intended allowlist from the active service inventory
- document
RAILWAY_PROJECT_IDSclearly in the Platform env docs - make production behavior fail closed, or return a clear configuration error, when the allowlist is missing
4. Fix Slack assistant Railway drift after choosing the operating model
The Slack assistant docs and manifest are inconsistent with the current repo and service layout:
apps/slack-apps/brainforge-assistant/README.mduses the wrong monorepo root directory pathapps/slack-apps/brainforge-assistant/manifest.yamlmixes a real Railway callback URL with placeholders- the live Railway project has multiple services, but the docs do not explain which service is canonical or how prod/test are intended to work
Before cleanup, decide whether Slack prod/test split is by service, by Railway environment, or both. Then update the repo docs and manifest policy to match that model.
5. Keep OpenWork blocked from broader rollout
OpenWork should remain isolated on its own Railway service/domain and stay blocked from broader rollout until the hosted runtime passes explicit checks.
The current feasibility note shows that base host health passes, but the interactive runtime paths needed for actual use are still failing in hosted validation.
Hosted runtime gates:
GET /healthpasses- OpenCode proxy health passes
- session creation passes
- SSE event stream passes
- persistent
/dataand/workspacebehavior is validated - internal-only access gate remains enabled until the above pass in Railway
Recommended Sequence
- Build the active Railway service inventory and operator runbook.
- Use that inventory to define the deploy-status allowlist and Slack operating model.
- Update Platform and Slack docs to match the intended live model.
- Apply deploy-status scoping changes only after the allowlist is agreed.
- Keep OpenWork isolated and gated until hosted runtime checks pass.
Risks To Watch
- dashboard scope changes could hide expected services if the service inventory is incomplete
- Slack cleanup could churn without a clear prod/test model
- a runbook that only inventories config, without verification and rollback steps, will not help much operationally
- OpenWork can appear healthy while still failing interactive runtime paths
Outcome
If we implement the above, Railway setup becomes easier to reason about, easier to onboard into, and less likely to fail because of stale docs or hidden console state.
The main non-goal is expanding OpenWork rollout before the current hosted runtime blocker is resolved.