HyperDX Operations Runbook
Operational procedures for Brainforge’s self-hosted HyperDX observability stack.
Service Inventory
| Service | Railway Service Name | Port | Health Check |
|---|---|---|---|
| HyperDX Server | hyperdx | 80 | GET /health → 200 |
| HyperDX MongoDB | hyperdx-mongo | 27017 | N/A (internal) |
| HyperDX ClickHouse | hyperdx-clickhouse-v2 | 8123 (HTTP), 9000 (Native) | GET /ping → Ok. |
| OTel Collector | hyperdx-otel-collector-v2 | 4318 (HTTP), 4317 (gRPC) | GET /v1/metrics → 200 |
⚠️ Stale v1 services to delete: The following v1 services are superseded by the v2 variants above and should be removed from the Railway dashboard (requires 2FA). Before deletion, export each v1 service’s environment variables via Railway dashboard as a recovery reference.
Service Railway Service ID Status hyperdx-clickhouse(v1)22f472f8Superseded by hyperdx-clickhouse-v2hyperdx-otel-collector(v1)6221ec9cSuperseded by hyperdx-otel-collector-v2Recovery: If v1 services need to be restored, use the exported env vars to recreate them. All configurations are also source-controlled under
apps/platform/deploy/hyperdx-clickhouse/andapps/platform/deploy/hyperdx-otel-collector/.
Key Rotation
Rotate HYPERDX_API_KEY
Ingestion key rotation (Platform, Slack Assistant, OpenCode Worker):
- Generate a new key in HyperDX UI: Settings → API Keys → Create Key.
- Update the key in all service environments simultaneously to avoid data gaps:
- Platform:
HYPERDX_API_KEY+NEXT_PUBLIC_HYPERDX_API_KEY - Slack Assistant:
HYPERDX_API_KEY - OpenCode Worker:
HYPERDX_API_KEY
- Platform:
- Deploy or restart all services to pick up the new key.
- Revoke the old key in HyperDX UI.
- Verify traces are flowing by triggering a test request and checking HyperDX search.
ClickHouse password rotation:
- Generate a new password:
openssl rand -base64 32. - Update
CLICKHOUSE_PASSWORDin:hyperdx-clickhouse-v2Railway env varshyperdxRailway env varshyperdx-otel-collector-v2Railway env vars
- Redeploy ClickHouse first (it stores the password), then the collector, then HyperDX server.
- Verify connectivity:
clickhouse-client --host <host> --port 9000 --user default --password <new-password>→ should connect.
ClickHouse Retention Adjustment
Default retention is 30 days. To adjust:
Option A: Via ClickHouse SQL
-- Check current TTL on the main table
SHOW CREATE TABLE hyperdx.log_entries;
-- Set TTL to 90 days (example)
ALTER TABLE hyperdx.log_entries
MODIFY TTL timestamp < now() - INTERVAL 90 DAY;Option B: Via Dockerfile env var
Edit apps/platform/deploy/hyperdx-clickhouse/Dockerfile and update CLICKHOUSE_RETENTION_DAYS (if you added this env var). Rebuild and redeploy.
Monitor Disk Usage
SELECT
database,
table,
formatReadableSize(sum(data_compressed_bytes)) AS compressed,
formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed,
sum(rows) AS rows
FROM system.parts
WHERE database = 'hyperdx' AND active
GROUP BY database, table;Set a Railway alert when disk usage exceeds 80%.
Troubleshooting
Missing Traces
Symptoms: Traces appear in Langfuse but not in HyperDX, or vice versa.
-
Check kill switch:
# In the suspect service's logs, search for: grep "HyperDX" <log-output> # If disabled, you'll see no [hyperdx] prefixed log lines. -
Verify OTLP endpoint connectivity:
curl -s https://<hyperdx-url>/v2/traces -X POST \ -H "Content-Type: application/json" \ -d '{"resourceSpans":[]}' -w "\nHTTP %{http_code}\n" # Expect: HTTP 200 (or 401 if key required, confirming endpoint is up) -
Check OTel Collector logs:
- Railway →
hyperdx-otel-collector-v2→ Deployments → Latest → Logs - Look for
ExporterConfigerrors orconnection refusedmessages.
- Railway →
-
Verify
OTEL_SERVICE_NAMEmatches expected values:brainforge-platform(Platform)brainforge-slack-assistant(Slack)brainforge-opencode-worker(Worker)
High Memory / Slow Queries
-
Check ClickHouse query log:
SELECT query, query_duration_ms, memory_usage FROM system.query_log WHERE event_date = today() ORDER BY query_duration_ms DESC LIMIT 10; -
Reduce retention if disk space is low (see Retention Adjustment above).
-
Restart ClickHouse if it’s unresponsive:
# Via Railway CLI railway restart --service hyperdx-clickhouse-v2
HyperDX Server Won’t Start
- Check Railway deployment logs for
HYPERDX_LOG_LEVELerrors. - Verify
MONGO_URL,CLICKHOUSE_URL, andINGESTION_API_KEYare set. - If using S3/S3-compatible storage for session replay, verify
S3_*env vars.
OTel Collector Not Forwarding
- Check config: Verify
custom.config.yamlin the collector service mounts correctly. - Test receiver directly:
curl -s http://<otel-collector>:4318/v1/traces -X POST \ -H "Content-Type: application/json" \ -d '{"resourceSpans":[{"resource":{"attributes":[{"key":"service.name","value":{"stringValue":"test"}}]},"scopeSpans":[{"spans":[{"name":"test-span","traceId":"0102030405060708090a0b0c0d0e0f10","spanId":"0102030405060708","kind":1}]}]}]}' \ -w "\nHTTP %{http_code}\n" - Check collector logs for
clickhouseexportererrors.
Session Replay Not Working
- Verify
NEXT_PUBLIC_HYPERDX_API_KEYis set in the Platform’s client env. - Check that
HyperDXProviderwraps the app (search for<HyperDXProvider>inProviders.tsx). - Verify
NEXT_PUBLIC_HYPERDX_ENABLED !== 'false'. - Check browser DevTools for
@hyperdx/browserscript loading errors.
Incident Response
HyperDX Completely Down
- Platform/Slack/Worker keep running — kill switches default to enabled, but if
HYPERDX_ENABLEDis not explicitly set, traces will attempt to send and fail silently. This is acceptable; no traffic loss. - Explicit kill switch if needed: Set
HYPERDX_ENABLED=falsein all services to stop connection attempts and log noise. - Restart order: ClickHouse → OTel Collector → HyperDX Server.
- Verify: Check health endpoints for all three services after restart.
Data Loss / Corruption in ClickHouse
- Check ClickHouse replica status (if using replicated tables):
SELECT database, table, is_readonly, replication_queue.progress FROM system.replicas; - Replay from OTel Collector backlog — the collector has internal queuing; if ClickHouse was down briefly, data may catch up automatically.
- Last resort: Drop and recreate the
hyperdxdatabase using theinit-db.shscript fromapps/platform/deploy/hyperdx-clickhouse/.# WARNING: This destroys all historical data clickhouse-client --host <host> --port 9000 --user default --password <pw> \ -q "DROP DATABASE IF EXISTS hyperdx;" # Then redeploy the ClickHouse service to re-run init-db.sh
Partial Outage (One Service Down)
- ClickHouse down: HyperDX server will fail to ingest. Traces buffer in OTel Collector’s memory until ClickHouse recovers (limited buffering). If ClickHouse is down for > 5 minutes, some data may be lost.
- OTel Collector down: All services fall back to direct OTLP sends. Check if HyperDX server’s
/v2/tracesendpoint handles the load; if not, setHYPERDX_ENABLED=falsetemporarily. - HyperDX Server down: UI is inaccessible; ingestion stops. ClickHouse continues to store data. Restart the server to resume.
Monitoring Alerts
Recommended Railway alerts:
| Service | Metric | Threshold |
|---|---|---|
hyperdx | CPU > 80% for 5 min | Warning |
hyperdx-clickhouse-v2 | Disk > 80% | Warning; > 90% → Critical |
hyperdx-otel-collector-v2 | Restarts > 3 in 10 min | Critical |
hyperdx | Health check failure | Critical |
Backup & Recovery
- ClickHouse data: Use Railway volumes (persistent). Periodic
clickhouse-dumpis optional for cross-region recovery. - HyperDX config: All config is in Railway env vars and the repo (
apps/hyperdx/,apps/platform/deploy/hyperdx-*). Restoring means redeploying from the repo. - Session replay objects: Stored in S3-compatible storage (if configured). Ensure S3 bucket versioning is enabled.
Environment Separation
| Environment | HYPERDX_API_KEY | OTEL_SERVICE_NAME Suffix | ClickHouse Password |
|---|---|---|---|
| Production | <prod-key> | brainforge-platform, etc. | <prod-password> |
| Staging | <staging-key> | brainforge-platform-staging, etc. | <staging-password> |
| Development | Not required | Not required (kill switch off) | N/A |
Never share keys or passwords between environments.