HyperDX Operations Runbook

Operational procedures for Brainforge’s self-hosted HyperDX observability stack.

Service Inventory

Service	Railway Service Name	Port	Health Check
HyperDX Server	`hyperdx`	80	`GET /health` → 200
HyperDX MongoDB	`hyperdx-mongo`	27017	N/A (internal)
HyperDX ClickHouse	`hyperdx-clickhouse-v2`	8123 (HTTP), 9000 (Native)	`GET /ping` → `Ok.`
OTel Collector	`hyperdx-otel-collector-v2`	4318 (HTTP), 4317 (gRPC)	`GET /v1/metrics` → 200

⚠️ Stale v1 services to delete: The following v1 services are superseded by the v2 variants above and should be removed from the Railway dashboard (requires 2FA). Before deletion, export each v1 service’s environment variables via Railway dashboard as a recovery reference.

Service Railway Service ID Status
hyperdx-clickhouse (v1) 22f472f8 Superseded by hyperdx-clickhouse-v2
hyperdx-otel-collector (v1) 6221ec9c Superseded by hyperdx-otel-collector-v2

Recovery: If v1 services need to be restored, use the exported env vars to recreate them. All configurations are also source-controlled under apps/platform/deploy/hyperdx-clickhouse/ and apps/platform/deploy/hyperdx-otel-collector/.

Service	Railway Service ID	Status
`hyperdx-clickhouse` (v1)	`22f472f8`	Superseded by `hyperdx-clickhouse-v2`
`hyperdx-otel-collector` (v1)	`6221ec9c`	Superseded by `hyperdx-otel-collector-v2`

Key Rotation

Rotate `HYPERDX_API_KEY`

Ingestion key rotation (Platform, Slack Assistant, OpenCode Worker):

Generate a new key in HyperDX UI: Settings → API Keys → Create Key.
Update the key in all service environments simultaneously to avoid data gaps:
- Platform: HYPERDX_API_KEY + NEXT_PUBLIC_HYPERDX_API_KEY
- Slack Assistant: HYPERDX_API_KEY
- OpenCode Worker: HYPERDX_API_KEY
Deploy or restart all services to pick up the new key.
Revoke the old key in HyperDX UI.
Verify traces are flowing by triggering a test request and checking HyperDX search.

ClickHouse password rotation:

Generate a new password: openssl rand -base64 32.
Update CLICKHOUSE_PASSWORD in:
- hyperdx-clickhouse-v2 Railway env vars
- hyperdx Railway env vars
- hyperdx-otel-collector-v2 Railway env vars
Redeploy ClickHouse first (it stores the password), then the collector, then HyperDX server.
Verify connectivity: clickhouse-client --host <host> --port 9000 --user default --password <new-password> → should connect.

ClickHouse Retention Adjustment

Default retention is 30 days. To adjust:

Option A: Via ClickHouse SQL

-- Check current TTL on the main table
SHOW CREATE TABLE hyperdx.log_entries;
 
-- Set TTL to 90 days (example)
ALTER TABLE hyperdx.log_entries
  MODIFY TTL timestamp < now() - INTERVAL 90 DAY;

Option B: Via Dockerfile env var

Edit apps/platform/deploy/hyperdx-clickhouse/Dockerfile and update CLICKHOUSE_RETENTION_DAYS (if you added this env var). Rebuild and redeploy.

Monitor Disk Usage

SELECT
  database,
  table,
  formatReadableSize(sum(data_compressed_bytes)) AS compressed,
  formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed,
  sum(rows) AS rows
FROM system.parts
WHERE database = 'hyperdx' AND active
GROUP BY database, table;

Set a Railway alert when disk usage exceeds 80%.

Troubleshooting

Missing Traces

Symptoms: Traces appear in Langfuse but not in HyperDX, or vice versa.

Check kill switch:

# In the suspect service's logs, search for:
grep "HyperDX" <log-output>
# If disabled, you'll see no [hyperdx] prefixed log lines.

Verify OTLP endpoint connectivity:

curl -s https://<hyperdx-url>/v2/traces -X POST \
  -H "Content-Type: application/json" \
  -d '{"resourceSpans":[]}' -w "\nHTTP %{http_code}\n"
# Expect: HTTP 200 (or 401 if key required, confirming endpoint is up)

Check OTel Collector logs:
- Railway → hyperdx-otel-collector-v2 → Deployments → Latest → Logs
- Look for ExporterConfig errors or connection refused messages.
Verify OTEL_SERVICE_NAME matches expected values:
- brainforge-platform (Platform)
- brainforge-slack-assistant (Slack)
- brainforge-opencode-worker (Worker)

High Memory / Slow Queries

Check ClickHouse query log:

SELECT query, query_duration_ms, memory_usage
FROM system.query_log
WHERE event_date = today()
ORDER BY query_duration_ms DESC
LIMIT 10;

Reduce retention if disk space is low (see Retention Adjustment above).

Restart ClickHouse if it’s unresponsive:

# Via Railway CLI
railway restart --service hyperdx-clickhouse-v2

HyperDX Server Won’t Start

Check Railway deployment logs for HYPERDX_LOG_LEVEL errors.
Verify MONGO_URL, CLICKHOUSE_URL, and INGESTION_API_KEY are set.
If using S3/S3-compatible storage for session replay, verify S3_* env vars.

OTel Collector Not Forwarding

Check config: Verify custom.config.yaml in the collector service mounts correctly.

Test receiver directly:

curl -s http://<otel-collector>:4318/v1/traces -X POST \
  -H "Content-Type: application/json" \
  -d '{"resourceSpans":[{"resource":{"attributes":[{"key":"service.name","value":{"stringValue":"test"}}]},"scopeSpans":[{"spans":[{"name":"test-span","traceId":"0102030405060708090a0b0c0d0e0f10","spanId":"0102030405060708","kind":1}]}]}]}' \
  -w "\nHTTP %{http_code}\n"

Check collector logs for clickhouseexporter errors.

Session Replay Not Working

Verify NEXT_PUBLIC_HYPERDX_API_KEY is set in the Platform’s client env.
Check that HyperDXProvider wraps the app (search for <HyperDXProvider> in Providers.tsx).
Verify NEXT_PUBLIC_HYPERDX_ENABLED !== 'false'.
Check browser DevTools for @hyperdx/browser script loading errors.

Incident Response

HyperDX Completely Down

Platform/Slack/Worker keep running — kill switches default to enabled, but if HYPERDX_ENABLED is not explicitly set, traces will attempt to send and fail silently. This is acceptable; no traffic loss.
Explicit kill switch if needed: Set HYPERDX_ENABLED=false in all services to stop connection attempts and log noise.
Restart order: ClickHouse → OTel Collector → HyperDX Server.
Verify: Check health endpoints for all three services after restart.

Data Loss / Corruption in ClickHouse

Check ClickHouse replica status (if using replicated tables):

SELECT database, table, is_readonly, replication_queue.progress
FROM system.replicas;

Replay from OTel Collector backlog — the collector has internal queuing; if ClickHouse was down briefly, data may catch up automatically.

Last resort: Drop and recreate the hyperdx database using the init-db.sh script from apps/platform/deploy/hyperdx-clickhouse/.

# WARNING: This destroys all historical data
clickhouse-client --host <host> --port 9000 --user default --password <pw> \
  -q "DROP DATABASE IF EXISTS hyperdx;"
# Then redeploy the ClickHouse service to re-run init-db.sh

Partial Outage (One Service Down)

ClickHouse down: HyperDX server will fail to ingest. Traces buffer in OTel Collector’s memory until ClickHouse recovers (limited buffering). If ClickHouse is down for > 5 minutes, some data may be lost.
OTel Collector down: All services fall back to direct OTLP sends. Check if HyperDX server’s /v2/traces endpoint handles the load; if not, set HYPERDX_ENABLED=false temporarily.
HyperDX Server down: UI is inaccessible; ingestion stops. ClickHouse continues to store data. Restart the server to resume.

Monitoring Alerts

Recommended Railway alerts:

Service	Metric	Threshold
`hyperdx`	CPU > 80% for 5 min	Warning
`hyperdx-clickhouse-v2`	Disk > 80%	Warning; > 90% → Critical
`hyperdx-otel-collector-v2`	Restarts > 3 in 10 min	Critical
`hyperdx`	Health check failure	Critical

Backup & Recovery

ClickHouse data: Use Railway volumes (persistent). Periodic clickhouse-dump is optional for cross-region recovery.
HyperDX config: All config is in Railway env vars and the repo (apps/hyperdx/, apps/platform/deploy/hyperdx-*). Restoring means redeploying from the repo.
Session replay objects: Stored in S3-compatible storage (if configured). Ensure S3 bucket versioning is enabled.

Environment Separation

Environment	`HYPERDX_API_KEY`	`OTEL_SERVICE_NAME` Suffix	ClickHouse Password
Production	`<prod-key>`	`brainforge-platform`, etc.	`<prod-password>`
Staging	`<staging-key>`	`brainforge-platform-staging`, etc.	`<staging-password>`
Development	Not required	Not required (kill switch off)	N/A

Never share keys or passwords between environments.

Brainforge Knowledge

Explorer

hyperdx-operations

HyperDX Operations Runbook

Service Inventory

Key Rotation

Rotate `HYPERDX_API_KEY`

ClickHouse Retention Adjustment

Option A: Via ClickHouse SQL

Option B: Via Dockerfile env var

Monitor Disk Usage

Troubleshooting

Missing Traces

High Memory / Slow Queries

HyperDX Server Won’t Start

OTel Collector Not Forwarding

Session Replay Not Working

Incident Response

HyperDX Completely Down

Data Loss / Corruption in ClickHouse

Partial Outage (One Service Down)

Monitoring Alerts

Backup & Recovery

Environment Separation

Graph View

Table of Contents

Brainforge Knowledge

Explorer

hyperdx-operations

HyperDX Operations Runbook

Service Inventory

Key Rotation

Rotate HYPERDX_API_KEY

ClickHouse Retention Adjustment

Option A: Via ClickHouse SQL

Option B: Via Dockerfile env var

Monitor Disk Usage

Troubleshooting

Missing Traces

High Memory / Slow Queries

HyperDX Server Won’t Start

OTel Collector Not Forwarding

Session Replay Not Working

Incident Response

HyperDX Completely Down

Data Loss / Corruption in ClickHouse

Partial Outage (One Service Down)

Monitoring Alerts

Backup & Recovery

Environment Separation

Graph View

Table of Contents

Rotate `HYPERDX_API_KEY`