HyperDX Operations Runbook

Operational procedures for Brainforge’s self-hosted HyperDX observability stack.

Service Inventory

ServiceRailway Service NamePortHealth Check
HyperDX Serverhyperdx80GET /health → 200
HyperDX MongoDBhyperdx-mongo27017N/A (internal)
HyperDX ClickHousehyperdx-clickhouse-v28123 (HTTP), 9000 (Native)GET /pingOk.
OTel Collectorhyperdx-otel-collector-v24318 (HTTP), 4317 (gRPC)GET /v1/metrics → 200

⚠️ Stale v1 services to delete: The following v1 services are superseded by the v2 variants above and should be removed from the Railway dashboard (requires 2FA). Before deletion, export each v1 service’s environment variables via Railway dashboard as a recovery reference.

ServiceRailway Service IDStatus
hyperdx-clickhouse (v1)22f472f8Superseded by hyperdx-clickhouse-v2
hyperdx-otel-collector (v1)6221ec9cSuperseded by hyperdx-otel-collector-v2

Recovery: If v1 services need to be restored, use the exported env vars to recreate them. All configurations are also source-controlled under apps/platform/deploy/hyperdx-clickhouse/ and apps/platform/deploy/hyperdx-otel-collector/.

Key Rotation

Rotate HYPERDX_API_KEY

Ingestion key rotation (Platform, Slack Assistant, OpenCode Worker):

  1. Generate a new key in HyperDX UI: Settings → API Keys → Create Key.
  2. Update the key in all service environments simultaneously to avoid data gaps:
    • Platform: HYPERDX_API_KEY + NEXT_PUBLIC_HYPERDX_API_KEY
    • Slack Assistant: HYPERDX_API_KEY
    • OpenCode Worker: HYPERDX_API_KEY
  3. Deploy or restart all services to pick up the new key.
  4. Revoke the old key in HyperDX UI.
  5. Verify traces are flowing by triggering a test request and checking HyperDX search.

ClickHouse password rotation:

  1. Generate a new password: openssl rand -base64 32.
  2. Update CLICKHOUSE_PASSWORD in:
    • hyperdx-clickhouse-v2 Railway env vars
    • hyperdx Railway env vars
    • hyperdx-otel-collector-v2 Railway env vars
  3. Redeploy ClickHouse first (it stores the password), then the collector, then HyperDX server.
  4. Verify connectivity: clickhouse-client --host <host> --port 9000 --user default --password <new-password> → should connect.

ClickHouse Retention Adjustment

Default retention is 30 days. To adjust:

Option A: Via ClickHouse SQL

-- Check current TTL on the main table
SHOW CREATE TABLE hyperdx.log_entries;
 
-- Set TTL to 90 days (example)
ALTER TABLE hyperdx.log_entries
  MODIFY TTL timestamp < now() - INTERVAL 90 DAY;

Option B: Via Dockerfile env var

Edit apps/platform/deploy/hyperdx-clickhouse/Dockerfile and update CLICKHOUSE_RETENTION_DAYS (if you added this env var). Rebuild and redeploy.

Monitor Disk Usage

SELECT
  database,
  table,
  formatReadableSize(sum(data_compressed_bytes)) AS compressed,
  formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed,
  sum(rows) AS rows
FROM system.parts
WHERE database = 'hyperdx' AND active
GROUP BY database, table;

Set a Railway alert when disk usage exceeds 80%.

Troubleshooting

Missing Traces

Symptoms: Traces appear in Langfuse but not in HyperDX, or vice versa.

  1. Check kill switch:

    # In the suspect service's logs, search for:
    grep "HyperDX" <log-output>
    # If disabled, you'll see no [hyperdx] prefixed log lines.
  2. Verify OTLP endpoint connectivity:

    curl -s https://<hyperdx-url>/v2/traces -X POST \
      -H "Content-Type: application/json" \
      -d '{"resourceSpans":[]}' -w "\nHTTP %{http_code}\n"
    # Expect: HTTP 200 (or 401 if key required, confirming endpoint is up)
  3. Check OTel Collector logs:

    • Railway → hyperdx-otel-collector-v2 → Deployments → Latest → Logs
    • Look for ExporterConfig errors or connection refused messages.
  4. Verify OTEL_SERVICE_NAME matches expected values:

    • brainforge-platform (Platform)
    • brainforge-slack-assistant (Slack)
    • brainforge-opencode-worker (Worker)

High Memory / Slow Queries

  1. Check ClickHouse query log:

    SELECT query, query_duration_ms, memory_usage
    FROM system.query_log
    WHERE event_date = today()
    ORDER BY query_duration_ms DESC
    LIMIT 10;
  2. Reduce retention if disk space is low (see Retention Adjustment above).

  3. Restart ClickHouse if it’s unresponsive:

    # Via Railway CLI
    railway restart --service hyperdx-clickhouse-v2

HyperDX Server Won’t Start

  1. Check Railway deployment logs for HYPERDX_LOG_LEVEL errors.
  2. Verify MONGO_URL, CLICKHOUSE_URL, and INGESTION_API_KEY are set.
  3. If using S3/S3-compatible storage for session replay, verify S3_* env vars.

OTel Collector Not Forwarding

  1. Check config: Verify custom.config.yaml in the collector service mounts correctly.
  2. Test receiver directly:
    curl -s http://<otel-collector>:4318/v1/traces -X POST \
      -H "Content-Type: application/json" \
      -d '{"resourceSpans":[{"resource":{"attributes":[{"key":"service.name","value":{"stringValue":"test"}}]},"scopeSpans":[{"spans":[{"name":"test-span","traceId":"0102030405060708090a0b0c0d0e0f10","spanId":"0102030405060708","kind":1}]}]}]}' \
      -w "\nHTTP %{http_code}\n"
  3. Check collector logs for clickhouseexporter errors.

Session Replay Not Working

  1. Verify NEXT_PUBLIC_HYPERDX_API_KEY is set in the Platform’s client env.
  2. Check that HyperDXProvider wraps the app (search for <HyperDXProvider> in Providers.tsx).
  3. Verify NEXT_PUBLIC_HYPERDX_ENABLED !== 'false'.
  4. Check browser DevTools for @hyperdx/browser script loading errors.

Incident Response

HyperDX Completely Down

  1. Platform/Slack/Worker keep running — kill switches default to enabled, but if HYPERDX_ENABLED is not explicitly set, traces will attempt to send and fail silently. This is acceptable; no traffic loss.
  2. Explicit kill switch if needed: Set HYPERDX_ENABLED=false in all services to stop connection attempts and log noise.
  3. Restart order: ClickHouse → OTel Collector → HyperDX Server.
  4. Verify: Check health endpoints for all three services after restart.

Data Loss / Corruption in ClickHouse

  1. Check ClickHouse replica status (if using replicated tables):
    SELECT database, table, is_readonly, replication_queue.progress
    FROM system.replicas;
  2. Replay from OTel Collector backlog — the collector has internal queuing; if ClickHouse was down briefly, data may catch up automatically.
  3. Last resort: Drop and recreate the hyperdx database using the init-db.sh script from apps/platform/deploy/hyperdx-clickhouse/.
    # WARNING: This destroys all historical data
    clickhouse-client --host <host> --port 9000 --user default --password <pw> \
      -q "DROP DATABASE IF EXISTS hyperdx;"
    # Then redeploy the ClickHouse service to re-run init-db.sh

Partial Outage (One Service Down)

  • ClickHouse down: HyperDX server will fail to ingest. Traces buffer in OTel Collector’s memory until ClickHouse recovers (limited buffering). If ClickHouse is down for > 5 minutes, some data may be lost.
  • OTel Collector down: All services fall back to direct OTLP sends. Check if HyperDX server’s /v2/traces endpoint handles the load; if not, set HYPERDX_ENABLED=false temporarily.
  • HyperDX Server down: UI is inaccessible; ingestion stops. ClickHouse continues to store data. Restart the server to resume.

Monitoring Alerts

Recommended Railway alerts:

ServiceMetricThreshold
hyperdxCPU > 80% for 5 minWarning
hyperdx-clickhouse-v2Disk > 80%Warning; > 90% → Critical
hyperdx-otel-collector-v2Restarts > 3 in 10 minCritical
hyperdxHealth check failureCritical

Backup & Recovery

  • ClickHouse data: Use Railway volumes (persistent). Periodic clickhouse-dump is optional for cross-region recovery.
  • HyperDX config: All config is in Railway env vars and the repo (apps/hyperdx/, apps/platform/deploy/hyperdx-*). Restoring means redeploying from the repo.
  • Session replay objects: Stored in S3-compatible storage (if configured). Ensure S3 bucket versioning is enabled.

Environment Separation

EnvironmentHYPERDX_API_KEYOTEL_SERVICE_NAME SuffixClickHouse Password
Production<prod-key>brainforge-platform, etc.<prod-password>
Staging<staging-key>brainforge-platform-staging, etc.<staging-password>
DevelopmentNot requiredNot required (kill switch off)N/A

Never share keys or passwords between environments.