Data Platform Doc Skills — Common Rules

Purpose: Shared rules and references for the three Data Platform Documentation skills (Kickoff, Backfill, Update). Each skill reads this file so behavior stays consistent without duplicating long text.

Used by: data-platform-doc (.cursor/skills/data-platform-doc/SKILL.md) — unified implementation. The data-platform-doc-kickoff, data-platform-doc-update, and data-platform-doc-backfill skill files are thin aliases that point to the same skill with a fixed mode (kickoff / update / backfill).


Client resolution

  • Resolve client name to knowledge/clients/{client}/ using standards/04-prompts/client-vault-mapping.md. Use exact folder name (casing). If not in mapping, list knowledge/clients/ and match (exact casing as on disk).

Google identity and tool order

  • user_google_email: Per .cursor/rules/google-workspace-mcp-user-email.mdc — resolve the active signed-in Google email from gws/MCP auth state and use that exact value for all gws and Google Workspace MCP calls (never guess).
  • Tool priority: (1) gws (Google Workspace CLI) per standards/03-knowledge/engineering/setup/google-workspace-cli-setup.md; (2) Google Workspace MCP when gws is unavailable or does not support the operation. For read: gws sheets get/values get then MCP. For write: gws sheets values update then MCP.

No paste fallback

If the agent cannot write to the sheet, do not output “here’s the text to paste.” Report the failure and ask for edit access or another method.

Red for no/TBD

Where the integration supports formatting, apply red background to cells that are empty or explicitly no or TBD. Where it does not, list those ranges in the changelog for manual highlight.

Changelog and summary

  • Changelog: Tab-by-tab (or range-by-range) what was added or changed.
  • Summary: What is still TBD or no (and highlighted or listed for manual red); what is left for client confirmation.
  • Optional: Write knowledge/clients/{client}/resources/data_platform_documentation/DATA_PLATFORM_DOC_UPDATE_SUMMARY_YYYY-MM-DD.md with “What was filled,” “Sources used,” “Left for client confirmation.”

Standard tabs reference

  • Tab list and column headers: standards/02-writing/data-platform-documentation-standard-tabs-sheets.md
  • Tab purposes and semantics: standards/02-writing/data-platform-documentation-template-guide.md
  • Optional marts tabs: Production Mart Catalog (and optional Subject areas glance tab) are documented in standard-tabs-sheets — use when the engagement includes a warehouse marts layer (PROD_MARTS, dbt models/marts/**, etc.). CSV/metadata generation to align with the catalog: standards/03-knowledge/engineering/workflows/marts-documentation-creation.md

dbt technical cross-check

Purpose: Before writing TBD for warehouse location or mart names, derive values from the client’s dbt project when it is available in the workspace. This keeps the Data Sources → Destination (column R) and Core Metrics → Marts Model (column N) aligned with code.

When to run

Run after resolving the client and scanning sheet gaps, and before filling empty/unknown cells with TBD for:

  • Data Sources → Destination (R) — if empty, unknown, or placeholder.
  • Core Metrics → Marts Model (N) — optional second pass; only fill on high-confidence match.

If no dbt project is found, skip this subsection, note in the discovery summary: “dbt cross-check skipped: no dbt_project.yml found”, then TBD is acceptable.

Resolve the dbt project root (try in order; stop at first hit)

  1. User-provided path in the request (e.g. “dbt is in ../client-analytics/dbt_project”).
  2. Client vault: knowledge/clients/{client}/README.md — optional field primary_dbt_project_path (path to directory containing dbt_project.yml, or repo + path note).
  3. Workspace search: glob **/dbt_project.yml across open workspace folders; prefer the repo that matches the client engagement (name in README, Linear team, or folder naming).
  4. If none found → skip cross-check.

Data Sources → Destination (R)

  1. Under the resolved dbt root, read all models/**/sources.yml (and root sources.yml if present).
  2. For each top-level sources: entry, read database, schema, and the source block name (logical source name in dbt).
  3. Canonical sheet value: database.schema (e.g. PARTNER_SHARE.CLIENT_RAW for a Snowflake share defined in YAML).
  4. Map sheet row → source: Match the Data Sources “Name” cell to dbt name or description (case-insensitive; normalize punctuation and parentheticals). Example: “Retail partner feed (Vendor A, Vendor B)” ↔ source retail_partner / description containing “retail partner…”.
  5. Multiple tables under one source: Destination remains database.schema (schema-level is correct). Optionally add detail in Notes (e.g. “tables per dbt source emerson”) without listing every table in R.
  6. Polytomic / RAW.* landings: If sources.yml defines explicit database / schema (e.g. RAW + connector schema), use database.schema. If YAML does not specify them, check dbt_project.yml for models: config, custom schema macros, or profile-driven database — only fill when unambiguous; else TBD.

Core Metrics → Marts Model (N) (optional)

  1. Search models/marts/**/*.sql and related schema.yml for model names that match the metric row (vault/Linear names, ref('...') patterns).
  2. Fill only when the match is high-confidence; otherwise leave TBD (avoid wrong model names).
  3. If still TBD and target/manifest.json exists, run the dbt manifest cross-check (below) before leaving TBD.

dbt manifest cross-check (optional)

Purpose: Suggest or narrow Marts Model in Core Metrics (column N) from target/manifest.json, after the sources.yml and SQL / schema.yml heuristics above, without replacing human-approved text.

When to run

  • Same dbt project root as the rest of the cross-check.
  • If target/manifest.json (or a path the user provided, e.g. from CI) does not exist, skip and note: “manifest cross-check skipped: no manifest.json at …”.
  • Run only for Marts Model (N) cells that are still empty, or exactly TBD (case-insensitive), or obvious placeholders — never replace non-placeholder text the client or PM has confirmed.
  • If in doubt, list candidate dbt model names in chat and leave the cell to the user.

How to use the manifest

  1. Parse target/manifest.json. dbt v1+ stores nodes: use entries where resource_type == model. Prefer marts models: path under models/marts/, or name that matches a metric row’s Metric Name, Metric ID, or ref('...') seen in the metric’s Formula line.
  2. For each Core Metrics row, use fuzzy, conservative match (e.g. normalized slug of metric name to model name); if one high-confidence model remains, set Marts Model (N) to the model name as dbt uses it.
  3. Do not clobber: if N already has any value that is not empty/TBD, do not overwrite. If the cell says “TBD (pending client)”, do not auto-fill.
  4. Changelog: e.g. “Core Metrics: Marts Model for [metric] from dbt manifest.json model fct_… (column N, previously TBD).”

Changelog

When Destination, Marts Model, or a manifest-suggested N cell is set from dbt, record explicitly (e.g. “Data Sources: Emerson Destination from dbt sources.ymlEMERSON_LMNT_SHARE.LMNT_425 or “Marts Model from manifest.json model fct_…).

Production Mart Catalog tab

When: Always for Data Platform Documentation engagements that include a mart layer. Production Mart Catalog is standard tab #11 (see data-platform-documentation-standard-tabs-sheets.md). Treat tab names in the Aliases table as satisfying this requirement.

What: Keep a table-level inventory of production marts: one row per mart relation, columns A–R exactly as in the standard-tabs doc (Full name through Doc freshness). Inventory must reflect live warehouse objects first, then enrich from dbt paths and markdown / schema.yml descriptions.

Resolve warehouse (do not assume Snowflake silently):

  1. Prefer Data Sources tab (which DW the client uses), vault README, or dbt_project/profiles.yml / target.
  2. If still unclear whether marts live in Snowflake, BigQuery, or Redshift, ask the user once and record the answer in changelog for future runs.

Mart catalog automation — behavior contract (implementation-agnostic)

Agents and engineers do not need a pre-existing script in any particular Brainforge or client repository. They may implement tooling in the client’s dbt repo (any language; conventional location dbt_project/scripts/), run ad hoc SQL + merge in the sheet, or hand-fill—as long as the outputs and precedence rules below are satisfied. Naming a script build_production_mart_catalog_for_sheet.py is optional discoverability only; behavior is normative.

Output shape (required):

  • One header row + one row per production mart relation, exactly the eighteen headers A–R defined in data-platform-documentation-standard-tabs-sheets.mdProduction Mart Catalog.
  • Default serialization: UTF-8 TSV suitable for pasting or values.update starting at Production Mart Catalog!A1. Optional: emit structured JSON describing { range_name, values } for chunked Google API / MCP writes (catalog only, or plus companion tabs such as Subject areas / _lists if the engagement uses them).

Precedence (source of truth, high to low):

  1. Warehouse catalog: List relations in the client’s production marts scope (database + schema + object name + type) via that platform’s metadata (INFORMATION_SCHEMA, BigQuery INFORMATION_SCHEMA, Redshift catalog views, etc.). Exclude system schemas unless the client treats them as marts.
  2. dbt layout: Map each warehouse schema to models/marts/<folder>/ using the project’s naming rules (dbt_project.yml, custom schema macros, or folder ↔ schema table documented in README). Set dbt path hint to the matching .sql model file when one exists.
  3. Human-readable descriptions: Prefer a single markdown file under dbt_project/docs/ with ## SCHEMA_NAME sections and per-table rows (pipe tables), or schema.yml description on mart models. Merge by (schema, table) / model name.
  4. Derived columns (K–R and D–F where not explicit): Infer Table type, Grain, Cross-source, Build state, etc. from documented rules in the client repo (README or comments in the generator). Use TBD when unknown; red-highlight per universal rules. Do not invent cross-source Yes without a rule or description basis.

When warehouse CLI or API is unavailable: Fall back to enumerating models/marts/**/*.sql; set Full name and type fields to TBD until verified against the warehouse, and note the limitation in the changelog.

Write path: gws sheets spreadsheets values batchUpdate (preferred) or MCP modify_sheet_values. For large grids or tool limits, chunk row ranges or refresh narrow columns (e.g. Q:Q only).

BigQuery / Redshift: Use the same eighteen headers; adjust Full name format (project.dataset.table vs database.schema.table) and the metadata query only.

Changelog (mart catalog): Row count, warehouse + database used, dbt root path, and whether output was automated, partial fallback, or manual.

How agents use this contract in practice:

  1. Resolve dbt root (Resolve the dbt project root above).
  2. If a repo already contains a script that satisfies this contract, run it (typical invocation: python3 dbt_project/scripts/build_production_mart_catalog_for_sheet.py with the client’s warehouse CLI profile).
  3. If not, implement a minimal generator (or SQL + spreadsheet formulas) that follows the contract, or fill manually from metadata queries—do not block on a missing file path from another engagement.

Limitations

  • Manifest (column N): Large monorepos: loading the full manifest.json is acceptable for agent runs; in CI, prefer a pre-built or pruned manifest. dbt Cloud API manifest is out of scope here until product adds it (see planning notes).
  • dbt not in Cursor workspace: Cross-check skipped; TBD OK.
  • Name mismatch between sheet row and dbt source: fuzzy-match or list candidate sources in chat for user confirmation.
  • Environment-specific database names: sources.yml is usually non-secret; if names differ by target (dev/prod), prefer dbt docs / profile docs or ask once.

Template spreadsheet (Kickoff only)