Data Platform Doc Skills — Common Rules
Purpose: Shared rules and references for the three Data Platform Documentation skills (Kickoff, Backfill, Update). Each skill reads this file so behavior stays consistent without duplicating long text.
Used by: data-platform-doc (.cursor/skills/data-platform-doc/SKILL.md) — unified implementation. The data-platform-doc-kickoff, data-platform-doc-update, and data-platform-doc-backfill skill files are thin aliases that point to the same skill with a fixed mode (kickoff / update / backfill).
Client resolution
- Resolve client name to
knowledge/clients/{client}/using standards/04-prompts/client-vault-mapping.md. Use exact folder name (casing). If not in mapping, listknowledge/clients/and match (exact casing as on disk).
Google identity and tool order
- user_google_email: Per
.cursor/rules/google-workspace-mcp-user-email.mdc— resolve the active signed-in Google email from gws/MCP auth state and use that exact value for all gws and Google Workspace MCP calls (never guess). - Tool priority: (1) gws (Google Workspace CLI) per standards/03-knowledge/engineering/setup/google-workspace-cli-setup.md; (2) Google Workspace MCP when gws is unavailable or does not support the operation. For read: gws sheets get/values get then MCP. For write: gws sheets values update then MCP.
No paste fallback
If the agent cannot write to the sheet, do not output “here’s the text to paste.” Report the failure and ask for edit access or another method.
Red for no/TBD
Where the integration supports formatting, apply red background to cells that are empty or explicitly no or TBD. Where it does not, list those ranges in the changelog for manual highlight.
Changelog and summary
- Changelog: Tab-by-tab (or range-by-range) what was added or changed.
- Summary: What is still TBD or no (and highlighted or listed for manual red); what is left for client confirmation.
- Optional: Write
knowledge/clients/{client}/resources/data_platform_documentation/DATA_PLATFORM_DOC_UPDATE_SUMMARY_YYYY-MM-DD.mdwith “What was filled,” “Sources used,” “Left for client confirmation.”
Standard tabs reference
- Tab list and column headers: standards/02-writing/data-platform-documentation-standard-tabs-sheets.md
- Tab purposes and semantics: standards/02-writing/data-platform-documentation-template-guide.md
- Optional marts tabs: Production Mart Catalog (and optional Subject areas glance tab) are documented in standard-tabs-sheets — use when the engagement includes a warehouse marts layer (
PROD_MARTS, dbtmodels/marts/**, etc.). CSV/metadata generation to align with the catalog: standards/03-knowledge/engineering/workflows/marts-documentation-creation.md
dbt technical cross-check
Purpose: Before writing TBD for warehouse location or mart names, derive values from the client’s dbt project when it is available in the workspace. This keeps the Data Sources → Destination (column R) and Core Metrics → Marts Model (column N) aligned with code.
When to run
Run after resolving the client and scanning sheet gaps, and before filling empty/unknown cells with TBD for:
- Data Sources → Destination (R) — if empty, unknown, or placeholder.
- Core Metrics → Marts Model (N) — optional second pass; only fill on high-confidence match.
If no dbt project is found, skip this subsection, note in the discovery summary: “dbt cross-check skipped: no dbt_project.yml found”, then TBD is acceptable.
Resolve the dbt project root (try in order; stop at first hit)
- User-provided path in the request (e.g. “dbt is in
../client-analytics/dbt_project”). - Client vault:
knowledge/clients/{client}/README.md— optional fieldprimary_dbt_project_path(path to directory containingdbt_project.yml, or repo + path note). - Workspace search:
glob **/dbt_project.ymlacross open workspace folders; prefer the repo that matches the client engagement (name in README, Linear team, or folder naming). - If none found → skip cross-check.
Data Sources → Destination (R)
- Under the resolved dbt root, read all
models/**/sources.yml(and rootsources.ymlif present). - For each top-level
sources:entry, readdatabase,schema, and the source blockname(logical source name in dbt). - Canonical sheet value:
database.schema(e.g.PARTNER_SHARE.CLIENT_RAWfor a Snowflake share defined in YAML). - Map sheet row → source: Match the Data Sources “Name” cell to dbt
nameordescription(case-insensitive; normalize punctuation and parentheticals). Example: “Retail partner feed (Vendor A, Vendor B)” ↔ sourceretail_partner/ description containing “retail partner…”. - Multiple tables under one source: Destination remains
database.schema(schema-level is correct). Optionally add detail in Notes (e.g. “tables per dbt sourceemerson”) without listing every table in R. - Polytomic /
RAW.*landings: Ifsources.ymldefines explicitdatabase/schema(e.g.RAW+ connector schema), usedatabase.schema. If YAML does not specify them, checkdbt_project.ymlformodels:config, custom schema macros, or profile-driven database — only fill when unambiguous; else TBD.
Core Metrics → Marts Model (N) (optional)
- Search
models/marts/**/*.sqland relatedschema.ymlfor model names that match the metric row (vault/Linear names,ref('...')patterns). - Fill only when the match is high-confidence; otherwise leave TBD (avoid wrong model names).
- If still TBD and
target/manifest.jsonexists, run the dbt manifest cross-check (below) before leaving TBD.
dbt manifest cross-check (optional)
Purpose: Suggest or narrow Marts Model in Core Metrics (column N) from target/manifest.json, after the sources.yml and SQL / schema.yml heuristics above, without replacing human-approved text.
When to run
- Same dbt project root as the rest of the cross-check.
- If
target/manifest.json(or a path the user provided, e.g. from CI) does not exist, skip and note: “manifest cross-check skipped: no manifest.json at …”. - Run only for Marts Model (N) cells that are still empty, or exactly TBD (case-insensitive), or obvious placeholders — never replace non-placeholder text the client or PM has confirmed.
- If in doubt, list candidate dbt
modelnames in chat and leave the cell to the user.
How to use the manifest
- Parse
target/manifest.json. dbt v1+ storesnodes: use entries whereresource_type==model. Prefer marts models: path undermodels/marts/, ornamethat matches a metric row’s Metric Name, Metric ID, orref('...')seen in the metric’s Formula line. - For each Core Metrics row, use fuzzy, conservative match (e.g. normalized slug of metric name to model
name); if one high-confidence model remains, set Marts Model (N) to the model name as dbt uses it. - Do not clobber: if N already has any value that is not empty/TBD, do not overwrite. If the cell says “TBD (pending client)”, do not auto-fill.
- Changelog: e.g. “Core Metrics: Marts Model for [metric] from dbt
manifest.jsonmodelfct_…(column N, previously TBD).”
Changelog
When Destination, Marts Model, or a manifest-suggested N cell is set from dbt, record explicitly (e.g. “Data Sources: Emerson Destination from dbt sources.yml → EMERSON_LMNT_SHARE.LMNT_425” or “Marts Model from manifest.json model fct_…”).
Production Mart Catalog tab
When: Always for Data Platform Documentation engagements that include a mart layer. Production Mart Catalog is standard tab #11 (see data-platform-documentation-standard-tabs-sheets.md). Treat tab names in the Aliases table as satisfying this requirement.
What: Keep a table-level inventory of production marts: one row per mart relation, columns A–R exactly as in the standard-tabs doc (Full name through Doc freshness). Inventory must reflect live warehouse objects first, then enrich from dbt paths and markdown / schema.yml descriptions.
Resolve warehouse (do not assume Snowflake silently):
- Prefer Data Sources tab (which DW the client uses), vault README, or
dbt_project/profiles.yml/target. - If still unclear whether marts live in Snowflake, BigQuery, or Redshift, ask the user once and record the answer in changelog for future runs.
Mart catalog automation — behavior contract (implementation-agnostic)
Agents and engineers do not need a pre-existing script in any particular Brainforge or client repository. They may implement tooling in the client’s dbt repo (any language; conventional location dbt_project/scripts/), run ad hoc SQL + merge in the sheet, or hand-fill—as long as the outputs and precedence rules below are satisfied. Naming a script build_production_mart_catalog_for_sheet.py is optional discoverability only; behavior is normative.
Output shape (required):
- One header row + one row per production mart relation, exactly the eighteen headers A–R defined in data-platform-documentation-standard-tabs-sheets.md → Production Mart Catalog.
- Default serialization: UTF-8 TSV suitable for pasting or
values.updatestarting atProduction Mart Catalog!A1. Optional: emit structured JSON describing{ range_name, values }for chunked Google API / MCP writes (catalog only, or plus companion tabs such as Subject areas /_listsif the engagement uses them).
Precedence (source of truth, high to low):
- Warehouse catalog: List relations in the client’s production marts scope (database + schema + object name + type) via that platform’s metadata (
INFORMATION_SCHEMA, BigQueryINFORMATION_SCHEMA, Redshift catalog views, etc.). Exclude system schemas unless the client treats them as marts. - dbt layout: Map each warehouse schema to
models/marts/<folder>/using the project’s naming rules (dbt_project.yml, custom schema macros, or folder ↔ schema table documented in README). Set dbt path hint to the matching.sqlmodel file when one exists. - Human-readable descriptions: Prefer a single markdown file under
dbt_project/docs/with## SCHEMA_NAMEsections and per-table rows (pipe tables), orschema.ymldescriptionon mart models. Merge by(schema, table)/ model name. - Derived columns (K–R and D–F where not explicit): Infer Table type, Grain, Cross-source, Build state, etc. from documented rules in the client repo (README or comments in the generator). Use TBD when unknown; red-highlight per universal rules. Do not invent cross-source Yes without a rule or description basis.
When warehouse CLI or API is unavailable: Fall back to enumerating models/marts/**/*.sql; set Full name and type fields to TBD until verified against the warehouse, and note the limitation in the changelog.
Write path: gws sheets spreadsheets values batchUpdate (preferred) or MCP modify_sheet_values. For large grids or tool limits, chunk row ranges or refresh narrow columns (e.g. Q:Q only).
BigQuery / Redshift: Use the same eighteen headers; adjust Full name format (project.dataset.table vs database.schema.table) and the metadata query only.
Changelog (mart catalog): Row count, warehouse + database used, dbt root path, and whether output was automated, partial fallback, or manual.
How agents use this contract in practice:
- Resolve dbt root (Resolve the dbt project root above).
- If a repo already contains a script that satisfies this contract, run it (typical invocation:
python3 dbt_project/scripts/build_production_mart_catalog_for_sheet.pywith the client’s warehouse CLI profile). - If not, implement a minimal generator (or SQL + spreadsheet formulas) that follows the contract, or fill manually from metadata queries—do not block on a missing file path from another engagement.
Limitations
- Manifest (column N): Large monorepos: loading the full
manifest.jsonis acceptable for agent runs; in CI, prefer a pre-built or pruned manifest. dbt Cloud API manifest is out of scope here until product adds it (see planning notes). - dbt not in Cursor workspace: Cross-check skipped; TBD OK.
- Name mismatch between sheet row and dbt source: fuzzy-match or list candidate sources in chat for user confirmation.
- Environment-specific database names:
sources.ymlis usually non-secret; if names differ by target (dev/prod), preferdbt docs/ profile docs or ask once.
Template spreadsheet (Kickoff only)
- Canonical template: https://docs.google.com/spreadsheets/d/1VD8PrCifgkt5JCOLkMHtlbzShckMS4k2Q7AYLE3MjIM
- Spreadsheet ID:
1VD8PrCifgkt5JCOLkMHtlbzShckMS4k2Q7AYLE3MjIM - Create the client’s doc by copying this file (Drive API copy), not by creating a blank spreadsheet and pasting structure.