data-source-discovery-memo-template

About this document (Brainforge)

Internal conventions for how this file works in the repo. Optional: strip or export without this section when sharing a client-only artifact.

Titling and filename

Use [Client] — Data source discovery memo — [Domain or share name] for the document title (the H1 above). Examples: LMNT — Data source discovery memo — Shopify (RAW.POLYTOMIC_SHOPIFY) · Acme — Data source discovery memo — ERP (RAW.ORACLE_ERP).

Filename: {client}-discovery-memo-{domain}.md under knowledge/clients/{client}/resources/.

When the memo covers multiple sources or shares, use a broader domain name (e.g., lmnt-discovery-memo-retail.md for Shopify + ReCharge + Gorgias in one document) or split into per-source files with a parent README. Prefer one memo per share or tightly coupled source group; avoid monolithic “all sources” memos that are hard to update.

Single source of truth

Repo markdown under knowledge/clients/{client}/resources/ is the canonical discovery memo once merged. Google Docs may exist for stakeholder review; this file wins on scope, table detail, and warehouse-verified metrics.

This memo (repo markdown) is the version-of-record for the discovery baseline — what was profiled, when, with what row counts and caveats. Update it when new tables arrive or share scopes change; do not maintain a parallel “v2” without a new file version.
Warehouse facts (row counts, date ranges, grain) are snapshotted on the profiling date. Append update sections for additive changes; do not silently overwrite historical numbers.
Google Docs (if used for stakeholder review) are downstream convenience copies. Import this markdown to create them; do not let the Google Doc diverge from the repo source.

When to use this template

Use this template when:

a new private share or schema appears for a client and leadership needs a written baseline
many new tables need to be introduced in one memo with consistent per-table subsections
the ask is discovery / profiling / interpretation, not full dimensional modeling or dbt project design

This template includes three consolidated artifacts:

Changelog (§11) — date-stamped entries for source updates (new tables, scope changes). Replaces the need for a standalone “Discovery Memo Update” document.
Appendix C (Schema Audit) — column-level per-table catalog with data types, nulls, cardinalities, and foreign keys. Replaces the need for a standalone Schema Audit document.
§2 (Source and lineage) includes ownership, SLA, and freshness tables. Replaces the need for a standalone Data Contract / SLA document.

Do not use this template when:

the ask is a one-page executive overview only (use data-source-memo-executive-style-playbook.md)
the ask is a pure technical schema audit with no narrative (this is now Appendix C — use this template, just skip the narrative sections)
the deliverable is a data accuracy investigation (use the Data Findings Memo template)

Document metadata

Status: [Draft / In stakeholder review / Approved / Locked] Warehouse: [Snowflake / BigQuery / other] — Account/region: [details] Database / project: [database or project name] Schema / dataset: [schema or dataset name] Profiled: [YYYY-MM-DD] (single snapshot date for all profiling in this memo) Prepared for: [Client stakeholder names and roles] Prepared by: Brainforge Last updated: [YYYY-MM-DD]

Artifact	Link / path	Notes
Data Platform Documentation	`[Google Sheet link]`	Source catalog, metric definitions
Linear project	`[Linear project URL]`	Discovery or data foundation milestones
Prior schema audit	`[path or link]`	If a separate schema audit exists, reconcile table names and domains
Vendor documentation	`[link]`	API docs, ERDs, or vendor data dictionaries
Stakeholder intake notes	`[link or path]`	Call notes or requirements that shaped discovery scope

1. Engagement context

1.1 Why this data matters

[2–4 sentences. What business problem does this data help solve? What becomes possible that was not possible before? Why now — what triggered this discovery?]

1.2 Audience

Role	What they need from this memo
`[e.g., VP of Analytics]`	`[e.g., Can we retire the legacy spreadsheet? What KPIs are reliable?]`
`[e.g., Data Engineering lead]`	`[e.g., Refresh cadence, join keys, TEMP table decisions, dbt staging priorities]`
`[e.g., Finance / Ops stakeholder]`	`[e.g., Which tables anchor revenue or cost reporting? What filters are required?]`

1.3 Discovery scope

[Narrative: which share, schema, or source group was profiled. List the tables included and any that were explicitly excluded. Note if scope was constrained by access, time, or stakeholder direction.]

In scope:

Source domain	Schema / dataset	Tables profiled	Notes
`[domain name]`	`[schema]`	`[count]`	`[e.g., full schema; excluding deprecated _v1 tables]`

Excluded (and why):

Table / domain	Reason excluded	When to revisit
`[name]`	`[e.g., empty snapshot, deprecated, access denied]`	`[condition or date]`

2. Source and lineage

2.1 How data arrives

[Describe the ingestion path: source system → connector (Polytomic / Fivetran / Stitch / custom) → warehouse schema. Note refresh cadence, sync scope, and any known latency or reliability concerns.]

[Source system]
  → [Connector / pipeline] (refresh: [cadence])
    → [Warehouse].[database].[schema]
      → dbt staging / marts
        → BI tool / analytics

2.2 Ownership and access

Concern	Detail
Connector owner	`[team or person]`
Warehouse access	`[role / user / service account used for profiling]`
API / token health	`[known expiry, rotation schedule, or concerns]`
Runbook / playbook	`[link to operational runbook if it exists]`

2.3 Freshness SLA

Metric	Expectation	Measured value	Notes
Refresh cadence	`[e.g., Daily at 06:00 UTC]`	`[actual as of profiling date]`	`[e.g., may drift on weekends]`
Freshness SLA	`[e.g., Data ≤ 24h old at start of business]`	`[actual lag]`	`[e.g., delays if upstream API is down]`
Expected latency	`[e.g., 2h from source event to warehouse]`	`[actual p50/p95]`	`[e.g., bulk loads land in single daily batch]`
Alert contact	`[name / team / Slack channel]`	—	`[e.g., #data-alerts]`
Known failure modes	`[e.g., connector drops auth token every 90 days]`	—	`[e.g., documented in runbook linked above]`

3. Executive summary

Write this section last, after §4 is stable. A director should read this in two minutes and know what landed, what scale, and what decisions are needed. No table dumps here.

3.1 What we found

[3–5 sentences. What sources were profiled? At what scale (row counts, date ranges)? What is the most important thing leadership should know?]

3.2 Key decisions needed

[Decision] — [Why it matters, what depends on it]
[Decision] — [...]
[Decision] — [...]

3.3 At a glance

Source domain	Primary fact table	Approx. rows	Date range	Grain
`[domain]`	`[table]`	`[count]`	`[start]–[end]`	`[grain]`

4. Per-source / domain catalog

Each domain below follows a fixed subsection pattern. For multi-source memos, group by domain (e.g., 4.1 Shopify, 4.2 Gorgias). Within each domain, one subsection per table. Append new tables as 4.1.N, 4.1.N+1; do not renumber existing subsections in a published memo.

4.1 [Source domain name]

[1–2 sentences: what this source represents to the business, its role in the client's stack, and any cross-cutting notes about the schema as a whole.]

4.1.1 [Table name]

Metrics (queried [YYYY-MM-DD]):

Attribute	Value
Approx. row count	`[count]`
Approx. storage	`[bytes]`
Grain	`[one row per …]`
Date range	`[start]–[end]` (based on `[column]`)
Distinct keys	`[column]: [count] distinct`

Business objective: [1–2 sentences. What does this table represent in business terms? What decisions or reports does it support?]

Questions this table answers:

[Question 1]
[Question 2]
[Question 3]

Caveats and interpretation:

[Grain warning, join caveat, known duplication, snapshot stacking, filter required]
[Date column reliability — which column to trust for time-series queries]
[Comparison to other tables — when to use this vs an alternative]

4.1.2 [Table name]

(Repeat the pattern above for each table in this domain.)

5. Cross-cutting questions

Synthesis across source domains. Only reference tables that are profiled in §4.

Business question	Where to start	Caveat
`[e.g., Revenue and units sold]`	`[table(s)]`	`[grain or filter note]`
`[e.g., Customer identity for CRM joins]`	`[table(s)]`	`[mapping required]`
`[e.g., Support contact volume]`	`[table(s)]`	`[date filter note]`

6. Joins and caveats

6.1 Key relationships

Table A	Join key	Table B	Join type	Notes
`[table]`	`[column(s)]`	`[table]`	`[inner / left / requires key mapping]`	`[e.g., filter test orders, deduplicate by date]`

6.2 Known traps

[Trap 1] — [what happens, how to avoid]
[Trap 2] — [...]
[Trap 3] — [...]

6.3 Non-canonical tables

Tables that exist in the schema but should NOT be used for enterprise KPIs until explicitly blessed.

Table	Risk if used	When to revisit
`[name]`	`[e.g., double-counting, stale data, unclear purpose]`	`[condition]`

7. Recommendations / next steps

Action-oriented, ordered by priority. Minimal hype — state what should happen and why.

[Action] — [Why this, why now, who should do it]
[Action] — [...]
[Action] — [...]
[Action] — [...]

8. Open items

Anything not yet resolved. One bullet per item. Bold the item name.

[Item name] — What it is, why it is not resolved, what the impact is, and when it will be addressed or what triggers addressing it.
[Item name] — […]

9. Validation

How to verify the profiling in this memo is accurate. Write for the technical user who needs to reproduce or audit the work.

9.1 Warehouse access verification

-- Confirms the profiled schema is reachable
SHOW SCHEMAS IN DATABASE [database_name];
SHOW TABLES IN SCHEMA [database_name].[schema_name];

9.2 Row count and date range verification

-- Spot-check row counts and date ranges for key tables
SELECT 'ORDERS' AS table_name, COUNT(*) AS row_count, MIN(created_at) AS min_date, MAX(created_at) AS max_date FROM [database].[schema].[table];
-- Repeat for each profiled table

9.3 What passing looks like

Row counts are within 5% of the snapshotted values in §4 (warehouse activity continues between profiling and validation)
Date ranges overlap the stated ranges in §4
Distinct key counts are not dramatically lower than row counts (no massive duplication)
Schema names match what is declared in §2

10. Acceptance and sign-off

Role	Name	Date	Status
`[e.g., Data Engineering lead]`	`[name]`	`[date]`	`[Approved / Changes requested]`
`[e.g., Analytics stakeholder]`	`[name]`	`[date]`	`[Approved / Changes requested]`

11. Changelog

Date-stamped entries for source changes after the initial memo. Newest first. Each entry records what changed and which tables were affected. This section replaces the need for a separate “Discovery Memo Update” document.

Date	What changed	Affected tables	Author
`[YYYY-MM-DD]`	`[e.g., New tables added for Q2 scope]`	`[e.g., RAW.POLYTOMIC_NEWSOURCE.*]`	`[name]`
`[YYYY-MM-DD]`	`[e.g., Column added to entities table]`	`[table_name]`	`[name]`

Initial memo: [YYYY-MM-DD] (see §4 for baseline catalog). Add entries above this line as the source evolves.

Appendix A — Agent guardrails (anti-fluff)

When an LLM agent drafts or updates this memo, enforce these rules. Humans: treat these as style reminders.

Do:

Prefer short sentences, neutral voice, and specific numbers with dates.
Use markdown tables for metric bundles; use bullets for interpretations and questions.
Name grain explicitly (order line vs invoice vs snapshot history).
Flag caveats (estimate vs exact count, repeated snapshots inflating rows).
Match the client’s vocabulary when known (program names, retailer names, system names).
Write the executive summary to pass the two-minute director skim test.

Do not:

Use filler openers (“In today’s data landscape…”, “It is important to note…”, “This comprehensive overview…”).
Stack synonyms for importance (critical, robust, powerful, holistic) without adding fact.
Add generic value claims (“drives actionable insights”) without tying to a decision or metric.
Invent row counts, schemas, or join keys; if unknown, write TBD and list what query will answer it.
Use em dashes as a crutch for long clauses — prefer periods.
Rewrite the memo in a “friendlier” tone when the ask was factual discovery.

Appendix B — Pre-handoff QA checklist

Complete before sharing with a client stakeholder.

Appendix C — Schema Audit

Column-level reference for each profiled table. This is a technical appendix — no narrative, no business interpretation. Pure schema documentation for the data engineer building models on top of this share. Replace with a live SHOW TABLES / DESCRIBE TABLE output in the warehouse when possible.

C.1 [Table name]

Column	Data type	Nullable	Cardinality	FK / PK reference	Notes
`[column]`	`[type]`	`[Y/N]`	`[count]`	`[references table.column]`	`[e.g., deprecated, use alt column]`
`[column]`	`[type]`	`[Y/N]`	`[count]`	`[references table.column]`	`[...]`

C.2 [Table name]

(Repeat for each table profiled in §4. Keep the same table ordering used in §4.)

Brainforge Knowledge

Explorer