[Client] — Data source discovery memo — [Domain / share name]
About this document (Brainforge)
Internal conventions for how this file works in the repo. Optional: strip or export without this section when sharing a client-only artifact.
Titling and filename
Use [Client] — Data source discovery memo — [Domain or share name] for the document title (the H1 above). Examples: LMNT — Data source discovery memo — Shopify (RAW.POLYTOMIC_SHOPIFY) · Acme — Data source discovery memo — ERP (RAW.ORACLE_ERP).
Filename: {client}-discovery-memo-{domain}.md under knowledge/clients/{client}/resources/.
When the memo covers multiple sources or shares, use a broader domain name (e.g., lmnt-discovery-memo-retail.md for Shopify + ReCharge + Gorgias in one document) or split into per-source files with a parent README. Prefer one memo per share or tightly coupled source group; avoid monolithic “all sources” memos that are hard to update.
Single source of truth
Repo markdown under knowledge/clients/{client}/resources/ is the canonical discovery memo once merged. Google Docs may exist for stakeholder review; this file wins on scope, table detail, and warehouse-verified metrics.
- This memo (repo markdown) is the version-of-record for the discovery baseline — what was profiled, when, with what row counts and caveats. Update it when new tables arrive or share scopes change; do not maintain a parallel “v2” without a new file version.
- Warehouse facts (row counts, date ranges, grain) are snapshotted on the profiling date. Append update sections for additive changes; do not silently overwrite historical numbers.
- Google Docs (if used for stakeholder review) are downstream convenience copies. Import this markdown to create them; do not let the Google Doc diverge from the repo source.
When to use this template
Use this template when:
- a new private share or schema appears for a client and leadership needs a written baseline
- many new tables need to be introduced in one memo with consistent per-table subsections
- the ask is discovery / profiling / interpretation, not full dimensional modeling or dbt project design
This template includes three consolidated artifacts:
- Changelog (§11) — date-stamped entries for source updates (new tables, scope changes). Replaces the need for a standalone “Discovery Memo Update” document.
- Appendix C (Schema Audit) — column-level per-table catalog with data types, nulls, cardinalities, and foreign keys. Replaces the need for a standalone Schema Audit document.
- §2 (Source and lineage) includes ownership, SLA, and freshness tables. Replaces the need for a standalone Data Contract / SLA document.
Do not use this template when:
- the ask is a one-page executive overview only (use
data-source-memo-executive-style-playbook.md) - the ask is a pure technical schema audit with no narrative (this is now Appendix C — use this template, just skip the narrative sections)
- the deliverable is a data accuracy investigation (use the Data Findings Memo template)
Document metadata
Status: [Draft / In stakeholder review / Approved / Locked]
Warehouse: [Snowflake / BigQuery / other] — Account/region: [details]
Database / project: [database or project name]
Schema / dataset: [schema or dataset name]
Profiled: [YYYY-MM-DD] (single snapshot date for all profiling in this memo)
Prepared for: [Client stakeholder names and roles]
Prepared by: Brainforge
Last updated: [YYYY-MM-DD]
Related artifacts
| Artifact | Link / path | Notes |
|---|---|---|
| Data Platform Documentation | [Google Sheet link] | Source catalog, metric definitions |
| Linear project | [Linear project URL] | Discovery or data foundation milestones |
| Prior schema audit | [path or link] | If a separate schema audit exists, reconcile table names and domains |
| Vendor documentation | [link] | API docs, ERDs, or vendor data dictionaries |
| Stakeholder intake notes | [link or path] | Call notes or requirements that shaped discovery scope |
1. Engagement context
1.1 Why this data matters
[2–4 sentences. What business problem does this data help solve? What becomes possible that was not possible before? Why now — what triggered this discovery?]
1.2 Audience
| Role | What they need from this memo |
|---|---|
[e.g., VP of Analytics] | [e.g., Can we retire the legacy spreadsheet? What KPIs are reliable?] |
[e.g., Data Engineering lead] | [e.g., Refresh cadence, join keys, TEMP table decisions, dbt staging priorities] |
[e.g., Finance / Ops stakeholder] | [e.g., Which tables anchor revenue or cost reporting? What filters are required?] |
1.3 Discovery scope
[Narrative: which share, schema, or source group was profiled. List the tables included and any that were explicitly excluded. Note if scope was constrained by access, time, or stakeholder direction.]
In scope:
| Source domain | Schema / dataset | Tables profiled | Notes |
|---|---|---|---|
[domain name] | [schema] | [count] | [e.g., full schema; excluding deprecated _v1 tables] |
Excluded (and why):
| Table / domain | Reason excluded | When to revisit |
|---|---|---|
[name] | [e.g., empty snapshot, deprecated, access denied] | [condition or date] |
2. Source and lineage
2.1 How data arrives
[Describe the ingestion path: source system → connector (Polytomic / Fivetran / Stitch / custom) → warehouse schema. Note refresh cadence, sync scope, and any known latency or reliability concerns.]
[Source system]
→ [Connector / pipeline] (refresh: [cadence])
→ [Warehouse].[database].[schema]
→ dbt staging / marts
→ BI tool / analytics
2.2 Ownership and access
| Concern | Detail |
|---|---|
| Connector owner | [team or person] |
| Warehouse access | [role / user / service account used for profiling] |
| API / token health | [known expiry, rotation schedule, or concerns] |
| Runbook / playbook | [link to operational runbook if it exists] |
2.3 Freshness SLA
| Metric | Expectation | Measured value | Notes |
|---|---|---|---|
| Refresh cadence | [e.g., Daily at 06:00 UTC] | [actual as of profiling date] | [e.g., may drift on weekends] |
| Freshness SLA | [e.g., Data ≤ 24h old at start of business] | [actual lag] | [e.g., delays if upstream API is down] |
| Expected latency | [e.g., 2h from source event to warehouse] | [actual p50/p95] | [e.g., bulk loads land in single daily batch] |
| Alert contact | [name / team / Slack channel] | — | [e.g., #data-alerts] |
| Known failure modes | [e.g., connector drops auth token every 90 days] | — | [e.g., documented in runbook linked above] |
3. Executive summary
Write this section last, after §4 is stable. A director should read this in two minutes and know what landed, what scale, and what decisions are needed. No table dumps here.
3.1 What we found
[3–5 sentences. What sources were profiled? At what scale (row counts, date ranges)? What is the most important thing leadership should know?]
3.2 Key decisions needed
[Decision]—[Why it matters, what depends on it][Decision]—[...][Decision]—[...]
3.3 At a glance
| Source domain | Primary fact table | Approx. rows | Date range | Grain |
|---|---|---|---|---|
[domain] | [table] | [count] | [start]–[end] | [grain] |
4. Per-source / domain catalog
Each domain below follows a fixed subsection pattern. For multi-source memos, group by domain (e.g., 4.1 Shopify, 4.2 Gorgias). Within each domain, one subsection per table. Append new tables as 4.1.N, 4.1.N+1; do not renumber existing subsections in a published memo.
4.1 [Source domain name]
[1–2 sentences: what this source represents to the business, its role in the client's stack, and any cross-cutting notes about the schema as a whole.]
4.1.1 [Table name]
Metrics (queried [YYYY-MM-DD]):
| Attribute | Value |
|---|---|
| Approx. row count | [count] |
| Approx. storage | [bytes] |
| Grain | [one row per …] |
| Date range | [start]–[end] (based on [column]) |
| Distinct keys | [column]: [count] distinct |
Business objective: [1–2 sentences. What does this table represent in business terms? What decisions or reports does it support?]
Questions this table answers:
[Question 1][Question 2][Question 3]
Caveats and interpretation:
[Grain warning, join caveat, known duplication, snapshot stacking, filter required][Date column reliability — which column to trust for time-series queries][Comparison to other tables — when to use this vs an alternative]
4.1.2 [Table name]
(Repeat the pattern above for each table in this domain.)
5. Cross-cutting questions
Synthesis across source domains. Only reference tables that are profiled in §4.
| Business question | Where to start | Caveat |
|---|---|---|
[e.g., Revenue and units sold] | [table(s)] | [grain or filter note] |
[e.g., Customer identity for CRM joins] | [table(s)] | [mapping required] |
[e.g., Support contact volume] | [table(s)] | [date filter note] |
6. Joins and caveats
6.1 Key relationships
| Table A | Join key | Table B | Join type | Notes |
|---|---|---|---|---|
[table] | [column(s)] | [table] | [inner / left / requires key mapping] | [e.g., filter test orders, deduplicate by date] |
6.2 Known traps
[Trap 1]—[what happens, how to avoid][Trap 2]—[...][Trap 3]—[...]
6.3 Non-canonical tables
Tables that exist in the schema but should NOT be used for enterprise KPIs until explicitly blessed.
| Table | Risk if used | When to revisit |
|---|---|---|
[name] | [e.g., double-counting, stale data, unclear purpose] | [condition] |
7. Recommendations / next steps
Action-oriented, ordered by priority. Minimal hype — state what should happen and why.
- [Action] —
[Why this, why now, who should do it] - [Action] —
[...] - [Action] —
[...] - [Action] —
[...]
8. Open items
Anything not yet resolved. One bullet per item. Bold the item name.
- [Item name] — What it is, why it is not resolved, what the impact is, and when it will be addressed or what triggers addressing it.
- [Item name] — […]
9. Validation
How to verify the profiling in this memo is accurate. Write for the technical user who needs to reproduce or audit the work.
9.1 Warehouse access verification
-- Confirms the profiled schema is reachable
SHOW SCHEMAS IN DATABASE [database_name];
SHOW TABLES IN SCHEMA [database_name].[schema_name];9.2 Row count and date range verification
-- Spot-check row counts and date ranges for key tables
SELECT 'ORDERS' AS table_name, COUNT(*) AS row_count, MIN(created_at) AS min_date, MAX(created_at) AS max_date FROM [database].[schema].[table];
-- Repeat for each profiled table9.3 What passing looks like
- Row counts are within 5% of the snapshotted values in §4 (warehouse activity continues between profiling and validation)
- Date ranges overlap the stated ranges in §4
- Distinct key counts are not dramatically lower than row counts (no massive duplication)
- Schema names match what is declared in §2
10. Acceptance and sign-off
| Role | Name | Date | Status |
|---|---|---|---|
[e.g., Data Engineering lead] | [name] | [date] | [Approved / Changes requested] |
[e.g., Analytics stakeholder] | [name] | [date] | [Approved / Changes requested] |
11. Changelog
Date-stamped entries for source changes after the initial memo. Newest first. Each entry records what changed and which tables were affected. This section replaces the need for a separate “Discovery Memo Update” document.
| Date | What changed | Affected tables | Author |
|---|---|---|---|
[YYYY-MM-DD] | [e.g., New tables added for Q2 scope] | [e.g., RAW.POLYTOMIC_NEWSOURCE.*] | [name] |
[YYYY-MM-DD] | [e.g., Column added to entities table] | [table_name] | [name] |
Initial memo: [YYYY-MM-DD] (see §4 for baseline catalog). Add entries above this line as the source evolves.
Appendix A — Agent guardrails (anti-fluff)
When an LLM agent drafts or updates this memo, enforce these rules. Humans: treat these as style reminders.
Do:
- Prefer short sentences, neutral voice, and specific numbers with dates.
- Use markdown tables for metric bundles; use bullets for interpretations and questions.
- Name grain explicitly (order line vs invoice vs snapshot history).
- Flag caveats (estimate vs exact count, repeated snapshots inflating rows).
- Match the client’s vocabulary when known (program names, retailer names, system names).
- Write the executive summary to pass the two-minute director skim test.
Do not:
- Use filler openers (“In today’s data landscape…”, “It is important to note…”, “This comprehensive overview…”).
- Stack synonyms for importance (critical, robust, powerful, holistic) without adding fact.
- Add generic value claims (“drives actionable insights”) without tying to a decision or metric.
- Invent row counts, schemas, or join keys; if unknown, write TBD and list what query will answer it.
- Use em dashes as a crutch for long clauses — prefer periods.
- Rewrite the memo in a “friendlier” tone when the ask was factual discovery.
Appendix B — Pre-handoff QA checklist
Complete before sharing with a client stakeholder.
- Warehouse platform, database, and schema are confirmed and written in metadata
- Profiling date is recorded and consistent across §4
- Every table in §4 has: metrics table, business objective, questions, and caveats
- Row counts and date ranges were queried (not estimated from INFORMATION_SCHEMA alone)
- Cross-cutting questions (§5) reference only tables profiled in §4
- Non-canonical / TEMP tables are flagged in §6.3
- Executive summary (§3) previews §4 faithfully — no claims that §4 doesn’t support
- Open items (§8) includes anything unresolved or approximate
- Validation queries (§9) would pass if run today
- Filename follows convention:
{client}-discovery-memo-{domain}.md - Prior schema audit or related artifacts reconciled — no contradictions shipped
- Changelog (§11) populated if this memo has been updated since initial profiling
- Appendix C (Schema Audit) column catalog covers every table profiled in §4
Appendix C — Schema Audit
Column-level reference for each profiled table. This is a technical appendix — no narrative, no business interpretation. Pure schema documentation for the data engineer building models on top of this share. Replace with a live SHOW TABLES / DESCRIBE TABLE output in the warehouse when possible.
C.1 [Table name]
| Column | Data type | Nullable | Cardinality | FK / PK reference | Notes |
|---|---|---|---|---|---|
[column] | [type] | [Y/N] | [count] | [references table.column] | [e.g., deprecated, use alt column] |
[column] | [type] | [Y/N] | [count] | [references table.column] | [...] |
C.2 [Table name]
(Repeat for each table profiled in §4. Keep the same table ordering used in §4.)