[Client] — Data source discovery memo — [Domain / share name]

About this document (Brainforge)

Internal conventions for how this file works in the repo. Optional: strip or export without this section when sharing a client-only artifact.

Titling and filename

Use [Client] — Data source discovery memo — [Domain or share name] for the document title (the H1 above). Examples: LMNT — Data source discovery memo — Shopify (RAW.POLYTOMIC_SHOPIFY) · Acme — Data source discovery memo — ERP (RAW.ORACLE_ERP).

Filename: {client}-discovery-memo-{domain}.md under knowledge/clients/{client}/resources/.

When the memo covers multiple sources or shares, use a broader domain name (e.g., lmnt-discovery-memo-retail.md for Shopify + ReCharge + Gorgias in one document) or split into per-source files with a parent README. Prefer one memo per share or tightly coupled source group; avoid monolithic “all sources” memos that are hard to update.

Single source of truth

Repo markdown under knowledge/clients/{client}/resources/ is the canonical discovery memo once merged. Google Docs may exist for stakeholder review; this file wins on scope, table detail, and warehouse-verified metrics.

  • This memo (repo markdown) is the version-of-record for the discovery baseline — what was profiled, when, with what row counts and caveats. Update it when new tables arrive or share scopes change; do not maintain a parallel “v2” without a new file version.
  • Warehouse facts (row counts, date ranges, grain) are snapshotted on the profiling date. Append update sections for additive changes; do not silently overwrite historical numbers.
  • Google Docs (if used for stakeholder review) are downstream convenience copies. Import this markdown to create them; do not let the Google Doc diverge from the repo source.

When to use this template

Use this template when:

  • a new private share or schema appears for a client and leadership needs a written baseline
  • many new tables need to be introduced in one memo with consistent per-table subsections
  • the ask is discovery / profiling / interpretation, not full dimensional modeling or dbt project design

This template includes three consolidated artifacts:

  • Changelog (§11) — date-stamped entries for source updates (new tables, scope changes). Replaces the need for a standalone “Discovery Memo Update” document.
  • Appendix C (Schema Audit) — column-level per-table catalog with data types, nulls, cardinalities, and foreign keys. Replaces the need for a standalone Schema Audit document.
  • §2 (Source and lineage) includes ownership, SLA, and freshness tables. Replaces the need for a standalone Data Contract / SLA document.

Do not use this template when:

  • the ask is a one-page executive overview only (use data-source-memo-executive-style-playbook.md)
  • the ask is a pure technical schema audit with no narrative (this is now Appendix C — use this template, just skip the narrative sections)
  • the deliverable is a data accuracy investigation (use the Data Findings Memo template)

Document metadata

Status: [Draft / In stakeholder review / Approved / Locked] Warehouse: [Snowflake / BigQuery / other] — Account/region: [details] Database / project: [database or project name] Schema / dataset: [schema or dataset name] Profiled: [YYYY-MM-DD] (single snapshot date for all profiling in this memo) Prepared for: [Client stakeholder names and roles] Prepared by: Brainforge Last updated: [YYYY-MM-DD]


ArtifactLink / pathNotes
Data Platform Documentation[Google Sheet link]Source catalog, metric definitions
Linear project[Linear project URL]Discovery or data foundation milestones
Prior schema audit[path or link]If a separate schema audit exists, reconcile table names and domains
Vendor documentation[link]API docs, ERDs, or vendor data dictionaries
Stakeholder intake notes[link or path]Call notes or requirements that shaped discovery scope

1. Engagement context

1.1 Why this data matters

[2–4 sentences. What business problem does this data help solve? What becomes possible that was not possible before? Why now — what triggered this discovery?]

1.2 Audience

RoleWhat they need from this memo
[e.g., VP of Analytics][e.g., Can we retire the legacy spreadsheet? What KPIs are reliable?]
[e.g., Data Engineering lead][e.g., Refresh cadence, join keys, TEMP table decisions, dbt staging priorities]
[e.g., Finance / Ops stakeholder][e.g., Which tables anchor revenue or cost reporting? What filters are required?]

1.3 Discovery scope

[Narrative: which share, schema, or source group was profiled. List the tables included and any that were explicitly excluded. Note if scope was constrained by access, time, or stakeholder direction.]

In scope:

Source domainSchema / datasetTables profiledNotes
[domain name][schema][count][e.g., full schema; excluding deprecated _v1 tables]

Excluded (and why):

Table / domainReason excludedWhen to revisit
[name][e.g., empty snapshot, deprecated, access denied][condition or date]

2. Source and lineage

2.1 How data arrives

[Describe the ingestion path: source system → connector (Polytomic / Fivetran / Stitch / custom) → warehouse schema. Note refresh cadence, sync scope, and any known latency or reliability concerns.]

[Source system]
  → [Connector / pipeline] (refresh: [cadence])
    → [Warehouse].[database].[schema]
      → dbt staging / marts
        → BI tool / analytics

2.2 Ownership and access

ConcernDetail
Connector owner[team or person]
Warehouse access[role / user / service account used for profiling]
API / token health[known expiry, rotation schedule, or concerns]
Runbook / playbook[link to operational runbook if it exists]

2.3 Freshness SLA

MetricExpectationMeasured valueNotes
Refresh cadence[e.g., Daily at 06:00 UTC][actual as of profiling date][e.g., may drift on weekends]
Freshness SLA[e.g., Data ≤ 24h old at start of business][actual lag][e.g., delays if upstream API is down]
Expected latency[e.g., 2h from source event to warehouse][actual p50/p95][e.g., bulk loads land in single daily batch]
Alert contact[name / team / Slack channel][e.g., #data-alerts]
Known failure modes[e.g., connector drops auth token every 90 days][e.g., documented in runbook linked above]

3. Executive summary

Write this section last, after §4 is stable. A director should read this in two minutes and know what landed, what scale, and what decisions are needed. No table dumps here.

3.1 What we found

[3–5 sentences. What sources were profiled? At what scale (row counts, date ranges)? What is the most important thing leadership should know?]

3.2 Key decisions needed

  1. [Decision][Why it matters, what depends on it]
  2. [Decision][...]
  3. [Decision][...]

3.3 At a glance

Source domainPrimary fact tableApprox. rowsDate rangeGrain
[domain][table][count][start]–[end][grain]

4. Per-source / domain catalog

Each domain below follows a fixed subsection pattern. For multi-source memos, group by domain (e.g., 4.1 Shopify, 4.2 Gorgias). Within each domain, one subsection per table. Append new tables as 4.1.N, 4.1.N+1; do not renumber existing subsections in a published memo.

4.1 [Source domain name]

[1–2 sentences: what this source represents to the business, its role in the client's stack, and any cross-cutting notes about the schema as a whole.]

4.1.1 [Table name]

Metrics (queried [YYYY-MM-DD]):

AttributeValue
Approx. row count[count]
Approx. storage[bytes]
Grain[one row per …]
Date range[start]–[end] (based on [column])
Distinct keys[column]: [count] distinct

Business objective: [1–2 sentences. What does this table represent in business terms? What decisions or reports does it support?]

Questions this table answers:

  • [Question 1]
  • [Question 2]
  • [Question 3]

Caveats and interpretation:

  • [Grain warning, join caveat, known duplication, snapshot stacking, filter required]
  • [Date column reliability — which column to trust for time-series queries]
  • [Comparison to other tables — when to use this vs an alternative]

4.1.2 [Table name]

(Repeat the pattern above for each table in this domain.)


5. Cross-cutting questions

Synthesis across source domains. Only reference tables that are profiled in §4.

Business questionWhere to startCaveat
[e.g., Revenue and units sold][table(s)][grain or filter note]
[e.g., Customer identity for CRM joins][table(s)][mapping required]
[e.g., Support contact volume][table(s)][date filter note]

6. Joins and caveats

6.1 Key relationships

Table AJoin keyTable BJoin typeNotes
[table][column(s)][table][inner / left / requires key mapping][e.g., filter test orders, deduplicate by date]

6.2 Known traps

  • [Trap 1][what happens, how to avoid]
  • [Trap 2][...]
  • [Trap 3][...]

6.3 Non-canonical tables

Tables that exist in the schema but should NOT be used for enterprise KPIs until explicitly blessed.

TableRisk if usedWhen to revisit
[name][e.g., double-counting, stale data, unclear purpose][condition]

7. Recommendations / next steps

Action-oriented, ordered by priority. Minimal hype — state what should happen and why.

  1. [Action][Why this, why now, who should do it]
  2. [Action][...]
  3. [Action][...]
  4. [Action][...]

8. Open items

Anything not yet resolved. One bullet per item. Bold the item name.

  • [Item name] — What it is, why it is not resolved, what the impact is, and when it will be addressed or what triggers addressing it.
  • [Item name] — […]

9. Validation

How to verify the profiling in this memo is accurate. Write for the technical user who needs to reproduce or audit the work.

9.1 Warehouse access verification

-- Confirms the profiled schema is reachable
SHOW SCHEMAS IN DATABASE [database_name];
SHOW TABLES IN SCHEMA [database_name].[schema_name];

9.2 Row count and date range verification

-- Spot-check row counts and date ranges for key tables
SELECT 'ORDERS' AS table_name, COUNT(*) AS row_count, MIN(created_at) AS min_date, MAX(created_at) AS max_date FROM [database].[schema].[table];
-- Repeat for each profiled table

9.3 What passing looks like

  • Row counts are within 5% of the snapshotted values in §4 (warehouse activity continues between profiling and validation)
  • Date ranges overlap the stated ranges in §4
  • Distinct key counts are not dramatically lower than row counts (no massive duplication)
  • Schema names match what is declared in §2

10. Acceptance and sign-off

RoleNameDateStatus
[e.g., Data Engineering lead][name][date][Approved / Changes requested]
[e.g., Analytics stakeholder][name][date][Approved / Changes requested]

11. Changelog

Date-stamped entries for source changes after the initial memo. Newest first. Each entry records what changed and which tables were affected. This section replaces the need for a separate “Discovery Memo Update” document.

DateWhat changedAffected tablesAuthor
[YYYY-MM-DD][e.g., New tables added for Q2 scope][e.g., RAW.POLYTOMIC_NEWSOURCE.*][name]
[YYYY-MM-DD][e.g., Column added to entities table][table_name][name]

Initial memo: [YYYY-MM-DD] (see §4 for baseline catalog). Add entries above this line as the source evolves.


Appendix A — Agent guardrails (anti-fluff)

When an LLM agent drafts or updates this memo, enforce these rules. Humans: treat these as style reminders.

Do:

  • Prefer short sentences, neutral voice, and specific numbers with dates.
  • Use markdown tables for metric bundles; use bullets for interpretations and questions.
  • Name grain explicitly (order line vs invoice vs snapshot history).
  • Flag caveats (estimate vs exact count, repeated snapshots inflating rows).
  • Match the client’s vocabulary when known (program names, retailer names, system names).
  • Write the executive summary to pass the two-minute director skim test.

Do not:

  • Use filler openers (“In today’s data landscape…”, “It is important to note…”, “This comprehensive overview…”).
  • Stack synonyms for importance (critical, robust, powerful, holistic) without adding fact.
  • Add generic value claims (“drives actionable insights”) without tying to a decision or metric.
  • Invent row counts, schemas, or join keys; if unknown, write TBD and list what query will answer it.
  • Use em dashes as a crutch for long clauses — prefer periods.
  • Rewrite the memo in a “friendlier” tone when the ask was factual discovery.

Appendix B — Pre-handoff QA checklist

Complete before sharing with a client stakeholder.

  • Warehouse platform, database, and schema are confirmed and written in metadata
  • Profiling date is recorded and consistent across §4
  • Every table in §4 has: metrics table, business objective, questions, and caveats
  • Row counts and date ranges were queried (not estimated from INFORMATION_SCHEMA alone)
  • Cross-cutting questions (§5) reference only tables profiled in §4
  • Non-canonical / TEMP tables are flagged in §6.3
  • Executive summary (§3) previews §4 faithfully — no claims that §4 doesn’t support
  • Open items (§8) includes anything unresolved or approximate
  • Validation queries (§9) would pass if run today
  • Filename follows convention: {client}-discovery-memo-{domain}.md
  • Prior schema audit or related artifacts reconciled — no contradictions shipped
  • Changelog (§11) populated if this memo has been updated since initial profiling
  • Appendix C (Schema Audit) column catalog covers every table profiled in §4

Appendix C — Schema Audit

Column-level reference for each profiled table. This is a technical appendix — no narrative, no business interpretation. Pure schema documentation for the data engineer building models on top of this share. Replace with a live SHOW TABLES / DESCRIBE TABLE output in the warehouse when possible.

C.1 [Table name]

ColumnData typeNullableCardinalityFK / PK referenceNotes
[column][type][Y/N][count][references table.column][e.g., deprecated, use alt column]
[column][type][Y/N][count][references table.column][...]

C.2 [Table name]

(Repeat for each table profiled in §4. Keep the same table ordering used in §4.)