[Client] — Golden dataset spec — [Domain / scope]
About this document (Brainforge)
Internal conventions for how this file works in the repo. Strip or export without this section when sharing with a client.
Titling and filename
Use [Client] — Golden dataset spec — [Domain or use case] for the document title. Examples: LMNT — Golden dataset spec — Omnichannel revenue · Acme — Golden dataset spec — Wholesale order-to-cash.
Filename: {client}-golden-dataset-{domain}.md under knowledge/clients/{client}/resources/.
When to use this template
Use this when building an evaluation dataset for AI-powered natural-language querying (e.g., Cortex Analyst, custom NL2SQL). The golden dataset defines a set of questions with verified correct answers that serve as the acceptance test for the semantic layer.
Do not use this template when:
- designing the semantic view itself (use the Semantic View Design Doc)
- profiling a new data source (use the Discovery Memo)
Document metadata
Status: [Draft / In review / Approved / Locked]
Warehouse: [platform] — Account/region: [details]
Semantic view: [view name or path to C2 doc]
Version: [1.0 / increment when questions are added or answers change]
Prepared by: Brainforge
Last updated: [YYYY-MM-DD]
Related artifacts
| Artifact | Link / path | Notes |
|---|---|---|
| Semantic View Design Doc | [path to C2 doc] | The semantic view this dataset tests |
| Discovery Memo | [path to A1 memo] | Source profiling reference |
| Data Platform Documentation | [Google Sheet link] | Source catalog, metric definitions |
1. Dataset purpose
[2–4 sentences. What questions should this golden dataset cover? What use cases or user personas does it represent? What makes a passing vs. failing result?]
2. Question catalog
Each row is one test case. Placeholder values are shown; replace with actual questions and answers.
| # | Natural language question | Expected SQL logic | Expected result type | Result value | Tolerance | Status |
|---|---|---|---|---|---|---|
| 1 | [e.g., "What was total revenue last month?"] | SUM(revenue) WHERE month = CURRENT_MONTH | [Scalar / Row / Table] | [$X] | [±% or exact] | [Active] |
| 2 | [e.g., "Which product had the most growth?"] | TOP 1 product ORDER BY growth DESC | [Row] | [Product name] | [exact] | [Active] |
| 3 | [e.g., "Show me revenue by state for last quarter"] | SELECT state, SUM(revenue) WHERE quarter = PREVIOUS | [Table] | [State: X, Revenue: Y; ...] | [±5%] | [Active] |
| 4 | [Edge case: "What was revenue last month?" when table is empty] | — | [Scalar] | [null or 0] | [exact] | [Active] |
| 5 | [Edge case: date range with no data] | — | [Scalar] | [0 or empty] | [exact] | [Active] |
3. Source tables
| Table | Role | Verified answers query source |
|---|---|---|
[database.schema.table] | [fact / dimension / reference] | [SQL used to generate verified answers] |
[database.schema.table] | [fact / dimension / reference] | [SQL used to generate verified answers] |
4. Edge cases
- [Edge case] —
[What the edge case is. How the golden dataset handles it. What the expected behavior of the NLQ system should be.] - [Edge case] —
[...]
5. Known limitations
[Limitation]—[What the golden dataset does not cover. Why. What follow-up work would address it.]
6. Runner manifest
| Attribute | Value |
|---|---|
| Question count | [N] active questions |
| Last run | [YYYY-MM-DD] |
| Pass rate | [N / N] (XX%) |
| Runner command | [e.g., python scripts/golden_audit.py --dataset {path}] |
Appendix — Pre-handoff QA checklist
- Every question has a verified answer from the warehouse (not hand-crafted)
- Questions cover: simple aggregations, filters, group-bys, time ranges, comparisons
- Edge cases are included (empty results, null handling, ambiguous phrasing)
- Tolerance is defined per question and justified
- Expected SQL logic is documented (directional, not executable — the NLQ engine may generate different SQL for the same answer)
- Runner manifest tracks pass rate over time
- Questions are written in the vocabulary the client actually uses