Marts Documentation Creation Workflow

Purpose

Create comprehensive documentation for dbt marts by extracting metadata from Snowflake tables. This workflow generates structured CSV documentation that includes table definitions, source mappings, key metrics, dimensions, and operational details.

Prerequisites

  • Snowflake access with permissions to query INFORMATION_SCHEMA
  • 1Password CLI installed and signed in (see setup/1password-cli-setup.md)
  • Python 3.8+ with required packages:
    pip install snowflake-connector-python pyyaml
  • Access to dbt project (for understanding source tables and model structure)
  • Network access to Snowflake instance

Workflow Overview

  1. Query Snowflake for table metadata (columns, row counts, sizes)
  2. Map tables to dbt source definitions
  3. Extract key metrics and dimensions from table schemas
  4. Generate CSV documentation with structured format
  5. Review and validate output

Step-by-Step Procedure

1. Query Snowflake for Mart Tables

Query INFORMATION_SCHEMA to get table metadata:

SELECT 
    TABLE_SCHEMA,
    TABLE_NAME,
    ROW_COUNT,
    BYTES,
    LAST_ALTERED
FROM PROD_MARTS.INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'WHOLESALE_CUSTOMERS'
ORDER BY TABLE_NAME;

2. Get Column Details

For each table, query column information:

SELECT 
    COLUMN_NAME,
    DATA_TYPE,
    IS_NULLABLE,
    COLUMN_DEFAULT
FROM PROD_MARTS.INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = 'WHOLESALE_CUSTOMERS'
  AND TABLE_NAME = '<table_name>'
ORDER BY ORDINAL_POSITION;

3. Map to dbt Sources

Review dbt project to understand:

  • Which source tables feed into each mart
  • Intermediate models (int_*) used in transformations
  • Business logic and transformations applied

Check dbt_project/models/raw/sources.yml for source definitions.

4. Categorize Columns

For each table, identify:

  • Key Metrics: Numeric measures (revenue, counts, averages)
  • Key Dimensions: Descriptive attributes (customer_id, date, segment)
  • Grain: Level of detail (one row per customer, per order, etc.)

5. Create Documentation CSV

Generate CSV with the following columns:

ColumnDescriptionExample
MartsTable nameWHOLESALE_DIM_CUSTOMERS
DefinitionBusiness purpose and descriptionFinal wholesale customer dimension table...
Source Tablesdbt source/intermediate tablesint_shopify__wholesale_dim_customer, int_gs__wholesaler_detail
Key MetricsComma-separated metric namesTOTAL_REVENUE, TOTAL_ORDERS, AVERAGE_ORDER_VALUE
Key DimensionsComma-separated dimension namesCUSTOMER_ID, EMAIL, SEGMENT, CUSTOMER_STATE
GrainLevel of detailOne row per wholesale customer
Update FrequencyHow often table is refreshedTable (or Daily, Hourly)
SchemaFull schema identifierPROD_MARTS.WHOLESALE_CUSTOMERS
Related MartsOther related marts(optional)
Business PurposeHigh-level business use caseMaster dimension table for all wholesale customer analysis
Usage NotesAdditional contextRow count: 14,217, Size: 10.87 MB

6. Python Script Example

import snowflake.connector
import csv
from typing import List, Dict
 
def get_table_metadata(conn, schema: str) -> List[Dict]:
    """Query Snowflake for table metadata."""
    cursor = conn.cursor()
    
    query = f"""
    SELECT 
        TABLE_NAME,
        ROW_COUNT,
        BYTES,
        LAST_ALTERED
    FROM PROD_MARTS.INFORMATION_SCHEMA.TABLES
    WHERE TABLE_SCHEMA = '{schema}'
    ORDER BY TABLE_NAME
    """
    
    cursor.execute(query)
    return cursor.fetchall()
 
def get_table_columns(conn, schema: str, table: str) -> List[str]:
    """Get column names for a table."""
    cursor = conn.cursor()
    
    query = f"""
    SELECT COLUMN_NAME
    FROM PROD_MARTS.INFORMATION_SCHEMA.COLUMNS
    WHERE TABLE_SCHEMA = '{schema}'
      AND TABLE_NAME = '{table}'
    ORDER BY ORDINAL_POSITION
    """
    
    cursor.execute(query)
    return [row[0] for row in cursor.fetchall()]
 
def categorize_columns(columns: List[str]) -> tuple:
    """Categorize columns into metrics and dimensions."""
    # Simple heuristic: numeric-sounding names are metrics
    metrics = [c for c in columns if any(x in c.upper() for x in ['COUNT', 'TOTAL', 'SUM', 'AVG', 'REVENUE', 'SPEND'])]
    dimensions = [c for c in columns if c not in metrics and c not in ['DBT_UPDATED_AT', '_POLYTOMIC_']]
    
    return metrics, dimensions
 
# Main workflow
conn = get_snowflake_connection(...)
tables = get_table_metadata(conn, 'WHOLESALE_CUSTOMERS')
 
with open('marts_documentation.csv', 'w', newline='') as f:
    writer = csv.DictWriter(f, fieldnames=[
        'Marts', 'Definition', 'Source Tables', 'Key Metrics', 
        'Key Dimensions', 'Grain', 'Update Frequency', 'Schema',
        'Related Marts', 'Business Purpose', 'Usage Notes'
    ])
    writer.writeheader()
    
    for table_name, row_count, bytes_size, last_altered in tables:
        columns = get_table_columns(conn, 'WHOLESALE_CUSTOMERS', table_name)
        metrics, dimensions = categorize_columns(columns)
        
        writer.writerow({
            'Marts': table_name,
            'Definition': f'[Define based on dbt code and business context]',
            'Source Tables': '[Map from dbt sources.yml]',
            'Key Metrics': ', '.join(metrics),
            'Key Dimensions': ', '.join(dimensions),
            'Grain': '[Determine from table structure]',
            'Update Frequency': 'Table',
            'Schema': f'PROD_MARTS.WHOLESALE_CUSTOMERS',
            'Related Marts': '',
            'Business Purpose': '[Business use case]',
            'Usage Notes': f'Row count: {row_count:,}, Size: {bytes_size / 1024 / 1024:.2f} MB'
        })

7. Review and Enhance

  1. Review definitions - Add business context from dbt model code
  2. Verify source tables - Cross-reference with dbt sources.yml and model files
  3. Validate metrics/dimensions - Ensure categorization is accurate
  4. Add business purpose - Document how each mart is used
  5. Include usage notes - Add row counts, sizes, update frequencies

Output Format

The CSV should be structured for easy consumption:

  • Spreadsheet-friendly: Opens cleanly in Excel/Google Sheets
  • Searchable: Column names enable filtering and sorting
  • Complete: All relevant metadata included
  • Validated: Data accuracy verified against Snowflake

Best Practices

  1. Query actual Snowflake metadata - Don’t rely on assumptions
  2. Review dbt code - Understand transformations and business logic
  3. Categorize carefully - Metrics vs dimensions affects downstream usage
  4. Document grain clearly - Critical for understanding table usage
  5. Include operational details - Size, row counts help with performance planning
  6. Keep definitions concise - But include enough context for understanding

Common Patterns

Dimension Tables

  • Grain: One row per entity (customer, product, etc.)
  • Key Dimensions: ID, name, attributes
  • Key Metrics: Aggregated measures per entity

Fact Tables

  • Grain: One row per event (order, transaction, etc.)
  • Key Dimensions: Foreign keys, date, status
  • Key Metrics: Amounts, counts, quantities

Summary Tables

  • Grain: Aggregated level (daily, weekly, by segment)
  • Key Dimensions: Grouping attributes
  • Key Metrics: Aggregated measures

Troubleshooting

“No tables found”

  • Verify schema name is correct (case-sensitive)
  • Check role has access to schema
  • Query SHOW SCHEMAS IN DATABASE PROD_MARTS;

“Cannot determine grain”

  • Review table structure and sample data
  • Check dbt model code for comments
  • Query SELECT * FROM table LIMIT 10 to understand structure

“Source tables unclear”

  • Review dbt sources.yml for source definitions
  • Trace source()ref() → mart dependencies
  • Check intermediate model files (int_*)
  • Snowflake Table Profiling: See snowflake-table-profiling.md
  • Metrics Dictionary Creation: See metrics-dictionary-creation.md
  • 1Password CLI Setup: See setup/1password-cli-setup.md

References