SOW Update Summary - December 12, 2024

Based on: Technical Deep-Dive Meeting (December 10, 2024) with Greg, Xiaojie, Uttam, and Awaish

Documents Updated:

  • /sow/SOW-Breezy-BrainforgeAI.md
  • ../transcripts/2024-12-10_technical_deep_dive_notes.md (created)
  • ../transcripts/brainforge_breezy_technical_discussion_12_10_25.md (processed)

Executive Summary of Changes

The December 10 technical deep-dive with Greg and Xiaojie revealed significantly more complexity in Phase 2 (MLS Data Infrastructure) than originally scoped, along with clearer constraints on timing (Android launch) and new opportunities in Phase 3 (Underbuilt automation).

Key Insights That Changed SOW:

  1. MLS Data Scale: 160 million records daily, ~300 columns (not previously specified in detail)
  2. Performance Requirements: Sub-1-second query latency requirement for production comp generation API
  3. Architecture Pivot: Data lake-first approach (S3 → PySpark → Postgres) may be more appropriate than traditional warehouse (Snowflake) for production use case
  4. Timeline Constraint: Phase 1 should start after Android launch (early January 2026), not December 2025
  5. Underbuilt Opportunity: Current manual extraction process moving toward LLM-based automation; POC exploration added to scope

Section-by-Section Changes

Document Header

Added:

  • Last Updated date (December 12, 2025)
  • Summary of December 10 technical deep-dive
  • Note about architecture approach refinement

Why: Transparency on SOW evolution; helps reviewers understand what’s new


Phase 1: Analytics Foundation (Weeks 1-4)

Changes:

  • Timeline: Explicitly noted Phase 1 starts “early January 2026” after Android launch completes
  • Assumptions: Added confirmation that MixPanel/Statsig already streaming events (per December 10)
  • Assumptions: Added coordination note for Customer.io integration

Impact: Minimal scope change, mostly clarifications and timeline alignment

Quote from meeting:

“We’re trying to launch Android, which is big enough work… by end of this year” - Xiaojie


Phase 2: MLS Data Infrastructure (Weeks 3-8) - MAJOR UPDATES

Old Title: “MLS Data Infrastructure & Accuracy Benchmarking”

New Title: “MLS Data Infrastructure & Comp Generation Pipeline”

Changes to Context Section (NEW):

Added detailed technical context missing from original SOW:

  • CoreLogic/Hotality provides 160 million records daily with ~300 columns
  • Current provider is ~85% accurate with specific data quality issues (sale dates, days on market)
  • Sub-1-second query performance requirement for comp generation API
  • Moving to “gold standard” bulk data to solve data quality problems

Quote from meeting:

“We need super accurate data… they may be 85% there, but we need it to be as high as possible.” - Greg

Changes to Methods & Platforms (MAJOR):

Old approach: Assumed Snowflake warehouse with dbt transformations

New approach: Three architecture options with recommendation:

Option A: Data Lake-First (Recommended)

CoreLogic → S3 → PySpark → Postgres
  • Fastest path to sub-1-second queries
  • Lowest cost (avoids Snowflake ~$500-1500/month)
  • Can add Snowflake later if analytics needed

Option B: Warehouse-Based

CoreLogic → S3 → Snowflake → dbt → Postgres
  • Use if analytics on MLS data confirmed
  • Higher cost and latency

Option C: Hybrid

CoreLogic → S3 → PySpark → Both Postgres + Snowflake
  • Best of both worlds
  • Start production, add analytics when needed

Rationale: December 10 discussion confirmed “no current product requirements for analytics on MLS data” - Greg stated “our use case is very narrow and focused… making amazing comps that are super accurate, super fast.”

Architecture Decision: Option A recommended unless new requirements emerge during Week 1 prototype.

Changes to Deliverables:

Added:

  • Specific mention of geo-spatial indexing (PostGIS)
  • Parallel processing for 160M records
  • High availability configuration
  • Architecture Decision Record (ADR) documenting chosen approach

Enhanced:

  • Data quality monitoring now includes specific fields (sale dates, listing dates, days on market) based on known current provider gaps

Removed:

  • Accuracy benchmarking vs. Zillow/Redfin (deprioritized, not blocking comp generation)
  • Reference to “Bugatti” data processor (using CoreLogic directly)

Changes to Acceptance Criteria:

Added:

  • <1 second latency for 95th percentile comp queries (new performance requirement)
  • Data quality threshold: 95%+ of properties have sale date, listing date, price
  • Zero downtime during daily data refreshes

Changes to Assumptions:

Added:

  • CoreLogic delivery flexibility confirmed (SFTP, S3, Snowflake, Databricks)
  • Delta vs. full refresh strategy TBD with vendor (Greg needs to confirm)
  • Current comp generation logic to be documented by Breezy
  • May require specialized database (Elasticsearch) or caching layer for query performance

Quote from meeting:

“I need to find out whether we can get deltas, or if it has to be full files every time” - Greg


Phase 3: Underbuilt Expansion Intelligence (Weeks 5-8)

Old Title: “Underbuilt Expansion Intelligence”

New Title: “Underbuilt Expansion Intelligence & Automation Exploration”

Changes to Context Section (NEW):

Added context on current state:

  • “Very manual” process currently
  • Extracting from PDFs, Word docs, county websites, some APIs
  • ~500 cities covered
  • Moving toward “large language model-based” approach
  • China team focused on “pipeline and AI side”

Quote from meeting:

“As far as I know, we’re the only ones doing this” - Greg (on competitive moat)

Changes to Deliverables:

Split into two parts:

Part A: Expansion Intelligence (Original scope)

  • Demand heatmap
  • Expansion prioritization (next 100 cities)
  • ROI model

Part B: Automation Exploration (NEW)

  • Extraction tool evaluation report (3-5 modern platforms)
  • POC results processing 3-5 sample counties
  • Architecture recommendations for automated workflow
  • Quick win opportunities (e.g., “condense 50-page building code to 3-page TLDR”)

Why added: Meeting revealed current manual bottleneck and opportunity to leverage modern LLM/OCR tools. Xiaojie noted “5 years ago this would have been unsolvable” - timing is right to explore automation.

Scope note: Part B is proof-of-concept only, not production implementation (separate phase)

Changes to Acceptance Criteria:

Added:

  • Extraction POC demonstrates >80% accuracy on structured data
  • Breezy has clear path forward for Underbuilt scaling

Risk Section Updates

Added New Risks:

Risk 2: Sub-1-second query performance on 160M records

  • May require specialized database (Elasticsearch) or caching layer
  • Mitigation: Prototype in Week 1 with sample data, proper indexing strategy
  • Impact: Could add 1-2 weeks if Postgres insufficient

Risk 3: CoreLogic data delivery unknowns

  • Schema, deltas vs. full dumps, timing not yet confirmed
  • Mitigation: Request sample data before Phase 2 kickoff, flexible S3-based architecture
  • Escalation path if data quality fails

Risk 4: Team bandwidth during Android launch (NEW)

  • Key stakeholders have limited availability Q4 2024
  • Mitigation: Start Phase 1 after Android launch (January 2026), async-first communication
  • Timeline: Phase 1 pushed to January 2026

Reordered: Existing risks renumbered to accommodate new ones


Success Metrics Updates

Changed:

  • Week targets now explicitly restart from “Phase 1 kickoff in early January 2026 (post-Android launch)”
  • Added Phase 2-specific metrics:
    • Week 1: Architecture decision finalized
    • Week 3: First 3 test markets processed successfully
    • Week 6: <1 second query latency achieved for 95th percentile
    • Week 6: Data quality monitoring and Slack alerts functional

Mutual Commitments Updates

Reorganized by phase for clarity:

Phase 1 additions:

  • Specific MixPanel/Statsig/Customer.io access requirements
  • Event schema coordination needs

Phase 2 additions (SIGNIFICANT):

  • CoreLogic credentials and documentation Week 1
  • Specific data delivery questions Greg needs to answer:
    • Full dumps vs. deltas?
    • Daily delivery timing and SLAs?
    • Data format (CSV, Parquet, JSON)?
  • Comp generation algorithm documentation needed
  • 10-20 test properties for validation
  • Greg allocated 5+ hrs/week during Phase 2 (more than original 3-5 hrs)

Phase 3 additions:

  • Sample building code documents (3-5 counties)
  • Manual process documentation for current workflow

All Phases:

  • Added async communication expectations (Slack, Loom)
  • Added decision velocity requirement (<48 hour turnaround on blocking questions)

What Didn’t Change

These sections remain largely unchanged:

  • Phase 1 core deliverables (retention dashboards, event taxonomy, DAU/MAU tracking)
  • Executive Value Thesis (still relevant, not updated)
  • Commercial Options (pricing structure unchanged)
  • References (same clients)
  • Overall engagement duration (8 weeks still appropriate)

Open Questions for Breezy Team

Based on December 10 discussion, these questions need answers to finalize scope:

CoreLogic/MLS Data (Greg)

  1. Have you received schema documentation from CoreLogic?
  2. What is the daily delta size vs. full dataset size?
  3. What time of day does data become available?
  4. What are their SLAs for data delivery?
  5. Do they provide data lineage or quality metrics?

Architecture & Performance

  1. What is acceptable data freshness SLA? (Daily refresh okay, or need more frequent?)
  2. Do you have existing Postgres instance we’ll use, or provision new?
  3. What is expected query volume for comp generation API? (QPS estimates)

Timeline

  1. What is exact Android launch date?
  2. When do you want Phase 1 to start? (Early January = first week? Second week?)
  3. Is there a hard deadline for MLS pipeline? (Or flexible based on quality?)

Underbuilt

  1. Can we get sample PDFs from 3-5 counties for extraction POC?
  2. How do you currently validate extraction accuracy?
  3. What is average time to add one city under current manual process?

  1. Breezy internal discussion: Review updated SOW, align on priorities and timeline
  2. Greg to gather CoreLogic details: Schema docs, sample data, delivery mechanism answers
  3. Confirm Phase 1 start date: After Android launch, specific week in January
  4. Share Underbuilt samples: PDFs/docs for extraction POC (if Phase 3 confirmed)
  5. Brainforge to share extraction tool recommendations: Email/Slack with tool evaluation (Uttam)
  6. Schedule kickoff call: Once SOW approved and start date confirmed

Impact on Commercial Terms

No change to pricing structure - scope expansion in Phase 2 and 3 balanced by:

  • More focused approach (production-first, skip analytics if not needed)
  • Clarified assumptions reduce rework risk
  • Proof-of-concept vs. production implementation boundary clearer

Timeline impact:

  • Phase 1 start pushed to January 2026 (was December 2025)
  • Overall 8-week duration still achievable
  • Delivery before January 20th waitlist launch still feasible if start first week of January

Meeting Artifacts

  • Full Transcript: ../transcripts/brainforge_breezy_technical_discussion_12_10_25.md
  • Detailed Meeting Notes: ../transcripts/2024-12-10_technical_deep_dive_notes.md
  • Updated SOW: /sow/SOW-Breezy-BrainforgeAI.md
  • This Summary: /SOW_UPDATE_SUMMARY_2024-12-12.md

Key Quotes from Meeting

“We need super accurate data… they may be 85% there, but we need it to be as high as possible.” - Greg

“Our use case is very narrow and focused… making amazing comps that are super accurate, super fast, that builds trust with agents.” - Greg

“We’re a startup, we’re scrappy, there are tons of other random, to be very honest, random shit there.” - Xiaojie

“As far as I know, we’re the only ones doing this [Underbuilt].” - Greg

“5 years ago this would have been unsolvable.” - Xiaojie (on Underbuilt extraction problem)

“I’m always pushing GMC [Jimsy], saying data solution, we should take it as early as possible to gather user signals.” - Xiaojie


Approval Process

To proceed:

  1. Breezy team reviews updated SOW
  2. Confirm architecture approach (Option A/B/C preference)
  3. Confirm Phase 1 start date (early January 2026)
  4. Sign or acknowledge updated SOW
  5. Kick off with access provisioning

Questions? Reach out via Slack or email (uttam@brainforge.ai)