SOW Update Summary - December 12, 2024

Based on: Technical Deep-Dive Meeting (December 10, 2024) with Greg, Xiaojie, Uttam, and Awaish

Documents Updated:

/sow/SOW-Breezy-BrainforgeAI.md
../transcripts/2024-12-10_technical_deep_dive_notes.md (created)
../transcripts/brainforge_breezy_technical_discussion_12_10_25.md (processed)

Executive Summary of Changes

The December 10 technical deep-dive with Greg and Xiaojie revealed significantly more complexity in Phase 2 (MLS Data Infrastructure) than originally scoped, along with clearer constraints on timing (Android launch) and new opportunities in Phase 3 (Underbuilt automation).

Key Insights That Changed SOW:

MLS Data Scale: 160 million records daily, ~300 columns (not previously specified in detail)
Performance Requirements: Sub-1-second query latency requirement for production comp generation API
Architecture Pivot: Data lake-first approach (S3 → PySpark → Postgres) may be more appropriate than traditional warehouse (Snowflake) for production use case
Timeline Constraint: Phase 1 should start after Android launch (early January 2026), not December 2025
Underbuilt Opportunity: Current manual extraction process moving toward LLM-based automation; POC exploration added to scope

Section-by-Section Changes

Document Header

Added:

Last Updated date (December 12, 2025)
Summary of December 10 technical deep-dive
Note about architecture approach refinement

Why: Transparency on SOW evolution; helps reviewers understand what’s new

Phase 1: Analytics Foundation (Weeks 1-4)

Changes:

Timeline: Explicitly noted Phase 1 starts “early January 2026” after Android launch completes
Assumptions: Added confirmation that MixPanel/Statsig already streaming events (per December 10)
Assumptions: Added coordination note for Customer.io integration

Impact: Minimal scope change, mostly clarifications and timeline alignment

Quote from meeting:

“We’re trying to launch Android, which is big enough work… by end of this year” - Xiaojie

Phase 2: MLS Data Infrastructure (Weeks 3-8) - MAJOR UPDATES

Old Title: “MLS Data Infrastructure & Accuracy Benchmarking”

New Title: “MLS Data Infrastructure & Comp Generation Pipeline”

Changes to Context Section (NEW):

Added detailed technical context missing from original SOW:

CoreLogic/Hotality provides 160 million records daily with ~300 columns
Current provider is ~85% accurate with specific data quality issues (sale dates, days on market)
Sub-1-second query performance requirement for comp generation API
Moving to “gold standard” bulk data to solve data quality problems

Quote from meeting:

“We need super accurate data… they may be 85% there, but we need it to be as high as possible.” - Greg

Changes to Methods & Platforms (MAJOR):

Old approach: Assumed Snowflake warehouse with dbt transformations

New approach: Three architecture options with recommendation:

Option A: Data Lake-First (Recommended)

CoreLogic → S3 → PySpark → Postgres

Fastest path to sub-1-second queries
Lowest cost (avoids Snowflake ~$500-1500/month)
Can add Snowflake later if analytics needed

Option B: Warehouse-Based

CoreLogic → S3 → Snowflake → dbt → Postgres

Use if analytics on MLS data confirmed
Higher cost and latency

Option C: Hybrid

CoreLogic → S3 → PySpark → Both Postgres + Snowflake

Best of both worlds
Start production, add analytics when needed

Rationale: December 10 discussion confirmed “no current product requirements for analytics on MLS data” - Greg stated “our use case is very narrow and focused… making amazing comps that are super accurate, super fast.”

Architecture Decision: Option A recommended unless new requirements emerge during Week 1 prototype.

Changes to Deliverables:

Added:

Specific mention of geo-spatial indexing (PostGIS)
Parallel processing for 160M records
High availability configuration
Architecture Decision Record (ADR) documenting chosen approach

Enhanced:

Data quality monitoring now includes specific fields (sale dates, listing dates, days on market) based on known current provider gaps

Removed:

Accuracy benchmarking vs. Zillow/Redfin (deprioritized, not blocking comp generation)
Reference to “Bugatti” data processor (using CoreLogic directly)

Changes to Acceptance Criteria:

Added:

<1 second latency for 95th percentile comp queries (new performance requirement)
Data quality threshold: 95%+ of properties have sale date, listing date, price
Zero downtime during daily data refreshes

Changes to Assumptions:

Added:

CoreLogic delivery flexibility confirmed (SFTP, S3, Snowflake, Databricks)
Delta vs. full refresh strategy TBD with vendor (Greg needs to confirm)
Current comp generation logic to be documented by Breezy
May require specialized database (Elasticsearch) or caching layer for query performance

Quote from meeting:

“I need to find out whether we can get deltas, or if it has to be full files every time” - Greg

Phase 3: Underbuilt Expansion Intelligence (Weeks 5-8)

Old Title: “Underbuilt Expansion Intelligence”

New Title: “Underbuilt Expansion Intelligence & Automation Exploration”

Changes to Context Section (NEW):

Added context on current state:

“Very manual” process currently
Extracting from PDFs, Word docs, county websites, some APIs
~500 cities covered
Moving toward “large language model-based” approach
China team focused on “pipeline and AI side”

Quote from meeting:

“As far as I know, we’re the only ones doing this” - Greg (on competitive moat)

Changes to Deliverables:

Split into two parts:

Part A: Expansion Intelligence (Original scope)

Demand heatmap
Expansion prioritization (next 100 cities)
ROI model

Part B: Automation Exploration (NEW)

Extraction tool evaluation report (3-5 modern platforms)
POC results processing 3-5 sample counties
Architecture recommendations for automated workflow
Quick win opportunities (e.g., “condense 50-page building code to 3-page TLDR”)

Why added: Meeting revealed current manual bottleneck and opportunity to leverage modern LLM/OCR tools. Xiaojie noted “5 years ago this would have been unsolvable” - timing is right to explore automation.

Scope note: Part B is proof-of-concept only, not production implementation (separate phase)

Changes to Acceptance Criteria:

Added:

Extraction POC demonstrates >80% accuracy on structured data
Breezy has clear path forward for Underbuilt scaling

Risk Section Updates

Added New Risks:

Risk 2: Sub-1-second query performance on 160M records

May require specialized database (Elasticsearch) or caching layer
Mitigation: Prototype in Week 1 with sample data, proper indexing strategy
Impact: Could add 1-2 weeks if Postgres insufficient

Risk 3: CoreLogic data delivery unknowns

Schema, deltas vs. full dumps, timing not yet confirmed
Mitigation: Request sample data before Phase 2 kickoff, flexible S3-based architecture
Escalation path if data quality fails

Risk 4: Team bandwidth during Android launch (NEW)

Key stakeholders have limited availability Q4 2024
Mitigation: Start Phase 1 after Android launch (January 2026), async-first communication
Timeline: Phase 1 pushed to January 2026

Reordered: Existing risks renumbered to accommodate new ones

Success Metrics Updates

Changed:

Week targets now explicitly restart from “Phase 1 kickoff in early January 2026 (post-Android launch)”
Added Phase 2-specific metrics:
- Week 1: Architecture decision finalized
- Week 3: First 3 test markets processed successfully
- Week 6: <1 second query latency achieved for 95th percentile
- Week 6: Data quality monitoring and Slack alerts functional

Mutual Commitments Updates

Reorganized by phase for clarity:

Phase 1 additions:

Specific MixPanel/Statsig/Customer.io access requirements
Event schema coordination needs

Phase 2 additions (SIGNIFICANT):

CoreLogic credentials and documentation Week 1
Specific data delivery questions Greg needs to answer:
- Full dumps vs. deltas?
- Daily delivery timing and SLAs?
- Data format (CSV, Parquet, JSON)?
Comp generation algorithm documentation needed
10-20 test properties for validation
Greg allocated 5+ hrs/week during Phase 2 (more than original 3-5 hrs)

Phase 3 additions:

Sample building code documents (3-5 counties)
Manual process documentation for current workflow

All Phases:

Added async communication expectations (Slack, Loom)
Added decision velocity requirement (<48 hour turnaround on blocking questions)

What Didn’t Change

These sections remain largely unchanged:

Phase 1 core deliverables (retention dashboards, event taxonomy, DAU/MAU tracking)
Executive Value Thesis (still relevant, not updated)
Commercial Options (pricing structure unchanged)
References (same clients)
Overall engagement duration (8 weeks still appropriate)

Open Questions for Breezy Team

Based on December 10 discussion, these questions need answers to finalize scope:

CoreLogic/MLS Data (Greg)

Have you received schema documentation from CoreLogic?
What is the daily delta size vs. full dataset size?
What time of day does data become available?
What are their SLAs for data delivery?
Do they provide data lineage or quality metrics?

Architecture & Performance

What is acceptable data freshness SLA? (Daily refresh okay, or need more frequent?)
Do you have existing Postgres instance we’ll use, or provision new?
What is expected query volume for comp generation API? (QPS estimates)

Timeline

What is exact Android launch date?
When do you want Phase 1 to start? (Early January = first week? Second week?)
Is there a hard deadline for MLS pipeline? (Or flexible based on quality?)

Underbuilt

Can we get sample PDFs from 3-5 counties for extraction POC?
How do you currently validate extraction accuracy?
What is average time to add one city under current manual process?

Recommended Next Steps

Breezy internal discussion: Review updated SOW, align on priorities and timeline
Greg to gather CoreLogic details: Schema docs, sample data, delivery mechanism answers
Confirm Phase 1 start date: After Android launch, specific week in January
Share Underbuilt samples: PDFs/docs for extraction POC (if Phase 3 confirmed)
Brainforge to share extraction tool recommendations: Email/Slack with tool evaluation (Uttam)
Schedule kickoff call: Once SOW approved and start date confirmed

Impact on Commercial Terms

No change to pricing structure - scope expansion in Phase 2 and 3 balanced by:

More focused approach (production-first, skip analytics if not needed)
Clarified assumptions reduce rework risk
Proof-of-concept vs. production implementation boundary clearer

Timeline impact:

Phase 1 start pushed to January 2026 (was December 2025)
Overall 8-week duration still achievable
Delivery before January 20th waitlist launch still feasible if start first week of January

Meeting Artifacts

Full Transcript: ../transcripts/brainforge_breezy_technical_discussion_12_10_25.md
Detailed Meeting Notes: ../transcripts/2024-12-10_technical_deep_dive_notes.md
Updated SOW: /sow/SOW-Breezy-BrainforgeAI.md
This Summary: /SOW_UPDATE_SUMMARY_2024-12-12.md

Key Quotes from Meeting

“We need super accurate data… they may be 85% there, but we need it to be as high as possible.” - Greg

“Our use case is very narrow and focused… making amazing comps that are super accurate, super fast, that builds trust with agents.” - Greg

“We’re a startup, we’re scrappy, there are tons of other random, to be very honest, random shit there.” - Xiaojie

“As far as I know, we’re the only ones doing this [Underbuilt].” - Greg

“5 years ago this would have been unsolvable.” - Xiaojie (on Underbuilt extraction problem)

“I’m always pushing GMC [Jimsy], saying data solution, we should take it as early as possible to gather user signals.” - Xiaojie

Approval Process

To proceed:

Breezy team reviews updated SOW
Confirm architecture approach (Option A/B/C preference)
Confirm Phase 1 start date (early January 2026)
Sign or acknowledge updated SOW
Kick off with access provisioning

Questions? Reach out via Slack or email (uttam@brainforge.ai)

Brainforge Knowledge

Explorer

SOW_UPDATE_SUMMARY_2024-12-12

SOW Update Summary - December 12, 2024

Executive Summary of Changes

Key Insights That Changed SOW:

Section-by-Section Changes

Document Header

Phase 1: Analytics Foundation (Weeks 1-4)

Phase 2: MLS Data Infrastructure (Weeks 3-8) - MAJOR UPDATES

Changes to Context Section (NEW):

Changes to Methods & Platforms (MAJOR):

Changes to Deliverables:

Changes to Acceptance Criteria:

Changes to Assumptions:

Phase 3: Underbuilt Expansion Intelligence (Weeks 5-8)

Changes to Context Section (NEW):

Changes to Deliverables:

Changes to Acceptance Criteria:

Risk Section Updates

Success Metrics Updates

Mutual Commitments Updates

What Didn’t Change

Open Questions for Breezy Team

CoreLogic/MLS Data (Greg)

Architecture & Performance

Timeline

Underbuilt

Recommended Next Steps

Impact on Commercial Terms

Meeting Artifacts

Key Quotes from Meeting

Approval Process

Graph View

Table of Contents

Backlinks