SOW Update Summary - December 12, 2024
Based on: Technical Deep-Dive Meeting (December 10, 2024) with Greg, Xiaojie, Uttam, and Awaish
Documents Updated:
/sow/SOW-Breezy-BrainforgeAI.md../transcripts/2024-12-10_technical_deep_dive_notes.md(created)../transcripts/brainforge_breezy_technical_discussion_12_10_25.md(processed)
Executive Summary of Changes
The December 10 technical deep-dive with Greg and Xiaojie revealed significantly more complexity in Phase 2 (MLS Data Infrastructure) than originally scoped, along with clearer constraints on timing (Android launch) and new opportunities in Phase 3 (Underbuilt automation).
Key Insights That Changed SOW:
- MLS Data Scale: 160 million records daily, ~300 columns (not previously specified in detail)
- Performance Requirements: Sub-1-second query latency requirement for production comp generation API
- Architecture Pivot: Data lake-first approach (S3 → PySpark → Postgres) may be more appropriate than traditional warehouse (Snowflake) for production use case
- Timeline Constraint: Phase 1 should start after Android launch (early January 2026), not December 2025
- Underbuilt Opportunity: Current manual extraction process moving toward LLM-based automation; POC exploration added to scope
Section-by-Section Changes
Document Header
Added:
- Last Updated date (December 12, 2025)
- Summary of December 10 technical deep-dive
- Note about architecture approach refinement
Why: Transparency on SOW evolution; helps reviewers understand what’s new
Phase 1: Analytics Foundation (Weeks 1-4)
Changes:
- Timeline: Explicitly noted Phase 1 starts “early January 2026” after Android launch completes
- Assumptions: Added confirmation that MixPanel/Statsig already streaming events (per December 10)
- Assumptions: Added coordination note for Customer.io integration
Impact: Minimal scope change, mostly clarifications and timeline alignment
Quote from meeting:
“We’re trying to launch Android, which is big enough work… by end of this year” - Xiaojie
Phase 2: MLS Data Infrastructure (Weeks 3-8) - MAJOR UPDATES
Old Title: “MLS Data Infrastructure & Accuracy Benchmarking”
New Title: “MLS Data Infrastructure & Comp Generation Pipeline”
Changes to Context Section (NEW):
Added detailed technical context missing from original SOW:
- CoreLogic/Hotality provides 160 million records daily with ~300 columns
- Current provider is ~85% accurate with specific data quality issues (sale dates, days on market)
- Sub-1-second query performance requirement for comp generation API
- Moving to “gold standard” bulk data to solve data quality problems
Quote from meeting:
“We need super accurate data… they may be 85% there, but we need it to be as high as possible.” - Greg
Changes to Methods & Platforms (MAJOR):
Old approach: Assumed Snowflake warehouse with dbt transformations
New approach: Three architecture options with recommendation:
Option A: Data Lake-First (Recommended)
CoreLogic → S3 → PySpark → Postgres
- Fastest path to sub-1-second queries
- Lowest cost (avoids Snowflake ~$500-1500/month)
- Can add Snowflake later if analytics needed
Option B: Warehouse-Based
CoreLogic → S3 → Snowflake → dbt → Postgres
- Use if analytics on MLS data confirmed
- Higher cost and latency
Option C: Hybrid
CoreLogic → S3 → PySpark → Both Postgres + Snowflake
- Best of both worlds
- Start production, add analytics when needed
Rationale: December 10 discussion confirmed “no current product requirements for analytics on MLS data” - Greg stated “our use case is very narrow and focused… making amazing comps that are super accurate, super fast.”
Architecture Decision: Option A recommended unless new requirements emerge during Week 1 prototype.
Changes to Deliverables:
Added:
- Specific mention of geo-spatial indexing (PostGIS)
- Parallel processing for 160M records
- High availability configuration
- Architecture Decision Record (ADR) documenting chosen approach
Enhanced:
- Data quality monitoring now includes specific fields (sale dates, listing dates, days on market) based on known current provider gaps
Removed:
- Accuracy benchmarking vs. Zillow/Redfin (deprioritized, not blocking comp generation)
- Reference to “Bugatti” data processor (using CoreLogic directly)
Changes to Acceptance Criteria:
Added:
- <1 second latency for 95th percentile comp queries (new performance requirement)
- Data quality threshold: 95%+ of properties have sale date, listing date, price
- Zero downtime during daily data refreshes
Changes to Assumptions:
Added:
- CoreLogic delivery flexibility confirmed (SFTP, S3, Snowflake, Databricks)
- Delta vs. full refresh strategy TBD with vendor (Greg needs to confirm)
- Current comp generation logic to be documented by Breezy
- May require specialized database (Elasticsearch) or caching layer for query performance
Quote from meeting:
“I need to find out whether we can get deltas, or if it has to be full files every time” - Greg
Phase 3: Underbuilt Expansion Intelligence (Weeks 5-8)
Old Title: “Underbuilt Expansion Intelligence”
New Title: “Underbuilt Expansion Intelligence & Automation Exploration”
Changes to Context Section (NEW):
Added context on current state:
- “Very manual” process currently
- Extracting from PDFs, Word docs, county websites, some APIs
- ~500 cities covered
- Moving toward “large language model-based” approach
- China team focused on “pipeline and AI side”
Quote from meeting:
“As far as I know, we’re the only ones doing this” - Greg (on competitive moat)
Changes to Deliverables:
Split into two parts:
Part A: Expansion Intelligence (Original scope)
- Demand heatmap
- Expansion prioritization (next 100 cities)
- ROI model
Part B: Automation Exploration (NEW)
- Extraction tool evaluation report (3-5 modern platforms)
- POC results processing 3-5 sample counties
- Architecture recommendations for automated workflow
- Quick win opportunities (e.g., “condense 50-page building code to 3-page TLDR”)
Why added: Meeting revealed current manual bottleneck and opportunity to leverage modern LLM/OCR tools. Xiaojie noted “5 years ago this would have been unsolvable” - timing is right to explore automation.
Scope note: Part B is proof-of-concept only, not production implementation (separate phase)
Changes to Acceptance Criteria:
Added:
- Extraction POC demonstrates >80% accuracy on structured data
- Breezy has clear path forward for Underbuilt scaling
Risk Section Updates
Added New Risks:
Risk 2: Sub-1-second query performance on 160M records
- May require specialized database (Elasticsearch) or caching layer
- Mitigation: Prototype in Week 1 with sample data, proper indexing strategy
- Impact: Could add 1-2 weeks if Postgres insufficient
Risk 3: CoreLogic data delivery unknowns
- Schema, deltas vs. full dumps, timing not yet confirmed
- Mitigation: Request sample data before Phase 2 kickoff, flexible S3-based architecture
- Escalation path if data quality fails
Risk 4: Team bandwidth during Android launch (NEW)
- Key stakeholders have limited availability Q4 2024
- Mitigation: Start Phase 1 after Android launch (January 2026), async-first communication
- Timeline: Phase 1 pushed to January 2026
Reordered: Existing risks renumbered to accommodate new ones
Success Metrics Updates
Changed:
- Week targets now explicitly restart from “Phase 1 kickoff in early January 2026 (post-Android launch)”
- Added Phase 2-specific metrics:
- Week 1: Architecture decision finalized
- Week 3: First 3 test markets processed successfully
- Week 6: <1 second query latency achieved for 95th percentile
- Week 6: Data quality monitoring and Slack alerts functional
Mutual Commitments Updates
Reorganized by phase for clarity:
Phase 1 additions:
- Specific MixPanel/Statsig/Customer.io access requirements
- Event schema coordination needs
Phase 2 additions (SIGNIFICANT):
- CoreLogic credentials and documentation Week 1
- Specific data delivery questions Greg needs to answer:
- Full dumps vs. deltas?
- Daily delivery timing and SLAs?
- Data format (CSV, Parquet, JSON)?
- Comp generation algorithm documentation needed
- 10-20 test properties for validation
- Greg allocated 5+ hrs/week during Phase 2 (more than original 3-5 hrs)
Phase 3 additions:
- Sample building code documents (3-5 counties)
- Manual process documentation for current workflow
All Phases:
- Added async communication expectations (Slack, Loom)
- Added decision velocity requirement (<48 hour turnaround on blocking questions)
What Didn’t Change
These sections remain largely unchanged:
- Phase 1 core deliverables (retention dashboards, event taxonomy, DAU/MAU tracking)
- Executive Value Thesis (still relevant, not updated)
- Commercial Options (pricing structure unchanged)
- References (same clients)
- Overall engagement duration (8 weeks still appropriate)
Open Questions for Breezy Team
Based on December 10 discussion, these questions need answers to finalize scope:
CoreLogic/MLS Data (Greg)
- Have you received schema documentation from CoreLogic?
- What is the daily delta size vs. full dataset size?
- What time of day does data become available?
- What are their SLAs for data delivery?
- Do they provide data lineage or quality metrics?
Architecture & Performance
- What is acceptable data freshness SLA? (Daily refresh okay, or need more frequent?)
- Do you have existing Postgres instance we’ll use, or provision new?
- What is expected query volume for comp generation API? (QPS estimates)
Timeline
- What is exact Android launch date?
- When do you want Phase 1 to start? (Early January = first week? Second week?)
- Is there a hard deadline for MLS pipeline? (Or flexible based on quality?)
Underbuilt
- Can we get sample PDFs from 3-5 counties for extraction POC?
- How do you currently validate extraction accuracy?
- What is average time to add one city under current manual process?
Recommended Next Steps
- Breezy internal discussion: Review updated SOW, align on priorities and timeline
- Greg to gather CoreLogic details: Schema docs, sample data, delivery mechanism answers
- Confirm Phase 1 start date: After Android launch, specific week in January
- Share Underbuilt samples: PDFs/docs for extraction POC (if Phase 3 confirmed)
- Brainforge to share extraction tool recommendations: Email/Slack with tool evaluation (Uttam)
- Schedule kickoff call: Once SOW approved and start date confirmed
Impact on Commercial Terms
No change to pricing structure - scope expansion in Phase 2 and 3 balanced by:
- More focused approach (production-first, skip analytics if not needed)
- Clarified assumptions reduce rework risk
- Proof-of-concept vs. production implementation boundary clearer
Timeline impact:
- Phase 1 start pushed to January 2026 (was December 2025)
- Overall 8-week duration still achievable
- Delivery before January 20th waitlist launch still feasible if start first week of January
Meeting Artifacts
- Full Transcript:
../transcripts/brainforge_breezy_technical_discussion_12_10_25.md - Detailed Meeting Notes:
../transcripts/2024-12-10_technical_deep_dive_notes.md - Updated SOW:
/sow/SOW-Breezy-BrainforgeAI.md - This Summary:
/SOW_UPDATE_SUMMARY_2024-12-12.md
Key Quotes from Meeting
“We need super accurate data… they may be 85% there, but we need it to be as high as possible.” - Greg
“Our use case is very narrow and focused… making amazing comps that are super accurate, super fast, that builds trust with agents.” - Greg
“We’re a startup, we’re scrappy, there are tons of other random, to be very honest, random shit there.” - Xiaojie
“As far as I know, we’re the only ones doing this [Underbuilt].” - Greg
“5 years ago this would have been unsolvable.” - Xiaojie (on Underbuilt extraction problem)
“I’m always pushing GMC [Jimsy], saying data solution, we should take it as early as possible to gather user signals.” - Xiaojie
Approval Process
To proceed:
- Breezy team reviews updated SOW
- Confirm architecture approach (Option A/B/C preference)
- Confirm Phase 1 start date (early January 2026)
- Sign or acknowledge updated SOW
- Kick off with access provisioning
Questions? Reach out via Slack or email (uttam@brainforge.ai)