Statement of Work

Data Foundation & Product Analytics: Real Estate Agent Operating System

Prepared for: Breezy (James Harris, Sigal Bareket, Jimsy)
Prepared on: December 4, 2025
Last Updated: December 12, 2025 (following December 10 technical deep-dive with Greg and Xiaojie)

Note on Proposed Timelines: All phase durations, deliverable sequencing, and milestone dates outlined in this SOW are proposed estimates based on our current understanding from discovery conversations. Actual timelines and scope will be validated and refined collaboratively as we gain deeper visibility into data quality, infrastructure readiness, and product priorities during the engagement. We expect to adjust pacing and deliverables together based on what we discover.

Recent Updates (December 12, 2025): Phase 2 (MLS Data Infrastructure) significantly expanded based on technical deep-dive revealing 160M daily records, ~300 columns, and sub-1-second query performance requirements. Architecture approach refined to explore data lake-first option (S3 → PySpark → Postgres) to optimize for production speed vs. traditional warehouse approach. Phase 3 (Underbuilt) enhanced with automation exploration based on discussion of current manual extraction process and LLM-based opportunities. Phase 1 start date aligned with Android launch completion (early January 2026).

1) Executive Value Thesis

Breezy is poised to lead the market with agent-focused property intelligence, leveraging data accuracy as a defensible moat before competitors can replicate feature sets. By January 20th, over 300 beta waitlist signups are expected, yet currently there is limited visibility into which features influence retention, which user segments are likely to convert to paid, or if expansion into underbuilt markets justifies further engineering. At present, product decisions rely on founder intuition and Slack feedback, rather than cohort behavior, feature correlation, and churn indicators.

This engagement establishes trusted data infrastructure and product analytics prior to scaling, enabling three business-critical outcomes: launch paid tiers in Q1 with confidence by understanding D0-D30 retention curves and feature activation patterns; defend 15 to 20% higher pricing than legacy CRMs by demonstrating accuracy advantages through MLS data quality benchmarks; and accelerate underbuilt city expansion 3x faster by prioritizing high-demand geographies using logged address interest instead of gut feel.

2) Approach and Deliverables

Phase 1: Analytics Foundation & Product Intelligence (Weeks 1 to 4 | December 2025)

Responsibilities:

Brainforge: Audit existing mixpanel/statsig implementation; design event taxonomy and data dictionary; build D0-D30 retention dashboards; establish feature usage funnels (comps → underbuilt → pipeline → notetaker); create DAU/MAU tracking with cohort segmentation
Breezy: Grant access (mixpanel, statsig, customer.io, codebase for event review); provide product roadmap and beta user feedback themes; designate engineering POC (Greg, Xiaojie) for event validation

Methods & Platforms:

Mixpanel for behavioral analytics (already streaming events); Data.io for lightweight BI layer; Python/Jupyter for cohort analysis prototyping
Daily Slack standups; weekly product review with Jimsy/Sigal; Linear for milestone tracking

Deliverables:

Product Analytics Playbook: Event taxonomy, naming conventions, tracking plan for new features (onboarding flow, monetization events)
Retention Dashboard: D0/D7/D30 curves by user cohort (sign-up date, invite source, James network vs. organic)
Feature Adoption Matrix: Which feature combinations predict retained users? (e.g., users who run 3+ comps in Week 1 retain at 2x rate)
DAU/MAU Baseline: Establish current engagement patterns across 200 beta users before January 20th waitlist influx
Alerting: Slack notifications for anomalies (for example, notable shifts in feature usage, authentication issues, or API rate limit events)

Acceptance: Breezy team independently runs retention queries in mixpanel; feature usage data validates or challenges product hypotheses (e.g., “Do agents who use underbuilt convert to paid faster?”); dashboards load efficiently (targeting under 3 seconds with 90 days of data)

Assumptions & Discovery Dependencies:

Mixpanel and Statsig events already streaming (confirmed December 10: “Mixpanel already receiving events”); we audit quality and enhance taxonomy, not rebuild from scratch. Timeline adjusts if foundational instrumentation requires significant rework.
Phase 1 starts early January 2026 after Android launch completes (end of December per Xiaojie: “trying to launch Android, which is big enough work”)
Beta user base stable at 200 to 400 through December; January 20th waitlist influx (~300 signups) creates new cohort for comparison and statistical significance
Product team available 3 to 5 hours/week for validation sessions once Android launch complete
Phase duration may extend or compress based on data quality findings and infrastructure readiness discovered during Week 1 audit
Customer.io integration (starting per Slack) coordinated to align event taxonomy

Phase 2: MLS Data Infrastructure & Comp Generation Pipeline (Weeks 3 to 8 | January 2026)

Updated based on December 10, 2024 technical deep-dive with Greg and Xiaojie

Context: CoreLogic/Hotality provides 160 million records daily with ~300 columns of property data. Current third-party provider is ~85% accurate with visible data quality issues (wrong sale dates, missing days on market). Moving to “gold standard” bulk data to build internal comp generation API with sub-1-second query performance.

Responsibilities:

Brainforge: Design and implement data pipeline from CoreLogic bulk files to production Postgres; establish transformation layer (PySpark or dbt); optimize Postgres schema and indexing for sub-1-second comp queries; build data quality tests (completeness, freshness, accuracy); create monitoring and alerting for pipeline health
Breezy: Provide CoreLogic credentials and sample data; define comp generation algorithm requirements; allocate Greg (engineering lead) for architecture decisions and schema review; provide test properties for validation

Methods & Platforms:

Architecture approach to be validated during Week 1 prototype based on CoreLogic delivery options:

Option A: Data Lake-First (Recommended for Production Speed)

CoreLogic → S3 (data lake) → PySpark in-memory transformations → Postgres (production)
Fastest path to sub-1-second queries, lowest cost, avoids warehouse latency
Snowflake can be added later on same S3 data if analytics needed
Use when: Only production use case confirmed, optimize for speed and cost

Option B: Warehouse-Based (If Analytics Required)

CoreLogic → S3 → Snowflake → dbt transformations → Postgres (reverse ETL via Polyatomic)
Enables analytics on MLS data (neighborhood trends, market analysis)
Higher cost (~$500-1,500/month Snowflake), additional latency hop
Use when: Product/data team needs to run analytics on listings data

Option C: Hybrid

CoreLogic → S3 → PySpark → Both Postgres (production) + Snowflake (analytics)
Best of both worlds, start with production and add analytics when needed
Use when: Future analytics confirmed but not blocking initial launch

Note: December 10 discussion confirmed no current product requirements for analytics on MLS data. “Our use case is very narrow and focused… making amazing comps that are super accurate, super fast” (Greg). Recommending Option A unless new requirements emerge.

Deliverables:

MLS Bulk Data Pipeline: Automated daily ingestion from CoreLogic (SFTP/S3); handles 160M records with parallel processing; incremental updates if available from vendor; error logging and data lineage
Postgres Comp Database: Optimized schema with geo-spatial indexing (PostGIS); sub-1-second query performance for comp searches by radius, bed/bath, property type; high availability configuration for production API
Transformation Layer: PySpark or dbt jobs for data cleaning, standardization, enrichment; handles ~300 source columns, maps to normalized comp schema
Data Quality Monitoring: Automated tests for completeness (sale dates, listing dates, days on market); freshness alerting (pipeline failed, data stale); schema drift detection; Slack notifications for anomalies
Architecture Decision Record: Documents chosen approach (Option A/B/C), rationale, trade-offs; includes cost projections and scaling plan

Acceptance:

Comp generation API queries return results in <1 second for 95th percentile
Pipeline processes daily CoreLogic data within agreed SLA (TBD based on vendor delivery time)
Data quality tests pass: 95%+ of properties have sale date, listing date, price
Zero downtime during daily data refreshes
Breezy engineering team can query production Postgres for any property in covered markets

Key Assumptions & Discovery Dependencies:

CoreLogic provides bulk files via S3/SFTP (confirmed: “They can put it into S3 or whatever”); schema documentation available; sample data provided Week 1 of Phase 2
Delta vs. full refresh strategy TBD with vendor (Greg: “I need to find out whether we can get deltas, or if it has to be full files every time”)
Production SLAs defined (data freshness, query latency, availability targets)
Current comp generation logic documented and provided by Breezy team
Postgres instance sized appropriately for 160M+ record dataset (or Elasticsearch/specialized query engine if needed)
Phase timeline may extend if query performance optimization requires specialized database (e.g., Elasticsearch) or caching layer

Phase 3: Underbuilt Expansion Intelligence & Automation Exploration (Weeks 5 to 8, Parallel Track | January 2026)

Updated based on December 10, 2024 technical deep-dive with Xiaojie and Greg

Context: Underbuilt currently uses “very manual” process to extract building codes and zoning data from county/municipal sources (PDFs, Word docs, county websites, some APIs). Data acquisition is “hardcore” with high variance across ~500 cities covered. Breezy is “moving towards a more large language model-based” approach to automate extraction. As Greg noted: “We’re the only ones doing this, as far as I know” - unique proprietary dataset creating competitive moat.

Responsibilities:

Brainforge: (1) Analyze logged address searches from beta users to identify high-demand markets; build demand heatmap and expansion prioritization model. (2) Evaluate modern extraction tools (OCR, LLM-based document processing) for building code automation; conduct proof-of-concept with sample county data; provide recommendations on tools and architecture for automated workflow
Breezy: Provide address search logs and current underbuilt coverage map; document current manual extraction workflow; share sample building code PDFs from 3-5 counties; allocate time with team familiar with extraction process (China team focused on “pipeline and AI side”)

Methods & Platforms:

Python analysis of MixPanel address search events; Mapbox or Carto for geographic visualization
Evaluation of extraction platforms: Google Document AI, AWS Textract, Anthropic Claude with vision, specialized PDF extraction tools
Proof-of-concept: Process 3-5 sample counties through recommended tools, measure accuracy vs. manual extraction
Simple prioritization model: (interest volume × market TAM × ease of data acquisition)

Deliverables:

Part A: Expansion Intelligence

Demand Heatmap: Map showing underbuilt search interest by city/zip; highlights gaps between user demand and current 500-city coverage
Expansion Prioritization: Ranked list of next 100 cities to build with rationale (e.g., “Orlando ranks #7: high agent density + 50 Winter Park queries from beta users + county provides API access”)
ROI Model: Estimate incremental paid conversions per city based on beta usage patterns (e.g., “Adding Orlando unlocks 50 potential paying agents at $30/ m o n t h A R R =$ 18k annual impact”)

Part B: Automation Exploration (Proof-of-concept, not production implementation)

Extraction Tool Evaluation: Report comparing 3-5 modern extraction platforms (accuracy, cost, API limitations, domain-specific capabilities)
POC Results: Sample output from processing 3-5 counties through recommended tools; accuracy comparison vs. manual extraction; identified challenges (tables, diagrams, domain terminology)
Architecture Recommendations: Proposed automated workflow (data acquisition → extraction → validation → storage); phased rollout approach; estimated effort for production implementation
Quick Win Opportunities: Identify immediate use cases (e.g., “Condense 50-page building code to 3-page TLDR for agent consumption” - Xiaojie’s suggestion)

Acceptance:

Expansion prioritization list validates against known user requests (3+ cities correctly predicted)
Extraction POC demonstrates feasibility of automation (>80% accuracy on structured data)
Breezy team has clear path forward for Underbuilt scaling based on tool recommendations
ROI model informs product roadmap decisions on expansion velocity

Key Assumptions & Discovery Dependencies:

Address search events tracked in MixPanel (validated during Phase 1 analytics audit)
Sample building code documents representative of typical complexity/format across US counties
Current manual extraction process documented (time per city, accuracy validation method, failure modes)
POC is exploration only; production implementation is separate phase (not included in this SOW)
China team available to share context on current pipeline and domain requirements
“5 years ago this would have been unsolvable” (Xiaojie) - modern LLMs make this feasible but still complex

3) Adoption, Proof, and Risk

Success Metrics

Note: Week targets below are proposed milestones subject to adjustment based on discovery findings and infrastructure readiness. Weeks restart from Phase 1 kickoff in early January 2026 (post-Android launch).

Phase 1: Analytics Foundation

Week 2: Retention dashboard available in MixPanel; Breezy team runs first “why did this cohort churn?” analysis independently
Week 4: Feature usage data informs product roadmap (e.g., prioritize onboarding flow changes based on activation patterns showing which features predict retention)

Phase 2: MLS Data Infrastructure

Week 1 (of Phase 2): Architecture decision finalized based on CoreLogic data samples and query performance prototypes
Week 3: Pipeline processes first 3 test markets successfully; data quality tests pass (95%+ properties have required fields)
Week 6: Full MLS pipeline operational with 160M records; comp generation queries achieve <1 second latency for 95th percentile
Week 6: Data quality monitoring in place; Slack alerts functional for pipeline failures, stale data, schema drift
Week 6: Breezy engineering team demonstrates querying production Postgres for comps in any covered market

Phase 3: Underbuilt Intelligence & Automation

Week 5: Demand heatmap shows top 100 expansion cities; model validates against 3+ known high-demand markets
Week 7: Extraction tool POC completes for 3-5 sample counties; accuracy report and recommendations delivered
Week 8: Expansion prioritization drives next quarter roadmap; automation approach documented for future implementation

Enablement Plan

Weekly Office Hours: 60-minute session for Breezy team to learn mixpanel SQL, cohort analysis, data interpretation
Documentation: Event taxonomy wiki; mixpanel query cookbook; MLS schema guide with example queries
Handoff: All code, dashboards, and infrastructure documented for internal ownership post-engagement

Decision Cadence

Daily: Slack async standups (blocker escalation, quick questions)
Weekly: 60-minute sync with Jimsy/Sigal (demo insights, adjust priorities based on beta feedback)
Bi-weekly: Technical sync with Greg/Xiaojie (MLS schema review, event validation, infrastructure decisions)

Top Risks & Mitigation

Beta user volume too low for statistical significance in retention analysis
Mitigation: Focus on directional insights and qualitative patterns in Phase 1; statistical rigor increases after January 20th waitlist influx; use James’s luxury agent network as high-intent cohort for early signals
Sub-1-second query performance on 160M records may require specialized database
Risk: Postgres may not handle comp generation queries (<1s latency) at scale without significant optimization or alternative architecture
Mitigation: Prototype query patterns in Week 1 of Phase 2 with sample data; establish indexing strategy (geo-spatial, composite indexes); evaluate Elasticsearch or caching layer if Postgres insufficient; size Postgres instance appropriately (memory, CPU, IOPS) before full data load
Impact on timeline: Could add 1-2 weeks if specialized query engine needed
CoreLogic data delivery unknowns (schema, deltas, timing)
Risk: “I need to find out whether we can get deltas, or if it has to be full files every time” (Greg). Schema documentation, data quality, or delivery mechanism worse than expected
Mitigation: Request sample data and schema docs before Phase 2 kickoff; prototype with 3 test markets before full rollout; establish acceptance criteria (95%+ properties have sale date, listing date, price); flexible architecture (S3-based) supports multiple ingestion patterns
Escalation: If CoreLogic data quality fails, evaluate alternative “gold standard” providers or revert to current provider with enhanced data quality layer
Team bandwidth during Android launch and holidays
Risk: “We’re trying to launch Android, which is big enough work… by end of this year” (Xiaojie). Key stakeholders (Greg, Xiaojie) may have limited availability for data initiative decisions during Q4 2024
Mitigation: Phase 1 starts after Android launch (early January 2026); async-friendly communication (Slack, Loom, documented ADRs); don’t block on real-time meetings for non-critical decisions; establish clear escalation path for time-sensitive questions
Adjusted timeline: Phase 1 kickoff pushed to January 2026 (after Android launch)
Customer.io integration (starting next week per Slack) conflicts with analytics priorities
Mitigation: Coordinate with Breezy engineering on event schema to ensure customer.io triggers align with mixpanel taxonomy; unified data dictionary prevents divergence

4) Co-Authored Commercial Options

Option	Scope & Duration	Commercials
Discovery Sprint	2 weeks; analytics audit, retention dashboards, event taxonomy	Pricing to be finalized based on agreed timeline and scope
Phases 1 to 2	6 weeks; full product analytics plus MLS infrastructure	Pricing to be finalized based on agreed timeline and scope
Full Build (Phases 1 to 3)	8 weeks; analytics plus MLS plus underbuilt expansion intelligence	Pricing to be finalized based on agreed timeline and scope
Retain & Iterate	Ongoing post-launch; about 30 to 50 hrs/month (analytics support, new feature instrumentation, MLS maintenance)	Pricing to be finalized based on agreed timeline and scope

Suggested References:

We recommend reaching out to clients who’ve collaborated on similar analytics and data infrastructure projects:

Default.com – Zero-to-one analytics infrastructure and event taxonomy structuring; also led pricing strategy development
Contact info available upon request

Hedra – Zero-to-one product analytics and BI stack; full data warehousing, reverse ETL, advanced SaaS workflows
Contact info available upon request

Bolt.new – Zero-to-one analytics implementation, including Snowflake and dbt setup; user retention analysis and funnel optimization
Contact info available upon request

Overage Terms: Hours beyond monthly retainer billed at standard rates (see below); reviewed monthly to right-size engagement

Standard Hourly Rates (if opting for hourly model):

Role	Name	LinkedIn	Hourly Rate
Strategist	Uttam Kumaran	linkedin.com/in/uttamkumaran	$250/hr
Architect	Awaish Kumar	linkedin.com/in/awaishkumar	$200/hr
Analytics Engineer	[TBD]	[TBD]	$150/hr

5) Mutual Commitments

To move fast and deliver before January 20th, we need from Breezy:

Phase 1 (Analytics Foundation - January 2026):

Access within 3 business days of kickoff: Mixpanel (read access for event audit), Statsig (configuration review), codebase (for event instrumentation review), Customer.io (event schema coordination)
Designated SMEs: Jimsy (product context, 2-3 hrs/week), Sigal (growth/marketing use cases, 2 hrs/week), Greg or Xiaojie (technical validation, 2 hrs/week for event schema review)
Beta user insights: Access to beta community feedback (agent community Slack, one-on-one session notes) to inform retention hypotheses

Phase 2 (MLS Data Infrastructure - January 2026):

CoreLogic/Hotality credentials and documentation: Vendor credentials Week 1 of Phase 2; schema documentation; sample data files (at least 3 test markets) for prototyping
Data delivery answers from Greg’s CoreLogic discussions:
- Full dumps vs. delta/incremental updates?
- Daily delivery timing and SLAs?
- Data format (CSV, Parquet, JSON)?
- Any data quality metrics or lineage provided?
Comp generation algorithm: Current logic for searching comparable properties (bed/bath/radius filters, scoring/ranking method)
Test properties for validation: 10-20 properties where accurate comps are known (James’s personal knowledge or verified data)
Technical SME availability: Greg allocated 5+ hrs/week during Phase 2 for architecture decisions, schema review, query performance validation

Phase 3 (Underbuilt Expansion):

Address search logs: MixPanel events or database exports showing city/zip-level search volume
Current coverage map: List of 500 cities currently covered by Underbuilt
Sample building code documents: PDFs/Word docs from 3-5 representative counties (high variance examples) for extraction POC
Manual process documentation: Current workflow for adding new city (time, steps, accuracy validation)

All Phases:

Weekly sync attendance: 60-minute call with at least one founder present (Jimsy or Sigal) plus relevant technical SMEs
Priority clarity: Linear board or Slack channel for top 3 analytics questions each week
Async communication: Slack for daily standup updates, blockers, quick questions; Loom for demos when meeting schedules conflict
Decision velocity: <48 hour turnaround on questions blocking progress (architecture choices, vendor decisions, acceptance criteria)

References

We’re happy to connect you with clients managing similar product analytics and data infrastructure challenges:

LMNT (Drink CPG) – Omnichannel revenue data foundation; Snowflake plus dbt plus product analytics; fractional data team model
Contact: Shivani Amar (BizOps Manager) [contact info available upon request]

Lilo Social (Agency Platform) – Behavioral analytics, forecasting engine, creative automation; owned infrastructure strategy
Contact: Zac Fromson (Co-Founder) [contact info available upon request]

Both can speak to responsiveness, technical depth, and ability to translate data insights into product decisions.

Next Steps:

Review SOW and provide feedback by December 6
Select commercial option and confirm start date (target: December 9 to deliver before January 20th)
Sign mutual NDA for underbuilt proprietary discussion (as discussed with Sigal)
Kickoff call with full Brainforge team (Uttam, Awaish, analytics engineer) plus Breezy team (Jimsy, Greg, Xiaojie)
Access provisioning: Mixpanel, statsig, customer.io, GitHub/codebase (first 48 hours)

Brainforge Knowledge

Explorer

SOW-Breezy-BrainforgeAI

Statement of Work

1) Executive Value Thesis

2) Approach and Deliverables

Phase 1: Analytics Foundation & Product Intelligence (Weeks 1 to 4 | December 2025)

Phase 2: MLS Data Infrastructure & Comp Generation Pipeline (Weeks 3 to 8 | January 2026)

Phase 3: Underbuilt Expansion Intelligence & Automation Exploration (Weeks 5 to 8, Parallel Track | January 2026)

3) Adoption, Proof, and Risk

Success Metrics

Enablement Plan

Decision Cadence

Top Risks & Mitigation

4) Co-Authored Commercial Options

5) Mutual Commitments

References

Graph View

Table of Contents