Statement of Work
Data Foundation & Product Analytics: Real Estate Agent Operating System
Prepared for: Breezy (James Harris, Sigal Bareket, Jimsy)
Prepared on: December 4, 2025
Last Updated: December 12, 2025 (following December 10 technical deep-dive with Greg and Xiaojie)
Note on Proposed Timelines: All phase durations, deliverable sequencing, and milestone dates outlined in this SOW are proposed estimates based on our current understanding from discovery conversations. Actual timelines and scope will be validated and refined collaboratively as we gain deeper visibility into data quality, infrastructure readiness, and product priorities during the engagement. We expect to adjust pacing and deliverables together based on what we discover.
Recent Updates (December 12, 2025): Phase 2 (MLS Data Infrastructure) significantly expanded based on technical deep-dive revealing 160M daily records, ~300 columns, and sub-1-second query performance requirements. Architecture approach refined to explore data lake-first option (S3 → PySpark → Postgres) to optimize for production speed vs. traditional warehouse approach. Phase 3 (Underbuilt) enhanced with automation exploration based on discussion of current manual extraction process and LLM-based opportunities. Phase 1 start date aligned with Android launch completion (early January 2026).
1) Executive Value Thesis
Breezy is poised to lead the market with agent-focused property intelligence, leveraging data accuracy as a defensible moat before competitors can replicate feature sets. By January 20th, over 300 beta waitlist signups are expected, yet currently there is limited visibility into which features influence retention, which user segments are likely to convert to paid, or if expansion into underbuilt markets justifies further engineering. At present, product decisions rely on founder intuition and Slack feedback, rather than cohort behavior, feature correlation, and churn indicators.
This engagement establishes trusted data infrastructure and product analytics prior to scaling, enabling three business-critical outcomes: launch paid tiers in Q1 with confidence by understanding D0-D30 retention curves and feature activation patterns; defend 15 to 20% higher pricing than legacy CRMs by demonstrating accuracy advantages through MLS data quality benchmarks; and accelerate underbuilt city expansion 3x faster by prioritizing high-demand geographies using logged address interest instead of gut feel.
2) Approach and Deliverables
Phase 1: Analytics Foundation & Product Intelligence (Weeks 1 to 4 | December 2025)
Responsibilities:
- Brainforge: Audit existing mixpanel/statsig implementation; design event taxonomy and data dictionary; build D0-D30 retention dashboards; establish feature usage funnels (comps → underbuilt → pipeline → notetaker); create DAU/MAU tracking with cohort segmentation
- Breezy: Grant access (mixpanel, statsig, customer.io, codebase for event review); provide product roadmap and beta user feedback themes; designate engineering POC (Greg, Xiaojie) for event validation
Methods & Platforms:
- Mixpanel for behavioral analytics (already streaming events); Data.io for lightweight BI layer; Python/Jupyter for cohort analysis prototyping
- Daily Slack standups; weekly product review with Jimsy/Sigal; Linear for milestone tracking
Deliverables:
- Product Analytics Playbook: Event taxonomy, naming conventions, tracking plan for new features (onboarding flow, monetization events)
- Retention Dashboard: D0/D7/D30 curves by user cohort (sign-up date, invite source, James network vs. organic)
- Feature Adoption Matrix: Which feature combinations predict retained users? (e.g., users who run 3+ comps in Week 1 retain at 2x rate)
- DAU/MAU Baseline: Establish current engagement patterns across 200 beta users before January 20th waitlist influx
- Alerting: Slack notifications for anomalies (for example, notable shifts in feature usage, authentication issues, or API rate limit events)
Acceptance: Breezy team independently runs retention queries in mixpanel; feature usage data validates or challenges product hypotheses (e.g., “Do agents who use underbuilt convert to paid faster?”); dashboards load efficiently (targeting under 3 seconds with 90 days of data)
Assumptions & Discovery Dependencies:
- Mixpanel and Statsig events already streaming (confirmed December 10: “Mixpanel already receiving events”); we audit quality and enhance taxonomy, not rebuild from scratch. Timeline adjusts if foundational instrumentation requires significant rework.
- Phase 1 starts early January 2026 after Android launch completes (end of December per Xiaojie: “trying to launch Android, which is big enough work”)
- Beta user base stable at 200 to 400 through December; January 20th waitlist influx (~300 signups) creates new cohort for comparison and statistical significance
- Product team available 3 to 5 hours/week for validation sessions once Android launch complete
- Phase duration may extend or compress based on data quality findings and infrastructure readiness discovered during Week 1 audit
- Customer.io integration (starting per Slack) coordinated to align event taxonomy
Phase 2: MLS Data Infrastructure & Comp Generation Pipeline (Weeks 3 to 8 | January 2026)
Updated based on December 10, 2024 technical deep-dive with Greg and Xiaojie
Context: CoreLogic/Hotality provides 160 million records daily with ~300 columns of property data. Current third-party provider is ~85% accurate with visible data quality issues (wrong sale dates, missing days on market). Moving to “gold standard” bulk data to build internal comp generation API with sub-1-second query performance.
Responsibilities:
- Brainforge: Design and implement data pipeline from CoreLogic bulk files to production Postgres; establish transformation layer (PySpark or dbt); optimize Postgres schema and indexing for sub-1-second comp queries; build data quality tests (completeness, freshness, accuracy); create monitoring and alerting for pipeline health
- Breezy: Provide CoreLogic credentials and sample data; define comp generation algorithm requirements; allocate Greg (engineering lead) for architecture decisions and schema review; provide test properties for validation
Methods & Platforms:
Architecture approach to be validated during Week 1 prototype based on CoreLogic delivery options:
Option A: Data Lake-First (Recommended for Production Speed)
- CoreLogic → S3 (data lake) → PySpark in-memory transformations → Postgres (production)
- Fastest path to sub-1-second queries, lowest cost, avoids warehouse latency
- Snowflake can be added later on same S3 data if analytics needed
- Use when: Only production use case confirmed, optimize for speed and cost
Option B: Warehouse-Based (If Analytics Required)
- CoreLogic → S3 → Snowflake → dbt transformations → Postgres (reverse ETL via Polyatomic)
- Enables analytics on MLS data (neighborhood trends, market analysis)
- Higher cost (~$500-1,500/month Snowflake), additional latency hop
- Use when: Product/data team needs to run analytics on listings data
Option C: Hybrid
- CoreLogic → S3 → PySpark → Both Postgres (production) + Snowflake (analytics)
- Best of both worlds, start with production and add analytics when needed
- Use when: Future analytics confirmed but not blocking initial launch
Note: December 10 discussion confirmed no current product requirements for analytics on MLS data. “Our use case is very narrow and focused… making amazing comps that are super accurate, super fast” (Greg). Recommending Option A unless new requirements emerge.
Deliverables:
- MLS Bulk Data Pipeline: Automated daily ingestion from CoreLogic (SFTP/S3); handles 160M records with parallel processing; incremental updates if available from vendor; error logging and data lineage
- Postgres Comp Database: Optimized schema with geo-spatial indexing (PostGIS); sub-1-second query performance for comp searches by radius, bed/bath, property type; high availability configuration for production API
- Transformation Layer: PySpark or dbt jobs for data cleaning, standardization, enrichment; handles ~300 source columns, maps to normalized comp schema
- Data Quality Monitoring: Automated tests for completeness (sale dates, listing dates, days on market); freshness alerting (pipeline failed, data stale); schema drift detection; Slack notifications for anomalies
- Architecture Decision Record: Documents chosen approach (Option A/B/C), rationale, trade-offs; includes cost projections and scaling plan
Acceptance:
- Comp generation API queries return results in <1 second for 95th percentile
- Pipeline processes daily CoreLogic data within agreed SLA (TBD based on vendor delivery time)
- Data quality tests pass: 95%+ of properties have sale date, listing date, price
- Zero downtime during daily data refreshes
- Breezy engineering team can query production Postgres for any property in covered markets
Key Assumptions & Discovery Dependencies:
- CoreLogic provides bulk files via S3/SFTP (confirmed: “They can put it into S3 or whatever”); schema documentation available; sample data provided Week 1 of Phase 2
- Delta vs. full refresh strategy TBD with vendor (Greg: “I need to find out whether we can get deltas, or if it has to be full files every time”)
- Production SLAs defined (data freshness, query latency, availability targets)
- Current comp generation logic documented and provided by Breezy team
- Postgres instance sized appropriately for 160M+ record dataset (or Elasticsearch/specialized query engine if needed)
- Phase timeline may extend if query performance optimization requires specialized database (e.g., Elasticsearch) or caching layer
Phase 3: Underbuilt Expansion Intelligence & Automation Exploration (Weeks 5 to 8, Parallel Track | January 2026)
Updated based on December 10, 2024 technical deep-dive with Xiaojie and Greg
Context: Underbuilt currently uses “very manual” process to extract building codes and zoning data from county/municipal sources (PDFs, Word docs, county websites, some APIs). Data acquisition is “hardcore” with high variance across ~500 cities covered. Breezy is “moving towards a more large language model-based” approach to automate extraction. As Greg noted: “We’re the only ones doing this, as far as I know” - unique proprietary dataset creating competitive moat.
Responsibilities:
- Brainforge: (1) Analyze logged address searches from beta users to identify high-demand markets; build demand heatmap and expansion prioritization model. (2) Evaluate modern extraction tools (OCR, LLM-based document processing) for building code automation; conduct proof-of-concept with sample county data; provide recommendations on tools and architecture for automated workflow
- Breezy: Provide address search logs and current underbuilt coverage map; document current manual extraction workflow; share sample building code PDFs from 3-5 counties; allocate time with team familiar with extraction process (China team focused on “pipeline and AI side”)
Methods & Platforms:
- Python analysis of MixPanel address search events; Mapbox or Carto for geographic visualization
- Evaluation of extraction platforms: Google Document AI, AWS Textract, Anthropic Claude with vision, specialized PDF extraction tools
- Proof-of-concept: Process 3-5 sample counties through recommended tools, measure accuracy vs. manual extraction
- Simple prioritization model: (interest volume × market TAM × ease of data acquisition)
Deliverables:
Part A: Expansion Intelligence
- Demand Heatmap: Map showing underbuilt search interest by city/zip; highlights gaps between user demand and current 500-city coverage
- Expansion Prioritization: Ranked list of next 100 cities to build with rationale (e.g., “Orlando ranks #7: high agent density + 50 Winter Park queries from beta users + county provides API access”)
- ROI Model: Estimate incremental paid conversions per city based on beta usage patterns (e.g., “Adding Orlando unlocks 50 potential paying agents at 18k annual impact”)
Part B: Automation Exploration (Proof-of-concept, not production implementation)
- Extraction Tool Evaluation: Report comparing 3-5 modern extraction platforms (accuracy, cost, API limitations, domain-specific capabilities)
- POC Results: Sample output from processing 3-5 counties through recommended tools; accuracy comparison vs. manual extraction; identified challenges (tables, diagrams, domain terminology)
- Architecture Recommendations: Proposed automated workflow (data acquisition → extraction → validation → storage); phased rollout approach; estimated effort for production implementation
- Quick Win Opportunities: Identify immediate use cases (e.g., “Condense 50-page building code to 3-page TLDR for agent consumption” - Xiaojie’s suggestion)
Acceptance:
- Expansion prioritization list validates against known user requests (3+ cities correctly predicted)
- Extraction POC demonstrates feasibility of automation (>80% accuracy on structured data)
- Breezy team has clear path forward for Underbuilt scaling based on tool recommendations
- ROI model informs product roadmap decisions on expansion velocity
Key Assumptions & Discovery Dependencies:
- Address search events tracked in MixPanel (validated during Phase 1 analytics audit)
- Sample building code documents representative of typical complexity/format across US counties
- Current manual extraction process documented (time per city, accuracy validation method, failure modes)
- POC is exploration only; production implementation is separate phase (not included in this SOW)
- China team available to share context on current pipeline and domain requirements
- “5 years ago this would have been unsolvable” (Xiaojie) - modern LLMs make this feasible but still complex
3) Adoption, Proof, and Risk
Success Metrics
Note: Week targets below are proposed milestones subject to adjustment based on discovery findings and infrastructure readiness. Weeks restart from Phase 1 kickoff in early January 2026 (post-Android launch).
Phase 1: Analytics Foundation
- Week 2: Retention dashboard available in MixPanel; Breezy team runs first “why did this cohort churn?” analysis independently
- Week 4: Feature usage data informs product roadmap (e.g., prioritize onboarding flow changes based on activation patterns showing which features predict retention)
Phase 2: MLS Data Infrastructure
- Week 1 (of Phase 2): Architecture decision finalized based on CoreLogic data samples and query performance prototypes
- Week 3: Pipeline processes first 3 test markets successfully; data quality tests pass (95%+ properties have required fields)
- Week 6: Full MLS pipeline operational with 160M records; comp generation queries achieve <1 second latency for 95th percentile
- Week 6: Data quality monitoring in place; Slack alerts functional for pipeline failures, stale data, schema drift
- Week 6: Breezy engineering team demonstrates querying production Postgres for comps in any covered market
Phase 3: Underbuilt Intelligence & Automation
- Week 5: Demand heatmap shows top 100 expansion cities; model validates against 3+ known high-demand markets
- Week 7: Extraction tool POC completes for 3-5 sample counties; accuracy report and recommendations delivered
- Week 8: Expansion prioritization drives next quarter roadmap; automation approach documented for future implementation
Enablement Plan
- Weekly Office Hours: 60-minute session for Breezy team to learn mixpanel SQL, cohort analysis, data interpretation
- Documentation: Event taxonomy wiki; mixpanel query cookbook; MLS schema guide with example queries
- Handoff: All code, dashboards, and infrastructure documented for internal ownership post-engagement
Decision Cadence
- Daily: Slack async standups (blocker escalation, quick questions)
- Weekly: 60-minute sync with Jimsy/Sigal (demo insights, adjust priorities based on beta feedback)
- Bi-weekly: Technical sync with Greg/Xiaojie (MLS schema review, event validation, infrastructure decisions)
Top Risks & Mitigation
-
Beta user volume too low for statistical significance in retention analysis
Mitigation: Focus on directional insights and qualitative patterns in Phase 1; statistical rigor increases after January 20th waitlist influx; use James’s luxury agent network as high-intent cohort for early signals -
Sub-1-second query performance on 160M records may require specialized database
Risk: Postgres may not handle comp generation queries (<1s latency) at scale without significant optimization or alternative architecture
Mitigation: Prototype query patterns in Week 1 of Phase 2 with sample data; establish indexing strategy (geo-spatial, composite indexes); evaluate Elasticsearch or caching layer if Postgres insufficient; size Postgres instance appropriately (memory, CPU, IOPS) before full data load
Impact on timeline: Could add 1-2 weeks if specialized query engine needed -
CoreLogic data delivery unknowns (schema, deltas, timing)
Risk: “I need to find out whether we can get deltas, or if it has to be full files every time” (Greg). Schema documentation, data quality, or delivery mechanism worse than expected
Mitigation: Request sample data and schema docs before Phase 2 kickoff; prototype with 3 test markets before full rollout; establish acceptance criteria (95%+ properties have sale date, listing date, price); flexible architecture (S3-based) supports multiple ingestion patterns
Escalation: If CoreLogic data quality fails, evaluate alternative “gold standard” providers or revert to current provider with enhanced data quality layer -
Team bandwidth during Android launch and holidays
Risk: “We’re trying to launch Android, which is big enough work… by end of this year” (Xiaojie). Key stakeholders (Greg, Xiaojie) may have limited availability for data initiative decisions during Q4 2024
Mitigation: Phase 1 starts after Android launch (early January 2026); async-friendly communication (Slack, Loom, documented ADRs); don’t block on real-time meetings for non-critical decisions; establish clear escalation path for time-sensitive questions
Adjusted timeline: Phase 1 kickoff pushed to January 2026 (after Android launch) -
Customer.io integration (starting next week per Slack) conflicts with analytics priorities
Mitigation: Coordinate with Breezy engineering on event schema to ensure customer.io triggers align with mixpanel taxonomy; unified data dictionary prevents divergence
4) Co-Authored Commercial Options
| Option | Scope & Duration | Commercials |
|---|---|---|
| Discovery Sprint | 2 weeks; analytics audit, retention dashboards, event taxonomy | Pricing to be finalized based on agreed timeline and scope |
| Phases 1 to 2 | 6 weeks; full product analytics plus MLS infrastructure | Pricing to be finalized based on agreed timeline and scope |
| Full Build (Phases 1 to 3) | 8 weeks; analytics plus MLS plus underbuilt expansion intelligence | Pricing to be finalized based on agreed timeline and scope |
| Retain & Iterate | Ongoing post-launch; about 30 to 50 hrs/month (analytics support, new feature instrumentation, MLS maintenance) | Pricing to be finalized based on agreed timeline and scope |
Suggested References:
We recommend reaching out to clients who’ve collaborated on similar analytics and data infrastructure projects:
Default.com – Zero-to-one analytics infrastructure and event taxonomy structuring; also led pricing strategy development
Contact info available upon request
Hedra – Zero-to-one product analytics and BI stack; full data warehousing, reverse ETL, advanced SaaS workflows
Contact info available upon request
Bolt.new – Zero-to-one analytics implementation, including Snowflake and dbt setup; user retention analysis and funnel optimization
Contact info available upon request
Overage Terms: Hours beyond monthly retainer billed at standard rates (see below); reviewed monthly to right-size engagement
Standard Hourly Rates (if opting for hourly model):
| Role | Name | Hourly Rate | |
|---|---|---|---|
| Strategist | Uttam Kumaran | linkedin.com/in/uttamkumaran | $250/hr |
| Architect | Awaish Kumar | linkedin.com/in/awaishkumar | $200/hr |
| Analytics Engineer | [TBD] | [TBD] | $150/hr |
5) Mutual Commitments
To move fast and deliver before January 20th, we need from Breezy:
Phase 1 (Analytics Foundation - January 2026):
- Access within 3 business days of kickoff: Mixpanel (read access for event audit), Statsig (configuration review), codebase (for event instrumentation review), Customer.io (event schema coordination)
- Designated SMEs: Jimsy (product context, 2-3 hrs/week), Sigal (growth/marketing use cases, 2 hrs/week), Greg or Xiaojie (technical validation, 2 hrs/week for event schema review)
- Beta user insights: Access to beta community feedback (agent community Slack, one-on-one session notes) to inform retention hypotheses
Phase 2 (MLS Data Infrastructure - January 2026):
- CoreLogic/Hotality credentials and documentation: Vendor credentials Week 1 of Phase 2; schema documentation; sample data files (at least 3 test markets) for prototyping
- Data delivery answers from Greg’s CoreLogic discussions:
- Full dumps vs. delta/incremental updates?
- Daily delivery timing and SLAs?
- Data format (CSV, Parquet, JSON)?
- Any data quality metrics or lineage provided?
- Comp generation algorithm: Current logic for searching comparable properties (bed/bath/radius filters, scoring/ranking method)
- Test properties for validation: 10-20 properties where accurate comps are known (James’s personal knowledge or verified data)
- Technical SME availability: Greg allocated 5+ hrs/week during Phase 2 for architecture decisions, schema review, query performance validation
Phase 3 (Underbuilt Expansion):
- Address search logs: MixPanel events or database exports showing city/zip-level search volume
- Current coverage map: List of 500 cities currently covered by Underbuilt
- Sample building code documents: PDFs/Word docs from 3-5 representative counties (high variance examples) for extraction POC
- Manual process documentation: Current workflow for adding new city (time, steps, accuracy validation)
All Phases:
- Weekly sync attendance: 60-minute call with at least one founder present (Jimsy or Sigal) plus relevant technical SMEs
- Priority clarity: Linear board or Slack channel for top 3 analytics questions each week
- Async communication: Slack for daily standup updates, blockers, quick questions; Loom for demos when meeting schedules conflict
- Decision velocity: <48 hour turnaround on questions blocking progress (architecture choices, vendor decisions, acceptance criteria)
References
We’re happy to connect you with clients managing similar product analytics and data infrastructure challenges:
LMNT (Drink CPG) – Omnichannel revenue data foundation; Snowflake plus dbt plus product analytics; fractional data team model
Contact: Shivani Amar (BizOps Manager) [contact info available upon request]
Lilo Social (Agency Platform) – Behavioral analytics, forecasting engine, creative automation; owned infrastructure strategy
Contact: Zac Fromson (Co-Founder) [contact info available upon request]
Both can speak to responsiveness, technical depth, and ability to translate data insights into product decisions.
Next Steps:
- Review SOW and provide feedback by December 6
- Select commercial option and confirm start date (target: December 9 to deliver before January 20th)
- Sign mutual NDA for underbuilt proprietary discussion (as discussed with Sigal)
- Kickoff call with full Brainforge team (Uttam, Awaish, analytics engineer) plus Breezy team (Jimsy, Greg, Xiaojie)
- Access provisioning: Mixpanel, statsig, customer.io, GitHub/codebase (first 48 hours)