Meeting Notes: Technical Deep Dive - Data Stack & MLS Architecture

Date: December 10, 2024 Attendees:

  • Brainforge: Uttam Kumaran, Awaish Kumar
  • Breezy: Xiaojie Zhang (Engineering Lead, China team), Greg (Engineering Leader, Founding Engineer)

Recording/Transcript: brainforge_breezy_technical_discussion_12_10_25.md


Executive Summary

Technical deep-dive meeting to discuss Breezy’s data infrastructure needs across three workstreams: (1) Product analytics foundation using MixPanel/Statsig, (2) MLS bulk data pipeline for comp generation requiring sub-1-second query performance, and (3) Underbuilt expansion and automation opportunities. Key decision: explore data lake-based architecture (S3 → PySpark → Postgres) to potentially bypass Snowflake for production use cases while maintaining analytics optionality. Team is currently focused on Android launch (end of year) with limited bandwidth for data initiatives until early January.


Key Discussion Points

1. Product Analytics Stack

Context: Breezy has existing MixPanel and Statsig implementations streaming events. Need to audit, establish event taxonomy, and build core measurement dashboards.

Discussion:

  • MixPanel already receiving events (confirmed)
  • Statsig being used for potential A/B testing framework
  • Current decision-making relies on founder intuition and Slack feedback vs. cohort behavior
  • ~200-400 beta users currently, expecting 300+ waitlist signups by January 20th
  • Need DAU/MAU, retention curves (D0/D7/D30), feature adoption patterns, activation funnels

Brainforge Approach Presented:

  • Audit existing event taxonomy
  • Build core dashboards within MixPanel (no need for external BI tool initially)
  • Track cohorts: sign-up date, invite source, James network vs. organic
  • Understand paths into product usage
  • Layer AI chat-with-data tools for broader business access to insights

Decisions Made:

  • Phase 1 will focus on analytics foundation
  • Use existing MixPanel/Statsig rather than rebuild instrumentation

Open Questions:

  • What is current event taxonomy completeness?
  • Which features/metrics are highest priority for product team?
  • When can we get access to MixPanel/Statsig?

2. MLS Data Pipeline Architecture (Major Focus)

Context: Breezy needs to ingest bulk MLS data from CoreLogic (now called Hotality) to generate accurate property comps. Current third-party provider is only ~85% accurate, causing data quality issues.

Scale & Requirements

Data Volume:

  • 160 million records daily
  • ~300 columns wide
  • Bulk data files (not real-time streaming)
  • Question: Full dumps or incremental deltas? (TBD with vendor)

Performance Requirements:

  • Sub-1-second query latency for comp generation API
  • Needs “super high availability” and “super fast” responses
  • Replicating Trestle-quality experience (CoreLogic’s real-time product)

Data Quality Drivers:

  • Need “super accurate data” - current provider has visible seams
  • Wrong/missing sale dates, listing dates, days on market
  • Currently cobbling together multiple APIs to fill gaps
  • Moving to “gold standard” (CoreLogic/Hotality) to eliminate data quality issues

Architecture Options Discussed

Option 1: Traditional Warehouse Approach

CoreLogic → S3 → Snowflake → dbt transformations → S3 → Postgres
  • Pros: Full analytics capabilities, well-understood pattern
  • Cons: Additional latency hop, Snowflake costs ($500-1500/month), not needed if only use case is production queries

Option 2: Data Lake In-Memory Processing (Recommended to explore)

CoreLogic → S3 → PySpark/DuckDB (in-memory transforms) → Postgres
  • Pros: Faster SLAs, lower costs, fewer hops, direct operational DB loading
  • Cons: Need separate solution if analytics required later
  • Can still add Snowflake on top of data lake later without losing work

Option 3: Hybrid

CoreLogic → S3 → PySpark transformations → Both Postgres (production) + Snowflake (analytics)
  • Best of both worlds
  • Start with production pipeline, add Snowflake when analytics needed

Technical Details:

  • CoreLogic can deliver via SFTP, S3, Snowflake, or Databricks (flexible)
  • Prefer S3/data lake landing for maximum flexibility
  • Use tools like Snowpipe, Polyatomic, or custom PySpark jobs for ingestion
  • PySpark can handle billions of rows, parallel processing in seconds
  • DuckDB for in-memory transformations also discussed

Decisions Made:

  • Explore data lake approach (Option 2/3) as primary path
  • Don’t commit to Snowflake unless analytics use case materializes
  • Need to confirm with CoreLogic: full dumps vs. deltas, delivery frequency, SLAs

Open Questions:

  • What is the delta size each day? (Incremental vs. full refresh)
  • What are product SLAs for data freshness? (Daily? Hourly?)
  • Does CoreLogic provide schema documentation?
  • What transformations are needed before landing in Postgres?

Production vs. Analytics Use Cases

Current Use Case (Production):

  • Generate comps for properties based on subject property
  • Search criteria: bedrooms, bathrooms, radius from long/lat
  • Query comparable properties super fast
  • Serve via Breezy API to agents

Future Use Cases (Analytics - not prioritized yet):

"The possibilities are endless" - Greg
  • Neighborhood/area analysis
  • Market trend analysis
  • Property appreciation patterns
  • “It’s gonna be a waste if we just transforming data and just make comps” - Xiaojie
  • But: No current product requirements for analytics on listings data

Insight: Don’t over-engineer for undefined future. Build production pipeline first, add analytics layer when needed.


3. Underbuilt Product & Data Challenges

Context: Underbuilt extracts building codes and zoning data from county/municipal sources (PDFs, Word docs, county websites, some APIs) to help agents understand development potential.

Current State:

  • “Very manual” - hardcore approach (Xiaojie’s words)
  • County-level data with huge variance in format and accessibility
  • ~500 cities currently covered
  • China team focused on “pipeline and AI side”

Moving Toward:

  • “Large language model-based” extraction
  • Automated workflow for data acquisition
  • Still need some scraping/manual pulling initially

Technical Challenges:

  • Wide variety of source formats (PDFs, Word docs, websites, APIs)
  • Domain-specific knowledge required (building codes, zoning terms)
  • Need to understand building code charts and regulations
  • “5 years ago would have been unsolvable” - Xiaojie

Opportunities Discussed:

  • Modern OCR and LLM extraction tools dramatically improved
  • Brainforge has evaluated multiple extraction platforms
  • Potential to condense 50-page building code guidelines to 3-page TLDR
  • Could build system that layers latest extraction tools (Google, Contextual, etc.)
  • PDFs don’t go anywhere - can iterate and improve extraction over time

Decisions Made:

  • Underbuilt is third priority (after analytics and MLS)
  • Brainforge to share extraction tool recommendations
  • Explore proof-of-concept when team has bandwidth

Open Questions:

  • What is current data acquisition process step-by-step?
  • Where does data land after extraction?
  • What accuracy threshold is acceptable?
  • How do we measure extraction quality?

Quote:

“We’re the only ones doing this, as far as I know” - Greg

Insight: This could be a proprietary moat if executed well. Unique dataset that competitors can’t easily replicate.


Technical Details Captured

Architecture Decisions

  1. Decoupling production and analytics pipelines

    • Production: Optimize for speed (sub-1s queries)
    • Analytics: Can tolerate higher latency, prioritize query flexibility
    • Don’t force both through same architecture
  2. Data lake as single source of truth

    • Land all data in S3/data lake
    • Multiple consumers can read from lake (PySpark, Snowflake, etc.)
    • Avoid vendor lock-in
  3. Transformation location trade-offs

    • In-warehouse (Snowflake/dbt): Better for analytics, easier debugging
    • In-flight (PySpark/DuckDB): Faster for production, lower cost
    • Choice depends on use case

Data Sources Discussed

SourcePurposeVolume/FrequencyStatus
CoreLogic/HotalityMLS bulk data160M records, ~300 cols, dailyNegotiating
MixPanelProduct eventsCurrent beta users (~200-400)Active
StatsigA/B testing frameworkSame as MixPanelActive
County/MunicipalBuilding codes, zoning (PDFs, Word, APIs)500 cities coveredManual process

Tools/Platforms Mentioned

Current Stack:

  • MixPanel (product analytics)
  • Statsig (experimentation platform)
  • Postgres (production database)
  • CoreLogic/Hotality (MLS data provider)
  • Trestle (CoreLogic real-time product - using for California only)

Brainforge Recommended Tools:

  • Fivetran or Polyatomic (data ingestion)
  • Snowflake (data warehouse - if analytics needed)
  • dbt (transformation layer)
  • PySpark or DuckDB (in-memory transformations)
  • Klaviyo (marketing activation - discussed)
  • Modern extraction tools for Underbuilt (specific names to be shared)

Tools Mentioned in Context:

  • Airflow (Xiaojie’s experience at Airbnb - “there’s better options now”)
  • Superset (same - “good open source but we have better choices”)
  • Granola (AI note-taker both teams use)
  • Customer.io (mentioned as starting next week per Slack)

Team Tech Background

Xiaojie Zhang:

  • Worked at Airbnb (product engineer + data engineer)
  • Experience with Airflow, Superset
  • Now focused on China dev team, AI/pipeline work
  • “More passionate about building software for customers”

Greg:

  • Worked at Afterpay from small scale-up through hockey stick growth
  • Core team, backend engineer and engineering manager for 8 years
  • Currently in Eastern Malaysia (just landed, on travel)
  • Leading MLS data project

Awaish Kumar:

  • 8+ years as data engineer
  • Experience at startups and growth-stage companies
  • Previously in vacation rental business (similar to property domain)
  • Built data infrastructure from scratch multiple times

Uttam Kumaran:

  • Background in data engineering (New York, WeWork data team)
  • Led product at data startup
  • Started Brainforge 3 years ago
  • Based in Austin

Pain Points Identified

1. Data Quality from Current MLS Provider

Current State: Using third-party provider that handles individual MLS relationships, provides API access across all MLSs Problem:

  • Only ~85% accurate
  • Visible “seams” in data
  • Wrong sale dates, listing dates
  • Missing days on market
  • Cobbling together multiple smaller APIs to fill gaps Impact: Undermines trust with agents, core value prop is accuracy Solution: Move to CoreLogic bulk data (gold standard)

2. Real-Time Data Access Complexity

Current State: CoreLogic Trestle provides real-time data but requires individual broker agreements with each MLS Problem:

  • Implemented Trestle for California only
  • “Not a straightforward process” to set up agreements in all states
  • Can’t scale to national coverage quickly Impact: Forced to use bulk data (not real-time) for most markets Solution: Accept bulk data trade-off, optimize for accuracy over real-time

3. Manual Underbuilt Data Processing

Current State: “Very manual” extraction of building codes from county sources Problem:

  • Doesn’t scale to 500+ cities efficiently
  • High variance in source formats
  • Requires domain expertise per county Impact: Limits expansion velocity, high operational cost Solution: Layer LLM/OCR extraction tools, build automated workflow

4. Limited Product Analytics Visibility

Current State: Decisions based on “founder intuition and Slack feedback” Problem:

  • Can’t identify retention drivers
  • Don’t know which features predict paid conversion
  • No cohort analysis or funnel optimization Impact: Risk launching paid tiers without understanding unit economics Solution: Phase 1 analytics foundation

5. Team Bandwidth Constraints

Current State: Team “swamped” with Android launch (end of year deadline) Problem:

  • Limited engineering hours for data initiatives
  • Greg on travel (just landed in Malaysia during call)
  • “Tons of random shit” competing for attention Impact: Data work may slip or get de-prioritized Solution: Start after Android launch (early January), clear prioritization

Requirements Gathered

Must-Have

MLS Pipeline

  • Ingest 160M records daily from CoreLogic
  • Sub-1-second query performance for comp generation
  • Super high availability for production API
  • Handle ~300 columns of property data
  • Support incremental updates (if available from vendor)
  • Data quality monitoring and alerting
  • Fallback/redundancy if pipeline fails

Product Analytics

  • Audit existing MixPanel event taxonomy
  • D0/D7/D30 retention dashboards
  • DAU/MAU tracking by cohort
  • Feature adoption matrix (what predicts retention?)
  • Funnel analysis (onboarding, feature activation)

Nice-to-Have

MLS Analytics (Future)

  • Neighborhood/area trend analysis
  • Market analytics for agents
  • Property appreciation modeling
  • Comp accuracy benchmarking vs. Zillow/Redfin

Underbuilt Automation

  • Automated extraction from PDFs/Word docs
  • LLM-based summarization of building codes
  • Expansion prioritization model (demand heatmap)

Action Items

  • Share extraction tool recommendations for Underbuilt - Owner: Uttam - Due: This week - Status: Not Started
  • Get answers from CoreLogic on data delivery - Owner: Greg - Due: TBD - Status: Not Started
    • Full dumps vs. deltas?
    • Delivery frequency and timing?
    • Schema documentation available?
  • Internal discussion on engagement scope and timeline - Owner: Xiaojie, Greg - Due: Before next meeting - Status: Not Started
  • Edit and send updated SOW document - Owner: Uttam - Due: After this meeting - Status: In Progress
  • Provide MixPanel/Statsig access - Owner: Breezy team - Due: Upon engagement start - Status: Not Started
  • Schedule follow-up after internal Breezy discussion - Owner: Uttam - Due: TBD - Status: Not Started

Quotes & Insights

On Team Dynamics

“We’re a startup, we’re scrappy, there are tons of other random, to be very honest, random shit there.” - Xiaojie Zhang

Insight: Need to be flexible and prioritize ruthlessly. Don’t over-plan, deliver incrementally.

On Data Tools Evolution

“5 years ago, I feel like… now we have a… there’s a lot of better and cheaper options that are more stable than even 5 years ago.” - Uttam

“It’s unsolvable. It’s unsolvable [5 years ago].” - Xiaojie (about Underbuilt problem)

Insight: Modern data stack enables previously impossible use cases. Don’t anchor on old approaches.

On Product Focus

“Our use case is very narrow and focused… making amazing comps that are super accurate, super fast, that builds trust with agents.” - Greg

Insight: Don’t build analytics infrastructure for undefined future needs. Solve the immediate problem.

On MLS Data Quality

“We need super accurate data… they may be 85% there, but we need it to be as high as possible.” - Greg

“CoreLogic… is the gold standard for this stuff. There’s not really an option that can give us better data.” - Greg

Insight: Data accuracy is existential for product value prop. Worth the engineering complexity.

On Underbuilt Competitive Moat

“As far as I know, we’re the only ones doing this.” - Greg

Insight: Unique proprietary dataset = defensible competitive advantage.

On Early Analytics

“I’m always pushing GMC [Jimsy], saying data solution, we should take it as early as possible to gather user signals.” - Xiaojie

“Almost 10 years ago, I was in Airbnb. It’s a 50-people team to make this… speed testing framework… but now, I get 1 million events free per month from [Statsig], which is a dream… why not?” - Xiaojie

Insight: Modern tools democratize capabilities that used to require massive teams.


Decisions Requiring Confirmation

Architecture Decisions

  1. Explore data lake-first approach (S3 → PySpark → Postgres) before committing to Snowflake

    • Confirm: Is Breezy comfortable with this approach?
    • Confirm: What are Snowflake budget constraints if we do need it?
  2. Start with production pipeline, add analytics later

    • Confirm: Does product team agree analytics on MLS data is not priority?
    • Confirm: Any stakeholders who need MLS reporting now?

Timeline Decisions

  1. Phase 1 (Analytics) starts after Android launch (early January)

    • Confirm: What is exact Android launch date?
    • Confirm: When is team bandwidth available?
  2. Underbuilt is lowest priority of three workstreams

    • Confirm: Does product team agree?
    • Confirm: Any investors/customers demanding Underbuilt expansion?

Technical Risks Identified

1. CoreLogic Data Delivery Unknowns

Risk: Schema, delivery mechanism, or data quality worse than expected Mitigation:

  • Get sample data ASAP
  • Prototype with 3 test markets before full rollout
  • Define acceptance criteria with Breezy (e.g., “95% of properties must have sale date + price”)

2. Sub-1-Second Query Performance on 160M Records

Risk: Postgres can’t handle query volume/complexity at required speed Mitigation:

  • Proper indexing strategy
  • Potential for Elasticsearch or other specialized query engine
  • Caching layer for common queries
  • Consider geo-spatial database extensions (PostGIS)

3. Daily Bulk Load Performance

Risk: Can’t process 160M records fast enough to meet daily refresh SLA Mitigation:

  • Parallel processing with PySpark
  • Incremental updates if available from vendor
  • Pipeline monitoring and alerting

4. Team Bandwidth During Holidays

Risk: Key stakeholders (Greg traveling, team on Android launch) unavailable for decisions Mitigation:

  • Don’t start until early January when team is available
  • Document all decisions clearly in async-friendly format
  • Use Slack/Loom for updates that don’t require live meetings


Follow-Up Questions for Next Meeting

CoreLogic/MLS Data

  1. Have you received schema documentation from CoreLogic?
  2. What is the daily delta size vs. full dataset size?
  3. What time of day does data become available?
  4. Do they provide data lineage or quality metrics?
  5. What are their SLAs for data delivery?

Product Analytics

  1. Can we get read access to MixPanel to audit current events?
  2. What are the top 3 analytics questions product team asks most often?
  3. Are there specific cohorts or segments you want to understand better?
  4. What metrics do you review in your weekly product meetings currently?

Underbuilt

  1. Can we see the current manual workflow for adding a new city?
  2. What is the average time to add one city’s building code data?
  3. How do you validate extraction accuracy currently?

Team/Timeline

  1. What is the Android launch date?
  2. When do you want to start Phase 1 (analytics foundation)?
  3. What is the critical deadline for MLS pipeline (if any)?

Next Steps

  1. Brainforge: Share Underbuilt extraction tool recommendations (Uttam)
  2. Brainforge: Update SOW based on technical insights from this call (Uttam)
  3. Brainforge: Send meeting summary to full team via Slack
  4. Breezy: Internal discussion on engagement timeline and priorities (Xiaojie, Greg, Jimsy, Sigal)
  5. Breezy: Get CoreLogic data delivery details (Greg)
  6. Both: Schedule follow-up after Breezy internal discussion

Meeting Effectiveness

What Went Well:

  • Deep technical discussion on MLS architecture options
  • Clear understanding of scale and performance requirements
  • Identified that Snowflake may not be needed for initial use case
  • Good alignment on prioritization (analytics first, then MLS, then Underbuilt)

What Could Improve:

  • Need more specific timeline commitments from Breezy
  • Would benefit from seeing actual CoreLogic data samples
  • Could have dug deeper on MixPanel event taxonomy during call

Recommended for Next Meeting:

  • Have CoreLogic schema documentation ready
  • Share MixPanel dashboard examples
  • Demo current comp generation process to understand transformation requirements