Meeting Notes: Technical Deep Dive - Data Stack & MLS Architecture

Date: December 10, 2024 Attendees:

Brainforge: Uttam Kumaran, Awaish Kumar
Breezy: Xiaojie Zhang (Engineering Lead, China team), Greg (Engineering Leader, Founding Engineer)

Recording/Transcript: brainforge_breezy_technical_discussion_12_10_25.md

Executive Summary

Technical deep-dive meeting to discuss Breezy’s data infrastructure needs across three workstreams: (1) Product analytics foundation using MixPanel/Statsig, (2) MLS bulk data pipeline for comp generation requiring sub-1-second query performance, and (3) Underbuilt expansion and automation opportunities. Key decision: explore data lake-based architecture (S3 → PySpark → Postgres) to potentially bypass Snowflake for production use cases while maintaining analytics optionality. Team is currently focused on Android launch (end of year) with limited bandwidth for data initiatives until early January.

Key Discussion Points

1. Product Analytics Stack

Context: Breezy has existing MixPanel and Statsig implementations streaming events. Need to audit, establish event taxonomy, and build core measurement dashboards.

Discussion:

MixPanel already receiving events (confirmed)
Statsig being used for potential A/B testing framework
Current decision-making relies on founder intuition and Slack feedback vs. cohort behavior
~200-400 beta users currently, expecting 300+ waitlist signups by January 20th
Need DAU/MAU, retention curves (D0/D7/D30), feature adoption patterns, activation funnels

Brainforge Approach Presented:

Audit existing event taxonomy
Build core dashboards within MixPanel (no need for external BI tool initially)
Track cohorts: sign-up date, invite source, James network vs. organic
Understand paths into product usage
Layer AI chat-with-data tools for broader business access to insights

Decisions Made:

Phase 1 will focus on analytics foundation
Use existing MixPanel/Statsig rather than rebuild instrumentation

Open Questions:

What is current event taxonomy completeness?
Which features/metrics are highest priority for product team?
When can we get access to MixPanel/Statsig?

2. MLS Data Pipeline Architecture (Major Focus)

Context: Breezy needs to ingest bulk MLS data from CoreLogic (now called Hotality) to generate accurate property comps. Current third-party provider is only ~85% accurate, causing data quality issues.

Scale & Requirements

Data Volume:

160 million records daily
~300 columns wide
Bulk data files (not real-time streaming)
Question: Full dumps or incremental deltas? (TBD with vendor)

Performance Requirements:

Sub-1-second query latency for comp generation API
Needs “super high availability” and “super fast” responses
Replicating Trestle-quality experience (CoreLogic’s real-time product)

Data Quality Drivers:

Need “super accurate data” - current provider has visible seams
Wrong/missing sale dates, listing dates, days on market
Currently cobbling together multiple APIs to fill gaps
Moving to “gold standard” (CoreLogic/Hotality) to eliminate data quality issues

Architecture Options Discussed

Option 1: Traditional Warehouse Approach

CoreLogic → S3 → Snowflake → dbt transformations → S3 → Postgres

Pros: Full analytics capabilities, well-understood pattern
Cons: Additional latency hop, Snowflake costs ($500-1500/month), not needed if only use case is production queries

Option 2: Data Lake In-Memory Processing (Recommended to explore)

CoreLogic → S3 → PySpark/DuckDB (in-memory transforms) → Postgres

Pros: Faster SLAs, lower costs, fewer hops, direct operational DB loading
Cons: Need separate solution if analytics required later
Can still add Snowflake on top of data lake later without losing work

Option 3: Hybrid

CoreLogic → S3 → PySpark transformations → Both Postgres (production) + Snowflake (analytics)

Best of both worlds
Start with production pipeline, add Snowflake when analytics needed

Technical Details:

CoreLogic can deliver via SFTP, S3, Snowflake, or Databricks (flexible)
Prefer S3/data lake landing for maximum flexibility
Use tools like Snowpipe, Polyatomic, or custom PySpark jobs for ingestion
PySpark can handle billions of rows, parallel processing in seconds
DuckDB for in-memory transformations also discussed

Decisions Made:

Explore data lake approach (Option 2/3) as primary path
Don’t commit to Snowflake unless analytics use case materializes
Need to confirm with CoreLogic: full dumps vs. deltas, delivery frequency, SLAs

Open Questions:

What is the delta size each day? (Incremental vs. full refresh)
What are product SLAs for data freshness? (Daily? Hourly?)
Does CoreLogic provide schema documentation?
What transformations are needed before landing in Postgres?

Production vs. Analytics Use Cases

Current Use Case (Production):

Generate comps for properties based on subject property
Search criteria: bedrooms, bathrooms, radius from long/lat
Query comparable properties super fast
Serve via Breezy API to agents

Future Use Cases (Analytics - not prioritized yet):

"The possibilities are endless" - Greg

Neighborhood/area analysis
Market trend analysis
Property appreciation patterns
“It’s gonna be a waste if we just transforming data and just make comps” - Xiaojie
But: No current product requirements for analytics on listings data

Insight: Don’t over-engineer for undefined future. Build production pipeline first, add analytics layer when needed.

3. Underbuilt Product & Data Challenges

Context: Underbuilt extracts building codes and zoning data from county/municipal sources (PDFs, Word docs, county websites, some APIs) to help agents understand development potential.

Current State:

“Very manual” - hardcore approach (Xiaojie’s words)
County-level data with huge variance in format and accessibility
~500 cities currently covered
China team focused on “pipeline and AI side”

Moving Toward:

“Large language model-based” extraction
Automated workflow for data acquisition
Still need some scraping/manual pulling initially

Technical Challenges:

Wide variety of source formats (PDFs, Word docs, websites, APIs)
Domain-specific knowledge required (building codes, zoning terms)
Need to understand building code charts and regulations
“5 years ago would have been unsolvable” - Xiaojie

Opportunities Discussed:

Modern OCR and LLM extraction tools dramatically improved
Brainforge has evaluated multiple extraction platforms
Potential to condense 50-page building code guidelines to 3-page TLDR
Could build system that layers latest extraction tools (Google, Contextual, etc.)
PDFs don’t go anywhere - can iterate and improve extraction over time

Decisions Made:

Underbuilt is third priority (after analytics and MLS)
Brainforge to share extraction tool recommendations
Explore proof-of-concept when team has bandwidth

Open Questions:

What is current data acquisition process step-by-step?
Where does data land after extraction?
What accuracy threshold is acceptable?
How do we measure extraction quality?

Quote:

“We’re the only ones doing this, as far as I know” - Greg

Insight: This could be a proprietary moat if executed well. Unique dataset that competitors can’t easily replicate.

Technical Details Captured

Architecture Decisions

Decoupling production and analytics pipelines
- Production: Optimize for speed (sub-1s queries)
- Analytics: Can tolerate higher latency, prioritize query flexibility
- Don’t force both through same architecture
Data lake as single source of truth
- Land all data in S3/data lake
- Multiple consumers can read from lake (PySpark, Snowflake, etc.)
- Avoid vendor lock-in
Transformation location trade-offs
- In-warehouse (Snowflake/dbt): Better for analytics, easier debugging
- In-flight (PySpark/DuckDB): Faster for production, lower cost
- Choice depends on use case

Data Sources Discussed

Source	Purpose	Volume/Frequency	Status
CoreLogic/Hotality	MLS bulk data	160M records, ~300 cols, daily	Negotiating
MixPanel	Product events	Current beta users (~200-400)	Active
Statsig	A/B testing framework	Same as MixPanel	Active
County/Municipal	Building codes, zoning (PDFs, Word, APIs)	500 cities covered	Manual process

Tools/Platforms Mentioned

Current Stack:

MixPanel (product analytics)
Statsig (experimentation platform)
Postgres (production database)
CoreLogic/Hotality (MLS data provider)
Trestle (CoreLogic real-time product - using for California only)

Brainforge Recommended Tools:

Fivetran or Polyatomic (data ingestion)
Snowflake (data warehouse - if analytics needed)
dbt (transformation layer)
PySpark or DuckDB (in-memory transformations)
Klaviyo (marketing activation - discussed)
Modern extraction tools for Underbuilt (specific names to be shared)

Tools Mentioned in Context:

Airflow (Xiaojie’s experience at Airbnb - “there’s better options now”)
Superset (same - “good open source but we have better choices”)
Granola (AI note-taker both teams use)
Customer.io (mentioned as starting next week per Slack)

Team Tech Background

Xiaojie Zhang:

Worked at Airbnb (product engineer + data engineer)
Experience with Airflow, Superset
Now focused on China dev team, AI/pipeline work
“More passionate about building software for customers”

Greg:

Worked at Afterpay from small scale-up through hockey stick growth
Core team, backend engineer and engineering manager for 8 years
Currently in Eastern Malaysia (just landed, on travel)
Leading MLS data project

Awaish Kumar:

8+ years as data engineer
Experience at startups and growth-stage companies
Previously in vacation rental business (similar to property domain)
Built data infrastructure from scratch multiple times

Uttam Kumaran:

Background in data engineering (New York, WeWork data team)
Led product at data startup
Started Brainforge 3 years ago
Based in Austin

Pain Points Identified

1. Data Quality from Current MLS Provider

Current State: Using third-party provider that handles individual MLS relationships, provides API access across all MLSs Problem:

Only ~85% accurate
Visible “seams” in data
Wrong sale dates, listing dates
Missing days on market
Cobbling together multiple smaller APIs to fill gaps Impact: Undermines trust with agents, core value prop is accuracy Solution: Move to CoreLogic bulk data (gold standard)

2. Real-Time Data Access Complexity

Current State: CoreLogic Trestle provides real-time data but requires individual broker agreements with each MLS Problem:

Implemented Trestle for California only
“Not a straightforward process” to set up agreements in all states
Can’t scale to national coverage quickly Impact: Forced to use bulk data (not real-time) for most markets Solution: Accept bulk data trade-off, optimize for accuracy over real-time

3. Manual Underbuilt Data Processing

Current State: “Very manual” extraction of building codes from county sources Problem:

Doesn’t scale to 500+ cities efficiently
High variance in source formats
Requires domain expertise per county Impact: Limits expansion velocity, high operational cost Solution: Layer LLM/OCR extraction tools, build automated workflow

4. Limited Product Analytics Visibility

Current State: Decisions based on “founder intuition and Slack feedback” Problem:

Can’t identify retention drivers
Don’t know which features predict paid conversion
No cohort analysis or funnel optimization Impact: Risk launching paid tiers without understanding unit economics Solution: Phase 1 analytics foundation

5. Team Bandwidth Constraints

Current State: Team “swamped” with Android launch (end of year deadline) Problem:

Limited engineering hours for data initiatives
Greg on travel (just landed in Malaysia during call)
“Tons of random shit” competing for attention Impact: Data work may slip or get de-prioritized Solution: Start after Android launch (early January), clear prioritization

Requirements Gathered

Must-Have

MLS Pipeline

Ingest 160M records daily from CoreLogic
Sub-1-second query performance for comp generation
Super high availability for production API
Handle ~300 columns of property data
Support incremental updates (if available from vendor)
Data quality monitoring and alerting
Fallback/redundancy if pipeline fails

Product Analytics

Audit existing MixPanel event taxonomy
D0/D7/D30 retention dashboards
DAU/MAU tracking by cohort
Feature adoption matrix (what predicts retention?)
Funnel analysis (onboarding, feature activation)

Nice-to-Have

MLS Analytics (Future)

Neighborhood/area trend analysis
Market analytics for agents
Property appreciation modeling
Comp accuracy benchmarking vs. Zillow/Redfin

Underbuilt Automation

Automated extraction from PDFs/Word docs
LLM-based summarization of building codes
Expansion prioritization model (demand heatmap)

Action Items

Share extraction tool recommendations for Underbuilt - Owner: Uttam - Due: This week - Status: Not Started
Get answers from CoreLogic on data delivery - Owner: Greg - Due: TBD - Status: Not Started
- Full dumps vs. deltas?
- Delivery frequency and timing?
- Schema documentation available?
Internal discussion on engagement scope and timeline - Owner: Xiaojie, Greg - Due: Before next meeting - Status: Not Started
Edit and send updated SOW document - Owner: Uttam - Due: After this meeting - Status: In Progress
Provide MixPanel/Statsig access - Owner: Breezy team - Due: Upon engagement start - Status: Not Started
Schedule follow-up after internal Breezy discussion - Owner: Uttam - Due: TBD - Status: Not Started

Quotes & Insights

On Team Dynamics

“We’re a startup, we’re scrappy, there are tons of other random, to be very honest, random shit there.” - Xiaojie Zhang

Insight: Need to be flexible and prioritize ruthlessly. Don’t over-plan, deliver incrementally.

On Data Tools Evolution

“5 years ago, I feel like… now we have a… there’s a lot of better and cheaper options that are more stable than even 5 years ago.” - Uttam

“It’s unsolvable. It’s unsolvable [5 years ago].” - Xiaojie (about Underbuilt problem)

Insight: Modern data stack enables previously impossible use cases. Don’t anchor on old approaches.

On Product Focus

“Our use case is very narrow and focused… making amazing comps that are super accurate, super fast, that builds trust with agents.” - Greg

Insight: Don’t build analytics infrastructure for undefined future needs. Solve the immediate problem.

On MLS Data Quality

“We need super accurate data… they may be 85% there, but we need it to be as high as possible.” - Greg

“CoreLogic… is the gold standard for this stuff. There’s not really an option that can give us better data.” - Greg

Insight: Data accuracy is existential for product value prop. Worth the engineering complexity.

On Underbuilt Competitive Moat

“As far as I know, we’re the only ones doing this.” - Greg

Insight: Unique proprietary dataset = defensible competitive advantage.

On Early Analytics

“I’m always pushing GMC [Jimsy], saying data solution, we should take it as early as possible to gather user signals.” - Xiaojie

“Almost 10 years ago, I was in Airbnb. It’s a 50-people team to make this… speed testing framework… but now, I get 1 million events free per month from [Statsig], which is a dream… why not?” - Xiaojie

Insight: Modern tools democratize capabilities that used to require massive teams.

Decisions Requiring Confirmation

Architecture Decisions

Explore data lake-first approach (S3 → PySpark → Postgres) before committing to Snowflake
- Confirm: Is Breezy comfortable with this approach?
- Confirm: What are Snowflake budget constraints if we do need it?
Start with production pipeline, add analytics later
- Confirm: Does product team agree analytics on MLS data is not priority?
- Confirm: Any stakeholders who need MLS reporting now?

Timeline Decisions

Phase 1 (Analytics) starts after Android launch (early January)
- Confirm: What is exact Android launch date?
- Confirm: When is team bandwidth available?
Underbuilt is lowest priority of three workstreams
- Confirm: Does product team agree?
- Confirm: Any investors/customers demanding Underbuilt expansion?

Technical Risks Identified

1. CoreLogic Data Delivery Unknowns

Risk: Schema, delivery mechanism, or data quality worse than expected Mitigation:

Get sample data ASAP
Prototype with 3 test markets before full rollout
Define acceptance criteria with Breezy (e.g., “95% of properties must have sale date + price”)

2. Sub-1-Second Query Performance on 160M Records

Risk: Postgres can’t handle query volume/complexity at required speed Mitigation:

Proper indexing strategy
Potential for Elasticsearch or other specialized query engine
Caching layer for common queries
Consider geo-spatial database extensions (PostGIS)

3. Daily Bulk Load Performance

Risk: Can’t process 160M records fast enough to meet daily refresh SLA Mitigation:

Parallel processing with PySpark
Incremental updates if available from vendor
Pipeline monitoring and alerting

4. Team Bandwidth During Holidays

Risk: Key stakeholders (Greg traveling, team on Android launch) unavailable for decisions Mitigation:

Don’t start until early January when team is available
Document all decisions clearly in async-friendly format
Use Slack/Loom for updates that don’t require live meetings

Transcript: brainforge_breezy_technical_discussion_12_10_25.md
SOW: ../sows/SOW-Breezy-BrainforgeAI.md (to be updated)
Previous Webinar: brainforge_breezy_webinar_demo_12_2_2025.md
Initial Discussion: brainforge_breezy_initial_discussion.md

Follow-Up Questions for Next Meeting

CoreLogic/MLS Data

Have you received schema documentation from CoreLogic?
What is the daily delta size vs. full dataset size?
What time of day does data become available?
Do they provide data lineage or quality metrics?
What are their SLAs for data delivery?

Product Analytics

Can we get read access to MixPanel to audit current events?
What are the top 3 analytics questions product team asks most often?
Are there specific cohorts or segments you want to understand better?
What metrics do you review in your weekly product meetings currently?

Underbuilt

Can we see the current manual workflow for adding a new city?
What is the average time to add one city’s building code data?
How do you validate extraction accuracy currently?

Team/Timeline

What is the Android launch date?
When do you want to start Phase 1 (analytics foundation)?
What is the critical deadline for MLS pipeline (if any)?

Next Steps

Brainforge: Share Underbuilt extraction tool recommendations (Uttam)
Brainforge: Update SOW based on technical insights from this call (Uttam)
Brainforge: Send meeting summary to full team via Slack
Breezy: Internal discussion on engagement timeline and priorities (Xiaojie, Greg, Jimsy, Sigal)
Breezy: Get CoreLogic data delivery details (Greg)
Both: Schedule follow-up after Breezy internal discussion

Meeting Effectiveness

What Went Well:

Deep technical discussion on MLS architecture options
Clear understanding of scale and performance requirements
Identified that Snowflake may not be needed for initial use case
Good alignment on prioritization (analytics first, then MLS, then Underbuilt)

What Could Improve:

Need more specific timeline commitments from Breezy
Would benefit from seeing actual CoreLogic data samples
Could have dug deeper on MixPanel event taxonomy during call

Recommended for Next Meeting:

Have CoreLogic schema documentation ready
Share MixPanel dashboard examples
Demo current comp generation process to understand transformation requirements

Brainforge Knowledge

Explorer

2024-12-10_technical_deep_dive_notes

Meeting Notes: Technical Deep Dive - Data Stack & MLS Architecture

Executive Summary

Key Discussion Points

1. Product Analytics Stack

2. MLS Data Pipeline Architecture (Major Focus)

Scale & Requirements

Architecture Options Discussed

Production vs. Analytics Use Cases

3. Underbuilt Product & Data Challenges

Technical Details Captured

Architecture Decisions

Data Sources Discussed

Tools/Platforms Mentioned

Team Tech Background

Pain Points Identified

1. Data Quality from Current MLS Provider

2. Real-Time Data Access Complexity

3. Manual Underbuilt Data Processing

4. Limited Product Analytics Visibility

5. Team Bandwidth Constraints

Requirements Gathered

Must-Have

MLS Pipeline

Product Analytics

Nice-to-Have

MLS Analytics (Future)

Underbuilt Automation

Action Items

Quotes & Insights

On Team Dynamics

On Data Tools Evolution

On Product Focus

On MLS Data Quality

On Underbuilt Competitive Moat

On Early Analytics

Decisions Requiring Confirmation

Architecture Decisions

Timeline Decisions

Technical Risks Identified

1. CoreLogic Data Delivery Unknowns

2. Sub-1-Second Query Performance on 160M Records

3. Daily Bulk Load Performance

4. Team Bandwidth During Holidays

Links to Related Documents

Follow-Up Questions for Next Meeting

CoreLogic/MLS Data

Product Analytics

Underbuilt

Team/Timeline

Next Steps

Meeting Effectiveness

Graph View

Table of Contents