Root Cause Analysis (RCA) Memo — Template
About this document (Brainforge)
Internal conventions for how this file works in the repo. Strip or export without this section when sharing a client-only artifact.
Titling and filename
Use [Client Name]: [Topic] — Root Cause Analysis for the document title. Example: LMNT: Pipeline Ingestion Failure — Root Cause Analysis.
Filename: {client}-rca-{topic-slug}.md under knowledge/clients/{client}/resources/.
When to use this template
This template covers two use cases, differentiated by the Type field below:
Use as an RCA (Incident) when: something went wrong — a pipeline failure, incorrect data delivered to a client, a model error discovered after delivery, or a reporting discrepancy that affected a client decision. This document closes the loop with the client on what happened, why, and what prevents it from happening again.
Use as an RCA (KPI Anomaly) when: a key metric spiked or dropped unexpectedly and needs explanation — but nothing broke. No pipeline failure, no data loss, no bug. The question is “why did the number change?” not “what broke?”
This is distinct from a Data Findings Memo (investigating pre-existing data quality issues, where the outcome is a corrected figure rather than a fix deployed). Use this template when the primary deliverable is: “something broke or changed unexpectedly, here is a clear account, here is what we did about it, and here is how we prevent recurrence.”
An RCA is a trust-building document. Write it with the assumption that the client is more interested in understanding and prevention than in assigning blame.
Do not use this template when:
- investigating a pre-existing data quality issue that needs corrected figures (use the Data Findings Memo)
- profiling a new data source (use the Discovery Memo)
- running a periodic health check (use the Data Quality Assessment)
[Client Name]: [Topic] — Root Cause Analysis
Prepared by: Brainforge ([names])
Prepared for: [Client stakeholder names and titles]
Date: YYYY-MM-DD
Type: [Incident / KPI Anomaly]
Incident / anomaly date: YYYY-MM-DD
Status: [Under investigation / Fix deployed / Monitoring / Closed / Explained]
Related artifacts
| Artifact | Link / path | Notes |
|---|---|---|
| Data Platform Documentation | [Google Sheet link] | Source catalog, metric definitions |
| Discovery Memo | [path to A1 memo] | Source profiling reference |
| Data Findings Memo (if escalated) | [path] | Prior investigation if this RCA follows a findings memo |
| Linear ticket | [Linear URL] | Investigation or fix ticket |
Executive Summary
[3–5 sentences. What happened? What was the impact? Is it fixed or explained? What is the one thing the client should walk away with?
For incidents: state what broke, when, and that it’s fixed. For anomalies: state what metric changed, by how much, and the root cause of the movement.]
Impact Assessment
| Dimension | Detail |
|---|---|
| Time range affected | [When did the issue begin? When was it resolved?] |
| Data or reports affected | [Which tables, dashboards, or reports were affected?] |
| Downstream impact | [Did this affect a client decision, board report, investor update, or operational workflow?] |
| Users affected | [Which client team members were working from affected data?] |
| Severity | [Low / Medium / High / Critical] |
Timeline
For incidents, use this section. For anomalies, the timeline may be simpler — just the date range the anomaly was observed.
| Timestamp (UTC) | Event |
|---|---|
| YYYY-MM-DD HH:MM | [What happened] |
| YYYY-MM-DD HH:MM | [First observed / reported] |
| YYYY-MM-DD HH:MM | [Investigation began] |
| YYYY-MM-DD HH:MM | [Root cause identified] |
| YYYY-MM-DD HH:MM | [Fix deployed or explanation confirmed] |
Root Cause Analysis
Immediate cause
[What directly caused the incident or anomaly? The technical fact, stated plainly.]
Contributing factors
[What conditions made the immediate cause possible or made the impact worse? Use the “5 Whys” method: for each cause, ask why it existed, and follow the chain until you reach a systemic or process-level root cause rather than a one-time mistake.]
- Why [immediate cause]? — [Because…]
- Why [that cause]? — [Because…]
- Why [that cause]? — [This is the systemic root cause: …]
What was not the cause
[Optional but valuable. If the client may suspect a different cause, address it directly.]
What We Did About It
For incidents: fix applied
[What was done to stop the bleeding. Date deployed. How it was verified. Data correction if incorrect data was delivered.]
For anomalies: explanation and monitoring
[What the data shows. Whether the movement was a real business signal, a data artifact, or a seasonal pattern. What monitoring has been added to catch it next time.]
Prevention
[The most important section. For each action, name what is changing, who owns it, and when it will be complete.]
| Action | Description | Owner | Target date |
|---|---|---|---|
| [Action name] | [What specifically is changing] | [Name] | YYYY-MM-DD |
Lessons Learned
[1–3 honest observations about what this revealed. Useful generalizations for the team.]
- [Lesson] — [What this revealed about process, tooling, or assumptions.]
Appendix: Pre-handoff QA Checklist
- Type field is set correctly (Incident vs KPI Anomaly) — determines which sections are primary
- Executive summary states the impact and current status in plain language
- Timeline is complete and honest (omitting unflattering facts erodes trust)
- Root cause follows the 5-Whys to a systemic level, not a one-time mistake
- Prevention actions are specific, named, and dated — not vague commitments
- For incidents: fix is verified and data correction is communicated
- For anomalies: the metric movement is explained as business signal or artifact
- All placeholders are filled or marked as intentional TBD