Root Cause Analysis (RCA) Memo — Template

About this document (Brainforge)

Internal conventions for how this file works in the repo. Strip or export without this section when sharing a client-only artifact.

Titling and filename

Use [Client Name]: [Topic] — Root Cause Analysis for the document title. Example: LMNT: Pipeline Ingestion Failure — Root Cause Analysis.

Filename: {client}-rca-{topic-slug}.md under knowledge/clients/{client}/resources/.

When to use this template

This template covers two use cases, differentiated by the Type field below:

Use as an RCA (Incident) when: something went wrong — a pipeline failure, incorrect data delivered to a client, a model error discovered after delivery, or a reporting discrepancy that affected a client decision. This document closes the loop with the client on what happened, why, and what prevents it from happening again.

Use as an RCA (KPI Anomaly) when: a key metric spiked or dropped unexpectedly and needs explanation — but nothing broke. No pipeline failure, no data loss, no bug. The question is “why did the number change?” not “what broke?”

This is distinct from a Data Findings Memo (investigating pre-existing data quality issues, where the outcome is a corrected figure rather than a fix deployed). Use this template when the primary deliverable is: “something broke or changed unexpectedly, here is a clear account, here is what we did about it, and here is how we prevent recurrence.”

An RCA is a trust-building document. Write it with the assumption that the client is more interested in understanding and prevention than in assigning blame.

Do not use this template when:

  • investigating a pre-existing data quality issue that needs corrected figures (use the Data Findings Memo)
  • profiling a new data source (use the Discovery Memo)
  • running a periodic health check (use the Data Quality Assessment)

[Client Name]: [Topic] — Root Cause Analysis

Prepared by: Brainforge ([names]) Prepared for: [Client stakeholder names and titles] Date: YYYY-MM-DD Type: [Incident / KPI Anomaly] Incident / anomaly date: YYYY-MM-DD Status: [Under investigation / Fix deployed / Monitoring / Closed / Explained]


ArtifactLink / pathNotes
Data Platform Documentation[Google Sheet link]Source catalog, metric definitions
Discovery Memo[path to A1 memo]Source profiling reference
Data Findings Memo (if escalated)[path]Prior investigation if this RCA follows a findings memo
Linear ticket[Linear URL]Investigation or fix ticket

Executive Summary

[3–5 sentences. What happened? What was the impact? Is it fixed or explained? What is the one thing the client should walk away with?

For incidents: state what broke, when, and that it’s fixed. For anomalies: state what metric changed, by how much, and the root cause of the movement.]


Impact Assessment

DimensionDetail
Time range affected[When did the issue begin? When was it resolved?]
Data or reports affected[Which tables, dashboards, or reports were affected?]
Downstream impact[Did this affect a client decision, board report, investor update, or operational workflow?]
Users affected[Which client team members were working from affected data?]
Severity[Low / Medium / High / Critical]

Timeline

For incidents, use this section. For anomalies, the timeline may be simpler — just the date range the anomaly was observed.

Timestamp (UTC)Event
YYYY-MM-DD HH:MM[What happened]
YYYY-MM-DD HH:MM[First observed / reported]
YYYY-MM-DD HH:MM[Investigation began]
YYYY-MM-DD HH:MM[Root cause identified]
YYYY-MM-DD HH:MM[Fix deployed or explanation confirmed]

Root Cause Analysis

Immediate cause

[What directly caused the incident or anomaly? The technical fact, stated plainly.]

Contributing factors

[What conditions made the immediate cause possible or made the impact worse? Use the “5 Whys” method: for each cause, ask why it existed, and follow the chain until you reach a systemic or process-level root cause rather than a one-time mistake.]

  • Why [immediate cause]? — [Because…]
  • Why [that cause]? — [Because…]
  • Why [that cause]? — [This is the systemic root cause: …]

What was not the cause

[Optional but valuable. If the client may suspect a different cause, address it directly.]


What We Did About It

For incidents: fix applied

[What was done to stop the bleeding. Date deployed. How it was verified. Data correction if incorrect data was delivered.]

For anomalies: explanation and monitoring

[What the data shows. Whether the movement was a real business signal, a data artifact, or a seasonal pattern. What monitoring has been added to catch it next time.]


Prevention

[The most important section. For each action, name what is changing, who owns it, and when it will be complete.]

ActionDescriptionOwnerTarget date
[Action name][What specifically is changing][Name]YYYY-MM-DD

Lessons Learned

[1–3 honest observations about what this revealed. Useful generalizations for the team.]

  • [Lesson] — [What this revealed about process, tooling, or assumptions.]

Appendix: Pre-handoff QA Checklist

  • Type field is set correctly (Incident vs KPI Anomaly) — determines which sections are primary
  • Executive summary states the impact and current status in plain language
  • Timeline is complete and honest (omitting unflattering facts erodes trust)
  • Root cause follows the 5-Whys to a systemic level, not a one-time mistake
  • Prevention actions are specific, named, and dated — not vague commitments
  • For incidents: fix is verified and data correction is communicated
  • For anomalies: the metric movement is explained as business signal or artifact
  • All placeholders are filled or marked as intentional TBD