AI, Automation & Platform Engineering Projects Technical Design Document Template (TDD)

0. Document Control

Version	Date	Author	Reviewer(s)	Linear Project	Status
01	YYYY_MM_DD

1. Overview

Purpose — why does this project exist?
Scope & Goals — what’s in / out, success metrics.
Stakeholders & Roles — engineering, product, client, ops.
Context diagram — one-slide “where this fits” view.

2. Problem Statement & Requirements

Business Problem / User Stories — narrative of current pain; tie to measurable outcomes.
Use Cases — what tasks the system must perform: questions it must answer, workflows it must orchestrate, services it must expose, user experiences it must deliver; link to PRDs/OKRs.
Functional Requirements — e.g., answer user queries from a knowledge base, expose a REST/GraphQL API, build a customer-facing feature, execute multi-step workflows, integrate with external services via MCP.
Non-Functional Requirements (NFRs) — latency/throughput, cost budgets (tokens, infrastructure), security/compliance, observability, scalability, SLAs.

3. Current State Assessment

Include sections relevant to the project. Not every project will touch every area.

Application Services & APIs — existing backend services, APIs, frontend apps, languages/frameworks, database layer.
Existing AI Capabilities — models in use, fine-tuned models, current accuracy/quality baselines.
Agent Frameworks & Orchestration — current tooling (LangChain, LlamaIndex, custom), runtime patterns.
Workflow Automation — existing automations (n8n, Windmill, cron jobs, custom scripts), trigger patterns.
MCP Servers & Integrations — deployed MCP servers, tool registries, API integrations.
LLM Providers & API Access — which providers, rate limits, spend, model versions pinned.
Vector Stores & Knowledge Bases — current retrieval infrastructure, embedding models, index sizes.
Prompt Management — how prompts are versioned, stored, and deployed today.
Infrastructure & Deployment — hosting platforms (Railway, Heroku, GCP, AWS, etc.), CI/CD, environments, domain/DNS, secrets management.
Observability & Monitoring — tracing (LangFuse, Langsmith), logging, APM, alerting, cost dashboards.
Known Constraints / Debt — brittle systems, missing tests/evals, provider lock-in, cost overruns, legacy code.
[ARCHITECTURE DIAGRAM]

4. Research & Discovery

Goal: expose all due diligence so reviewers trust the design decisions.

Link	Type (spike, RFC, benchmark, eval, article)	Key takeaway
…	…	…

Model benchmarks & eval results (if AI project)
Library / framework comparisons, proof-of-concept results
Competitive / prior-art summary
Assumptions & constraints
Open questions

5. Target Architecture & Design

High-Level Architecture Diagram — show the full system: clients/UIs ⇒ services/APIs ⇒ AI/model layer ⇒ integrations/tools ⇒ data stores.
Component Breakdown — purpose, inputs/outputs, dependencies for each component (services, agents, RAG pipelines, MCP servers, workflow automations, databases, frontend apps).
Integration Architecture — how components connect (MCP protocol, REST/GraphQL APIs, webhooks, event buses, message queues, database connections).
Sequence / Activity Diagrams — critical flows (request lifecycles, agent reasoning loops, RAG retrieval-to-response, workflow trigger chains, multi-turn conversations, auth flows).
Interfaces & Contracts — API specs (OpenAPI/GraphQL), MCP tool schemas, webhook payloads, database schemas, prompt templates with variable contracts.
Deployment View — environments, hosting platform selection (Railway, Heroku, GCP, AWS, etc.), CI/CD pipelines, feature flags, secrets management, domain/DNS.

A. Key Questions & Answers (KQA) Matrix

Business Question	Metric(s) / Output	Required Context & Data	System Path (Input ⇒ Agent/Model ⇒ Tool ⇒ Output)	Latency / Freshness SLA	Accuracy Target & Eval(s)
…	…	…	…	…	…

6. AI Model & Knowledge Strategy

Include this section when the project involves AI/ML components. For purely traditional dev projects, note “N/A — no AI components” and move on.

Model Selection Strategy — which LLMs/embedding models for which tasks; cost vs. capability tradeoffs; fallback chains.
Prompt Engineering Approach — system prompts, few-shot examples, chain-of-thought, structured output schemas.
Knowledge Management — RAG architecture, vector store design, chunking strategy, embedding model selection, index update cadence.
Agent Design Patterns — tool calling conventions, memory management (short-term / long-term), planning strategies, error recovery.
Workflow Automation Design — trigger logic, branching/conditional paths, retry/error handling, human-in-the-loop gates.

Capability	Model / Approach	Why	Fallback	Cost Tier
Primary reasoning	…	…	…	…
Embeddings	…	…	…	…
Classification / routing	…	…	…	…
Code generation	…	…	…	…
Summarization	…	…	…	…

7. Tooling & Technology Decisions Matrix

Evaluate and justify tools for each domain relevant to the project:

Application & Infrastructure

Languages & frameworks (Python, TypeScript/Node, Next.js, FastAPI, etc.)
Databases (Postgres, Supabase, Redis, MongoDB, etc.)
Deployment & hosting (Railway, Heroku, GCP, AWS, Vercel, etc.)
CI/CD & DevOps (GitHub Actions, Docker, Terraform, etc.)
Auth & identity (Supabase Auth, Clerk, Auth0, custom)

AI & Automation (when applicable)

LLM providers (OpenAI, Anthropic, Google, open-source / self-hosted)
Agent frameworks & orchestration (LangChain, LlamaIndex, CrewAI, custom)
Vector stores & retrieval (TurboPuffer, Pinecone, Weaviate, pgvector, Qdrant)
Workflow automation (n8n, Windmill, Temporal)
MCP framework & runtime
Observability & tracing (LangFuse, Langsmith, custom)
Evaluation frameworks (Ragas, DeepEval, custom harness)
Prompt management & versioning

Domain	Option(s) Considered	Selected	Criteria / Score	Notes
…	…	…	…	…

8. Decision Log & Rationale (ADR Index)

Record each significant trade-off as an Architecture Decision Record (ADR). Link them here or inline.

#	Choice	Options Considered	Decision	Rationale	Impact / Risks
…	…	…	…	…	…

9. Security, Safety & Guardrails

General Security (all projects)

Authentication & Authorization — auth strategy, role-based access, API key management, OAuth flows.
Input Validation & Sanitization — injection prevention (SQL, XSS, etc.), request validation.
PII & Data Handling — how sensitive data is handled in transit, at rest, and in logs; redaction strategy.
Rate Limiting & Abuse Prevention — per-user/per-tenant limits, escalation rules.
Compliance & Regulatory — applicable regulations, data residency, audit trail requirements.
Secrets Management — how credentials, API keys, and tokens are stored and rotated.

AI-Specific Safety (when AI components are involved)

Output Validation — content filtering, format enforcement, confidence thresholds, hallucination detection.
Prompt Injection Defenses — input guardrails, context length management, system prompt protection.
Bias & Fairness — assessment methodology, testing cadence, mitigation strategies.
Human-in-the-Loop Requirements — which actions require human approval, escalation flows, override mechanisms.
Red-Teaming Plan — adversarial testing scope, cadence, responsible disclosure.

10. Implementation & Migration Plan

Milestones / phases (POC ⇒ pilot ⇒ production rollout)
Workstream ownership
Linear linkage to Project
Migration / cutover strategy — prompt migration, model swaps, A/B traffic shifting, rollback plans
Decommission plan — what legacy systems/workflows retire and when

B. Before / After Impact Scorecard

Dimension	Current State	Target State	% / Delta Improvement	Evidence / Benchmark Plan
Time to answer “X?”	Manual research, hours/days	< 30 sec self-serve	95%+ faster	Query log baseline + post-rollout tracking
Task completion accuracy	Manual, ~80% consistency	> 95% eval accuracy	~15 pt improvement	Golden dataset evals, human review sample
Workflow cycle time	4 hrs manual, 3 handoffs	< 5 min automated	50x faster	Timing instrumentation
Cost per interaction	$X (manual labor)	$Y (tokens + infra)	Z% reduction	LangFuse cost tracking

11. Testing & Validation

Standard Testing (all projects)

Unit Tests — coverage targets, framework (pytest, Jest, Vitest, etc.), mocking strategy.
Integration Tests — API contract tests, database integration, service-to-service calls.
End-to-End Tests — critical user paths, browser automation (if frontend), workflow paths.
Performance / Load Testing — response time benchmarks, throughput under load, stress testing.
Regression Testing — test suites that run on dependency updates, code changes, deploys.

AI Evaluation (when AI components are involved)

Eval Framework — tooling (Ragas, DeepEval, custom), golden datasets, automated vs. human grading.
Accuracy & Faithfulness Testing — factuality checks, hallucination detection, citation verification.
Safety Testing — red-teaming results, adversarial input testing, guardrail validation.
Model Regression — test suites that run on model upgrades or prompt changes.

Validation & Acceptance

User Acceptance Testing — pilot user feedback loops, A/B test design, success criteria.
Staging / QA Environments — how pre-production mirrors production, data seeding strategy.

Test Type	Scope	Tool / Method	Pass Criteria	Cadence
Unit	Functions, modules	…	> X% coverage	Every commit
Integration	API contracts, service calls	…	All contracts pass	Every PR
E2E	Critical user paths	…	All paths green	Every deploy
Eval suite	Golden dataset, N questions	…	> X% accuracy	Every deploy
Latency	P50 / P90 / P99	…	< Y ms	Every deploy
Safety	Adversarial inputs, N scenarios	…	0 critical failures	Weekly / release
Regression	Prompt + model + code changes	…	No degradation > Z%	Every change

12. Operational Considerations

Runbooks / on-call playbooks (service outage, model degradation, provider outage, cost spike)
SLA/SLO targets (availability, latency, accuracy, error rates)
Incident response — how to roll back a deploy, revert a model/prompt, disable a feature/tool
Scaling strategy — autoscaling triggers, queue management, circuit breakers, database connection pooling
Backup & recovery — database backups, disaster recovery, data export strategy
Logging & alerting — structured logging, alert routing, escalation paths

13. Cost Management

Infrastructure Costs

Hosting & Compute — platform costs (Railway, Heroku, GCP, AWS), instance sizing, autoscaling impact.
Database & Storage — managed database costs, storage tiers, backup costs.
Third-Party Services — SaaS/API costs, per-call pricing, volume commitments.

AI-Specific Costs (when applicable)

Token Usage Projections — estimated tokens per interaction, daily/monthly volume, growth assumptions.
Cost Per Interaction Targets — budget per query/task, broken down by model tier.
Model Tier Strategy — routing cheap tasks to smaller/cheaper models, reserving expensive models for complex reasoning.
Caching Strategy — semantic caching, response caching, embedding cache; expected hit rates.

Governance

Cost Monitoring & Alerting — dashboards, alert thresholds, per-tenant/per-feature cost attribution.
Budget Governance — hard caps, soft warnings, escalation when thresholds are breached.

Cost Component	Unit	Estimated Volume	Unit Cost	Monthly Estimate	Notes
Hosting / compute	per instance/hour	…	…	…	…
Database	per plan / GB	…	…	…	…
LLM inference (primary)	per 1k tokens	…	…	…	…
LLM inference (fallback)	per 1k tokens	…	…	…	…
Embedding generation	per 1k tokens	…	…	…	…
Vector store	per GB / per query	…	…	…	…
Third-party APIs / SaaS	per call / month	…	…	…	…

14. Review, RFC & Sign-offs

Design Review Meeting — date/time, attendees, key outcomes.
RFC Process — distribution list, comment window, approval criteria (link to RFC doc/thread).
Sign-off Checklist — required approvals (AI Lead, Engineering Lead, Security, Product).

15. Meeting Notes

Paste minutes or link to recordings for every design review.

Action-item table drives accountability.

16. Agent Handoff Brief

Purpose: Condense the TDD into agent-ready context for AI coding agents (Cursor Plan Mode, Codex, etc.). Fill this out after the design is approved. For each implementation workstream or milestone, create one of these briefs. You can paste just this section as agent context, or pair it with the relevant architecture section.

Codebase Orientation

Repo(s): <repo name and path>
Key files / entry points:
- path/to/relevant/file — what it does
- path/to/another/file — what it does
Reference implementations: Point to existing features or patterns in the codebase that this work should follow.

Scope for This Workstream

What specifically is being built in this phase. Reference the milestone from Section 10.

Implementation Sequence

Ordered list of what to build, in what order. Each step should be independently testable.

Acceptance Criteria

Concrete, verifiable statements. An agent (or a human) should be able to check each one off.

Key Decisions to Carry Forward

Pull the most important decisions from the ADR Index (Section 8) and Tooling Matrix (Section 7) that the agent needs to respect. Don’t make the agent read the whole TDD — surface the decisions that matter for implementation.

Constraints and Guardrails

Don’t touch: files, services, or areas that are out of scope for this workstream
Must use: specific libraries, frameworks, patterns, or conventions required
Must not: hard constraints (e.g., no new dependencies without approval, don’t modify shared schemas)
Style / conventions: naming patterns, file organization, code style to follow
AI-specific: model provider to use, prompt patterns to follow, eval thresholds that must pass

17. Appendices

Glossary
Reference material
Change log

Appendix A — Section 5 Variants (Target Architecture & Design)

Replace your standard Section 5 with one (or combine relevant pieces) of the following variants based on project type.

5A. Retrieval-Augmented Generation (RAG) Pipeline

Context & Drivers — why pure LLM won’t meet accuracy/compliance; need grounded answers.
High-Level Diagram — Query ⇒ Retriever ⇒ Ranker ⇒ LLM (w/ context) ⇒ Post-processor.
Knowledge Store — vector DB choice, chunking strategy, embeddings model, update cadence.
Latency Budget — e.g., 300 ms retrieval, 700 ms generation (90-percentile).
Grounding & Citations — how sources are surfaced, confidence scoring, fallback if below threshold.
Failure Modes on Questions — stale index, long-tail queries, retrieval miss; mitigation.
Eval Hooks — automated factuality/hallucination tests every nightly index build.

5B. Agentic System with Tool Invocation

Agent Loop Diagram — Planner ⇒ Tool-calling ⇒ Memory Update ⇒ Critic ⇒ Next Action.
Tool Registry — JSON schema of actions (name, args, guardrails, auth scope).
Memory & Long-Term Context — vector store vs. relational store; eviction strategy.
Orchestration Runtime — LangChain, LlamaIndex, custom; concurrency, timeout rules.
Safety Layer — output filtering, rate limiting, red-teaming hooks.
Question-to-Component Map — which tools answer which business questions.
Observability — token/sec, cost per call, success/rollback metrics.

5C. Embedded AI Copilot in SaaS Product

User Journey Diagram — entry point, context capture, backend calls, UI surfacing.
Context Window Assembly — telemetry, recent actions, role/permissions.
Personalization Logic — per-user embeddings, on-device vs. server storage.
Real-Time Constraints — ⇐1 s perceived latency; streaming vs. single shot.
Fallback UX — graceful degradation when model confidence is low.
Telemetry for Product Questions — how “Time-to-Answer” and adoption are logged.

5D. Fine-Tuned / Custom LLM Service

Base Model Selection — criteria (license, capability, cost).
Fine-Tuning Dataset Flow — collection, filtering, dedup, weighting, holdout split.
Training Infra — GPUs, parameter-efficient tuning (LoRA, QLoRA), MLOps pipeline.
Serving Stack — quantization, tensor parallelism, autoscaling triggers.
Versioning & Rollback — traffic shadowing, canary %, safety nets.
Evaluation Harness — regression suite on KQA, toxicity, bias, latency, cost.

Input Modalities — text, image, voice; pre-processing pipelines.
Fusion Strategy — encoder sharing, late fusion, routing logic.
Real-Time Transcription / OCR — latency and cost gates.
Output Rendering — rich snippets, highlighted source images.
Accessibility Compliance — captions, alt-text generation.
Question Failure Map — how missing modality affects specific questions.

5F. MCP Server & Tool Platform

MCP Server Architecture — transport layer, session management, capability negotiation.
Tool Schema Design — JSON schema definitions, argument validation, auth scoping.
Resource & Prompt Exposure — what resources are served, prompt templates provided.
Multi-Tenant Isolation — per-client/per-user scoping, credential management.
Composability — how multiple MCP servers are composed for a single agent.
Deployment & Versioning — server lifecycle, backward compatibility, discovery mechanisms.

5G. Workflow Automation & Integration Platform

Workflow Engine — n8n, Windmill, Temporal; selection rationale.
Trigger Taxonomy — webhooks, schedules, event-driven, manual; trigger-to-workflow mapping.
Step Design — idempotency, retry logic, timeout handling, error routing.
Integration Registry — external systems connected, auth methods, rate limits per integration.
Human-in-the-Loop Gates — approval steps, escalation paths, notification channels.
Monitoring & Alerting — execution dashboards, failure alerts, SLA tracking.

5H. API / Backend Service

Service Architecture — monolith vs. microservices, request lifecycle, middleware stack.
API Design — REST vs. GraphQL, versioning strategy, pagination, error contract.
Database Design — schema (ERD), ORM/query layer, migrations strategy, indexing plan.
Auth & Permissions — authentication method, authorization model (RBAC, ABAC), token management.
Background Jobs / Queues — async processing, job scheduling, retry/dead-letter handling.
Caching Layer — what’s cached, TTL strategy, invalidation rules.
Third-Party Integrations — external APIs consumed, webhook handling, circuit breakers.

5I. Frontend / Product Feature

User Journey & Wireframes — key screens, user flows, interaction patterns.
Component Architecture — framework (Next.js, React, etc.), component hierarchy, state management.
API Integration Layer — how the frontend talks to backend(s), data fetching strategy, optimistic updates.
Performance Budget — bundle size targets, Core Web Vitals targets, lazy loading strategy.
Accessibility — WCAG compliance level, testing approach.
Deployment & CDN — hosting (Vercel, Railway, etc.), edge caching, preview deployments.

5.X Question-Driven Design Lens

Primary Questions Enabled — (reference KQA matrix rows this section covers).
Speed/Accuracy Impact — what improves and how it’s measured (link to Scorecard).
Failure Mode on Questions — if this component fails/degrades, which questions break and what’s the fallback?

5.Y Question-to-Component Mapping

Component / Service	Questions Unlocked	Model / Tool Dependencies	SLA / Perf Target	Owner
…	…	…	…	…

Appendix B — Section 6 Variants (AI Model & Knowledge Strategy)

Pick the paradigm that matches the project. You can combine approaches (e.g., RAG knowledge base with a prompt library and eval dataset).

6A. RAG Knowledge-Base Design

Corpus Scope & Owners — docs, tickets, wikis, DB records; TTL rules.
Chunking & Embeddings Strategy — overlap, window size, model choice, re-embed frequency.
Metadata Schema — source, confidence, permissions tag.
Index Update Workflow — CDC, webhook triggers, nightly rebuild.
Evaluation — recall@k, answer faithfulness, citation coverage.

6B. Prompt & System-Message Library

Prompt Catalog — table of prompt ID, purpose, target model, guardrails.
Parameterization Rules — slots, variable escaping, defaults.
Versioning & A/B Testing — traffic split strategy, success metrics.
Observability — prompt-level latency, token usage, cost.

6C. Tool / Action Schema for Agents

JSON/YAML Contract — name, description, args, required auth, rate limits.
Registration & Discovery — dynamic loading vs. static registry.
Safety Scopes — sandboxing, dry-run, audit logging.
Mapping to Business Questions — which tool enables which KQA row.

6D. Evaluation & Feedback Dataset Design

Golden Dataset — representative user questions, expected answers, edge cases.
Scoring Rubrics — factuality, helpfulness, brevity, style.
Automated Graders — LLM-as-judge vs. deterministic scripts.
Human-in-the-Loop Pipeline — sampling %, UI for raters, adjudication.
Continuous Learning Loop — when data promotes to fine-tuning set.

6E. Vector-Store & Embedding Governance

Embedding Model Lifecycle — upgrade cycle, backfill policy.
Similarity Metrics — cosine vs. inner product vs. dot-product; threshold tuning.
Namespace & ACL Strategy — per-tenant isolation, encryption at rest.
Cost Controls — shard sizing, cold-storage tiers, deletion hooks.

6F. Fine-Tuning / RLHF Dataset Curation

Source Mix — customer chats, docs, synthetic Q&A.
Filtering — PII removal, toxicity filter, language detection.
Label Schema — preference pairs, scorecards, multi-choice.
Data Weighting & Sampling — boost rare intents, down-weight low quality.
Ethics & Bias Review — checklists, external audit steps.

6G. Workflow & Automation Recipe Design

Recipe Catalog — table of automation name, trigger, steps, owner, frequency.
Input/Output Contracts — expected payloads, schema validation, error shapes.
Dependency Map — which automations depend on which services/APIs/models.
Testing Strategy — dry-run modes, mock integrations, snapshot testing.
Versioning & Rollback — recipe version control, staged rollouts, instant rollback.

Key Questions & Answers (KQA) Matrix & Impact Scorecard

(Mandatory in every variant — paste near the top of Section 5 or 6)

Business Question	Metric / Output	Latency SLA	Accuracy Target	Component Path	Eval Test
…	…	…	…	…	…

Dimension	Current	Target	Delta Improvement	Evidence Plan
Time-to-Answer	…	…	…	…
Answer Accuracy	…	…	…	…
Cost / 1k Tokens	…	…	…	…
Workflow Cycle Time	…	…	…	…

Brainforge Knowledge

Explorer

Technical Design Document Template (TDD) AI & Automation Template