Validation Discipline: Testing, Code Review & System Evolution
This guide explains how to validate AI-generated code systematically, use code review and system review commands effectively, and evolve your system through validation feedback — covering the 5-Level Validation Pyramid, the four validation commands, parallel code review with agents, and the critical distinction between fixing code bugs vs fixing process bugs.
1. Why Validation Is a Discipline
The Validation Problem
AI coding agents are remarkably productive — but they also produce subtle bugs, skip edge cases, and drift from requirements. Without systematic validation, you’re rolling the dice on code quality with every implementation.
The PIV Loop places validation as the THIRD pillar for a reason: it’s not an afterthought, it’s a gate. Implementation without validation is just hope.
The Core Insight
“Don’t just fix the bug. Fix the system that allowed the bug.”
When validation catches an issue, you have two choices:
- Fix the code — solve the immediate problem
- Fix the system — update commands, templates, or rules to prevent the category of problem
Choice #1 is necessary. Choice #2 is what compounds. The validation discipline is about making choice #2 a habit.
What This Guide Covers
- The 5-Level Validation Pyramid — a gated progression from syntax to human review
- Four validation commands —
/code-review,/code-review-fix,/execution-report,/system-review - Parallel code review — using 4 specialized agents for 40-50% faster reviews
- Validation as feedback — the meta-reasoning loop that evolves your system
- The selectivity principle — why you shouldn’t blindly follow all AI recommendations
- Practical workflows — when to use which command and in what order
2. The 5-Level Validation Pyramid
Each level gates the next. Don’t proceed to the next level if the current one fails. This prevents wasting time running expensive integration tests when a linting error would catch the issue in seconds.
Level 5: Human Review
(Alignment with intent)
|
Level 4: Integration Tests
(System behavior)
|
Level 3: Unit Tests
(Isolated logic)
|
Level 2: Type Safety
(Type checking)
|
Level 1: Syntax & Style
(Linting, formatting)
Level 1 — Syntax & Style
What it catches: Syntax errors, formatting inconsistencies, import issues, style violations.
Tools: Linters (ruff, eslint, biome), formatters (black, prettier), auto-fix on save.
When to run: After every file write. Most editors and AI tools can do this automatically.
Why it matters: The fastest, cheapest check. If your code doesn’t parse, nothing else matters. Running linting first prevents false positives at higher levels (a type checker may report errors that are actually just syntax issues).
Common commands:
# Python
ruff check . --fix
ruff format .
# TypeScript/JavaScript
npx eslint . --fix
npx prettier --write .
# Go
go fmt ./...
golangci-lint runLevel 2 — Type Safety
What it catches: Type mismatches, missing return types, incorrect function signatures, null/undefined handling.
Tools: Type checkers (mypy, pyright, tsc --noEmit).
When to run: After implementation, before tests. Type errors can cause tests to fail for the wrong reason.
Why it matters: Catches an entire category of bugs that tests might miss — especially in dynamically typed languages where type hints exist but aren’t enforced.
Common commands:
# Python
mypy app/ --strict
pyright
# TypeScript
npx tsc --noEmit
# Go (built into compiler)
go build ./...AI-specific pitfall: AI agents frequently generate code that looks correct but has subtle type issues — wrong generic parameters, missing optional markers, incorrect union types. Type checking catches these before runtime.
Level 3 — Unit Tests
What it catches: Logic errors, edge cases, incorrect calculations, broken algorithms.
Tools: Test frameworks (pytest, jest, vitest, go test).
When to run: After type checking passes. Run focused tests first (just the feature), then the full suite.
Why it matters: Verifies that individual functions and classes work correctly in isolation.
Common commands:
# Python
pytest tests/unit/ -v
pytest tests/ -k "test_feature_name"
# TypeScript/JavaScript
npx jest --testPathPattern="feature-name"
npx vitest run
# Go
go test ./... -vCritical pitfall — AI mocking tests to pass: AI agents will sometimes write tests that mock so heavily they test nothing real. Watch for:
- Tests that mock the function being tested (circular)
- Tests with no real assertions (just
assert mock.called) - Tests where all external dependencies are mocked away
- Tests that pass trivially regardless of implementation
Rule: Reject mocks without justification. If you see a mock, ask: “What real behavior is this test verifying?” If the answer is “nothing,” the test is fake.
Level 4 — Integration Tests
What it catches: Component interaction bugs, database issues, API contract violations, race conditions.
Tools: Same test frameworks with integration markers, test databases, fixtures.
When to run: After unit tests pass. Uses more resources, takes longer, so run last.
Why it matters: Individual components can work perfectly in isolation but fail when combined. Integration tests verify the connections between components.
Common commands:
# Python (with markers)
pytest tests/integration/ -v
pytest -m integration
# TypeScript
npx jest --testPathPattern="integration"
# Docker-based
docker compose up -d test-db && pytest tests/integration/ && docker compose downBest practices:
- Mock external services (Stripe, SendGrid) but not internal components
- Use test fixtures and factories for database setup
- Clean up after each test (use transactions or truncation)
- Run against a real test database, not mocks
Level 5 — Human Review
What it catches: Architectural drift, intent misalignment, pattern violations, security oversights, over-engineering.
How to do it: Read the git diff. Compare implementation against the plan. Look at the big picture.
When to do it: After levels 1-4 pass. This is the final gate before merge/commit.
Why it matters: AI handles mechanical validation (levels 1-4). Humans judge strategic alignment — does this implementation actually solve the right problem the right way?
What to look for:
- Does the implementation match the plan? (Or did the AI drift?)
- Are the right patterns being followed? (Check against CLAUDE.md conventions)
- Is the approach sound architecturally? (Or is it a hack that will cause problems later?)
- Are there security concerns the automated checks missed?
- Is there unnecessary complexity? (YAGNI violations)
The 80/20 of human review: Focus on:
- New files — these define new patterns
- Public interfaces — API contracts, function signatures
- Database changes — migrations, schema modifications
- Security-sensitive code — auth, input validation, data access
3. The Four Validation Commands
The PIV Loop includes four dedicated validation commands. Each serves a different purpose and has different optimal timing.
/code-review — Technical Code Review
Purpose: Find bugs, security issues, performance problems, and pattern violations in changed files.
When to run: After implementation, before commit. Works on uncommitted changes.
What it does:
- Gets the git diff of changed files
- Reads the full content of each changed file
- Reviews for: correctness, security, performance, patterns
- Produces a structured report with severity levels (Critical/Major/Minor)
- Saves the report to
requests/code-reviews/
Output format: Structured findings with severity, location (file:line), issue description, evidence (code snippet), and suggested fix.
Key design decisions:
- Reviews changed files only — not the entire codebase
- Reads the full file, not just the diff — because context matters
- Produces a saved artifact — consumed by
/code-review-fixdownstream - Optimized for agent consumption — explicit file paths, exact line numbers
/code-review-fix — Fix Issues from Review
Purpose: Take a code review report and fix the identified issues.
When to run: After /code-review, before commit.
What it does:
- Reads the code review report
- Processes issues by severity (Critical first, then Major, then Minor)
- Fixes each issue with minimal changes
- Optionally scopes fixes (all files, specific files, or specific severity)
Key design decisions:
- Accepts a scope parameter — fix everything, or just critical issues
- Fixes in severity order — critical issues first, don’t waste time on minor issues if critical ones exist
- Minimal changes — fix the issue, don’t refactor surrounding code
Usage patterns:
# Fix all issues
/code-review-fix requests/code-reviews/feature-review.md
# Fix only critical and major issues
/code-review-fix requests/code-reviews/feature-review.md critical+major
# Fix issues in specific files only
/code-review-fix requests/code-reviews/feature-review.md src/auth//execution-report — Implementation Report
Purpose: Document what actually happened during implementation for comparison against the plan.
Critical constraint: Must run in the SAME conversation context as /execute. If you start a new conversation, the AI loses memory of what it did and the report becomes generic guesswork.
When to run: Immediately after /execute completes, BEFORE commit or context switch.
What it does:
- Recalls what was implemented (from current conversation memory)
- Lists files created, modified, and deleted
- Documents any deviations from the plan
- Notes any issues encountered and how they were resolved
- Records validation results
Why it must be same-context: The report is accurate because the AI remembers exactly what it did. In a fresh context, the AI would have to infer from file diffs — missing the why behind decisions, workarounds tried and rejected, and deviations from the plan.
Output: Saved to requests/execution-reports/ for later use by /system-review.
/system-review — Divergence Analysis
Purpose: Compare plan vs implementation to find process bugs — not code bugs.
When to run: After commit, when you want to evolve the system. Not every loop — only when something felt wrong.
What it does:
- Reads the structured plan
- Reads the execution report
- Compares: What was planned vs what actually happened?
- Identifies divergences and categorizes them:
- Plan gap — plan was missing information → fix the planning command/template
- Execution drift — AI deviated from plan → fix the execute command
- Validation miss — issue not caught → fix validation commands
- System gap — no command covers this scenario → create new command
- Recommends specific system improvements
Key insight: /code-review finds bugs in code. /system-review finds bugs in process. Different purposes, different commands, different timing.
When NOT to use it: Don’t run system review after every loop. Use it when:
- Implementation took significantly longer than expected
- You had to manually intervene multiple times
- The AI made the same type of mistake repeatedly
- The output quality was noticeably different from previous runs
4. The Complete Validation Workflow
Recommended Sequence
/execute [plan] → /execution-report → /code-review → /code-review-fix → /commit
Step-by-step:
/execute [plan]— Implement the feature from the structured plan/execution-report— Document what happened (SAME context, before commit)/code-review— Technical review of changed files/code-review-fix— Fix issues found in review/commit— Create the git commit
Optional system evolution (after commit):
/system-review [plan] [report]
When to Skip Steps
Not every loop needs the full workflow:
| Situation | Recommended Workflow |
|---|---|
| Simple bug fix | /execute → /commit |
| Standard feature | /execute → /code-review → /code-review-fix → /commit |
| Complex feature | Full workflow including /execution-report |
| Repeated quality issues | Full workflow + /system-review |
| Documentation only | /execute → /commit (no code review needed) |
Timing Matters
The order is not arbitrary:
- Execution report before commit: The AI remembers what it did
- Code review before fix: You need findings before you can fix them
- Fix before commit: Don’t commit known issues
- System review after commit: You need the final artifact (committed code) to compare against the plan
5. Parallel Code Review with Agents
The Pattern
Instead of one agent reviewing everything sequentially, four specialized agents review in parallel — each focused on a specific concern:
Main Agent
├─> Type Safety Agent (types, annotations, type errors)
├─> Security Agent (vulnerabilities, injection, secrets)
├─> Architecture Agent (patterns, conventions, structure)
└─> Performance Agent (queries, algorithms, memory)
↓ (results return in parallel)
Main Agent combines findings → unified report
Why Parallel
Speed: 40-50% faster than sequential review. Four agents working simultaneously instead of one doing four passes.
Depth: Each agent is specialized — its entire system prompt focuses on one concern. A security-focused agent catches vulnerabilities that a general reviewer might miss.
Consistency: Each agent follows a fixed analysis approach. No concern gets short-changed because the reviewer got tired or distracted.
The Four Review Agents
| Agent | Focus | What It Checks |
|---|---|---|
| code-review-type-safety | Type annotations & checking | Missing types, incorrect types, type errors, generic issues |
| code-review-security | Security vulnerabilities | SQL injection, XSS, exposed secrets, auth bypass, OWASP top 10 |
| code-review-architecture | Design patterns & conventions | Layer violations, DRY, YAGNI, naming, file structure compliance |
| code-review-performance | Performance & scalability | N+1 queries, inefficient algorithms, memory leaks, unnecessary computations |
Agent Output Format
Each agent returns structured findings that the main agent can parse and combine:
## [AGENT-NAME] Review
### Critical
- **[file:line]**: Issue description
- Evidence: `code snippet`
- Fix: What to do
### Major
- ...
### Minor
- ...
### Summary
- Files reviewed: N
- Issues found: X critical, Y major, Z minorKey design element: The output format includes an instruction to the main agent: “Do NOT start fixing issues without user approval.” This prevents the main agent from automatically acting on all findings when the user just wanted a report.
How to Activate
The 4 code review agents are pre-installed in .claude/agents/. The /code-review command automatically detects these agents and switches to parallel mode. If no agents are found, it falls back to single-agent sequential review.
Customization
The pre-installed agents are generic. Customize for your project:
- Update context gathering — reference your project’s specific files and patterns
- Add project-specific checks — e.g., check for
pytestcoverage in architecture agent - Adjust severity thresholds — what’s “critical” varies by project
- Tune output format — add fields relevant to your workflow
When NOT to Use All Four
Pick agents based on the feature type:
| Feature Type | Agents to Use |
|---|---|
| New API endpoint | Security + Architecture + Performance |
| Frontend component | Type Safety + Architecture |
| Database migration | Security + Architecture + Performance |
| Bug fix | Type Safety + Security |
| Documentation | Skip parallel review entirely |
6. Validation as Feedback
The Meta-Reasoning Loop
When validation catches an issue, don’t just fix the code. Ask:
- What went wrong? — Describe the specific issue
- Why did it happen? — Was the plan unclear? Did the AI drift? Was the pattern undocumented?
- Where in the system should I fix it? — Is this a plan issue, a command issue, a template issue, or a rules issue?
- How do I prevent this category of problem? — What system change prevents recurrence?
Where to Fix: Decision Framework
| Fix Location | When to Use | Example |
|---|---|---|
| Global rules (CLAUDE.md/sections) | Convention that applies to ALL tasks | ”Always use structured logging” |
| On-demand context (reference/) | Task-type-specific guidance | ”When building APIs, follow this contract pattern” |
| Commands (planning, execute, etc.) | Process/workflow issue | ”Planning produces plans that are too long” |
| Templates (structured plan, PRD) | Output format/structure issue | ”Plans are missing validation commands section” |
| Vibe planning (your prompts) | Research was incomplete or scope was wrong | ”I didn’t specify the auth method clearly enough” |
The System Evolution Principle
When a command produces suboptimal output, update the command itself — don’t just one-off fix it.
This is the highest-leverage activity in the entire PIV Loop. Every system fix compounds:
- Fix the planning command once → every future plan is better
- Fix the execute command once → every future implementation is better
- Fix a template once → every future output matches the right format
Two types of improvements:
- Plan updates — fix the command/template for all future runs
- One-off fixes — fix the immediate output without changing the system
Always prefer plan updates. One-off fixes solve today; system updates solve forever.
Practical Example
Problem: AI keeps writing 1500-line plans when you want 700-1000.
Bad response: “Make this plan shorter” (one-off fix)
Good response:
- Ask the AI to analyze WHY the plan is long (meta-reasoning)
- AI identifies: “The planning command has no line constraint, and the template encourages detailed task descriptions”
- Add a
CRITICAL: Plan must be 700-1000 linesconstraint to the planning command - Add a conciseness guideline to the structured plan template
- Now every future plan respects the constraint
The Selectivity Principle
LLMs over-engineer recommendations. Be selective about which suggestions to implement.
When /system-review or /code-review produces recommendations, evaluate each one critically:
- Does this solve a real problem I’ve experienced? (Not a hypothetical one)
- Will this simplify or complicate the system? (Prefer simplification)
- Is this a pattern I’ll use repeatedly? (Not a one-time scenario)
- Does the cost of the fix justify the benefit? (Adding complexity to prevent a rare issue isn’t worth it)
AI has a bias toward adding safety nets, abstractions, and edge case handling. Often the right answer is “this is fine as-is” or “fix it when it actually becomes a problem.”
Rule of thumb: If /system-review suggests 10 improvements, implement 2-3 that address problems you’ve actually experienced. Ignore the rest until they become real.
7. Embedded Validation in Plans
Task-Level Validation
Every task in a structured plan should include its own validation step. The VALIDATE field in the task format ensures validation isn’t an afterthought:
### UPDATE src/auth/middleware.py
- **IMPLEMENT**: Add JWT token validation middleware
- **PATTERN**: Follow existing middleware pattern in `src/middleware/base.py`
- **IMPORTS**: `from jose import jwt`, `from app.core.config import settings`
- **GOTCHA**: Token expiry check must use UTC, not local time
- **VALIDATE**: `pytest tests/auth/test_middleware.py -v` — all tests passThe VALIDATE field tells the execution agent exactly what command to run after implementing. This creates a tight feedback loop: implement → validate → fix → validate → move on.
Plan-Level Validation
At the bottom of every structured plan, there should be a VALIDATION COMMANDS section that lists all commands to run after all tasks are complete:
## VALIDATION COMMANDS
### Level 1: Syntax & Style
ruff check . --fix
ruff format .
### Level 2: Type Safety
mypy app/ --strict
### Level 3: Unit Tests
pytest tests/unit/ -v
### Level 4: Integration Tests
pytest tests/integration/ -vThe execution agent runs these in order, fixing issues at each level before proceeding to the next. This matches the pyramid — levels gate each other.
8. Common Validation Anti-Patterns
Anti-Pattern 1: Testing Theater
Symptom: Tests pass but don’t verify real behavior. Heavy mocking, trivial assertions.
Fix: Review test files during human review (Level 5). Ask: “If I changed the implementation, would this test fail?” If not, the test is theater.
Anti-Pattern 2: Skipping Levels
Symptom: Running integration tests directly after implementation, skipping linting and type checking.
Fix: Each level is a gate. Failing Level 1 (syntax) can cause cascading failures at Level 3 (tests). Fix cheap issues first.
Anti-Pattern 3: Validating Once, Never Again
Symptom: Running validation at the end of implementation, never during.
Fix: Embed VALIDATE in every task. Run focused tests after each task, full suite only at the end.
Anti-Pattern 4: Applying All Recommendations
Symptom: Every suggestion from /code-review or /system-review gets implemented, even speculative ones.
Fix: Apply the selectivity principle. Implement only recommendations that address real, observed problems. Ignore hypothetical improvements.
Anti-Pattern 5: Human Review as Rubber Stamp
Symptom: Glancing at the diff and approving without reading. Trusting “tests pass” as sufficient.
Fix: Focus human review on new files, public interfaces, database changes, and security-sensitive code. These are where architectural drift hides.
9. Integrating Validation with the PIV Loop
During Planning (Layer 2)
Include validation strategy in every structured plan:
- What tests should exist for this feature?
- What type safety requirements apply?
- What integration points need testing?
- What should human review focus on?
The planning command should produce plans that include validation commands. If your plans don’t have a VALIDATION COMMANDS section, update the planning command and/or template.
During Implementation
The execute command should validate as it goes:
- After each task: run the task’s
VALIDATEcommand - After all tasks: run the plan’s validation commands in pyramid order
- Fix issues before moving to the next task or level
During Post-Implementation
Choose the appropriate validation workflow:
- Minimal:
/execute→/commit - Standard:
/execute→/code-review→/code-review-fix→/commit - Thorough:
/execute→/execution-report→/code-review→/code-review-fix→/commit - With system evolution: Add
/system-reviewafter commit
Trust Progression for Validation
Manual validation → Embedded validation → Command validation → Parallel agents → Automated CI
↑ trust & verify ↑ ↑ trust & verify ↑ ↑ trust & verify ↑
- Manual: You run tests and read diffs yourself
- Embedded: Plans include
VALIDATEfields, AI runs them during execution - Command:
/code-reviewand/code-review-fixautomate the review cycle - Parallel agents: 4 specialized agents review simultaneously
- Automated CI: GitHub Actions + CodeRabbit handle review-fix loops
Don’t skip stages. Each tier builds trust for the next.
10. FAQ
Q: Should I run /code-review on every feature?
A: For code changes, yes. It’s fast and catches real issues. Skip it for documentation-only changes.
Q: How do I know when to use /system-review?
A: When something felt wrong during the loop — the AI made repeated mistakes, the plan was missing information, or you had to intervene more than expected. Don’t use it routinely; use it when you sense a process problem.
Q: Can I use /code-review without the parallel agents?
A: Yes. The command works in single-agent mode by default. Parallel agents are an optional enhancement that improves speed and depth but aren’t required.
Q: What if /code-review finds issues I disagree with?
A: Use the selectivity principle. The AI over-reports. Review each finding and only fix what’s actually a problem. You can scope /code-review-fix to specific severities or files.
Q: Should tests be written during planning or implementation?
A: Test strategy during planning (what to test, test structure). Test code during implementation (the AI writes tests alongside the feature code). The plan should specify what tests to write; the execute command should actually write them.
Q: How do I handle validation failures that reveal plan gaps?
A: This is the feedback loop working. When a test fails because the plan missed something:
- Fix the immediate issue (code fix)
- Note what the plan should have included
- After the loop, update the planning command or template to capture this in future plans
Q: What’s the difference between /execution-report and git log?
A: Git log shows what changed. The execution report shows why things changed — including deviations from the plan, issues encountered, workarounds attempted, and decisions made. This context is only available in the same conversation where implementation happened.
Q: How many issues should /code-review typically find?
A: Varies by feature size. For a well-planned feature: 0-2 critical, 2-5 major, 5-10 minor. If you’re consistently seeing 5+ critical issues, your planning phase needs improvement — the issues should be prevented at planning time, not caught at review time.
11. Reference Files
Commands:
.claude/commands/code-review.md— Technical code review command.claude/commands/code-review-fix.md— Fix issues from code review.claude/commands/execution-report.md— Implementation report command.claude/commands/system-review.md— Divergence analysis command
Agents:
.claude/agents/code-review-type-safety.md— Type safety reviewer.claude/agents/code-review-security.md— Security reviewer.claude/agents/code-review-architecture.md— Architecture reviewer.claude/agents/code-review-performance.md— Performance reviewer
Related guides:
reference/implementation-discipline.md— The execute phase that validation followsreference/command-design-framework.md— How the validation commands are designed (INPUT→PROCESS→OUTPUT)reference/subagents-deep-dive.md— How parallel review agents work under the hood