Q2 Platform AI Execution Harness
Status: Draft (Pending Clarence Review)
Created: 2026-03-24
Author: Uttam
Related: Executive Q2 Planning Operating Model, Honcho Integration
Quick Links:
- Executive Q2 Planning Operating Model
- Platform Team Charter (primitives, portfolio review)
- Linear: mirror harness work after Clarence sign-off (see §12); initiative/project names TBD
- Honcho Documentation: https://docs.honcho.dev
1. Context & Problem Statement
Current State
The Platform team (Uttam + Clarence) operates as a 2-person engineering team. Current execution model:
- Tickets are manually implemented by humans
- Code review is the quality bottleneck
- No systematic AI involvement in end-to-end delivery
- Tool decisions have been ad hoc rather than principled
Problem Statement
- Human bandwidth limits throughput — 2 engineers can only ship so much
- Review becomes the bottleneck — Even if AI writes code, human review burden doesn’t decrease proportionally
- No durable decision framework — New tools are evaluated case-by-case without a principled rubric
- Trust in AI output is inconsistent — Without verification infrastructure, AI-shipped code requires heavy manual validation
Goal
Enable 50% or more of Platform tickets to be completed end-to-end by AI agents by end of Q2 (June 30, 2026).
“End-to-end” means: AI interprets ticket → AI writes implementation → AI writes tests → AI runs verification → AI creates PR → Human spot-check review → AI merges (with safety checks).
2. Connection to Broader Goals
Executive Q2 Planning Operating Model
This project delivers against the Platform team’s requirement under the Executive Q2 Planning Operating Model:
- Internal teams must have approved quarterly plans
- Work must ladder into Linear (Initiative → Project → Milestone → Issue)
- Plans must have sponsor-visible outcomes
Company-Wide AI Enablement
The harness defined here becomes the foundation for other teams (Delivery, GTM, etc.) to achieve similar AI execution rates. Platform’s role is to prove the model first.
3. Core Primitives
The harness is organized around six technology-agnostic primitives. Tool selections are mapped to primitives, but primitives are durable — new tools are judged by how well they serve these primitives.
Primitive 1: Context
Definition: The AI has access to relevant memory, history, and company knowledge to make contextually appropriate decisions.
Why Critical: Without context, AI makes naive implementation choices that require heavy human correction. Context reduces the “why did you do it this way?” review cycles.
Required Capabilities:
- Ephemeral context (current ticket, active conversation)
- Persistent context (past similar tickets, resolved issues)
- Procedural context (company standards, coding conventions, SOPs)
- Active retrieval (AI can query for relevant context when needed)
Honcho Mapping: Honcho provides long-term agent memory, user memories, and workspace context. This primitive is partially addressed by Honcho but may need supplementation for codebase-specific context.
Primitive 2: Specification
Definition: Clear, testable definition of what “done” means for any given ticket.
Why Critical: Subjective “done” creates review bottleneck. Objective “done” (verified by automated tests) reduces human review to “does this make sense?” rather than “does this work?”
Required Capabilities:
- Ticket contains human-written acceptance criteria
- AI generates test plan from acceptance criteria
- Human approves test plan (not implementation) — fast checkpoint
- AI implements to make approved tests pass
Tool Mapping: Linear provides ticket structure; test plan generation likely requires custom MCP or Honcho skill. Not yet assigned.
Primitive 3: Verification
Definition: Multi-layer automated proof that implementation satisfies specification.
Why Critical: Green tests must equal trustworthy. Human review burden only decreases if verification is comprehensive enough to catch errors before human sees code.
Required Layers:
- Static analysis (types, lint, format) — fast, catches trivial errors
- Unit tests (functions in isolation) — catches logic errors
- Integration tests (components together) — catches interface errors
- Smoke tests (app runs, basic flows work) — catches catastrophic errors
Tool Mapping: TBD. OSS-first, hybrid-run capable required. Options include GitHub Actions, self-hosted runners, pytest/jest, Playwright.
Primitive 4: Execution
Definition: The AI can actually invoke tools, run commands, query APIs, and deploy changes.
Why Critical: AI must do the work, not just suggest it. Execution capability turns AI from advisor to implementor.
Required Capabilities:
- Tool invocation (MCPs or equivalent)
- Sandboxed code execution
- File system operations (read, write, modify)
- Deployment triggers (staging, production)
Honcho Mapping: Honcho Cron provides scheduled execution and basic triggering. MCPs provide tool interfaces. May need additional execution environment infrastructure.
Tool Mapping: MCPs (Custom MCPs TBD), deployment tooling TBD.
Primitive 5: Observation
Definition: Both AI and humans can inspect what happened, debug failures, and trace AI decisions.
Why Critical: When AI execution fails, understanding why without manual investigation is essential for iteration and trust.
Required Capabilities:
- Execution logs with AI reasoning
- Decision traces (why did AI choose X over Y?)
- Failure categorization (test failure vs. dependency issue vs. unclear spec)
- Audit trail of AI actions
Honcho Mapping: Honcho memories store execution history and reasoning. May need structured logging/monitoring supplementation.
Tool Mapping: TBD. Options include structured logging (Loki, etc.), tracing systems.
Primitive 6: Safety
Definition: Guardrails, rollback capability, and blast radius containment for AI execution.
Why Critical: One production incident from AI-shipped code erodes trust and increases future review burden. Safety enables progressive trust.
Required Capabilities:
- Staging environment (AI deploys here first automatically)
- Feature flags (AI-shipped code starts disabled/canary)
- Rollback (one-click or automatic revert to pre-AI state)
- Rate limiting (max X AI deployments per day until proven reliable)
- Change categorization (safe vs. risky change types)
Tool Mapping: TBD. Options include PostHog (feature flags, self-hostable), Railway/Coolify (deployment with rollback), custom safety layer.
4. Tool Selection Principles
All tools selected for the harness must satisfy:
- OSS-First Preference: Open source core with option for managed service
- Hybrid-Run Capable: Can run locally (dev) and in cloud (prod/staging)
- Agent-Native: Callable by AI via API/MCP, not just human GUI
- Proven at Scale: Evidence of production use (not experimental)
- Fit-to-Primitive: Maps cleanly to one or more primitives above
5. Current Tool Assignments
| Primitive | Assigned Tool | Rationale | Gaps |
|---|---|---|---|
| Context | Honcho | Already decided; provides memory, workspace, user memories | Codebase-specific context (search, indexing) |
| Specification | TBD | Test plan generation from Linear tickets | |
| Verification | TBD | CI/CD platform, test runners (unit, integration, smoke) | |
| Execution | Honcho (partial) + MCPs | Honcho Cron for triggers; Custom MCPs for tools | Deployment automation, sandboxed execution |
| Observation | Honcho (partial) | Memories store reasoning | Structured logging, failure categorization |
| Safety | TBD | Staging env, feature flags, rollback mechanism |
6. Work Phases
Calendar weeks below are illustrative sequencing until the full Platform portfolio review with Clarence locks Q2 dates.
Phase 1: Primitive Definition & Tool Selection (Week 1: March 24-28)
- Clarence review of 6 core primitives (this document)
- Agreement on primitive definitions
- Tool selection for Verification, Safety, Specification gaps
- Document “How Platform Ships Code” playbook
- Create Linear project + first issues
Deliverable: Approved harness plan with tool stack defined
Phase 2: Verification Infrastructure (Week 2-3: March 31 - April 11)
- Implement CI/CD pipeline (Verification primitive)
- Configure test layers (unit, integration, smoke)
- Connect to repository (apps/platform/)
- AI-triggerable verification (AI can run tests on demand)
Deliverable: Green verification pipeline that AI can invoke
Phase 3: First AI End-to-End Ticket (Week 4: April 14-18)
- Select first ticket (small, well-scoped)
- AI writes test plan → human approves
- AI implements code + tests
- Verification runs automatically
- Human spot-check review
- AI merges (with safety checks)
Deliverable: One ticket completed end-to-end by AI; documented process
Phase 4: Safety & Staging (Week 5-6: April 21 - May 2)
- Implement staging environment (Safety primitive)
- Configure feature flags for AI-shipped code
- Add rollback mechanism
- Rate limiting for AI deployments
Deliverable: Safe deployment path for AI-shipped code
Phase 5: Scale to 25% (Week 7-10: May 5 - May 30)
- Process 25% of Platform tickets via AI end-to-end
- Iterate on verification based on failure modes
- Build “proven pattern” library (patterns that pass verification consistently)
- Document what works vs. what still needs human heavy-lifting
Deliverable: 25% AI completion rate; pattern library v1
Phase 6: Harden for 50% (Week 11-13: June 2 - June 30)
- Refine Specification primitive (better test plans)
- Expand proven pattern library
- Achieve 50% AI completion rate
- Document harness for other teams
Deliverable: 50% AI completion rate sustained; harness documented
7. Success Metrics
| Metric | Before | Target (End of Q2) | Owner |
|---|---|---|---|
| AI end-to-end completion rate | 0% | 50% | Platform team |
| Human review time per AI PR | Unknown | < 15 minutes | Platform team |
| Verification pass rate (first attempt) | N/A | 70% | Platform team |
| Production incidents from AI-shipped code | N/A | 0 | Clarence |
| Time from ticket creation to merge (AI-shipped) | Human baseline | 50% faster than human | Platform team |
8. Risks & Mitigations
| Risk | Mitigation | Owner |
|---|---|---|
| Tool selection takes > 1 week | Cap analysis at 3 days per primitive; pick “good enough” over perfect | Uttam |
| Verification infrastructure becomes the project | Cap Phase 2 at 2 weeks; minimal viable test layers first | Uttam |
| AI completion rate stays < 25% | Intermediate milestone at 10% by April 30; reassess primitives if missed | Platform team |
| Human review doesn’t decrease | Measure weekly; if review time doesn’t drop, investigate Specification/Verification gaps | Platform team |
| Production incident from AI code | Safety primitive gates (staging, flags, rollback) must be operational before any auto-merge | Clarence |
| Clarence/Uttam disagreement on approach | Document 2-3 options for each open primitive; decision criteria agreed upfront | Uttam |
9. Open Questions
-
Verification Tool Stack: Do we use GitHub Actions (familiar) or evaluate alternatives (self-hosted Drone, etc.)? What’s the hybrid-run requirement specifically?
-
Safety Infrastructure: Do we have existing staging environment, or is this net-new? What feature flag system (if any) is currently in use?
-
Clarence’s Role: Is Clarence also hands-off code, or is he the primary human reviewer? This affects bandwidth planning significantly.
-
First Ticket Selection: What is a good first test case? Suggestions: Linear cleanup task, small UI component, documentation update.
-
“Proven Pattern” Definition: What criteria make a pattern trustworthy for reduced human review? (e.g., 5 consecutive green verifications?)
10. Next Steps
- Review and approve primitive definitions (owner: Clarence + Uttam) — Due: March 28
- Decision on Verification tool stack (owner: Clarence + Uttam) — Due: March 28
- Create Linear project “Platform AI Execution Harness” with Phase 1-6 issues (owner: Uttam) — Due: March 28
- Schedule 30-min working session to finalize tool selections (owner: Uttam) — Due: March 28
- Begin Phase 2: Verification infrastructure setup (owner: Platform team) — Due: April 11
11. Related Resources
| Resource | Location | Description |
|---|---|---|
| Honcho Documentation | https://docs.honcho.dev | Memory/context primitive documentation |
| Executive Q2 Planning Operating Model | executive-q2-planning-operating-model-2026.md | Planning framework this project operates within |
| Linear Structure Guide | linear-structure-guide.md | How to mirror this plan in Linear |
| Platform Initiatives Reflection | platform-initiatives-and-plans-reflection-2026-03.md | Current state of Platform Linear structure |
12. Linear Execution
Proposed Linear Structure
Initiative: Platform AI Execution Harness
Target Date: 2026-06-30
Owner: [TBD - Uttam or Clarence]
Projects:
- Primitive Definition & Tool Selection (Target: March 28)
- Verification Infrastructure (Target: April 11)
- First AI End-to-End Ticket (Target: April 18)
- Safety & Staging (Target: May 2)
- Scale to 25% (Target: May 30)
- Harden for 50% (Target: June 30)
Issues to Create:
- PLT-XXXX: Define 6 core primitives (draft → review → approve)
- PLT-XXXX: Select Verification tool stack (CI/CD, test runners)
- PLT-XXXX: Select Safety tool stack (staging, flags, rollback)
- PLT-XXXX: Select Specification tooling (test plan generation)
- PLT-XXXX: Document “How Platform Ships Code” playbook
- PLT-XXXX: Implement CI/CD pipeline
- PLT-XXXX: Configure unit test layer
- PLT-XXXX: Configure integration test layer
- PLT-XXXX: Configure smoke test layer
- PLT-XXXX: Select and implement first AI end-to-end ticket
- [Additional issues for Phases 4-6]
Last updated: 2026-03-24