Q2 Platform AI Execution Harness

Status: Draft (Pending Clarence Review)
Created: 2026-03-24
Author: Uttam
Related: Executive Q2 Planning Operating Model, Honcho Integration

Quick Links:


1. Context & Problem Statement

Current State

The Platform team (Uttam + Clarence) operates as a 2-person engineering team. Current execution model:

  • Tickets are manually implemented by humans
  • Code review is the quality bottleneck
  • No systematic AI involvement in end-to-end delivery
  • Tool decisions have been ad hoc rather than principled

Problem Statement

  1. Human bandwidth limits throughput — 2 engineers can only ship so much
  2. Review becomes the bottleneck — Even if AI writes code, human review burden doesn’t decrease proportionally
  3. No durable decision framework — New tools are evaluated case-by-case without a principled rubric
  4. Trust in AI output is inconsistent — Without verification infrastructure, AI-shipped code requires heavy manual validation

Goal

Enable 50% or more of Platform tickets to be completed end-to-end by AI agents by end of Q2 (June 30, 2026).

“End-to-end” means: AI interprets ticket → AI writes implementation → AI writes tests → AI runs verification → AI creates PR → Human spot-check review → AI merges (with safety checks).


2. Connection to Broader Goals

Executive Q2 Planning Operating Model

This project delivers against the Platform team’s requirement under the Executive Q2 Planning Operating Model:

  • Internal teams must have approved quarterly plans
  • Work must ladder into Linear (Initiative → Project → Milestone → Issue)
  • Plans must have sponsor-visible outcomes

Company-Wide AI Enablement

The harness defined here becomes the foundation for other teams (Delivery, GTM, etc.) to achieve similar AI execution rates. Platform’s role is to prove the model first.


3. Core Primitives

The harness is organized around six technology-agnostic primitives. Tool selections are mapped to primitives, but primitives are durable — new tools are judged by how well they serve these primitives.

Primitive 1: Context

Definition: The AI has access to relevant memory, history, and company knowledge to make contextually appropriate decisions.

Why Critical: Without context, AI makes naive implementation choices that require heavy human correction. Context reduces the “why did you do it this way?” review cycles.

Required Capabilities:

  • Ephemeral context (current ticket, active conversation)
  • Persistent context (past similar tickets, resolved issues)
  • Procedural context (company standards, coding conventions, SOPs)
  • Active retrieval (AI can query for relevant context when needed)

Honcho Mapping: Honcho provides long-term agent memory, user memories, and workspace context. This primitive is partially addressed by Honcho but may need supplementation for codebase-specific context.

Primitive 2: Specification

Definition: Clear, testable definition of what “done” means for any given ticket.

Why Critical: Subjective “done” creates review bottleneck. Objective “done” (verified by automated tests) reduces human review to “does this make sense?” rather than “does this work?”

Required Capabilities:

  • Ticket contains human-written acceptance criteria
  • AI generates test plan from acceptance criteria
  • Human approves test plan (not implementation) — fast checkpoint
  • AI implements to make approved tests pass

Tool Mapping: Linear provides ticket structure; test plan generation likely requires custom MCP or Honcho skill. Not yet assigned.

Primitive 3: Verification

Definition: Multi-layer automated proof that implementation satisfies specification.

Why Critical: Green tests must equal trustworthy. Human review burden only decreases if verification is comprehensive enough to catch errors before human sees code.

Required Layers:

  • Static analysis (types, lint, format) — fast, catches trivial errors
  • Unit tests (functions in isolation) — catches logic errors
  • Integration tests (components together) — catches interface errors
  • Smoke tests (app runs, basic flows work) — catches catastrophic errors

Tool Mapping: TBD. OSS-first, hybrid-run capable required. Options include GitHub Actions, self-hosted runners, pytest/jest, Playwright.

Primitive 4: Execution

Definition: The AI can actually invoke tools, run commands, query APIs, and deploy changes.

Why Critical: AI must do the work, not just suggest it. Execution capability turns AI from advisor to implementor.

Required Capabilities:

  • Tool invocation (MCPs or equivalent)
  • Sandboxed code execution
  • File system operations (read, write, modify)
  • Deployment triggers (staging, production)

Honcho Mapping: Honcho Cron provides scheduled execution and basic triggering. MCPs provide tool interfaces. May need additional execution environment infrastructure.

Tool Mapping: MCPs (Custom MCPs TBD), deployment tooling TBD.

Primitive 5: Observation

Definition: Both AI and humans can inspect what happened, debug failures, and trace AI decisions.

Why Critical: When AI execution fails, understanding why without manual investigation is essential for iteration and trust.

Required Capabilities:

  • Execution logs with AI reasoning
  • Decision traces (why did AI choose X over Y?)
  • Failure categorization (test failure vs. dependency issue vs. unclear spec)
  • Audit trail of AI actions

Honcho Mapping: Honcho memories store execution history and reasoning. May need structured logging/monitoring supplementation.

Tool Mapping: TBD. Options include structured logging (Loki, etc.), tracing systems.

Primitive 6: Safety

Definition: Guardrails, rollback capability, and blast radius containment for AI execution.

Why Critical: One production incident from AI-shipped code erodes trust and increases future review burden. Safety enables progressive trust.

Required Capabilities:

  • Staging environment (AI deploys here first automatically)
  • Feature flags (AI-shipped code starts disabled/canary)
  • Rollback (one-click or automatic revert to pre-AI state)
  • Rate limiting (max X AI deployments per day until proven reliable)
  • Change categorization (safe vs. risky change types)

Tool Mapping: TBD. Options include PostHog (feature flags, self-hostable), Railway/Coolify (deployment with rollback), custom safety layer.


4. Tool Selection Principles

All tools selected for the harness must satisfy:

  1. OSS-First Preference: Open source core with option for managed service
  2. Hybrid-Run Capable: Can run locally (dev) and in cloud (prod/staging)
  3. Agent-Native: Callable by AI via API/MCP, not just human GUI
  4. Proven at Scale: Evidence of production use (not experimental)
  5. Fit-to-Primitive: Maps cleanly to one or more primitives above

5. Current Tool Assignments

PrimitiveAssigned ToolRationaleGaps
ContextHonchoAlready decided; provides memory, workspace, user memoriesCodebase-specific context (search, indexing)
SpecificationTBDTest plan generation from Linear tickets
VerificationTBDCI/CD platform, test runners (unit, integration, smoke)
ExecutionHoncho (partial) + MCPsHoncho Cron for triggers; Custom MCPs for toolsDeployment automation, sandboxed execution
ObservationHoncho (partial)Memories store reasoningStructured logging, failure categorization
SafetyTBDStaging env, feature flags, rollback mechanism

6. Work Phases

Calendar weeks below are illustrative sequencing until the full Platform portfolio review with Clarence locks Q2 dates.

Phase 1: Primitive Definition & Tool Selection (Week 1: March 24-28)

  • Clarence review of 6 core primitives (this document)
  • Agreement on primitive definitions
  • Tool selection for Verification, Safety, Specification gaps
  • Document “How Platform Ships Code” playbook
  • Create Linear project + first issues

Deliverable: Approved harness plan with tool stack defined

Phase 2: Verification Infrastructure (Week 2-3: March 31 - April 11)

  • Implement CI/CD pipeline (Verification primitive)
  • Configure test layers (unit, integration, smoke)
  • Connect to repository (apps/platform/)
  • AI-triggerable verification (AI can run tests on demand)

Deliverable: Green verification pipeline that AI can invoke

Phase 3: First AI End-to-End Ticket (Week 4: April 14-18)

  • Select first ticket (small, well-scoped)
  • AI writes test plan → human approves
  • AI implements code + tests
  • Verification runs automatically
  • Human spot-check review
  • AI merges (with safety checks)

Deliverable: One ticket completed end-to-end by AI; documented process

Phase 4: Safety & Staging (Week 5-6: April 21 - May 2)

  • Implement staging environment (Safety primitive)
  • Configure feature flags for AI-shipped code
  • Add rollback mechanism
  • Rate limiting for AI deployments

Deliverable: Safe deployment path for AI-shipped code

Phase 5: Scale to 25% (Week 7-10: May 5 - May 30)

  • Process 25% of Platform tickets via AI end-to-end
  • Iterate on verification based on failure modes
  • Build “proven pattern” library (patterns that pass verification consistently)
  • Document what works vs. what still needs human heavy-lifting

Deliverable: 25% AI completion rate; pattern library v1

Phase 6: Harden for 50% (Week 11-13: June 2 - June 30)

  • Refine Specification primitive (better test plans)
  • Expand proven pattern library
  • Achieve 50% AI completion rate
  • Document harness for other teams

Deliverable: 50% AI completion rate sustained; harness documented


7. Success Metrics

MetricBeforeTarget (End of Q2)Owner
AI end-to-end completion rate0%50%Platform team
Human review time per AI PRUnknown< 15 minutesPlatform team
Verification pass rate (first attempt)N/A70%Platform team
Production incidents from AI-shipped codeN/A0Clarence
Time from ticket creation to merge (AI-shipped)Human baseline50% faster than humanPlatform team

8. Risks & Mitigations

RiskMitigationOwner
Tool selection takes > 1 weekCap analysis at 3 days per primitive; pick “good enough” over perfectUttam
Verification infrastructure becomes the projectCap Phase 2 at 2 weeks; minimal viable test layers firstUttam
AI completion rate stays < 25%Intermediate milestone at 10% by April 30; reassess primitives if missedPlatform team
Human review doesn’t decreaseMeasure weekly; if review time doesn’t drop, investigate Specification/Verification gapsPlatform team
Production incident from AI codeSafety primitive gates (staging, flags, rollback) must be operational before any auto-mergeClarence
Clarence/Uttam disagreement on approachDocument 2-3 options for each open primitive; decision criteria agreed upfrontUttam

9. Open Questions

  1. Verification Tool Stack: Do we use GitHub Actions (familiar) or evaluate alternatives (self-hosted Drone, etc.)? What’s the hybrid-run requirement specifically?

  2. Safety Infrastructure: Do we have existing staging environment, or is this net-new? What feature flag system (if any) is currently in use?

  3. Clarence’s Role: Is Clarence also hands-off code, or is he the primary human reviewer? This affects bandwidth planning significantly.

  4. First Ticket Selection: What is a good first test case? Suggestions: Linear cleanup task, small UI component, documentation update.

  5. “Proven Pattern” Definition: What criteria make a pattern trustworthy for reduced human review? (e.g., 5 consecutive green verifications?)


10. Next Steps

  • Review and approve primitive definitions (owner: Clarence + Uttam) — Due: March 28
  • Decision on Verification tool stack (owner: Clarence + Uttam) — Due: March 28
  • Create Linear project “Platform AI Execution Harness” with Phase 1-6 issues (owner: Uttam) — Due: March 28
  • Schedule 30-min working session to finalize tool selections (owner: Uttam) — Due: March 28
  • Begin Phase 2: Verification infrastructure setup (owner: Platform team) — Due: April 11

ResourceLocationDescription
Honcho Documentationhttps://docs.honcho.devMemory/context primitive documentation
Executive Q2 Planning Operating Modelexecutive-q2-planning-operating-model-2026.mdPlanning framework this project operates within
Linear Structure Guidelinear-structure-guide.mdHow to mirror this plan in Linear
Platform Initiatives Reflectionplatform-initiatives-and-plans-reflection-2026-03.mdCurrent state of Platform Linear structure

12. Linear Execution

Proposed Linear Structure

Initiative: Platform AI Execution Harness
Target Date: 2026-06-30
Owner: [TBD - Uttam or Clarence]

Projects:

  1. Primitive Definition & Tool Selection (Target: March 28)
  2. Verification Infrastructure (Target: April 11)
  3. First AI End-to-End Ticket (Target: April 18)
  4. Safety & Staging (Target: May 2)
  5. Scale to 25% (Target: May 30)
  6. Harden for 50% (Target: June 30)

Issues to Create:

  • PLT-XXXX: Define 6 core primitives (draft → review → approve)
  • PLT-XXXX: Select Verification tool stack (CI/CD, test runners)
  • PLT-XXXX: Select Safety tool stack (staging, flags, rollback)
  • PLT-XXXX: Select Specification tooling (test plan generation)
  • PLT-XXXX: Document “How Platform Ships Code” playbook
  • PLT-XXXX: Implement CI/CD pipeline
  • PLT-XXXX: Configure unit test layer
  • PLT-XXXX: Configure integration test layer
  • PLT-XXXX: Configure smoke test layer
  • PLT-XXXX: Select and implement first AI end-to-end ticket
  • [Additional issues for Phases 4-6]

Last updated: 2026-03-24