Harness Engineering for AI Coding Agents
Your developers adopted AI coding agents and throughput jumped. Then the problems started showing up: architecture drift across repositories, inconsistent code patterns that pass tests but break in production, security controls that work in one module and get bypassed in another. The agents write code faster than anyone can review it, and the failures follow a pattern nobody designed for.
Harness engineering is the emerging discipline that fixes this. Not by restricting agents, but by wrapping them in constraints, verification loops, and feedback systems that make their output predictable and reliable. We build those harnesses for enterprise engineering teams.
Siblings Software is a software outsourcing company based in Miami, Florida, with engineering teams in Argentina working in US time zones. We have been building outsourced software teams since 2014, and our AI practice now includes harness engineering as a core capability.
Why AI Agents Need Harnesses, Not Just Prompts
The term gained traction in early 2026 after Mitchell Hashimoto (co-founder of HashiCorp) articulated a principle that many engineering teams were already discovering the hard way: whenever an AI agent makes a mistake, the fix should be a change to the system, not a change to the prompt. OpenAI formalized the concept shortly after, describing how they built an entire internal product with zero lines of manually written code by investing heavily in the scaffolding around their Codex agents rather than in the prompts that drove them.
The Thoughtworks Technology Radar (Volume 34, April 2026) identified harness engineering as one of the defining macro trends in the industry. Their assessment: the field is moving from experimentation to a desire for repeatability and stability. Agent reliability has shifted from one concern among many to one of the most critical issues facing engineering organizations.
Here is the core problem. Telling an agent "follow our coding standards" in a prompt is like telling a new hire to "write good code." It works some of the time. But LLM compliance with instructions is probabilistic, not deterministic. The agent might follow your architecture conventions perfectly on Monday and introduce a direct database call that bypasses your ORM layer on Tuesday. Prompts don't enforce; they suggest.
Harness engineering adds the enforcement layer. AGENTS.md files encode architectural boundaries in a format agents can reference. CI gates block code that violates constraints before it reaches main. Plan-Execute-Verify loops force agents through structured checkpoints. And observability dashboards track whether the harness is actually working or whether new failure patterns are emerging. Each layer reinforces the others. The constraint layer reduces failure volume, the verification layer catches what slips through, and the observability layer surfaces patterns that both layers missed.
What We Build
Our harness engineering practice covers five service areas. Most engagements start with constraint design and verification pipelines, then expand to orchestration and observability as the team's agent usage matures.
Agent Constraint Design
We write the AGENTS.md files, rules files, and constraint specifications that tell agents what they can and cannot do within your codebase. This goes well beyond project context. We encode architectural boundaries (which modules are off-limits for modification), testing requirements (what coverage is mandatory), security constraints (how credentials must be handled), and dependency rules (which packages are approved). The constraint layer is version-controlled and evolves with your codebase.
Verification Pipeline Build
We implement Plan-Execute-Verify (PEV) loops that force agents through structured checkpoints. The agent plans its approach, executes in a sandboxed environment, then the output is verified against automated tests, static analysis, architectural compliance checks, and mutation testing. Failed verification feeds back to the agent with specific remediation instructions. We wire this into your existing CI/CD infrastructure: GitHub Actions, GitLab CI, Jenkins, Azure DevOps, or CircleCI.
Agent Orchestration Infrastructure
For teams running multiple agents (Claude Code for complex refactoring, Copilot for routine tasks, Codex for batch operations), we build the coordination layer. Task decomposition so agents work on independent slices without merge conflicts. Sandbox environments so each agent operates in isolation. Rollback mechanisms so failed agent runs don't contaminate the main branch. This is the "agent orchestration as the new CI/CD" pattern that the industry is converging on.
Observability and Metrics
You cannot improve what you do not measure. We deploy dashboards that track agent task resolution rate, pass@1 rate (whether the agent gets it right on the first try), rework rate, and architecture drift. These integrate with DORA metrics so you can see how agent-generated code affects your broader delivery performance. The metrics surface when a harness is too loose (too many failures) or too tight (blocking valid work).
Team Enablement and Training
The hardest part of harness engineering is the human side. Your developers need to shift from writing code to writing specs, reviewing agent output, and maintaining constraint files. We run workshops on spec-driven development, build agent workflow playbooks tailored to your team, and help manage the cognitive debt that accumulates when teams delegate code generation to agents without maintaining understanding of what was built.
Our harness engineering work integrates with our AI code security practice for the security constraint layer, and draws on our AI-powered testing expertise for verification pipeline design.
The Plan-Execute-Verify Loop
PEV is the core pattern we implement across every engagement. It transforms agent coding from a "generate and hope" workflow into a structured engineering process with built-in quality gates.
Plan
The agent reads the specification, loads project context from AGENTS.md and rules files, decomposes the task into subtasks, checks constraints, and proposes an implementation approach. For high-risk changes, a human approval gate sits between plan and execution. The agent cannot proceed until the approach is signed off.
Execute
The agent generates code and tests within a sandboxed environment. Execution is bounded by the constraints defined in the plan phase. Changes are committed incrementally, not as one massive diff. If the agent attempts to modify a protected module or introduce a disallowed dependency, the constraint layer blocks the action before it reaches the codebase.
Verify
The full test suite runs at every level: unit, integration, end-to-end. Static analysis checks for code quality and security issues. Architecture compliance verifiers confirm the change respects module boundaries. Mutation testing validates that the new tests actually catch real bugs, not just exercise code paths. Pass/fail feedback goes back to the agent, which can retry with the specific failure context.
What makes PEV different from just "running tests after the agent writes code" is the planning phase. Without it, agents generate plausible-looking code that technically compiles but introduces subtle architectural problems. The constraint-informed plan reduces these issues by roughly 60 to 70 percent based on what we have seen across engagements. The remaining failures get caught in verification.
AGENTS.md: Where Constraints Live
The AGENTS.md specification, released in August 2025 as an open standard, gives agents structured project context. But context alone is not a harness. Most teams we work with have some form of AGENTS.md or CLAUDE.md file when we arrive. The files typically describe the tech stack, list some conventions, and maybe include a few "don't do this" instructions.
That is a starting point, not a constraint system. A well-engineered AGENTS.md includes architectural boundaries that map to enforceable CI checks, testing requirements tied to specific verification gates, security constraints that reference actual scanning rules, and dependency policies that match your package management configuration. The file becomes a contract between the human team and the agents, not a suggestion list.
We build AGENTS.md files that function as executable specifications. Every constraint in the file maps to a CI check that enforces it. Every testing requirement maps to a verification gate. When the file says "all database queries go through the ORM layer," there is a static analysis rule that catches direct SQL calls. When it says "integration tests required for API routes," the CI pipeline blocks merges without them.
How an Engagement Works
Most engagements follow four phases over 10 to 14 weeks. Focused engagements covering a single team or a limited set of repositories can reach full deployment in 4 to 6 weeks.
Phase 1: Agent Audit (Weeks 1-2)
We map your current agent usage across teams and repositories. Which tools are developers using? What percentage of recent commits involve AI assistance? Where are the failure clusters? We measure baseline metrics: task resolution rate, rework rate, architecture drift. The output is a prioritized inventory of harness requirements and a practical implementation plan.
Phase 2: Harness Design (Weeks 3-5)
Based on the audit, we design the constraint architecture. Which boundaries need enforcement? Where do verification gates go in your CI pipeline? What level of human oversight is appropriate for different risk levels? How should multi-agent workflows be coordinated? We make these decisions collaboratively with your team. Harness engineering does not work if it is imposed from outside.
Phase 3: Build and Integrate (Weeks 5-10)
The implementation phase. We author AGENTS.md and rules files for each repository, deploy PEV loop infrastructure, configure CI gates, set up agent sandboxing, wire in observability dashboards, and calibrate constraint thresholds. Every component is tested against your actual agent workflows, not theoretical scenarios. Thresholds are tuned to avoid over-constraining: a harness that blocks too much valid work is worse than no harness at all.
Phase 4: Handoff and Training (Weeks 10-14)
Documentation, runbooks, spec-writing workshops, and hands-on training for maintaining the harness as your codebase and agent tools evolve. We train your team to update constraints, adjust verification gates, and interpret the observability data. The goal is your team running the harness independently. We offer ongoing support as an option, but the default assumption is full ownership transfer.
The harness infrastructure connects with your existing DevOps pipelines. For teams building AI-powered products alongside their agent workflows, our AI agents development practice provides complementary engineering depth.
How We Helped a B2B SaaS Company Tame Agent-Generated Code
The Situation
A Series B SaaS platform with about 70 engineers came to us in early 2026 after hitting a wall with their AI coding tool adoption. Their team had enthusiastically adopted Claude Code and Cursor Agent six months earlier. Individual developer throughput had increased by roughly 40 percent based on their internal tracking. Leadership was pleased with the numbers.
Then three things happened in the same month. A production incident traced back to an AI-generated database migration that bypassed their ORM layer and introduced a data integrity issue affecting 12,000 customer records. A security audit found credential handling inconsistencies across 8 of their 14 repositories, all in AI-generated modules. And their lead architect realized that their monorepo had developed two competing API patterns because different agents had introduced different conventions in different services, and nobody caught the drift until it caused integration failures.
They tried fixing these problems with better prompts and more detailed system instructions. It helped for about two weeks. Then new instances of the same patterns appeared in different parts of the codebase. The engineering VP told us: "We cannot go back to the old speed, but we cannot keep shipping at this defect rate either."
What We Built
We deployed a full harness engineering implementation over 12 weeks with a five-person team: two platform engineers, two DevOps specialists, and a harness architect leading the engagement.
The critical decisions were:
- AGENTS.md files per repository with enforceable architectural boundaries. The ORM-only constraint for database access was backed by a Semgrep rule that blocked direct SQL in CI. The duplicate API pattern problem was addressed with a shared API style guide that agents had to reference.
- PEV loops in GitHub Actions that required agent-generated PRs to pass plan validation, automated testing at three levels, and architecture compliance checks before merge was allowed.
- A multi-agent coordination layer that assigned Claude Code to complex refactoring tasks and Cursor to routine feature work, with task decomposition to prevent merge conflicts between concurrent agent sessions.
- Observability dashboards tracking agent success rate, rework rate, and drift detection across all 14 repositories, integrated with their existing Datadog setup.
Six months after deployment, the team maintained the 40 percent throughput gain from AI tools while reducing the defect rate below pre-agent levels. The engineering VP's summary: "We went from fighting the agents to working with them." Want to see more of our work? Visit our case studies page.
The Cognitive Debt Problem (And What Clients Usually Get Wrong)
Most companies approach harness engineering as a purely technical challenge. Build the CI gates, write the rules files, deploy the dashboards. Done. But the hardest failure mode we see is not technical. It is cognitive debt: the growing gap between what your codebase does and what your team understands about it.
When agents generate 60 or 70 percent of your code, developers lose touch with implementation details they used to know intimately. A senior engineer who used to understand every quirk of the authentication module now reviews agent-generated changes to it without the deep context they once had. The code works, the tests pass, but the human understanding erodes. Six months later, when something breaks in a way the tests did not anticipate, nobody has the mental model to debug it quickly.
This is why our engagements always include the team enablement component. We help engineering managers identify which parts of the codebase require deep human understanding (security-critical paths, data integrity logic, customer-facing payment flows) and which parts can safely become "agent-maintained territory" with lighter human oversight. The goal is not maximizing agent autonomy. It is finding the right balance between velocity and understanding for each team.
When Outsourcing Harness Engineering Makes Sense
Not always. Here is an honest breakdown.
Outsourcing Is a Good Fit When
- You have no platform engineers with agent reliability experience on staff (most teams do not, since the discipline is less than a year old)
- Agent-generated code is already causing production issues and you need harnesses operational in weeks, not quarters
- Your team uses multiple AI coding tools and you need harnesses that work across all of them
- You want the harness built once, tuned to your codebase, and handed off to your internal team to maintain
- You have more than 30 engineers using agents and the coordination overhead is becoming unmanageable
Building In-House Makes More Sense When
- You already have a strong platform engineering team that just needs guidance and a reference architecture
- Your team is small enough (under 15 engineers) that informal review processes still catch most issues
- You use a single AI coding tool and your constraint needs are straightforward
- You have 6 to 9 months to iterate on the harness without urgent production pressure
The Cost Reality
Hiring a platform engineer with agent harness experience, a DevOps specialist who can build PEV infrastructure, and an architect who understands multi-agent coordination in the US costs upward of $650,000 per year in fully loaded compensation. And you are hiring for a specialty where the talent pool is extremely thin since the discipline barely existed 18 months ago. Our nearshore model delivers the same skill set at roughly 40 to 50 percent of that cost, with engineers working in your time zone.
For project-based engagements, typical harness engineering builds range from $90,000 to $280,000 depending on scope, number of repositories, agent tools in use, and the complexity of your CI/CD infrastructure.
Measuring Harness Effectiveness
Harness engineering without measurement is guesswork. We deploy a specific set of metrics that track whether the harness is working and where it needs adjustment.
Task Resolution Rate measures the percentage of tasks an agent resolves correctly, verified by automated tests. Industry benchmarks from DORA research suggest top agents achieve 65 to 77 percent resolution rates in well-harnessed environments. Without harnesses, most teams see rates between 25 and 40 percent.
Pass@1 Rate tracks whether the agent gets it right on the first attempt without retries. This matters because retries consume compute budget and developer attention. A well-tuned harness pushes pass@1 above 60 percent for routine tasks.
Rework Rate measures what percentage of agent-generated PRs require human intervention after the first verification pass. Before harness deployment, rework rates of 50 to 60 percent are common. After deployment, we target below 15 percent.
Architecture Drift tracks how often agent-generated code violates established architectural patterns. This is the metric that surfaces the slow degradation problems that individual PR reviews miss. We integrate drift detection into your existing observability stack.
Three ways to work with us, depending on where you are with agent adoption.
How to Work With Us
Project-Based
Outsourcing
We build the complete harness engineering infrastructure and hand it over. Best for companies that want a production-ready harness without managing the build process internally. Typical duration: 10-14 weeks. Includes AGENTS.md authoring, PEV loop deployment, CI gate configuration, observability setup, and team training.
Dedicated
Engineering Team
An ongoing team embedded in your organization: harness architects, platform engineers, and DevOps specialists. They maintain and evolve the harness as your agent usage grows, onboard new repositories, adjust constraints, and monitor effectiveness. Works as an extension of your platform team.
Staff
Augmentation
Embed individual harness engineers into your existing platform or DevOps team. Best when you have the strategy defined but need hands-on expertise to implement PEV loops, configure agent gates, write AGENTS.md files, or build observability dashboards.
Frequently Asked Questions
Harness engineering is the discipline of designing environments, constraints, and feedback loops that make AI coding agents reliable at scale. Instead of relying on prompts alone, it wraps agents in deterministic guardrails: AGENTS.md files that encode architectural boundaries, CI gates that block non-compliant code, and Plan-Execute-Verify loops that catch failures before they reach production. It matters because AI agents generate code faster than teams can review it, and without structural constraints, the result is architecture drift, security gaps, and rework rates above 50 percent.
We build harnesses for all major agent platforms: Claude Code, Cursor (Agent mode), GitHub Copilot Workspace, OpenAI Codex, JetBrains Junie, Augment Code, and Amazon Q Developer. Each tool has different constraint mechanisms and failure modes. We design harnesses that work across whichever combination your team uses, including multi-agent orchestration.
Project-based engagements typically range from $90,000 to $280,000 depending on repository count, agent tools in use, and team size. Dedicated teams start around $28,000 per month. Focused engagements covering a single team or limited repository set start around $50,000. We scope every engagement after an initial discovery call.
Basic constraint harnesses and AGENTS.md files can be operational within two weeks, and teams see a measurable drop in agent rework within a month. Full enterprise deployments with PEV loops, multi-agent orchestration, and observability take 10 to 14 weeks. Focused engagements for a single team can reach full deployment in 4 to 6 weeks.
Yes. Our engineering teams are based in Argentina, which shares time zones with the US East Coast. Real-time collaboration during your working hours, same-day responses, no 12-hour communication delays. For harness engineering, this matters especially because constraint design requires tight feedback loops with your development team.
Related Services