The Complete Harness: Why Most AI Agents Are Only One-Third of the Solution

Everyone's building harnesses now. Linear announced coding agents. OpenAI killed Sora to focus on Codex. Anthropic went all-in on Claude Code. Notion shipped agents for work. The convergence is real.

But most of these harnesses are incomplete. They do one thing well and pretend the other dimensions don't exist. If you've spent any time trying to deploy an AI agent into a real enterprise operation, you know the gap between "demo" and "daily driver" is enormous.

I think there are three things a complete harness needs to handle. Miss any one of them and you've built a toy.

1. Probabilistic AND Deterministic

Here's the thing nobody talks about: an LLM is a terrible calculator.

I'm serious. Ask an LLM to write a SQL query against a messy production database and you'll get something that looks right, runs without errors, and returns the wrong numbers. Research shows a 39-point accuracy gap between how well LLMs perform on SQL benchmarks (90%) versus production schemas (51%). That's not a rounding error. That's a coin flip.

But ask that same LLM to read a meeting transcript, figure out which customer project it's about, and draft a summary with action items? It's extraordinary.

The mistake most harnesses make is treating everything as an LLM call. Need to calculate margin on a quote? LLM call. Need to check inventory levels? LLM call. Need to route a message? LLM call.

No. Some things should be deterministic. SQL queries, data transforms, API calls, cron jobs, math — these should execute as code, every time, with predictable results. The LLM's job is to decide what to execute, not to be the execution.

Anthropic figured this out. When they switched from having Claude call tools directly to having Claude write code that calls tools, they got a 98.7% reduction in token usage. Same results, fraction of the cost. The insight: separate the thinking (probabilistic) from the doing (deterministic).

A complete harness needs both. The LLM reasons about what to do. Deterministic functions actually do it. The harness orchestrates the handoff. This is what we call the "turn loop" — the LLM reflects, decides on an action, dispatches to a function or script, gets the result, and reflects again. Probabilistic reasoning wrapped around deterministic execution.

When you don't do this, you get agents that hallucinate SQL results, miscalculate margins, and confidently present wrong numbers with a smiley face. Your users lose trust on day one.

2. Multi-Dimensional, Not Flat

Here's a question: when a production manager walks into a daily standup, are they using the same mental model as when they're doing a root cause analysis on a quality escape?

Obviously not. Same person, completely different mode of thinking. Different data, different tools, different judgment patterns, different people in the room.

Most agent architectures ignore this. They give you one agent persona with a pile of skills and hope the right one surfaces. It's like giving a surgeon a toolbox with 200 tools and saying "figure it out." A surgeon in the OR needs a scalpel. A surgeon reviewing imaging needs a PACS viewer. Same person, different dimension.

We've seen multi-agent systems try to solve this — CrewAI lets you define agents with different roles, AutoGen has group chat patterns, Microsoft's Magentic-One has an orchestrator routing to specialized agents. These are steps in the right direction, but they're thinking about it as "different agents for different roles."

The reality in enterprise is messier. It's not just different roles — it's different dimensions within the same role. That production manager has at least three dimensions:

Daily operations: What ran yesterday, what's behind schedule, who's out sick
Analytics: Trends, KPIs, capacity utilization, bottleneck analysis
Quality: NCRs, scrap rates, customer complaints, root cause investigation

The transitions between these dimensions are themselves a form of expertise. Knowing when to shift from "let's look at the numbers" to "let's go to the shop floor" is a skill. A complete harness needs to understand this.

72% of enterprise AI projects now use multi-agent architectures, up from 23% in 2024. The industry is moving here fast. But most implementations are still flat — role-based routing, not dimension-aware context switching.

3. The Output Has to Shape-Shift

This is the one that trips up every "AI-first" product I've seen.

Different users want different things from the same intelligence. Not different answers — different formats. The VP wants a dashboard. The plant manager wants a daily email digest. The engineer wants an interactive app. The ops team wants a canvas they can mark up. The IT team wants a webhook they can integrate into their monitoring stack.

Same underlying data. Same agent intelligence. Completely different output modalities.

Today's landscape gives you one modality per tool:

ChatGPT gives you chat
Claude gives you artifacts
Notion gives you documents
Retool gives you apps
Zapier gives you workflows

So the user context-switches. They go to one tool for the conversation, another for the document, another for the dashboard, another for the workflow. The average knowledge worker toggles between apps 1,200 times per day. That's once every 24 seconds. And every switch costs 23 minutes to fully regain deep focus.

A complete harness absorbs this. The harness context-switches so the user doesn't have to.

If a user asks a question that's best answered with a chart, build the chart. If they ask for something that needs to go to five people as an email, send the email. If the answer is really an app with interactive inputs and a database behind it, build the app. If it's a recurring workflow with a cron trigger, set up the workflow.

The harness is the shape-shifter. Same brain, different body for each task.

This isn't hypothetical. We've built this. A project manager at one of our pilot customers talks to myai through Teams. myai responds in Teams when it's a quick answer. When the task requires a canvas with live data, myai builds the canvas. When the PM needs something shared with the team, myai sends an email. When the answer is really a recurring report, myai sets up the workflow. The PM never leaves Teams. myai does the context switching.

The Punchline: Context Switching Is the Harness's Job

Here's how I think about it. The knowledge worker's day is fragmented. 275 interruptions during core hours. A ping every two minutes. 10 different apps, 117 emails, 153 chat messages, all day, every day.

The promise of an AI agent isn't "one more tool." It's fewer tools. The harness replaces the context switching, not the user.

That means the harness needs to:

Think in two modes — probabilistic reasoning for judgment calls, deterministic execution for data and scripts
Operate across dimensions — understanding that the same person needs different tools, data, and patterns depending on what mode of work they're in
Output in whatever shape the task demands — chat, canvas, email, app, dashboard, webhook, workflow

Miss any one of these and you've built a feature, not a platform.

Phil Schmid from Hugging Face put it well: the model is the CPU, the harness is the operating system, and the agent is the application. In 2026, models absorbed roughly 80% of what multi-agent frameworks used to provide. The remaining 20% — persistence, deterministic replay, cost control, error recovery, output routing — is exactly what the harness provides.

That 20% is where the moat lives.

The Complete Harness: Why Most AI Agents Are Only One-Third of the Solution

1. Probabilistic AND Deterministic

2. Multi-Dimensional, Not Flat

3. The Output Has to Shape-Shift

The Punchline: Context Switching Is the Harness's Job

More from our blog

Your Agent Instructions Are Describing, Not Prescribing

Your Workarounds Are Telling You Something