Back to blog

Voice agents don't need a faster horse. They need a dispatcher.

May 2, 2026 · William VanBuskirk

Try to do anything substantive with a voice agent and you'll feel the wall.

Ask ChatGPT Voice to dig into your calendar. Ask Claude Voice to reason about a long document. The agent goes quiet, then comes back with something competent but thin. The texture is wrong. You can feel that the agent doing the talking and the agent doing the work are not the same thing, except they are, because that's the only architecture the consumer voice loop currently knows.

The reason this happens is structural, and once you see it, the fix is obvious.

The single-loop problem

A consumer voice agent is one loop. Voice in → speech-to-text → LLM (with a small allow-list of tools) → text-to-speech → voice out. The whole thing has to stay tight; if a tool takes more than a couple of seconds, the conversation dies. So providers ship a deliberately small allow-list: search, a calendar peek, a weather call. Anything that takes time (a real tool chain over MCP, a multi-step workflow against your actual data) doesn't make the cut.

That's not a model limitation. It's an architecture choice. The agent that's listening to you is the agent that has to wait. There's nowhere to park you while it works.

If you've ever asked ChatGPT Voice why it can't see your MCP tools, that's the answer. It can. The model is the same. But the loop can't carry them. So it doesn't.

The dispatcher pattern

What if the agent that's listening to you isn't the agent doing the work?

Think of how a good front-desk person handles a hard request. They don't go silent and disappear into the back office. They keep talking to you. Let me check on that. While I'm pulling it up, what else are you trying to figure out today? OK, I just heard back. Give me one more second on the second thing... actually, here's what we have on the first one. The back office is doing real work. The front desk is doing real conversation. Both are happening at once.

That's the dispatcher pattern. The voice surface is a liaison. It's not the agent. It's the thing that knows how to talk to you while the agent works.

This is the same shape as Claude-over-MCP today. When you connect Claude to an external tool surface, Claude doesn't suddenly become an expert in your data. It stays Claude, the conversational interface, and dispatches the deep work to the surface that holds your context. The reason that pairing feels deep is that there are two systems doing two different jobs.

Make the voice surface the liaison and the same architecture works for voice. The user is talking. The dispatcher is also talking, narrating progress, asking adjacent questions, confirming the dispatch. The actual tool chain is running in the background. Results stream back into the conversation when ready, not on a poll.

What changes when you have it

The kinds of asks the agent can take change.

Today: "what's on my calendar tomorrow."

With a dispatcher: "I'm trying to figure out whether to push the customer demo. Pull up the last three weeks of work-order activity, see what's still in flight, and tell me whether the team would actually be ready by Tuesday."

The first one is a single-loop tool call. Three seconds, in and out. The second one is a real piece of judgment that needs to scan a graph, reason against several artifacts, possibly delegate to a specialized dimension. Easily fifteen seconds of work.

A voice agent that has to wait for fifteen seconds is a bad voice agent. A voice agent that can keep talking to you for fifteen seconds (narrating the dispatch, surfacing intermediate findings, asking a clarifying question that comes back into the running query) is suddenly a useful one.

Why MCP isn't quite enough

MCP is request/response with a synchronous "agent waits for tool" assumption baked into the protocol. That works fine when the agent is text-based and happy to take its turn. It breaks down for voice for the same reasons the single-loop problem breaks down: there's no slot for "the tool is still running, but here's what we've got so far."

A voice-tuned protocol on the same kind of surface needs a few specific things:

  • Streaming tool progress. The dispatcher can't narrate what it doesn't know about. Tools need to emit what they're doing, not just what they returned.
  • Mid-tool cancellation. The user changes direction mid-sentence. The dispatcher should be able to abandon the current query without crashing or hanging.
  • Concurrent dispatched messages. While the prior request is still running, the user keeps talking. The protocol has to multiplex.
  • Session-scoped state. Subsequent dispatches inherit the prior context without re-establishing it. The dispatcher remembers what it's working on for you across messages.

These are layerable on top of an MCP-shaped surface, but they're not what MCP itself is. They're what would sit on top of MCP if voice were a primary use case the protocol had been designed around.

Where this goes

The narrow version of this story is "build a voice app." A few of our design partners are asking for one (one specifically used the word voice; the others used phrases that translate to I want to ask myai things while I'm walking). The shape on our side is small: a thin iOS client, a streaming session against the existing tool/workflow surface, no app-store review cycle for the design-partner cohort.

The wider version is more interesting. The reason every consumer voice agent feels limited the moment you push on it is that the model layer has been treated as the load-bearing part of the system, and the orchestration around the model has been treated as plumbing. The dispatcher pattern inverts that: the model is the conversational fluency, the orchestration is where the actual capability lives. Once you're willing to have two systems doing two jobs, voice stops being a degraded version of chat. It becomes its own kind of interface.

We've been writing about this same shift in other places: myai is the context, your agent is the hands. This is the same idea, applied to a different surface. The agent that's good at being a conversational presence isn't the same agent that's good at thinking deeply about your work. The architecture should let both of those things be true at once.

If you've ever found yourself wishing your voice agent could just do the actual thing, the gap is not the model. It's the loop. The fix is to give the loop somewhere to send the work.


If you want to talk about voice surfaces over your context, we're spinning up an early iOS prototype for the voice dispatcher. Reach out and we'll get you in.