A new paper from researchers at Shanghai Jiao Tong University, CMU, and others just dropped a framework that every enterprise AI buyer should understand. It's called "Externalization in LLM Agents," and its core argument is this: the next wave of agent capability won't come from bigger models — it will come from better infrastructure around them.
We've been building Make Yourself AI on exactly this thesis. Here's what the paper gets right, what it misses, and why it matters for anyone deploying agents in operations-heavy businesses.
The Big Idea: Externalization as Cognitive Transformation
The paper introduces a framework with four dimensions: memory, skills, protocols, and harness engineering. The insight isn't that agents need databases. It's that externalizing knowledge transforms the cognitive task itself.
A shopping list doesn't make your memory better — it converts the hard task of recall into the easy task of recognition. A skill registry doesn't make the model smarter — it converts open-ended generation into constrained composition.
This is a crucial distinction. Most enterprise AI products bolt on "memory" or "tool use" as features. The researchers argue these should be architectural primitives that fundamentally reshape what the model is asked to do.
We agree. That's exactly what we built.
Where Most Agent Systems Stop
The paper identifies four layers of memory, from ephemeral to persistent:
- Working context — what the agent is thinking about right now
- Episodic experience — records of past runs, decisions, and outcomes
- Semantic knowledge — domain facts, rules, conventions
- Personalized memory — user-specific preferences, habits, and recurring constraints
Most enterprise agent platforms live in layers 1 and 3. They stuff context windows with RAG results and call it memory. Some track conversation history for a session. Very few build persistent episodic memory. Almost none attempt true personalization.
The paper is clear: personalized memory is the frontier. It's also the hardest to build because it requires understanding not just what someone knows, but how they think — their decision patterns, their interpretive frames, their judgment under ambiguity.
The Mirror Profile: Personalized Memory as Architecture
At Make Yourself AI, the core primitive is the Mirror Profile — a structured representation of an expert's perspective. Not their data. Their judgment.
A Mirror Profile captures:
- Beliefs — guiding principles that shape interpretation ("trust requires silence after vulnerability," "early shipping signals are more diagnostic than late ones")
- Interpretive modes — contextual postures the agent adopts (investigating, building trust, escalating)
- Trigger patterns — linguistic and semantic cues that activate specific beliefs
- Judgment examples — specific past situations where the expert applied a belief to an ambiguous event
In the paper's taxonomy, this spans all four memory layers simultaneously. Beliefs are semantic knowledge. Judgment examples are episodic experience. Trigger patterns are personalized memory. And at runtime, we compose these into a MindFrame — a working context payload that grounds every agent action in the expert's perspective.
This isn't memory bolted onto an agent. It's memory as the agent's cognitive architecture.
Skills Aren't Tools. They're Packaged Judgment.
The paper describes a maturity curve for skills: from atomic tool calls, to managing large capability libraries, to "skill as packaged expertise" — bundled workflows with documentation and preconditions.
Most platforms are stuck at level one. They give agents access to APIs and hope prompt engineering handles the rest.
We think skills are expressions of judgment, not just wrappers around functions. When a quality engineer's mirror decides to flag a production anomaly, it's not executing a "call anomaly detection API" skill. It's applying a belief ("early signals in batch variance are worth investigating even when they're within spec") through an interpretive mode ("investigating"), producing a judgment that gets logged, reviewed, and refined.
The paper identifies four pathways for skill acquisition: authored, distilled, discovered, and composed. Our architecture supports all four — experts author beliefs directly, judgment patterns get distilled from execution traces, trigger patterns are discovered from interaction, and cross-functional canvases compose skills across domains.
The Protocol Gap: Co-Evolution, Not Just Oversight
The paper's protocol dimension covers agent-tool, agent-agent, and agent-user interaction standards. For agent-user protocols, the focus is on oversight — approval gates, human-in-the-loop checkpoints, escalation policies.
This is necessary but insufficient.
At Make Yourself AI, the expert-agent relationship is governed by the Mirror Pair Contract — a co-evolutionary protocol where the expert reflects, the agent acts, actions are logged as judgments, and the expert reviews and refines. This isn't oversight. It's a growth mechanism.
Over time, the agent starts surfacing patterns the expert hadn't consciously articulated. The paper calls this the frontier of "self-evolving harnesses." We call it the natural trajectory of a well-tuned mirror: the agent becomes a thinking partner, not just a tool.
The paper's framework treats the human as a supervisor. Our architecture treats the human as a source of evolving wisdom that the system actively learns from.
Why This Matters for Enterprise Buyers
If you're evaluating agent platforms for manufacturing, operations, or portfolio management, the externalization framework gives you a diagnostic:
Ask your vendor these questions:
-
Memory depth: Does the agent remember across sessions? Does it remember how you think, or just what you said? Can it distinguish between domain knowledge and personal judgment?
-
Skill maturity: Are capabilities atomic API calls, or packaged expertise with preconditions and context? Can the system compose skills across functional boundaries (accounting to finance to operations)?
-
Protocol sophistication: Is the human-agent relationship a gate (approve/reject) or a loop (reflect/evolve)? Does the system get better from your feedback, or just more compliant?
-
Harness observability: Can you see why the agent made a decision? Can you trace a judgment back to the belief that produced it? Can you interrogate the working context that was active at decision time?
Most platforms will score well on basic memory and atomic skills. Very few will have answers for personalized memory, composed skills, co-evolutionary protocols, or judgment-level observability.
The Deeper Point
The paper's theoretical grounding in cognitive science — Norman's cognitive artifacts, Clark and Chalmers' extended cognition — points to something the AI industry hasn't fully absorbed: the model is not the agent. The agent is the entire system of externalized memory, skills, protocols, and infrastructure that transforms what the model is asked to do.
Bigger context windows won't solve the memory problem. More tools won't solve the skills problem. Better prompts won't solve the protocol problem.
What solves them is architecture — deliberate, structured externalization that converts hard cognitive tasks into ones models handle reliably.
That's what we've been building. The researchers just gave it a name.
The paper: "Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering" — arXiv:2604.08224, April 2026.
Make Yourself AI builds mirrored agents for operations-heavy businesses. Our architecture externalizes expert judgment — not just data — into persistent, inspectable, evolvable systems. Learn more.