Blog

Personas Are Content, Coordination Is Structure

How driver, worker, selector, reviewer, analyst, and coordinator roles become reliable multi-agent systems instead of roleplay.

Drew Stone
agentsmulti-agentsystemsself-improvement

Short answer: Multi-agent coordination is not roleplay. It becomes real when agents have contracts, authority boundaries, tool permissions, state isolation, selection rules, budgets, and traces. More agents only help when disagreement becomes useful evidence under an equal-compute gate.

The hard part is not naming the agents.

The hard part is making their disagreement useful.

A “researcher,” “critic,” “architect,” “driver,” and “supervisor” can all be the same model, with the same blind spot, reading the same context, under the same budget, producing five versions of one mistake. That is not a multi-agent system in the meaningful sense. It is a single correlated policy wearing role labels.

Multi-agent coordination becomes real when roles have separate contracts, state boundaries, tool permissions, selection rules, budgets, and traces.

The persona is content. Coordination is structure.

The Object Being Optimized

The previous post made runtime topology explicit:

g = executable graph of agents, tools, validators, selectors, handoffs, and gates
pi = runtime policy over graph moves

Multi-agent coordination sits one layer above that. It decides what the nodes are supposed to do together.

Let:

r = role contracts
p = persona and instruction content
k = active skills per role
u = tools and permissions per role
c = communication and context-sharing policy
sigma = selector or merger policy
v = verifier and judge stack
b = budget allocation policy
tau = termination and escalation policy

A multi-agent system candidate is:

s = (g, r, p, k, u, c, sigma, v, b, tau)

The optimization target is:

J(s | m, h) =
  E_{x ~ D}[R(run(m, h, s, x))]
  - lambda * E[C(run(m, h, s, x))]

The important part is not the notation. It is the coordinate system.

If only p changes, the optimizer is tuning role descriptions. If k changes, it is tuning durable role procedure. If g, sigma, c, b, or tau change, it is tuning coordination structure.

That distinction prevents a common category error: evaluating a better persona as if it were a better multi-agent system.

Persona Is Not Authority

A persona can say:

You are a careful supervisor.
You delegate independent work.
You force disagreement before consensus.
You stop when the reviewer passes the artifact.

Those instructions may improve local judgment. They do not grant runtime authority.

Authority lives in the structure:

Can this role spawn workers?
Can it choose tools?
Can it read child traces?
Can it cancel branches?
Can it override a reviewer?
Can it spend more budget?
Can it merge artifacts?
Can it promote the result?

If the answers are not represented in the runtime, the persona is aspirational. It may be useful text, but it is not a coordination contract.

This is why “supervisor” is an overloaded word. In one system, it means a prompt that asks an LLM to choose the next agent. In another, it means a graph node that routes state. In a third, it means a durable workflow controller with scoped budget, cancellation, replay, and audit authority. These are not interchangeable.

Roles As Contracts

A role becomes real when it has an input contract, an output contract, authority, state access, and accountability.

The minimum useful taxonomy:

RoleContractFailure mode
DriverChoose the next runtime moveprivate reasoning controls execution without trace
PlannerDecompose goal into scoped tasksdecomposition creates fake parallelism or missing dependencies
WorkerProduce an artifact under a task contracthidden assumptions leak into final output
ReviewerFind defects against a rubricreviewer becomes vague taste instead of a gate
JudgeScore output or trajectoryjudge labels are reused as training signal without holdout
SelectorPick branch, winner, or continuationselector optimizes agreement, not correctness
CoordinatorAllocate work and merge artifactsmerge erases dissent and provenance
AnalystConvert traces into findingsanalyst lists symptoms instead of causal failure modes

The same LLM can occupy several roles, but the role contracts still need separation. A selector that also writes the candidate can rationalize its own work. A reviewer that never blocks promotion is commentary. A coordinator that cannot see child traces is guessing.

In code, the role split looks less like a cast list and more like a typed interface:

worker(task, context, tools, budget) -> artifact, trace
reviewer(artifact, rubric, trace) -> defects, pass
judge(artifact, task, trace) -> score, dimensions
selector(candidates, scores, budget_policy) -> winner | continue | abort
coordinator(goal, children, traces) -> merged_artifact, lineage

The point is not bureaucracy. The point is falsifiability. If a role has no contract, its contribution cannot be tested.

The Disagreement Problem

Multi-agent systems are usually sold as specialization. The stronger reason is controlled disagreement.

For an ensemble to help, at least one of these must be true:

  • agents see different evidence
  • agents use different tools
  • agents use different models
  • agents use different skills
  • agents explore different branches
  • agents are scored by an independent verifier
  • the selector can preserve dissent instead of forcing consensus

If every worker shares the same prompt, same model, same examples, same context, same decoding parameters, and same evaluator, the ensemble is highly correlated.

The idealized independent case is easy:

P(at least one success) = 1 - product_i(1 - q_i)

where q_i is the probability that worker i independently finds a valid answer.

But multi-agent LLM systems rarely get independence for free. The useful quantity is not worker count. It is error correlation.

coordination_gain =
  E[score_multi at budget B] - E[score_best_single at budget B]

If the gain disappears at matched budget, the system did not learn coordination. It bought more samples.

If the gain disappears when workers use isolated context, the system may have been copying. If the gain disappears when the selector is replaced with a deterministic verifier, the selector may have been rewarding style. If the gain disappears on held-out tasks, the role split overfit the benchmark.

This is the central test: does the coordination policy create useful diversity, or just more tokens?

Lineage: From Sampling To Societies

The modern multi-agent conversation did not appear fully formed. It grew out of several older ideas.

Self-consistency sampled multiple reasoning paths and selected the most consistent answer. That is not multi-agent in the social sense, but it is the simplest version of a coordination move: generate diverse candidates, aggregate them with a rule, and beat greedy decoding on reasoning tasks.

Tree of Thoughts made the search structure more explicit. It explored coherent intermediate thoughts, evaluated them, and allowed lookahead and backtracking. Again, the key object is not a persona. It is a search policy over branches.

CAMEL pushed role-playing into agent cooperation. Its important contribution is not that agents had names. It is that role-conditioned communicative agents could be studied as a cooperative system, with inception prompting used to keep the interaction on task.

AutoGen framed multi-agent applications as configurable conversable agents, where interaction behavior can be programmed in natural language or code. That shifted attention from one prompt to conversation protocol.

Multiagent debate showed another route: multiple model instances propose and critique answers over rounds, improving factuality and reasoning in some settings. But debate also reveals the danger. More discussion is not automatically more truth. It can become persuasion, anchoring, or convergence to a fluent wrong answer unless the selector and verifier are strong.

Mixture-of-Agents made the ensemble structure more layered: agents generate outputs, later agents consume previous outputs as auxiliary information, and an aggregator improves the final response. This is close to a production pattern: proposers, aggregators, and selectors are distinct roles.

The research arc is clear:

sample many -> search branches -> assign roles -> debate -> aggregate -> orchestrate

The open engineering problem is making the orchestration measurable.

Coordination Patterns

The useful patterns are not defined by agent names. They are defined by information flow and authority.

Best-of-N

Multiple workers attempt the same task. A selector or verifier picks one.

spawn N -> score each -> return winner

This is strong when outputs are easy to score and independent attempts are cheap. It is weak when scoring is subjective or all workers share the same blind spot.

Self-consistency

Multiple reasoning paths produce candidate answers. The system chooses the answer supported by the most paths or highest marginal score.

sample paths -> marginalize answers -> choose stable answer

This helps when the final answer has a stable attractor and errors are diverse. It is less useful for open-ended artifact quality where many incompatible answers can all be plausible.

Tree search

The system expands intermediate states, evaluates partial progress, prunes weak branches, and backtracks.

expand -> evaluate -> select frontier -> continue or backtrack

This is coordination over thoughts, plans, or artifacts. It requires explicit state and a heuristic good enough to guide search.

Debate

Agents expose arguments, counterarguments, and revisions before a final decision.

propose -> critique -> respond -> judge

Debate is useful when hidden assumptions matter. It fails when agents optimize rhetoric, defer to the strongest voice, or converge before evidence changes.

Supervisor

A central coordinator delegates scoped tasks to workers and keeps authority over the final artifact.

supervisor -> assign -> collect -> merge -> verify

This is good for work that has clear subdomains. It fails when the supervisor has no real budget, no child trace access, or no merge discipline.

Handoff

Control transfers from one agent to another.

triage -> active specialist -> maybe hand off again

Handoffs are good when the next specialist should own the state and speak directly. They are dangerous when state transfer is implicit or context grows without boundaries.

Blackboard

Agents write partial results into a shared workspace. Other agents read and improve them.

workers -> shared artifact store -> reviewers -> revised artifact

This is natural for code, research, planning, and design. It needs locking, provenance, conflict resolution, and traceable authorship.

Layered mixture

One layer proposes outputs. Later layers aggregate, refine, or route.

proposers -> aggregators -> final selector

This works when the aggregator can exploit complementary model strengths. It fails when later layers smooth away critical dissent.

The Framework Map

As of June 5, 2026, major agent frameworks expose this distinction directly.

OpenAI’s Agents SDK documentation defines orchestration as which agents run, in what order, and how that decision is made. It separates LLM-driven orchestration from code-driven orchestration, names agents-as-tools and handoffs as common patterns, and explicitly says code orchestration is more deterministic and predictable for speed, cost, and performance.

AutoGen AgentChat exposes teams and multi-agent design patterns, including Selector Group Chat, Swarm, Magentic-One, and GraphFlow. The documentation names selectors, shared context, localized tool-based routing, and directed graphs as first-class concepts.

LangChain and LangGraph documentation describes multi-agent systems as coordination among specialized components, while warning that a single agent with the right tools and prompt can often be enough. Its handoff docs are especially concrete: behavior changes through state, agents can be distinct graph nodes, and context engineering determines what messages cross agent boundaries.

The shared direction is not “more personas.” It is explicit control over routing, handoff, state, context, and observability.

The Tangle Placement

In the local @tangle-network/[email protected] source, the surface separates three coordination shapes.

The first layer is the focused multi-shot kernel:

  • runLoop: a topology-agnostic kernel over sandbox executions.
  • Driver: the topology object through plan() and decide().
  • createRefineDriver: serial attempt, validate, retry until pass or cap.
  • createFanoutVoteDriver: parallel attempts with scored winner selection.
  • AgentRunSpec: profile plus task-to-prompt formatter.
  • OutputAdapter: sandbox event stream to typed output.
  • Validator: typed output to score and pass/fail verdict.

This kernel is intentionally narrow. It owns iteration accounting, bounded concurrency, abort propagation, cost aggregation, and trace emission. It does not own persona, domain policy, output scoring, or topology. That shape is good for optimization: the mutable coordinate is the driver and profile set, not a hidden monolith.

The second layer is the multi-agent conversation substrate:

  • defineConversation: declares participants and policy before execution.
  • runConversation / runConversationStream: drive speaker turns and event streams.
  • createConversationBackend: lets a whole conversation become a participant in a larger conversation.
  • ConversationPolicy: maxTurns, maxCreditsCents, turn order, halt predicate, default call policy.
  • ConversationParticipant.authSource: per-participant billing identity, either forward the user or use agent-owned credentials.
  • ConversationJournal: resumable transcript storage, with in-memory, file, and SQL implementations.
  • turnId: deterministic per-turn id for retries and trace stitching.
  • buildForwardHeaders and DEFAULT_MAX_DEPTH: cross-gateway run, turn, parent-turn, speaker, authorization, and recursion-depth propagation.
  • CircuitBreakerState and call policy: per-participant deadlines, retries, backoff, and circuit breaking.

This is a more serious coordination surface. It makes long-running multi-agent dialogue a runtime object with durability, economics, recursion bounds, and trace correlation.

The MCP layer is a third shape, not a replacement for either of the first two. delegate_code, delegate_research, delegate_feedback, delegation_status, and delegation_history expose async fire-and-poll delegation to agents. The runtime owns the queue, feedback store, schemas, and tool projection. The product supplies the delegates. The default coder delegate is shipped through coderProfile and multiHarnessCoderFanout; researcher delegation is peer-backed through @tangle-network/agent-knowledge or an injected ResearcherDelegate, not a top-level agent-runtime/profiles export in the inspected source.

The separation is important:

runLoop = bounded multi-shot task kernel
conversation = long-horizon participant dialogue
MCP delegation = async specialist work surface

A reliable coordination stack needs the following contracts regardless of which layer hosts them:

scope:
  budget
  allowed tools
  allowed agents
  state visibility
  cancellation authority
  trace parent

assignment:
  task
  role contract
  input artifacts
  expected output
  verifier
  deadline

selection:
  candidates
  scores
  cost
  risk
  lineage
  decision rationale

In the local @tangle-network/[email protected] source, the package is not just a judge wrapper. It is a promotion and analysis system:

  • AgentProfileCell, AGENT_PROFILE_KINDS, buildSandboxAgentProfileCell, and toAgentProfileJson: stable cells for model, prompt, tool, skill, runtime, and harness variation.
  • runEvalCampaign: variant by scenario campaign runner with raw-provider capture and profile-cell checks.
  • HeldOutGate: paired promotion gate, now with a cost ceiling so lift cannot ignore budget.
  • scorecards and release confidence: longitudinal evidence, paired deltas, overfit gaps, release reports.
  • runProductionLoop: production trace clusters to candidate improvement to held-out gate to PR.
  • runIntentMatchJudge, failure taxonomy, semantic judges, and multi-layer verifiers: scoring beyond one rubric prompt.
  • AnalystRegistry with DEFAULT_TRACE_ANALYST_KINDS: failure-mode, knowledge-gap, knowledge-poisoning, and improvement analysts over trace stores.
  • focused subpaths such as /optimization, /reporting, /control, /rl, /traces, /pipelines, /meta-eval, /prm, /builder-eval, /governance, and /knowledge.

For multi-agent coordination, the clean split is:

agent-runtime/conversation decides which participants spoke, under which policy
agent-runtime/loops decides which bounded workers ran
agent-runtime/mcp exposes async specialist delegation
agent-eval decides whether the resulting system was better

Multi-agent coordination without eval is theater. Eval without trace-level runtime evidence is an opinion poll.

Why More Agents Often Make Things Worse

The failure modes are predictable.

Correlated blind spots

Five agents using the same model and context may agree because they share the same missing fact.

Consensus collapse

Agents converge on the first plausible answer because nobody has authority or incentive to preserve dissent.

Selector overfitting

The selector learns to prefer fluent, long, confident, or rubric-shaped outputs instead of correct ones.

Unpriced compute

The multi-agent variant wins because it used 8 workers against a single-worker baseline.

Context contamination

A worker sees another worker’s answer before producing its own, so the supposed independent samples are not independent.

Merge loss

The coordinator combines outputs but drops provenance, uncertainty, and unresolved contradictions.

Authority confusion

The reviewer finds a hard failure, but the supervisor treats it as advisory feedback and ships anyway.

Trace gaps

The final answer looks good, but the system cannot show which child saw which context, used which tools, or caused which decision.

These are not edge cases. They are the default unless the coordination structure prevents them.

Evaluation Protocol

Do not ask whether a multi-agent system “feels smarter.” Ask whether it beats the right baseline.

Minimum protocol:

1. Define the task distribution and artifact contract.
2. Freeze model set, tools, prompts, skills, dataset, and evaluator where possible.
3. Compare against best single-agent and best-of-N baselines at matched budget.
4. Record child context, tool calls, artifacts, scores, selector decisions, and merge lineage.
5. Measure quality, cost, latency, branch failure rate, trace integrity, and human review load.
6. Run ablations: no debate, no shared context, no heterogeneity, deterministic selector.
7. Promote only on held-out lift with acceptable cost, latency, and failure-mode profile.

The promotion rule can be written:

promote(s_multi) if:
  LCB_95(median(score_multi - score_baseline on holdout)) > epsilon
  and median_cost_multi <= cost_ceiling
  and median_latency_multi <= latency_ceiling
  and trace_integrity == 1
  and selector_ablation_delta > 0
  and deterministic_failures == 0

The selector_ablation_delta term matters. If the multi-agent system still performs the same when the selector is replaced with a trivial rule, the sophisticated coordination may not be doing causal work.

For open-ended work, add a disagreement audit:

disagreement_audit:
  independent evidence found?
  contradictions preserved?
  reviewer defects resolved?
  final merge cites child lineage?
  rejected branches explained?

Disagreement is useful only when it changes the final decision or improves confidence calibration.

What Optimizers Can And Cannot Do

Prompt optimizers can improve role instructions:

critic prompt
planner prompt
selector rubric
handoff description
reviewer checklist

Skill optimizers can improve durable role procedure:

how a reviewer inspects a patch
how a researcher triangulates sources
how a coordinator merges conflicting evidence
how an analyst clusters trace failures

Runtime topology optimizers can improve execution shape:

fanout width
which roles run in parallel
whether debate happens before or after evidence collection
which selector sees which fields
when branches cancel
how budget is allocated

Harness evolution can change the coordination machine itself:

new driver
new selector implementation
new trace schema
new sandbox isolation model
new promotion gate

This is where GEPA, MIPRO, SkillOpt, agent-runtime, agent-eval, and meta-harness stop looking like competitors. They operate on different mutable surfaces.

The mistake is asking one optimizer to search a surface it cannot execute.

Working Rule

Use multiple agents when the work needs at least one of these:

  • independent evidence gathering
  • heterogeneous tools or models
  • decomposable subtasks with real parallelism
  • adversarial review
  • branch search with pruning
  • scoped handoff to a specialist
  • artifact merge with provenance
  • trace analysis by multiple lenses

Do not use multiple agents when the only benefit is a richer cast list.

The engineering test is simple:

Can the system show why this role existed?
Can it show what information the role had?
Can it show what the role produced?
Can it show how the selector used or rejected that output?
Can it beat a compute-matched single-agent baseline?

If not, the coordination is not yet a system property. It is prose.

Personas can help agents think in different local modes. Coordination decides whether those modes become useful work.

Source Trail

Source freshness checked on 2026-06-06.

FAQ

What makes a multi-agent system real?

A multi-agent system is real when roles have contracts, state boundaries, tool permissions, budgets, communication policy, selectors, traces, and promotion gates. Names alone do not create coordination.

Why do more agents often make results worse?

More agents can amplify correlated errors, duplicate context, hide failed branches, spend more compute, and let weak selectors ship the wrong artifact. The useful quantity is not worker count; it is verified coordination gain at equal budget.

What should a builder optimize first?

Start with runtime topology and test-time compute: can the system execute the coordination pattern, and does it beat simpler budget-matched baselines? Then tune role contracts and skills.