Short answer: Multi-agent coordination is not roleplay. It becomes real when agents have contracts, authority boundaries, tool permissions, state isolation, selection rules, budgets, and traces. More agents only help when disagreement becomes useful evidence under an equal-compute gate.
The hard part is not naming the agents.
The hard part is making their disagreement useful.
A “researcher,” “critic,” “architect,” “driver,” and “supervisor” can all be the same model, with the same blind spot, reading the same context, under the same budget, producing five versions of one mistake. That is not a multi-agent system in the meaningful sense. It is a single correlated policy wearing role labels.
Multi-agent coordination becomes real when roles have separate contracts, state boundaries, tool permissions, selection rules, budgets, and traces.
The persona is content. Coordination is structure.
The Object Being Optimized
The previous post made runtime topology explicit:
g = executable graph of agents, tools, validators, selectors, handoffs, and gates
pi = runtime policy over graph moves
Multi-agent coordination sits one layer above that. It decides what the nodes are supposed to do together.
Let:
r = role contracts
p = persona and instruction content
k = active skills per role
u = tools and permissions per role
c = communication and context-sharing policy
sigma = selector or merger policy
v = verifier and judge stack
b = budget allocation policy
tau = termination and escalation policy
A multi-agent system candidate is:
s = (g, r, p, k, u, c, sigma, v, b, tau)
The optimization target is:
J(s | m, h) =
E_{x ~ D}[R(run(m, h, s, x))]
- lambda * E[C(run(m, h, s, x))]
The important part is not the notation. It is the coordinate system.
If only p changes, the optimizer is tuning role descriptions. If k changes, it is tuning durable role procedure. If g, sigma, c, b, or tau change, it is tuning coordination structure.
That distinction prevents a common category error: evaluating a better persona as if it were a better multi-agent system.
Persona Is Not Authority
A persona can say:
You are a careful supervisor.
You delegate independent work.
You force disagreement before consensus.
You stop when the reviewer passes the artifact.
Those instructions may improve local judgment. They do not grant runtime authority.
Authority lives in the structure:
Can this role spawn workers?
Can it choose tools?
Can it read child traces?
Can it cancel branches?
Can it override a reviewer?
Can it spend more budget?
Can it merge artifacts?
Can it promote the result?
If the answers are not represented in the runtime, the persona is aspirational. It may be useful text, but it is not a coordination contract.
This is why “supervisor” is an overloaded word. In one system, it means a prompt that asks an LLM to choose the next agent. In another, it means a graph node that routes state. In a third, it means a durable workflow controller with scoped budget, cancellation, replay, and audit authority. These are not interchangeable.
Roles As Contracts
A role becomes real when it has an input contract, an output contract, authority, state access, and accountability.
The minimum useful taxonomy:
| Role | Contract | Failure mode |
|---|---|---|
| Driver | Choose the next runtime move | private reasoning controls execution without trace |
| Planner | Decompose goal into scoped tasks | decomposition creates fake parallelism or missing dependencies |
| Worker | Produce an artifact under a task contract | hidden assumptions leak into final output |
| Reviewer | Find defects against a rubric | reviewer becomes vague taste instead of a gate |
| Judge | Score output or trajectory | judge labels are reused as training signal without holdout |
| Selector | Pick branch, winner, or continuation | selector optimizes agreement, not correctness |
| Coordinator | Allocate work and merge artifacts | merge erases dissent and provenance |
| Analyst | Convert traces into findings | analyst lists symptoms instead of causal failure modes |
The same LLM can occupy several roles, but the role contracts still need separation. A selector that also writes the candidate can rationalize its own work. A reviewer that never blocks promotion is commentary. A coordinator that cannot see child traces is guessing.
In code, the role split looks less like a cast list and more like a typed interface:
worker(task, context, tools, budget) -> artifact, trace
reviewer(artifact, rubric, trace) -> defects, pass
judge(artifact, task, trace) -> score, dimensions
selector(candidates, scores, budget_policy) -> winner | continue | abort
coordinator(goal, children, traces) -> merged_artifact, lineage
The point is not bureaucracy. The point is falsifiability. If a role has no contract, its contribution cannot be tested.
The Disagreement Problem
Multi-agent systems are usually sold as specialization. The stronger reason is controlled disagreement.
For an ensemble to help, at least one of these must be true:
- agents see different evidence
- agents use different tools
- agents use different models
- agents use different skills
- agents explore different branches
- agents are scored by an independent verifier
- the selector can preserve dissent instead of forcing consensus
If every worker shares the same prompt, same model, same examples, same context, same decoding parameters, and same evaluator, the ensemble is highly correlated.
The idealized independent case is easy:
P(at least one success) = 1 - product_i(1 - q_i)
where q_i is the probability that worker i independently finds a valid answer.
But multi-agent LLM systems rarely get independence for free. The useful quantity is not worker count. It is error correlation.
coordination_gain =
E[score_multi at budget B] - E[score_best_single at budget B]
If the gain disappears at matched budget, the system did not learn coordination. It bought more samples.
If the gain disappears when workers use isolated context, the system may have been copying. If the gain disappears when the selector is replaced with a deterministic verifier, the selector may have been rewarding style. If the gain disappears on held-out tasks, the role split overfit the benchmark.
This is the central test: does the coordination policy create useful diversity, or just more tokens?
Lineage: From Sampling To Societies
The modern multi-agent conversation did not appear fully formed. It grew out of several older ideas.
Self-consistency sampled multiple reasoning paths and selected the most consistent answer. That is not multi-agent in the social sense, but it is the simplest version of a coordination move: generate diverse candidates, aggregate them with a rule, and beat greedy decoding on reasoning tasks.
Tree of Thoughts made the search structure more explicit. It explored coherent intermediate thoughts, evaluated them, and allowed lookahead and backtracking. Again, the key object is not a persona. It is a search policy over branches.
CAMEL pushed role-playing into agent cooperation. Its important contribution is not that agents had names. It is that role-conditioned communicative agents could be studied as a cooperative system, with inception prompting used to keep the interaction on task.
AutoGen framed multi-agent applications as configurable conversable agents, where interaction behavior can be programmed in natural language or code. That shifted attention from one prompt to conversation protocol.
Multiagent debate showed another route: multiple model instances propose and critique answers over rounds, improving factuality and reasoning in some settings. But debate also reveals the danger. More discussion is not automatically more truth. It can become persuasion, anchoring, or convergence to a fluent wrong answer unless the selector and verifier are strong.
Mixture-of-Agents made the ensemble structure more layered: agents generate outputs, later agents consume previous outputs as auxiliary information, and an aggregator improves the final response. This is close to a production pattern: proposers, aggregators, and selectors are distinct roles.
The research arc is clear:
sample many -> search branches -> assign roles -> debate -> aggregate -> orchestrate
The open engineering problem is making the orchestration measurable.
Coordination Patterns
The useful patterns are not defined by agent names. They are defined by information flow and authority.
Best-of-N
Multiple workers attempt the same task. A selector or verifier picks one.
spawn N -> score each -> return winner
This is strong when outputs are easy to score and independent attempts are cheap. It is weak when scoring is subjective or all workers share the same blind spot.
Self-consistency
Multiple reasoning paths produce candidate answers. The system chooses the answer supported by the most paths or highest marginal score.
sample paths -> marginalize answers -> choose stable answer
This helps when the final answer has a stable attractor and errors are diverse. It is less useful for open-ended artifact quality where many incompatible answers can all be plausible.
Tree search
The system expands intermediate states, evaluates partial progress, prunes weak branches, and backtracks.
expand -> evaluate -> select frontier -> continue or backtrack
This is coordination over thoughts, plans, or artifacts. It requires explicit state and a heuristic good enough to guide search.
Debate
Agents expose arguments, counterarguments, and revisions before a final decision.
propose -> critique -> respond -> judge
Debate is useful when hidden assumptions matter. It fails when agents optimize rhetoric, defer to the strongest voice, or converge before evidence changes.
Supervisor
A central coordinator delegates scoped tasks to workers and keeps authority over the final artifact.
supervisor -> assign -> collect -> merge -> verify
This is good for work that has clear subdomains. It fails when the supervisor has no real budget, no child trace access, or no merge discipline.
Handoff
Control transfers from one agent to another.
triage -> active specialist -> maybe hand off again
Handoffs are good when the next specialist should own the state and speak directly. They are dangerous when state transfer is implicit or context grows without boundaries.
Blackboard
Agents write partial results into a shared workspace. Other agents read and improve them.
workers -> shared artifact store -> reviewers -> revised artifact
This is natural for code, research, planning, and design. It needs locking, provenance, conflict resolution, and traceable authorship.
Layered mixture
One layer proposes outputs. Later layers aggregate, refine, or route.
proposers -> aggregators -> final selector
This works when the aggregator can exploit complementary model strengths. It fails when later layers smooth away critical dissent.
The Framework Map
As of June 5, 2026, major agent frameworks expose this distinction directly.
OpenAI’s Agents SDK documentation defines orchestration as which agents run, in what order, and how that decision is made. It separates LLM-driven orchestration from code-driven orchestration, names agents-as-tools and handoffs as common patterns, and explicitly says code orchestration is more deterministic and predictable for speed, cost, and performance.
AutoGen AgentChat exposes teams and multi-agent design patterns, including Selector Group Chat, Swarm, Magentic-One, and GraphFlow. The documentation names selectors, shared context, localized tool-based routing, and directed graphs as first-class concepts.
LangChain and LangGraph documentation describes multi-agent systems as coordination among specialized components, while warning that a single agent with the right tools and prompt can often be enough. Its handoff docs are especially concrete: behavior changes through state, agents can be distinct graph nodes, and context engineering determines what messages cross agent boundaries.
The shared direction is not “more personas.” It is explicit control over routing, handoff, state, context, and observability.
The Tangle Placement
In the local @tangle-network/[email protected] source, the surface separates three coordination shapes.
The first layer is the focused multi-shot kernel:
runLoop: a topology-agnostic kernel over sandbox executions.Driver: the topology object throughplan()anddecide().createRefineDriver: serial attempt, validate, retry until pass or cap.createFanoutVoteDriver: parallel attempts with scored winner selection.AgentRunSpec: profile plus task-to-prompt formatter.OutputAdapter: sandbox event stream to typed output.Validator: typed output to score and pass/fail verdict.
This kernel is intentionally narrow. It owns iteration accounting, bounded concurrency, abort propagation, cost aggregation, and trace emission. It does not own persona, domain policy, output scoring, or topology. That shape is good for optimization: the mutable coordinate is the driver and profile set, not a hidden monolith.
The second layer is the multi-agent conversation substrate:
defineConversation: declares participants and policy before execution.runConversation/runConversationStream: drive speaker turns and event streams.createConversationBackend: lets a whole conversation become a participant in a larger conversation.ConversationPolicy:maxTurns,maxCreditsCents, turn order, halt predicate, default call policy.ConversationParticipant.authSource: per-participant billing identity, either forward the user or use agent-owned credentials.ConversationJournal: resumable transcript storage, with in-memory, file, and SQL implementations.turnId: deterministic per-turn id for retries and trace stitching.buildForwardHeadersandDEFAULT_MAX_DEPTH: cross-gateway run, turn, parent-turn, speaker, authorization, and recursion-depth propagation.CircuitBreakerStateand call policy: per-participant deadlines, retries, backoff, and circuit breaking.
This is a more serious coordination surface. It makes long-running multi-agent dialogue a runtime object with durability, economics, recursion bounds, and trace correlation.
The MCP layer is a third shape, not a replacement for either of the first two. delegate_code, delegate_research, delegate_feedback, delegation_status, and delegation_history expose async fire-and-poll delegation to agents. The runtime owns the queue, feedback store, schemas, and tool projection. The product supplies the delegates. The default coder delegate is shipped through coderProfile and multiHarnessCoderFanout; researcher delegation is peer-backed through @tangle-network/agent-knowledge or an injected ResearcherDelegate, not a top-level agent-runtime/profiles export in the inspected source.
The separation is important:
runLoop = bounded multi-shot task kernel
conversation = long-horizon participant dialogue
MCP delegation = async specialist work surface
A reliable coordination stack needs the following contracts regardless of which layer hosts them:
scope:
budget
allowed tools
allowed agents
state visibility
cancellation authority
trace parent
assignment:
task
role contract
input artifacts
expected output
verifier
deadline
selection:
candidates
scores
cost
risk
lineage
decision rationale
In the local @tangle-network/[email protected] source, the package is not just a judge wrapper. It is a promotion and analysis system:
AgentProfileCell,AGENT_PROFILE_KINDS,buildSandboxAgentProfileCell, andtoAgentProfileJson: stable cells for model, prompt, tool, skill, runtime, and harness variation.runEvalCampaign: variant by scenario campaign runner with raw-provider capture and profile-cell checks.HeldOutGate: paired promotion gate, now with a cost ceiling so lift cannot ignore budget.- scorecards and release confidence: longitudinal evidence, paired deltas, overfit gaps, release reports.
runProductionLoop: production trace clusters to candidate improvement to held-out gate to PR.runIntentMatchJudge, failure taxonomy, semantic judges, and multi-layer verifiers: scoring beyond one rubric prompt.AnalystRegistrywithDEFAULT_TRACE_ANALYST_KINDS: failure-mode, knowledge-gap, knowledge-poisoning, and improvement analysts over trace stores.- focused subpaths such as
/optimization,/reporting,/control,/rl,/traces,/pipelines,/meta-eval,/prm,/builder-eval,/governance, and/knowledge.
For multi-agent coordination, the clean split is:
agent-runtime/conversation decides which participants spoke, under which policy
agent-runtime/loops decides which bounded workers ran
agent-runtime/mcp exposes async specialist delegation
agent-eval decides whether the resulting system was better
Multi-agent coordination without eval is theater. Eval without trace-level runtime evidence is an opinion poll.
Why More Agents Often Make Things Worse
The failure modes are predictable.
Correlated blind spots
Five agents using the same model and context may agree because they share the same missing fact.
Consensus collapse
Agents converge on the first plausible answer because nobody has authority or incentive to preserve dissent.
Selector overfitting
The selector learns to prefer fluent, long, confident, or rubric-shaped outputs instead of correct ones.
Unpriced compute
The multi-agent variant wins because it used 8 workers against a single-worker baseline.
Context contamination
A worker sees another worker’s answer before producing its own, so the supposed independent samples are not independent.
Merge loss
The coordinator combines outputs but drops provenance, uncertainty, and unresolved contradictions.
Authority confusion
The reviewer finds a hard failure, but the supervisor treats it as advisory feedback and ships anyway.
Trace gaps
The final answer looks good, but the system cannot show which child saw which context, used which tools, or caused which decision.
These are not edge cases. They are the default unless the coordination structure prevents them.
Evaluation Protocol
Do not ask whether a multi-agent system “feels smarter.” Ask whether it beats the right baseline.
Minimum protocol:
1. Define the task distribution and artifact contract.
2. Freeze model set, tools, prompts, skills, dataset, and evaluator where possible.
3. Compare against best single-agent and best-of-N baselines at matched budget.
4. Record child context, tool calls, artifacts, scores, selector decisions, and merge lineage.
5. Measure quality, cost, latency, branch failure rate, trace integrity, and human review load.
6. Run ablations: no debate, no shared context, no heterogeneity, deterministic selector.
7. Promote only on held-out lift with acceptable cost, latency, and failure-mode profile.
The promotion rule can be written:
promote(s_multi) if:
LCB_95(median(score_multi - score_baseline on holdout)) > epsilon
and median_cost_multi <= cost_ceiling
and median_latency_multi <= latency_ceiling
and trace_integrity == 1
and selector_ablation_delta > 0
and deterministic_failures == 0
The selector_ablation_delta term matters. If the multi-agent system still performs the same when the selector is replaced with a trivial rule, the sophisticated coordination may not be doing causal work.
For open-ended work, add a disagreement audit:
disagreement_audit:
independent evidence found?
contradictions preserved?
reviewer defects resolved?
final merge cites child lineage?
rejected branches explained?
Disagreement is useful only when it changes the final decision or improves confidence calibration.
What Optimizers Can And Cannot Do
Prompt optimizers can improve role instructions:
critic prompt
planner prompt
selector rubric
handoff description
reviewer checklist
Skill optimizers can improve durable role procedure:
how a reviewer inspects a patch
how a researcher triangulates sources
how a coordinator merges conflicting evidence
how an analyst clusters trace failures
Runtime topology optimizers can improve execution shape:
fanout width
which roles run in parallel
whether debate happens before or after evidence collection
which selector sees which fields
when branches cancel
how budget is allocated
Harness evolution can change the coordination machine itself:
new driver
new selector implementation
new trace schema
new sandbox isolation model
new promotion gate
This is where GEPA, MIPRO, SkillOpt, agent-runtime, agent-eval, and meta-harness stop looking like competitors. They operate on different mutable surfaces.
The mistake is asking one optimizer to search a surface it cannot execute.
Working Rule
Use multiple agents when the work needs at least one of these:
- independent evidence gathering
- heterogeneous tools or models
- decomposable subtasks with real parallelism
- adversarial review
- branch search with pruning
- scoped handoff to a specialist
- artifact merge with provenance
- trace analysis by multiple lenses
Do not use multiple agents when the only benefit is a richer cast list.
The engineering test is simple:
Can the system show why this role existed?
Can it show what information the role had?
Can it show what the role produced?
Can it show how the selector used or rejected that output?
Can it beat a compute-matched single-agent baseline?
If not, the coordination is not yet a system property. It is prose.
Personas can help agents think in different local modes. Coordination decides whether those modes become useful work.
Source Trail
Source freshness checked on 2026-06-06.
- Self-Consistency Improves Chain of Thought Reasoning in Language Models, checked June 5, 2026.
- Tree of Thoughts: Deliberate Problem Solving with Large Language Models, checked June 5, 2026.
- CAMEL: Communicative Agents for “Mind” Exploration of Large Language Model Society, checked June 5, 2026.
- AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation, checked June 5, 2026.
- Improving Factuality and Reasoning in Language Models through Multiagent Debate, checked June 5, 2026.
- Mixture-of-Agents Enhances Large Language Model Capabilities, checked June 5, 2026.
- OpenAI Agents SDK orchestration docs, checked June 5, 2026.
- AutoGen AgentChat docs, checked June 5, 2026.
- LangChain multi-agent docs, checked June 5, 2026.
- Local
@tangle-network/[email protected]source audit:runLoop,Driver,createRefineDriver,createFanoutVoteDriver,defineConversation,runConversation,createConversationBackend,ConversationJournal,authSource, cross-gateway headers, MCP delegation tools, June 6, 2026. - Local
@tangle-network/[email protected]source audit:AgentProfileCell,AGENT_PROFILE_KINDS,buildSandboxAgentProfileCell,runEvalCampaign,HeldOutGate, release confidence, scorecard, intent-match judge, failure taxonomy,AnalystRegistry,DEFAULT_TRACE_ANALYST_KINDS, June 6, 2026.
FAQ
What makes a multi-agent system real?
A multi-agent system is real when roles have contracts, state boundaries, tool permissions, budgets, communication policy, selectors, traces, and promotion gates. Names alone do not create coordination.
Why do more agents often make results worse?
More agents can amplify correlated errors, duplicate context, hide failed branches, spend more compute, and let weak selectors ship the wrong artifact. The useful quantity is not worker count; it is verified coordination gain at equal budget.
What should a builder optimize first?
Start with runtime topology and test-time compute: can the system execute the coordination pattern, and does it beat simpler budget-matched baselines? Then tune role contracts and skills.