Blog

The Self-Improving Stack

A series map for self-improving agent systems, from optimization theory and prompt search to runtime topology, traces, memory, and governance.

Drew Stone
agentsevalssystemsself-improvement

Short answer: A self-improving agent system is not a model that reflects harder. It is a closed loop around mutable state, trace evidence, evaluation gates, memory, runtime topology, harness code, and governance. The first question is not whether the system improves itself. The first question is which layer changed and what proof allowed that change to persist.

Self-improvement is not a model property.

It is a system property.

A model can sit inside a self-improving system, but the loop usually lives around it: prompts, skills, tools, traces, memory, evaluators, runtimes, harnesses, and release gates.

That distinction matters because a lot of AI discourse collapses very different loops into one phrase:

the system optimizes itself

That sentence is too vague.

The useful questions are:

what is allowed to change?
what evidence says it improved?
how are candidates generated?
what gate decides promotion?
what can go wrong when that layer changes?

Those five questions define the self-improving stack.

The Loop

A self-improving agent system has a closed loop:

run
observe
diagnose
propose
validate
promote
remember
govern

The loop is only real when each verb has a concrete implementation.

run: execute the agent under a scenario
observe: capture a full trace, not only a score
diagnose: identify failure modes and missing knowledge
propose: generate a candidate change
validate: test the candidate against baseline
promote: replace baseline only if the gate passes
remember: persist the right lesson for future runs
govern: keep the optimizer inside its authority and evidence boundary

The system is not self-improving because it says “reflect.” It is self-improving when a future run changes in the right direction because a previous run produced admissible evidence.

The compact equation is:

s_{t+1} =
  Promote(s_t, c_t)
  if Gate(Eval(Run(c_t), Run(s_t)), policy) passes
  else s_t

where:

s_t = current system state
c_t = candidate state
Run(.) = full agent trajectory under scenarios
Eval(.) = measured evidence
Gate(.) = promotion rule under policy

The system improves only when the promoted state performs better on the right distribution while staying inside cost, safety, integrity, and governance constraints.

The Stack

The stack has layers because “candidate” can mean many different things.

LayerMutable surfaceSearch operatorTrusted feedbackGate
Optimization theorycandidate state and objectivehill climbing, bandits, evolutionary search, Bayesian searchreward, loss, regret, Pareto frontierstatistical and structural validity
Prompt optimizationinstructions, examples, rubrics, LM program textGEPA, MIPRO, DSPy, AxLLM-style searchtask score, judge score, validation setheld-out prompt eval
Skill optimizationreusable procedures and action policiesreflection, trace-to-skill, skill mutationtransfer success, recurrence reductionskill invocation and transfer tests
Runtime topologydrivers, fanout, reviewers, selectors, turn budgetstopology search, hand-designed patterns, meta-harness mutationfull trajectory score and costtrace integrity plus budget gate
Multi-agent coordinationroles, contracts, supervisors, workersrole decomposition, delegation, debate, votecoordination quality, disagreement resolutionrole isolation and selector audit
Test-time computesamples, branches, retries, verifier callsbest-of-N, tree search, adaptive allocationcompute-matched liftPareto dominance under equal compute
Evaluation gatesscorecards, judges, baselines, release criteriaevaluator design and calibrationpaired deltas, human calibration, deterministic checksfail-closed promotion
Trace systemsspans, artifacts, raw calls, replay recordstrace mining and analyst loopscausal evidence from trajectoriescapture integrity and replayability
Harness evolutionsource code around the agentcode search, meta-harness, worktree variantsbenchmark and production evidencerelease gate outside the mutation surface
Post-trainingmodel weights or adaptersSFT, RLHF, DPO, PPO, GRPO, tool-use RLtraining loss, preferences, verifiable reward, deployment evalmodel release and data-governance gate
Memory and knowledgepersistent state across episodestrace mining, proposal review, retrieval tuningretrieval-conditioned task liftsource, freshness, scope, poisoning checks
Governanceauthority, risk controls, release policysafety-case iterationred-team, audit, incident, outcome evidenceaccountable approval and rollback

This table is the core of the series.

GEPA, SkillOpt, meta-harness, post-training, and memory flywheels all have the same outer skeleton:

propose candidate
run candidate
measure candidate
promote or reject

They are not the same system because they mutate different surfaces.

A prompt optimizer cannot add a new sandbox boundary. A skill optimizer cannot guarantee a runtime will invoke the skill. A memory system cannot invent a fanout topology. A harness optimizer cannot safely approve itself if it owns the gate. Post-training can move model behavior, but it also moves the rollback and explanation boundary.

The layer is not an implementation detail.

The layer determines the reachable set.

Layer Confusion

The most common mistake is using the optimizer from one layer to fix a failure in another.

SymptomTempting fixLikely real layer
The agent says “parallelize” but still works seriallyoptimize the promptruntime topology
A worker follows the persona but misses the task contractrewrite the personamulti-agent coordination
The same tool argument bug recursadd another reminderskill or harness evolution
Scores improve while traces get messiertune the judgeevaluation gate and trace integrity
A memory helps one task and poisons anotherretrieve more contextmemory gate and scope policy
A benchmark improves only with more samplesclaim better reasoningtest-time compute accounting

This is the practical reason to map every system by mutable surface first.

Where Prompt Optimization Stops

Prompt optimization is real.

It can discover better instructions, examples, decomposition strategies, rubric wording, and reflection text. Systems like GEPA and DSPy-style optimizers made that point concrete: language program text can be searched.

But prompt search operates inside a fixed runtime.

The objective is roughly:

p* = argmax_p E[R(Run(h_fixed, p, x))]

The harness h_fixed is held constant.

If the runtime lacks a worker pool, the prompt can ask for parallelism but cannot create it. If the tool graph lacks a verifier, the prompt can request verification but cannot execute one. If the evaluator leaks holdout answers, the prompt can overfit beautifully.

This is why the series keeps separating:

better wording
better procedure
better topology
better evaluator
better harness
better model
better memory
better governance

Those are different control surfaces.

Skills Are Trainable State

Skills sit between prompts and code.

A skill is durable procedural memory:

when this task class appears,
use this decomposition,
with these tools,
under these checks,
and stop under these conditions

That makes skills more reusable than a one-off prompt and less rigid than hard-coded application logic.

The hard part is activation. A skill that never triggers is inert. A skill that triggers everywhere becomes a new bug. The gate has to test transfer:

does the skill improve held-out tasks in the intended class?
does it avoid harming nearby tasks outside the class?
does it reduce repeated failures?

That is why skill optimization belongs in the stack but does not replace runtime design.

Topology Is The Action Space

Agent behavior is not only model output.

It is workflow shape:

single shot
refine loop
fanout and vote
planner plus worker
researcher plus coder
supervisor plus reviewer
debate
tree search
human approval gate

The topology defines which actions exist and which observations can influence future actions.

This matters for multi-agent systems. “Persona” is content. “Coordinator,” “reviewer,” “selector,” “budget holder,” and “release approver” are structural roles. You can optimize persona text, but coordination quality usually depends on contracts, routing, isolation, and selection.

For maxTurns=0 worker flows, learning does not happen inside the worker’s conversation. It happens across runs:

pre-run retrieval
single worker attempt
post-run trace capture
analyst finding
candidate change
promotion gate
next run

That is still self-improvement, but the loop lives in the harness.

Test-Time Compute Is The Baseline

Before claiming that a new optimizer improved the agent, beat random or naive sampling at equal compute.

A lot of agent improvements are really compute allocation changes:

more samples
more branches
more retries
more verifier calls
more expensive judge
more time

Those can be useful. They are not free.

The fair comparison is:

quality(candidate) - quality(baseline)
cost(candidate) - cost(baseline)

A candidate that wins only by spending more may still be worth shipping, but the claim is different. It is a cost-quality trade, not pure intelligence gain.

The Gate Is The Optimizer

The promotion gate decides what the system becomes.

If the gate rewards shallow style, the system learns shallow style. If the gate leaks the answer, the system learns leakage. If the gate ignores cost, the system learns to spend. If the gate ignores safety, the system learns unsafe shortcuts.

A usable gate is explicit:

promote(c) iff
  paired_delta(c, baseline, holdout) > threshold
  and deterministic_verifiers(c) pass
  and trace_integrity(c) passes
  and cost(c) <= budget
  and safety_regression(c) == false

That gate can be statistical, deterministic, human-reviewed, or all three.

The key is that it is separate from the candidate.

Traces Are The Data

Scores say that something happened.

Traces say what happened.

An agent trace needs enough information to explain the mechanism:

which prompt
which model
which tools
which arguments
which observations
which retrieved documents
which artifacts
which verifier
which failure class
which budget
which outcome

Without traces, the system can only hill climb on a lossy projection.

With traces, the system can diagnose:

missing knowledge
bad tool argument
weak verifier
wrong selector
coordination failure
memory poisoning
budget breach
unsafe side effect

That is why traces are not logging decoration. They are the training data for the external-state loop.

Harness Evolution Changes The Machine

When prompt, skill, and topology tuning plateau, the mutable surface may need to be source code.

Harness evolution changes:

planner contracts
tool routers
selectors
trace emitters
verifiers
benchmark adapters
worktree lifecycle
promotion gates
memory write paths

This is powerful because it expands the reachable set.

It is dangerous because the harness may contain the evaluator. The core rule from the governance layer is:

the optimizer cannot own the gate that promotes it

If the candidate can rewrite the judge or release policy that approves it, the loop is no longer honest.

Post-Training Moves The Model Boundary

Most systems discussed before the post-training layer mutate external state:

prompt
skill
tool docs
memory
runtime
harness
evaluator

Post-training mutates model behavior itself:

theta_{t+1} = Update(theta_t, data, objective)

or:

theta' = theta_base + Delta_adapter

That can generalize better than prompt edits when the signal is strong. It also makes the behavior harder to inspect, partially roll back, and attribute to one trace.

That is why post-training sits near the top of the stack. It is not “more advanced prompt optimization.” It changes the policy.

Memory Is Not Automatically Learning

Memory changes future runs by changing what persists.

The update rule is:

M_{t+1} =
  Apply(M_t, u_t) if G_mem(u_t, trace, policy) passes
  M_t             otherwise

A memory system is useful when:

Score(with memory) - Score(without memory) > threshold

under cost, freshness, privacy, and poisoning constraints.

Remembering more is not learning. Learning is remembering the right thing, retrieving it in the right context, and proving it improved behavior.

Governance Closes The Loop

Governance is not the opposite of autonomy.

Governance is what makes autonomy accountable.

A self-improving system needs a safety case:

claim
scope
evidence
residual risk
owner
release gate
rollback path

The system can propose improvements. The gate decides which improvements persist. The owner accepts residual risk.

The minimum release rule is:

ship(candidate) iff
  improves(candidate)
  and evidence_complete(candidate)
  and controls_pass(candidate)
  and owner_accepts_residual_risk(candidate)

Without that rule, self-improvement can become proxy hacking with better branding.

The Practical Test

When someone says their agent improves itself, ask for the layer.

What changed?
Who proposed it?
What evidence was captured?
What baseline was beaten?
What held-out set was protected?
What gate approved it?
What got more expensive?
What became riskier?
What can roll back?
What persisted into the next run?

If those questions have concrete answers, there may be a real loop.

If the answer is only “the model reflected,” there probably is not.

Series Map

Source Trail

Source freshness checked on 2026-06-06.

  • Microsoft MAI hill-climbing, published 2026-06-02.
  • GEPA, submitted 2025-07-25 and revised 2026-02-14.
  • MIPROv2 / DSPy, submitted 2024-06-17 and revised 2024-10-06.
  • SkillOpt, submitted 2026-05-22 and revised 2026-05-25.
  • Voyager, submitted 2023-05-25 and revised 2023-10-19.
  • Reflexion, submitted 2023-03-20 and revised 2023-10-10.
  • RAG, submitted 2020-05-22 and revised 2021-04-12.
  • InstructGPT, submitted 2022-03-04.
  • Direct Preference Optimization, submitted 2023-05-29 and revised 2024-07-29.
  • DeepSeek-R1, submitted 2025-01-22 and revised 2026-01-04.
  • NIST AI RMF, with AI RMF 1.0 released 2023-01-26, the Generative AI Profile released 2024-07-26, and the critical-infrastructure concept note released 2026-04-07.
  • OWASP Top 10 for LLM Applications, project page verified 2026-06-06 with a 2025 version link.
  • @tangle-network/agent-runtime local package: /Users/drew/webb/agent-runtime
  • @tangle-network/agent-eval local package: /Users/drew/webb/agent-eval
  • @tangle-network/agent-knowledge local package: /Users/drew/webb/agent-knowledge

FAQ

What is the self-improving stack?

The self-improving stack is the set of mutable layers around an agent: prompts, skills, runtime topology, test-time compute, evaluation gates, traces, harness code, model weights, memory, and governance. The practical question is always what changed, what evidence proved improvement, and what gate allowed the change to persist.

Where should a builder start?

Start with the gate and trace layer before optimizing prompts. The Gate Is The Optimizer explains promotion discipline, and Traces Are The Training Data explains what evidence the optimizer needs.

How does this connect to Tangle infrastructure?

Tangle’s agent stack treats self-improvement as a systems problem: runtime actions, sandboxed work, scorecards, traces, and release gates. Use the series map to decide whether the next change belongs in runtime topology, skills, or governance.