Short answer: Agent memory is not learning by default. It becomes learning only when a trace produces a scoped write, the write is gated, retrieval happens in the right context, and a paired eval shows the future run improved. Otherwise memory is just a larger prompt with more ways to preserve mistakes.
Remembering more is not learning.
Learning means the next run changes in the right direction.
Memory is one way to change the next run without changing model weights. It lets an agent carry evidence, preferences, decisions, failures, and procedures across episodes. That makes memory powerful. It also makes memory dangerous.
Persistent state is inherited by future behavior. A bad prompt can ruin one run. A bad memory can keep ruining runs until something expires, contradicts, or deletes it.
So the important question is not “does the agent have memory?”
The important question is:
what is allowed to persist,
who can retrieve it,
what evidence supports it,
how it is tested,
and how it is retired?
That is the memory flywheel.
The State Variable
The clean way to think about memory is as mutable external state.
Let:
M_t = memory state before episode t
tau_t = full trace from episode t
u_t = proposed memory write after episode t
G_mem = memory write gate
Write candidates need structure:
u_t =
kind
claim_or_procedure
evidence_refs
scope
confidence
sensitivity
freshness_policy
retrieval_policy
The update rule is:
M_{t+1} =
Apply(M_t, u_t) if G_mem(u_t, tau_t, policy) passes
M_t otherwise
At inference time, the memory layer changes the context seen by the policy:
c_t = Retrieve(M_t, q_t, k, policy)
y_t = pi_theta(x_t, c_t, tools)
where:
x_t = task input
q_t = retrieval query or retrieval plan
k = retrieval budget
c_t = retrieved context
pi_theta = model policy with fixed weights theta
y_t = output or next action
Memory is not magic. It is an intervention on the policy’s input distribution.
That gives us a measurable target:
Delta_memory =
E[Score(pi_theta with M)] - E[Score(pi_theta without M)]
The unit test for memory is not “did retrieval return something?”
The unit test is whether the memory intervention improved the downstream task, under cost, safety, freshness, and privacy constraints.
A Short History
RAG made the basic distinction famous in 2020: a model can combine parametric memory in its weights with non-parametric memory in an external index. Lewis et al. explicitly framed provenance and updating world knowledge as open problems for parameter-only models.
Agent memory then became more explicit.
Generative Agents in 2023 stored natural-language experiences, synthesized reflections, and retrieved memories to plan social behavior. Reflexion in 2023 turned feedback into verbal reflections held in an episodic memory buffer, then used those reflections to improve later attempts without updating model weights. Voyager in 2023 pushed the procedural version: an embodied Minecraft agent built an ever-growing library of executable skills and reused them in new worlds.
The same year, MemGPT treated memory management like virtual context management. MemoryBank focused on long-term conversational memory, including selective forgetting and reinforcement. LongMem explored model architectures that augment fixed models with long-term memory banks.
By 2024 and 2025, the evaluation pressure sharpened. LongMemEval tested extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. Mem0 reported production-oriented memory extraction, consolidation, graph memory, latency, and token-cost results on LOCOMO. A-MEM moved toward dynamically organized agent memory, where adding a memory can update links and contextual attributes of older memories. AgeMem, submitted in January 2026 and revised in April 2026, frames memory operations as tool actions and trains long-term plus short-term memory management with reinforcement learning.
The security story sharpened too. MemoryGraft, submitted on December 18, 2025, describes poisoned experience retrieval: malicious successful-looking records persist into an agent’s memory and later steer behavior on semantically similar tasks.
That is the line from RAG to agent memory:
retrieve facts
-> store experiences
-> reflect across episodes
-> store executable procedures
-> manage memory as context
-> evaluate multi-session behavior
-> learn memory operations
-> defend the memory trust boundary
The frontier is not “bigger memory.”
The frontier is controlled memory mutation.
The Memory Types
Different memories mutate different parts of the system.
| Memory type | Mutable unit | Used by | Evaluator | Main failure mode |
|---|---|---|---|---|
| Episodic | trace, episode summary, decision record | planner, analyst, supervisor | replay and outcome comparison | summary loses the causal detail |
| Semantic | claim, page, source anchor, relation | retriever, answerer, researcher | citation, contradiction, freshness | false or stale fact becomes canonical |
| Procedural | skill, checklist, tool habit, repair routine | driver, worker, coding agent | task success and transfer | local trick gets over-applied |
| Preference | user, team, product, or persona preference | assistant, router, UI agent | satisfaction, explicit confirmation | preference is scoped too broadly |
| Negative | invalid assumption, failed tactic, banned path | planner, verifier, selector | recurrence reduction | stale warning blocks correct behavior |
| Source | raw document, trace artifact, anchor, quote | knowledge curator, judge | provenance and hash integrity | generated text is mistaken for source evidence |
| Decision | chosen option, rejected option, rationale | future planner and reviewer | consistency under same constraints | old rationale survives after constraints change |
This table matters because “memory” is too broad a word.
A retrieved user preference, an executable skill, a stale API fact, a failed deployment lesson, and a product requirement are not the same kind of thing. They need separate scope, retrieval, and write policies.
The Flywheel
A useful memory flywheel has seven steps:
observe
extract
propose
gate
retrieve
act
evaluate
The trace supplies the raw material:
tau_t =
task
messages
tool calls
observations
artifacts
verifier results
analyst findings
outcome
The extractor turns trace evidence into proposed writes:
tau_t -> {u_1, u_2, ..., u_n}
The gate decides whether each write is safe, scoped, supported, and useful:
G_mem(u_i) -> admit | reject | ask | quarantine | expire
The retriever selects admitted memory for a future task:
Retrieve(M_{t+1}, q, policy) -> context
Then the evaluator measures whether the retrieval actually helped.
That last step is where many memory systems become cargo cults. They store more, retrieve more, and show more context to the model, but never run the paired ablation:
same task
same model
same tool surface
with memory versus without memory
Without that ablation, memory success is often just retrieval theater.
The Write Gate
The memory write gate is the admission controller.
It asks different questions for different memory classes, but the core checks are stable:
| Gate check | Question |
|---|---|
| Provenance | What evidence supports the write? |
| Locus | Is this global, persona-scoped, task-scoped, project-scoped, or agent-scoped? |
| Sensitivity | Is the content public, private, secret, or user-confirmed? |
| Freshness | Does it expire? When was it last verified? |
| Contradiction | Does it conflict with active claims or newer traces? |
| Confidence | Is the evidence strong enough for the target use? |
| Reversibility | Can the write be rolled back or superseded? |
| Retrieval impact | Does it improve retrieval-conditioned behavior? |
| Promotion | Has it passed held-out tasks or production replay? |
The provenance check is the one that prevents the most subtle mistakes.
A self-generated artifact is not the same thing as a source. An agent can write an analysis that says “API v2 requires field X.” That analysis is useful trace evidence. It is not itself proof that the API requires field X. The source-grounded memory needs to point to the API documentation, a tool response, a schema file, or a verified runtime observation.
Some memories do not need external source grounding. A user preference can be grounded in the user’s own instruction. A local coding habit can be grounded in a repeated trace pattern. A decision record can be grounded in the decision meeting or session. But the scope must be explicit:
operator preference for this repo
team convention for this product
task-local assumption
global technical fact
Most poisoning problems start when a scoped memory is treated as global truth.
One useful mental model is a scope lattice:
run
task
project
persona
team
organization
global
Promotion up the lattice requires stronger evidence. A run-local observation can become a task memory after repeated traces. A task memory can become a project convention after review. A project convention rarely deserves to become a global technical fact.
The gate can be written as a predicate:
admit(u) iff
provenance(u) passes
and scope(u) is allowed for the target readers
and sensitivity(u) is allowed for the target storage
and freshness(u, now) >= required_freshness
and contradiction_check(u, M_t) passes
and expected_lift(u) - expected_cost(u) > threshold
The last term is estimated, then corrected by real evals after the memory is used.
Retrieval Is An Intervention
Retrieval is not context stuffing.
Retrieval changes the policy input:
pi_theta(y | x)
becomes:
pi_theta(y | x, c)
where c is retrieved context.
That context can help, do nothing, or harm. It can help by supplying missing facts, preserving user preferences, recalling a successful procedure, or warning against a repeated mistake. It can harm by anchoring the model on irrelevant facts, stale procedures, false summaries, or over-broad preferences.
The evaluator has to measure the whole effect:
| Metric | What it catches |
|---|---|
| Recall@k | whether relevant memory can be found |
| Precision@k | whether retrieved memory is mostly useful |
| Contradiction rate | whether retrieval injects conflicting claims |
| Freshness pass rate | whether retrieved sources are still valid |
| Answer lift | whether final outputs improve |
| Task lift | whether full agent trajectories improve |
| Cost | whether memory increases latency, tokens, or tool calls |
| Abstention quality | whether the system knows when memory is insufficient |
Hit rate is not enough.
A memory can be retrievable and harmful. A vector store can return semantically similar experience that is operationally wrong. A graph can link related facts that differ in scope. A persona memory can dominate a task where it does not apply.
The promotion gate compares:
Score_with_memory - Score_without_memory
and also:
Cost_with_memory - Cost_without_memory
A memory layer that improves one benchmark by adding large latency and subtle privacy risk may be a bad production trade.
Negative Knowledge
Negative knowledge is one of the most useful and most dangerous forms of memory.
It records what not to do:
do not use endpoint A after version 3
do not assume screenshots live in path P
do not ask the user for repo facts before inspecting the repo
do not collapse supervisor and worker roles for this task class
do not retry a failed deploy hook without checking logs
This is often the difference between an agent that keeps repeating a class of mistake and one that actually compounds.
But negative knowledge needs expiration and scope.
If endpoint A comes back, the old warning can block correct work. If a failed tactic only failed under a specific model, repo, budget, or date, the warning cannot become a universal law. If the negative memory is a vague sentence like “avoid parallelization here,” it can suppress a good multi-agent strategy later.
Good negative memory has this shape:
claim: tactic T failed under condition C
evidence: trace spans and artifacts
scope: repo, tool, model, task class, or date range
replacement: use tactic R instead
expiry: when to re-check
Negative memory is not cynicism. It is a falsifiable constraint.
Memory Versus Skill
Procedural memory is close to skill optimization, but the distinction is useful.
A memory can say:
when patching a repo, inspect status and the last few commits first
A skill can operationalize it:
inputs: repo path
preconditions: git worktree exists
steps: status, log, reflog, open PRs
verification: no live rebase, no mid-merge, branch context known
The skill has an invocation contract, parameters, steps, and verification. The memory is the durable lesson that motivates or updates the skill.
Voyager’s executable library sits on the skill side. Reflexion’s verbal reflections sit on the episodic/procedural memory side. In production agents, the clean loop is:
trace shows repeated procedural failure
-> memory records the failure pattern
-> skill proposal updates the reusable procedure
-> held-out tasks test the skill
-> memory stores the promotion evidence
This prevents the memory layer from becoming a bag of instructions that only work when the model happens to read them.
Multi-Agent Memory
Multi-agent systems make memory harder because there is no single “the agent.”
There are drivers, workers, reviewers, supervisors, routers, judges, researchers, and coordinators. Each role needs a different memory view.
A coding worker may need:
repo conventions
tool-call habits
known failure modes
current task artifacts
A supervisor may need:
branch state
worker assignments
conflict map
quality bar
promotion gate
A judge may need:
rubric
reference outputs
verifier traces
leakage restrictions
A coordinator may need:
fanout policy
budget policy
selector rules
stop conditions
This is why a single optimized persona prompt is not enough. In a multi-agent flow, memory has to be routed by role and task. A worker does not need every supervisor constraint. A judge cannot retrieve candidate-internal rationales that contaminate independence. A coordinator cannot treat one worker’s failed local path as a global ban unless the evidence says so.
The same point applies to maxTurns=0 agentic flows.
If a subagent gets one shot, it cannot learn inside its own episode. The learning has to happen outside it:
pre-run retrieval
post-run trace capture
cross-run write proposal
promotion gate
next-run retrieval
That is still a flywheel, but the flywheel lives in the harness and memory substrate, not inside the worker’s conversational loop.
Operator directives such as “parallelize independent reads” or “inspect the repo before asking” can become procedural or preference memory, but only if they are captured with scope and evidence. A prompt optimizer may discover a wording that says “parallelize,” but it cannot invent a parallel execution graph if the runtime cannot fan out calls. A memory system can remember the directive. The runtime still needs the action surface to use it.
Memory is not a substitute for topology.
It is the substrate that lets topology improve across episodes.
Knowledge Poisoning
Knowledge poisoning is not merely “the agent did not know something.”
A gap is:
the agent needed X and did not have it
Poisoning is:
the agent confidently used X, and X was wrong
The second case is worse because the agent does not ask. It acts.
In December 2025, MemoryGraft named a concrete version of this attack surface: poison an agent’s experience retrieval by making malicious successful-looking records persist into long-term memory. Later, semantically similar tasks retrieve those records and imitate the unsafe pattern.
The general pattern is broader:
stale wiki page
outdated web result
wrong prior-run summary
tool description with old return shape
system prompt copied from an older runtime
successful-looking trace from a compromised task
The defense is not “trust memory less” in the abstract. The defense is dual verification:
1. Did the agent act on the belief?
2. Does trace or source evidence show the belief is false?
Only then can the system emit a poisoning finding. Otherwise it risks turning uncertainty into fake certainty.
Poisoning remediation is also a memory write:
mark stale
supersede claim
quarantine source
lower confidence
add expiry
link contradiction evidence
trigger held-out replay
Bad memory cannot just be deleted quietly. The system needs to learn why it was bad.
How Tangle Fits
The local Tangle stack is close to the architecture described above.
@tangle-network/agent-knowledge is the knowledge substrate. In the checked local source, version 1.3.0 describes itself as “source-grounded, eval-gated knowledge growth primitives for agents.” Its exported surfaces include source records, source anchors, claims, relations, pages, graph search, readiness scoring, freshness tracking, safe write blocks, validation, proposal generation from analyst findings, research loops, and release reports.
The important detail is that it models memory as structured knowledge, not loose text:
SourceRecord
SourceAnchor
KnowledgeClaim
KnowledgeRelation
KnowledgePage
KnowledgeIndex
KnowledgeSearchResult
KnowledgeLintFinding
KnowledgeRelease
That shape supports the gate:
refs
confidence
status
validUntil
lastVerifiedAt
sourceIds
allowedPathPrefixes
lint findings
release reports
@tangle-network/agent-eval supplies the analyst side. The local source includes knowledge-gap and knowledge-poisoning analyst specs. The knowledge-gap analyst asks what the agent lacked or what was stale, then attributes the gap to the layer responsible for holding it:
agent-knowledge:wiki:<page>
agent-knowledge:claim:<topic>
agent-knowledge:raw:<source>
agent-knowledge:stale:<page>
websearch:outdated:<topic>
tool-doc:<tool>
system-prompt:<section>
memory:<key>
The knowledge-poisoning analyst asks for confident wrong action, then requires the dual verification protocol:
acted on false belief
belief contradicted by trace evidence
@tangle-network/agent-runtime supplies the bridge. The local createSurfaceKnowledgeAdapter wraps agent-knowledge proposal generation and write-block application. It converts analyst findings into knowledge proposals, applies write blocks against a knowledge root, and optionally lints after apply.
Put together, the stack can express this loop:
production trace
-> agent-eval analyst finding
-> agent-knowledge proposal
-> safe write block
-> lint and readiness checks
-> retrieved context
-> future production run
-> held-out and production evaluation
That is the memory flywheel as software.
The Readiness Gate
The most underrated piece is readiness.
Before an agent starts a task, the system can ask:
what knowledge is required for this task?
is it present?
is it fresh?
is it sensitive?
how confident does it need to be?
what happens if it is missing?
The local agent-knowledge readiness builder maps specs to requirements with fields such as:
category
acquisitionMode
importance
freshness
sensitivity
confidenceNeeded
fallbackPolicy
minSources
minHits
That is a better frame than “give the model memories.”
For blocking requirements, absence blocks, asks, or triggers acquisition. For non-blocking requirements, absence can continue with caveats. For high-sensitivity requirements, retrieval may be disallowed for some roles. For realtime requirements, stale memory counts as missing.
Readiness turns memory from a passive archive into a pre-flight gate.
The Core Test
A memory system is doing real self-improvement when all of these are true:
1. A trace produces a specific finding.
2. The finding proposes a scoped memory write.
3. The write is source-grounded or explicitly scoped to its evidence.
4. The gate admits, rejects, asks, quarantines, or expires it.
5. Future retrieval selects it only for appropriate roles and tasks.
6. A paired eval shows task lift, not just retrieval activity.
7. Staleness, contradiction, privacy, and poisoning have review paths.
If any part is missing, the system may still be useful, but it is not a disciplined learning loop.
It may just be a larger prompt with a longer memory leak.
Source Trail
Source freshness checked on 2026-06-06.
- RAG, 2020
- Generative Agents, 2023
- Reflexion, 2023
- Voyager, 2023
- MemoryBank, 2023
- LongMem, 2023
- MemGPT, 2023
- LongMemEval, 2024 with 2025 revision
- A-MEM, 2025
- Mem0, 2025
- MemoryGraft, 2025
- AgeMem, 2026
@tangle-network/agent-knowledgelocal package:/Users/drew/webb/agent-knowledge@tangle-network/agent-evallocal package:/Users/drew/webb/agent-eval@tangle-network/agent-runtimelocal package:/Users/drew/webb/agent-runtime
FAQ
Is memory the same as learning?
No. Memory becomes learning only when a gated write is later retrieved in the right context and a paired evaluation shows better behavior. Retrieval activity alone is not evidence of improvement.
What should an agent memory write include?
A useful write needs a claim or procedure, evidence references, scope, confidence, sensitivity, freshness policy, and retrieval policy. Without source grounding, a self-generated memory can preserve a hallucination.
Where does memory fit in the self-improving stack?
Memory sits beside skills and traces. Traces provide the evidence that proposes a memory write. Skills preserve procedures. Memory preserves source-grounded facts, observations, and preferences.