Memory Architecture Benchmarks

LLMs lose knowledge as conversations grow. Context windows overflow, compaction silently erases facts, and RAG retrieves documents when you need specific answers.

Knowledge Objects are a structured memory layer. An LLM extracts discrete claims from unstructured text, each attributed and retrievable by exact key, linked by typed dependencies, persistent across sessions.

These benchmarks compare KOs against in-context memory, RAG, and flat memory files across 16 experiments with 5 random seeds, using real API calls to Claude Sonnet 4 and GPT-4o. KOs interoperate with any LLM. No retraining required.

Primary experiments on Claude Sonnet 4 (claude-sonnet-4-20250514). Compaction validated on Opus 4.6.

100%

KO vs in-context at 10,000 facts

54%

of goals silently lost after compaction

KO memory: zero lost

57×

cheaper per query (KO vs in-context)

at 1,000 stored facts

Accuracy vs. Corpus Size

KO retrieval stays at 100% while in-context methods hit context window limits

KO retrieval stays constant. In-context hits a hard wall at window limits. And 7,000 facts is not a lot — roughly what you'd extract from 100 research papers (each yielding 50–150 structured claims). In practice, attention degrades well before the hard limit: retrieval accuracy drops even with 20–30 documents in context as the model loses track of specific details amid surrounding text.

Retrieval Accuracy vs. Number of Full Papers

Accuracy degrades from 100% to 40% as more papers are loaded into context

38 real arXiv cosmology papers, target paper always at position 1 (most favorable). Even in the best case, accuracy drops to 94% at 10 papers (~88K tokens) and 40% at 20 papers (~177K tokens). At N=20, failures were confident wrong answers, not abstentions. The model returned values from the wrong paper, not abstentions. At 25 full papers, the API rejects the request entirely: prompt too long. The context window is not just a quality constraint — it is a hard physical limit. 50 queries per data point, 95% Wilson confidence intervals shown. KO memory: 100% at any corpus size.

Goal Drift Under Compaction

Constraints silently lost under cascading compaction

After 3 rounds of compaction, over half of project constraints are silently lost. "Full context" (no compaction) preserves all 20 constraints because the corpus fits within the context window. The scaling figure above shows what happens when it does not fit: accuracy degrades well before the hard token limit.

What Is a Knowledge Object?

The Unit of Storage Matters

RAG stores document chunks. LLMs store parameters. KOs store individual facts.

RAG Chunk

"Erlotinib is a tyrosine kinase inhibitor used in non-small cell lung cancer. Studies have shown IC50 values ranging from 2.3 to 15.7 nM depending on the cell line..."

512 tokens. Embedded. Retrieved by similarity.

Flat Memory

"The user works on EGFR inhibitors. Key compounds include erlotinib (potent), gefitinib (moderate). See previous conversations for IC50 data."

Compressed summary. Details lost. No version history.

Knowledge Object

(Erlotinib, IC50 for EGFR, 2.3 nM)

LLM-extracted. Attributed. Versioned. O(1) retrieval.

A KO is a single verifiable claim: one subject, one predicate, one value. An LLM identifies structured claims from unstructured text, classifies the relationship type, and attaches provenance (SHA-256 hash + author). Each KO is linked to others by typed dependency edges (prerequisite, supports, contradicts). When a value is updated, the old KO is archived, creating a version chain. Retrieval uses exact key lookup with embedding fallback for ambiguous queries.

Why This Architecture Works: Neuroscience Foundations

KOs Mirror Hippocampal Memory Traces

The design of Knowledge Objects is modeled on how the hippocampus stores and retrieves episodic memories.

Hippocampal Property	KO Implementation
Pattern separation (distinct traces for similar inputs)	Each claim extracted as an attributed unit with a unique key, distinguishing even near-duplicate facts
Source memory (who said it, where, when)	Attribution hash (SHA-256) + author ORCID + timestamp + version chain
Associative binding (linking contexts)	Typed dependency edges between KOs (prerequisite, supports, contradicts)
Retrieval-dependent strengthening	Retrieval count + recency boost: memories accessed more often rank higher
Vividness decay (old memories fade)	Encoding strength at write time, exponential vividness decay (90-day half-life), refreshed on retrieval
Contextual reinstatement (one cue recalls the whole episode)	Session-grouped KOs: recalling one retrieves co-captured knowledge

Bicameral Memory: Fast Store + Slow Consolidation

Complementary Learning Systems (CLS) theory: the hippocampus captures new experiences rapidly, while the neocortex consolidates them slowly into general knowledge. KO architecture implements both.

Hippocampal Layer (Fast)

Every fact stored as an attributed unit. Exact recall. No compression. Quality-gated at write time, O(1) retrieval at read time. This is what prevents the compaction loss shown above.

Neocortical Layer (Slow)

Summaries, cross-session patterns, and salience-weighted memories consolidate over time. Used for narrative context and general orientation, not for fact retrieval.

Standard LLM memory is neocortex-only: it compresses everything into summaries and loses the originals. KO architecture keeps both layers. Facts live in the structured knowledge layer (lossless, attributed, quality-gated). Summaries live in the neocortical layer (lossy but useful for context). Retrieval checks the structured layer first.

Write-Time Salience Gating

The hippocampus does not record everything. Salient inputs (novel, surprising, goal-relevant) are encoded more strongly. KOs implement this as a write-time gate.

Before a claim is admitted to the knowledge layer, a salience scorer evaluates relevance, novelty, and consistency with existing knowledge. Low-salience noise is rejected at write time rather than filtered at read time. This matters because read-time filtering collapses under noise: at 8:1 distractor ratios, the retriever returns mostly junk and the critic cannot recover (see Write-Time Curation results below). The gating architecture prevents noise from entering the knowledge layer in the first place.

Beyond RAG

RAG retrieves documents. KOs retrieve facts. The difference is the unit of storage.

Capability	RAG (chunks)	KOs (facts)
Retrieval unit	Documents that seem relevant	The specific fact, by exact key
Provenance	"From paper X"	SHA-256 hash, author, version chain
Update propagation	Re-embed the document	Follow typed dependency edges downstream
Contradiction detection	None	Same (S,P), different O = signal
"What breaks if this is wrong?"	Cannot answer	Trace the dependency graph

The KO layer is also embedding-indexed for similarity search: exact key lookup first, embedding fallback for ambiguous queries. You get RAG-style retrieval plus exact addressing, structural matching, and per-fact provenance in the same system.

Compaction Loss Compounds Over Sessions

Older Knowledge Is Silently Erased

Simulating 5 sessions of research work (40 facts each, compacted to 2,000 tokens after each session). Querying facts from each session after all compaction rounds.

Session of origin	KO Memory	Compacted Context
Session 1 (oldest)	100%	0%
Session 2	100%	0%
Session 3	100%	0%
Session 4	100%	60%
Session 5 (most recent)	100%	100%
Overall	100%	32%

Compacted context develops a severe recency bias: sessions 1–3 are completely erased, session 4 partially retained, session 5 fully present. This is the opposite of what long-term memory should do. KO memory retains all facts from all sessions equally.

Experiment M. 200 facts across 5 sessions, compacted to 2,000 tokens after each. Seed 42.

Compaction Curve: Model Quality Doesn't Help

Facts retained after a single compaction at increasing ratios. Tested with Claude Sonnet 4 ($3/M) and Claude Opus 4.6 ($15/M, 5× the cost).

Compaction Ratio	Sonnet 4	Opus 4.6	KO Memory
1× (no compaction)	100%	100%	100%
2×	—	80%	100%
5×	25%	45%	100%
10×	0%	15%	100%
20×	0%	5%	100%

Even the most capable model (Opus 4.6, 5× the cost) delays the collapse slightly but doesn't prevent it. Compaction loss is architectural, not a model quality issue. At 10× compression, Sonnet retains nothing; Opus retains 15%. At 20×, both are near zero. KO memory is unaffected at any ratio.

Experiment A. 5 seeds averaged.

Goal Drift Under Compaction

Alignment Decisions Lost After Compaction

20 non-default project constraints embedded in an 88-turn conversation, then compacted at increasing ratios. Loss = silent reversion to the default.

Compaction	Ratio	Constraints Preserved	Lost
None	1×	20 / 20	0
1 round	9×	18.2 / 20	1.8
2 rounds	17×	12.4 / 20	7.6
3 rounds	31×	9.2 / 20	10.8

After 3 rounds: "No Redis" (client requirement) reverts to Redis. "Deploy to eu-west-1" (GDPR) reverts to us-east-1. The model continues with full confidence. There is no error signal. KO memory: all 20 constraints preserved at every compression ratio.

Experiment 10c. 5 seeds averaged.

KO + Embedding Hybrid: Strictly Better Than Either Alone

Exact Retrieval + Semantic Search

KOs don't replace embedding retrieval — they augment it. The hybrid architecture checks the KO layer first (exact key match), then falls back to embedding retrieval for queries without a stored fact. This gives you noise-immune precision on known facts and full coverage on the long tail.

Noise Ratio	RAG Only	KO Only	KO + RAG Hybrid
1:1 (clean)	100%	50%	100%
2:1	100%	50%	100%
4:1	95%	50%	98%
8:1	90%	50%	95%
16:1	85%	50%	93%

Hybrid ≥ RAG at every single noise ratio. RAG degrades as distractors increase (100% → 85%). KO alone has limited coverage (50%) but perfect precision on stored facts. The hybrid takes the best of both: for KO-covered queries, it maintains 100% accuracy at all noise ratios. The remaining gap comes from queries that must fall back to embedding retrieval.

Experiment K. Pharmacology corpus, 5 seeds averaged. Noise = unrelated facts mixed into the retrieval corpus at increasing ratios.

Real-World Validation: 1,001 Wikipedia Facts

Same-Domain Distractors Break Embeddings

1,001 facts across 7 domains (countries, elements, animals, cities, films, historical events, scientists). Same-domain distractors to challenge cosine similarity — the realistic case where your knowledge base contains many similar entities.

Noise Ratio	KO Only	Embedding Only	KO + Embedding Hybrid
1:1 (clean)	92%	97%	96%
4:1	93%	92%	96%
8:1	93%	82%	96%
16:1	93%	68%	95%

Embedding retrieval loses 29pp under noise (97% → 68%). Same-domain distractors confuse cosine similarity — "population of Paris" and "population of Lyon" are nearly identical vectors. KO exact retrieval is noise-immune (92–93% throughout); the ~7% error comes from key extraction ambiguity, not retrieval failure. Hybrid is strictly better than either alone at 4:1+ noise.

In real knowledge bases, semantic noise is the norm: multiple similar entities, overlapping facts, near-duplicate claims. Exact retrieval handles known facts; embeddings handle the long tail. The hybrid architecture gives you both.

3 seeds averaged. Wikipedia facts with same-domain distractors at each ratio.

Cost Per Query

KO Cost Is Constant, In-Context Grows Linearly

KO retrieves ~30 tokens (the specific fact). In-context loads ALL facts into the prompt. At Claude Sonnet 4 pricing ($3/M input):

Corpus size	KO query	In-context query	Multiplier
100 facts	$0.002	$0.01	6×
1,000 facts	$0.002	$0.09	57×
5,000 facts	$0.002	$0.45	283×
10,000 facts	$0.002	$0.90	566×

At N=10,000, Sonnet's context window overflows entirely. RAG sits in between at ~$0.009/query. KO's retrieval unit is already answer-sized (~30 tokens vs ~500 for a RAG chunk).

Won't Larger Context Windows Solve This?

Larger windows delay when compaction triggers. They do not prevent it.

Problem	Bigger Window	KO Memory
Goal drift after compaction	Same degradation, delayed onset	Constraints stored as discrete objects
Cost per query	Worse: grows linearly with N	Constant ($0.002 regardless of N)
Session boundaries	Context doesn't persist	Persists by construction
Long-running agents	Any window fills eventually	Corpus grows without degradation

Detailed Evidence

Scaling: Accuracy vs. Corpus Size

Exact-match retrieval of pharmacology facts at increasing corpus sizes. 30 queries per N value for N ≥ 1,000.

Corpus Size (N)	Claude In-Context	KO (hash lookup)
10 – 500	445 / 445	445 / 445
1,000	150 / 150	150 / 150
3,000	149 / 150	150 / 150
5,000	149 / 150	150 / 150
7,000	150 / 150	150 / 150
8,000+	Overflow (200K)	300 / 300

Experiments 1, 7, 8, 9. 5 seeds. Claude Sonnet 4 performs near-perfectly within its window. The failure mode is a hard wall, not gradual degradation.

Aggregation: Counting and Range Queries

Queries requiring multiple facts: "How many drugs target EGFR?" Embedding top-k misses facts.

Query Type	In-Context	KO (hash)	Embedding (top-5)
Count ("How many...")	100%	100%	44%
Comparison	100%	96%	100%
Range ("min/max...")	100%	100%	92%
Overall	100%	74%	59%

Experiment F. 90 facts, 5 targets, 4 query types. 5-seed mean. KO outperforms embedding by +15pp.

KO + RAG Hybrid: Noise Immunity (Detail)

Full breakdown across 5 noise ratios. KO handles stored facts (exact, noise-immune). RAG handles the long tail. Hybrid outperforms either alone at every ratio.

Noise Ratio	RAG Only	KO Only	KO + RAG Hybrid
1:1 (clean)	100%	50%	100%
2:1	100%	50%	100%
4:1	95%	50%	98%
8:1	90%	50%	95%
16:1	85%	50%	93%

Experiment K. 5 seeds averaged. KO alone scores 50% because half the queries target non-KO facts (requiring embedding fallback). RAG degrades under noise. Hybrid is strictly superior.

Write-Time Curation

50 facts (10 correct + 40 distractors) per trial. Salience scoring filters at write time.

Method	Accuracy	At 8:1 Noise
Ungated store (all facts admitted)	13.3%	0%
Read-time filtering (LLM critic)	93.8%	0%
Write-gated KOs	100%	100%

Experiments 1–3. At 8:1 noise, read-time filtering collapses because the retriever returns mostly distractors. Write gating prevents noise from entering the store entirely.

Multi-Domain: Five Domains, One Store

500 facts across pharmacology, materials science, astronomy, economics, chemistry.

Method	Accuracy
KO (hash lookup)	94%
Embedding (semantic search)	100%

Experiment H. 75 queries across 5 domains. 5-seed mean. KO's 6% gap comes from key extraction failures in economics and chemistry phrasing.

What About Neural Memory?

Titans: Google's Learned Memory Layer

Titans (Google DeepMind, 2025) adds a differentiable long-term memory module to transformer architectures. Instead of storing facts externally, the model learns to write and read from a neural memory matrix during training. It is the most prominent recent attempt to give transformers persistent memory through architecture changes alone. We benchmarked it as a neural memory baseline.

Metric	Titans (best dim)	KO Memory
Free recall ("What is the IC50 of erlotinib?")	0–10%	100%
Forced choice (pick correct answer from 4)	25–90%	100%
Memorization (reproduce training data)	89–100%	—

Titans memorizes but doesn't know. It achieves near-perfect memorization of training sequences (89–100%) but cannot answer questions about the same facts (0–10% free recall). The memory module learns to compress and replay patterns, not to extract and retrieve structured knowledge. This is the fundamental gap between parametric memory (learned weights) and explicit memory (stored facts).

3 memory dimensions (128, 256, 512) × 5 corpus sizes (10–200 facts) × 5 seeds. Depth-6 memory module, 100 training epochs per run. Total: 75 configurations.

Methodology

Corpus	Synthetic pharmacology (drug-target binding affinities, confusable near-duplicates), five synthetic domains (500 facts), real arXiv cosmology papers.
Models	Claude Sonnet 4 (`claude-sonnet-4-20250514`, 200K context), GPT-4o. All numbers from real API calls.
Seeds	5 random seeds (42, 123, 456, 789, 1337) for most experiments.
Evaluation	Exact string match. KO retrieval uses a realistic pipeline (LLM parses query, hash lookup), not oracle key extraction.
Total cost	~$110 across all experiments. ~1,600 API calls.