Structured memory for AI agents

Agents forget.
We remember.

Drop a long agent run into structured memory — episodic log, causal graph, phase index. Then ask the hard questions a raw context window can't answer. Watch both sides race, live.

208

benchmark episodes

3

memory layers

100+

turns per run

See it race.

Same question, same model. Left has only the raw transcript. Right has our memory. Pick a question and hit run.

Question · type A (Recall / Causal)

The observation after the `up` action at Step 8 is identical to the observation from Step 6. What is the causal relationship between the action at Step 7 (`down`) and the action at Step 8 (`up`) that explains this state reversion, and what does this two-step sequence imply about the agent's progress?

Memory graph · retrieval traversal

CausalTemporalEntitySemanticState update
originposition SStep 6state SStep 7downStep 8uposcillationloopStep 9downStep 10up

Baseline — no memory

Hermes + Memory

recorded

How it works

Three layers of memory, one query.

A raw context window is a flat wall of text — the model re-reads everything and still misses the link between step 7 and step 90. We split a run into three structured layers, then fuse them at query time.

Layer 1

Episodic log

Every step, time-stamped and embedded. Answers what happened and when — recall by meaning, not by scrolling a transcript.

Layer 2

Causal graph

Typed edges — causal, temporal, entity, semantic — link steps and objects. Answers why: cause, effect, and state changes across the whole run.

Layer 3

Phase index

Boundaries split the run into phases (explore, loop, stall). Answers structure: where progress happened and where the agent got stuck.

Multi-anchor retrieval locks onto the most relevant nodes, traverses typed edges, and assembles evidence from all three layers — then RRF fusion ranks it before the model ever answers.

AMA-bench accuracy by question type

208 episodes · baseline vs. memory-augmented

Baseline+ Memory
41%
78%

Recall / Causal

type A

33%
71%

Causal Inference

type B

28%
69%

State Updating

type C

36%
74%

State Abstraction

type D

placeholder scores — replace in app/lib/demo-data.ts with measured runs

Give your agents a memory that actually answers.

We turn long, messy agent runs into structured memory that beats raw context on the hardest recall and reasoning questions. Built on AMA-bench, wired into Hermes.