Agents forget.
We remember.
Drop a long agent run into structured memory — episodic log, causal graph, phase index. Then ask the hard questions a raw context window can't answer. Watch both sides race, live.
208
benchmark episodes
3
memory layers
100+
turns per run
See it race.
Same question, same model. Left has only the raw transcript. Right has our memory. Pick a question and hit run.
Question · type A (Recall / Causal)
The observation after the `up` action at Step 8 is identical to the observation from Step 6. What is the causal relationship between the action at Step 7 (`down`) and the action at Step 8 (`up`) that explains this state reversion, and what does this two-step sequence imply about the agent's progress?
Memory graph · retrieval traversal
Baseline — no memory
Hermes + Memory
How it works
Three layers of memory, one query.
A raw context window is a flat wall of text — the model re-reads everything and still misses the link between step 7 and step 90. We split a run into three structured layers, then fuse them at query time.
Episodic log
Every step, time-stamped and embedded. Answers what happened and when — recall by meaning, not by scrolling a transcript.
Causal graph
Typed edges — causal, temporal, entity, semantic — link steps and objects. Answers why: cause, effect, and state changes across the whole run.
Phase index
Boundaries split the run into phases (explore, loop, stall). Answers structure: where progress happened and where the agent got stuck.
Multi-anchor retrieval locks onto the most relevant nodes, traverses typed edges, and assembles evidence from all three layers — then RRF fusion ranks it before the model ever answers.
AMA-bench accuracy by question type
208 episodes · baseline vs. memory-augmented
Recall / Causal
type A
Causal Inference
type B
State Updating
type C
State Abstraction
type D
placeholder scores — replace in app/lib/demo-data.ts with measured runs
Give your agents a memory that
actually answers.
We turn long, messy agent runs into structured memory that beats raw context on the hardest recall and reasoning questions. Built on AMA-bench, wired into Hermes.