91.4%
Overall accuracy — all 500 questions, including abstention.
92.3%
Averaged across the six task types.
500 questions
Histories scalable past 1M tokens. ICLR 2025 dataset.
Results by task type
| Task type | Score | Correct |
|---|---|---|
| Knowledge update | 100.0% | 72 / 72 |
| Abstention | 100.0% | 30 / 30 |
| Single-session · user fact | 98.4% | 63 / 64 |
| Single-session · preference | 93.3% | 28 / 30 |
| Multi-session | 90.2% | 109 / 121 |
| Single-session · assistant | 89.3% | 50 / 56 |
| Temporal reasoning | 82.3% | 102 / 124 |
| Overall | 91.4% | 457 / 500 |
What each task type measures
Knowledge update — 100.0%
Knowledge update — 100.0%
A fact changes over the course of the history (the user moves city, switches jobs, updates a goal). The agent must answer with the current value and ignore the superseded one. Engram’s confidence + contradiction model is built for exactly this — newer evidence supersedes older beliefs.
Abstention — 100.0%
Abstention — 100.0%
Some questions are unanswerable from the history by design. The agent must recognise this and decline rather than fabricate an answer. Surfacing only well-supported memories keeps the agent from inventing facts.
Single-session · user fact — 98.4%
Single-session · user fact — 98.4%
Recall a specific fact the user stated within a single conversation (“my flight is at 6am”, “my dog’s name is Rex”).
Single-session · preference — 93.3%
Single-session · preference — 93.3%
Surface a user preference — stated outright (“I prefer aisle seats”) or implied by repeated behaviour.
Single-session · assistant — 89.3%
Single-session · assistant — 89.3%
Recall something the assistant said, recommended, or listed earlier (“what was the 7th restaurant you suggested?”). Requires storing assistant turns, not just user statements.
Temporal reasoning — 82.3%
Temporal reasoning — 82.3%
Reason about when events happened — ordering, duration, and “how long ago” — across the timeline of the conversation history. Engram sorts recalled memories chronologically with
[DATE] tags and gives the answerer the question’s reference date, so duration and “how long ago” questions resolve correctly.Multi-session — 90.2%
Multi-session — 90.2%
Combine and aggregate facts spread across many separate conversations (“how many projects have I mentioned working on?”). Demands complete cross-session recall plus a counting-aware answerer that de-duplicates repeated mentions and excludes plans the user never acted on.
Methodology
Engram is the memory layer
Conversation histories are ingested into Engram. At question time, the agent reads only what Engram retrieves — Engram is responsible for storing the right facts and surfacing them on recall.
Standard dataset and grading
Evaluated against the official LongMemEval 500-question set, with answers graded by the benchmark’s standard LLM judge (the official answer-correctness prompt).
How to read memory benchmarks
That is precisely why we publish the full per-task breakdown and methodology rather than one cherry-picked figure. The numbers above reflect a fixed, documented setup, and the areas where Engram is still improving (multi-session, temporal) are shown plainly rather than averaged away.Reference
- Paper: LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory — Wu, Wang, Yu, Zhang, Chang, Yu (ICLR 2025), arXiv:2410.10813
- Code & dataset: github.com/xiaowu0162/LongMemEval
See the full comparison
How Engram’s architecture — calibrated confidence, contradiction detection, hybrid retrieval — maps to these results.