Benchmarks - Engram

Engram is evaluated on LongMemEval (Wu et al., ICLR 2025) — the standard benchmark for long-term conversational memory. It grades 500 questions over chat histories that scale past a million tokens, across six task types covering the core abilities a memory layer must get right.

91.4%

Overall accuracy — all 500 questions, including abstention.

92.3%

Averaged across the six task types.

500 questions

Histories scalable past 1M tokens. ICLR 2025 dataset.

Results by task type

Task type	Score	Correct
Knowledge update	100.0%	72 / 72
Abstention	100.0%	30 / 30
Single-session · user fact	98.4%	63 / 64
Single-session · preference	93.3%	28 / 30
Multi-session	90.2%	109 / 121
Single-session · assistant	89.3%	50 / 56
Temporal reasoning	82.3%	102 / 124
Overall	91.4%	457 / 500

Engram is strongest exactly where reliability matters most for production agents — knowledge updates (100%), abstention (100%), and single-session recall (89–98%). Multi-session aggregation (90.2%) uses a counting-aware answerer that scans every recalled session and de-duplicates instances across conversations. Temporal reasoning (82.3%) uses date-aware answering — the answerer is given the question’s reference date, so “how long ago” questions have an anchor — and is the area with the most remaining headroom.

What each task type measures

Knowledge update — 100.0%

A fact changes over the course of the history (the user moves city, switches jobs, updates a goal). The agent must answer with the current value and ignore the superseded one. Engram’s confidence + contradiction model is built for exactly this — newer evidence supersedes older beliefs.

Abstention — 100.0%

Some questions are unanswerable from the history by design. The agent must recognise this and decline rather than fabricate an answer. Surfacing only well-supported memories keeps the agent from inventing facts.

Single-session · user fact — 98.4%

Recall a specific fact the user stated within a single conversation (“my flight is at 6am”, “my dog’s name is Rex”).

Single-session · preference — 93.3%

Surface a user preference — stated outright (“I prefer aisle seats”) or implied by repeated behaviour.

Single-session · assistant — 89.3%

Recall something the assistant said, recommended, or listed earlier (“what was the 7th restaurant you suggested?”). Requires storing assistant turns, not just user statements.

Temporal reasoning — 82.3%

Reason about when events happened — ordering, duration, and “how long ago” — across the timeline of the conversation history. Engram sorts recalled memories chronologically with [DATE] tags and gives the answerer the question’s reference date, so duration and “how long ago” questions resolve correctly.

Multi-session — 90.2%

Combine and aggregate facts spread across many separate conversations (“how many projects have I mentioned working on?”). Demands complete cross-session recall plus a counting-aware answerer that de-duplicates repeated mentions and excludes plans the user never acted on.

Methodology

Engram is the memory layer

Conversation histories are ingested into Engram. At question time, the agent reads only what Engram retrieves — Engram is responsible for storing the right facts and surfacing them on recall.

Standard dataset and grading

Evaluated against the official LongMemEval 500-question set, with answers graded by the benchmark’s standard LLM judge (the official answer-correctness prompt).

Per-task breakdown published

Every task type and raw count is published above — not a single aggregate headline — so the strong and the in-progress areas are both visible.

How to read memory benchmarks

Memory-layer benchmark numbers are notoriously setup-dependent. Published LongMemEval scores swing widely with the reader model, retrieval budget, and harness, and memory vendors have publicly disputed each other’s figures on both LongMemEval and LOCOMO. Treat any single headline number — including ours — with healthy skepticism.

That is precisely why we publish the full per-task breakdown and methodology rather than one cherry-picked figure. The numbers above reflect a fixed, documented setup, and the areas where Engram is still improving (multi-session, temporal) are shown plainly rather than averaged away.

Reference

Paper: LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory — Wu, Wang, Yu, Zhang, Chang, Yu (ICLR 2025), arXiv:2410.10813
Code & dataset: github.com/xiaowu0162/LongMemEval

See the full comparison

How Engram’s architecture — calibrated confidence, contradiction detection, hybrid retrieval — maps to these results.

91.4%

92.3%

500 questions

​Results by task type

​What each task type measures

​Methodology

​How to read memory benchmarks

​Reference

See the full comparison

Results by task type

What each task type measures

Methodology

How to read memory benchmarks

Reference