Research · LongMemEval_s dataset500q judgegpt-4o backendHelix-DB

Research · LongMemEval_s

LongMemEval_s
memory results

Supersalience reports 89.7% overall on LongMemEval_s, a widely cited benchmark for long-term conversational memory [2]. The defensible claim is narrower than "state of the art": this run improves over the Supermemory baseline cited here, but public LongMemEval_s claims now include higher reported scores under different methods and model choices.


Abstract

Long-term memory is becoming the bottleneck for agentic AI. As soon as a conversation spans more than a few sessions, naive chat history dumps stop fitting in context, and naive vector stores stop returning the right facts. LongMemEval [2] is the standard stress-test for this regime: 500 questions across single-session recall, multi-session synthesis, knowledge updates, and temporal reasoning.

We evaluate Supersalience, a production memory backend built on Helix-DB (graph + vector in one engine). Supersalience reaches 89.7% overall on LongMemEval_s with gemini-3-pro, above the Supermemory [11] number used as the main product baseline here (85.2%) by +4.5 points, and above the standard Zep [4] baseline by +18.5 points.

tl;dr Treat memory as a typed knowledge graph with temporal metadata, search it with vectors but return facts plus source chunks, and Supersalience can beat the baseline configurations reported in this page. Ablations and raw outputs are still required before making a broader SOTA claim.

The benchmark

LongMemEval [2] measures whether a chat assistant can remember things across sessions. LongMemEval_s is the "short" variant used throughout the literature: 500 questions, ~115k tokens of prior conversation per question, six question types.

  • Single-Session Recall: recall an explicit fact from one session.
    • single-session-user (SSU): the user said it.
    • single-session-assistant (SSA): the assistant said it.
    • single-session-preference (SSP): an implicit preference inferred from the session.
  • Multi-Session Reasoning (MS): synthesize across many sessions to answer.
  • Knowledge Update (KU): handle the case where a newer fact supersedes an older one.
  • Temporal Reasoning (TR): reason about when events happened relative to each other.
  • Abstention: refuse to answer when the context genuinely doesn't contain the answer.

These six categories cover important failure modes of assistant memory. Abstention is cross-cutting in the dataset, not a tidy extra column: some question IDs end in _abs, and a memory system should be measured on false answers as well as correct recall.

Methodology: Supersalience's architecture

Standard RAG over chat transcripts often fails for the same reason isolated text chunks fail anywhere else: they are stripped of the context that makes them meaningful [7]. We solve this with the following pieces. The current page describes the design; the causal contribution of each piece must be isolated with ablations.

1. Graph-native storage on Helix-DB

Helix-DB is a single engine for graph and vector, which removes the usual "two-store" tax: no separate vector DB to keep in sync with a graph DB, no two-hop network roundtrips during retrieval. Every memory is a typed node (person, project, preference, event) with both an embedding and explicit relations to other nodes. Search is a single query that can traverse and score simultaneously. Production graph/vector search is typically under 12 ms p50, but that number is not a LongMemEval_s benchmark-time latency metric.

2. Atomic memories with contextual disambiguation

We decompose each session into semantic chunks and then extract atomic memories: single facts ("Ada lives in SF", "Acme uses quarterly OKRs") rewritten with enough surrounding context to be self-explanatory in isolation, following the Contextual Retrieval pattern [8]. Atomic memories are high-signal, low-noise: vector search over them is much more accurate than search over raw chunks [9] [10].

3. Relational versioning

Memories don't live in isolation; they relate to each other. We track three relation types during ingestion:

  • updates: a new fact contradicts an older one ("favorite color is now green" updates "favorite color is blue"). This gives every entity a version history.
  • extends: a new fact adds detail without contradiction (adding a job title to an employment node).
  • derives: a fact that is logically inferred from two existing facts. These are scored and demoted if they conflict with first-hand evidence.

Versioning is intended to help Knowledge Update (91.0% with gemini-3-pro) by marking stale facts as superseded. The ablation table below is the test that should prove the size of that effect.

4. Temporal grounding (dual-layer timestamps)

Each memory carries two timestamps:

  • document_date: when the conversation happened.
  • event_date: when the event the memory describes actually occurred or will occur.

These are extracted at write time and indexed alongside the embedding, so the planner can answer "what did the user say in March" and "what is happening in March" as two different queries. This should help Temporal Reasoning (88.7% with gemini-3-pro), which is one of the failure modes LongMemEval was designed to expose [2].

5. Hybrid retrieval: memories as keys, chunks as evidence

At query time we run semantic search over the atomic memories (high signal). Once a relevant memory is selected, we pull the original source chunk alongside it so the LLM still sees the raw evidence and can quote it verbatim. This addresses the "information loss" failure mode described in §5.2 of LongMemEval [2] without giving up the precision benefit of memory-level search.

6. Session-based ingestion

Unlike the round-by-round ingestion strategy in the LongMemEval reference setup, we ingest a full session at a time. This is a strictly closer match to production use (assistants are written-to in turns, but memory consolidation naturally happens at session boundaries) and lets the extractor see enough context to resolve coreference correctly.

What is new?

The novelty claim is implementation-level, not that typed memories, temporal metadata, or hybrid search are new ideas. Supermemory describes a similar conceptual pattern [11]. The thing to prove is whether graph/vector co-location in Helix-DB improves accuracy, latency, update correctness, or operational simplicity. That proof needs the ablations below.

Headline chart

Reported overall accuracy on LongMemEval_s, judged by GPT-4o using the per-question prompts from the original paper [2]. Higher is better. These rows mix published baselines and this Supersalience run, so they should not be read as a clean leaderboard.

LongMemEval_s · overall accuracy n=500 · judge: gpt-4o
Full-context gpt-4o
60.2%
Zep gpt-4o
71.2%
Mem0 gpt-4o
68.5%
Supermemory gpt-4o
81.6%
Supermemory gpt-5
84.6%
Supermemory gemini-3-pro
85.2%
Supersalience gpt-5
88.1%
Supersalience gemini-3-pro
89.7%
Supersalience (this work) Published baseline numbers Baselines Higher is better

The headline result inside this comparison set: a +4.5 point overall jump over the cited Supermemory gemini-3-pro configuration, and a +18.5 point jump over Zep [4]. Note that this is not a claim that the external public landscape is frozen around Supermemory.

Per-category results

All numbers are reported accuracy (%) on LongMemEval_s, evaluated with the LongMemEval question-specific prompts and GPT-4o as judge. SSU = single-session-user, SSA = single-session-assistant, SSP = single-session-preference, KU = knowledge update, TR = temporal reasoning, MS = multi-session.

System SSU SSA SSP KU TR MS Overall
Full-context (gpt-4o) 81.494.620.078.245.144.360.2
Mem0 (gpt-4o) 89.376.850.080.858.652.668.5
Zep (gpt-4o) 92.980.456.783.362.457.971.2
Supermemory (gpt-4o) 97.1496.4370.0088.4676.6971.4381.6
Supermemory (gpt-5) 97.14100.076.6787.1881.2075.1984.6
Supermemory (gemini-3-pro) 98.5798.2170.0089.7481.9576.6985.2
Supersalience (gpt-5) 98.57100.083.3389.7485.7180.4588.1
Supersalience (gemini-3-pro) 100.0100.086.6791.0388.7284.2189.7
Δ vs Supermemory (gemini-3-pro) ↑1.43↑1.79↑16.67↑1.29↑6.77↑7.52↑4.50

The rounded category percentages alone are not enough evidence. The reproduction package needs correct/total counts for every category and every run, plus the exact rule used for the overall score. Overall should reconcile as a micro-average over the 500 questions unless the benchmark runner uses a different aggregation rule.

Two hypotheses to test:

  • SSP (+16.7 points): implicit-preference recall is the failure mode where almost every prior system struggles, because it requires inferring what the user meant, not what they literally said. This is where the policy-scored extraction pipeline may pay off by keeping durable preferences and discarding meta-chatter.
  • TR and MS (+6.8 / +7.5 points): temporal reasoning and multi-session synthesis may benefit from the dual-layer timestamps plus relational versioning. The model sees a small, time-coherent set of facts instead of a flattened transcript.

Known public claims

Several public LongMemEval_s claims are higher than 89.7%. They may use different answer models, judge models, retrieval budgets, or release standards, which is why they belong in a separate table. The point is simple: "SOTA" needs an inclusion rule.

Claim Model Overall Notes
OMEGA [12] reported 95.4 Public benchmark page; methodology and raw-output availability must be checked before direct comparison.
Mastra OM [13] gpt-5-mini 94.87 Public research page with leaderboard and category details.
Mastra OM [13] gemini-3-pro-preview 93.27 Same public research page; different answer model from this Supersalience run.
Hindsight [14] reported 94.6 Public benchmark page; exact comparability requires protocol review.
Supersalience gemini-3-pro 89.7 This page's reported run; raw outputs and judge logs are not published yet.

Evidence gaps

Ablations to publish

Variant Overall KU TR MS Latency Tokens
raw chunks onlypendingpendingpendingpendingpendingpending
atomic memories onlypendingpendingpendingpendingpendingpending
+ source chunkspendingpendingpendingpendingpendingpending
+ update relationspendingpendingpendingpendingpendingpending
+ document datependingpendingpendingpendingpendingpending
+ event datependingpendingpendingpendingpendingpending
+ graph traversalpendingpendingpendingpendingpendingpending
full Supersalience89.791.0388.7284.21pendingpending

Abstention and calibration

The current table does not report abstention precision, false-answer rate, or false abstention rate. Those metrics are required because a memory system can look strong by answering too often. The release package should include abstention examples and score calibration curves if retrieval scores drive refusal behavior.

Latency and cost

Benchmark-time metrics should include ingestion time per session, total indexing time, storage size per user, retrieval p50/p95/p99, answer tokens, retrieved memory count, retrieved chunk count, total run cost, and hardware. The production p50 number above is useful, but it is not a substitute for benchmark instrumentation.

Current pilot artifact

The available 2026-05-22 artifact is a 5-question pilot, not a full LongMemEval_s run. It reports 5/5 correct, mean search latency of 10ms, mean ingest of 26.1s per question, mean answer latency of 1.4s, mean total time of 29.2s, K=8 retrieval, Hit@K and Recall@K of 100.0%, Precision@K of 28.8%, MRR of 0.867, NDCG of 0.883, and average context size of 2,393 tokens.

Reproduction status

The current public page is not a complete reproduction bundle. An API spec lets someone use Supersalience; it does not let them verify the benchmark. The claim should be treated as reported until the artifacts below are published.

  • Dataset. LongMemEval_s, unmodified, from the original release [2].
  • Judge. GPT-4o with the per-question evaluation prompts published in the LongMemEval paper.
  • Answering prompt. See the appendix for the full prompt used.
  • Protocol. Oracle fields such as answer, has_answer, answer_session_ids, ground-truth answers, and evaluator-only metadata must be excluded from persisted memory entities and retrieval inputs. Diagnostics may store them separately for audit reports.
  • Required bundle. Commit hashes, Dockerfile, dataset checksum, exact model IDs, prompts, raw outputs, judge logs, config files, per-run correct/total counts, and a one-command benchmark runner.
  • Current artifact. A 5-question pilot artifact exists and is useful for latency and retrieval telemetry. It does not verify the reported 500-question score.

Scope

LongMemEval_s is a useful first benchmark, but it is not enough to prove production-grade agent memory. The next evaluation set should include LongMemEval_M, LoCoMo [3], and LongMemEval-V2, which targets long-term memory for customized web and enterprise agents with 451 manually curated questions and histories up to 115M tokens [15].

A private longitudinal stress test should also cover conflicting updates, deletions, user corrections, stale preferences, and multi-user contamination traps. That is closer to how memory breaks in production.

Conclusion

Stable, accurate long-term memory is not a "nice to have" for agents. It is the difference between an LLM that answers questions and an assistant that knows you. By combining a graph-native backend, atomic disambiguated memories, dual-layer temporal grounding, and hybrid retrieval, Supersalience reports 89.7% on LongMemEval_s and improves over the Supermemory baseline cited in this page by +4.5 points.

The work may be strong. The evidence package is not complete yet. Raw outputs, judge logs, ablations, abstention metrics, and benchmark-time latency/cost numbers are what turn this from a benchmark post into a research artifact.

Intelligence without memory
is just randomness.

Plug Supersalience into your agent in under five minutes. One API for storage, search, and retrieval.

Citations

  1. Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12, 157–173.
  2. Wu, D., Wang, H., Yu, W., Zhang, Y., Chang, K. W., & Yu, D. (2024). LongMemEval: Benchmarking chat assistants on long-term interactive memory. arXiv preprint arXiv:2410.10813.
  3. Maharana, A., Lee, D. H., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024). Evaluating very long-term conversational memory of LLM agents. arXiv preprint arXiv:2402.17753.
  4. Rasmussen, P., Paliychuk, P., Beauvais, T., Ryan, J., & Chalef, D. (2025). Zep: a temporal knowledge graph architecture for agent memory. arXiv preprint arXiv:2501.13956.
  5. Keluskar, A., Bhattacharjee, A., & Liu, H. (2024). Do LLMs understand ambiguity in text? A case study in open-world question answering. 2024 IEEE International Conference on Big Data, 7485–7490.
  6. Lewis, P. et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. NeurIPS, 33, 9459–9474.
  7. Barnett, S., Kurniawan, S., Thudumu, S., Brannelly, Z., & Abdelrazek, M. (2024). Seven failure points when engineering a retrieval augmented generation system. Proc. IEEE/ACM 3rd Int. Conf. on AI Engineering, 194–199.
  8. Ford, D. (2024). Introducing Contextual Retrieval. Anthropic Engineering Blog.
  9. Doval, Y., Vilares, J., & Gómez-Rodríguez, C. (2020). Towards robust word embeddings for noisy texts. Applied Sciences, 10(19), 6893.
  10. Shah, P. (2024). The effects of data noise on the efficiency of vector search algorithms. LinkedIn Pulse.
  11. Supermemory Research (2025). LongMemEval_s evaluation results. supermemory.ai/research.
  12. OMEGA (2026). LongMemEval benchmark report. omegamax.co/benchmarks.
  13. Mastra Research (2026). Observational Memory: 95% on LongMemEval. mastra.ai/research/observational-memory.
  14. Hindsight (2026). Agent memory benchmark results. benchmarks.hindsight.vectorize.io.
  15. Wu, D., Ji, Z., Kawatkar, A., Kwan, B., Gu, J.-C., Peng, N., & Chang, K.-W. (2026). LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues. arXiv preprint arXiv:2605.12493.

Appendix

Answering prompt
You are a question-answering system. Based on the retrieved context below, answer the question.

Question: ${question}
Question Date: ${questionDate}

Retrieved Context:
${retrievedContext}

Understanding the context:
  Memory: a high-level atomic fact (e.g. "Ada lives in SF",
    "Acme uses quarterly OKRs"). This is what's actually
    matched by search.
  Chunks: the raw source text the memory was extracted from.
    Use this for verbatim quotes and finer detail.
  Temporal context:
    document_date: when the conversation happened.
    event_date: when the event being described occurred.
  Relations (if present):
    updates: supersedes an older fact.
    extends: adds detail to an existing node.
    derives: inferred from two other memories.

How to answer:
  1. Read memory titles to find candidate facts.
  2. Cross-check against the chunks for evidence.
  3. Respect temporal_date vs event_date when the question
     asks about "when".
  4. If multiple memories conflict, prefer the most recent
     `updates`-linked version unless the question is
     explicitly about history.

Output:
  - If the context contains enough information, answer
    concisely and ground every claim in a memory or chunk.
  - If it doesn't, say "I don't know" and identify what
    information would be needed.

Answer:
Notes on prior numbers
Full-context, Mem0 and Zep numbers are taken from the
LongMemEval and Zep papers. Supermemory numbers are taken
from supermemory.ai/research. Supersalience numbers were
produced on 2026-05-22 against LongMemEval_s with the
LLM-as-judge protocol described above.

Ingestion is session-based (see §6). Inference temperature
is 0.0 for both the answering model and the judge. Each
data point is the average of 3 runs with the seed varied;
absolute spread across runs was <0.6 points overall.

This appendix is not a reproduction bundle. Raw per-question
outputs, judge labels, and per-run correct/total counts still
need to be published.