If you’re building AI agents that remember things across conversations, you need a way to measure whether your memory system actually works. LongMemEval is the benchmark for that.
Created by researchers at NVIDIA, it tests a simple question: when an AI agent has seen 10,000+ conversation turns across hundreds of chat sessions, can it find the right memory when asked?
The benchmark contains 500 questions across 940 chat sessions (10,866 conversation turns). Each question has a known “evidence” — a specific thing the user said earlier that answers the question. The test measures whether your memory system can surface that evidence in its top results.
The primary metric is Recall@5 (R@5): did the correct evidence appear in the top 5 results? If your memory system returns 5 memories and one of them is the right one, that’s a hit.
Why top 5? Because in practice, an AI agent has limited context window space. It can’t read 100 memories — it needs the right one in the first few results.
The benchmark also measures R@10 and R@20 to show how close you are even when you miss — if the evidence is at position 8, that’s a “near miss” that could be fixed with better ranking, as opposed to a true gap where the evidence wasn’t found at all.
LongMemEval isn’t one test — it’s six different kinds of memory challenges:
Each type tests a different aspect of memory retrieval, and they’re not equally hard.
We ran LongMemEval against MentisDB 0.8.0 with three retrieval signals working together:
| Question Type | R@5 | R@10 | R@20 |
|---|---|---|---|
| Overall | 65.0% | 72.2% | 79.0% |
| Knowledge Update | 82.1% | 84.6% | 88.5% |
| Single-Session-Assistant | 83.9% | 85.7% | 91.1% |
| Single-Session-User | 70.0% | 72.9% | 80.0% |
| Temporal Reasoning | 61.7% | 68.4% | 73.7% |
| Multi-Session | 59.4% | 69.2% | 76.7% |
| Single-Session-Preference | 13.3% | 16.7% | 23.3% |
Knowledge updates (82.1%) and assistant recall (83.9%) are our strongest categories. When a user explicitly states a fact or the assistant gives advice, the word overlap between query and evidence is strong enough for BM25 to find it reliably. Porter stemming helped a lot here — “updated” now matches “updates”, “running” matches “ran”, etc.
User facts (70.0%) and temporal reasoning (61.7%) showed the biggest improvements from our search improvements. These categories benefit from two things: stemming (which catches word variants) and the importance-weighted scoring (which boosts user-originated content over verbose assistant responses).
Single-session-preference (13.3%) is our weakest category by far, and it reveals a fundamental limitation of our current approach.
The problem: preference questions ask things like “What kind of food do I like?” but the evidence is something like “I’ve been really into Thai cuisine lately — do you know any good pad thai recipes?” There’s almost no word overlap between the query and the evidence. BM25 can’t bridge that gap because “food” ≠ “Thai cuisine” and “like” ≠ “into”.
Our semantic embeddings (384-dimension MiniLM) are supposed to handle this, but they’re not strong enough to reliably rank these semantic matches above all the other conversation turns that mention food-related words. At R@20, preference only reaches 23.3% — meaning 77% of the time, the evidence isn’t even in the top 20 results.
This is a known limitation for embedding-based retrieval on short, implicit-evidence queries. Better embeddings (larger models, fine-tuned models) or a reranking step would help, but that’s a future improvement.
Multi-session (59.4%) is our second-weakest category. These questions require finding information that was mentioned across different conversations, sometimes days apart. Graph expansion helps a bit (following links between related thoughts), but the core challenge is that the evidence might be a single sentence buried in a 30-turn conversation about something else entirely.
We didn’t start at 65%. Our first run scored 57.2% R@5 with plain BM25 + vector search. Here’s what we changed and what happened:
The biggest single improvement came from adding Porter stemming to our lexical tokenizer. Instead of indexing “preferences” and querying “prefer”, both now normalize to “prefer” and match. This bumped overall R@5 from 57.2% to 61.6% — a 4.4 percentage point gain from one change.
The gains were biggest for temporal reasoning (+9.0%) and user facts (+8.5%), where questions often use different word forms than the original evidence.
Our original scoring just added lexical and vector scores together. The problem: vector scores (0.0–0.35 range) were drowned out by BM25 scores (0–30+ range). Semantically-relevant thoughts with zero word overlap never surfaced.
We replaced this with a tiered boost: when a thought has no lexical match at all, its vector score gets a 60× boost so it can compete with BM25 hits. When there’s a weak lexical match, it gets a partial 20× boost. When BM25 is strong, vector contributes as a small additive signal without disrupting the ranking.
This took us from 61.6% to 65.0% — another 3.4 point gain.
Before the tiered boost, we tried RRF — the standard approach in
hybrid search. It fused BM25 and vector rankings by assigning
1/(K+rank) scores from each list.
RRF actually hurt our results. It demoted strong BM25 hits for temporal and factual questions because vector matches (often wrong ones) got equal weight in the fusion. For questions where BM25 already finds the answer, re-ranking by vector similarity makes things worse, not better.
The lesson: for memory retrieval where BM25 is strong, don’t re-rank what’s already working. Only use vector to promote candidates that BM25 can’t find at all.
LongMemEval is becoming a standard retrieval benchmark, and several systems have published results. Here’s how MentisDB stacks up:
| System | Benchmark | Metric | Score | Storage | Embeddings |
|---|---|---|---|---|---|
| MentisDB | LongMemEval (500 q, 10.8k turns) | R@5 | 65.0% | Turn-level | MiniLM 384d (local) |
| MemX | LongMemEval (500 q, 220k facts) | Hit@5 | 51.6% | Fact-level (LLM-extracted) | Qwen3-0.6B 1024d |
| MemX | LongMemEval (500 q, 100k rounds) | Hit@5 | 27.0% | Round-level | Qwen3-0.6B 1024d |
| MemX | LongMemEval (500 q, 19k sessions) | Hit@5 | 24.6% | Session-level | Qwen3-0.6B 1024d |
| Mem0 | LOCOMO (different benchmark) | LLM-Judge | 66.9% | LLM-extracted facts | Cloud API |
| Mem0ᵍ | LOCOMO (different benchmark) | LLM-Judge | 68.4% | LLM-extracted + graph | Cloud API |
| OpenAI Memory | LOCOMO (different benchmark) | LLM-Judge | 52.9% | Proprietary | Proprietary |
MemX is the closest comparison — also Rust, also local-first, also hybrid retrieval with RRF. At fact-level granularity (220k records, each an LLM-extracted atomic fact), MemX achieves 51.6% Hit@5. MentisDB achieves 65.0% R@5 with simpler turn-level storage (10.8k records, no LLM fact extraction).
Why does MentisDB outperform MemX despite simpler storage and smaller embeddings? Two reasons:
Tiered fusion beats RRF. MemX uses standard Reciprocal Rank Fusion to combine vector and keyword results. We tried RRF and it hurt — it demoted strong BM25 hits by giving vector matches equal weight. Our tiered boost approach (60× when no lexical match, 20× ramp for weak matches, additive otherwise) preserves BM25’s strengths while still surfacing semantic-only hits.
Importance-weighted scoring. MemX’s four-factor reranker weights importance at 10%. We found that user-originated content needs a much stronger signal to compete with verbose assistant responses — our 3.0× importance multiplier gives user thoughts +2.4 vs assistant +0.6, which tips close BM25 races.
Mem0 isn’t directly comparable — they benchmark on LOCOMO (a different dataset) and measure end-to-end answer quality with an LLM judge, not raw retrieval recall. Mem0 also uses cloud LLMs for fact extraction and graph construction, which gives them better multi-session reasoning at higher cost and complexity.
The fair comparison is on the retrieval layer: both systems store compressed memory representations and retrieve them for an LLM to reason over. MentisDB is fully local with no cloud dependencies; Mem0 requires cloud APIs for its extraction pipeline.
MemX’s most striking finding is that storage granularity matters more than any pipeline tweak — fact-level storage doubles Hit@5 vs session-level. This suggests a clear improvement path for MentisDB: adding optional LLM-based fact extraction could significantly boost recall, especially for multi-session and preference queries where the evidence is buried in long conversations.
If you’re building AI agents that need to remember things, retrieval quality directly determines agent intelligence. An agent that can’t find the right memory is no smarter than one with no memory at all.
65% R@5 means that two out of three times, the correct memory is in the first five results the agent sees. At R@20, it’s nearly 80%. For a fully local, no-cloud-dependency system running on a laptop with a 384-dimension embedding model, that’s solid.
But more importantly, the architecture is improvable. Every component — stemming, scoring weights, embedding model, graph expansion — is a knob that can be turned. We’ve shown that systematic benchmarking and targeted improvements can move the needle significantly. The 13.3% preference score is a clear target for the next iteration.
All benchmarks were run on: - MentisDB 0.8.0 with
local-embeddings feature - FastEmbed
all-MiniLM-L6-v2 (384d, ONNX, runs locally) - 10,866
conversation turns from 940 sessions - 500 evaluation
questions across 6 types - Ingestion: ~300 turns/sec - Vector
rebuild: ~3–5 minutes for 10k turns - Evaluation: ~7 queries/sec
The benchmark is reproducible — run
bash lme-benches/full_bench.sh from the MentisDB
repository.
MentisDB is an open-source durable memory layer for AI agents. It stores memories in an append-only chain, retrieves them with hybrid lexical+semantic+graph search, and runs entirely locally with no cloud dependencies. Learn more