April 8, 2026

How MentisDB Performs on LongMemEval

How MentisDB Performs on LongMemEval — And Why It Matters

What is LongMemEval?

If you’re building AI agents that remember things across conversations, you need a way to measure whether your memory system actually works. LongMemEval is the benchmark for that.

Created by researchers at NVIDIA, it tests a simple question: when an AI agent has seen 10,000+ conversation turns across hundreds of chat sessions, can it find the right memory when asked?

The benchmark contains 500 questions across 940 chat sessions (10,866 conversation turns). Each question has a known “evidence” — a specific thing the user said earlier that answers the question. The test measures whether your memory system can surface that evidence in its top results.

What does R@5 mean?

The primary metric is Recall@5 (R@5): did the correct evidence appear in the top 5 results? If your memory system returns 5 memories and one of them is the right one, that’s a hit.

Why top 5? Because in practice, an AI agent has limited context window space. It can’t read 100 memories — it needs the right one in the first few results.

The benchmark also measures R@10 and R@20 to show how close you are even when you miss — if the evidence is at position 8, that’s a “near miss” that could be fixed with better ranking, as opposed to a true gap where the evidence wasn’t found at all.

The Six Question Types

LongMemEval isn’t one test — it’s six different kinds of memory challenges:

Single-session-user: “What did I tell you about my car?” — facts the user stated in one conversation
Single-session-assistant: “What advice did you give me about resumes?” — things the assistant said
Single-session-preference: “What kind of food do I prefer?” — user preferences and tastes
Multi-session: “How many projects am I leading?” — information spread across multiple conversations
Knowledge-update: “What’s my new address?” — facts that changed over time
Temporal-reasoning: “Which did I buy first, the bike or the car?” — understanding the order of events

Each type tests a different aspect of memory retrieval, and they’re not equally hard.

How MentisDB Did

We ran LongMemEval against MentisDB 0.8.0 with three retrieval signals working together:

Lexical search (BM25 with Porter stemming) — finds exact word matches
Semantic search (FastEmbed all-MiniLM-L6-v2) — finds meaning matches even when words differ
Graph expansion — follows links between related memories

The Numbers

Question Type	R@5	R@10	R@20
Overall	65.0%	72.2%	79.0%
Knowledge Update	82.1%	84.6%	88.5%
Single-Session-Assistant	83.9%	85.7%	91.1%
Single-Session-User	70.0%	72.9%	80.0%
Temporal Reasoning	61.7%	68.4%	73.7%
Multi-Session	59.4%	69.2%	76.7%
Single-Session-Preference	13.3%	16.7%	23.3%

What we’re good at

Knowledge updates (82.1%) and assistant recall (83.9%) are our strongest categories. When a user explicitly states a fact or the assistant gives advice, the word overlap between query and evidence is strong enough for BM25 to find it reliably. Porter stemming helped a lot here — “updated” now matches “updates”, “running” matches “ran”, etc.

User facts (70.0%) and temporal reasoning (61.7%) showed the biggest improvements from our search improvements. These categories benefit from two things: stemming (which catches word variants) and the importance-weighted scoring (which boosts user-originated content over verbose assistant responses).

Where we struggle

Single-session-preference (13.3%) is our weakest category by far, and it reveals a fundamental limitation of our current approach.

The problem: preference questions ask things like “What kind of food do I like?” but the evidence is something like “I’ve been really into Thai cuisine lately — do you know any good pad thai recipes?” There’s almost no word overlap between the query and the evidence. BM25 can’t bridge that gap because “food” ≠ “Thai cuisine” and “like” ≠ “into”.

Our semantic embeddings (384-dimension MiniLM) are supposed to handle this, but they’re not strong enough to reliably rank these semantic matches above all the other conversation turns that mention food-related words. At R@20, preference only reaches 23.3% — meaning 77% of the time, the evidence isn’t even in the top 20 results.

This is a known limitation for embedding-based retrieval on short, implicit-evidence queries. Better embeddings (larger models, fine-tuned models) or a reranking step would help, but that’s a future improvement.

Multi-session (59.4%) is our second-weakest category. These questions require finding information that was mentioned across different conversations, sometimes days apart. Graph expansion helps a bit (following links between related thoughts), but the core challenge is that the evidence might be a single sentence buried in a 30-turn conversation about something else entirely.

The Journey: 57.2% → 65.0%

We didn’t start at 65%. Our first run scored 57.2% R@5 with plain BM25 + vector search. Here’s what we changed and what happened:

1. Porter Stemming (+4.4%)

The biggest single improvement came from adding Porter stemming to our lexical tokenizer. Instead of indexing “preferences” and querying “prefer”, both now normalize to “prefer” and match. This bumped overall R@5 from 57.2% to 61.6% — a 4.4 percentage point gain from one change.

The gains were biggest for temporal reasoning (+9.0%) and user facts (+8.5%), where questions often use different word forms than the original evidence.

2. Better Vector-Lexical Fusion (+3.4%)

Our original scoring just added lexical and vector scores together. The problem: vector scores (0.0–0.35 range) were drowned out by BM25 scores (0–30+ range). Semantically-relevant thoughts with zero word overlap never surfaced.

We replaced this with a tiered boost: when a thought has no lexical match at all, its vector score gets a 60× boost so it can compete with BM25 hits. When there’s a weak lexical match, it gets a partial 20× boost. When BM25 is strong, vector contributes as a small additive signal without disrupting the ranking.

This took us from 61.6% to 65.0% — another 3.4 point gain.

What didn’t work: Reciprocal Rank Fusion (RRF)

Before the tiered boost, we tried RRF — the standard approach in hybrid search. It fused BM25 and vector rankings by assigning 1/(K+rank) scores from each list.

RRF actually hurt our results. It demoted strong BM25 hits for temporal and factual questions because vector matches (often wrong ones) got equal weight in the fusion. For questions where BM25 already finds the answer, re-ranking by vector similarity makes things worse, not better.

The lesson: for memory retrieval where BM25 is strong, don’t re-rank what’s already working. Only use vector to promote candidates that BM25 can’t find at all.

How We Compare

LongMemEval is becoming a standard retrieval benchmark, and several systems have published results. Here’s how MentisDB stacks up:

System	Benchmark	Metric	Score	Storage	Embeddings
MentisDB	LongMemEval (500 q, 10.8k turns)	R@5	65.0%	Turn-level	MiniLM 384d (local)
MemX	LongMemEval (500 q, 220k facts)	Hit@5	51.6%	Fact-level (LLM-extracted)	Qwen3-0.6B 1024d
MemX	LongMemEval (500 q, 100k rounds)	Hit@5	27.0%	Round-level	Qwen3-0.6B 1024d
MemX	LongMemEval (500 q, 19k sessions)	Hit@5	24.6%	Session-level	Qwen3-0.6B 1024d
Mem0	LOCOMO (different benchmark)	LLM-Judge	66.9%	LLM-extracted facts	Cloud API
Mem0ᵍ	LOCOMO (different benchmark)	LLM-Judge	68.4%	LLM-extracted + graph	Cloud API
OpenAI Memory	LOCOMO (different benchmark)	LLM-Judge	52.9%	Proprietary	Proprietary

MentisDB vs MemX

MemX is the closest comparison — also Rust, also local-first, also hybrid retrieval with RRF. At fact-level granularity (220k records, each an LLM-extracted atomic fact), MemX achieves 51.6% Hit@5. MentisDB achieves 65.0% R@5 with simpler turn-level storage (10.8k records, no LLM fact extraction).

Why does MentisDB outperform MemX despite simpler storage and smaller embeddings? Two reasons:

Tiered fusion beats RRF. MemX uses standard Reciprocal Rank Fusion to combine vector and keyword results. We tried RRF and it hurt — it demoted strong BM25 hits by giving vector matches equal weight. Our tiered boost approach (60× when no lexical match, 20× ramp for weak matches, additive otherwise) preserves BM25’s strengths while still surfacing semantic-only hits.
Importance-weighted scoring. MemX’s four-factor reranker weights importance at 10%. We found that user-originated content needs a much stronger signal to compete with verbose assistant responses — our 3.0× importance multiplier gives user thoughts +2.4 vs assistant +0.6, which tips close BM25 races.

MentisDB vs Mem0

Mem0 isn’t directly comparable — they benchmark on LOCOMO (a different dataset) and measure end-to-end answer quality with an LLM judge, not raw retrieval recall. Mem0 also uses cloud LLMs for fact extraction and graph construction, which gives them better multi-session reasoning at higher cost and complexity.

The fair comparison is on the retrieval layer: both systems store compressed memory representations and retrieve them for an LLM to reason over. MentisDB is fully local with no cloud dependencies; Mem0 requires cloud APIs for its extraction pipeline.

The granularity trade-off

MemX’s most striking finding is that storage granularity matters more than any pipeline tweak — fact-level storage doubles Hit@5 vs session-level. This suggests a clear improvement path for MentisDB: adding optional LLM-based fact extraction could significantly boost recall, especially for multi-session and preference queries where the evidence is buried in long conversations.

Why This Matters

If you’re building AI agents that need to remember things, retrieval quality directly determines agent intelligence. An agent that can’t find the right memory is no smarter than one with no memory at all.

65% R@5 means that two out of three times, the correct memory is in the first five results the agent sees. At R@20, it’s nearly 80%. For a fully local, no-cloud-dependency system running on a laptop with a 384-dimension embedding model, that’s solid.

But more importantly, the architecture is improvable. Every component — stemming, scoring weights, embedding model, graph expansion — is a knob that can be turned. We’ve shown that systematic benchmarking and targeted improvements can move the needle significantly. The 13.3% preference score is a clear target for the next iteration.

The Setup

All benchmarks were run on: - MentisDB 0.8.0 with local-embeddings feature - FastEmbed all-MiniLM-L6-v2 (384d, ONNX, runs locally) - 10,866 conversation turns from 940 sessions - 500 evaluation questions across 6 types - Ingestion: ~300 turns/sec - Vector rebuild: ~3–5 minutes for 10k turns - Evaluation: ~7 queries/sec

The benchmark is reproducible — run bash lme-benches/full_bench.sh from the MentisDB repository.

MentisDB is an open-source durable memory layer for AI agents. It stores memories in an append-only chain, retrieves them with hybrid lexical+semantic+graph search, and runs entirely locally with no cloud dependencies. Learn more