← Blog
April 11, 2026

MentisDB 0.8.5 — Retrieval Tuning Beats the Baseline

After a rough 72 hours of migration bugs and hotfixes, we needed a win. Here it is: tuned session cohesion and graph scoring push LoCoMo 10-persona R@10 to 74.6%, clearing the 74.2% baseline from 0.8.1.

The Numbers

BenchmarkMetric0.8.1 Baseline0.8.3 (broken)0.8.5 (tuned)Δ vs baseline
LoCoMo 10-personaR@1074.2%72.8%74.6%+0.4%
  single-hopR@1077.0%79.0%+2.0%
  multi-hopR@1057.7%58.4%+0.7%

Single-hop recall jumped +2.0% to 79.0%. Multi-hop improved +0.7%. The regression from 0.8.3 is fully recovered, and we're now above the 0.8.1 baseline.

What Changed

Session Cohesion: Wider Radius, Stronger Boost

Session cohesion scores adjacent thoughts higher when a nearby thought already matched lexically. This is critical for LoCoMo-style conversations where evidence sits in turns adjacent to the matching turn but shares no lexical terms.

ParameterBeforeAfter
SESSION_COHESION_RADIUS812
SESSION_COHESION_BOOST0.81.2

The wider radius (8→12) captures more context around matched seeds. The stronger boost (0.8→1.2) gives adjacent thoughts enough weight to push past near-misses into the top-10.

Graph Relation Scores: Doubled

Graph expansion via ThoughtRelation edges was contributing near-zero scores (avg 0.0019 on misses). We doubled all relation kind boosts:

Relation KindBeforeAfter
ContinuesFrom0.300.60
Corrects / Invalidates0.250.50
Supersedes0.220.45
DerivedFrom0.200.40
Summarizes / CausedBy0.110.20
Supports / Contradicts0.090.15
RelatedTo0.050.08
References0.040.06

Seed support score also doubled from 0.1 to 0.2 per supporting path. These changes give graph-connected thoughts enough weight to surface when the lexical signal is weak — exactly the multi-hop scenario.

FastEmbed Vectors in the Benchmark

The LoCoMo benchmark script now rebuilds fastembed-minilm vectors in addition to local-text-v1 when the local-embeddings feature is compiled. Real sentence embeddings give the vector-lexical fusion a meaningful semantic signal instead of just text hashing.

To build with real embeddings:

cargo build --release --features local-embeddings

Near-Miss Analysis

Of the 503 misses (queries where the correct answer wasn't in top-10):

The 130 near-misses in top-20 are the most actionable bucket. Session cohesion and graph boosts are specifically designed to push these over the line, and the +2.0% single-hop improvement confirms they're working.

The 43% Problem

The 218 queries (43.3%) where the evidence isn't even in top-50 are the hard ceiling. These are genuine lexical gaps — the query asks "what is Caroline's identity?" but the evidence says "the transgender stories were so inspiring!" No amount of scoring tuning will fix these; they need either:

  1. Better stemming — irregular verb lemmas (0.8.3 roadmap)
  2. Stronger vector signals — larger embedding models or reranking
  3. Query expansion — LLM-based query rewriting

What's Next

LongMemEval benchmark is next. Then we continue the 0.8.3 roadmap: irregular verb lemmas for better stemming, RRF (Reciprocal Rank Fusion) for combining lexical and vector signals, and per-field BM25 DF cutoffs to reduce false positives.