April 11, 2026

MentisDB 0.8.5 — Retrieval Tuning Beats the Baseline

After a rough 72 hours of migration bugs and hotfixes, we needed a win. Here it is: tuned session cohesion and graph scoring push LoCoMo 10-persona R@10 to 74.6%, clearing the 74.2% baseline from 0.8.1.

The Numbers

Benchmark	Metric	0.8.1 Baseline	0.8.3 (broken)	0.8.5 (tuned)	Δ vs baseline
LoCoMo 10-persona	R@10	74.2%	72.8%	74.6%	+0.4%
single-hop	R@10	—	77.0%	79.0%	+2.0%
multi-hop	R@10	—	57.7%	58.4%	+0.7%

Single-hop recall jumped +2.0% to 79.0%. Multi-hop improved +0.7%. The regression from 0.8.3 is fully recovered, and we're now above the 0.8.1 baseline.

What Changed

Session Cohesion: Wider Radius, Stronger Boost

Session cohesion scores adjacent thoughts higher when a nearby thought already matched lexically. This is critical for LoCoMo-style conversations where evidence sits in turns adjacent to the matching turn but shares no lexical terms.

Parameter	Before	After
`SESSION_COHESION_RADIUS`	8	12
`SESSION_COHESION_BOOST`	0.8	1.2

The wider radius (8→12) captures more context around matched seeds. The stronger boost (0.8→1.2) gives adjacent thoughts enough weight to push past near-misses into the top-10.

Graph Relation Scores: Doubled

Graph expansion via ThoughtRelation edges was contributing near-zero scores (avg 0.0019 on misses). We doubled all relation kind boosts:

Relation Kind	Before	After
ContinuesFrom	0.30	0.60
Corrects / Invalidates	0.25	0.50
Supersedes	0.22	0.45
DerivedFrom	0.20	0.40
Summarizes / CausedBy	0.11	0.20
Supports / Contradicts	0.09	0.15
RelatedTo	0.05	0.08
References	0.04	0.06

Seed support score also doubled from 0.1 to 0.2 per supporting path. These changes give graph-connected thoughts enough weight to surface when the lexical signal is weak — exactly the multi-hop scenario.

FastEmbed Vectors in the Benchmark

The LoCoMo benchmark script now rebuilds fastembed-minilm vectors in addition to local-text-v1 when the local-embeddings feature is compiled. Real sentence embeddings give the vector-lexical fusion a meaningful semantic signal instead of just text hashing.

To build with real embeddings:

cargo build --release --features local-embeddings

Near-Miss Analysis

Of the 503 misses (queries where the correct answer wasn't in top-10):

130 (25.8%) appear in top-20 — these are ranking problems, not retrieval failures
285 (56.7%) appear in top-50 — close but need more signal
218 (43.3%) not in top-50 — genuine lexical gaps (the query terms don't appear in the evidence)

The 130 near-misses in top-20 are the most actionable bucket. Session cohesion and graph boosts are specifically designed to push these over the line, and the +2.0% single-hop improvement confirms they're working.

The 43% Problem

The 218 queries (43.3%) where the evidence isn't even in top-50 are the hard ceiling. These are genuine lexical gaps — the query asks "what is Caroline's identity?" but the evidence says "the transgender stories were so inspiring!" No amount of scoring tuning will fix these; they need either:

Better stemming — irregular verb lemmas (0.8.3 roadmap)
Stronger vector signals — larger embedding models or reranking
Query expansion — LLM-based query rewriting

What's Next

LongMemEval benchmark is next. Then we continue the 0.8.3 roadmap: irregular verb lemmas for better stemming, RRF (Reciprocal Rank Fusion) for combining lexical and vector signals, and per-field BM25 DF cutoffs to reduce false positives.