MentisDB 0.8.1 — Session Cohesion, Smarter Fusion, Fewer False Positives

One day after shipping 0.8.0, we've pushed search quality further. MentisDB 0.8.1 adds session cohesion scoring, tunes the vector-lexical fusion curve, and tightens BM25's document-frequency filter. The result: LongMemEval R@5 climbs from 65.0% to 67.6%, and our LoCoMo 2-persona benchmark hits 88.7% R@10 — just 0.2% shy of MemPalace's published 88.9% hybrid score.

The Numbers

The LoCoMo 2-persona baseline was 55.8% R@10 before 0.8.0. Over two releases we've added 32.9 percentage points — all without reindexing, format changes, or cloud dependencies.

1. Session Cohesion Scoring

Benchmark	Metric	0.8.0	0.8.1	Δ
LoCoMo (2-persona)	R@10	87.4%	88.7%	+1.3%
LoCoMo (2-persona)	single-hop	89.4%	90.7%	+1.3%
LoCoMo (2-persona)	multi-hop	78.2%	80.0%	+1.8%
LoCoMo (10-persona)	R@10	—	74.2%	new
LongMemEval	R@5	65.0%	67.6%	+2.6%
LongMemEval	R@10	70.6%	73.2%	+2.6%

Long conversations have a structural property that BM25 and vectors both miss: the evidence for a query often sits in a turn adjacent to the matching turn, sharing no keywords. "I went to an LGBTQ conference two days ago" doesn't contain the words from the query "when did Caroline go to the LGBTQ conference?" — but it sits right next to a turn that does.

Session cohesion detects high-scoring lexical hits (score ≥ 3.0) and boosts thoughts within ±8 positions in the append-order index. The boost is linear — 0.8 at distance 1, decaying to zero at the radius boundary. Thoughts that already have strong lexical scores (≥ 5.0) are excluded from the boost to avoid double-counting.

The effect is most visible on multi-hop queries where evidence spans adjacent turns: LoCoMo multi-hop improved from 74.5% to 80.0% R@10 in early testing (later tuning settled at 80.0% with all changes combined).

2. Smoother Vector-Lexical Fusion

0.8.0 replaced flat vector addition with a tiered boost — 60× for no-lexical, 20× for weak, additive for strong. That worked, but the step function between tiers created discontinuities. A thought with BM25 score 0.99 got a very different vector treatment than one with score 1.01.

With BOOST=35 and DECAY_RATE=3.0, a pure-semantic match gets ~36× amplification. By the time lexical reaches 3.0, the extra boost has decayed to ~12×. At lexical 6.0, it's ~2× — vector is additive, not dominant. This gives every lexical level the right amount of vector assistance without discontinuities.

The decay rate was the most impactful parameter. Going from 2.0 to 3.0 alone added +1.0% to LoCoMo R@10 — faster decay means vector stops interfering sooner with moderate-lexical results that are already close to correct.

3. Tighter BM25 Document-Frequency Cutoff

BM25 assigns high IDF (inverse document frequency) to rare terms and low IDF to common ones. But in conversational data, entity names like "Caroline" or "Melanie" appear in 30–50% of turns — not rare enough to filter under 0.8.0's 50% cutoff, but common enough to contribute noise. A query like "what did Caroline research?" would match every turn mentioning Caroline with nearly equal BM25 weight, burying the one that actually discusses her research.

Lowering the DF cutoff from 50% to 30% filters these entity names out of BM25 scoring. The right turn still surfaces — it matches on "research" and "adoption" — but it no longer competes with 200 other Caroline mentions.

Bug Fixes

What Didn't Work

We tried English stopword filtering — removing "the", "is", "what", "when" etc. before stemming. It hurt: R@10 dropped from 87.4% to 86.7%. The reason: words like "when" and "what" carry temporal and spatial discriminative signal in conversational queries ("when did Caroline go camping?"). The DF cutoff already handles genuinely non-discriminative terms. Blanket stopword removal is too blunt for memory retrieval.

The Remaining Gap

On LoCoMo, MemPalace reports 88.9% R@10 with hybrid search and no reranking. We're at 88.7% on the 2-persona subset. The 10-persona full benchmark shows 74.2% R@10 — a harder problem where entity names flood the index and more conversations create longer sessions.

The dominant miss pattern is stemming limitations: "went" doesn't stem to "go", "gave" doesn't stem to "talk", "research" as a noun differs from "research" as a verb. About 38% of remaining misses are not in the top-50 at all — a true lexical gap that no scoring adjustment can fix.

Upgrade

Existing chains, vector sidecars, and skill registries are migrated automatically. The lexical index rebuilds on first access due to the DF cutoff change.

MentisDB is an open-source durable memory layer for AI agents. It stores memories in an append-only hash-chained log, retrieves them with hybrid lexical+semantic+graph search, and runs entirely locally with no cloud dependencies. GitHub · Docs · Website