One day after shipping 0.8.0, we've pushed search quality further. MentisDB 0.8.1 adds session cohesion scoring, tunes the vector-lexical fusion curve, and tightens BM25's document-frequency filter. The result: LongMemEval R@5 climbs from 65.0% to 67.6%, and our LoCoMo 2-persona benchmark hits 88.7% R@10 — just 0.2% shy of MemPalace's published 88.9% hybrid score.
| Benchmark | Metric | 0.8.0 | 0.8.1 | Δ |
|---|---|---|---|---|
| LoCoMo (2-persona) | R@10 | 87.4% | 88.7% | +1.3% |
| LoCoMo (2-persona) | single-hop | 89.4% | 90.7% | +1.3% |
| LoCoMo (2-persona) | multi-hop | 78.2% | 80.0% | +1.8% |
| LoCoMo (10-persona) | R@10 | — | 74.2% | new |
| LongMemEval | R@5 | 65.0% | 67.6% | +2.6% |
| LongMemEval | R@10 | 70.6% | 73.2% | +2.6% |
The LoCoMo 2-persona baseline was 55.8% R@10 before 0.8.0. Over two releases we've added 32.9 percentage points — all without reindexing, format changes, or cloud dependencies.
Long conversations have a structural property that BM25 and vectors both miss: the evidence for a query often sits in a turn adjacent to the matching turn, sharing no keywords. "I went to an LGBTQ conference two days ago" doesn't contain the words from the query "when did Caroline go to the LGBTQ conference?" — but it sits right next to a turn that does.
Session cohesion detects high-scoring lexical hits (score ≥ 3.0) and boosts thoughts within ±8 positions in the append-order index. The boost is linear — 0.8 at distance 1, decaying to zero at the radius boundary. Thoughts that already have strong lexical scores (≥ 5.0) are excluded from the boost to avoid double-counting.
The effect is most visible on multi-hop queries where evidence spans adjacent turns: LoCoMo multi-hop improved from 74.5% to 80.0% R@10 in early testing (later tuning settled at 80.0% with all changes combined).
0.8.0 replaced flat vector addition with a tiered boost — 60× for no-lexical, 20× for weak, additive for strong. That worked, but the step function between tiers created discontinuities. A thought with BM25 score 0.99 got a very different vector treatment than one with score 1.01.
0.8.1 replaces the step function with a smooth exponential decay:
vector_contribution = vector × (1 + BOOST × exp(-lexical / DECAY_RATE))
With BOOST=35 and DECAY_RATE=3.0, a pure-semantic match
gets ~36× amplification. By the time lexical reaches 3.0, the extra boost has decayed
to ~12×. At lexical 6.0, it's ~2× — vector is additive, not dominant. This gives
every lexical level the right amount of vector assistance without discontinuities.
The decay rate was the most impactful parameter. Going from 2.0 to 3.0 alone added +1.0% to LoCoMo R@10 — faster decay means vector stops interfering sooner with moderate-lexical results that are already close to correct.
BM25 assigns high IDF (inverse document frequency) to rare terms and low IDF to common ones. But in conversational data, entity names like "Caroline" or "Melanie" appear in 30–50% of turns — not rare enough to filter under 0.8.0's 50% cutoff, but common enough to contribute noise. A query like "what did Caroline research?" would match every turn mentioning Caroline with nearly equal BM25 weight, burying the one that actually discusses her research.
Lowering the DF cutoff from 50% to 30% filters these entity names out of BM25 scoring. The right turn still surfaces — it matches on "research" and "adoption" — but it no longer competes with 200 other Caroline mentions.
Two bugs could crash the dashboard under edge conditions:
f32::NAN.clamp(0.0, 1.0)
returns NaN in Rust (not a clamped value). If with_importance(NaN)
was called, the stored NaN would crash serde_json serialization
when the dashboard tried to render the thought. Now with_confidence()
silently drops non-finite values and with_importance() defaults to
0.5. thought_json() also sanitizes existing NaN values at the
serialization boundary.
RankedSearchScoreResponse struct was missing the new
session_cohesion field, which would have caused a compile error
for anyone building from git HEAD.
We tried English stopword filtering — removing "the", "is", "what", "when" etc. before stemming. It hurt: R@10 dropped from 87.4% to 86.7%. The reason: words like "when" and "what" carry temporal and spatial discriminative signal in conversational queries ("when did Caroline go camping?"). The DF cutoff already handles genuinely non-discriminative terms. Blanket stopword removal is too blunt for memory retrieval.
On LoCoMo, MemPalace reports 88.9% R@10 with hybrid search and no reranking. We're at 88.7% on the 2-persona subset. The 10-persona full benchmark shows 74.2% R@10 — a harder problem where entity names flood the index and more conversations create longer sessions.
The dominant miss pattern is stemming limitations: "went" doesn't stem to "go", "gave" doesn't stem to "talk", "research" as a noun differs from "research" as a verb. About 38% of remaining misses are not in the top-50 at all — a true lexical gap that no scoring adjustment can fix.
cargo install mentisdb
Or from source:
git pull
cargo install --path . --locked
Existing chains, vector sidecars, and skill registries are migrated automatically. The lexical index rebuilds on first access due to the DF cutoff change.
MentisDB is an open-source durable memory layer for AI agents. It stores memories in an append-only hash-chained log, retrieves them with hybrid lexical+semantic+graph search, and runs entirely locally with no cloud dependencies. GitHub · Docs · Website