3.4 Benchmarking Your Memory
A memory benchmark should answer one practical question: when an agent asks a realistic question, does the right memory show up soon enough to use? Latency matters, but recall is the quality gate. A fast system that misses the decision is just confidently wrong.
The pattern
- Create a small eval set from real incidents, decisions, and handoffs.
- Store expected thought indices or stable ids for each query.
- Measure recall at
kbefore latency tuning. - Run the eval in CI before changing retrieval knobs or embedding providers.
struct MemoryEvalCase {
query: &'static str,
expected_content: &'static str,
}
fn benchmark_recall_at_k() -> io::Result<()> {
let dir = tempfile::tempdir()?;
let adapter = BinaryStorageAdapter::for_chain_key(dir.path(), "cookbook-benchmark");
let mut chain = MentisDb::open_with_storage(Box::new(adapter))?;
chain.upsert_agent(
"eval-agent",
Some("Eval Agent"),
Some("memory-team"),
Some("Builds retrieval eval sets"),
None,
)?;
for content in [
"Decision: use bearer tokens for remote MCP access.",
"Constraint: dashboard access must require a PIN on shared hosts.",
"Lesson: stale vector sidecars should be rebuilt from the append-only log.",
] {
chain.append_thought(
"eval-agent",
ThoughtInput::new(ThoughtType::Decision, content)
.with_concepts(["deployment", "security"])
.with_importance(0.8),
)?;
}
let cases = [
MemoryEvalCase {
query: "How do we secure remote MCP access?",
expected_content: "bearer tokens",
},
MemoryEvalCase {
query: "What protects the dashboard on shared hosts?",
expected_content: "PIN",
},
];
let mut hits = 0;
for case in &cases {
let result = chain.query_ranked(
&RankedSearchQuery::new()
.with_text(case.query)
.with_limit(3),
);
if result
.hits
.iter()
.any(|hit| hit.thought.content.contains(case.expected_content))
{
hits += 1;
}
}
let recall_at_3 = hits as f32 / cases.len() as f32;
assert!(recall_at_3 >= 0.5);
Ok(())
}
Use your own queries. LoCoMo and LongMemEval are useful external bars, but
production regressions usually show up first in project-specific questions.
What to track
Track recall@5 or recall@10, median latency, p95 latency, and the number of candidates considered. When a benchmark fails, save the query, expected memory, actual top hits, and retrieval settings as a durable thought so future tuning work starts with evidence.