3.4 Benchmarking Your Memory

A memory benchmark should answer one practical question: when an agent asks a realistic question, does the right memory show up soon enough to use? Latency matters, but recall is the quality gate. A fast system that misses the decision is just confidently wrong.

The pattern

Create a small eval set from real incidents, decisions, and handoffs.
Store expected thought indices or stable ids for each query.
Measure recall at k before latency tuning.
Run the eval in CI before changing retrieval knobs or embedding providers.

struct MemoryEvalCase {
    query: &'static str,
    expected_content: &'static str,
}

fn benchmark_recall_at_k() -> io::Result<()> {
    let dir = tempfile::tempdir()?;
    let adapter = BinaryStorageAdapter::for_chain_key(dir.path(), "cookbook-benchmark");
    let mut chain = MentisDb::open_with_storage(Box::new(adapter))?;

    chain.upsert_agent(
        "eval-agent",
        Some("Eval Agent"),
        Some("memory-team"),
        Some("Builds retrieval eval sets"),
        None,
    )?;

    for content in [
        "Decision: use bearer tokens for remote MCP access.",
        "Constraint: dashboard access must require a PIN on shared hosts.",
        "Lesson: stale vector sidecars should be rebuilt from the append-only log.",
    ] {
        chain.append_thought(
            "eval-agent",
            ThoughtInput::new(ThoughtType::Decision, content)
                .with_concepts(["deployment", "security"])
                .with_importance(0.8),
        )?;
    }

    let cases = [
        MemoryEvalCase {
            query: "How do we secure remote MCP access?",
            expected_content: "bearer tokens",
        },
        MemoryEvalCase {
            query: "What protects the dashboard on shared hosts?",
            expected_content: "PIN",
        },
    ];

    let mut hits = 0;
    for case in &cases {
        let result = chain.query_ranked(
            &RankedSearchQuery::new()
                .with_text(case.query)
                .with_limit(3),
        );
        if result
            .hits
            .iter()
            .any(|hit| hit.thought.content.contains(case.expected_content))
        {
            hits += 1;
        }
    }

    let recall_at_3 = hits as f32 / cases.len() as f32;
    assert!(recall_at_3 >= 0.5);
    Ok(())
}

Use your own queries. LoCoMo and LongMemEval are useful external bars, but production regressions usually show up first in project-specific questions.

What to track

Track recall@5 or recall@10, median latency, p95 latency, and the number of candidates considered. When a benchmark fails, save the query, expected memory, actual top hits, and retrieval settings as a durable thought so future tuning work starts with evidence.