LoCoMo — Hebb Mind vs MemPalace

MemPalace publishes session-level Recall@k on the full 1,986 LoCoMo questions. Their metric and ours (see LoCoMo) are computed the same way: ground-truth evidence is parsed into a set of session_ids, the question counts as correct iff any GT session appears in the retrieved set (session-level hit@10).

Production parity — the most important caveat

Hebb Mind benchmarks call the same code paths as the shipped product. The eval harness ingests every LoCoMo turn through the production Claude Code capture hook (src/hebb/integrations/claude_code/stop.py, which fires on each turn round-trip): per-utterance memories with the same min-length filter and session-scoped dedupe; per-turn-pair summaries with the same [<timestamp>] [<role>] … formatting. Retrieval goes through the same /api/v1/search that Claude Code, the MCP server, and the Web Console all hit. The 91.4% / 94.1% R@10 you see here is the R@10 a user actually gets in production (94.1% with the --profile best bge tier, rerank off).

MemPalace's benchmark does not call their production pipeline. Our source-level audit of their repo finds three concrete divergences:

Ingestion granularity — the benchmark ingests one document per session (or per turn) of verbatim text; production chunks every memory into 800-character windows. Whole-session docs are much larger and more semantically coherent than the chunks a real user accumulates.
Scoring pipeline — production adds a "closet boost" (rank-based distance reduction [0.40, 0.25, 0.15, 0.08, 0.04] for boosted hits with cosine distance < 1.5), BM25 hybrid weighting, and neighbor-chunk enrichment. None of these run in the benchmark.
Embedding flexibility — production hard-codes ChromaDB's default (all-MiniLM-L6-v2) with no model swap; benchmark numbers sweep across MiniLM, bge-large, etc.

Quoting our audit verbatim: "The production pipeline (closet boost, BM25 hybrid, neighbor expansion) was NOT tested in benchmarks. Benchmark numbers reflect the benchmark pipeline, not production search quality."

This matters for the tables below: we are comparing prod-mirror Hebb Mind against benchmark-only MemPalace. The MemPalace numbers are a ceiling on what their evaluation harness produces, not a measurement of what their shipped system does.

Headline — no rerank (R@10)

System	R@10	Embedding	Notes
Hebb Mind (prod-mirror)	94.14%	bge-large-1024	Full 10 scenarios, 1,978q scored (8 adversarial excluded)
MemPalace bge-large hybrid	92.40%	bge-large-1024	Full 1,986q (MemPalace-published)
MemPalace MiniLM hybrid	92.63%	MiniLM-384	Full 1,986q (hit@10 recomputed from their released per-question data)
Hebb Mind (prod-mirror)	91.41%	MiniLM-384	Full 10 scenarios, 1,978q scored

A note on MemPalace's MiniLM number: their headline "88.9%" is mean per-question recall (|GT ∩ retrieved| / |GT|), not hit@10. Computed the same way as ours from their released data, the same run is 92.63% hit@10 (mean recall 0.889). We compare hit@10 to hit@10 throughout.

Same-embedding deltas, no rerank (the only honest comparison):

Embedding	Hebb	MemPalace	Δ
bge-large-1024	94.14	92.40	+1.74 pp
MiniLM-384	91.41	92.63	−1.22 pp

Without rerank we lead at the bge-large tier but trail slightly at MiniLM — MemPalace's BM25-hybrid is well tuned for the weaker 384-dim embedder. Our shipped default uses bge-large, where the prod-mirror 3-path RRF (date-proximity boost, general-English synonym expansion, prev/next-turn window) already leads.

With rerank (R@10)

Hebb Mind now ships an optional local cross-encoder rerank (BAAI/bge-reranker-base, src/hebb/retrieval/rerank/) — no LLM API call, runs on CPU.

System	R@10	Rerank	Notes
MemPalace bge-large + Haiku rerank	96.30%	LLM (Claude Haiku)	Full 1,986q (MemPalace-published)
Hebb Mind bge-large + bge-reranker-base	95.75%	local cross-encoder	Full 10 scenarios
Hebb Mind MiniLM + bge-reranker-base	94.69%	local cross-encoder	Full 10 scenarios

Same-embedding deltas, with rerank:

Embedding	Hebb (+ rerank)	MemPalace (no rerank)	Δ
bge-large-1024	95.75	92.40	+3.35 pp
MiniLM-384	94.69	92.63	+2.06 pp

The local cross-encoder lifts every embedding tier (bge-large +1.61, MiniLM +3.28, e5-small +2.43 pp) and reverses the MiniLM deficit — MiniLM + rerank (94.69%) now beats MemPalace's MiniLM hybrid by +2.06 pp and even edges past our own bge-large with no rerank (94.14%).

Against MemPalace's strongest published config (bge-large + Claude Haiku LLM rerank, 96.30%) we are −0.55 pp — and we close almost the entire gap with a free local cross-encoder instead of a per-query LLM call. (The previous version of this page reported −3.0 pp with "rerank not implemented; on roadmap"; it is now implemented.)

Why the comparison is fair (and where it isn't)

Fair:

Same metric (session-level hit@10 via evidence intersection), computed identically for both sides
Same top_k=10
Same dataset, full 10/10 LoCoMo scenarios on both sides (1,978 of 1,986 questions scored after excluding 8 with empty/unparseable evidence — same exclusion policy MemPalace uses)

Not strictly fair:

We use prod-mirror per-utterance + per-pair memories (~875 memories per scenario, 8,755 total); MemPalace ingests one document per session (~19–32 documents per conversation). At equal top-k we search a much larger candidate pool, which is harder — but session granularity is what the metric scores at, so this favours their setup.
MemPalace's bge-large and Haiku-rerank figures are their own published numbers (no per-question data released); only their MiniLM run could be recomputed on our exact metric.
Embedding capacity explains the absolute score level, not the gap: the same-embedding deltas hold at both 384-dim and 1024-dim tiers.

Source

mempalace benchmark deep-dive §4 — source-code-level breakdown of MemPalace's hybrid v1–v5 pipeline, embedding sweep, LLM rerank schedule, and the benchmark-vs-production divergences that motivate the prod-parity callout above. Hebb Mind numbers: eval/reports/locomo/matrix/SUMMARY.md.

Benchmarks

LoCoMo

LongMemEval

LoCoMo — Hebb Mind vs MemPalace

Production parity — the most important caveat

Headline — no rerank (R@10)

With rerank (R@10)

Why the comparison is fair (and where it isn't)

Source

LoCoMo

LongMemEval

LoCoMo — Hebb Mind vs MemPalace ​

Production parity — the most important caveat ​

Headline — no rerank (R@10) ​

With rerank (R@10) ​

Why the comparison is fair (and where it isn't) ​

Source ​

LoCoMo — Hebb Mind vs MemPalace

Production parity — the most important caveat

Headline — no rerank (R@10)

With rerank (R@10)

Why the comparison is fair (and where it isn't)

Source