LoCoMo — Hebb Mind vs mem0

Our headline LoCoMo metric is session-level retrieval Recall@10 (no judge in the loop):

System	Metric	Score	Source
Hebb Mind (bge-large + rerank)	Session R@10 (full 1,978q)	95.75%	LoCoMo
Hebb Mind (bge-large, default)	Session R@10 (full 1,978q)	94.14%	LoCoMo
Hebb Mind (MiniLM-384)	Session R@10 (full 1,978q)	91.41%	LoCoMo
mem0	—	does not publish a retrieval-recall number	mem0ai/mem0

mem0 publishes only end-to-end QA accuracy scored by an LLM judge (the "J score"), not retrieval recall — so there is no same-metric row for the table above. To compare on their metric we ran our own end-to-end QA on the identical question subset; that like-for-like is below, followed by why the QA number — on either side — is hard to trust.

Like-for-like: same subset (cat 1–4), same metric (end-to-end QA)

mem0 evaluates only LoCoMo categories 1–4 — the 446 adversarial questions are excluded (their stated convention, "evaluation only on Categories 1–4", and the standard LoCoMo-QA practice, since category 5 is "the model should refuse" and has documented ground-truth problems). Subset = ~1,540 questions, not 1,986.

To remove the judge as a confounder, we score the same retrieved-and-generated answers two ways on that subset.

(a) Our own judge (DeepSeek-V4-Pro, strict — for a list answer it requires all/all-but-one items):

Hebb Mind end-to-end QA	Accuracy	Denominator
all categories	77.0%	1,986
categories 1–4 (mem0's subset)	73.8%	1,540
adversarial only	88.3%	446

(Our all-category headline is propped up by adversarial, our strongest class; 73.8% is the comparable number.)

(b) mem0's exact judge prompt, copied verbatim from benchmarks/locomo/prompts.py — a much more lenient rubric ("≥1 correct item from the gold list = CORRECT", dates within 14 days pass, paraphrases / extra detail / same-referent all pass):

Hebb Mind, scored by mem0's judge prompt	cat 1–4
overall	77.9%
multi-hop (cat 1)	72.0%
temporal (cat 2)	72.3%
open-domain (cat 3)	34.4%
single-hop (cat 4)	86.9%

Swapping our judge for mem0's moves the number only +4.1 pp (73.8 → 77.9): their judge is more lenient, but that is not where the headroom is.

(c) vs mem0's published numbers, same subset, same judge prompt:

System	cat 1–4 QA	judge	vs Hebb 77.9%
Hebb Mind (bge-large + rerank)	77.9%	mem0's, verbatim	—
mem0 — arXiv paper	66.9%	mem0's	Hebb +11.0 pp
mem0 — graph (paper)	68.4%	mem0's	Hebb +9.5 pp
mem0 — README / marketing	91.6%	mem0's	mem0 +13.7 pp

With mem0's own judge prompt on the same 1,540-question subset, our retrieval + answers beat mem0's peer-reviewed paper result by +11 pp. The only mem0 figure above us is the README 91.6%, and the judge does not explain that gap (matching it bought just +4 pp). It comes from two things, neither related to retrieval quality:

mem0's 7-step chain-of-thought answer-generation prompt (entity verification, temporal disambiguation, inclusion checks, …). We used a plain one-shot answer prompt; their generation harness is doing the heavy lifting.
That 91.6% does not reproduce from mem0's public harness (see next section).

Remaining caveat: the judge model still differs (we run DeepSeek-V4-Pro; mem0 runs GPT-4o-mini) — but the judge prompt is now byte-for-byte identical.

Why the QA number — on either side — is hard to trust

The leaderboard is contested by every participant. Zep's audit, "Lies, Damn Lies, and Statistics: Is Mem0 Really SOTA in Agent Memory?", found that a full-context baseline (~73% J) beat mem0's best graph configuration (~68% J) — i.e. not using a memory system at all scored higher, because LoCoMo conversations (~16k–26k tokens) fit in a modern context window. For balance: mem0 published a rebuttal arguing Zep misconfigured their system, and Zep's own 84% LoCoMo claim was separately audited down to 58.44% (getzep/zep-papers #5).

mem0's own numbers don't reproduce from the public harness. mem0 #3944 reports an LLM score of ~0.20 running their eval with GPT-4o-mini (traced to the platform stamping memories with the current date instead of dataset timestamps); mem0 #2800 reproduces locally with scores "significantly lower than the ones I see in the paper."

The LoCoMo ground truth itself is broken. A public audit (dial481/locomo-audit, summarized in MemPalace #29) documents ~99 wrong/hallucinated/misattributed answers across the ten conversations (honest ceiling ~93–94%, not 100%) and — decisively — that the LoCoMo LLM judge accepts up to ~63% of intentionally wrong answers. A judge that green-lights ~63% of deliberately wrong answers puts a large, systematic error bar on every LoCoMo J-score, mem0's and ours alike.

This is why we report retrieval recall as the headline: the ground truth is the evidence session set, scored by set intersection, with no judge in the loop.

What a same-metric retrieval comparison would need

To put mem0 in the R@10 table above we would:

Run mem0 through the Hebb Mind eval/ retrieval-recall harness so both systems use session-level hit@10 via evidence intersection.
Run the full 1,978-question set on both sides (same exclusion policy used elsewhere on this site).
Disclose mem0's version and embedding model.

This is on the roadmap. Open a PR if you have already done a same-harness mem0 run.

Sources

Mem0 paper — arXiv 2504.19413
mem0 judge + answer-generation prompts (used verbatim for the same-judge test) — memory-benchmarks/benchmarks/locomo/prompts.py
Zep audit of mem0 — blog.getzep.com
Zep's own claim audited — getzep/zep-papers #5
mem0 reproducibility — mem0 #3944, mem0 #2800
LoCoMo ground-truth audit — dial481/locomo-audit, MemPalace #29
Hebb Mind numbers — eval/reports/locomo/matrix/SUMMARY.md (retrieval), eval/reports/locomo_qa/locomo-qa/v1/run-2/ (QA)

Benchmarks

LoCoMo

LongMemEval

LoCoMo — Hebb Mind vs mem0

Like-for-like: same subset (cat 1–4), same metric (end-to-end QA)

Why the QA number — on either side — is hard to trust

What a same-metric retrieval comparison would need

Sources

LoCoMo

LongMemEval

LoCoMo — Hebb Mind vs mem0 ​

Like-for-like: same subset (cat 1–4), same metric (end-to-end QA) ​

Why the QA number — on either side — is hard to trust ​

What a same-metric retrieval comparison would need ​

Sources ​

LoCoMo — Hebb Mind vs mem0

Like-for-like: same subset (cat 1–4), same metric (end-to-end QA)

Why the QA number — on either side — is hard to trust

What a same-metric retrieval comparison would need

Sources