Skip to content

LoCoMo

snap-research/locomo — multi-session conversations between two personas, 1,986 questions across single-hop, multi-hop, temporal, open-ended, and adversarial categories. Each conversation spans 19–32 sessions; questions test whether a memory system can surface facts that were established sessions earlier.

We report LoCoMo as session-level Recall@10 — the same metric MemPalace publishes, so the numbers compare directly.

Production parity. Ingestion calls the exact same Claude Code hook code paths that fire in real usage (src/hebb/integrations/claude_code/write.py + stop.py), and retrieval goes through the same /api/v1/search endpoint that the MCP server, CLI, and Web Console use. The numbers below are what a user actually gets in production — not an idealised eval-only configuration. See vs MemPalace for how this contrasts with eval setups that diverge from their own production pipeline.

Session-level Recall@10

No LLM at scoring time. The question is "did the retrieved set surface a memory tagged with any of the evidence sessions?". Ingestion mirrors the production Claude Code hooks (integrations/claude_code/{write,stop}.py): one memory per user utterance + one memory per turn round-trip with an ISO timestamp prefix, no chunking, no image captions. Search uses prev_turns=2 / next_turns=2 context-window expansion, a date-proximity boost on query timestamps (src/hebb/retrieval/temporal_boost.py), and a general-English synonym group expander inside the FTS query builder (src/hebb/retrieval/fts_query.py). An optional local cross-encoder rerank pass (src/hebb/retrieval/rerank/, BAAI/bge-reranker-base) can be enabled on top.

Embedding × rerank sweep (full 10 scenarios, 1,978q)

ConfigEmbeddingRerankR@10Mean recall
prod-mirror + rerankbge-large-1024bge-reranker-base95.75%0.917
prod-mirror + rerankMiniLM-384bge-reranker-base94.69%0.902
prod-mirror + reranke5-small-384bge-reranker-base94.44%0.903
prod-mirror (default)bge-large-102494.14%0.899
prod-mirrore5-small-38492.01%0.870
prod-mirrorjina-v3-102492.01%0.870
prod-mirrorMiniLM-38491.41%0.865

Source: eval/reports/locomo/matrix/<config>/locomo/v4/run-1/ and eval/reports/locomo/matrix/SUMMARY.md. Denominator is 1,978 not 1,986 because 8 questions carry empty/unparseable evidence (adversarial-by-design); per MemPalace convention they are excluded from the R@k denominator.

The shipped default (bge-large-1024, rerank off) scores 94.14%. Enabling the optional cross-encoder rerank lifts it to 95.75% — the best configuration.

Rerank lift

EmbeddingNo rerank+ rerankΔ
bge-large-102494.14%95.75%+1.61 pp
MiniLM-38491.41%94.69%+3.28 pp
e5-small-38492.01%94.44%+2.43 pp

Rerank helps at every embedding tier, and helps the cheaper 384-dim embedders most — it nearly closes the embedding-capacity gap (MiniLM-384 + rerank, 94.69%, edges past bge-large-1024 with no rerank, 94.14%).

Per-category (bge-large-1024 + rerank, headline)

CategoryR@10
open_ended98.2%
adversarial97.3%
multi_hop94.1%
single_hop92.9%
temporal79.8%

Temporal lags because LoCoMo "temporal" questions are largely inferential ("Would X be considered Y?") rather than time-anchored, so the date-proximity boost rarely fires; every other category is ≥ 92%.

Per-competitor comparisons

  • vs MemPalace — same-metric R@10
  • vs mem0 — TBD (different judge / scoring; same-harness re-run pending)
  • vs Letta — TBD (no public LoCoMo result found)
  • vs Zep — same-metric QA: ~tied (Hebb ~74–78% vs Zep 75.14% on cat 1–4); Zep publishes no R@10

Released under the MIT License.