Ditto

engineering

Teaching Memory to Find Itself

How Ditto's Seed Memories v4 finds the right memory from a vague question: a per-user query adapter and a subject graph that together lift Recall@1 by 7.6 points, train on CPU in seconds, and stay under a megabyte per user.

On this page
  1. What we measured against
  2. What works, part 1: the subject graph
  3. What works, part 2: a tiny per-user adapter
  4. Pair selection matters
  5. Stacking them: Seed Memories v4
  6. A negative we're glad we checked
  7. Shipping it for everyone
  8. Coming soon: Generative Memories
  9. Sources

Teaching Memory to Find Itself

When you ask Ditto something indirect, like “what did I decide about that thing a while back?”, it has to find one specific memory out of thousands. There’s no keyword to match on, no obvious filter. Just a fuzzy question and a haystack of everything you’ve ever told it.

Ditto already does this with a strong, frozen text encoder (text-embedding-005, 768 dimensions) and cosine similarity: embed the question, embed every memory, rank by closeness. That baseline is good. The research question for Seed Memories v4 was narrower and more interesting: can a small, cheap, per-user model make it meaningfully better, without retraining the encoder, without a GPU, without shipping anything heavy?

The answer is yes, and it comes from two ideas that stack. One reads your knowledge graph. One is a tiny model that’s yours and only yours. Here’s the whole system in one picture:

Seed Memories v4 overview: a vague question is embedded by a frozen encoder, passed through a per-user query adapter f(q), then scored two ways (dense cosine against every memory and an additive subject-graph signal) and fused into a single ranking handed to the answering model. Result badges: +7.6pp Recall@1, +5.9pp MRR.

Everything to the left of the green f(q) node is shared and frozen. Everything to the right is cheap arithmetic over vectors you already have. The two scoring lanes, dense similarity and the subject graph, get added together, never multiplied, for a reason we’ll get to.

What we measured against

All the numbers below come from one real user’s memory store: about 2,200 real memories and roughly 200 held-out “hard” questions, the indirect, vague kind, with the answering memory known but deliberately hidden from the ranker. The metrics are the standard retrieval four:

  • Recall@1: was the single right memory the very top result?
  • Recall@5 / Recall@10: did it land anywhere in the top 5 / top 10?
  • MRR: how high up was the first correct hit? (Position 1 scores 1.0, position 2 scores 0.5, and so on.)
  • pp is a percentage point: 0.606 → 0.682 is +7.6 pp.

The dense baseline scores 0.606 Recall@1 and 0.705 MRR. That’s the bar to clear.

What works, part 1: the subject graph

Every Ditto user already has a knowledge graph. As you talk, Ditto extracts subjects (typed topics, people, projects, preferences) and links each one to the memories it touches. (That’s the dreaming pipeline at work.) We’d already seen that matching a query to subjects, rather than only to raw memories, was a powerful signal. v4 leans on it directly.

Subject fusion: the query f(q) matches its top-40 subjects by cosine; strong subject matches (Pricing decision, Q3 Planning) light up brass edges to the memories they link to (Switched Acme to annual billing, Pricing A/B test results, Board deck v3). The score is the dense cosine plus lambda times the best subject match, additively.

Instead of generating anything, we match the query to its closest subjects and then lift the memories those subjects link to. The fused score is:

score(q,m)=cos(q,m)  +  λmaxssubjects(m)topN(q)cos(q,s)\text{score}(q, m) = \cos(q, m) \;+\; \lambda \cdot \max_{\,s \,\in\, \text{subjects}(m)\, \cap\, \text{top}_N(q)} \cos(q, s)

Three design choices matter here:

  • It’s additive, not multiplicative. A memory only gets a subject boost on top of its own dense score. A strong, direct query match is never derailed by the subject signal: subjects can only ever help a memory that already looks relevant. Multiplying would let a weak match be rescued (or a strong one be tanked) by graph noise; adding can’t.
  • Fixed subjects can’t fabricate. The subjects are real entries in your graph. There’s nothing to hallucinate: the signal is grounded in topics you actually have.
  • It’s free of training. Subject vectors live in the same 768-d space as memories. We embed them once. Scoring is a max and an add.

With λ ≈ 0.4 over the query’s top-40 subjects, subject fusion alone delivers +6.6 pp Recall@1 (0.606 → 0.672) and +4.5 pp MRR. No model, no fitting, no per-user state beyond the graph you already have.

What works, part 2: a tiny per-user adapter

Subject fusion improves the scoring. The second idea improves the query itself, and this is where each user gets their own model.

It is not a language model and not a fine-tune of the encoder. It’s a small residual MLP that nudges a query embedding toward the region of space where that user’s memories live, following Google’s “Search-Adaptor” recipe. The base encoder stays frozen and shared; only this little adapter is personal.

The query adapter: a residual MLP takes the 768-d query through Linear 768 to 128, GELU plus dropout, Linear 128 to 768, scaled by alpha=0.3 and added back to the original query via a skip connection, then normalized to give f(q). It is initialized near identity so f(q) starts equal to q. The contrastive loss has three terms (InfoNCE to the answering memory, an auxiliary subject term, and a reconstruction anchor), and the model is about 200k parameters, trains on CPU in seconds, and is sub-megabyte per user.

The adapter is a residual shift:

f(q)=normalize ⁣(q+αMLP128(q)),MLP128:  768128768f(q) = \text{normalize}\!\big(\, q + \alpha \cdot \mathrm{MLP}_{128}(q) \,\big), \qquad \mathrm{MLP}_{128}:\; 768 \to 128 \to 768

It’s initialized so the second linear layer starts at zero, meaning f(q) begins exactly equal to q, and training only ever moves it as far as the data justifies. With α = 0.3 and a 128-dim bottleneck, the whole thing is about 200,000 parameters: it trains on a CPU in seconds and saves to under a megabyte.

It learns from your own (query → answering memory) pairs with a three-part contrastive loss:

L=InfoNCE(f(q),m+)pull toward the right memory  +  wsubjInfoNCE(f(q),s+)  +  wreconf(q)q2\mathcal{L} = \underbrace{\mathrm{InfoNCE}\big(f(q), m^{+}\big)}_{\text{pull toward the right memory}} \;+\; w_{\text{subj}}\,\mathrm{InfoNCE}\big(f(q), s^{+}\big) \;+\; w_{\text{recon}}\,\big\lVert f(q) - q \big\rVert^{2}

where InfoNCE is the standard contrastive term:

InfoNCE(f(q),m+)=logexp ⁣(cos(f(q),m+)/τ)jexp ⁣(cos(f(q),mj)/τ)\mathrm{InfoNCE}\big(f(q), m^{+}\big) = -\log \frac{\exp\!\big(\cos(f(q), m^{+})/\tau\big)}{\sum_j \exp\!\big(\cos(f(q), m_j)/\tau\big)}

The first term pulls the adapted query toward the memory that answered it and pushes it away from everything else in the batch. The second is a gentle auxiliary pull toward the right subject (w_subj = 0.2). The third, the reconstruction anchor (w_recon = 0.5, τ = 0.05), is the one that makes this safe on small, personal data: it penalizes f(q) for wandering away from q, so a few thousand pairs reshape the geometry without distorting it. Our first attempts without that anchor actually made recall worse; adding the bottleneck, dropout, weight decay, and early stopping turned it into a reliable, monotone gain.

Pair selection matters

The negatives the adapter trains against come from the subject graph: memories that share a subject with the gold answer but aren’t it, the genuinely confusable ones. The catch is that some of those “negatives” are really just other correct answers. So we drop any candidate that’s too close to the gold:

keep negative n    cos(q,n)<0.95cos(q,gold)\text{keep negative } n \iff \cos(q, n) < 0.95 \cdot \cos(q, \text{gold})

Anything within 5% of the gold similarity is almost certainly a true match in disguise, and training against it would teach the model exactly the wrong thing.

Stacking them: Seed Memories v4

The full system adapts the query, scores it densely, and adds the subject signal, all at once:

score(q,m)=cos ⁣(f(q),m)  +  λsubject_match ⁣(f(q),m)\text{score}(q, m) = \cos\!\big(f(q), m\big) \;+\; \lambda \cdot \text{subject\_match}\!\big(f(q), m\big)

Results bar chart: Recall@1 rises from 0.606 (baseline) to 0.672 (+subject fusion, +6.6pp) to 0.682 (+query adapter, +7.6pp); MRR rises from 0.705 to 0.750 (+4.5pp) to 0.764 (+5.9pp).

On the held-out questions:

methodRecall@1Recall@5Recall@10MRR
baseline (raw query)0.6060.8180.8940.705
+ subject fusion0.6720.8590.8890.750
+ subject fusion & query adapter0.6820.8480.9090.764

The combined system lands +7.6 pp Recall@1 (a 12.5% relative jump), +5.9 pp MRR, and +1.5 pp Recall@10 over the dense baseline. Notice that subject fusion alone already captures most of the Recall@1 win, which is exactly what you want, because it’s the half that needs no per-user training at all.

A negative we’re glad we checked

Before settling, we tried the obvious “smarter graph” idea: Personalized PageRank over the subject↔memory graph, the technique behind HippoRAG’s strong multi-hop numbers. On our task it was clearly worse: Recall@1 fell from 0.68 to roughly 0.42–0.54.

The reason is instructive. PageRank shines when answering a question means traversing several entities across multiple hops. Finding one memory is a single hop, and PageRank’s probability diffusion floods well-connected “hub” memories, burying the precise target. HippoRAG’s real engine isn’t the diffusion, it’s the query-to-entity linking, which is exactly what our subject fusion already does. We kept the part that works and dropped the part that doesn’t.

Shipping it for everyone

None of this is expensive to run for real users, because nothing heavy is per-user:

  • The encoder is shared and frozen: one model for everyone.
  • Each user gets a small subject-vector index (which the knowledge graph already maintains) and a sub-megabyte adapter.
  • New users start with subject fusion alone. It needs no training and works from the very first memory.
  • Once a user has accumulated enough (query → memory) supervision to train without overfitting, they get a personal adapter, refit on a schedule (think weekly) as their memory grows, and a new adapter only ships if it beats the old one on held-out validation. No silent regressions.

That’s Seed Memories v4: a frozen shared encoder, a graph signal that’s free, and a personal model that costs one tiny MLP. Better recall on vague questions, and the worst case is “no worse than today.”

Coming soon: Generative Memories

There’s a third idea we’re building, and it’s the one I’m most excited about.

Imagine a per-user model that doesn’t just shift your query, but recalls in your own words, generating a fuzzy draft of the memory you’re reaching for, in your phrasing, with your specifics, and using that to retrieve. It’s the natural endgame for personal memory: a model that knows you well enough to finish your sentence before you do.

Coming soon, Generative Memories: a large personal memory store feeds a deliberately overfit per-user generator, which produces grounded fuzzy recalls that carry your true specifics and blend into v4 scoring. Three notes explain why deliberate overfitting plus a large memory database turn hallucination into recall, why scale earns the right to overfit, and why the per-user model is private by construction.

The whole game here is turning hallucination into recall, and it rests on two things that sound like sins and are actually the design:

Overfitting is the feature. A generic model asked to “imagine the memory” will produce the right theme with invented details: plausible dates, plausible names, none of them yours. That noise drifts the query away from the real entry. But a model deliberately overfit to one person’s memories stops inventing: when it generates a fuzzy recall, the specifics it reaches for are the ones it memorized, your real decisions, your real people. Expansion only helps a retriever when the model has memorized the corpus it’s expanding into. So we memorize it, on purpose, per user.

A large memory database earns the right to overfit. Overfit a tiny store and you memorize noise. Overfit a large one and you memorize signal: there’s enough true ground truth that the model can lock onto real entries instead of artifacts. This is the same gate the query adapter already respects: wait until there are enough pairs, then fit hard. The more you’ve trusted Ditto with, the sharper this gets, and because the model is per-user and never shared, the only specifics it can ever memorize are your own.

It’s still a work in progress: the engineering around training and serving a generative model per user is real, and we’re being careful about validating it the same way we validate everything else: it only ships if it wins on held-out questions. But the direction is set. Seed Memories v4 taught your queries to find the right memory. The next version will let your memory describe itself.

Omar

Sources

Seed Memories v4 builds on published work. In rough order of appearance:

  1. text-embedding-005. Google Cloud Vertex AI text embeddings, the frozen 768-dimensional encoder used throughout. Vertex AI text embeddings documentation.
  2. Search-Adaptor. Jinsung Yoon, Yanfei Chen, Sercan Ö. Arık, Tomas Pfister. Search-Adaptor: Embedding Customization for Information Retrieval. ACL 2024. arXiv:2310.08750. The frozen-encoder adapter recipe behind our per-user query adapter.
  3. InfoNCE / Contrastive Predictive Coding. Aäron van den Oord, Yazhe Li, Oriol Vinyals. Representation Learning with Contrastive Predictive Coding. 2018. arXiv:1807.03748. The contrastive loss the adapter trains on.
  4. GELU. Dan Hendrycks, Kevin Gimpel. Gaussian Error Linear Units (GELUs). 2016. arXiv:1606.08415. The adapter’s activation function.
  5. HyDE. Luyu Gao, Xueguang Ma, Jimmy Lin, Jamie Callan. Precise Zero-Shot Dense Retrieval without Relevance Labels. ACL 2023. arXiv:2212.10496. Generating a hypothetical document to retrieve with: the seed of the “recall in your own words” direction.
  6. Query2doc. Liang Wang, Nan Yang, Furu Wei. Query2doc: Query Expansion with Large Language Models. EMNLP 2023. arXiv:2303.07678. LLM query expansion for retrieval.
  7. HippoRAG. Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, Yu Su. HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models. NeurIPS 2024. arXiv:2405.14831. Knowledge-graph retrieval with Personalized PageRank, the multi-hop method we tested and adapted.
  8. PageRank. Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Stanford InfoLab, 1999. Technical report. The personalized random-walk algorithm behind HippoRAG’s traversal.

Open a thread.

Ditto remembers what matters from every conversation, so your next idea starts where your last one left off.

Try Ditto More field notes
Open Ditto