Open-Sourcing the Ditto Harness — and a DittoBench Miner Starter Kit

A while back we introduced DittoBench: a benchmark for any agentic harness that wields tools and memory, scoring whether it calls the right tool, surfaces the right memory, and how fast it does both. DittoBench is also the scoring core of an upcoming Bittensor subnet (SN118) where miners compete to build the best agent harness.

Today we’re shipping the two things we promised the community:

ditto-harness — Ditto’s agent + memory harness, rewritten in Rust and dual-licensed: open source under AGPL-3.0, with a commercial license for partners.
dittobench-starter-kit — a self-contained kit to run that harness, talk to the agent, practice memory retrieval, and prepare a mining submission.

A note for readers, up front. Ditto’s production backend does not run this Rust harness yet — it still runs the original Go implementation. We’re migrating to the Rust harness by the end of the month, around the same time the on-chain mining mechanism goes live. We’re open-sourcing it now so the community can start building and practicing against the real thing while we finish that migration.

Why a harness, and why Rust

A “harness” is everything around the model: it stores and retrieves memories, exposes tools, runs the multi-turn agent loop, and assembles the prompt. It’s where most of an assistant’s perceived intelligence actually lives — the model is interchangeable; the harness is the product.

The original harness was extracted from our Go backend (Postgres + pgvector). The rewrite is Rust, backed by an embedded Turso/SQLite database with native vector search — no external database, no services to stand up. That makes it genuinely portable: a miner clones one repo, runs one binary, and has the whole agent + memory stack running locally. It also keeps the door open to compiling the same harness to native bindings (the repo ships NAPI Node bindings too).

The crate is intentionally small and importable: a chat::Harness (prepare → agent loop → save), a memory::Store (ingest, vector + composite search, a subject graph), the retrieval pipeline, and pluggable model/embedder clients via rig-core (Ollama, OpenRouter, vLLM). It deliberately omits Ditto’s closed-source application features and billing.

How the pieces fit

You only ever touch the starter kit. It depends on the harness as a git crate, and the validator scores it — off-chain today, on-chain soon.

A diagram showing dittobench-starter-kit (the Rust harness you optimize) depending on ditto-harness (the open-source Rust crate), scored by a hosted validator (coming soon). A band below shows the timeline: practice locally today, on-chain mining on Bittensor SN118 by end of month, and a note that production does not run this harness yet.

The kit is the optimization surface — one file, baseline.rs, marked with EXTENSION POINTs. Everything a miner tunes (model choice, system prompt, retrieval, tools) lives there. You score yourself locally against a fixed benchmark while you iterate; the hosted validator (coming soon) will rotate a fresh, randomized dataset every submission — so there’s nothing to overfit — scoring tool selection and memory recall, the same loop the on-chain validator uses.

Memory retrieval, mirrored 1:1 from production

This is the part we’re most excited to put in people’s hands. The kit doesn’t approximate Ditto’s memory system — it ships the same retrieval pipeline and trained ranking models production uses (with one twist for the local embedder, below).

The pipeline runs in three stages:

Candidate pool — a vector search over the embedded store pulls the top ~50 memories by cosine similarity.
Composite scoring (V2) — seven signals (semantic similarity, linear + exponential recency, subject frequency, subject semantic match, session continuity, neighbor density) are fused into one score. The fusion weights aren’t hand-tuned — they’re predicted per query by a small weight-predictor MLP (model.bin, ~217K parameters) from the query embedding plus 17 auxiliary features.
Cross-encoder rerank — a TinyBERT-L2 cross-encoder (model.onnx, run locally via ONNX Runtime) scores each (query, memory) pair and fuses its ranking with the composite ranking via Reciprocal Rank Fusion.

Both models ship in the kit as weights. To make this faithful, we had to add a clean Reranker hook to the harness itself so a second-stage reranker can slot in between the candidate pool and the final ordering — the same shape as production.

One honest twist: production embeds with Vertex text-embedding-005, but the kit embeds locally and for free with Ollama’s embeddinggemma — a different vector space. The cross-encoder doesn’t care (it scores raw text). The MLP does, since it’s calibrated to whatever embedding space it was trained on. So we retrained the MLP on embeddinggemma through the same production training pipeline (on LongMemEval) and shipped that — it’s calibrated to the embedder the kit actually uses, and on the bundled seed user it lifts retrieval from hit@10 0.90 to 0.96. Run the exact production stack by swapping in Vertex + the production weights; everything else is identical.

One dummy user to experiment with

A retrieval pipeline is only as interesting as the memories behind it, so the kit bundles a self-contained seed user: a coherent, type-balanced slice of LongMemEval — 477 conversation pairs, 1,049 subjects already run through subject-sync, 1,710 subject links, and 50 questions. One command bulk-loads it into the local vector store, embeds everything, and you have a realistic user to query.

Then mem-eval runs the bundled questions through the full pipeline and reports retrieval recall@k — no LLM calls, so it’s free and fast, and it isolates retrieval quality from the model:

cargo run -- seed-user        # load the dummy user (embeds pairs + subjects)
cargo run -- mem-eval --k 10  # recall@k over the full pipeline, per question type

Talk to the agent

Benchmarks are abstract; talking to the thing is not. The kit includes a playground — a single-file web UI wired to a 1:1 production-Ditto chat agent: the real v2 system prompt and persona, the production default model, the full tool catalog, and real memory retrieval over the seed user.

cp .env.example .env          # paste your OpenRouter key
cargo run -- seed-user        # one-time
cargo run -- playground       # open http://127.0.0.1:8088

Action tools (web search, image generation, agent jobs, settings) return fake-but-plausible results so you can exercise tool-calling without wiring up real integrations — while the memory tools are real and query the seed user. The UI shows every tool’s definition and, after each turn, a live trace of the tool calls and the memories that were retrieved. Ask “how many postcards have I collected?” and watch it answer from memory, with citations; ask it to “search the web for…” and watch the tool fire.

The playground also has a Score tab: run the fixed benchmark right in the browser and watch it score live — a progress bar, each case streaming in with its score, latency, and called-vs-expected detail, then a composite breakdown with a “how to read this” legend. It’s the same evaluation as the CLI, made watchable.

Practice the mining flow

The kit gives you a fixed local benchmark to iterate against — the same static seed user and the same questions every run, so you can actually tell whether a change helped:

cargo run -- evaluate   # static seed user + the same questions, every run

Because the inputs are fixed, your score is comparable run-to-run (the model itself is still stochastic). When you want the anti-overfit picture, cargo run -- practice rotates a fresh random dataset instead.

The real subnet, of course, must rotate a fresh dataset every submission so nothing can be overfit. That’s the job of the hosted DittoBench validator (coming soon) — the same loop, with a freshly randomized dataset per submission and (eventually) a Docker sandbox that builds and runs each submission in isolation, the way the on-chain validator will execute untrusted miner code.

What’s next

By end of month: Ditto’s production backend migrates onto this Rust harness, and the on-chain mining mechanism (Bittensor SN118) goes live — validators scoring signed submissions and emitting rewards.
Going fully public: the harness is dual-licensed — open under AGPL-3.0, with a commercial license for partners — and we’re finishing the last housekeeping (a stable tagged release) so external miners can depend on it without friction.

Get started

Starter kit: github.com/ditto-assistant/dittobench-starter-kit — start with the SETUP.md.
The harness: github.com/ditto-assistant/ditto-harness (AGPL-3.0 + commercial).
The benchmark itself: DittoBench — how we measure Ditto’s agent.

A hosted validator that rotates a fresh dataset per submission is coming soon; until then, practice locally with the kit’s evaluate.

If you build something on it, or you’re planning to mine SN118, we’d love to hear from you.