Ditto

Start typing to search...

Scoring & Rewards

How Ditto SN118 scores a harness: a fresh anti-cheat dataset every run, a composite of tool accuracy and memory recall, multiple validators finalizing on the median, and winner-take-most emissions.

Scoring & Rewards

When your submission reaches evaluating, validators run it through DittoBench in an isolated Docker sandbox and post signed scores. This page explains how the number is computed and how it turns into emissions.

A fresh, anti-cheat dataset every run

Two design choices make scoring fair and hard to game:

  • The dataset is regenerated for every run. Tool cases are paraphrased and the memory “haystack” is reassembled with distractors and fresh timestamps, so there is no static answer key. Memorizing the benchmark doesn’t help — only a genuinely good memory pipeline scores well.
  • Multiple validators, median score. Several assigned validators each run your harness on their own seed and post a signed score. The result is finalized on the median, and validators recompute weights from the same public ledger with the same open-source function — so no single validator decides your fate.

The composite score

Each run produces a composite in [0, 1], blending correctness and memory:

composite = 0.6 × tool_mean + 0.4 × memory_mean
  • tool_mean — tool-calling / routing accuracy: did the harness call the right memory tools with the right arguments?
  • memory_mean — memory recall: did it surface the correct memory for each question (LLM-judged)?
  • Latency — the median per-case wall-clock is measured and reported alongside. Keep your harness fast; a correct-but-slow harness leaves points on the table.

The same composite is what you see when you practice off-chain with the starter kit, so your practice number and your on-chain number are directly comparable.

From score to emissions

Winner-take-most. Only positive composites earn weight, and emissions concentrate on the top harness. A small, real improvement over the field can be the difference between most of the subnet’s emissions and none — this is a benchmark race, not a participation pool.

The exact weight curve is part of the incentive mechanism and is being finalized ahead of launch; expect a steep, top-heavy distribution. The score ledger is public, so you can always see where you stand and by how much you need to improve.

How to climb

  • Raise recall first. memory_mean and tool_mean are where the points are. Better embeddings, ranking, subject handling, and reranking all move these.
  • Watch latency. A heavier pipeline that barely improves recall can be a net loss once speed is weighed.
  • Measure on full before you submit. Small runs are for fast iteration; a full practice run is the closest proxy to an on-chain evaluation.
  • Resubmit to climb. Each submission is a fresh, independent evaluation. When you have a better harness, submit it — but remember each on-chain submission costs a fee, so prove it off-chain first.

Status

The scoring design above — DittoBench, the composite formula, multi-validator median finalization — is implemented in the practice bench and in the validator pipeline that is rolling out for launch. The precise on-chain weight/emissions curve is being finalized; treat the mechanics here as the current design and confirm against the subnet repos and announcements before a high-stakes submission.