# Scoring & Rewards

> How Ditto SN118 scores a harness: a fresh anti-cheat dataset every run, a composite of tool accuracy and memory recall, multiple validators finalizing on the median, and winner-take-most emissions.

---

# Scoring & Rewards

When your submission reaches `evaluating`, validators run it through **DittoBench** in an isolated Docker sandbox and post signed scores. This page explains how the number is computed and how it turns into emissions.

## A fresh, anti-cheat dataset every run

Two design choices make scoring fair and hard to game:

- **The dataset is regenerated for every run.** Tool cases are paraphrased and the memory "haystack" is reassembled with distractors and fresh timestamps, so there is no static answer key. Memorizing the benchmark doesn't help — only a genuinely good memory pipeline scores well.
- **Multiple validators, median score.** Several assigned validators each run your harness on their own seed and post a **signed** score. The result is finalized on the **median**, and validators recompute weights from the same public ledger with the same open-source function — so no single validator decides your fate.

## The composite score

Each run produces a composite in `[0, 1]`, blending correctness and memory:

```
composite = 0.6 × tool_mean + 0.4 × memory_mean
```

- **`tool_mean`** — tool-calling / routing accuracy: did the harness call the right memory tools with the right arguments?
- **`memory_mean`** — memory recall: did it surface the correct memory for each question (LLM-judged)?
- **Latency** — the median per-case wall-clock is measured and reported alongside. Keep your harness fast; a correct-but-slow harness leaves points on the table.

The same composite is what you see when you practice off-chain with the [starter kit](https://github.com/ditto-assistant/dittobench-starter-kit), so your practice number and your on-chain number are directly comparable.

## From score to emissions

**Winner-take-most.** Only positive composites earn weight, and emissions concentrate on the top harness. A small, real improvement over the field can be the difference between most of the subnet's emissions and none — this is a benchmark race, not a participation pool.

The exact weight curve is part of the incentive mechanism and is being finalized ahead of launch; expect a steep, top-heavy distribution. The score ledger is public, so you can always see where you stand and by how much you need to improve.

## How to climb

- **Raise recall first.** `memory_mean` and `tool_mean` are where the points are. Better embeddings, ranking, subject handling, and reranking all move these.
- **Watch latency.** A heavier pipeline that barely improves recall can be a net loss once speed is weighed.
- **Measure on `full` before you submit.** Small runs are for fast iteration; a `full` practice run is the closest proxy to an on-chain evaluation.
- **Resubmit to climb.** Each submission is a fresh, independent evaluation. When you have a better harness, submit it — but remember each on-chain submission costs a fee, so prove it off-chain first.

## Status

The scoring design above — DittoBench, the composite formula, multi-validator median finalization — is implemented in the practice bench and in the validator pipeline that is **rolling out for launch**. The precise on-chain weight/emissions curve is being finalized; treat the mechanics here as the current design and confirm against the subnet repos and announcements before a high-stakes submission.

## Related

- [Start Mining: From Fork to Live](/docs/mining-getting-started) — practice against this exact score
- [Submitting to Subnet 118](/docs/mining-submitting) — how to get into the evaluation queue
- [Mining on Ditto (Subnet 118)](/docs/mining-on-ditto) — the overview